Split Dataset
Splitting your image dataset into training, validation, and test sets
The Split Dataset screen divides your original image dataset into three subsets used during the machine learning workflow: training (75%), validation (15%), and test (10%). This fixed ratio ensures that the model is trained on one portion of data, tuned on another, and evaluated on a completely separate set.
Prerequisites
Before splitting your dataset, ensure that:
You have an open project (the screen is disabled otherwise).
Your image dataset is organized into class folders, where each subfolder name represents a class label. For example:
Original/
OK/
img_001.jpg
img_002.jpg
...
Defect_Black/
img_001.jpg
...
Defect_Sour/
img_001.jpg
...Images are in a supported format:
.jpg,.jpeg,.png,.bmp,.tiff, or.webp.
Step-by-Step Walkthrough
1. Select the Original Dataset Folder
Click Select next to the first field to choose the folder containing your class-organized images. This is the source dataset that will be read during splitting.
When you create a new project, the default path is set to {project}/Dataset/Original/V00. You can change this to point to any folder on your system.
After selecting the folder, the class distribution chart on the right panel updates to show a bar chart of how many images exist in each class. Use this to understand the balance of your dataset before splitting.
2. Select the Split Output Folder
Click Select next to the second field to choose where the split results will be saved. The application creates train/, val/, and test/ subfolders inside this location, each mirroring the class folder structure.
The default output path is {project}/Dataset/Split/V00. If you re-run the split with different settings, consider creating a new version folder (e.g., V01) to preserve previous splits.
3. Configure Class Balancing (Optional)
By default, the split respects the natural distribution of your dataset — classes with more images produce more samples in each subset. You can optionally set a cutoff to cap the number of images per class, which helps balance an unevenly distributed dataset.
There are two modes, controlled by a toggle switch:
Use Minimum Class Count (toggle ON): The application automatically sets the cutoff to match the class with the fewest images. This creates a perfectly balanced dataset but may discard images from larger classes.
Type Value (toggle OFF): Enter a specific number manually. Each class will contribute at most this many images to the split. Enter the value and click Save to apply it.
The cutoff value appears as a red dashed reference line on the class distribution chart, so you can visualize how many images will be included versus excluded.
4. Run the Split
Click Split Dataset to begin. The application:
Validates that all required paths are set and the folders exist.
Removes any existing output in the split folder (to ensure a clean split).
Launches the Python splitting script.
Streams progress logs to the right panel in real time.
The status indicator below the button shows the current state: Idle, Running, Completed, Error, or Cancelled.
A progress bar appears during the operation. You can click Stop Splitting at any time to cancel.
The sidebar navigation is locked while the split is running to prevent accidental navigation. It unlocks automatically when the operation completes or is cancelled.
Understanding the Class Distribution Chart
The bar chart on the right panel visualizes your dataset composition:
X-axis: Class names (from folder names)
Y-axis: Image count per class
Bar colors: Gradient from light to dark blue, ordered by class
Red dashed line: The cutoff threshold (if configured)
Total count: Displayed above the chart header
Click the refresh button (circular arrow icon) to reload the distribution if you have modified the dataset outside the application.
Output Structure
After a successful split, the output folder contains:
The ratio is fixed at 75% train / 15% validation / 10% test, applied per class to maintain proportional representation.
Configuration Persistence
All settings on this screen (original path, output path, cutoff value) are saved automatically to the project's config.yaml file. If you navigate away and return, your previous configuration is restored.
If you modify the cutoff value without saving and attempt to navigate away, a confirmation dialog appears asking whether to save or discard your changes.
Troubleshooting
"Split Dataset" button is disabled
No project is open
Open or create a project from the Home screen
Chart shows "No data to display"
Dataset folder is empty or path is incorrect
Verify that the selected folder contains class subfolders with images
Split fails with an error
Output folder is on a read-only drive, or disk is full
Check disk space and folder permissions
Classes are severely imbalanced
Natural dataset distribution
Use the class balancing cutoff to cap images per class
Last updated