foldersSplit Dataset

Splitting your image dataset into training, validation, and test sets

The Split Dataset screen divides your original image dataset into three subsets used during the machine learning workflow: training (75%), validation (15%), and test (10%). This fixed ratio ensures that the model is trained on one portion of data, tuned on another, and evaluated on a completely separate set.

Prerequisites

Before splitting your dataset, ensure that:

  • You have an open project (the screen is disabled otherwise).

  • Your image dataset is organized into class folders, where each subfolder name represents a class label. For example:

Original/
  OK/
    img_001.jpg
    img_002.jpg
    ...
  Defect_Black/
    img_001.jpg
    ...
  Defect_Sour/
    img_001.jpg
    ...
  • Images are in a supported format: .jpg, .jpeg, .png, .bmp, .tiff, or .webp.

Step-by-Step Walkthrough

1. Select the Original Dataset Folder

Click Select next to the first field to choose the folder containing your class-organized images. This is the source dataset that will be read during splitting.

circle-info

When you create a new project, the default path is set to {project}/Dataset/Original/V00. You can change this to point to any folder on your system.

After selecting the folder, the class distribution chart on the right panel updates to show a bar chart of how many images exist in each class. Use this to understand the balance of your dataset before splitting.

2. Select the Split Output Folder

Click Select next to the second field to choose where the split results will be saved. The application creates train/, val/, and test/ subfolders inside this location, each mirroring the class folder structure.

circle-info

The default output path is {project}/Dataset/Split/V00. If you re-run the split with different settings, consider creating a new version folder (e.g., V01) to preserve previous splits.

3. Configure Class Balancing (Optional)

By default, the split respects the natural distribution of your dataset — classes with more images produce more samples in each subset. You can optionally set a cutoff to cap the number of images per class, which helps balance an unevenly distributed dataset.

There are two modes, controlled by a toggle switch:

Use Minimum Class Count (toggle ON): The application automatically sets the cutoff to match the class with the fewest images. This creates a perfectly balanced dataset but may discard images from larger classes.

Type Value (toggle OFF): Enter a specific number manually. Each class will contribute at most this many images to the split. Enter the value and click Save to apply it.

circle-info

The cutoff value appears as a red dashed reference line on the class distribution chart, so you can visualize how many images will be included versus excluded.

4. Run the Split

Click Split Dataset to begin. The application:

  1. Validates that all required paths are set and the folders exist.

  2. Removes any existing output in the split folder (to ensure a clean split).

  3. Launches the Python splitting script.

  4. Streams progress logs to the right panel in real time.

The status indicator below the button shows the current state: Idle, Running, Completed, Error, or Cancelled.

A progress bar appears during the operation. You can click Stop Splitting at any time to cancel.

circle-exclamation

Understanding the Class Distribution Chart

The bar chart on the right panel visualizes your dataset composition:

  • X-axis: Class names (from folder names)

  • Y-axis: Image count per class

  • Bar colors: Gradient from light to dark blue, ordered by class

  • Red dashed line: The cutoff threshold (if configured)

  • Total count: Displayed above the chart header

Click the refresh button (circular arrow icon) to reload the distribution if you have modified the dataset outside the application.

Output Structure

After a successful split, the output folder contains:

The ratio is fixed at 75% train / 15% validation / 10% test, applied per class to maintain proportional representation.

Configuration Persistence

All settings on this screen (original path, output path, cutoff value) are saved automatically to the project's config.yaml file. If you navigate away and return, your previous configuration is restored.

If you modify the cutoff value without saving and attempt to navigate away, a confirmation dialog appears asking whether to save or discard your changes.

Troubleshooting

Issue
Possible Cause
Solution

"Split Dataset" button is disabled

No project is open

Open or create a project from the Home screen

Chart shows "No data to display"

Dataset folder is empty or path is incorrect

Verify that the selected folder contains class subfolders with images

Split fails with an error

Output folder is on a read-only drive, or disk is full

Check disk space and folder permissions

Classes are severely imbalanced

Natural dataset distribution

Use the class balancing cutoff to cap images per class

Last updated