Cluster Dataset
Grouping similar images into manageable clusters for review and quality control
The Cluster Dataset screen groups visually similar images within a class folder into numbered subfolders. This is useful for reviewing large datasets where manually inspecting every image is impractical. By organizing images into smaller groups, you can quickly scan each cluster for misclassified or low-quality images.
After clustering, you can also rename the groups sequentially — for example, after manually removing unwanted images from certain clusters.
Prerequisites
An open project.
A class folder containing images to cluster. This can be a folder from your split dataset (e.g.,
train/OK/) or any folder with coffee bean images.
Step-by-Step Walkthrough
1. Select the Class Folder
Click Select next to the folder field and choose the folder containing images to cluster. The detected image count is displayed after selection. The application validates that the folder exists on disk.
2. Set Images Per Group
Enter the maximum number of images per cluster group in the Images per Group field. The default is 500.
Smaller values create more groups with fewer images each, making manual review faster per group but producing more folders.
Larger values create fewer groups with more images, better for large datasets where you want fewer clusters to browse.
Click Save after changing the value. A confirmation dialog appears if you navigate away with unsaved changes.
3. Run Clustering
Click Cluster Images to begin. The application launches the Python clustering script, which:
Extracts visual features from each image.
Groups images by similarity.
Creates numbered subfolders (e.g.,
group_000/,group_001/,group_002/) inside the selected class folder.Moves images into their corresponding group.
Progress is streamed to the log monitor. When complete, the result folder opens automatically.
4. Review Clusters
After clustering:
Browse each numbered group folder.
Look for images that do not belong — misclassified samples, blurred images, or other anomalies.
Delete or move unwanted images out of the group folders.
5. Rename Groups (Optional)
After manually removing images from clusters, the group numbering may have gaps or you may want sequential numbering starting from a specific index. The Rename Groups operation re-numbers all group folders sequentially.
Set the Rename Start Index — the number the first group will be assigned (default is 0).
Click Rename Groups.
All group folders are renamed in order:
group_000,group_001, etc.
This is particularly useful when you plan to merge clusters from different runs or when you need consistent numbering for downstream processing.
Stopping an Operation
Click Stop during either clustering or renaming to cancel. The sidebar is locked during execution.
Configuration Persistence
The selected folder and images-per-group value are stored in the project's config.yaml under the cluster section:
cluster.input_folder— last used class folder pathcluster.images_per_group— configured group size
These values are restored when you return to the screen.
Troubleshooting
"Cluster Images" button is disabled
No folder selected or folder path is invalid
Select a valid folder with images
Very few groups created
Images per group is set too high
Lower the images-per-group value
Clustering takes a long time
Very large dataset
This is expected for datasets with thousands of images; monitor progress in the log panel
Rename fails
Group folders were manually renamed to non-standard names
Ensure group folders follow the group_NNN naming pattern
Last updated