Cluster Dataset

Grouping similar images into manageable clusters for review and quality control

The Cluster Dataset screen groups visually similar images within a class folder into numbered subfolders. This is useful for reviewing large datasets where manually inspecting every image is impractical. By organizing images into smaller groups, you can quickly scan each cluster for misclassified or low-quality images.

After clustering, you can also rename the groups sequentially — for example, after manually removing unwanted images from certain clusters.

Prerequisites

An open project.
A class folder containing images to cluster. This can be a folder from your split dataset (e.g., train/OK/) or any folder with coffee bean images.

Step-by-Step Walkthrough

1. Select the Class Folder

Click Select next to the folder field and choose the folder containing images to cluster. The detected image count is displayed after selection. The application validates that the folder exists on disk.

2. Set Images Per Group

Enter the maximum number of images per cluster group in the Images per Group field. The default is 500.

Smaller values create more groups with fewer images each, making manual review faster per group but producing more folders.
Larger values create fewer groups with more images, better for large datasets where you want fewer clusters to browse.

Click Save after changing the value. A confirmation dialog appears if you navigate away with unsaved changes.

3. Run Clustering

Click Cluster Images to begin. The application launches the Python clustering script, which:

Extracts visual features from each image.
Groups images by similarity.
Creates numbered subfolders (e.g., group_000/, group_001/, group_002/) inside the selected class folder.
Moves images into their corresponding group.

Progress is streamed to the log monitor. When complete, the result folder opens automatically.

4. Review Clusters

After clustering:

Browse each numbered group folder.
Look for images that do not belong — misclassified samples, blurred images, or other anomalies.
Delete or move unwanted images out of the group folders.

5. Rename Groups (Optional)

After manually removing images from clusters, the group numbering may have gaps or you may want sequential numbering starting from a specific index. The Rename Groups operation re-numbers all group folders sequentially.

Set the Rename Start Index — the number the first group will be assigned (default is 0).
Click Rename Groups.
All group folders are renamed in order: group_000, group_001, etc.

This is particularly useful when you plan to merge clusters from different runs or when you need consistent numbering for downstream processing.

Stopping an Operation

Click Stop during either clustering or renaming to cancel. The sidebar is locked during execution.

Configuration Persistence

The selected folder and images-per-group value are stored in the project's config.yaml under the cluster section:

cluster.input_folder — last used class folder path
cluster.images_per_group — configured group size

These values are restored when you return to the screen.

Troubleshooting

Issue

Possible Cause

Solution

"Cluster Images" button is disabled

No folder selected or folder path is invalid

Select a valid folder with images

Very few groups created

Images per group is set too high

Lower the images-per-group value

Clustering takes a long time

Very large dataset

This is expected for datasets with thousands of images; monitor progress in the log panel

Rename fails

Group folders were manually renamed to non-standard names

Ensure group folders follow the group_NNN naming pattern

PreviousOutlier Detection NextTrain AI Model

Last updated 6 days ago

hashtagPrerequisites

hashtagStep-by-Step Walkthrough

hashtag1. Select the Class Folder

hashtag2. Set Images Per Group

hashtag3. Run Clustering

hashtag4. Review Clusters

hashtag5. Rename Groups (Optional)

hashtagStopping an Operation

hashtagConfiguration Persistence

hashtagTroubleshooting