Test AI Model
Evaluating a trained model's performance against a held-out test dataset
The Test screen evaluates a trained model's classification performance by running it against your held-out test dataset. The test produces accuracy metrics, confusion matrices, ROC curves, and precision-recall statistics that help you decide whether the model is ready for production deployment.
Prerequisites
An open project with a completed training session (at least one checkpoint in the Checkpoints folder).
A test dataset — typically the
test/subfolder created by the Split Dataset step.
Step-by-Step Walkthrough
1. Verify the Test Dataset Folder
The test dataset path is automatically derived from your split dataset configuration. By default, it points to the test/ subfolder of your split output (e.g., Dataset/Split/V00/test/).
If you need to use a different test set, click Select to choose an alternative folder. The folder must contain class subfolders that match the classes the model was trained on.
2. Select the Checkpoint
Click Select to choose which .ckpt checkpoint to evaluate. Checkpoints are located in your project's Checkpoints/version_XX/ folders.
Each training session produces one or more checkpoints. The "best" checkpoint (lowest validation loss) is typically named with the epoch and loss value. The last.ckpt file represents the final training state regardless of performance.
3. Run the Test
Click Start Test to begin evaluation. The application launches the Python test script with the selected checkpoint and dataset. Progress is streamed to the log monitor.
The test script:
Loads the trained model from the checkpoint.
Runs inference on every image in the test dataset.
Compares predictions against the ground-truth labels (folder names).
Computes classification metrics.
Generates visualization plots.
Stopping a Test
Click Stop Test to cancel. The sidebar is locked during execution.
Understanding Results
After testing completes, the results are saved to a test_metric_plots/test_data_XX/ subfolder within the checkpoint's version directory. These include:
Metrics Generated
Overall accuracy — Percentage of correctly classified images across all classes.
Per-class precision — Of all images predicted as a given class, what fraction were correct.
Per-class recall — Of all images belonging to a given class, what fraction were correctly identified.
F1 score — Harmonic mean of precision and recall for each class.
Confusion matrix — Visual grid showing prediction vs. ground truth for every class pair.
ROC curves — Receiver Operating Characteristic curves per class.
For a detailed explanation of these metrics and how to interpret them for coffee classification, see the Fundamentals section:
Result Files
The test output folder contains:
test_metrics.json
Raw metric values in JSON format
confusion_matrix.png
Confusion matrix heatmap
roc_curve.png
ROC curves for each class
precision_recall.png
Precision-recall curves
These files are also used by the Export screen when generating PDF reports.
Iterative Testing
You can run multiple test sessions against the same checkpoint or test different checkpoints against the same dataset. Each run creates a new test_data_XX subfolder, preserving previous results.
This is useful for:
Comparing the "best" checkpoint vs. the "last" checkpoint.
Testing the model against different test sets (e.g., different coffee origins or crop years).
Tracking improvement across training iterations.
Troubleshooting
Test fails with class mismatch
Test folder has different classes than the training data
Ensure the test folder contains the same class subfolders used during training
Test completes but accuracy is very low
Model is undertrained, dataset is noisy, or test set is from a different distribution
Review training metrics, clean the dataset, or retrain with more data
Test is slow
Large test dataset or no GPU
Testing runs on GPU if available; reduce test set size if needed for quick validation
Last updated