chart-lineTest AI Model

Evaluating a trained model's performance against a held-out test dataset

The Test screen evaluates a trained model's classification performance by running it against your held-out test dataset. The test produces accuracy metrics, confusion matrices, ROC curves, and precision-recall statistics that help you decide whether the model is ready for production deployment.

Prerequisites

  • An open project with a completed training session (at least one checkpoint in the Checkpoints folder).

  • A test dataset — typically the test/ subfolder created by the Split Dataset step.

Step-by-Step Walkthrough

1. Verify the Test Dataset Folder

The test dataset path is automatically derived from your split dataset configuration. By default, it points to the test/ subfolder of your split output (e.g., Dataset/Split/V00/test/).

If you need to use a different test set, click Select to choose an alternative folder. The folder must contain class subfolders that match the classes the model was trained on.

2. Select the Checkpoint

Click Select to choose which .ckpt checkpoint to evaluate. Checkpoints are located in your project's Checkpoints/version_XX/ folders.

circle-info

Each training session produces one or more checkpoints. The "best" checkpoint (lowest validation loss) is typically named with the epoch and loss value. The last.ckpt file represents the final training state regardless of performance.

3. Run the Test

Click Start Test to begin evaluation. The application launches the Python test script with the selected checkpoint and dataset. Progress is streamed to the log monitor.

The test script:

  1. Loads the trained model from the checkpoint.

  2. Runs inference on every image in the test dataset.

  3. Compares predictions against the ground-truth labels (folder names).

  4. Computes classification metrics.

  5. Generates visualization plots.

Stopping a Test

Click Stop Test to cancel. The sidebar is locked during execution.

Understanding Results

After testing completes, the results are saved to a test_metric_plots/test_data_XX/ subfolder within the checkpoint's version directory. These include:

Metrics Generated

  • Overall accuracy — Percentage of correctly classified images across all classes.

  • Per-class precision — Of all images predicted as a given class, what fraction were correct.

  • Per-class recall — Of all images belonging to a given class, what fraction were correctly identified.

  • F1 score — Harmonic mean of precision and recall for each class.

  • Confusion matrix — Visual grid showing prediction vs. ground truth for every class pair.

  • ROC curves — Receiver Operating Characteristic curves per class.

circle-info

For a detailed explanation of these metrics and how to interpret them for coffee classification, see the Fundamentals section:

Result Files

The test output folder contains:

File
Content

test_metrics.json

Raw metric values in JSON format

confusion_matrix.png

Confusion matrix heatmap

roc_curve.png

ROC curves for each class

precision_recall.png

Precision-recall curves

These files are also used by the Export screen when generating PDF reports.

Iterative Testing

You can run multiple test sessions against the same checkpoint or test different checkpoints against the same dataset. Each run creates a new test_data_XX subfolder, preserving previous results.

This is useful for:

  • Comparing the "best" checkpoint vs. the "last" checkpoint.

  • Testing the model against different test sets (e.g., different coffee origins or crop years).

  • Tracking improvement across training iterations.

Troubleshooting

Issue
Possible Cause
Solution

Test fails with class mismatch

Test folder has different classes than the training data

Ensure the test folder contains the same class subfolders used during training

Test completes but accuracy is very low

Model is undertrained, dataset is noisy, or test set is from a different distribution

Review training metrics, clean the dataset, or retrain with more data

Test is slow

Large test dataset or no GPU

Testing runs on GPU if available; reduce test set size if needed for quick validation

Last updated