book-blankAI Model Metrics

Classification Fundamentals

Before diving into specific metrics, it is important to understand the two main paradigms that classification models follow.

Binary classification involves assigning each instance to one of exactly two classes — typically labeled as positive and negative. In coffee quality analysis, a binary classifier might label each seed as either OK (negative for defects) or NOK (positive for defects). The simplicity of two classes makes it straightforward to define metrics like precision, recall, and specificity.

Multiclass classification extends this to three or more mutually exclusive classes. Instead of a simple pass/fail decision, the model assigns each instance to one of N categories — for example, classifying seeds into specific defect types such as black, sour, broken, insect-damaged, or immature. Multiclass problems require additional considerations for how metrics are computed and aggregated, since each class has its own set of correct and incorrect predictions.

circle-info

Understanding whether you are dealing with a binary or multiclass problem is the first step in choosing the right evaluation metrics. Some metrics (like specificity) are straightforward in binary settings but require adaptation for multiclass scenarios.

The Confusion Matrix

The confusion matrix is the foundation of nearly all classification metrics. It is a table that summarizes how a model's predictions compare to the actual (true) labels.

Binary Confusion Matrix

For binary classification, the confusion matrix is a 2×2 table with four possible outcomes:

Predicted Positive

Predicted Negative

Actually Positive

True Positive (TP)

False Negative (FN)

Actually Negative

False Positive (FP)

True Negative (TN)

  • True Positive (TP): The model correctly predicts the positive class. The instance is positive and the model says positive.

  • True Negative (TN): The model correctly predicts the negative class. The instance is negative and the model says negative.

  • False Positive (FP): The model incorrectly predicts positive when the instance is actually negative. Also called a Type I error or false alarm.

  • False Negative (FN): The model incorrectly predicts negative when the instance is actually positive. Also called a Type II error or missed detection.

Every classification metric — accuracy, precision, recall, specificity, F1-score — is derived from these four counts. The confusion matrix provides a complete picture of model performance that a single number cannot capture.

Multiclass Confusion Matrix

For multiclass classification with N classes, the confusion matrix becomes an N×N table. Each row represents the actual class and each column represents the predicted class. The diagonal elements show correct predictions, while off-diagonal elements reveal the specific types of errors the model makes. By examining the off-diagonal entries, you can identify which classes the model confuses with one another — for example, whether the model frequently misclassifies sour seeds as immature seeds.

Example of a Confusion Matrix

AI Metrics for Evaluating Machine Learning Models

AI metrics are essential tools for evaluating the performance of machine learning models, offering insights into their effectiveness and reliability. Accuracy measures the overall correctness of the model by calculating the proportion of correctly classified instances. Recall (or sensitivity) evaluates the model's ability to identify all relevant instances, making it particularly useful in detecting rare or critical cases. Specificity, on the other hand, assesses the model's ability to correctly reject irrelevant or negative instances, ensuring it avoids false positives. Additional metrics, such as precision, focus on the accuracy of positive predictions, while the F1-score balances precision and recall to provide a harmonic mean, especially useful when dealing with imbalanced datasets. Together, these metrics enable developers to fine-tune models for specific applications and optimize their real-world performance.

Key Metrics Explained:

  • Accuracy = (True Positives + True Negatives) / Total Instances Accuracy measures the overall proportion of correct predictions across all classes. While intuitive, accuracy can be misleading when classes are imbalanced. Consider a dataset where 95% of coffee seeds are OK and only 5% are defective: a model that simply predicts every seed as OK achieves 95% accuracy while catching zero defects. This is known as the accuracy paradox — high accuracy does not necessarily mean a useful model (Provost et al., 1998).

circle-exclamation
  • Precision = True Positives / (True Positives + False Positives) Precision (positive predictive value) represents the fraction of relevant instances among the retrieved instances.

  • Recall (Sensitivity) = True Positives / (True Positives + False Negatives) Recall is the fraction of relevant instances that were successfully retrieved. Sensitivity (True Positive Rate) refers to the probability of a positive test given that the instance is truly positive.

  • Specificity (True Negative Rate) = True Negatives / (True Negatives + False Positives) Specificity measures the probability of a negative test result given that the instance is truly negative.

  • F1-Score = 2 × Precision × Recall / (Precision + Recall) The F1-score is the harmonic mean of precision and recall. It provides a single metric that balances both concerns, which is especially useful when you cannot afford to optimize for one at the expense of the other. The harmonic mean penalizes extreme values: if either precision or recall is very low, the F1-score will also be low, even if the other metric is high. This makes the F1-score a more reliable indicator of model quality than the arithmetic mean of precision and recall.

Representation of Precision and Recall

Averaging Strategies for Multiclass Metrics

In multiclass classification, precision, recall, and F1-score are computed per class. To summarize these into a single number, you must choose an averaging strategy. The choice of averaging method can significantly affect the reported performance, especially when class sizes differ.

  • Macro Average: Compute the metric independently for each class, then take the unweighted mean. Macro averaging treats all classes equally regardless of their size. This is useful when every class is equally important, even if some classes are rare.

  • Micro Average: Aggregate the contributions of all classes (sum all TPs, FPs, and FNs globally) and then compute the metric from the totals. Micro averaging gives more weight to classes with more instances and is equivalent to overall accuracy in multiclass settings.

  • Weighted Average: Compute the metric for each class, then take the mean weighted by the number of instances (support) in each class. Weighted averaging accounts for class imbalance by giving more influence to larger classes while still providing per-class granularity.

Strategy
Treats all classes equally?
Sensitive to class imbalance?
Best used when

Macro

Yes

No (rare classes count equally)

All classes are equally important

Micro

No

Yes (majority classes dominate)

Overall performance matters most

Weighted

No

Partially (weighted by support)

Class sizes vary but each sample matters equally

circle-info

When reporting model performance, always specify which averaging strategy is used. Two models can appear to have very different F1-scores depending on whether macro or micro averaging is applied.

Class Imbalance

Class imbalance occurs when the number of samples differs substantially across classes. In many real-world applications, including coffee quality analysis, some categories are naturally more frequent than others — for instance, OK seeds may outnumber any single defect type by a factor of 10 or more.

Class imbalance matters because most machine learning algorithms optimize for overall accuracy, which biases the model toward the majority class. The model learns that predicting the dominant class most of the time yields high accuracy, at the expense of correctly identifying minority classes. This leads to:

  • High accuracy but low recall for rare classes

  • A confusion matrix with strong diagonal entries for majority classes and weak entries for minority classes

  • Metrics that appear strong on the surface but mask poor performance on the classes that matter most

Strategies for handling class imbalance include:

  • Oversampling: Increase the number of minority class samples (e.g., through duplication or synthetic generation via SMOTE)

  • Undersampling: Reduce the number of majority class samples to balance the dataset

  • Class weighting: Assign higher loss weights to minority classes during training so that errors on rare classes are penalized more heavily

  • Focal loss: A modified loss function that down-weights easy (well-classified) examples and focuses training on hard (misclassified) examples (Lin et al., 2017)

The choice of strategy depends on the dataset size, the degree of imbalance, and the specific application requirements. In practice, a combination of strategies often yields the best results (He and Garcia, 2009).

Overfitting and Generalization

A model's ultimate goal is to perform well on new, unseen data — not just the data it was trained on. Understanding how a model generalizes is essential for building reliable systems.

  • Overfitting occurs when a model memorizes the training data, including its noise and outliers, rather than learning the underlying patterns. An overfitted model achieves excellent performance on training data but performs poorly on new data. Signs of overfitting include a large gap between training accuracy and validation accuracy.

  • Underfitting occurs when a model is too simple to capture the underlying patterns in the data. An underfitted model performs poorly on both training and new data, indicating that it lacks the capacity or complexity to learn the task.

  • Generalization is the ability of a model to perform well on data it has never seen. A well-generalized model shows similar performance on training, validation, and test data.

Training, Validation, and Test Splits

To assess generalization, datasets are typically divided into three subsets:

  1. Training set: Used to train the model — the model learns patterns from this data.

  2. Validation set: Used during training to tune hyperparameters and monitor for overfitting. The model does not learn from this data directly, but design decisions are based on its performance here.

  3. Test set: Held out entirely until final evaluation. It provides an unbiased estimate of the model's performance on truly unseen data.

Learning curves — plots of training and validation performance over time (or over increasing dataset sizes) — are a powerful diagnostic tool for detecting overfitting and underfitting. When training performance is high but validation performance plateaus or decreases, overfitting is occurring.

Entropy and Prediction Confidence

Entropy, originally introduced by Shannon (1948), is a measure of uncertainty or unpredictability in a probability distribution. In information theory, entropy quantifies the average amount of information (or "surprise") contained in a set of possible outcomes.

For a discrete probability distribution with N classes and predicted probabilities p₁, p₂, ..., pₙ, Shannon entropy is defined as:

H = − Σ pᵢ × log₂(pᵢ)

where the sum runs over all classes i from 1 to N, and by convention 0 × log₂(0) = 0.

  • When the model is fully confident (assigns probability 1.0 to a single class), entropy is 0 — there is no uncertainty.

  • When the model is completely uncertain (assigns equal probability to all classes), entropy reaches its maximum value of log₂(N).

Normalized entropy divides the raw entropy by its theoretical maximum (log₂(N)), producing a value between 0 and 1 regardless of the number of classes. This makes it possible to compare prediction confidence across models with different numbers of output classes.

circle-info

Low entropy indicates the model is confident in its prediction, while high entropy signals uncertainty. Monitoring entropy across predictions helps identify samples where the model struggles, which can guide data collection and model refinement efforts.

Cohen's Kappa

Cohen's Kappa (Cohen, 1960) measures the level of agreement between two raters (or between predictions and ground truth) while accounting for agreement that would occur by chance alone. This makes it a more robust measure than simple accuracy, especially when class distributions are skewed.

The formula is:

κ = (pₒ − pₑ) / (1 − pₑ)

where:

  • pₒ is the observed agreement — the proportion of instances where the predicted label matches the actual label (equivalent to accuracy).

  • pₑ is the expected agreement by chance — the proportion of agreement expected if predictions were made randomly, based on the marginal distributions of each class.

A kappa of 1.0 indicates perfect agreement, 0 indicates agreement no better than chance, and negative values indicate agreement worse than chance (systematic disagreement).

Interpretation Scale (Landis and Koch, 1977):

Kappa Range
Strength of Agreement

< 0.00

Poor

0.00 – 0.20

Slight

0.21 – 0.40

Fair

0.41 – 0.60

Moderate

0.61 – 0.80

Substantial

0.81 – 1.00

Almost perfect

circle-info

Cohen's Kappa is particularly valuable in domains with class imbalance. A model can achieve high accuracy simply by predicting the majority class, but its kappa will remain low because such predictions are expected by chance. Kappa rewards models that perform genuinely better than random guessing.

Csmart Model Metrics

In Csmart-Digit, AI models are evaluated using the metrics described above, applied to both binary classification (defect detection) and multiclass classification (categorizing seeds by classes). The software presents these metrics to provide critical insights into the effectiveness, reliability, and confidence of the model’s predictions.

Key AI Metrics in Csmart-Digit

  1. Inference Confidence Level Based on the normalized entropy of each prediction (see Entropy and Prediction Confidence above), Csmart-Digit categorizes model confidence into discrete levels:

    • High Confidence: Entropy < 12%

    • Medium Confidence: 12% ≤ Entropy < 20%

    • Low Confidence: 20% ≤ Entropy < 40%

    • Low Reliability: 40% ≤ Entropy < 75%

    • Very Low Reliability: 75% ≤ Entropy < 100%

    Lower confidence levels highlight areas where the model may need improvement to enhance prediction reliability.

  2. Cohen’s Kappa in Csmart-Digit Cohen’s Kappa (see Cohen’s Kappa above) will only produce results if the user provides feedback by correcting the model’s predictions within the software. This feedback is essential to calculate the agreement between the model’s predictions and the actual corrected labels, enabling a meaningful kappa score.

  3. Binary Accuracy and Binary Error

    • Binary Accuracy reflects the proportion of correctly classified instances in binary tasks, such as identifying defective or non-defective seeds. For example, a binary accuracy of 98% indicates strong performance in this task.

    • Binary Error indicates the rate of incorrect classifications in binary tasks. A low error rate ensures that defective beans are reliably identified.

  4. Multiclass Accuracy and Multiclass Error

    • Multiclass Accuracy measures the proportion of correct classifications across all coffee categories, such as bean grades or flavor profiles. For example, a multiclass accuracy of 95% demonstrates effective classification across categories.

    • Multiclass Error represents the rate of misclassifications in multiclass tasks. A low error rate confirms the model’s ability to distinguish between diverse coffee categories accurately.

  5. Confusion Matrix Csmart-Digit displays a multiclass confusion matrix (see The Confusion Matrix above) to help users identify which classes the model confuses most frequently and guide targeted improvements.

References

  • Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46.

  • Cover, T. M., and Thomas, J. A. (2006). Elements of Information Theory (2nd ed.). Wiley-Interscience.

  • He, H., and Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284.

  • Landis, J. R., and Koch, G. G. (1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174.

  • Lin, T.-Y., Goyal, P., Girshick, R., He, K., and Dollár, P. (2017). Focal loss for dense object detection. Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2980–2988.

  • Provost, F., Fawcett, T., and Kohavi, R. (1998). The case against accuracy estimation for comparing induction algorithms. Proceedings of the 15th International Conference on Machine Learning (ICML), 445–453.

  • Shannon, C. E. (1948). A mathematical theory of communication. The Bell System Technical Journal, 27(3), 379–423.

Last updated