AI Grading Validation Standard
CSMART COFFEE TECHNOLOGIES
Guideline
AI Grading Validation for Green Coffee
Computer Vision Systems for Physical Analysis
Any hardcopy of this document is not a controlled copy
March 2026
Document ID
CSMART-P-01
Version
0.1 (Draft)
Valid from
TBD
Page count
---
Table of Contents
1. Scope
This document provides a guideline for the development, verification, and validation of computer vision (AI) systems used in green coffee physical analysis. It establishes minimum quality control expectations and performance indicators for AI tools that classify coffee bean defects, estimate bean weights from images, determine defect percentages, assess screen size distribution, and derive commercial grade (Type).
The scope covers AI systems that operate on digital images of green coffee beans and produce results equivalent to those of a trained human grader performing physical analysis according to established origin-specific grading standards.
This guideline is applicable to all origin-specific grading methods, including but not limited to:
Brazil --- Classificacao Oficial Brasileira (COB), Instrucao Normativa no 8/2003
Vietnam --- Weight-percentage method with defect group aggregation (BB, FM)
ISO 10470 --- Green coffee defect reference chart (Robusta weighted factors)
Other origin-specific methods as defined in the origin-specific modules (Annexes)
2. Objectives
Define a standardized, multi-level validation framework for AI grading systems applied to green coffee physical analysis.
Establish performance indicators at each measurement level: bean classification, weight estimation, sample-level defect percentages, screen size distribution, and commercial Type.
Provide clear acceptance criteria that account for the logarithmic nature of the Type scale and the inherent image-to-weight conversion error.
Specify minimum requirements for human reference data, including blind grading protocols and inter-grader variability benchmarks.
Enable consistent comparison of AI model performance across origins, grading standards, and model versions.
Facilitate ongoing monitoring and revalidation of deployed AI grading systems.
3. Related Documents
ISTA TCOM-P-12 v1.0 (May 2025)
Advanced Technology Applications for Seed Testing --- Computer Vision. Reference framework for Bland-Altman methodology and validation protocol structure.
Instrucao Normativa no 8 (June 2003)
Classificacao Oficial Brasileira (COB) --- Brazilian official green coffee classification.
ISO 10470:2004
Green coffee --- Defect reference chart.
SENAR Colecao 192
Cafe: classificacao e degustacao --- Brazilian coffee classification and cupping manual.
Bland & Altman (1986)
Statistical methods for assessing agreement between two methods of clinical measurement. The Lancet, 327(8476), 307--310.
4. Responsibilities
Tool developer (Csmart): Responsible for providing model training reports, classification accuracy metrics, latent space analysis, and support for validation data collection. Responsible for retraining or calibrating models when validation reveals unacceptable bias.
User laboratory / quality team: Responsible for collecting blind grading data from qualified human graders, performing scans with the AI system, and maintaining validation records. Responsible for ongoing monitoring according to the schedule defined in Section 9.
Quality manager: Responsible for reviewing validation results against acceptance criteria, authorizing deployment of validated models, and initiating revalidation when required.
5. Abbreviations
AI
Artificial Intelligence
BB
Black + Broken defect group (Vietnam method)
COB
Classificacao Oficial Brasileira
DNN
Deep Neural Network
FM
Foreign Matter defect group (Vietnam method)
LoA
Limits of Agreement (Bland-Altman)
MAE
Mean Absolute Error
pp
Percentage points
QC
Quality Control
StDev
Standard Deviation
6. Definitions
Physical analysis: The classification of green coffee beans by identifying and quantifying defects, impurities, and screen size distribution through visual inspection and weighing. Unlike sensory analysis, physical analysis deals with objectively existing physical properties of the beans, though the classification of borderline beans introduces inter-grader variability.
Bias (systematic error): The average signed difference between the AI measurement and the human reference. A positive bias indicates the AI consistently overestimates; a negative bias indicates underestimation. Bias is correctable through calibration. Bias is distinct from MAE: a system can have high MAE with zero bias (random scatter) or high MAE that is almost entirely bias (systematic offset).
Mean Absolute Error (MAE): The average of the absolute differences between AI and human measurements, ignoring direction. MAE captures the total magnitude of disagreement, combining both systematic (bias) and random components.
Human consensus mean: The reference value for each sample and defect class, calculated as the arithmetic mean of all human graders' results for that sample. Since the exact true defect composition of each sample is not known with certainty, the consensus mean serves as the best available approximation.
Leave-one-out MAE: The method used to compute each human grader's MAE. Grader k's reference is the mean of all other graders excluding k. This avoids inflating agreement by including the grader in their own reference.
Equivalent defects: A standardized defect count derived from the origin-specific equivalence table. Different physical defects contribute different numbers of equivalent defects (e.g., in COB: 1 black bean = 1 defect, 5 broken beans = 1 defect). Equivalent defects are the input to the Type formula.
Type (coffee grade): A commercial classification derived from the equivalent defect count. The relationship between defect count and Type follows a logarithmic scale (see Section 7.5), meaning that equal absolute differences in defect counts do not correspond to equal differences in Type.
Weight estimation: The process by which an AI system infers the physical weight of individual beans or bean groups from 2D image data (pixel area, shape). This introduces a specific error source that is independent of classification accuracy, since images capture area but not thickness or density.
Blind sample: A sample for which the human graders and AI operator do not know the expected defect composition, origin characteristics, or any other information that could influence the grading result.
7. Measurement Levels and Performance Indicators
AI coffee grading systems operate through a multi-level pipeline. Errors at each level can propagate and compound through subsequent levels. Validation must therefore assess each level independently, as well as the final commercial output.
1. Bean classification
Bean image
Defect class label
Misclassification
2. Weight estimation
Bean image (pixels)
Estimated weight (g)
Area-to-weight conversion
3. Sample-level
All bean classes + weights
Defect % per class
Combined (1 + 2)
4. Screen size
Bean image (dimensions)
Screen distribution
Pixel-to-size conversion
5. Commercial (Type)
Defect % -> equivalence table
Type grade
Cascaded (1 + 2 + 3)
7.1 Bean-Level Classification
Bean-level classification measures how accurately the AI assigns individual beans to their correct defect class. This is evaluated during model training/testing using labelled reference datasets controlled by experienced analysts.
Performance indicators
Overall accuracy
Percentage of beans assigned the correct class label
>= 90%
Per-class precision
Of all beans the AI labelled as class c, what fraction truly belongs to c
Report all; flag any < 80%
Per-class recall
Of all beans truly belonging to class c, what fraction the AI correctly identified
Report all; flag any < 80%
Confusion matrix
Full cross-tabulation of true vs. predicted classes
Flag any off-diagonal pair > 5%
Note: Bean-level metrics come from the model's internal test set during training. They are necessary but not sufficient --- a model with 92% bean-level accuracy can still show significant bias at the sample level if misclassification errors are concentrated in specific classes (e.g., Brown <-> Black confusion).
7.2 Weight Estimation
Traditional coffee grading methods rely on physical weighing. AI systems operating on images must estimate weight from 2D pixel data (area, shape). This conversion introduces a systematic error source that is independent of classification accuracy, because:
Images capture projected area but not bean thickness or density.
Different defect types have different density profiles (e.g., black beans are typically lighter per unit area).
Overlapping or touching beans in the image may cause area estimation errors.
Performance indicators
Total weight error
|W_AI - W_scale| / W_scale x 100%
<= 2%
Per-class weight error
|W_AI,c - W_scale,c| / W_scale,c x 100% (where feasible)
Report; no fixed threshold
Note: Per-class weight error requires the human grader to weigh each separated defect group individually (not just count them). This data is valuable but may not always be available. When available, it helps isolate whether sample-level errors originate from classification or weight estimation.
7.3 Sample-Level Defect Percentages
Sample-level validation measures how closely the AI's reported defect percentages (per class) match the human consensus. This is the most direct comparison of the AI's end-to-end output against standard grading practice. Errors at this level reflect the combined effect of classification and weight estimation.
Reference standard
The reference for each sample and defect class is the human consensus mean: the arithmetic mean of all participating human graders' results for that sample. Each human grader's own performance is measured using the leave-one-out method: grader k's reference is the mean of all graders except k.
Performance indicators
MAE per class
MAE_c = (1/n) Sum|y_i,c - x_i,c|
<= 2x human inter-grader StDev for class c
Bias per class
Bias_c = (1/n) Sum(y_i,c - x_i,c)
Paired t-test: p >= 0.05 (no significant bias)
Limits of Agreement
LoA = Bias +/- 1.96 x SD of differences
<= 2 outliers per 30 samples (see Section 8.4)
Pearson correlation (r)
Correlation between AI and consensus values
>= 0.8 for commercially critical classes
Overall MAE
Mean of per-class MAEs across all defect classes
Report; compare to human leave-one-out MAE
Where:
y_i,c = AI measurement for sample i, class c
x_i,c = human consensus mean for sample i, class c
n = number of samples
Important --- Bias vs. MAE: These are distinct metrics. MAE measures the magnitude of error (ignoring direction). Bias measures the direction. A high MAE driven primarily by bias (e.g., Brown: MAE 0.91 pp, Bias -0.78 pp) indicates a systematic, correctable error. A high MAE with near-zero bias indicates random scatter, which is harder to correct. Both must be reported.
7.4 Screen Size Distribution
Screen size (granulometry) classification measures the distribution of bean sizes across standard sieve numbers. Rather than validating each individual sieve band independently (which produces excessive granularity and noise), this guideline uses two commercially relevant summary metrics.
Performance indicators
% above screen 15 (15+)
Cumulative weight percentage retained on screens 15 and above. This is the primary commercial cutoff (chato medio/graudo).
Same criteria as Section 7.3 (MAE, Bias, Bland-Altman)
Weighted average screen
WAS = Sum(S_k x W_k) / Sum(W_k), where S_k is the screen number and W_k is the weight retained on screen k.
Same criteria as Section 7.3
Note: Screen size estimation from images is subject to pixel-to-millimeter calibration error. Unlike defect classification, screen size depends on accurate dimensional measurement rather than visual pattern recognition. The validation protocol should verify that the imaging system's dimensional calibration is stable and traceable.
7.5 Commercial-Level Type Agreement
The commercial grade (Type) is derived from the total equivalent defect count according to the origin-specific grading standard. The relationship between equivalent defect count (d) and Type follows a logarithmic scale:
Type = ln(d) / b where b ~ 0.709
This logarithmic relationship has a critical implication for validation: absolute differences in defect counts do not correspond to equal differences in Type. Small defect count errors have a large impact on high-quality coffees (low defect counts) and a negligible impact on lower grades (high defect counts):
4
Type 2
-> 14
+1.8 Types
26
Type 4--5
-> 36
+0.5 Types
86
Type 6
-> 96
+0.2 Types
160
Type 7
-> 170
+0.1 Types
Therefore, validation at the commercial level must be performed on the Type scale, not on raw defect counts.
Performance indicators
The Type difference between AI and human is computed as:
|Delta Type| = |ln(d_AI) - ln(d_human)| / b
Mean |Delta Type|
Average absolute Type difference across all samples
Report
% within +/-0.5 Type
Proportion of samples where AI and human agree within half a Type step
>= 90%
% within +/-0.25 Type
Proportion of samples where AI and human show excellent agreement
Report (target: >= 75%)
Bland-Altman on Delta Type
Paired t-test and LoA on the Type scale
p >= 0.05; <= 2 outliers beyond LoA
Type match rate
Proportion of samples where AI and human assign exactly the same integer Type
Report
Critical: Bland-Altman analysis for Type agreement must be performed on Type values (or equivalently on ln(d)), not on raw defect counts. Performing the analysis on raw counts would violate the assumption of uniform measurement scale and would underweight errors on high-quality coffees.
8. Validation Protocol
8.1 Sample Requirements
Number of blind samples
30 per origin/grading standard
Aligned with ISTA TCOM-P-12 Annex 4
Type range coverage
Samples must span at least 3 full Type steps
E.g., Type 2--3 (high quality), Type 4--5 (mid), Type 6+ (low). Ensures the logarithmic scale is tested across its range.
Sample diversity
Samples from at least 3 different lots
Prevents overfitting to a single production batch
Sample size
Per origin standard (e.g., 300 g for COB)
Must match the standard human grading method
8.2 Human Reference Standard
Number of human graders
>= 3 across the sample set
Each sample must be graded by at least 2 independent graders. At least 3 different graders must participate across the full set.
Grader qualification
Trained and experienced in the origin's grading method
Graders must be current QC staff routinely performing this analysis.
Blind protocol
Required
Graders must not know the AI result, each other's results, or the expected composition of the sample.
Independence
Required
Graders must perform their analysis independently, without discussion or comparison.
Note on consensus stability: The human consensus mean is the reference for all AI performance metrics. This consensus shifts depending on which graders participate. The more graders contribute, the more stable the consensus. With only 2 graders, replacing one changes the reference substantially. With 4+, the consensus is more robust. This is why >= 3 graders are required and 4+ are recommended.
8.3 Data Collection
Human grading
Each grader receives the same physical samples, identified only by sample code.
Graders perform standard physical analysis per the origin method:
Separate defective beans from the 300 g sample.
Classify defects by type.
Weigh each defect group separately (if feasible --- enables weight estimation validation).
Record defect weight percentages per class.
Perform screen size analysis on the catacao (clean fraction) using standard sieve set.
Record weight retained per screen number.
Results are submitted to the quality manager without inter-grader discussion.
AI scanning
The same physical samples are scanned by the AI system.
Record:
Bean class assignments (if exportable)
AI-estimated total sample weight
Defect weight percentages per class
Screen size distribution (per screen number)
Model version, date, and scan parameters
Record the actual scale weight of the sample for weight estimation validation (Section 7.2).
8.4 Statistical Analysis
Step 1: Compute human consensus and inter-grader variability
For each sample i and defect class c:
Consensus mean: x_i,c = mean of all human graders for that sample/class.
Human inter-grader StDev: computed per class across all samples using the leave-one-out method.
Human leave-one-out MAE: for each grader k, MAE_k,c = mean |x_k,i,c - x_(minus k),i,c|, where x_(minus k) excludes grader k.
Step 2: Compute AI performance metrics (Sections 7.1--7.5)
For each measurement level, compute all indicators listed in the corresponding section. The Bland-Altman analysis follows the procedure described in ISTA TCOM-P-12 Annex 4:
Let x_i = human consensus and y_i = AI measurement for sample i.
Compute Bias = (1/n) Sum(y_i - x_i).
Compute SD = sqrt[(1/(n-1)) Sum((y_i - x_i) - Bias)^2].
Perform a paired t-test. If p < 0.05, the bias is statistically significant --- the AI shows systematic error on this metric.
Compute Limits of Agreement: Lower LoA = Bias - 1.96 x SD; Upper LoA = Bias + 1.96 x SD.
Count the number of points falling outside the LoA. Accept if <= 2 out of 30 (approximately 5%).
Generate Bland-Altman plots: x-axis = mean of AI and human, y-axis = difference (AI - human).
Step 3: For commercial-level (Type), apply the logarithmic transformation
Convert AI and human equivalent defect counts to Type: Type = ln(d) / 0.709.
Perform the Bland-Altman analysis on the Type values, not the raw defect counts.
Compute |Delta Type| per sample and the acceptance percentages (+/-0.25, +/-0.5 Type).
8.5 Acceptance Criteria Summary
Bean classification
Overall accuracy
>= 90%
Bean classification
Confusion pairs > 5%
Flagged for review
Weight estimation
Total weight error
<= 2%
Sample-level (per class)
Bias (paired t-test)
p >= 0.05
Sample-level (per class)
LoA outliers
<= 2 / 30
Sample-level (per class)
MAE
<= 2x human inter-grader StDev
Sample-level (per class)
Pearson r
>= 0.8 (critical classes)
Screen size
15+ bias (paired t-test)
p >= 0.05
Screen size
Weighted avg screen bias
p >= 0.05
Commercial (Type)
% within +/-0.5 Type
>= 90%
Commercial (Type)
Bias on Type scale (paired t-test)
p >= 0.05
Partial compliance: If a model meets acceptance criteria on some but not all defect classes, it may be deployed with documented limitations. For example, a model that passes on all classes except Brown and Black (due to known confusion) may be deployed if the quality team is aware that these two classes require manual review. Such limitations must be documented in the validation record (Annex 1).
9. Ongoing Monitoring and Revalidation
Routine monitoring
Quarterly
Abbreviated check: key MAEs, bias direction, weight estimation error
10 blind samples with >= 2 human graders
Annually
Full validation per Section 8
30 blind samples with >= 3 human graders
Revalidation triggers
Full revalidation (30 samples) is required when any of the following occur:
The AI model is updated, retrained, or replaced.
The imaging hardware is changed or recalibrated.
The software version is updated in a way that affects the classification or weight estimation pipeline.
Routine monitoring reveals a previously passing metric now fails acceptance criteria.
The origin-specific grading standard is revised.
A new coffee variety or processing method is introduced that was not represented in the original validation dataset.
Corrective actions
When validation or monitoring reveals unacceptable performance:
Identify whether the error source is classification (confusion matrix), weight estimation, or both.
For classification errors: review the training database for the affected class boundaries; consider adding representative samples or correcting borderline labels.
For weight estimation errors: verify imaging hardware calibration; review the weight estimation model.
For systematic bias on specific classes: evaluate whether a calibration offset is appropriate as a temporary measure while retraining is prepared. Document any offsets applied.
After correction, perform full revalidation.
10. Record Keeping
The following records must be retained for as long as the AI system is in use, and a minimum of six years after decommissioning:
Validation reports including all metrics from Sections 7.1--7.5, Bland-Altman plots, and acceptance decisions (Annex 2).
Raw grading data: human graders' results, AI outputs, sample identifiers.
Model identification: version, architecture, training date, training report.
Monitoring results from quarterly checks.
Revalidation records, including the trigger for revalidation.
Any calibration offsets applied, with justification and date.
Corrective action records.
11. Annexes
Annex 1: Tool Overview and Application Scope
To be completed for each AI system and origin.
Name of the tool
Laboratory / Site
Scope
e.g., Green coffee defect analysis, Vietnam method
Origin / Grading standard
Defect classes supported
e.g., Broken, Brown, Fragment, Black, Mold, Husk, Pod, SilverSkin, Stick, Stone, Immature, Floaters, InsectDamage, OK
Defect groups
e.g., BB (Black + Broken), FM (Fragment + Mold + ...)
Per origin method
Screen size range
e.g., Screens 11--18+
Model version
Model architecture
e.g., ConvNeXt Large, EVA-02 Large
Bean-level accuracy
From training report
Deployment date
Validation status
VALIDATED / CONDITIONAL / NOT VALIDATED
Known limitations
e.g., Brown vs Black confusion above threshold
SOP reference
Responsible person
Annex 2: Validation Scorecard
To be completed for each validation round.
Model version
Validation date
Number of samples
____ / 30 minimum
Human graders
Origin / Grading standard
Bean classification
Overall accuracy
>= 90%
Weight estimation
Total weight error
<= 2%
Sample: Broken
MAE (pp)
<= 2x human StDev
Sample: Broken
Bias (pp)
---
Sample: Broken
Paired t-test (p)
>= 0.05
Sample: Broken
Pearson r
>= 0.8
... repeat for each defect class ...
Screen size
15+ bias (p)
>= 0.05
Screen size
Weighted avg screen bias (p)
>= 0.05
Commercial (Type)
% within +/-0.5 Type
>= 90%
Commercial (Type)
Bias on Type scale (p)
>= 0.05
Commercial (Type)
Mean |Delta Type|
Report
Decision
VALIDATED / CONDITIONAL / NOT VALIDATED
Conditions (if applicable)
Authorized by
Date
Annex 3: Bland-Altman Procedure for Sample-Level and Type Validation
This annex describes the Bland-Altman method as applied to coffee AI grading validation, adapted from ISTA TCOM-P-12 Annex 4 (Bland & Altman, 1986).
A3.1 For sample-level defect percentages
For each defect class c, with n >= 30 blind samples:
Let x_i = human consensus mean for sample i, and y_i = AI measurement for sample i.
Compute differences: d_i = y_i - x_i.
Compute Bias = (1/n) Sum d_i.
Compute SD = sqrt[(1/(n-1)) Sum(d_i - Bias)^2].
Paired t-test: t = Bias / (SD / sqrt(n)). If p < 0.05, bias is significant -> FAIL.
Limits of Agreement: Lower = Bias - 1.96 x SD; Upper = Bias + 1.96 x SD.
Count outliers beyond LoA. If <= 2 out of 30 (~ 5%) -> PASS.
Generate Bland-Altman plot: x-axis = (x_i + y_i) / 2; y-axis = d_i. Draw Bias (dashed red) and LoA (dotted gray) lines.
A3.2 For commercial-level Type agreement
The procedure is identical to A3.1, except the measurements are on the Type scale:
Convert equivalent defect counts to Type: Type_AI = ln(d_AI) / 0.709; Type_human = ln(d_human) / 0.709.
Proceed with steps 2--8 from A3.1, using Type values instead of defect percentages.
Do not apply the Bland-Altman procedure to raw defect counts. The logarithmic relationship between defects and Type means that the assumption of a uniform measurement scale (required by Bland-Altman) is only met on the Type scale, not on the raw count scale.
A3.3 Interpretation guide
Bias near zero, few outliers
PASS
PASS
AI agrees with human consensus; no systematic error.
Significant bias, few outliers
FAIL
PASS
Systematic offset. Error is predictable and correctable (calibration or retraining). Precision is adequate.
Bias near zero, many outliers
PASS
FAIL
No systematic error, but high random scatter on some samples. Investigate whether outlier samples have unusual properties.
Significant bias AND many outliers
FAIL
FAIL
Both systematic and random errors. Model needs significant improvement before deployment.
Annex 4: Origin-Specific Module --- Brazil (COB)
A4.1 Grading standard
Instrucao Normativa no 8, 11 June 2003. Sample size: 300 g of beneficiated green coffee.
A4.2 Defect equivalence table
Grao preto (Black)
1
Grao ardido (Sour/Brown)
2
Grao preto-verde (Stinker)
2
Grao brocado (Insect damage)
2--5
Grao concha (Shell)
3
Grao verde (Immature)
5
Grao quebrado (Broken)
5
Grao chocho (Withered)
5
Grao esmagado (Crushed)
5
Coco (Cherry)
1
Casca grande (Large husk)
1
Casca pequena (Small husk)
2--3
Pergaminho (Parchment)
2
Marinheiro (Sailor)
2
Pau/Pedra/Torrao grande
1 = 5 defects
Pau/Pedra/Torrao regular
1 = 2 defects
Pau/Pedra/Torrao pequeno
1 = 1 defect
A4.3 Type conversion formula
Type = ln(d) / 0.709
Where d = total equivalent defects in 300 g sample.
A4.4 Type scale reference
4
2
12
2--3
26
4
46
5
86
6
160
7
360
8
A4.5 Screen size method
100 g of the catacao (clean fraction) is passed through a standard sieve set (screens 10--19 for chato; 8--13 for moca). Weight retained per screen is recorded. Validation metrics per Section 7.4: % above screen 15 and weighted average screen number.
Annex 5: Origin-Specific Module --- Vietnam
A5.1 Grading standard
Weight-percentage method. All defect groups have equal weight factor (= 1). This is distinct from ISO 10470, which assigns different weight factors per defect group.
A5.2 Defect groups
BB (Black + Broken)
FullBlack, Broken
1
FM (Foreign Matter)
Fragment, Husk, Stick, Stone, Pod, SilverSkin
1
Total Defects
All defect classes except OK
1
A5.3 Validation specifics
Since Vietnam does not use an equivalence table or Type system, validation at the commercial level (Section 7.5) is replaced by validation of the defect group percentages (BB, FM, Total Defects) using the same Bland-Altman methodology as Section 7.3.
Annex 6: Origin-Specific Module --- ISO 10470 (Weighted Factors)
A6.1 Grading standard
ISO 10470:2004 --- Green coffee defect reference chart. Used primarily for Robusta grading (e.g., Nestle purchasing specifications). Each defect group has a different weight factor, unlike the Vietnam equal-weight method.
A6.2 Validation specifics
Validation follows the same protocol as Section 8, but with the origin-specific weight factors applied to compute weighted defect percentages. The Bland-Altman analysis is performed on the weighted percentages.
Detailed weight factor table to be added based on the specific ISO 10470 edition in use.
Revision History
0.1
March 2026
Initial draft.
CSMART-P-01 --- AI Grading Validation for Green Coffee --- Version 0.1 Draft --- March 2026
Csmart Coffee Technologies SA
Last updated