AI Grading Validation Standard

CSMART COFFEE TECHNOLOGIES

Guideline

AI Grading Validation for Green Coffee

Computer Vision Systems for Physical Analysis

Any hardcopy of this document is not a controlled copy

CREATED

REVIEWED

APPROVED

March 2026

Field

Value

Document ID

CSMART-P-01

Version

0.1 (Draft)

Valid from

TBD

Page count

---

1. Scope

This document provides a guideline for the development, verification, and validation of computer vision (AI) systems used in green coffee physical analysis. It establishes minimum quality control expectations and performance indicators for AI tools that classify coffee bean defects, estimate bean weights from images, determine defect percentages, assess screen size distribution, and derive commercial grade (Type).

The scope covers AI systems that operate on digital images of green coffee beans and produce results equivalent to those of a trained human grader performing physical analysis according to established origin-specific grading standards.

This guideline is applicable to all origin-specific grading methods, including but not limited to:

Brazil --- Classificacao Oficial Brasileira (COB), Instrucao Normativa no 8/2003
Vietnam --- Weight-percentage method with defect group aggregation (BB, FM)
ISO 10470 --- Green coffee defect reference chart (Robusta weighted factors)
Other origin-specific methods as defined in the origin-specific modules (Annexes)

2. Objectives

Define a standardized, multi-level validation framework for AI grading systems applied to green coffee physical analysis.
Establish performance indicators at each measurement level: bean classification, weight estimation, sample-level defect percentages, screen size distribution, and commercial Type.
Provide clear acceptance criteria that account for the logarithmic nature of the Type scale and the inherent image-to-weight conversion error.
Specify minimum requirements for human reference data, including blind grading protocols and inter-grader variability benchmarks.
Enable consistent comparison of AI model performance across origins, grading standards, and model versions.
Facilitate ongoing monitoring and revalidation of deployed AI grading systems.

Document

Description

ISTA TCOM-P-12 v1.0 (May 2025)

Advanced Technology Applications for Seed Testing --- Computer Vision. Reference framework for Bland-Altman methodology and validation protocol structure.

Instrucao Normativa no 8 (June 2003)

Classificacao Oficial Brasileira (COB) --- Brazilian official green coffee classification.

ISO 10470:2004

Green coffee --- Defect reference chart.

SENAR Colecao 192

Cafe: classificacao e degustacao --- Brazilian coffee classification and cupping manual.

Bland & Altman (1986)

Statistical methods for assessing agreement between two methods of clinical measurement. The Lancet, 327(8476), 307--310.

4. Responsibilities

Tool developer (Csmart): Responsible for providing model training reports, classification accuracy metrics, latent space analysis, and support for validation data collection. Responsible for retraining or calibrating models when validation reveals unacceptable bias.

User laboratory / quality team: Responsible for collecting blind grading data from qualified human graders, performing scans with the AI system, and maintaining validation records. Responsible for ongoing monitoring according to the schedule defined in Section 9.

Quality manager: Responsible for reviewing validation results against acceptance criteria, authorizing deployment of validated models, and initiating revalidation when required.

5. Abbreviations

Abbreviation

Definition

Artificial Intelligence

Black + Broken defect group (Vietnam method)

COB

Classificacao Oficial Brasileira

DNN

Deep Neural Network

Foreign Matter defect group (Vietnam method)

LoA

Limits of Agreement (Bland-Altman)

MAE

Mean Absolute Error

Percentage points

Quality Control

StDev

Standard Deviation

6. Definitions

Physical analysis: The classification of green coffee beans by identifying and quantifying defects, impurities, and screen size distribution through visual inspection and weighing. Unlike sensory analysis, physical analysis deals with objectively existing physical properties of the beans, though the classification of borderline beans introduces inter-grader variability.

Bias (systematic error): The average signed difference between the AI measurement and the human reference. A positive bias indicates the AI consistently overestimates; a negative bias indicates underestimation. Bias is correctable through calibration. Bias is distinct from MAE: a system can have high MAE with zero bias (random scatter) or high MAE that is almost entirely bias (systematic offset).

Mean Absolute Error (MAE): The average of the absolute differences between AI and human measurements, ignoring direction. MAE captures the total magnitude of disagreement, combining both systematic (bias) and random components.

Human consensus mean: The reference value for each sample and defect class, calculated as the arithmetic mean of all human graders' results for that sample. Since the exact true defect composition of each sample is not known with certainty, the consensus mean serves as the best available approximation.

Leave-one-out MAE: The method used to compute each human grader's MAE. Grader k's reference is the mean of all other graders excluding k. This avoids inflating agreement by including the grader in their own reference.

Equivalent defects: A standardized defect count derived from the origin-specific equivalence table. Different physical defects contribute different numbers of equivalent defects (e.g., in COB: 1 black bean = 1 defect, 5 broken beans = 1 defect). Equivalent defects are the input to the Type formula.

Type (coffee grade): A commercial classification derived from the equivalent defect count. The relationship between defect count and Type follows a logarithmic scale (see Section 7.5), meaning that equal absolute differences in defect counts do not correspond to equal differences in Type.

Weight estimation: The process by which an AI system infers the physical weight of individual beans or bean groups from 2D image data (pixel area, shape). This introduces a specific error source that is independent of classification accuracy, since images capture area but not thickness or density.

Blind sample: A sample for which the human graders and AI operator do not know the expected defect composition, origin characteristics, or any other information that could influence the grading result.

7. Measurement Levels and Performance Indicators

AI coffee grading systems operate through a multi-level pipeline. Errors at each level can propagate and compound through subsequent levels. Validation must therefore assess each level independently, as well as the final commercial output.

Level

Input

Output

Error type

1. Bean classification

Bean image

Defect class label

Misclassification

2. Weight estimation

Bean image (pixels)

Estimated weight (g)

Area-to-weight conversion

3. Sample-level

All bean classes + weights

Defect % per class

Combined (1 + 2)

4. Screen size

Bean image (dimensions)

Screen distribution

Pixel-to-size conversion

5. Commercial (Type)

Defect % -> equivalence table

Type grade

Cascaded (1 + 2 + 3)

7.1 Bean-Level Classification

Bean-level classification measures how accurately the AI assigns individual beans to their correct defect class. This is evaluated during model training/testing using labelled reference datasets controlled by experienced analysts.

Performance indicators

Indicator

Description

Minimum threshold

Overall accuracy

Percentage of beans assigned the correct class label

>= 90%

Per-class precision

Of all beans the AI labelled as class c, what fraction truly belongs to c

Report all; flag any < 80%

Per-class recall

Of all beans truly belonging to class c, what fraction the AI correctly identified

Report all; flag any < 80%

Confusion matrix

Full cross-tabulation of true vs. predicted classes

Flag any off-diagonal pair > 5%

Note: Bean-level metrics come from the model's internal test set during training. They are necessary but not sufficient --- a model with 92% bean-level accuracy can still show significant bias at the sample level if misclassification errors are concentrated in specific classes (e.g., Brown <-> Black confusion).

7.2 Weight Estimation

Traditional coffee grading methods rely on physical weighing. AI systems operating on images must estimate weight from 2D pixel data (area, shape). This conversion introduces a systematic error source that is independent of classification accuracy, because:

Images capture projected area but not bean thickness or density.
Different defect types have different density profiles (e.g., black beans are typically lighter per unit area).
Overlapping or touching beans in the image may cause area estimation errors.

Performance indicators

Indicator

Formula

Acceptance

Total weight error

|W_AI - W_scale| / W_scale x 100%

<= 2%

Per-class weight error

|W_AI,c - W_scale,c| / W_scale,c x 100% (where feasible)

Report; no fixed threshold

Note: Per-class weight error requires the human grader to weigh each separated defect group individually (not just count them). This data is valuable but may not always be available. When available, it helps isolate whether sample-level errors originate from classification or weight estimation.

7.3 Sample-Level Defect Percentages

Sample-level validation measures how closely the AI's reported defect percentages (per class) match the human consensus. This is the most direct comparison of the AI's end-to-end output against standard grading practice. Errors at this level reflect the combined effect of classification and weight estimation.

Reference standard

The reference for each sample and defect class is the human consensus mean: the arithmetic mean of all participating human graders' results for that sample. Each human grader's own performance is measured using the leave-one-out method: grader k's reference is the mean of all graders except k.

Performance indicators

Indicator

Formula

Acceptance

MAE per class

MAE_c = (1/n) Sum|y_i,c - x_i,c|

<= 2x human inter-grader StDev for class c

Bias per class

Bias_c = (1/n) Sum(y_i,c - x_i,c)

Paired t-test: p >= 0.05 (no significant bias)

Limits of Agreement

LoA = Bias +/- 1.96 x SD of differences

<= 2 outliers per 30 samples (see Section 8.4)

Pearson correlation (r)

Correlation between AI and consensus values

>= 0.8 for commercially critical classes

Overall MAE

Mean of per-class MAEs across all defect classes

Report; compare to human leave-one-out MAE

Where:

y_i,c = AI measurement for sample i, class c
x_i,c = human consensus mean for sample i, class c
n = number of samples

Important --- Bias vs. MAE: These are distinct metrics. MAE measures the magnitude of error (ignoring direction). Bias measures the direction. A high MAE driven primarily by bias (e.g., Brown: MAE 0.91 pp, Bias -0.78 pp) indicates a systematic, correctable error. A high MAE with near-zero bias indicates random scatter, which is harder to correct. Both must be reported.

7.4 Screen Size Distribution

Screen size (granulometry) classification measures the distribution of bean sizes across standard sieve numbers. Rather than validating each individual sieve band independently (which produces excessive granularity and noise), this guideline uses two commercially relevant summary metrics.

Performance indicators

Indicator

Description

Acceptance

% above screen 15 (15+)

Cumulative weight percentage retained on screens 15 and above. This is the primary commercial cutoff (chato medio/graudo).

Same criteria as Section 7.3 (MAE, Bias, Bland-Altman)

Weighted average screen

WAS = Sum(S_k x W_k) / Sum(W_k), where S_k is the screen number and W_k is the weight retained on screen k.

Same criteria as Section 7.3

Note: Screen size estimation from images is subject to pixel-to-millimeter calibration error. Unlike defect classification, screen size depends on accurate dimensional measurement rather than visual pattern recognition. The validation protocol should verify that the imaging system's dimensional calibration is stable and traceable.

7.5 Commercial-Level Type Agreement

The commercial grade (Type) is derived from the total equivalent defect count according to the origin-specific grading standard. The relationship between equivalent defect count (d) and Type follows a logarithmic scale:

Type = ln(d) / b where b ~ 0.709

This logarithmic relationship has a critical implication for validation: absolute differences in defect counts do not correspond to equal differences in Type. Small defect count errors have a large impact on high-quality coffees (low defect counts) and a negligible impact on lower grades (high defect counts):

True defects

Approximate Type

+10 defect error

Resulting Delta Type

Type 2

-> 14

+1.8 Types

Type 4--5

-> 36

+0.5 Types

Type 6

-> 96

+0.2 Types

160

Type 7

-> 170

+0.1 Types

Therefore, validation at the commercial level must be performed on the Type scale, not on raw defect counts.

Performance indicators

The Type difference between AI and human is computed as:

|Delta Type| = |ln(d_AI) - ln(d_human)| / b

Indicator

Description

Acceptance

Mean |Delta Type|

Average absolute Type difference across all samples

Report

% within +/-0.5 Type

Proportion of samples where AI and human agree within half a Type step

>= 90%

% within +/-0.25 Type

Proportion of samples where AI and human show excellent agreement

Report (target: >= 75%)

Bland-Altman on Delta Type

Paired t-test and LoA on the Type scale

p >= 0.05; <= 2 outliers beyond LoA

Type match rate

Proportion of samples where AI and human assign exactly the same integer Type

Report

Critical: Bland-Altman analysis for Type agreement must be performed on Type values (or equivalently on ln(d)), not on raw defect counts. Performing the analysis on raw counts would violate the assumption of uniform measurement scale and would underweight errors on high-quality coffees.

8. Validation Protocol

8.1 Sample Requirements

Requirement

Minimum

Notes

Number of blind samples

30 per origin/grading standard

Aligned with ISTA TCOM-P-12 Annex 4

Type range coverage

Samples must span at least 3 full Type steps

E.g., Type 2--3 (high quality), Type 4--5 (mid), Type 6+ (low). Ensures the logarithmic scale is tested across its range.

Sample diversity

Samples from at least 3 different lots

Prevents overfitting to a single production batch

Sample size

Per origin standard (e.g., 300 g for COB)

Must match the standard human grading method

8.2 Human Reference Standard

Requirement

Minimum

Notes

Number of human graders

>= 3 across the sample set

Each sample must be graded by at least 2 independent graders. At least 3 different graders must participate across the full set.

Grader qualification

Trained and experienced in the origin's grading method

Graders must be current QC staff routinely performing this analysis.

Blind protocol

Required

Graders must not know the AI result, each other's results, or the expected composition of the sample.

Independence

Required

Graders must perform their analysis independently, without discussion or comparison.

Note on consensus stability: The human consensus mean is the reference for all AI performance metrics. This consensus shifts depending on which graders participate. The more graders contribute, the more stable the consensus. With only 2 graders, replacing one changes the reference substantially. With 4+, the consensus is more robust. This is why >= 3 graders are required and 4+ are recommended.

8.3 Data Collection

Human grading

Each grader receives the same physical samples, identified only by sample code.
Graders perform standard physical analysis per the origin method:
- Separate defective beans from the 300 g sample.
- Classify defects by type.
- Weigh each defect group separately (if feasible --- enables weight estimation validation).
- Record defect weight percentages per class.
- Perform screen size analysis on the catacao (clean fraction) using standard sieve set.
- Record weight retained per screen number.
Results are submitted to the quality manager without inter-grader discussion.

AI scanning

The same physical samples are scanned by the AI system.
Record:
- Bean class assignments (if exportable)
- AI-estimated total sample weight
- Defect weight percentages per class
- Screen size distribution (per screen number)
- Model version, date, and scan parameters
Record the actual scale weight of the sample for weight estimation validation (Section 7.2).

8.4 Statistical Analysis

Step 1: Compute human consensus and inter-grader variability

For each sample i and defect class c:

Consensus mean: x_i,c = mean of all human graders for that sample/class.
Human inter-grader StDev: computed per class across all samples using the leave-one-out method.
Human leave-one-out MAE: for each grader k, MAE_k,c = mean |x_k,i,c - x_(minus k),i,c|, where x_(minus k) excludes grader k.

Step 2: Compute AI performance metrics (Sections 7.1--7.5)

For each measurement level, compute all indicators listed in the corresponding section. The Bland-Altman analysis follows the procedure described in ISTA TCOM-P-12 Annex 4:

Let x_i = human consensus and y_i = AI measurement for sample i.
Compute Bias = (1/n) Sum(y_i - x_i).
Compute SD = sqrt[(1/(n-1)) Sum((y_i - x_i) - Bias)^2].
Perform a paired t-test. If p < 0.05, the bias is statistically significant --- the AI shows systematic error on this metric.
Compute Limits of Agreement: Lower LoA = Bias - 1.96 x SD; Upper LoA = Bias + 1.96 x SD.
Count the number of points falling outside the LoA. Accept if <= 2 out of 30 (approximately 5%).
Generate Bland-Altman plots: x-axis = mean of AI and human, y-axis = difference (AI - human).

Step 3: For commercial-level (Type), apply the logarithmic transformation

Convert AI and human equivalent defect counts to Type: Type = ln(d) / 0.709.
Perform the Bland-Altman analysis on the Type values, not the raw defect counts.
Compute |Delta Type| per sample and the acceptance percentages (+/-0.25, +/-0.5 Type).

8.5 Acceptance Criteria Summary

Level

Metric

Criterion

Status

Bean classification

Overall accuracy

>= 90%

Bean classification

Confusion pairs > 5%

Flagged for review

Weight estimation

Total weight error

<= 2%

Sample-level (per class)

Bias (paired t-test)

p >= 0.05

Sample-level (per class)

LoA outliers

<= 2 / 30

Sample-level (per class)

MAE

<= 2x human inter-grader StDev

Sample-level (per class)

Pearson r

>= 0.8 (critical classes)

Screen size

15+ bias (paired t-test)

p >= 0.05

Screen size

Weighted avg screen bias

p >= 0.05

Commercial (Type)

% within +/-0.5 Type

>= 90%

Commercial (Type)

Bias on Type scale (paired t-test)

p >= 0.05

Partial compliance: If a model meets acceptance criteria on some but not all defect classes, it may be deployed with documented limitations. For example, a model that passes on all classes except Brown and Black (due to known confusion) may be deployed if the quality team is aware that these two classes require manual review. Such limitations must be documented in the validation record (Annex 1).

9. Ongoing Monitoring and Revalidation

Routine monitoring

Frequency

Scope

Minimum samples

Quarterly

Abbreviated check: key MAEs, bias direction, weight estimation error

10 blind samples with >= 2 human graders

Annually

Full validation per Section 8

30 blind samples with >= 3 human graders

Revalidation triggers

Full revalidation (30 samples) is required when any of the following occur:

The AI model is updated, retrained, or replaced.
The imaging hardware is changed or recalibrated.
The software version is updated in a way that affects the classification or weight estimation pipeline.
Routine monitoring reveals a previously passing metric now fails acceptance criteria.
The origin-specific grading standard is revised.
A new coffee variety or processing method is introduced that was not represented in the original validation dataset.

Corrective actions

When validation or monitoring reveals unacceptable performance:

Identify whether the error source is classification (confusion matrix), weight estimation, or both.
For classification errors: review the training database for the affected class boundaries; consider adding representative samples or correcting borderline labels.
For weight estimation errors: verify imaging hardware calibration; review the weight estimation model.
For systematic bias on specific classes: evaluate whether a calibration offset is appropriate as a temporary measure while retraining is prepared. Document any offsets applied.
After correction, perform full revalidation.

10. Record Keeping

The following records must be retained for as long as the AI system is in use, and a minimum of six years after decommissioning:

Validation reports including all metrics from Sections 7.1--7.5, Bland-Altman plots, and acceptance decisions (Annex 2).
Raw grading data: human graders' results, AI outputs, sample identifiers.
Model identification: version, architecture, training date, training report.
Monitoring results from quarterly checks.
Revalidation records, including the trigger for revalidation.
Any calibration offsets applied, with justification and date.
Corrective action records.

11. Annexes

Annex 1: Tool Overview and Application Scope

To be completed for each AI system and origin.

Annex 2: Validation Scorecard

To be completed for each validation round.

Annex 3: Bland-Altman Procedure for Sample-Level and Type Validation

This annex describes the Bland-Altman method as applied to coffee AI grading validation, adapted from ISTA TCOM-P-12 Annex 4 (Bland & Altman, 1986).

A3.1 For sample-level defect percentages

For each defect class c, with n >= 30 blind samples:

Let x_i = human consensus mean for sample i, and y_i = AI measurement for sample i.
Compute differences: d_i = y_i - x_i.
Compute Bias = (1/n) Sum d_i.
Compute SD = sqrt[(1/(n-1)) Sum(d_i - Bias)^2].
Paired t-test: t = Bias / (SD / sqrt(n)). If p < 0.05, bias is significant -> FAIL.
Limits of Agreement: Lower = Bias - 1.96 x SD; Upper = Bias + 1.96 x SD.
Count outliers beyond LoA. If <= 2 out of 30 (~ 5%) -> PASS.
Generate Bland-Altman plot: x-axis = (x_i + y_i) / 2; y-axis = d_i. Draw Bias (dashed red) and LoA (dotted gray) lines.

A3.2 For commercial-level Type agreement

The procedure is identical to A3.1, except the measurements are on the Type scale:

Convert equivalent defect counts to Type: Type_AI = ln(d_AI) / 0.709; Type_human = ln(d_human) / 0.709.
Proceed with steps 2--8 from A3.1, using Type values instead of defect percentages.

Do not apply the Bland-Altman procedure to raw defect counts. The logarithmic relationship between defects and Type means that the assumption of a uniform measurement scale (required by Bland-Altman) is only met on the Type scale, not on the raw count scale.

A3.3 Interpretation guide

Scenario

Bias test

LoA outliers

Interpretation

Bias near zero, few outliers

PASS

AI agrees with human consensus; no systematic error.

Significant bias, few outliers

FAIL

PASS

Systematic offset. Error is predictable and correctable (calibration or retraining). Precision is adequate.

Bias near zero, many outliers

PASS

FAIL

No systematic error, but high random scatter on some samples. Investigate whether outlier samples have unusual properties.

Significant bias AND many outliers

FAIL

Both systematic and random errors. Model needs significant improvement before deployment.

Annex 4: Origin-Specific Module --- Brazil (COB)

A4.1 Grading standard

Instrucao Normativa no 8, 11 June 2003. Sample size: 300 g of beneficiated green coffee.

A4.2 Defect equivalence table

Defect type

Quantity = 1 equivalent defect

Grao preto (Black)

Grao ardido (Sour/Brown)

Grao preto-verde (Stinker)

Grao brocado (Insect damage)

2--5

Grao concha (Shell)

Grao verde (Immature)

Grao quebrado (Broken)

Grao chocho (Withered)

Grao esmagado (Crushed)

Coco (Cherry)

Casca grande (Large husk)

Casca pequena (Small husk)

2--3

Pergaminho (Parchment)

Marinheiro (Sailor)

Pau/Pedra/Torrao grande

1 = 5 defects

Pau/Pedra/Torrao regular

1 = 2 defects

Pau/Pedra/Torrao pequeno

1 = 1 defect

A4.3 Type conversion formula

Type = ln(d) / 0.709
Where d = total equivalent defects in 300 g sample.

A4.4 Type scale reference

Equivalent defects

Type

2--3

160

360

A4.5 Screen size method

100 g of the catacao (clean fraction) is passed through a standard sieve set (screens 10--19 for chato; 8--13 for moca). Weight retained per screen is recorded. Validation metrics per Section 7.4: % above screen 15 and weighted average screen number.

Annex 5: Origin-Specific Module --- Vietnam

A5.1 Grading standard

Weight-percentage method. All defect groups have equal weight factor (= 1). This is distinct from ISO 10470, which assigns different weight factors per defect group.

A5.2 Defect groups

Group

Included classes

Weight factor

BB (Black + Broken)

FullBlack, Broken

FM (Foreign Matter)

Fragment, Husk, Stick, Stone, Pod, SilverSkin

Total Defects

All defect classes except OK

A5.3 Validation specifics

Since Vietnam does not use an equivalence table or Type system, validation at the commercial level (Section 7.5) is replaced by validation of the defect group percentages (BB, FM, Total Defects) using the same Bland-Altman methodology as Section 7.3.

Annex 6: Origin-Specific Module --- ISO 10470 (Weighted Factors)

A6.1 Grading standard

ISO 10470:2004 --- Green coffee defect reference chart. Used primarily for Robusta grading (e.g., Nestle purchasing specifications). Each defect group has a different weight factor, unlike the Vietnam equal-weight method.

A6.2 Validation specifics

Validation follows the same protocol as Section 8, but with the origin-specific weight factors applied to compute weighted defect percentages. The Bland-Altman analysis is performed on the weighted percentages.

Detailed weight factor table to be added based on the specific ISO 10470 edition in use.

Revision History

Version

Date

Changes

0.1

March 2026

Initial draft.

CSMART-P-01 --- AI Grading Validation for Green Coffee --- Version 0.1 Draft --- March 2026

Csmart Coffee Technologies SA

PreviousAI Model Metrics NextCsmart Studio Desktop

Last updated 6 hours ago

hashtagCSMART COFFEE TECHNOLOGIES

hashtagGuideline

hashtagAI Grading Validation for Green Coffee

hashtagComputer Vision Systems for Physical Analysis

hashtagTable of Contents

hashtag1. Scope

hashtag2. Objectives

hashtag3. Related Documents

hashtag4. Responsibilities

hashtag5. Abbreviations

hashtag6. Definitions

hashtag7. Measurement Levels and Performance Indicators

hashtag7.1 Bean-Level Classification

hashtagPerformance indicators

hashtag7.2 Weight Estimation

hashtagPerformance indicators

hashtag7.3 Sample-Level Defect Percentages

hashtagReference standard

hashtagPerformance indicators

hashtag7.4 Screen Size Distribution

hashtagPerformance indicators

hashtag7.5 Commercial-Level Type Agreement

hashtagPerformance indicators

hashtag8. Validation Protocol

hashtag8.1 Sample Requirements

hashtag8.2 Human Reference Standard

hashtag8.3 Data Collection

hashtagHuman grading

hashtagAI scanning

hashtag8.4 Statistical Analysis

hashtagStep 1: Compute human consensus and inter-grader variability

hashtagStep 2: Compute AI performance metrics (Sections 7.1--7.5)

hashtagStep 3: For commercial-level (Type), apply the logarithmic transformation

hashtag8.5 Acceptance Criteria Summary

hashtag9. Ongoing Monitoring and Revalidation

hashtagRoutine monitoring

hashtagRevalidation triggers

hashtagCorrective actions

hashtag10. Record Keeping

hashtag11. Annexes

hashtagAnnex 1: Tool Overview and Application Scope

hashtagAnnex 2: Validation Scorecard

hashtagAnnex 3: Bland-Altman Procedure for Sample-Level and Type Validation

hashtagA3.1 For sample-level defect percentages

hashtagA3.2 For commercial-level Type agreement

hashtagA3.3 Interpretation guide

hashtagAnnex 4: Origin-Specific Module --- Brazil (COB)

hashtagA4.1 Grading standard

hashtagA4.2 Defect equivalence table

hashtagA4.3 Type conversion formula

hashtagA4.4 Type scale reference

hashtagA4.5 Screen size method

hashtagAnnex 5: Origin-Specific Module --- Vietnam

hashtagA5.1 Grading standard

hashtagA5.2 Defect groups

hashtagA5.3 Validation specifics

hashtagAnnex 6: Origin-Specific Module --- ISO 10470 (Weighted Factors)

hashtagA6.1 Grading standard

hashtagA6.2 Validation specifics

hashtagRevision History

CSMART COFFEE TECHNOLOGIES

Guideline

AI Grading Validation for Green Coffee

Computer Vision Systems for Physical Analysis

Table of Contents

1. Scope

2. Objectives

3. Related Documents

4. Responsibilities

5. Abbreviations

6. Definitions

7. Measurement Levels and Performance Indicators

7.1 Bean-Level Classification

Performance indicators

7.2 Weight Estimation

Performance indicators

7.3 Sample-Level Defect Percentages

Reference standard

Performance indicators

7.4 Screen Size Distribution

Performance indicators

7.5 Commercial-Level Type Agreement

Performance indicators

8. Validation Protocol

8.1 Sample Requirements

8.2 Human Reference Standard

8.3 Data Collection

Human grading

AI scanning

8.4 Statistical Analysis

Step 1: Compute human consensus and inter-grader variability

Step 2: Compute AI performance metrics (Sections 7.1--7.5)

Step 3: For commercial-level (Type), apply the logarithmic transformation

8.5 Acceptance Criteria Summary

9. Ongoing Monitoring and Revalidation

Routine monitoring

Revalidation triggers

Corrective actions

10. Record Keeping

11. Annexes

Annex 1: Tool Overview and Application Scope

Annex 2: Validation Scorecard

Annex 3: Bland-Altman Procedure for Sample-Level and Type Validation

A3.1 For sample-level defect percentages

A3.2 For commercial-level Type agreement

A3.3 Interpretation guide

Annex 4: Origin-Specific Module --- Brazil (COB)

A4.1 Grading standard

A4.2 Defect equivalence table

A4.3 Type conversion formula

A4.4 Type scale reference

A4.5 Screen size method

Annex 5: Origin-Specific Module --- Vietnam

A5.1 Grading standard

A5.2 Defect groups

A5.3 Validation specifics

Annex 6: Origin-Specific Module --- ISO 10470 (Weighted Factors)

A6.1 Grading standard

A6.2 Validation specifics

Revision History