Creating a CNN Training Dataset

Effective training datasets are crucial steps in developing a robust AI model. This process begins with clear planning and definition of classes. One can use the SCA definition, involving the 16 classes of defects plus 1 non-defective class, or any other particular standard. The important step here is to properly define the classes before gathering raw material. Let’s dive into the following examples:

Example of a Detailed Class List: Broken, Floater, Fox Bean, Full Black, Full Sour, Husk, Immature, Ok, Parchment, Partial Black, Partial Sour, Pod, Severe Insect Damage, Shell, Slight Insect Damage, Sticks, Stones, Whitered.

Example of a Simplified Class List: Broken, Black, Floater, Foreign Matter, Husk, Insect Damage, Immature, Ok, Parchment, Pod, Shell, Sour, Whitered

Both lists are valid examples and can be deployed as an AI model for Csmart Digit. The choice of classes will solely depend on the intended use by the company creating the model. It is important to note that both defective and non-defective (good) classes should be included. AI does not operate by exclusion; therefore, the entire spectrum of classes that the model aims to represent must be included in the dataset to provide a comprehensive learning set.

Additionally, it could be useful to create custom classes that may not be part of standard classification methods, nor considered proper defects, but are relevant as a quality indicator, such as spotted, faded, or even fermented seeds. This approach ensures the retrieval of comprehensive information about the analyzed samples, beyond defect counting.

CNN Constraints: Focus on Visual Information

It is important to mention that a CNN has its limitations, as it is inherently designed to process and analyze visual information. Therefore, it cannot effectively retrieve or interpret non-visual information, which requires different types of sensors and analyses. For instance, attempting to create classes related to moisture content or density will likely fail, and the network may retain other patterns that were not initially intended.

Classes Subset

After listing the classes, it is mandatory to categorize the classes into specific subclasses, namely: Ok Classes, Primary Defects, Secondary Defects, and Foreign Matter. Each model must incorporate these subsets with the specified division. However, the allocation of classes within each subset is determined by the user’s discretion. Following is an example of a possible subset division for the detailed class list:

  • Ok Classes: OK, Fox bean

  • Primary Defects: Full Black, Full Sour, Immature, Pod

  • Secondary Defects: Husk, Partial Black, Partial Sour, Severe Insect Damage, Shell, Slight Insect Damage

  • Foreign Matter: Sticks, Stones

Importance of a Large Dataset

With the initial planning done, it is time to sort the raw material into the desired classes. For a standard coffee model, it is suggested to have between 6,000 and 8,000 seeds per class. These numbers are not fixed and will vary depending on the universe that needs to be represented. Under the hood, the gathered images of the training dataset will be divided into three datasets: training, validation, and test. Typically, about 70% of the data is used for training, 15% for validation, and 15% for testing. The training dataset is used by the model to learn and identify patterns, while the validation dataset helps tune the model after the forward pass without the model being aware of it during training. The test dataset, also unseen by the model during training, is used for the final evaluation of performance. This division ensures the model is well-trained, validated during development, and rigorously tested, highlighting the need for a large and diverse set of images.

Importance of Representative Sampling

The goal is to ensure that the dataset comprehensively covers the variability found in the real-world scenarios the AI model will encounter. For example, if the AI model intends to represent an entire country's type of coffee, the classes must be representative of the diversity within that country. This includes different regions, altitudes, growing conditions, and processing methods. Therefore, creating a class from a single batch of a single producer will not provide a representative sample. Instead, the dataset should include samples from various producers, regions, and harvests to capture the full spectrum of possible variations.

Balancing Quantity and Quality

While the suggested range is between 6,000 and 8,000 images per class, the actual number can vary. The key is to balance quantity with quality. Having a large number of high-quality, well-labeled images is crucial, but it is equally important to ensure that these images are representative of the diversity within each class. A smaller, high-quality dataset that accurately reflects the variations within each class is often more valuable than a larger, lower-quality dataset.

Balance Between Classes

Maintaining a balance in the quantity of images per class is crucial for the effective training of an AI model. When the dataset has an equal number of images for each class, the model learns to recognize and differentiate between classes more accurately. If some classes have significantly more images than others, the model may become biased, favoring the more represented classes and potentially neglecting the underrepresented ones. This imbalance can lead to poor generalization and inaccurate predictions, as the model might not learn the distinguishing features of the less frequent classes adequately. It is known that some classes may occur less frequently than others, leading to difficulties in obtaining a balanced dataset. However, this is the challenge of creating a good training dataset and addressing it is essential for developing a robust and fair model.

Registering the Dataset

With all raw material thoroughly sorted by hand, it is time to digitize the data. Each class of coffee should be processed through Csmart Digit in the same manner as when performing a sample analysis. The difference now is that the expected outcome is known, so the result is not relevant in this step. Follow these steps in the Csmart Digit software to produce the new dataset:

  1. Create a new analysis, naming it with the intended class name;

  2. Feed the hopper with the specified class, ensuring only the specified class is used.

Note: Do not run the sample twice if your intention is to produce more images with fewer seeds. Doing so would likely have the opposite effect: images from the same seed would appear in both the training and test sets. This overlap makes it easier for the AI to learn because the same seed is present in both datasets. This issue is known as data leakage.

  1. Record all seeds and then run any AI model. As previously mentioned, the intention is not to classify the images yet, but to store them;

  2. Navigate to the Export Images section and select Export as a single class;

  1. Repeat this operation for all classes in the new dataset.

  2. Open the exported folder and review the saved images, ensuring only images of the intended class are present;

By the end of this process, one should have a set of folders named after each class, which together comprise the new training dataset.

Class Image Samples

Ok Class

Sour

Insect Damage

Husk

Pod

Broken

Parchment

Sticks

Stones

Black

Broken

Immature

Last updated