book-blankIntroduction to Neural Networks and CNNs

What Is a Neural Network?

A neural network is a type of machine-learning model that mimics the way the human brain works to recognize patterns and make decisions. It is a sophisticated method for a computer to learn from examples, similar to how a child learns to identify objects by looking at pictures. The basic building blocks of a neural network are neurons, which act as tiny decision-makers that look at their input and decide what to pass on to the next layer.

These neurons are organized into layers. The first layer, known as the input layer, is where the network first receives data, in our case, images of coffee. Between the input and output layers are hidden layers, which perform various transformations and computations to help the network learn from the data. These layers are called "hidden" because they are not directly visible in the input or output; they do the heavy lifting behind the scenes. The output layer is where the network produces its final prediction.

Diagram of a neural network.

How Neurons Work

Each neuron receives one or more numerical inputs, multiplies each input by a weight, sums the results, adds a bias term, and passes the total through an activation function. The weight determines how much influence each input has, and the bias shifts the decision boundary so the neuron can better fit the data.

The activation function introduces non-linearity into the network. Without it, stacking multiple layers would be no more powerful than a single layer, because a chain of linear operations is still linear. Common activation functions include:

  • ReLU (Rectified Linear Unit): Outputs the input directly if it is positive; otherwise outputs zero. ReLU is the most widely used activation in modern networks because it is simple and trains efficiently.

  • Sigmoid: Squashes the input into a range between 0 and 1, making it useful for outputting probabilities.

  • Softmax: Applied to the output layer in classification tasks. It converts a vector of raw scores into a probability distribution that sums to 1, so the network can express how likely each class is.

Forward Pass and Predictions

The process where data moves from the input layer, through the hidden layers, to the output layer, generating predictions, is called the forward pass. At each layer, every neuron computes its weighted sum, applies the activation function, and passes the result to the next layer. When the data reaches the output layer, the network produces a prediction.

It is important to understand that the final prediction is a statistical decision made by the network. The neural network calculates the probabilities of the input belonging to each class and subsequently selects the class with the highest probability as its prediction. Consequently, even if the network is uncertain, it will still produce a prediction, as it is designed to provide an answer based on the learned patterns in the data.

Loss Functions

After making a prediction, the network needs a way to measure how wrong it was. This measurement is provided by a loss function (also called a cost function), which quantifies the difference between the predicted output and the true label.

  • Cross-Entropy Loss: The standard loss function for classification tasks. It penalizes confident wrong predictions heavily and rewards confident correct predictions. For a binary classification problem, the formula uses the predicted probability and the true label (0 or 1); for multiclass problems, it generalizes across all classes.

  • Mean Squared Error (MSE): More common in regression tasks, where the output is a continuous value rather than a class label.

The goal of training is to minimize the loss — that is, to adjust the network so that its predictions get closer to the true answers over time.

Backpropagation and Training

After computing the loss, the network adjusts its weights and biases to improve future predictions through a process known as backpropagation. Backpropagation works by calculating how much each weight contributed to the error (using the mathematical chain rule) and then nudging each weight in the direction that reduces the loss.

The size of each adjustment is controlled by the learning rate, a hyperparameter that determines how big each step is:

  • If the learning rate is too large, the network may overshoot the optimal weights and fail to converge.

  • If the learning rate is too small, training becomes very slow and may get stuck in suboptimal solutions.

The algorithm that performs these updates is called an optimizer. Common optimizers include:

  • SGD (Stochastic Gradient Descent): Updates weights using a small random subset (mini-batch) of the training data at each step.

  • Adam: An adaptive optimizer that adjusts the learning rate for each weight individually, often converging faster than SGD.

The continuous cycle of forward pass, loss computation, and backpropagation is referred to as training. One complete pass through the entire training dataset is called an epoch. Training typically runs for many epochs until the loss stabilizes or validation performance stops improving.

Training a neural network involves providing it with a dataset comprising inputs (e.g., images) and corresponding outputs (e.g., labels). The network quantifies the disparity between its prediction and the actual answer and iteratively adjusts the weights and biases to minimize the loss function's value over time.

Domain Shift, Confident Errors, and Out-of-Distribution Detection

The likelihood of an incorrect prediction increases if the inference input data is not sufficiently related to the training data. This type of error is commonly known as domain shift. For example, if a model is trained exclusively on images of dry-processed coffee beans but is later used to classify wet-processed beans, the visual differences (color, texture, surface patterns) may cause the model to make unreliable predictions. Ensuring that the training dataset is representative of the conditions the model will encounter in production is essential for minimizing domain shift.

Why a Model Can Be Confident and Still Wrong

A critical subtlety of neural networks is that the softmax output layer always produces a probability distribution that sums to 1 — even when the input is completely unlike anything the model has seen during training. The softmax function forces the network to distribute all of its confidence across the known classes, regardless of whether the input actually belongs to any of them. This means a model presented with, say, a stone or a piece of wood instead of a coffee bean will still confidently assign it to one of its learned classes (e.g., "92% black defect"), because the model has no concept of "none of the above."

This is not a flaw in a specific model — it is an inherent property of how classification neural networks work. The output probabilities reflect the model's relative preference among the classes it knows, not an absolute measure of whether the input belongs to the problem domain at all. A model trained on five coffee defect classes will always pick one of those five, even if the true answer is something entirely different.

circle-exclamation

The Penultimate Layer and Latent Space

To understand why a prediction might be unreliable, we need to look beyond the final output and examine what happens inside the network — specifically at the penultimate layer (the layer immediately before the output layer).

Throughout the forward pass, each layer transforms the input into increasingly abstract representations. By the time data reaches the penultimate layer, the network has compressed the original image (thousands of pixel values) into a compact numerical vector — typically a few hundred or a few thousand numbers. This vector is called a feature embedding or latent representation, and the mathematical space in which these vectors live is called the latent space (also known as the embedding space or feature space).

The latent space is where the real "understanding" of the model resides. During training, the network learns to organize this space so that:

  • Images of the same class are mapped to nearby points (their embeddings are similar).

  • Images of different classes are mapped to distant points (their embeddings are dissimilar).

Think of it as the model creating an internal map where similar-looking seeds cluster together and different defect types occupy separate regions. When a new image arrives, the model places it on this map and checks which cluster it falls closest to — that determines the predicted class.

Why the Penultimate Layer Matters More Than the Output

The final softmax layer collapses all of this rich spatial information into a simple probability distribution. Two very different inputs can produce similar softmax outputs, making it impossible to distinguish between a genuinely confident prediction and a falsely confident one by looking at probabilities alone.

The penultimate layer, however, preserves the full geometric relationships. By examining where a new sample's embedding falls in the latent space, we can ask much more informative questions:

  • Is this embedding close to known clusters of training data? If so, the prediction is likely reliable.

  • Is this embedding far from all known clusters, in an empty region of the latent space? If so, the input is likely out-of-distribution, and the prediction should not be trusted — regardless of how high the softmax confidence is.

Out-of-Distribution (OOD) Detection

Out-of-distribution (OOD) detection refers to techniques designed to identify inputs that differ significantly from the training data, so that unreliable predictions can be flagged rather than blindly accepted. This is a critical safety mechanism for any deployed AI system.

Common approaches to OOD detection include:

  • Distance-based methods: Measure the distance between a new sample's embedding and the embeddings of the training data in the latent space. If the new sample is far from all known clusters (using metrics such as Euclidean distance or Mahalanobis distance), it is flagged as out-of-distribution. These methods are effective because they directly leverage the geometric structure of the latent space.

  • Entropy-based methods: Monitor the entropy (uncertainty) of the softmax output distribution. While not foolproof (as discussed above, softmax can be overconfident), unusually high entropy can still indicate that the model is struggling to decide between classes, suggesting an ambiguous or unfamiliar input.

  • Reconstruction-based methods: Use an auxiliary model (such as an autoencoder) to reconstruct the input. If the reconstruction error is high, the input is likely unlike the training data.

  • Ensemble methods: Run multiple models or multiple forward passes (e.g., with Monte Carlo dropout) and measure the disagreement between predictions. High disagreement signals uncertainty about the input.

circle-info

In practice, combining softmax probabilities with latent-space analysis provides a much more robust measure of prediction reliability than either approach alone. This is why modern AI systems increasingly rely on feature embeddings from the penultimate layer — not just the final class probabilities — to assess whether a prediction should be trusted.

Convolutional Neural Networks (CNNs)

A Convolutional Neural Network (CNN) is a specialized type of neural network primarily utilized for analyzing visual data, such as images. While a standard neural network treats the input as a flat list of numbers (losing spatial relationships), a CNN preserves the 2D structure of images and exploits the fact that nearby pixels are related to each other. This makes CNNs exceptionally adept at recognizing patterns and features within images, rendering them highly effective for tasks such as image classification.

Why CNNs for Images?

A standard (fully connected) neural network connecting every pixel to every neuron would require an enormous number of parameters. For example, a modest 224×224 color image has over 150,000 input values — connecting each to just 1,000 hidden neurons would already require 150 million weights in the first layer alone. This is computationally impractical and prone to overfitting.

CNNs solve this problem through three key ideas:

  1. Local connectivity: Instead of looking at the entire image at once, each neuron only looks at a small local region (its receptive field).

  2. Weight sharing: The same set of weights (a filter) is applied across the entire image, drastically reducing the number of parameters.

  3. Translation invariance: Because the same filter scans the whole image, the network can detect a feature (such as an edge or texture) regardless of where it appears in the image.

Building Blocks of a CNN

A CNN is composed of several types of layers stacked together. Each type of layer has a specific role in processing the image data.

Convolutional Layers

The core building block of a CNN. A convolutional layer applies a set of small learnable filters (also called kernels) to the input image. Each filter is a small matrix (commonly 3×3 or 5×5 pixels) that slides across the image — a process called convolution. At each position, the filter computes a dot product between its weights and the overlapping image pixels, producing a single number. As the filter slides across the entire image, it produces a 2D output called a feature map (or activation map).

Each filter is designed to detect a specific feature. In early layers, filters typically learn to detect simple features such as edges, corners, and color gradients. In deeper layers, filters combine these simple features to detect increasingly complex patterns — textures, shapes, and eventually entire objects or defect types.

A convolutional layer usually applies multiple filters in parallel, producing multiple feature maps. For example, a layer with 32 filters produces 32 feature maps, each highlighting a different aspect of the input.

Pooling Layers

After convolution, pooling layers reduce the spatial dimensions (width and height) of the feature maps while retaining the most important information. This serves two purposes: it reduces computational cost and helps the network become more robust to small shifts or distortions in the input.

The most common type is max pooling, which divides the feature map into non-overlapping regions (e.g., 2×2 blocks) and keeps only the maximum value in each region. This effectively halves the width and height of the feature map. Average pooling takes the mean value instead, and global average pooling reduces each entire feature map to a single number by averaging all its values.

Fully Connected (Dense) Layers

After the convolutional and pooling layers have extracted and compressed the spatial features, the resulting feature maps are flattened into a one-dimensional vector and passed through one or more fully connected layers. These layers work exactly like a standard neural network: every neuron is connected to every input from the previous layer. Their role is to combine the extracted features and learn the final mapping from features to class predictions.

Output Layer

The final fully connected layer has one neuron per class. A softmax activation function converts the raw scores into probabilities that sum to 1. The class with the highest probability is the network's prediction.

How a CNN Processes an Image — Step by Step

  1. Input: A coffee seed image (e.g., 224×224 pixels with 3 color channels — red, green, blue) enters the network.

  2. Early convolutional layers: Filters detect low-level features like edges, color boundaries, and simple textures.

  3. Deeper convolutional layers: Filters combine low-level features to detect higher-level patterns such as surface cracks, discoloration patches, or insect damage marks.

  4. Pooling: Between convolutional layers, pooling reduces spatial size while keeping the strongest activations.

  5. Flattening: The final set of feature maps is reshaped into a single long vector.

  6. Fully connected layers: The vector passes through dense layers that weigh all the extracted features together.

  7. Output: The softmax layer produces a probability for each class (e.g., 85% black defect, 10% sour, 5% OK), and the highest probability determines the prediction.

Feature Hierarchy

One of the most powerful aspects of CNNs is their ability to learn a hierarchy of features automatically, without human intervention:

Layer Depth
Features Detected
Example in Coffee Analysis

Early layers

Edges, gradients, colors

Boundaries of the seed, color variations

Middle layers

Textures, shapes, patterns

Surface roughness, crack patterns, spots

Deep layers

Object parts, complex patterns

Characteristic defect signatures (black, sour, insect bore holes)

This hierarchical learning is what makes CNNs so effective: they automatically discover which visual features are relevant for the classification task during training, rather than requiring a human engineer to hand-design feature extractors.

Common CNN and Vision Architectures

Over the years, researchers have developed several influential architectures for image classification, each contributing key innovations. The field has evolved from pure convolutional networks to hybrid and transformer-based approaches. Below are the major architecture families, including the ones available for training in Csmart Studio.

Foundational CNN Architectures

  • LeNet-5 (LeCun et al., 1998): One of the earliest CNNs, designed for handwritten digit recognition. It introduced the basic pattern of alternating convolutional and pooling layers followed by fully connected layers.

  • AlexNet (Krizhevsky et al., 2012): Demonstrated that deep CNNs trained on GPUs could dramatically outperform traditional methods on large-scale image classification (ImageNet). It popularized ReLU activations and dropout regularization.

  • VGGNet (Simonyan and Zisserman, 2014): Showed that using very small (3×3) filters consistently throughout a deep network achieves strong performance. Its simplicity and uniform structure make it a popular choice for transfer learning.

ResNet Family

  • ResNet (He et al., 2016): Introduced residual connections (skip connections) that allow gradients to flow directly through the network, enabling training of extremely deep architectures without degradation. A skip connection adds the input of a block directly to its output, so the network only needs to learn the residual (the difference), which makes optimization much easier. Csmart Studio provides ResNet-18, ResNet-50, and ResNet-101 as training options — the number indicates the total layers, with deeper variants offering more capacity at the cost of more computation.

  • ResNeXt (Xie et al., 2017): Extends ResNet by replacing each residual block with a set of parallel pathways (called "cardinality"). Instead of one wide block, ResNeXt uses many narrower blocks computed in parallel and aggregated, improving accuracy without significantly increasing computational cost. Csmart Studio provides ResNeXt-50 and ResNeXt-101 variants.

  • Wide ResNet (Zagoruyko and Komodakis, 2016): An alternative approach to increasing capacity — instead of adding more layers (depth), Wide ResNets increase the number of channels per layer (width). This often achieves comparable or better accuracy than very deep networks while being faster to train, since wider layers parallelize more efficiently on modern GPUs.

ConvNeXt Family

  • ConvNeXt (Liu et al., 2022): A modernized pure-CNN architecture that incorporates design principles from Vision Transformers (such as larger kernel sizes, layer normalization, and fewer activation functions) back into a convolutional framework. ConvNeXt demonstrates that CNNs can match or exceed transformer performance when properly modernized. Csmart Studio provides ConvNeXt Base and ConvNeXt Large.

  • ConvNeXtV2 (Woo et al., 2023): Builds on ConvNeXt with improved self-supervised pre-training using a Fully Convolutional Masked Autoencoder (FCMAE), leading to better feature representations. Csmart Studio provides ConvNeXtV2 Large.

EfficientNet

  • EfficientNet (Tan and Le, 2019): Introduced a systematic method for scaling CNNs by balancing network width, depth, and input resolution together using a compound scaling coefficient. This produces models that achieve strong accuracy with significantly fewer parameters and less computation than prior architectures. Csmart Studio provides EfficientNet-B0 as a lightweight option.

Vision Transformers (ViT)

Vision Transformers adapt the transformer architecture — originally developed for natural language processing — to image classification. Instead of convolutions, a ViT divides the input image into fixed-size patches (e.g., 16×16 or 14×14 pixels), flattens each patch into a vector, and processes the sequence of patches using self-attention mechanisms. Self-attention allows each patch to attend to every other patch in the image, capturing long-range dependencies that convolutional layers (limited by their local receptive field) may miss.

  • ViT (Dosovitskiy et al., 2021): The original Vision Transformer. When pre-trained on large datasets, ViT achieves excellent performance on image classification tasks. Csmart Studio provides ViT Base (patch size 16, input 224×224).

  • DINOv2 (Oquab et al., 2024): A family of ViT models trained with a self-supervised method called self-distillation (the model learns from itself without labeled data). DINOv2 produces highly generalizable feature representations that transfer well to diverse downstream tasks, even with minimal fine-tuning. Csmart Studio provides DINOv2 Base, DINOv2 Large, and DINOv2 Giant.

  • EVA-02 (Fang et al., 2024): A ViT family that combines masked image modeling pre-training with distillation from large-scale vision-language models (CLIP). This dual pre-training strategy produces strong visual representations. Csmart Studio provides EVA-02 Base, EVA-02 Large, and EVA-02 Enormous.

  • BioCLIP 2 (Stevens et al., 2024): A ViT-Large model pre-trained on biological imagery using contrastive learning (CLIP-style). Its training data includes diverse biological specimens, making its learned features particularly relevant for natural-product classification tasks such as coffee seed analysis. Csmart Studio provides BioCLIP 2.

  • SigLIP (Zhai et al., 2023): A vision-language model that replaces the standard softmax-based contrastive loss of CLIP with a sigmoid-based loss. This simplifies training and improves scalability while maintaining strong visual representations.

Hybrid Architectures

  • MaxViT (Tu et al., 2022): Combines convolutional layers with multi-axis self-attention in each block. The convolutional component captures local features efficiently, while the attention component captures global relationships. This hybrid design achieves strong performance across a range of model sizes. Csmart Studio provides MaxViT Base and MaxViT Large.

circle-info

Csmart Studio offers a wide range of architectures — from lightweight CNNs like ResNet-18 and EfficientNet-B0 to large-scale vision transformers like DINOv2 Giant and EVA-02 Enormous. The choice depends on the tradeoff between accuracy and computational resources: lightweight models are faster to train and run inference, while larger models generally achieve higher accuracy on complex classification tasks.

Transfer Learning

Training a CNN from scratch requires large amounts of labeled data and significant computational resources. Transfer learning is a technique that addresses this by starting with a model that has already been trained on a large general-purpose dataset (such as ImageNet, which contains millions of labeled images across thousands of categories).

The key insight behind transfer learning is that the early and middle layers of a CNN learn general visual features (edges, textures, shapes) that are useful across many different tasks. Only the deeper layers need to be adapted to the specific domain.

A typical transfer learning workflow:

  1. Take a pre-trained CNN (e.g., a ResNet trained on ImageNet).

  2. Remove or replace the final classification layer(s) to match the new task's number of classes.

  3. Fine-tune the network on the domain-specific dataset (e.g., labeled coffee seed images), either updating all layers or freezing the early layers and only training the later ones.

Transfer learning enables high-accuracy models even when the domain-specific dataset is relatively small, because the network already "knows" how to extract basic visual features from its prior training.

Regularization Techniques

Deep neural networks with millions of parameters are prone to overfitting — memorizing the training data rather than learning generalizable patterns. Several techniques help prevent this:

  • Dropout: During training, randomly sets a fraction of neuron outputs to zero at each step. This forces the network to learn redundant representations and prevents any single neuron from becoming overly specialized. At inference time, all neurons are active.

  • Data augmentation: Artificially expands the training dataset by applying random transformations to the input images — rotations, flips, crops, brightness changes, and color shifts. This helps the model generalize to variations it may encounter in real-world data.

  • Batch normalization: Normalizes the inputs to each layer so that they have a consistent mean and variance. This stabilizes and accelerates training, and also provides a mild regularization effect.

  • Weight decay (L2 regularization): Adds a penalty proportional to the squared magnitude of the weights to the loss function, discouraging the network from relying on any single large weight.

Inference

Once a CNN has been trained, it is deployed for inference — the process of making predictions on new, unseen data. During inference, the network performs only the forward pass (no backpropagation or weight updates). The input image passes through all layers, and the output layer produces the predicted class probabilities.

In Csmart-Digit, inference is performed using ONNX Runtime, an optimized inference engine that runs trained models efficiently. The trained CNN model is exported to the ONNX (Open Neural Network Exchange) format, which is a standardized representation that allows models trained in various frameworks (PyTorch, TensorFlow) to be deployed consistently across different platforms.

References

  • Dosovitskiy, A., Beyer, L., Kolesnikov, A., et al. (2021). An image is worth 16x16 words: Transformers for image recognition at scale. Proceedings of the 9th International Conference on Learning Representations (ICLR).

  • Fang, Y., Sun, Q., Wang, X., et al. (2024). EVA-02: A visual representation for neon genesis. Image and Vision Computing, 149, 105171.

  • He, K., Zhang, X., Ren, S., and Sun, J. (2016). Deep residual learning for image recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 770–778.

  • Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). ImageNet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems (NeurIPS), 25, 1097–1105.

  • LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.

  • Liu, Z., Mao, H., Wu, C.-Y., et al. (2022). A ConvNet for the 2020s. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 11976–11986.

  • Oquab, M., Darcet, T., Moutakanni, T., et al. (2024). DINOv2: Learning robust visual features without supervision. Transactions on Machine Learning Research.

  • Simonyan, K., and Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556.

  • Stevens, S., Wu, J., Thompson, M. J., et al. (2024). BioCLIP: A vision foundation model for the tree of life. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 19412–19424.

  • Tan, M., and Le, Q. V. (2019). EfficientNet: Rethinking model scaling for convolutional neural networks. Proceedings of the 36th International Conference on Machine Learning (ICML), 6105–6114.

  • Tu, Z., Talebi, H., Zhang, H., et al. (2022). MaxViT: Multi-axis vision transformer. Proceedings of the European Conference on Computer Vision (ECCV), 459–479.

  • Woo, S., Debnath, S., Hu, R., et al. (2023). ConvNeXt V2: Co-designing and scaling ConvNets with masked autoencoders. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 16133–16142.

  • Xie, S., Girshick, R., Dollar, P., Tu, Z., and He, K. (2017). Aggregated residual transformations for deep neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 1492–1500.

  • Zagoruyko, S., and Komodakis, N. (2016). Wide residual networks. Proceedings of the British Machine Vision Conference (BMVC).

  • Zhai, X., Mustafa, B., Kolesnikov, A., and Beyer, L. (2023). Sigmoid loss for language image pre-training. Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 11975–11986.

Last updated