slidersModel Settings

Configuring data augmentation, preprocessing, callbacks, and hyperparameters

The Model Settings screen provides fine-grained control over the training pipeline's configuration. These settings are organized into four sections: Data Augmentation, Preprocessing, Callbacks, and Hyperparameters. All values are stored in the project's config.yaml and applied automatically when training begins.

circle-info

For most use cases, the default values work well. Only adjust these settings if you understand their impact on training. The Training screen provides preset-based defaults that configure many of these values automatically.

Data Augmentation

Data augmentation artificially increases the diversity of your training data by applying random transformations to images during training. Instead of showing the model the same image repeatedly, each epoch presents slightly different versions — rotated, color-shifted, blurred, or otherwise transformed. This forces the model to learn the underlying features of each class rather than memorizing specific pixel patterns, which directly reduces overfitting and improves generalization to new, unseen data.

Studio Desktop applies augmentation using a custom pipeline built on Korniaarrow-up-right, a differentiable computer vision library for PyTorch. Transforms are applied on the GPU during training, adding minimal overhead.

Augmentation Preset

Select from four levels of augmentation intensity (from lightest to heaviest):

Preset
Intensity
When to Use

Mild

Lightest

Small or clean datasets where you want to preserve original image characteristics. Applies only gentle geometric and color transforms.

Strong

Moderate

General-purpose default for most datasets. Good balance between variety and realism. Introduces blur, noise, sharpening, and compression artifacts at controlled probabilities.

Robust

Heavy

Large datasets or when the Strong preset is not producing enough generalization. Widens all transform ranges and increases the probability of each augmentation being applied.

Super Strong

Maximum

Very large datasets, heavily overfitting models, or when you need maximum robustness to real-world imaging variation. Applies the most aggressive transforms across all categories.

Each preset controls over 20 individual transform parameters including rotation range, brightness and contrast ranges, color space adjustments (white balance, saturation, LAB chroma, hue), blur kernels (Gaussian, motion, box), noise injection (Gaussian, salt-and-pepper), sharpening, perspective distortion, JPEG compression artifacts, coarse dropout (random rectangular holes), and flip probabilities.

circle-info

If none of the presets fit your needs, you can adjust individual augmentation parameters directly in the project's config.yaml file. The application will recognize the configuration as a "custom" preset.

Mixup and CutMix

Toggle Mixup/CutMix to enable advanced augmentation techniques introduced by Zhang et al. (2018) and Yun et al. (2019) that operate at the batch level rather than on individual images.

Mixup creates virtual training examples by blending two images and their labels through linear interpolation. For example, a 70/30 mix of an "OK" bean image and a "Defect" bean image produces a new training sample with soft labels (0.7 OK, 0.3 Defect). This encourages the model to learn smoother decision boundaries and has been shown to improve robustness to adversarial examples and reduce overconfident predictions.

CutMix takes a different approach: it cuts a random rectangular patch from one training image and pastes it onto another, mixing labels proportionally to the patch area. Unlike Mixup, CutMix preserves local image structure, which helps the model learn to focus on less discriminative parts of objects — improving both classification accuracy and localization ability.

Parameter
Range
Description

Mixup Alpha

0–1

Controls the Beta distribution that determines blend ratios. Higher values produce more uniform blending (images are more equally mixed). Lower values keep blends closer to one of the two source images.

CutMix Alpha

0–1

Controls the Beta distribution that determines patch sizes. Higher values produce more varied patch sizes.

Probability

0–1

Chance that either Mixup or CutMix is applied to a given training batch. Set to 0 to effectively disable both.

Switch Probability

0–1

Controls which technique is used for a given batch. A value of 0 means only Mixup is applied; 1 means only CutMix; 0.5 gives equal probability to both.

circle-info

Mixup and CutMix are most effective on larger datasets (1,000+ images per class). On very small datasets, they can introduce confusing training signals. Start with the default values and only adjust if you observe overfitting that standard augmentation cannot address.

Preprocessing

Preprocessing settings control how images are adjusted before being fed to the model. These transforms are deterministic (not random) and are applied identically during both training and inference, ensuring that the model sees images in the same format during production as it did during training.

circle-exclamation

Saturation Adjustment

Toggle Adjust Saturation to enable color saturation modification.

Saturation Factor — A multiplier applied to the saturation channel. A factor of 1.0 leaves the image unchanged. Values above 1.0 intensify colors (useful when training images appear washed out compared to production images), while values below 1.0 desaturate toward grayscale.

In coffee classification, saturation adjustment can help normalize color variation caused by different lighting conditions or camera white balance settings between the training environment and the production Csmart Digit device.

CLAHE Equalization

Toggle CLAHE Equalization to enable Contrast Limited Adaptive Histogram Equalization, a well-established image processing technique that enhances local contrast while preventing over-amplification of noise.

Unlike standard histogram equalization (which adjusts contrast globally across the entire image), CLAHE divides the image into a grid of tiles and performs histogram equalization independently on each tile. This preserves local detail — for example, enhancing the visibility of subtle defect textures on a coffee bean surface without washing out highlights or crushing shadows elsewhere in the image.

Research has consistently shown that CLAHE preprocessing improves deep learning model performance, particularly for classification tasks where fine-grained texture discrimination is important. A study on bone scan classification found that CLAHE preprocessing significantly improved model AUC compared to no preprocessing (p < 0.001), and multiple medical imaging studies have confirmed its effectiveness for improving segmentation and classification accuracy across different imaging modalities.

Parameter
Description
Guidance

Clip Limit

Controls the maximum contrast amplification in each tile. Higher values allow more contrast enhancement but may amplify noise in homogeneous regions.

Start with the default value. Increase if defect textures are hard to see; decrease if the output images appear noisy or have halo artifacts at tile boundaries.

Tile Grid Size (W × H)

The number of tiles the image is divided into for local equalization. Smaller tiles produce more localized enhancement; larger tiles behave more like global equalization.

An 8×8 grid is a common default. Use smaller grids (e.g., 4×4) only if you need very localized contrast enhancement.

circle-info

CLAHE is particularly effective when training images come from cameras or environments with uneven illumination — a common scenario in industrial coffee analysis where belt lighting may create hotspots or shadows across the field of view.

Callbacks

Callbacks are functions that execute at specific points during the training loop to modify training behavior.

Stochastic Weight Averaging (SWA)

Toggle SWA to enable Stochastic Weight Averaging, a training technique introduced by Izmailov et al. (2018) that has been shown to improve generalization in deep learning at essentially no additional computational cost.

How it works: Standard training (SGD or Adam) converges to a single point in the loss landscape — often at the boundary of a low-loss region. SWA modifies the learning rate schedule in the final phase of training to keep the optimizer exploring rather than converging, then averages the model weights visited during this exploration phase. The resulting averaged model tends to land in the center of a wide, flat minimum in the loss landscape, rather than at a sharp boundary.

Why flat minima matter: A model sitting at a sharp minimum performs well on the exact training data but is sensitive to small perturbations — meaning it may perform worse on slightly different test data. A model at a flat minimum is more robust: small shifts in the input distribution (different camera, different lighting, different bean moisture content) cause smaller changes in output, leading to better real-world performance.

SWA has been demonstrated to improve performance across computer vision, semi-supervised learning, and low-precision training. As of PyTorch 1.6, SWA is a built-in feature of the framework, and Studio Desktop uses the PyTorch Lightning SWA callbackarrow-up-right implementation.

Parameter
Description
Guidance

SWA Learning Rate

The constant learning rate used during the SWA phase. Must be high enough to enable exploration but low enough to stay in the low-loss region.

Typically 10–100× lower than the initial training learning rate.

Epoch Start

The epoch at which SWA begins. Before this epoch, training proceeds normally.

Set to approximately 75% of total epochs. Starting too early means the model has not yet converged to a good region; starting too late leaves insufficient time for meaningful averaging.

Annealing Epochs

The number of epochs over which the learning rate transitions from its current value to the SWA learning rate.

5–10 epochs is typical. Shorter annealing is more abrupt; longer annealing provides a smoother transition.

Annealing Strategy

How the learning rate transitions: cosine (smooth curve) or linear (constant rate of decrease).

Cosine is generally preferred as it provides a gentler transition.

circle-info

SWA is most effective when applied in the last 20–30% of total training epochs. If you plan 100 epochs of training, set the epoch start to around 70–80.

Hyperparameters

Core training hyperparameters that control how the model learns. These settings have a significant impact on model convergence, accuracy, and generalization.

Hyperparameter Preset

Preset
Description

Stage-wise LLRD

Layer-wise Learning Rate Decay — applies higher learning rates to the top (later) layers of the network and progressively lower rates to the bottom (earlier) layers. This is based on the principle that early layers learn general features (edges, textures) that transfer well across tasks, while later layers learn task-specific features that need more adaptation. LLRD prevents the destructive update of well-learned general features while allowing the classifier head to adapt aggressively. Research has shown this approach consistently outperforms uniform learning rates for fine-tuning pretrained vision models, particularly with transformer architectures like ViT and MaxViT.

Fixed

Uses a single, uniform learning rate for all layers. Simpler to configure but does not account for the different roles of early vs. late layers. May cause early layers to "forget" useful pretrained features during fine-tuning (catastrophic forgetting). Best suited for training from scratch or when using very small learning rates.

circle-info

Stage-wise LLRD is the recommended default for all fine-tuning tasks (Pretrained or Transfer starting weights). Use Fixed only when training from scratch or when you need a simpler baseline for comparison.

Learning Rate

The base learning rate controls the step size of each parameter update during training. It is the single most impactful hyperparameter — too high and the model diverges or oscillates; too low and training is extremely slow or gets stuck in poor local minima.

Aspect
Value

Typical range

1e-5 to 1e-3

Step size

1e-5 (precision to 6 decimal places)

Recommended starting point for fine-tuning

1e-4 to 5e-4

Recommended starting point for training from scratch

1e-3

When using Stage-wise LLRD, this value is the learning rate assigned to the topmost layer. Lower layers receive progressively smaller rates according to the decay factor. When using Fixed, this value is applied uniformly to all layers.

circle-info

If training loss oscillates or explodes, your learning rate is likely too high — reduce by 2–5×. If training loss decreases very slowly or plateaus at a high value, your learning rate may be too low — increase by 2–5×.

Loss Function

The loss function defines the mathematical objective the model optimizes during training.

Function
How It Works
When to Use

Cross Entropy Loss

The standard loss for multi-class classification. Measures the difference between the model's predicted probability distribution and the true class label. The model is rewarded for assigning high probability to the correct class and penalized for distributing probability to incorrect classes.

Default for most classification tasks. Use this when your primary goal is accurate class prediction — e.g., distinguishing OK beans from defect types.

Focal Loss

A variant of Cross Entropy introduced by Lin et al. (2017) for RetinaNet. Adds a modulating factor that down-weights the loss contribution from well-classified (easy) examples and focuses training on hard, misclassified examples. The Focal Gamma parameter controls the strength of this effect.

Use when you have a significant class imbalance or when the model quickly learns easy classes but struggles with hard ones. In practice, most balanced coffee datasets do not benefit noticeably from Focal Loss over standard Cross Entropy.

Triplet Loss

A metric learning loss that operates on triplets of images: an anchor, a positive (same class), and a negative (different class). The model learns to produce embeddings where same-class images are close together and different-class images are far apart. Unlike Cross Entropy, which optimizes for correct label assignment, Triplet Loss optimizes the structure of the embedding space.

Use when you need high-quality feature embeddings — for example, when you plan to use the Feature Extraction screen for similarity analysis, or when deploying the model in a system that relies on nearest-neighbor classification rather than direct softmax prediction.

circle-info

Cross Entropy Loss is faster to converge and simpler to tune. Triplet Loss can produce better embeddings but requires careful batch construction (hard negative mining) and typically needs larger datasets. Focal Loss is a specialized tool for imbalanced scenarios. For most Csmart workflows, Cross Entropy Loss is the recommended choice.

Focal Gamma

When Focal Loss is selected, the Focal Gamma parameter becomes available. This controls how aggressively the loss function down-weights easy examples.

Aspect
Value

Range

0 to 5

Value of 0

Equivalent to standard Cross Entropy (no modulation)

Default

2.0

Higher gamma values increase the focus on hard examples. At gamma = 2 (the original paper's recommendation), an example classified with 90% confidence contributes 100x less to the loss than one classified at 50% confidence. Values above 3 are rarely needed and can cause training instability.

Class Weighting

Class weighting adjusts the loss function to compensate for imbalanced class distributions. When enabled, classes with fewer training samples receive a higher weight, preventing the model from being biased toward majority classes.

Option
How It Works
When to Use

None

All classes contribute equally to the loss.

Default. Use when your dataset is reasonably balanced or when you have already balanced it through your dataset split.

Inverse Frequency

Each class weight is proportional to 1 / sample_count. Classes with fewer images receive higher weight. Weights are normalized so they sum to the number of classes.

Use when you have mild to moderate class imbalance (e.g., 2:1 ratio between largest and smallest class). Simple and interpretable.

Effective Number

Uses the effective number of samples formula (Cui et al., 2019) with beta = 0.999. This approach accounts for data overlap — as sample count grows, each additional sample provides diminishing information. Produces smoother weights than inverse frequency for large datasets.

Use when you have severe class imbalance (e.g., 10:1 ratio or more) and inverse frequency produces weights that are too extreme.

circle-info

Class weighting is compatible with Cross Entropy Loss and Focal Loss but not with Triplet Loss (which operates on embedding distances rather than class probabilities). If your dataset is balanced, class weighting has no meaningful effect — the computed weights will be approximately equal for all classes.

Label Smoothing

A regularization technique introduced by Szegedy et al. (2016) in the Inception architecture that prevents the model from becoming overconfident in its predictions.

How it works: In standard training, the target label for a "Defect_Black" image is a one-hot vector: 100% probability on the correct class, 0% on everything else. With label smoothing, the target becomes slightly softer — for example, 95% on the correct class and the remaining 5% spread evenly across all other classes. This small change has a significant effect on training dynamics: the model no longer tries to drive its logits to extreme values, which improves calibration (the model's confidence scores become more meaningful) and reduces overfitting.

Aspect
Value

Range

0 to 0.5 (step 0.05)

Value of 0

No smoothing — hard labels (standard behavior)

Recommended starting point

0.1

Typical effective range

0.05–0.15

When to increase label smoothing: If the model is overfitting (high training accuracy, lower validation accuracy), or if the model produces overconfident predictions (very high softmax scores that do not reflect actual reliability).

When to decrease or disable: If validation accuracy drops when smoothing is enabled, or if you have a very clean dataset with no label noise. On small datasets, even mild smoothing can reduce the model's ability to learn fine-grained class distinctions.

circle-info

Label smoothing is especially useful in coffee classification where visual boundaries between classes can be subjective — for example, distinguishing a mildly sour bean from a normal bean. Smoothing acknowledges this inherent uncertainty in the labels.

Saving Changes

After modifying any settings, click Save at the bottom of the screen. A toast notification confirms the save succeeded. If you navigate away with unsaved changes, a confirmation dialog asks whether to save or discard.

All settings are written to the project's config.yaml and take effect the next time you start training.

Last updated