Train AI Model
Training a classification model using PyTorch with real-time monitoring
The Training screen is where you train a deep learning classification model on your prepared dataset. Studio Desktop supports multiple model architectures and training strategies, with real-time progress monitoring and automatic metric tracking.
Training is powered by PyTorch Lightning under the hood. You configure the key parameters through the interface, and the application handles the rest — launching the Python training script, streaming logs, plotting metrics, and saving checkpoints.
Prerequisites
An open project with a split dataset (at minimum,
train/andval/subfolders with class-organized images).The Python runtime installed (from Hardware Settings).
An NVIDIA GPU is strongly recommended. CPU training is supported but significantly slower.
Hardware and model settings configured (see Hardware Settings and Model Settings).
Configuration
Split Dataset Folder
The path to your split dataset is loaded automatically from the project configuration. Click Select to change it. The application displays the total image count and class distribution badges below the field.
Starting Weights
Choose how the model is initialized before training begins:
Pretrained (ImageNet)
Starts from publicly available weights pre-trained on ImageNet
First training on a new dataset, no prior checkpoints
Transfer from Checkpoint
Uses the feature extraction layers from a previous checkpoint, resets the classification head
Adapting a model trained on one coffee type to another
Resume from Checkpoint
Loads the entire model state from a previous checkpoint and continues training
Extending a training session that was stopped early
When Transfer or Resume is selected, an additional field appears to select the .ckpt checkpoint file. The application validates that the checkpoint's architecture matches the selected base model.
If the checkpoint's architecture does not match the selected base model, a red warning appears. You must either change the base model to match or select a compatible checkpoint.
Base Architecture
Select the neural network architecture from the dropdown. Available models are organized by computational cost:
Lightweight models (faster training, lower VRAM usage):
ResNet-18, ResNet-50
ConvNeXt Tiny, ConvNeXt Small
Standard models (balanced performance):
MaxViT Tiny
ConvNeXt Base
Heavy models (highest accuracy potential, requires more VRAM):
MaxViT Small, MaxViT Base
ViT Base, ViT Large
SegFormer B3, SegFormer B5
Fused Network (custom multi-backbone architecture)
An orange warning appears when selecting heavy models, advising that they require significant GPU memory. If your GPU has less than 8 GB VRAM, stick with lightweight or standard models.
Training Mode
Lightweight
Freezes the backbone layers and only trains the classification head
Quick experiments, small datasets, or when fine-tuning from a strong checkpoint
Full Training
Trains all model layers end-to-end
Best accuracy, recommended when you have a large dataset and sufficient compute
Lightweight mode trains significantly faster and uses less GPU memory. It is a good starting point to verify your dataset before committing to a full training run.
The combination of Resume from Checkpoint and Lightweight mode is not supported. If you select Resume, the application automatically switches to Full Training.
Patience (Early Stopping)
The Patience field controls how many epochs the model continues training without improvement before stopping automatically. Enter an integer value, or leave it as .inf for no early stopping (the model trains for the full configured number of epochs).
Click Save after changing the value.
Checkpoint Directory
The path where training outputs are saved. Each training session creates a new version_XX subfolder containing:
The model checkpoint (
.ckpt) fileHyperparameters (
hparams.yaml)Training metrics (
training_metrics.json)
Running Training
Verify all configuration fields are set correctly.
Click Run Training.
The application validates your settings, locks the sidebar, and launches the Python training script.
During Training
The log monitor streams real-time output including epoch progress, loss values, and learning rate.
The training chart plots training and validation loss curves as epochs complete.
The status indicator shows "Running" with a progress bar.
Training can take anywhere from minutes (lightweight mode, small dataset) to hours (full training, large dataset, heavy architecture).
Stopping Training
Click Stop Training at any time. The model sends a graceful termination signal, allowing PyTorch to save the current state. The last completed epoch's checkpoint is preserved.
After Training
When training completes or is stopped:
The status changes to Completed (green) or Cancelled (yellow).
Training metrics are saved to
training_metrics.jsonin the version folder.Training hours are added to your cumulative working hours counter.
The checkpoint is available for testing, feature extraction, and export.
Troubleshooting
Training fails immediately
Python runtime not installed or outdated
Install or update the runtime from Hardware Settings
CUDA out of memory
Model is too large for your GPU VRAM
Switch to a lighter architecture, use Lightweight mode, or reduce image resolution in Model Settings
Loss is not decreasing
Learning rate too high/low, dataset issues
Adjust hyperparameters in Model Settings, verify dataset quality
Architecture mismatch warning
Checkpoint was trained with a different model
Select the matching architecture or choose a different checkpoint
Training is very slow
No GPU detected, or using CPU mode
Check GPU availability in Hardware Settings; verify NVIDIA drivers
Last updated