Train AI Model

Training a classification model using PyTorch with real-time monitoring

The Training screen is where you train a deep learning classification model on your prepared dataset. Studio Desktop supports multiple model architectures and training strategies, with real-time progress monitoring and automatic metric tracking.

Training is powered by PyTorch Lightning under the hood. You configure the key parameters through the interface, and the application handles the rest — launching the Python training script, streaming logs, plotting metrics, and saving checkpoints.

Prerequisites

An open project with a split dataset (at minimum, train/ and val/ subfolders with class-organized images).
The Python runtime installed (from Hardware Settings).
An NVIDIA GPU is strongly recommended. CPU training is supported but significantly slower.
Hardware and model settings configured (see Hardware Settings and Model Settings).

Configuration

Split Dataset Folder

The path to your split dataset is loaded automatically from the project configuration. Click Select to change it. The application displays the total image count and class distribution badges below the field.

Starting Weights

Choose how the model is initialized before training begins:

Option

Description

When to Use

Pretrained (ImageNet)

Starts from publicly available weights pre-trained on ImageNet

First training on a new dataset, no prior checkpoints

Transfer from Checkpoint

Uses the feature extraction layers from a previous checkpoint, resets the classification head

Adapting a model trained on one coffee type to another

Resume from Checkpoint

Loads the entire model state from a previous checkpoint and continues training

Extending a training session that was stopped early

When Transfer or Resume is selected, an additional field appears to select the .ckpt checkpoint file. The application validates that the checkpoint's architecture matches the selected base model.

If the checkpoint's architecture does not match the selected base model, a red warning appears. You must either change the base model to match or select a compatible checkpoint.

Base Architecture

Select the neural network architecture from the dropdown. Available models are organized by computational cost:

Lightweight models (faster training, lower VRAM usage):

ResNet-18, ResNet-50
ConvNeXt Tiny, ConvNeXt Small

Standard models (balanced performance):

MaxViT Tiny
ConvNeXt Base

Heavy models (highest accuracy potential, requires more VRAM):

MaxViT Small, MaxViT Base
ViT Base, ViT Large
SegFormer B3, SegFormer B5
Fused Network (custom multi-backbone architecture)

An orange warning appears when selecting heavy models, advising that they require significant GPU memory. If your GPU has less than 8 GB VRAM, stick with lightweight or standard models.

Training Mode

Mode

Description

When to Use

Lightweight

Freezes the backbone layers and only trains the classification head

Quick experiments, small datasets, or when fine-tuning from a strong checkpoint

Full Training

Trains all model layers end-to-end

Best accuracy, recommended when you have a large dataset and sufficient compute

Lightweight mode trains significantly faster and uses less GPU memory. It is a good starting point to verify your dataset before committing to a full training run.

The combination of Resume from Checkpoint and Lightweight mode is not supported. If you select Resume, the application automatically switches to Full Training.

Patience (Early Stopping)

The Patience field controls how many epochs the model continues training without improvement before stopping automatically. Enter an integer value, or leave it as .inf for no early stopping (the model trains for the full configured number of epochs).

Click Save after changing the value.

Checkpoint Directory

The path where training outputs are saved. Each training session creates a new version_XX subfolder containing:

The model checkpoint (.ckpt) file
Hyperparameters (hparams.yaml)
Training metrics (training_metrics.json)

Running Training

Verify all configuration fields are set correctly.
Click Run Training.
The application validates your settings, locks the sidebar, and launches the Python training script.

During Training

The log monitor streams real-time output including epoch progress, loss values, and learning rate.
The training chart plots training and validation loss curves as epochs complete.
The status indicator shows "Running" with a progress bar.

Training can take anywhere from minutes (lightweight mode, small dataset) to hours (full training, large dataset, heavy architecture).

Stopping Training

Click Stop Training at any time. The model sends a graceful termination signal, allowing PyTorch to save the current state. The last completed epoch's checkpoint is preserved.

After Training

When training completes or is stopped:

The status changes to Completed (green) or Cancelled (yellow).
Training metrics are saved to training_metrics.json in the version folder.
Training hours are added to your cumulative working hours counter.
The checkpoint is available for testing, feature extraction, and export.

Troubleshooting

Issue

Possible Cause

Solution

Training fails immediately

Python runtime not installed or outdated

Install or update the runtime from Hardware Settings

CUDA out of memory

Model is too large for your GPU VRAM

Switch to a lighter architecture, use Lightweight mode, or reduce image resolution in Model Settings

Loss is not decreasing

Learning rate too high/low, dataset issues

Adjust hyperparameters in Model Settings, verify dataset quality

Architecture mismatch warning

Checkpoint was trained with a different model

Select the matching architecture or choose a different checkpoint

Training is very slow

No GPU detected, or using CPU mode

Check GPU availability in Hardware Settings; verify NVIDIA drivers

PreviousCluster Dataset NextTest AI Model

Last updated 6 days ago

hashtagPrerequisites

hashtagConfiguration

hashtagSplit Dataset Folder

hashtagStarting Weights

hashtagBase Architecture

hashtagTraining Mode

hashtagPatience (Early Stopping)

hashtagCheckpoint Directory

hashtagRunning Training

hashtagDuring Training

hashtagStopping Training

hashtagAfter Training

hashtagTroubleshooting

Prerequisites

Configuration

Split Dataset Folder

Starting Weights

Base Architecture

Training Mode

Patience (Early Stopping)

Checkpoint Directory

Running Training

During Training

Stopping Training

After Training

Troubleshooting