hexagon-nodesTrain AI Model

Training a classification model using PyTorch with real-time monitoring

The Training screen is where you train a deep learning classification model on your prepared dataset. Studio Desktop supports multiple model architectures and training strategies, with real-time progress monitoring and automatic metric tracking.

Training is powered by PyTorch Lightning under the hood. You configure the key parameters through the interface, and the application handles the rest — launching the Python training script, streaming logs, plotting metrics, and saving checkpoints.

Prerequisites

  • An open project with a split dataset (at minimum, train/ and val/ subfolders with class-organized images).

  • The Python runtime installed (from Hardware Settings).

  • An NVIDIA GPU is strongly recommended. CPU training is supported but significantly slower.

  • Hardware and model settings configured (see Hardware Settings and Model Settings).

Configuration

Split Dataset Folder

The path to your split dataset is loaded automatically from the project configuration. Click Select to change it. The application displays the total image count and class distribution badges below the field.

Starting Weights

Choose how the model is initialized before training begins:

Option
Description
When to Use

Pretrained (ImageNet)

Starts from publicly available weights pre-trained on ImageNet

First training on a new dataset, no prior checkpoints

Transfer from Checkpoint

Uses the feature extraction layers from a previous checkpoint, resets the classification head

Adapting a model trained on one coffee type to another

Resume from Checkpoint

Loads the entire model state from a previous checkpoint and continues training

Extending a training session that was stopped early

When Transfer or Resume is selected, an additional field appears to select the .ckpt checkpoint file. The application validates that the checkpoint's architecture matches the selected base model.

circle-exclamation

Base Architecture

Select the neural network architecture from the dropdown. Available models are organized by computational cost:

Lightweight models (faster training, lower VRAM usage):

  • ResNet-18, ResNet-50

  • ConvNeXt Tiny, ConvNeXt Small

Standard models (balanced performance):

  • MaxViT Tiny

  • ConvNeXt Base

Heavy models (highest accuracy potential, requires more VRAM):

  • MaxViT Small, MaxViT Base

  • ViT Base, ViT Large

  • SegFormer B3, SegFormer B5

  • Fused Network (custom multi-backbone architecture)

circle-info

An orange warning appears when selecting heavy models, advising that they require significant GPU memory. If your GPU has less than 8 GB VRAM, stick with lightweight or standard models.

Training Mode

Mode
Description
When to Use

Lightweight

Freezes the backbone layers and only trains the classification head

Quick experiments, small datasets, or when fine-tuning from a strong checkpoint

Full Training

Trains all model layers end-to-end

Best accuracy, recommended when you have a large dataset and sufficient compute

circle-info

Lightweight mode trains significantly faster and uses less GPU memory. It is a good starting point to verify your dataset before committing to a full training run.

circle-exclamation

Patience (Early Stopping)

The Patience field controls how many epochs the model continues training without improvement before stopping automatically. Enter an integer value, or leave it as .inf for no early stopping (the model trains for the full configured number of epochs).

Click Save after changing the value.

Checkpoint Directory

The path where training outputs are saved. Each training session creates a new version_XX subfolder containing:

  • The model checkpoint (.ckpt) file

  • Hyperparameters (hparams.yaml)

  • Training metrics (training_metrics.json)

Running Training

  1. Verify all configuration fields are set correctly.

  2. Click Run Training.

  3. The application validates your settings, locks the sidebar, and launches the Python training script.

During Training

  • The log monitor streams real-time output including epoch progress, loss values, and learning rate.

  • The training chart plots training and validation loss curves as epochs complete.

  • The status indicator shows "Running" with a progress bar.

Training can take anywhere from minutes (lightweight mode, small dataset) to hours (full training, large dataset, heavy architecture).

Stopping Training

Click Stop Training at any time. The model sends a graceful termination signal, allowing PyTorch to save the current state. The last completed epoch's checkpoint is preserved.

After Training

When training completes or is stopped:

  • The status changes to Completed (green) or Cancelled (yellow).

  • Training metrics are saved to training_metrics.json in the version folder.

  • Training hours are added to your cumulative working hours counter.

  • The checkpoint is available for testing, feature extraction, and export.

Troubleshooting

Issue
Possible Cause
Solution

Training fails immediately

Python runtime not installed or outdated

Install or update the runtime from Hardware Settings

CUDA out of memory

Model is too large for your GPU VRAM

Switch to a lighter architecture, use Lightweight mode, or reduce image resolution in Model Settings

Loss is not decreasing

Learning rate too high/low, dataset issues

Adjust hyperparameters in Model Settings, verify dataset quality

Architecture mismatch warning

Checkpoint was trained with a different model

Select the matching architecture or choose a different checkpoint

Training is very slow

No GPU detected, or using CPU mode

Check GPU availability in Hardware Settings; verify NVIDIA drivers

Last updated