chart-scatterFeature Extraction

Extracting neural embeddings for similarity analysis and model comparison

The Feature Extraction screen extracts high-dimensional feature embeddings from a trained model checkpoint and your dataset. These embeddings are then analyzed with clustering metrics and visualized as a PCA scatter plot, giving you insight into how well the model separates different classes in its internal representation space.

Feature extraction also enables model comparison: you can run it on multiple training sessions and compare their cluster quality metrics side by side.

Prerequisites

  • An open project with at least one completed training session.

  • A split dataset (the training subset is used by default).

Step-by-Step Walkthrough

1. Select a Training Session

The Training Session dropdown lists all version_XX folders in your checkpoint directory. Each entry shows whether feature extraction has already been run (indicated by the presence of results).

The application auto-selects the most recent session that has existing feature extraction results. If no session has results, the latest session is selected.

2. Select a Checkpoint

The Checkpoint dropdown is populated based on the selected session. It lists all .ckpt files in that version folder.

circle-info

The application prefers the "best" checkpoint (not last.ckpt) by default, as it typically represents the model's optimal performance.

3. Verify the Dataset Path

The dataset path is automatically loaded from your project configuration. It points to your training split folder. This field is read-only but displayed for reference.

4. Extract Features

Click Extract Features to begin. The application:

  1. Writes the extraction configuration to config.yaml.

  2. Launches the Python feature extraction script.

  3. Loads the checkpoint and processes every image in the dataset.

  4. Extracts feature vectors from the model's penultimate layer.

  5. Computes clustering metrics on the embedding space.

  6. Saves results to feature_extraction.json in the session folder.

Progress is streamed to the log monitor.

Viewing Results

After extraction completes (or when selecting a session with existing results), two visualizations appear:

PCA Scatter Plot

The PCA (Principal Component Analysis) plot reduces the high-dimensional embeddings to 2D for visualization. Each point represents an image, colored by its class label. The axis labels show the percentage of variance explained by each principal component.

A well-trained model produces clusters that are clearly separated by class. Overlapping clusters suggest the model struggles to distinguish those classes.

Cluster Quality Metrics

Three metrics quantify how well the model's embeddings separate classes:

Metric
Range
Good Value
Interpretation

Silhouette Score

-1 to 1

Higher is better (>0.5 is good)

Measures how similar images are to their own class vs. other classes

Davies-Bouldin Index

0 to ∞

Lower is better (<1 is good)

Measures the ratio of within-class scatter to between-class separation

Calinski-Harabasz Index

0 to ∞

Higher is better

Measures the ratio of between-class dispersion to within-class dispersion

A per-class silhouette chart is also displayed, showing which classes are well-clustered and which have overlap.

Model Comparison Table

The bottom-right section shows a comparison table with feature extraction results from all training sessions. Each row represents a session and includes:

  • Session name (version number)

  • Model architecture name

  • Total samples processed

  • Number of classes

  • Silhouette score

  • Davies-Bouldin index

  • Calinski-Harabasz index

Use this table to compare different training runs and identify which model produces the cleanest class separation.

Configuration Persistence

The extraction parameters are stored in config.yaml under the feature_extraction section, including checkpoint path, dataset path, and output directory. Results are saved as feature_extraction.json in each session's version folder.

Troubleshooting

Issue
Possible Cause
Solution

No training sessions listed

No completed training runs

Complete at least one training session first

Extraction fails

Checkpoint is corrupted or incompatible

Try a different checkpoint from the same session

PCA plot shows overlapping clusters

Model does not separate classes well

Consider retraining with more data, a different architecture, or adjusted hyperparameters

Metrics are poor across all sessions

Dataset quality issues

Review the dataset for mislabeled images or insufficient class diversity

Last updated