Feature Extraction
Extracting neural embeddings for similarity analysis and model comparison
The Feature Extraction screen extracts high-dimensional feature embeddings from a trained model checkpoint and your dataset. These embeddings are then analyzed with clustering metrics and visualized as a PCA scatter plot, giving you insight into how well the model separates different classes in its internal representation space.
Feature extraction also enables model comparison: you can run it on multiple training sessions and compare their cluster quality metrics side by side.
Prerequisites
An open project with at least one completed training session.
A split dataset (the training subset is used by default).
Step-by-Step Walkthrough
1. Select a Training Session
The Training Session dropdown lists all version_XX folders in your checkpoint directory. Each entry shows whether feature extraction has already been run (indicated by the presence of results).
The application auto-selects the most recent session that has existing feature extraction results. If no session has results, the latest session is selected.
2. Select a Checkpoint
The Checkpoint dropdown is populated based on the selected session. It lists all .ckpt files in that version folder.
The application prefers the "best" checkpoint (not last.ckpt) by default, as it typically represents the model's optimal performance.
3. Verify the Dataset Path
The dataset path is automatically loaded from your project configuration. It points to your training split folder. This field is read-only but displayed for reference.
4. Extract Features
Click Extract Features to begin. The application:
Writes the extraction configuration to
config.yaml.Launches the Python feature extraction script.
Loads the checkpoint and processes every image in the dataset.
Extracts feature vectors from the model's penultimate layer.
Computes clustering metrics on the embedding space.
Saves results to
feature_extraction.jsonin the session folder.
Progress is streamed to the log monitor.
Viewing Results
After extraction completes (or when selecting a session with existing results), two visualizations appear:
PCA Scatter Plot
The PCA (Principal Component Analysis) plot reduces the high-dimensional embeddings to 2D for visualization. Each point represents an image, colored by its class label. The axis labels show the percentage of variance explained by each principal component.
A well-trained model produces clusters that are clearly separated by class. Overlapping clusters suggest the model struggles to distinguish those classes.
Cluster Quality Metrics
Three metrics quantify how well the model's embeddings separate classes:
Silhouette Score
-1 to 1
Higher is better (>0.5 is good)
Measures how similar images are to their own class vs. other classes
Davies-Bouldin Index
0 to ∞
Lower is better (<1 is good)
Measures the ratio of within-class scatter to between-class separation
Calinski-Harabasz Index
0 to ∞
Higher is better
Measures the ratio of between-class dispersion to within-class dispersion
A per-class silhouette chart is also displayed, showing which classes are well-clustered and which have overlap.
Model Comparison Table
The bottom-right section shows a comparison table with feature extraction results from all training sessions. Each row represents a session and includes:
Session name (version number)
Model architecture name
Total samples processed
Number of classes
Silhouette score
Davies-Bouldin index
Calinski-Harabasz index
Use this table to compare different training runs and identify which model produces the cleanest class separation.
Configuration Persistence
The extraction parameters are stored in config.yaml under the feature_extraction section, including checkpoint path, dataset path, and output directory. Results are saved as feature_extraction.json in each session's version folder.
Troubleshooting
No training sessions listed
No completed training runs
Complete at least one training session first
Extraction fails
Checkpoint is corrupted or incompatible
Try a different checkpoint from the same session
PCA plot shows overlapping clusters
Model does not separate classes well
Consider retraining with more data, a different architecture, or adjusted hyperparameters
Metrics are poor across all sessions
Dataset quality issues
Review the dataset for mislabeled images or insufficient class diversity
Last updated