Results

Pinned Benchmark Snapshot

This page hard-codes the v1 benchmark snapshot from pinned experiment runs. This page gets updated when new runs outperform old runs.

Pinned Runs

Run IDs Used in This Snapshot

Baseline CNN

Run ID: 20260428_042520

Single-model ResNet50 baseline with a fixed 0.5 threshold.

Ensemble CNN (5-fold)

Run ID: ensemble_ens_20260429_b

Five member folds aggregated via mean malignancy probability.

Vision Transformer (ViT-B16)

Run ID: 20260423_105922

Pretrained ViT with warmup and cosine-style learning-rate decay.

Vision Transformer (ViT-L16)

Run ID: 20260430_165245

Pretrained larger ViT with warmup and cosine-style learning-rate decay.

Metric Tables

Validation and External-Test Performance

ISIC Validation Metrics

ModelAccuracyPrecisionRecallF1ROC AUCPR AUCConfusion Matrix
Baseline CNN85.14%55.56%83.87%66.84%0.92920.8064TN 3555 | FP 607 | FN 146 | TP 759
Ensemble CNN (5-fold)83.41%52.78%67.07%59.08%0.85270.6511TN 3622 | FP 543 | FN 298 | TP 607
Vision Transformer (ViT-B16)92.40%79.82%76.91%78.33%0.95540.8713TN 3986 | FP 176 | FN 209 | TP 696
Vision Transformer (ViT-L16)93.86%84.45%80.44%82.40%0.96580.9014TN 4028 | FP 134 | FN 177 | TP 728

HAM10000 External-Test Metrics

ModelAccuracyPrecisionRecallF1ROC AUCPR AUCConfusion Matrix
Baseline CNN84.34%61.77%51.84%56.37%0.79650.6263TN 7434 | FP 627 | FN 941 | TP 1013
Ensemble CNN (5-fold)83.69%62.75%40.43%49.18%0.79660.5739TN 7592 | FP 469 | FN 1164 | TP 790
Vision Transformer (ViT-B16)88.96%91.01%48.16%62.99%0.84700.7059TN 7968 | FP 93 | FN 1013 | TP 941
Vision Transformer (ViT-L16)89.85%95.00%50.61%66.04%0.83550.7159TN 8009 | FP 52 | FN 965 | TP 989

Shift Analysis

Generalization Gap Summary

Generalization Gap (Validation - External Test)

ModelAccuracy GapPrecision GapRecall GapF1 GapROC AUC GapPR AUC Gap
Baseline CNN+0.80%-6.20%+32.03%+10.47%+13.27%+18.01%
Ensemble CNN (5-fold)-0.28%-9.97%+26.64%+9.90%+5.61%+7.72%
Vision Transformer (ViT-B16)+3.45%-11.19%+28.75%+15.35%+10.84%+16.54%
Vision Transformer (ViT-L16)+4.02%-10.55%+29.83%+16.36%+13.03%+18.55%

Positive values indicate the model performed better on validation than on external test. This is expected under dataset shift and is one of the core benchmark signals.

Curves

ROC and Precision-Recall Panels

Baseline CNN Curves

ISIC Validation ROC

ISIC Validation ROC

Validation ROC curve on the ISIC split.

ISIC Validation PR

ISIC Validation PR

Validation precision-recall curve on ISIC.

HAM10000 ROC

HAM10000 ROC

External-test ROC curve on HAM10000.

HAM10000 PR

HAM10000 PR

External-test precision-recall curve on HAM10000.

Ensemble CNN (5-fold) Curves

ISIC Aggregate ROC

ISIC Aggregate ROC

ROC curves of the 5-folds on ISIC validation.

ISIC Aggregate PR

ISIC Aggregate PR

Precision-recall curves of the ISIC 5-folds.

HAM10000 Aggregate ROC

HAM10000 Aggregate ROC

External-test ROC fold curves of the ensemble.

HAM10000 Aggregate PR

HAM10000 Aggregate PR

External-test precision-recall fold curves of the ensemble.

Vision Transformer (ViT-B16) Curves

ISIC Validation ROC

ISIC Validation ROC

Validation ROC curve for ViT-B16 on ISIC.

ISIC Validation PR

ISIC Validation PR

Validation precision-recall curve for ViT-B16.

HAM10000 ROC

HAM10000 ROC

External-test ROC curve for ViT-B16 on HAM10000.

HAM10000 PR

HAM10000 PR

External-test precision-recall curve for ViT-B16.

Vision Transformer (ViT-L16) Curves

ISIC Validation ROC

ISIC Validation ROC

Validation ROC curve for ViT-L16 on ISIC.

ISIC Validation PR

ISIC Validation PR

Validation precision-recall curve for ViT-L16.

HAM10000 ROC

HAM10000 ROC

External-test ROC curve for ViT-L16 on HAM10000.

HAM10000 PR

HAM10000 PR

External-test precision-recall curve for ViT-L16.