Evaluation & Ranking - autoPET V

🤺 Evaluation¶

🔍 Metrics¶

Evaluation will be performed on held-out test cases. The challenge focuses on assessing both segmentation accuracy and interaction efficiency in an interactive human–AI setting.

To capture these aspects, we evaluate performance using two complementary metrics:

Dice Score (DSC): Measures voxel-level overlap between predicted and ground-truth lesion segmentation.
Detection–Matching Metric (DMM): Measures instance-level detection performance by assessing whether predicted lesions match ground-truth lesions. This metric explicitly evaluates the ability to correctly identify individual lesions, independent of voxel overlap.

These two metrics reflect the dual objective of PET/CT lesion analysis:

accurately delineating tumor burden (Dice)
correctly detecting and separating individual lesions (DMM)

🔄 Interactive Evaluation¶

Metrics are evaluated iteratively over the interaction process, where models receive corrective scribbles and update their predictions step-by-step.

For each case:

An initial prediction is generated
Corrective scribbles are provided
The model updates its segmentation iteratively

Performance is tracked across interaction steps and summarized using area-under-the-curve (AUC) formulations:

AUC-Dice → measures how efficiently segmentation quality improves
AUC-DMM → measures how efficiently lesion detection improves

AUC is computed using the trapezoidal rule over interaction steps.

📊 Final Metrics¶

For each submission, the following metrics are reported:

AUC-Dice (higher is better)
AUC-DMM (higher is better)

Additionally, the following descriptive metrics are reported only for analysis:

Final Foreground Dice score of segmented lesions
Final DMM score of segmented lesions
False positive volume (FPV): Volume of false positive connected components that do not overlap with positives
False negative volume (FNV): Volume of positive connected components in the ground truth that do not overlap with the estimated segmentation mask

Figure: Example for the evaluation. The Dice score is calculated to measure the correct overlap between predicted lesion segmentation (blue) and ground truth (red). Additionally special emphasis is put on false negatives by measuring their volume (i.e. entirely missed lesions) and on false positives by measuring their volume (i.e. large false positive volumes like brain or bladder will result in a low score).

A python script computing these evaluation metrics is provided under https://github.com/lab-midas/autoPETV.

⚙️ Interaction Regimes¶

Each submission is evaluated under two complementary regimes on the test sets:

Category 1: Simulated Interaction¶

Standardized, reproducible scribbles
Fixed number of interaction steps
Enables consistent benchmarking across all methods

Category 2: Clinician-Driven Interaction¶

Real expert-provided scribbles
Variable number of interactions per case
Reflects realistic clinical correction workflows

📈 Ranking¶

All submissions are evaluated under both interaction regimes and are automatically eligible for both award categories.

The ranking is based on AUC-Dice (50%) and AUC-DMM (50%). For each metric, scores are averaged across all test cases. Then rankings are computed per metric and the final ranking is obtained by combining both metric rankings with equal weighting.