This is a review of candidate metrics for evaluating spatial cell segmentation methods against ground truth. The benchmark exposes two data structures to metrics:
file_spatial_solution: cell_labels label image (ground truth), cell_boundaries shapes, table with cell_id, region, cell_area, transcript_counts
file_processed_prediction: segmentation label image (predicted), table with cell_id, region, counts/normalized/normalized_log/normalized_log_scaled layers
Disclaimer: This overview was generated with the assistance of GitHub Copilot and may contain inaccurate or incomplete information.
Overview
| Priority |
Metric |
Rationale |
| 1 |
Panoptic Quality (PQ) |
Accepted standard in spatial biology; captures detection and boundary quality in one score |
| 2 |
ARI on transcript assignments |
Directly measures transcript-level assignment quality; complements PQ |
| 3 |
F1 / Precision / Recall |
Simple and interpretable; exposes over- vs under-segmentation |
| 4 |
Cell-type purity |
Measures biological relevance of the segmentation |
| 5 |
AP@[0.5:0.95] |
More thorough picture at multiple IoU thresholds |
| 6 |
Cell area distribution |
Diagnostic / sanity check only |
| 7 |
Silhouette score |
Too indirect; confounded by normalization choices |
Suggestions and corrections are very welcome; particularly around literature references and whether any important metrics have been missed.
1. Panoptic Quality (PQ)
| Property |
Value |
| Complexity |
Medium |
| Literature acceptance |
High |
| Evaluated on |
Label images (cell_labels vs segmentation) |
| Type |
Image analysis |
| Fields used |
cell_labels (solution), segmentation (prediction) |
PQ = Detection Quality × Segmentation Quality. A predicted cell is matched to a ground truth cell if their IoU exceeds 0.5. DQ is the F1 of matched pairs; SQ is the mean IoU of matched pairs. The combined score ranges from 0 to 1.
Used as a standard in instance segmentation (COCO benchmark) and increasingly in spatial biology (e.g. Greenwald et al. 2022 Nature Biotechnology, Pachitariu & Stringer 2022 Nature Methods).
Pros: Penalises both false positives and over/under-segmentation in a single interpretable score.
Cons: Requires pixel-level coordinate alignment between prediction and ground truth; sensitive to coordinate transformations.
2. Average Precision (AP) at multiple IoU thresholds
| Property |
Value |
| Complexity |
Medium |
| Literature acceptance |
High |
| Evaluated on |
Label images (cell_labels vs segmentation) |
| Type |
Image analysis |
| Fields used |
cell_labels (solution), segmentation (prediction) |
Computes precision–recall across a range of IoU thresholds (e.g. 0.5–0.95 in steps of 0.05) and averages the area under each PR curve. mAP@0.5 is the most commonly reported single number.
De-facto standard in instance segmentation and used in cell segmentation benchmarks such as CellSeg (Ma et al. 2024).
Pros: More complete picture than a single IoU cutoff; separates performance at strict and lenient thresholds.
Cons: Computationally heavier; less intuitive than PQ as a single summary number.
3. Adjusted Rand Index (ARI) on Transcript Assignments
| Property |
Value |
| Complexity |
Low |
| Literature acceptance |
Medium |
| Evaluated on |
Per-transcript cell assignments derived from label images |
| Type |
Clustering / transcript assignment |
| Fields used |
cell_labels + transcripts point coordinates (solution) vs segmentation + transcripts (prediction) |
Treats transcript-to-cell assignment as a clustering problem. For each transcript, its cell ID is looked up in both the ground truth and predicted label image. ARI is computed between the two resulting assignment vectors. ARI = 1 is perfect agreement; 0 is random.
Used in transcript-assignment benchmarks (e.g. Petukhov et al. 2022 Nature Biotechnology).
Pros: Directly measures what matters for downstream analysis (which transcripts belong to which cell); no need for pixel-level IoU.
Cons: Sensitive to how background transcripts (cell_id = 0) are handled; ignores cell shape beyond assignment.
4. F1 / Precision / Recall on Cell Detection
| Property |
Value |
| Complexity |
Low |
| Literature acceptance |
Medium |
| Evaluated on |
Set of detected cell instances |
| Type |
Detection |
| Fields used |
cell_labels (solution), segmentation (prediction), IoU matching at fixed threshold |
At a fixed IoU threshold (typically 0.5), precision is the fraction of predicted cells matched to a ground truth cell; recall is the fraction of ground truth cells that were detected. F1 is their harmonic mean.
Pros: Simple and interpretable; exposes whether a method over- or under-segments.
Cons: Single threshold; does not capture boundary quality beyond the threshold. Largely subsumed by PQ (whose DQ component is equivalent).
5. Cell-type Purity (Downstream Biological Quality)
| Property |
Value |
| Complexity |
High |
| Literature acceptance |
Medium–High |
| Evaluated on |
Expression matrix + scRNA-seq reference for label transfer |
| Type |
Biological relevance |
| Fields used |
counts / normalized_log layers (prediction), scRNA-seq reference for cell-type annotation |
After assigning cell types to predicted cells (e.g. by label transfer from the scRNA-seq reference), measure purity of the resulting clusters relative to ground truth cell-type annotations. Possible statistics include mean cell-type entropy per cluster or adjusted mutual information.
Used in txsim (Kleshchevnikov et al.) and Squidpy-based evaluations.
Pros: Captures biological meaningfulness: perfect pixel overlap can still yield mixed transcriptome profiles if boundaries are slightly off.
Cons: Requires an accurate scRNA-seq reference and a label-transfer step, both of which introduce additional noise.
6. Silhouette Score of Expression Profiles
| Property |
Value |
| Complexity |
Low–Medium |
| Literature acceptance |
Low–Medium |
| Evaluated on |
Normalized expression matrix |
| Type |
Data quality |
| Fields used |
normalized_log or normalized_log_scaled layer, cell-type labels |
Computes the silhouette coefficient of cells in PCA space, grouped by ground truth cell type. A higher score means same-type cells cluster together, indicating clean type-specific expression profiles.
Pros: Does not require pixel-level comparison; captures transcriptomic cohesion.
Cons: Confounded by normalization and HVG selection choices; hard to attribute changes to segmentation quality alone.
7. Cell Area Distribution Similarity
| Property |
Value |
| Complexity |
Low |
| Literature acceptance |
Low |
| Evaluated on |
Per-cell area statistics |
| Type |
Diagnostic |
| Fields used** |
cell_area (solution table), cell area from predicted segmentation label image |
Compares the distribution of predicted cell sizes to the ground truth distribution using e.g. Wasserstein distance or Jensen–Shannon divergence. A method that produces systematically larger or smaller cells will show a distributional mismatch.
Pros: Very fast; useful for diagnosing systematic over/under-segmentation bias.
Cons: A method can match the distribution while still misidentifying individual cells; not sufficient as a primary metric.
This is a review of candidate metrics for evaluating spatial cell segmentation methods against ground truth. The benchmark exposes two data structures to metrics:
file_spatial_solution:cell_labelslabel image (ground truth),cell_boundariesshapes, table withcell_id,region,cell_area,transcript_countsfile_processed_prediction:segmentationlabel image (predicted), table withcell_id,region,counts/normalized/normalized_log/normalized_log_scaledlayersOverview
Suggestions and corrections are very welcome; particularly around literature references and whether any important metrics have been missed.
1. Panoptic Quality (PQ)
cell_labelsvssegmentation)cell_labels(solution),segmentation(prediction)PQ = Detection Quality × Segmentation Quality. A predicted cell is matched to a ground truth cell if their IoU exceeds 0.5. DQ is the F1 of matched pairs; SQ is the mean IoU of matched pairs. The combined score ranges from 0 to 1.
Used as a standard in instance segmentation (COCO benchmark) and increasingly in spatial biology (e.g. Greenwald et al. 2022 Nature Biotechnology, Pachitariu & Stringer 2022 Nature Methods).
Pros: Penalises both false positives and over/under-segmentation in a single interpretable score.
Cons: Requires pixel-level coordinate alignment between prediction and ground truth; sensitive to coordinate transformations.
2. Average Precision (AP) at multiple IoU thresholds
cell_labelsvssegmentation)cell_labels(solution),segmentation(prediction)Computes precision–recall across a range of IoU thresholds (e.g. 0.5–0.95 in steps of 0.05) and averages the area under each PR curve. mAP@0.5 is the most commonly reported single number.
De-facto standard in instance segmentation and used in cell segmentation benchmarks such as CellSeg (Ma et al. 2024).
Pros: More complete picture than a single IoU cutoff; separates performance at strict and lenient thresholds.
Cons: Computationally heavier; less intuitive than PQ as a single summary number.
3. Adjusted Rand Index (ARI) on Transcript Assignments
cell_labels+transcriptspoint coordinates (solution) vssegmentation+transcripts(prediction)Treats transcript-to-cell assignment as a clustering problem. For each transcript, its cell ID is looked up in both the ground truth and predicted label image. ARI is computed between the two resulting assignment vectors. ARI = 1 is perfect agreement; 0 is random.
Used in transcript-assignment benchmarks (e.g. Petukhov et al. 2022 Nature Biotechnology).
Pros: Directly measures what matters for downstream analysis (which transcripts belong to which cell); no need for pixel-level IoU.
Cons: Sensitive to how background transcripts (cell_id = 0) are handled; ignores cell shape beyond assignment.
4. F1 / Precision / Recall on Cell Detection
cell_labels(solution),segmentation(prediction), IoU matching at fixed thresholdAt a fixed IoU threshold (typically 0.5), precision is the fraction of predicted cells matched to a ground truth cell; recall is the fraction of ground truth cells that were detected. F1 is their harmonic mean.
Pros: Simple and interpretable; exposes whether a method over- or under-segments.
Cons: Single threshold; does not capture boundary quality beyond the threshold. Largely subsumed by PQ (whose DQ component is equivalent).
5. Cell-type Purity (Downstream Biological Quality)
counts/normalized_loglayers (prediction), scRNA-seq reference for cell-type annotationAfter assigning cell types to predicted cells (e.g. by label transfer from the scRNA-seq reference), measure purity of the resulting clusters relative to ground truth cell-type annotations. Possible statistics include mean cell-type entropy per cluster or adjusted mutual information.
Used in txsim (Kleshchevnikov et al.) and Squidpy-based evaluations.
Pros: Captures biological meaningfulness: perfect pixel overlap can still yield mixed transcriptome profiles if boundaries are slightly off.
Cons: Requires an accurate scRNA-seq reference and a label-transfer step, both of which introduce additional noise.
6. Silhouette Score of Expression Profiles
normalized_logornormalized_log_scaledlayer, cell-type labelsComputes the silhouette coefficient of cells in PCA space, grouped by ground truth cell type. A higher score means same-type cells cluster together, indicating clean type-specific expression profiles.
Pros: Does not require pixel-level comparison; captures transcriptomic cohesion.
Cons: Confounded by normalization and HVG selection choices; hard to attribute changes to segmentation quality alone.
7. Cell Area Distribution Similarity
cell_area(solution table), cell area from predictedsegmentationlabel imageCompares the distribution of predicted cell sizes to the ground truth distribution using e.g. Wasserstein distance or Jensen–Shannon divergence. A method that produces systematically larger or smaller cells will show a distributional mismatch.
Pros: Very fast; useful for diagnosing systematic over/under-segmentation bias.
Cons: A method can match the distribution while still misidentifying individual cells; not sufficient as a primary metric.