Candidate metrics

This is a review of candidate metrics for evaluating spatial cell segmentation methods against ground truth. The benchmark exposes two data structures to metrics:

- **`file_spatial_solution`**: `cell_labels` label image (ground truth), `cell_boundaries` shapes, table with `cell_id`, `region`, `cell_area`, `transcript_counts`
- **`file_processed_prediction`**: `segmentation` label image (predicted), table with `cell_id`, `region`, `counts`/`normalized`/`normalized_log`/`normalized_log_scaled` layers

> **Disclaimer:** This overview was generated with the assistance of GitHub Copilot and may contain inaccurate or incomplete information.

---

### Overview

| Priority | Metric | Rationale |
|----------|--------|-----------|
| 1 | **Panoptic Quality (PQ)** | Accepted standard in spatial biology; captures detection and boundary quality in one score |
| 2 | **ARI on transcript assignments** | Directly measures transcript-level assignment quality; complements PQ |
| 3 | **F1 / Precision / Recall** | Simple and interpretable; exposes over- vs under-segmentation |
| 4 | **Cell-type purity** | Measures biological relevance of the segmentation |
| 5 | **AP@[0.5:0.95]** | More thorough picture at multiple IoU thresholds |
| 6 | **Cell area distribution** | Diagnostic / sanity check only |
| 7 | **Silhouette score** | Too indirect; confounded by normalization choices |

Suggestions and corrections are very welcome; particularly around literature references and whether any important metrics have been missed.

### 1. Panoptic Quality (PQ)

| Property | Value |
|----------|-------|
| Complexity | Medium |
| Literature acceptance | High |
| Evaluated on | Label images (`cell_labels` vs `segmentation`) |
| Type | Image analysis |
| Fields used | `cell_labels` (solution), `segmentation` (prediction) |

PQ = Detection Quality × Segmentation Quality. A predicted cell is matched to a ground truth cell if their IoU exceeds 0.5. DQ is the F1 of matched pairs; SQ is the mean IoU of matched pairs. The combined score ranges from 0 to 1.

Used as a standard in instance segmentation (COCO benchmark) and increasingly in spatial biology (e.g. [Greenwald et al. 2022](https://doi.org/10.1038/s41587-021-01094-0) *Nature Biotechnology*, [Pachitariu & Stringer 2022](https://doi.org/10.1038/s41592-022-01663-4) *Nature Methods*).

**Pros:** Penalises both false positives and over/under-segmentation in a single interpretable score.  
**Cons:** Requires pixel-level coordinate alignment between prediction and ground truth; sensitive to coordinate transformations.

---

### 2. Average Precision (AP) at multiple IoU thresholds

| Property | Value |
|----------|-------|
| Complexity | Medium |
| Literature acceptance | High |
| Evaluated on | Label images (`cell_labels` vs `segmentation`) |
| Type | Image analysis |
| Fields used | `cell_labels` (solution), `segmentation` (prediction) |

Computes precision–recall across a range of IoU thresholds (e.g. 0.5–0.95 in steps of 0.05) and averages the area under each PR curve. mAP@0.5 is the most commonly reported single number.

De-facto standard in instance segmentation and used in cell segmentation benchmarks such as [CellSeg (Ma et al. 2024)](https://doi.org/10.1038/s41592-024-02233-6).

**Pros:** More complete picture than a single IoU cutoff; separates performance at strict and lenient thresholds.  
**Cons:** Computationally heavier; less intuitive than PQ as a single summary number.

---

### 3. Adjusted Rand Index (ARI) on Transcript Assignments

| Property | Value |
|----------|-------|
| Complexity | Low |
| Literature acceptance | Medium |
| Evaluated on | Per-transcript cell assignments derived from label images |
| Type | Clustering / transcript assignment |
| Fields used | `cell_labels` + `transcripts` point coordinates (solution) vs `segmentation` + `transcripts` (prediction) |

Treats transcript-to-cell assignment as a clustering problem. For each transcript, its cell ID is looked up in both the ground truth and predicted label image. ARI is computed between the two resulting assignment vectors. ARI = 1 is perfect agreement; 0 is random.

Used in transcript-assignment benchmarks (e.g. [Petukhov et al. 2022](https://doi.org/10.1038/s41587-021-01044-w) *Nature Biotechnology*).

**Pros:** Directly measures what matters for downstream analysis (which transcripts belong to which cell); no need for pixel-level IoU.  
**Cons:** Sensitive to how background transcripts (cell_id = 0) are handled; ignores cell shape beyond assignment.

---

### 4. F1 / Precision / Recall on Cell Detection

| Property | Value |
|----------|-------|
| Complexity | Low |
| Literature acceptance | Medium |
| Evaluated on | Set of detected cell instances |
| Type | Detection |
| Fields used | `cell_labels` (solution), `segmentation` (prediction), IoU matching at fixed threshold |

At a fixed IoU threshold (typically 0.5), precision is the fraction of predicted cells matched to a ground truth cell; recall is the fraction of ground truth cells that were detected. F1 is their harmonic mean.

**Pros:** Simple and interpretable; exposes whether a method over- or under-segments.  
**Cons:** Single threshold; does not capture boundary quality beyond the threshold. Largely subsumed by PQ (whose DQ component is equivalent).

---

### 5. Cell-type Purity (Downstream Biological Quality)

| Property | Value |
|----------|-------|
| Complexity | High |
| Literature acceptance | Medium–High |
| Evaluated on | Expression matrix + scRNA-seq reference for label transfer |
| Type | Biological relevance |
| Fields used | `counts` / `normalized_log` layers (prediction), scRNA-seq reference for cell-type annotation |

After assigning cell types to predicted cells (e.g. by label transfer from the scRNA-seq reference), measure purity of the resulting clusters relative to ground truth cell-type annotations. Possible statistics include mean cell-type entropy per cluster or adjusted mutual information.

Used in txsim (Kleshchevnikov et al.) and Squidpy-based evaluations.

**Pros:** Captures biological meaningfulness: perfect pixel overlap can still yield mixed transcriptome profiles if boundaries are slightly off.  
**Cons:** Requires an accurate scRNA-seq reference and a label-transfer step, both of which introduce additional noise.

---

### 6. Silhouette Score of Expression Profiles

| Property | Value |
|----------|-------|
| Complexity | Low–Medium |
| Literature acceptance | Low–Medium |
| Evaluated on | Normalized expression matrix |
| Type | Data quality |
| Fields used | `normalized_log` or `normalized_log_scaled` layer, cell-type labels |

Computes the silhouette coefficient of cells in PCA space, grouped by ground truth cell type. A higher score means same-type cells cluster together, indicating clean type-specific expression profiles.

**Pros:** Does not require pixel-level comparison; captures transcriptomic cohesion.  
**Cons:** Confounded by normalization and HVG selection choices; hard to attribute changes to segmentation quality alone.

---

### 7. Cell Area Distribution Similarity

| Property | Value |
|----------|-------|
| Complexity | Low |
| Literature acceptance | Low |
| Evaluated on | Per-cell area statistics |
| Type | Diagnostic |
| Fields used** | `cell_area` (solution table), cell area from predicted `segmentation` label image |

Compares the distribution of predicted cell sizes to the ground truth distribution using e.g. Wasserstein distance or Jensen–Shannon divergence. A method that produces systematically larger or smaller cells will show a distributional mismatch.

**Pros:** Very fast; useful for diagnosing systematic over/under-segmentation bias.  
**Cons:** A method can match the distribution while still misidentifying individual cells; not sufficient as a primary metric.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Candidate metrics #10

Overview

1. Panoptic Quality (PQ)

2. Average Precision (AP) at multiple IoU thresholds

3. Adjusted Rand Index (ARI) on Transcript Assignments

4. F1 / Precision / Recall on Cell Detection

5. Cell-type Purity (Downstream Biological Quality)

6. Silhouette Score of Expression Profiles

7. Cell Area Distribution Similarity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Priority	Metric	Rationale
1	Panoptic Quality (PQ)	Accepted standard in spatial biology; captures detection and boundary quality in one score
2	ARI on transcript assignments	Directly measures transcript-level assignment quality; complements PQ
3	F1 / Precision / Recall	Simple and interpretable; exposes over- vs under-segmentation
4	Cell-type purity	Measures biological relevance of the segmentation
5	AP@[0.5:0.95]	More thorough picture at multiple IoU thresholds
6	Cell area distribution	Diagnostic / sanity check only
7	Silhouette score	Too indirect; confounded by normalization choices

Property	Value
Complexity	Medium
Literature acceptance	High
Evaluated on	Label images (`cell_labels` vs `segmentation`)
Type	Image analysis
Fields used	`cell_labels` (solution), `segmentation` (prediction)

Property	Value
Complexity	Low–Medium
Literature acceptance	Low–Medium
Evaluated on	Normalized expression matrix
Type	Data quality
Fields used	`normalized_log` or `normalized_log_scaled` layer, cell-type labels

Candidate metrics #10

Description

Overview

1. Panoptic Quality (PQ)

2. Average Precision (AP) at multiple IoU thresholds

3. Adjusted Rand Index (ARI) on Transcript Assignments

4. F1 / Precision / Recall on Cell Detection

5. Cell-type Purity (Downstream Biological Quality)

6. Silhouette Score of Expression Profiles

7. Cell Area Distribution Similarity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions