SpatialBench

Can AI agents extract biological insight from real-world spatial data?

SpatialBench is a benchmark of 159 verifiable problems derived from practical spatial transcriptomics workflows. Each problem snapshots an analysis state immediately before a target step and pairs it with a deterministic grader that evaluates recovery of a key biological result.

This revised version of the benchmark includes 159 problems across 5 platforms and 7 task categories. We share results for the full benchmark and publicly release a representative sample covering all platform types and task categories along with the associated agent trajectories. We withhold releasing the full benchmark set publicly to avoid contamination.

Key Findings

model_name	harness	Accuracy (%)	Cost ($)
gpt-5.5	mini-swe-agent	57.65	1.1207
gpt-5.4	mini-swe-agent	57.44	0.577
gpt-5.5	openai-codex	53.67	3.1616
claude-opus-4-6	mini-swe-agent	52.83	0.8456
claude-opus-4-7	mini-swe-agent	52.41	0.9817
gemini-3.1-pro-preview	mini-swe-agent	51.57	0.9362
claude-opus-4-7	claude-code	51.36	0.8023
gpt-5.2	mini-swe-agent	50.1	0.6024
grok-4.20-beta-0309-reasoning	mini-swe-agent	45.91	0.1679
claude-sonnet-4-6	mini-swe-agent	44.23	0.273
claude-opus-4-5	mini-swe-agent	42.77	0.4624
claude-sonnet-4-5	mini-swe-agent	41.51	0.2247
gpt-5.1	mini-swe-agent	39.83	0.1574
grok-4-1-fast-reasoning	mini-swe-agent	33.96	0.0164
grok-4	mini-swe-agent	31.87	0.4529
gemini-2.5-pro	mini-swe-agent	28.93	0.1086

Full results with 95% confidence intervals are in results/. Details on implementation methodology can be found in Methods

Benchmark Structure

159 evaluations across:

5 platforms: Curio,Vizgen,Xenium,AtlasXOmics,Visium
7 task categories: Dimensionality Reduction,Cell Typing,Normalization,Differential Expression,Clustering,QC,Spatial Analysis

Tasks require empirical interaction with the data—agents that rely on prior knowledge without performing the requisite analysis fail to complete many tasks correctly.

Quick Start

pip install -e .

# Validate evaluation format
spatialbench validate example_evals/spatial_analysis/xenium_kidney_spatial_cn7_composition_day14.json

# Run with mini-swe-agent
export ANTHROPIC_API_KEY=your_key
spatialbench run example_evals/spatial_analysis/xenium_kidney_spatial_cn7_composition_day14.json --agent minisweagent --model anthropic/claude-opus-4-5

export OPENAI_API_KEY=your_key
spatialbench run example_evals/spatial_analysis/xenium_kidney_spatial_cn7_composition_day14.json --agent minisweagent --model openai/gpt-5.5

Graders

Five grader families handle different answer types:

Grader	Use Case
NumericTolerance	QC metrics, counts, expression values
MultipleChoice	Discrete interpretation questions
MarkerGenePrecisionRecall	Gene lists (P@K, R@K)
LabelSetJaccard	Cell type sets
DistributionComparison	Cell type proportions

See latch-eval-tools for implementations and harness setups.

Citation

@article{spatialbench2025,
  title = {SpatialBench: Can Agents Analyze Real-World Spatial Biology Data?},
  author = {Workman, Kenny and Yang, Zhen and Muralidharan, Harihara and Le, Hannah},
  year = {2025},
  url = {https://github.com/latchbio/spatialbench}
}

License

Apache 2.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SpatialBench

Key Findings

Benchmark Structure

Quick Start

Graders

Citation

License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

SpatialBench

Key Findings

Benchmark Structure

Quick Start

Graders

Citation

License