A benchmark for structured extraction from PDF documents, comprising:
- Dataset -- 35 human-validated PDF-to-JSON extraction tasks across 5 schemas and 4 domains (finance, academia, hiring, sports).
- Evaluation Suite -- A Python package that scores predicted JSON against gold JSON with per-field metrics (exact match, fuzzy, semantic/LLM-based, numeric tolerance, and more).
pip install -e .
# With dev dependencies
pip install -e ".[dev]"import json
from pathlib import Path
from extract_bench import ReportBuilder, ReportConfig
# Load your data
schema = json.load(open("schema.json"))
gold = json.load(open("gold.json"))
extracted = json.load(open("model_output.json"))
# Configure and build report
config = ReportConfig(
output_dir=Path("./eval_results"),
output_name="nvidia-10k-extract-gemini-flash", # Identifies this experiment
)
builder = ReportBuilder(config)
report = builder.build(schema, gold, extracted)
# Save all outputs
output_path = builder.save(report)
print(f"Results saved to: {output_path}")This creates eval_results/nvidia-10k-extract-gemini-flash/ containing:
| File | Purpose |
|---|---|
report.json |
Machine-readable full report (for programmatic analysis) |
summary.txt |
Human-readable one-page summary (for quick inspection) |
fields.csv |
Per-field outcomes (for csv analysis) |
fields.md |
Markdown table (for documentation/sharing) |
Key metrics on the report object:
print(f"Overall score: {report.overall_score:.3f}") # Item-weighted (arrays count more)
print(f"Field score: {report.field_score:.3f}") # Flat average (each field equal)
print(f"Pass rate: {report.overall_pass_rate:.1%}")
print(f"Evaluated: {report.outcomes.total_evaluated} fields")config = ReportConfig(
output_dir=Path("./outputs"), # Where to save reports
output_name="my-experiment", # Subdirectory name (auto-generated if None)
max_reasoning_length=200, # Truncate LLM reasoning in outputs
top_n_lowest_fields=5, # Track N lowest-scoring fields
save_json=True, # Generate report.json
save_text=True, # Generate summary.txt
save_csv=True, # Generate fields.csv
save_markdown=True, # Generate fields.md
)For running many experiments:
import asyncio
import json
from pathlib import Path
from extract_bench import ReportBuilder, ReportConfig
async def evaluate_model_outputs(
schema_path: Path,
gold_path: Path,
outputs_dir: Path,
results_dir: Path,
):
"""Evaluate all model outputs in a directory."""
schema = json.load(schema_path.open())
gold = json.load(gold_path.open())
results = []
for output_file in outputs_dir.glob("*.json"):
extracted = json.load(output_file.open())
config = ReportConfig(
output_dir=results_dir,
output_name=output_file.stem, # Use filename as experiment ID
)
builder = ReportBuilder(config)
report = await builder.build_async(schema, gold, extracted)
builder.save(report)
results.append({
"model": output_file.stem,
"pass_rate": report.overall_pass_rate,
"score": report.overall_score,
})
return results
# Run batch evaluation
results = asyncio.run(evaluate_model_outputs(
schema_path=Path("schema.json"),
gold_path=Path("gold.json"),
outputs_dir=Path("./model_outputs"),
results_dir=Path("./eval_results"),
))
# Print comparison
for r in sorted(results, key=lambda x: -x["score"]):
print(f"{r['model']}: {r['pass_rate']:.1%} pass, {r['score']:.3f} avg score")The report provides two complementary scores:
| Score | What it measures | When to use |
|---|---|---|
| Overall Score (item-weighted) | Each array item counts individually — an array of 10 items has 10x the weight of a scalar field | Primary score. Reflects how much of the data was extracted correctly |
| Field Score (flat average) | Each schema field counts equally regardless of type | Secondary. Useful when you care about field-level completeness |
Both scores range from 0.0 to 1.0. When a schema has no arrays, the two scores are identical.
Example: A schema with name (string), age (integer), and items (array of 10 objects). If name and age are correct but only 8/10 items match:
- Field score: (1.0 + 1.0 + 0.8) / 3 = 0.933
- Overall score: (1×1.0 + 1×1.0 + 10×0.8) / 12 = 0.833 ← reflects the 2 missed items
A one-page overview designed for quick inspection:
OVERALL RESULTS
----------------------------------------
Overall Score: 0.909 (item-weighted)
Field Score: 0.897 (flat average)
Pass Rate: 84.6%
Evaluated: 13 fields (11 passed, 2 failed)
ARRAY BREAKDOWN
----------------------------------------
languages [PASS] score=1.000
Items: 3 matched, 0 missed, 0 spurious
P=1.000 R=1.000 F1=1.000
workExperience [FAIL] score=0.667
Items: 2 matched, 1 missed, 1 spurious
P=0.667 R=0.667 F1=0.667
FAILED FIELDS (first 10)
----------------------------------------
other
Metric: array_llm, Score: 0.000
Reason: gold_empty_array
workExperience
Metric: array_llm, Score: 0.667
The Array Breakdown section shows per-array item-level matching. For each array:
- matched — gold items correctly found in extracted
- missed — gold items not found in extracted
- spurious — extracted items not present in gold
- P/R/F1 — precision, recall, and F1 computed from these counts
The array score itself is recall-based: matched / (matched + missed).
Per-field outcomes with array breakdown columns:
| Column | Description |
|---|---|
path |
Full JSONPath to the field |
normalized_path |
Human-readable path (e.g., borrower.name) |
metric_id |
Metric used for evaluation |
score |
Numeric score (0.0-1.0) |
passed |
Boolean pass/fail |
gold_value |
Expected value |
extracted_value |
Model's output value |
reasoning |
LLM reasoning (for semantic metrics) |
matched |
Array: number of matched items |
missed_gold |
Array: gold items missing from extracted |
spurious_pred |
Array: extracted items not in gold |
precision |
Array: matched / (matched + spurious) |
recall |
Array: matched / (matched + missed) |
f1 |
Array: harmonic mean of P and R |
The array columns are empty for non-array fields.
Machine-readable full report including all statistics, per-field outcomes with array breakdowns (including the actual matched/missed/spurious item lists), coverage analysis, confusion matrix, and LLM call statistics. Use this for programmatic analysis.
For direct access to evaluation results without reporting:
from extract_bench import StructuredEvaluator, StructuredEvaluatorConfig
evaluator = StructuredEvaluator(StructuredEvaluatorConfig(metrics=[]))
result = evaluator.evaluate(schema, gold, predicted)
# Raw results dict: path -> metric_id -> MetricResult
for path, metrics in result["results"].items():
for metric_id, metric_result in metrics.items():
print(f"{path} [{metric_id}]: passed={metric_result.passed}, score={metric_result.score}")Use evaluate_async() for better performance with LLM-based metrics.
| Category | Metric | Description |
|---|---|---|
| String | string_exact |
Case-sensitive exact match |
string_case_insensitive |
Case-insensitive match | |
string_fuzzy |
Levenshtein similarity | |
string_semantic |
LLM-based semantic comparison (default) | |
string_url |
URL-normalized match (strips protocol, www, trailing slash) | |
| Number | number_exact |
Exact numeric equality |
number_tolerance |
Match within tolerance (default) | |
integer_exact |
Exact integer equality | |
| Boolean | boolean_exact |
Exact boolean equality |
| Array | array_llm |
LLM-based array comparison |
| General | string_llm |
LLM judge for any comparison |
Automatic inference:
- Fields with
"format": "uri"and"evaluation_config": "string_exact"are automatically evaluated withstring_urlinstead, sohttp://example.comandexample.comare treated as equal.anyOfnodes without an explicitevaluation_configinherit the config from their children when all non-null branches agree on the same metric. For example,anyOf: [{type: array, evaluation_config: "array_llm"}, {type: object, ...}]will automatically usearray_llmon the anyOf node.
Specify evaluation_config in schema fields to control which metric is used:
schema = {
"type": "object",
"properties": {
"price": {
"type": "number",
"evaluation_config": {
"metrics": [{"metric_id": "number_tolerance", "params": {"tolerance": 0.01}}]
}
},
"description": {
"type": "string",
"evaluation_config": "string_fuzzy" # Use preset shorthand
}
}
}| Preset | Description |
|---|---|
string_exact |
Case-sensitive exact match |
string_fuzzy |
Levenshtein similarity (case-insensitive by default) |
string_case_insensitive |
Case-insensitive match |
string_semantic |
LLM-based semantic similarity (default for strings) |
string_url |
URL-normalized match (auto-applied for format: "uri" fields) |
number_exact |
Exact numeric equality |
number_tolerance |
Match within tolerance (default for numbers) |
integer_exact |
Exact integer equality (default for integers) |
boolean_exact |
Exact boolean equality (default for booleans) |
array_llm |
LLM evaluation of entire array (default for arrays) |
skip |
Skip evaluation for this node |
from extract_bench import global_metric_registry
from extract_bench.evaluation.metrics import BaseMetric, MetricResult
class MyCustomMetric(BaseMetric):
metric_id = "my_custom"
async def evaluate(self, node, config=None):
gold = node.get_gold_value()
extracted = node.get_extracted_value()
return MetricResult(
metric_id=self.metric_id,
score=1.0,
passed=True,
details={"custom": "data"}
)
global_metric_registry.register_metric(MyCustomMetric)LLM-based metrics use LiteLLM. Configure your provider:
# Vertex AI (Google Cloud)
gcloud auth application-default login
# OpenAI
export OPENAI_API_KEY=sk-...
# Or copy .env.example to .envDefault model: vertex_ai/gemini-2.5-flash (or set DEFAULT_LLM_MODEL in .env).
Override per-field in schema:
schema = {
"type": "object",
"properties": {
"company": {
"type": "string",
"evaluation_config": {
"metrics": [{"metric_id": "string_semantic", "params": {"model": "openai/gpt-4o-mini"}}]
},
}
},
}extract-bench/
├── dataset/ # Benchmark dataset (see dataset/README.md)
│ ├── {domain}/{schema}/ # e.g. finance/10k/, academic/research/
│ │ ├── *-schema.json # JSON Schema with evaluation_config per field
│ │ └── pdf+gold/ # Source PDFs + human-validated gold JSONs
├── src/extract_bench/ # Evaluation suite
│ ├── infra/ # Schema AST (nodes, visitors)
│ └── evaluation/
│ ├── metrics/ # Metric implementations
│ └── reporting/ # Report generation (see reporting/README.md)
Schema → AST → Values instantiated → Metrics evaluated async in parallel → Report generated.
If you use ExtractBench in your research, please cite:
@article{ferguson2026extractbench,
title={ExtractBench: A Benchmark and Evaluation Methodology for Complex Structured Extraction},
author={Ferguson, Nick and Pennington, Josh and Beghian, Narek and Mohan, Aravind and Kiela, Douwe and Agrawal, Sheshansh and Nguyen, Thien Hang},
journal={arXiv preprint arXiv:2602.12247},
year={2026}
}pip install -e ".[dev]"
pytest tests/ -v