Extract Bench

A benchmark for structured extraction from PDF documents, comprising:

Dataset -- 35 human-validated PDF-to-JSON extraction tasks across 5 schemas and 4 domains (finance, academia, hiring, sports).
Evaluation Suite -- A Python package that scores predicted JSON against gold JSON with per-field metrics (exact match, fuzzy, semantic/LLM-based, numeric tolerance, and more).

Getting Started

Installation

pip install -e .

# With dev dependencies
pip install -e ".[dev]"

Quick Start

import json
from pathlib import Path
from extract_bench import ReportBuilder, ReportConfig

# Load your data
schema = json.load(open("schema.json"))
gold = json.load(open("gold.json"))
extracted = json.load(open("model_output.json"))

# Configure and build report
config = ReportConfig(
    output_dir=Path("./eval_results"),
    output_name="nvidia-10k-extract-gemini-flash",  # Identifies this experiment
)
builder = ReportBuilder(config)
report = builder.build(schema, gold, extracted)

# Save all outputs
output_path = builder.save(report)
print(f"Results saved to: {output_path}")

This creates eval_results/nvidia-10k-extract-gemini-flash/ containing:

File	Purpose
`report.json`	Machine-readable full report (for programmatic analysis)
`summary.txt`	Human-readable one-page summary (for quick inspection)
`fields.csv`	Per-field outcomes (for csv analysis)
`fields.md`	Markdown table (for documentation/sharing)

Key metrics on the report object:

print(f"Overall score:  {report.overall_score:.3f}")   # Item-weighted (arrays count more)
print(f"Field score:    {report.field_score:.3f}")      # Flat average (each field equal)
print(f"Pass rate:      {report.overall_pass_rate:.1%}")
print(f"Evaluated:      {report.outcomes.total_evaluated} fields")

Evaluation API

ReportConfig Options

config = ReportConfig(
    output_dir=Path("./outputs"),      # Where to save reports
    output_name="my-experiment",       # Subdirectory name (auto-generated if None)
    max_reasoning_length=200,          # Truncate LLM reasoning in outputs
    top_n_lowest_fields=5,             # Track N lowest-scoring fields
    save_json=True,                    # Generate report.json
    save_text=True,                    # Generate summary.txt
    save_csv=True,                     # Generate fields.csv
    save_markdown=True,                # Generate fields.md
)

Batch Evaluation

For running many experiments:

import asyncio
import json
from pathlib import Path
from extract_bench import ReportBuilder, ReportConfig

async def evaluate_model_outputs(
    schema_path: Path,
    gold_path: Path,
    outputs_dir: Path,
    results_dir: Path,
):
    """Evaluate all model outputs in a directory."""
    schema = json.load(schema_path.open())
    gold = json.load(gold_path.open())

    results = []
    for output_file in outputs_dir.glob("*.json"):
        extracted = json.load(output_file.open())

        config = ReportConfig(
            output_dir=results_dir,
            output_name=output_file.stem,  # Use filename as experiment ID
        )
        builder = ReportBuilder(config)
        report = await builder.build_async(schema, gold, extracted)
        builder.save(report)

        results.append({
            "model": output_file.stem,
            "pass_rate": report.overall_pass_rate,
            "score": report.overall_score,
        })

    return results

# Run batch evaluation
results = asyncio.run(evaluate_model_outputs(
    schema_path=Path("schema.json"),
    gold_path=Path("gold.json"),
    outputs_dir=Path("./model_outputs"),
    results_dir=Path("./eval_results"),
))

# Print comparison
for r in sorted(results, key=lambda x: -x["score"]):
    print(f"{r['model']}: {r['pass_rate']:.1%} pass, {r['score']:.3f} avg score")

Report Output Format

Understanding the Scores

The report provides two complementary scores:

Score	What it measures	When to use
Overall Score (item-weighted)	Each array item counts individually — an array of 10 items has 10x the weight of a scalar field	Primary score. Reflects how much of the data was extracted correctly
Field Score (flat average)	Each schema field counts equally regardless of type	Secondary. Useful when you care about field-level completeness

Both scores range from 0.0 to 1.0. When a schema has no arrays, the two scores are identical.

Example: A schema with name (string), age (integer), and items (array of 10 objects). If name and age are correct but only 8/10 items match:

Field score: (1.0 + 1.0 + 0.8) / 3 = 0.933
Overall score: (1×1.0 + 1×1.0 + 10×0.8) / 12 = 0.833 ← reflects the 2 missed items

summary.txt

A one-page overview designed for quick inspection:

OVERALL RESULTS
----------------------------------------
  Overall Score:  0.909  (item-weighted)
  Field Score:    0.897  (flat average)
  Pass Rate: 84.6%
  Evaluated: 13 fields (11 passed, 2 failed)

ARRAY BREAKDOWN
----------------------------------------
  languages [PASS] score=1.000
    Items: 3 matched, 0 missed, 0 spurious
    P=1.000  R=1.000  F1=1.000
  workExperience [FAIL] score=0.667
    Items: 2 matched, 1 missed, 1 spurious
    P=0.667  R=0.667  F1=0.667

FAILED FIELDS (first 10)
----------------------------------------
  other
    Metric: array_llm, Score: 0.000
    Reason: gold_empty_array
  workExperience
    Metric: array_llm, Score: 0.667

The Array Breakdown section shows per-array item-level matching. For each array:

matched — gold items correctly found in extracted
missed — gold items not found in extracted
spurious — extracted items not present in gold
P/R/F1 — precision, recall, and F1 computed from these counts

The array score itself is recall-based: matched / (matched + missed).

fields.csv / fields.md

Per-field outcomes with array breakdown columns:

Column	Description
`path`	Full JSONPath to the field
`normalized_path`	Human-readable path (e.g., `borrower.name`)
`metric_id`	Metric used for evaluation
`score`	Numeric score (0.0-1.0)
`passed`	Boolean pass/fail
`gold_value`	Expected value
`extracted_value`	Model's output value
`reasoning`	LLM reasoning (for semantic metrics)
`matched`	Array: number of matched items
`missed_gold`	Array: gold items missing from extracted
`spurious_pred`	Array: extracted items not in gold
`precision`	Array: matched / (matched + spurious)
`recall`	Array: matched / (matched + missed)
`f1`	Array: harmonic mean of P and R

The array columns are empty for non-array fields.

report.json

Machine-readable full report including all statistics, per-field outcomes with array breakdowns (including the actual matched/missed/spurious item lists), coverage analysis, confusion matrix, and LLM call statistics. Use this for programmatic analysis.

Low-Level API

For direct access to evaluation results without reporting:

from extract_bench import StructuredEvaluator, StructuredEvaluatorConfig

evaluator = StructuredEvaluator(StructuredEvaluatorConfig(metrics=[]))
result = evaluator.evaluate(schema, gold, predicted)

# Raw results dict: path -> metric_id -> MetricResult
for path, metrics in result["results"].items():
    for metric_id, metric_result in metrics.items():
        print(f"{path} [{metric_id}]: passed={metric_result.passed}, score={metric_result.score}")

Use evaluate_async() for better performance with LLM-based metrics.

Metrics

Available Metrics

Category	Metric	Description
String	`string_exact`	Case-sensitive exact match
	`string_case_insensitive`	Case-insensitive match
	`string_fuzzy`	Levenshtein similarity
	`string_semantic`	LLM-based semantic comparison (default)
	`string_url`	URL-normalized match (strips protocol, www, trailing slash)
Number	`number_exact`	Exact numeric equality
	`number_tolerance`	Match within tolerance (default)
	`integer_exact`	Exact integer equality
Boolean	`boolean_exact`	Exact boolean equality
Array	`array_llm`	LLM-based array comparison
General	`string_llm`	LLM judge for any comparison

Automatic inference:

Fields with "format": "uri" and "evaluation_config": "string_exact" are automatically evaluated with string_url instead, so http://example.com and example.com are treated as equal.

anyOf nodes without an explicit evaluation_config inherit the config from their children when all non-null branches agree on the same metric. For example, anyOf: [{type: array, evaluation_config: "array_llm"}, {type: object, ...}] will automatically use array_llm on the anyOf node.

Evaluation Presets

Specify evaluation_config in schema fields to control which metric is used:

schema = {
    "type": "object",
    "properties": {
        "price": {
            "type": "number",
            "evaluation_config": {
                "metrics": [{"metric_id": "number_tolerance", "params": {"tolerance": 0.01}}]
            }
        },
        "description": {
            "type": "string",
            "evaluation_config": "string_fuzzy"  # Use preset shorthand
        }
    }
}

Preset	Description
`string_exact`	Case-sensitive exact match
`string_fuzzy`	Levenshtein similarity (case-insensitive by default)
`string_case_insensitive`	Case-insensitive match
`string_semantic`	LLM-based semantic similarity (default for strings)
`string_url`	URL-normalized match (auto-applied for `format: "uri"` fields)
`number_exact`	Exact numeric equality
`number_tolerance`	Match within tolerance (default for numbers)
`integer_exact`	Exact integer equality (default for integers)
`boolean_exact`	Exact boolean equality (default for booleans)
`array_llm`	LLM evaluation of entire array (default for arrays)
`skip`	Skip evaluation for this node

Custom Metrics

from extract_bench import global_metric_registry
from extract_bench.evaluation.metrics import BaseMetric, MetricResult

class MyCustomMetric(BaseMetric):
    metric_id = "my_custom"

    async def evaluate(self, node, config=None):
        gold = node.get_gold_value()
        extracted = node.get_extracted_value()
        return MetricResult(
            metric_id=self.metric_id,
            score=1.0,
            passed=True,
            details={"custom": "data"}
        )

global_metric_registry.register_metric(MyCustomMetric)

Configuration

Environment Setup

LLM-based metrics use LiteLLM. Configure your provider:

# Vertex AI (Google Cloud)
gcloud auth application-default login

# OpenAI
export OPENAI_API_KEY=sk-...

# Or copy .env.example to .env

LLM Model Configuration

Default model: vertex_ai/gemini-2.5-flash (or set DEFAULT_LLM_MODEL in .env).

Override per-field in schema:

schema = {
    "type": "object",
    "properties": {
        "company": {
            "type": "string",
            "evaluation_config": {
                "metrics": [{"metric_id": "string_semantic", "params": {"model": "openai/gpt-4o-mini"}}]
            },
        }
    },
}

Architecture

extract-bench/
├── dataset/                   # Benchmark dataset (see dataset/README.md)
│   ├── {domain}/{schema}/     #   e.g. finance/10k/, academic/research/
│   │   ├── *-schema.json      #   JSON Schema with evaluation_config per field
│   │   └── pdf+gold/          #   Source PDFs + human-validated gold JSONs
├── src/extract_bench/         # Evaluation suite
│   ├── infra/                 #   Schema AST (nodes, visitors)
│   └── evaluation/
│       ├── metrics/           #   Metric implementations
│       └── reporting/         #   Report generation (see reporting/README.md)

Schema → AST → Values instantiated → Metrics evaluated async in parallel → Report generated.

Citation

If you use ExtractBench in your research, please cite:

@article{ferguson2026extractbench,
  title={ExtractBench: A Benchmark and Evaluation Methodology for Complex Structured Extraction},
  author={Ferguson, Nick and Pennington, Josh and Beghian, Narek and Mohan, Aravind and Kiela, Douwe and Agrawal, Sheshansh and Nguyen, Thien Hang},
  journal={arXiv preprint arXiv:2602.12247},
  year={2026}
}

Development

pip install -e ".[dev]"
pytest tests/ -v

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
dataset		dataset
src/extract_bench		src/extract_bench
tests		tests
.env.example		.env.example
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Extract Bench

Table of Contents

Getting Started

Installation

Quick Start

Evaluation API

ReportConfig Options

Batch Evaluation

Report Output Format

Understanding the Scores

summary.txt

fields.csv / fields.md

report.json

Low-Level API

Metrics

Available Metrics

Evaluation Presets

Custom Metrics

Configuration

Environment Setup

LLM Model Configuration

Architecture

Citation

Development

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors 1

Languages

Folders and files

Latest commit

History

Repository files navigation

Extract Bench

Table of Contents

Getting Started

Installation

Quick Start

Evaluation API

ReportConfig Options

Batch Evaluation

Report Output Format

Understanding the Scores

summary.txt

fields.csv / fields.md

report.json

Low-Level API

Metrics

Available Metrics

Evaluation Presets

Custom Metrics

Configuration

Environment Setup

LLM Model Configuration

Architecture

Citation

Development

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors 1

Languages

Packages