Skip to content

ContextualAI/extract-bench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Extract Bench

arXiv

A benchmark for structured extraction from PDF documents, comprising:

  1. Dataset -- 35 human-validated PDF-to-JSON extraction tasks across 5 schemas and 4 domains (finance, academia, hiring, sports).
  2. Evaluation Suite -- A Python package that scores predicted JSON against gold JSON with per-field metrics (exact match, fuzzy, semantic/LLM-based, numeric tolerance, and more).

Table of Contents

Getting Started

Installation

pip install -e .

# With dev dependencies
pip install -e ".[dev]"

Quick Start

import json
from pathlib import Path
from extract_bench import ReportBuilder, ReportConfig

# Load your data
schema = json.load(open("schema.json"))
gold = json.load(open("gold.json"))
extracted = json.load(open("model_output.json"))

# Configure and build report
config = ReportConfig(
    output_dir=Path("./eval_results"),
    output_name="nvidia-10k-extract-gemini-flash",  # Identifies this experiment
)
builder = ReportBuilder(config)
report = builder.build(schema, gold, extracted)

# Save all outputs
output_path = builder.save(report)
print(f"Results saved to: {output_path}")

This creates eval_results/nvidia-10k-extract-gemini-flash/ containing:

File Purpose
report.json Machine-readable full report (for programmatic analysis)
summary.txt Human-readable one-page summary (for quick inspection)
fields.csv Per-field outcomes (for csv analysis)
fields.md Markdown table (for documentation/sharing)

Key metrics on the report object:

print(f"Overall score:  {report.overall_score:.3f}")   # Item-weighted (arrays count more)
print(f"Field score:    {report.field_score:.3f}")      # Flat average (each field equal)
print(f"Pass rate:      {report.overall_pass_rate:.1%}")
print(f"Evaluated:      {report.outcomes.total_evaluated} fields")

Evaluation API

ReportConfig Options

config = ReportConfig(
    output_dir=Path("./outputs"),      # Where to save reports
    output_name="my-experiment",       # Subdirectory name (auto-generated if None)
    max_reasoning_length=200,          # Truncate LLM reasoning in outputs
    top_n_lowest_fields=5,             # Track N lowest-scoring fields
    save_json=True,                    # Generate report.json
    save_text=True,                    # Generate summary.txt
    save_csv=True,                     # Generate fields.csv
    save_markdown=True,                # Generate fields.md
)

Batch Evaluation

For running many experiments:

import asyncio
import json
from pathlib import Path
from extract_bench import ReportBuilder, ReportConfig

async def evaluate_model_outputs(
    schema_path: Path,
    gold_path: Path,
    outputs_dir: Path,
    results_dir: Path,
):
    """Evaluate all model outputs in a directory."""
    schema = json.load(schema_path.open())
    gold = json.load(gold_path.open())

    results = []
    for output_file in outputs_dir.glob("*.json"):
        extracted = json.load(output_file.open())

        config = ReportConfig(
            output_dir=results_dir,
            output_name=output_file.stem,  # Use filename as experiment ID
        )
        builder = ReportBuilder(config)
        report = await builder.build_async(schema, gold, extracted)
        builder.save(report)

        results.append({
            "model": output_file.stem,
            "pass_rate": report.overall_pass_rate,
            "score": report.overall_score,
        })

    return results

# Run batch evaluation
results = asyncio.run(evaluate_model_outputs(
    schema_path=Path("schema.json"),
    gold_path=Path("gold.json"),
    outputs_dir=Path("./model_outputs"),
    results_dir=Path("./eval_results"),
))

# Print comparison
for r in sorted(results, key=lambda x: -x["score"]):
    print(f"{r['model']}: {r['pass_rate']:.1%} pass, {r['score']:.3f} avg score")

Report Output Format

Understanding the Scores

The report provides two complementary scores:

Score What it measures When to use
Overall Score (item-weighted) Each array item counts individually — an array of 10 items has 10x the weight of a scalar field Primary score. Reflects how much of the data was extracted correctly
Field Score (flat average) Each schema field counts equally regardless of type Secondary. Useful when you care about field-level completeness

Both scores range from 0.0 to 1.0. When a schema has no arrays, the two scores are identical.

Example: A schema with name (string), age (integer), and items (array of 10 objects). If name and age are correct but only 8/10 items match:

  • Field score: (1.0 + 1.0 + 0.8) / 3 = 0.933
  • Overall score: (1×1.0 + 1×1.0 + 10×0.8) / 12 = 0.833 ← reflects the 2 missed items

summary.txt

A one-page overview designed for quick inspection:

OVERALL RESULTS
----------------------------------------
  Overall Score:  0.909  (item-weighted)
  Field Score:    0.897  (flat average)
  Pass Rate: 84.6%
  Evaluated: 13 fields (11 passed, 2 failed)

ARRAY BREAKDOWN
----------------------------------------
  languages [PASS] score=1.000
    Items: 3 matched, 0 missed, 0 spurious
    P=1.000  R=1.000  F1=1.000
  workExperience [FAIL] score=0.667
    Items: 2 matched, 1 missed, 1 spurious
    P=0.667  R=0.667  F1=0.667

FAILED FIELDS (first 10)
----------------------------------------
  other
    Metric: array_llm, Score: 0.000
    Reason: gold_empty_array
  workExperience
    Metric: array_llm, Score: 0.667

The Array Breakdown section shows per-array item-level matching. For each array:

  • matched — gold items correctly found in extracted
  • missed — gold items not found in extracted
  • spurious — extracted items not present in gold
  • P/R/F1 — precision, recall, and F1 computed from these counts

The array score itself is recall-based: matched / (matched + missed).

fields.csv / fields.md

Per-field outcomes with array breakdown columns:

Column Description
path Full JSONPath to the field
normalized_path Human-readable path (e.g., borrower.name)
metric_id Metric used for evaluation
score Numeric score (0.0-1.0)
passed Boolean pass/fail
gold_value Expected value
extracted_value Model's output value
reasoning LLM reasoning (for semantic metrics)
matched Array: number of matched items
missed_gold Array: gold items missing from extracted
spurious_pred Array: extracted items not in gold
precision Array: matched / (matched + spurious)
recall Array: matched / (matched + missed)
f1 Array: harmonic mean of P and R

The array columns are empty for non-array fields.

report.json

Machine-readable full report including all statistics, per-field outcomes with array breakdowns (including the actual matched/missed/spurious item lists), coverage analysis, confusion matrix, and LLM call statistics. Use this for programmatic analysis.

Low-Level API

For direct access to evaluation results without reporting:

from extract_bench import StructuredEvaluator, StructuredEvaluatorConfig

evaluator = StructuredEvaluator(StructuredEvaluatorConfig(metrics=[]))
result = evaluator.evaluate(schema, gold, predicted)

# Raw results dict: path -> metric_id -> MetricResult
for path, metrics in result["results"].items():
    for metric_id, metric_result in metrics.items():
        print(f"{path} [{metric_id}]: passed={metric_result.passed}, score={metric_result.score}")

Use evaluate_async() for better performance with LLM-based metrics.

Metrics

Available Metrics

Category Metric Description
String string_exact Case-sensitive exact match
string_case_insensitive Case-insensitive match
string_fuzzy Levenshtein similarity
string_semantic LLM-based semantic comparison (default)
string_url URL-normalized match (strips protocol, www, trailing slash)
Number number_exact Exact numeric equality
number_tolerance Match within tolerance (default)
integer_exact Exact integer equality
Boolean boolean_exact Exact boolean equality
Array array_llm LLM-based array comparison
General string_llm LLM judge for any comparison

Automatic inference:

  • Fields with "format": "uri" and "evaluation_config": "string_exact" are automatically evaluated with string_url instead, so http://example.com and example.com are treated as equal.
  • anyOf nodes without an explicit evaluation_config inherit the config from their children when all non-null branches agree on the same metric. For example, anyOf: [{type: array, evaluation_config: "array_llm"}, {type: object, ...}] will automatically use array_llm on the anyOf node.

Evaluation Presets

Specify evaluation_config in schema fields to control which metric is used:

schema = {
    "type": "object",
    "properties": {
        "price": {
            "type": "number",
            "evaluation_config": {
                "metrics": [{"metric_id": "number_tolerance", "params": {"tolerance": 0.01}}]
            }
        },
        "description": {
            "type": "string",
            "evaluation_config": "string_fuzzy"  # Use preset shorthand
        }
    }
}
Preset Description
string_exact Case-sensitive exact match
string_fuzzy Levenshtein similarity (case-insensitive by default)
string_case_insensitive Case-insensitive match
string_semantic LLM-based semantic similarity (default for strings)
string_url URL-normalized match (auto-applied for format: "uri" fields)
number_exact Exact numeric equality
number_tolerance Match within tolerance (default for numbers)
integer_exact Exact integer equality (default for integers)
boolean_exact Exact boolean equality (default for booleans)
array_llm LLM evaluation of entire array (default for arrays)
skip Skip evaluation for this node

Custom Metrics

from extract_bench import global_metric_registry
from extract_bench.evaluation.metrics import BaseMetric, MetricResult

class MyCustomMetric(BaseMetric):
    metric_id = "my_custom"

    async def evaluate(self, node, config=None):
        gold = node.get_gold_value()
        extracted = node.get_extracted_value()
        return MetricResult(
            metric_id=self.metric_id,
            score=1.0,
            passed=True,
            details={"custom": "data"}
        )

global_metric_registry.register_metric(MyCustomMetric)

Configuration

Environment Setup

LLM-based metrics use LiteLLM. Configure your provider:

# Vertex AI (Google Cloud)
gcloud auth application-default login

# OpenAI
export OPENAI_API_KEY=sk-...

# Or copy .env.example to .env

LLM Model Configuration

Default model: vertex_ai/gemini-2.5-flash (or set DEFAULT_LLM_MODEL in .env).

Override per-field in schema:

schema = {
    "type": "object",
    "properties": {
        "company": {
            "type": "string",
            "evaluation_config": {
                "metrics": [{"metric_id": "string_semantic", "params": {"model": "openai/gpt-4o-mini"}}]
            },
        }
    },
}

Architecture

extract-bench/
├── dataset/                   # Benchmark dataset (see dataset/README.md)
│   ├── {domain}/{schema}/     #   e.g. finance/10k/, academic/research/
│   │   ├── *-schema.json      #   JSON Schema with evaluation_config per field
│   │   └── pdf+gold/          #   Source PDFs + human-validated gold JSONs
├── src/extract_bench/         # Evaluation suite
│   ├── infra/                 #   Schema AST (nodes, visitors)
│   └── evaluation/
│       ├── metrics/           #   Metric implementations
│       └── reporting/         #   Report generation (see reporting/README.md)

Schema → AST → Values instantiated → Metrics evaluated async in parallel → Report generated.

Citation

If you use ExtractBench in your research, please cite:

@article{ferguson2026extractbench,
  title={ExtractBench: A Benchmark and Evaluation Methodology for Complex Structured Extraction},
  author={Ferguson, Nick and Pennington, Josh and Beghian, Narek and Mohan, Aravind and Kiela, Douwe and Agrawal, Sheshansh and Nguyen, Thien Hang},
  journal={arXiv preprint arXiv:2602.12247},
  year={2026}
}

Development

pip install -e ".[dev]"
pytest tests/ -v

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Languages