A proof-of-concept pipeline for AI-assisted prosopography of the medieval Levant (Crusades era, 11th–14th centuries). Part of a collaborative research project by Jochen Burgtorf (Cal State Fullerton), Tobias Hodel (University of Bern), and Laura Morreale (Harvard / independent scholar).
Status: proof of concept. The pipeline runs end-to-end; accuracy improves substantially with a Gemini API key and grows further through iterative Human-in-the-Loop review.
Traditional crusades prosopography focuses on elites — kings, bishops, military orders. This project asks: can generative AI + Knowledge Graphs make the full population of the medieval Levant visible at scale, including refugees, artisans, women, unnamed groups, and non-Latin actors?
The pipeline implements a two-layer architecture:
| Layer | What it does |
|---|---|
| Layer 1 — LLM extraction | Reads historical texts (PDF or plain text) and extracts person-like signals: names, titles, epithets, roles, collective groups. Uses Google Gemini (gemini-2.0-flash) with a structured JSON prompt; falls back to heuristic regex NER when no API key is set. |
| Layer 2 — KG linking | Fuzzy-matches extracted mentions against a curated authority file of 126 known crusader persons (sourced from an Omeka database). Returns ranked candidates with confidence scores and flags ambiguous or multi-candidate matches. |
Results are published as a static GitHub Pages site with a Human-in-the-Loop review UI — scholars can accept, reject, or flag individual candidate links and export their decisions as JSON.
outremer/
├── data/raw/ Source texts (.pdf, .txt) to process
├── data/entity_feedback.json Auto-collected problematic entities (Gemini negative memory)
├── bib/ BibTeX output (repo copy)
├── scripts/
│ ├── run_pipeline.py Main pipeline entry point
│ ├── extract_persons_google.py Layer 1: Gemini + fallback extraction
│ └── outremer_index.json Authority file (126 crusader persons)
├── site/ Static site (deployed to GitHub Pages)
│ ├── index.html
│ ├── app.js Explorer + H-i-t-L adjudication UI
│ ├── style.css
│ ├── data/ Generated per-document JSON
│ └── bib/ Generated BibTeX (site copy)
├── .github/workflows/
│ ├── pipeline.yml Runs extraction + linking on push / nightly
│ └── pages.yml Deploys site/ to GitHub Pages
├── requirements.txt
├── requirements.lock.txt
└── README.md
git clone https://github.com/thodel/outremer.git
cd outremer
python3 -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
pip install -r requirements.txt
# For fully pinned, reproducible installs:
pip install -r requirements.lock.txt| Variable | Where | Purpose |
|---|---|---|
GOOGLE_API_KEY |
env var or .env file |
Activates Gemini extraction. Without it, the pipeline falls back to heuristic regex NER. |
MISTRAL_API_KEY |
env var or .env file |
Activates Mistral OCR for image-only / scanned PDFs. Without it, scanned PDFs yield empty text. |
For GitHub Actions, add both under Settings → Secrets and variables → Actions:
GOOGLE_API_KEY and MISTRAL_API_KEY.
# Activate venv first
source .venv/bin/activate
# With Gemini + Mistral OCR (recommended)
export GOOGLE_API_KEY=your_gemini_key
export MISTRAL_API_KEY=your_mistral_key
python scripts/run_pipeline.py --input-dir data/raw --genai-metadata
# With language hint for multilingual sources
python scripts/run_pipeline.py --input-dir data/raw --genai-metadata --language la # Latin
python scripts/run_pipeline.py --input-dir data/raw --genai-metadata --language ar # Arabic
# Supported: la, fro (Old French), ar, el (Greek), de (Middle High German)
# Without API keys (heuristic fallback, no OCR)
python scripts/run_pipeline.py --input-dir data/raw
# Sync human adjudication into feedback memory (rejects -> blocked_terms, accepts -> allow_terms)
python scripts/run_pipeline.py --input-dir data/raw --genai-metadata \
--entity-feedback-path data/entity_feedback.json \
--review-decisions-path data/decisions.json
# All options
python scripts/run_pipeline.py --help--entity-feedback-path (default: data/entity_feedback.json) stores noisy/non-person entities filtered from extraction results.
On later runs, frequent offenders from this file are injected into the Gemini prompt as a do-not-extract list.
--review-decisions-path (optional) imports human accept/reject decisions into this feedback store:
rejected names are added to blocked_terms and accepted names are added to allow_terms.
Output is written to site/data/*.json, site/bib/*.bib, and bib/*.bib.
GitHub Pages: The site is auto-deployed on every push to main. Visit https://thodel.github.io/outremer/.
Locally:
cd site
python3 -m http.server 8080
# open http://localhost:8080H-i-t-L workflow:
- Select a document from the dropdown and click Load.
- The Extracted Persons panel lists all detected mentions with confidence scores.
- The Links panel shows candidate matches ranked by fuzzy score (green = high, yellow = medium, red = low).
- For each candidate, click ✅ Accept, ❌ Reject, or 🚩 Flag. Add an optional comment.
- Use the filter bar to focus on unreviewed or flagged items.
- Click Export decisions to download your adjudications as a JSON file.
Decisions are persisted in the browser's localStorage — they survive page refreshes and are scoped per-document.
scripts/outremer_index.json is a gold person authority file containing 126 crusader-era persons. Each entry includes:
authority_id— unique identifier (e.g.AUTH:CR1)preferred_label— canonical namevariants— alternate spellings and formsnormalized— pre-computed lowercase/accent-stripped forms for matchingname— parsed name components (given, toponym, regnal, epithet)identifiers.omeka_item_id— link to source Omeka database recordprovenance.source_files— Omeka XML source files
The linker matches against all variant forms using rapidfuzz token-sort ratio (≥ 60% to appear as a candidate; ≥ 90% = "high confidence").
| Workflow | Trigger | What it does |
|---|---|---|
pipeline.yml |
push to main, nightly 02:00 UTC, manual |
Runs run_pipeline.py --genai-metadata, commits updated site/data/, site/bib/, bib/ back to main |
pages.yml |
push to main, manual |
Deploys site/ to GitHub Pages |
People of the Medieval Levant is a collaborative digital humanities project exploring how generative AI and Knowledge Graphs can enable a more inclusive prosopography of the Crusades era — one that goes beyond the traditional elite focus to encompass non-Western actors, women, refugees, artisans, and unnamed collectives.
The project is led by Jochen Burgtorf (medieval history), Tobias Hodel (digital humanities / AI), and Laura Morreale (medieval cultural contact). A book chapter co-authored by Burgtorf and Hodel outlines the theoretical framework; this repository is the technical proof of concept.
The pipeline treats ambiguity as data rather than error. Mismatches between the LLM layer and the KG layer are diagnostic signals — they reveal name collisions, missing entities, or outdated assumptions. Scholarly adjudication through the H-i-t-L interface is where historical interpretation happens.