People of the Medieval Levant — Outremer PoC

A proof-of-concept pipeline for AI-assisted prosopography of the medieval Levant (Crusades era, 11th–14th centuries). Part of a collaborative research project by Jochen Burgtorf (Cal State Fullerton), Tobias Hodel (University of Bern), and Laura Morreale (Harvard / independent scholar).

Status: proof of concept. The pipeline runs end-to-end; accuracy improves substantially with a Gemini API key and grows further through iterative Human-in-the-Loop review.

What this does

Traditional crusades prosopography focuses on elites — kings, bishops, military orders. This project asks: can generative AI + Knowledge Graphs make the full population of the medieval Levant visible at scale, including refugees, artisans, women, unnamed groups, and non-Latin actors?

The pipeline implements a two-layer architecture:

Layer	What it does
Layer 1 — LLM extraction	Reads historical texts (PDF or plain text) and extracts person-like signals: names, titles, epithets, roles, collective groups. Uses Google Gemini (`gemini-2.0-flash`) with a structured JSON prompt; falls back to heuristic regex NER when no API key is set.
Layer 2 — KG linking	Fuzzy-matches extracted mentions against a curated authority file of 126 known crusader persons (sourced from an Omeka database). Returns ranked candidates with confidence scores and flags ambiguous or multi-candidate matches.

Results are published as a static GitHub Pages site with a Human-in-the-Loop review UI — scholars can accept, reject, or flag individual candidate links and export their decisions as JSON.

Repository structure

outremer/
├── data/raw/           Source texts (.pdf, .txt) to process
├── data/entity_feedback.json  Auto-collected problematic entities (Gemini negative memory)
├── bib/                BibTeX output (repo copy)
├── scripts/
│   ├── run_pipeline.py         Main pipeline entry point
│   ├── extract_persons_google.py  Layer 1: Gemini + fallback extraction
│   └── outremer_index.json     Authority file (126 crusader persons)
├── site/               Static site (deployed to GitHub Pages)
│   ├── index.html
│   ├── app.js          Explorer + H-i-t-L adjudication UI
│   ├── style.css
│   ├── data/           Generated per-document JSON
│   └── bib/            Generated BibTeX (site copy)
├── .github/workflows/
│   ├── pipeline.yml    Runs extraction + linking on push / nightly
│   └── pages.yml       Deploys site/ to GitHub Pages
├── requirements.txt
├── requirements.lock.txt
└── README.md

Setup (local)

git clone https://github.com/thodel/outremer.git
cd outremer

python3 -m venv .venv
source .venv/bin/activate        # Windows: .venv\Scripts\activate
pip install -r requirements.txt

# For fully pinned, reproducible installs:
pip install -r requirements.lock.txt

Configuration

Variable	Where	Purpose
`GOOGLE_API_KEY`	env var or `.env` file	Activates Gemini extraction. Without it, the pipeline falls back to heuristic regex NER.
`MISTRAL_API_KEY`	env var or `.env` file	Activates Mistral OCR for image-only / scanned PDFs. Without it, scanned PDFs yield empty text.

For GitHub Actions, add both under Settings → Secrets and variables → Actions: GOOGLE_API_KEY and MISTRAL_API_KEY.

Running the pipeline

# Activate venv first
source .venv/bin/activate

# With Gemini + Mistral OCR (recommended)
export GOOGLE_API_KEY=your_gemini_key
export MISTRAL_API_KEY=your_mistral_key
python scripts/run_pipeline.py --input-dir data/raw --genai-metadata

# With language hint for multilingual sources
python scripts/run_pipeline.py --input-dir data/raw --genai-metadata --language la   # Latin
python scripts/run_pipeline.py --input-dir data/raw --genai-metadata --language ar   # Arabic
# Supported: la, fro (Old French), ar, el (Greek), de (Middle High German)

# Without API keys (heuristic fallback, no OCR)
python scripts/run_pipeline.py --input-dir data/raw

# Sync human adjudication into feedback memory (rejects -> blocked_terms, accepts -> allow_terms)
python scripts/run_pipeline.py --input-dir data/raw --genai-metadata \
  --entity-feedback-path data/entity_feedback.json \
  --review-decisions-path data/decisions.json

# All options
python scripts/run_pipeline.py --help

--entity-feedback-path (default: data/entity_feedback.json) stores noisy/non-person entities filtered from extraction results.
On later runs, frequent offenders from this file are injected into the Gemini prompt as a do-not-extract list. --review-decisions-path (optional) imports human accept/reject decisions into this feedback store: rejected names are added to blocked_terms and accepted names are added to allow_terms.

Output is written to site/data/*.json, site/bib/*.bib, and bib/*.bib.

Reviewing results

GitHub Pages: The site is auto-deployed on every push to main. Visit https://thodel.github.io/outremer/.

Locally:

cd site
python3 -m http.server 8080
# open http://localhost:8080

H-i-t-L workflow:

Select a document from the dropdown and click Load.
The Extracted Persons panel lists all detected mentions with confidence scores.
The Links panel shows candidate matches ranked by fuzzy score (green = high, yellow = medium, red = low).
For each candidate, click ✅ Accept, ❌ Reject, or 🚩 Flag. Add an optional comment.
Use the filter bar to focus on unreviewed or flagged items.
Click Export decisions to download your adjudications as a JSON file.

Decisions are persisted in the browser's localStorage — they survive page refreshes and are scoped per-document.

Authority file

scripts/outremer_index.json is a gold person authority file containing 126 crusader-era persons. Each entry includes:

authority_id — unique identifier (e.g. AUTH:CR1)
preferred_label — canonical name
variants — alternate spellings and forms
normalized — pre-computed lowercase/accent-stripped forms for matching
name — parsed name components (given, toponym, regnal, epithet)
identifiers.omeka_item_id — link to source Omeka database record
provenance.source_files — Omeka XML source files

The linker matches against all variant forms using rapidfuzz token-sort ratio (≥ 60% to appear as a candidate; ≥ 90% = "high confidence").

GitHub Actions

Workflow	Trigger	What it does
`pipeline.yml`	push to `main`, nightly 02:00 UTC, manual	Runs `run_pipeline.py --genai-metadata`, commits updated `site/data/`, `site/bib/`, `bib/` back to `main`
`pages.yml`	push to `main`, manual	Deploys `site/` to GitHub Pages

Project context

People of the Medieval Levant is a collaborative digital humanities project exploring how generative AI and Knowledge Graphs can enable a more inclusive prosopography of the Crusades era — one that goes beyond the traditional elite focus to encompass non-Western actors, women, refugees, artisans, and unnamed collectives.

The project is led by Jochen Burgtorf (medieval history), Tobias Hodel (digital humanities / AI), and Laura Morreale (medieval cultural contact). A book chapter co-authored by Burgtorf and Hodel outlines the theoretical framework; this repository is the technical proof of concept.

The pipeline treats ambiguity as data rather than error. Mismatches between the LLM layer and the KG layer are diagnostic signals — they reveal name collisions, missing entities, or outdated assumptions. Scholarly adjudication through the H-i-t-L interface is where historical interpretation happens.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

People of the Medieval Levant — Outremer PoC

What this does

Repository structure

Setup (local)

Configuration

Running the pipeline

Reviewing results

Authority file

GitHub Actions

Project context

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 142 Commits
.github/workflows		.github/workflows
bib		bib
data		data
docs		docs
scrapers		scrapers
scripts		scripts
site		site
.gitignore		.gitignore
README.md		README.md
requirements.lock.txt		requirements.lock.txt
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

People of the Medieval Levant — Outremer PoC

What this does

Repository structure

Setup (local)

Configuration

Running the pipeline

Reviewing results

Authority file

GitHub Actions

Project context

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages