poster2json extracts structured metadata from scientific conference posters (PDF or image format) into machine-actionable JSON conforming to the poster-json-schema.
The pipeline uses:
- Llama 3.1 8B (fine-tuned) for JSON structuring
- Qwen2-VL-7B for vision-based OCR of image posters
- pdfalto for layout-aware PDF text extraction
pip install poster2json# Extract metadata from a poster
poster2json extract poster.pdf -o result.json
# Validate extracted JSON
poster2json validate result.json
# Process multiple posters
poster2json batch ./posters/ -o ./output/from poster2json import extract_poster, validate_poster
# Extract metadata
result = extract_poster("poster.pdf")
print(result["titles"][0]["title"])
# Validate the result
is_valid = validate_poster(result)Output conforms to the poster-json-schema (DataCite-based):
{
"$schema": "https://posters.science/schema/v0.1/poster_schema.json",
"creators": [
{
"name": "Garcia, Sofia",
"givenName": "Sofia",
"familyName": "Garcia",
"affiliation": ["University"]
}
],
"titles": [
{ "title": "Machine Learning Approaches to Diabetic Retinopathy Detection" }
],
"posterContent": {
"sections": [
{ "sectionTitle": "Abstract", "sectionContent": "..." },
{ "sectionTitle": "Methods", "sectionContent": "..." },
{ "sectionTitle": "Results", "sectionContent": "..." }
]
},
"imageCaptions": [{ "captions": ["Figure 1.", "ROC curves showing..."] }],
"tableCaptions": [{ "captions": ["Table 1.", "Performance metrics"] }]
}| Requirement | Specification |
|---|---|
| GPU | NVIDIA CUDA-capable, ≥16GB VRAM |
| RAM | ≥32GB recommended |
| Python | 3.10+ |
| OS | Linux, macOS, Windows (via WSL2) |
Validated on 10 manually annotated scientific posters:
| Metric | Score | Threshold |
|---|---|---|
| Word Capture | 0.96 | ≥0.75 |
| ROUGE-L | 0.89 | ≥0.75 |
| Number Capture | 0.93 | ≥0.75 |
| Field Proportion | 0.99 | 0.30–2.50 |
Pass Rate: 10/10 (100%)
| Document | Description |
|---|---|
| Architecture | Technical details & methodology |
| Evaluation | Validation metrics & results |
# Clone the repository
git clone https://github.com/fairdataihub/poster2json.git
cd poster2json
# Create a virtual environment
python -m venv .venv
# Activate the virtual environment
source venv/bin/activate
.venv\Scripts\activate # On Windows
# Install poetry
pip install poetry
# Install dependencies
poetry install
# Run tests
poe test
# Format code
poe formatIf you are on windows and have multiple python versions, you can use the following commands:
py -0p # list all python versions
py -3.12 -m venv .venvMIT License - see LICENSE for details.
@software{poster2json2026,
title = {poster2json: Scientific Poster to JSON Metadata Extraction},
author = {O'Neill, James and Soundarajan, Sanjay and Portillo, Dorian and Patel, Bhavesh},
year = {2026},
url = {https://github.com/fairdataihub/poster2json},
doi = {10.5281/zenodo.18320010}
}- FAIR Data Innovations Hub
- Meta AI for Llama 3.1
- Alibaba Cloud for Qwen2-VL
- Part of the posters.science platform
Contributions welcome! Please see CONTRIBUTING.md for guidelines.
