poster2json

Convert scientific posters (PDF/images) to structured JSON metadata using Large Language Models.

Documentation · Changelog · Report Bug · Request Feature

Description

poster2json extracts structured metadata from scientific conference posters (PDF or image format) into machine-actionable JSON conforming to the poster-json-schema.

The pipeline uses:

Llama 3.1 8B (fine-tuned) for JSON structuring
Qwen2-VL-7B for vision-based OCR of image posters
pdfalto for layout-aware PDF text extraction

Quick Start

Installation

pip install poster2json

CLI Usage

# Extract metadata from a poster
poster2json extract poster.pdf -o result.json

# Validate extracted JSON
poster2json validate result.json

# Process multiple posters
poster2json batch ./posters/ -o ./output/

Python API

from poster2json import extract_poster, validate_poster

# Extract metadata
result = extract_poster("poster.pdf")
print(result["titles"][0]["title"])

# Validate the result
is_valid = validate_poster(result)

Output Format

Output conforms to the poster-json-schema (DataCite-based):

{
  "$schema": "https://posters.science/schema/v0.1/poster_schema.json",
  "creators": [
    {
      "name": "Garcia, Sofia",
      "givenName": "Sofia",
      "familyName": "Garcia",
      "affiliation": ["University"]
    }
  ],
  "titles": [
    { "title": "Machine Learning Approaches to Diabetic Retinopathy Detection" }
  ],
  "posterContent": {
    "sections": [
      { "sectionTitle": "Abstract", "sectionContent": "..." },
      { "sectionTitle": "Methods", "sectionContent": "..." },
      { "sectionTitle": "Results", "sectionContent": "..." }
    ]
  },
  "imageCaptions": [{ "captions": ["Figure 1.", "ROC curves showing..."] }],
  "tableCaptions": [{ "captions": ["Table 1.", "Performance metrics"] }]
}

System Requirements

Requirement	Specification
GPU	NVIDIA CUDA-capable, ≥16GB VRAM
RAM	≥32GB recommended
Python	3.10+
OS	Linux, macOS, Windows (via WSL2)

Performance

Validated on 10 manually annotated scientific posters:

Metric	Score	Threshold
Word Capture	0.96	≥0.75
ROUGE-L	0.89	≥0.75
Number Capture	0.93	≥0.75
Field Proportion	0.99	0.30–2.50

Pass Rate: 10/10 (100%)

Documentation

Document	Description
Architecture	Technical details & methodology
Evaluation	Validation metrics & results

Development Setup

# Clone the repository
git clone https://github.com/fairdataihub/poster2json.git
cd poster2json

# Create a virtual environment
python -m venv .venv

# Activate the virtual environment
source venv/bin/activate
.venv\Scripts\activate # On Windows

# Install poetry
pip install poetry

# Install dependencies
poetry install

# Run tests
poe test

# Format code
poe format

If you are on windows and have multiple python versions, you can use the following commands:

py -0p # list all python versions

py -3.12 -m venv .venv

License

MIT License - see LICENSE for details.

Citation

@software{poster2json2026,
  title = {poster2json: Scientific Poster to JSON Metadata Extraction},
  author = {O'Neill, James and Soundarajan, Sanjay and Portillo, Dorian and Patel, Bhavesh},
  year = {2026},
  url = {https://github.com/fairdataihub/poster2json},
  doi = {10.5281/zenodo.18320010}
}

Acknowledgements

FAIR Data Innovations Hub
Meta AI for Llama 3.1
Alibaba Cloud for Qwen2-VL
Part of the posters.science platform

Contributing

Contributions welcome! Please see CONTRIBUTING.md for guidelines.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
.github/workflows		.github/workflows
.vscode		.vscode
bin		bin
docs		docs
notebooks		notebooks
poster2json		poster2json
tests		tests
.appveyor.yml		.appveyor.yml
.coveragerc		.coveragerc
.flake8		.flake8
.gitattributes		.gitattributes
.gitignore		.gitignore
.markdownlint.json		.markdownlint.json
.pydocstyle.ini		.pydocstyle.ini
.pylint.ini		.pylint.ini
.scrutinizer.yml		.scrutinizer.yml
.tool-versions		.tool-versions
.verchew.ini		.verchew.ini
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE.md		LICENSE.md
Makefile		Makefile
README.md		README.md
logo.svg		logo.svg
mkdocs.yml		mkdocs.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
scent.py		scent.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

poster2json

Documentation · Changelog · Report Bug · Request Feature

Description

Quick Start

Installation

CLI Usage

Python API

Output Format

System Requirements

Performance

Documentation

Development Setup

License

Citation

Acknowledgements

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Languages

License

fairdataihub/poster2json

Folders and files

Latest commit

History

Repository files navigation

poster2json

Documentation · Changelog · Report Bug · Request Feature

Description

Quick Start

Installation

CLI Usage

Python API

Output Format

System Requirements

Performance

Documentation

Development Setup

License

Citation

Acknowledgements

Contributing

About

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Languages

Packages