HiPerRAG for Literature-based Data Extraction on Priority Pathogens

This is a project for the 2025 NIAID BRC AI Codeathon.

Project page: https://niaid-brc-codeathons.github.io/projects/hiperrag-literature-extraction/

Project Themes

Automated Knowledge Extraction and Curation

Team

Team Lead(s)

Name: Ozan
Affiliation: Argonne National Laboratory, BV-BRC

Project Summary

This project leverages HiPerRAG—a high-performance retrieval-augmented generation system optimized for large scientific corpora—to extract and curate structured data for priority pathogens. By targeting key relationship types such as protein–protein interactions (PPIs), host–pathogen interactions, and drug–protein binding data, the project aims to produce curated, machine-readable datasets for integration with BV-BRC knowledgebases.

HiPerRAG codebase: https://github.com/ramanathanlab/distllm/tree/main
HiPerRAG paper: https://arxiv.org/abs/2505.04846

Goals and Objectives

Goal 1: Define target data types relevant to CEPI and BV-BRC (e.g., PPIs, drug–protein interactions)
Goal 2: Deploy HiPerRAG on relevant literature corpora to extract structured biological relationships
Goal 3: Generate curated datasets for 1–2 CEPI priority pathogens (e.g., Nipah, Lassa)

Approach

HiPerRAG will be configured to parse biomedical literature and extract relations using fine-tuned retrieval and extraction modules. The system’s hybrid pipeline combines dense retrieval, passage re-ranking, and LLM-based summarization to produce high-confidence knowledge graphs. The team will evaluate both fully automated and human-in-the-loop curation workflows to balance scale and accuracy.

Data and Resources Required

Resource Type	Source / Link	Description / Purpose
Data	PubMed, BV-BRC text corpora	Literature sources for entity/relation extraction
Tools / Services	HiPerRAG	RAG-based extraction framework
LLMs / AI Models	Mistral Large, GPT-4 (Rhea)	Entity normalization and summarization
Compute / Storage	Argonne HPC, BRC clusters	Parallel literature processing

Expected Outcomes / Deliverables

Curated datasets of structured biological relationships for CEPI priority pathogens
Machine-readable outputs suitable for integration into BV-BRC pipelines

Potential Impact and Next Steps

This project demonstrates scalable, AI-driven literature mining for infectious disease research. It enables automated knowledge enrichment and accelerates understanding of pathogen biology, supporting CEPI’s 100-day mission and BV-BRC’s informatics and data curation goals.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

HiPerRAG for Literature-based Data Extraction on Priority Pathogens

Project Themes

Team

Team Lead(s)

Project Summary

Goals and Objectives

Approach

Data and Resources Required

Expected Outcomes / Deliverables

Potential Impact and Next Steps

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

HiPerRAG for Literature-based Data Extraction on Priority Pathogens

Project Themes

Team

Team Lead(s)

Project Summary

Goals and Objectives

Approach

Data and Resources Required

Expected Outcomes / Deliverables

Potential Impact and Next Steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages