Skip to content

NIAID-BRC-Codeathons/hiperrag-literature-extraction

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 

Repository files navigation

HiPerRAG for Literature-based Data Extraction on Priority Pathogens

This is a project for the 2025 NIAID BRC AI Codeathon.

Project page: https://niaid-brc-codeathons.github.io/projects/hiperrag-literature-extraction/


Project Themes

  • Automated Knowledge Extraction and Curation

Team

Team Lead(s)

  • Name: Ozan
  • Affiliation: Argonne National Laboratory, BV-BRC

Project Summary

This project leverages HiPerRAG—a high-performance retrieval-augmented generation system optimized for large scientific corpora—to extract and curate structured data for priority pathogens. By targeting key relationship types such as protein–protein interactions (PPIs), host–pathogen interactions, and drug–protein binding data, the project aims to produce curated, machine-readable datasets for integration with BV-BRC knowledgebases.


Goals and Objectives

  • Goal 1: Define target data types relevant to CEPI and BV-BRC (e.g., PPIs, drug–protein interactions)
  • Goal 2: Deploy HiPerRAG on relevant literature corpora to extract structured biological relationships
  • Goal 3: Generate curated datasets for 1–2 CEPI priority pathogens (e.g., Nipah, Lassa)

Approach

HiPerRAG will be configured to parse biomedical literature and extract relations using fine-tuned retrieval and extraction modules. The system’s hybrid pipeline combines dense retrieval, passage re-ranking, and LLM-based summarization to produce high-confidence knowledge graphs. The team will evaluate both fully automated and human-in-the-loop curation workflows to balance scale and accuracy.


Data and Resources Required

Resource Type Source / Link Description / Purpose
Data PubMed, BV-BRC text corpora Literature sources for entity/relation extraction
Tools / Services HiPerRAG RAG-based extraction framework
LLMs / AI Models Mistral Large, GPT-4 (Rhea) Entity normalization and summarization
Compute / Storage Argonne HPC, BRC clusters Parallel literature processing

Expected Outcomes / Deliverables

  • Curated datasets of structured biological relationships for CEPI priority pathogens
  • Machine-readable outputs suitable for integration into BV-BRC pipelines

Potential Impact and Next Steps

This project demonstrates scalable, AI-driven literature mining for infectious disease research. It enables automated knowledge enrichment and accelerates understanding of pathogen biology, supporting CEPI’s 100-day mission and BV-BRC’s informatics and data curation goals.

About

Leveraging high-performance retrieval-augmented generation to extract and curate structured biological data for CEPI priority pathogens from scientific literature

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors