GSoC Project Ideas 2025

Extralit's Google Summer of Code Project Ideas (2025)

Join us to contribute to Extralit's open-source development this summer! Here are the projects available for GSoC contributors:

Project #1: Data Integration Pipeline for Collaborative Data Extraction
Project #2: Enhanced AI OCR Extraction Pipeline
Project #3: Interactive Schema Editor UI

Our potential mentors are Jonny Tran, Ph.D. and Dianne Ting.

Project Idea 1: Workflow Orchestrator for LLM-assisted Data Extraction

Abstract

Build a unified workflow orchestrator that manages the entire extraction process from PDF preprocessing, RAG-enabled LLM-powered predictions, to final structured dataset creation. You'll create both the backend workflow engine and a web endpoint that allows researchers to monitor extraction progress, kickoff automated LLM extractions, track annotation history, and validate extracted data across multiple documents. The system will automatically update final datasets when annotations change, ensuring data consistency throughout the extraction lifecycle. This work will transform how researchers extract and manage scientific data.

Current State

Extralit currently manages extraction through individual annotation records, but lacks a centralized system to coordinate the entire workflow. Users must rely on Python code to initialize records and track progress, with no easy way to automate AI workflows, monitor extraction status across multiple documents or automatically update datasets when schemas change or new papers are added.

Tasks

Design an automated data orchestrator to keep track of extraction progress, pending tasks, and validation status
Integrate with elasticsearch to update the vector database for efficient Retrieval-Augmented Generation (RAG) workflows
Improve upon an existing data ETL pipeline that consolidates both LLM and human annotations into structured, analysis-ready datasets
Create mechanisms for automatic data record generation when the schemas or document sets change
Implement data lineage tracking to link each extracted data point back to source annotation record
Build dataset versioning capabilities to track changes over time
Stretch goal: Develop simple dashboard and UI components that displays the status of data records and previews of the data output

Expected Outcomes

A complete data workflow orchestration system that allows users to add/remove documents or schemas, then see that reflected in the final dataset output
An API endpoint that aggregates over annotation records to provide the extraction progress for each document or schema
Clear data provenance tracking linking dataset values to specific annotations
Documentation and tutorials for using the orchestration system

Details

Prerequisites:
- Python programming skills
- Experience with data pipeline design and ETL (e.g. Pandas, SQL)
- Familiarity with any workflow orchestration tool (e.g. Airflow, Flyte, Metaflow, snakemake, etc.)
- Knowledge in systems design and data modeling
Duration: 350 hours
Complexity: High
Project Type: Core development
Potential Mentor(s): Jonny Tran, Dianne Ting
Mentor Contact Email: nhat.c.tran@gmail.com, dianneting.design@gmail.com

References

Project Idea 2: Enhanced AI OCR Extraction Pipeline

Abstract

Scientific literature often contains critical data in complex tables and figures that are difficult to extract accurately. This project aims to enhance Extralit's OCR pipeline by implementing improved algorithms for table detection, document structure recognition, and content extraction from PDF of scientific papers. The goal is to deploy a new PDF processing server with increased speed and extraction accuracy for complex table formats commonly found in life science research papers.

Current State

Extralit currently has table extraction capabilities using a number of Vision Transformers machine learning (ML) models, but cannot reach 100% accuracy in extracting complex table formats, which often required additional human annotation. The extraction pipeline also can extract the body text content, but needs improvements to quickly process hundreds of research papers at cheaper costs.

Tasks

Explore and experiment with Vision-Language Models (VLMs) and traditional ML approaches for structured text and table extraction
Implement a document OCR pipeline utilizing a combination of SaaS APIs, LLMs, and Vision Transformers models to better identify and extract content in PDFs
Create a post-processing pipeline to parse outputs from the extraction algorithm that preserves the document structure and headings across a variety of document layouts
Set up robust model serving that integrates with existing data pipeline
Run evaluation metrics and create performance benchmarks to measure latency, costs and accuracy against human-annotated gold standards
Stretch goal: Explore techniques using agentic LLMs to implement a minimum viable solution to accurately extract data points from data visualizations in figures

Expected Outcomes

A robust API endpoint to a PDF extraction pipeline with hallucination-free text OCR and improved table extraction accuracy on complex scientific tables and figures
An API endpoint to enable users on the Extralit web UI to monitor progress, correct OCR outputs if needed, and integrate with Extralit's data storage stack
Documentation and examples to reproducibly deploy the pipeline on cloud platforms
Test suite with evaluation metrics for assessing extraction quality

Details

Prerequisites:
- Python software development experience
- Basic knowledge of machine learning model deployment
- Interest in scientific document and PDF parsing libraries
- Experience with GPU-accelerated data pipeline and computer vision libraries (optional but helpful)
Duration: 175 hours
Complexity: Medium
Project Type: Core development
Potential Mentor(s): Jonny Tran, Dianne Ting
Mentor Contact Email: nhat.c.tran@gmail.com, dianneting.design@gmail.com

References

Marker documentation: https://github.com/VikParuchuri/marker
PyMuPDF documentation: https://pymupdf.readthedocs.io/
Prompt engineering tools: BAML
Table-Transformer research: paper

Project Idea 3: Interactive Schema Editor UI

Abstract

Design and implement an intuitive, visual schema editor that allows researchers to define extraction schemas without writing code. The editor will provide a drag-and-drop interface for creating data fields, relationships, and validation rules, making it easier to change the data extraction requirements. This project will significantly lower the barrier to adoption for domain experts who need to extract structured data from scientific literature.

Current State

Currently, Extralit schemas must be defined programmatically using Python or the CLI, which limits accessibility for non-technical users. The current approach also makes it difficult for users to leverage all of Extralit's extraction capabilities without deep technical knowledge of the system.

Tasks

Research and document common extraction schema patterns across 3-5 scientific domains
Collaborate on designing and developing a user-friendly schema creation UI, guided by insights from UX research
Implement a front-end interface using Vue.js/TypeScript or other web component frameworks
Create interactive UI components for defining entities, attributes, and relationships definition with real-time validation, schema versioning, and preview capabilities
Integrate with Extralit's backend APIs
Stretch goal: Add LLM-powered suggestions to auto-complete schema fields based on context

Expected Outcomes

An easy-to-use visual schema editor integrated within Extralit web app
Interactive web forms supporting 90% schema features available in the schema definition language, including complex scientific data relationships, domain-specific constraints, and advanced data validation rules
Comprehensive test suite ensuring reliability and correctness
User testing results demonstrating significant efficiency improvements over code-based schema definition

Details

Prerequisites:
- Experience with Vue.js and/or React for building interactive web applications
- Knowledge of UI/UX design process
- Familiarity with data modeling, ideally in the scientific domain
Duration: 175 hours
Complexity: Medium
Potential Mentor(s): Jonny Tran, Dianne Ting
Mentor Contact Email: nhat.c.tran@gmail.com, dianneting.design@gmail.com

References

Pandera documentation: https://pandera.readthedocs.io/en/stable/dataframe_models.html
DrawDB: https://github.com/drawdb-io/drawdb
JSON-editor: https://json-editor.github.io/json-editor/
Extralit documentation: https://docs.extralit.ai
Vue.js documentation: https://vuejs.org/
Figma (for UI design): https://www.figma.com/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GSoC Project Ideas 2025

Extralit's Google Summer of Code Project Ideas (2025)

Useful Links

License

Code of Conduct

Project Idea 1: Workflow Orchestrator for LLM-assisted Data Extraction

Abstract

Current State

Tasks

Expected Outcomes

Details

References

Project Idea 2: Enhanced AI OCR Extraction Pipeline

Abstract

Current State

Tasks

Expected Outcomes

Details

References

Project Idea 3: Interactive Schema Editor UI

Abstract

Current State

Tasks

Expected Outcomes

Details

References

Uh oh!

Clone this wiki locally