-
-
Notifications
You must be signed in to change notification settings - Fork 49
GSoC Project Ideas 2025
Join us to contribute to Extralit's open-source development this summer! Here are the projects available for GSoC contributors:
- Project #1: Data Integration Pipeline for Collaborative Data Extraction
- Project #2: Enhanced AI OCR Extraction Pipeline
- Project #3: Interactive Schema Editor UI
Our potential mentors are Jonny Tran, Ph.D. and Dianne Ting.
- GitHub: https://github.com/extralit/extralit
- Docs: https://docs.extralit.ai
- Demo: https://extralit-public-demo.hf.space
Apache 2.0: https://github.com/extralit/extralit/blob/main/LICENSE
https://github.com/extralit/extralit/blob/develop/CODE_OF_CONDUCT.md
Build a unified workflow orchestrator that manages the entire extraction process from PDF preprocessing, RAG-enabled LLM-powered predictions, to final structured dataset creation. You'll create both the backend workflow engine and a web endpoint that allows researchers to monitor extraction progress, kickoff automated LLM extractions, track annotation history, and validate extracted data across multiple documents. The system will automatically update final datasets when annotations change, ensuring data consistency throughout the extraction lifecycle. This work will transform how researchers extract and manage scientific data.
Extralit currently manages extraction through individual annotation records, but lacks a centralized system to coordinate the entire workflow. Users must rely on Python code to initialize records and track progress, with no easy way to automate AI workflows, monitor extraction status across multiple documents or automatically update datasets when schemas change or new papers are added.
- Design an automated data orchestrator to keep track of extraction progress, pending tasks, and validation status
- Integrate with elasticsearch to update the vector database for efficient Retrieval-Augmented Generation (RAG) workflows
- Improve upon an existing data ETL pipeline that consolidates both LLM and human annotations into structured, analysis-ready datasets
- Create mechanisms for automatic data record generation when the schemas or document sets change
- Implement data lineage tracking to link each extracted data point back to source annotation record
- Build dataset versioning capabilities to track changes over time
- Stretch goal: Develop simple dashboard and UI components that displays the status of data records and previews of the data output
- A complete data workflow orchestration system that allows users to add/remove documents or schemas, then see that reflected in the final dataset output
- An API endpoint that aggregates over annotation records to provide the extraction progress for each document or schema
- Clear data provenance tracking linking dataset values to specific annotations
- Documentation and tutorials for using the orchestration system
- Prerequisites:
- Python programming skills
- Experience with data pipeline design and ETL (e.g. Pandas, SQL)
- Familiarity with any workflow orchestration tool (e.g. Airflow, Flyte, Metaflow, snakemake, etc.)
- Knowledge in systems design and data modeling
- Duration: 350 hours
- Complexity: High
- Project Type: Core development
- Potential Mentor(s): Jonny Tran, Dianne Ting
- Mentor Contact Email: nhat.c.tran@gmail.com, dianneting.design@gmail.com
- Argilla Record Annotation Documentation
- Argilla Webhooks
- Extralit Documentation
- Dataset aggregation function
- Metaflow Documentation
Scientific literature often contains critical data in complex tables and figures that are difficult to extract accurately. This project aims to enhance Extralit's OCR pipeline by implementing improved algorithms for table detection, document structure recognition, and content extraction from PDF of scientific papers. The goal is to deploy a new PDF processing server with increased speed and extraction accuracy for complex table formats commonly found in life science research papers.
Extralit currently has table extraction capabilities using a number of Vision Transformers machine learning (ML) models, but cannot reach 100% accuracy in extracting complex table formats, which often required additional human annotation. The extraction pipeline also can extract the body text content, but needs improvements to quickly process hundreds of research papers at cheaper costs.
- Explore and experiment with Vision-Language Models (VLMs) and traditional ML approaches for structured text and table extraction
- Implement a document OCR pipeline utilizing a combination of SaaS APIs, LLMs, and Vision Transformers models to better identify and extract content in PDFs
- Create a post-processing pipeline to parse outputs from the extraction algorithm that preserves the document structure and headings across a variety of document layouts
- Set up robust model serving that integrates with existing data pipeline
- Run evaluation metrics and create performance benchmarks to measure latency, costs and accuracy against human-annotated gold standards
- Stretch goal: Explore techniques using agentic LLMs to implement a minimum viable solution to accurately extract data points from data visualizations in figures
- A robust API endpoint to a PDF extraction pipeline with hallucination-free text OCR and improved table extraction accuracy on complex scientific tables and figures
- An API endpoint to enable users on the Extralit web UI to monitor progress, correct OCR outputs if needed, and integrate with Extralit's data storage stack
- Documentation and examples to reproducibly deploy the pipeline on cloud platforms
- Test suite with evaluation metrics for assessing extraction quality
- Prerequisites:
- Python software development experience
- Basic knowledge of machine learning model deployment
- Interest in scientific document and PDF parsing libraries
- Experience with GPU-accelerated data pipeline and computer vision libraries (optional but helpful)
- Duration: 175 hours
- Complexity: Medium
- Project Type: Core development
- Potential Mentor(s): Jonny Tran, Dianne Ting
- Mentor Contact Email: nhat.c.tran@gmail.com, dianneting.design@gmail.com
- Marker documentation: https://github.com/VikParuchuri/marker
- PyMuPDF documentation: https://pymupdf.readthedocs.io/
- Prompt engineering tools: BAML
- Table-Transformer research: paper
Design and implement an intuitive, visual schema editor that allows researchers to define extraction schemas without writing code. The editor will provide a drag-and-drop interface for creating data fields, relationships, and validation rules, making it easier to change the data extraction requirements. This project will significantly lower the barrier to adoption for domain experts who need to extract structured data from scientific literature.
Currently, Extralit schemas must be defined programmatically using Python or the CLI, which limits accessibility for non-technical users. The current approach also makes it difficult for users to leverage all of Extralit's extraction capabilities without deep technical knowledge of the system.
- Research and document common extraction schema patterns across 3-5 scientific domains
- Collaborate on designing and developing a user-friendly schema creation UI, guided by insights from UX research
- Implement a front-end interface using Vue.js/TypeScript or other web component frameworks
- Create interactive UI components for defining entities, attributes, and relationships definition with real-time validation, schema versioning, and preview capabilities
- Integrate with Extralit's backend APIs
- Stretch goal: Add LLM-powered suggestions to auto-complete schema fields based on context
- An easy-to-use visual schema editor integrated within Extralit web app
- Interactive web forms supporting 90% schema features available in the schema definition language, including complex scientific data relationships, domain-specific constraints, and advanced data validation rules
- Comprehensive test suite ensuring reliability and correctness
- User testing results demonstrating significant efficiency improvements over code-based schema definition
- Prerequisites:
- Experience with Vue.js and/or React for building interactive web applications
- Knowledge of UI/UX design process
- Familiarity with data modeling, ideally in the scientific domain
- Duration: 175 hours
- Complexity: Medium
- Potential Mentor(s): Jonny Tran, Dianne Ting
- Mentor Contact Email: nhat.c.tran@gmail.com, dianneting.design@gmail.com
- Pandera documentation: https://pandera.readthedocs.io/en/stable/dataframe_models.html
- DrawDB: https://github.com/drawdb-io/drawdb
- JSON-editor: https://json-editor.github.io/json-editor/
- Extralit documentation: https://docs.extralit.ai
- Vue.js documentation: https://vuejs.org/
- Figma (for UI design): https://www.figma.com/