GDC Cohort Copilot is an AI copilot tool to assist in the curation of cohorts from the NCI GDC using natural language. We share GDC Cohort Copilot as a containerized Gradio app (also enabled as an MCP tool!) on HuggingFace Spaces available at: https://huggingface.co/spaces/uc-ctds/GDC-Cohort-Copilot.
The implementation of the web app can be found in the git submodule at: app (links to HuggingFace repo).
The containerized Gradio app can be run locally using docker. The docker image is available at the following URL:
registry.hf.space/uc-ctds-gdc-cohort-copilot
To run the app locally, run the following before opening http://localhost:7860 in a web browser:
docker run -it --rm -p 7860:7860 registry.hf.space/uc-ctds-gdc-cohort-copilot:latest python app.py- The app runs on port
7860within the container, however if this port is occupied on the host, the port can be remapped on the host using thepflag. Refer to thedocker rundocumentation on host/container port mapping for details. - If serving remotely, you will need to ssh tunnel from your local to the remote host before being able to open the app in your web browser:
ssh -NL 7860:localhost:7860 <user>@<remote>
Model inference may be sped up by using GPU acceleration, though is not required:
docker run -it --rm -p 7860:7860 --runtime nvidia --gpus all registry.hf.space/uc-ctds-gdc-cohort-copilot:latest python app.py- GPU acceleration requires installing nvidia-container-toolkit.
- On multi-GPU hosts, restricting the container to specific GPUs can be controlled with the
--gpusflag. Please refer to thedocker rundocumentation on accessing GPUs for additional details.
Overview of GDC Cohort Copilot implementation and user workflow:
- Implementation of GDC Cohort Copilot involves training the GDC Cohort LLM to translate from a natural language query of a cohort to the cohort filter JSON. The cohort JSONs are derived from datasests of real user-made cohorts or synthetically generated cohorts. The paired natural language queries are generated by a frozen LLM using the cohort JSONs. The final trained GDC Cohort LLM model is served in a containerized Gradio app that exposes a GDC Cohort Builder-like interface running on HuggingFace Spaces.
- A user curates their desired cohort using the GDC Cohort Copilot by:
- Inputting a natural language description of a desired cohort
- which is automatically passed to GDC Cohort LLM.
- The resulting generated cohort filter is automatically populated back into the interface, allowing the user to manually refine their cohort before
- exporting the curated cohort to the NCI GDC.
In addition to the containerized application, we also include our source code for developing and evaluating GDC Cohort LLM, the generative language model powering the GDC Cohort Copilot. We share all experimental variants of GDC Cohort LLM on huggingface: https://huggingface.co/uc-ctds.
In order, the steps for our experiments are:
- Setup and activate development environment
conda env create -f env.yaml conda activate cohort - Data Preprocessing
- Synthetic Data Generation
- Model Training and Inference
- OpenAI Comparison
- Evaluation
@article{song2025gdc,
title={GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons},
author={Song, Steven and Subramanyam, Anirudh and Zhang, Zhenyu and Venkat, Aarti and Grossman, Robert L},
journal={Bioinformatics Advances},
pages={vbaf295},
year={2025},
publisher={Oxford University Press}
}
