Skip to content

uc-cdis/gdc-cohort-copilot

Repository files navigation

GDC Cohort Copilot

GDC Cohort Copilot is an AI copilot tool to assist in the curation of cohorts from the NCI GDC using natural language. We share GDC Cohort Copilot as a containerized Gradio app (also enabled as an MCP tool!) on HuggingFace Spaces available at: https://huggingface.co/spaces/uc-ctds/GDC-Cohort-Copilot.

The implementation of the web app can be found in the git submodule at: app (links to HuggingFace repo).

Run Locally

The containerized Gradio app can be run locally using docker. The docker image is available at the following URL:

registry.hf.space/uc-ctds-gdc-cohort-copilot

To run the app locally, run the following before opening http://localhost:7860 in a web browser:

docker run -it --rm -p 7860:7860 registry.hf.space/uc-ctds-gdc-cohort-copilot:latest python app.py
  • The app runs on port 7860 within the container, however if this port is occupied on the host, the port can be remapped on the host using the p flag. Refer to the docker run documentation on host/container port mapping for details.
  • If serving remotely, you will need to ssh tunnel from your local to the remote host before being able to open the app in your web browser:
    ssh -NL 7860:localhost:7860 <user>@<remote>

Model inference may be sped up by using GPU acceleration, though is not required:

docker run -it --rm -p 7860:7860 --runtime nvidia --gpus all registry.hf.space/uc-ctds-gdc-cohort-copilot:latest python app.py
  • GPU acceleration requires installing nvidia-container-toolkit.
  • On multi-GPU hosts, restricting the container to specific GPUs can be controlled with the --gpus flag. Please refer to the docker run documentation on accessing GPUs for additional details.

Method

overview figure

Overview of GDC Cohort Copilot implementation and user workflow:

  1. Implementation of GDC Cohort Copilot involves training the GDC Cohort LLM to translate from a natural language query of a cohort to the cohort filter JSON. The cohort JSONs are derived from datasests of real user-made cohorts or synthetically generated cohorts. The paired natural language queries are generated by a frozen LLM using the cohort JSONs. The final trained GDC Cohort LLM model is served in a containerized Gradio app that exposes a GDC Cohort Builder-like interface running on HuggingFace Spaces.
  2. A user curates their desired cohort using the GDC Cohort Copilot by:
    1. Inputting a natural language description of a desired cohort
    2. which is automatically passed to GDC Cohort LLM.
    3. The resulting generated cohort filter is automatically populated back into the interface, allowing the user to manually refine their cohort before
    4. exporting the curated cohort to the NCI GDC.

GDC Cohort LLM

In addition to the containerized application, we also include our source code for developing and evaluating GDC Cohort LLM, the generative language model powering the GDC Cohort Copilot. We share all experimental variants of GDC Cohort LLM on huggingface: https://huggingface.co/uc-ctds.

In order, the steps for our experiments are:

  1. Setup and activate development environment
    conda env create -f env.yaml
    conda activate cohort
    
  2. Data Preprocessing
  3. Synthetic Data Generation
  4. Model Training and Inference
  5. OpenAI Comparison
  6. Evaluation

Citation

@article{song2025gdc,
  title={GDC Cohort Copilot: An AI Copilot for Curating Cohorts from the Genomic Data Commons},
  author={Song, Steven and Subramanyam, Anirudh and Zhang, Zhenyu and Venkat, Aarti and Grossman, Robert L},
  journal={Bioinformatics Advances},
  pages={vbaf295},
  year={2025},
  publisher={Oxford University Press}
}

About

GDC Cohort Copilot is an AI copilot tool to assist in the curation of cohorts from the NCI GDC using natural language.

Resources

License

Stars

Watchers

Forks

Contributors