Code for data analysis and figure generation for "Mapping MAVE data for use in human genomics applications" (Arbesfeld et. al.):
mavedb_mapping.ipynb: This notebook applies the mapping algorithm to a set of 209 examined score sets from MaveDB, successfully creating mappings for ~2.5 million variant pairs across 207 score sets.mapping_analysis.ipynb: This notebook computes reference sequence concordance across the generated VRS mapping pairs. The notebook also computes the number of unique pre-mapped and post-mapped variants.mavedb_scoreset_breakdown.ipynb: This notebook generates the summary statistics that are described in the manuscript.
A compatible Python environment can be generated using the included requirements.txt file.
First, create and activate a virtual environment of your preference. For example, using virtualenv:
python3 -m virtualenv venv
source venv/bin/activateThen install all requirements in requirements.txt:
python3 -m pip install -r requirements.txtAfter executing mapping code, this directory will contain working and output data in the following locations:
├── README.md
├── analysis_files
│ ├── mappings
│ │ └── <mapping output files>
│ └── <mapping checkpoint files>
├── experiment_scoresets.txt
├── mapping_analysis.ipynb
├── mave_mapping_fig_3b.R
├── mavedb_files
│ └── <Scoreset records and metadata from MaveDB>
├── mavedb_mapping.ipynb
├── mavedb_scoreset_breakdown.ipynb
└── requirements.txt