This repository annotates GEO datasets using DrugBank and Cellosaurus labels. Then, we download and sex label all data.
The code directory is set up as follows:
- GEO metadata is downloaded using
00_get_list_all_gse.R
- output: (in
data/01_sample_lists/) *<>_gse_gsm.csv*gse_<>.csv*gse_gsm_all_geo_dedup.csv*gse_all_geo_info.csv
01_cell_labeling/
00_parse_cellosaurus.py- input: cellosaurus XML, located in
data/00_db_data/cellosaurus.xml - output:
data/00_db_data/cellosaurus.json
- input: cellosaurus XML, located in
01_process_cell_df.R- output:
data/00_db_data/cellosaurus_df.txt
- output:
02_cell_line_labeling.R- uses
data/01_sample_lists/gse_all_geo_info.csv - output:
data/02_labeled_data/cell_line_mapped_gse.txt
- uses
02_drug_labeling/
00_drugbank_synonyms.py- input:
drugbank.xml - output:
data/00_db_data/drugbank_info.json
- input:
01_process_drugbank.R- output:
data/00_db_data/drugbank_parsed.txt
- output:
02_drug_gse_labeling.R- uses
data/01_sample_lists/gse_all_geo_info.csv - output:
data/02_labeled_data/drugbank_mapped_gse.txt
- uses
This uses the NCBI aspera client and relies on wrenlab software packages and is slightly complicated to install.
sbatch 00_download_gse_wrapper.sh ${ID} ${GSE_LIST}
This runs download_geo_chunk.sh which runs downloadGEO.py for each individual GSE.
We do this for ID=["mouse", "rat", and "human"], and use the files "data/sample_lists/gse_${ID}.csv".
// TODO: download GPL scripts
This directory processes downloaded GSEs using exprsex based on a reference list of files.
00_convert_to_mat.sh01_label_mat.sh02_combine_mat.sh03_train_test_divide.R04_run_meta.sh05_train_test_mat.sh
Required files:
- GEOMetadb (update GEOMetadb in utils path to this)
- drugbank and cellosaurus XML in the
00_db_datadirectory - jake_stopwords file