This repository contains a reproducible pipeline to prioritise high-confidence male fertility candidate genes by integrating:
- GTEx v11 tissue RNA-seq (testis specificity)
- Human sperm RNA-seq (GSE40181 TPM)
- ClinVar gene/variant evidence
- Human Phenotype Ontology (HPO) male infertility phenotypes
- Proteomics evidence (internal data) (optional)
- Functional annotation (MyGene.info; optional, requires internet)
The goal is to identify genes that are testis-enriched, retained/expressed in sperm, and supported by clinical and/or phenotype evidence.
All tabular outputs are tab-separated (TSV).
You will need the following inputs available locally (downloaded by download_data.sh or manually where noted):
- Gene TPM GCT (RNASeQC) file (e.g.
GTEx_Analysis_*_gene_tpm.gct) - SampleAttributesDS metadata file
hgnc_complete_set.txt(for mapping Ensembl and Entrez IDs to approved HGNC symbols)
- TPM matrix file (gzipped TSV)
tab_delimited/variant_summary.txt.gztab_delimited/gene_condition_source_id(uncompressed)
- HPO ontology files and
genes_to_phenotype.txt(exact filenames depend on your HPO download)
- Proteomics Excel, containing a gene symbol column and sample presence columns
A typical run consists of:
- Download inputs
- Build ClinVar male infertility tiers (HPO + ClinVar)
- Identify GTEx testis-specific genes and intersect with sperm expression
- Add annotations (ClinVar / HPO / proteomics / MyGene)
- Build final summary tables and plots
This pipeline combines Bash and Python scripts
- Python ≥ 3.9 (tested with Python 3.9)
The following Python packages are required:
- pandas
- numpy
- requests
- openpyxl (for proteomics Excel files)
- tqdm (optional but useful for progress bars if added later)
pip install pandas numpy requests openpyxl
or conda:
conda create -n fertility_pipeline python=3.9 pandas numpy requests openpyxl
conda activate fertility_pipelinehttps://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GSE40181 https://www.gtexportal.org/home/downloads/adult-gtex/overview
This is driven by find_testis_expression.sh and calls the scripts below in order.
- Script:
scripts/gtex_testis_specificity_logged.py - What it does:
- Loads GTEx gene TPM from a GCT file
- Maps samples to tissues using SampleAttributesDS
- Computes per-gene tissue median TPM values
- Calculates tissue-specificity metrics per gene, including:
- tau (0–1; higher = more tissue-specific)
- log2 enrichment of testis median TPM vs maximum non-testis median TPM
- fraction of testis samples where TPM passes a minimum threshold
- top tissues by median TPM
Tau reference: https://pubmed.ncbi.nlm.nih.gov/15388519/
- Key outputs:
- Ranked per-gene table (TSV)
- Gene × tissue median TPM matrix (TSV)
- Script:
scripts/add_hgnc_symbols.py - What it does:
- Reads the ranked GTEx output (indexed by Ensembl gene IDs)
- Strips Ensembl version suffixes (e.g.
.1) - Joins against HGNC complete set to add a
hgnc_symbolcolumn
- Key output:
- GTEx ranked table with HGNC symbols (TSV)
wget https://ftp.pride.ebi.ac.uk/pride/data/archive/2019/11/PXD014618/DBsearchpsm.csv
- Script:
RNAseq/annotate_testis_genes_with_clinvar.py - What it does:
- Adds ClinVar gene-condition associations from
gene_condition_source_id - Optionally parses
variant_summary.txt.gzand adds per-gene variant counts by clinical significance - Normalises gene symbols for robust joins
- Adds ClinVar gene-condition associations from
- Key output:
- GTEx ranked table with ClinVar annotations (TSV)
- Script:
RNAseq/map_sperm_entrez_to_hgnc_and_intersect.py - What it does:
- Loads sperm TPM matrix (Entrez GeneID-based)
- Uses HGNC complete set to map Entrez IDs → HGNC symbols
- Computes sperm evidence columns such as:
sperm_present_any(TPM > threshold in at least one sample)sperm_present_fracsperm_tpm_mean,sperm_tpm_median
- Merges sperm evidence onto the GTEx ranked table via HGNC symbol
- Key outputs:
- Mapped sperm TPM table (TSV)
- GTEx ranked table annotated with sperm evidence (TSV)
- Script:
RNAseq/extract_high_confidence_testis_sperm_genes.py - What it does:
- Filters the GTEx+sperm annotated table using configurable thresholds, typically:
min_tau(default 0.95)min_testis_tpm(default 5)min_sperm_tpm(default 0.1)
- Filters the GTEx+sperm annotated table using configurable thresholds, typically:
- Key output:
- High-confidence gene set (TSV)
- Script:
RNAseq/add_functional_annotation_mygene.py - What it does:
- Queries MyGene.info in batches using HGNC symbols
- Adds fields such as gene name, summary, gene type, GO terms
- Supports a JSONL cache to avoid repeated queries
- Key output:
- Functionally annotated GTEx+sperm table (TSV)
This is driven by identify_genes_to_infertility.sh and creates a ClinVar evidence set focused on male infertility phenotypes.
- Script:
extract_male_infertility_genes_hpo_ontology.py - What it does:
- Traverses HPO ontology to collect genes associated with male infertility phenotypes
- Can require “male” context and include disease terms depending on flags
- Key outputs:
- HPO-derived gene summary tables (TSV)
- Script:
extract_gene_list_from_hpo_outputs.py - What it does:
- Extracts unique gene symbols and Entrez IDs
- Writes:
male_infertility_gene_symbols.tsvmale_infertility_entrez_ids.tsvmale_infertility_gene_list.tsv(combined)
- Key outputs:
- Gene list TSVs for filtering ClinVar
- Script:
filter_clinvar_variant_summary_by_gene_list.py - What it does:
- Streams
variant_summary.txt.gzand keeps only rows whoseGeneSymbolmatches the HPO-derived gene list - Writes counts by ClinicalSignificance and ReviewStatus
- Streams
- Key outputs:
- Filtered ClinVar subset (TSV)
- Summary counts (TSV)
- Script:
filter_clinvar_filtered_by_phenotype.py - What it does:
- Filters ClinVar rows by keyword matching against
PhenotypeList - Uses include and exclude keyword sets to avoid female-only contexts
- Produces both full and minimal meeting-ready outputs
- Filters ClinVar rows by keyword matching against
- Key outputs:
male_infertility_clinvar_variants_by_phenotype.tsvmale_infertility_clinvar_variants_by_phenotype_minimal.tsv- Summary counts TSV
- Script:
collapse_clinvar_variants_to_best.py - What it does:
- Groups by
AlleleID(default) and selects one “best” record per variant using a rank:- Prefer GRCh38 over GRCh37
- Prefer stronger ReviewStatus
- Prefer more clinically relevant ClinicalSignificance
- Groups by
- Key outputs:
- Best-per-variant TSV
- Best-per-variant pathogenic-only TSV
- Counts summary TSV
- Script:
filter_clinvar_best_high_confidence.py - What it does:
- Filters to high-confidence review categories, such as:
- practice guideline
- reviewed by expert panel
- criteria provided, multiple submitters, no conflicts
- Creates:
- all high-confidence rows
- high-confidence + pathogenic/likely pathogenic subset
- Writes summary counts
- Filters to high-confidence review categories, such as:
- Key outputs:
male_infertility_clinvar_variants_high_confidence.tsvmale_infertility_clinvar_variants_high_confidence_pathogenic.tsv- Counts TSV
- Script:
make_clinvar_high_confidence_report.py - What it does:
- Writes a boss-ready report table from the high-confidence pathogenic subset
- Writes per-gene summaries with phenotype strings and review statuses
- Key outputs:
- Concise report TSV
- Per-gene summary TSV
- Script:
overlap_testis_specific_with_clinvar_tiers.py - What it does:
- Intersects:
- high-confidence testis+sperm gene set
- ClinVar male infertility genes (best-per-variant tables)
- Assigns tiers based on ReviewStatus and ClinicalSignificance (e.g. high-confidence pathogenic vs other)
- Writes:
- overlap table with per-gene evidence counts
- testis-only gene list
- ClinVar-only gene list
- optional long-form evidence table
- Intersects:
These can be applied to the GTEx+sperm table before extracting final gene sets.
- Script:
add_hpo_annotation.py - What it does:
- Joins HPO gene-to-phenotype associations onto the gene table
- Summarises:
- total HPO term count
- reproductive-related term count (regex-based)
- lists of HPO terms, IDs, and disease IDs
- Script:
add_proteomics_annotation.py - What it does:
- Reads a proteomics Excel export and summarises evidence per gene
- Supports filtering by:
- FDR confidence (default “High”)
- species (default “Homo sapiens”)
- minimum unique peptides
- Adds presence across “Found in Sample:” columns and accession summaries
- Script:
add_functional_annotation_mygene.py - What it does:
- Adds names, summaries, gene types, GO terms via MyGene.info
- Uses batching, retry logic, and an optional JSONL cache
At the end of the pipeline you should have:
- GTEx testis specificity ranked table (per-gene tau and enrichment metrics)
- GTEx ranked table with HGNC symbols
- GTEx ranked table annotated with:
- ClinVar gene-condition summaries and optional variant counts
- sperm TPM evidence columns (presence, mean/median)
- optional HPO term summaries
- optional proteomics evidence
- optional functional annotation
high_confidence_testis_sperm_genes.tsv- (optionally)
high_confidence_testis_sperm_genes_function.tsv(if you re-filter after MyGene annotation)
- Best-per-variant and high-confidence subsets for male infertility phenotype-associated variants
- Gene-level summaries of high-confidence pathogenic variants
- Intersection between the testis+sperm gene set and ClinVar male infertility tiers
- Testis-only and ClinVar-only gene lists
- Tiered overlap tables with per-gene evidence counts
- All outputs are TSV (tab-separated).
- Scripts accept named CLI arguments; thresholds are configurable.
- Several scripts write logs to
logs/when--log_pathis provided. - ClinVar filtering steps are designed to stream large inputs efficiently (no need to fully decompress variant_summary).
download_data.shidentify_genes_to_infertility.shfind_testis_expression.shmk_final_results.sh
If you run things manually, follow the step order described above.
See LICENSE for repository licensing terms.