Skip to content

Jari-van-Diermen/Spiny_mice_manuscript

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Spiny mice manuscript Positive selection pipeline

Overview

This (episodic) positive selection pipeline was used in the Spiny mice manuscript. The pipeline uses the programs aBSREL and MEME from the HyPhy open-source software package. This software package contains programs for comparative sequence analysis using evolutionary models (https://stevenweaver.github.io/hyphy-site/). The pipeline requires a high performance computing (HPC) cluster. All required software is provided as Docker containers on Docker hub. These will be pulled from docker hub as singularity containers, which are subsequently used during the execution of the pipeline.

Furthermore, A complementary positive selection analysis using PAML codeml was also performed on positively selected A. cahirinus genes that were identified by HyPhy aBSREL. This repository also contains the scripts and details for this codeml analysis.

Software / docker containers

Several software packages are required to run the Positive selection pipeline. Most of them are pre-installed in docker containers, which are publicly available in docker hub. These containers should be pulled as singularity containers so they can be run in a HPC where you do not have root-access. Thus, singularity must be installed on the HPC.

Other dependencies are:

  • awk
  • mapfile

Documentation

The scripts used in the pipeline steps below are described in more detail in additional README.md files in the docs directory. The README.md files are named after the corresponding scripts they explain in more detail.

Details aBSREL and MEME analysis

The pipeline can broadly be divided in three parts:

Multiple sequence alignment (MSA) preprocessing

This section generates a number of files required for subsequent steps and performs two major preprocessing steps:

  • Update aligments: The MSA files used for the analysis were generated by Kirilenko et al. (2023), and are accessible here. These MSA files represent human reference genes (hg38), aligned with their TOGA-predicted orthologs in other mammalian species. The exact procedure for the creation of these MSA files can also be found here.

    After the creation of TOGA and the MSA files containing the detected orthologs, a higher quality nanopore sequencing A. cahirinus genome assembly became available (Miller, 2023). A. cahirinus orthologs for the human (hg38) reference genes were identified by TOGA, and were made available here. In order to utilize these A. cahirinus orthologs from the new genome assembly, they were selected and added to the existing MSA files. This was the main purpose of the Update_alignments.sh script.

  • filter identifiers: The MSA files retrieved from the TOGA database contain the orthologs from 488 different mammalian species. Furthermore, the genome assemblies from these species vary in quality and the number of orthologs that were identified. Therefore, we used a filtering procedure to filter out the lower quality genome assemblies and the assemblies where fewer orthologs were identified. Lastly, this preprocessing step reduced the number of species represented by the MSA files using Treemmer (Menardo et al., 2018), which allows for the reduction of phylogenetic data with a minimal loss of biodiversity.

aBSREL analysis overview

This section of the pipeline uses the preprocessed MSA files and phylogenetic trees to perform gene-based (episodic) diversifying selection analyses. Every preprocessed MSA file was analysed by HyPhy aBSREL. For each MSA file, we selected the A. cahirinus branch, the M. musculus branch, the human branch and 12 random other branches as foreground, which were separately analyzed for (episodic) diversifying selection by aBSREL. This resulted in a list of (episodic) diversifying genes for the A. cahirinus lineage, but also for the M. musculus lineage.

This section can be divided in two parts:

  • aBSREL preprocessing: Before the MSA files are analyzed by aBSREL, each file is additionally preprocessed to make it compatible for aBSREL analysis. This included genome assembly selection for species which are described by multiple genome assemblies, and dealing with one-2-many orthologs. Furthermore, badly aligned codons and sequences were removed from the MSA files and phylogenetic trees were created that perfectly match the resulting processed MSA files.

  • aBSREL analysis: The aBSREL preprocessed MSA files and their corresponding phylogenetic trees are analyzed by aBSREL in a parallel fashion.

This section was executed by the scripts in directories scripts/230721_HyPhy_aBSREL_analysis to scripts/230721_HyPhy_aBSREL_analysis_17, which each analyzed a different batch of 1000 MSA files.

MEME analysis overview

This section of the pipeline uses the preprocessed MSA files and phylogenetic trees to perform codon-based (episodic) diversifying selection analyses. Here, we only attempted to identify (episodic) diversifying codons in genes where aBSREL found significant diversifying selection in the A. cahirinus branch.

Similarly to the aBSREL analysis, this section can be divided in two parts:

    • MEME preprocessing: The additional preprocessing steps for the MEME analysis were almost identical to the aBSREL preprocessing. The only difference is in how codons are filtered from the MSA files, so that the codon coordinates in the MEME output corresponds to the codon coordinates of the human sequences.
  • MEME analysis: The MEME preprocessed MSA files and their corresponding phylogenetic trees are analyzed by MEME in a parallel fashion.

pipeline overview

simplified schematic of pipeline

More detailed schemas for the MSA preprocessing and aBSREL analysis are available in the schemas directory

Run the diversifying selection pipeline

Download singularity images

To run the pipeline, we need to pull the required singularity containers from docker hub.

Connect to a HPC running the SLURM workload manager. Clone this repository at the desired location (preferably a project folder, as your HOME directory might not have the needed space to run the pipeline) and cd into the singularity directory.

git clone git@github.com:Suirotras/Spiny_mice_manuscript.git
# CHANGE LATER WHEN NAME FOR REPOSITORY IS CHOSEN
cd Spiny_mice_manuscript
# Create and 'cd' into singularity directory
mkdir -p singularity
cd singularity

Pull the docker images as singularity .sif files.

# singularity image for preprocessing steps
singularity build java_gnuparallel_ubuntu.sif docker://jarivdiermen/java_gnuparallel_ubuntu:latest
# singularity image for aBSREL and MEME analysis
singularity build hyphy_ubuntu.sif docker://jarivdiermen/hyphy_ubuntu:latest

Note

Alternatively, it is also possible to build the docker images yourself, using the dockerfiles provided in this repository. These docker images could then be converted to singularity containers. However, pulling the singularity containers from docker hub ensures the exact same execution environment.

MSA preprocessing

Start by moving to the outputs directory

# CHANGE LATER WHEN NAME FOR REPOSITORY IS CHOSEN
cd Spiny_mice_manuscript/outputs

In the first section of MSA preprocessing, called update aligments, the A. cahirinus orthologs are added to their corresponding MSA files. This is orchestrated by the 230704_update_alignments step. Enter the following commands to execute this step.

Note

Update_alignments.sh could take days to run to completion.

# Create output directory
mkdir -p 230704_update_alignments
cd 230704_update_alignments
sbatch -A [YOUR PROJECT ID] ../../scripts/230704_update_alignments/Update_alignments.sh

The next section of MSA preprocessing is called filter identifiers. here we filter the MSA files aqcuired by the update alignments section to reduce their size.

However, we first run the step called 230802_assembly_counter. This creates a file with genome assembly counts, where the count represents how many MSA files contain a sequence from each genome assembly. This file is required for the filter identifiers step.

# move back to the outputs directory
cd Spiny_mice_manuscript/outputs

# Create output directory
mkdir -p 230802_assembly_counter
cd 230802_assembly_counter
sbatch -A [YOUR PROJECT ID] ../../scripts/230802_assembly_counter/main.sh

Next, we execute the filter identifiers step.

Note

Filter_identifiers_script.sh expects that the HPC node has a minimum of 20 cores. If your HPC has fewer cores per node, change this in the Filter_identifiers_script.sh script by changing #SBATCH -n 20.

# move back to the outputs directory
cd Spiny_mice_manuscript/outputs

# Create output directory
mkdir -p 230718_filter_identifiers
cd 230718_filter_identifiers
sbatch -A [YOUR PROJECT ID] ../../scripts/230718_filter_identifiers/Filter_identifiers_script.sh

The last step of the MSA preprocessing is called multiple_codon_alignment_remover. Here, we remove MSA files without A. cahirinus sequences (i.e. from the HLacoCah2 assembly). This step is not strictly necessary, as MSA files without A. cahirinus genome assembly sequences should not have been processed by the filter identifiers step. However, this step should remove erroneous MSA files from the analysis.

# move back to the outputs directory
cd Spiny_mice_manuscript/outputs

# Create output directory
mkdir -p 230721_multiple_codon_alignment_remover
cd 230721_multiple_codon_alignment_remover
sbatch -A [YOUR PROJECT ID] ../../scripts/230721_multiple_codon_alignment_remover/multiple_codon_alignment_remover.sh

aBSREL analysis

The MSA files generated by the MSA processing step can be used for HyPhy aBSREL analysis, which will analyze each MSA file for significant signs of (episodic) diversifying selection in A. cahirinus and M. musculus branches (i.e lineages). Each MSA file undergoes additional processing to make it aBSREL-compatible, after which each MSA file and a corresponding phylogenetic tree are used for aBSREL analysis.

MSA preprocessing generated separate batches, each containing 1000 MSA files (with the exception of the last batch, which only represents the remainder). These batches were separately analyzed for manageability.

The batches can be analyzed by aBSREL with the following commands:

Note

230721_HyPhy_aBSREL_analysis is analyzed in batches. For a limited analysis, fewer batches can be analyzed.

### Analyze batch 1
# move back to the outputs directory
cd Spiny_mice_manuscript/outputs
# Create and move into directory
mkdir -p 230721_HyPhy_aBSREL_analysis
cd 230721_HyPhy_aBSREL_analysis
# start aBSREL analysis
bash ../../scripts/230721_HyPhy_aBSREL_analysis/HyPhy_aBSREL_initialize.sh "[YOUR PROJECT ID]"

### Analyze batch 2
# move back to the outputs directory
cd Spiny_mice_manuscript/outputs
# Create and move into directory
mkdir -p 230721_HyPhy_aBSREL_analysis_2
cd 230721_HyPhy_aBSREL_analysis_2
# start aBSREL analysis
bash ../../scripts/230721_HyPhy_aBSREL_analysis_2/HyPhy_aBSREL_initialize.sh "[YOUR PROJECT ID]"

###-----------------------------------------------###
###-- Continue until last batch (i.e. batch 17) --###
###-----------------------------------------------###

### Analyze batch 16
# move back to the outputs directory
cd Spiny_mice_manuscript/outputs
# Create and move into directory
mkdir -p 230721_HyPhy_aBSREL_analysis_16
cd 230721_HyPhy_aBSREL_analysis_16
# start aBSREL analysis
bash ../../scripts/230721_HyPhy_aBSREL_analysis_16/HyPhy_aBSREL_initialize.sh "[YOUR PROJECT ID]"

### Analyze batch 17
# move back to the outputs directory
cd Spiny_mice_manuscript/outputs
# Create and move into directory
mkdir -p 230721_HyPhy_aBSREL_analysis_17
cd 230721_HyPhy_aBSREL_analysis_17
# start aBSREL analysis
bash ../../scripts/230721_HyPhy_aBSREL_analysis_17/HyPhy_aBSREL_initialize.sh "[YOUR PROJECT ID]"

Every MSA file analyzed by aBSREL produces an output JSON file. The next step combines the output information of all aBSREL-analyzed MSA files into a single RData file. This step is called 230906_CombineResults_HyPhy_aBSREL.

Note

CombineResults_HyPhy_aBSREL.sh expects that the HPC node has a minimum of 19 cores. If your HPC has fewer cores per node, change this in the CombineResults_HyPhy_aBSREL.sh script by changing #SBATCH -n 19.

Note

CombineResults_HyPhy_aBSREL.sh combines all batches in a single RData file. If a limited aBSREL analysis was performed, for example only batch 1 and 2 were analyzed, then this should be indicated in the CombineResults_HyPhy_aBSREL.sh script.

The script contains an array named dirs. In the case that only batch 1 and 2 were analyzed, this line should be changed into the following so that only batch 1 and 2 are combined in an RData output file.

dirs=(
 "../230721_HyPhy_aBSREL_analysis" \
 "../230721_HyPhy_aBSREL_analysis_2" \
)
# move back to the outputs directory
cd Spiny_mice_manuscript/outputs

# Create output directory
mkdir -p 230906_CombineResults_HyPhy_aBSREL
cd 230906_CombineResults_HyPhy_aBSREL
sbatch -A [YOUR PROJECT ID] ../../scripts/230906_CombineResults_HyPhy_aBSREL/CombineResults_HyPhy_aBSREL.sh

The aBSREL results are combined in a file called aBSREL_analysis_pval_count.RData.

Next, we generate (episodic) diversifying selection genelists for both A. cahirinus (assembly: HLacoCah2) and M. musculus (assembly: mm39). This is done by 240112_get_orthology_genelists and 240116_get_orthology_genelists_mus respectively.

Generate the A. cahirinus genelists.

# move back to the outputs directory
cd Spiny_mice_manuscript/outputs

# Create output directory
mkdir -p 240112_get_orthology_genelists
cd 240112_get_orthology_genelists
sbatch -A [YOUR PROJECT ID] ../../scripts/240112_get_orthology_genelists/get_orthology_genelists.sh

This generates the following data:

  • aBSREL_genes_ortholog_info.tsv: a tab-delimited file that lists the genes, their orthology relationship with the human transcript, and a functional prediction for that ortholog. The orthology relationship and the functional prediction were generated by TOGA and simply taken from their online database (https://genome.senckenberg.de//download/TOGA/human_hg38_reference/).

  • Three types of positively selected genelists

    • All A. cahirinus-significant fdr-corrected genes
    • Only the A. cahirinus-significant fdr-corrected genes that have a one-2-one orthology relationship between the A. cahirinus and human genome assemblies.
    • Only the A. cahirinus-significant fdr-corrected genes that fullfill both conditions:
      • The gene has a one-2-one orthology relationship between the A. cahirinus and human genome assemblies.
      • The gene was predicted by TOGA to be intact or partially intact in A. cahirinus (i.e. not predicted as uncertain loss).

Now we generate the M. musculus genelist.

# move back to the outputs directory
cd Spiny_mice_manuscript/outputs

# Create output directory
mkdir -p 240116_get_orthology_genelists_mus
cd 240116_get_orthology_genelists_mus
sbatch -A [YOUR PROJECT ID] ../../scripts/240116_get_orthology_genelists_mus/get_orthology_genelists_mus.sh

This generates the following data:

  • aBSREL_genes_ortholog_info_mus.tsv: a tab-delimited file that lists the genes, their orthology relationship with the human transcript, and a functional prediction for that ortholog.

  • Three types of positively selected genelists

    • All M. musculus-significant fdr-corrected genes
    • Only the M. musculus-significant fdr-corrected genes that have a one-2-one orthology relationship between the M. musculus and human genome assemblies.
    • Only the M. musculus-significant fdr-corrected genes that fullfill both conditions:
      • The gene has a one-2-one orthology relationship between the M. musculus and human genome assemblies.
      • The gene was predicted by TOGA to be intact or partially intact in M. musculus (i.e. not predicted as uncertain loss).

MEME analysis

Just like the aBSREL analysis, the MSA files generated by the MSA processing step can be used for HyPhy MEME analysis. We limit the MEME analysis to the MSA files were aBSREL identified (episodic) diversifying selection in the A. cahirinus lineage (i.e. branch). HyPhy MEME will analyze these MSA files for diversifying sites (i.e. codon).

Similar to the aBSREL analysis, each MSA file undergoes additional processing to make it MEME-compatible, after which each MSA file and a corresponding phylogenetic tree are used for MEME analysis.

The MSA files can be analyzed by MEME with the following commands:

# move back to the outputs directory
cd Spiny_mice_manuscript/outputs

# Create and move into directory
mkdir -p 230926_MEME_analysis
cd 230926_MEME_analysis

# start MEME analysis
bash ../../scripts/230926_MEME_analysis/HyPhy_MEME_initialize.sh "[YOUR PROJECT ID]"

This script will automatically detect which MSA files had significant diversifying selection in the A. cahirinus branch (it detects the number of lines in ../240112_get_orthology_genelists/genelists/aBSREL_human_transcript_ids_sign.txt).

Similarly to the aBSREL analysis, we next combine the MEME results in a single RData file, as the numerous JSON files are unwieldy. This is done by the CombineResults_HyPhy_MEME step.

# move back to the outputs directory
cd Spiny_mice_manuscript/outputs

# Create output directory
mkdir -p 231002_CombineResults_HyPhy_MEME
cd 231002_CombineResults_HyPhy_MEME
sbatch -A [YOUR PROJECT ID] ../../scripts/231002_CombineResults_HyPhy_MEME/CombineResults_HyPhy_MEME.sh

The combining procedure generates relevant RData files:

  • MEME_results_REF_seq.RData: represents a dataframe with the MSAs added as columns. For every analyzed MSA file, the columns indicate the preprocessed MSA sequences (under the column REFERENCE_sequence) and the original unprocessed MSA sequences (under the column original_sequence).

  • MEME_results_sign_sites.RData: represents a dataframe with a variety of information about the HyPhy MEME analyses performed on the MSA files. The columns of the dataframe contains information on the maximum likelyhood estimations, estimated model parameters, posterior probabilities, diversifying sites and some additional information.

Two additional steps improve on the already available information in the RData files produced by CombineResults_HyPhy_MEME. These steps are called get_EBF_values and get_subsitutions.

  • get_EBF_values calculates the empirical Bayes factor (EBF) for the event of diversifying selection (more information about this factor, and its unavoidable limitations, can be found in an article by Murrell et al., (2012)). The EBF gives an indication about which branches (i.e. evolutionary lineages) contributed to the diversifying selection signal of a specific site, although it is a low-powered indication. This information is added as an additional column.

  • get_subsitutions generates additional columns in MEME_results_sign_sites.RData, which display information about the subsitutions that occurred along each branch.

Generate the EBF values:

# move back to the outputs directory
cd Spiny_mice_manuscript/outputs

# Create output directory
mkdir -p 231017_get_EBF_values
cd 231017_get_EBF_values
sbatch -A [YOUR PROJECT ID] ../../scripts/231017_get_EBF_values/get_EBF_values.sh

This generates the following files:

  • MEME_EBF_values.RData: the dataframe with calculated EBF values (i.e. MEME_results_sign_sites.RData with EBF information).

  • MEME_EBF_values_simple.RData: A simplified version of MEME_EBF_values.RData with less columns.

Generate the information about the substitutions:

# move back to the outputs directory
cd Spiny_mice_manuscript/outputs

# Create output directory
mkdir -p 231017_get_subsitutions
cd 231017_get_subsitutions
sbatch -A [YOUR PROJECT ID] ../../scripts/231017_get_subsitutions/get_substitutions.sh

This generates the following files:

  • MEME_substitutions.RData: the dataframe with added substitution columns (i.e. MEME_results_sign_sites.RData with substitution information).

  • MEME_substitutions_small.RData: A simplified version of MEME_substitutions.RData with less columns.

Details codeml analysis

The codeml analysis uses as input components of the aBSREL analysis pipeline:

  • The preprocessed MSA files that are generated in the MSA preprocessing step of the aBSREL analysis pipeline are used as input files for the codeml analysis.

  • The positively selected A. cahirinus genelist, generated by the aBSREL analysis pipeline, is used to select which MSA files to analyze with codeml.

Thus, the aBSREL analysis pipeline has to be executed before the codeml analysis.

codeml analysis overview

The codeml analysis is divided in two parts:

  • codeml analysis: This step contains scripts to run codeml, which is part of PAML (Phylogenetic Analysis by Maximum Likelihood). Only MSA files with positively selected A. cahirinus genes (as determined by aBSREL) are analyzed.

    First, a branch-site model of positive selection (branch-site model A) is estimated for the MSA files. Here, we call this the alternative model. Secondly, a modified branch-site model of positive selection is fitted to the same MSA files, and we call this the null model. In the null model, the dN/dS (omega) rate of the foreground branches (in our case the A. cahirinus branch) is fixed to be one (ω = 1). This allows us to perform a likelyhood ratio test by comparing the log likelyhoods of the alternative and null models for each gene. When the alternative model is significantly more likely then the null model, this is an indication that the foreground branches (here Acomys cahirinus) were affected by positive selection.

    Next to the already performed MSA preprocessing, additional preprocessing is done to make the alignments compatible with codeml. This is identical to the preprocessing performed to make the MSA files compatible with aBSREL. The single change is that the file format of the MSA is changed from FASTA to PHYLIP format, as codeml expects MSA files in PHYLIP format.

  • codeml likelyhood ratio test: After the codeml estimations of the alternative and null models, the log likelyhood ratios of both models are compared in a likelyhood ratio test (LRT). This LRT, and a subsequent FDR correction, generates the p-values that determines if a gene is considered positively in the A. cahirinus evolutionary lineage.

Run the codeml positive selection analysis

Download codeml singularity images

To run the codeml analysis, we need to pull the required singularity containers from docker hub.

cd into the singularity directory to pull the required containers.

# CHANGE LATER WHEN NAME FOR REPOSITORY IS CHOSEN
cd Spiny_mice_manuscript/singularity

# singularity image for the LRT step, same image as was used in the aBSREL pipeline
singularity build java_gnuparallel_ubuntu.sif docker://jarivdiermen/java_gnuparallel_ubuntu:latest
# singularity image with PAML for codeml analysis
singularity build paml_ubuntu.sif docker://jarivdiermen/paml_ubuntu.sif:latest

codeml analysis

The MSA files generated by the MSA preprocessing step of the aBSREL pipeline can be used for codeml analysis, which will analyze the MSA files for significant signs of (episodic) diversifying selection in the A. cahirinus branch (i.e lineage). Each MSA file undergoes additional processing to make it codeml-compatible, after which each MSA file and a corresponding phylogenetic tree are used for codeml analysis. The additional processing is almost identical to the processing that makes the MSA files aBSREL-compatible, with the exception that the files are changed to PHYLIP format.

We limit the codeml analysis to the MSA files were aBSREL identified (episodic) diversifying selection in the A. cahirinus lineage (i.e. branch).

The MSA files can be analyzed by codeml with the following commands:

# move back to the outputs directory
cd Spiny_mice_manuscript/outputs

# Create and move into directory
mkdir -p 231122_codeml_analysis
cd 231122_codeml_analysis

# start codeml analysis
bash ../../scripts/231122_codeml_analysis/PAML_codeml_initialize.sh "[YOUR PROJECT ID]"

codeml LRT

Next, the log likelyhood ratios of the alternative and null models are retrieved from the codeml results. This is achieved by the following commands:

# move back to the outputs directory
cd Spiny_mice_manuscript/outputs

# Create output directory
mkdir -p 231129_codeml_LRT
cd 231129_codeml_LRT
sbatch -A [YOUR PROJECT ID] ../../scripts/231129_codeml_LRT/codeml_LRT.sh

This produces an RData file describing the LRT results of the analysis of each MSA file. This file is a dataframe called lnl_values.RData. More details about the contents of this dataframe can be found in the README.codeml_LRT.md.

Details data visualization and exploration

Additional data exploration, data processing and data visualization was performed locally in a conda environment named Acomys_comp_genomics. This conda environment was containerized in an identically named Docker image that is available at docker hub.

This Docker image can be retrieved as a singularity image using the following command:

# build the singularity image with the 'Acomys_comp_genomics' conda environment
singularity build --fakeroot --force acomys_conda_environments.sif docker://jarivdiermen/acomys_conda_environments:latest

Scripts that processed the data and produced the visualizations were then executed using the following command:

singularity exec -B {host_dir}:{container_dir},{.cache_dir} --no-home acomys_conda_environments.sif bash -c "source /home/jari/miniconda3/bin/activate Acomys_comp_genomics && 'COMMAND'"

Where:

  • -B is the option that mounts host directories to the singularity container. here, the following directories are mounted:

    • host_dir is the directory where the necessary scripts and/or files are located on the host file system,
    • container_dir is the directory in the singularity container where the scripts and/or files from host_dir will be located,
    • .cache_dir is the cache directory that the singularity container may use during script execution, as some scripts may write temporary files to cache directories.
  • --no-home is an option to stop singularity from mounting the complete home directory to the singularity container. This is necessary because mounting the complete home directory would hide the miniconda installation and the installed conda environments.

One example of how this singularity image can be used is seen below. In this example, An R script is executed using the Acomys_comp_genomics conda environment.

singularity exec -B /home/jari/Documents/Github/Repo_minor_research_project:/home/jari/Documents/Github/Repo_minor_research_project,/home/jari/.cache --no-home /home/jari/Documents/Github/containerize-conda-envs_docker/singularity_containers/acomys_conda_environments.sif bash -c "source /home/jari/miniconda3/bin/activate Acomys_comp_genomics && R -f map_substitutions_to_uniprot.R"

Retrieval of protein domains from the Proteins REST API

In this study, we used Uniprot functional sequence annotations (e.g. protein domains, motifs) for all aBSREL-identified positively selected genes, retrieved from the Proteins REST API. More specifically, we used a GO program called Progo, that efficiently retrieves the protein domains for a list of gene IDs (https://github.com/addityea/progo/tree/v1.0.0). Progo specifically uses the features service of the Proteins REST API to retrieve the annotations from the Uniprot Knowledgebase (UniprotKB) (Nightingale et al., 2017).

The specific GO binary used in this study is located at (Spiny_mice_manuscript/scripts/get_protein_features/progo_linux_amd64).Furthermore, in this study we use a bash wrapper to more easily get the output for different Uniprot annotations in separate tsv files.

How Uniprot functional annotations where retrieved

Start by moving to the directory containing the GO binary:

# CHANGE LATER WHEN NAME FOR REPOSITORY IS CHOSEN
cd Spiny_mice_manuscript/scripts/get_protein_features/

Next we retrieve the Uniprot annotations using the following command:

bash get_protein_features.sh -i /path/to/input/text/file/with/gene_symbols.txt -f "BINDING,CARBOHYD,COILED,CROSSLNK,DISULFID,DNA_BIND,DOMAIN,LIPID,MOD_RES,MOTIF,REGION,REPEAT,SITE,TRANSIT,ZN_FING,CHAIN,SIGNAL" -n "No"

Where:

  • -i: represents the file with the gene symbols for which we want to retrieve annotations
  • -f: The type of annotations that we want to retrieve from the features service of the Proteins REST API.
  • -n: Is there a header in the gene symbol text file. Should be "No" or "Yes".

This produces output directories in the current working directory (e.g. progo_output_DOMAIN or progo_output_MOTIF) containing one file per gene_symbol (e.g. ADGRE1_features.csv or BRD8_features.csv). The files look like the example below:

Gene	Start	End	DOMAIN
ADGRE1	31	79	EGF-like 1
ADGRE1	80	131	EGF-like 2; calcium-binding
ADGRE1	132	171	EGF-like 3; calcium-binding
ADGRE1	172	220	EGF-like 4; calcium-binding
ADGRE1	221	267	EGF-like 5; calcium-binding
ADGRE1	268	316	EGF-like 6; calcium-binding
ADGRE1	547	596	GPS

With three columns:

  • Gene: The gene identifier.
  • Start: The Start amino acid position, relative to the canonical sequence of the Uniprot Swissprot accession.
  • End: The End amino acid position
  • DOMAIN: the sequence feature that is indicated by the Start and End coordinates.

About

Pipeline for running the Acomys cahirinus positive selection analysis using HyPhy aBSREL, HyPhy MEME and PAML codeml

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors