A SEDA-Compi pipeline to study gene evolution using getorf/BLAST and ProSplign-ProCompart.
A Docker image is available for this pipeline in this Docker Hub repository.
- Pipeline overview
- Using the
gene-evolutionpipeline image in Linux - Running the pipeline with sample data
- Advanced options
- Troubleshooting
- Publications
The image below shows an overview of the pipeline. It is important to note that there are two alternative ways of starting the pipeline and reach the two main branches:
- Green: Through
extract_filesstarting with a set of downloaded files located at the input directory as explained below. - Blue: Through
download_genomes, starting with a list of accessions located at the input file especified below. This alternative is aimed at reducing disk space as genomes are removed from disk after being analyzed bygetorf_and_blastandprosplign_procompart.
Also, note that the ProSplign-ProCompart branch (starting either in prosplign_procompart or prosplign-procompart), is executed by default but it can be disbled as explained before.
To perform an analysis, users must first:
- Initialize a working directory with the files required by the pipeline.
- Add the input data to be analyzed.
- Configure the pipeline parameters.
This section provides a comprehensive guide on how to perform these steps and describes the tools and scripts included in the pipeline image to do so easily.
To start a new analysis, the first step is to build the directory tree in your local file system. This directory tree will be referred to as the working directory, and its structure is recognized and used by the pipeline during the analysis. As an alternative to start, next subsection explains how to download a sample dataset to start with an analysis-ready working directory.
To build the working directory, adapt the first line of the following code and run it:
WORKING_DIRECTORY=/path/to/the/working-directory
mkdir -p ${WORKING_DIRECTORY}
docker run --rm \
-v ${WORKING_DIRECTORY}:${WORKING_DIRECTORY} \
-u "$(id -u)":"$(id -g)" \
--entrypoint=/bin/bash \
pegi3s/gene-evolution \
init_working_dir.sh ${WORKING_DIRECTORY}After running this command, the selected working directory should have the following structure:
sample-working-dir/
├── compi.params
├── input/
├── params/
├── query
├── query1
├── pipeline.png
├── README.txt
└── run.sh
Where:
compi.paramscontains the Compi parameters.inputis a directory to put the input data.paramsis a directory that contains the default SEDA operations' parameters.queryandquery1are the two FASTA files used as input in the pipeline.pipeline.pngcontains the pipeline graph.README.txtcontains the next steps you need to follow to run the analysis.run.shis the script to run the analysis.
As explained previously, there are two ways of starting the pipeline:
- Through
extract_filesstarting with a set of downloaded files that must be provided atinput/extract_files/. These are the CGA/CGF zipped files as they are downloaded from NCBI. - Through
download_genomes, starting with a list of accessions that must be provided atinput/accessions_list.txt. See controlling genome downloads for more details.
The query FASTA file is used for BLAST and the query1 FASTA file is used for ProSplign-ProCompart. They are referenced in params/blast_1.cliParams and params/prosplign-procompart.cliParams respectively.
Note that headers in the query must follow the format >Gene Description, for instance:
>FOXD2 XP_016006825_2
>D1B XP_015992634_2
>mesogenin1 XP_016009398_2
If not, the pipeline will fail to produce the correct outputs.
It is possible to test the pipeline using our sample data available here. Download this ZIP file and decompress it in your local file system. Then, to execute the pipeline using Docker, run the following command changing the /path/to/seda-compi-gene-evolution/ to the path where you have the decompressed data.
./run.sh /path/to/seda-compi-gene-evolutionPipeline results will be created in a directory called results inside the working directory and intermediate files will be stored at output.
The pipeline execution can be customized (e.g. setting the maximum number of parallel tasks, partial executions, and so on) by providing an additional parameter to the run.sh script. Below are some examples:
./run.sh /path/to/working-directory "--single-task extract-files --num-tasks 2"./run.sh /path/to/working-directory "--from getorf --until blast_merge"The ProSplign-ProCompart branch is executed by default but the execution fo its associated tasks can be ommited by including the skip_prosplign_procompart flag in the compi.params file.
As explained earlier, the download_genomes task is designed to save disk space by removing genomes after they are analyzed by getorf_and_blast and prosplign_procompart. The task starts with a list of accessions located at input/accessions_list.txt.
If there are X accessions in the list, Compi will launch up to X instances of download_genomes (one per accession). Thus, the pipeline must be run with --num-tasks X, otherwise not all accessions will be processed.
Then, the pipeline parameter max_tasks controls how many accessions can be downloaded and processed simultaneously by the subsequent tasks (getorf_and_blast and prosplign_procompart).
If max_tasks = 1 (default), all remaining tasks will be paused at download_genomes—before any download begins—until another genome finishes processing and is removed by remove_genomes. In this way, at most max_tasks genomes are stored at once, preventing excessive disk usage. To change it, set it explicitly in the compi.params file.
| Problem | Possible cause |
|---|---|
download_genomes tasks do not all start |
--num-tasks lower than number of accessions in input/accessions_list.txt |
check_failed_downloads reports failed accessions |
Network/transient NCBI issue or accession deprecated |
results/gene_sequences contains a single FASTA file |
The query file headers do not follow the requirements (>Gene Description) |
results/gene_sequences contains filenames that do not correspond to a gene name (e.g. mesogenin1_GCA_004027595.1_DinBra_v1_BIUU_genomic.fna.sequences.fasta instead of mesogenin1.fasta) |
There was an error retrieving the taxonomic information at rename-ncbi_1, try re-running the pipeline from that point (with --from rename-ncbi_1 --until label_reference) after removing the affected output directories (cd /path/to/working_dir && rm -rf output/rename-ncbi_1 output/replace_1 output/rename-header-multipart_1 output/blast_merge results) |
- C. P. Vieira; H. López-Fernández; M. Reboiro-Jato; J. Vieira (2025) GULO evolution in bats: a genomic approach. 19th International Conference on Practical Applications of Computational Biology & Bioinformatics: PACBB 2025. Lille, France. 25 - June
