VirStrain

VirStrain is an RNA virus strain-level identification tool for short-read sequencing data.

Overview

VirStrain supports:

Strain identification from single-end and paired-end short reads
Strain identification from assembled contigs
Construction of custom VirStrain databases
Use of pre-built public databases for common viral species

Contact

Email: heruiliao2-c@my.cityu.edu.hk
Recommended version: v1.18
Legacy note: v1.14 fixed some bugs, but did not include virstrain_contig or virstrain_merge

Changelog

2026 updates

2026-04-05

v1.18: VirStrain supports Python >=3.8 now!

2024 updates

2024-05-28

v1.17: Added the -v parameter to display version information
Available in the GitHub version only

2024-03-11

v1.17: Synced all changes to both GitHub and Conda

2024-02-27

Tem_Vs files are now named randomly in the GitHub version
Added links for downloading pre-built databases

2023 updates

2023-10-12

v1.14: Fixed a bug in v1.13 related to handling gzipped FASTQ files

2023-09-05

Added a new function for contig-based viral strain identification
Supports comprehensive identification across 45,619 strains from 28 viral species

2022 updates

2022-12-20

v1.13: Fixed a database generation bug present in v1.12 of the Bioconda release

2022-12-16

The VirStrain web server extension, StrainDetect, is now online:
https://strain.ee.cityu.edu.hk

2022-11-10

Added parameter -s to sort the most likely strain by site matches

2022-03-23

Fixed a Perl script bug related to header name handling

2022-02-08

Added an alternative method for downloading databases from Figshare

2022-02-05

v1.12: VirStrain can now accept gzipped FASTQ input files

2021 updates

2021-11

Added downloadable databases for two DNA viruses used in the paper:
- HBV
- HCMV
Added a larger SARS-CoV-2 database
See Supplementary Section 1.1 of the paper

Requirements

Dependencies

Python >= 3.8
- Recommended: 3.10.19
- Should work on python >3.11 as well
Perl
Python packages:
- biopython >=1.78
- matplotlib-base >=3.1
- networkx >=2.4
- numpy >=1.17
- pandas >=1.0
- plotly >=5.0
Bowtie2
Required for VirStrain version >= v1.18

If you use Conda, you can install required packages automatically with:

sh install_package.sh

If you install VirStrain via Bioconda or pip, you can ignore manual dependency installation.

Installation

Supported platform: Linux / Ubuntu

Option 1: Install with Bioconda

Once Bioconda is configured:

# Create new conda environment and install VirStrain
conda create -n virstrain_env -c conda-forge -c bioconda virstrain=1.18 -y
# or mamba create -n virstrain_env -c conda-forge -c bioconda virstrain=1.18 -y

# Activate the environment
conda activate virstrain_env

chmod 755 bin/jellyfish-linux

Optional: you can also install VirStrain in your conda env directly (may get issues due to dependencies)

conda install -c bioconda virstrain
#or: mamba install -c bioconda virstrain

Option 2: Install with pip

pip install virstrain==1.18
chmod 755 bin/jellyfish-linux

Option 3: Manual installation

Make sure all dependencies are installed first.

git clone https://github.com/liaoherui/VirStrain.git
cd VirStrain
chmod 755 bin/jellyfish-linux
rm VirStrain_DB.tar.gz

Command mapping

If you installed VirStrain via Bioconda or pip, use the following command names:

Source install command	Bioconda / pip command
`python VirStrain.py -h`	`virstrain -h`
`python VirStrain_build.py -h`	`virstrain_build -h`
`python VirStrain_contig.py -h`	`virstrain_contig -h`
`python VirStrain_contigDB_merge.py -h`	`virstrain_merge -h`

Databases

Download the default reference database

After cloning the repository:

cd VirStrain
sh download.sh

Alternative download method

cd VirStrain
wget -qO- "https://figshare.com/ndownloader/files/34002479" | tar -zx

You may also download the database manually from Google Drive or Figshare and extract it with:

tar -zxvf <downloaded_file>

If all download methods fail, please contact the author by email.

Additional downloadable databases

DNA virus databases

sh download_dna.sh

Includes databases for:

HBV
HCMV

Larger SARS-CoV-2 database

sh download_scov2_big.sh

Contig-level database

sh download_contig_db.sh

Pre-built database downloads

If the download scripts fail, pre-built databases are also available via Google Drive.

Name	Description	Download
`VirStrain_DB.tar.gz`	Databases containing SCOV2, H1N1, and HIV strains used in the paper	Google Drive
`SCOV2_newBig.tar.gz`	Expanded database containing additional SCOV2 strains	Google Drive
`VirStrain_DNA_DB.tar.gz`	Databases containing HBV and HCMV strains	Google Drive
`VirStrain_contig_DB.tar.gz`	Contig-level database	Google Drive

Usage

If you installed VirStrain via Bioconda or pip, replace script-based commands with the corresponding installed commands shown above.

1) Identify RNA virus strains from short reads

Single-end reads

python VirStrain.py -i Test_Data/MT451123_1.fq -d VirStrain_DB/SCOV2 -o MT451123_SE_Test

Paired-end reads

python VirStrain.py -i Test_Data/MT451123_1.fq -p Test_Data/MT451123_2.fq -d VirStrain_DB/SCOV2 -o MT451123_PE_Test

High-mutation viruses such as HIV

Use the -m option.

Single-end:

python VirStrain.py -i <Read1> -d VirStrain_DB/HIV -o <Output_dir> -m

Paired-end:

python VirStrain.py -i <Read1> -p <Read2> -d VirStrain_DB/HIV -o <Output_dir> -m

2) Identify viral strains from assembled contigs

python VirStrain_contig.py -i <Input_Contig_fasta> -d VirStrain_contig_DB -o VirStrain_Contig_Res

Convert read-based databases into a contig database

python VirStrain_contigDB_merge.py -i VirStrain_DB/SCOV2,VirStrain_DB/H1N1 -o VirStrain_contig_DB_merge

3) Build a custom VirStrain database

Basic usage:

python VirStrain_build.py -i <Input_MSA> -d <Database_Dir>

Important header naming rule

Characters , and | are not allowed in sequence headers in <Input_MSA>.

Examples:

Not allowed: >Strain_A, 2022
Not allowed: >Strain_A|2022
Allowed: >Strain_A_2022

Manual covering for small datasets or large viral genomes

For small strain collections (<1000 strains) or viruses with large genomes such as HCMV, you can use the manual covering function with -s to retain more useful sites.

Example:

python VirStrain_build.py -i <Input_MSA> -d <Database_Dir> -s 0.4

General guidance:

0.2–0.6 is usually a reasonable range for -s
With very few strains (for example, 3 strains), a larger value such as -s 0.8 may also work

Restrict SNV site range

If you only want to use SNV sites from position x to y, use -r.

Example:

python VirStrain_build.py -i <Input_MSA> -d <Database_Dir> -s 0.4 -r 500-1000

Input format note

The input MSA must have the same format as an alignment generated by MAFFT:

https://mafft.cbrc.jp/alignment/software/

Full command-line options

VirStrain.py — short-read strain identification

Default k-mer size: 25

VirStrain - An RNA virus strain-level identification tool for short reads.

Example:
python VirStrain.py -i Test_Data/MT451123_1.fq -p Test_Data/MT451123_2.fq -d VirStrain_DB/SCOV2 -o MT451123_PE_Test

required arguments:
    -i, --input_reads             Input FASTQ data
    -d, --database_dir            Path to VirStrain database

optional arguments:
    -h, --help                    Show help message and exit
    -o, --output_dir              Output directory (default: ./VirStrain_Out)
    -p, --input_reads2            Input FASTQ data for paired-end reads
    -c, --site_filter_cutoff      Site filtering cutoff used when calculating Vscore (default: 0.05)
    -s, --rank_by_sites           If set to 1, sort the most likely strain by site matches (default: 0)
    -f, --turn_off_figures        If set to 1, do not generate figures (default: 0)
    -m, --high_mutation_virus     Use for high mutation rate viruses such as HIV

VirStrain_build.py — custom database construction

Default k-mer size: 25

VirStrain - An RNA virus strain-level identification tool for short reads.

Example:
python VirStrain_build.py -i <Input_MSA> -d <Database_Dir>

required arguments:
    -i, --input_msa               Input MSA file (must match MAFFT output format)

optional arguments:
    -d, --database_dir            Output directory for the constructed database (default: ./VirStrain_DB)
    -c, --dash_cutoff             Dash cutoff for each MSA column (default: 0)
    -s, --sites_cutoff            Cutoff for manual-covering function
                                  (e.g. 1 = all useful sites; 0.8 = 80% of useful sites)
    -r, --sites_rcutoff           Site range cutoff for covering algorithm
                                  (e.g. 3-500 means only SNV sites from positions 3 to 500 are considered)

Output format

VirStrain generates two primary outputs:

A text report
- Contains identified strains, depth, site coverage, and related metrics
An interactive HTML report
- Displays depth and site uniqueness information visually

You can find an example output in the MT451123_Sim_PE folder in this repository.

Example report image:

Report sections

Header	Description
Most Possible strain*	The most likely strain detected by VirStrain. These are the strains with the highest Vscore in the first iteration.
Other Possible strains*	Additional possible strains detected by VirStrain. These are identified in later iterations. Based on the authors’ experiments, 10 mutations can be strong evidence for additional possible strains.
`Highest Map Strains`	The strain with the maximum `Covered SNV site / Total SNV site` in the first iteration. Provided for reference.
`Top 10 Score Strains`	The top 10 strains ranked by Vscore in the first iteration. This can help identify low-abundance strains highly similar to high-abundance strains.

Headers marked with * contain the main identification results.

Report columns

Column	Description
`Strain_ID`	NCBI accession number or other public database identifier for the identified strain
`Cls_info`	Cluster information for the identified strain, e.g. `Cluster2830_2` means cluster `Cluster2830` with size `2`
`SubCls_info`	Sub-cluster information
`Vscore`	Score generated by the VirStrain algorithm
`Total_Map_Rate`	Covered sites out of total sites in the first iteration
`Valid_Map_Rate`	Covered sites out of total sites in the remaining iterations
`Strain_depth`	Predicted sequencing depth for the identified strain
`Strain_info`	Metadata for the identified strain, such as region and subtype
`SNV_freq`	SNV frequency across all sites

Citation

If you use VirStrain, please cite:

Liao, H., Cai, D. & Sun, Y. VirStrain: a strain identification tool for RNA viruses. Genome Biology 23, 38 (2022). https://doi.org/10.1186/s13059-022-02609-x

Name		Name	Last commit message	Last commit date
Latest commit History 245 Commits
MT451123_Sim_PE		MT451123_Sim_PE
Output_fmt		Output_fmt
Test_Data		Test_Data
bin		bin
.gitattributes		.gitattributes
LICENSE		LICENSE
README.md		README.md
VirStrain.py		VirStrain.py
VirStrain_DB.tar.gz		VirStrain_DB.tar.gz
VirStrain_build.py		VirStrain_build.py
VirStrain_contig.py		VirStrain_contig.py
VirStrain_contigDB_merge.py		VirStrain_contigDB_merge.py
download.sh		download.sh
download_contig_db.sh		download_contig_db.sh
download_dna.sh		download_dna.sh
download_scov2_big.sh		download_scov2_big.sh
install_package.sh		install_package.sh
logo.png		logo.png

Folders and files

Latest commit

History

Repository files navigation

VirStrain

Overview

Contact

Changelog

2026-04-05

2024-05-28

2024-03-11

2024-02-27

2023-10-12

2023-09-05

2022-12-20

2022-12-16

2022-11-10

2022-03-23

2022-02-08

2022-02-05

2021-11

Requirements

Dependencies

Installation

Option 1: Install with Bioconda

Option 2: Install with pip

Option 3: Manual installation

Command mapping

Databases

Download the default reference database

Alternative download method

Additional downloadable databases

DNA virus databases

Larger SARS-CoV-2 database

Contig-level database

Pre-built database downloads

Usage

1) Identify RNA virus strains from short reads

Single-end reads

Paired-end reads

High-mutation viruses such as HIV

2) Identify viral strains from assembled contigs

Convert read-based databases into a contig database

3) Build a custom VirStrain database

Important header naming rule

Manual covering for small datasets or large viral genomes

Restrict SNV site range

Input format note

Full command-line options

Output format

Report sections

Report columns

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 1

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages