Recopops is a command line interface tool that generates SNP configurations and phylogenetic trees using simulation of coalescent models with structured populations - Recopops stands for Recombination with population structure.
What you need to do at the present time after downloading the repository:
pip install /path/to/recopopswhere /path/to/recopops is the path of the directory containing the setup.py file of the recopops package.
If you wish to install the package in such a way that changes to the source code are immediately reflected without needing to reinstall the package, use editable mode:
pip install -e /path/to/recopopsYou need the following python packages at the moment:
numpy, biopython, tqdm, networkx
This is an example command to generate a single phylogeny of 10 strains:
recopops phylo -n 10This is an example to generate 100 phylogenies with 10 strains:
recopops phyloens -n 10 -r 100This is an example command to generate a SNP configuration for 10 strains and 1000 nucleotides:
recopops snps -n 10 -l 1000To generate a single phylogeny given a specific coalescent rate matrix, you can use the following command:
recopops phylo -f inputs/prob_matrices/coa_prob_matrix_from_dist_pwr2_ecoli_data_nostrain32_noclones.yml -u 0.0002 -v 0 -s 567 -f : yaml file of the coalescent matrix, which is a symmetrix nxn matrix where n is the number of strains in the sample; each element of the matrix is the coalescent rate between two strains -u : mutation rate, in units of 1/2N generation where N is mean to represent the full population size of the standard Wright-Fisher model that people use to derive the Kingman coalescent -v : fluctuation parameter of the mutation rate, ie, if different from zero each branch has different mutation rates with mean equal to u -s : random seed, default is None in which you generate a different phylogeny even for a given matrix
To generate quickly a single phylogeny, it's often convenient to just give the number of strain in input:
recopops phylo -n 10 -u 0.0002 -v 0 -s 567In this case n is the number of the strain in the same and then generates a coalescent matrix taken from a uniform distribution (NB: uniform, not constant) between 0 and 2 and then executes the structured coalescent model as before.
The output of the phylogeny command is an ascii tree on the terminal and a .txt file with the newick string corresponding to the generate phylogeny. In this case, the command should be:
You might also want to generate single phylogeny given a specific coalescent probability matrix. Remember that the coalescent probability is just the coalescent rate divided by the sum of all the rates, that is, the probability is equal to the relative rate.
recopops phylo -t p -f inputs/coa_prob_matrix_from_dist_pwr2_ecoli_data_nostrain32_noclones.yaml -u 0.0002 -v 0 -s 567 -t : the type of the coalescent matrix. It can take two values, 'p' or 'r', which stand for probability and rate. By default, recopops assume that the matrix is a rate matrix, so there is no need to specify this parameter if you are using a rate matrix.
Note that if you only specify the relative rate, the sum of all rates is an additional parameter to be defined. In this case, Recopops sets the sum of all rates to the total number of pairs. This is motivated by the fact that if the probability is constant, then if the sum of the rates is equal to the number of pairs, we obtain a Kingman coalescent with the typical scale. Note that if you don't set the sum to the total number of pairs, the net result is really just to stretch the Kingman coalescent. In some sense, it is equivalent to a change of units. Therefore, fixing it to the number of pairs is more so that the numerical values of observables like the height and length of the trees agree with the ones reported in standard textbooks.
To study the statistical properties of the process generating the phylogenies, it will be necessary to generate many phylogenies to study the properties of the distribution. To do that, you can use the 'recopops phyloens' like this:
recopops phyloens -f inputs/prob_matrices/coa_prob_matrix_from_dist_pwr10_ecoli_data_nostrain32_noclones.yaml -u 0.0002 -v 0 -r 100recopops phyloens -n 10 -u 0.0002 -v 0 -r 100The only new parameters is -r, which is the number of phylogenies to be generated.
The output is a .txt file, which contains all the newick strings corresponding to the generated phylogenies.
As before, you might also want to generate single phylogeny given a specific coalescent probability matrix and you do in the same way:
recopops phyloens -t p -f inputs/coa_prob_matrix_from_dist_pwr10_ecoli_data_nostrain32_noclones.yaml -u 0.0002 -v 0 -r 100
This is the most important simulation performed by Recopops as it allows comparison of different phylogeny ensembles with data. Consequently, we delve deeper into this functionality.
To generate a snp-configuration given a specific coalescence matrix, you can use the following command:
recopops snps -f inputs/rate_matrices/coa_rate_matrix_nostrain50_disttype_e_mean1.000000.yaml -u 0.0002 -v 0 -l 2593107Note the options:
-f : yaml file of the coalescent rate matrix, which is a symmetrix nxn rate matrix where n is the number of strains in the sample; each element of the matrix is the coalescent rate between two strains
-u : mutation rate, in units of 1/2N generation where N is mean to represent the full population size of the standard Wright-Fisher model that people use to derive the Kingman coalescent
-v : fluctuation parameter of the mutation rate, ie, if different from zero each branch has different mutation rates with mean equal to u
-l : genome length, or more precisely core genome length
To generate quickly a snp-configuration, it's often convenient not to use a file but rather just give the number of stains input:
recopops snps -n 10 -u 0.0002 -v 0 -l 2593107n is the number of strain in the sample, all the other parameters have the same meaning. In this case, the program first generates a coalescent rate matrix taken from a uniform distributino on the coalescent probability and then executes the structure coalescent model as before.
Importantly, you can also generate a snp-configuration given a specific coalescent probability matric. This is a matrix of the relative rates. To do so, you'd run a command like the following:
recopops snps -t p -f inputs/prob_matrices/coa_prob_matrix_from_dist_pwr2_ecoli_data_nostrain32_noclones.yaml -u 0.0002 -v 0 -l 2593107Note the new option -t, which stands for type. t takes either 'p' or 'r' respectively for probabilty or rate. By default, it is a rate.
The advantage of using probability matrices is that you can choose an overall scale of the rate separately with the option rs:
recopops snps -t p -f inputs/prob_matrices/coa_prob_matrix_from_dist_pwr10_ecoli_data_nostrain32_noclones.yaml -rs=7 -u 0.0002 -v 0 -l 2593107rs is the sum of all the rates. By default, this is set to the total number of pairs when you generate SNP configuration with a probability matrix. This is done because if the probability matrix is a constant matrix (ie, all equal elements), then you are generating Kingman tree and the natural unit of the rate sum is total number of pairs. For our purposes, it is convenient to keep this parameter separate because you can use to set other overall scales of the simulation like average total tree length, which in turn it's related to quantities like the fraction of polymorphic sites.
The output of the snp configuration consists of two files:
(1) a file containing all the bi-allelic SNPs, which looks like this:
00000000000000000000000000000001
00000000000011001010100110000001
00000000000000000010000000000000
00010000000000000000000000000000
00000011000000010001000000001000
00000000000000000000000100000000
0 indicates the ancestral state, while 1 indicates the mutated state in the course of evolutionary history. A row is a locus with a SNP in the simulated genomes, a column is a specific genome/strain.
(2) a summary file, which describes the number of fraction of columns w/o snps, bi-allelic snps, tri-allelic snps, etcetera. The file looks like this:
0 2339855 0.9023364635551098
1 239381 0.09231435494177448
2 13290 0.005125125958936519
3 557 0.00021480023770712123
4 23 8.869668702448453e-06
5 1 3.8563776967167183e-07
Recopops also has an integrated custom library to generate random rate and probability matrices from different distributions.
To generate random rate matrices without correlation between rate, you should run a command like:
recopops randrate-uncorr -n 10 -u 10which generates a random rate matrice for 10 strain from a uniform distribution rates between 0 and 1.
To generate random rate matrices with correlation between rate, you should run a command like:
recopops randrate-corre.g.
recopops randrate-corr -n 10 -d um --stretch_par 2To generate random probability matrices without correlation between rate, you should use a command like:
recopops randrate-uncorr -n 10 -d uTo generate random probability matrices with correlation between rates, you should use a command like:
recopops randprob-corre.g.
recopops randprob-corr -n 10 -d um --stretch_par 2You can find examples of output and downstream analysis with jupyter notebooks in the folder 'example_output'.
If you use this project in your work, please credit the project and add a link to this repository.