Standalone scripts for post-processing, QC, and visualization of ARG tree sequences (.ts, .trees, .tsz).
conda env create -f environment.yml
conda activate argtestCore dependencies are in environment.yml (numpy, matplotlib, tskit, tszip, msprime).
One reasonable post-processing workflow for ARG tree sequences in this repo is:
- Find low rec regions Because regions of low recombination are more affected by linked selection, for analyses assuming the neutrality of the ARG it may be a good idea to remove low recombination regions ahead of time. Find windows in the genetic map in the bottom
Xpercentile ofcM/Mbusing scripts/hapmap_low_rec_mask.py. This turns a HapMap-style recombination map plus a.faiinto per-chromosome BED masks for very low-recombination regions. - Find regions of poor alignment Find windows of
sizekb where more thanX% of bp are masked using scripts/find_low_access_regions.py. This inspects the inferred mutation map for a tree sequence and writes low-accessibility windows to a BED file. - Find windows with aberrant mutational load All samples in a tree should have the same number of derived mutations, since all have the same distance from root. In windows of
numberSNPs, identify individuals withX% more or fewer derived mutations than the window median using scripts/mutload_summary.py. This reports per-window outliers, can write mutation-masked BED intervals when too many individuals are outliers, and highlights outlier individuals in red in the HTML summary. - Remove affected regions For each chromosome, combine the BED files from steps 1-3 (.low_rec.mask.bed, low_access.ws.accbp.bed, and <ts_stem>_mutation_masked.bed), then remove those genomic regions from a directory of tree sequences with scripts/trim_regions.py. This script applies a supplied BED mask and writes trimmed tree sequences.
- Remove affected samples In many cases, only a few samples within a window will be problematic. They could have evidence of introgression (identified using e.g. TRACE) or odd patterns of derived mutations (see step 3). Using a bedfile specifying regions and individuals, prune those individuals from the trees with scripts/trim_samples.py.
- Validate Run the validation plots with scripts/validation_plots_from_ts.py to get a sense of the cleaned ARG. This gives a compact set of QC plots for mutational load, diversity, Tajima's D, and related summaries. Compare these to the same plots run on the original treesequences.
- If satisfied, merge chromosomes for each replicate for downstream analysis using scripts/merge_treefiles_by_replicate.py. This concatenates chromosome-specific tree files into one combined tree sequence per replicate.
Builds one BED per chromosome marking the bottom --rec-fraction of HapMap recombination-rate intervals, ranked by Rate(cM/Mb) separately within each chromosome.
The HapMap input should follow the format described in the msprime docs: https://tskit.dev/msprime/docs/stable/api.html#msprime.RateMap.read_hapmap.
Behavior:
- requires
--hapmapand--fai - uses the first row's
Rate(cM/Mb)from position0up to that row'sPosition(bp) - uses each row's
Rate(cM/Mb)from its position up to the next position on the same chromosome - uses the last row's
Rate(cM/Mb)from its position to the chromosome end from the.fai - writes
resultsas one BED per chromosome named<chr>.low_rec.mask.bedin--out-dir(or the current directory by default)
Example:
python scripts/hapmap_low_rec_mask.py \
--hapmap maize.hapmap.tsv \
--fai maize.fa.fai \
--rec-fraction 0.1 \
--out-dir low_rec_masksMain options:
--hapmap--fai--rec-fraction--out-dir
Finds low-accessibility windows from the inferred mutation map for a representative tree sequence and writes those windows as a BED file.
Behavior:
- infers the mutation map (
*.mut_rate.p) from the first tree sequence in--ts-dir - computes accessible bp in windows of
--window-size - writes windows with accessible bp below
--cutoff-bpto a BED file - writes to
--outor to<ts-dir>/low_access.ws<window>.accbp<cutoff>.bedby default
Inputs:
- directory of tree sequence files
- inferred mutation-rate map file(s)
Key outputs:
- one BED file of low-accessibility windows
Example:
python scripts/find_low_access_regions.py \
--ts-dir /path/to/trees \
--window-size 50000 \
--cutoff-bp 2500Main options:
--ts-dir--window-size--cutoff-bp--pattern--out
Builds an HTML summary of per-individual mutational load and writes outlier windows as BED when windowing is enabled. When --window-size or --snp-window is used, each individual is compared to that window's median mutational load and is marked as an outlier if its load is greater than (1 + cutoff) * median or less than (1 - cutoff) * median; windows with median zero are skipped. In the per-window plots, outlier individuals are shown with red bars. If --fraction is provided, windows with an outlier fraction greater than that threshold are written to results/<ts_stem>_mutation_masked.bed and are excluded from results/<ts_stem>_outliers.bed.
Inputs:
- one tree sequence file
Key outputs:
results/<name>.htmlresults/<ts_stem>_outliers.bed(if--window-sizeor--snp-windowis used; excludes windows written to the mutation-masked BED when--fractionis set)results/<ts_stem>_mutation_masked.bed(if--fractionis used)logs/<name>.log
Example:
python scripts/mutload_summary.py example_data/maize.tsz --window-size 50000 --cutoff 0.25 --out mutload.htmlExample with fixed-variant windows:
python scripts/mutload_summary.py example_data/maize.tsz --snp-window 500 --cutoff 0.25 --out mutload_snp.htmlMain options:
--window-size--snp-window--cutoff--fraction--out--suffix-to-strip
Applies a BED mask to a directory of tree sequences and writes trimmed tree files with compacted coordinates.
Inputs:
- directory of tree sequences
- one BED file of regions to remove
Key outputs:
- trimmed tree sequences in output directory
- one summary log (
trim_regions_log.txtby default)
Example:
python scripts/trim_regions.py \
--ts-dir /path/to/trees \
--remove low_access.bed \
--pattern "*.tsz" \
--out-dir /path/to/trimmedMain options:
--ts-dir--remove--out-dir--pattern--log
Removes selected individuals either genome-wide (--individuals) or over BED intervals (--remove).
For --remove, you can use one BED for all samples or multiple BED files. Each BED line should contain chrom start end sample_id in column 4, and column 4 may also list multiple comma-separated sample IDs to apply the same interval to several individuals. If column 4 is omitted, the BED filename stem is used as the sample name.
Inputs:
- one tree sequence file
- optional BED(s) with per-individual intervals
Key output:
- trimmed tree sequence (
--out, orresults/<ts_stem>_trimmed.tsz)
Example:
python scripts/trim_samples.py example_data/maize.tsz --individuals B73,Mo17 --out results/maize_trimmed.tszExample BED lines:
chr1 1000 5000 A
chr1 8000 9000 B,C
Main options:
--individuals--remove--out--suffix-to-strip
Generates SINGER-style validation/diagnostic plots (excluding coalescence/Ne curves) directly from a set of TS replicates.
Plots produced:
mutational-load.pngmutational-load-trace.pngdiversity-scatter.pngdiversity-skyline.pngdiversity-trace.pngtajima-d-scatter.pngtajima-d-skyline.pngtajima-d-trace.pngfrequency-spectrum.png- optional observed-vs-sim density plots (when
--simTSV is provided):diversity-density-vs-sim.pngtajima-d-density-vs-sim.png
summary.txt
Notes:
- branch diversity is scaled by
--mutation-ratefor site-vs-branch comparison - trace plots are branch-only MCMC outcomes
--simcan read the simulation TSV fromscripts/coalescence_ne_plots_from_ts.py --simand compare observed vs simulated windowedpiand Tajima's D using overlapping semi-transparent density plots
Example:
python scripts/validation_plots_from_ts.py \
--ts-dir ~/crud/collapsed \
--pattern "*.tsz" \
--window-size 100000 \
--mutation-rate 3.3e-8 \
--burnin-frac 0.5 \
--out-dir results/validation_plotsExample with simulation comparison:
python scripts/validation_plots_from_ts.py \
--ts-dir ~/crud/collapsed \
--pattern "*.tsz" \
--window-size 100000 \
--mutation-rate 3.3e-8 \
--burnin-frac 0.5 \
--sim results/coalescence_ne_plots/sim-window-stats.tsv \
--out-dir results/validation_plotsMain options:
--ts-dir--pattern--mutation-rate--burnin-frac--window-size--folded--sim--out-dir--prefix
Merges chromosome-specific tree sequence files by replicate. Input files must be named like <base>.<chromosome>.<replicate> with suffix .tree, .trees, or .tsz.
Behavior:
- groups files by
<base>and<replicate> - sorts chromosomes in natural order within each replicate group
- concatenates all chromosomes for a replicate into one combined tree sequence
- writes outputs named
<base>.combined.<replicate><suffix> - writes to
--out-diror to<ts-dir>/combinedby default - preserves the suffix of the first file in each group unless
--out-suffixis provided
Inputs:
- directory of chromosome-specific tree sequence files
Key outputs:
- one combined tree sequence per replicate
Example:
python scripts/merge_treefiles_by_replicate.py \
--ts-dir /path/to/treefilesMain options:
--ts-dir--out-dir--pattern--out-suffix
Generates pair coalescence and effective population size plots from a set of TS replicates using explicit time bins.
Plots produced:
pair-coalescence-pdf.pngpair-coalescence-rates.pngeffective-pop-size.png(Ne = 1 / (2 * coal_rate))- optional simulation summary table:
sim-window-stats.tsv(when--sim > 0) summary.txt
Notes:
- time bins come from
--time-bins-file(explicit bin edges) --time-adjustrescales plotted x-axis values by dividing time by a factor--yearadds a red dashed vertical marker to the Ne plot- optional
--sim Xmode converts the inferred piecewise-constant Ne trajectory into a one-deme Demes model and runsXcoalescent simulations - simulations are
1 Mb, use the same sample size as the observed TS, and use--mufor both mutation and recombination rates - simulated nucleotide diversity and Tajima's D are computed in
50 Kbwindows and written as a TSV
Example:
python scripts/coalescence_ne_plots_from_ts.py \
--ts-dir ~/crud/collapsed \
--pattern "*.tsz" \
--time-bins-file /path/to/time_bins.txt \
--burnin-frac 0.5 \
--time-adjust 6.19476 \
--year 534 \
--log-rates \
--out-dir results/coalescence_ne_plotsExample with simulations:
python scripts/coalescence_ne_plots_from_ts.py \
--ts-dir ~/crud/collapsed \
--pattern "*.tsz" \
--time-bins-file /path/to/time_bins.txt \
--burnin-frac 0.5 \
--time-adjust 6.19476 \
--year 534 \
--sim 100 \
--mu 3.3e-8 \
--sim-outfile results/coalescence_ne_plots/sim-window-stats.tsv \
--out-dir results/coalescence_ne_plots \
--log-ratesMain options:
--ts-dir--pattern--time-bins-file--burnin-frac--tail-cutoff--time-adjust--year--sim--mu--sim-outfile--log-rates--out-dir--prefix
Renders one tree index from each of two tree sequences into a single HTML comparison.
Example:
python scripts/compare_trees_html.py a.tsz 9 b.tsz 9 --out tree_compare.htmlRenders all trees from two tree sequences as a top/bottom-row scrollable HTML gallery.
Example:
python scripts/trees_gallery_html.py a.tsz b.tsz --out trees_gallery.htmlscripts/argtest_common.py contains shared tree-sequence helpers used by multiple scripts:
- TS I/O (
load_ts,dump_ts) - mutational load/stat helpers
- trimming and masking helpers
Use this module for internal script imports.
- Generated
logs/andresults/are git-ignored. .DS_Storeis git-ignored.