This repository contains the code and workflow definitions used for single-nucleus ATAC-seq analyses of chromatin accessibility, nucleosome positioning, and downstream QTL interpretation in human skeletal muscle.
The repository centers on a set of Snakemake workflows in smk/ plus one Nextflow workflow for LDSC analyses in nextflow/.
This repository accompanies:
Wang X, Robertson C.C. et al. bioRxiv (2025)
The main analyses in this repository identify and interpret:
nfrQTLs: QTLs for nucleosome-free regionsnucQTLs: QTLs for nucleosomal regionscaQTLs: QTLs for accessibility peaks derived from MACS2 summits and narrowpeaks
Downstream workflows then connect those QTLs to:
- NFR versus nucleosomal colocalization
- causal inference testing
- GWAS colocalization
- eQTL overlap
- motif enrichment
- multi-track locus plots
- LDSC annotation enrichment
muscle_nfr_nucQTL/
├── bin/ # R and Python analysis scripts
├── env/ # Conda environments
├── nextflow/ # LDSC workflow
├── run_snakemake/ # Helper scripts to launch Snakemake
├── smk/ # Snakemake workflows and YAML configs
└── README.md
This codebase was developed with:
R >= 4.2.3Python >= 3.8Snakemake >= 7.30.1Nextflow >= 24.10.6
External tools used across workflows include:
- QTLtools
- samtools
- bedtools
- MACS2 / MACS3
- UCSC utilities such as
bedGraphToBigWig - deepTools
- HOMER
- featureCounts
Conda environment files are provided in env/ and smk/envs/, but some workflows also depend on cluster modules, Singularity, and external reference files specified in the YAML configs.
Every workflow is configured through files in smk/. Before launching a pipeline, inspect:
smk/<workflow>.yamlsmk/<workflow>_cluster.yaml- the corresponding
smk/<workflow>.smk
In particular, confirm:
- input BAM and VCF paths
- sample metadata paths
- chromosome size and genome reference paths
- output directories
- cluster account, time, memory, and CPU settings
- trait-specific metadata for GWAS workflows
Many configs contain absolute or environment-specific paths. They are not plug-and-play across systems.
All Snakemake workflows are launched from run_snakemake/.
From the repository root:
cd run_snakemake
bash dryrun.sh <workflow>
bash run.sh <workflow>Example:
cd run_snakemake
bash dryrun.sh peakcall
bash run.sh peakcallThis launches:
- Snakefile:
smk/<workflow>.smk - config:
smk/<workflow>.yaml - cluster config:
smk/<workflow>_cluster.yaml
Useful helpers:
bash dryrun.sh <workflow>: inspect the DAG before submitting jobsbash run.sh <workflow>: submit the workflow to Slurmbash unlock.sh <workflow>: unlock a working directory after interrupted runs
The repository has many workflows, but not all are meant to be run in a single linear chain. The safest way to think about it is:
- Run the core NFR and nucleosome pipeline first.
- Run downstream interpretation workflows from those outputs.
- Run optional caQTL, GWAS, motif, plotting, or LDSC workflows only if you need those analyses.
Run these first if you want to reproduce the main NFR and nucQTL analyses.
| Step | Workflow | Run when | Purpose |
|---|---|---|---|
| 1 | peakcall |
first | Calls peaks and produces peak-level signal tracks used downstream |
| 2 | sample_prep |
after peakcall inputs are available |
Splits BAMs into NFR and nucleosomal fractions and prepares sample-level tracks |
| 3 | QTLprep |
after peakcall and sample_prep |
Builds peak SAF/BED files, counts reads, tidies counts, filters features, normalizes counts, and writes QTL BED inputs |
| 4 | QTLscan |
after QTLprep |
Computes covariates and runs the main QTL scan for NFR and nucleosomal phenotypes |
| 5 | coloc_qtls |
after QTLscan |
Fine-maps QTLs with SuSiE and tests colocalization between NFR and nucleosomal QTL signals |
| 6 | cit |
after coloc_qtls |
Runs causal inference testing on colocalized NFR/nucQTL pairs |
If you only need the main manuscript-style analysis, the order is:
peakcall
-> sample_prep
-> QTLprep
-> QTLscan
-> coloc_qtls
-> cit
peakcall- calls NFR or nucleosomal peak-related files and signal tracks
- provides peak definitions used later by
QTLprep
sample_prep- derives NFR and nucleosomal read partitions from sample BAMs
- generates sample-level BAM and bigWig-style inputs used for counts and visualization
QTLprep- converts peaks to SAF/BED
- counts reads with
featureCounts - tidies and filters raw counts
- normalizes counts with
bin/normalization.R - writes QTL phenotype BED files
QTLscan- computes genotype and phenotype covariates
- compresses/indexes phenotype BED files
- runs
QTLtools cis - merges and summarizes QTL scan results
coloc_qtls- corrects phenotype BEDs
- runs SuSiE fine-mapping for NFR and nuc traits
- performs coloc between NFR and nucleosomal signals
- writes coloc summary tables and regression summaries
cit- reads the coloc summaries
- filters by
PPH4 - runs CIT and FDR summarization
Use these only after the required upstream outputs exist.
| Workflow | Run after | Purpose |
|---|---|---|
peak_compare |
peakcall |
Compares MACS2 peak calls to nucleoATAC-derived regions and computes summit distances |
| Workflow | Run after | Purpose |
|---|---|---|
snps_inpeak |
coloc_qtls |
Extracts and annotates variants within colocalized peak sets |
layer_plot |
snps_inpeak |
Builds motif summaries and layered regional plots for selected GWAS traits |
plot_multitrack |
GWAS coloc outputs plus QTL SuSiE outputs | Generates locus-level multi-track plots for selected GWAS peaks |
| Workflow | Run after | Purpose |
|---|---|---|
eqtl_overlap |
coloc_qtls recommended |
Summarizes eQTL overlap with QTL signals and fits enrichment models |
This workflow relies on QTL overlap outputs and references the NFR SuSiE directory in its logistic modeling step, so it is best treated as downstream of the main QTL discovery and fine-mapping steps.
| Workflow | Run after | Purpose |
|---|---|---|
motif_homer |
coloc_qtls |
HOMER enrichment comparing colocated versus background NFR QTL peaks |
motif_meme |
QTLscan, coloc_qtls, and optionally cit |
MEME/SEA analyses on QTL peaks, coloc-derived sets, SNP-aware FASTA sets, and CIT-derived sets |
motif_meme pulls from several result directories, so the exact prerequisites depend on which target block is active in rule all.
These workflows are separate branches for different GWAS inputs. Use the one that matches your summary statistics source.
| Workflow | Run after | Purpose |
|---|---|---|
coloc_gwas |
QTLscan and QTL SuSiE outputs for the selected region |
General GWAS colocalization workflow using lead variants and SuSiE |
coloc_gwas_ukbb |
before UKBB muscle-specific coloc workflows | Lifts over UKBB GWAS to hg38, identifies lead variants, and fine-maps GWAS loci |
coloc_gwas_nonukbb |
QTLscan and QTL SuSiE outputs for NFR |
Colocalization against a predefined set of non-UKBB GWAS traits |
coloc_gwas_muscle |
trait-specific | Muscle disease or muscle-focused GWAS coloc workflow |
coloc_gwas_height |
trait-specific | Height GWAS coloc workflow |
coloc_gwas_grip_strength |
trait-specific | Grip strength GWAS coloc workflow |
coloc_gwas_metrate |
trait-specific | Metabolic-rate GWAS coloc workflow |
coloc_gwas_mv |
trait-specific | Multivariate GWAS coloc workflow |
Practical rule:
- run the appropriate
coloc_gwas*workflow only after the relevant QTL phenotype, QTL scan, and QTL SuSiE outputs already exist - check the active
rule allsection in the target Snakefile, because several of these workflows have been repointed to different output generations over time
These workflows are separate from the core NFR/nucQTL chain.
| Workflow | Run after | Purpose |
|---|---|---|
caQTL |
peakcall and sample-level BAM availability |
Builds accessibility peaks from MACS2 summits, quantifies them, runs caQTL scans, and fine-maps caQTLs |
caQTL_np |
peakcall and sample-level BAM availability |
Alternate caQTL workflow using MACS2 narrowPeak definitions |
ca_partition |
caQTL plus NFR/nucQTL peak BEDs |
Computes overlap partitions between caQTL, nfrQTL, and nucQTL peak sets |
ca_partition_np |
caQTL_np plus NFR/nucQTL peak BEDs |
Partition analysis for the narrowPeak-based caQTL branch |
Recommended order for the accessibility branch:
peakcall
-> caQTL or caQTL_np
-> ca_partition or ca_partition_np
| Workflow | Run after | Purpose |
|---|---|---|
cit_match |
coloc_qtls or CIT-related inputs |
Matching-based CIT follow-up workflow |
cit_downsample |
cit inputs plus downsampling setup |
Downsampled CIT sensitivity analysis |
These are follow-up or sensitivity-analysis workflows rather than first-pass pipelines.
| Workflow | Run after | Purpose |
|---|---|---|
MR_split |
normalized counts, sample metadata, coloc peak list, and genotype resources are prepared | Splits samples, reruns QTL scans on split datasets, harmonizes exposure/outcome effects, and runs Steiger directionality analyses |
MR_split is self-contained but expects many explicit config entries such as:
- raw NFR counts
- raw nucleosomal counts
- sample info
- peak list to keep
- VCF and eigenvector files
- output root
Treat this as a specialized downstream analysis, not part of the default run order.
cd run_snakemake
bash dryrun.sh peakcall
bash run.sh peakcall
bash dryrun.sh sample_prep
bash run.sh sample_prep
bash dryrun.sh QTLprep
bash run.sh QTLprep
bash dryrun.sh QTLscan
bash run.sh QTLscan
bash dryrun.sh coloc_qtls
bash run.sh coloc_qtls
bash dryrun.sh cit
bash run.sh citAfter the core QTL workflows are complete, choose the relevant GWAS branch:
cd run_snakemake
bash dryrun.sh coloc_gwas
bash run.sh coloc_gwasor:
cd run_snakemake
bash dryrun.sh coloc_gwas_nonukbb
bash run.sh coloc_gwas_nonukbbor one of the other specialized coloc_gwas* workflows.
After coloc_qtls and any required upstream QTL outputs:
cd run_snakemake
bash dryrun.sh motif_homer
bash run.sh motif_homer
bash dryrun.sh motif_meme
bash run.sh motif_memeUse this quick guide if you are unsure what to run.
- If you want the main discovery pipeline: run
peakcall -> sample_prep -> QTLprep -> QTLscan -> coloc_qtls -> cit - If you want NFR versus nucleosomal peak comparison only: run
peakcall, thenpeak_compare - If you want GWAS colocalization: finish QTL discovery first, then run the appropriate
coloc_gwas*workflow - If you want eQTL overlap: finish QTL discovery and coloc/fine-mapping first, then run
eqtl_overlap - If you want motif enrichment on colocated signals: finish
coloc_qtls, then runmotif_homerand/ormotif_meme - If you want accessibility QTLs from MACS2 summits: run
caQTLorcaQTL_np, then the matching partition workflow - If you want Steiger or split-sample directionality analyses: run
MR_split
The nextflow/ldsc_allgwas_consensus-cluster/ directory contains a separate Nextflow workflow for LDSC annotation enrichment analyses.
Entry point:
nextflow/ldsc_allgwas_consensus-cluster/scripts/main.nf
This workflow:
- builds LDSC annotation files from BED annotations
- calculates LD scores
- runs single-annotation enrichment against GWAS summary statistics
- runs joint annotation enrichment if provided with joint annotation files
This is independent of the main Snakemake execution chain and should be treated as a separate downstream analysis.
- Many paths in this repository are hard-coded to a specific computing environment.
- Some
rule alltargets have been changed over time and often point to the latest analysis generation only. - Several Snakefiles contain historical or commented-out targets. Always inspect the active
rule allsection before launching a workflow. - Some raw data remain controlled access, so not every analysis can be rerun from public inputs alone.
If you use this repository or adapt these workflows, please cite:
Wang, X., Robertson, C. C., Varshney, A., Manickam, N., Orchard, P., Laakso, M., Tuomilehto, J., Lakka, T. A., Mohlke, K. L., Boehnke, M., Scott, L. J., Koistinen, H. A., Collins, F. S., & Parker, S. C. J. (2025). Genetic integration with cell-specific nucleosome positioning resolves causal relationships underlying chromatin accessibility profiles (p. 2025.09.09.674883). bioRxiv. https://doi.org/10.1101/2025.09.09.674883
- Alice Wang:
xiaoouw [at] umich.edu - Cassie Robertson:
ccrober [at] umich.edu - Stephen Parker:
scjp [at] umich.edu