Single-Nucleus Muscle NFR and Nucleosome QTL Analysis

This repository contains the code and workflow definitions used for single-nucleus ATAC-seq analyses of chromatin accessibility, nucleosome positioning, and downstream QTL interpretation in human skeletal muscle.

The repository centers on a set of Snakemake workflows in smk/ plus one Nextflow workflow for LDSC analyses in nextflow/.

Paper

This repository accompanies:

Wang X, Robertson C.C. et al. bioRxiv (2025)

What This Repository Does

The main analyses in this repository identify and interpret:

nfrQTLs: QTLs for nucleosome-free regions
nucQTLs: QTLs for nucleosomal regions
caQTLs: QTLs for accessibility peaks derived from MACS2 summits and narrowpeaks

Downstream workflows then connect those QTLs to:

NFR versus nucleosomal colocalization
causal inference testing
GWAS colocalization
eQTL overlap
motif enrichment
multi-track locus plots
LDSC annotation enrichment

Repository Layout

muscle_nfr_nucQTL/
├── bin/                    # R and Python analysis scripts
├── env/                    # Conda environments
├── nextflow/               # LDSC workflow
├── run_snakemake/          # Helper scripts to launch Snakemake
├── smk/                    # Snakemake workflows and YAML configs
└── README.md

Requirements

This codebase was developed with:

R >= 4.2.3
Python >= 3.8
Snakemake >= 7.30.1
Nextflow >= 24.10.6

External tools used across workflows include:

QTLtools
samtools
bedtools
MACS2 / MACS3
UCSC utilities such as bedGraphToBigWig
deepTools
HOMER
featureCounts

Conda environment files are provided in env/ and smk/envs/, but some workflows also depend on cluster modules, Singularity, and external reference files specified in the YAML configs.

Before Running Anything

Every workflow is configured through files in smk/. Before launching a pipeline, inspect:

smk/<workflow>.yaml
smk/<workflow>_cluster.yaml
the corresponding smk/<workflow>.smk

In particular, confirm:

input BAM and VCF paths
sample metadata paths
chromosome size and genome reference paths
output directories
cluster account, time, memory, and CPU settings
trait-specific metadata for GWAS workflows

Many configs contain absolute or environment-specific paths. They are not plug-and-play across systems.

How To Run A Workflow

All Snakemake workflows are launched from run_snakemake/.

From the repository root:

cd run_snakemake
bash dryrun.sh <workflow>
bash run.sh <workflow>

Example:

cd run_snakemake
bash dryrun.sh peakcall
bash run.sh peakcall

This launches:

Snakefile: smk/<workflow>.smk
config: smk/<workflow>.yaml
cluster config: smk/<workflow>_cluster.yaml

Useful helpers:

bash dryrun.sh <workflow>: inspect the DAG before submitting jobs
bash run.sh <workflow>: submit the workflow to Slurm
bash unlock.sh <workflow>: unlock a working directory after interrupted runs

Recommended Execution Order

The repository has many workflows, but not all are meant to be run in a single linear chain. The safest way to think about it is:

Run the core NFR and nucleosome pipeline first.
Run downstream interpretation workflows from those outputs.
Run optional caQTL, GWAS, motif, plotting, or LDSC workflows only if you need those analyses.

Core Workflow Order

Run these first if you want to reproduce the main NFR and nucQTL analyses.

Step	Workflow	Run when	Purpose
1	`peakcall`	first	Calls peaks and produces peak-level signal tracks used downstream
2	`sample_prep`	after `peakcall` inputs are available	Splits BAMs into NFR and nucleosomal fractions and prepares sample-level tracks
3	`QTLprep`	after `peakcall` and `sample_prep`	Builds peak SAF/BED files, counts reads, tidies counts, filters features, normalizes counts, and writes QTL BED inputs
4	`QTLscan`	after `QTLprep`	Computes covariates and runs the main QTL scan for NFR and nucleosomal phenotypes
5	`coloc_qtls`	after `QTLscan`	Fine-maps QTLs with SuSiE and tests colocalization between NFR and nucleosomal QTL signals
6	`cit`	after `coloc_qtls`	Runs causal inference testing on colocalized NFR/nucQTL pairs

Core Dependency Summary

If you only need the main manuscript-style analysis, the order is:

peakcall
  -> sample_prep
  -> QTLprep
  -> QTLscan
  -> coloc_qtls
  -> cit

What Each Core Step Produces

peakcall
- calls NFR or nucleosomal peak-related files and signal tracks
- provides peak definitions used later by QTLprep
sample_prep
- derives NFR and nucleosomal read partitions from sample BAMs
- generates sample-level BAM and bigWig-style inputs used for counts and visualization
QTLprep
- converts peaks to SAF/BED
- counts reads with featureCounts
- tidies and filters raw counts
- normalizes counts with bin/normalization.R
- writes QTL phenotype BED files
QTLscan
- computes genotype and phenotype covariates
- compresses/indexes phenotype BED files
- runs QTLtools cis
- merges and summarizes QTL scan results
coloc_qtls
- corrects phenotype BEDs
- runs SuSiE fine-mapping for NFR and nuc traits
- performs coloc between NFR and nucleosomal signals
- writes coloc summary tables and regression summaries
cit
- reads the coloc summaries
- filters by PPH4
- runs CIT and FDR summarization

Optional Workflow Branches

Use these only after the required upstream outputs exist.

Peak Comparison and Peak-Level Summaries

Workflow	Run after	Purpose
`peak_compare`	`peakcall`	Compares MACS2 peak calls to nucleoATAC-derived regions and computes summit distances

Variant-in-Peak and Layered Plotting

Workflow	Run after	Purpose
`snps_inpeak`	`coloc_qtls`	Extracts and annotates variants within colocalized peak sets
`layer_plot`	`snps_inpeak`	Builds motif summaries and layered regional plots for selected GWAS traits
`plot_multitrack`	GWAS coloc outputs plus QTL SuSiE outputs	Generates locus-level multi-track plots for selected GWAS peaks

eQTL Overlap

Workflow	Run after	Purpose
`eqtl_overlap`	`coloc_qtls` recommended	Summarizes eQTL overlap with QTL signals and fits enrichment models

This workflow relies on QTL overlap outputs and references the NFR SuSiE directory in its logistic modeling step, so it is best treated as downstream of the main QTL discovery and fine-mapping steps.

Motif Analysis

Workflow	Run after	Purpose
`motif_homer`	`coloc_qtls`	HOMER enrichment comparing colocated versus background NFR QTL peaks
`motif_meme`	`QTLscan`, `coloc_qtls`, and optionally `cit`	MEME/SEA analyses on QTL peaks, coloc-derived sets, SNP-aware FASTA sets, and CIT-derived sets

motif_meme pulls from several result directories, so the exact prerequisites depend on which target block is active in rule all.

GWAS Colocalization

These workflows are separate branches for different GWAS inputs. Use the one that matches your summary statistics source.

Workflow	Run after	Purpose
`coloc_gwas`	`QTLscan` and QTL SuSiE outputs for the selected region	General GWAS colocalization workflow using lead variants and SuSiE
`coloc_gwas_ukbb`	before UKBB muscle-specific coloc workflows	Lifts over UKBB GWAS to hg38, identifies lead variants, and fine-maps GWAS loci
`coloc_gwas_nonukbb`	`QTLscan` and QTL SuSiE outputs for NFR	Colocalization against a predefined set of non-UKBB GWAS traits
`coloc_gwas_muscle`	trait-specific	Muscle disease or muscle-focused GWAS coloc workflow
`coloc_gwas_height`	trait-specific	Height GWAS coloc workflow
`coloc_gwas_grip_strength`	trait-specific	Grip strength GWAS coloc workflow
`coloc_gwas_metrate`	trait-specific	Metabolic-rate GWAS coloc workflow
`coloc_gwas_mv`	trait-specific	Multivariate GWAS coloc workflow

Practical rule:

run the appropriate coloc_gwas* workflow only after the relevant QTL phenotype, QTL scan, and QTL SuSiE outputs already exist
check the active rule all section in the target Snakefile, because several of these workflows have been repointed to different output generations over time

caQTL Branch

These workflows are separate from the core NFR/nucQTL chain.

Workflow	Run after	Purpose
`caQTL`	`peakcall` and sample-level BAM availability	Builds accessibility peaks from MACS2 summits, quantifies them, runs caQTL scans, and fine-maps caQTLs
`caQTL_np`	`peakcall` and sample-level BAM availability	Alternate caQTL workflow using MACS2 narrowPeak definitions
`ca_partition`	`caQTL` plus NFR/nucQTL peak BEDs	Computes overlap partitions between caQTL, nfrQTL, and nucQTL peak sets
`ca_partition_np`	`caQTL_np` plus NFR/nucQTL peak BEDs	Partition analysis for the narrowPeak-based caQTL branch

Recommended order for the accessibility branch:

peakcall
  -> caQTL or caQTL_np
  -> ca_partition or ca_partition_np

CIT Matching and Downsampling Variants

Workflow	Run after	Purpose
`cit_match`	`coloc_qtls` or CIT-related inputs	Matching-based CIT follow-up workflow
`cit_downsample`	`cit` inputs plus downsampling setup	Downsampled CIT sensitivity analysis

These are follow-up or sensitivity-analysis workflows rather than first-pass pipelines.

MR / Steiger Split Workflow

Workflow	Run after	Purpose
`MR_split`	normalized counts, sample metadata, coloc peak list, and genotype resources are prepared	Splits samples, reruns QTL scans on split datasets, harmonizes exposure/outcome effects, and runs Steiger directionality analyses

MR_split is self-contained but expects many explicit config entries such as:

raw NFR counts
raw nucleosomal counts
sample info
peak list to keep
VCF and eigenvector files
output root

Treat this as a specialized downstream analysis, not part of the default run order.

Minimal Run Recipes

Reproduce The Main NFR/NucQTL Analysis

cd run_snakemake
bash dryrun.sh peakcall
bash run.sh peakcall

bash dryrun.sh sample_prep
bash run.sh sample_prep

bash dryrun.sh QTLprep
bash run.sh QTLprep

bash dryrun.sh QTLscan
bash run.sh QTLscan

bash dryrun.sh coloc_qtls
bash run.sh coloc_qtls

bash dryrun.sh cit
bash run.sh cit

Add GWAS Colocalization

After the core QTL workflows are complete, choose the relevant GWAS branch:

cd run_snakemake
bash dryrun.sh coloc_gwas
bash run.sh coloc_gwas

or:

cd run_snakemake
bash dryrun.sh coloc_gwas_nonukbb
bash run.sh coloc_gwas_nonukbb

or one of the other specialized coloc_gwas* workflows.

Add Motif Analysis

After coloc_qtls and any required upstream QTL outputs:

cd run_snakemake
bash dryrun.sh motif_homer
bash run.sh motif_homer

bash dryrun.sh motif_meme
bash run.sh motif_meme

Workflow Selection Guide

Use this quick guide if you are unsure what to run.

If you want the main discovery pipeline: run peakcall -> sample_prep -> QTLprep -> QTLscan -> coloc_qtls -> cit
If you want NFR versus nucleosomal peak comparison only: run peakcall, then peak_compare
If you want GWAS colocalization: finish QTL discovery first, then run the appropriate coloc_gwas* workflow
If you want eQTL overlap: finish QTL discovery and coloc/fine-mapping first, then run eqtl_overlap
If you want motif enrichment on colocated signals: finish coloc_qtls, then run motif_homer and/or motif_meme
If you want accessibility QTLs from MACS2 summits: run caQTL or caQTL_np, then the matching partition workflow
If you want Steiger or split-sample directionality analyses: run MR_split

Nextflow Workflow: LDSC

The nextflow/ldsc_allgwas_consensus-cluster/ directory contains a separate Nextflow workflow for LDSC annotation enrichment analyses.

Entry point:

nextflow/ldsc_allgwas_consensus-cluster/scripts/main.nf

This workflow:

builds LDSC annotation files from BED annotations
calculates LD scores
runs single-annotation enrichment against GWAS summary statistics
runs joint annotation enrichment if provided with joint annotation files

This is independent of the main Snakemake execution chain and should be treated as a separate downstream analysis.

Important Notes

Many paths in this repository are hard-coded to a specific computing environment.
Some rule all targets have been changed over time and often point to the latest analysis generation only.
Several Snakefiles contain historical or commented-out targets. Always inspect the active rule all section before launching a workflow.
Some raw data remain controlled access, so not every analysis can be rerun from public inputs alone.

Citation

If you use this repository or adapt these workflows, please cite:

Wang, X., Robertson, C. C., Varshney, A., Manickam, N., Orchard, P., Laakso, M., Tuomilehto, J., Lakka, T. A., Mohlke, K. L., Boehnke, M., Scott, L. J., Koistinen, H. A., Collins, F. S., & Parker, S. C. J. (2025). Genetic integration with cell-specific nucleosome positioning resolves causal relationships underlying chromatin accessibility profiles (p. 2025.09.09.674883). bioRxiv. https://doi.org/10.1101/2025.09.09.674883

Contact

Alice Wang: xiaoouw [at] umich.edu
Cassie Robertson: ccrober [at] umich.edu
Stephen Parker: scjp [at] umich.edu

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
bin		bin
env		env
nextflow		nextflow
run_snakemake		run_snakemake
smk		smk
README.md		README.md

Folders and files

Latest commit

History

Repository files navigation

Single-Nucleus Muscle NFR and Nucleosome QTL Analysis

Paper

What This Repository Does

Repository Layout

Requirements

Before Running Anything

How To Run A Workflow

Recommended Execution Order

Core Workflow Order

Core Dependency Summary

What Each Core Step Produces

Optional Workflow Branches

Peak Comparison and Peak-Level Summaries

Variant-in-Peak and Layered Plotting

eQTL Overlap

Motif Analysis

GWAS Colocalization

caQTL Branch

CIT Matching and Downsampling Variants

MR / Steiger Split Workflow

Minimal Run Recipes

Reproduce The Main NFR/NucQTL Analysis

Add GWAS Colocalization

Add Motif Analysis

Workflow Selection Guide

Nextflow Workflow: LDSC

Important Notes

Citation

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages