Skip to content

Feature/mouse support#9

Merged
shafighi merged 4 commits into
mainfrom
feature/mouse-support
Jun 2, 2026
Merged

Feature/mouse support#9
shafighi merged 4 commits into
mainfrom
feature/mouse-support

Conversation

@shafighi
Copy link
Copy Markdown
Collaborator

@shafighi shafighi commented Jun 2, 2026

No description provided.

shafighi added 4 commits May 21, 2026 14:38
- readFilter(): add GRCm38 branch loading mm10-blacklist.v2.bed so
  centromere/low-mappability bins are excluded for mouse data
- readData(): compute coverage pData column for Mouse using 2.7 Gb
  genome size; also store genome in experimentData for Mouse
- Extend extendedBlacklisting condition to apply for both Human and Mouse
Mouse chromosomes are telocentric: the centromere sits at position 0 of
every chromosome, followed by unassembled satellite sequence. The UCSC
mm10 gap track documents a formal centromere N-gap from 0 to 3,000,000 bp
on all 21 chromosomes.

The existing ENCODE mm10-blacklist.v2.bed already covers chr2-4, chr6,
chr9-chr19, and chrY from position 0. Five chromosomes were unprotected
near the start (chr1, chr5, chr7, chr8, chrX).

Changes:
- Add data/blacklisting/mm10_pericentromeric_exclude.bed covering
  0-3,000,000 bp on all chromosomes, based on the UCSC centromere gap
- readFilter() for GRCm38 now merges this file with mm10-blacklist.v2.bed
  before applying findOverlaps, so both sources contribute to bin exclusion

The BED file uses base-pair coordinates so it is bin-size-agnostic: the
findOverlaps logic in readFilter() handles any bin size correctly.
This also covers the chrX PAR region (0-700,000 bp) for male samples.
Segmental duplications are near-identical copy-pasted regions of the
genome. Reads from these regions cannot be unambiguously placed, making
read counts unreliable. Bailey et al. 2008 (Nature Genetics) showed ~47%
of apparent mouse CNVs overlap segmental duplications and are likely
false positives. Mouse chr7, chr12, chr14, and chrX are particularly
affected.

The ENCODE mm10-blacklist.v2.bed does not cover these regions, so a
separate mask is needed.

Changes:
- Add data/blacklisting/mm10_segdup.bed (UCSC mm10 segmental duplication
  track, 180,459 regions)
- readFilter() for GRCm38 now merges all three sources: ENCODE blacklist,
  pericentromeric filter, and segdup mask
- Random/unplaced contigs in the segdup file (chrUn_*, chr*_random, chrM)
  are filtered to only chromosomes present in the QDNAseq.mm10 bin set
  before building the GRanges object
The FSU RepliChip BigWig files used for mouse replication timing
correction were in mm9 coordinates but were being averaged against mm10
GRanges bins without an explicit liftover — a silent coordinate mismatch.

Fix: add liftover_mm9_to_mm10.sh which downloads the same 33 FSU RepliChip
WaveSignal files from UCSC and lifts them to mm10 using the official
mm9->mm10 chain (>97% of the genome retained). The R script is updated to
read from repliChip_data_mm10/ instead of repliChip_data/.

After running the shell script, re-run replicationInfo.R and
replicationInfo_per_bin.R to regenerate replicationInfo_mouse.RDS and
replicationTiming_per_binSize_mouse.RDS with correct mm10 coordinates.

No changes to the averaging or binning logic — only the input file path
changes. Same 33 cell types, same WaveSignal format.
Copilot AI review requested due to automatic review settings June 2, 2026 08:53
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds mouse (GRCm38/mm10) support to the read-counting and blacklist-filtering pipeline, and switches the mouse replication-timing inputs from mm9 to mm10 via a new liftover script.

Changes:

  • In R/core.R, extend readData to compute coverage/genome metadata for species == "Mouse", broaden the extended-blacklisting gate to include mouse, and add a GRCm38 branch in readFilter that combines an ENCODE mm10 blacklist, a new pericentromeric BED, and a segdup BED.
  • Add data/blacklisting/mm10_pericentromeric_exclude.bed (3 Mb pericentromeric exclusion per autosome plus chrX/Y).
  • Update data/replicationTiming/replicationInfo.R and README to consume an mm10 RepliChip directory, and add liftover_mm9_to_mm10.sh that downloads mm9 BigWigs and lifts them to mm10.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
R/core.R Adds Mouse branches for coverage, experimentData, and a GRCm38 blacklist path in readFilter.
data/blacklisting/mm10_pericentromeric_exclude.bed New 3 Mb per-chromosome pericentromeric exclusion regions for mm10.
data/replicationTiming/replicationInfo.R Repoints mouse RepliChip path to the new mm10 directory and updates source comments.
data/replicationTiming/README Documents that mouse data is now mm10 (lifted from mm9).
data/replicationTiming/liftover_mm9_to_mm10.sh New script to download mm9 RepliChip BigWigs and lift them over to mm10.
Comments suppressed due to low confidence (1)

R/core.R:779

  • The first if block (when binSize %in% c(200, 2000, 5000) && genome == "GRCh37") sets extendedFilter from the pre-readFilter fData(readCounts)[["use"]], but then immediately falls through to readCounts = readFilter(...) followed by an unconditional extendedFilter = Biobase::fData(readCounts)[["use"]] reassignment on line 776, which overwrites the value just computed. This was already dead-looking code before this PR, but extending the surrounding condition to Mouse is a good moment to clarify intent: either the GRCh37 custom-bin branch should return/skip the subsequent readFilter call, or the redundant assignment on line 773 should be removed. As written, custom-bin GRCh37 mouse/human runs behave identically to the default path.
  if(extendedBlacklisting & (species == "Human" | species == "Mouse")){
    if(binSize < 10) stop("blacklisting doesn't support binsizes smaller than 10kb")
    # this is because our blacklisted regions have a resolution of 10kb
    if(binSize %in% c(200, 2000, 5000) && genome == "GRCh37"){
      # custom bins
      extendedFilter = Biobase::fData(readCounts)[["use"]]
    }
    readCounts = readFilter(readCounts, genome=genome)
    extendedFilter = Biobase::fData(readCounts)[["use"]]
  }else{
    extendedFilter = Biobase::fData(readCounts)[["use"]] | base::startsWith(rownames(readCounts), "X:") | base::startsWith(rownames(readCounts), "Y:")
  }

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +26 to +85
echo "=== Downloading liftover chain and chrom sizes ==="
if [ ! -f "$CHAIN" ]; then
curl -s "https://hgdownload.soe.ucsc.edu/goldenPath/mm9/liftOver/mm9ToMm10.over.chain.gz" \
| gunzip > "$CHAIN"
fi
if [ ! -f "$CHROM_SIZES" ]; then
curl -s "https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/mm10.chrom.sizes" \
-o "$CHROM_SIZES"
fi

files=(
"wgEncodeFsuRepliChipCh12FWaveSignalRep1"
"wgEncodeFsuRepliChipCh12FWaveSignalRep2"
"wgEncodeFsuRepliChipEpisc5MWaveSignalRep1"
"wgEncodeFsuRepliChipEpisc5MWaveSignalRep2"
"wgEncodeFsuRepliChipEpisc7FWaveSignalRep1"
"wgEncodeFsuRepliChipEpisc7FWaveSignalRep2"
"wgEncodeFsuRepliChipEs46cMDifff6dWaveSignalRep1"
"wgEncodeFsuRepliChipEs46cMWaveSignalRep1"
"wgEncodeFsuRepliChipEsd3MDiffe3dWaveSignalRep1"
"wgEncodeFsuRepliChipEsd3MDiffe3dWaveSignalRep2"
"wgEncodeFsuRepliChipEsd3MDiffe6dWaveSignalRep1"
"wgEncodeFsuRepliChipEsd3MDiffe6dWaveSignalRep2"
"wgEncodeFsuRepliChipEsd3MDiffe9dWaveSignalRep1"
"wgEncodeFsuRepliChipEsd3MDiffe9dWaveSignalRep2"
"wgEncodeFsuRepliChipEsd3MDiffg3dWaveSignalRep1"
"wgEncodeFsuRepliChipEsd3MDiffg3dWaveSignalRep2"
"wgEncodeFsuRepliChipEsd3MWaveSignalRep1"
"wgEncodeFsuRepliChipEsd3MWaveSignalRep2"
"wgEncodeFsuRepliChipEsem5sUDiffhsoxmWaveSignalRep1"
"wgEncodeFsuRepliChipEsem5sUDiffhsoxmWaveSignalRep2"
"wgEncodeFsuRepliChipEsem5sUDiffhsoxpWaveSignalRep1"
"wgEncodeFsuRepliChipEsem5sUDiffhsoxpWaveSignalRep2"
"wgEncodeFsuRepliChipEstt2MDifff9dWaveSignalRep1"
"wgEncodeFsuRepliChipEstt2MDifff9dWaveSignalRep2"
"wgEncodeFsuRepliChipEstt2MWaveSignalRep1"
"wgEncodeFsuRepliChipEstt2MWaveSignalRep2"
"wgEncodeFsuRepliChipJ185aUWaveSignalRep1"
"wgEncodeFsuRepliChipJ185aUWaveSignalRep2"
"wgEncodeFsuRepliChipL1210FWaveSignalRep1"
"wgEncodeFsuRepliChipL1210FWaveSignalRep2"
"wgEncodeFsuRepliChipMefMWaveSignalRep1"
"wgEncodeFsuRepliChipMelMWaveSignalRep1"
"wgEncodeFsuRepliChipMelMWaveSignalRep2"
)

for f in "${files[@]}"; do
out="${MM10_DIR}/${f}.bigWig"
if [ -f "$out" ]; then
echo "Skipping $f (already lifted)"
continue
fi

echo "=== Processing $f ==="

mm9_bw="${MM9_DIR}/${f}.bigWig"
if [ ! -f "$mm9_bw" ]; then
echo " Downloading mm9 BigWig..."
curl -s "${BASE_URL}/${f}.bigWig" -o "$mm9_bw"
fi
Comment on lines +98 to +99
sort -k1,1 -k2,2n "${MM10_DIR}/${f}_unsorted.bedGraph" \
> "${MM10_DIR}/${f}_sorted.bedGraph"
@shafighi shafighi merged commit dfa9a37 into main Jun 2, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants