Feature/mouse support#9
Merged
Merged
Conversation
- readFilter(): add GRCm38 branch loading mm10-blacklist.v2.bed so centromere/low-mappability bins are excluded for mouse data - readData(): compute coverage pData column for Mouse using 2.7 Gb genome size; also store genome in experimentData for Mouse - Extend extendedBlacklisting condition to apply for both Human and Mouse
Mouse chromosomes are telocentric: the centromere sits at position 0 of every chromosome, followed by unassembled satellite sequence. The UCSC mm10 gap track documents a formal centromere N-gap from 0 to 3,000,000 bp on all 21 chromosomes. The existing ENCODE mm10-blacklist.v2.bed already covers chr2-4, chr6, chr9-chr19, and chrY from position 0. Five chromosomes were unprotected near the start (chr1, chr5, chr7, chr8, chrX). Changes: - Add data/blacklisting/mm10_pericentromeric_exclude.bed covering 0-3,000,000 bp on all chromosomes, based on the UCSC centromere gap - readFilter() for GRCm38 now merges this file with mm10-blacklist.v2.bed before applying findOverlaps, so both sources contribute to bin exclusion The BED file uses base-pair coordinates so it is bin-size-agnostic: the findOverlaps logic in readFilter() handles any bin size correctly. This also covers the chrX PAR region (0-700,000 bp) for male samples.
Segmental duplications are near-identical copy-pasted regions of the genome. Reads from these regions cannot be unambiguously placed, making read counts unreliable. Bailey et al. 2008 (Nature Genetics) showed ~47% of apparent mouse CNVs overlap segmental duplications and are likely false positives. Mouse chr7, chr12, chr14, and chrX are particularly affected. The ENCODE mm10-blacklist.v2.bed does not cover these regions, so a separate mask is needed. Changes: - Add data/blacklisting/mm10_segdup.bed (UCSC mm10 segmental duplication track, 180,459 regions) - readFilter() for GRCm38 now merges all three sources: ENCODE blacklist, pericentromeric filter, and segdup mask - Random/unplaced contigs in the segdup file (chrUn_*, chr*_random, chrM) are filtered to only chromosomes present in the QDNAseq.mm10 bin set before building the GRanges object
The FSU RepliChip BigWig files used for mouse replication timing correction were in mm9 coordinates but were being averaged against mm10 GRanges bins without an explicit liftover — a silent coordinate mismatch. Fix: add liftover_mm9_to_mm10.sh which downloads the same 33 FSU RepliChip WaveSignal files from UCSC and lifts them to mm10 using the official mm9->mm10 chain (>97% of the genome retained). The R script is updated to read from repliChip_data_mm10/ instead of repliChip_data/. After running the shell script, re-run replicationInfo.R and replicationInfo_per_bin.R to regenerate replicationInfo_mouse.RDS and replicationTiming_per_binSize_mouse.RDS with correct mm10 coordinates. No changes to the averaging or binning logic — only the input file path changes. Same 33 cell types, same WaveSignal format.
There was a problem hiding this comment.
Pull request overview
Adds mouse (GRCm38/mm10) support to the read-counting and blacklist-filtering pipeline, and switches the mouse replication-timing inputs from mm9 to mm10 via a new liftover script.
Changes:
- In
R/core.R, extendreadDatato compute coverage/genome metadata forspecies == "Mouse", broaden the extended-blacklisting gate to include mouse, and add aGRCm38branch inreadFilterthat combines an ENCODE mm10 blacklist, a new pericentromeric BED, and a segdup BED. - Add
data/blacklisting/mm10_pericentromeric_exclude.bed(3 Mb pericentromeric exclusion per autosome plus chrX/Y). - Update
data/replicationTiming/replicationInfo.Rand README to consume an mm10 RepliChip directory, and addliftover_mm9_to_mm10.shthat downloads mm9 BigWigs and lifts them to mm10.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| R/core.R | Adds Mouse branches for coverage, experimentData, and a GRCm38 blacklist path in readFilter. |
| data/blacklisting/mm10_pericentromeric_exclude.bed | New 3 Mb per-chromosome pericentromeric exclusion regions for mm10. |
| data/replicationTiming/replicationInfo.R | Repoints mouse RepliChip path to the new mm10 directory and updates source comments. |
| data/replicationTiming/README | Documents that mouse data is now mm10 (lifted from mm9). |
| data/replicationTiming/liftover_mm9_to_mm10.sh | New script to download mm9 RepliChip BigWigs and lift them over to mm10. |
Comments suppressed due to low confidence (1)
R/core.R:779
- The first
ifblock (whenbinSize %in% c(200, 2000, 5000) && genome == "GRCh37") setsextendedFilterfrom the pre-readFilterfData(readCounts)[["use"]], but then immediately falls through toreadCounts = readFilter(...)followed by an unconditionalextendedFilter = Biobase::fData(readCounts)[["use"]]reassignment on line 776, which overwrites the value just computed. This was already dead-looking code before this PR, but extending the surrounding condition toMouseis a good moment to clarify intent: either the GRCh37 custom-bin branch shouldreturn/skip the subsequentreadFiltercall, or the redundant assignment on line 773 should be removed. As written, custom-bin GRCh37 mouse/human runs behave identically to the default path.
if(extendedBlacklisting & (species == "Human" | species == "Mouse")){
if(binSize < 10) stop("blacklisting doesn't support binsizes smaller than 10kb")
# this is because our blacklisted regions have a resolution of 10kb
if(binSize %in% c(200, 2000, 5000) && genome == "GRCh37"){
# custom bins
extendedFilter = Biobase::fData(readCounts)[["use"]]
}
readCounts = readFilter(readCounts, genome=genome)
extendedFilter = Biobase::fData(readCounts)[["use"]]
}else{
extendedFilter = Biobase::fData(readCounts)[["use"]] | base::startsWith(rownames(readCounts), "X:") | base::startsWith(rownames(readCounts), "Y:")
}
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+26
to
+85
| echo "=== Downloading liftover chain and chrom sizes ===" | ||
| if [ ! -f "$CHAIN" ]; then | ||
| curl -s "https://hgdownload.soe.ucsc.edu/goldenPath/mm9/liftOver/mm9ToMm10.over.chain.gz" \ | ||
| | gunzip > "$CHAIN" | ||
| fi | ||
| if [ ! -f "$CHROM_SIZES" ]; then | ||
| curl -s "https://hgdownload.soe.ucsc.edu/goldenPath/mm10/bigZips/mm10.chrom.sizes" \ | ||
| -o "$CHROM_SIZES" | ||
| fi | ||
|
|
||
| files=( | ||
| "wgEncodeFsuRepliChipCh12FWaveSignalRep1" | ||
| "wgEncodeFsuRepliChipCh12FWaveSignalRep2" | ||
| "wgEncodeFsuRepliChipEpisc5MWaveSignalRep1" | ||
| "wgEncodeFsuRepliChipEpisc5MWaveSignalRep2" | ||
| "wgEncodeFsuRepliChipEpisc7FWaveSignalRep1" | ||
| "wgEncodeFsuRepliChipEpisc7FWaveSignalRep2" | ||
| "wgEncodeFsuRepliChipEs46cMDifff6dWaveSignalRep1" | ||
| "wgEncodeFsuRepliChipEs46cMWaveSignalRep1" | ||
| "wgEncodeFsuRepliChipEsd3MDiffe3dWaveSignalRep1" | ||
| "wgEncodeFsuRepliChipEsd3MDiffe3dWaveSignalRep2" | ||
| "wgEncodeFsuRepliChipEsd3MDiffe6dWaveSignalRep1" | ||
| "wgEncodeFsuRepliChipEsd3MDiffe6dWaveSignalRep2" | ||
| "wgEncodeFsuRepliChipEsd3MDiffe9dWaveSignalRep1" | ||
| "wgEncodeFsuRepliChipEsd3MDiffe9dWaveSignalRep2" | ||
| "wgEncodeFsuRepliChipEsd3MDiffg3dWaveSignalRep1" | ||
| "wgEncodeFsuRepliChipEsd3MDiffg3dWaveSignalRep2" | ||
| "wgEncodeFsuRepliChipEsd3MWaveSignalRep1" | ||
| "wgEncodeFsuRepliChipEsd3MWaveSignalRep2" | ||
| "wgEncodeFsuRepliChipEsem5sUDiffhsoxmWaveSignalRep1" | ||
| "wgEncodeFsuRepliChipEsem5sUDiffhsoxmWaveSignalRep2" | ||
| "wgEncodeFsuRepliChipEsem5sUDiffhsoxpWaveSignalRep1" | ||
| "wgEncodeFsuRepliChipEsem5sUDiffhsoxpWaveSignalRep2" | ||
| "wgEncodeFsuRepliChipEstt2MDifff9dWaveSignalRep1" | ||
| "wgEncodeFsuRepliChipEstt2MDifff9dWaveSignalRep2" | ||
| "wgEncodeFsuRepliChipEstt2MWaveSignalRep1" | ||
| "wgEncodeFsuRepliChipEstt2MWaveSignalRep2" | ||
| "wgEncodeFsuRepliChipJ185aUWaveSignalRep1" | ||
| "wgEncodeFsuRepliChipJ185aUWaveSignalRep2" | ||
| "wgEncodeFsuRepliChipL1210FWaveSignalRep1" | ||
| "wgEncodeFsuRepliChipL1210FWaveSignalRep2" | ||
| "wgEncodeFsuRepliChipMefMWaveSignalRep1" | ||
| "wgEncodeFsuRepliChipMelMWaveSignalRep1" | ||
| "wgEncodeFsuRepliChipMelMWaveSignalRep2" | ||
| ) | ||
|
|
||
| for f in "${files[@]}"; do | ||
| out="${MM10_DIR}/${f}.bigWig" | ||
| if [ -f "$out" ]; then | ||
| echo "Skipping $f (already lifted)" | ||
| continue | ||
| fi | ||
|
|
||
| echo "=== Processing $f ===" | ||
|
|
||
| mm9_bw="${MM9_DIR}/${f}.bigWig" | ||
| if [ ! -f "$mm9_bw" ]; then | ||
| echo " Downloading mm9 BigWig..." | ||
| curl -s "${BASE_URL}/${f}.bigWig" -o "$mm9_bw" | ||
| fi |
Comment on lines
+98
to
+99
| sort -k1,1 -k2,2n "${MM10_DIR}/${f}_unsorted.bedGraph" \ | ||
| > "${MM10_DIR}/${f}_sorted.bedGraph" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.