Skip to content

Guard MannKendall against segments with <3 non-NA values#10

Open
shafighi wants to merge 1 commit into
mainfrom
fix/cellcycle-mannkendall-nas
Open

Guard MannKendall against segments with <3 non-NA values#10
shafighi wants to merge 1 commit into
mainfrom
fix/cellcycle-mannkendall-nas

Conversation

@shafighi
Copy link
Copy Markdown
Collaborator

@shafighi shafighi commented Jun 2, 2026

The two cellcycle classification functions (cellcycleMetadata, cellcycle_gctest) loop over genome segments and call Kendall::MannKendall(co[idx]) on each segment's copy-number values. The existing gates skip a segment if it has fewer than 20 bins (segment_size_cutoff) or if all values are zero, but neither gate accounts for NA values in co.

With the new mouse blacklist + pericentromeric + segmental duplication masks, segments often clear the size gate but have most of their bins NA-filtered. The non-zero gate also misfires because all(co == 0) returns NA (not FALSE) when co contains NAs, so the gate's if-condition errors -- which the outer tryCatch in run_scAbsolute.R catches as a hard failure, producing an empty .rds.

Reproducibly observed on SLX-27548 mouse cells: A1_2 (65M reads) fails with stop("length(x)<3") fired by Kendall internals when most of a segment's bins are NA.

Fix:

  1. all(co == 0) -> all(co == 0, na.rm=TRUE) so NA-only and zero+NA segments are skipped instead of erroring.
  2. New gate: sum(!is.na(co)) < 3 returns NA for that segment instead of feeding a degenerate vector to MannKendall.

Same fix applied to both cellcycleMetadata (lines ~62-66) and cellcycle_gctest (lines ~192-196).

The two cellcycle classification functions (cellcycleMetadata,
cellcycle_gctest) loop over genome segments and call
Kendall::MannKendall(co[idx]) on each segment's copy-number values.
The existing gates skip a segment if it has fewer than 20 bins
(segment_size_cutoff) or if all values are zero, but neither gate
accounts for NA values in co.

With the new mouse blacklist + pericentromeric + segmental duplication
masks, segments often clear the size gate but have most of their bins
NA-filtered. The non-zero gate also misfires because all(co == 0)
returns NA (not FALSE) when co contains NAs, so the gate's if-condition
errors -- which the outer tryCatch in run_scAbsolute.R catches as a
hard failure, producing an empty .rds.

Reproducibly observed on SLX-27548 mouse cells: A1_2 (65M reads) fails
with stop("length(x)<3") fired by Kendall internals when most of a
segment's bins are NA.

Fix:
  1. all(co == 0) -> all(co == 0, na.rm=TRUE) so NA-only and zero+NA
     segments are skipped instead of erroring.
  2. New gate: sum(!is.na(co)) < 3 returns NA for that segment instead
     of feeding a degenerate vector to MannKendall.

Same fix applied to both cellcycleMetadata (lines ~62-66) and
cellcycle_gctest (lines ~192-196).
Copilot AI review requested due to automatic review settings June 2, 2026 22:22
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the cell-cycle classification logic in cellcycleMetadata() and cellcycle_gctest() against genome segments where aggressive filtering leaves too few non-NA observations to safely run Kendall::MannKendall().

Changes:

  • Make the “all zeros” gate NA-safe via all(co == 0, na.rm=TRUE) in cellcycleMetadata().
  • Add guards to skip segments with fewer than 3 usable (non-NA) observations before calling Kendall::MannKendall().
  • Add explanatory comments describing why the new guard is needed in filtered mouse data.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread R/cellcycle.R
Comment on lines +67 to 82
# Mann-Kendall (and the partial-MK / partial-spearman calls below) all
# need >=3 jointly-non-NA values. With aggressive mouse filtering
# (blacklist + pericentromeric + segdup) and mm10-liftover gaps in
# the replication-timing track, a long segment can clear the size
# cutoff but still have <3 bins that are non-NA across co, time and
# covar simultaneously. base::sort(time, ...) drops NAs from time
# silently, so the resulting idx is shorter than the segment and
# co[idx] feeds a degenerate vector into Kendall::MannKendall,
# triggering stop("length(x)<3") inside the package.
jointly_valid = !is.na(co) & !is.na(time) & !is.na(covar)
if(sum(jointly_valid) < 3){
return(list("pval"=NA, "tau"=NA, "tau_corrected"=NA, "cor_corrected"=NA))
}
idx = base::sort(time, index.return=TRUE)$ix
# ggplot(data=dplyr::tibble(co=co[idx], covar=covar[idx], time=time[idx])) + geom_point(aes(x=time, y=co, color=covar))
trend.kendall_test = Kendall::MannKendall(co[idx])
Comment thread R/cellcycle.R
Comment on lines +202 to 209
# See cellcycleMetadata above for the same guard. base::sort(time, ...)
# silently drops NA time values, so the segment can clear the size
# gate but still feed a <3-length vector to Kendall::MannKendall.
if(sum(!is.na(co) & !is.na(time)) < 3){
return(list("pval"=NA, "tau"=NA))
}
idx = base::sort(time, index.return=TRUE)$ix
trend.kendall_test = Kendall::MannKendall(co[idx])
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants