feat(preprocessing): add optimal shifted CLR (PFlog1pPF) normalization#4160
feat(preprocessing): add optimal shifted CLR (PFlog1pPF) normalization#4160rwbaber wants to merge 8 commits into
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #4160 +/- ##
==========================================
+ Coverage 79.61% 79.73% +0.11%
==========================================
Files 120 120
Lines 12786 12853 +67
==========================================
+ Hits 10180 10248 +68
+ Misses 2606 2605 -1
Flags with carried forward coverage won't be shown. Click here to find out more.
|
|
@rwbaber thanks for the PR! We started a discussion over in the reproducibility repo: pachterlab/BHGP_2022#1 It's not immediately clear what the default should be, but I will definitely have a look at this implementation! |
ilan-gold
left a comment
There was a problem hiding this comment.
Please make sure this works with dask (sparse and dense) as well. Thanks!
| # mean cell depth. This implicitly sets the count-scale pseudocount to the | ||
| # optimal variance-stabilizing value y0 = 1 / (4 * alpha), and overrides | ||
| # any value passed as `target_sum`. | ||
| target_sum = 4.0 * alpha * scale |
There was a problem hiding this comment.
We should probably provide an option for fitting alpha from the data via curve_fit. I think this could take the form of an "auto" string option to the alpha argument
| c: float = 1.0, | ||
| target_sum: float | None = None, | ||
| alpha: float | None = None, | ||
| scale: float | None = None, |
There was a problem hiding this comment.
I'm not sure it's going to make sense to expose all of these to the users. It seems like the only two profiles that we should care about initially are c=1 and c= 1 / 4 alpha s or do I have that wrong?
That was my reading of the paper but as you see from the attached issue, it's not immediately clear
Address maintainer review on scverse#4160: - Drop c (redundant with target_sum) and scale (internal mean depth). - alpha now accepts 'auto' via closed-form OLS overdispersion estimation matching runorm. - Raise ValueError on non-positive alpha instead of clamping. - Cast integer input to float64 instead of float32.
|
@ilan-gold thanks for the review! I had missed the issue you opened but now caught up on it. I just pushed my changes again for review, though I'm not sure if this is the optimal implementation yet. What I changed based on your comments:
Let me know what you think! |
- Switch overdispersion estimation to dask-aware mean_var - Use map_blocks with gene rechunking for cell centering - Expand test suites to cover dense and sparse Dask ARRAY_TYPES
|
@rwbaber can you remove the llm bloat? |
tests/test_normalization.py.Summary & Paper Context
This PR introduces
sc.pp.normalize_clrto implement the optimal shifted Centered Log-Ratio transform.The paper demonstrates that the shifted CLR (conceptually known as the$\text{PFlog1pPF}$ family) satisfies four essential preprocessing criteria, unlike most other common normalization methods:
sctransform.See publication for extensive validation benchmarks across hundreds of datasets.
Implementation Details & Files Changed
src/scanpy/preprocessing/_normalization.pynormalize_clrfunction and a private_normalize_clr_helperbackend engine.alpha) data overrides to automatically calibrate target scale factors via the delta method.src/scanpy/preprocessing/__init__.pynormalize_clrinside the preprocessing namespace and appended it to the global__all__list.docs/references.bibBooeshaghi2022BibTeX entry.docs/api/preprocessing.mddocs/release-notes/4160.feat.mdtests/test_normalization.pyARRAY_TYPES_MEM) for sparse-dense equivalence, delta-method overrides, empty cell handling, and hyperplane invariants.tests/test_package_structure.pysc.pp.normalize_clrinto thecopy_sigschained assignment exemption list. This explicitly informs the globaltest_sig_conventionsrunner that the function is permitted to return adict[str, np.ndarray]wheninplace=False, resolving the administrative signature enforcement failure.