feat(preprocessing): add optimal shifted CLR (PFlog1pPF) normalization by rwbaber · Pull Request #4160 · scverse/scanpy

rwbaber · 2026-06-14T17:27:22Z

Closes #
Tests included or not required because: 34 comprehensive parametrized unit tests covering memory layouts, alpha calibration, and zero-depth guards have been appended to tests/test_normalization.py.
Release notes not necessary because: (A release note fragment has been provided).

Summary & Paper Context

This PR introduces sc.pp.normalize_clr to implement the optimal shifted Centered Log-Ratio transform.

Reference Publication: This implementation is based directly on the paper "Depth normalization for single-cell genomics count data" (Booeshaghi et al., 2022, 2026).
bioRxiv Link: https://www.biorxiv.org/content/10.1101/2022.05.06.490859v3
Code Attribution: The core mathematical processing engine and memory-saving array optimizations are adapted directly from the Pachter Lab reference repository: https://github.com/pachterlab/bhgp_2022

The paper demonstrates that the shifted CLR (conceptually known as the $\text{PFlog1pPF}$ family) satisfies four essential preprocessing criteria, unlike most other common normalization methods:

Variance Stabilization: Balances gene variance independent of expression magnitude.
Scale (Sequencing Depth) Invariance: Erases technical read-depth variations across cells, preventing cells from clustering based on library size.
Strict Rank Monotonicity: Preserves the exact relative abundance rankings of genes within a cell, avoiding the artifacts introduced by regression-based methods like sctransform.
Perturbation Additivity: Fully compatible with linear downstream dimensionality reduction (PCA) because changes act additively in log-space.

See publication for extensive validation benchmarks across hundreds of datasets.

Implementation Details & Files Changed

src/scanpy/preprocessing/_normalization.py
- Implemented the public normalize_clr function and a private _normalize_clr_helper backend engine.
- Utilizes the author's sparse "offset trick" to isolate log operations exclusively to non-zero indices, preventing premature matrix densification and keeping memory usage minimal.
- Supports negative-binomial overdispersion (alpha) data overrides to automatically calibrate target scale factors via the delta method.
src/scanpy/preprocessing/__init__.py
- Registered normalize_clr inside the preprocessing namespace and appended it to the global __all__ list.
docs/references.bib
- Appended the cleaned, structured, and properly aligned Booeshaghi2022 BibTeX entry.
docs/api/preprocessing.md
- Registered the function for inclusion in the Sphinx-generated documentation summary table.
docs/release-notes/4160.feat.md
- Created the required markdown fragment documenting this change for the next release.
tests/test_normalization.py
- Added 9 test functions (yielding 34 parametrized cases) evaluating standard in-memory arrays (ARRAY_TYPES_MEM) for sparse-dense equivalence, delta-method overrides, empty cell handling, and hyperplane invariants.
tests/test_package_structure.py
- Registered sc.pp.normalize_clr into the copy_sigs chained assignment exemption list. This explicitly informs the global test_sig_conventions runner that the function is permitted to return a dict[str, np.ndarray] when inplace=False, resolving the administrative signature enforcement failure.

codecov · 2026-06-14T17:33:28Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.73%. Comparing base (2ae768e) to head (b284927).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #4160      +/-   ##
==========================================
+ Coverage   79.61%   79.73%   +0.11%     
==========================================
  Files         120      120              
  Lines       12786    12853      +67     
==========================================
+ Hits        10180    10248      +68     
+ Misses       2606     2605       -1

Flag	Coverage Δ
hatch-test.low-vers	`78.98% <100.00%> (+0.14%)`	⬆️
hatch-test.pre	`79.58% <100.00%> (+0.10%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
src/scanpy/preprocessing/__init__.py	`100.00% <100.00%> (ø)`
src/scanpy/preprocessing/_normalization.py	`97.18% <100.00%> (+2.51%)`	⬆️

... and 1 file with indirect coverage changes

…raphy

ilan-gold · 2026-06-15T09:46:47Z

@rwbaber thanks for the PR! We started a discussion over in the reproducibility repo: pachterlab/BHGP_2022#1

It's not immediately clear what the default should be, but I will definitely have a look at this implementation!

ilan-gold

Please make sure this works with dask (sparse and dense) as well. Thanks!

ilan-gold · 2026-06-15T09:52:00Z

+        # mean cell depth. This implicitly sets the count-scale pseudocount to the
+        # optimal variance-stabilizing value y0 = 1 / (4 * alpha), and overrides
+        # any value passed as `target_sum`.
+        target_sum = 4.0 * alpha * scale


We should probably provide an option for fitting alpha from the data via curve_fit. I think this could take the form of an "auto" string option to the alpha argument

ilan-gold · 2026-06-15T10:09:01Z

+    c: float = 1.0,
+    target_sum: float | None = None,
+    alpha: float | None = None,
+    scale: float | None = None,


I'm not sure it's going to make sense to expose all of these to the users. It seems like the only two profiles that we should care about initially are c=1 and c= 1 / 4 alpha s or do I have that wrong?

That was my reading of the paper but as you see from the attached issue, it's not immediately clear

Address maintainer review on scverse#4160: - Drop c (redundant with target_sum) and scale (internal mean depth). - alpha now accepts 'auto' via closed-form OLS overdispersion estimation matching runorm. - Raise ValueError on non-positive alpha instead of clamping. - Cast integer input to float64 instead of float32.

rwbaber · 2026-06-15T14:42:19Z

@ilan-gold thanks for the review! I had missed the issue you opened but now caught up on it. I just pushed my changes again for review, though I'm not sure if this is the optimal implementation yet.

What I changed based on your comments:

Precision Cast (float64): I originally mirrored normalize_total's int --> float32 cast, though there it also says # TODO: Check if float64 should be used. And since the centering step always forces the CLR matrix into a dense representation, any memory-saving rationale doesn't apply here anyway. I switched it to np.float64.
Implementing alpha="auto" via Closed-Form OLS: I had a look at their runorm repo and they implemented it with exact analytic closed-form Ordinary Least Squares (OLS) estimator: $\alpha = \sum_g (\text{Var}_g - \mu_g) \cdot \mu_g^2 / {\sum_g \mu_g^4}$. So instead of scipy.optimize.curve_fit, it uses pure NumPy reductions and explicitly raises a ValueError if the data is underdispersed ($\alpha \le 0$) (matching their error handling).
Parameter Simplification (c and scale removed): I think you are right. Looking at the runorm library implementation, a user-facing c or scale parameter is also omitted, and log1p is just applied directly.
If I'm not mistaken, the per-cell centering step algebraically cancels out the per-gene-constant log-shift scalar ($\log c$), and the entire shape of the transformation is governed by the ratio $K/c$ (i.e., c simply rescales K). So keeping $c = 1.0$ fixed internally and adjusting target_sum ($K$) should span the same mathematical space. Also, scale ($s$) can just be computed as the mean cell depth under the hood. I’ve refactored the function to mirror this.
Test Coverage: I updated the test suites to drop the legacy parameters, verified the new closed-form alpha="auto" matching math, and added assertions verifying that non-positive overdispersions raise the appropriate errors.
Dask Array Support: normalize_clr now fully supports dense and sparse Dask arrays (mapping the transform over row chunks via map_blocks after a gene rechunk for centering), with equivalence tests expanded across all ARRAY_TYPES.

Let me know what you think!

- Switch overdispersion estimation to dask-aware mean_var - Use map_blocks with gene rechunking for cell centering - Expand test suites to cover dense and sparse Dask ARRAY_TYPES

Intron7 · 2026-06-16T06:10:13Z

@rwbaber can you remove the llm bloat?

rwbaber added 2 commits June 14, 2026 18:02

feat(preprocessing): add optimal shifted CLR (PFlog1pPF) normalization

6c253cb

docs: update release notes fragment with PR number

eec2659

rwbaber added 3 commits June 14, 2026 19:21

test: register normalize_clr in copy_sigs signature conventions

32fb68b

refactor(pp): unify normalize_clr math variables and clean up bibliog…

6bd44b0

…raphy

docs: polish normalize_clr docstring wording

dd9aca0

ilan-gold self-requested a review June 15, 2026 09:46

ilan-gold reviewed Jun 15, 2026

View reviewed changes

rwbaber added 2 commits June 15, 2026 21:37

feat(pp): support dask arrays in normalize_clr

6eee954

- Switch overdispersion estimation to dask-aware mean_var - Use map_blocks with gene rechunking for cell centering - Expand test suites to cover dense and sparse Dask ARRAY_TYPES

test(pp): cover alpha="auto" zero-mean overdispersion error

b284927

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(preprocessing): add optimal shifted CLR (PFlog1pPF) normalization#4160

feat(preprocessing): add optimal shifted CLR (PFlog1pPF) normalization#4160
rwbaber wants to merge 8 commits into
scverse:mainfrom
rwbaber:feature/normalize_clr

rwbaber commented Jun 14, 2026 •

edited

Loading

Uh oh!

codecov Bot commented Jun 14, 2026 •

edited

Loading

Uh oh!

ilan-gold commented Jun 15, 2026

Uh oh!

ilan-gold left a comment

Uh oh!

Uh oh!

ilan-gold Jun 15, 2026

Uh oh!

ilan-gold Jun 15, 2026

Uh oh!

rwbaber commented Jun 15, 2026 •

edited

Loading

Uh oh!

Intron7 commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

rwbaber commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary & Paper Context

Implementation Details & Files Changed

Uh oh!

codecov Bot commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

ilan-gold commented Jun 15, 2026

Uh oh!

ilan-gold left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ilan-gold Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

ilan-gold Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

rwbaber commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Intron7 commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rwbaber commented Jun 14, 2026 •

edited

Loading

codecov Bot commented Jun 14, 2026 •

edited

Loading

rwbaber commented Jun 15, 2026 •

edited

Loading