Skip to content

feat(preprocessing): add optimal shifted CLR (PFlog1pPF) normalization#4160

Open
rwbaber wants to merge 8 commits into
scverse:mainfrom
rwbaber:feature/normalize_clr
Open

feat(preprocessing): add optimal shifted CLR (PFlog1pPF) normalization#4160
rwbaber wants to merge 8 commits into
scverse:mainfrom
rwbaber:feature/normalize_clr

Conversation

@rwbaber

@rwbaber rwbaber commented Jun 14, 2026

Copy link
Copy Markdown
Contributor
  • Closes #
  • Tests included or not required because: 34 comprehensive parametrized unit tests covering memory layouts, alpha calibration, and zero-depth guards have been appended to tests/test_normalization.py.
  • Release notes not necessary because: (A release note fragment has been provided).

Summary & Paper Context

This PR introduces sc.pp.normalize_clr to implement the optimal shifted Centered Log-Ratio transform.

The paper demonstrates that the shifted CLR (conceptually known as the $\text{PFlog1pPF}$ family) satisfies four essential preprocessing criteria, unlike most other common normalization methods:

  • Variance Stabilization: Balances gene variance independent of expression magnitude.
  • Scale (Sequencing Depth) Invariance: Erases technical read-depth variations across cells, preventing cells from clustering based on library size.
  • Strict Rank Monotonicity: Preserves the exact relative abundance rankings of genes within a cell, avoiding the artifacts introduced by regression-based methods like sctransform.
  • Perturbation Additivity: Fully compatible with linear downstream dimensionality reduction (PCA) because changes act additively in log-space.

See publication for extensive validation benchmarks across hundreds of datasets.


Implementation Details & Files Changed

  1. src/scanpy/preprocessing/_normalization.py
    • Implemented the public normalize_clr function and a private _normalize_clr_helper backend engine.
    • Utilizes the author's sparse "offset trick" to isolate log operations exclusively to non-zero indices, preventing premature matrix densification and keeping memory usage minimal.
    • Supports negative-binomial overdispersion (alpha) data overrides to automatically calibrate target scale factors via the delta method.
  2. src/scanpy/preprocessing/__init__.py
    • Registered normalize_clr inside the preprocessing namespace and appended it to the global __all__ list.
  3. docs/references.bib
    • Appended the cleaned, structured, and properly aligned Booeshaghi2022 BibTeX entry.
  4. docs/api/preprocessing.md
    • Registered the function for inclusion in the Sphinx-generated documentation summary table.
  5. docs/release-notes/4160.feat.md
    • Created the required markdown fragment documenting this change for the next release.
  6. tests/test_normalization.py
    • Added 9 test functions (yielding 34 parametrized cases) evaluating standard in-memory arrays (ARRAY_TYPES_MEM) for sparse-dense equivalence, delta-method overrides, empty cell handling, and hyperplane invariants.
  7. tests/test_package_structure.py
    • Registered sc.pp.normalize_clr into the copy_sigs chained assignment exemption list. This explicitly informs the global test_sig_conventions runner that the function is permitted to return a dict[str, np.ndarray] when inplace=False, resolving the administrative signature enforcement failure.

@codecov

codecov Bot commented Jun 14, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 79.73%. Comparing base (2ae768e) to head (b284927).
⚠️ Report is 1 commits behind head on main.
✅ All tests successful. No failed tests found.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #4160      +/-   ##
==========================================
+ Coverage   79.61%   79.73%   +0.11%     
==========================================
  Files         120      120              
  Lines       12786    12853      +67     
==========================================
+ Hits        10180    10248      +68     
+ Misses       2606     2605       -1     
Flag Coverage Δ
hatch-test.low-vers 78.98% <100.00%> (+0.14%) ⬆️
hatch-test.pre 79.58% <100.00%> (+0.10%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
src/scanpy/preprocessing/__init__.py 100.00% <100.00%> (ø)
src/scanpy/preprocessing/_normalization.py 97.18% <100.00%> (+2.51%) ⬆️

... and 1 file with indirect coverage changes

@ilan-gold

Copy link
Copy Markdown
Contributor

@rwbaber thanks for the PR! We started a discussion over in the reproducibility repo: pachterlab/BHGP_2022#1

It's not immediately clear what the default should be, but I will definitely have a look at this implementation!

@ilan-gold ilan-gold self-requested a review June 15, 2026 09:46

@ilan-gold ilan-gold left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make sure this works with dask (sparse and dense) as well. Thanks!

Comment thread src/scanpy/preprocessing/_normalization.py Outdated
# mean cell depth. This implicitly sets the count-scale pseudocount to the
# optimal variance-stabilizing value y0 = 1 / (4 * alpha), and overrides
# any value passed as `target_sum`.
target_sum = 4.0 * alpha * scale

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably provide an option for fitting alpha from the data via curve_fit. I think this could take the form of an "auto" string option to the alpha argument

Comment on lines +405 to +408
c: float = 1.0,
target_sum: float | None = None,
alpha: float | None = None,
scale: float | None = None,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure it's going to make sense to expose all of these to the users. It seems like the only two profiles that we should care about initially are c=1 and c= 1 / 4 alpha s or do I have that wrong?

That was my reading of the paper but as you see from the attached issue, it's not immediately clear

Address maintainer review on scverse#4160:
- Drop c (redundant with target_sum) and scale (internal mean depth).
- alpha now accepts 'auto' via closed-form OLS overdispersion estimation matching runorm.
- Raise ValueError on non-positive alpha instead of clamping.
- Cast integer input to float64 instead of float32.
@rwbaber

rwbaber commented Jun 15, 2026

Copy link
Copy Markdown
Contributor Author

@ilan-gold thanks for the review! I had missed the issue you opened but now caught up on it. I just pushed my changes again for review, though I'm not sure if this is the optimal implementation yet.

What I changed based on your comments:

  1. Precision Cast (float64): I originally mirrored normalize_total's int --> float32 cast, though there it also says # TODO: Check if float64 should be used. And since the centering step always forces the CLR matrix into a dense representation, any memory-saving rationale doesn't apply here anyway. I switched it to np.float64.
  2. Implementing alpha="auto" via Closed-Form OLS: I had a look at their runorm repo and they implemented it with exact analytic closed-form Ordinary Least Squares (OLS) estimator: $\alpha = \sum_g (\text{Var}_g - \mu_g) \cdot \mu_g^2 / {\sum_g \mu_g^4}$. So instead of scipy.optimize.curve_fit, it uses pure NumPy reductions and explicitly raises a ValueError if the data is underdispersed ($\alpha \le 0$) (matching their error handling).
  3. Parameter Simplification (c and scale removed): I think you are right. Looking at the runorm library implementation, a user-facing c or scale parameter is also omitted, and log1p is just applied directly.
    If I'm not mistaken, the per-cell centering step algebraically cancels out the per-gene-constant log-shift scalar ($\log c$), and the entire shape of the transformation is governed by the ratio $K/c$ (i.e., c simply rescales K). So keeping $c = 1.0$ fixed internally and adjusting target_sum ($K$) should span the same mathematical space. Also, scale ($s$) can just be computed as the mean cell depth under the hood. I’ve refactored the function to mirror this.
  4. Test Coverage: I updated the test suites to drop the legacy parameters, verified the new closed-form alpha="auto" matching math, and added assertions verifying that non-positive overdispersions raise the appropriate errors.
  5. Dask Array Support: normalize_clr now fully supports dense and sparse Dask arrays (mapping the transform over row chunks via map_blocks after a gene rechunk for centering), with equivalence tests expanded across all ARRAY_TYPES.

Let me know what you think!

rwbaber added 2 commits June 15, 2026 21:37
- Switch overdispersion estimation to dask-aware mean_var

- Use map_blocks with gene rechunking for cell centering

- Expand test suites to cover dense and sparse Dask ARRAY_TYPES
@Intron7

Intron7 commented Jun 16, 2026

Copy link
Copy Markdown
Member

@rwbaber can you remove the llm bloat?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants