Use np.int8 for indicator matrix in generateMatrix (8x memory reduction, closes #28) by chlee-tabin · Pull Request #29 · UcarLab/AMULET

chlee-tabin · 2026-05-13T09:25:54Z

Closes #28.

Summary

One-line fix: generateMatrix allocates the cell × peak indicator matrix as np.zeros((n_peaks, n_cells)) (defaults to float64 — 8 bytes/cell) but the matrix only ever stores binary 0/1 values. Changing the dtype to np.int8 cuts peak memory 8× with no algorithmic change.

-    matrix = np.zeros((len(unionoverlaps), len(cellids)))
+    matrix = np.zeros((len(unionoverlaps), len(cellids)), dtype=np.int8)

The matrix is only ever written to via matrix[oi, celliddict[cellid]] = 1. All downstream operations (np.sum(matrix, axis=1) in inferRepeats, etc.) work unchanged on int8 — sums return int64 by default.

Why this matters

Reproduction context (duck Anas platyrhynchos snATAC multiome, 12 cellranger-arc 2.1.0 libraries, 4 K – 35 K cells/lib, median ~25,000 unique HQ fragments/cell):

Library	n_cells	Default (float64)	dtype=np.int8
ARC08	29,523	OOM at peak >240 GB	34 GB
PILOTARC4A	34,622	OOM at peak >246 GB	20 GB

(SLURM MaxRSS from sacct; HMS-O2 short partition, 257 GB node cap.)

Without this patch AMULET is unrunnable on these libraries on standard cluster hardware. Hits 10x Chromium scATAC / multiome users with 25 K+ cells/library; sci-ATAC-seq3 and similar low-coverage assays (~4 K reads/cell, e.g. Qiu et al. 2026) don't trigger it because per-cell fragment counts are 5–10× lower.

Verification

Doublet rates produced from int8 vs float64 matrices are identical, confirmed on the 4 small libraries (ARC0S-duck, PILOTARC3, PILOTARC4B, ARC04) where float64 happens to fit in memory. Final 12-library doublet-rate range: 3.42 – 7.76 % at FDR q<0.01.

Follow-ups

Long-term, scipy.sparse.csr_matrix would be even better — the indicator matrix is >99% zeros for typical scATAC data, so sparse would drop memory another order of magnitude. That's a bigger refactor though. This int8 change is a strictly-necessary first step that unblocks AMULET on modern high-coverage 10x data with a one-line patch.

Note on coexistence with #27

This PR is orthogonal to #27 (which fixes np.object → object for NumPy ≥1.24). Both fixes are needed for AMULET to run cleanly on a modern conda env; they touch disjoint lines and can be merged in either order.

) The cell x peak matrix in generateMatrix() only ever stores 0/1 binary indicators (set via matrix[oi, celliddict[cellid]] = 1 in the inner loop), but np.zeros defaults to float64 -- 8 bytes per cell unnecessarily. Switching to dtype=np.int8 cuts peak memory 8x with zero algorithmic change; downstream np.sum() operations work unchanged (they return int64 by default). Measured impact on duck (Anas platyrhynchos) snATAC multiome: | Library | n_cells | float64 | int8 | |------------|---------|---------------|-----------| | ARC08 | 29,523 | OOM at >240G | 34G peak | | PILOTARC4A | 34,622 | OOM at >246G | 20G peak | Without this, AMULET is unrunnable on 10x multiome libraries >=25K cells on standard 256G cluster nodes. Doublet rates with the int8 patch match what AMULET produces on the smaller libs where float64 fits (3.42-7.76% range, q<0.01, n=12 ARC libs). Sci-ATAC-seq3 and similar low-coverage assays don't hit this because per-cell fragment counts are ~5-10x lower. Long-term, scipy.sparse.csr_matrix would be even better (the indicator is >99%% zeros for typical scATAC data), but the int8 change is a strictly-necessary first step that unblocks modern high-coverage 10x users without any other code change.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use np.int8 for indicator matrix in generateMatrix (8x memory reduction, closes #28)#29

Use np.int8 for indicator matrix in generateMatrix (8x memory reduction, closes #28)#29
chlee-tabin wants to merge 1 commit into
UcarLab:mainfrom
chlee-tabin:memory/int8-indicator-matrix

chlee-tabin commented May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

chlee-tabin commented May 13, 2026

Summary

Why this matters

Verification

Follow-ups

Note on coexistence with #27

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants