Skip to content

Use np.int8 for indicator matrix in generateMatrix (8x memory reduction, closes #28)#29

Open
chlee-tabin wants to merge 1 commit into
UcarLab:mainfrom
chlee-tabin:memory/int8-indicator-matrix
Open

Use np.int8 for indicator matrix in generateMatrix (8x memory reduction, closes #28)#29
chlee-tabin wants to merge 1 commit into
UcarLab:mainfrom
chlee-tabin:memory/int8-indicator-matrix

Conversation

@chlee-tabin
Copy link
Copy Markdown

Closes #28.

Summary

One-line fix: generateMatrix allocates the cell × peak indicator matrix as np.zeros((n_peaks, n_cells)) (defaults to float64 — 8 bytes/cell) but the matrix only ever stores binary 0/1 values. Changing the dtype to np.int8 cuts peak memory with no algorithmic change.

-    matrix = np.zeros((len(unionoverlaps), len(cellids)))
+    matrix = np.zeros((len(unionoverlaps), len(cellids)), dtype=np.int8)

The matrix is only ever written to via matrix[oi, celliddict[cellid]] = 1. All downstream operations (np.sum(matrix, axis=1) in inferRepeats, etc.) work unchanged on int8 — sums return int64 by default.

Why this matters

Reproduction context (duck Anas platyrhynchos snATAC multiome, 12 cellranger-arc 2.1.0 libraries, 4 K – 35 K cells/lib, median ~25,000 unique HQ fragments/cell):

Library n_cells Default (float64) dtype=np.int8
ARC08 29,523 OOM at peak >240 GB 34 GB
PILOTARC4A 34,622 OOM at peak >246 GB 20 GB

(SLURM MaxRSS from sacct; HMS-O2 short partition, 257 GB node cap.)

Without this patch AMULET is unrunnable on these libraries on standard cluster hardware. Hits 10x Chromium scATAC / multiome users with 25 K+ cells/library; sci-ATAC-seq3 and similar low-coverage assays (~4 K reads/cell, e.g. Qiu et al. 2026) don't trigger it because per-cell fragment counts are 5–10× lower.

Verification

Doublet rates produced from int8 vs float64 matrices are identical, confirmed on the 4 small libraries (ARC0S-duck, PILOTARC3, PILOTARC4B, ARC04) where float64 happens to fit in memory. Final 12-library doublet-rate range: 3.42 – 7.76 % at FDR q<0.01.

Follow-ups

Long-term, scipy.sparse.csr_matrix would be even better — the indicator matrix is >99% zeros for typical scATAC data, so sparse would drop memory another order of magnitude. That's a bigger refactor though. This int8 change is a strictly-necessary first step that unblocks AMULET on modern high-coverage 10x data with a one-line patch.

Note on coexistence with #27

This PR is orthogonal to #27 (which fixes np.object → object for NumPy ≥1.24). Both fixes are needed for AMULET to run cleanly on a modern conda env; they touch disjoint lines and can be merged in either order.

)

The cell x peak matrix in generateMatrix() only ever stores 0/1
binary indicators (set via matrix[oi, celliddict[cellid]] = 1 in
the inner loop), but np.zeros defaults to float64 -- 8 bytes per
cell unnecessarily.  Switching to dtype=np.int8 cuts peak memory
8x with zero algorithmic change; downstream np.sum() operations
work unchanged (they return int64 by default).

Measured impact on duck (Anas platyrhynchos) snATAC multiome:

| Library    | n_cells | float64       | int8      |
|------------|---------|---------------|-----------|
| ARC08      | 29,523  | OOM at >240G  | 34G peak  |
| PILOTARC4A | 34,622  | OOM at >246G  | 20G peak  |

Without this, AMULET is unrunnable on 10x multiome libraries
>=25K cells on standard 256G cluster nodes.  Doublet rates with
the int8 patch match what AMULET produces on the smaller libs
where float64 fits (3.42-7.76% range, q<0.01, n=12 ARC libs).

Sci-ATAC-seq3 and similar low-coverage assays don't hit this
because per-cell fragment counts are ~5-10x lower.

Long-term, scipy.sparse.csr_matrix would be even better (the
indicator is >99%% zeros for typical scATAC data), but the int8
change is a strictly-necessary first step that unblocks modern
high-coverage 10x users without any other code change.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OOM in generateMatrix: dense float64 matrix wastes 8x memory on high-coverage data (e.g. 10x multiome)

2 participants