From 60b5666a6840adc08239e9283e5cc60d262bccdb Mon Sep 17 00:00:00 2001 From: chlee-tabin Date: Wed, 13 May 2026 05:25:25 -0400 Subject: [PATCH] Use int8 dtype for indicator matrix in generateMatrix (closes #28) The cell x peak matrix in generateMatrix() only ever stores 0/1 binary indicators (set via matrix[oi, celliddict[cellid]] = 1 in the inner loop), but np.zeros defaults to float64 -- 8 bytes per cell unnecessarily. Switching to dtype=np.int8 cuts peak memory 8x with zero algorithmic change; downstream np.sum() operations work unchanged (they return int64 by default). Measured impact on duck (Anas platyrhynchos) snATAC multiome: | Library | n_cells | float64 | int8 | |------------|---------|---------------|-----------| | ARC08 | 29,523 | OOM at >240G | 34G peak | | PILOTARC4A | 34,622 | OOM at >246G | 20G peak | Without this, AMULET is unrunnable on 10x multiome libraries >=25K cells on standard 256G cluster nodes. Doublet rates with the int8 patch match what AMULET produces on the smaller libs where float64 fits (3.42-7.76% range, q<0.01, n=12 ARC libs). Sci-ATAC-seq3 and similar low-coverage assays don't hit this because per-cell fragment counts are ~5-10x lower. Long-term, scipy.sparse.csr_matrix would be even better (the indicator is >99%% zeros for typical scATAC data), but the int8 change is a strictly-necessary first step that unblocks modern high-coverage 10x users without any other code change. --- AMULET.py | 6 +++++- 1 file changed, 5 insertions(+), 1 deletion(-) diff --git a/AMULET.py b/AMULET.py index 35b25ee..1f682ba 100644 --- a/AMULET.py +++ b/AMULET.py @@ -44,7 +44,11 @@ def generateMatrix(data, cellids, unionoverlaps): regioninfo = dict() - matrix = np.zeros((len(unionoverlaps), len(cellids))) + # Indicator matrix only ever stores 0/1 (peak-in-cell), so int8 is + # sufficient and 8x smaller than the float64 default. This is the + # binding memory constraint for high-per-cell-coverage assays + # (e.g. 10x Chromium multiome at 25K+ cells/library). See #28. + matrix = np.zeros((len(unionoverlaps), len(cellids)), dtype=np.int8) for i in range(len(data)): curchr = data[i,0] curstart = data[i,1]