Describe the bug
Indexing problems emerge with scDataLoader pre-processing for scprint-2 pretraining
To Reproduce
Steps to reproduce the behavior:
- Follow the dev install instructions in scPRINT-2 repository, as in the readme
- Try following the official guideline for building a dataset from cellxgene.
- It stops with
'Series' object has no attribute 'nonzero'.
Expected behavior
The dataset processing process should work without errors.
Possible issue
- I think the problems are originating from anndata indexing conflicting with
scipy >=1.15
- It removed the implicit relaxations for indexing sparse arrays (or anndata's
SparseDataset) with boolean pandas Series.
- The project(scDataLoader itself) lacks any version requirements about scipy in
pyproject.toml
- The scPRINT-2 has
scipy>=1.11.0 but no upper bound on its version.
- This problem was likely undiscovered in your CI pipeline, as it uses uv to install scDataLoader
- In the
uv.lock file scipy is fixed to be 1.12, which still has the relaxations.
- This causes an
AttributeError on line 872:
|
cluster_means = pd.DataFrame( |
|
np.array( |
|
[ |
|
adata.X[adata.obs[NEWOBS] == i].mean(axis=0) |
|
for i in adata.obs[NEWOBS].unique() |
|
] |
|
)[:, 0, :], |
|
index=adata.obs[NEWOBS].unique(), |
|
) |
- Or more simply, like this gist.
Suggested fix
- I think it's better not to index any
AnnData.X with a pandas series.
- If needed, I suggest using
().to_numpy() there.
- Instead, you can add a upper bound to scipy version as
scipy < 1.15 but this is not future-proof.
Desktop (please complete the following information):
-
OS:
$ cat /etc/os-release
NAME="Rocky Linux"
VERSION="9.6 (Blue Onyx)"
ID="rocky"
ID_LIKE="rhel centos fedora"
VERSION_ID="9.6"
PLATFORM_ID="platform:el9"
PRETTY_NAME="Rocky Linux 9.6 (Blue Onyx)"
ANSI_COLOR="0;32"
LOGO="fedora-logo-icon"
CPE_NAME="cpe:/o:rocky:rocky:9::baseos"
HOME_URL="https://rockylinux.org/"
VENDOR_NAME="RESF"
VENDOR_URL="https://resf.org/"
BUG_REPORT_URL="https://bugs.rockylinux.org/"
SUPPORT_END="2032-05-31"
ROCKY_SUPPORT_PRODUCT="Rocky-Linux-9"
ROCKY_SUPPORT_PRODUCT_VERSION="9.6"
REDHAT_SUPPORT_PRODUCT="Rocky Linux"
REDHAT_SUPPORT_PRODUCT_VERSION="9.6"
-
Version:
$ git -C scDataLoader rev-parse --short HEAD
c2db375
$ git -C scPRINT-2 rev-parse --short HEAD
a885b49
$ git -C benGRN rev-parse --short HEAD
06c16f0
$ git -C GRnnData rev-parse --short HEAD
05d1101
Additional context\
Describe the bug
Indexing problems emerge with scDataLoader pre-processing for scprint-2 pretraining
To Reproduce
Steps to reproduce the behavior:
'Series' object has no attribute 'nonzero'.Expected behavior
The dataset processing process should work without errors.
Possible issue
scipy >=1.15SparseDataset) with boolean pandasSeries._index.py@_validate_indices()on scipy/scipy@66ec333pyproject.tomlscipy>=1.11.0but no upper bound on its version.uv.lockfile scipy is fixed to be 1.12, which still has the relaxations.AttributeErroron line 872:scDataLoader/scdataloader/preprocess.py
Lines 869 to 877 in c2db375
Suggested fix
AnnData.Xwith a pandas series.().to_numpy()there.scipy < 1.15but this is not future-proof.Desktop (please complete the following information):
OS:
Version:
Additional context\