scDataLoader pre-processing for scprint-2 pretraining

**Describe the bug**
Indexing problems emerge with scDataLoader pre-processing for scprint-2 pretraining

**To Reproduce**
Steps to reproduce the behavior:
1. Follow the dev install instructions in scPRINT-2 repository, as in the [readme](https://github.com/cantinilab/scPRINT-2/blob/a885b49d358c96b3cefa0e358c41fc8a80d91054/README.md?plain=1#L512-L526)
2. Try following the official [guideline](https://www.jkobject.com/scDataLoader/#gathering-a-pre-training-database) for building a dataset from cellxgene.
3. It stops with `'Series' object has no attribute 'nonzero'`.

**Expected behavior**
The dataset processing process should work without errors.

**Possible issue**
- I think the problems are originating from anndata indexing conflicting with `scipy >=1.15`
  - It removed the implicit relaxations for indexing sparse arrays (or anndata's `SparseDataset`) with boolean pandas `Series`.
    - Check `_index.py@_validate_indices()` on https://github.com/scipy/scipy/commit/66ec33370c889e9c9acba320354b624ee258ee22
  - The project(scDataLoader itself) lacks any version requirements about scipy in `pyproject.toml`
    - The scPRINT-2 has `scipy>=1.11.0` but no upper bound on its version.
  - This problem was likely undiscovered in your CI pipeline, as it uses uv to install scDataLoader
    - In the `uv.lock` file scipy is fixed to be 1.12, which still has the relaxations.
- This causes an `AttributeError` on line 872:
https://github.com/jkobject/scDataLoader/blob/c2db375099e0d19f11c1eb1327a16b6b5918943a/scdataloader/preprocess.py#L869-L877
  - Or more simply, like this [gist](https://gist.github.com/Stfort52/d3a19d932536c2f4e7708cacc397a8bf).

**Suggested fix**
- I think it's better not to index any `AnnData.X` with a pandas series. 
  - If needed, I suggest using `().to_numpy()` there.
- Instead, you can add a upper bound to scipy version as `scipy < 1.15` but this is not future-proof.

**Desktop (please complete the following information):**
- OS:
  ```
  $ cat /etc/os-release
  NAME="Rocky Linux"
  VERSION="9.6 (Blue Onyx)"
  ID="rocky"
  ID_LIKE="rhel centos fedora"
  VERSION_ID="9.6"
  PLATFORM_ID="platform:el9"
  PRETTY_NAME="Rocky Linux 9.6 (Blue Onyx)"
  ANSI_COLOR="0;32"
  LOGO="fedora-logo-icon"
  CPE_NAME="cpe:/o:rocky:rocky:9::baseos"
  HOME_URL="https://rockylinux.org/"
  VENDOR_NAME="RESF"
  VENDOR_URL="https://resf.org/"
  BUG_REPORT_URL="https://bugs.rockylinux.org/"
  SUPPORT_END="2032-05-31"
  ROCKY_SUPPORT_PRODUCT="Rocky-Linux-9"
  ROCKY_SUPPORT_PRODUCT_VERSION="9.6"
  REDHAT_SUPPORT_PRODUCT="Rocky Linux"
  REDHAT_SUPPORT_PRODUCT_VERSION="9.6"
  ```

- Version:
  ```
  $ git -C scDataLoader rev-parse --short HEAD
  c2db375
  $ git -C scPRINT-2 rev-parse --short HEAD
  a885b49
  $ git -C benGRN rev-parse --short HEAD
  06c16f0 
  $ git -C GRnnData rev-parse --short HEAD 
  05d1101
  ```

**Additional context**\


	cluster_means = pd.DataFrame(
	np.array(
	[
	adata.X[adata.obs[NEWOBS] == i].mean(axis=0)
	for i in adata.obs[NEWOBS].unique()
	]
	)[:, 0, :],
	index=adata.obs[NEWOBS].unique(),
	)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

scDataLoader pre-processing for scprint-2 pretraining #32

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

scDataLoader pre-processing for scprint-2 pretraining #32

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions