Skip to content

scDataLoader pre-processing for scprint-2 pretraining #32

@Stfort52

Description

@Stfort52

Describe the bug
Indexing problems emerge with scDataLoader pre-processing for scprint-2 pretraining

To Reproduce
Steps to reproduce the behavior:

  1. Follow the dev install instructions in scPRINT-2 repository, as in the readme
  2. Try following the official guideline for building a dataset from cellxgene.
  3. It stops with 'Series' object has no attribute 'nonzero'.

Expected behavior
The dataset processing process should work without errors.

Possible issue

  • I think the problems are originating from anndata indexing conflicting with scipy >=1.15
    • It removed the implicit relaxations for indexing sparse arrays (or anndata's SparseDataset) with boolean pandas Series.
    • The project(scDataLoader itself) lacks any version requirements about scipy in pyproject.toml
      • The scPRINT-2 has scipy>=1.11.0 but no upper bound on its version.
    • This problem was likely undiscovered in your CI pipeline, as it uses uv to install scDataLoader
      • In the uv.lock file scipy is fixed to be 1.12, which still has the relaxations.
  • This causes an AttributeError on line 872:
    cluster_means = pd.DataFrame(
    np.array(
    [
    adata.X[adata.obs[NEWOBS] == i].mean(axis=0)
    for i in adata.obs[NEWOBS].unique()
    ]
    )[:, 0, :],
    index=adata.obs[NEWOBS].unique(),
    )
    • Or more simply, like this gist.

Suggested fix

  • I think it's better not to index any AnnData.X with a pandas series.
    • If needed, I suggest using ().to_numpy() there.
  • Instead, you can add a upper bound to scipy version as scipy < 1.15 but this is not future-proof.

Desktop (please complete the following information):

  • OS:

    $ cat /etc/os-release
    NAME="Rocky Linux"
    VERSION="9.6 (Blue Onyx)"
    ID="rocky"
    ID_LIKE="rhel centos fedora"
    VERSION_ID="9.6"
    PLATFORM_ID="platform:el9"
    PRETTY_NAME="Rocky Linux 9.6 (Blue Onyx)"
    ANSI_COLOR="0;32"
    LOGO="fedora-logo-icon"
    CPE_NAME="cpe:/o:rocky:rocky:9::baseos"
    HOME_URL="https://rockylinux.org/"
    VENDOR_NAME="RESF"
    VENDOR_URL="https://resf.org/"
    BUG_REPORT_URL="https://bugs.rockylinux.org/"
    SUPPORT_END="2032-05-31"
    ROCKY_SUPPORT_PRODUCT="Rocky-Linux-9"
    ROCKY_SUPPORT_PRODUCT_VERSION="9.6"
    REDHAT_SUPPORT_PRODUCT="Rocky Linux"
    REDHAT_SUPPORT_PRODUCT_VERSION="9.6"
    
  • Version:

    $ git -C scDataLoader rev-parse --short HEAD
    c2db375
    $ git -C scPRINT-2 rev-parse --short HEAD
    a885b49
    $ git -C benGRN rev-parse --short HEAD
    06c16f0 
    $ git -C GRnnData rev-parse --short HEAD 
    05d1101
    

Additional context\

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't workinghelp wantedExtra attention is needed

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions