Skip to content

Commit 4316e8b

Browse files
committed
finishing the doc
1 parent 0e0abdf commit 4316e8b

16 files changed

Lines changed: 1074 additions & 7872 deletions

README.md

Lines changed: 114 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -10,9 +10,10 @@
1010
[![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
1111
[![DOI](https://zenodo.org/badge/731248665.svg)](https://doi.org/10.5281/zenodo.10573143)
1212

13-
<img src="scdataloader.png" width="600">
13+
<img src="./docs/scdataloader.png" width="600">
1414

15-
This single cell pytorch dataloader / lighting datamodule is designed to be used with:
15+
This single cell pytorch dataloader / lighting datamodule is designed to be used
16+
with:
1617

1718
- [lamindb](https://lamin.ai/)
1819

@@ -24,11 +25,13 @@ and:
2425
It allows you to:
2526

2627
1. load thousands of datasets containing millions of cells in a few seconds.
27-
2. preprocess the data per dataset and download it locally (normalization, filtering, etc.)
28+
2. preprocess the data per dataset and download it locally (normalization,
29+
filtering, etc.)
2830
3. create a more complex single cell dataset
2931
4. extend it to your need
3032

31-
built on top of `lamindb` and the `.mapped()` function by Sergei: https://github.com/Koncopd
33+
built on top of `lamindb` and the `.mapped()` function by Sergei:
34+
https://github.com/Koncopd
3235

3336
```
3437
Portions of the mapped.py file are derived from Lamin Labs
@@ -39,11 +42,17 @@ Please see https://github.com/laminlabs/lamindb/blob/main/lamindb/core/_mapped_c
3942
for the original implementation
4043
```
4144

42-
The package has been designed together with the [scPRINT paper](https://doi.org/10.1101/2024.07.29.605556) and [model](https://github.com/cantinilab/scPRINT).
45+
The package has been designed together with the
46+
[scPRINT paper](https://doi.org/10.1101/2024.07.29.605556) and
47+
[model](https://github.com/cantinilab/scPRINT).
4348

4449
## More
4550

46-
I needed to create this Data Loader for my PhD project. I am using it to load & preprocess thousands of datasets containing millions of cells in a few seconds. I believed that individuals employing AI for single-cell RNA sequencing and other sequencing datasets would eagerly utilize and desire such a tool, which presently does not exist.
51+
I needed to create this Data Loader for my PhD project. I am using it to load &
52+
preprocess thousands of datasets containing millions of cells in a few seconds.
53+
I believed that individuals employing AI for single-cell RNA sequencing and
54+
other sequencing datasets would eagerly utilize and desire such a tool, which
55+
presently does not exist.
4756

4857
![scdataloader.drawio.png](docs/scdataloader.drawio.png)
4958

@@ -57,12 +66,14 @@ pip install scDataLoader[dev] # for dev dependencies
5766
lamin init --storage ./testdb --name test --schema bionty
5867
```
5968

60-
if you start with lamin and had to do a `lamin init`, you will also need to populate your ontologies. This is because scPRINT is using ontologies to define its cell types, diseases, sexes, ethnicities, etc.
69+
if you start with lamin and had to do a `lamin init`, you will also need to
70+
populate your ontologies. This is because scPRINT is using ontologies to define
71+
its cell types, diseases, sexes, ethnicities, etc.
6172

6273
you can do it manually or with our function:
6374

6475
```python
65-
from scdataloader.utils import populate_my_ontology
76+
from scdataloader.utils import populate_my_ontology, _adding_scbasecamp_genes
6677

6778
populate_my_ontology() #to populate everything (recommended) (can take 2-10mns)
6879

@@ -76,11 +87,14 @@ organisms: List[str] = ["NCBITaxon:10090", "NCBITaxon:9606"],
7687
diseases = None,
7788
dev_stages = None,
7889
)
90+
# if you want to load the gene names and species for the arc scbasecount species, also add this:
91+
_adding_scbasecamp_genes()
7992
```
8093

8194
### Dev install
8295

83-
If you want to use the latest version of scDataLoader and work on the code yourself use `git clone` and `pip -e` instead of `pip install`.
96+
If you want to use the latest version of scDataLoader and work on the code
97+
yourself use `git clone` and `pip -e` instead of `pip install`.
8498

8599
```bash
86100
git clone https://github.com/jkobject/scDataLoader.git
@@ -119,6 +133,12 @@ datamodule = DataModule(
119133
)
120134
```
121135

136+
see the notebooks in [docs](https://www.jkobject.com/scDataLoader/) to learn
137+
more
138+
139+
1. [load a dataset](https://www.jkobject.com/scDataLoader/notebooks/1_download_and_preprocess/)
140+
2. [create a dataset](https://www.jkobject.com/scDataLoader/notebooks/2_create_dataloader/)
141+
122142
### lightning-free usage (Dataset+Collator+DataLoader)
123143

124144
```python
@@ -169,7 +189,17 @@ for batch in tqdm(dataloader):
169189
)
170190
```
171191

172-
### Usage on all of cellxgene
192+
## Gathering a pre-training database
193+
194+
Here I will explain how to gather and preprocess all of cellxgene (scPRINT-1
195+
pretraining database) with scDataLoader, and the scPRINT-2 corpus (scPRINT-2
196+
pretraining database).
197+
198+
### Getting all of cellxgene
199+
200+
Here is an example of how to download and preprocess all of cellxgene with
201+
scDataLoader as a script (a notebook version is also available in
202+
[./notebooks/update_lamin_or_cellxgene.ipynb](https://github.com/jkobject/scdataloader/blob/main/notebooks/update_lamin_or_cellxgene.ipynb)).
173203

174204
```python
175205
# initialize a local lamin database
@@ -184,11 +214,25 @@ DESCRIPTION='preprocessed by scDataLoader'
184214
cx_dataset = ln.Collection.using(instance="laminlabs/cellxgene").filter(name="cellxgene-census", version='2023-12-15').one()
185215
cx_dataset, len(cx_dataset.artifacts.all())
186216

217+
# (OPTIONAL) if you want to do you preprocessing on a slurm cluster without internet connections,
218+
# you can first do this:
219+
load_dataset_local(
220+
cx_dataset,
221+
download_folder="/my_download_folder",
222+
name="cached-cellxgene-census",
223+
description="all of it topreprocess",
224+
)
187225

226+
# preprocessing
188227
do_preprocess = LaminPreprocessor(additional_postprocess=additional_postprocess, additional_preprocess=additional_preprocess, skip_validate=True, subset_hvg=0)
189228

190229
preprocessed_dataset = do_preprocess(cx_dataset, name=DESCRIPTION, description=DESCRIPTION, start_at=6, version="2")
191230

231+
```
232+
233+
After this you can use the preprocessed dataset with the DataModule below.
234+
235+
```python
192236
# create dataloaders
193237
from scdataloader import DataModule
194238
import tqdm
@@ -210,27 +254,52 @@ for i in tqdm.tqdm(datamodule.train_dataloader()):
210254

211255
# with lightning:
212256
# Trainer(model, datamodule)
257+
```
258+
259+
You can use the command line to preprocess a large database of datasets like
260+
here for cellxgene. this allows parallelizing and easier usage.
213261

262+
```bash
263+
scdataloader --instance "laminlabs/cellxgene" --name "cellxgene-census" --version "2023-12-15" --description "preprocessed for scprint" --new_name "scprint main" --start_at 10 >> scdataloader.out
214264
```
215265

216-
see the notebooks in [docs](https://www.jkobject.com/scDataLoader/):
266+
### Getting the rest of the scPRINT-2 corpus
217267

218-
1. [load a dataset](https://www.jkobject.com/scDataLoader/notebooks/1_download_and_preprocess/)
219-
2. [create a dataset](https://www.jkobject.com/scDataLoader/notebooks/2_create_dataloader/)
268+
by now, using the command / scripts above you should be able to get all of
269+
cellxgene (and preprocess it). laminlabs now also hosts the rest of the
270+
scPRINT-2 corpus in `laminlabs/arc-virtual-cell-atlas` and they can be
271+
downloaded and preprocessed the same way as cellxgene above. Be careful however
272+
that there is no metadata for these datasets.
220273

221-
### command line preprocessing
274+
You can have a look at my notebooks:
275+
[./notebooks/adding_tahoe.ipynb](https://github.com/jkobject/scdataloader/blob/main/notebooks/adding_tahoe.ipynb)
276+
and
277+
[./notebooks/adding_scbasecount.ipynb](https://github.com/jkobject/scdataloader/blob/main/notebooks/adding_scbasecount.ipynb)
278+
where I create some remmaping to retrive metadata that can be used by
279+
scdataloader and lamindb from these datasets.
222280

223-
You can use the command line to preprocess a large database of datasets like here for cellxgene. this allows parallelizing and easier usage.
281+
If you do not have access for some reason to these datasets, please contact
282+
laminlabs. But another solution, is to download them from the original sources
283+
and add them one by one in your instance and then do the same preprocessing but
284+
this time use `your_account/your_instance` instead of
285+
`laminlabs/arc-virtual-cell-atlas`.
224286

225-
```bash
226-
scdataloader --instance "laminlabs/cellxgene" --name "cellxgene-census" --version "2023-12-15" --description "preprocessed for scprint" --new_name "scprint main" --start_at 10 >> scdataloader.out
227-
```
287+
This is actually what I did in my own instance to create the full scPRINT-2
288+
corpus and you can see some of it in the notebooks above.
289+
290+
### Getting even more
291+
292+
They also host a pertubation atlas in `laminlabs/pertdata` that can be
293+
downloaded the same way.
228294

229-
### command line usage
295+
### command line usage to train a moel
230296

231297
The main way to use
232298

233-
> please refer to the [scPRINT documentation](https://www.jkobject.com/scPRINT/) and [lightning documentation](https://lightning.ai/docs/pytorch/stable/cli/lightning_cli_intermediate.html) for more information on command line usage
299+
> please refer to the [scPRINT documentation](https://www.jkobject.com/scPRINT/)
300+
> and
301+
> [lightning documentation](https://lightning.ai/docs/pytorch/stable/cli/lightning_cli_intermediate.html)
302+
> for more information on command line usage
234303
235304
## FAQ
236305

@@ -253,13 +322,36 @@ from scdataloader import utils
253322
utils.populate_ontologies() # this might take from 5-20mins
254323
```
255324

325+
### how to move my lamin instance to another folder?
326+
327+
you cannot just move your folder from one place to another because lamin is
328+
using absolute paths. You need to do 3 things:
329+
330+
1. move your folder to the new place
331+
2. update your lamin config file (usually in `~/.lamin/my_env.yml`) to point to
332+
the new place
333+
3. update the absolute paths in your lamin database. You can do it like this:
334+
335+
```python
336+
import lamin as ln
337+
ln.Storage.df()
338+
# view what is your current storage id (in my case it was GZgLW1TQ)
339+
ln.Storage.filter(uid="GZgLW1TI").update(
340+
root=Path("your_new_locations").as_posix().rstrip("/")
341+
)
342+
```
343+
256344
## Development
257345

258-
Read the [CONTRIBUTING.md](CONTRIBUTING.md) file.
346+
Read the
347+
[CONTRIBUTING.md](https://github.com/jkobject/scdataloader/blob/main/CONTRIBUTING.md)
348+
file.
259349

260350
## License
261351

262-
This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
352+
This project is licensed under the MIT License - see the
353+
[LICENSE](https://github.com/jkobject/scdataloader/blob/main/LICENSE) file for
354+
details.
263355

264356
## Acknowledgments
265357

0 commit comments

Comments
 (0)