jkobject
diff --git a/‎README.md‎
Lines changed: 114 additions & 22 deletions b/‎README.md‎
Lines changed: 114 additions & 22 deletions
@@ -10,9 +10,10 @@
 [![Code style: black](https://img.shields.io/badge/code%20style-black-000000.svg)](https://github.com/psf/black)
 [![DOI](https://zenodo.org/badge/731248665.svg)](https://doi.org/10.5281/zenodo.10573143)
 
-<img src="scdataloader.png" width="600">
+<img src="./docs/scdataloader.png" width="600">
 
-This single cell pytorch dataloader / lighting datamodule is designed to be used with:
+This single cell pytorch dataloader / lighting datamodule is designed to be used
+with:
 
 - [lamindb](https://lamin.ai/)
 
@@ -24,11 +25,13 @@ and:
 It allows you to:
 
 1. load thousands of datasets containing millions of cells in a few seconds.
-2. preprocess the data per dataset and download it locally (normalization, filtering, etc.)
+2. preprocess the data per dataset and download it locally (normalization,
+   filtering, etc.)
 3. create a more complex single cell dataset
 4. extend it to your need
 
-built on top of `lamindb` and the `.mapped()` function by Sergei: https://github.com/Koncopd
+built on top of `lamindb` and the `.mapped()` function by Sergei:
+https://github.com/Koncopd
 
 ```
 Portions of the mapped.py file are derived from Lamin Labs
@@ -39,11 +42,17 @@ Please see https://github.com/laminlabs/lamindb/blob/main/lamindb/core/_mapped_c
 for the original implementation
 ```
 
-The package has been designed together with the [scPRINT paper](https://doi.org/10.1101/2024.07.29.605556) and [model](https://github.com/cantinilab/scPRINT).
+The package has been designed together with the
+[scPRINT paper](https://doi.org/10.1101/2024.07.29.605556) and
+[model](https://github.com/cantinilab/scPRINT).
 
 ## More
 
-I needed to create this Data Loader for my PhD project. I am using it to load & preprocess thousands of datasets containing millions of cells in a few seconds. I believed that individuals employing AI for single-cell RNA sequencing and other sequencing datasets would eagerly utilize and desire such a tool, which presently does not exist.
+I needed to create this Data Loader for my PhD project. I am using it to load &
+preprocess thousands of datasets containing millions of cells in a few seconds.
+I believed that individuals employing AI for single-cell RNA sequencing and
+other sequencing datasets would eagerly utilize and desire such a tool, which
+presently does not exist.
 
 ![scdataloader.drawio.png](docs/scdataloader.drawio.png)
 
@@ -57,12 +66,14 @@ pip install scDataLoader[dev] # for dev dependencies
 lamin init --storage ./testdb --name test --schema bionty
 ```
 
-if you start with lamin and had to do a `lamin init`, you will also need to populate your ontologies. This is because scPRINT is using ontologies to define its cell types, diseases, sexes, ethnicities, etc.
+if you start with lamin and had to do a `lamin init`, you will also need to
+populate your ontologies. This is because scPRINT is using ontologies to define
+its cell types, diseases, sexes, ethnicities, etc.
 
 you can do it manually or with our function:
 
 ```python
-from scdataloader.utils import populate_my_ontology
+from scdataloader.utils import populate_my_ontology, _adding_scbasecamp_genes
 
 populate_my_ontology() #to populate everything (recommended) (can take 2-10mns)
 
@@ -76,11 +87,14 @@ organisms: List[str] = ["NCBITaxon:10090", "NCBITaxon:9606"],
     diseases = None,
     dev_stages = None,
 )
+# if you want to load the gene names and species for the arc scbasecount species, also add this:
+_adding_scbasecamp_genes()
 ```
 
 ### Dev install
 
-If you want to use the latest version of scDataLoader and work on the code yourself use `git clone` and `pip -e` instead of `pip install`.
+If you want to use the latest version of scDataLoader and work on the code
+yourself use `git clone` and `pip -e` instead of `pip install`.
 
 ```bash
 git clone https://github.com/jkobject/scDataLoader.git
@@ -119,6 +133,12 @@ datamodule = DataModule(
 )
 ```
 
+see the notebooks in [docs](https://www.jkobject.com/scDataLoader/) to learn
+more
+
+1. [load a dataset](https://www.jkobject.com/scDataLoader/notebooks/1_download_and_preprocess/)
+2. [create a dataset](https://www.jkobject.com/scDataLoader/notebooks/2_create_dataloader/)
+
 ### lightning-free usage (Dataset+Collator+DataLoader)
 
 ```python
@@ -169,7 +189,17 @@ for batch in tqdm(dataloader):
     )
 ```
 
-### Usage on all of cellxgene
+## Gathering a pre-training database
+
+Here I will explain how to gather and preprocess all of cellxgene (scPRINT-1
+pretraining database) with scDataLoader, and the scPRINT-2 corpus (scPRINT-2
+pretraining database).
+
+### Getting all of cellxgene
+
+Here is an example of how to download and preprocess all of cellxgene with
+scDataLoader as a script (a notebook version is also available in
+[./notebooks/update_lamin_or_cellxgene.ipynb](https://github.com/jkobject/scdataloader/blob/main/notebooks/update_lamin_or_cellxgene.ipynb)).
 
 ```python
 # initialize a local lamin database
@@ -184,11 +214,25 @@ DESCRIPTION='preprocessed by scDataLoader'
 cx_dataset = ln.Collection.using(instance="laminlabs/cellxgene").filter(name="cellxgene-census", version='2023-12-15').one()
 cx_dataset, len(cx_dataset.artifacts.all())
 
+# (OPTIONAL) if you want to do you preprocessing on a slurm cluster without internet connections,
+# you can first do this:
+load_dataset_local(
+    cx_dataset,
+    download_folder="/my_download_folder",
+    name="cached-cellxgene-census",
+    description="all of it topreprocess",
+)
 
+# preprocessing
 do_preprocess = LaminPreprocessor(additional_postprocess=additional_postprocess, additional_preprocess=additional_preprocess, skip_validate=True, subset_hvg=0)
 
 preprocessed_dataset = do_preprocess(cx_dataset, name=DESCRIPTION, description=DESCRIPTION, start_at=6, version="2")
 
+```
+
+After this you can use the preprocessed dataset with the DataModule below.
+
+```python
 # create dataloaders
 from scdataloader import DataModule
 import tqdm
@@ -210,27 +254,52 @@ for i in tqdm.tqdm(datamodule.train_dataloader()):
 
 # with lightning:
 # Trainer(model, datamodule)
+```
+
+You can use the command line to preprocess a large database of datasets like
+here for cellxgene. this allows parallelizing and easier usage.
 
+```bash
+scdataloader --instance "laminlabs/cellxgene" --name "cellxgene-census" --version "2023-12-15" --description "preprocessed for scprint" --new_name "scprint main" --start_at 10 >> scdataloader.out
 ```
 
-see the notebooks in [docs](https://www.jkobject.com/scDataLoader/):
+### Getting the rest of the scPRINT-2 corpus
 
-1. [load a dataset](https://www.jkobject.com/scDataLoader/notebooks/1_download_and_preprocess/)
-2. [create a dataset](https://www.jkobject.com/scDataLoader/notebooks/2_create_dataloader/)
+by now, using the command / scripts above you should be able to get all of
+cellxgene (and preprocess it). laminlabs now also hosts the rest of the
+scPRINT-2 corpus in `laminlabs/arc-virtual-cell-atlas` and they can be
+downloaded and preprocessed the same way as cellxgene above. Be careful however
+that there is no metadata for these datasets.
 
-### command line preprocessing
+You can have a look at my notebooks:
+[./notebooks/adding_tahoe.ipynb](https://github.com/jkobject/scdataloader/blob/main/notebooks/adding_tahoe.ipynb)
+and
+[./notebooks/adding_scbasecount.ipynb](https://github.com/jkobject/scdataloader/blob/main/notebooks/adding_scbasecount.ipynb)
+where I create some remmaping to retrive metadata that can be used by
+scdataloader and lamindb from these datasets.
 
-You can use the command line to preprocess a large database of datasets like here for cellxgene. this allows parallelizing and easier usage.
+If you do not have access for some reason to these datasets, please contact
+laminlabs. But another solution, is to download them from the original sources
+and add them one by one in your instance and then do the same preprocessing but
+this time use `your_account/your_instance` instead of
+`laminlabs/arc-virtual-cell-atlas`.
 
-```bash
-scdataloader --instance "laminlabs/cellxgene" --name "cellxgene-census" --version "2023-12-15" --description "preprocessed for scprint" --new_name "scprint main" --start_at 10 >> scdataloader.out
-```
+This is actually what I did in my own instance to create the full scPRINT-2
+corpus and you can see some of it in the notebooks above.
+
+### Getting even more
+
+They also host a pertubation atlas in `laminlabs/pertdata` that can be
+downloaded the same way.
 
-### command line usage
+### command line usage to train a moel
 
 The main way to use
 
-> please refer to the [scPRINT documentation](https://www.jkobject.com/scPRINT/) and [lightning documentation](https://lightning.ai/docs/pytorch/stable/cli/lightning_cli_intermediate.html) for more information on command line usage
+> please refer to the [scPRINT documentation](https://www.jkobject.com/scPRINT/)
+> and
+> [lightning documentation](https://lightning.ai/docs/pytorch/stable/cli/lightning_cli_intermediate.html)
+> for more information on command line usage
 
 ## FAQ
 
@@ -253,13 +322,36 @@ from scdataloader import utils
 utils.populate_ontologies() # this might take from 5-20mins
 ```
 
+### how to move my lamin instance to another folder?
+
+you cannot just move your folder from one place to another because lamin is
+using absolute paths. You need to do 3 things:
+
+1. move your folder to the new place
+2. update your lamin config file (usually in `~/.lamin/my_env.yml`) to point to
+   the new place
+3. update the absolute paths in your lamin database. You can do it like this:
+
+```python
+import lamin as ln
+ln.Storage.df()
+# view what is your current storage id (in my case it was GZgLW1TQ)
+ln.Storage.filter(uid="GZgLW1TI").update(
+    root=Path("your_new_locations").as_posix().rstrip("/")
+)
+```
+
 ## Development
 
-Read the [CONTRIBUTING.md](CONTRIBUTING.md) file.
+Read the
+[CONTRIBUTING.md](https://github.com/jkobject/scdataloader/blob/main/CONTRIBUTING.md)
+file.
 
 ## License
 
-This project is licensed under the MIT License - see the [LICENSE](LICENSE) file for details.
+This project is licensed under the MIT License - see the
+[LICENSE](https://github.com/jkobject/scdataloader/blob/main/LICENSE) file for
+details.
 
 ## Acknowledgments