In this section we talk about data for your scientific Python package: when you would need it, and how you can access it and provide it to your users.
:class: note
Some material adapted from:
https://www.dampfkraft.com/code/distributing-large-files-with-pypi.html
https://learn.scientific-python.org/development/patterns/data-files/
First we describe when and why you might need data. Basically there are two cases: for examples, and for tests. We'll talk through both in the next couple of sections.
It's very common for scientific Python packages to need data that helps their users understand how the library is to be used. Often the package provides functionality to access this data, either by loading it from inside the source code, or by downloading it off of a remote host. In fact, the latter approach is so common that libraries have been developed just to "fetch" data, like pooch. We will show you how to use both methods for providing access to data below, but here we present some examples.
- movingpandas: https://movingpandas.github.io/movingpandas-website/2-analysis-examples/bird-migration.html
- scikit-image: https://github.com/scikit-image/scikit-image/tree/main/skimage/data
- scikit-learn: https://github.com/scikit-learn/scikit-learn/tree/main/sklearn/datasets/data
It is common to design your code and tests in such a way that you can quickly test on fake data, ranging from something as simple as a NumPy array of zeros, to something much more complex like a test suite that "mocks" data for a specific domain. This lets you make sure the core logic of your code works, without needing real data. At the end of the day though, you do want to make sure your code works on real data, especially if it is scientific code that may work with very specific data formats. That's why you will often want at least a small amount of real-world test data. A good rule of thumb is to have a handful of small files, say no more than 10 files that are a maximum of 50 MB each. Anything more than that you will probably want to store on-line and download, for reasons we describe in the next section.
Below we will introduce places you can store data on-line, and show you tools you can use to download that data. We suggest you prefer this approach when possible, The main reason for this is that there are limits on file and project sizes for forges, like GitHub and GitLab, and for package indexes--most importantly, PyPI. Especially with scientific datasets that can be quite large, we want to be good citizens of the ecosystem and not place unneccesarry demands on the common infrastructure.
Forges for hosting source code have maximum sizes for both files and projects. For example, on GitHub, a single file cannot be more than 100 MB. You would be surprised how quickly you can make a csv file this big! You also want to avoid committing larger binary files (like images or audio) to a version control system like git, because it is hard to go back and remove them later, and it can really slow down the speed with which you can clone the project. More importantly, it slows down the speed with which potential contributors can clone your project!
The Python Package Index (PyPI) places a limit on the size of individual files uploaded--where a "file" is either a sdist or a wheel--and also a limit on the total size of the project (the sum of all the "files"). These limits are not documented as far as we can tell, but most estimates are around 100 MB per file and 1 GB for the total project. Files this large place a real strain on the resources supporting PyPI, as discussed here. For this reason, as a good citizen in the Python ecosystem you should do everything you can to minimize your impact. Don't worry, we're here to help you do that! You can request increases for both file size and project size (see here and here) but we strongly suggest you read about other options here first.
Alright, we're strongly suggesting you don't try to cram your data into your code--where should you store it? Here we provide several options.
As stated above, there are cases where relatively small datasets can be included in a package. If this data consists of examples for usage, then you would likely put it inside your source code so that it will be included in the sdist and wheel. If the data is meant only for tests, and you have a separate test directory (as we suggest)
- strengths and weaknesses
- Strengths
- easy to access
- can be very do-able for smaller files, e.g. text files used in bioinformatics
- Weaknesses
- maximum file sizes on forges like GitHub and on PyPI
- You want to avoid adding these files to your version control history (git) and draining the resources of PyPI
- Strengths
- examples:
- pyOpenSci:
- core scientific-python packages
- strengths and weaknesses
- strength: free, guaranteed lifetime of dataset, often appropriate for pyOpenSci packages
- weaknesses: may be hard to automate for data that changes frequently
- examples
- Zenodo
- OSF
- FigShare
- Dryad (paid?)
- strengths and weaknesses
- strengths: robust; tooling exists to more easily automate updating of packages, but this requires more technical know-how
- weaknesses: not free
- examples
- AWS
- Google Cloud
- Linode
:class: tip
Did you know that tools exist that let you track changes to datasets in the same way version control systems like git
lets you track changes to code? Although you don't strictly need data versioning to include data with your package,
you probably want to be aware that such tools exist if you are reading this section.
Such tools could be particularly important if your package focuses mainly on providing access to datasets.
Within science, tools have been developed to provide distributed access to datasets. These tools
general
DataLad https://www.datalad.org/
related tools that are used for data engineering and industry (maybe breakout?)
Git-LFS
DVC
Pachyderm (I think it's called?)
:class: tip
It's important to be aware of field-specific standards
eg astronomy
neuroscience: DANDI, NWB
many pyOpenSci tools exist to address these standards or to provide interoperability because these standards don't exist
see also: FAIR data
Last but definitely not least, it's important to understand how you and your users
If you have included data files in your source code, then you can provide access to these through importlib-resources.
link to PyCon talk w/Barry Warsaw code snippet example here mention python-3.9 backport
- examples:
- pyOpenSci packages:
- core scientific Python packages:
pooch: https://github.com/fatiando/pooch code snippet example of using pooch
Many of the same tools apply. You can download test data as a set-up step in your CI. Pytest fixtures for accessing test data.