Skip to content

Commit dd1c8b2

Browse files
authored
Merge pull request #23 from quantifyearth/mwd-snakemake
Move to snakemake
2 parents 5187a34 + c61bc31 commit dd1c8b2

26 files changed

Lines changed: 1194 additions & 350 deletions

.github/workflows/python-package.yml

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@ on:
1212
jobs:
1313
build:
1414
runs-on: ubuntu-latest
15-
container: ghcr.io/osgeo/gdal:ubuntu-small-3.11.4
15+
container: ghcr.io/osgeo/gdal:ubuntu-small-3.12.1
1616
strategy:
1717
fail-fast: false
1818
matrix:
@@ -22,7 +22,7 @@ jobs:
2222
- name: Install system
2323
run: |
2424
apt-get update -qqy
25-
apt-get install -y git python3-pip libpq5 libpq-dev r-base libtirpc-dev shellcheck
25+
apt-get install -y git python3-pip libpq5 libpq-dev r-base libtirpc-dev
2626
- uses: actions/checkout@v4
2727
with:
2828
submodules: 'true'
@@ -35,8 +35,9 @@ jobs:
3535
- name: Install dependencies
3636
run: |
3737
python -m pip install --upgrade pip
38-
python -m pip install gdal[numpy]==3.11.4
38+
python -m pip install gdal[numpy]==3.12.1
3939
python -m pip install -r requirements.txt
40+
python -m pip install snakefmt
4041
4142
- name: Lint with pylint
4243
run: python3 -m pylint utils prepare_layers prepare_species threats
@@ -47,6 +48,5 @@ jobs:
4748
- name: Tests
4849
run: python3 -m pytest ./tests
4950

50-
- name: Script checks
51-
run: |
52-
shellcheck ./scripts/run.sh
51+
- name: Snakemake format check
52+
run: snakefmt --check workflow/

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,8 @@ __pycache__/
33
*.py[cod]
44
*$py.class
55

6+
.snakemake/
7+
68
# C extensions
79
*.so
810

Dockerfile

Lines changed: 22 additions & 14 deletions
Original file line numberDiff line numberDiff line change
@@ -1,45 +1,40 @@
1+
# Build stage for reclaimer (used to download from Zenodo)
12
FROM golang:latest AS reclaimerbuild
23
RUN git clone https://github.com/quantifyearth/reclaimer.git
34
WORKDIR /go/reclaimer
45
RUN go mod tidy
56
RUN go build
67

7-
FROM golang:latest AS littlejohnbuild
8-
RUN git clone https://github.com/quantifyearth/littlejohn.git
9-
WORKDIR /go/littlejohn
10-
RUN go mod tidy
11-
RUN go build
12-
13-
FROM ghcr.io/osgeo/gdal:ubuntu-small-3.11.4
8+
FROM ghcr.io/osgeo/gdal:ubuntu-small-3.12.1
149

1510
RUN apt-get update -qqy && \
1611
apt-get install -qy \
1712
git \
1813
cmake \
1914
python3-pip \
20-
shellcheck \
2115
r-base \
2216
libpq-dev \
2317
libtirpc-dev \
2418
&& rm -rf /var/lib/apt/lists/* \
2519
&& rm -rf /var/cache/apt/*
2620

2721
COPY --from=reclaimerbuild /go/reclaimer/reclaimer /bin/reclaimer
28-
COPY --from=littlejohnbuild /go/littlejohn/littlejohn /bin/littlejohn
2922

3023
RUN rm /usr/lib/python3.*/EXTERNALLY-MANAGED
31-
RUN pip install gdal[numpy]==3.11.4
24+
RUN pip install gdal[numpy]==3.12.1
3225

3326
COPY requirements.txt /tmp/
3427
RUN pip install -r /tmp/requirements.txt
3528

29+
# Snakemake linting/formatting tools
30+
RUN pip install snakefmt
31+
3632
RUN mkdir /root/R
3733
ENV R_LIBS_USER=/root/R
3834
RUN Rscript -e 'install.packages(c("lme4","lmerTest","emmeans"), repos="https://cloud.r-project.org")'
3935

4036
COPY ./ /root/star
4137
WORKDIR /root/star
42-
RUN chmod 755 ./scripts/run.sh
4338

4439
# We create a DATADIR - this should be mapped at container creation
4540
# time to a volume somewhere else
@@ -53,6 +48,19 @@ ENV VIRTUAL_ENV=/usr
5348
ENV PYTHONPATH=/root/star
5449

5550
RUN python3 -m pytest ./tests
56-
RUN python3 -m pylint prepare_layers prepare_species utils tests
57-
RUN python3 -m mypy prepare_layers prepare_species utils tests
58-
RUN shellcheck ./scripts/run.sh
51+
RUN python3 -m pylint prepare_layers prepare_species threats utils tests
52+
RUN python3 -m mypy prepare_layers prepare_species threats utils tests
53+
54+
# Snakemake validation
55+
RUN snakefmt --check workflow/
56+
# RUN snakemake --snakefile workflow/Snakefile --lint
57+
58+
# Copy and set up entrypoint script
59+
COPY docker-entrypoint.sh /usr/local/bin/
60+
RUN chmod +x /usr/local/bin/docker-entrypoint.sh
61+
62+
# Default command runs the full Snakemake pipeline
63+
# Use --cores to specify parallelism, e.g.: docker run ... --cores 8
64+
# Logs are written to $DATADIR/logs/ and .snakemake/ metadata is stored in $DATADIR/
65+
ENTRYPOINT ["/usr/local/bin/docker-entrypoint.sh"]
66+
CMD ["--cores", "4", "all"]

README.md

Lines changed: 35 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -2,8 +2,6 @@
22

33
An implementation of the threat based [STAR biodiversity metric by Muir et al](https://www.nature.com/articles/s41559-021-01432-0) (also known as STAR(t)).
44

5-
See [method.md](method.md) for a description of the methodology, or `scripts/run.sh` for how to execute the pipeline.
6-
75
## Checking out the code
86

97
The code is available on github, and can be checked out from there:
@@ -23,7 +21,6 @@ There are some additional inputs required to run the pipeline, which should be p
2321

2422
The script also assumes you have a Postgres database with the IUCN Redlist database in it.
2523

26-
2724
## Species data acquisition
2825

2926
There are two scripts for getting the species data from the Redlist. For those in the IUCN with access to the database version of the redlist, use `extract_species_data_psql.py`.
@@ -34,6 +31,20 @@ For those outside the IUCN, there is a script called `extract_species_data_redli
3431

3532
There are two ways to run the pipeline. The easiest way is to use Docker if you have it available to you, as it will manage all the dependencies for you. But you can check out and run it locally if you want to also, but it requires a little more effort.
3633

34+
Either way, the pipeline itself is ran using [Snakemake](https://snakemake.readthedocs.io/en/stable/), which is a tool designed to run data-science pipelines made up from many different scripts and sources of information. Snakemake will track dependancies making it easier to re-run the pipeline and only the bits that depend on what changed will rerun. However, in STAR the initial data processing of raster layers is very slow, so we've configured Snakemake to never re-generate those unless the generated rasters have been deleted manually.
35+
36+
Because sometimes you do not need to run all the pipeline for a specific job, the snakemake script has multiple targets you can invoke:
37+
38+
* prepaer: Generate the necessary input rasters for the STAR pipeline.
39+
* species_data: Extract species data into GeoJSON files from Redlist database.
40+
* aohs: Just generate the species AOHs and summary CSV.
41+
* validation: Run model validation.
42+
* occurrence_validation: Run occurrence validation - this can be VERY SLOW as it fetches occurrence data from GBIF.
43+
* threats: Generate the STAR(t) raster layers.
44+
* all: Do everything except occurrence validation.
45+
46+
There is a configuration file in `config/config.yaml` that is used to set experimental parameters such as which taxa to run the pipeline for.
47+
3748
### Running with Docker
3849

3950
There is included a docker file, which is based on the GDAL container image, which is set up to install everything ready to use. You can build that using:
@@ -42,15 +53,21 @@ There is included a docker file, which is based on the GDAL container image, whi
4253
$ docker buildx build -t star .
4354
```
4455

56+
Note that depending on how many CPU cores you provide, you will probably need to give Docker more memory that the out of the box setting (which is a few GB). We recommend giving it as much as you can allow.
57+
4558
You can then invoke the run script using this. You should map an external folder into the container as a place to store the intermediary data and final results, and you should provide details about the Postgres instance with the IUCN redlist:
4659

4760
```shell
4861
$ docker run --rm -v /some/local/dir:/data \
62+
-p 5432:5432 \
4963
-e DB_HOST=localhost \
5064
-e DB_NAME=iucnredlist \
5165
-e DB_PASSWORD=supersecretpassword \
5266
-e DB_USER=postgres \
53-
star ./scripts/run.sh
67+
-e GBIF_USERNAME=myusename \
68+
-e GBIF_PASSWORD=mypassword \
69+
-e GBIF_EMAIL=myemail \
70+
star --cores 8 all
5471
```
5572

5673
### Running without Docker
@@ -61,7 +78,6 @@ If you prefer not to use Docker, you will need:
6178
* GDAL
6279
* R (required for validation)
6380
* [Reclaimer](https://github.com/quantifyearth/reclaimer/) - a Go tool for fetching data from Zenodo
64-
* [Littlejohn](https://github.com/quantifyearth/littlejohn/) - a Go tool for running scripts in parallel
6581

6682
If you are using macOS please note that the default Python install that Apple ships is now several years out of date (Python 3.9, released Oct 2020) and you'll need to install a more recent version (for example, using [homebrew](https://brew.sh)).
6783

@@ -91,6 +107,20 @@ export DB_PASSWORD=supersecretpassword
91107
export DB_USER=postgres
92108
```
93109

110+
If on macOS then you can set the following extra flag to use GPU acceleration:
111+
112+
```shell
113+
export YIRGACHEFFE_BACKEND=MLX
114+
```
115+
116+
For occurrence validation you will need a GBIF account and have to set the details as follows:
117+
118+
```shell
119+
export GBIF_USERNAME=myusename
120+
export GBIF_PASSWORD=mypassword
121+
export GBIF_EMAIL=myemail
122+
```
123+
94124
Once you have all that you can then run the pipeline:
95125

96126
```shell

config/config.yaml

Lines changed: 37 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,37 @@
1+
# STAR Pipeline Configuration
2+
# ===========================
3+
4+
# Taxonomic classes to process
5+
taxa:
6+
- AMPHIBIA
7+
- AVES
8+
- MAMMALIA
9+
- REPTILIA
10+
11+
# Scenario for habitat layers (for future expansion)
12+
scenario: current
13+
14+
# Projection used throughout the pipeline
15+
projection: "ESRI:54009"
16+
17+
# Scale for habitat processing (meters)
18+
habitat_scale: 992.292720200000133
19+
20+
# Input data files (relative to datadir)
21+
# These are expected to be pre-downloaded in the Zenodo subfolder
22+
inputs:
23+
zenodo_mask: "Zenodo/CGLS100Inland_withGADMIslands.tif"
24+
zenodo_elevation_max: "Zenodo/FABDEM_1km_max_patched.tif"
25+
zenodo_elevation_min: "Zenodo/FABDEM_1km_min_patched.tif"
26+
zenodo_islands: "Zenodo/MissingLandcover_1km_cover.tif"
27+
crosswalk_source: "data/crosswalk_bin_T.csv"
28+
29+
# Optional input files (pipeline will check if these exist)
30+
optional_inputs:
31+
birdlife_elevations: "BL_Species_Elevations_2023.csv"
32+
species_excludes: "SpeciesList_generalisedRangePolygons.csv"
33+
34+
# Zenodo configuration for downloading raw habitat
35+
zenodo:
36+
habitat_id: 3939050
37+
habitat_filename: "PROBAV_LC100_global_v3.0.1_2019-nrt_Discrete-Classification-map_EPSG-4326.tif"

docker-entrypoint.sh

Lines changed: 20 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,20 @@
1+
#!/bin/bash
2+
set -e
3+
4+
# Ensure logs directory exists
5+
mkdir -p "${DATADIR}/logs"
6+
7+
# Change to DATADIR so .snakemake/ metadata is stored there
8+
cd "${DATADIR}"
9+
10+
# Generate timestamped log filename
11+
LOG_FILE="${DATADIR}/logs/snakemake_$(date +%Y%m%d_%H%M%S).log"
12+
13+
echo "Snakemake logs will be written to: ${LOG_FILE}"
14+
15+
# Run snakemake with all passed arguments, capturing output to log file
16+
exec snakemake \
17+
--snakefile /root/star/workflow/Snakefile \
18+
--scheduler greedy \
19+
"$@" \
20+
2>&1 | tee "${LOG_FILE}"

prepare_layers/convert_crosswalk.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
from pathlib import Path
33

44
import pandas as pd
5+
from snakemake_argparse_bridge import snakemake_compatible
56

67
# Take from https://www.iucnredlist.org/resources/habitat-classification-scheme
78
IUCN_HABITAT_CODES = {
@@ -57,6 +58,10 @@ def convert_crosswalk(
5758
df = pd.DataFrame(res, columns=["code", "value"])
5859
df.to_csv(output_path, index=False)
5960

61+
@snakemake_compatible(mapping={
62+
"original_path": "input.original",
63+
"output_path": "output.crosswalk",
64+
})
6065
def main() -> None:
6166
parser = argparse.ArgumentParser(description="Convert IUCN crosswalk to minimal common format.")
6267
parser.add_argument(

prepare_layers/remove_nans_from_mask.py

Lines changed: 5 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,7 @@
33
from pathlib import Path
44

55
import yirgacheffe as yg
6+
from snakemake_argparse_bridge import snakemake_compatible
67

78
def remove_nans_from_mask(
89
input_path: Path,
@@ -13,6 +14,10 @@ def remove_nans_from_mask(
1314
converted = layer.nan_to_num()
1415
converted.to_geotiff(output_path)
1516

17+
@snakemake_compatible(mapping={
18+
"original_path": "input.original",
19+
"output_path": "output.mask",
20+
})
1621
def main() -> None:
1722
parser = argparse.ArgumentParser(description="Convert NaNs to zeros in mask layers")
1823
parser.add_argument(

prepare_species/apply_birdlife_data.py

Lines changed: 25 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,12 @@
11
import argparse
22
import math
3+
import os
34
from pathlib import Path
45

56
import aoh
67
import geopandas as gpd
78
import pandas as pd
9+
from snakemake_argparse_bridge import snakemake_compatible
810

911
# Columns from current BirdLife data overrides:
1012
# SIS ID
@@ -24,6 +26,7 @@
2426
def apply_birdlife_data(
2527
geojson_directory_path: Path,
2628
overrides_path: Path,
29+
sentinel_path: Path | None,
2730
) -> None:
2831
overrides = pd.read_csv(overrides_path, encoding="latin1")
2932

@@ -51,6 +54,18 @@ def apply_birdlife_data(
5154
res = gpd.GeoDataFrame(data.to_frame().transpose(), crs=species_info.crs, geometry="geometry")
5255
res.to_file(path, driver="GeoJSON")
5356

57+
# This script modifies the GeoJSON files, but snakemake needs one
58+
# output to say when this is done, so if we're in snakemake mode we touch a sentinel file to
59+
# let it know we've done. One day this should be another decorator.
60+
if sentinel_path is not None:
61+
os.makedirs(sentinel_path.parent, exist_ok=True)
62+
sentinel_path.touch()
63+
64+
@snakemake_compatible(mapping={
65+
"geojson_directory_path": "params.geojson_dir",
66+
"overrides": "input.overrides",
67+
"sentinel_path": "output.sentinel",
68+
})
5469
def main() -> None:
5570
parser = argparse.ArgumentParser(description="Process agregate species data to per-species-file.")
5671
parser.add_argument(
@@ -67,11 +82,20 @@ def main() -> None:
6782
required=True,
6883
dest="overrides",
6984
)
85+
parser.add_argument(
86+
'--sentinel',
87+
type=Path,
88+
help='Generate a sentinel file on completion for snakemake to track',
89+
required=False,
90+
default=None,
91+
dest='sentinel_path',
92+
)
7093
args = parser.parse_args()
7194

7295
apply_birdlife_data(
7396
args.geojson_directory_path,
74-
args.overrides
97+
args.overrides,
98+
args.sentinel_path,
7599
)
76100

77101
if __name__ == "__main__":

prepare_species/common.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -161,9 +161,9 @@ def process_systems(
161161
return systems
162162

163163
def process_threats(
164-
threat_data: list[tuple[int, str, str]],
164+
threat_data: list[tuple[str, str, str]],
165165
report: SpeciesReport,
166-
) -> list[tuple[int, int]]:
166+
) -> list[tuple[str, int]]:
167167
cleaned_threats = []
168168
for code, scope, severity in threat_data:
169169
if scope is None or scope.lower() == "unknown":

0 commit comments

Comments
 (0)