Skip to content

Commit 2eb56ff

Browse files
committed
documentation and finishing for release
1 parent 1e42766 commit 2eb56ff

File tree

4 files changed

+322
-26
lines changed

4 files changed

+322
-26
lines changed

.gitignore

Lines changed: 0 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -14,12 +14,8 @@ data/*
1414

1515
build/
1616
validation_playground.py
17-
docs/*.md
1817

1918
playground_results.yml
20-
src/DataBUS/neotomaUploader/tests/
21-
src/DataBUS/neotomaHelpers/tests/
22-
2319

2420
# Eventually Add:
2521
.github/workflows/release.yml

docs/reference.md

Lines changed: 1 addition & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -27,7 +27,7 @@ Core classes representing the fundamental data models used throughout the DataBU
2727
::: DataBUS.UThSeries
2828
::: DataBUS.Variable
2929

30-
## DataBUS Validator / Uploader
30+
## DataBUS
3131

3232
Validation and insertion modules for the `neotomaValidator` package. Each function validates the
3333
corresponding Neotoma entity and, when a populated `databus` dict is supplied, also inserts the
@@ -65,34 +65,14 @@ template parsing, and transaction management.
6565
### Parameter Extraction
6666

6767
::: DataBUS.neotomaHelpers.pull_params
68-
::: DataBUS.neotomaHelpers.pull_required
69-
::: DataBUS.neotomaHelpers.pull_overwrite
7068

7169
### Template & File Utilities
7270

73-
::: DataBUS.neotomaHelpers.template_to_dict
7471
::: DataBUS.neotomaHelpers.read_csv
7572
::: DataBUS.neotomaHelpers.check_file
7673
::: DataBUS.neotomaHelpers.hash_file
7774
::: DataBUS.neotomaHelpers.excel_to_yaml
7875

79-
### Database & Contact Helpers
80-
81-
::: DataBUS.neotomaHelpers.get_contacts
82-
::: DataBUS.neotomaHelpers.utils
83-
84-
### Transaction Management
85-
86-
::: DataBUS.neotomaHelpers.safe_step
87-
8876
### Logging
8977

9078
::: DataBUS.neotomaHelpers.logging_dict
91-
92-
### CLI
93-
94-
::: DataBUS.neotomaHelpers.parse_arguments
95-
96-
### Speleothem Reference Inserts
97-
98-
::: DataBUS.neotomaHelpers.speleothem_reference_inserts

docs/tutorials.md

Lines changed: 309 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,309 @@
1+
# Tutorials
2+
3+
This section provides step-by-step guides to help you get started with DataBUS. The tutorials are grounded in the example files included in this repository (`data/data_example.csv` and `data/template_example.yml`) and in the reference workflow script `databus_example.py`.
4+
5+
---
6+
7+
## Tutorial 1: Setting Up Your Environment
8+
9+
### Prerequisites
10+
11+
- Python 3.11+ and [`uv`](https://docs.astral.sh/uv/) installed
12+
- Access to a Neotoma database (test or production)
13+
- A `.env` file with your database connection string (see `.env_example`)
14+
15+
### Installing DataBUS
16+
17+
Clone the repository and install dependencies with `uv`:
18+
19+
```bash
20+
git clone https://github.com/NeotomaDB/DataBUS.git
21+
cd DataBUS
22+
uv sync --extra dev
23+
```
24+
25+
### Configuring the Database Connection
26+
27+
DataBUS reads database credentials from a `.env` file. Copy the provided example and fill in your connection details:
28+
29+
```bash
30+
cp .env_example .env
31+
```
32+
33+
The `.env` file should contain a `PGDB_TANK` key with a JSON-encoded connection string:
34+
35+
```
36+
PGDB_TANK={"host": "your_host", "dbname": "neotoma", "user": "your_user", "password": "your_password", "port": 5432}
37+
```
38+
39+
**Important:** The `.env` file contains sensitive database credentials and is listed in `.gitignore` — it will never be committed to the repository. Do not share or commit this file.
40+
41+
This is loaded automatically in your script via:
42+
43+
```python
44+
from dotenv import load_dotenv
45+
load_dotenv()
46+
connection = json.loads(os.getenv("PGDB_TANK"))
47+
conn = psycopg2.connect(**connection, connect_timeout=5)
48+
```
49+
50+
---
51+
52+
## Tutorial 2: Understanding the Input Files
53+
54+
DataBUS takes two inputs for every upload: a **CSV data file** and a **YAML template** that maps your CSV columns to Neotoma database fields. The repository includes a fully annotated example of each.
55+
56+
### The CSV Data File (`data/data_example.csv`)
57+
58+
Each row in the CSV represents one sample. Site-level metadata (name, coordinates, collection unit info) is repeated across all rows, while depth-varying fields like `Depth`, `Thickness`, and proxy values change per row.
59+
60+
The example file contains three rows for a fictional "Example Lake" site, with pollen counts for *Quercus*, *Betula*, and *Pinus* at depths 0.5, 1.5, and 2.5 cm:
61+
62+
```
63+
SiteName, Latitude, Longitude, Altitude, ..., Depth, Thickness, ..., TaxonName, Value, ...
64+
Example Lake, 45.1234, -90.5678, 300, ..., 0.5, 1.0, ..., Quercus, 45, ...
65+
Example Lake, 45.1234, -90.5678, 300, ..., 1.5, 1.0, ..., Betula, 78, ...
66+
Example Lake, 45.1234, -90.5678, 300, ..., 2.5, 1.0, ..., Pinus, 120, ...
67+
```
68+
69+
The full column set covers the complete upload hierarchy: site → geopolitical units → collection unit → analysis units → chronology → chron controls → geochronology → sample ages → data values → contacts → publications.
70+
71+
### The YAML Template (`data/template_example.yml`)
72+
73+
The template declares how each CSV column maps to a Neotoma database field. Key concepts:
74+
75+
**`rowwise: false`** — this field is constant across all rows (site-level data). DataBUS reads that all rows have only one unique value.
76+
77+
**`rowwise: true`** — this field varies per sample row (depths, measurements).
78+
79+
**`required: true/false`** — whether the field must be present and non-null for validation to pass.
80+
81+
**`chronologyname`** — groups multiple columns into a single age model. All fields sharing the same `chronologyname` value are treated as part of the same chronology.
82+
83+
**`taxonname`** — links a data column to a specific Neotoma taxon. Every column mapped to `ndb.data.value` must have a `taxonname`, and its corresponding units column must carry the same name. DataBUS does **not** create new taxa — if a taxon is not already in Neotoma, validation will fail and the taxon must be added via Tilia or the database. We can assist to create uploading scripts too. The same is true for:
84+
- Dataset Types
85+
- Constituent Databases
86+
- Publications
87+
- Contacts
88+
- Variable Units
89+
- Variable Elements
90+
- Variable Contexts
91+
- Variables
92+
93+
DataBUS does not do this as only stewards are responsible for this process.
94+
95+
A minimal site block looks like this:
96+
97+
```yaml
98+
- column: SiteName
99+
neotoma: ndb.sites.sitename
100+
required: true
101+
rowwise: false
102+
type: string
103+
104+
- column: Latitude
105+
neotoma: ndb.sites.geog.latitude
106+
required: true
107+
rowwise: false
108+
type: float
109+
110+
- column: Longitude
111+
neotoma: ndb.sites.geog.longitude
112+
required: true
113+
rowwise: false
114+
type: float
115+
```
116+
117+
A data variable pair (value + units) looks like this:
118+
119+
```yaml
120+
- column: Unsupported.210Pb
121+
neotoma: ndb.data.value
122+
taxonname: Excess 210Pb # must exist in Neotoma
123+
rowwise: true
124+
type: float
125+
unitcolumn: Unsupported.210Pb.Units
126+
127+
- column: Unsupported.210Pb.Units
128+
neotoma: ndb.variables.variableunitsid
129+
taxonname: Excess 210Pb # must match the data column above
130+
rowwise: true
131+
type: string
132+
```
133+
134+
See `data/template_example.yml` for the complete annotated template covering all supported sections.
135+
136+
---
137+
138+
## Tutorial 3: The Two-Pass Workflow
139+
140+
DataBUS is designed to be run **twice**: first to validate your data without modifying the database, then to upload once everything passes. This prevents partial or corrupt submissions.
141+
142+
### Pass 1 — Validate Only
143+
144+
```bash
145+
uv run databus_example.py \
146+
--data data/ \
147+
--template data/template_example.yml \
148+
--logs data/logs/ \
149+
--upload False # Defaulted to False if not passed
150+
```
151+
152+
This runs all validation steps and writes a `.valid.log` file for each CSV file in `data/`. No data is written to the database.
153+
154+
### Pass 2 — Upload
155+
156+
```bash
157+
uv run databus_example.py \
158+
--data data/ \
159+
--template data/template_example.yml \
160+
--logs data/logs/ \
161+
--upload True
162+
```
163+
164+
This runs validation again and, only if **every** step passes, commits the data. If any step fails, the transaction is rolled back and nothing is written.
165+
166+
---
167+
168+
## Tutorial 4: What Happens Inside the Script
169+
170+
The `databus_example.py` script walks through as many validation steps for each CSV file. Understanding the structure helps you adapt it for your own dataset. Not all steps need to be run, for example, speleothem data may require to run the step `valid_speleothem.py` but this step is not needed for a pollen record.
171+
172+
### File Integrity Check
173+
174+
Before any validation, DataBUS checks whether the file has already been processed (via hash) and whether it exists in the expected location:
175+
176+
```python
177+
hashcheck = nh.hash_file(filename)
178+
filecheck = nh.check_file(filename, validation_files="data/")
179+
```
180+
181+
If both checks fail, the file is skipped entirely.
182+
183+
### Validation Steps
184+
185+
Each step uses `nh.safe_step()`, which wraps the validator in error handling and logs the result. Results are collected in a `databus` dict that is passed forward to subsequent steps, so later steps can reference IDs produced by earlier ones.
186+
187+
The steps run in this order:
188+
189+
1. **Sites** — validates site name and coordinates
190+
2. **Geopolitical Units** — country, state/province, county
191+
3. **Collection Units** — core handle, collection type, depositional environment
192+
4. **Analysis Units** — depth and thickness per sample row
193+
5. **Datasets** — dataset name and type
194+
6. **Geochron Datasets** — geochronological dataset metadata
195+
7. **Chronologies** — age model name, type, and bounds
196+
8. **Chron Controls** — individual age-depth control points
197+
9. **Geochron** — individual radiometric dates
198+
10. **Geochron Control** — links geochron dates to chron controls
199+
11. **Contacts** — PI, collector, processor, analyst
200+
12. **Database** — contributing database link
201+
13. **Samples** — sample records per analysis unit
202+
14. **Sample Ages** — assigned ages per sample
203+
15. **Data** — proxy measurements (wide format, one column per variable)
204+
16. **Publications** — DOI and citation
205+
206+
### Commit or Rollback
207+
208+
After all steps, DataBUS checks whether every step passed **and** the file hash was clean:
209+
210+
```python
211+
all_true = all(databus[key].validAll for key in databus)
212+
all_true = all_true and hashcheck
213+
214+
if args.upload:
215+
if all_true:
216+
databus["finalize"] = nv.insert_final(cur, databus=databus)
217+
conn.commit()
218+
else:
219+
conn.rollback()
220+
```
221+
222+
The `insert_final` call inserts the record into the `datasetsubmissions` table, marking the upload as complete.
223+
224+
### Reading the Log
225+
226+
Each file produces a `<filename>.valid.log`. Each step appends its messages to the log, so you can trace exactly where validation failed. Messages use `✓`, `✗`, and `?` symbols for pass, fail, and informational messages respectively.
227+
228+
---
229+
230+
## Tutorial 5: Adapting the Template for Your Dataset
231+
232+
The `template_example.yml` is a universal template covering all supported fields. For your own dataset you will almost always use only a subset of it.
233+
234+
### Step 1 — Start from the example
235+
236+
Copy `data/template_example.yml` as a starting point. Remove sections that do not apply to your data type (e.g., remove the U-Th geochronology block if you are uploading pollen data).
237+
238+
### Step 2 — Set the required constants
239+
240+
Fill in `datasettypeid` and `databasename` — these are dataset-level constants with no corresponding CSV column:
241+
242+
```yaml
243+
- column: datasettypeid
244+
neotoma: ndb.datasettypes.datasettypeid
245+
required: true
246+
value: pollen # e.g. "Lead 210", "speleothem", "ostracode surface sample"
247+
248+
- column: databasename
249+
neotoma: ndb.datasetdatabases.databasename
250+
required: true
251+
value: Neotoma Paleoecology Database
252+
```
253+
254+
### Step 3 — Define your data variables
255+
256+
Replace the placeholder `MyVariable1` / `MyVariable2` entries with your actual proxy columns. Each variable needs a value column and a units column, both sharing the same `taxonname`:
257+
258+
```yaml
259+
- column: Quercus
260+
neotoma: ndb.data.value
261+
taxonname: Quercus # must exist in Neotoma taxa table
262+
required: false
263+
rowwise: true
264+
type: float
265+
unitcolumn: Quercus.Units
266+
267+
- column: Quercus.Units
268+
neotoma: ndb.variables.variableunitsid
269+
taxonname: Quercus
270+
required: false
271+
rowwise: true
272+
type: string
273+
```
274+
275+
### Step 4 — Match column names exactly
276+
277+
The `column:` value in the YAML must match the CSV header **exactly** (case-sensitive). A mismatch will cause that field to be silently ignored during validation.
278+
279+
---
280+
281+
## Dataset-Specific Examples
282+
283+
The DataBUS ecosystem includes several repositories that demonstrate the full workflow for specific proxy types. These are the best reference when adapting DataBUS for your own data:
284+
285+
### SISAL — Speleothem Isotope Records
286+
287+
[github.com/NeotomaDB/DataBUS_SISAL](https://github.com/NeotomaDB/DataBUS_SISAL)
288+
289+
A complete example for uploading speleothem stable isotope records (δ¹⁸O, δ¹³C) from the SISAL database. Includes examples of multiple chronologies within a single dataset and the U-Th geochronology workflow.
290+
291+
### Lead-210 Dating
292+
293+
[github.com/NeotomaDB/DataBUS_Pb210](https://github.com/NeotomaDB/DataBUS_Pb210)
294+
295+
Demonstrates uploading ²¹⁰Pb and ¹³⁷Cs radiometric data for recent sediment cores. Shows how to define wide-format activity columns with matching units columns, and how to link the `X210Pb` chronology model.
296+
297+
### Ostracode Surface Samples
298+
299+
[github.com/NeotomaDB/DataBUS_Ostracode](https://github.com/NeotomaDB/DataBUS_Ostracode)
300+
301+
An example for faunal surface sample data. Useful as a reference for datasets without depth-based chronologies, where the focus is on taxonomic counts mapped to sample locations.
302+
303+
---
304+
305+
## Next Steps
306+
307+
- Browse the [Reference](reference.md) for full documentation of all DataBUS classes and validator functions.
308+
- Check the [How-To Guides](how-to-guide.md) for task-oriented recipes (To come soon...).
309+
- If your taxon or variable is not in Neotoma, contact the database stewards or use Tilia to add it before running DataBUS.

mkdocs.yml

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,18 @@ theme:
77
locale: en
88

99
plugins:
10-
- mkdocstrings
10+
- mkdocstrings:
11+
handlers:
12+
python:
13+
options:
14+
show_source: false
15+
members_order: source
16+
filters:
17+
- "!^_"
18+
show_root_heading: true
19+
show_root_full_path: false
20+
merge_init_into_class: true
21+
docstring_section_style: spacy
1122

1223
nav:
1324
- DataBUS Docs: index.md

0 commit comments

Comments
 (0)