|
| 1 | +# Tutorials |
| 2 | + |
| 3 | +This section provides step-by-step guides to help you get started with DataBUS. The tutorials are grounded in the example files included in this repository (`data/data_example.csv` and `data/template_example.yml`) and in the reference workflow script `databus_example.py`. |
| 4 | + |
| 5 | +--- |
| 6 | + |
| 7 | +## Tutorial 1: Setting Up Your Environment |
| 8 | + |
| 9 | +### Prerequisites |
| 10 | + |
| 11 | +- Python 3.11+ and [`uv`](https://docs.astral.sh/uv/) installed |
| 12 | +- Access to a Neotoma database (test or production) |
| 13 | +- A `.env` file with your database connection string (see `.env_example`) |
| 14 | + |
| 15 | +### Installing DataBUS |
| 16 | + |
| 17 | +Clone the repository and install dependencies with `uv`: |
| 18 | + |
| 19 | +```bash |
| 20 | +git clone https://github.com/NeotomaDB/DataBUS.git |
| 21 | +cd DataBUS |
| 22 | +uv sync --extra dev |
| 23 | +``` |
| 24 | + |
| 25 | +### Configuring the Database Connection |
| 26 | + |
| 27 | +DataBUS reads database credentials from a `.env` file. Copy the provided example and fill in your connection details: |
| 28 | + |
| 29 | +```bash |
| 30 | +cp .env_example .env |
| 31 | +``` |
| 32 | + |
| 33 | +The `.env` file should contain a `PGDB_TANK` key with a JSON-encoded connection string: |
| 34 | + |
| 35 | +``` |
| 36 | +PGDB_TANK={"host": "your_host", "dbname": "neotoma", "user": "your_user", "password": "your_password", "port": 5432} |
| 37 | +``` |
| 38 | + |
| 39 | +**Important:** The `.env` file contains sensitive database credentials and is listed in `.gitignore` — it will never be committed to the repository. Do not share or commit this file. |
| 40 | + |
| 41 | +This is loaded automatically in your script via: |
| 42 | + |
| 43 | +```python |
| 44 | +from dotenv import load_dotenv |
| 45 | +load_dotenv() |
| 46 | +connection = json.loads(os.getenv("PGDB_TANK")) |
| 47 | +conn = psycopg2.connect(**connection, connect_timeout=5) |
| 48 | +``` |
| 49 | + |
| 50 | +--- |
| 51 | + |
| 52 | +## Tutorial 2: Understanding the Input Files |
| 53 | + |
| 54 | +DataBUS takes two inputs for every upload: a **CSV data file** and a **YAML template** that maps your CSV columns to Neotoma database fields. The repository includes a fully annotated example of each. |
| 55 | + |
| 56 | +### The CSV Data File (`data/data_example.csv`) |
| 57 | + |
| 58 | +Each row in the CSV represents one sample. Site-level metadata (name, coordinates, collection unit info) is repeated across all rows, while depth-varying fields like `Depth`, `Thickness`, and proxy values change per row. |
| 59 | + |
| 60 | +The example file contains three rows for a fictional "Example Lake" site, with pollen counts for *Quercus*, *Betula*, and *Pinus* at depths 0.5, 1.5, and 2.5 cm: |
| 61 | + |
| 62 | +``` |
| 63 | +SiteName, Latitude, Longitude, Altitude, ..., Depth, Thickness, ..., TaxonName, Value, ... |
| 64 | +Example Lake, 45.1234, -90.5678, 300, ..., 0.5, 1.0, ..., Quercus, 45, ... |
| 65 | +Example Lake, 45.1234, -90.5678, 300, ..., 1.5, 1.0, ..., Betula, 78, ... |
| 66 | +Example Lake, 45.1234, -90.5678, 300, ..., 2.5, 1.0, ..., Pinus, 120, ... |
| 67 | +``` |
| 68 | + |
| 69 | +The full column set covers the complete upload hierarchy: site → geopolitical units → collection unit → analysis units → chronology → chron controls → geochronology → sample ages → data values → contacts → publications. |
| 70 | + |
| 71 | +### The YAML Template (`data/template_example.yml`) |
| 72 | + |
| 73 | +The template declares how each CSV column maps to a Neotoma database field. Key concepts: |
| 74 | + |
| 75 | +**`rowwise: false`** — this field is constant across all rows (site-level data). DataBUS reads that all rows have only one unique value. |
| 76 | + |
| 77 | +**`rowwise: true`** — this field varies per sample row (depths, measurements). |
| 78 | + |
| 79 | +**`required: true/false`** — whether the field must be present and non-null for validation to pass. |
| 80 | + |
| 81 | +**`chronologyname`** — groups multiple columns into a single age model. All fields sharing the same `chronologyname` value are treated as part of the same chronology. |
| 82 | + |
| 83 | +**`taxonname`** — links a data column to a specific Neotoma taxon. Every column mapped to `ndb.data.value` must have a `taxonname`, and its corresponding units column must carry the same name. DataBUS does **not** create new taxa — if a taxon is not already in Neotoma, validation will fail and the taxon must be added via Tilia or the database. We can assist to create uploading scripts too. The same is true for: |
| 84 | +- Dataset Types |
| 85 | +- Constituent Databases |
| 86 | +- Publications |
| 87 | +- Contacts |
| 88 | +- Variable Units |
| 89 | +- Variable Elements |
| 90 | +- Variable Contexts |
| 91 | +- Variables |
| 92 | + |
| 93 | +DataBUS does not do this as only stewards are responsible for this process. |
| 94 | + |
| 95 | +A minimal site block looks like this: |
| 96 | + |
| 97 | +```yaml |
| 98 | +- column: SiteName |
| 99 | + neotoma: ndb.sites.sitename |
| 100 | + required: true |
| 101 | + rowwise: false |
| 102 | + type: string |
| 103 | + |
| 104 | +- column: Latitude |
| 105 | + neotoma: ndb.sites.geog.latitude |
| 106 | + required: true |
| 107 | + rowwise: false |
| 108 | + type: float |
| 109 | + |
| 110 | +- column: Longitude |
| 111 | + neotoma: ndb.sites.geog.longitude |
| 112 | + required: true |
| 113 | + rowwise: false |
| 114 | + type: float |
| 115 | +``` |
| 116 | +
|
| 117 | +A data variable pair (value + units) looks like this: |
| 118 | +
|
| 119 | +```yaml |
| 120 | +- column: Unsupported.210Pb |
| 121 | + neotoma: ndb.data.value |
| 122 | + taxonname: Excess 210Pb # must exist in Neotoma |
| 123 | + rowwise: true |
| 124 | + type: float |
| 125 | + unitcolumn: Unsupported.210Pb.Units |
| 126 | + |
| 127 | +- column: Unsupported.210Pb.Units |
| 128 | + neotoma: ndb.variables.variableunitsid |
| 129 | + taxonname: Excess 210Pb # must match the data column above |
| 130 | + rowwise: true |
| 131 | + type: string |
| 132 | +``` |
| 133 | +
|
| 134 | +See `data/template_example.yml` for the complete annotated template covering all supported sections. |
| 135 | + |
| 136 | +--- |
| 137 | + |
| 138 | +## Tutorial 3: The Two-Pass Workflow |
| 139 | + |
| 140 | +DataBUS is designed to be run **twice**: first to validate your data without modifying the database, then to upload once everything passes. This prevents partial or corrupt submissions. |
| 141 | + |
| 142 | +### Pass 1 — Validate Only |
| 143 | + |
| 144 | +```bash |
| 145 | +uv run databus_example.py \ |
| 146 | + --data data/ \ |
| 147 | + --template data/template_example.yml \ |
| 148 | + --logs data/logs/ \ |
| 149 | + --upload False # Defaulted to False if not passed |
| 150 | +``` |
| 151 | + |
| 152 | +This runs all validation steps and writes a `.valid.log` file for each CSV file in `data/`. No data is written to the database. |
| 153 | + |
| 154 | +### Pass 2 — Upload |
| 155 | + |
| 156 | +```bash |
| 157 | +uv run databus_example.py \ |
| 158 | + --data data/ \ |
| 159 | + --template data/template_example.yml \ |
| 160 | + --logs data/logs/ \ |
| 161 | + --upload True |
| 162 | +``` |
| 163 | + |
| 164 | +This runs validation again and, only if **every** step passes, commits the data. If any step fails, the transaction is rolled back and nothing is written. |
| 165 | + |
| 166 | +--- |
| 167 | + |
| 168 | +## Tutorial 4: What Happens Inside the Script |
| 169 | + |
| 170 | +The `databus_example.py` script walks through as many validation steps for each CSV file. Understanding the structure helps you adapt it for your own dataset. Not all steps need to be run, for example, speleothem data may require to run the step `valid_speleothem.py` but this step is not needed for a pollen record. |
| 171 | + |
| 172 | +### File Integrity Check |
| 173 | + |
| 174 | +Before any validation, DataBUS checks whether the file has already been processed (via hash) and whether it exists in the expected location: |
| 175 | + |
| 176 | +```python |
| 177 | +hashcheck = nh.hash_file(filename) |
| 178 | +filecheck = nh.check_file(filename, validation_files="data/") |
| 179 | +``` |
| 180 | + |
| 181 | +If both checks fail, the file is skipped entirely. |
| 182 | + |
| 183 | +### Validation Steps |
| 184 | + |
| 185 | +Each step uses `nh.safe_step()`, which wraps the validator in error handling and logs the result. Results are collected in a `databus` dict that is passed forward to subsequent steps, so later steps can reference IDs produced by earlier ones. |
| 186 | + |
| 187 | +The steps run in this order: |
| 188 | + |
| 189 | +1. **Sites** — validates site name and coordinates |
| 190 | +2. **Geopolitical Units** — country, state/province, county |
| 191 | +3. **Collection Units** — core handle, collection type, depositional environment |
| 192 | +4. **Analysis Units** — depth and thickness per sample row |
| 193 | +5. **Datasets** — dataset name and type |
| 194 | +6. **Geochron Datasets** — geochronological dataset metadata |
| 195 | +7. **Chronologies** — age model name, type, and bounds |
| 196 | +8. **Chron Controls** — individual age-depth control points |
| 197 | +9. **Geochron** — individual radiometric dates |
| 198 | +10. **Geochron Control** — links geochron dates to chron controls |
| 199 | +11. **Contacts** — PI, collector, processor, analyst |
| 200 | +12. **Database** — contributing database link |
| 201 | +13. **Samples** — sample records per analysis unit |
| 202 | +14. **Sample Ages** — assigned ages per sample |
| 203 | +15. **Data** — proxy measurements (wide format, one column per variable) |
| 204 | +16. **Publications** — DOI and citation |
| 205 | + |
| 206 | +### Commit or Rollback |
| 207 | + |
| 208 | +After all steps, DataBUS checks whether every step passed **and** the file hash was clean: |
| 209 | + |
| 210 | +```python |
| 211 | +all_true = all(databus[key].validAll for key in databus) |
| 212 | +all_true = all_true and hashcheck |
| 213 | +
|
| 214 | +if args.upload: |
| 215 | + if all_true: |
| 216 | + databus["finalize"] = nv.insert_final(cur, databus=databus) |
| 217 | + conn.commit() |
| 218 | + else: |
| 219 | + conn.rollback() |
| 220 | +``` |
| 221 | + |
| 222 | +The `insert_final` call inserts the record into the `datasetsubmissions` table, marking the upload as complete. |
| 223 | + |
| 224 | +### Reading the Log |
| 225 | + |
| 226 | +Each file produces a `<filename>.valid.log`. Each step appends its messages to the log, so you can trace exactly where validation failed. Messages use `✓`, `✗`, and `?` symbols for pass, fail, and informational messages respectively. |
| 227 | + |
| 228 | +--- |
| 229 | + |
| 230 | +## Tutorial 5: Adapting the Template for Your Dataset |
| 231 | + |
| 232 | +The `template_example.yml` is a universal template covering all supported fields. For your own dataset you will almost always use only a subset of it. |
| 233 | + |
| 234 | +### Step 1 — Start from the example |
| 235 | + |
| 236 | +Copy `data/template_example.yml` as a starting point. Remove sections that do not apply to your data type (e.g., remove the U-Th geochronology block if you are uploading pollen data). |
| 237 | + |
| 238 | +### Step 2 — Set the required constants |
| 239 | + |
| 240 | +Fill in `datasettypeid` and `databasename` — these are dataset-level constants with no corresponding CSV column: |
| 241 | + |
| 242 | +```yaml |
| 243 | +- column: datasettypeid |
| 244 | + neotoma: ndb.datasettypes.datasettypeid |
| 245 | + required: true |
| 246 | + value: pollen # e.g. "Lead 210", "speleothem", "ostracode surface sample" |
| 247 | +
|
| 248 | +- column: databasename |
| 249 | + neotoma: ndb.datasetdatabases.databasename |
| 250 | + required: true |
| 251 | + value: Neotoma Paleoecology Database |
| 252 | +``` |
| 253 | + |
| 254 | +### Step 3 — Define your data variables |
| 255 | + |
| 256 | +Replace the placeholder `MyVariable1` / `MyVariable2` entries with your actual proxy columns. Each variable needs a value column and a units column, both sharing the same `taxonname`: |
| 257 | + |
| 258 | +```yaml |
| 259 | +- column: Quercus |
| 260 | + neotoma: ndb.data.value |
| 261 | + taxonname: Quercus # must exist in Neotoma taxa table |
| 262 | + required: false |
| 263 | + rowwise: true |
| 264 | + type: float |
| 265 | + unitcolumn: Quercus.Units |
| 266 | +
|
| 267 | +- column: Quercus.Units |
| 268 | + neotoma: ndb.variables.variableunitsid |
| 269 | + taxonname: Quercus |
| 270 | + required: false |
| 271 | + rowwise: true |
| 272 | + type: string |
| 273 | +``` |
| 274 | + |
| 275 | +### Step 4 — Match column names exactly |
| 276 | + |
| 277 | +The `column:` value in the YAML must match the CSV header **exactly** (case-sensitive). A mismatch will cause that field to be silently ignored during validation. |
| 278 | + |
| 279 | +--- |
| 280 | + |
| 281 | +## Dataset-Specific Examples |
| 282 | + |
| 283 | +The DataBUS ecosystem includes several repositories that demonstrate the full workflow for specific proxy types. These are the best reference when adapting DataBUS for your own data: |
| 284 | + |
| 285 | +### SISAL — Speleothem Isotope Records |
| 286 | + |
| 287 | +[github.com/NeotomaDB/DataBUS_SISAL](https://github.com/NeotomaDB/DataBUS_SISAL) |
| 288 | + |
| 289 | +A complete example for uploading speleothem stable isotope records (δ¹⁸O, δ¹³C) from the SISAL database. Includes examples of multiple chronologies within a single dataset and the U-Th geochronology workflow. |
| 290 | + |
| 291 | +### Lead-210 Dating |
| 292 | + |
| 293 | +[github.com/NeotomaDB/DataBUS_Pb210](https://github.com/NeotomaDB/DataBUS_Pb210) |
| 294 | + |
| 295 | +Demonstrates uploading ²¹⁰Pb and ¹³⁷Cs radiometric data for recent sediment cores. Shows how to define wide-format activity columns with matching units columns, and how to link the `X210Pb` chronology model. |
| 296 | + |
| 297 | +### Ostracode Surface Samples |
| 298 | + |
| 299 | +[github.com/NeotomaDB/DataBUS_Ostracode](https://github.com/NeotomaDB/DataBUS_Ostracode) |
| 300 | + |
| 301 | +An example for faunal surface sample data. Useful as a reference for datasets without depth-based chronologies, where the focus is on taxonomic counts mapped to sample locations. |
| 302 | + |
| 303 | +--- |
| 304 | + |
| 305 | +## Next Steps |
| 306 | + |
| 307 | +- Browse the [Reference](reference.md) for full documentation of all DataBUS classes and validator functions. |
| 308 | +- Check the [How-To Guides](how-to-guide.md) for task-oriented recipes (To come soon...). |
| 309 | +- If your taxon or variable is not in Neotoma, contact the database stewards or use Tilia to add it before running DataBUS. |
0 commit comments