Global guidance for agentic coding agents working in this repo.
Sub-folder AGENTS.md files override these rules for their specific context.
modern-data-platform composes free, open-source Modern Data Stack tools into a self-contained, versioned, production-ready, and LLM-friendly data warehouse.
The intended workflow is daily or hourly batch refresh (not real-time streaming).
| Layer | Tool | Status |
|---|---|---|
| Environment | uv | Active |
| Database | DuckDB | Active |
| Transformation | SQLMesh | Active |
| Orchestration | Dagster | Planned |
| Ingestion | dlt | Planned |
| Storage layer | DuckLake | Planned |
| Reporting | Evidence | Planned |
- Python
>=3.13(pyproject.toml), managed withuv.
root/
├── AGENTS.md # This file — global rules and conventions
├── .agents/ # Agent config, SKILL and Persona definitions
│ ├── data-engineer.md
│ └── analyst.md
├── analytics/ # Analytics / ML module (future)
│ └── __init__.py
├── ingestion/ # dlt pipelines (future)
├── reporting/ # Dashboards (future)
├── transformation/ # SQLMesh pipeline (DuckDB backend)
│ ├── config.yaml
│ ├── models/
│ │ ├── a_raw/ # Raw layer: ingestion models
│ │ ├── b_staging/ # Staging layer: cleaning & normalization
│ │ └── c_marts/ # Marts layer: business-ready tables
│ ├── seeds/ # CSV seed files for reference data
│ ├── audits/
│ ├── macros/
│ └── tests/
├── data/ # Local DuckDB database (git-ignored)
├── pyproject.toml # Python deps, ruff config, build
└── uv.lock
uv sync # Create venv and install dependenciesuv sync # Install/sync dependencies
uv venv --clear && uv sync # Recreate venv from scratchcd transformation && sqlmesh plan --auto-apply # Plan + auto-apply changes
cd transformation && sqlmesh run # Run pipeline
cd transformation && sqlmesh plan -m <model_name> --auto-apply # Single modelModel naming convention: a_raw.*, b_staging.*, c_marts.*.
duckdb -ui data/db.duckdb # Open DuckDB UIuv run ruff format . # Format
uv run ruff check --fix . # Lint + auto-fixcd transformation && sqlmesh lintRules configured in transformation/config.yaml.
cd transformation && sqlmesh lint # SQLMesh lint
uv run ruff check . && cd transformation && sqlmesh lint # ruff + sqlmesh lintNo Python unit test suite yet (transformation/tests/.gitkeep).
- Prefer small, reviewable PRs: isolate behavior changes.
- Keep side effects explicit: separate data fetching, transformation, and persistence.
- Do not write to
data/at import time; do it in functions/entrypoints.
- Prefer explicit imports over
import *. - Use
collections.abcfor typing iterables. - Use explicit
from typing import Ximports. Do not useimport typing as t.
- Use type hints for public functions and non-trivial helpers.
- Prefer
collections.abctypes:Iterator,Iterable,Sequence, etc. - Use
typing.Anyonly at integration boundaries.
- SQLMesh models:
a_raw.*,b_staging.*,c_marts.*. - Column names:
snake_case.
- Avoid bare
except:; catchExceptionat boundaries only. - Prefer structured logging (
loggingmodule) overprintfor libraries. - Include enough context for debugging (what operation failed, exception message).
- Re-raise if failure should abort the run.
- Prefer Polars for DataFrame operations in Python models.
- Keep column naming consistent:
snake_case.
- Keep
SELECTlists explicit; avoidSELECT *in staging and marts. - Use consistent casing for SQL keywords (match surrounding code).
- Prefer CTEs for multi-step transforms.
Conventions for Python-based SQLMesh models (transformation/models/):
-
Column types in
@modeldecorator: use lowercase ("date","text","float"). -
Reference data (seeds): static reference data lives in CSV seed files under
transformation/seeds/. Python models read from seed tables at runtime viacontext.resolve_table()+context.fetchdf(). Do not hardcode reference data inline in model files. -
Returning empty DataFrames: Python models may not return an empty DataFrame. If your model could possibly return an empty DataFrame, conditionally yield it or return an empty generator instead:
if df.is_empty(): yield from () else: yield df
-
Error handling: catch
Exceptionat the top level, log withlogger.exception(...)for full traceback context, then re-raise so the pipeline fails visibly on persistent errors.
- Local DB path:
data/db.duckdb(seetransformation/config.yaml). data/andlogs/are local artifacts, git-ignored.
Data Analyst Best Practices:
- Query from
c_marts.*tables whenever possible; avoid querying raw or staging - Document assumptions and methodology in reports.
- Prefer explicit
SELECTlists overSELECT *.