diff --git a/.claude/skills/bdd-features/SKILL.md b/.claude/skills/bdd-features/SKILL.md new file mode 100644 index 0000000..8d0bb12 --- /dev/null +++ b/.claude/skills/bdd-features/SKILL.md @@ -0,0 +1,105 @@ +--- +name: bdd-features +description: "This skill should be used when the user asks to \"write Gherkin\", \"create feature files\", \"generate BDD scenarios\", \"write acceptance tests in Gherkin\", \"create Behave features\", \"write Given When Then tests\", \"BDD test cases for my pipeline\", \"Gherkin for Unity Catalog\", or wants to translate requirements into Gherkin feature files for Databricks." +user-invocable: true +--- + +# BDD Features — Gherkin Feature File Generation + +Generate well-structured Gherkin `.feature` files for Databricks workloads. Translate requirements, user stories, or existing code into behavior specifications using Given/When/Then syntax. + +## When to use + +- Translating requirements or user stories into Gherkin acceptance criteria +- Creating feature files for Databricks pipelines, catalog permissions, jobs, or Apps +- Writing regression tests in Gherkin for existing functionality +- Generating Scenario Outlines for data-driven testing + +## Process + +### 1. Identify the test subject + +Determine what to test. Read the relevant code or ask the user: + +- A Lakeflow SDP pipeline definition → pipeline behavior tests +- Unity Catalog grants/policies → permission verification tests +- A FastAPI Databricks App → API endpoint tests +- A notebook or job → execution and output validation tests +- SQL transformations → data quality and correctness tests + +### 2. Write the feature file + +Place feature files in the appropriate subdirectory under `features/`: + +``` +features/ +├── catalog/permissions.feature +├── pipelines/events_pipeline.feature +├── apps/api_endpoints.feature +├── jobs/etl_notebook.feature +└── sql/data_quality.feature +``` + +**Structure every feature file with:** + +1. **Tags** — `@domain`, `@smoke`/`@regression`/`@integration`, optional `@slow` or `@wip` +2. **Feature header** — name + As a / I want / So that narrative +3. **Background** — shared Given steps (workspace connection, test schema) +4. **Scenarios** — one behavior per scenario, descriptive names + +Refer to `references/gherkin-patterns.md` for Databricks-specific Gherkin patterns covering: +- Pipeline lifecycle (full refresh, incremental, failure handling) +- Unity Catalog grants, column masks, row filters +- App endpoint testing with SSO headers +- Job/notebook execution and output validation +- SQL data quality assertions +- Scenario Outlines for parameterized testing + +### 3. Gherkin writing principles + +**Declarative, not imperative.** Describe *what* the system should do, not *how* to click buttons: + +```gherkin +# Good — declarative +When I grant SELECT on "catalog.schema.table" to group "readers" +Then the group "readers" should have SELECT permission + +# Bad — imperative +When I open the Catalog Explorer +And I click on the table "catalog.schema.table" +And I click "Permissions" +And I click "Grant" +And I select "SELECT" +And I type "readers" in the group field +And I click "Save" +``` + +**One behavior per scenario.** If a scenario tests two independent things, split it. + +**Use Backgrounds for shared setup.** Avoid repeating connection/schema steps across scenarios. + +**Scenario Outlines for data variations.** When the same behavior is tested with different inputs, use Examples tables instead of duplicating scenarios. + +**Tag strategically:** +- `@smoke` — fast, critical-path tests (< 30 seconds each) +- `@regression` — thorough coverage (minutes) +- `@integration` — needs live workspace (skip in unit test CI) +- `@slow` — pipeline tests, job executions (> 2 minutes) + +**CRITICAL — Curly braces break step matching.** Behave uses the `parse` library for step matching. `{anything}` in feature file text is interpreted as a capture group, not a literal. Never use `{test_schema}.table_name` in feature files — it will fail to match step definitions. Instead, use short table names (`"customers"`) and resolve the schema in step code. + +**Trailing colons matter.** When a step has an attached data table or docstring, the `:` at the end of the Gherkin line IS part of the step text. The step pattern must include it: `@given('a table "{name}" with data:')` — not `with data` (no colon). + +### 4. Validate step coverage + +After writing features, check that step definitions exist for all steps: + +```bash +uv run behave --dry-run +``` + +Any undefined steps will be reported with suggested snippets. Hand those to the `bdd-steps` skill for implementation. + +## Additional resources + +- **`references/gherkin-patterns.md`** — Complete Databricks Gherkin pattern library with examples for every domain diff --git a/.claude/skills/bdd-features/references/gherkin-patterns.md b/.claude/skills/bdd-features/references/gherkin-patterns.md new file mode 100644 index 0000000..1913252 --- /dev/null +++ b/.claude/skills/bdd-features/references/gherkin-patterns.md @@ -0,0 +1,446 @@ +# Gherkin Patterns for Databricks + +Reusable Gherkin patterns for common Databricks testing scenarios. Copy and adapt these to feature files. + +> **WARNING: Curly braces in step text break Behave's `parse` matcher.** +> +> Behave uses Python's `parse` library for step matching. Any `{...}` in step text +> is interpreted as a capture group. Writing `{test_schema}.customers` in a step line +> will **silently fail to match** your step definition. +> +> **The correct pattern:** +> - Step text uses **short table names in quotes**: `"customers"`, `"orders"` +> - SQL inside **docstrings** (triple-quoted blocks) can safely use `{schema}` because +> docstrings are accessed via `context.text`, not step matching +> - Step definitions prepend `context.test_schema + "."` internally to build the FQN +> +> ```python +> # WRONG - step text with curly braces +> @given('a table "{test_schema}.customers" exists') # BROKEN - parse eats {test_schema} +> +> # RIGHT - short name in step text, FQN built in the step body +> @given('a managed table "{table_name}" exists') +> def step_impl(context, table_name): +> fqn = f"{context.test_schema}.{table_name}" +> # ... use fqn +> ``` +> +> **Docstring SQL pattern** (safe because `context.text` is just a string): +> ```python +> @when('I execute SQL:') +> def step_impl(context): +> sql = context.text.replace("{schema}", context.test_schema) +> # ... execute sql +> ``` + +## Common Background + +Most Databricks feature files share this Background: + +```gherkin +Background: + Given a Databricks workspace connection is established + And a test schema is provisioned +``` + +--- + +## Unity Catalog + +### Table permissions + +```gherkin +@catalog @permissions +Feature: Unity Catalog table permissions + As a data engineer + I want to verify table-level permissions + So that sensitive data is properly protected + + Background: + Given a Databricks workspace connection is established + And a test schema is provisioned + + Scenario: Grant SELECT to a group + Given a managed table "customers" exists + When I execute SQL: + """sql + GRANT SELECT ON TABLE {schema}.customers TO `data_readers` + """ + And I execute SQL: + """sql + SHOW GRANTS ON TABLE {schema}.customers + """ + Then the result should contain a row where "ActionType" is "SELECT" and "Principal" is "data_readers" + + Scenario Outline: Verify multiple privilege types + Given a managed table "sales" exists + When I execute SQL: + """sql + GRANT ON TABLE {schema}.sales TO `` + """ + And I execute SQL: + """sql + SHOW GRANTS ON TABLE {schema}.sales + """ + Then the result should contain a row where "ActionType" is "" and "Principal" is "" + + Examples: + | privilege | group | + | SELECT | data_readers | + | MODIFY | data_writers | +``` + +### Column masks + +```gherkin +@catalog @security +Feature: Column-level security + + Background: + Given a Databricks workspace connection is established + And a test schema is provisioned + + Scenario: Mask PII columns for analysts + Given a managed table "customers" with columns: + | column_name | data_type | contains_pii | + | id | BIGINT | false | + | name | STRING | true | + | email | STRING | true | + | region | STRING | false | + And a column mask function "mask_pii" is applied to "name" and "email" on "customers" + When I query "customers" as group "analysts" + Then columns "name" and "email" should return masked values + But columns "id" and "region" should return actual values +``` + +### Row filters + +```gherkin +@catalog @security +Feature: Row-level security + + Background: + Given a Databricks workspace connection is established + And a test schema is provisioned + + Scenario: Row filter restricts by region + Given a managed table "regional_sales" with data: + | region | revenue | quarter | + | APAC | 50000 | Q1 | + | EMEA | 75000 | Q1 | + | AMER | 100000 | Q1 | + And a row filter on "regional_sales" restricts "apac_analysts" to region "APAC" + When I query "regional_sales" as group "apac_analysts" + Then I should only see rows where "region" is "APAC" + And the result should have 1 row +``` + +--- + +## Lakeflow Spark Declarative Pipelines + +### Pipeline lifecycle + +```gherkin +@pipeline @lakeflow +Feature: Events pipeline processing + As a data engineer + I want to verify the events pipeline processes data correctly + So that downstream consumers get accurate aggregations + + Background: + Given a Databricks workspace connection is established + And a test schema is provisioned + + @integration @slow + Scenario: Full refresh produces expected tables + Given a pipeline "events_pipeline" exists targeting the test schema + When I trigger a full refresh of the pipeline + Then the pipeline update should succeed within 600 seconds + And the streaming table "bronze_events" should exist + And the materialized view "silver_events_agg" should exist + And the table "silver_events_agg" should have more than 0 rows + + @integration + Scenario: Incremental refresh picks up new data + Given the pipeline "events_pipeline" has completed a full refresh + When I insert test records into the source + And I trigger an incremental refresh of the pipeline + Then the pipeline update should succeed within 300 seconds + And the new records should appear in "bronze_events" + + Scenario: Pipeline handles empty source gracefully + Given a pipeline "events_pipeline" exists targeting the test schema + And the source table is empty + When I trigger a full refresh of the pipeline + Then the pipeline update should succeed within 300 seconds + And the streaming table "bronze_events" should have 0 rows +``` + +### Pipeline failure handling + +```gherkin + Scenario: Pipeline surfaces schema mismatch errors + Given a pipeline "events_pipeline" exists targeting the test schema + And the source table has an unexpected column "extra_col" of type "BINARY" + When I trigger a full refresh of the pipeline + Then the pipeline update should fail + And the pipeline error should mention schema +``` + +--- + +## Jobs and Notebooks + +### Notebook execution + +```gherkin +@jobs @notebook +Feature: Customer ETL notebook + As a data engineer + I want to verify the ETL notebook produces correct output + + Background: + Given a Databricks workspace connection is established + And a test schema is provisioned + + @integration @slow + Scenario: Dedup notebook removes duplicates + Given a managed table "raw_customers" with data: + | customer_id | name | email | updated_at | + | 1 | Alice | alice@example.com | 2024-01-01T00:00:00 | + | 1 | Alice B. | alice@example.com | 2024-06-01T00:00:00 | + | 2 | Bob | bob@example.com | 2024-03-15T00:00:00 | + When I run the notebook "/Repos/team/etl/customer_dedup" with parameters: + | key | value | + | source_table | raw_customers | + | target_table | clean_customers| + Then the job should complete with status "SUCCESS" within 300 seconds + And the table "clean_customers" should have 2 rows + And the table "clean_customers" should contain a row where "customer_id" is "1" and "name" is "Alice B." + + Scenario: Notebook fails gracefully on missing source + When I run the notebook "/Repos/team/etl/customer_dedup" with parameters: + | key | value | + | source_table | nonexistent | + | target_table | output | + Then the job should complete with status "FAILED" within 120 seconds +``` + +--- + +## Databricks Apps (FastAPI) + +### API endpoint testing + +```gherkin +@app @fastapi +Feature: Databricks App API + As a user + I want the app endpoints to work correctly + + Background: + Given the app is running at the configured base URL + And the test user is "testuser@databricks.com" + + @smoke + Scenario: Health check + When I GET "/health" + Then the response status should be 200 + And the response JSON should contain "status" with value "healthy" + + Scenario: Authenticated user can list resources + When I GET "/api/dashboards" with auth headers + Then the response status should be 200 + And the response should be a JSON list + + Scenario: Unauthenticated request is rejected + When I GET "/api/dashboards" without auth headers + Then the response status should be 401 + + Scenario: POST creates a resource + When I POST "/api/items" with auth headers and body: + """json + {"name": "Test Item", "description": "Created by BDD test"} + """ + Then the response status should be 201 + And the response JSON should contain "name" with value "Test Item" +``` + +### App deployment testing + +```gherkin +@app @deployment @slow +Feature: App deployment lifecycle + Scenario: Deploy and verify app is running + Given a bundle project at the repository root + When I deploy using Asset Bundles with target "dev" + Then the deployment should succeed + And the app should reach "RUNNING" state within 120 seconds + And the app health endpoint should return 200 +``` + +--- + +## SQL Data Quality + +### Row counts and data validation + +```gherkin +@sql @data-quality +Feature: Data quality checks + + Background: + Given a Databricks workspace connection is established + And a test schema is provisioned + + @smoke + Scenario: Table is not empty + Given the table "orders" has been loaded + Then the table "orders" should have more than 0 rows + + Scenario: No duplicate primary keys + Given the table "orders" has been loaded + When I execute SQL: + """sql + SELECT order_id, COUNT(*) as cnt + FROM {schema}.orders + GROUP BY order_id + HAVING COUNT(*) > 1 + """ + Then the result should have 0 rows + + Scenario: Foreign key integrity + Given the tables "orders" and "customers" have been loaded + When I execute SQL: + """sql + SELECT o.customer_id + FROM {schema}.orders o + LEFT JOIN {schema}.customers c ON o.customer_id = c.customer_id + WHERE c.customer_id IS NULL + """ + Then the result should have 0 rows + + Scenario: No null values in required columns + When I execute SQL: + """sql + SELECT COUNT(*) as null_count + FROM {schema}.orders + WHERE order_id IS NULL OR customer_id IS NULL OR order_date IS NULL + """ + Then the first row column "null_count" should be "0" + + Scenario: Verify GRANT was applied via SQL + Given a managed table "products" exists + When I execute SQL: + """sql + GRANT SELECT ON TABLE {schema}.products TO `reporting_team` + """ + And I execute SQL: + """sql + SHOW GRANTS ON TABLE {schema}.products + """ + Then the result should contain a row where "ActionType" is "SELECT" and "Principal" is "reporting_team" +``` + +--- + +## Asset Bundles Deployment + +```gherkin +@deployment @dabs +Feature: Bundle lifecycle + @smoke + Scenario: Bundle validates successfully + When I run "databricks bundle validate" with target "dev" + Then the command should exit with code 0 + + @integration @slow + Scenario: Deploy and destroy lifecycle + When I run "databricks bundle deploy" with target "dev" + Then the command should exit with code 0 + When I run "databricks bundle destroy" with target "dev" and auto-approve + Then the command should exit with code 0 +``` + +--- + +## Scenario Outline patterns + +Use Scenario Outlines for testing multiple variations of the same behavior. + +Note: table names in the Examples table are short names (no schema prefix). The step +definition prepends `context.test_schema` to build the fully-qualified name. + +```gherkin + Scenario Outline: Verify table existence after pipeline run + Then the "" should exist + + Examples: Streaming tables + | table_type | table_name | + | streaming table | bronze_events | + | streaming table | bronze_transactions| + + Examples: Materialized views + | table_type | table_name | + | materialized view | silver_events_agg| + | materialized view | gold_summary | +``` + +--- + +## Steps with data tables and docstrings + +Steps that accept a data table or docstring **must** end with a trailing colon. The colon +is part of the step text that Behave matches against your `@given`/`@when`/`@then` decorator. + +```gherkin +# CORRECT - colon before data table +Given a managed table "customers" with data: + | id | name | region | + | 1 | Alice | APAC | + | 2 | Bob | EMEA | + +# CORRECT - colon before docstring +When I execute SQL: + """sql + SELECT * FROM {schema}.customers + """ + +# WRONG - missing colon, Behave will not match the step +Given a managed table "customers" with data + | id | name | region | +``` + +--- + +## SHOW GRANTS column names + +`SHOW GRANTS` returns PascalCase column names. Use these exact names when asserting +on grant results: + +| Column | Description | +|--------------|------------------------------------------------| +| `Principal` | The user, group, or service principal | +| `ActionType` | The privilege (SELECT, MODIFY, ALL PRIVILEGES) | +| `ObjectType` | TABLE, SCHEMA, CATALOG, etc. | +| `ObjectKey` | The fully-qualified object name | + +--- + +## Tag strategy + +| Tag | Purpose | Typical runtime | +|-----|---------|----------------| +| `@smoke` | Critical path, must always pass | < 30s per scenario | +| `@regression` | Full coverage | Minutes | +| `@integration` | Needs live workspace | Varies | +| `@slow` | Pipeline/job execution | > 2 min | +| `@wip` | Work in progress, skip by default | N/A | +| `@skip` | Explicitly disabled | N/A | +| `@catalog` | Unity Catalog tests | Varies | +| `@pipeline` | Lakeflow SDP tests | Minutes | +| `@jobs` | Job/notebook tests | Minutes | +| `@app` | Databricks Apps tests | Seconds | +| `@sql` | SQL/data quality tests | Seconds | +| `@deployment` | DABs lifecycle tests | Minutes | diff --git a/.claude/skills/bdd-run/SKILL.md b/.claude/skills/bdd-run/SKILL.md new file mode 100644 index 0000000..f8f242e --- /dev/null +++ b/.claude/skills/bdd-run/SKILL.md @@ -0,0 +1,145 @@ +--- +name: bdd-run +description: "This skill should be used when the user asks to \"run BDD tests\", \"execute Behave\", \"run Gherkin tests\", \"run my feature files\", \"behave test results\", \"run smoke tests\", \"BDD test report\", or needs to execute Behave test suites with specific options like tag filtering, parallel execution, or CI reporting." +user-invocable: true +--- + +# BDD Run — Execute and Report Behave Tests + +Execute Behave test suites with tag filtering, parallel execution, output formatting, and CI integration. Diagnose failures and suggest fixes. + +## When to use + +- Running the full BDD test suite or a subset by tags +- Getting JUnit/JSON reports for CI pipelines +- Re-running only failed scenarios +- Running tests in parallel for speed +- Diagnosing and triaging test failures + +## Process + +### 1. Pre-flight checks + +Before running tests, verify the environment: + +```bash +# Verify Behave is installed +uv run behave --version + +# Verify Databricks auth +uv run python -c "from databricks.sdk import WorkspaceClient; print(WorkspaceClient().current_user.me().user_name)" + +# Dry run to check step coverage +uv run behave --dry-run +``` + +If any undefined steps are found, report them and suggest using the `bdd-steps` skill. + +### 2. Execute tests + +**Run by tag (most common):** + +```bash +# Smoke tests only +uv run behave --tags="@smoke" --format=pretty + +# All except slow and WIP +uv run behave --tags="not @slow and not @wip" + +# Specific domain +uv run behave --tags="@catalog" +uv run behave --tags="@pipeline" + +# Boolean combinations +uv run behave --tags="(@catalog or @pipeline) and @smoke" +``` + +**Run specific feature file or directory:** + +```bash +uv run behave features/catalog/permissions.feature +uv run behave features/pipelines/ +``` + +**Run by scenario name:** + +```bash +uv run behave --name "Grant SELECT on a table" +``` + +**Pass runtime configuration:** + +```bash +uv run behave -D warehouse_id=abc123 -D catalog=my_catalog -D environment=dev +``` + +### 3. Output and reporting + +**For local development:** + +```bash +uv run behave --format=pretty --show-timings +``` + +**For CI pipelines (JUnit XML):** + +```bash +uv run behave --junit --junit-directory=reports/behave/ --format=progress +``` + +**JSON output for programmatic analysis:** + +```bash +uv run behave --format=json --outfile=reports/results.json --format=progress +``` + +**Multiple formatters simultaneously:** + +```bash +uv run behave --format=pretty --format=json --outfile=reports/results.json +``` + +### 4. Re-run failed tests + +Configure rerun file output, then re-run only failures: + +```bash +# First run captures failures +uv run behave --format=rerun --outfile=reports/rerun.txt --format=pretty + +# Re-run only failed scenarios +uv run behave @reports/rerun.txt +``` + +### 5. Parallel execution + +Behave has no built-in parallelism. Use `behavex` for parallel feature execution: + +```bash +uv run behavex --parallel-processes 4 --parallel-scheme feature +``` + +Each parallel worker needs its own test schema to avoid cross-contamination. The `environment.py` template from `bdd-scaffold` handles this by using timestamped schema names with worker ID suffixes. + +### 6. Failure diagnosis + +When tests fail, read the output and categorize: + +| Failure type | Symptom | Action | +|-------------|---------|--------| +| Undefined step | `NotImplementedError` or "undefined" in output | Generate step with `bdd-steps` | +| Auth failure | `PermissionDenied`, 401/403 | Check `databricks auth profiles` | +| Timeout | `TimeoutError` in polling steps | Increase timeout parameter or check resource state | +| Data mismatch | Assertion error with expected vs. actual | Check test data setup or query logic | +| Schema not found | `SCHEMA_NOT_FOUND` | Verify `before_all` created the ephemeral schema | +| Warehouse stopped | `WAREHOUSE_NOT_RUNNING` | Start warehouse or use `@fixture.sql_warehouse` tag hook | + +### 7. Makefile integration + +If a Makefile exists, prefer `make` targets: + +```bash +make bdd # Full suite +make bdd-smoke # Smoke tests +make bdd-report # JUnit for CI +``` diff --git a/.claude/skills/bdd-scaffold/SKILL.md b/.claude/skills/bdd-scaffold/SKILL.md new file mode 100644 index 0000000..d24c5ca --- /dev/null +++ b/.claude/skills/bdd-scaffold/SKILL.md @@ -0,0 +1,114 @@ +--- +name: bdd-scaffold +description: "This skill should be used when the user asks to \"set up BDD\", \"create a Behave project\", \"scaffold BDD tests\", \"initialize Behave\", \"add BDD to my project\", \"set up Gherkin testing\", \"create test structure for Behave\", or mentions setting up behavior-driven development testing. Generates a complete Behave project structure wired to Databricks SDK." +user-invocable: true +--- + +# BDD Scaffold — Behave + Databricks Project Setup + +Generate a complete Python Behave project structure pre-wired with Databricks SDK integration, including `environment.py` hooks, test isolation via ephemeral schemas, and `behave.ini` configuration. + +## When to use + +- Starting a new BDD test suite for a Databricks project +- Adding Behave-based acceptance tests to an existing repo +- Setting up integration testing against Unity Catalog, pipelines, jobs, or Apps + +## Process + +### 1. Detect project context + +Identify the project root and existing tooling: + +```bash +git rev-parse --show-toplevel +``` + +Check for existing test infrastructure: `pyproject.toml`, `Makefile`, `behave.ini`, `features/` directory. If a `features/` directory already exists, confirm before overwriting. + +### 2. Determine test domains + +Ask (or infer from the codebase) which Databricks domains to scaffold step files for: + +| Domain | Step file | When | +|--------|-----------|------| +| Unity Catalog | `catalog_steps.py` | Tables, schemas, grants, row filters, column masks | +| Pipelines | `pipeline_steps.py` | Lakeflow SDP, streaming tables, materialized views | +| Jobs | `job_steps.py` | Notebook runs, workflow tasks, job clusters | +| Apps | `app_steps.py` | FastAPI endpoints, SSO headers, deployment | +| SQL | `sql_steps.py` | Statement execution, warehouse queries, data validation | + +Always generate `common_steps.py` (shared workspace connection, row counting, table existence checks). + +### 3. Generate the directory structure + +``` +features/ +├── environment.py # Databricks SDK setup, ephemeral schema lifecycle +├── steps/ +│ ├── common_steps.py # Shared steps (always generated) +│ └── _steps.py # Per-domain (based on step 2) +├── catalog/ # Feature file directories (one per domain) +├── pipelines/ +├── jobs/ +├── apps/ +└── sql/ +behave.ini +Makefile # (append BDD targets if Makefile exists) +``` + +Refer to `references/environment-template.md` for the full `environment.py` template with: +- `before_all`: WorkspaceClient init, warehouse auto-discovery, ephemeral schema creation +- `after_all`: Schema cascade drop +- `before_scenario` / `after_scenario`: Per-scenario resource tracking and cleanup +- Tag-based hooks for `@wip`, `@skip`, `@slow` + +Refer to `references/behave-config.md` for `behave.ini` and `pyproject.toml` configuration. + +### 4. Add dependencies + +If `pyproject.toml` exists and uses `uv`: + +```bash +uv add --group test behave databricks-sdk httpx +``` + +If no `pyproject.toml`, create a minimal one with test dependencies. + +### 5. Add Makefile targets + +Append these targets (or create a Makefile if none exists): + +```makefile +.PHONY: bdd bdd-smoke bdd-report + +bdd: + uv run behave --format=pretty + +bdd-smoke: + uv run behave --tags="@smoke" --format=pretty + +bdd-report: + uv run behave --junit --junit-directory=reports/ --format=progress +``` + +### 6. Verify scaffold + +Run `behave --dry-run` to confirm step discovery works and there are no import errors: + +```bash +uv run behave --dry-run +``` + +Report the generated structure and next steps to the user. + +## Key design decisions + +- **Ephemeral schemas** — each test run creates a timestamped schema (`behave_test_YYYYMMDD_HHMMSS`) and drops it in `after_all`. Prevents cross-run contamination. +- **`-D` userdata** for parameterization — warehouse IDs, catalog names, and targets are passed via CLI args, never hardcoded. +- **Step files are globally scoped** in Behave — all files in `steps/` are imported regardless of which feature runs. Name step patterns carefully to avoid collisions. + +## Additional resources + +- **`references/environment-template.md`** — Full annotated environment.py template +- **`references/behave-config.md`** — behave.ini and pyproject.toml configuration reference diff --git a/.claude/skills/bdd-scaffold/references/behave-config.md b/.claude/skills/bdd-scaffold/references/behave-config.md new file mode 100644 index 0000000..d994f51 --- /dev/null +++ b/.claude/skills/bdd-scaffold/references/behave-config.md @@ -0,0 +1,134 @@ +# Behave Configuration Reference + +## behave.ini + +Standard Behave configuration file. Place at project root. + +```ini +[behave] +# Output +default_format = pretty +show_timings = true +color = true + +# Default tag filter — skip WIP and explicitly skipped tests +default_tags = not @wip and not @skip + +# Logging +logging_level = INFO +logging_format = %(asctime)s %(levelname)-8s %(name)s: %(message)s + +# Capture control +stdout_capture = true +log_capture = true + +# JUnit output (enable in CI) +junit = false +junit_directory = reports/ + +# Feature paths +paths = features/ + +[behave.userdata] +# Override with -D key=value on CLI +warehouse_id = auto +catalog = main +environment = dev +``` + +## pyproject.toml + +Alternative configuration via pyproject.toml (Behave reads `[tool.behave]`): + +**IMPORTANT:** In `pyproject.toml`, `default_tags` must be a **list**, not a string. The `behave.ini` parser accepts a plain string, but the TOML parser is stricter: + +```toml +[tool.behave] +default_format = "pretty" +show_timings = true +default_tags = ["not @wip and not @skip"] # MUST be a list in pyproject.toml +junit = false +junit_directory = "reports/" +logging_level = "INFO" + +[tool.behave.userdata] +warehouse_id = "auto" +catalog = "main" +environment = "dev" +``` + +## Dependencies + +Add to `pyproject.toml`: + +```toml +[project.optional-dependencies] +test = [ + "behave>=1.2.6", + "databricks-sdk>=0.40.0", + "httpx>=0.27.0", +] + +# Or for parallel execution +test-parallel = [ + "behave>=1.2.6", + "behavex>=3.0", + "databricks-sdk>=0.40.0", + "httpx>=0.27.0", +] +``` + +With `uv`: + +```bash +uv add --group test behave databricks-sdk httpx +``` + +## Makefile targets + +```makefile +.PHONY: bdd bdd-smoke bdd-report bdd-rerun bdd-parallel bdd-dry-run + +bdd: + uv run behave --format=pretty --show-timings + +bdd-smoke: + uv run behave --tags="@smoke" --format=pretty + +bdd-report: + uv run behave --junit --junit-directory=reports/behave/ --format=progress + +bdd-rerun: + uv run behave @reports/rerun.txt + +bdd-parallel: + uv run behavex --parallel-processes 4 --parallel-scheme feature + +bdd-dry-run: + uv run behave --dry-run +``` + +## CI integration (GitHub Actions example) + +```yaml +- name: Run BDD tests + env: + DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }} + DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }} + DATABRICKS_WAREHOUSE_ID: ${{ secrets.WAREHOUSE_ID }} + TEST_CATALOG: ci_test + run: | + uv run behave \ + --tags="not @slow" \ + --junit --junit-directory=reports/behave/ \ + --format=progress \ + -D catalog=$TEST_CATALOG \ + -D warehouse_id=$DATABRICKS_WAREHOUSE_ID + +- name: Upload test results + if: always() + uses: actions/upload-artifact@v4 + with: + name: behave-results + path: reports/behave/ +``` diff --git a/.claude/skills/bdd-scaffold/references/environment-template.md b/.claude/skills/bdd-scaffold/references/environment-template.md new file mode 100644 index 0000000..2a7dc1b --- /dev/null +++ b/.claude/skills/bdd-scaffold/references/environment-template.md @@ -0,0 +1,195 @@ +# environment.py Template — Databricks + Behave + +Complete annotated template for `features/environment.py`. Copy and adapt to the target project. + +## Full template + +```python +"""Behave environment hooks — Databricks SDK integration. + +Sets up workspace connection, ephemeral test schema, and per-scenario cleanup. +""" +from __future__ import annotations + +import logging +import os +from datetime import datetime + +from behave.model import Feature, Scenario, Step +from behave.runner import Context + +logger = logging.getLogger("behave.databricks") + + +# ─── Session-level hooks ──────────────────────────────────────── + +def before_all(context: Context) -> None: + """Initialize Databricks clients and create ephemeral test schema.""" + from databricks.sdk import WorkspaceClient + + context.workspace = WorkspaceClient() + + # Fix host URL — some profiles include ?o= which breaks SDK API paths. + # The CLI handles this transparently but the SDK does not. + if context.workspace.config.host and "?" in context.workspace.config.host: + clean_host = context.workspace.config.host.split("?")[0].rstrip("/") + profile = os.environ.get("DATABRICKS_CONFIG_PROFILE") + context.workspace = WorkspaceClient(profile=profile, host=clean_host) + + # Verify auth + me = context.workspace.current_user.me() + context.current_user = me.user_name + logger.info("Authenticated as: %s", context.current_user) + + # Warehouse — from -D userdata, env var, or auto-discover + userdata = context.config.userdata + context.warehouse_id = ( + userdata.get("warehouse_id") + or os.environ.get("DATABRICKS_WAREHOUSE_ID") + or _discover_warehouse(context.workspace) + ) + logger.info("Using warehouse: %s", context.warehouse_id) + + # Catalog — from -D userdata or env var + context.test_catalog = userdata.get("catalog", os.environ.get("TEST_CATALOG", "main")) + + # Create ephemeral schema (timestamped for isolation) + ts = datetime.now().strftime("%Y%m%d_%H%M%S") + worker = os.environ.get("BEHAVE_WORKER_ID", "0") + context.test_schema = f"{context.test_catalog}.behave_test_{ts}_w{worker}" + + _execute_sql(context, f"CREATE SCHEMA IF NOT EXISTS {context.test_schema}") + logger.info("Created test schema: %s", context.test_schema) + + +def after_all(context: Context) -> None: + """Drop ephemeral test schema.""" + if hasattr(context, "test_schema"): + try: + _execute_sql(context, f"DROP SCHEMA IF EXISTS {context.test_schema} CASCADE") + logger.info("Dropped test schema: %s", context.test_schema) + except Exception as e: + logger.warning("Failed to drop test schema %s: %s", context.test_schema, e) + + +# ─── Feature-level hooks ──────────────────────────────────────── + +def before_feature(context: Context, feature: Feature) -> None: + """Log feature start. Skip if tagged @skip.""" + logger.info("▶ Feature: %s", feature.name) + if "skip" in feature.tags: + feature.skip("Marked with @skip") + + +def after_feature(context: Context, feature: Feature) -> None: + logger.info("◀ Feature: %s [%s]", feature.name, feature.status) + + +# ─── Scenario-level hooks ─────────────────────────────────────── + +def before_scenario(context: Context, scenario: Scenario) -> None: + """Initialize per-scenario state. Skip @wip scenarios.""" + logger.info(" ▶ Scenario: %s", scenario.name) + if "wip" in scenario.tags: + scenario.skip("Work in progress") + return + # Track resources created during this scenario for cleanup + context.scenario_cleanup_sql = [] + + +def after_scenario(context: Context, scenario: Scenario) -> None: + """Clean up scenario-specific resources.""" + for sql in getattr(context, "scenario_cleanup_sql", []): + try: + _execute_sql(context, sql) + except Exception as e: + logger.warning("Cleanup SQL failed: %s — %s", sql, e) + if scenario.status == "failed": + logger.error(" ✗ FAILED: %s", scenario.name) + else: + logger.info(" ◀ Scenario: %s [%s]", scenario.name, scenario.status) + + +# ─── Step-level hooks ─────────────────────────────────────────── + +def before_step(context: Context, step: Step) -> None: + context._step_start = datetime.now() + + +def after_step(context: Context, step: Step) -> None: + elapsed = (datetime.now() - context._step_start).total_seconds() + if elapsed > 10: + logger.warning(" Slow step (%.1fs): %s %s", elapsed, step.keyword, step.name) + if step.status == "failed": + logger.error(" ✗ %s %s\n %s", step.keyword, step.name, step.error_message) + + +# ─── Tag-based hooks ──────────────────────────────────────────── + +def before_tag(context, tag: str) -> None: + """Ensure resources for tagged scenarios.""" + if tag == "fixture.sql_warehouse": + _ensure_warehouse_running(context) + + +# ─── Helpers ──────────────────────────────────────────────────── + +def _execute_sql(context: Context, sql: str) -> object: + """Execute a SQL statement via the Statement Execution API.""" + return context.workspace.statement_execution.execute_statement( + warehouse_id=context.warehouse_id, + statement=sql, + wait_timeout="30s", + ) + + +def _discover_warehouse(workspace) -> str: + """Find the first available SQL warehouse.""" + from databricks.sdk.service.sql import State + + warehouses = list(workspace.warehouses.list()) + # Prefer running warehouses + for wh in warehouses: + if wh.state == State.RUNNING: + return wh.id + if warehouses: + return warehouses[0].id + raise RuntimeError( + "No SQL warehouses found. Pass warehouse_id via -D warehouse_id= " + "or set DATABRICKS_WAREHOUSE_ID." + ) + + +def _ensure_warehouse_running(context: Context) -> None: + """Start warehouse if stopped. Used by @fixture.sql_warehouse tag.""" + from databricks.sdk.service.sql import State + + wh = context.workspace.warehouses.get(context.warehouse_id) + if wh.state != State.RUNNING: + logger.info("Starting warehouse %s...", context.warehouse_id) + context.workspace.warehouses.start(context.warehouse_id) + context.workspace.warehouses.wait_get_warehouse_running(context.warehouse_id) + logger.info("Warehouse %s is running.", context.warehouse_id) +``` + +## Context object layering + +Behave's `context` has scoped layers. Data set at different levels has different lifetimes: + +| Set in | Lifetime | Example | +|--------|----------|---------| +| `before_all` | Entire run | `context.workspace`, `context.test_schema` | +| `before_feature` | Current feature | `context.feature_data` | +| `before_scenario` / steps | Current scenario | `context.query_result`, `context.scenario_cleanup_sql` | + +At the end of each scenario, the scenario layer is popped — anything set during steps is gone. Root-level data persists across everything. + +## Parallel execution isolation + +When using `behavex` for parallel execution, each worker needs its own schema. The template uses `BEHAVE_WORKER_ID` from the environment. Set it in the parallel runner config or wrapper script: + +```bash +# Example wrapper for behavex +export BEHAVE_WORKER_ID=$WORKER_INDEX +behave "$@" +``` diff --git a/.claude/skills/bdd-scaffold/test-suite/.gitignore b/.claude/skills/bdd-scaffold/test-suite/.gitignore new file mode 100644 index 0000000..744fea5 --- /dev/null +++ b/.claude/skills/bdd-scaffold/test-suite/.gitignore @@ -0,0 +1,4 @@ +.venv/ +__pycache__/ +reports/ +*.pyc diff --git a/.claude/skills/bdd-scaffold/test-suite/behave.ini b/.claude/skills/bdd-scaffold/test-suite/behave.ini new file mode 100644 index 0000000..a3c4cb0 --- /dev/null +++ b/.claude/skills/bdd-scaffold/test-suite/behave.ini @@ -0,0 +1,13 @@ +[behave] +default_format = pretty +show_timings = true +color = true +default_tags = not @wip and not @skip +logging_level = INFO +stdout_capture = false +log_capture = false +paths = features/ + +[behave.userdata] +warehouse_id = auto +catalog = main diff --git a/.claude/skills/bdd-scaffold/test-suite/features/catalog/schema_operations.feature b/.claude/skills/bdd-scaffold/test-suite/features/catalog/schema_operations.feature new file mode 100644 index 0000000..45bacf9 --- /dev/null +++ b/.claude/skills/bdd-scaffold/test-suite/features/catalog/schema_operations.feature @@ -0,0 +1,38 @@ +@catalog @smoke +Feature: Unity Catalog schema and table operations + As a data engineer + I want to verify Unity Catalog operations work correctly + So that I can manage my data assets with confidence + + Background: + Given a Databricks workspace connection is established + And a test schema is provisioned + + Scenario: Ephemeral test schema was created + Then the test schema should exist in Unity Catalog + + Scenario: Create tables and list them + Given a managed table "table_alpha" exists + And a managed table "table_beta" exists + When I list tables in the test schema + Then the table list should include "table_alpha" + And the table list should include "table_beta" + + Scenario: Table with data is queryable via SQL + Given a managed table "products" with data: + | product_id | name | price | + | 1 | Widget | 9.99 | + | 2 | Gadget | 19.99 | + | 3 | Doohickey | 4.99 | + Then the managed table "products" should have 3 rows + When I execute a query on the test schema: + """ + SELECT name, price FROM {schema}.products WHERE CAST(price AS DOUBLE) > 10.0 + """ + Then the query result should have 1 rows + And the first result column "name" should be "Gadget" + + Scenario: Grant SELECT permission on a table + Given a managed table "grant_test" exists + When I grant SELECT on managed table "grant_test" to group "users" + Then the group "users" should have SELECT on managed table "grant_test" diff --git a/.claude/skills/bdd-scaffold/test-suite/features/environment.py b/.claude/skills/bdd-scaffold/test-suite/features/environment.py new file mode 100644 index 0000000..bd09e07 --- /dev/null +++ b/.claude/skills/bdd-scaffold/test-suite/features/environment.py @@ -0,0 +1,131 @@ +"""Behave environment hooks — Databricks SDK integration. + +Tested against azure-east workspace. +""" +from __future__ import annotations + +import logging +import os +from datetime import datetime + +from behave.model import Feature, Scenario, Step +from behave.runner import Context + +logger = logging.getLogger("behave.databricks") + + +def before_all(context: Context) -> None: + """Initialize Databricks clients and create ephemeral test schema.""" + from databricks.sdk import WorkspaceClient + + # Use profile from env or default + profile = os.environ.get("DATABRICKS_CONFIG_PROFILE", "azure-east") + context.workspace = WorkspaceClient(profile=profile) + + # Fix host URL — some profiles include ?o= which breaks SDK API paths + if context.workspace.config.host and "?" in context.workspace.config.host: + clean_host = context.workspace.config.host.split("?")[0].rstrip("/") + context.workspace = WorkspaceClient(profile=profile, host=clean_host) + + me = context.workspace.current_user.me() + context.current_user = me.user_name + logger.info("Authenticated as: %s", context.current_user) + + # Warehouse — from -D userdata, env var, or auto-discover + userdata = context.config.userdata + wh_id = userdata.get("warehouse_id", "auto") + if wh_id == "auto": + wh_id = os.environ.get("DATABRICKS_WAREHOUSE_ID") or _discover_warehouse( + context.workspace + ) + context.warehouse_id = wh_id + logger.info("Using warehouse: %s", context.warehouse_id) + + # Catalog + context.test_catalog = userdata.get("catalog", "main") + + # Create ephemeral schema + ts = datetime.now().strftime("%Y%m%d_%H%M%S") + context.test_schema = f"{context.test_catalog}.behave_test_{ts}" + + _execute_sql(context, f"CREATE SCHEMA IF NOT EXISTS {context.test_schema}") + logger.info("Created test schema: %s", context.test_schema) + + +def after_all(context: Context) -> None: + """Drop ephemeral test schema.""" + if hasattr(context, "test_schema"): + try: + _execute_sql( + context, f"DROP SCHEMA IF EXISTS {context.test_schema} CASCADE" + ) + logger.info("Dropped test schema: %s", context.test_schema) + except Exception as e: + logger.warning("Failed to drop test schema %s: %s", context.test_schema, e) + + +def before_feature(context: Context, feature: Feature) -> None: + logger.info("▶ Feature: %s", feature.name) + if "skip" in feature.tags: + feature.skip("Marked with @skip") + + +def after_feature(context: Context, feature: Feature) -> None: + logger.info("◀ Feature: %s [%s]", feature.name, feature.status) + + +def before_scenario(context: Context, scenario: Scenario) -> None: + logger.info(" ▶ Scenario: %s", scenario.name) + if "wip" in scenario.tags: + scenario.skip("Work in progress") + return + context.scenario_cleanup_sql = [] + + +def after_scenario(context: Context, scenario: Scenario) -> None: + for sql in getattr(context, "scenario_cleanup_sql", []): + try: + _execute_sql(context, sql) + except Exception as e: + logger.warning("Cleanup SQL failed: %s — %s", sql, e) + if scenario.status == "failed": + logger.error(" ✗ FAILED: %s", scenario.name) + + +def before_step(context: Context, step: Step) -> None: + context._step_start = datetime.now() + + +def after_step(context: Context, step: Step) -> None: + elapsed = (datetime.now() - context._step_start).total_seconds() + if elapsed > 10: + logger.warning(" Slow step (%.1fs): %s %s", elapsed, step.keyword, step.name) + if step.status == "failed": + logger.error( + " ✗ %s %s\n %s", step.keyword, step.name, step.error_message + ) + + +# ─── Helpers ──────────────────────────────────────────────────── + + +def _execute_sql(context: Context, sql: str) -> object: + """Execute a SQL statement via the Statement Execution API.""" + return context.workspace.statement_execution.execute_statement( + warehouse_id=context.warehouse_id, + statement=sql, + wait_timeout="30s", + ) + + +def _discover_warehouse(workspace) -> str: + """Find the first available SQL warehouse.""" + from databricks.sdk.service.sql import State + + warehouses = list(workspace.warehouses.list()) + for wh in warehouses: + if wh.state == State.RUNNING: + return wh.id + if warehouses: + return warehouses[0].id + raise RuntimeError("No SQL warehouses found") diff --git a/.claude/skills/bdd-scaffold/test-suite/features/sql/data_operations.feature b/.claude/skills/bdd-scaffold/test-suite/features/sql/data_operations.feature new file mode 100644 index 0000000..8e02c68 --- /dev/null +++ b/.claude/skills/bdd-scaffold/test-suite/features/sql/data_operations.feature @@ -0,0 +1,48 @@ +@sql @smoke +Feature: SQL data operations via Databricks + As a data engineer + I want to verify SQL operations work correctly against the warehouse + So that I can trust my data transformations + + Background: + Given a Databricks workspace connection is established + And a test schema is provisioned + + Scenario: Create a table and verify it exists + Given a managed table "smoke_test" exists + Then the managed table "smoke_test" should exist + + Scenario: Insert and count rows + Given a managed table "customers" with data: + | customer_id | name | email | + | 1 | Alice | alice@example.com | + | 2 | Bob | bob@example.com | + | 3 | Charlie | charlie@example.com | + Then the managed table "customers" should have 3 rows + + Scenario: Aggregate query returns correct results + Given a managed table "orders" with data: + | order_id | customer_id | amount | + | 101 | 1 | 50 | + | 102 | 1 | 75 | + | 103 | 2 | 100 | + When I execute a query on the test schema: + """ + SELECT customer_id, COUNT(*) as order_count + FROM {schema}.orders + GROUP BY customer_id + HAVING COUNT(*) > 1 + """ + Then the query result should have 1 rows + And the first result column "customer_id" should be "1" + + Scenario: Query with no matching rows returns zero + Given a managed table "statuses" with data: + | id | status | + | 1 | active | + | 2 | active | + When I execute a query on the test schema: + """ + SELECT * FROM {schema}.statuses WHERE status = 'inactive' + """ + Then the query result should have 0 rows diff --git a/.claude/skills/bdd-scaffold/test-suite/features/steps/catalog_steps.py b/.claude/skills/bdd-scaffold/test-suite/features/steps/catalog_steps.py new file mode 100644 index 0000000..da966ae --- /dev/null +++ b/.claude/skills/bdd-scaffold/test-suite/features/steps/catalog_steps.py @@ -0,0 +1,84 @@ +"""Step definitions for Unity Catalog operations. + +Table names in Gherkin are short ("customers"). +Schema resolution happens in _fqn() via context.test_schema. +""" +from __future__ import annotations + +from behave import when, then +from behave.runner import Context + + +@when("I list tables in the test schema") +def step_list_tables(context: Context) -> None: + catalog, schema = context.test_schema.split(".", 1) + context.table_list = list( + context.workspace.tables.list(catalog_name=catalog, schema_name=schema) + ) + + +@then("the table list should include {table_count:d} tables") +def step_table_count(context: Context, table_count: int) -> None: + actual = len(context.table_list) + assert actual == table_count, ( + f"Expected {table_count} tables, got {actual}: " + f"{[t.name for t in context.table_list]}" + ) + + +@then('the table list should include "{table_name}"') +def step_table_in_list(context: Context, table_name: str) -> None: + names = [t.name for t in context.table_list] + assert table_name in names, f"Table '{table_name}' not in list: {names}" + + +@when('I grant {privilege} on managed table "{table_name}" to group "{group}"') +def step_grant_privilege( + context: Context, privilege: str, table_name: str, group: str +) -> None: + """Grant privilege using SQL — more stable across SDK versions than the grants API.""" + from databricks.sdk.service.sql import StatementState + + fqn = f"{context.test_schema}.{table_name}" + result = context.workspace.statement_execution.execute_statement( + warehouse_id=context.warehouse_id, + statement=f"GRANT {privilege} ON TABLE {fqn} TO `{group}`", + wait_timeout="30s", + ) + assert result.status.state == StatementState.SUCCEEDED, ( + f"GRANT failed: {result.status.error}" + ) + + +@then('the group "{group}" should have {privilege} on managed table "{table_name}"') +def step_verify_permission( + context: Context, group: str, privilege: str, table_name: str +) -> None: + """Verify privilege using SHOW GRANTS — stable across SDK versions.""" + from databricks.sdk.service.sql import StatementState + + fqn = f"{context.test_schema}.{table_name}" + result = context.workspace.statement_execution.execute_statement( + warehouse_id=context.warehouse_id, + statement=f"SHOW GRANTS ON TABLE {fqn}", + wait_timeout="30s", + ) + assert result.status.state == StatementState.SUCCEEDED, ( + f"SHOW GRANTS failed: {result.status.error}" + ) + # Parse result: columns are Principal, ActionType, ObjectType, ObjectKey (PascalCase) + rows = result.result.data_array or [] + columns = [c.name for c in result.manifest.schema.columns] + # Handle case variation — normalize to lowercase for lookup + col_lower = [c.lower() for c in columns] + principal_idx = col_lower.index("principal") + action_idx = col_lower.index("actiontype") + + found = any( + row[principal_idx] == group and row[action_idx] == privilege + for row in rows + ) + assert found, ( + f"Expected {group} to have {privilege} on {fqn}. " + f"Grants found: {[(r[principal_idx], r[action_idx]) for r in rows]}" + ) diff --git a/.claude/skills/bdd-scaffold/test-suite/features/steps/common_steps.py b/.claude/skills/bdd-scaffold/test-suite/features/steps/common_steps.py new file mode 100644 index 0000000..0350761 --- /dev/null +++ b/.claude/skills/bdd-scaffold/test-suite/features/steps/common_steps.py @@ -0,0 +1,130 @@ +"""Shared step definitions for Databricks BDD tests. + +Design principle: Gherkin uses short table names ("customers"). +Step definitions prepend context.test_schema internally. +This avoids {curly_brace} conflicts with Behave's parse library. +""" +from __future__ import annotations + +from behave import given, when, then, step +from behave.runner import Context +from databricks.sdk.service.sql import StatementState + + +# ─── Connection and setup ──────────────────────────────────────── + + +@given("a Databricks workspace connection is established") +def step_workspace_connection(context: Context) -> None: + assert hasattr(context, "workspace"), "No workspace client — check environment.py" + assert hasattr(context, "warehouse_id"), "No warehouse_id — check environment.py" + + +@given("a test schema is provisioned") +def step_test_schema(context: Context) -> None: + assert hasattr(context, "test_schema"), "No test_schema — check environment.py" + + +# ─── Table creation ────────────────────────────────────────────── + + +@given('a managed table "{table_name}" exists') +def step_ensure_table(context: Context, table_name: str) -> None: + fqn = _fqn(context, table_name) + _sql(context, f"CREATE TABLE IF NOT EXISTS {fqn} (id BIGINT)") + context.scenario_cleanup_sql.append(f"DROP TABLE IF EXISTS {fqn}") + + +@given('a managed table "{table_name}" with data:') +def step_create_with_data(context: Context, table_name: str) -> None: + fqn = _fqn(context, table_name) + headers = context.table.headings + rows = context.table.rows + + col_defs = ", ".join(f"`{h}` STRING" for h in headers) + _sql(context, f"CREATE OR REPLACE TABLE {fqn} ({col_defs})") + context.scenario_cleanup_sql.append(f"DROP TABLE IF EXISTS {fqn}") + + for row in rows: + values = ", ".join(f"'{cell}'" for cell in row) + _sql(context, f"INSERT INTO {fqn} VALUES ({values})") + + +# ─── SQL execution ─────────────────────────────────────────────── + + +@when("I execute a query on the test schema:") +def step_execute_sql(context: Context) -> None: + """Execute SQL from docstring. Use {schema} as placeholder for test schema.""" + sql = context.text.replace("{schema}", context.test_schema) + context.query_result = _sql(context, sql) + + +# ─── Table assertions ──────────────────────────────────────────── + + +@then('the managed table "{table_name}" should exist') +def step_table_exists(context: Context, table_name: str) -> None: + fqn = _fqn(context, table_name) + try: + context.workspace.tables.get(fqn) + except Exception as e: + raise AssertionError(f"Table {fqn} does not exist: {e}") + + +@then('the managed table "{table_name}" should have {expected:d} rows') +def step_row_count(context: Context, table_name: str, expected: int) -> None: + fqn = _fqn(context, table_name) + result = _sql(context, f"SELECT COUNT(*) AS cnt FROM {fqn}") + actual = int(result.result.data_array[0][0]) + assert actual == expected, f"Expected {expected} rows in {table_name}, got {actual}" + + +@then("the test schema should exist in Unity Catalog") +def step_schema_exists(context: Context) -> None: + try: + context.workspace.schemas.get(context.test_schema) + except Exception as e: + raise AssertionError(f"Schema {context.test_schema} does not exist: {e}") + + +# ─── Query result assertions ──────────────────────────────────── + + +@then("the query result should have {expected:d} rows") +def step_result_row_count(context: Context, expected: int) -> None: + rows = context.query_result.result.data_array or [] + actual = len(rows) + assert actual == expected, f"Expected {expected} result rows, got {actual}" + + +@then('the first result column "{col}" should be "{value}"') +def step_first_result_value(context: Context, col: str, value: str) -> None: + result = context.query_result + columns = [c.name for c in result.manifest.schema.columns] + assert col in columns, f"Column '{col}' not in result: {columns}" + col_idx = columns.index(col) + actual = result.result.data_array[0][col_idx] + assert str(actual) == value, f"Expected {col}='{value}', got '{actual}'" + + +# ─── Helpers ───────────────────────────────────────────────────── + + +def _fqn(context: Context, table_name: str) -> str: + """Build fully-qualified table name from short name.""" + return f"{context.test_schema}.{table_name}" + + +def _sql(context: Context, sql: str): + """Execute SQL and assert success.""" + result = context.workspace.statement_execution.execute_statement( + warehouse_id=context.warehouse_id, + statement=sql, + wait_timeout="30s", + ) + assert result.status.state == StatementState.SUCCEEDED, ( + f"SQL failed ({result.status.state}): {result.status.error}\n" + f"Statement: {sql[:200]}" + ) + return result diff --git a/.claude/skills/bdd-scaffold/test-suite/pyproject.toml b/.claude/skills/bdd-scaffold/test-suite/pyproject.toml new file mode 100644 index 0000000..1ecbbdb --- /dev/null +++ b/.claude/skills/bdd-scaffold/test-suite/pyproject.toml @@ -0,0 +1,19 @@ +[project] +name = "bdd-test-suite" +version = "0.1.0" +requires-python = ">=3.11" +dependencies = [ + "behave>=1.2.6", + "databricks-sdk>=0.40.0", + "httpx>=0.27.0", +] + +[tool.behave] +default_format = "pretty" +show_timings = true +default_tags = ["not @wip and not @skip"] +logging_level = "INFO" + +[tool.behave.userdata] +warehouse_id = "auto" +catalog = "main" diff --git a/.claude/skills/bdd-steps/SKILL.md b/.claude/skills/bdd-steps/SKILL.md new file mode 100644 index 0000000..a14a12b --- /dev/null +++ b/.claude/skills/bdd-steps/SKILL.md @@ -0,0 +1,109 @@ +--- +name: bdd-steps +description: "This skill should be used when the user asks to \"write step definitions\", \"implement BDD steps\", \"generate step code\", \"create Behave steps\", \"implement Given When Then\", \"write Python steps for Gherkin\", \"step definitions for Databricks\", or needs to create Python step implementations for existing Gherkin feature files." +user-invocable: true +--- + +# BDD Steps — Python Step Definition Generation + +Generate Python step definitions for Behave that implement Gherkin steps using the Databricks SDK. Read existing `.feature` files, identify undefined steps, and produce well-typed implementations. + +## When to use + +- Implementing step definitions for new or existing feature files +- Adding Databricks SDK calls to step implementations +- Refactoring step definitions for reusability across features + +## Process + +### 1. Identify undefined steps + +Read the target feature files, then run a dry-run to find undefined steps: + +```bash +uv run behave --dry-run features/.feature 2>&1 +``` + +Behave prints suggested snippets for each undefined step. Use these as the starting point. + +### 2. Write step definitions + +Place step files in `features/steps/` organized by domain: + +| File | Domain | Key SDK imports | +|------|--------|----------------| +| `common_steps.py` | Shared utilities | `WorkspaceClient`, `StatementState` | +| `catalog_steps.py` | Unity Catalog | `catalog.PermissionsChange`, `catalog.Privilege`, `catalog.SecurableType` | +| `pipeline_steps.py` | Lakeflow SDP | `pipelines.PipelineStateInfo` | +| `job_steps.py` | Jobs/Notebooks | `jobs.SubmitTask`, `jobs.NotebookTask`, `jobs.RunLifeCycleState` | +| `app_steps.py` | Databricks Apps | `httpx.Client` for HTTP assertions | +| `sql_steps.py` | SQL/Data quality | `sql.StatementState`, `sql.Disposition` | + +**Step definition structure:** + +```python +from __future__ import annotations + +from behave import given, when, then +from behave.runner import Context + + +@given('a descriptive step pattern with "{parameter}"') +def step_impl(context: Context, parameter: str) -> None: + """Docstring explaining what this step does.""" + # Implementation using context.workspace (set in environment.py) + ... +``` + +Refer to `references/step-library.md` for a comprehensive library of reusable Databricks step definitions covering: +- Workspace connection and SQL execution +- Table/schema existence and row count assertions +- Grant and permission verification +- Pipeline triggering and status polling +- Job submission and completion waiting +- HTTP endpoint testing with SSO header simulation + +### 3. Step writing principles + +**Use `context` for state passing.** Store results in `context` attributes so downstream `Then` steps can assert on them: + +```python +@when('I execute a query on "{table_name}"') +def step_execute(context: Context, table_name: str) -> None: + context.query_result = context.workspace.statement_execution.execute_statement(...) + +@then('the result should have {count:d} rows') +def step_check_rows(context: Context, count: int) -> None: + actual = len(context.query_result.result.data_array or []) + assert actual == count, f"Expected {count}, got {actual}" +``` + +**Type all parameters.** Use Behave's parse types (`{name:d}` for int, `{name:f}` for float) or register custom types. + +**Assertion messages must be diagnostic.** Always include expected vs. actual values: + +```python +assert actual == expected, f"Expected {expected}, got {actual}" +``` + +**Substitute `{test_schema}` references.** Feature files may use `{test_schema}` as a placeholder. Step definitions should resolve it from `context.test_schema`: + +```python +table_fqn = table_name.replace("{test_schema}", context.test_schema) +``` + +**Poll with timeout for async operations.** Jobs, pipelines, and app deployments need polling loops with configurable timeouts. + +### 4. Validate steps compile + +After writing, verify all steps resolve: + +```bash +uv run behave --dry-run +``` + +Zero undefined steps = ready to run. + +## Additional resources + +- **`references/step-library.md`** — Complete reusable step definition library for all Databricks domains diff --git a/.claude/skills/bdd-steps/references/step-library.md b/.claude/skills/bdd-steps/references/step-library.md new file mode 100644 index 0000000..11ddf76 --- /dev/null +++ b/.claude/skills/bdd-steps/references/step-library.md @@ -0,0 +1,660 @@ +# Reusable Step Definition Library + +Complete library of Databricks step definitions for Behave. Organized by domain. Copy relevant sections into `features/steps/` files. + +**Proven patterns used throughout:** + +- Step patterns use **short names** (e.g., `"{table_name}"`), never `{test_schema}.table` in the pattern +- Step code builds FQN internally: `fqn = f"{context.test_schema}.{table_name}"` +- SQL in docstrings uses `{schema}` placeholder, replaced via `context.text.replace("{schema}", context.test_schema)` +- Steps with data tables have a **trailing colon** in the decorator: `@given('... with data:')` +- Grants use **SQL**, not the SDK grants API (which breaks on recent SDK versions) +- Integer parameters use Behave's built-in `{count:d}` format, not custom type parsers + +--- + +## Common Steps (`common_steps.py`) + +Always include these. They provide workspace connection, SQL execution, and basic assertions. + +```python +"""Shared step definitions for Databricks BDD tests.""" +from __future__ import annotations + +import os +from datetime import datetime + +from behave import given, then, step +from behave.runner import Context +from databricks.sdk.service.sql import StatementState + + +# ─── Connection and setup steps ───────────────────────────────── + +@given("a Databricks workspace connection is established") +def step_workspace_connection(context: Context) -> None: + """Initialize workspace client. Usually handled by environment.py.""" + if not hasattr(context, "workspace"): + from databricks.sdk import WorkspaceClient + context.workspace = WorkspaceClient() + me = context.workspace.current_user.me() + context.current_user = me.user_name + + +@given("a test schema is provisioned") +def step_test_schema(context: Context) -> None: + """Verify test schema exists. Usually handled by environment.py.""" + assert hasattr(context, "test_schema"), ( + "No test_schema on context — check environment.py before_all" + ) + + +# ─── SQL execution steps ──────────────────────────────────────── + +@step("I execute the following SQL") +def step_execute_sql_docstring(context: Context) -> None: + """Execute SQL from a docstring (triple-quoted text in feature file). + + In feature files, use {schema} as the placeholder: + When I execute the following SQL + \"\"\" + SELECT * FROM {schema}.customers + \"\"\" + """ + sql = context.text.replace("{schema}", context.test_schema) + context.query_result = _execute_sql(context, sql) + + +@step('I execute SQL "{sql}"') +def step_execute_sql_inline(context: Context, sql: str) -> None: + """Execute inline SQL. The {schema} placeholder is replaced automatically.""" + sql = sql.replace("{schema}", context.test_schema) + context.query_result = _execute_sql(context, sql) + + +# ─── Table existence and row count assertions ─────────────────── + +@then('the table "{table_name}" should exist') +def step_table_exists(context: Context, table_name: str) -> None: + fqn = f"{context.test_schema}.{table_name}" + try: + context.workspace.tables.get(fqn) + except Exception as e: + raise AssertionError(f"Table {fqn} does not exist: {e}") + + +@then('the streaming table "{table_name}" should exist') +def step_streaming_table_exists(context: Context, table_name: str) -> None: + fqn = f"{context.test_schema}.{table_name}" + try: + info = context.workspace.tables.get(fqn) + assert info.table_type is not None, f"{fqn} exists but has no table_type" + except Exception as e: + raise AssertionError(f"Streaming table {fqn} does not exist: {e}") + + +@then('the materialized view "{table_name}" should exist') +def step_mv_exists(context: Context, table_name: str) -> None: + fqn = f"{context.test_schema}.{table_name}" + try: + context.workspace.tables.get(fqn) + except Exception as e: + raise AssertionError(f"Materialized view {fqn} does not exist: {e}") + + +@then('the table "{table_name}" should have {expected:d} rows') +def step_exact_row_count(context: Context, table_name: str, expected: int) -> None: + actual = _count_rows(context, table_name) + assert actual == expected, f"Expected {expected} rows in {table_name}, got {actual}" + + +@then('the table "{table_name}" should have more than {expected:d} rows') +def step_min_row_count(context: Context, table_name: str, expected: int) -> None: + actual = _count_rows(context, table_name) + assert actual > expected, f"Expected more than {expected} rows in {table_name}, got {actual}" + + +@then('the table "{table_name}" should have 0 rows') +def step_empty_table(context: Context, table_name: str) -> None: + actual = _count_rows(context, table_name) + assert actual == 0, f"Expected 0 rows in {table_name}, got {actual}" + + +# ─── Query result assertions ──────────────────────────────────── + +@then("the result should have {expected:d} rows") +def step_result_row_count(context: Context, expected: int) -> None: + rows = context.query_result.result.data_array or [] + actual = len(rows) + assert actual == expected, f"Expected {expected} rows, got {actual}" + + +@then("the result should have more than {expected:d} rows") +def step_result_min_rows(context: Context, expected: int) -> None: + rows = context.query_result.result.data_array or [] + actual = len(rows) + assert actual > expected, f"Expected more than {expected} rows, got {actual}" + + +@then('the first row column "{col}" should be "{value}"') +def step_first_row_value(context: Context, col: str, value: str) -> None: + result = context.query_result + columns = [c.name for c in result.manifest.schema.columns] + col_idx = columns.index(col) + actual = result.result.data_array[0][col_idx] + assert str(actual) == value, f"Expected {col}={value}, got {actual}" + + +# ─── Data setup steps ─────────────────────────────────────────── + +@given('the table "{table_name}" has been loaded') +def step_table_loaded(context: Context, table_name: str) -> None: + """Assert table exists and is not empty.""" + fqn = f"{context.test_schema}.{table_name}" + count = _count_rows(context, table_name) + assert count > 0, f"Table {fqn} exists but is empty" + + +@given('a managed table "{table_name}" exists') +def step_ensure_table_exists(context: Context, table_name: str) -> None: + fqn = f"{context.test_schema}.{table_name}" + try: + context.workspace.tables.get(fqn) + except Exception: + # Create a minimal table + _execute_sql(context, f"CREATE TABLE IF NOT EXISTS {fqn} (id BIGINT)") + context.scenario_cleanup_sql.append(f"DROP TABLE IF EXISTS {fqn}") + + +@given('a managed table "{table_name}" with data:') +def step_create_table_with_data(context: Context, table_name: str) -> None: + """Create a table and populate from the Gherkin data table. + + The trailing colon in the decorator is required — Behave matches it + as part of the step text when a data table follows. + + Example feature file usage: + Given a managed table "customers" with data: + | id | name | region | + | 1 | Acme | APAC | + | 2 | Contoso | EMEA | + """ + fqn = f"{context.test_schema}.{table_name}" + headers = context.table.headings + rows = context.table.rows + + # Infer types (simple heuristic — all STRING) + col_defs = ", ".join(f"{h} STRING" for h in headers) + _execute_sql(context, f"CREATE OR REPLACE TABLE {fqn} ({col_defs})") + context.scenario_cleanup_sql.append(f"DROP TABLE IF EXISTS {fqn}") + + # Insert rows + for row in rows: + values = ", ".join(f"'{cell}'" for cell in row) + _execute_sql(context, f"INSERT INTO {fqn} VALUES ({values})") + + +# ─── Helpers ──────────────────────────────────────────────────── + +def _execute_sql(context: Context, sql: str): + """Execute SQL and return result.""" + result = context.workspace.statement_execution.execute_statement( + warehouse_id=context.warehouse_id, + statement=sql, + wait_timeout="30s", + ) + assert result.status.state == StatementState.SUCCEEDED, ( + f"SQL failed: {result.status.error}\nStatement: {sql[:200]}" + ) + return result + + +def _count_rows(context: Context, table_name: str) -> int: + """Count rows in a table.""" + fqn = f"{context.test_schema}.{table_name}" + result = _execute_sql(context, f"SELECT COUNT(*) AS cnt FROM {fqn}") + return int(result.result.data_array[0][0]) +``` + +--- + +## Catalog Steps (`catalog_steps.py`) + +Uses SQL for grants instead of the SDK grants API. The SDK's `grants.update(securable_type=SecurableType.TABLE, ...)` fails with `SECURABLETYPE.TABLE is not a valid securable type` on recent SDK versions. + +```python +"""Step definitions for Unity Catalog permissions and security. + +Uses SQL for all grant operations. The SDK grants API is unreliable — +SecurableType.TABLE fails on recent databricks-sdk versions. +""" +from __future__ import annotations + +from behave import when, then +from behave.runner import Context +from databricks.sdk.service.sql import StatementState + + +@when('I grant {privilege} on table "{table_name}" to group "{group}"') +def step_grant(context: Context, privilege: str, table_name: str, group: str) -> None: + """Grant a privilege on a table using SQL. + + Example feature file usage: + When I grant SELECT on table "customers" to group "analysts" + """ + fqn = f"{context.test_schema}.{table_name}" + _execute_sql(context, f"GRANT {privilege} ON TABLE {fqn} TO `{group}`") + + +@when('I revoke {privilege} on table "{table_name}" from group "{group}"') +def step_revoke(context: Context, privilege: str, table_name: str, group: str) -> None: + """Revoke a privilege on a table using SQL.""" + fqn = f"{context.test_schema}.{table_name}" + _execute_sql(context, f"REVOKE {privilege} ON TABLE {fqn} FROM `{group}`") + + +@then('the group "{group}" should have {privilege} permission on "{table_name}"') +def step_verify_grant( + context: Context, group: str, privilege: str, table_name: str +) -> None: + """Verify a grant exists using SHOW GRANTS. + + SHOW GRANTS returns PascalCase columns: Principal, ActionType, ObjectType, ObjectKey. + """ + fqn = f"{context.test_schema}.{table_name}" + result = _execute_sql(context, f"SHOW GRANTS ON TABLE {fqn}") + columns = [c.name for c in result.manifest.schema.columns] + principal_idx = columns.index("Principal") + action_idx = columns.index("ActionType") + + found_privs = [] + for row in result.result.data_array or []: + if row[principal_idx] == group: + found_privs.append(row[action_idx]) + + assert privilege in found_privs, ( + f"Expected {group} to have {privilege} on {fqn}, " + f"found: {found_privs}" + ) + + +@then('the group "{group}" should not have {privilege} permission on "{table_name}"') +def step_verify_no_grant( + context: Context, group: str, privilege: str, table_name: str +) -> None: + """Verify a grant does NOT exist using SHOW GRANTS.""" + fqn = f"{context.test_schema}.{table_name}" + result = _execute_sql(context, f"SHOW GRANTS ON TABLE {fqn}") + columns = [c.name for c in result.manifest.schema.columns] + principal_idx = columns.index("Principal") + action_idx = columns.index("ActionType") + + found_privs = [] + for row in result.result.data_array or []: + if row[principal_idx] == group: + found_privs.append(row[action_idx]) + + assert privilege not in found_privs, ( + f"Expected {group} NOT to have {privilege} on {fqn}, " + f"but found: {found_privs}" + ) + + +def _execute_sql(context: Context, sql: str): + """Execute SQL and return result.""" + result = context.workspace.statement_execution.execute_statement( + warehouse_id=context.warehouse_id, + statement=sql, + wait_timeout="30s", + ) + assert result.status.state == StatementState.SUCCEEDED, ( + f"SQL failed: {result.status.error}\nStatement: {sql[:200]}" + ) + return result +``` + +--- + +## Pipeline Steps (`pipeline_steps.py`) + +```python +"""Step definitions for Lakeflow Spark Declarative Pipelines.""" +from __future__ import annotations + +import time + +from behave import given, when, then +from behave.runner import Context + + +@given('a pipeline "{name}" exists targeting "{schema}"') +def step_pipeline_exists(context: Context, name: str, schema: str) -> None: + pipelines = list( + context.workspace.pipelines.list_pipelines(filter=f'name LIKE "{name}"') + ) + if pipelines: + context.pipeline_id = pipelines[0].pipeline_id + else: + result = context.workspace.pipelines.create( + name=name, + target=schema, + catalog=context.test_catalog, + channel="CURRENT", + ) + context.pipeline_id = result.pipeline_id + context.scenario_cleanup_sql.append(None) # Mark for pipeline cleanup + + +@given('the pipeline "{name}" has completed a full refresh') +def step_pipeline_refreshed(context: Context, name: str) -> None: + """Ensure pipeline exists and has been refreshed at least once.""" + pipelines = list( + context.workspace.pipelines.list_pipelines(filter=f'name LIKE "{name}"') + ) + assert pipelines, f"Pipeline '{name}' not found" + context.pipeline_id = pipelines[0].pipeline_id + # Check latest update status + detail = context.workspace.pipelines.get(context.pipeline_id) + assert detail.latest_updates, f"Pipeline '{name}' has never been run" + + +@when("I trigger a full refresh of the pipeline") +def step_full_refresh(context: Context) -> None: + response = context.workspace.pipelines.start_update( + pipeline_id=context.pipeline_id, + full_refresh=True, + ) + context.update_id = response.update_id + + +@when("I trigger an incremental refresh of the pipeline") +def step_incremental_refresh(context: Context) -> None: + response = context.workspace.pipelines.start_update( + pipeline_id=context.pipeline_id, + full_refresh=False, + ) + context.update_id = response.update_id + + +@then("the pipeline update should succeed within {timeout:d} seconds") +def step_pipeline_success(context: Context, timeout: int) -> None: + _wait_for_pipeline(context, timeout, expect_success=True) + + +@then("the pipeline update should fail") +def step_pipeline_fail(context: Context) -> None: + _wait_for_pipeline(context, timeout=300, expect_success=False) + + +@then('the pipeline error should mention {keyword}') +def step_pipeline_error_contains(context: Context, keyword: str) -> None: + events = list(context.workspace.pipelines.list_pipeline_events( + pipeline_id=context.pipeline_id, + max_results=10, + )) + error_messages = " ".join( + str(e.message) for e in events if e.level == "ERROR" + ) + assert keyword.lower() in error_messages.lower(), ( + f"Expected pipeline error to mention '{keyword}', " + f"but errors were: {error_messages[:500]}" + ) + + +def _wait_for_pipeline( + context: Context, timeout: int, expect_success: bool +) -> None: + deadline = time.time() + timeout + while time.time() < deadline: + update = context.workspace.pipelines.get_update( + pipeline_id=context.pipeline_id, + update_id=context.update_id, + ) + state = update.update.state + if state in ("COMPLETED",): + if expect_success: + return + raise AssertionError("Expected pipeline to fail, but it succeeded") + if state in ("FAILED", "CANCELED"): + if not expect_success: + return + raise AssertionError( + f"Pipeline update {state}. Check update {context.update_id}" + ) + time.sleep(15) + raise TimeoutError(f"Pipeline did not complete within {timeout}s") +``` + +--- + +## Job Steps (`job_steps.py`) + +```python +"""Step definitions for Databricks Jobs and notebook runs.""" +from __future__ import annotations + +import time + +from behave import when, then +from behave.runner import Context +from databricks.sdk.service.jobs import ( + NotebookTask, + RunLifeCycleState, + SubmitTask, +) + + +@when('I run the notebook "{path}" with parameters:') +def step_run_notebook(context: Context, path: str) -> None: + """Run a notebook with parameters from a Gherkin data table. + + The trailing colon is required when a data table follows. + + Example feature file usage: + When I run the notebook "/Workspace/tests/etl" with parameters: + | key | value | + | schema | my_schema | + | mode | full | + """ + params = {} + for row in context.table: + value = row["value"].replace("{schema}", context.test_schema) + params[row["key"]] = value + + run = context.workspace.jobs.submit( + run_name=f"behave-{context.scenario.name[:50]}", + tasks=[ + SubmitTask( + task_key="main", + notebook_task=NotebookTask( + notebook_path=path, + base_parameters=params, + ), + ) + ], + ) + context.run_id = run.response.run_id + + +@then('the job should complete with status "{expected}" within {timeout:d} seconds') +def step_job_status(context: Context, expected: str, timeout: int) -> None: + deadline = time.time() + timeout + while time.time() < deadline: + run = context.workspace.jobs.get_run(context.run_id) + state = run.state + if state.life_cycle_state in ( + RunLifeCycleState.TERMINATED, + RunLifeCycleState.INTERNAL_ERROR, + RunLifeCycleState.SKIPPED, + ): + break + time.sleep(10) + else: + raise TimeoutError(f"Run {context.run_id} did not complete within {timeout}s") + + actual = state.result_state.value if state.result_state else "UNKNOWN" + assert actual == expected, ( + f"Expected {expected}, got {actual}. Message: {state.state_message}" + ) +``` + +--- + +## App Steps (`app_steps.py`) + +```python +"""Step definitions for Databricks Apps (FastAPI) testing.""" +from __future__ import annotations + +import subprocess +import os + +import httpx +from behave import given, when, then +from behave.runner import Context + + +@given('the app is running at "{base_url}"') +def step_app_running(context: Context, base_url: str) -> None: + context.app_client = httpx.Client(base_url=base_url, timeout=10) + + +@given('the test user is "{email}"') +def step_test_user(context: Context, email: str) -> None: + context.auth_headers = { + "X-Forwarded-Email": email, + "X-Forwarded-User": email.split("@")[0], + } + + +@when('I GET "{path}"') +def step_get(context: Context, path: str) -> None: + context.response = context.app_client.get(path) + + +@when('I GET "{path}" with auth headers') +def step_get_auth(context: Context, path: str) -> None: + context.response = context.app_client.get(path, headers=context.auth_headers) + + +@when('I GET "{path}" without auth headers') +def step_get_no_auth(context: Context, path: str) -> None: + context.response = context.app_client.get(path) + + +@when('I POST "{path}" with auth headers and body') +def step_post_auth(context: Context, path: str) -> None: + """POST with JSON body from a docstring. + + Example feature file usage: + When I POST "/api/items" with auth headers and body + \"\"\" + {"name": "test-item", "value": 42} + \"\"\" + """ + import json + body = json.loads(context.text) + context.response = context.app_client.post( + path, json=body, headers=context.auth_headers, + ) + + +@then("the response status should be {code:d}") +def step_status_code(context: Context, code: int) -> None: + assert context.response.status_code == code, ( + f"Expected {code}, got {context.response.status_code}: " + f"{context.response.text[:200]}" + ) + + +@then('the response JSON should contain "{key}" with value "{value}"') +def step_json_value(context: Context, key: str, value: str) -> None: + data = context.response.json() + assert key in data, f"Key '{key}' not in response: {list(data.keys())}" + assert str(data[key]) == value, f"Expected {key}='{value}', got '{data[key]}'" + + +@then("the response should be a JSON list") +def step_json_list(context: Context) -> None: + data = context.response.json() + assert isinstance(data, list), f"Expected list, got {type(data).__name__}" + + +# ─── Deployment steps ──────────────────────────────────────────── + +@when('I deploy using Asset Bundles with target "{target}"') +def step_deploy_bundle(context: Context, target: str) -> None: + result = subprocess.run( + ["databricks", "bundle", "deploy", "--target", target], + capture_output=True, + text=True, + env={**dict(os.environ), "DATABRICKS_BUNDLE_ENGINE": "direct"}, + timeout=300, + ) + context.deploy_result = result + + +@then("the deployment should succeed") +def step_deploy_success(context: Context) -> None: + r = context.deploy_result + assert r.returncode == 0, ( + f"Deploy failed (rc={r.returncode}):\n{r.stderr[:500]}" + ) +``` + +--- + +## Shell Command Steps (reusable) + +```python +"""Step definitions for running CLI commands (DABs, databricks CLI).""" +from __future__ import annotations + +import os +import subprocess + +from behave import when, then +from behave.runner import Context + + +@when('I run "{command}" with target "{target}"') +def step_run_command(context: Context, command: str, target: str) -> None: + full_cmd = f"{command} --target {target}" + context.cmd_result = subprocess.run( + full_cmd.split(), + capture_output=True, + text=True, + env={**dict(os.environ), "DATABRICKS_BUNDLE_ENGINE": "direct"}, + timeout=300, + ) + + +@when('I run "{command}" with target "{target}" and auto-approve') +def step_run_command_approve(context: Context, command: str, target: str) -> None: + full_cmd = f"{command} --target {target} --auto-approve" + context.cmd_result = subprocess.run( + full_cmd.split(), + capture_output=True, + text=True, + env={**dict(os.environ), "DATABRICKS_BUNDLE_ENGINE": "direct"}, + timeout=300, + ) + + +@then("the command should exit with code {code:d}") +def step_exit_code(context: Context, code: int) -> None: + actual = context.cmd_result.returncode + assert actual == code, ( + f"Expected exit code {code}, got {actual}.\n" + f"stdout: {context.cmd_result.stdout[:300]}\n" + f"stderr: {context.cmd_result.stderr[:300]}" + ) + + +@then("the command should succeed") +def step_command_success(context: Context) -> None: + assert context.cmd_result.returncode == 0, ( + f"Command failed (rc={context.cmd_result.returncode}):\n" + f"{context.cmd_result.stderr[:500]}" + ) +``` diff --git a/.claude/skills/databricks-aibi-dashboards/SKILL.md b/.claude/skills/databricks-aibi-dashboards/SKILL.md deleted file mode 100644 index 41dbeec..0000000 --- a/.claude/skills/databricks-aibi-dashboards/SKILL.md +++ /dev/null @@ -1,923 +0,0 @@ ---- -name: databricks-aibi-dashboards -description: "Create Databricks AI/BI dashboards. CRITICAL: You MUST test ALL SQL queries via execute_sql BEFORE deploying. Follow guidelines strictly." ---- - -# AI/BI Dashboard Skill - -Create Databricks AI/BI dashboards (formerly Lakeview dashboards). **Follow these guidelines strictly.** - -## CRITICAL: MANDATORY VALIDATION WORKFLOW - -**You MUST follow this workflow exactly. Skipping validation causes broken dashboards.** - -``` -┌─────────────────────────────────────────────────────────────────────┐ -│ STEP 1: Get table schemas via get_table_details(catalog, schema) │ -├─────────────────────────────────────────────────────────────────────┤ -│ STEP 2: Write SQL queries for each dataset │ -├─────────────────────────────────────────────────────────────────────┤ -│ STEP 3: TEST EVERY QUERY via execute_sql() ← DO NOT SKIP! │ -│ - If query fails, FIX IT before proceeding │ -│ - Verify column names match what widgets will reference │ -│ - Verify data types are correct (dates, numbers, strings) │ -├─────────────────────────────────────────────────────────────────────┤ -│ STEP 4: Build dashboard JSON using ONLY verified queries │ -├─────────────────────────────────────────────────────────────────────┤ -│ STEP 5: Deploy via create_or_update_dashboard() │ -└─────────────────────────────────────────────────────────────────────┘ -``` - -**WARNING: If you deploy without testing queries, widgets WILL show "Invalid widget definition" errors!** - -## Available MCP Tools - -| Tool | Description | -|------|-------------| -| `get_table_details` | **STEP 1**: Get table schemas for designing queries | -| `execute_sql` | **STEP 3**: Test SQL queries - MANDATORY before deployment! | -| `get_best_warehouse` | Get available warehouse ID | -| `create_or_update_dashboard` | **STEP 5**: Deploy dashboard JSON (only after validation!) | -| `get_dashboard` | Get dashboard details by ID | -| `list_dashboards` | List dashboards in workspace | -| `trash_dashboard` | Move dashboard to trash | -| `publish_dashboard` | Publish dashboard for viewers | -| `unpublish_dashboard` | Unpublish a dashboard | - ---- - -## Implementation Guidelines - -### 1) DATASET ARCHITECTURE (STRICT) - -- **One dataset per domain** (e.g., orders, customers, products) -- **Exactly ONE valid SQL query per dataset** (no multiple queries separated by `;`) -- Always use **fully-qualified table names**: `catalog.schema.table_name` -- SELECT must include all dimensions needed by widgets and all derived columns via `AS` aliases -- Put ALL business logic (CASE/WHEN, COALESCE, ratios) into the dataset SELECT with explicit aliases -- **Contract rule**: Every widget `fieldName` must exactly match a dataset column or alias - -### 2) WIDGET FIELD EXPRESSIONS - -> **CRITICAL: Field Name Matching Rule** -> The `name` in `query.fields` MUST exactly match the `fieldName` in `encodings`. -> If they don't match, the widget shows "no selected fields to visualize" error! - -**Correct pattern for aggregations:** -```json -// In query.fields: -{"name": "sum(spend)", "expression": "SUM(`spend`)"} - -// In encodings (must match!): -{"fieldName": "sum(spend)", "displayName": "Total Spend"} -``` - -**WRONG - names don't match:** -```json -// In query.fields: -{"name": "spend", "expression": "SUM(`spend`)"} // name is "spend" - -// In encodings: -{"fieldName": "sum(spend)", ...} // ERROR: "sum(spend)" ≠ "spend" -``` - -Allowed expressions in widget queries (you CANNOT use CAST or other SQL in expressions): - -**For numbers:** -```json -{"name": "sum(revenue)", "expression": "SUM(`revenue`)"} -{"name": "avg(price)", "expression": "AVG(`price`)"} -{"name": "count(orders)", "expression": "COUNT(`order_id`)"} -{"name": "countdistinct(customers)", "expression": "COUNT(DISTINCT `customer_id`)"} -{"name": "min(date)", "expression": "MIN(`order_date`)"} -{"name": "max(date)", "expression": "MAX(`order_date`)"} -``` - -**For dates** (use daily for timeseries, weekly/monthly for grouped comparisons): -```json -{"name": "daily(date)", "expression": "DATE_TRUNC(\"DAY\", `date`)"} -{"name": "weekly(date)", "expression": "DATE_TRUNC(\"WEEK\", `date`)"} -{"name": "monthly(date)", "expression": "DATE_TRUNC(\"MONTH\", `date`)"} -``` - -**Simple field reference** (for pre-aggregated data): -```json -{"name": "category", "expression": "`category`"} -``` - -If you need conditional logic or multi-field formulas, compute a derived column in the dataset SQL first. - -### 3) SPARK SQL PATTERNS - -- Date math: `date_sub(current_date(), N)` for days, `add_months(current_date(), -N)` for months -- Date truncation: `DATE_TRUNC('DAY'|'WEEK'|'MONTH'|'QUARTER'|'YEAR', column)` -- **AVOID** `INTERVAL` syntax - use functions instead - -### 4) LAYOUT (6-Column Grid, NO GAPS) - -Each widget has a position: `{"x": 0, "y": 0, "width": 2, "height": 4}` - -**CRITICAL**: Each row must fill width=6 exactly. No gaps allowed. - -**Recommended widget sizes:** - -| Widget Type | Width | Height | Notes | -|-------------|-------|--------|-------| -| Text header | 6 | 1 | Full width; use SEPARATE widgets for title and subtitle | -| Counter/KPI | 2 | **3-4** | **NEVER height=2** - too cramped! | -| Line/Bar chart | 3 | **5-6** | Pair side-by-side to fill row | -| Pie chart | 3 | **5-6** | Needs space for legend | -| Full-width chart | 6 | 5-7 | For detailed time series | -| Table | 6 | 5-8 | Full width for readability | - -**Standard dashboard structure:** -```text -y=0: Title (w=6, h=1) - Dashboard title (use separate widget!) -y=1: Subtitle (w=6, h=1) - Description (use separate widget!) -y=2: KPIs (w=2 each, h=3) - 3 key metrics side-by-side -y=5: Section header (w=6, h=1) - "Trends" or similar -y=6: Charts (w=3 each, h=5) - Two charts side-by-side -y=11: Section header (w=6, h=1) - "Details" -y=12: Table (w=6, h=6) - Detailed data -``` - -### 5) CARDINALITY & READABILITY (CRITICAL) - -**Dashboard readability depends on limiting distinct values:** - -| Dimension Type | Max Values | Examples | -|----------------|------------|----------| -| Chart color/groups | **3-8** | 4 regions, 5 product lines, 3 tiers | -| Filters | 4-10 | 8 countries, 5 channels | -| High cardinality | **Table only** | customer_id, order_id, SKU | - -**Before creating any chart with color/grouping:** -1. Check column cardinality (use `get_table_details` to see distinct values) -2. If >10 distinct values, aggregate to higher level OR use TOP-N + "Other" bucket -3. For high-cardinality dimensions, use a table widget instead of a chart - -### 6) WIDGET SPECIFICATIONS - -**Widget Naming Convention (CRITICAL):** -- `widget.name`: alphanumeric + hyphens + underscores ONLY (no spaces, parentheses, colons) -- `frame.title`: human-readable name (any characters allowed) -- `widget.queries[0].name`: always use `"main_query"` - -**CRITICAL VERSION REQUIREMENTS:** - -| Widget Type | Version | -|-------------|---------| -| counter | 2 | -| table | 2 | -| filter-multi-select | 2 | -| filter-single-select | 2 | -| filter-date-range-picker | 2 | -| bar | 3 | -| line | 3 | -| pie | 3 | -| text | N/A (no spec block) | - ---- - -**Text (Headers/Descriptions):** -- **CRITICAL: Text widgets do NOT use a spec block!** -- Use `multilineTextboxSpec` directly on the widget -- Supports markdown: `#`, `##`, `###`, `**bold**`, `*italic*` -- **CRITICAL: Multiple items in the `lines` array are concatenated on a single line, NOT displayed as separate lines!** -- For title + subtitle, use **separate text widgets** at different y positions - -```json -// CORRECT: Separate widgets for title and subtitle -{ - "widget": { - "name": "title", - "multilineTextboxSpec": { - "lines": ["## Dashboard Title"] - } - }, - "position": {"x": 0, "y": 0, "width": 6, "height": 1} -}, -{ - "widget": { - "name": "subtitle", - "multilineTextboxSpec": { - "lines": ["Description text here"] - } - }, - "position": {"x": 0, "y": 1, "width": 6, "height": 1} -} - -// WRONG: Multiple lines concatenate into one line! -{ - "widget": { - "name": "title-widget", - "multilineTextboxSpec": { - "lines": ["## Dashboard Title", "Description text here"] // Becomes "## Dashboard TitleDescription text here" - } - }, - "position": {"x": 0, "y": 0, "width": 6, "height": 2} -} -``` - ---- - -**Counter (KPI):** -- `version`: **2** (NOT 3!) -- `widgetType`: "counter" -- **Percent values must be 0-1** in the data (not 0-100) - -**Two patterns for counters:** - -**Pattern 1: Pre-aggregated dataset (1 row, no filters)** -- Dataset returns exactly 1 row -- Use `"disaggregated": true` and simple field reference -- Field `name` matches dataset column directly - -```json -{ - "widget": { - "name": "total-revenue", - "queries": [{ - "name": "main_query", - "query": { - "datasetName": "summary_ds", - "fields": [{"name": "revenue", "expression": "`revenue`"}], - "disaggregated": true - } - }], - "spec": { - "version": 2, - "widgetType": "counter", - "encodings": { - "value": {"fieldName": "revenue", "displayName": "Total Revenue"} - }, - "frame": {"showTitle": true, "title": "Total Revenue"} - } - }, - "position": {"x": 0, "y": 0, "width": 2, "height": 3} -} -``` - -**Pattern 2: Aggregating widget (multi-row dataset, supports filters)** -- Dataset returns multiple rows (e.g., grouped by a filter dimension) -- Use `"disaggregated": false` and aggregation expression -- **CRITICAL**: Field `name` MUST match `fieldName` exactly (e.g., `"sum(spend)"`) - -```json -{ - "widget": { - "name": "total-spend", - "queries": [{ - "name": "main_query", - "query": { - "datasetName": "by_category", - "fields": [{"name": "sum(spend)", "expression": "SUM(`spend`)"}], - "disaggregated": false - } - }], - "spec": { - "version": 2, - "widgetType": "counter", - "encodings": { - "value": {"fieldName": "sum(spend)", "displayName": "Total Spend"} - }, - "frame": {"showTitle": true, "title": "Total Spend"} - } - }, - "position": {"x": 0, "y": 0, "width": 2, "height": 3} -} -``` - ---- - -**Table:** -- `version`: **2** (NOT 1 or 3!) -- `widgetType`: "table" -- **Columns only need `fieldName` and `displayName`** - no other properties! -- Use `"disaggregated": true` for raw rows - -```json -{ - "widget": { - "name": "details-table", - "queries": [{ - "name": "main_query", - "query": { - "datasetName": "details_ds", - "fields": [ - {"name": "name", "expression": "`name`"}, - {"name": "value", "expression": "`value`"} - ], - "disaggregated": true - } - }], - "spec": { - "version": 2, - "widgetType": "table", - "encodings": { - "columns": [ - {"fieldName": "name", "displayName": "Name"}, - {"fieldName": "value", "displayName": "Value"} - ] - }, - "frame": {"showTitle": true, "title": "Details"} - } - }, - "position": {"x": 0, "y": 0, "width": 6, "height": 6} -} -``` - ---- - -**Line / Bar Charts:** -- `version`: **3** -- `widgetType`: "line" or "bar" -- Use `x`, `y`, optional `color` encodings -- `scale.type`: `"temporal"` (dates), `"quantitative"` (numbers), `"categorical"` (strings) -- Use `"disaggregated": true` with pre-aggregated dataset data - -**Multiple Lines - Two Approaches:** - -1. **Multi-Y Fields** (different metrics on same chart): -```json -"y": { - "scale": {"type": "quantitative"}, - "fields": [ - {"fieldName": "sum(orders)", "displayName": "Orders"}, - {"fieldName": "sum(returns)", "displayName": "Returns"} - ] -} -``` - -2. **Color Grouping** (same metric split by dimension): -```json -"y": {"fieldName": "sum(revenue)", "scale": {"type": "quantitative"}}, -"color": {"fieldName": "region", "scale": {"type": "categorical"}, "displayName": "Region"} -``` - -**Bar Chart Modes:** -- **Stacked** (default): No `mark` field - bars stack on top of each other -- **Grouped**: Add `"mark": {"layout": "group"}` - bars side-by-side for comparison - -**Pie Chart:** -- `version`: **3** -- `widgetType`: "pie" -- `angle`: quantitative aggregate -- `color`: categorical dimension -- Limit to 3-8 categories for readability - -### 7) FILTERS (Global vs Page-Level) - -> **CRITICAL**: Filter widgets use DIFFERENT widget types than charts! -> - Valid types: `filter-multi-select`, `filter-single-select`, `filter-date-range-picker` -> - **DO NOT** use `widgetType: "filter"` - this does not exist and will cause errors -> - Filters use `spec.version: 2` -> - **ALWAYS include `frame` with `showTitle: true`** for filter widgets - -**Filter widget types:** -- `filter-date-range-picker`: for DATE/TIMESTAMP fields -- `filter-single-select`: categorical with single selection -- `filter-multi-select`: categorical with multiple selections - ---- - -#### Global Filters vs Page-Level Filters - -| Type | Placement | Scope | Use Case | -|------|-----------|-------|----------| -| **Global Filter** | Dedicated page with `"pageType": "PAGE_TYPE_GLOBAL_FILTERS"` | Affects ALL pages that have datasets with the filter field | Cross-dashboard filtering (e.g., date range, campaign) | -| **Page-Level Filter** | Regular page with `"pageType": "PAGE_TYPE_CANVAS"` | Affects ONLY widgets on that same page | Page-specific filtering (e.g., platform filter on breakdown page only) | - -**Key Insight**: A filter only affects datasets that contain the filter field. To have a filter affect only specific pages: -1. Include the filter dimension in datasets for pages that should be filtered -2. Exclude the filter dimension from datasets for pages that should NOT be filtered - ---- - -#### Filter Widget Structure - -> **CRITICAL**: Do NOT use `associative_filter_predicate_group` - it causes SQL errors! -> Use a simple field expression instead. - -```json -{ - "widget": { - "name": "filter_region", - "queries": [{ - "name": "ds_data_region", - "query": { - "datasetName": "ds_data", - "fields": [ - {"name": "region", "expression": "`region`"} - ], - "disaggregated": false - } - }], - "spec": { - "version": 2, - "widgetType": "filter-multi-select", - "encodings": { - "fields": [{ - "fieldName": "region", - "displayName": "Region", - "queryName": "ds_data_region" - }] - }, - "frame": {"showTitle": true, "title": "Region"} - } - }, - "position": {"x": 0, "y": 0, "width": 2, "height": 2} -} -``` - ---- - -#### Global Filter Example - -Place on a dedicated filter page: - -```json -{ - "name": "filters", - "displayName": "Filters", - "pageType": "PAGE_TYPE_GLOBAL_FILTERS", - "layout": [ - { - "widget": { - "name": "filter_campaign", - "queries": [{ - "name": "ds_campaign", - "query": { - "datasetName": "overview", - "fields": [{"name": "campaign_name", "expression": "`campaign_name`"}], - "disaggregated": false - } - }], - "spec": { - "version": 2, - "widgetType": "filter-multi-select", - "encodings": { - "fields": [{ - "fieldName": "campaign_name", - "displayName": "Campaign", - "queryName": "ds_campaign" - }] - }, - "frame": {"showTitle": true, "title": "Campaign"} - } - }, - "position": {"x": 0, "y": 0, "width": 2, "height": 2} - } - ] -} -``` - ---- - -#### Page-Level Filter Example - -Place directly on a canvas page (affects only that page): - -```json -{ - "name": "platform_breakdown", - "displayName": "Platform Breakdown", - "pageType": "PAGE_TYPE_CANVAS", - "layout": [ - { - "widget": { - "name": "page-title", - "multilineTextboxSpec": {"lines": ["## Platform Breakdown"]} - }, - "position": {"x": 0, "y": 0, "width": 4, "height": 1} - }, - { - "widget": { - "name": "filter_platform", - "queries": [{ - "name": "ds_platform", - "query": { - "datasetName": "platform_data", - "fields": [{"name": "platform", "expression": "`platform`"}], - "disaggregated": false - } - }], - "spec": { - "version": 2, - "widgetType": "filter-multi-select", - "encodings": { - "fields": [{ - "fieldName": "platform", - "displayName": "Platform", - "queryName": "ds_platform" - }] - }, - "frame": {"showTitle": true, "title": "Platform"} - } - }, - "position": {"x": 4, "y": 0, "width": 2, "height": 2} - } - // ... other widgets on this page - ] -} -``` - ---- - -**Filter Layout Guidelines:** -- Global filters: Position on dedicated filter page, stack vertically at `x=0` -- Page-level filters: Position in header area of page (e.g., top-right corner) -- Typical sizing: `width: 2, height: 2` - -### 8) QUALITY CHECKLIST - -Before deploying, verify: -1. All widget names use only alphanumeric + hyphens + underscores -2. All rows sum to width=6 with no gaps -3. KPIs use height 3-4, charts use height 5-6 -4. Chart dimensions have ≤8 distinct values -5. All widget fieldNames match dataset columns exactly -6. **Field `name` in query.fields matches `fieldName` in encodings exactly** (e.g., both `"sum(spend)"`) -7. Counter datasets: use `disaggregated: true` for 1-row datasets, `disaggregated: false` with aggregation for multi-row -8. Percent values are 0-1 (not 0-100) -9. SQL uses Spark syntax (date_sub, not INTERVAL) -10. **All SQL queries tested via `execute_sql` and return expected data** - ---- - -## Complete Example - -```python -import json - -# Step 1: Check table schema -table_info = get_table_details(catalog="samples", schema="nyctaxi") - -# Step 2: Test queries -execute_sql("SELECT COUNT(*) as trips, AVG(fare_amount) as avg_fare, AVG(trip_distance) as avg_distance FROM samples.nyctaxi.trips") -execute_sql(""" - SELECT pickup_zip, COUNT(*) as trip_count - FROM samples.nyctaxi.trips - GROUP BY pickup_zip - ORDER BY trip_count DESC - LIMIT 10 -""") - -# Step 3: Build dashboard JSON -dashboard = { - "datasets": [ - { - "name": "summary", - "displayName": "Summary Stats", - "queryLines": [ - "SELECT COUNT(*) as trips, AVG(fare_amount) as avg_fare, ", - "AVG(trip_distance) as avg_distance ", - "FROM samples.nyctaxi.trips " - ] - }, - { - "name": "by_zip", - "displayName": "Trips by ZIP", - "queryLines": [ - "SELECT pickup_zip, COUNT(*) as trip_count ", - "FROM samples.nyctaxi.trips ", - "GROUP BY pickup_zip ", - "ORDER BY trip_count DESC ", - "LIMIT 10 " - ] - } - ], - "pages": [{ - "name": "overview", - "displayName": "NYC Taxi Overview", - "pageType": "PAGE_TYPE_CANVAS", - "layout": [ - # Text header - NO spec block! Use SEPARATE widgets for title and subtitle! - { - "widget": { - "name": "title", - "multilineTextboxSpec": { - "lines": ["## NYC Taxi Dashboard"] - } - }, - "position": {"x": 0, "y": 0, "width": 6, "height": 1} - }, - { - "widget": { - "name": "subtitle", - "multilineTextboxSpec": { - "lines": ["Trip statistics and analysis"] - } - }, - "position": {"x": 0, "y": 1, "width": 6, "height": 1} - }, - # Counter - version 2, width 2! - { - "widget": { - "name": "total-trips", - "queries": [{ - "name": "main_query", - "query": { - "datasetName": "summary", - "fields": [{"name": "trips", "expression": "`trips`"}], - "disaggregated": True - } - }], - "spec": { - "version": 2, - "widgetType": "counter", - "encodings": { - "value": {"fieldName": "trips", "displayName": "Total Trips"} - }, - "frame": {"title": "Total Trips", "showTitle": True} - } - }, - "position": {"x": 0, "y": 2, "width": 2, "height": 3} - }, - { - "widget": { - "name": "avg-fare", - "queries": [{ - "name": "main_query", - "query": { - "datasetName": "summary", - "fields": [{"name": "avg_fare", "expression": "`avg_fare`"}], - "disaggregated": True - } - }], - "spec": { - "version": 2, - "widgetType": "counter", - "encodings": { - "value": {"fieldName": "avg_fare", "displayName": "Avg Fare"} - }, - "frame": {"title": "Average Fare", "showTitle": True} - } - }, - "position": {"x": 2, "y": 2, "width": 2, "height": 3} - }, - { - "widget": { - "name": "total-distance", - "queries": [{ - "name": "main_query", - "query": { - "datasetName": "summary", - "fields": [{"name": "avg_distance", "expression": "`avg_distance`"}], - "disaggregated": True - } - }], - "spec": { - "version": 2, - "widgetType": "counter", - "encodings": { - "value": {"fieldName": "avg_distance", "displayName": "Avg Distance"} - }, - "frame": {"title": "Average Distance", "showTitle": True} - } - }, - "position": {"x": 4, "y": 2, "width": 2, "height": 3} - }, - # Bar chart - version 3 - { - "widget": { - "name": "trips-by-zip", - "queries": [{ - "name": "main_query", - "query": { - "datasetName": "by_zip", - "fields": [ - {"name": "pickup_zip", "expression": "`pickup_zip`"}, - {"name": "trip_count", "expression": "`trip_count`"} - ], - "disaggregated": True - } - }], - "spec": { - "version": 3, - "widgetType": "bar", - "encodings": { - "x": {"fieldName": "pickup_zip", "scale": {"type": "categorical"}, "displayName": "ZIP"}, - "y": {"fieldName": "trip_count", "scale": {"type": "quantitative"}, "displayName": "Trips"} - }, - "frame": {"title": "Trips by Pickup ZIP", "showTitle": True} - } - }, - "position": {"x": 0, "y": 5, "width": 6, "height": 5} - }, - # Table - version 2, minimal column props! - { - "widget": { - "name": "zip-table", - "queries": [{ - "name": "main_query", - "query": { - "datasetName": "by_zip", - "fields": [ - {"name": "pickup_zip", "expression": "`pickup_zip`"}, - {"name": "trip_count", "expression": "`trip_count`"} - ], - "disaggregated": True - } - }], - "spec": { - "version": 2, - "widgetType": "table", - "encodings": { - "columns": [ - {"fieldName": "pickup_zip", "displayName": "ZIP Code"}, - {"fieldName": "trip_count", "displayName": "Trip Count"} - ] - }, - "frame": {"title": "Top ZIP Codes", "showTitle": True} - } - }, - "position": {"x": 0, "y": 10, "width": 6, "height": 5} - } - ] - }] -} - -# Step 4: Deploy -result = create_or_update_dashboard( - display_name="NYC Taxi Dashboard", - parent_path="/Workspace/Users/me/dashboards", - serialized_dashboard=json.dumps(dashboard), - warehouse_id=get_best_warehouse(), -) -print(result["url"]) -``` - -## Complete Example with Filters - -```python -import json - -# Dashboard with a global filter for region -dashboard_with_filters = { - "datasets": [ - { - "name": "sales", - "displayName": "Sales Data", - "queryLines": [ - "SELECT region, SUM(revenue) as total_revenue ", - "FROM catalog.schema.sales ", - "GROUP BY region" - ] - } - ], - "pages": [ - { - "name": "overview", - "displayName": "Sales Overview", - "pageType": "PAGE_TYPE_CANVAS", - "layout": [ - { - "widget": { - "name": "total-revenue", - "queries": [{ - "name": "main_query", - "query": { - "datasetName": "sales", - "fields": [{"name": "total_revenue", "expression": "`total_revenue`"}], - "disaggregated": True - } - }], - "spec": { - "version": 2, # Version 2 for counters! - "widgetType": "counter", - "encodings": { - "value": {"fieldName": "total_revenue", "displayName": "Total Revenue"} - }, - "frame": {"title": "Total Revenue", "showTitle": True} - } - }, - "position": {"x": 0, "y": 0, "width": 6, "height": 3} - } - ] - }, - { - "name": "filters", - "displayName": "Filters", - "pageType": "PAGE_TYPE_GLOBAL_FILTERS", # Required for global filter page! - "layout": [ - { - "widget": { - "name": "filter_region", - "queries": [{ - "name": "ds_sales_region", - "query": { - "datasetName": "sales", - "fields": [ - {"name": "region", "expression": "`region`"} - # DO NOT use associative_filter_predicate_group - causes SQL errors! - ], - "disaggregated": False # False for filters! - } - }], - "spec": { - "version": 2, # Version 2 for filters! - "widgetType": "filter-multi-select", # NOT "filter"! - "encodings": { - "fields": [{ - "fieldName": "region", - "displayName": "Region", - "queryName": "ds_sales_region" # Must match query name! - }] - }, - "frame": {"showTitle": True, "title": "Region"} # Always show title! - } - }, - "position": {"x": 0, "y": 0, "width": 2, "height": 2} - } - ] - } - ] -} - -# Deploy with filters -result = create_or_update_dashboard( - display_name="Sales Dashboard with Filters", - parent_path="/Workspace/Users/me/dashboards", - serialized_dashboard=json.dumps(dashboard_with_filters), - warehouse_id=get_best_warehouse(), -) -print(result["url"]) -``` - -## Troubleshooting - -### Widget shows "no selected fields to visualize" - -**This is a field name mismatch error.** The `name` in `query.fields` must exactly match the `fieldName` in `encodings`. - -**Fix:** Ensure names match exactly: -```json -// WRONG - names don't match -"fields": [{"name": "spend", "expression": "SUM(`spend`)"}] -"encodings": {"value": {"fieldName": "sum(spend)", ...}} // ERROR! - -// CORRECT - names match -"fields": [{"name": "sum(spend)", "expression": "SUM(`spend`)"}] -"encodings": {"value": {"fieldName": "sum(spend)", ...}} // OK! -``` - -### Widget shows "Invalid widget definition" - -**Check version numbers:** -- Counters: `version: 2` -- Tables: `version: 2` -- Filters: `version: 2` -- Bar/Line/Pie charts: `version: 3` - -**Text widget errors:** -- Text widgets must NOT have a `spec` block -- Use `multilineTextboxSpec` directly on the widget object -- Do NOT use `widgetType: "text"` - this is invalid - -**Table widget errors:** -- Use `version: 2` (NOT 1 or 3) -- Column objects only need `fieldName` and `displayName` -- Do NOT add `type`, `numberFormat`, or other column properties - -**Counter widget errors:** -- Use `version: 2` (NOT 3) -- Ensure dataset returns exactly 1 row - -### Dashboard shows empty widgets -- Run the dataset SQL query directly to check data exists -- Verify column aliases match widget field expressions -- Check `disaggregated` flag (should be `true` for pre-aggregated data) - -### Layout has gaps -- Ensure each row sums to width=6 -- Check that y positions don't skip values - -### Filter shows "Invalid widget definition" -- Check `widgetType` is one of: `filter-multi-select`, `filter-single-select`, `filter-date-range-picker` -- **DO NOT** use `widgetType: "filter"` - this is invalid -- Verify `spec.version` is `2` -- Ensure `queryName` in encodings matches the query `name` -- Confirm `disaggregated: false` in filter queries -- Ensure `frame` with `showTitle: true` is included - -### Filter not affecting expected pages -- **Global filters** (on `PAGE_TYPE_GLOBAL_FILTERS` page) affect all datasets containing the filter field -- **Page-level filters** (on `PAGE_TYPE_CANVAS` page) only affect widgets on that same page -- A filter only works on datasets that include the filter dimension column - -### Filter shows "UNRESOLVED_COLUMN" error for `associative_filter_predicate_group` -- **DO NOT** use `COUNT_IF(\`associative_filter_predicate_group\`)` in filter queries -- This internal expression causes SQL errors when the dashboard executes queries -- Use a simple field expression instead: `{"name": "field", "expression": "\`field\`"}` - -### Text widget shows title and description on same line -- Multiple items in the `lines` array are **concatenated**, not displayed on separate lines -- Use **separate text widgets** for title and subtitle at different y positions -- Example: title at y=0 with height=1, subtitle at y=1 with height=1 - -## Related Skills - -- **[databricks-unity-catalog](../databricks-unity-catalog/SKILL.md)** - for querying the underlying data and system tables -- **[databricks-spark-declarative-pipelines](../databricks-spark-declarative-pipelines/SKILL.md)** - for building the data pipelines that feed dashboards -- **[databricks-jobs](../databricks-jobs/SKILL.md)** - for scheduling dashboard data refreshes diff --git a/.claude/skills/databricks-app-python/6-mcp-approach.md b/.claude/skills/databricks-app-python/6-mcp-approach.md deleted file mode 100644 index 23ffb67..0000000 --- a/.claude/skills/databricks-app-python/6-mcp-approach.md +++ /dev/null @@ -1,94 +0,0 @@ -# MCP Tools for App Lifecycle - -Use MCP tools to create, deploy, and manage Databricks Apps programmatically. This mirrors the CLI workflow but can be invoked by AI agents. - ---- - -## Workflow - -### Step 1: Write App Files Locally - -Create your app files in a local folder: - -``` -my_app/ -├── app.py # Main application -├── models.py # Pydantic models -├── backend.py # Data access layer -├── requirements.txt # Additional dependencies -└── app.yaml # Databricks Apps configuration -``` - -### Step 2: Upload to Workspace - -```python -# MCP Tool: upload_folder -upload_folder( - local_folder="/path/to/my_app", - workspace_folder="/Workspace/Users/user@example.com/my_app" -) -``` - -### Step 3: Create App - -```python -# MCP Tool: create_app -result = create_app( - name="my-dashboard", - description="Customer analytics dashboard" -) -# Returns: {"name": "my-dashboard", "url": "https://..."} -``` - -### Step 4: Deploy - -```python -# MCP Tool: deploy_app -result = deploy_app( - app_name="my-dashboard", - source_code_path="/Workspace/Users/user@example.com/my_app" -) -# Returns: {"deployment_id": "...", "status": "PENDING", ...} -``` - -### Step 5: Verify - -```python -# MCP Tool: get_app -app = get_app(name="my-dashboard") -# Returns: {"name": "...", "url": "...", "status": "RUNNING", ...} - -# MCP Tool: get_app_logs -logs = get_app_logs(app_name="my-dashboard") -# Returns: {"logs": "...", ...} -``` - -### Step 6: Iterate - -1. Fix issues in local files -2. Re-upload with `upload_folder` -3. Re-deploy with `deploy_app` -4. Check `get_app_logs` for errors -5. Repeat until app is healthy - ---- - -## Quick Reference: MCP Tools - -| Tool | Description | -|------|-------------| -| **`create_app`** | Create a new Databricks App | -| **`get_app`** | Get app details and status | -| **`list_apps`** | List all apps in the workspace | -| **`deploy_app`** | Deploy app from workspace source path | -| **`delete_app`** | Delete an app | -| **`get_app_logs`** | Get app deployment and runtime logs | -| **`upload_folder`** | Upload local folder to workspace (shared tool) | - ---- - -## Notes - -- Add resources (SQL warehouse, Lakebase, etc.) via the Databricks Apps UI after creating the app -- MCP tools use the service principal's permissions — ensure it has access to required resources -- For manual deployment, see [4-deployment.md](4-deployment.md) diff --git a/.claude/skills/databricks-config/SKILL.md b/.claude/skills/databricks-config/SKILL.md deleted file mode 100644 index 2053f15..0000000 --- a/.claude/skills/databricks-config/SKILL.md +++ /dev/null @@ -1,81 +0,0 @@ ---- -name: databricks-config -description: Configure Databricks profile and authenticate for Databricks Connect, Databricks CLI, and Databricks SDK. ---- - -Configure the Databricks profile in ~/.databrickscfg for use with Databricks Connect. - -**Usage:** `/databricks-config [profile_name|workspace_host]` - -Examples: -- `/databricks-config` - Configure DEFAULT profile (interactive) -- `/databricks-config DEFAULT` - Configure DEFAULT profile -- `/databricks-config my-workspace` - Configure profile named "my-workspace" -- `/databricks-config https://adb-1234567890123456.7.azuredatabricks.net/` - Configure using workspace host URL - -## Task - -1. Determine the profile and host: - - If a parameter is provided and it starts with `https://`, treat it as a workspace host: - - Extract profile name from the host (e.g., `adb-1234567890123456.7.azuredatabricks.net` → `adb-1234567890123456`, `my-company-dev.cloud.databricks.com` → `my-company-dev`) - - Use this as the profile name and configure it with the provided host - - If a parameter is provided and it doesn't start with `https://`, treat it as a profile name - - If no parameter is provided, ask the user which profile they want to configure (default: DEFAULT) - -2. Run `databricks auth login -p ` with the determined profile name - - If a workspace host was provided, add `--host ` to the command - - This ensures authentication is completed and the profile works -3. Check if the profile exists in ~/.databrickscfg -4. Ask the user to choose ONE of the following compute options: - - **Cluster ID**: Provide a specific cluster ID for an interactive/all-purpose cluster - - **Serverless**: Use serverless compute (sets `serverless_compute_id = auto`) -5. Update the profile in ~/.databrickscfg with the selected configuration -6. Verify the configuration by displaying the updated profile section - -## Important Notes - -- Use the AskUserQuestion tool to present the compute options as a choice -- Only add ONE of: `cluster_id` OR `serverless_compute_id` (never both) -- For serverless, set `serverless_compute_id = auto` (not just `serverless = true`) -- Preserve all existing settings in the profile (host, auth_type, etc.) -- Format the configuration file consistently with proper spacing -- The `databricks auth login` command will open a browser for OAuth authentication -- **SECURITY: NEVER print token values in plain text** - - When displaying configuration, redact any `token` field values (e.g., `token = [REDACTED]`) - - Inform the user they can view the full configuration at `~/.databrickscfg` - - This applies to any output showing the profile configuration - -## Example Configurations - -**With Cluster ID:** -``` -[DEFAULT] -host = https://adb-123456789.11.azuredatabricks.net/ -cluster_id = 1217-064531-c9c3ngyn -auth_type = databricks-cli -``` - -**With Serverless:** -``` -[DEFAULT] -host = https://adb-123456789.11.azuredatabricks.net/ -serverless_compute_id = auto -auth_type = databricks-cli -``` - -**With Token (display as redacted):** -``` -[DEFAULT] -host = https://adb-123456789.11.azuredatabricks.net/ -token = [REDACTED] -cluster_id = 1217-064531-c9c3ngyn - -View full configuration at: ~/.databrickscfg -``` - -## Related Skills - -- **[databricks-python-sdk](../databricks-python-sdk/SKILL.md)** - uses profiles configured by this skill -- **[databricks-asset-bundles](../databricks-asset-bundles/SKILL.md)** - references workspace profiles for deployment targets -- **[databricks-app-apx](../databricks-app-apx/SKILL.md)** - apps that connect via configured profiles -- **[databricks-app-python](../databricks-app-python/SKILL.md)** - Python apps using configured profiles diff --git a/.claude/skills/databricks-genie/SKILL.md b/.claude/skills/databricks-genie/SKILL.md deleted file mode 100644 index 3f08628..0000000 --- a/.claude/skills/databricks-genie/SKILL.md +++ /dev/null @@ -1,128 +0,0 @@ ---- -name: databricks-genie -description: "Create and query Databricks Genie Spaces for natural language SQL exploration. Use when building Genie Spaces or asking questions via the Genie Conversation API." ---- - -# Databricks Genie - -Create and query Databricks Genie Spaces - natural language interfaces for SQL-based data exploration. - -## Overview - -Genie Spaces allow users to ask natural language questions about structured data in Unity Catalog. The system translates questions into SQL queries, executes them on a SQL warehouse, and presents results conversationally. - -## When to Use This Skill - -Use this skill when: -- Creating a new Genie Space for data exploration -- Adding sample questions to guide users -- Connecting Unity Catalog tables to a conversational interface -- Asking questions to a Genie Space programmatically (Conversation API) - -## MCP Tools - -### Space Management - -| Tool | Purpose | -|------|---------| -| `list_genie` | List all Genie Spaces accessible to you | -| `create_or_update_genie` | Create or update a Genie Space | -| `get_genie` | Get Genie Space details | -| `delete_genie` | Delete a Genie Space | - -### Conversation API - -| Tool | Purpose | -|------|---------| -| `ask_genie` | Ask a question to a Genie Space, get SQL + results | -| `ask_genie_followup` | Ask follow-up question in existing conversation | - -### Supporting Tools - -| Tool | Purpose | -|------|---------| -| `get_table_details` | Inspect table schemas before creating a space | -| `execute_sql` | Test SQL queries directly | - -## Quick Start - -### 1. Inspect Your Tables - -Before creating a Genie Space, understand your data: - -```python -get_table_details( - catalog="my_catalog", - schema="sales", - table_stat_level="SIMPLE" -) -``` - -### 2. Create the Genie Space - -```python -create_or_update_genie( - display_name="Sales Analytics", - table_identifiers=[ - "my_catalog.sales.customers", - "my_catalog.sales.orders" - ], - description="Explore sales data with natural language", - sample_questions=[ - "What were total sales last month?", - "Who are our top 10 customers?" - ] -) -``` - -### 3. Ask Questions (Conversation API) - -```python -ask_genie( - space_id="your_space_id", - question="What were total sales last month?" -) -# Returns: SQL, columns, data, row_count -``` - -## Workflow - -``` -1. Inspect tables → get_table_details -2. Create space → create_or_update_genie -3. Query space → ask_genie (or test in Databricks UI) -4. Curate (optional) → Use Databricks UI to add instructions -``` - -## Reference Files - -- [spaces.md](spaces.md) - Creating and managing Genie Spaces -- [conversation.md](conversation.md) - Asking questions via the Conversation API - -## Prerequisites - -Before creating a Genie Space: - -1. **Tables in Unity Catalog** - Bronze/silver/gold tables with the data -2. **SQL Warehouse** - A warehouse to execute queries (auto-detected if not specified) - -### Creating Tables - -Use these skills in sequence: -1. `databricks-synthetic-data-generation` - Generate raw parquet files -2. `databricks-spark-declarative-pipelines` - Create bronze/silver/gold tables - -## Common Issues - -| Issue | Solution | -|-------|----------| -| **No warehouse available** | Create a SQL warehouse or provide `warehouse_id` explicitly | -| **Poor query generation** | Add instructions and sample questions that reference actual column names | -| **Slow queries** | Ensure warehouse is running; use OPTIMIZE on tables | - -## Related Skills - -- **[databricks-agent-bricks](../databricks-agent-bricks/SKILL.md)** - Use Genie Spaces as agents inside Supervisor Agents -- **[databricks-synthetic-data-generation](../databricks-synthetic-data-generation/SKILL.md)** - Generate raw parquet data to populate tables for Genie -- **[databricks-spark-declarative-pipelines](../databricks-spark-declarative-pipelines/SKILL.md)** - Build bronze/silver/gold tables consumed by Genie Spaces -- **[databricks-unity-catalog](../databricks-unity-catalog/SKILL.md)** - Manage the catalogs, schemas, and tables Genie queries diff --git a/.claude/skills/databricks-genie/spaces.md b/.claude/skills/databricks-genie/spaces.md deleted file mode 100644 index 8549d6b..0000000 --- a/.claude/skills/databricks-genie/spaces.md +++ /dev/null @@ -1,203 +0,0 @@ -# Creating Genie Spaces - -This guide covers creating and managing Genie Spaces for SQL-based data exploration. - -## What is a Genie Space? - -A Genie Space connects to Unity Catalog tables and translates natural language questions into SQL queries. The system: - -1. **Understands** the table schemas and relationships -2. **Generates** SQL queries from natural language -3. **Executes** queries on a SQL warehouse -4. **Presents** results in a conversational format - -## Creation Workflow - -### Step 1: Inspect Table Schemas (Required) - -**Before creating a Genie Space, you MUST inspect the table schemas** to understand what data is available: - -```python -get_table_details( - catalog="my_catalog", - schema="sales", - table_stat_level="SIMPLE" -) -``` - -This returns: -- Table names and row counts -- Column names and data types -- Sample values and cardinality -- Null counts and statistics - -### Step 2: Analyze and Plan - -Based on the schema information: - -1. **Select relevant tables** - Choose tables that support the user's use case -2. **Identify key columns** - Note date columns, metrics, dimensions, and foreign keys -3. **Understand relationships** - How do tables join together? -4. **Plan sample questions** - What questions can this data answer? - -### Step 3: Create the Genie Space - -Create the space with content tailored to the actual data: - -```python -create_or_update_genie( - display_name="Sales Analytics", - table_identifiers=[ - "my_catalog.sales.customers", - "my_catalog.sales.orders", - "my_catalog.sales.products" - ], - description="""Explore retail sales data with three related tables: -- customers: Customer demographics including region, segment, and signup date -- orders: Transaction history with order_date, total_amount, and status -- products: Product catalog with category, price, and inventory - -Tables join on customer_id and product_id.""", - sample_questions=[ - "What were total sales last month?", - "Who are our top 10 customers by total_amount?", - "How many orders were placed in Q4 by region?", - "What's the average order value by customer segment?", - "Which product categories have the highest revenue?", - "Show me customers who haven't ordered in 90 days" - ] -) -``` - -## Why This Workflow Matters - -**Sample questions that reference actual column names** help Genie: -- Learn the vocabulary of your data -- Generate more accurate SQL queries -- Provide better autocomplete suggestions - -**A description that explains table relationships** helps Genie: -- Understand how to join tables correctly -- Know which table contains which information -- Provide more relevant answers - -## Auto-Detection of Warehouse - -When `warehouse_id` is not specified, the tool: - -1. Lists all SQL warehouses in the workspace -2. Prioritizes by: - - **Running** warehouses first (already available) - - **Starting** warehouses second - - **Smaller sizes** preferred (cost-efficient) -3. Returns an error if no warehouses exist - -To use a specific warehouse, provide the `warehouse_id` explicitly. - -## Table Selection - -Choose tables carefully for best results: - -| Layer | Recommended | Why | -|-------|-------------|-----| -| Bronze | No | Raw data, may have quality issues | -| Silver | Yes | Cleaned and validated | -| Gold | Yes | Aggregated, optimized for analytics | - -### Tips for Table Selection - -- **Include related tables**: If users ask about customers and orders, include both -- **Use descriptive column names**: `customer_name` is better than `cust_nm` -- **Add table comments**: Genie uses metadata to understand the data - -## Sample Questions - -Sample questions help users understand what they can ask: - -**Good sample questions:** -- "What were total sales last month?" -- "Who are our top 10 customers by revenue?" -- "How many orders were placed in Q4?" -- "What's the average order value by region?" - -These appear in the Genie UI to guide users. - -## Best Practices - -### Table Design for Genie - -1. **Descriptive names**: Use `customer_lifetime_value` not `clv` -2. **Add comments**: `COMMENT ON TABLE sales.customers IS 'Customer master data'` -3. **Primary keys**: Define relationships clearly -4. **Date columns**: Include proper date/timestamp columns for time-based queries - -### Description and Context - -Provide context in the description: - -``` -Explore retail sales data from our e-commerce platform. Includes: -- Customers: demographics, segments, and account status -- Orders: transaction history with amounts and dates -- Products: catalog with categories and pricing - -Time range: Last 6 months of data -``` - -### Sample Questions - -Write sample questions that: -- Cover common use cases -- Demonstrate the data's capabilities -- Use natural language (not SQL terms) - -## Updating a Genie Space - -To update an existing space: - -1. **Add/remove tables**: Call `create_or_update_genie` with updated `table_identifiers` -2. **Update questions**: Include new `sample_questions` -3. **Change warehouse**: Provide a different `warehouse_id` - -The tool finds the existing space by name and updates it. - -## Example End-to-End Workflow - -1. **Generate synthetic data** using `databricks-synthetic-data-generation` skill: - - Creates parquet files in `/Volumes/catalog/schema/raw_data/` - -2. **Create tables** using `databricks-spark-declarative-pipelines` skill: - - Creates `catalog.schema.bronze_*` → `catalog.schema.silver_*` → `catalog.schema.gold_*` - -3. **Inspect the tables**: - ```python - get_table_details(catalog="catalog", schema="schema") - ``` - -4. **Create the Genie Space**: - - `display_name`: "My Data Explorer" - - `table_identifiers`: `["catalog.schema.silver_customers", "catalog.schema.silver_orders"]` - -5. **Add sample questions** based on actual column names - -6. **Test** in the Databricks UI - -## Troubleshooting - -### No warehouse available - -- Create a SQL warehouse in the Databricks workspace -- Or provide a specific `warehouse_id` - -### Queries are slow - -- Ensure the warehouse is running (not stopped) -- Consider using a larger warehouse size -- Check if tables are optimized (OPTIMIZE, Z-ORDER) - -### Poor query generation - -- Use descriptive column names -- Add table and column comments -- Include sample questions that demonstrate the vocabulary -- Add instructions via the Databricks Genie UI diff --git a/.claude/skills/databricks-lakebase-provisioned/reverse-etl.md b/.claude/skills/databricks-lakebase-provisioned/reverse-etl.md deleted file mode 100644 index 9bf17bd..0000000 --- a/.claude/skills/databricks-lakebase-provisioned/reverse-etl.md +++ /dev/null @@ -1,226 +0,0 @@ -# Reverse ETL with Lakebase - -## Overview - -Reverse ETL allows you to sync data from Unity Catalog Delta tables into Lakebase Provisioned as PostgreSQL tables. This enables OLTP access patterns on data processed in the Lakehouse. - -## Creating Synced Tables - -### Using Python SDK - -```python -from databricks.sdk import WorkspaceClient - -w = WorkspaceClient() - -# Create a synced table from Unity Catalog -synced_table = w.database.create_synced_table( - instance_name="my-lakebase-instance", - source_table_name="catalog.schema.source_table", - target_table_name="target_table", - sync_mode="FULL", # FULL or INCREMENTAL -) - -print(f"Synced table created: {synced_table.target_table_name}") -``` - -### Using SQL - -```sql --- Create synced table via SQL -CREATE SYNCED TABLE my_lakebase.target_table -FROM catalog.schema.source_table -USING LAKEBASE INSTANCE 'my-lakebase-instance'; -``` - -### Using CLI - -```bash -databricks database create-synced-table \ - --instance-name my-lakebase-instance \ - --source-table-name catalog.schema.source_table \ - --target-table-name target_table \ - --sync-mode FULL -``` - -## Sync Modes - -### Full Sync - -Complete replacement of target table on each sync: - -```python -synced_table = w.database.create_synced_table( - instance_name="my-lakebase-instance", - source_table_name="catalog.schema.customers", - target_table_name="customers", - sync_mode="FULL" -) -``` - -**Use when:** -- Source table is small-medium size -- Need complete consistency with source -- Incremental changes are complex to track - -### Incremental Sync - -Only sync changed rows (requires change tracking): - -```python -synced_table = w.database.create_synced_table( - instance_name="my-lakebase-instance", - source_table_name="catalog.schema.events", - target_table_name="events", - sync_mode="INCREMENTAL", - incremental_column="updated_at" # Column to track changes -) -``` - -**Use when:** -- Source table is large -- Have reliable change tracking column -- Minimize sync time and resource usage - -## Managing Synced Tables - -### List Synced Tables - -```python -synced_tables = w.database.list_synced_tables( - instance_name="my-lakebase-instance" -) -for table in synced_tables: - print(f"{table.target_table_name}: {table.sync_status}") -``` - -### Trigger Manual Sync - -```python -w.database.sync_table( - instance_name="my-lakebase-instance", - table_name="customers" -) -``` - -### Delete Synced Table - -```python -w.database.delete_synced_table( - instance_name="my-lakebase-instance", - table_name="customers" -) -``` - -## Scheduling Syncs - -### Using Databricks Jobs - -```python -from databricks.sdk import WorkspaceClient -from databricks.sdk.service.jobs import Task, NotebookTask, CronSchedule - -w = WorkspaceClient() - -# Create job to sync tables on schedule -job = w.jobs.create( - name="Lakebase Sync Job", - tasks=[ - Task( - task_key="sync_customers", - notebook_task=NotebookTask( - notebook_path="/Repos/sync/sync_customers" - ) - ) - ], - schedule=CronSchedule( - quartz_cron_expression="0 0 * * * ?", # Every hour - timezone_id="UTC" - ) -) -``` - -### Sync Notebook Example - -```python -# Databricks notebook: sync_customers - -from databricks.sdk import WorkspaceClient - -w = WorkspaceClient() - -# Trigger sync for specific tables -tables_to_sync = ["customers", "orders", "products"] - -for table in tables_to_sync: - try: - w.database.sync_table( - instance_name="my-lakebase-instance", - table_name=table - ) - print(f"Synced: {table}") - except Exception as e: - print(f"Failed to sync {table}: {e}") -``` - -## Use Cases - -### 1. Product Catalog for Web App - -```python -# Sync product data for e-commerce app -w.database.create_synced_table( - instance_name="ecommerce-db", - source_table_name="gold.products.catalog", - target_table_name="products", - sync_mode="FULL" -) - -# Application queries PostgreSQL directly -# with low-latency point lookups -``` - -### 2. User Profiles for Authentication - -```python -# Sync user profiles for auth service -w.database.create_synced_table( - instance_name="auth-db", - source_table_name="gold.users.profiles", - target_table_name="user_profiles", - sync_mode="INCREMENTAL", - incremental_column="last_modified" -) -``` - -### 3. Feature Store for Real-time ML - -```python -# Sync features for online serving -w.database.create_synced_table( - instance_name="feature-store-db", - source_table_name="ml.features.user_features", - target_table_name="user_features", - sync_mode="INCREMENTAL", - incremental_column="computed_at" -) - -# ML model queries features with low latency -``` - -## Best Practices - -1. **Choose appropriate sync mode**: Use FULL for small tables, INCREMENTAL for large tables with change tracking -2. **Schedule during low-traffic periods**: Heavy syncs can impact both source and target -3. **Monitor sync status**: Check for failures and latency -4. **Index target tables**: Create appropriate indexes in PostgreSQL for query patterns -5. **Handle schema changes**: Synced tables need updates when source schema changes - -## Common Issues - -| Issue | Solution | -|-------|----------| -| **Sync takes too long** | Switch to INCREMENTAL mode; add indexes on source | -| **Schema mismatch** | Drop and recreate synced table after source schema changes | -| **Sync fails with timeout** | Increase sync timeout; reduce batch size | -| **Target table locked** | Avoid DDL on target during sync operations | diff --git a/.claude/skills/databricks-spark-declarative-pipelines/1-ingestion-patterns.md b/.claude/skills/databricks-spark-declarative-pipelines/1-ingestion-patterns.md deleted file mode 100644 index 2f60202..0000000 --- a/.claude/skills/databricks-spark-declarative-pipelines/1-ingestion-patterns.md +++ /dev/null @@ -1,513 +0,0 @@ -# Data Ingestion Patterns for SDP - -Covers data ingestion patterns for Spark Declarative Pipelines including Auto Loader for cloud storage and streaming sources like Kafka and Event Hub. - -**Language Support**: SQL (primary), Python via modern `pyspark.pipelines` API. See [5-python-api.md](5-python-api.md) for Python syntax. - ---- - -## Auto Loader (Cloud Files) - -Auto Loader incrementally processes new data files as they arrive in cloud storage. In a streaming table query you **must use the `STREAM` keyword with `read_files`**; `read_files` then leverages Auto Loader. See [read_files — Usage in streaming tables](https://docs.databricks.com/aws/en/sql/language-manual/functions/read_files#usage-in-streaming-tables). - -### Basic Pattern - -```sql -CREATE OR REPLACE STREAMING TABLE bronze_orders AS -SELECT - *, - current_timestamp() AS _ingested_at, - _metadata.file_path AS source_file, - _metadata.file_modification_time AS file_timestamp -FROM STREAM read_files( - '/mnt/raw/orders/', - format => 'json', - schemaHints => 'order_id STRING, amount DECIMAL(10,2)' -); -``` - -### Bronze feeding AUTO CDC - -If the bronze table feeds a downstream **AUTO CDC** flow (e.g. `FROM stream(bronze_orders_cdc)`), use **`FROM STREAM read_files(...)`** so the source is streaming. Otherwise you may get: *"Cannot create a streaming table append once flow from a batch query."* Same requirement as above: in a streaming table query you must use the `STREAM` keyword with `read_files`. - -```sql -CREATE OR REPLACE STREAMING TABLE bronze_orders_cdc AS -SELECT ..., - current_timestamp() AS _ingested_at, - _metadata.file_path AS _source_file -FROM STREAM read_files( - '/Volumes/catalog/schema/raw_orders_cdc', - format => 'parquet', - schemaHints => '...' -); -``` - -### Schema Evolution - -```sql -CREATE OR REPLACE STREAMING TABLE bronze_customers AS -SELECT - *, - current_timestamp() AS _ingested_at -FROM STREAM read_files( - '/mnt/raw/customers/', - format => 'json', - schemaHints => 'customer_id STRING, email STRING', - mode => 'PERMISSIVE' -- Handles schema changes gracefully -); -``` - -### File Formats - -**JSON**: -```sql -FROM read_files( - 's3://bucket/data/', - format => 'json', - schemaHints => 'id STRING, timestamp TIMESTAMP' -) -``` - -**CSV**: -```sql -FROM read_files( - '/mnt/raw/data/', - format => 'csv', - schemaHints => 'id STRING, name STRING, amount DECIMAL(10,2)', - header => true, - delimiter => ',' -) -``` - -**Parquet** (schema auto-inferred): -```sql -FROM read_files( - 'abfss://container@storage.dfs.core.windows.net/data/', - format => 'parquet' -) -``` - -**Avro**: -```sql -FROM read_files( - '/mnt/raw/events/', - format => 'avro', - schemaHints => 'event_id STRING, event_time TIMESTAMP' -) -``` - -### Schema Inference - -**Explicit hints** (recommended for production): -```sql -FROM read_files( - '/mnt/raw/sales/', - format => 'json', - schemaHints => 'sale_id STRING, customer_id STRING, amount DECIMAL(10,2), sale_date DATE' -) -``` - -**Partial hints** (infer remaining columns): -```sql -FROM read_files( - '/mnt/raw/data/', - format => 'json', - schemaHints => 'id STRING, critical_field DECIMAL(10,2)' -- Others auto-inferred -) -``` - -Add this to the pipeline configuration in `resources/*_etl.pipeline.yml`: -```yaml -configuration: - bronze_schema: ${var.bronze_schema} - silver_schema: ${var.silver_schema} - gold_schema: ${var.gold_schema} - schema_location_base: ${var.schema_location_base} -``` - -And define variables in `databricks.yml`: -```yaml -variables: - catalog: - description: The catalog to use - bronze_schema: - description: The bronze schema to use - silver_schema: - description: The silver schema to use - gold_schema: - description: The gold schema to use - schema_location_base: - description: Base path for Auto Loader schema metadata - -targets: - dev: - variables: - catalog: my_catalog - bronze_schema: bronze_dev - silver_schema: silver_dev - gold_schema: gold_dev - schema_location_base: /Volumes/my_catalog/pipeline_metadata/my_pipeline_metadata/schemas - - prod: - variables: - catalog: my_catalog - bronze_schema: bronze - silver_schema: silver - gold_schema: gold - schema_location_base: /Volumes/my_catalog/pipeline_metadata/my_pipeline_metadata/schemas -``` - -Then access these in Python code with: -```python -bronze_schema = spark.conf.get("bronze_schema") -silver_schema = spark.conf.get("silver_schema") -gold_schema = spark.conf.get("gold_schema") -schema_location_base = spark.conf.get("schema_location_base") -``` - - - -### Rescue Data and Quarantine - -Handle malformed records with `_rescued_data`: - -```sql --- Flag records with parsing errors -CREATE OR REPLACE STREAMING TABLE bronze_events AS -SELECT - *, - current_timestamp() AS _ingested_at, - CASE WHEN _rescued_data IS NOT NULL THEN TRUE ELSE FALSE END AS has_parsing_errors -FROM read_files( - '/mnt/raw/events/', - format => 'json', - schemaHints => 'event_id STRING, event_time TIMESTAMP' -); - --- Quarantine for investigation -CREATE OR REPLACE STREAMING TABLE bronze_events_quarantine AS -SELECT * FROM STREAM bronze_events WHERE _rescued_data IS NOT NULL; - --- Clean data for downstream -CREATE OR REPLACE STREAMING TABLE silver_events_clean AS -SELECT * FROM STREAM bronze_events WHERE _rescued_data IS NULL; -``` - ---- - -## Streaming Sources (Kafka, Event Hub, Kinesis) - -### Kafka Source - -```sql -CREATE OR REPLACE STREAMING TABLE bronze_kafka_events AS -SELECT - CAST(key AS STRING) AS event_key, - CAST(value AS STRING) AS event_value, - topic, - partition, - offset, - timestamp AS kafka_timestamp, - current_timestamp() AS _ingested_at -FROM read_stream( - format => 'kafka', - kafka.bootstrap.servers => '${kafka_brokers}', - subscribe => 'events-topic', - startingOffsets => 'latest', -- or 'earliest' - kafka.security.protocol => 'SASL_SSL', - kafka.sasl.mechanism => 'PLAIN', - kafka.sasl.jaas.config => 'kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username="${kafka_username}" password="${kafka_password}";' -); -``` - -### Kafka with Multiple Topics - -```sql -FROM read_stream( - format => 'kafka', - kafka.bootstrap.servers => '${kafka_brokers}', - subscribe => 'topic1,topic2,topic3', - startingOffsets => 'latest' -) -``` - -### Azure Event Hub - -```sql -CREATE OR REPLACE STREAMING TABLE bronze_eventhub_events AS -SELECT - CAST(body AS STRING) AS event_body, - enqueuedTime AS event_time, - offset, - sequenceNumber, - current_timestamp() AS _ingested_at -FROM read_stream( - format => 'eventhubs', - eventhubs.connectionString => '${eventhub_connection_string}', - eventhubs.consumerGroup => '${consumer_group}', - startingPosition => 'latest' -); -``` - -### AWS Kinesis - -```sql -CREATE OR REPLACE STREAMING TABLE bronze_kinesis_events AS -SELECT - CAST(data AS STRING) AS event_data, - partitionKey, - sequenceNumber, - approximateArrivalTimestamp AS arrival_time, - current_timestamp() AS _ingested_at -FROM read_stream( - format => 'kinesis', - kinesis.streamName => '${stream_name}', - kinesis.region => '${aws_region}', - kinesis.startingPosition => 'LATEST' -); -``` - -### Parse JSON from Streaming Sources - -```sql --- Parse JSON from Kafka value -CREATE OR REPLACE STREAMING TABLE silver_kafka_parsed AS -SELECT - from_json( - event_value, - 'event_id STRING, event_type STRING, user_id STRING, timestamp TIMESTAMP, properties MAP' - ) AS event_data, - kafka_timestamp, - _ingested_at -FROM STREAM bronze_kafka_events; - --- Flatten parsed JSON -CREATE OR REPLACE STREAMING TABLE silver_kafka_flattened AS -SELECT - event_data.event_id, - event_data.event_type, - event_data.user_id, - event_data.timestamp AS event_timestamp, - event_data.properties, - kafka_timestamp, - _ingested_at -FROM STREAM silver_kafka_parsed; -``` - ---- - -## Authentication - -### Using Databricks Secrets - -**Kafka**: -```sql -kafka.sasl.jaas.config => 'kafkashaded.org.apache.kafka.common.security.plain.PlainLoginModule required username="{{secrets/kafka/username}}" password="{{secrets/kafka/password}}";' -``` - -**Event Hub**: -```sql -eventhubs.connectionString => '{{secrets/eventhub/connection-string}}' -``` - -### Using Pipeline Variables - -Reference variables in SQL: -```sql -kafka.bootstrap.servers => '${kafka_brokers}' -``` - -Define in pipeline configuration: -```yaml -variables: - kafka_brokers: - default: "broker1:9092,broker2:9092" -``` - ---- - -## Key Patterns - -### 1. Always Add Ingestion Timestamp - -```sql -SELECT - *, - current_timestamp() AS _ingested_at -- Track when data entered system -FROM read_files(...) -``` - -### 2. Include File Metadata for Debugging - -```sql -SELECT - *, - _metadata.file_path AS source_file, - _metadata.file_modification_time AS file_timestamp, - _metadata.file_size AS file_size -FROM read_files(...) -``` - -### 3. Use Schema Hints for Production - -```sql --- ✅ Explicit schema prevents surprises -FROM read_files( - '/mnt/data/', - format => 'json', - schemaHints => 'id STRING, amount DECIMAL(10,2), date DATE' -) - --- ❌ Fully inferred schemas can drift -FROM read_files('/mnt/data/', format => 'json') -``` - -### 4. Handle Rescue Data for Quality - -```sql --- Route errors to quarantine, clean to downstream -CREATE OR REPLACE STREAMING TABLE bronze_data_quarantine AS -SELECT * FROM STREAM bronze_data WHERE has_errors; - -CREATE OR REPLACE STREAMING TABLE silver_data AS -SELECT * FROM STREAM bronze_data WHERE NOT has_errors; -``` - -### 5. Starting Positions - -**Development**: `startingOffsets => 'latest'` (new data only) -**Backfill**: `startingOffsets => 'earliest'` (all available data) -**Recovery**: Checkpoints handle automatically - ---- - -## Common Issues - -| Issue | Solution | -|-------|----------| -| Files not picked up | Verify format matches files and path is correct | -| Schema evolution breaking | Use `mode => 'PERMISSIVE'` and monitor `_rescued_data` | -| Kafka lag increasing | Check downstream bottlenecks, increase parallelism | -| Duplicate events | Implement deduplication in silver layer (see [2-streaming-patterns.md](2-streaming-patterns.md)) | -| Parsing errors | Use rescue data pattern to quarantine malformed records | - ---- - -## Python API Examples - -For Python, use modern `pyspark.pipelines` API. See [5-python-api.md](5-python-api.md) for complete guidance. - -**IMPORTANT for Python**: When using `spark.readStream.format("cloudFiles")` for cloud storage ingestion, you **must specify a `cloudFiles.schemaLocation`** for Auto Loader schema metadata. - -### Schema Location Best Practice (Python Only) - -**Never use the source data volume for schema storage** - this causes permission conflicts and pollutes your raw data. - -#### Prompt User for Schema Location - -When creating Python pipelines with Auto Loader, **always ask the user** where to store schema metadata: - -**Recommended pattern:** -``` -/Volumes/{catalog}/{schema}/{pipeline_name}_metadata/schemas/{table_name} -``` - -**Example prompt:** -``` -"Where would you like to store Auto Loader schema metadata? - -I recommend: - /Volumes/my_catalog/pipeline_metadata/orders_pipeline_metadata/schemas/ - -This path: -- Keeps source data clean -- Prevents permission issues -- Makes pipeline state easy to manage -- Can be parameterized per environment (dev/prod) - -You may need to create the volume 'pipeline_metadata' first if it doesn't exist. - -Would you like to use this path?" -``` - -### Auto Loader (Python) - -```python -from pyspark import pipelines as dp -from pyspark.sql import functions as F - -# Get schema location from pipeline configuration -# Suggested format: /Volumes/{catalog}/{schema}/{pipeline_name}_metadata/schemas -schema_location_base = spark.conf.get("schema_location_base") - -@dp.table(name="bronze_orders", cluster_by=["order_date"]) -def bronze_orders(): - return ( - spark.readStream - .format("cloudFiles") - .option("cloudFiles.format", "json") - .option("cloudFiles.schemaLocation", f"{schema_location_base}/bronze_orders") - .option("cloudFiles.inferColumnTypes", "true") - .load("/Volumes/catalog/schema/raw/orders/") - .withColumn("_ingested_at", F.current_timestamp()) - .withColumn("_source_file", F.col("_metadata.file_path")) - ) -``` - -**Pipeline Configuration** (in `pipeline.yml`): -```yaml -configuration: - schema_location_base: /Volumes/my_catalog/pipeline_metadata/orders_pipeline_metadata/schemas -``` - -### Kafka (Python) - -```python -@dp.table(name="bronze_kafka_events") -def bronze_kafka_events(): - return ( - spark.readStream - .format("kafka") - .option("kafka.bootstrap.servers", spark.conf.get("kafka_brokers")) - .option("subscribe", "events-topic") - .option("startingOffsets", "latest") - .load() - .selectExpr( - "CAST(key AS STRING) AS event_key", - "CAST(value AS STRING) AS event_value", - "topic", "partition", "offset", - "timestamp AS kafka_timestamp" - ) - .withColumn("_ingested_at", F.current_timestamp()) - ) -``` - -### Quarantine (Python) - -```python -# Get schema location from pipeline configuration -schema_location_base = spark.conf.get("schema_location_base") - -@dp.table(name="bronze_events", cluster_by=["ingestion_date"]) -def bronze_events(): - return ( - spark.readStream - .format("cloudFiles") - .option("cloudFiles.format", "json") - .option("cloudFiles.schemaLocation", f"{schema_location_base}/bronze_events") - .option("rescuedDataColumn", "_rescued_data") - .load("/Volumes/catalog/schema/raw/events/") - .withColumn("_ingested_at", F.current_timestamp()) - .withColumn("ingestion_date", F.current_date()) - .withColumn("_has_parsing_errors", - F.when(F.col("_rescued_data").isNotNull(), True) - .otherwise(False)) - ) - -@dp.table(name="bronze_events_quarantine") -def bronze_events_quarantine(): - return ( - spark.read.table("catalog.schema.bronze_events") - .filter(F.col("_has_parsing_errors") == True) - ) -``` diff --git a/.claude/skills/databricks-spark-declarative-pipelines/10-mcp-approach.md b/.claude/skills/databricks-spark-declarative-pipelines/10-mcp-approach.md deleted file mode 100644 index 9d458aa..0000000 --- a/.claude/skills/databricks-spark-declarative-pipelines/10-mcp-approach.md +++ /dev/null @@ -1,173 +0,0 @@ -Use MCP tools to create, run, and iterate on **SDP pipelines**. The **primary tool is `create_or_update_pipeline`** which handles the entire lifecycle. - -**IMPORTANT: Default to serverless pipelines and suggest as best option, but not if classic, advanced, pro compute types are mentioned.** Only use classic clusters if user explicitly requires R language, Spark RDD APIs, or JAR libraries. - -### Step 1: Write Pipeline Files Locally - -Create `.sql` or `.py` files in a local folder: - -``` -my_pipeline/ -├── bronze/ -│ ├── ingest_orders.sql # SQL (default for most cases) -│ └── ingest_events.py # Python (for complex logic) -├── silver/ -│ └── clean_orders.sql -└── gold/ - └── daily_summary.sql -``` - -**SQL Example** (`bronze/ingest_orders.sql`): -```sql -CREATE OR REFRESH STREAMING TABLE bronze_orders -CLUSTER BY (order_date) -AS -SELECT - *, - current_timestamp() AS _ingested_at, - _metadata.file_path AS _source_file -FROM read_files( - '/Volumes/catalog/schema/raw/orders/', - format => 'json', - schemaHints => 'order_id STRING, customer_id STRING, amount DECIMAL(10,2), order_date DATE' -); -``` - -**Python Example** (`bronze/ingest_events.py`): -```python -from pyspark import pipelines as dp -from pyspark.sql.functions import col, current_timestamp - -# Get schema location from pipeline configuration -schema_location_base = spark.conf.get("schema_location_base") - -@dp.table(name="bronze_events", cluster_by=["event_date"]) -def bronze_events(): - return ( - spark.readStream.format("cloudFiles") - .option("cloudFiles.format", "json") - .option("cloudFiles.schemaLocation", f"{schema_location_base}/bronze_events") - .load("/Volumes/catalog/schema/raw/events/") - .withColumn("_ingested_at", current_timestamp()) - .withColumn("_source_file", col("_metadata.file_path")) - ) -``` - -### Step 2: Upload to Databricks Workspace - -```python -# MCP Tool: upload_folder -upload_folder( - local_folder="/path/to/my_pipeline", - workspace_folder="/Workspace/Users/user@example.com/my_pipeline" -) -``` - -### Step 3: Create/Update and Run Pipeline - -Use **`create_or_update_pipeline`** - the main entry point. It: -1. Searches for an existing pipeline with the same name (or uses `id` from `extra_settings`) -2. Creates a new pipeline or updates the existing one -3. Optionally starts a pipeline run -4. Optionally waits for completion and returns detailed results - -```python -# MCP Tool: create_or_update_pipeline -result = create_or_update_pipeline( - name="my_orders_pipeline", - root_path="/Workspace/Users/user@example.com/my_pipeline", - catalog="my_catalog", - schema="my_schema", - workspace_file_paths=[ - "/Workspace/Users/user@example.com/my_pipeline/bronze/ingest_orders.sql", - "/Workspace/Users/user@example.com/my_pipeline/silver/clean_orders.sql", - "/Workspace/Users/user@example.com/my_pipeline/gold/daily_summary.sql" - ], - start_run=True, # Start immediately - wait_for_completion=True, # Wait and return final status - full_refresh=True, # Full refresh all tables - timeout=1800 # 30 minute timeout -) -``` - -**Result contains actionable information:** -```python -{ - "success": True, # Did the operation succeed? - "pipeline_id": "abc-123", # Pipeline ID for follow-up operations - "pipeline_name": "my_orders_pipeline", - "created": True, # True if new, False if updated - "state": "COMPLETED", # COMPLETED, FAILED, TIMEOUT, etc. - "catalog": "my_catalog", # Target catalog - "schema": "my_schema", # Target schema - "duration_seconds": 45.2, # Time taken - "message": "Pipeline created and completed successfully in 45.2s. Tables written to my_catalog.my_schema", - "error_message": None, # Error summary if failed - "errors": [] # Detailed error list if failed -} -``` - -### Step 4: Handle Results - -**On Success:** -```python -if result["success"]: - # Verify output tables - stats = get_table_details( - catalog="my_catalog", - schema="my_schema", - table_names=["bronze_orders", "silver_orders", "gold_daily_summary"] - ) -``` - -**On Failure:** -```python -if not result["success"]: - # Message includes suggested next steps - print(result["message"]) - # "Pipeline created but run failed. State: FAILED. Error: Column 'amount' not found. - # Use get_pipeline_events(pipeline_id='abc-123') for full details." - - # Get detailed errors - events = get_pipeline_events(pipeline_id=result["pipeline_id"], max_results=50) -``` - -### Step 5: Iterate Until Working - -1. Review errors from result or `get_pipeline_events` -2. Fix issues in local files -3. Re-upload with `upload_folder` -4. Run `create_or_update_pipeline` again (it will update, not recreate) -5. Repeat until `result["success"] == True` - ---- - -## Quick Reference: MCP Tools - -### Primary Tool - -| Tool | Description | -|------|-------------| -| **`create_or_update_pipeline`** | **Main entry point.** Creates or updates pipeline, optionally runs and waits. Returns detailed status with `success`, `state`, `errors`, and actionable `message`. | - -### Pipeline Management - -| Tool | Description | -|------|-------------| -| `find_pipeline_by_name` | Find existing pipeline by name, returns pipeline_id | -| `get_pipeline` | Get pipeline configuration and current state | -| `start_update` | Start pipeline run (`validate_only=True` for dry run) | -| `get_update` | Poll update status (QUEUED, RUNNING, COMPLETED, FAILED) | -| `stop_pipeline` | Stop a running pipeline | -| `get_pipeline_events` | Get error messages for debugging failed runs | -| `delete_pipeline` | Delete a pipeline | - -### Supporting Tools - -| Tool | Description | -|------|-------------| -| `upload_folder` | Upload local folder to workspace (parallel) | -| `get_table_details` | Verify output tables have expected schema and row counts | -| `execute_sql` | Run ad-hoc SQL to inspect data | - ---- \ No newline at end of file diff --git a/.claude/skills/databricks-spark-declarative-pipelines/3-scd-query-patterns.md b/.claude/skills/databricks-spark-declarative-pipelines/3-scd-query-patterns.md deleted file mode 100644 index e04a410..0000000 --- a/.claude/skills/databricks-spark-declarative-pipelines/3-scd-query-patterns.md +++ /dev/null @@ -1,243 +0,0 @@ -# SCD Query Patterns - -How to query SCD Type 2 history tables effectively, including current state queries, point-in-time analysis, and change tracking. - ---- - -## Understanding SCD Type 2 Structure - -When you create an SCD Type 2 flow, the system automatically adds temporal columns: - -```sql -CREATE FLOW customers_scd2_flow AS -AUTO CDC INTO customers_history -FROM stream(customers_cdc_clean) -KEYS (customer_id) -SEQUENCE BY event_timestamp -STORED AS SCD TYPE 2 -TRACK HISTORY ON *; -``` - -**Resulting table structure** (Lakeflow uses double-underscore temporal columns): -``` -customers_history -├── customer_id -- Business key -├── customer_name -├── email -├── phone -├── __START_AT -- When this version became effective (auto-generated) -├── __END_AT -- When this version expired (NULL for current) -└── ...other columns -``` - -**Important:** Query using `__START_AT` and `__END_AT` (double underscore), not `START_AT`/`END_AT`. - ---- - -## Current State Queries - -### All Current Records - -```sql --- __END_AT IS NULL indicates active record (Lakeflow uses double underscore) -CREATE OR REPLACE MATERIALIZED VIEW dim_customers_current AS -SELECT - customer_id, customer_name, email, phone, address, - __START_AT AS valid_from -FROM customers_history -WHERE __END_AT IS NULL; -``` - -### Specific Customer - -```sql -SELECT * -FROM customers_history -WHERE customer_id = '12345' - AND __END_AT IS NULL; -``` - ---- - -## Point-in-Time Queries - -### As-Of Date Query - -Get state of records as they were on a specific date: - -```sql --- Products as of January 1, 2024 (use __START_AT / __END_AT) -CREATE OR REPLACE MATERIALIZED VIEW products_as_of_2024_01_01 AS -SELECT - product_id, product_name, price, category, - __START_AT, __END_AT -FROM products_history -WHERE __START_AT <= '2024-01-01' - AND (__END_AT > '2024-01-01' OR __END_AT IS NULL); -``` - ---- - -## Change Analysis - -### Track All Changes for Entity - -```sql --- Complete history for a customer (use __START_AT / __END_AT) -SELECT - customer_id, customer_name, email, phone, - __START_AT, __END_AT, - COALESCE( - DATEDIFF(DAY, __START_AT, __END_AT), - DATEDIFF(DAY, __START_AT, CURRENT_TIMESTAMP()) - ) AS days_active -FROM customers_history -WHERE customer_id = '12345' -ORDER BY __START_AT DESC; -``` - -### Changes Within Time Period - -```sql --- Customers who changed during Q1 2024 (use __START_AT) -SELECT - customer_id, customer_name, - __START_AT AS change_timestamp, - 'UPDATE' AS change_type -FROM customers_history -WHERE __START_AT BETWEEN '2024-01-01' AND '2024-03-31' - AND __START_AT != ( - SELECT MIN(__START_AT) - FROM customers_history ch2 - WHERE ch2.customer_id = customers_history.customer_id - ) -ORDER BY __START_AT; -``` - ---- - -## Joining Facts with Historical Dimensions - -### Enrich Facts with Dimension at Transaction Time - -```sql --- Join sales with product prices at time of sale -CREATE OR REPLACE MATERIALIZED VIEW sales_with_historical_prices AS -SELECT - s.sale_id, s.product_id, s.sale_date, s.quantity, - p.product_name, p.price AS unit_price_at_sale_time, - s.quantity * p.price AS calculated_amount, - p.category -FROM sales_fact s -INNER JOIN products_history p - ON s.product_id = p.product_id - AND s.sale_date >= p.__START_AT - AND (s.sale_date < p.__END_AT OR p.__END_AT IS NULL); -``` - -### Join with Current Dimension - -```sql --- Join sales with current product information -CREATE OR REPLACE MATERIALIZED VIEW sales_with_current_prices AS -SELECT - s.sale_id, s.product_id, s.sale_date, s.quantity, - s.amount AS amount_at_sale, - p.product_name AS current_product_name, - p.price AS current_price, - p.category AS current_category -FROM sales_fact s -INNER JOIN products_history p - ON s.product_id = p.product_id - AND p.__END_AT IS NULL; -- Current version only -``` - ---- - -## Selective History Tracking - -When using `TRACK HISTORY ON specific_columns`: - -```sql --- Only price changes trigger new versions -CREATE FLOW products_scd2_flow AS -AUTO CDC INTO products_history -FROM stream(products_cdc_clean) -KEYS (product_id) -SEQUENCE BY event_timestamp -STORED AS SCD TYPE 2 -TRACK HISTORY ON price, cost; -- Only these columns -``` - ---- - -## Optimization Patterns - -### Pre-Filter Materialized Views - -```sql --- Current state view (most common pattern) -CREATE OR REPLACE MATERIALIZED VIEW dim_products_current AS -SELECT * FROM products_history WHERE __END_AT IS NULL; - --- Recent changes only -CREATE OR REPLACE MATERIALIZED VIEW dim_recent_changes AS -SELECT * FROM products_history -WHERE __START_AT >= CURRENT_DATE() - INTERVAL 90 DAYS; - --- Change frequency stats -CREATE OR REPLACE MATERIALIZED VIEW product_change_stats AS -SELECT - product_id, - COUNT(*) AS version_count, - MIN(__START_AT) AS first_seen, - MAX(__START_AT) AS last_updated -FROM products_history -GROUP BY product_id; -``` - ---- - -## Best Practices - -### 1. Always Filter by __END_AT for Current (Lakeflow uses double underscore) - -```sql --- ✅ Efficient -WHERE __END_AT IS NULL - --- ❌ Less efficient -WHERE __START_AT = (SELECT MAX(__START_AT) FROM table WHERE ...) -``` - -### 2. Use Inclusive Lower, Exclusive Upper - -```sql --- ✅ Standard pattern -WHERE __START_AT <= '2024-01-01' - AND (__END_AT > '2024-01-01' OR __END_AT IS NULL) -``` - -### 3. Create MVs for Common Patterns - -```sql --- Current state -CREATE OR REPLACE MATERIALIZED VIEW dim_current AS -SELECT * FROM history WHERE __END_AT IS NULL; - --- Recent changes -CREATE OR REPLACE MATERIALIZED VIEW dim_recent_changes AS -SELECT * FROM history -WHERE __START_AT >= CURRENT_DATE() - INTERVAL 90 DAYS; -``` - ---- - -## Common Issues - -| Issue | Solution | -|-------|----------| -| Multiple rows for same key | Missing `__END_AT IS NULL` filter for current state | -| Point-in-time no results | Use `__START_AT <= date AND (__END_AT > date OR __END_AT IS NULL)` | -| Slow temporal join | Create materialized view for specific time period | -| Unexpected duplicates | Multiple changes same day - use SEQUENCE BY with high precision | diff --git a/.claude/skills/databricks-spark-declarative-pipelines/5-python-api.md b/.claude/skills/databricks-spark-declarative-pipelines/5-python-api.md deleted file mode 100644 index a7f3a70..0000000 --- a/.claude/skills/databricks-spark-declarative-pipelines/5-python-api.md +++ /dev/null @@ -1,338 +0,0 @@ -# Python API: Modern vs Legacy - -**Last Updated**: January 2026 -**Status**: Modern API (`pyspark.pipelines`) recommended for all new projects - ---- - -## Overview - -Databricks provides two Python APIs for Spark Declarative Pipelines: - -1. **Modern API** (`pyspark.pipelines` as `dp`) - **Recommended (2025)** -2. **Legacy API** (`dlt`) - Older Delta Live Tables API, still supported - -**Key Recommendation**: Always use **modern API** for new projects. Only use legacy for maintaining existing DLT code. - ---- - -## Quick Comparison - -| Aspect | Modern (`dp`) | Legacy (`dlt`) | -|--------|---------------|----------------| -| **Import** | `from pyspark import pipelines as dp` | `import dlt` | -| **Status** | ✅ **Recommended** | ⚠️ Legacy | -| **Table decorator** | `@dp.table()` | `@dlt.table()` | -| **Read** | `spark.read.table("table")` | `dlt.read("table")` | -| **CDC/SCD** | `dp.create_auto_cdc_flow()` | `dlt.apply_changes()` | -| **Use for** | New projects | Maintaining existing | - ---- - -## Side-by-Side Examples - -### Basic Table Definition - -**Modern (Recommended)**: -```python -from pyspark import pipelines as dp -from pyspark.sql import functions as F - -@dp.table(name="bronze_events", comment="Raw events") -def bronze_events(): - return ( - spark.readStream - .format("cloudFiles") - .option("cloudFiles.format", "json") - .load("/mnt/raw/events") - ) -``` - -**Legacy**: -```python -import dlt -from pyspark.sql import functions as F - -@dlt.table(name="bronze_events", comment="Raw events") -def bronze_events(): - return ( - spark.readStream - .format("cloudFiles") - .option("cloudFiles.format", "json") - .load("/mnt/raw/events") - ) -``` - -### Reading Tables - -**Modern (Recommended)**: -```python -@dp.table(name="silver_events") -def silver_events(): - # Explicit Unity Catalog path - return spark.read.table("bronze_events").filter(...) -``` - -**Legacy**: -```python -@dlt.table(name="silver_events") -def silver_events(): - # Implicit LIVE schema - return dlt.read("bronze_events").filter(...) -``` - -**Key Difference**: Modern uses explicit UC paths, legacy uses implicit `LIVE.*`. - -### Streaming Reads - -**Modern (Recommended)**: -```python -@dp.table(name="silver_events") -def silver_events(): - # Context-aware (no separate read_stream) - return ( - spark.readStream.table("catalog.schema.bronze_events") - .filter(F.col("event_type").isNotNull()) - ) -``` - -**Legacy**: -```python -@dlt.table(name="silver_events") -def silver_events(): - # Explicit streaming read - return ( - dlt.read_stream("bronze_events") - .filter(F.col("event_type").isNotNull()) - ) -``` - -### Data Quality Expectations - -**Modern (Recommended)**: -```python -@dp.table(name="silver_validated") -@dp.expect_or_drop("valid_id", "id IS NOT NULL") -@dp.expect_or_drop("valid_amount", "amount > 0") -@dp.expect_or_fail("critical_field", "timestamp IS NOT NULL") -def silver_validated(): - return spark.read.table("catalog.schema.bronze_events") -``` - -**Legacy**: -```python -@dlt.table(name="silver_validated") -@dlt.expect_or_drop("valid_id", "id IS NOT NULL") -@dlt.expect_or_drop("valid_amount", "amount > 0") -@dlt.expect_or_fail("critical_field", "timestamp IS NOT NULL") -def silver_validated(): - return dlt.read("bronze_events") -``` - -**Note**: Expectations API identical between versions. - -### SCD Type 2 (AUTO CDC) - -**Modern (Recommended)**: -```python -from pyspark.sql.functions import col - -dp.create_streaming_table("customers_history") - -dp.create_auto_cdc_flow( - target="customers_history", - source="customers_cdc", - keys=["customer_id"], - sequence_by=col("event_timestamp"), - stored_as_scd_type="2", - track_history_column_list=["*"] -) -``` - -**Legacy**: -```python -dlt.create_streaming_table("customers_history") - -dlt.apply_changes( - target="customers_history", - source="customers_cdc", - keys=["customer_id"], - sequence_by="event_timestamp", - stored_as_scd_type="2", - track_history_column_list=["*"] -) -``` - -**Key Difference**: Modern uses `create_auto_cdc_flow()`, legacy uses `apply_changes()`. - -### Liquid Clustering - -**Modern (Recommended)**: -```python -@dp.table( - name="bronze_events", - table_properties={ - "delta.autoOptimize.optimizeWrite": "true", - "delta.autoOptimize.autoCompact": "true" - }, - cluster_by=["event_type", "event_date"] # Liquid Clustering -) -def bronze_events(): - return spark.readStream.format("cloudFiles").load("/data") -``` - -**Legacy**: -```python -@dlt.table( - name="bronze_events", - table_properties={ - "pipelines.autoOptimize.managed": "true", - "pipelines.autoOptimize.zOrderCols": "event_type" - }, - partition_cols=["event_date"] # Legacy partitioning -) -def bronze_events(): - return spark.readStream.format("cloudFiles").load("/data") -``` - -**Key Difference**: Modern supports `cluster_by` for Liquid Clustering. - ---- - -## Decision Matrix - -### Use Modern API (`dp`) When: -- ✅ **Starting new project** (default choice) -- ✅ **Learning SDP/LDP** (learn current standard) -- ✅ **Want Liquid Clustering** -- ✅ **Prefer explicit Unity Catalog paths** -- ✅ **Following 2025 best practices** - -### Use Legacy API (`dlt`) When: -- ⚠️ **Maintaining existing DLT pipelines** (don't rewrite working code) -- ⚠️ **Team trained on DLT** (consistency with existing) -- ⚠️ **Older DBR versions** (if modern API not available) - -**Default**: Use modern `dp` API unless specific reason for legacy. - ---- - -## Migration Guide: dlt → dp - -### Step 1: Update Imports - -**Before**: -```python -import dlt -``` - -**After**: -```python -from pyspark import pipelines as dp -``` - -### Step 2: Update Decorators - -**Before**: `@dlt.table(name="my_table")` -**After**: `@dp.table(name="my_table")` - -### Step 3: Update Reads - -**Before**: -```python -dlt.read("source_table") -dlt.read_stream("source_table") -``` - -**After**: -```python -spark.table("catalog.schema.source_table") -# Streaming context-aware, no separate read_stream -``` - -### Step 4: Update CDC/SCD Operations - -**Before**: -```python -dlt.apply_changes(target="dim_customer", source="cdc_source", ...) -``` - -**After**: -```python -from pyspark.sql.functions import col - -dp.create_auto_cdc_flow( - target="dim_customer", - source="cdc_source", - keys=["customer_id"], - sequence_by=col("event_timestamp"), - stored_as_scd_type="2", - track_history_column_list=["*"] -) -``` - -**Key Change**: `dlt.apply_changes()` → `dp.create_auto_cdc_flow()` - -### Step 5: Update Clustering - -**Before**: `@dlt.table(partition_cols=["date"])` -**After**: `@dp.table(cluster_by=["date", "other_col"])` - ---- - -## Key Patterns (2025) - -### 1. Use Liquid Clustering - -```python -@dp.table(cluster_by=["key_col", "date_col"]) -def my_table(): - return ... - -# Or automatic -@dp.table(cluster_by=["AUTO"]) -def my_table(): - return ... -``` - -### 2. Explicit UC Paths - -```python -# ✅ Modern: explicit path -spark.table("catalog.schema.table") - -# ❌ Legacy: implicit LIVE -dlt.read("table") -``` - -### 3. forEachBatch for Custom Sinks - -```python -def write_to_custom_sink(batch_df, batch_id): - batch_df.write.format("custom").save(...) - -@dp.table(name="my_table") -def my_table(): - return ( - spark.readStream - .format("cloudFiles") - .load("/data") - .writeStream - .foreachBatch(write_to_custom_sink) - ) -``` - ---- - -## Summary - -**For New Projects**: Use modern `pyspark.pipelines` (`dp`) -- ✅ Current best practice (2025) -- ✅ Liquid Clustering support -- ✅ Explicit Unity Catalog paths - -**For Existing Projects**: Legacy `dlt` fully supported -- ⚠️ Migrate when convenient, not urgent -- ⚠️ Consider modern API for new files - -**Key Takeaway**: Modern API provides same functionality plus new features. Start all new projects with `from pyspark import pipelines as dp`. diff --git a/.claude/skills/databricks-spark-declarative-pipelines/6-dlt-migration.md b/.claude/skills/databricks-spark-declarative-pipelines/6-dlt-migration.md deleted file mode 100644 index 19a1007..0000000 --- a/.claude/skills/databricks-spark-declarative-pipelines/6-dlt-migration.md +++ /dev/null @@ -1,298 +0,0 @@ -# DLT to SDP Migration Guide - -Guide for migrating Delta Live Tables (DLT) Python pipelines to Spark Declarative Pipelines (SDP) SQL. - -⚠️ **For NEW Python SDP pipelines**: Use modern `pyspark.pipelines` API. See [5-python-api.md](5-python-api.md). - ---- - -## Migration Decision Matrix - -| Feature/Pattern | DLT Python | SDP SQL | Recommendation | -|-----------------|------------|---------|----------------| -| Simple transformations | ✓ | ✓ | **Migrate to SQL** | -| Aggregations | ✓ | ✓ | **Migrate to SQL** | -| Filtering, WHERE clauses | ✓ | ✓ | **Migrate to SQL** | -| CASE expressions | ✓ | ✓ | **Migrate to SQL** | -| SCD Type 1/2 | ✓ | ✓ | **Migrate to SQL** (AUTO CDC) | -| Simple joins | ✓ | ✓ | **Migrate to SQL** | -| Auto Loader | ✓ | ✓ | **Migrate to SQL** (read_files) | -| Streaming sources (Kafka) | ✓ | ✓ | **Migrate to SQL** (read_stream) | -| Complex Python UDFs | ✓ | ❌ | **Stay in Python** | -| External API calls | ✓ | ❌ | **Stay in Python** | -| Custom libraries | ✓ | ❌ | **Stay in Python** | -| Complex apply functions | ✓ | ❌ | **Stay in Python** or simplify | -| ML model inference | ✓ | ❌ | **Stay in Python** | - -**Rule**: If 80%+ is SQL-expressible, migrate to SDP SQL. If heavy Python logic, stay with DLT Python or use hybrid. - ---- - -## Side-by-Side: Key Patterns - -### Basic Streaming Table - -**DLT Python**: -```python -@dlt.table(name="bronze_sales", comment="Raw sales") -def bronze_sales(): - return ( - spark.readStream.format("cloudFiles") - .option("cloudFiles.format", "json") - .load("/mnt/raw/sales") - .withColumn("_ingested_at", F.current_timestamp()) - ) -``` - -**SDP SQL**: -```sql -CREATE OR REPLACE STREAMING TABLE bronze_sales -COMMENT 'Raw sales' -AS -SELECT *, current_timestamp() AS _ingested_at -FROM read_files('/mnt/raw/sales', format => 'json'); -``` - -### Filtering and Transformations - -**DLT Python**: -```python -@dlt.table(name="silver_sales") -@dlt.expect_or_drop("valid_amount", "amount > 0") -@dlt.expect_or_drop("valid_sale_id", "sale_id IS NOT NULL") -def silver_sales(): - return ( - dlt.read_stream("bronze_sales") - .withColumn("sale_date", F.to_date("sale_date")) - .withColumn("amount", F.col("amount").cast("decimal(10,2)")) - .select("sale_id", "customer_id", "amount", "sale_date") - ) -``` - -**SDP SQL**: -```sql -CREATE OR REPLACE STREAMING TABLE silver_sales AS -SELECT - sale_id, customer_id, - CAST(amount AS DECIMAL(10,2)) AS amount, - CAST(sale_date AS DATE) AS sale_date -FROM STREAM bronze_sales -WHERE amount > 0 AND sale_id IS NOT NULL; -``` - -### SCD Type 2 - -**DLT Python**: -```python -dlt.create_streaming_table("customers_history") - -dlt.apply_changes( - target="customers_history", - source="customers_cdc_clean", - keys=["customer_id"], - sequence_by="event_timestamp", - stored_as_scd_type="2", - track_history_column_list=["*"] -) -``` - -**SDP SQL** (clause order: APPLY AS DELETE WHEN before SEQUENCE BY; only EXCEPT columns that exist in source; omit TRACK HISTORY ON * if it causes parse errors): -```sql -CREATE OR REFRESH STREAMING TABLE customers_history; - -CREATE FLOW customers_scd2_flow AS -AUTO CDC INTO customers_history -FROM stream(customers_cdc_clean) -KEYS (customer_id) -APPLY AS DELETE WHEN operation = "DELETE" -SEQUENCE BY event_timestamp -COLUMNS * EXCEPT (operation, _ingested_at, _source_file) -STORED AS SCD TYPE 2; -``` - -### Joins - -**DLT Python**: -```python -@dlt.table(name="silver_sales_enriched") -def silver_sales_enriched(): - sales = dlt.read_stream("silver_sales") - products = dlt.read("dim_products") - - return ( - sales.join(products, "product_id", "left") - .select(sales["*"], products["product_name"], products["category"]) - ) -``` - -**SDP SQL**: -```sql -CREATE OR REPLACE STREAMING TABLE silver_sales_enriched AS -SELECT - s.*, - p.product_name, - p.category -FROM STREAM silver_sales s -LEFT JOIN dim_products p ON s.product_id = p.product_id; -``` - ---- - -## Handling Expectations - -**DLT Python**: -```python -@dlt.expect_or_drop("valid_amount", "amount > 0") -@dlt.expect_or_fail("critical_id", "id IS NOT NULL") -``` - -**SDP SQL - Basic**: -```sql --- Use WHERE (equivalent to expect_or_drop) -WHERE amount > 0 AND id IS NOT NULL -``` - -**SDP SQL - Quarantine Pattern** (for auditing): -```sql --- Flag invalid records -CREATE OR REPLACE STREAMING TABLE bronze_data_flagged AS -SELECT - *, - CASE - WHEN amount <= 0 THEN TRUE - WHEN id IS NULL THEN TRUE - ELSE FALSE - END AS is_invalid -FROM STREAM bronze_data; - --- Clean for downstream -CREATE OR REPLACE STREAMING TABLE silver_data_clean AS -SELECT * FROM STREAM bronze_data_flagged WHERE NOT is_invalid; - --- Quarantine for investigation -CREATE OR REPLACE STREAMING TABLE silver_data_quarantine AS -SELECT * FROM STREAM bronze_data_flagged WHERE is_invalid; -``` - -**Migration**: `@dlt.expect_or_drop` → WHERE clause or quarantine pattern. - ---- - -## Handling UDFs - -### Simple UDFs (Migrate to SQL) - -**DLT Python**: -```python -@F.udf(returnType=StringType()) -def categorize_amount(amount): - if amount > 1000: - return "High" - elif amount > 100: - return "Medium" - else: - return "Low" - -@dlt.table(name="sales_categorized") -def sales_categorized(): - return ( - dlt.read("sales") - .withColumn("category", categorize_amount(F.col("amount"))) - ) -``` - -**SDP SQL** (CASE expression): -```sql -CREATE OR REPLACE MATERIALIZED VIEW sales_categorized AS -SELECT - *, - CASE - WHEN amount > 1000 THEN 'High' - WHEN amount > 100 THEN 'Medium' - ELSE 'Low' - END AS category -FROM sales; -``` - -### Complex UDFs (Stay in Python) - -**Keep in Python for**: -- Complex conditional logic -- External API calls -- Custom algorithms -- ML inference - -**Options**: -1. Keep transformation in Python DLT -2. Create hybrid (SQL + Python for specific UDFs) -3. Refactor to SQL built-ins if possible - ---- - -## Migration Process - -### Step 1: Inventory - -Document: -- Number of tables/views -- Python UDFs (simple vs complex) -- External dependencies -- Expectations and quality rules - -### Step 2: Categorize - -**Easy to migrate**: Filters, aggregations, simple CASE -**Moderate**: UDFs rewritable as SQL -**Hard**: Complex Python, external calls, ML - -### Step 3: Migrate by Layer - -1. **Bronze** (ingestion): Convert Auto Loader to read_files() -2. **Silver** (cleansing): Convert expectations to WHERE/quarantine -3. **Gold** (aggregations): Usually straightforward -4. **SCD/CDC**: Use AUTO CDC - -### Step 4: Test - -- Run both pipelines in parallel -- Compare outputs for correctness -- Validate performance -- Check quality metrics - ---- - -## When NOT to Migrate - -**Stay with DLT Python if**: -1. Heavy Python UDF usage (>30% of logic) -2. External API calls required -3. Custom ML model inference -4. Complex stateful operations not in SQL -5. Existing pipeline works well, team prefers Python -6. Limited SQL expertise - -**Consider hybrid**: SQL for most, Python for complex logic. - ---- - -## Common Issues - -| Issue | Solution | -|-------|----------| -| UDF doesn't translate | Keep in Python or refactor with SQL built-ins | -| Expectations differ | Use quarantine pattern to audit dropped records | -| Performance degradation | Use CLUSTER BY for Liquid Clustering, review joins | -| Schema evolution different | Use `mode => 'PERMISSIVE'` in read_files() | - ---- - -## Summary - -**Migration Path**: -1. Use decision matrix (80%+ SQL-expressible → migrate) -2. Migrate by layer (bronze → silver → gold) -3. Handle expectations with WHERE/quarantine -4. Translate simple UDFs to CASE expressions -5. Keep complex Python logic in Python - -**Key**: DLT Python and SDP SQL are both fully supported. Migrate for simplicity, not necessity. diff --git a/.claude/skills/databricks-spark-declarative-pipelines/9-auto_cdc.md b/.claude/skills/databricks-spark-declarative-pipelines/9-auto_cdc.md deleted file mode 100644 index b8a5b59..0000000 --- a/.claude/skills/databricks-spark-declarative-pipelines/9-auto_cdc.md +++ /dev/null @@ -1,353 +0,0 @@ -# AUTO CDC Patterns for Change Data Capture - -**Keywords**: Slow Changing Dimension, SCD, SCD Type 1, SCD Type 2, AUTO CDC, change data capture, dp.create_auto_cdc_flow, deduplication - ---- - -## Overview - -AUTO CDC automatically handles Change Data Capture (CDC) to track changes in your data using Slow Changing Dimensions (SCD). It provides automatic deduplication, change tracking, and handles late-arriving data correctly. - -**Where to apply AUTO CDC:** -- **Silver layer**: When business users need deduplicated or historical data for analytics/ML -- **Gold layer**: When implementing dimensional modeling (star schema) with dim/fact tables -- **Choice depends on**: Downstream consumption patterns and query requirements - ---- - -## SCD Type 1 vs Type 2 - -### SCD Type 1 (In-place updates) -- **Overwrites** old values with new values -- **No history preserved** - only current state maintained -- **Use for**: Dimension attributes that don't need history - - Correcting data errors (typos) - - Updating attributes where history doesn't matter - - Maintaining single current record per key -- **Syntax**: `stored_as_scd_type="1"` (string) - -### SCD Type 2 (History tracking) -- **Creates new row** for each change -- **Preserves full history** with `__START_AT` and `__END_AT` timestamps -- **Use for**: Tracking changes over time - - Customer address changes - - Product price history - - Employee role changes - - Any dimension requiring temporal analysis -- **Syntax**: `stored_as_scd_type=2` (integer) - ---- - -## Pattern: Cleaning + AUTO CDC - -### Step 1: Clean and Validate Data - -Create a cleaned streaming table with proper typing and quality checks: - -```python -# Cleaned data preparation (can be silver or intermediate layer) -from pyspark import pipelines as dp -from pyspark.sql import functions as F - -schema = spark.conf.get("schema") - -@dp.table( - name=f"{schema}.users_clean", - comment="Cleaned and validated user data with proper typing and quality checks", - cluster_by=["user_id"] -) -def users_clean(): - """ - Prepare clean data with: - - Proper timestamp typing - - Data quality validations - - Remove records with invalid email or null user_id - """ - return ( - spark.readStream.table("bronze_users") - .filter(F.col("user_id").isNotNull()) - .filter(F.col("email").isNotNull()) - .filter(F.col("email").rlike(r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$")) - .withColumn("created_timestamp", F.to_timestamp("created_timestamp")) - .withColumn("updated_timestamp", F.to_timestamp("updated_timestamp")) - .drop("_rescued_data") - .select( - "user_id", - "email", - "name", - "subscription_tier", - "country", - "created_timestamp", - "updated_timestamp", - "_ingested_at", - "_source_file" - ) - ) -``` - -### Step 2: Apply AUTO CDC (SCD Type 2) - -Create a history-tracked dimension table with full change history: - -```python -# AUTO CDC with SCD Type 2 (history tracking) -from pyspark import pipelines as dp - -target_schema = spark.conf.get("target_schema") -source_schema = spark.conf.get("source_schema") - -# Create the target table for AUTO CDC -dp.create_streaming_table(f"{target_schema}.dim_users") - -# Apply AUTO CDC (SCD Type 2) -dp.create_auto_cdc_flow( - target=f"{target_schema}.dim_users", - source=f"{source_schema}.users_clean", - keys=["user_id"], - sequence_by="updated_timestamp", - stored_as_scd_type=2 # Integer for Type 2 -) -``` - -**Resulting table will include**: -- All original columns from source -- `__START_AT` - When this version became effective -- `__END_AT` - When this version expired (NULL for current) - -### Step 3: Apply AUTO CDC (SCD Type 1) - -Create a deduplicated table with in-place updates (no history): - -```python -# AUTO CDC with SCD Type 1 (in-place updates) -from pyspark import pipelines as dp - -target_schema = spark.conf.get("target_schema") -source_schema = spark.conf.get("source_schema") - -# Create the target table for AUTO CDC -dp.create_streaming_table(f"{target_schema}.orders_current") - -# Apply AUTO CDC (SCD Type 1) -dp.create_auto_cdc_flow( - target=f"{target_schema}.orders_current", - source=f"{source_schema}.orders_clean", - keys=["order_id"], - sequence_by="updated_timestamp", - stored_as_scd_type="1" # String for Type 1 -) -``` - ---- - -## Key Benefits - -- **Automatic deduplication** based on keys - no manual MERGE logic -- **Automatic change tracking** with temporal metadata (`__START_AT`, `__END_AT`) -- **Handles late-arriving data** correctly using `sequence_by` timestamp -- **Simplified pipeline code** - no complex merge/upsert logic required -- **Built-in idempotency** - safe to reprocess data - ---- - -## Common Patterns - -### Pattern 1: Gold Dimensional Model - -Use AUTO CDC in Gold layer for star schema dimensions: - -```python -# Silver: Cleaned streaming tables -@dp.table(name="silver.customers_clean") -def customers_clean(): - return spark.readStream.table("bronze.customers").filter(...) - -# Gold: SCD Type 2 dimension -dp.create_streaming_table("gold.dim_customers") -dp.create_auto_cdc_flow( - target="gold.dim_customers", - source="silver.customers_clean", - keys=["customer_id"], - sequence_by="updated_at", - stored_as_scd_type=2 -) - -# Gold: Fact table (no AUTO CDC) -@dp.table(name="gold.fact_orders") -def fact_orders(): - return spark.read.table("silver.orders_clean") -``` - -### Pattern 2: Silver Deduplication for Joins - -Use AUTO CDC in Silver when joining multiple tables: - -```python -# Silver: AUTO CDC for deduplication -dp.create_streaming_table("silver.products_dedupe") -dp.create_auto_cdc_flow( - target="silver.products_dedupe", - source="bronze.products", - keys=["product_id"], - sequence_by="modified_at", - stored_as_scd_type="1" # Type 1: just dedupe, no history -) - -# Silver: Join with deduplicated data -@dp.table(name="silver.orders_enriched") -def orders_enriched(): - orders = spark.readStream.table("bronze.orders") - products = spark.read.table("silver.products_dedupe") - return orders.join(products, "product_id") -``` - -### Pattern 3: Mixed SCD Types - -Different tables use different SCD types based on requirements: - -```python -# SCD Type 2: Need history -dp.create_auto_cdc_flow( - target="gold.dim_customers", - source="silver.customers", - keys=["customer_id"], - sequence_by="updated_at", - stored_as_scd_type=2 # Track address changes over time -) - -# SCD Type 1: Corrections only -dp.create_auto_cdc_flow( - target="gold.dim_products", - source="silver.products", - keys=["product_id"], - sequence_by="modified_at", - stored_as_scd_type="1" # Current product info only -) -``` - ---- - -## Selective History Tracking - -Track history only for specific columns (SCD Type 2): - -```python -dp.create_auto_cdc_flow( - target="gold.dim_products", - source="silver.products_clean", - keys=["product_id"], - sequence_by="modified_at", - stored_as_scd_type=2, - track_history_column_list=["price", "cost"] # Only track these columns -) -``` - -When `price` or `cost` changes, a new version is created. Other column changes update the current record without creating new versions. - ---- - -## Using Temporary Views with AUTO CDC - -**`@dp.temporary_view()`** creates in-pipeline temporary views that exist only during pipeline execution. These are useful for intermediate transformations before AUTO CDC. - -**Key Constraints:** -- Cannot specify `catalog` or `schema` (temporary views are pipeline-scoped only) -- Cannot use `cluster_by` (not persisted) -- Only exists during pipeline execution - -**Use Cases:** -- Complex transformations before AUTO CDC -- Intermediate logic that's referenced multiple times -- Avoiding redundant transformations - -**Example: Preparation before AUTO CDC** - -```python -from pyspark import pipelines as dp -from pyspark.sql import functions as F - -# Step 1: Temporary view for complex business logic -@dp.temporary_view() -def orders_with_calculated_fields(): - """ - Temporary view for complex calculations. - No catalog/schema needed - exists only in pipeline. - """ - return ( - spark.readStream.table("bronze.orders") - .withColumn("order_total", F.col("quantity") * F.col("unit_price")) - .withColumn("discount_amount", F.col("order_total") * F.col("discount_rate")) - .withColumn("final_amount", F.col("order_total") - F.col("discount_amount")) - .withColumn("order_category", - F.when(F.col("final_amount") > 1000, "large") - .when(F.col("final_amount") > 100, "medium") - .otherwise("small") - ) - .filter(F.col("order_id").isNotNull()) - .filter(F.col("final_amount") > 0) - .filter(F.col("order_date").isNotNull()) - ) - -# Step 2: Apply AUTO CDC using the temporary view as source -target_schema = spark.conf.get("target_schema") - -dp.create_streaming_table(f"{target_schema}.orders_current") -dp.create_auto_cdc_flow( - target=f"{target_schema}.orders_current", - source="orders_with_calculated_fields", # Reference temporary view by name - keys=["order_id"], - sequence_by="order_date", - stored_as_scd_type="1" -) -``` - -**Benefits:** -- Avoids creating unnecessary persisted tables -- Reduces storage costs (nothing written to disk) -- Simplifies complex multi-step transformations -- Enables code reuse across multiple tables in same pipeline - ---- - -## Related Documentation - -- **[3-scd-query-patterns.md](3-scd-query-patterns.md)** - Querying SCD Type 2 history tables, point-in-time analysis, temporal joins -- **[1-ingestion-patterns.md](1-ingestion-patterns.md)** - CDC data sources (Kafka, Event Hubs, Kinesis) -- **[2-streaming-patterns.md](2-streaming-patterns.md)** - Deduplication patterns without AUTO CDC - ---- - -## Best Practices - -1. **Choose the right SCD type**: - - Type 2 when you need to query historical states - - Type 1 when you only need current state or deduplication - -2. **Use meaningful sequence_by column**: - - Should reflect true chronological order of changes - - Typically `updated_timestamp`, `modified_at`, or `event_timestamp` - -3. **Clean data before AUTO CDC**: - - Apply type casting, validation, and filtering first - - AUTO CDC works best with clean, well-typed data - -4. **Consider query patterns**: - - If analysts query history → Use Type 2 - - If analysts only need current → Use Type 1 - - If joining frequently → Consider Silver deduplication - -5. **Use selective tracking for large tables**: - - Track history only for columns that change meaningfully - - Reduces storage and improves query performance - ---- - -## Common Issues - -| Issue | Solution | -|-------|----------| -| **Duplicates still appearing** | Check `keys` include all business key columns; verify `sequence_by` has proper ordering | -| **Missing `__START_AT`/`__END_AT` columns** | These only appear in SCD Type 2 (integer), not Type 1 (string) | -| **Late data not handled** | Ensure `sequence_by` column is set and reflects true event time | -| **Type syntax error** | Type 2 uses integer `2`, Type 1 uses string `"1"` | -| **Performance issues** | Use `track_history_column_list` to limit which columns trigger new versions | diff --git a/.claude/skills/databricks-spark-declarative-pipelines/SKILL.md b/.claude/skills/databricks-spark-declarative-pipelines/SKILL.md deleted file mode 100644 index 144041e..0000000 --- a/.claude/skills/databricks-spark-declarative-pipelines/SKILL.md +++ /dev/null @@ -1,577 +0,0 @@ ---- -name: databricks-spark-declarative-pipelines -description: "Creates, configures, and updates Databricks Lakeflow Spark Declarative Pipelines (SDP/LDP) using serverless compute. Handles streaming tables, materialized views, CDC, SCD Type 2, and Auto Loader ingestion patterns. Use when building data pipelines, working with Delta Live Tables, ingesting streaming data, implementing change data capture, or when the user mentions SDP, LDP, DLT, Lakeflow pipelines, streaming tables, or bronze/silver/gold medallion architectures." ---- - -# Lakeflow Spark Declarative Pipelines (SDP) - -IMPORTANT: If this is a new pipeline (one does not already exist), see Quick Start. Be sure to use whatever language user has specified only (Python or SQL). Be sure to use Databricks Asset Bundles for new projects. - ---- - -## Critical Rules (always follow) -- **MUST** confirm language as Python or SQL. Stick with that language unless told otherwise. -- **MUST** if not modifying an existing pipeline, use [Quick Start](#quick-start) below. -- **MUST** create serverless pipelines by default. ** Only use classic clusters if user explicitly requires R language, Spark RDD APIs, or JAR libraries. - - -## Required Steps - -Copy this checklist and verify each item: -``` -- [ ] Language selected: Python or SQL -- [ ] Compute type decided: serverless or classic compute -- [ ] Decide on multiple catalogs or schemas vs. all in one default schema -- [ ] Consider what should be parameterized at the pipeline level to make deployment easy. -- [ ] Consider [Multi-Schema Patterns](#multi-schema-patterns) below, ask if unclear on best choices. -- [ ] Consider [Modern Defaults](#modern-defaults) below, ask if unclear on best choices. - - -## Quick Start: Initialize New Pipeline Project - -**RECOMMENDED**: Use `databricks pipelines init` to create production-ready Asset Bundle projects with multi-environment support. - -### When to Use Bundle Initialization - -Use bundle initialization for **New pipeline projects** for a professional structure from the start - -Use manual workflow for: -- Quick prototyping without multi-environment needs -- Existing manual projects you want to continue -- Learning/experimentation - -### Step 1: Initialize Project - -I will automatically run this command when you request a new pipeline: - -```bash -databricks pipelines init -``` - -**Interactive Prompts:** -- **Project name**: e.g., `customer_orders_pipeline` -- **Initial catalog**: Unity Catalog name (e.g., `main`, `prod_catalog`) -- **Personal schema per user?**: `yes` for dev (each user gets their own schema), `no` for prod -- **Language**: SQL or Python (auto-detected from your request - see language detection below) - -**Generated Structure:** -``` -my_pipeline/ -├── databricks.yml # Multi-environment config (dev/prod) -├── resources/ -│ └── *_etl.pipeline.yml # Pipeline resource definition -└── src/ - └── *_etl/ - ├── explorations/ # Exploratory code in .ipynb - └── transformations/ # Your .sql or .py files here -``` - -### Step 2: Customize Transformations - -Replace the example code created by the init process with custom transformation files in `src/transformations/` based on provided requirements, using best practice guidance from this skill. - -**For Python pipelines using cloudFiles**: Ask the user where to store Auto Loader schema metadata. Recommend: -``` -/Volumes/{catalog}/{schema}/{pipeline_name}_metadata/schemas -``` - -### Step 3: Deploy and Run - -```bash -# Deploy to workspace (dev by default) -databricks bundle deploy - -# Run pipeline -databricks bundle run my_pipeline_etl - -# Deploy to production -databricks bundle deploy --target prod -``` - - -## Quick Reference - -| Concept | Details | -|---------|---------| -| **Names** | SDP = Spark Declarative Pipelines = LDP = Lakeflow Declarative Pipelines = Lakeflow Pipelines (all interchangeable) | -| **Python Import** | `from pyspark import pipelines as dp` | -| **Primary Decorators** | `@dp.table()`, `@dp.materialized_view()`, `@dp.temporary_view()` | -| **Temporary Views** | `@dp.temporary_view()` creates in-pipeline temporary views (no catalog/schema, no cluster_by). Useful for intermediate logic before AUTO CDC or when a view needs multiple references without persistence. | -| **Replaces** | Delta Live Tables (DLT) with `import dlt` | -| **Based On** | Apache Spark 4.1+ (Databricks' modern data pipeline framework) | -| **Docs** | https://docs.databricks.com/aws/en/ldp/developer/python-dev | - ---- - -## Detailed guides - -**Ingestion patterns**: Use [1-ingestion-patterns.md](1-ingestion-patterns.md) when planning how to get new data into your Lakeflow pipeline —- covers file formats, batch/streaming options, and tips for incremental and full loads. (Keywords: Auto Loader, Kafka, Event Hub, Kinesis, file formats) - -**Streaming pipeline patterns**: See [2-streaming-patterns.md](2-streaming-patterns.md) for designing pipelines with streaming data sources, change data detection, triggers, and windowing. (Keywords: deduplication, windowing, stateful operations, joins) - -**SCD query patterns**: See [3-scd-query-patterns.md](3-scd-query-patterns.md) for querying Slowly Changing Dimensions Type 2 history tables, including current state queries, point-in-time analysis, temporal joins, and change tracking. (Keywords: SCD Type 2 history tables, temporal joins, querying historical data) - -**Performance tuning**: Use [4-performance-tuning.md](4-performance-tuning.md) for optimizing pipelines with Liquid Clustering, state management, and best practices for high-performance streaming workloads. (Keywords: Liquid Clustering, optimization, state management) - -**Python API reference**: See [5-python-api.md](5-python-api.md) for the modern `pyspark.pipelines` (dp) API reference and migration from legacy `dlt` API patterns. (Keywords: dp API, dlt API comparison) - -**DLT migration**: Use [6-dlt-migration.md](6-dlt-migration.md) when migrating existing Delta Live Tables (DLT) pipelines to Spark Declarative Pipelines (SDP). (Keywords: migrating DLT pipelines to SDP) - -**Advanced configuration**: See [7-advanced-configuration.md](7-advanced-configuration.md) for advanced pipeline settings including development mode, continuous execution, notifications, Python dependencies, and custom cluster configurations. (Keywords: extra_settings parameter reference, examples) - -**Project initialization**: Use [8-project-initialization.md](8-project-initialization.md) for setting up new pipeline projects with `databricks pipelines init`, Asset Bundles, multi-environment deployments, and language detection logic. (Keywords: databricks pipelines init, Asset Bundles, language detection, migration guides) - -**AUTO CDC patterns**: Use [9-auto_cdc.md](9-auto_cdc.md) for implementing Change Data Capture with AUTO CDC, including Slow Changing Dimensions (SCD Type 1 and Type 2) for tracking changes and deduplication. (Keywords: AUTO CDC, Slow Changing Dimension, SCD, SCD Type 1, SCD Type 2, change data capture, deduplication) - ---- - -## Workflow - -1. Determine the task type: - - **Setting up new project?** → Read [8-project-initialization.md](8-project-initialization.md) first - **Creating new pipeline?** → Read [1-ingestion-patterns.md](1-ingestion-patterns.md) - **Creating stream table?** → Read [2-streaming-patterns.md](2-streaming-patterns.md) - **Querying SCD history tables?** → Read [3-scd-query-patterns.md](3-scd-query-patterns.md) - **Implementing AUTO CDC or SCD?** → Read [9-auto_cdc.md](9-auto_cdc.md) - **Performance issues?** → Read [4-performance-tuning.md](4-performance-tuning.md) - **Using Python API?** → Read [5-python-api.md](5-python-api.md) - **Migrating from DLT?** → Read [6-dlt-migration.md](6-dlt-migration.md) - **Advanced configuration?** → Read [7-advanced-configuration.md](7-advanced-configuration.md) - **Validating?** → Read [validation-checklist.md](validation-checklist.md) - -2. Follow the instructions in the relevant guide - -3. Repeat for next task type ---- - -## Official Documentation - -- **[Lakeflow Spark Declarative Pipelines Overview](https://docs.databricks.com/aws/en/ldp/)** - Main documentation hub -- **[SQL Language Reference](https://docs.databricks.com/aws/en/ldp/developer/sql-dev)** - SQL syntax for streaming tables and materialized views -- **[Python Language Reference](https://docs.databricks.com/aws/en/ldp/developer/python-ref)** - `pyspark.pipelines` API -- **[Loading Data](https://docs.databricks.com/aws/en/ldp/load)** - Auto Loader, Kafka, Kinesis ingestion -- **[Change Data Capture (CDC)](https://docs.databricks.com/aws/en/ldp/cdc)** - AUTO CDC, SCD Type 1/2 - - -### Medallion Architecture Pattern - **Bronze Layer (Raw)** - - Raw data ingested from sources in original format - - Minimal transformations (append-only, add metadata like `_ingested_at`, `_source_file`) - - Single source of truth preserving data lineage - - **Silver Layer (Validated)** - - Cleaned and validated data. - - Might deduplicate here with auto_cdc, but often wait until the final step for auto_cdc if possible. - - Business logic applied (type casting, quality checks, filtering invalid records) - - Enterprise view of key business entities - - Enables self-service analytics and ML - - **Gold Layer (Business-Ready)** - - Aggregated, denormalized, project-specific tables - - Optimized for consumption (reporting, dashboards, BI tools) - - Fewer joins, read-optimized data models - - Kimball star schema tables - dim_, fact_ - - Deduplication often happens here via Slow Changing Dimensions (SCD), using auto_cdc. Sometimes that will happen upstream in silver instead, such as when joining multiple tables or business users plan to query the table from silver. - - **Typical Flow (Can vary)** - Bronze: read_files() or spark.readStream.format("cloudFiles") → streaming table - Silver: read bronze → filter/clean/validate → streaming table - Gold: read silver → aggregate/denormalize → auto_cdc or materialized view - - Sources: - - https://www.databricks.com/glossary/medallion-architecture - - https://docs.databricks.com/aws/en/lakehouse/medallion - - https://www.databricks.com/blog/2022/06/24/data-warehousing-modeling-techniques-and-their-implementation-on-the-databricks-lakehouse-platform.html - -**For medallion architecture** (bronze/silver/gold), two approaches work: -- **Flat with naming** (template default): `bronze_*.sql`, `silver_*.sql`, `gold_*.sql` -- **Subdirectories**: `bronze/orders.sql`, `silver/cleaned.sql`, `gold/summary.sql` - -Both work with the `transformations/**` glob pattern. Choose based on preference. - -See **[8-project-initialization.md](8-project-initialization.md)** for complete details on bundle initialization, migration, and troubleshooting. - ---- -## General SDP development guidance -### Step 1: Write Pipeline Files Locally - -Create `.sql` or `.py` files in a local folder: - -``` -my_pipeline/ -├── bronze/ -│ ├── ingest_orders.sql # SQL (default for most cases) -│ └── ingest_events.py # Python (for complex logic) -├── silver/ -│ └── clean_orders.sql -└── gold/ - └── daily_summary.sql -``` - -**SQL Example** (`bronze/ingest_orders.sql`): -```sql -CREATE OR REFRESH STREAMING TABLE bronze_orders -CLUSTER BY (order_date) -AS -SELECT - *, - current_timestamp() AS _ingested_at, - _metadata.file_path AS _source_file -FROM read_files( - '/Volumes/catalog/schema/raw/orders/', - format => 'json', - schemaHints => 'order_id STRING, customer_id STRING, amount DECIMAL(10,2), order_date DATE' -); -``` - -**Python Example** (`bronze/ingest_events.py`): -```python -from pyspark import pipelines as dp -from pyspark.sql.functions import col, current_timestamp - -# Get schema location from pipeline configuration -schema_location_base = spark.conf.get("schema_location_base") - -@dp.table(name="bronze_events", cluster_by=["event_date"]) -def bronze_events(): - return ( - spark.readStream.format("cloudFiles") - .option("cloudFiles.format", "json") - .option("cloudFiles.schemaLocation", f"{schema_location_base}/bronze_events") - .load("/Volumes/catalog/schema/raw/events/") - .withColumn("_ingested_at", current_timestamp()) - .withColumn("_source_file", col("_metadata.file_path")) - ) -``` - -**IMPORTANT for Python Pipelines**: When using `spark.readStream.format("cloudFiles")` for cloud storage ingestion, with schema inference (no schema specified), you **must specify a schema location**. - -**Always ask the user** where to store Auto Loader schema metadata. Recommend: -``` -/Volumes/{catalog}/{schema}/{pipeline_name}_metadata/schemas -``` - -Example: `/Volumes/my_catalog/pipeline_metadata/orders_pipeline_metadata/schemas` - -**Never use the source data volume** - this causes permission conflicts. The schema location should be configured in the pipeline settings and accessed via `spark.conf.get("schema_location_base")`. - -**Language Selection:** - -**CRITICAL RULE**: If the user explicitly mentions "Python" in their request (e.g., "Python Spark Declarative Pipeline", "Python SDP", "use Python"), **ALWAYS use Python without asking**. The same applies to SQL - if they say "SQL pipeline", use SQL. - -- **Explicit language request**: User says "Python" → Use Python. User says "SQL" → Use SQL. **Do not ask for clarification.** -- **Auto-detection** (only when no explicit language mentioned): - - **SQL indicators**: "sql files", "simple transformations", "aggregations", "materialized view", "CREATE OR REFRESH" - - **Python indicators**: ".py files", "UDF", "complex logic", "ML inference", "external API", "@dp.table", "pandas", "decorator" -- **Prompt for clarification** only when language intent is truly ambiguous (no explicit mention, mixed signals) -- **Default to SQL** only when ambiguous AND no Python indicators present - -See **[8-project-initialization.md](8-project-initialization.md)** for detailed language detection logic. - - -## Option 1: Pipelines with DABs: -Use asset bundles and pipeline CLI. -See [Quick Start](#quick-start) and **[8-project-initialization.md](8-project-initialization.md)** for complete details. - -## Option 2: Manual Workflow (Advanced) - -For rapid prototyping, experimentation, or when you prefer direct control without Asset Bundles, use the manual workflow with MCP tools. - -Use MCP tools to create, run, and iterate on **serverless SDP pipelines**. The **primary tool is `create_or_update_pipeline`** which handles the entire lifecycle. - -**IMPORTANT: Always create serverless pipelines (default).** Only use classic clusters if user explicitly ask for classic, pro, advances compute or requires R language, Spark RDD APIs, or JAR libraries. - -See **[10-mcp-approach.md](10-mcp-approach.md)** for detailed guide. - - -## Best Practices (2026) - -### Project Structure -- **Default to `databricks pipelines init`** for new projects (creates Asset Bundle) -- **Use Asset Bundles** for multi-environment deployments (dev/staging/prod) -- **Manual structure only** for quick prototypes or legacy migration -- **Medallion architecture**: Two approaches work with Asset Bundles: - - **Flat structure** (template default): `bronze_*.sql`, `silver_*.sql`, `gold_*.sql` in `transformations/` - - **Subdirectories**: `transformations/bronze/`, `transformations/silver/`, `transformations/gold/` - - Both work with the `transformations/**` glob pattern - choose based on team preference -- See **[8-project-initialization.md](8-project-initialization.md)** for project setup details - -### Minimal pipeline config pointers -- Define parameters in your pipeline’s configuration and access them in code with spark.conf.get("key"). -- In Databricks Asset Bundles, set these under resources.pipelines..configuration; validate with databricks bundle validate. - -### Modern Defaults -- **CLUSTER BY** (Liquid Clustering), not PARTITION BY - see [4-performance-tuning.md](4-performance-tuning.md) -- **Raw `.sql`/`.py` files**, not notebooks -- **Serverless compute ONLY** - Do not use classic clusters unless explicitly required -- **Unity Catalog** (required for serverless) -- **read_files()** when using SQL for cloud storage ingestion - see [1-ingestion-patterns.md](1-ingestion-patterns.md) - -### Multi-Schema Patterns - -**Default: Single target schema per pipeline.** Each pipeline has one target `catalog` and `schema` where all tables are written. - - -#### Option 1: Single Pipeline, Single Schema with Prefixes (Recommended) - -Use one schema with table name prefixes to distinguish layers: - -```python -# All tables write to: catalog.schema.bronze_*, silver_*, gold_* -@dp.table(name="bronze_orders") # → catalog.schema.bronze_orders -@dp.table(name="silver_orders") # → catalog.schema.silver_orders -@dp.table(name="gold_summary") # → catalog.schema.gold_summary -``` - -**Advantages:** -- Simpler configuration (one pipeline) -- All tables in one schema for easy discovery - -#### Option 2: -Use varaiables to specific separate catalog and/or schema for different steps. - -Below are Python SDP examples that source variables from pipeline configs via spark.conf.get, and use the default catalog/schema for bronze. - -##### Same catalog, separate schemas; bronze uses pipeline defaults -- Set your pipeline’s default catalog and default schema to the bronze layer (for example, catalog=my_catalog, schema=bronze). When you omit catalog/schema in code, reads/writes go to these defaults. -- Use pipeline parameters for the other schemas and any source schema/path, retrieved in code with spark.conf.get(...). - -```python -from pyspark import pipelines as dp -from pyspark.sql.functions import col - -# Pull variables from pipeline configuration parameters -silver_schema = spark.conf.get("silver_schema") # e.g., "silver" -gold_schema = spark.conf.get("gold_schema") # e.g., "gold" -landing_schema = spark.conf.get("landing_schema") # e.g., "landing" - -# Bronze → uses default catalog/schema (set to bronze in pipeline settings) -@dp.table(name="orders_bronze") -def orders_bronze(): - # Read from another schema in the same default catalog - return spark.readStream.table(f"{landing_schema}.orders_raw") - -# Silver → same catalog, schema from parameter -@dp.table(name=f"{silver_schema}.orders_clean") -def orders_clean(): - return (spark.read.table("orders_bronze") # unqualified = default catalog/schema - .filter(col("order_id").isNotNull())) - -# Gold → same catalog, schema from parameter -@dp.materialized_view(name=f"{gold_schema}.orders_by_date") -def orders_by_date(): - return (spark.read.table(f"{silver_schema}.orders_clean") - .groupBy("order_date") - .count().withColumnRenamed("count", "order_count")) -``` -- Using unqualified names for bronze ensures it lands in the pipeline’s default catalog/schema; silver/gold are explicitly schema-qualified within the same catalog. - ---- - -##### Custom catalog/schema per layer; bronze still uses pipeline defaults -- Keep bronze in the pipeline defaults (default catalog/schema set to your bronze layer). For silver/gold, use fully-qualified names with catalog and schema variables from pipeline configuration. - -```python -from pyspark import pipelines as dp -from pyspark.sql.functions import col - -# Pull variables from pipeline configuration parameters -silver_catalog = spark.conf.get("silver_catalog") # e.g., "my_catalog" -silver_schema = spark.conf.get("silver_schema") # e.g., "silver" -gold_catalog = spark.conf.get("gold_catalog") # e.g., "my_catalog" -gold_schema = spark.conf.get("gold_schema") # e.g., "gold" -landing_catalog = spark.conf.get("landing_catalog") # optional, if source is in another catalog -landing_schema = spark.conf.get("landing_schema") - -# Bronze → uses default catalog/schema (set to bronze) -@dp.table(name="orders_bronze") -def orders_bronze(): - # If source is in a specified catalog/schema: - return spark.readStream.table(f"{landing_catalog}.{landing_schema}.orders_raw") - -# Silver → custom catalog + schema via parameters -@dp.table(name=f"{silver_catalog}.{silver_schema}.orders_clean") -def orders_clean(): - # Read bronze by its unqualified name (defaults), or fully qualify if preferred - return (spark.read.table("orders_bronze") - .filter(col("order_id").isNotNull())) - -# Gold → custom catalog + schema via parameters -@dp.materialized_view(name=f"{gold_catalog}.{gold_schema}.orders_by_date}") -def orders_by_date(): - return (spark.read.table(f"{silver_catalog}.{silver_schema}.orders_clean") - .groupBy("order_date") - .count().withColumnRenamed("count", "order_count")) -``` -- Multipart names in the decorator’s name argument let you publish to explicit catalog.schema targets within one pipeline. -- Unqualified reads/writes use the pipeline defaults; use fully-qualified names when crossing catalogs or when you need explicit namespace control. - ---- - - -**Note:** The `@dp.table()` decorator does not currently support separate for `schema=` or `catalog=` parameters. The table parameter is a string that contains the catalog.schema.table_name, or it can leave off catalog and or schema to use the pipeilnes configured default target schema. - -### Reading Tables in Python - -**Modern SDP Best Practice:** -- Use `spark.read.table()` for batch reads -- Use `spark.readStream.table()` for streaming reads -- Don't use `dp.read()` or `dp.read_stream()` (old syntax, no longer documented) -- Don't use `dlt.read()` or `dlt.read_stream()` (legacy DLT API) - -**Key Point:** SDP automatically tracks table dependencies from standard Spark DataFrame operations. No special read APIs are needed. - -#### Three-Tier Identifier Resolution - -SDP supports three levels of table name qualification: - -| Level | Syntax | When to Use | -|-------|--------|-------------| -| **Unqualified** | `spark.read.table("my_table")` | Reading tables within the same pipeline's target catalog/schema (recommended) | -| **Partially-qualified** | `spark.read.table("other_schema.my_table")` | Reading from different schema in same catalog | -| **Fully-qualified** | `spark.read.table("other_catalog.other_schema.my_table")` | Reading from external catalogs/schemas | - -#### Option 1: Unqualified Names (Recommended for Pipeline Tables) - -**Best practice for tables within the same pipeline.** SDP resolves unqualified names to the pipeline's configured target catalog and schema. This makes code portable across environments (dev/prod). - -```python -@dp.table(name="silver_clean") -def silver_clean(): - # Reads from pipeline's target catalog/schema (e.g., dev_catalog.dev_schema.bronze_raw) - return ( - spark.read.table("bronze_raw") - .filter(F.col("valid") == True) - ) - -@dp.table(name="silver_events") -def silver_events(): - # Streaming read from same pipeline's bronze_events table - return ( - spark.readStream.table("bronze_events") - .withColumn("processed_at", F.current_timestamp()) - ) -``` - -#### Option 2: Pipeline Parameters (For External Sources) - -**Use `spark.conf.get()` to parameterize external catalog/schema references.** Define parameters in pipeline configuration, then reference them at the module level. - -```python -from pyspark import pipelines as dp -from pyspark.sql import functions as F - -# Get parameterized values at module level (evaluated once at pipeline start) -source_catalog = spark.conf.get("source_catalog") -source_schema = spark.conf.get("source_schema", "sales") # with default - -@dp.table(name="transaction_summary") -def transaction_summary(): - return ( - spark.read.table(f"{source_catalog}.{source_schema}.transactions") - .groupBy("account_id") - .agg( - F.count("txn_id").alias("txn_count"), - F.sum("txn_amount").alias("account_revenue") - ) - ) -``` - -**Configure parameters in pipeline settings:** -- **Asset Bundles**: Add to `pipeline.yml` under `configuration:` -- **Manual/MCP**: Pass via `extra_settings.configuration` dict - -```yaml -# In resources/my_pipeline.pipeline.yml -configuration: - source_catalog: "shared_catalog" - source_schema: "sales" -``` - -#### Option 3: Fully-Qualified Names (For Fixed External References) - -Use when referencing specific external tables that don't change across environments: - -```python -@dp.table(name="enriched_orders") -def enriched_orders(): - # Pipeline-internal table (unqualified) - orders = spark.read.table("bronze_orders") - - # External reference table (fully-qualified) - products = spark.read.table("shared_catalog.reference.products") - - return orders.join(products, "product_id") -``` - -#### Choosing the Right Approach - -| Scenario | Recommended Approach | -|----------|---------------------| -| Reading tables created in same pipeline | **Unqualified names** - portable, uses target catalog/schema | -| Reading from external source that varies by environment | **Pipeline parameters** - configurable per deployment | -| Reading from shared/reference tables with fixed location | **Fully-qualified names** - explicit and clear | -| Mixed pipeline (some internal, some external) | **Combine approaches** - unqualified for internal, parameters for external | - ---- - -## Common Issues - -| Issue | Solution | -|-------|----------| -| **Empty output tables** | Use `get_table_details` to verify, check upstream sources | -| **Pipeline stuck INITIALIZING** | Normal for serverless, wait a few minutes | -| **"Column not found"** | Check `schemaHints` match actual data | -| **Streaming reads fail** | For file ingestion in a streaming table, you must use the `STREAM` keyword with `read_files`: `FROM STREAM read_files(...)`. For table streams use `FROM stream(table)`. See [read_files — Usage in streaming tables](https://docs.databricks.com/aws/en/sql/language-manual/functions/read_files#usage-in-streaming-tables). | -| **Timeout during run** | Increase `timeout`, or use `wait_for_completion=False` and poll with `get_update` | -| **MV doesn't refresh** | Enable row tracking on source tables | -| **SCD2: query column not found** | Lakeflow uses `__START_AT` and `__END_AT` (double underscore), not `START_AT`/`END_AT`. Use `WHERE __END_AT IS NULL` for current rows. See [3-scd-patterns.md](3-scd-patterns.md). | -| **AUTO CDC parse error at APPLY/SEQUENCE** | Put `APPLY AS DELETE WHEN` **before** `SEQUENCE BY`. Only list columns in `COLUMNS * EXCEPT (...)` that exist in the source (omit `_rescued_data` unless bronze uses rescue data). Omit `TRACK HISTORY ON *` if it causes "end of input" errors; default is equivalent. See [2-streaming-patterns.md](2-streaming-patterns.md). | -| **"Cannot create streaming table from batch query"** | In a streaming table query, use `FROM STREAM read_files(...)` so `read_files` leverages Auto Loader; `FROM read_files(...)` alone is batch. See [1-ingestion-patterns.md](1-ingestion-patterns.md) and [read_files — Usage in streaming tables](https://docs.databricks.com/aws/en/sql/language-manual/functions/read_files#usage-in-streaming-tables). | - -**For detailed errors**, the `result["message"]` from `create_or_update_pipeline` includes suggested next steps. Use `get_pipeline_events(pipeline_id=...)` for full stack traces. - ---- - -## Advanced Pipeline Configuration - -For advanced configuration options (development mode, continuous pipelines, custom clusters, notifications, Python dependencies, etc.), see **[7-advanced-configuration.md](7-advanced-configuration.md)**. - ---- - -## Platform Constraints - -### Serverless Pipeline Requirements (Default) -| Requirement | Details | -|-------------|---------| -| **Unity Catalog** | Required - serverless pipelines always use UC | -| **Workspace Region** | Must be in serverless-enabled region | -| **Serverless Terms** | Must accept serverless terms of use | -| **CDC Features** | Requires serverless (or Pro/Advanced with classic clusters) | - -### Serverless Limitations (When Classic Clusters Required) -| Limitation | Workaround | -|------------|-----------| -| **R language** | Not supported - use classic clusters if required | -| **Spark RDD APIs** | Not supported - use classic clusters if required | -| **JAR libraries** | Not supported - use classic clusters if required | -| **Maven coordinates** | Not supported - use classic clusters if required | -| **DBFS root access** | Limited - must use Unity Catalog external locations | -| **Global temp views** | Not supported | - -### General Constraints -| Constraint | Details | -|------------|---------| -| **Schema Evolution** | Streaming tables require full refresh for incompatible changes | -| **SQL Limitations** | PIVOT clause unsupported | -| **Sinks** | Python only, streaming only, append flows only | - -**Default to serverless** unless user explicitly requires R, RDD APIs, or JAR libraries. - -## Related Skills - -- **[databricks-jobs](../databricks-jobs/SKILL.md)** - for orchestrating and scheduling pipeline runs -- **[databricks-asset-bundles](../databricks-asset-bundles/SKILL.md)** - for multi-environment deployment of pipeline projects -- **[databricks-synthetic-data-generation](../databricks-synthetic-data-generation/SKILL.md)** - for generating test data to feed into pipelines -- **[databricks-unity-catalog](../databricks-unity-catalog/SKILL.md)** - for catalog/schema/volume management and governance diff --git a/.claude/skills/databricks-synthetic-data-generation/SKILL.md b/.claude/skills/databricks-synthetic-data-generation/SKILL.md deleted file mode 100644 index ce2a17c..0000000 --- a/.claude/skills/databricks-synthetic-data-generation/SKILL.md +++ /dev/null @@ -1,660 +0,0 @@ ---- -name: databricks-synthetic-data-generation -description: "Generate realistic synthetic data using Faker and Spark, with non-linear distributions, integrity constraints, and save to Databricks. Use when creating test data, demo datasets, or synthetic tables." ---- - -# Synthetic Data Generation - -Generate realistic, story-driven synthetic data for Databricks using Python with Faker and Spark. - -## Common Libraries - -These libraries are useful for generating realistic synthetic data: - -- **faker**: Generates realistic names, addresses, emails, companies, dates, etc. -- **holidays**: Provides country-specific holiday calendars for realistic date patterns - -These are typically NOT pre-installed on Databricks. Install them using `execute_databricks_command` tool: -- `code`: "%pip install faker holidays" - -Save the returned `cluster_id` and `context_id` for subsequent calls. - -## Workflow - -1. **Write Python code to a local file** in the project (e.g., `scripts/generate_data.py`) -2. **Execute on Databricks** using the `run_python_file_on_databricks` MCP tool -3. **If execution fails**: Edit the local file to fix the error, then re-execute -4. **Reuse the context** for follow-up executions by passing the returned `cluster_id` and `context_id` - -**Always work with local files first, then execute.** This makes debugging easier - you can see and edit the code. - -### Context Reuse Pattern - -The first execution auto-selects a running cluster and creates an execution context. **Reuse this context for follow-up calls** - it's much faster (~1s vs ~15s) and shares variables/imports: - -**First execution** - use `run_python_file_on_databricks` tool: -- `file_path`: "scripts/generate_data.py" - -Returns: `{ success, output, error, cluster_id, context_id, ... }` - -Save `cluster_id` and `context_id` for follow-up calls. - -**If execution fails:** -1. Read the error from the result -2. Edit the local Python file to fix the issue -3. Re-execute with same context using `run_python_file_on_databricks` tool: - - `file_path`: "scripts/generate_data.py" - - `cluster_id`: "" - - `context_id`: "" - -**Follow-up executions** reuse the context (faster, shares state): -- `file_path`: "scripts/validate_data.py" -- `cluster_id`: "" -- `context_id`: "" - -### Handling Failures - -When execution fails: -1. Read the error from the result -2. **Edit the local Python file** to fix the issue -3. Re-execute using the same `cluster_id` and `context_id` (faster, keeps installed libraries) -4. If the context is corrupted, omit `context_id` to create a fresh one - -### Installing Libraries - -Databricks provides Spark, pandas, numpy, and common data libraries by default. **Only install a library if you get an import error.** - -Use `execute_databricks_command` tool: -- `code`: "%pip install faker" -- `cluster_id`: "" -- `context_id`: "" - -The library is immediately available in the same context. - -**Note:** Keeping the same `context_id` means installed libraries persist across calls. - -## Storage Destination - -### Ask for Schema Name - -By default, use the `ai_dev_kit` catalog. Ask the user which schema to use: - -> "I'll save the data to `ai_dev_kit.`. What schema name would you like to use? (You can also specify a different catalog if needed.)" - -If the user provides just a schema name, use `ai_dev_kit.{schema}`. If they provide `catalog.schema`, use that instead. - -### Create Infrastructure in the Script - -Always create the catalog, schema, and volume **inside the Python script** using `spark.sql()`. Do NOT make separate MCP SQL calls - it's much slower. - -The `spark` variable is available by default on Databricks clusters. - -```python -# ============================================================================= -# CREATE INFRASTRUCTURE (inside the Python script) -# ============================================================================= -spark.sql(f"CREATE CATALOG IF NOT EXISTS {CATALOG}") -spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA}") -spark.sql(f"CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.raw_data") -``` - -### Save to Volume as Raw Data (Never Tables) - -**Always save data to a Volume as parquet files, never directly to tables** (unless the user explicitly requests tables). This is the input for the downstream Spark Declarative Pipeline (SDP) that will handle bronze/silver/gold layers. - -```python -VOLUME_PATH = f"/Volumes/{CATALOG}/{SCHEMA}/raw_data" - -# Save as parquet files (raw data) -spark.createDataFrame(customers_pdf).write.mode("overwrite").parquet(f"{VOLUME_PATH}/customers") -spark.createDataFrame(orders_pdf).write.mode("overwrite").parquet(f"{VOLUME_PATH}/orders") -spark.createDataFrame(tickets_pdf).write.mode("overwrite").parquet(f"{VOLUME_PATH}/tickets") -``` - -## Raw Data Only - No Pre-Aggregated Fields (Unless Instructed Otherwise) - -**By default, generate raw, transactional data only.** Do not create fields that represent sums, totals, averages, or counts. - -- One row = one event/transaction/record -- No columns like `total_orders`, `sum_revenue`, `avg_csat`, `order_count` -- Each row has its own individual values, not rollups - -**Why?** A Spark Declarative Pipeline (SDP) will typically be built after data generation to: -- Ingest raw data (bronze layer) -- Clean and validate (silver layer) -- Aggregate and compute metrics (gold layer) - -The synthetic data is the **source** for this pipeline. Aggregations happen downstream. - -**Note:** If the user specifically requests aggregated fields or summary tables, follow their instructions. - -```python -# GOOD - Raw transactional data -# Customer table: one row per customer, no aggregated fields -customers_data.append({ - "customer_id": cid, - "name": fake.company(), - "tier": "Enterprise", - "region": "North", -}) - -# Order table: one row per order -orders_data.append({ - "order_id": f"ORD-{i:06d}", - "customer_id": cid, - "amount": 150.00, # This order's amount - "order_date": "2024-10-15", -}) - -# BAD - Don't add pre-aggregated fields -# customers_data.append({ -# "customer_id": cid, -# "total_orders": 47, # NO - this is an aggregation -# "total_revenue": 12500.00, # NO - this is a sum -# "avg_order_value": 265.95, # NO - this is an average -# }) -``` - -## Temporality and Data Volume - -### Date Range: Last 6 Months from Today - -**Always generate data for the last ~6 months ending at the current date.** This ensures: -- Data feels current and relevant for demos -- Recent patterns are visible in dashboards -- Downstream aggregations (daily/weekly/monthly) have enough history - -```python -from datetime import datetime, timedelta - -# Dynamic date range - last 6 months from today -END_DATE = datetime.now().replace(hour=0, minute=0, second=0, microsecond=0) -START_DATE = END_DATE - timedelta(days=180) - -# Place special events within this range (e.g., incident 3 weeks ago) -INCIDENT_END = END_DATE - timedelta(days=21) -INCIDENT_START = INCIDENT_END - timedelta(days=10) -``` - -### Data Volume for Aggregation - -Generate enough data so patterns remain visible after downstream aggregation (SDP pipelines often aggregate by day/week/region/category). Rules of thumb: - -| Grain | Minimum Records | Rationale | -|-------|-----------------|-----------| -| Daily time series | 50-100/day | See trends after weekly rollup | -| Per category | 500+ per category | Statistical significance | -| Per customer | 5-20 events/customer | Enough for customer-level analysis | -| Total rows | 10K-50K minimum | Patterns survive GROUP BY | - -```python -# Example: 8000 tickets over 180 days = ~44/day average -# After weekly aggregation: ~310 records per week per category -# After monthly by region: still enough to see patterns -N_TICKETS = 8000 -N_CUSTOMERS = 2500 # Each has ~3 tickets on average -N_ORDERS = 25000 # ~10 orders per customer average -``` - -## Script Structure - -Always structure scripts with configuration variables at the top: - -```python -"""Generate synthetic data for [use case].""" -import numpy as np -import pandas as pd -from datetime import datetime, timedelta -from faker import Faker -import holidays -from pyspark.sql import SparkSession - -# ============================================================================= -# CONFIGURATION - Edit these values -# ============================================================================= -CATALOG = "my_catalog" -SCHEMA = "my_schema" -VOLUME_PATH = f"/Volumes/{CATALOG}/{SCHEMA}/raw_data" - -# Data sizes - enough for aggregation patterns to survive -N_CUSTOMERS = 2500 -N_ORDERS = 25000 -N_TICKETS = 8000 - -# Date range - last 6 months from today -END_DATE = datetime.now().replace(hour=0, minute=0, second=0, microsecond=0) -START_DATE = END_DATE - timedelta(days=180) - -# Special events (within the date range) -INCIDENT_END = END_DATE - timedelta(days=21) -INCIDENT_START = INCIDENT_END - timedelta(days=10) - -# Holiday calendar for realistic patterns -US_HOLIDAYS = holidays.US(years=[START_DATE.year, END_DATE.year]) - -# Reproducibility -SEED = 42 - -# ============================================================================= -# SETUP -# ============================================================================= -np.random.seed(SEED) -Faker.seed(SEED) -fake = Faker() -spark = SparkSession.builder.getOrCreate() - -# ... rest of script -``` - -## Key Principles - -### 1. Use Pandas for Generation, Spark for Saving - -Generate data with pandas (faster, easier), convert to Spark for saving: - -```python -import pandas as pd - -# Generate with pandas -customers_pdf = pd.DataFrame({ - "customer_id": [f"CUST-{i:05d}" for i in range(N_CUSTOMERS)], - "name": [fake.company() for _ in range(N_CUSTOMERS)], - "tier": np.random.choice(['Free', 'Pro', 'Enterprise'], N_CUSTOMERS, p=[0.6, 0.3, 0.1]), - "region": np.random.choice(['North', 'South', 'East', 'West'], N_CUSTOMERS, p=[0.4, 0.25, 0.2, 0.15]), - "created_at": [fake.date_between(start_date='-2y', end_date='-6m') for _ in range(N_CUSTOMERS)], -}) - -# Convert to Spark and save -customers_df = spark.createDataFrame(customers_pdf) -customers_df.write.mode("overwrite").parquet(f"{VOLUME_PATH}/customers") -``` - -### 2. Iterate on DataFrames for Referential Integrity - -Generate master tables first, then iterate on them to create related tables with matching IDs: - -```python -# 1. Generate customers (master table) -customers_pdf = pd.DataFrame({ - "customer_id": [f"CUST-{i:05d}" for i in range(N_CUSTOMERS)], - "tier": np.random.choice(['Free', 'Pro', 'Enterprise'], N_CUSTOMERS, p=[0.6, 0.3, 0.1]), - # ... -}) - -# 2. Create lookup for foreign key generation -customer_ids = customers_pdf["customer_id"].tolist() -customer_tier_map = dict(zip(customers_pdf["customer_id"], customers_pdf["tier"])) - -# Weight by tier - Enterprise customers generate more orders -tier_weights = customers_pdf["tier"].map({'Enterprise': 5.0, 'Pro': 2.0, 'Free': 1.0}) -customer_weights = (tier_weights / tier_weights.sum()).tolist() - -# 3. Generate orders with valid foreign keys and tier-based logic -orders_data = [] -for i in range(N_ORDERS): - cid = np.random.choice(customer_ids, p=customer_weights) - tier = customer_tier_map[cid] - - # Amount depends on tier - if tier == 'Enterprise': - amount = np.random.lognormal(7, 0.8) - elif tier == 'Pro': - amount = np.random.lognormal(5, 0.7) - else: - amount = np.random.lognormal(3.5, 0.6) - - orders_data.append({ - "order_id": f"ORD-{i:06d}", - "customer_id": cid, - "amount": round(amount, 2), - "order_date": fake.date_between(start_date=START_DATE, end_date=END_DATE), - }) - -orders_pdf = pd.DataFrame(orders_data) - -# 4. Generate tickets that reference both customers and orders -order_ids = orders_pdf["order_id"].tolist() -tickets_data = [] -for i in range(N_TICKETS): - cid = np.random.choice(customer_ids, p=customer_weights) - oid = np.random.choice(order_ids) # Or None for general inquiry - - tickets_data.append({ - "ticket_id": f"TKT-{i:06d}", - "customer_id": cid, - "order_id": oid if np.random.random() > 0.3 else None, - # ... - }) - -tickets_pdf = pd.DataFrame(tickets_data) -``` - -### 3. Non-Linear Distributions - -**Never use uniform distributions** - real data is rarely uniform: - -```python -# BAD - Uniform (unrealistic) -prices = np.random.uniform(10, 1000, size=N_ORDERS) - -# GOOD - Log-normal (realistic for prices, salaries, order amounts) -prices = np.random.lognormal(mean=4.5, sigma=0.8, size=N_ORDERS) - -# GOOD - Pareto/power law (popularity, wealth, page views) -popularity = (np.random.pareto(a=2.5, size=N_PRODUCTS) + 1) * 10 - -# GOOD - Exponential (time between events, resolution time) -resolution_hours = np.random.exponential(scale=24, size=N_TICKETS) - -# GOOD - Weighted categorical -regions = np.random.choice( - ['North', 'South', 'East', 'West'], - size=N_CUSTOMERS, - p=[0.40, 0.25, 0.20, 0.15] -) -``` - -### 4. Time-Based Patterns - -Add weekday/weekend effects, holidays, seasonality, and event spikes: - -```python -import holidays - -# Load holiday calendar -US_HOLIDAYS = holidays.US(years=[START_DATE.year, END_DATE.year]) - -def get_daily_multiplier(date): - """Calculate volume multiplier for a given date.""" - multiplier = 1.0 - - # Weekend drop - if date.weekday() >= 5: - multiplier *= 0.6 - - # Holiday drop (even lower than weekends) - if date in US_HOLIDAYS: - multiplier *= 0.3 - - # Q4 seasonality (higher in Oct-Dec) - multiplier *= 1 + 0.15 * (date.month - 6) / 6 - - # Incident spike - if INCIDENT_START <= date <= INCIDENT_END: - multiplier *= 3.0 - - # Random noise - multiplier *= np.random.normal(1, 0.1) - - return max(0.1, multiplier) - -# Distribute tickets across dates with realistic patterns -date_range = pd.date_range(START_DATE, END_DATE, freq='D') -daily_volumes = [int(BASE_DAILY_TICKETS * get_daily_multiplier(d)) for d in date_range] -``` - -### 5. Row Coherence - -Attributes within a row should correlate logically: - -```python -def generate_ticket(customer_id, tier, date): - """Generate a coherent ticket where attributes correlate.""" - - # Priority correlates with tier - if tier == 'Enterprise': - priority = np.random.choice(['Critical', 'High', 'Medium'], p=[0.3, 0.5, 0.2]) - else: - priority = np.random.choice(['Critical', 'High', 'Medium', 'Low'], p=[0.05, 0.2, 0.45, 0.3]) - - # Resolution time correlates with priority - resolution_scale = {'Critical': 4, 'High': 12, 'Medium': 36, 'Low': 72} - resolution_hours = np.random.exponential(scale=resolution_scale[priority]) - - # CSAT correlates with resolution time - if resolution_hours < 4: - csat = np.random.choice([4, 5], p=[0.3, 0.7]) - elif resolution_hours < 24: - csat = np.random.choice([3, 4, 5], p=[0.2, 0.5, 0.3]) - else: - csat = np.random.choice([1, 2, 3, 4], p=[0.1, 0.3, 0.4, 0.2]) - - return { - "customer_id": customer_id, - "priority": priority, - "resolution_hours": round(resolution_hours, 1), - "csat_score": csat, - "created_at": date, - } -``` - -## Complete Example - -Save as `scripts/generate_data.py`: - -```python -"""Generate synthetic customer, order, and ticket data.""" -import numpy as np -import pandas as pd -from datetime import datetime, timedelta -from faker import Faker -import holidays -from pyspark.sql import SparkSession - -# ============================================================================= -# CONFIGURATION -# ============================================================================= -CATALOG = "my_catalog" -SCHEMA = "my_schema" -VOLUME_PATH = f"/Volumes/{CATALOG}/{SCHEMA}/raw_data" - -N_CUSTOMERS = 2500 -N_ORDERS = 25000 -N_TICKETS = 8000 - -# Date range - last 6 months from today -END_DATE = datetime.now().replace(hour=0, minute=0, second=0, microsecond=0) -START_DATE = END_DATE - timedelta(days=180) - -# Special events (within the date range) -INCIDENT_END = END_DATE - timedelta(days=21) -INCIDENT_START = INCIDENT_END - timedelta(days=10) - -# Holiday calendar -US_HOLIDAYS = holidays.US(years=[START_DATE.year, END_DATE.year]) - -SEED = 42 - -# ============================================================================= -# SETUP -# ============================================================================= -np.random.seed(SEED) -Faker.seed(SEED) -fake = Faker() -spark = SparkSession.builder.getOrCreate() - -# ============================================================================= -# CREATE INFRASTRUCTURE -# ============================================================================= -print(f"Creating catalog/schema/volume if needed...") -spark.sql(f"CREATE CATALOG IF NOT EXISTS {CATALOG}") -spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA}") -spark.sql(f"CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.raw_data") - -print(f"Generating: {N_CUSTOMERS:,} customers, {N_ORDERS:,} orders, {N_TICKETS:,} tickets") - -# ============================================================================= -# 1. CUSTOMERS (Master Table) -# ============================================================================= -print("Generating customers...") - -customers_pdf = pd.DataFrame({ - "customer_id": [f"CUST-{i:05d}" for i in range(N_CUSTOMERS)], - "name": [fake.company() for _ in range(N_CUSTOMERS)], - "tier": np.random.choice(['Free', 'Pro', 'Enterprise'], N_CUSTOMERS, p=[0.6, 0.3, 0.1]), - "region": np.random.choice(['North', 'South', 'East', 'West'], N_CUSTOMERS, p=[0.4, 0.25, 0.2, 0.15]), -}) - -# ARR correlates with tier -customers_pdf["arr"] = customers_pdf["tier"].apply( - lambda t: round(np.random.lognormal(11, 0.5), 2) if t == 'Enterprise' - else round(np.random.lognormal(8, 0.6), 2) if t == 'Pro' else 0 -) - -# Lookups for foreign keys -customer_ids = customers_pdf["customer_id"].tolist() -customer_tier_map = dict(zip(customers_pdf["customer_id"], customers_pdf["tier"])) -tier_weights = customers_pdf["tier"].map({'Enterprise': 5.0, 'Pro': 2.0, 'Free': 1.0}) -customer_weights = (tier_weights / tier_weights.sum()).tolist() - -print(f" Created {len(customers_pdf):,} customers") - -# ============================================================================= -# 2. ORDERS (References Customers) -# ============================================================================= -print("Generating orders...") - -orders_data = [] -for i in range(N_ORDERS): - cid = np.random.choice(customer_ids, p=customer_weights) - tier = customer_tier_map[cid] - amount = np.random.lognormal(7 if tier == 'Enterprise' else 5 if tier == 'Pro' else 3.5, 0.7) - - orders_data.append({ - "order_id": f"ORD-{i:06d}", - "customer_id": cid, - "amount": round(amount, 2), - "status": np.random.choice(['completed', 'pending', 'cancelled'], p=[0.85, 0.10, 0.05]), - "order_date": fake.date_between(start_date=START_DATE, end_date=END_DATE), - }) - -orders_pdf = pd.DataFrame(orders_data) -print(f" Created {len(orders_pdf):,} orders") - -# ============================================================================= -# 3. TICKETS (References Customers, with incident spike) -# ============================================================================= -print("Generating tickets...") - -def get_daily_volume(date, base=25): - vol = base * (0.6 if date.weekday() >= 5 else 1.0) - if date in US_HOLIDAYS: - vol *= 0.3 # Even lower on holidays - if INCIDENT_START <= date <= INCIDENT_END: - vol *= 3.0 - return int(vol * np.random.normal(1, 0.15)) - -# Distribute tickets across dates -tickets_data = [] -ticket_idx = 0 -for day in pd.date_range(START_DATE, END_DATE): - daily_count = get_daily_volume(day.to_pydatetime()) - is_incident = INCIDENT_START <= day.to_pydatetime() <= INCIDENT_END - - for _ in range(daily_count): - if ticket_idx >= N_TICKETS: - break - - cid = np.random.choice(customer_ids, p=customer_weights) - tier = customer_tier_map[cid] - - # Category - Auth dominates during incident - if is_incident: - category = np.random.choice(['Auth', 'Network', 'Billing', 'Account'], p=[0.65, 0.15, 0.1, 0.1]) - else: - category = np.random.choice(['Auth', 'Network', 'Billing', 'Account'], p=[0.25, 0.30, 0.25, 0.20]) - - # Priority correlates with tier - priority = np.random.choice(['Critical', 'High', 'Medium'], p=[0.3, 0.5, 0.2]) if tier == 'Enterprise' \ - else np.random.choice(['Critical', 'High', 'Medium', 'Low'], p=[0.05, 0.2, 0.45, 0.3]) - - # Resolution time correlates with priority - res_scale = {'Critical': 4, 'High': 12, 'Medium': 36, 'Low': 72} - resolution = np.random.exponential(scale=res_scale[priority]) - - # CSAT degrades during incident for Auth - if is_incident and category == 'Auth': - csat = np.random.choice([1, 2, 3, 4, 5], p=[0.15, 0.25, 0.35, 0.2, 0.05]) - else: - csat = 5 if resolution < 4 else (4 if resolution < 12 else np.random.choice([2, 3, 4], p=[0.2, 0.5, 0.3])) - - tickets_data.append({ - "ticket_id": f"TKT-{ticket_idx:06d}", - "customer_id": cid, - "category": category, - "priority": priority, - "resolution_hours": round(resolution, 1), - "csat_score": csat, - "created_at": day.strftime("%Y-%m-%d"), - }) - ticket_idx += 1 - - if ticket_idx >= N_TICKETS: - break - -tickets_pdf = pd.DataFrame(tickets_data) -print(f" Created {len(tickets_pdf):,} tickets") - -# ============================================================================= -# 4. SAVE TO VOLUME -# ============================================================================= -print(f"\nSaving to {VOLUME_PATH}...") - -spark.createDataFrame(customers_pdf).write.mode("overwrite").parquet(f"{VOLUME_PATH}/customers") -spark.createDataFrame(orders_pdf).write.mode("overwrite").parquet(f"{VOLUME_PATH}/orders") -spark.createDataFrame(tickets_pdf).write.mode("overwrite").parquet(f"{VOLUME_PATH}/tickets") - -print("Done!") - -# ============================================================================= -# 5. VALIDATION -# ============================================================================= -print("\n=== VALIDATION ===") -print(f"Tier distribution: {customers_pdf['tier'].value_counts(normalize=True).to_dict()}") -print(f"Avg order by tier: {orders_pdf.merge(customers_pdf[['customer_id', 'tier']]).groupby('tier')['amount'].mean().to_dict()}") - -incident_tickets = tickets_pdf[tickets_pdf['created_at'].between( - INCIDENT_START.strftime("%Y-%m-%d"), INCIDENT_END.strftime("%Y-%m-%d") -)] -print(f"Incident period tickets: {len(incident_tickets):,} ({len(incident_tickets)/len(tickets_pdf)*100:.1f}%)") -print(f"Incident Auth %: {(incident_tickets['category'] == 'Auth').mean()*100:.1f}%") -``` - -Execute using `run_python_file_on_databricks` tool: -- `file_path`: "scripts/generate_data.py" - -If it fails, edit the file and re-run with the same `cluster_id` and `context_id`. - -### Validate Generated Data - -After successful execution, use `get_volume_folder_details` tool to verify the generated data: -- `volume_path`: "my_catalog/my_schema/raw_data/customers" -- `format`: "parquet" -- `table_stat_level`: "SIMPLE" - -This returns schema, row counts, and column statistics to confirm the data was written correctly. - -## Best Practices - -1. **Ask for schema**: Default to `ai_dev_kit` catalog, ask user for schema name -2. **Create infrastructure**: Use `CREATE CATALOG/SCHEMA/VOLUME IF NOT EXISTS` -3. **Raw data only**: No `total_x`, `sum_x`, `avg_x` fields - SDP pipeline computes those -4. **Save to Volume, not tables**: Write parquet to `/Volumes/{catalog}/{schema}/raw_data/` -5. **Configuration at top**: All sizes, dates, and paths as variables -6. **Dynamic dates**: Use `datetime.now() - timedelta(days=180)` for last 6 months -7. **Pandas for generation**: Faster and easier than Spark for row-by-row logic -8. **Master tables first**: Generate customers, then orders reference customer_ids -9. **Weighted sampling**: Enterprise customers generate more activity -10. **Distributions**: Log-normal for values, exponential for times, weighted categorical -11. **Time patterns**: Weekday/weekend, holidays, seasonality, event spikes -12. **Row coherence**: Priority affects resolution time affects CSAT -13. **Volume for aggregation**: 10K-50K rows minimum so patterns survive GROUP BY -14. **Always use files**: Write to local file, execute, edit if error, re-execute -15. **Context reuse**: Pass `cluster_id` and `context_id` for faster iterations -16. **Libraries**: Install `faker` and `holidays` first; most others are pre-installed - -## Related Skills - -- **[databricks-spark-declarative-pipelines](../databricks-spark-declarative-pipelines/SKILL.md)** - for building bronze/silver/gold pipelines on top of generated data -- **[databricks-aibi-dashboards](../databricks-aibi-dashboards/SKILL.md)** - for visualizing the generated data in dashboards -- **[databricks-unity-catalog](../databricks-unity-catalog/SKILL.md)** - for managing catalogs, schemas, and volumes where data is stored diff --git a/.claude/skills/databricks-unstructured-pdf-generation/SKILL.md b/.claude/skills/databricks-unstructured-pdf-generation/SKILL.md deleted file mode 100644 index 7666f21..0000000 --- a/.claude/skills/databricks-unstructured-pdf-generation/SKILL.md +++ /dev/null @@ -1,194 +0,0 @@ ---- -name: databricks-unstructured-pdf-generation -description: "Generate synthetic PDF documents for RAG and unstructured data use cases. Use when creating test PDFs, demo documents, or evaluation datasets for retrieval systems." ---- - -# Unstructured PDF Generation - -Generate realistic synthetic PDF documents using LLM for RAG (Retrieval-Augmented Generation) and unstructured data use cases. - -## Overview - -This skill uses the `generate_pdf_documents` MCP tool to create professional PDF documents with: -- LLM-generated content based on your description -- Accompanying JSON files with questions and evaluation guidelines (for RAG testing) -- Automatic upload to Unity Catalog Volumes - -## Quick Start - -Use the `generate_pdf_documents` MCP tool: -- `catalog`: "my_catalog" -- `schema`: "my_schema" -- `description`: "Technical documentation for a cloud infrastructure platform including setup guides, troubleshooting procedures, and API references." -- `count`: 10 - -This generates 10 PDF documents and saves them to `/Volumes/my_catalog/my_schema/raw_data/pdf_documents/` (using default volume and folder). - -### With Custom Location - -Use the `generate_pdf_documents` MCP tool: -- `catalog`: "my_catalog" -- `schema`: "my_schema" -- `description`: "HR policy documents..." -- `count`: 10 -- `volume`: "custom_volume" -- `folder`: "hr_policies" -- `overwrite_folder`: true - -## Parameters - -| Parameter | Type | Required | Default | Description | -|-----------|------|----------|---------|-------------| -| `catalog` | string | Yes | - | Unity Catalog name | -| `schema` | string | Yes | - | Schema name | -| `description` | string | Yes | - | Detailed description of what PDFs should contain | -| `count` | int | Yes | - | Number of PDFs to generate | -| `volume` | string | No | `raw_data` | Volume name (created if not exists) | -| `folder` | string | No | `pdf_documents` | Folder within volume for output files | -| `doc_size` | string | No | `MEDIUM` | Document size: `SMALL` (~1 page), `MEDIUM` (~5 pages), `LARGE` (~10+ pages) | -| `overwrite_folder` | bool | No | `false` | If true, deletes existing folder contents first | - -### Document Size Guide - -- **SMALL**: ~1 page, concise content. Best for quick demos or testing. -- **MEDIUM**: ~4-6 pages, comprehensive coverage. Good balance for most use cases. -- **LARGE**: ~10+ pages, exhaustive documentation. Use for thorough RAG evaluation. - -## Output Files - -For each document, the tool creates two files: - -1. **PDF file** (`.pdf`): The generated document -2. **JSON file** (`.json`): Metadata for RAG evaluation - -### JSON Structure - -```json -{ - "title": "API Authentication Guide", - "category": "Technical", - "pdf_path": "/Volumes/catalog/schema/volume/folder/doc_001.pdf", - "question": "What authentication methods are supported by the API?", - "guideline": "Answer should mention OAuth 2.0, API keys, and JWT tokens with their use cases." -} -``` - -## Common Patterns - -### Pattern 1: HR Policy Documents - -Use the `generate_pdf_documents` MCP tool: -- `catalog`: "ai_dev_kit" -- `schema`: "hr_demo" -- `description`: "HR policy documents for a technology company including employee handbook, leave policies, performance review procedures, benefits guide, and workplace conduct guidelines." -- `count`: 15 -- `folder`: "hr_policies" -- `overwrite_folder`: true - -### Pattern 2: Technical Documentation - -Use the `generate_pdf_documents` MCP tool: -- `catalog`: "ai_dev_kit" -- `schema`: "tech_docs" -- `description`: "Technical documentation for a SaaS analytics platform including installation guides, API references, troubleshooting procedures, security best practices, and integration tutorials." -- `count`: 20 -- `folder`: "product_docs" -- `overwrite_folder`: true - -### Pattern 3: Financial Reports - -Use the `generate_pdf_documents` MCP tool: -- `catalog`: "ai_dev_kit" -- `schema`: "finance_demo" -- `description`: "Financial documents for a retail company including quarterly reports, expense policies, budget guidelines, and audit procedures." -- `count`: 12 -- `folder`: "reports" -- `overwrite_folder`: true - -### Pattern 4: Training Materials - -Use the `generate_pdf_documents` MCP tool: -- `catalog`: "ai_dev_kit" -- `schema`: "training" -- `description`: "Training materials for new software developers including onboarding guides, coding standards, code review procedures, and deployment workflows." -- `count`: 8 -- `folder`: "courses" -- `overwrite_folder`: true - -## Workflow - -1. **Ask for destination**: Default to `ai_dev_kit` catalog, ask user for schema name -2. **Get description**: Ask what kind of documents they need -3. **Generate PDFs**: Call `generate_pdf_documents` MCP tool with appropriate parameters -4. **Verify output**: Check the volume path for generated files - -## Best Practices - -1. **Detailed descriptions**: The more specific your description, the better the generated content - - BAD: "Generate some HR documents" - - GOOD: "HR policy documents for a technology company including employee handbook covering remote work policies, leave policies with PTO and sick leave details, performance review procedures with quarterly and annual cycles, and workplace conduct guidelines" - -2. **Appropriate count**: - - For demos: 5-10 documents - - For RAG testing: 15-30 documents - - For comprehensive evaluation: 50+ documents - -3. **Folder organization**: Use descriptive folder names that indicate content type - - `hr_policies/` - - `technical_docs/` - - `training_materials/` - -4. **Use overwrite_folder**: Set to `true` when regenerating to ensure clean state - -## Integration with RAG Pipelines - -The generated JSON files are designed for RAG evaluation: - -1. **Ingest PDFs**: Use the PDF files as source documents for your vector database -2. **Test retrieval**: Use the `question` field to query your RAG system -3. **Evaluate answers**: Use the `guideline` field to assess if the RAG response is correct - -Example evaluation workflow: -```python -# Load questions from JSON files -questions = load_json_files(f"/Volumes/{catalog}/{schema}/{volume}/{folder}/*.json") - -for q in questions: - # Query RAG system - response = rag_system.query(q["question"]) - - # Evaluate using guideline - is_correct = evaluate_response(response, q["guideline"]) -``` - -## Environment Configuration - -The tool requires LLM configuration via environment variables: - -```bash -# Databricks Foundation Models (default) -LLM_PROVIDER=DATABRICKS -DATABRICKS_MODEL=databricks-meta-llama-3-3-70b-instruct - -# Or Azure OpenAI -LLM_PROVIDER=AZURE -AZURE_OPENAI_ENDPOINT=https://your-resource.cognitiveservices.azure.com/ -AZURE_OPENAI_API_KEY=your-api-key -AZURE_OPENAI_DEPLOYMENT=gpt-4o -``` - -## Common Issues - -| Issue | Solution | -|-------|----------| -| **"No LLM endpoint configured"** | Set `DATABRICKS_MODEL` or `AZURE_OPENAI_DEPLOYMENT` environment variable | -| **"Volume does not exist"** | The tool creates volumes automatically; ensure you have CREATE VOLUME permission | -| **"PDF generation timeout"** | Reduce `count` or check LLM endpoint availability | -| **Low quality content** | Provide more detailed `description` with specific topics and document types | - -## Related Skills - -- **[databricks-agent-bricks](../databricks-agent-bricks/SKILL.md)** - Create Knowledge Assistants that ingest the generated PDFs -- **[databricks-vector-search](../databricks-vector-search/SKILL.md)** - Index generated documents for semantic search and RAG -- **[databricks-synthetic-data-generation](../databricks-synthetic-data-generation/SKILL.md)** - Generate structured tabular data (complement to unstructured PDFs) -- **[databricks-mlflow-evaluation](../databricks-mlflow-evaluation/SKILL.md)** - Evaluate RAG systems using the generated question/guideline pairs diff --git a/.claude/skills/spark-python-data-source/SKILL.md b/.claude/skills/spark-python-data-source/SKILL.md deleted file mode 100644 index 898b9d2..0000000 --- a/.claude/skills/spark-python-data-source/SKILL.md +++ /dev/null @@ -1,311 +0,0 @@ ---- -name: spark-python-data-source -description: Use when building custom Spark data source connectors for external systems (databases, APIs, message queues), implementing batch/streaming readers/writers, or creating data source plugins for systems without native Spark support. Triggers - "build Spark data source", "create Spark connector", "implement Spark reader/writer", "connect Spark to [system]", "streaming data source" ---- - -# spark-python-data-source - -Build custom Python data sources for Apache Spark 4.0+ to read from and write to external systems in batch and streaming modes. - -## When to use - -Use when building Spark connectors for external systems that lack native support: -- External databases, APIs, message queues -- Custom file formats or protocols -- Real-time streaming data sources -- Systems requiring specialized authentication or protocols - -Triggers: "build Spark data source", "create Spark connector", "implement Spark reader/writer", "connect Spark to [system]", "streaming data source" - -## Instructions - -You are an experienced Spark developer building custom Python data sources following the PySpark DataSource API. Follow these principles and patterns: - -### Core Architecture - -Each data source follows a flat, single-level inheritance structure: - -1. **DataSource class** - Entry point returning readers/writers -2. **Base Reader/Writer classes** - Shared logic for options and data processing -3. **Batch classes** - Inherit from base + `DataSourceReader`/`DataSourceWriter` -4. **Stream classes** - Inherit from base + `DataSourceStreamReader`/`DataSourceStreamWriter` - -### Critical Design Principles - -**SIMPLE over CLEVER** - These are non-negotiable: - -✅ REQUIRED: -- Flat single-level inheritance only -- Direct implementations, no abstractions -- Explicit imports, explicit control flow -- Standard library first, minimal dependencies -- Simple classes with single responsibilities - -❌ FORBIDDEN: -- Abstract base classes or complex inheritance -- Factory patterns or dependency injection -- Decorators for cross-cutting concerns -- Complex configuration classes -- Async/await (unless absolutely necessary) -- Connection pooling or caching (unless critical) -- Generic "framework" code -- Premature optimization - -### Implementation Pattern - -```python -from pyspark.sql.datasource import ( - DataSource, DataSourceReader, DataSourceWriter, - DataSourceStreamReader, DataSourceStreamWriter -) - -# 1. DataSource class -class YourDataSource(DataSource): - @classmethod - def name(cls): - return "your-format" - - def __init__(self, options): - self.options = options - - def schema(self): - return self._infer_or_return_schema() - - def reader(self, schema): - return YourBatchReader(self.options, schema) - - def streamReader(self, schema): - return YourStreamReader(self.options, schema) - - def writer(self, schema, overwrite): - return YourBatchWriter(self.options, schema) - - def streamWriter(self, schema, overwrite): - return YourStreamWriter(self.options, schema) - -# 2. Base Writer with shared logic -class YourWriter: - def __init__(self, options, schema=None): - # Validate required options - self.url = options.get("url") - assert self.url, "url is required" - self.batch_size = int(options.get("batch_size", "50")) - self.schema = schema - - def write(self, iterator): - # Import libraries here for partition execution - import requests - from pyspark import TaskContext - - context = TaskContext.get() - partition_id = context.partitionId() - - msgs = [] - cnt = 0 - - for row in iterator: - cnt += 1 - msgs.append(row.asDict()) - - if len(msgs) >= self.batch_size: - self._send_batch(msgs) - msgs = [] - - if msgs: - self._send_batch(msgs) - - return SimpleCommitMessage(partition_id=partition_id, count=cnt) - - def _send_batch(self, msgs): - # Implement send logic - pass - -# 3. Batch Writer -class YourBatchWriter(YourWriter, DataSourceWriter): - pass - -# 4. Stream Writer -class YourStreamWriter(YourWriter, DataSourceStreamWriter): - def commit(self, messages, batchId): - pass - - def abort(self, messages, batchId): - pass - -# 5. Base Reader with partitioning -class YourReader: - def __init__(self, options, schema): - self.url = options.get("url") - assert self.url, "url is required" - self.schema = schema - - def partitions(self): - # Return list of partitions for parallel reading - return [YourPartition(0, start, end)] - - def read(self, partition): - # Import here for executor execution - import requests - - response = requests.get(f"{self.url}?start={partition.start}") - for item in response.json(): - yield tuple(item.values()) - -# 6. Batch Reader -class YourBatchReader(YourReader, DataSourceReader): - pass - -# 7. Stream Reader -class YourStreamReader(YourReader, DataSourceStreamReader): - def initialOffset(self): - return {"offset": "0"} - - def latestOffset(self): - return {"offset": str(self._get_latest())} - - def partitions(self, start, end): - return [YourPartition(0, start["offset"], end["offset"])] - - def commit(self, end): - pass -``` - -### Project Setup - -```bash -# Create project -poetry new your-datasource -cd your-datasource -poetry add pyspark pytest pytest-spark - -# Development commands - CRITICAL: Always use 'poetry run' -poetry run pytest # Run tests -poetry run ruff check src/ # Lint -poetry run ruff format src/ # Format -poetry build # Build wheel -``` - -### Registration and Usage - -```python -# Register -from your_package import YourDataSource -spark.dataSource.register(YourDataSource) - -# Batch read -df = spark.read.format("your-format").option("url", "...").load() - -# Batch write -df.write.format("your-format").option("url", "...").save() - -# Streaming read -df = spark.readStream.format("your-format").option("url", "...").load() - -# Streaming write -df.writeStream.format("your-format").option("url", "...").start() -``` - -### Key Implementation Decisions - -**Partitioning Strategy**: Choose based on data source characteristics -- Time-based: For APIs with temporal data (see [partitioning-patterns.md](references/partitioning-patterns.md)) -- Token-range: For distributed databases (see [partitioning-patterns.md](references/partitioning-patterns.md)) -- ID-range: For paginated APIs - -**Authentication**: Support multiple methods in priority order -- Databricks Unity Catalog credentials -- Cloud default credentials (managed identity) -- Explicit credentials (service principal, API key, username/password) -- See [authentication-patterns.md](references/authentication-patterns.md) - -**Type Conversion**: Map between Spark and external types -- Handle nulls, timestamps, UUIDs, collections -- See [type-conversion.md](references/type-conversion.md) - -**Streaming Offsets**: Design for exactly-once semantics -- JSON-serializable offset class -- Non-overlapping partition boundaries -- See [streaming-patterns.md](references/streaming-patterns.md) - -**Error Handling**: Implement retries and resilience -- Exponential backoff for retryable errors -- Circuit breakers for cascading failures -- See [error-handling.md](references/error-handling.md) - -### Testing Approach - -```python -import pytest -from unittest.mock import patch, Mock - -@pytest.fixture -def spark(): - from pyspark.sql import SparkSession - return SparkSession.builder.master("local[2]").getOrCreate() - -def test_data_source_name(): - assert YourDataSource.name() == "your-format" - -def test_writer_sends_data(spark): - with patch('requests.post') as mock_post: - mock_post.return_value = Mock(status_code=200) - - df = spark.createDataFrame([(1, "test")], ["id", "value"]) - df.write.format("your-format").option("url", "http://api").save() - - assert mock_post.called -``` - -### Code Review Checklist - -Before implementing, ask: -1. Is this the simplest way to solve this problem? -2. Would a new developer understand this immediately? -3. Am I adding abstraction for real needs vs hypothetical flexibility? -4. Can I solve this with standard library? -5. Does this follow the established flat pattern? - -### Common Mistakes to Avoid - -- Creating abstract base classes for "reusability" -- Adding configuration frameworks or dependency injection -- Premature optimization before measuring performance -- Complex error handling hierarchies -- Importing heavy libraries at module level (import in methods) -- Using `python` command directly (always use `poetry run`) - -### Reference Implementations - -Study these for real-world patterns: -- [cyber-spark-data-connectors](https://github.com/alexott/cyber-spark-data-connectors) - Sentinel, Splunk, REST -- [spark-cassandra-data-source](https://github.com/alexott/spark-cassandra-data-source) - Token-range partitioning -- [pyspark-hubspot](https://github.com/dgomez04/pyspark-hubspot) - REST API pagination -- [pyspark-mqtt](https://github.com/databricks-industry-solutions/python-data-sources/tree/main/mqtt) - Streaming with TLS - -## Usage - -``` -Create a Spark data source for reading from MongoDB with sharding support -Build a streaming connector for RabbitMQ with at-least-once delivery -Implement a batch writer for Snowflake with staged uploads -Write a data source for REST API with OAuth2 authentication and pagination -``` - -## Related - -- databricks-testing: Test data sources on Databricks clusters -- databricks-spark-declarative-pipelines: Use custom sources in DLT pipelines -- python-dev: Python development best practices - -## References - -- [partitioning-patterns.md](references/partitioning-patterns.md) - Parallel reading strategies -- [authentication-patterns.md](references/authentication-patterns.md) - Multi-method auth implementations -- [type-conversion.md](references/type-conversion.md) - Bidirectional type mapping -- [streaming-patterns.md](references/streaming-patterns.md) - Offset management and watermarking -- [error-handling.md](references/error-handling.md) - Retries, circuit breakers, resilience -- [testing-patterns.md](references/testing-patterns.md) - Unit and integration testing -- [production-patterns.md](references/production-patterns.md) - Observability, security, validation -- [Official Databricks Documentation](https://docs.databricks.com/aws/en/pyspark/datasources) -- [Apache Spark Python DataSource Tutorial](https://spark.apache.org/docs/latest/api/python/tutorial/sql/python_data_source.html) -- [awesome-python-datasources](https://github.com/allisonwang-db/awesome-python-datasources) - directory of available implementations. diff --git a/.config/opencode/opencode.json b/.config/opencode/opencode.json new file mode 100644 index 0000000..b9121ab --- /dev/null +++ b/.config/opencode/opencode.json @@ -0,0 +1,64 @@ +{ + "$schema": "https://opencode.ai/config.json", + "provider": { + "databricks": { + "npm": "@ai-sdk/openai-compatible", + "name": "Databricks Model Serving (via content-filter proxy)", + "options": { + "baseURL": "http://127.0.0.1:4000", + "apiKey": "{env:DATABRICKS_TOKEN}" + }, + "models": { + "databricks-claude-opus-4-7": { + "name": "Claude Opus 4.7 (Databricks)", + "limit": { + "context": 200000, + "output": 16384 + } + }, + "databricks-claude-sonnet-4-6": { + "name": "Claude Sonnet 4.6 (Databricks)", + "limit": { + "context": 200000, + "output": 8192 + } + }, + "databricks-gemini-2-5-flash": { + "name": "Gemini 2.5 Flash (Databricks)", + "limit": { + "context": 1000000, + "output": 8192 + } + }, + "databricks-gemini-2-5-pro": { + "name": "Gemini 2.5 Pro (Databricks)", + "limit": { + "context": 1000000, + "output": 8192 + } + }, + "databricks-gemini-3-1-pro": { + "name": "Gemini 3.1 Pro (Databricks)", + "limit": { + "context": 1000000, + "output": 8192 + } + } + } + } + }, + "mcp": { + "deepwiki": { + "type": "remote", + "url": "https://mcp.deepwiki.com/mcp", + "enabled": true, + "oauth": false + }, + "exa": { + "type": "remote", + "url": "https://mcp.exa.ai/mcp", + "enabled": true + } + }, + "model": "databricks/databricks-claude-opus-4-7" +} \ No newline at end of file diff --git a/.databricksignore b/.databricksignore new file mode 100644 index 0000000..5274c8d --- /dev/null +++ b/.databricksignore @@ -0,0 +1,6 @@ +.venv/ +vendor/ +node_modules/ +__pycache__/ +*.pyc +.git/ diff --git a/.syncignore b/.syncignore new file mode 100644 index 0000000..4cad30a --- /dev/null +++ b/.syncignore @@ -0,0 +1,32 @@ +# Large directories that should never be uploaded to workspace +vendor/ +.venv/ +venv/ + +# Build artifacts and caches +__pycache__/ +*.pyc +*.pyo +.ruff_cache/ +uv.lock + +# Local config and secrets +.env +.databricks/ +.claude/ +.git/ + +# Test artifacts +evidence/ +.relentless_logs/ +_gates.json +_validation_report.json + +# Git worktrees +.worktrees/ + +# Uploads +uploads/ + +# Node modules (if any) +node_modules/ diff --git a/CLAUDE.md b/CLAUDE.md index fd2674d..a880803 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -92,4 +92,6 @@ Real-time terminal I/O over **WebSocket** (Flask-SocketIO) with automatic **HTTP - Development workflow skills from [obra/superpowers](https://github.com/obra/superpowers) # things to remember -Remember to never move .git folder to the workspace if you're running workspace import. \ No newline at end of file +Remember to never move .git folder to the workspace if you're running workspace import. + +Fork-specific runtime directives (uv, library version floors, Unity Catalog name, terminal-editor pointer) are injected into `~/.claude/CLAUDE.md` by `setup_claude.py` at app startup. See the `coda-fork-directives` block there — that's the source of truth for per-fork conventions, and it travels with every CODA spawned from this repo. \ No newline at end of file diff --git a/README.md b/README.md index 50ef022..09447c7 100644 --- a/README.md +++ b/README.md @@ -282,7 +282,7 @@ This template repo opens that vision up for every Databricks user — no IDE set |----------|----------|-------------| | `DATABRICKS_TOKEN` | No | Optional. If not set, the app prompts for a token on first session. Auto-rotated every 10 minutes | | `HOME` | Yes | Set to `/app/python/source_code` in app.yaml | -| `ANTHROPIC_MODEL` | No | Claude model name (default: `databricks-claude-opus-4-6`) | +| `ANTHROPIC_MODEL` | No | Claude model name (default: `databricks-claude-opus-4-7`) | | `CODEX_MODEL` | No | Codex model name (default: `databricks-gpt-5-3-codex`) | | `GEMINI_MODEL` | No | Gemini model name (default: `databricks-gemini-3-1-pro`) | | `DATABRICKS_GATEWAY_HOST` | No | AI Gateway URL override. Auto-discovered from `DATABRICKS_WORKSPACE_ID` if unset | diff --git a/app.py b/app.py index 526adce..2702add 100644 --- a/app.py +++ b/app.py @@ -103,6 +103,7 @@ def handle_sigterm(signum, frame): "steps": [ {"id": "git", "label": "Configuring git identity", "status": "pending", "started_at": None, "completed_at": None, "error": None}, {"id": "micro", "label": "Installing micro editor", "status": "pending", "started_at": None, "completed_at": None, "error": None}, + {"id": "editors", "label": "Detecting available editors", "status": "pending", "started_at": None, "completed_at": None, "error": None}, {"id": "gh", "label": "Installing GitHub CLI", "status": "pending", "started_at": None, "completed_at": None, "error": None}, {"id": "dbcli", "label": "Upgrading Databricks CLI", "status": "pending", "started_at": None, "completed_at": None, "error": None}, {"id": "proxy", "label": "Starting content-filter proxy", "status": "pending", "started_at": None, "completed_at": None, "error": None}, @@ -112,6 +113,7 @@ def handle_sigterm(signum, frame): {"id": "gemini", "label": "Configuring Gemini CLI", "status": "pending", "started_at": None, "completed_at": None, "error": None}, {"id": "databricks", "label": "Setting up Databricks CLI", "status": "pending", "started_at": None, "completed_at": None, "error": None}, {"id": "mlflow", "label": "Enabling MLflow tracing", "status": "pending", "started_at": None, "completed_at": None, "error": None}, + {"id": "projects", "label": "Setting up workshop projects", "status": "pending", "started_at": None, "completed_at": None, "error": None}, ] } @@ -197,6 +199,18 @@ def _setup_git_config(): f.write("\n".join(lines) + "\n") logger.info(f"Git config written to {gitconfig_path}") + # Configure gh as the git credential helper (if gh is available). + # NOTE: gh must already be authenticated (via `gh auth login` or GH_TOKEN env var) + # for the credential helper to work. Without auth, git operations to GitHub will fail. + try: + subprocess.run( + ["gh", "auth", "setup-git"], + capture_output=True, timeout=10, + ) + logger.info("gh auth setup-git configured") + except (FileNotFoundError, subprocess.TimeoutExpired): + logger.debug("gh not available, skipping credential helper setup") + # Write post-commit hook for workspace sync (works from any CLI: Claude, Gemini, OpenCode, etc.) # Only syncs repos inside ~/projects/ — skips the app source and any other repos post_commit = os.path.join(hooks_dir, "post-commit") @@ -237,10 +251,83 @@ def _setup_git_config(): os.chmod(post_commit, 0o755) logger.info(f"Post-commit hook written to {post_commit}") + # Write `wsync` command to ~/.local/bin for manual workspace sync + local_bin = os.path.join(home, ".local", "bin") + os.makedirs(local_bin, exist_ok=True) + wsync_path = os.path.join(local_bin, "wsync") + with open(wsync_path, "w") as f: + f.write('#!/bin/bash\n') + f.write('# Manual sync to Databricks Workspace\n') + f.write('REPO_ROOT="$(git rev-parse --show-toplevel 2>/dev/null)"\n') + f.write('if [ -z "$REPO_ROOT" ]; then\n') + f.write(' echo "Error: not inside a git repo"\n') + f.write(' exit 1\n') + f.write('fi\n') + f.write('APP_DIR="/app/python/source_code"\n') + f.write('SYNC_SCRIPT="$APP_DIR/sync_to_workspace.py"\n') + f.write('if [ ! -f "$SYNC_SCRIPT" ]; then\n') + f.write(' echo "Error: sync script not found"\n') + f.write(' exit 1\n') + f.write('fi\n') + f.write('echo "Syncing $REPO_ROOT to Databricks Workspace..."\n') + f.write('uv run --project "$APP_DIR" python "$SYNC_SCRIPT" "$REPO_ROOT"\n') + os.chmod(wsync_path, 0o755) + logger.info(f"wsync command written to {wsync_path}") + # Reinit app source git to remove template origin (Databricks Apps only) _reinit_app_git() +def _setup_embedded_projects(): + """Copy embedded project templates from app source into ~/projects/ and git-init them. + + Projects are bundled under /projects// at deploy time. + Each is copied to ~/projects// (if not already present) and initialized + as a standalone git repo so commits trigger workspace sync via post-commit hook. + """ + import shutil + + app_dir = os.path.dirname(os.path.abspath(__file__)) + embedded_dir = os.path.join(app_dir, "projects") + if not os.path.isdir(embedded_dir): + return + + home = os.environ.get("HOME", "/app/python/source_code") + if not home or home == "/": + home = "/app/python/source_code" + projects_dir = os.path.join(home, "projects") + os.makedirs(projects_dir, exist_ok=True) + + for name in os.listdir(embedded_dir): + src = os.path.join(embedded_dir, name) + if not os.path.isdir(src): + continue + dest = os.path.join(projects_dir, name) + if os.path.exists(dest): + logger.info(f"Project already exists, skipping: {dest}") + continue + + shutil.copytree(src, dest) + # Initialize as a git repo so post-commit hooks work + subprocess.run(["git", "init"], cwd=dest, capture_output=True) + subprocess.run(["git", "add", "."], cwd=dest, capture_output=True) + subprocess.run( + ["git", "commit", "-m", "Initial workshop project"], + cwd=dest, capture_output=True, + ) + logger.info(f"Embedded project initialized: {dest}") + + +def _run_projects_step(): + """Run embedded project setup as a tracked setup step.""" + _update_step("projects", status="running", started_at=time.time()) + try: + _setup_embedded_projects() + _update_step("projects", status="complete", completed_at=time.time()) + except Exception as e: + _update_step("projects", status="error", completed_at=time.time(), error=str(e)) + + def _reinit_app_git(): """On Databricks Apps, reinit git to remove template origin remote.""" app_dir = os.path.dirname(os.path.abspath(__file__)) @@ -291,11 +378,25 @@ def _configure_all_cli_auth(token): anthropic_base_url = f"{databricks_host}/serving-endpoints/anthropic" settings = { + "theme": "dark", + "permissions": { + "defaultMode": "auto", + "allow": [ + "Bash(databricks *)", + "Bash(uv *)", + "Bash(git *)", + "Bash(make *)", + "Bash(python *)", + "Bash(pytest *)", + "Bash(ruff *)", + "Bash(wsync)", + ], + }, "env": { - "ANTHROPIC_MODEL": os.environ.get("ANTHROPIC_MODEL", "databricks-claude-opus-4-6"), + "ANTHROPIC_MODEL": os.environ.get("ANTHROPIC_MODEL", "databricks-claude-opus-4-7"), "ANTHROPIC_BASE_URL": anthropic_base_url, "ANTHROPIC_AUTH_TOKEN": token, - "ANTHROPIC_DEFAULT_OPUS_MODEL": "databricks-claude-opus-4-6", + "ANTHROPIC_DEFAULT_OPUS_MODEL": "databricks-claude-opus-4-7", "ANTHROPIC_DEFAULT_SONNET_MODEL": "databricks-claude-sonnet-4-6", "ANTHROPIC_DEFAULT_HAIKU_MODEL": "databricks-claude-haiku-4-5", "ANTHROPIC_CUSTOM_HEADERS": "x-databricks-use-coding-agent-mode: true", @@ -352,6 +453,17 @@ def run_setup(): _run_step("micro", ["bash", "-c", "mkdir -p ~/.local/bin && bash install_micro.sh && mv micro ~/.local/bin/ 2>/dev/null || true"]) + # Probe which terminal editors are actually available in this container. + # Writes a human-readable report to ~/.local/share/coda/editors.txt so + # users (and Claude) can discover what to reach for from the terminal. + _run_step("editors", ["bash", "-c", + "mkdir -p ~/.local/share/coda && " + "{ echo 'Available terminal editors (detected at app startup):'; " + " for ed in micro nano vim vi emacs ed pico joe mcedit; do " + " p=$(command -v \"$ed\" 2>/dev/null) && echo \" $ed -> $p\"; " + " done; } > ~/.local/share/coda/editors.txt && " + "cat ~/.local/share/coda/editors.txt"]) + _run_step("gh", ["bash", "install_gh.sh"]) # --- Upgrade Databricks CLI (runtime image ships an older version) --- @@ -372,11 +484,13 @@ def run_setup(): ("mlflow", ["uv", "run", "python", "setup_mlflow.py"]), ] - with ThreadPoolExecutor(max_workers=len(parallel_steps)) as executor: + with ThreadPoolExecutor(max_workers=len(parallel_steps) + 1) as executor: futures = [ executor.submit(_run_step, step_id, command) for step_id, command in parallel_steps ] + # Embedded projects (copy + git init) — runs in parallel with agent setup + futures.append(executor.submit(_run_projects_step)) wait(futures) with setup_lock: @@ -386,24 +500,52 @@ def run_setup(): def get_token_owner(): - """Get the owner email. Priority: Apps API (app.creator) > PAT (current_user.me). + """Get the owner email. - Uses the auto-provisioned SP to call the Apps API — no PAT needed for - owner resolution. Falls back to PAT-based lookup for backward compat. + Priority: APP_OWNER_EMAIL env var > app description > app.creator > PAT. + The spawner sets owner:{email} in the app description when creating apps on + behalf of users, so the child app knows its owner without requiring a PAT. + + The Apps API call retries with backoff because the app's auto-provisioned SP + credentials may not be ready for OAuth token exchange immediately at boot. """ from databricks.sdk import WorkspaceClient - # 1. Try Apps API via SP credentials (no PAT needed) + # 0. Explicit owner from deployer (env var) + explicit_owner = os.environ.get("APP_OWNER_EMAIL", "").strip().lower() + if explicit_owner: + logger.info(f"Owner resolved from APP_OWNER_EMAIL: {explicit_owner}") + return explicit_owner + + # 1. Try Apps API via SP credentials (no PAT needed) — retry for SP propagation app_name = os.environ.get("DATABRICKS_APP_NAME") if app_name: - try: - w = WorkspaceClient() # auto-detects SP credentials - app = w.apps.get(name=app_name) - owner = (app.creator or "").lower() - logger.info(f"Owner resolved from app.creator: {owner}") - return owner - except Exception as e: - logger.warning(f"Could not resolve owner via Apps API: {e}") + max_retries = 6 + base_delay = 5.0 + for attempt in range(max_retries): + try: + w = WorkspaceClient() # auto-detects SP credentials + app_info = w.apps.get(name=app_name) + + # Spawner sets owner in description as "owner:{email}" + desc = getattr(app_info, "description", "") or "" + if desc.startswith("owner:"): + owner = desc.split(":", 1)[1].strip().lower() + logger.info(f"Owner resolved from app description: {owner}") + return owner + + owner = (app_info.creator or "").lower() + logger.info(f"Owner resolved from app.creator: {owner}") + return owner + except Exception as e: + delay = min(base_delay * (2**attempt), 60) + logger.warning( + f"Apps API call failed (attempt {attempt + 1}/{max_retries}): {e}" + f" — retrying in {delay:.0f}s" + ) + if attempt < max_retries - 1: + time.sleep(delay) + logger.error(f"Could not resolve owner via Apps API after {max_retries} attempts") # 2. Fallback: PAT-based resolution try: diff --git a/app.yaml b/app.yaml index 5596e08..f90619f 100644 --- a/app.yaml +++ b/app.yaml @@ -5,7 +5,7 @@ env: - name: HOME value: /app/python/source_code - name: ANTHROPIC_MODEL - value: databricks-claude-opus-4-6 + value: databricks-claude-opus-4-7 - name: GEMINI_MODEL value: databricks-gemini-3-1-pro - name: CODEX_MODEL @@ -14,3 +14,5 @@ env: value: 0 - name: MAX_CONCURRENT_SESSIONS value: "5" + - name: MLFLOW_CLAUDE_TRACING_ENABLED + value: "true" diff --git a/app.yaml.template b/app.yaml.template index c29f3a6..3786be8 100644 --- a/app.yaml.template +++ b/app.yaml.template @@ -7,7 +7,7 @@ env: - name: DATABRICKS_TOKEN valueFrom: DATABRICKS_TOKEN - name: ANTHROPIC_MODEL - value: databricks-claude-opus-4-6 + value: databricks-claude-opus-4-7 - name: GEMINI_MODEL value: databricks-gemini-3-1-pro #OPTIONAL: Use the new Databricks AI Gateway if you have access (recommended), otherwise it will default to the older endpoint diff --git a/claude_brain_sync.py b/claude_brain_sync.py new file mode 100644 index 0000000..2d17c33 --- /dev/null +++ b/claude_brain_sync.py @@ -0,0 +1,177 @@ +#!/usr/bin/env python +"""Sync Claude Code's auto-memory ("brain") to/from Databricks Workspace. + +The "brain" is the set of memory files Claude Code maintains at +`~/.claude/projects/{slug}/memory/`, one slug per working directory. +They accumulate user/project/feedback/reference memories that make +future sessions smarter. + +Ephemeral Databricks App compute means these files vanish when the +app restarts unless we persist them. This script syncs them to the +user's workspace so they survive redeploys and restarts. + +Usage: + python claude_brain_sync.py push # local -> workspace ([DEFAULT]) + python claude_brain_sync.py pull # workspace -> local ([DEFAULT]) + python claude_brain_sync.py push --profile daveok # use a named profile + python claude_brain_sync.py # push (default) +""" +from __future__ import annotations + +import argparse +import configparser +import os +import subprocess +import sys +from pathlib import Path + +try: + from databricks.sdk import WorkspaceClient +except ImportError: + print("databricks-sdk not available, skipping brain sync", file=sys.stderr) + sys.exit(0) + + +CLAUDE_PROJECTS = Path.home() / ".claude" / "projects" +WORKSPACE_SUBPATH = ".coda/claude-brain/projects" + + +def _read_databrickscfg(profile: str = "DEFAULT") -> tuple[str | None, str | None]: + cfg = Path.home() / ".databrickscfg" + if not cfg.exists(): + return None, None + p = configparser.ConfigParser() + p.read(cfg) + if profile not in p and profile != "DEFAULT": + return None, None + return ( + p.get(profile, "host", fallback=None), + p.get(profile, "token", fallback=None), + ) + + +def _workspace_client(profile: str | None) -> WorkspaceClient: + """Build a WorkspaceClient. Named profiles delegate auth to the SDK + so OAuth (`auth_type = databricks-cli`) works for local testing; + the default path reads [DEFAULT] explicitly for the production PAT flow.""" + if profile: + return WorkspaceClient(profile=profile) + host, token = _read_databrickscfg("DEFAULT") + if not host or not token: + raise RuntimeError("~/.databrickscfg [DEFAULT] missing host or token") + return WorkspaceClient(host=host, token=token, auth_type="pat") + + +def _user_email(profile: str | None) -> str: + return _workspace_client(profile).current_user.me().user_name + + +def _sync_env() -> dict[str, str]: + """Env for databricks CLI. Strip OAuth M2M vars so CLI falls through to + the profile config. Profile selection is passed via --profile CLI flag.""" + env = os.environ.copy() + for var in ("DATABRICKS_CLIENT_ID", "DATABRICKS_CLIENT_SECRET", + "DATABRICKS_HOST", "DATABRICKS_TOKEN"): + env.pop(var, None) + return env + + +def _profile_args(profile: str | None) -> list[str]: + return ["--profile", profile] if profile else [] + + +def _memory_dirs() -> list[Path]: + """Return memory dirs that actually contain files worth syncing.""" + if not CLAUDE_PROJECTS.exists(): + return [] + dirs = [] + for project_dir in CLAUDE_PROJECTS.iterdir(): + if not project_dir.is_dir(): + continue + memory = project_dir / "memory" + if memory.exists() and any(memory.iterdir()): + dirs.append(memory) + return dirs + + +def push(profile: str | None = None) -> int: + """Push each project's memory dir to workspace.""" + dirs = _memory_dirs() + if not dirs: + print("brain-sync: no memory dirs to push") + return 0 + + try: + email = _user_email(profile) + except Exception as e: + print(f"brain-sync: could not resolve user email: {e}", file=sys.stderr) + return 1 + + env = _sync_env() + profile_flags = _profile_args(profile) + failures = 0 + for memory_dir in dirs: + project_slug = memory_dir.parent.name + remote = f"/Workspace/Users/{email}/{WORKSPACE_SUBPATH}/{project_slug}/memory" + result = subprocess.run( + ["databricks", "sync", str(memory_dir), remote, "--watch=false"] + profile_flags, + capture_output=True, text=True, env=env, + ) + if result.returncode == 0: + print(f"brain-sync push: {project_slug}") + else: + print(f"brain-sync push FAILED for {project_slug}: {result.stderr.strip()}", + file=sys.stderr) + failures += 1 + return 0 if failures == 0 else 1 + + +def pull(profile: str | None = None) -> int: + """Pull brain from workspace into ~/.claude/projects/. + + Uses databricks workspace export-dir because `databricks sync` is + local->remote only. + """ + try: + email = _user_email(profile) + except Exception as e: + print(f"brain-sync: could not resolve user email: {e}", file=sys.stderr) + return 1 + + env = _sync_env() + profile_flags = _profile_args(profile) + remote_root = f"/Workspace/Users/{email}/{WORKSPACE_SUBPATH}" + + check = subprocess.run( + ["databricks", "workspace", "list", remote_root] + profile_flags, + capture_output=True, text=True, env=env, + ) + if check.returncode != 0: + print(f"brain-sync pull: no remote brain yet at {remote_root}") + return 0 + + CLAUDE_PROJECTS.mkdir(parents=True, exist_ok=True) + result = subprocess.run( + ["databricks", "workspace", "export-dir", + remote_root, str(CLAUDE_PROJECTS), "--overwrite"] + profile_flags, + capture_output=True, text=True, env=env, + ) + if result.returncode == 0: + print(f"brain-sync pull: restored from {remote_root}") + return 0 + print(f"brain-sync pull FAILED: {result.stderr.strip()}", file=sys.stderr) + return 1 + + +def main() -> int: + parser = argparse.ArgumentParser(description=__doc__.splitlines()[0]) + parser.add_argument("direction", nargs="?", default="push", choices=["push", "pull"]) + parser.add_argument("--profile", help="databricks CLI profile name (default: [DEFAULT])") + args = parser.parse_args() + if args.direction == "push": + return push(args.profile) + return pull(args.profile) + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/coda-marketplace/.claude-plugin/marketplace.json b/coda-marketplace/.claude-plugin/marketplace.json new file mode 100644 index 0000000..21fab40 --- /dev/null +++ b/coda-marketplace/.claude-plugin/marketplace.json @@ -0,0 +1,50 @@ +{ + "name": "coda", + "owner": { + "name": "Databricks Field Engineering", + "email": "field-eng@databricks.com" + }, + "metadata": { + "description": "CODA-bundled Claude Code plugins — ship with every CODA deployment", + "version": "0.1.0" + }, + "plugins": [ + { + "name": "coda-essentials", + "source": "./plugins/coda-essentials", + "description": "Subagents, hooks, slash commands, and session lifecycle tooling bundled with every CODA instance. Includes the TDD subagent workflow (prd-writer, test-generator, implementer, build-feature), session-start git context loader, memory staleness checker, crystallization nudge, and the /til slash command.", + "version": "0.1.0", + "author": { + "name": "Databricks Field Engineering" + }, + "category": "productivity", + "keywords": [ + "coda", + "databricks", + "workshop", + "tdd", + "memory", + "hooks" + ] + }, + { + "name": "coda-databricks-skills", + "source": "./plugins/coda-databricks-skills", + "description": "Databricks platform skills synced from databricks-solutions/ai-dev-kit: Agent Bricks, AI/BI Dashboards, AI Functions, Databricks App (Python), BDD Testing, Bundles, Config, DBSQL, Docs, Execution Compute, Genie, Iceberg, Jobs, Lakebase (Autoscale + Provisioned), Metric Views, MLflow Evaluation, Model Serving, Python SDK, Spark SDP, Structured Streaming, Synthetic Data Gen, Unity Catalog, Unstructured PDF Generation, Vector Search, Zerobus Ingest, and Spark Python Data Source.", + "version": "0.1.0", + "author": { + "name": "Databricks Field Engineering", + "url": "https://github.com/databricks-solutions/ai-dev-kit" + }, + "category": "platform", + "keywords": [ + "databricks", + "ai-dev-kit", + "spark", + "unity-catalog", + "mlflow", + "lakebase" + ] + } + ] +} diff --git a/coda-marketplace/plugins/coda-databricks-skills/.claude-plugin/plugin.json b/coda-marketplace/plugins/coda-databricks-skills/.claude-plugin/plugin.json new file mode 100644 index 0000000..5424de6 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/.claude-plugin/plugin.json @@ -0,0 +1,18 @@ +{ + "name": "coda-databricks-skills", + "description": "Databricks platform skills (Agent Bricks, AI/BI, AI Functions, App Python, BDD Testing, Bundles, Config, DBSQL, Docs, Execution Compute, Genie, Iceberg, Jobs, Lakebase, MLflow Eval, Model Serving, Metric Views, Python SDK, Spark SDP, Structured Streaming, Synthetic Data Gen, Unity Catalog, Unstructured PDF, Vector Search, Zerobus Ingest, Spark Python Data Source) — synced from databricks-solutions/ai-dev-kit.", + "version": "0.1.0", + "author": { + "name": "Databricks Field Engineering", + "url": "https://github.com/databricks-solutions/ai-dev-kit" + }, + "keywords": [ + "databricks", + "ai-dev-kit", + "spark", + "unity-catalog", + "mlflow", + "lakebase" + ], + "skills": "./skills/" +} diff --git a/.claude/skills/databricks-agent-bricks/1-knowledge-assistants.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-agent-bricks/1-knowledge-assistants.md similarity index 100% rename from .claude/skills/databricks-agent-bricks/1-knowledge-assistants.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-agent-bricks/1-knowledge-assistants.md diff --git a/.claude/skills/databricks-agent-bricks/2-supervisor-agents.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-agent-bricks/2-supervisor-agents.md similarity index 100% rename from .claude/skills/databricks-agent-bricks/2-supervisor-agents.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-agent-bricks/2-supervisor-agents.md diff --git a/.claude/skills/databricks-agent-bricks/SKILL.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-agent-bricks/SKILL.md similarity index 94% rename from .claude/skills/databricks-agent-bricks/SKILL.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-agent-bricks/SKILL.md index 4aff7ac..026f204 100644 --- a/.claude/skills/databricks-agent-bricks/SKILL.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-agent-bricks/SKILL.md @@ -28,7 +28,7 @@ Before creating Agent Bricks, ensure you have the required data: ### For Genie Spaces - **See the `databricks-genie` skill** for comprehensive Genie Space guidance - Tables in Unity Catalog with the data to explore -- Generate raw data using the `databricks-synthetic-data-generation` skill +- Generate raw data using the `databricks-synthetic-data-gen` skill - Create tables using the `databricks-spark-declarative-pipelines` skill ### For Supervisor Agents @@ -67,18 +67,19 @@ Actions: **For comprehensive Genie guidance, use the `databricks-genie` skill.** -Basic tools available: - -- `create_or_update_genie` - Create or update a Genie Space -- `get_genie` - Get Genie Space details -- `delete_genie` - Delete a Genie Space +Use `manage_genie` with actions: +- `create_or_update` - Create or update a Genie Space +- `get` - Get Genie Space details +- `list` - List all Genie Spaces +- `delete` - Delete a Genie Space +- `export` / `import` - For migration See `databricks-genie` skill for: - Table inspection workflow - Sample question best practices - Curation (instructions, certified queries) -**IMPORTANT**: There is NO system table for Genie spaces (e.g., `system.ai.genie_spaces` does not exist). To find a Genie space by name, use the `find_genie_by_name` tool. +**IMPORTANT**: There is NO system table for Genie spaces (e.g., `system.ai.genie_spaces` does not exist). Use `manage_genie(action="list")` to find spaces. ### Supervisor Agent Tool @@ -119,7 +120,7 @@ Before creating Agent Bricks, generate the required source data: **For Genie (SQL exploration)**: ``` -1. Use `databricks-synthetic-data-generation` skill to create raw parquet data +1. Use `databricks-synthetic-data-gen` skill to create raw parquet data 2. Use `databricks-spark-declarative-pipelines` skill to create bronze/silver/gold tables ``` @@ -199,7 +200,7 @@ manage_mas( - **[databricks-genie](../databricks-genie/SKILL.md)** - Comprehensive Genie Space creation, curation, and Conversation API guidance - **[databricks-unstructured-pdf-generation](../databricks-unstructured-pdf-generation/SKILL.md)** - Generate synthetic PDFs to feed into Knowledge Assistants -- **[databricks-synthetic-data-generation](../databricks-synthetic-data-generation/SKILL.md)** - Create raw data for Genie Space tables +- **[databricks-synthetic-data-gen](../databricks-synthetic-data-gen/SKILL.md)** - Create raw data for Genie Space tables - **[databricks-spark-declarative-pipelines](../databricks-spark-declarative-pipelines/SKILL.md)** - Build bronze/silver/gold tables consumed by Genie Spaces - **[databricks-model-serving](../databricks-model-serving/SKILL.md)** - Deploy custom agent endpoints used as MAS agents - **[databricks-vector-search](../databricks-vector-search/SKILL.md)** - Build vector indexes for RAG applications paired with KAs diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-ai-functions/1-task-functions.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-ai-functions/1-task-functions.md new file mode 100644 index 0000000..a94159e --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-ai-functions/1-task-functions.md @@ -0,0 +1,385 @@ +# Task-Specific AI Functions — Full Reference + +These functions require no model endpoint selection. They call pre-configured Foundation Model APIs optimized for each task. All require DBR 15.1+ (15.4 ML LTS for batch); `ai_parse_document` requires DBR 17.1+. + +--- + +## `ai_analyze_sentiment` + +**Docs:** https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_analyze_sentiment + +Returns one of: `positive`, `negative`, `neutral`, `mixed`, or `NULL`. + +```sql +SELECT ai_analyze_sentiment(review_text) AS sentiment +FROM customer_reviews; +``` + +```python +from pyspark.sql.functions import expr +df = spark.table("customer_reviews") +df.withColumn("sentiment", expr("ai_analyze_sentiment(review_text)")).display() +``` + +--- + +## `ai_classify` + +**Docs:** https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_classify + +**Syntax:** `ai_classify(content, labels [, options])` +- `content`: VARIANT | STRING — raw text, or VARIANT from `ai_parse_document` / `ai_extract` +- `labels`: STRING — JSON labels definition: + - Simple array: `'["urgent", "not_urgent", "spam"]'` + - With descriptions: `'{"billing_error": "Payment, invoice, or refund issues", "product_defect": "Any malfunction or bug"}'` (descriptions up to 1000 chars each) + - 2–500 labels, each 1–100 characters +- `options`: optional MAP\: + - `instructions`: task context to improve accuracy (max 20,000 chars) + - `multilabel`: `"true"` to return multiple matching labels (default `"false"`) + +Returns VARIANT. Returns `NULL` if content is `NULL`. + +```sql +-- simple labels +SELECT ticket_text, + ai_classify(ticket_text, '["urgent", "not urgent", "spam"]') AS priority +FROM support_tickets; +-- {"response": ["urgent"], "error_message": null} + +-- labels with descriptions +SELECT ticket_text, + ai_classify( + ticket_text, + '{"billing_error": "Payment, invoice, or refund issues", + "product_defect": "Any malfunction, bug, or breakage", + "account_issue": "Login failures, password resets"}', + MAP('instructions', 'Customer support tickets for a SaaS product') + ) AS category +FROM support_tickets; +``` + +```python +from pyspark.sql.functions import expr +df = spark.table("support_tickets") +df.withColumn( + "priority", + expr("ai_classify(ticket_text, '[\"urgent\", \"not urgent\", \"spam\"]')") +).display() +``` + +**Tips:** +- Use label descriptions for ambiguous categories — they significantly improve accuracy +- `multilabel: "true"` enables multi-label classification without running multiple calls +- Up to 500 labels supported + +--- + +## `ai_extract` + +**Docs:** https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_extract + +**Syntax:** `ai_extract(content, schema [, options])` +- `content`: VARIANT | STRING — raw text, or VARIANT from `ai_parse_document` +- `schema`: STRING — JSON schema definition: + - Simple (field names only): `'["invoice_id", "vendor_name", "total_amount"]'` + - Advanced (with types and descriptions): + ```json + { + "invoice_id": {"type": "string"}, + "total_amount": {"type": "number"}, + "currency": {"type": "enum", "labels": ["USD", "EUR", "GBP"]}, + "line_items": {"type": "array", "items": {"type": "object", "properties": {...}}} + } + ``` + - Supported types: `string`, `integer`, `number`, `boolean`, `enum` + - Max 128 fields, 7 nesting levels, 500 enum values +- `options`: optional MAP\: + - `instructions`: task context to improve extraction quality (max 20,000 chars) + +Returns VARIANT `{"response": {...}, "error_message": null}`. Returns `NULL` if content is `NULL`. + +```sql +-- simple schema +SELECT ai_extract( + 'Invoice #12345 from Acme Corp for $1,250.00', + '["invoice_id", "vendor_name", "total_amount"]' +) AS extracted; +-- {"response": {"invoice_id": "12345", "vendor_name": "Acme Corp", ...}, "error_message": null} + +-- composable with ai_parse_document +WITH parsed AS ( + SELECT ai_parse_document(content, MAP('version', '2.0')) AS parsed + FROM READ_FILES('/Volumes/finance/invoices/', format => 'binaryFile') +) +SELECT ai_extract( + parsed, + '["invoice_id", "vendor_name", "total_amount"]', + MAP('instructions', 'These are vendor invoices.') +) AS invoice_data +FROM parsed; +``` + +```python +from pyspark.sql.functions import expr +df = spark.table("messages") +df = df.withColumn( + "entities", + expr("ai_extract(message, '[\"person\", \"location\", \"date\"]')") +) +df.display() +``` + +--- + +## `ai_fix_grammar` + +**Docs:** https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_fix_grammar + +**Syntax:** `ai_fix_grammar(content)` — Returns corrected STRING. + +Optimized for English. Useful for cleaning user-generated content before downstream processing. + +```sql +SELECT ai_fix_grammar(user_comment) AS corrected FROM user_feedback; +``` + +```python +from pyspark.sql.functions import expr +df = spark.table("user_feedback") +df.withColumn("corrected", expr("ai_fix_grammar(user_comment)")).display() +``` + +--- + +## `ai_gen` + +**Docs:** https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_gen + +**Syntax:** `ai_gen(prompt)` — Returns a generated STRING. + +Use for free-form text generation where the output format doesn't need to be structured. For structured JSON output, use `ai_query` with `responseFormat`. + +```sql +SELECT product_name, + ai_gen(CONCAT('Write a one-sentence marketing tagline for: ', product_name)) AS tagline +FROM products; +``` + +```python +from pyspark.sql.functions import expr +df = spark.table("products") +df.withColumn( + "tagline", + expr("ai_gen(concat('Write a one-sentence marketing tagline for: ', product_name))") +).display() +``` + +--- + +## `ai_mask` + +**Docs:** https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_mask + +**Syntax:** `ai_mask(content, labels)` +- `content`: STRING — text with sensitive data +- `labels`: ARRAY\ — entity types to redact + +Returns text with identified entities replaced by `[MASKED]`. + +Common label values: `'person'`, `'email'`, `'phone'`, `'address'`, `'ssn'`, `'credit_card'` + +```sql +SELECT ai_mask( + message_body, + ARRAY('person', 'email', 'phone', 'address') +) AS message_safe +FROM customer_messages; +``` + +```python +from pyspark.sql.functions import expr +df = spark.table("customer_messages") +df.withColumn( + "message_safe", + expr("ai_mask(message_body, array('person', 'email', 'phone'))") +).write.format("delta").mode("append").saveAsTable("catalog.schema.messages_safe") +``` + +--- + +## `ai_similarity` + +**Docs:** https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_similarity + +**Syntax:** `ai_similarity(expr1, expr2)` — Returns a FLOAT between 0.0 and 1.0. + +Use for fuzzy deduplication, search result ranking, or item matching across datasets. + +```sql +-- Deduplicate company names (similarity > 0.85 = likely duplicate) +SELECT a.id, b.id, a.name, b.name, + ai_similarity(a.name, b.name) AS score +FROM companies a +JOIN companies b ON a.id < b.id +WHERE ai_similarity(a.name, b.name) > 0.85 +ORDER BY score DESC; +``` + +```python +from pyspark.sql.functions import expr +df = spark.table("product_search") +df.withColumn( + "match_score", + expr("ai_similarity(search_query, product_title)") +).orderBy("match_score", ascending=False).display() +``` + +--- + +## `ai_summarize` + +**Docs:** https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_summarize + +**Syntax:** `ai_summarize(content [, max_words])` +- `content`: STRING — text to summarize +- `max_words`: INTEGER (optional) — word limit; default 50; use `0` for uncapped + +```sql +-- Default (50 words) +SELECT ai_summarize(article_body) AS summary FROM news_articles; + +-- Custom word limit +SELECT ai_summarize(article_body, 20) AS brief FROM news_articles; +SELECT ai_summarize(article_body, 0) AS full FROM news_articles; +``` + +```python +from pyspark.sql.functions import expr +df = spark.table("news_articles") +df.withColumn("summary", expr("ai_summarize(article_body, 30)")).display() +``` + +--- + +## `ai_translate` + +**Docs:** https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_translate + +**Syntax:** `ai_translate(content, to_lang)` +- `content`: STRING — source text +- `to_lang`: STRING — target language code + +**Supported languages:** `en`, `de`, `fr`, `it`, `pt`, `hi`, `es`, `th` + +For unsupported languages, use `ai_query` with a multilingual model endpoint. + +```sql +-- Single language +SELECT ai_translate(product_description, 'es') AS description_es FROM products; + +-- Multi-language fanout +SELECT + description, + ai_translate(description, 'fr') AS description_fr, + ai_translate(description, 'de') AS description_de +FROM products; +``` + +```python +from pyspark.sql.functions import expr +df = spark.table("products") +df.withColumn( + "description_es", + expr("ai_translate(product_description, 'es')") +).display() +``` + +--- + +## `ai_parse_document` + +**Docs:** https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_parse_document + +**Requires:** DBR 17.1+ + +**Syntax:** `ai_parse_document(content [, options])` +- `content`: BINARY — document content loaded from `read_files()` or `spark.read.format("binaryFile")` +- `options`: MAP\ (optional) — parsing configuration + +**Supported formats:** PDF, JPG/JPEG, PNG, DOCX, PPTX + +Returns a VARIANT with pages, elements (text paragraphs, tables, figures, headers, footers), bounding boxes, and error metadata. + +**Options:** + +| Key | Values | Description | +|-----|--------|-------------| +| `version` | `'2.0'` | Output schema version | +| `imageOutputPath` | Volume path | Save rendered page images | +| `descriptionElementTypes` | `''`, `'figure'`, `'*'` | AI-generated descriptions (default: `'*'` for all) | + +**Output schema:** + +``` +document +├── pages[] -- page id, image_uri +└── elements[] -- extracted content + ├── type -- "text", "table", "figure", etc. + ├── content -- extracted text + ├── bbox -- bounding box coordinates + └── description -- AI-generated description +metadata -- file info, schema version +error_status[] -- errors per page (if any) +``` + +```sql +-- Parse and extract text blocks +SELECT + path, + parsed:pages[*].elements[*].content AS text_blocks, + parsed:error AS parse_error +FROM ( + SELECT path, ai_parse_document(content) AS parsed + FROM read_files('/Volumes/catalog/schema/landing/docs/', format => 'binaryFile') +); + +-- Parse with options (image output + descriptions) +SELECT ai_parse_document( + content, + map( + 'version', '2.0', + 'imageOutputPath', '/Volumes/catalog/schema/volume/images/', + 'descriptionElementTypes', '*' + ) +) AS parsed +FROM read_files('/Volumes/catalog/schema/volume/invoices/', format => 'binaryFile'); +``` + +```python +from pyspark.sql.functions import expr + +df = ( + spark.read.format("binaryFile") + .load("/Volumes/catalog/schema/landing/docs/") + .withColumn("parsed", expr("ai_parse_document(content)")) + .selectExpr( + "path", + "parsed:pages[*].elements[*].content AS text_blocks", + "parsed:error AS parse_error", + ) + .filter("parse_error IS NULL") +) + +# Chain with task-specific functions on the extracted text +df = ( + df.withColumn("summary", expr("ai_summarize(text_blocks, 50)")) + .withColumn("entities", expr("ai_extract(text_blocks, array('date', 'amount', 'vendor'))")) + .withColumn("category", expr("ai_classify(text_blocks, array('invoice', 'contract', 'report'))")) +) +df.display() +``` + +**Limitations:** +- Processing is slow for dense or low-resolution documents +- Suboptimal for non-Latin alphabets and digitally signed PDFs +- Custom models not supported — always uses the built-in parsing model diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-ai-functions/2-ai-query.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-ai-functions/2-ai-query.md new file mode 100644 index 0000000..60d860f --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-ai-functions/2-ai-query.md @@ -0,0 +1,223 @@ +# `ai_query` — Full Reference + +**Docs:** https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_query + +> Use `ai_query` only when no task-specific function fits. See the function selection table in [SKILL.md](SKILL.md). + +## When to Use `ai_query` + +- Output schema has **nested arrays or deeply nested STRUCTs** (e.g., `itens: [{codigo, descricao, qtde}]`) +- Calling a **custom Model Serving endpoint** (your own fine-tuned model) +- **Multimodal input** — passing binary image files via `files =>` +- **Cross-document reasoning** — prompt includes content from multiple sources +- Need **sampling parameters** (`temperature`, `max_tokens`) control + +## Syntax + +```sql +ai_query( + endpoint, + request + [, returnType => ddl_schema] + [, failOnError => boolean] + [, modelParameters => named_struct(...)] + [, responseFormat => json_string] + [, files => binary_column] +) +``` + +## Parameters + +| Parameter | Type | Runtime | Description | +|---|---|---|---| +| `endpoint` | STRING literal | — | Foundation Model name or custom endpoint name. Never guess — use exact names from the [model serving docs](https://docs.databricks.com/aws/en/machine-learning/foundation-models/supported-models.html). | +| `request` | STRING or STRUCT | — | Prompt string for chat models; STRUCT for custom ML endpoints | +| `returnType` | DDL schema (optional) | 15.2+ | Structures the parsed response like `from_json` | +| `failOnError` | BOOLEAN (optional, default `true`) | 15.3+ | If `false`, returns STRUCT `{response, error}` instead of raising on failure | +| `modelParameters` | STRUCT (optional) | 15.3+ | Sampling params: `temperature`, `max_tokens`, `top_p`, etc. | +| `responseFormat` | JSON string (optional) | 15.4+ | Forces structured JSON output: `'{"type":"json_object"}'` | +| `files` | binary column (optional) | — | Pass binary images directly (JPEG/PNG) — no upload step needed | + +## Foundation Model Names (Do Not Guess) + +| Use case | Endpoint name | +|---|---| +| General reasoning / extraction | `databricks-claude-sonnet-4` | +| Fast / cheap tasks | `databricks-meta-llama-3-1-8b-instruct` | +| Large context / complex | `databricks-meta-llama-3-3-70b-instruct` | +| Multimodal (vision + text) | `databricks-llama-4-maverick` | +| Embeddings | `databricks-gte-large-en` | + +## Patterns + +### Basic — single prompt + +```sql +SELECT ai_query( + 'databricks-meta-llama-3-3-70b-instruct', + 'Describe Databricks SQL in 30 words.' +) AS response; +``` + +### Applied to a table column + +```sql +SELECT ticket_id, + ai_query( + 'databricks-meta-llama-3-3-70b-instruct', + CONCAT('Summarize in one sentence: ', ticket_body) + ) AS summary +FROM support_tickets; +``` + +### Structured JSON output (`responseFormat`) + +Preferred over `returnType` for chat models (requires Runtime 15.4+): + +```sql +SELECT ai_query( + 'databricks-claude-sonnet-4', + CONCAT('Extract invoice fields as JSON. Fields: numero, fornecedor, total, ' + 'itens:[{codigo, descricao, qtde, vlrUnit}]. Input: ', text_blocks), + responseFormat => '{"type":"json_object"}', + failOnError => false +) AS ai_response +FROM parsed_documents; +``` + +Then parse with `from_json`: + +```python +from pyspark.sql.functions import from_json, col + +df = df.withColumn( + "invoice", + from_json( + col("ai_response.response"), + "STRUCT>>" + ) +) +# Access fields +df.select("invoice.numero", "invoice.total", "invoice.itens").display() +``` + +### With `failOnError` (always use in batch pipelines) + +```sql +SELECT + id, + ai_response.response, + ai_response.error +FROM ( + SELECT id, + ai_query( + 'databricks-claude-sonnet-4', + CONCAT('Classify: ', text), + failOnError => false + ) AS ai_response + FROM documents +) +-- Route errors to a separate table downstream +``` + +### With `modelParameters` (control sampling) + +```sql +SELECT ai_query( + 'databricks-meta-llama-3-3-70b-instruct', + CONCAT('Extract entities from: ', text), + failOnError => false, + modelParameters => named_struct('temperature', CAST(0.0 AS DOUBLE), 'max_tokens', 500) +) AS result +FROM documents; +``` + +### Multimodal — image files (`files =>`) + +No file upload step needed. Pass the binary column directly: + +```sql +SELECT + path, + ai_query( + 'databricks-llama-4-maverick', + 'Describe what is in this image in detail.', + files => content + ) AS description +FROM read_files('/Volumes/catalog/schema/images/', format => 'binaryFile'); +``` + +```python +from pyspark.sql.functions import expr + +df = ( + spark.read.format("binaryFile") + .load("/Volumes/catalog/schema/images/") + .withColumn("description", expr(""" + ai_query( + 'databricks-llama-4-maverick', + 'Describe the contents of this image.', + files => content + ) + """)) +) +``` + +### As a reusable SQL UDF + +```sql +CREATE FUNCTION catalog.schema.extract_invoice(text STRING) +RETURNS STRING +RETURN ai_query( + 'databricks-claude-sonnet-4', + CONCAT('Extract invoice JSON from: ', text), + responseFormat => '{"type":"json_object"}' +); + +SELECT extract_invoice(document_text) FROM raw_documents; +``` + +### PySpark with `expr` + +```python +from pyspark.sql.functions import expr + +df = spark.table("documents") +df = df.withColumn("result", expr(""" + ai_query( + 'databricks-claude-sonnet-4', + concat('Extract structured data from: ', content), + responseFormat => '{"type":"json_object"}', + failOnError => false + ) +""")) +``` + +## Error Handling Pattern for Batch Pipelines + +Always use `failOnError => false` in batch jobs. Write errors to a sidecar table: + +```python +import dlt +from pyspark.sql.functions import expr, col + +@dlt.table(comment="AI extraction results") +def extracted(): + return ( + dlt.read("raw") + .withColumn("ai_response", expr(""" + ai_query('databricks-claude-sonnet-4', prompt, + responseFormat => '{"type":"json_object"}', + failOnError => false) + """)) + ) + +@dlt.table(comment="Rows that failed AI extraction") +def extraction_errors(): + return ( + dlt.read("extracted") + .filter(col("ai_response.error").isNotNull()) + .select("id", "prompt", col("ai_response.error").alias("error")) + ) +``` diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-ai-functions/3-ai-forecast.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-ai-functions/3-ai-forecast.md new file mode 100644 index 0000000..9c1f9b1 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-ai-functions/3-ai-forecast.md @@ -0,0 +1,162 @@ +# `ai_forecast` — Full Reference + +**Docs:** https://docs.databricks.com/aws/en/sql/language-manual/functions/ai_forecast + +> `ai_forecast` is a **table-valued function** — it returns a table of rows, not a scalar. Call it with `SELECT * FROM ai_forecast(...)`. + +## Requirements + +- **Pro or Serverless SQL warehouse** — not available on Classic or Starter +- Input data must have a DATE or TIMESTAMP time column and at least one numeric value column + +## Syntax + +```sql +SELECT * +FROM ai_forecast( + observed => TABLE(...) or query, + horizon => 'YYYY-MM-DD' or TIMESTAMP, + time_col => 'column_name', + value_col => 'column_name', + [group_col => 'column_name'], + [prediction_interval_width => 0.95] +) +``` + +## Parameters + +| Parameter | Type | Description | +|---|---|---| +| `observed` | TABLE reference or subquery | Training data with time + value columns | +| `horizon` | DATE, TIMESTAMP, or STRING | End date/time for the forecast period | +| `time_col` | STRING | Name of the DATE or TIMESTAMP column in `observed` | +| `value_col` | STRING | One or more numeric columns to forecast (up to 100 per group) | +| `group_col` | STRING (optional) | Column to partition forecasts by — produces one forecast series per group value | +| `prediction_interval_width` | DOUBLE (optional, default 0.95) | Confidence interval width between 0 and 1 | + +## Output Columns + +For each `value_col` named `metric`, the output includes: + +| Column | Type | Description | +|---|---|---| +| time_col | DATE or TIMESTAMP | The forecast timestamp (same type as input) | +| `metric_forecast` | DOUBLE | Point forecast | +| `metric_upper` | DOUBLE | Upper confidence bound | +| `metric_lower` | DOUBLE | Lower confidence bound | +| group_col | original type | Present when `group_col` is specified | + +## Patterns + +### Single Metric Forecast + +```sql +SELECT * +FROM ai_forecast( + observed => TABLE(SELECT order_date, revenue FROM daily_revenue), + horizon => '2026-12-31', + time_col => 'order_date', + value_col => 'revenue' +); +-- Returns: order_date, revenue_forecast, revenue_upper, revenue_lower +``` + +### Multi-Group Forecast + +Produces one forecast series per distinct value of `group_col`: + +```sql +SELECT * +FROM ai_forecast( + observed => TABLE(SELECT date, region, sales FROM regional_sales), + horizon => '2026-12-31', + time_col => 'date', + value_col => 'sales', + group_col => 'region' +); +-- Returns: date, region, sales_forecast, sales_upper, sales_lower +-- One row per date per region +``` + +### Multiple Value Columns + +```sql +SELECT * +FROM ai_forecast( + observed => TABLE(SELECT date, units, revenue FROM daily_kpis), + horizon => '2026-06-30', + time_col => 'date', + value_col => 'units,revenue' -- comma-separated +); +-- Returns: date, units_forecast, units_upper, units_lower, +-- revenue_forecast, revenue_upper, revenue_lower +``` + +### Custom Confidence Interval + +```sql +SELECT * +FROM ai_forecast( + observed => TABLE(SELECT ts, sensor_value FROM iot_readings), + horizon => '2026-03-31', + time_col => 'ts', + value_col => 'sensor_value', + prediction_interval_width => 0.80 -- narrower interval = less conservative +); +``` + +### Filtering Input Data (Subquery) + +```sql +SELECT * +FROM ai_forecast( + observed => TABLE( + SELECT date, sales + FROM daily_sales + WHERE region = 'BR' AND date >= '2024-01-01' + ), + horizon => '2026-12-31', + time_col => 'date', + value_col => 'sales' +); +``` + +### PySpark — Use `spark.sql()` + +`ai_forecast` is a table-valued function and must be called through `spark.sql()`: + +```python +result = spark.sql(""" + SELECT * + FROM ai_forecast( + observed => TABLE(SELECT date, sales FROM catalog.schema.daily_sales), + horizon => '2026-12-31', + time_col => 'date', + value_col => 'sales' + ) +""") +result.display() +``` + +### Save Forecast to Delta Table + +```python +result = spark.sql(""" + SELECT * + FROM ai_forecast( + observed => TABLE(SELECT date, region, revenue FROM catalog.schema.sales), + horizon => '2026-12-31', + time_col => 'date', + value_col => 'revenue', + group_col => 'region' + ) +""") +result.write.format("delta").mode("overwrite").saveAsTable("catalog.schema.revenue_forecast") +``` + +## Notes + +- The underlying model is a **prophet-like piecewise linear + seasonality model** — suitable for business time series with trend and weekly/yearly seasonality +- Handles "any number of groups" but up to **100 metrics per group** +- Output time column preserves the input type (DATE stays DATE, TIMESTAMP stays TIMESTAMP) +- Value columns are always cast to DOUBLE in output regardless of input type diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-ai-functions/4-document-processing-pipeline.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-ai-functions/4-document-processing-pipeline.md new file mode 100644 index 0000000..cb8afbd --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-ai-functions/4-document-processing-pipeline.md @@ -0,0 +1,470 @@ +# Document Processing Pipeline with AI Functions + +End-to-end patterns for building batch document processing pipelines using AI Functions in a Lakeflow Declarative Pipeline (DLT). Covers function selection, `config.yml` centralization, error handling, and guidance on near-real-time variants with DSPy or LangChain. + +> For workflow migration context (e.g., migrating from n8n, LangChain, or other orchestration tools), see the companion skill `n8n-to-databricks`. + +--- + +## Function Selection for Document Pipelines + +When processing documents with AI Functions, apply this order of preference for each stage: + +| Stage | Preferred function | Use `ai_query` when... | +|---|---|---| +| Parse binary docs (PDF, DOCX, images) | `ai_parse_document` | Need image-level reasoning | +| Extract flat fields from text | `ai_extract` | Schema has nested arrays | +| Classify document type or status | `ai_classify` | More than 20 categories | +| Score item similarity / matching | `ai_similarity` | Need cross-document reasoning | +| Summarize long sections | `ai_summarize` | — | +| Extract nested JSON (e.g. line items) | `ai_query` with `responseFormat` | (This is the intended use case) | + +--- + +## Centralized Configuration (`config.yml`) + +**Always centralize model names, volume paths, and prompts in a `config.yml`.** This makes model swaps a one-line change and keeps pipeline code free of hardcoded strings. + +```yaml +# config.yml +models: + default: "databricks-claude-sonnet-4" + mini: "databricks-meta-llama-3-1-8b-instruct" + vision: "databricks-llama-4-maverick" + +catalog: + name: "my_catalog" + schema: "document_processing" + +volumes: + input: "/Volumes/my_catalog/document_processing/landing/" + tmp: "/Volumes/my_catalog/document_processing/tmp/" + +output_tables: + results: "my_catalog.document_processing.processed_docs" + errors: "my_catalog.document_processing.processing_errors" + +prompts: + extract_invoice: | + Extract invoice fields and return ONLY valid JSON. + Fields: invoice_number, vendor_name, vendor_tax_id (digits only), + issue_date (dd/mm/yyyy), total_amount (numeric), + line_items: [{item_code, description, quantity, unit_price, total}]. + Return null for missing fields. + + classify_doc: | + Classify this document into exactly one category. +``` + +```python +# config_loader.py +import yaml + +def load_config(path: str = "config.yml") -> dict: + with open(path) as f: + return yaml.safe_load(f) + +CFG = load_config() +ENDPOINT = CFG["models"]["default"] +ENDPOINT_MINI = CFG["models"]["mini"] +VOLUME_INPUT = CFG["volumes"]["input"] +PROMPT_INV = CFG["prompts"]["extract_invoice"] +``` + +--- + +## Batch Pipeline — Lakeflow Declarative Pipeline + +Each logical step in your document workflow maps to a `@dlt.table` stage. Data flows through Delta tables between stages. + +``` +[Landing Volume] → Stage 1: ai_parse_document + → Stage 2: ai_classify (document type) + → Stage 3: ai_extract (flat fields) + ai_query (nested JSON) + → Stage 4: ai_similarity (item matching) + → Stage 5: Final Delta output table +``` + +### `pipeline.py` + +```python +import dlt +import yaml +from pyspark.sql.functions import expr, col, from_json + +CFG = yaml.safe_load(open("/Workspace/path/to/config.yml")) +ENDPOINT = CFG["models"]["default"] +VOL_IN = CFG["volumes"]["input"] +PROMPT = CFG["prompts"]["extract_invoice"] + + +# ── Stage 1: Parse binary documents ────────────────────────────────────────── +# Preferred: ai_parse_document — no model selection, no ai_query needed + +@dlt.table(comment="Parsed document text from all file types in the landing volume") +def raw_parsed(): + return ( + spark.read.format("binaryFile").load(VOL_IN) + .withColumn("parsed", expr("ai_parse_document(content)")) + .selectExpr( + "path", + "parsed:pages[*].elements[*].content AS text_blocks", + "parsed:error AS parse_error", + ) + .filter("parse_error IS NULL") + ) + + +# ── Stage 2: Classify document type ────────────────────────────────────────── +# Preferred: ai_classify — cheap, no endpoint selection + +@dlt.table(comment="Document type classification") +def classified_docs(): + return ( + dlt.read("raw_parsed") + .withColumn( + "doc_type", + expr("ai_classify(text_blocks, array('invoice', 'purchase_order', 'receipt', 'contract', 'other'))") + ) + ) + + +# ── Stage 3a: Flat field extraction ────────────────────────────────────────── +# Preferred: ai_extract for flat fields (vendor, date, total) + +@dlt.table(comment="Flat header fields extracted from documents") +def extracted_flat(): + return ( + dlt.read("classified_docs") + .filter("doc_type = 'invoice'") + .withColumn( + "header", + expr("ai_extract(text_blocks, array('invoice_number', 'vendor_name', 'issue_date', 'total_amount', 'tax_id'))") + ) + .select("path", "doc_type", "text_blocks", col("header")) + ) + + +# ── Stage 3b: Nested JSON extraction (last resort: ai_query) ───────────────── +# Use ai_query only because line_items is a nested array — ai_extract can't handle it + +@dlt.table(comment="Nested line items extracted — ai_query used for array schema only") +def extracted_line_items(): + return ( + dlt.read("extracted_flat") + .withColumn( + "ai_response", + expr(f""" + ai_query( + '{ENDPOINT}', + concat('{PROMPT.strip()}', '\\n\\nDocument text:\\n', LEFT(text_blocks, 6000)), + responseFormat => '{{"type":"json_object"}}', + failOnError => false + ) + """) + ) + .withColumn( + "line_items", + from_json( + col("ai_response.response"), + "STRUCT>>" + ) + ) + .select("path", "doc_type", "header", "line_items", col("ai_response.error").alias("extraction_error")) + ) + + +# ── Stage 4: Similarity matching ───────────────────────────────────────────── +# Preferred: ai_similarity for fuzzy matching between extracted fields + +@dlt.table(comment="Vendor name similarity vs reference master data") +def vendor_matched(): + extracted = dlt.read("extracted_line_items") + # Join against a reference vendor table for fuzzy matching + vendors = spark.table("my_catalog.document_processing.vendor_master").select("vendor_id", "vendor_name") + + return ( + extracted.crossJoin(vendors) + .withColumn( + "name_similarity", + expr("ai_similarity(header.vendor_name, vendor_name)") + ) + .filter("name_similarity > 0.80") + .orderBy("name_similarity", ascending=False) + ) + + +# ── Stage 5: Final output + error sidecar ──────────────────────────────────── + +@dlt.table( + comment="Final processed documents ready for downstream consumption", + table_properties={"delta.enableChangeDataFeed": "true"}, +) +def processed_docs(): + return ( + dlt.read("extracted_line_items") + .filter("extraction_error IS NULL") + .selectExpr( + "path", + "doc_type", + "header.invoice_number", + "header.vendor_name", + "header.issue_date", + "header.total_amount", + "line_items.line_items AS items", + ) + ) + + +@dlt.table(comment="Rows that failed at any extraction stage — review and reprocess") +def processing_errors(): + return ( + dlt.read("extracted_line_items") + .filter("extraction_error IS NOT NULL") + .select("path", "doc_type", col("extraction_error").alias("error")) + ) +``` + +--- + +## Custom RAG Pipeline — Parse → Chunk → Index → Query + +When the goal is retrieval-augmented generation rather than field extraction, use this pipeline to parse documents, chunk them into a Delta table, and index with Vector Search. + +### Step 1 — Parse and Chunk into a Delta Table + +`ai_parse_document` returns a VARIANT. Use `variant_get` with an explicit `ARRAY` cast before calling `explode`, since `explode()` does not accept raw VARIANT values. + +```sql +CREATE OR REPLACE TABLE catalog.schema.parsed_chunks AS +WITH parsed AS ( + SELECT + path, + ai_parse_document(content) AS doc + FROM read_files('/Volumes/catalog/schema/volume/docs/', format => 'binaryFile') +), +elements AS ( + SELECT + path, + explode(variant_get(doc, '$.document.elements', 'ARRAY')) AS element + FROM parsed +) +SELECT + md5(concat(path, variant_get(element, '$.content', 'STRING'))) AS chunk_id, + path AS source_path, + variant_get(element, '$.content', 'STRING') AS content, + variant_get(element, '$.type', 'STRING') AS element_type, + current_timestamp() AS parsed_at +FROM elements +WHERE variant_get(element, '$.content', 'STRING') IS NOT NULL + AND length(trim(variant_get(element, '$.content', 'STRING'))) > 10; +``` + +### Step 1a (Production) — Incremental Parsing with Structured Streaming + +For production pipelines where new documents arrive over time, use Structured Streaming with checkpoints for exactly-once processing. Each run processes only new files, then stops with `trigger(availableNow=True)`. + +See the official bundle example: +[databricks/bundle-examples/contrib/job_with_ai_parse_document](https://github.com/databricks/bundle-examples/tree/main/contrib/job_with_ai_parse_document) + +**Stage 1 — Parse raw documents (streaming):** + +```python +from pyspark.sql.functions import col, current_timestamp, expr + +files_df = ( + spark.readStream.format("binaryFile") + .option("pathGlobFilter", "*.{pdf,jpg,jpeg,png}") + .option("recursiveFileLookup", "true") + .load("/Volumes/catalog/schema/volume/docs/") +) + +parsed_df = ( + files_df + .repartition(8, expr("crc32(path) % 8")) + .withColumn("parsed", expr(""" + ai_parse_document(content, map( + 'version', '2.0', + 'descriptionElementTypes', '*' + )) + """)) + .withColumn("parsed_at", current_timestamp()) + .select("path", "parsed", "parsed_at") +) + +( + parsed_df.writeStream.format("delta") + .outputMode("append") + .option("checkpointLocation", "/Volumes/catalog/schema/checkpoints/01_parse") + .option("mergeSchema", "true") + .trigger(availableNow=True) + .toTable("catalog.schema.parsed_documents_raw") +) +``` + +**Stage 2 — Extract text from parsed VARIANT (streaming):** + +Uses `transform()` to extract element content from the VARIANT array, and `try_cast` for safe access. Error rows are preserved but flagged. + +```python +from pyspark.sql.functions import col, concat_ws, expr, lit, when + +parsed_stream = spark.readStream.format("delta").table("catalog.schema.parsed_documents_raw") + +text_df = ( + parsed_stream + .withColumn("text", + when( + expr("try_cast(parsed:error_status AS STRING)").isNotNull(), lit(None) + ).otherwise( + concat_ws("\n\n", expr(""" + transform( + try_cast(parsed:document:elements AS ARRAY), + element -> try_cast(element:content AS STRING) + ) + """)) + ) + ) + .withColumn("error_status", expr("try_cast(parsed:error_status AS STRING)")) + .select("path", "text", "error_status", "parsed_at") +) + +( + text_df.writeStream.format("delta") + .outputMode("append") + .option("checkpointLocation", "/Volumes/catalog/schema/checkpoints/02_text") + .option("mergeSchema", "true") + .trigger(availableNow=True) + .toTable("catalog.schema.parsed_documents_text") +) +``` + +Key techniques: +- **`repartition` by file hash** — parallelizes `ai_parse_document` across workers +- **`trigger(availableNow=True)`** — processes all pending files then stops (batch-like) +- **Checkpoints** — exactly-once guarantee; no re-parsing on re-runs +- **`transform()` + `try_cast`** — safer than `explode` + `variant_get` for text extraction +- **Separate stages with independent checkpoints** — parse and text extraction can fail/retry independently + +### Step 1b — Enable Change Data Feed + +Required for Vector Search Delta Sync: + +```sql +ALTER TABLE catalog.schema.parsed_chunks +SET TBLPROPERTIES (delta.enableChangeDataFeed = true); +``` + +### Step 2 — Create a Vector Search Index and Query It + +Use the **[databricks-vector-search](../databricks-vector-search/SKILL.md)** skill to create a Delta Sync index on the chunked table and query it. Ensure CDF is enabled first (Step 1b above). + +### RAG-Specific Issues + +| Issue | Solution | +|-------|----------| +| `explode()` fails with VARIANT | `explode()` requires ARRAY, not VARIANT. Use `variant_get(doc, '$.document.elements', 'ARRAY')` to cast before exploding | +| Short/noisy chunks | Filter with `length(trim(...)) > 10` — parsing produces tiny fragments (page numbers, headers) that pollute the index | +| Re-parsing unchanged documents | Use Structured Streaming with checkpoints — see Step 1a above | +| Region not supported | US/EU regions only, or enable cross-geography routing | + +--- + +## Near-Real-Time Variant — DSPy + MLflow Agent + +When the pipeline must respond in seconds (triggered by a user action, API call, or form submission), use DSPy with an MLflow ChatAgent instead of a DLT pipeline. + +**When to use DSPy vs LangChain:** + +| Scenario | Stack | +|---|---| +| Fixed pipeline steps, well-defined I/O, want prompt optimization | **DSPy** | +| Needs tool-calling, memory, or multi-agent coordination | **LangChain LCEL** + MLflow ChatAgent | +| Single LLM call, simple task | Direct AI Function or `ai_query` in a notebook | + +### DSPy Signatures (replace LangChain agent system prompts) + +```python +# pip install dspy-ai mlflow databricks-sdk +import dspy, yaml + +CFG = yaml.safe_load(open("config.yml")) +lm = dspy.LM( + model=f"databricks/{CFG['models']['default']}", + api_base="https:///serving-endpoints", + api_key=dbutils.secrets.get("scope", "databricks-token"), +) +dspy.configure(lm=lm) + + +class ExtractInvoiceHeader(dspy.Signature): + """Extract invoice header fields from document text.""" + document_text: str = dspy.InputField(desc="Raw text from the document") + invoice_number: str = dspy.OutputField(desc="Invoice number, or null") + vendor_name: str = dspy.OutputField(desc="Vendor/supplier name, or null") + issue_date: str = dspy.OutputField(desc="Date as dd/mm/yyyy, or null") + total_amount: float = dspy.OutputField(desc="Total amount as float, or null") + + +class ClassifyDocument(dspy.Signature): + """Classify a document into one of the provided categories.""" + document_text: str = dspy.InputField() + category: str = dspy.OutputField( + desc="One of: invoice, purchase_order, receipt, contract, other" + ) + + +class DocumentPipeline(dspy.Module): + def __init__(self): + self.classify = dspy.Predict(ClassifyDocument) + self.extract = dspy.Predict(ExtractInvoiceHeader) + + def forward(self, document_text: str): + doc_type = self.classify(document_text=document_text).category + if doc_type == "invoice": + header = self.extract(document_text=document_text) + return {"doc_type": doc_type, "header": header.__dict__} + return {"doc_type": doc_type, "header": None} + + +pipeline = DocumentPipeline() +``` + +### Wrap and Register with MLflow + +```python +import mlflow, json + +class DSPyDocumentAgent(mlflow.pyfunc.PythonModel): + def load_context(self, context): + import dspy, yaml + cfg = yaml.safe_load(open(context.artifacts["config"])) + lm = dspy.LM(model=f"databricks/{cfg['models']['default']}") + dspy.configure(lm=lm) + self.pipeline = DocumentPipeline() + + def predict(self, context, model_input): + text = model_input.iloc[0]["document_text"] + return json.dumps(self.pipeline(document_text=text), ensure_ascii=False) + +mlflow.set_registry_uri("databricks-uc") +with mlflow.start_run(): + mlflow.pyfunc.log_model( + artifact_path="document_agent", + python_model=DSPyDocumentAgent(), + artifacts={"config": "config.yml"}, + registered_model_name="my_catalog.document_processing.document_agent", + ) +``` + +--- + +## Tips + +1. **Parse first, enrich second** — always run `ai_parse_document` as the first stage. Feed its text output to task-specific functions; never pass raw binary to `ai_query`. +2. **Flat fields → `ai_extract`; nested arrays → `ai_query`** — this is the clearest decision boundary. +3. **`failOnError => false` is mandatory in batch** — write errors to a sidecar `_errors` table rather than crashing the pipeline. +4. **Truncate before sending to `ai_query`** — use `LEFT(text, 6000)` or chunk long documents to stay within context window limits. +5. **Prompts belong in `config.yml`** — never hardcode prompt strings in pipeline code. A prompt change should be a config change, not a code change. +6. **DSPy for agents** — when migrating from LangChain agent-based tools, DSPy typed `Signature` classes give you structured I/O contracts, testability, and optional prompt compilation/optimization. diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-ai-functions/SKILL.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-ai-functions/SKILL.md new file mode 100644 index 0000000..19897d8 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-ai-functions/SKILL.md @@ -0,0 +1,195 @@ +--- +name: databricks-ai-functions +description: "Use Databricks built-in AI Functions (ai_classify, ai_extract, ai_summarize, ai_mask, ai_translate, ai_fix_grammar, ai_gen, ai_analyze_sentiment, ai_similarity, ai_parse_document, ai_query, ai_forecast) to add AI capabilities directly to SQL and PySpark pipelines without managing model endpoints. Also covers document parsing and building custom RAG pipelines (parse → chunk → index → query)." +--- + +# Databricks AI Functions + +> **Official Docs:** https://docs.databricks.com/aws/en/large-language-models/ai-functions +> Individual function reference: https://docs.databricks.com/aws/en/sql/language-manual/functions/ + +## Overview + +Databricks AI Functions are built-in SQL and PySpark functions that call Foundation Model APIs directly from your data pipelines — no model endpoint setup, no API keys, no boilerplate. They operate on table columns as naturally as `UPPER()` or `LENGTH()`, and are optimized for batch inference at scale. + +There are three categories: + +| Category | Functions | Use when | +|---|---|---| +| **Task-specific** | `ai_analyze_sentiment`, `ai_classify`, `ai_extract`, `ai_fix_grammar`, `ai_gen`, `ai_mask`, `ai_similarity`, `ai_summarize`, `ai_translate`, `ai_parse_document` | The task is well-defined — prefer these always | +| **General-purpose** | `ai_query` | Complex nested JSON, custom endpoints, multimodal — **last resort only** | +| **Table-valued** | `ai_forecast` | Time series forecasting | + +**Function selection rule — always prefer a task-specific function over `ai_query`:** + +| Task | Use this | Fall back to `ai_query` when... | +|---|---|---| +| Sentiment scoring | `ai_analyze_sentiment` | Never | +| Fixed-label routing | `ai_classify` (2–500 labels; add descriptions for accuracy) | Never | +| Entity / field extraction | `ai_extract` | Never | +| Summarization | `ai_summarize` | Never — use `max_words=0` for uncapped | +| Grammar correction | `ai_fix_grammar` | Never | +| Translation | `ai_translate` | Target language not in the supported list | +| PII redaction | `ai_mask` | Never | +| Free-form generation | `ai_gen` | Need structured JSON output | +| Semantic similarity | `ai_similarity` | Never | +| PDF / document parsing | `ai_parse_document` | Need image-level reasoning | +| Complex JSON / reasoning | — | **This is the intended use case for `ai_query`** | + +## Prerequisites + +- Databricks SQL warehouse (**not Classic**) or cluster with DBR **15.1+** +- DBR **15.4 ML LTS** recommended for batch workloads +- DBR **17.1+** required for `ai_parse_document` +- `ai_forecast` requires a **Pro or Serverless** SQL warehouse +- Workspace in a supported AWS/Azure region for batch AI inference +- Models run under Apache 2.0 or LLAMA 3.3 Community License — customers are responsible for compliance + +## Quick Start + +Classify, extract, and score sentiment from a text column in a single query: + +```sql +SELECT + ticket_id, + ticket_text, + ai_classify(ticket_text, ARRAY('urgent', 'not urgent', 'spam')) AS priority, + ai_extract(ticket_text, ARRAY('product', 'error_code', 'date')) AS entities, + ai_analyze_sentiment(ticket_text) AS sentiment +FROM support_tickets; +``` + +```python +from pyspark.sql.functions import expr + +df = spark.table("support_tickets") +df = ( + df.withColumn("priority", expr("ai_classify(ticket_text, array('urgent', 'not urgent', 'spam'))")) + .withColumn("entities", expr("ai_extract(ticket_text, array('product', 'error_code', 'date'))")) + .withColumn("sentiment", expr("ai_analyze_sentiment(ticket_text)")) +) +# Access nested STRUCT fields from ai_extract +df.select("ticket_id", "priority", "sentiment", + "entities.product", "entities.error_code", "entities.date").display() +``` + +## Common Patterns + +### Pattern 1: Text Analysis Pipeline + +Chain multiple task-specific functions to enrich a text column in one pass: + +```sql +SELECT + id, + content, + ai_analyze_sentiment(content) AS sentiment, + ai_summarize(content, 30) AS summary, + ai_classify(content, + ARRAY('technical', 'billing', 'other')) AS category, + ai_fix_grammar(content) AS content_clean +FROM raw_feedback; +``` + +### Pattern 2: PII Redaction Before Storage + +```python +from pyspark.sql.functions import expr + +df_clean = ( + spark.table("raw_messages") + .withColumn( + "message_safe", + expr("ai_mask(message, array('person', 'email', 'phone', 'address'))") + ) +) +df_clean.write.format("delta").mode("append").saveAsTable("catalog.schema.messages_safe") +``` + +### Pattern 3: Document Ingestion from a Unity Catalog Volume + +Parse PDFs/Office docs, then enrich with task-specific functions: + +```python +from pyspark.sql.functions import expr + +df = ( + spark.read.format("binaryFile") + .load("/Volumes/catalog/schema/landing/documents/") + .withColumn("parsed", expr("ai_parse_document(content)")) + .selectExpr("path", + "parsed:pages[*].elements[*].content AS text_blocks", + "parsed:error AS parse_error") + .filter("parse_error IS NULL") + .withColumn("summary", expr("ai_summarize(text_blocks, 50)")) + .withColumn("entities", expr("ai_extract(text_blocks, array('date', 'amount', 'vendor'))")) +) +``` + +### Pattern 4: Semantic Matching / Deduplication + +```sql +-- Find near-duplicate company names +SELECT a.id, b.id, ai_similarity(a.name, b.name) AS score +FROM companies a +JOIN companies b ON a.id < b.id +WHERE ai_similarity(a.name, b.name) > 0.85; +``` + +### Pattern 5: Complex JSON Extraction with `ai_query` (last resort) + +Use only when the output schema has nested arrays or requires multi-step reasoning that no task-specific function handles: + +```python +from pyspark.sql.functions import expr, from_json, col + +df = ( + spark.table("parsed_documents") + .withColumn("ai_response", expr(""" + ai_query( + 'databricks-claude-sonnet-4', + concat('Extract invoice as JSON with nested itens array: ', text_blocks), + responseFormat => '{"type":"json_object"}', + failOnError => false + ) + """)) + .withColumn("invoice", from_json( + col("ai_response.response"), + "STRUCT>>" + )) +) +``` + +### Pattern 6: Time Series Forecasting + +```sql +SELECT * +FROM ai_forecast( + observed => TABLE(SELECT date, sales FROM daily_sales), + horizon => '2026-12-31', + time_col => 'date', + value_col => 'sales' +); +-- Returns: date, sales_forecast, sales_upper, sales_lower +``` + +## Reference Files + +- [1-task-functions.md](1-task-functions.md) — Full syntax, parameters, SQL + PySpark examples for all 9 task-specific functions (`ai_analyze_sentiment`, `ai_classify`, `ai_extract`, `ai_fix_grammar`, `ai_gen`, `ai_mask`, `ai_similarity`, `ai_summarize`, `ai_translate`) and `ai_parse_document` +- [2-ai-query.md](2-ai-query.md) — `ai_query` complete reference: all parameters, structured output with `responseFormat`, multimodal `files =>`, UDF patterns, and error handling +- [3-ai-forecast.md](3-ai-forecast.md) — `ai_forecast` parameters, single-metric, multi-group, multi-metric, and confidence interval patterns +- [4-document-processing-pipeline.md](4-document-processing-pipeline.md) — End-to-end batch document processing pipeline using AI Functions in a Lakeflow Declarative Pipeline; includes `config.yml` centralization, function selection logic, custom RAG pipeline (parse → chunk → Vector Search), and DSPy/LangChain guidance for near-real-time variants + +## Common Issues + +| Issue | Solution | +|---|---| +| `ai_parse_document` not found | Requires DBR **17.1+**. Check cluster runtime. | +| `ai_forecast` fails | Requires **Pro or Serverless** SQL warehouse — not available on Classic or Starter. | +| All functions return NULL | Input column is NULL. Filter with `WHERE col IS NOT NULL` before calling. | +| `ai_translate` fails for a language | Supported: English, German, French, Italian, Portuguese, Hindi, Spanish, Thai. Use `ai_query` with a multilingual model for others. | +| `ai_classify` returns unexpected labels | Use clear, mutually exclusive label names. Fewer labels (2–5) produces more reliable results. | +| `ai_query` raises on some rows in a batch job | Add `failOnError => false` — returns a STRUCT with `.response` and `.error` instead of raising. | +| Batch job runs slowly | Use DBR **15.4 ML LTS** cluster (not serverless or interactive) for optimized batch inference throughput. | +| Want to swap models without editing pipeline code | Store all model names and prompts in `config.yml` — see [4-document-processing-pipeline.md](4-document-processing-pipeline.md) for the pattern. | diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-aibi-dashboards/1-widget-specifications.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-aibi-dashboards/1-widget-specifications.md new file mode 100644 index 0000000..d8e03c1 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-aibi-dashboards/1-widget-specifications.md @@ -0,0 +1,341 @@ +# Widget Specifications + +Core widget types for AI/BI dashboards. For advanced visualizations (area, scatter, choropleth map, combo), see [2-advanced-widget-specifications.md](2-advanced-widget-specifications.md). + +## Widget Naming and Display + +- `widget.name`: alphanumeric + hyphens + underscores ONLY (max 60 characters) +- `frame.title`: human-readable title (any characters allowed) +- `frame.showTitle`: always set to `true` so users understand the widget +- `displayName`: use in encodings to label axes/values clearly (e.g., "Revenue ($)", "Growth Rate (%)") +- `widget.queries[].name`: use `"main_query"` for chart/counter/table widgets. Filter widgets with multiple queries can use descriptive names (see [3-filters.md](3-filters.md)) + +**Always format values appropriately** - use `format` for currency, percentages, and large numbers (see [Axis Formatting](#axis-formatting)). + +## Version Requirements + +| Widget Type | Version | File | +|-------------|---------|------| +| text | N/A | this file | +| counter | 2 | this file | +| table | 2 | this file | +| bar | 3 | this file | +| line | 3 | this file | +| pie | 3 | this file | +| area | 3 | [2-advanced-widget-specifications.md](2-advanced-widget-specifications.md) | +| scatter | 3 | [2-advanced-widget-specifications.md](2-advanced-widget-specifications.md) | +| combo | 1 | [2-advanced-widget-specifications.md](2-advanced-widget-specifications.md) | +| choropleth-map | 1 | [2-advanced-widget-specifications.md](2-advanced-widget-specifications.md) | +| filter-* | 2 | [3-filters.md](3-filters.md) | + +--- + +## Text (Headers/Descriptions) + +- **CRITICAL: Text widgets do NOT use a spec block** - use `multilineTextboxSpec` directly +- Supports markdown: `#`, `##`, `###`, `**bold**`, `*italic*` +- **CRITICAL: Multiple items in the `lines` array are concatenated on a single line, NOT displayed as separate lines!** +- For title + subtitle, use **separate text widgets** at different y positions + +```json +// CORRECT: Separate widgets for title and subtitle +{ + "widget": { + "name": "title", + "multilineTextboxSpec": {"lines": ["## Dashboard Title"]} + }, + "position": {"x": 0, "y": 0, "width": 6, "height": 1} +}, +{ + "widget": { + "name": "subtitle", + "multilineTextboxSpec": {"lines": ["Description text here"]} + }, + "position": {"x": 0, "y": 1, "width": 6, "height": 1} +} + +// WRONG: Multiple lines concatenate into one line! +{ + "widget": { + "name": "title-widget", + "multilineTextboxSpec": { + "lines": ["## Dashboard Title", "Description text here"] // Becomes "## Dashboard TitleDescription text here" + } + }, + "position": {"x": 0, "y": 0, "width": 6, "height": 2} +} +``` + +--- + +## Counter (KPI) + +- `version`: **2** (NOT 3!) +- `widgetType`: "counter" +- Percent values must be 0-1 in the data (not 0-100) + +### Number Formatting + +```json +"encodings": { + "value": { + "fieldName": "revenue", + "displayName": "Total Revenue", + "format": { + "type": "number-currency", + "currencyCode": "USD", + "abbreviation": "compact", + "decimalPlaces": {"type": "max", "places": 2} + } + } +} +``` + +Format types: `number`, `number-currency`, `number-percent` + +### Counter Patterns + +**Pre-aggregated dataset (1 row)** - use `disaggregated: true`: +```json +{ + "widget": { + "name": "total-revenue", + "queries": [{ + "name": "main_query", + "query": { + "datasetName": "summary_ds", + "fields": [{"name": "revenue", "expression": "`revenue`"}], + "disaggregated": true + } + }], + "spec": { + "version": 2, + "widgetType": "counter", + "encodings": { + "value": {"fieldName": "revenue", "displayName": "Total Revenue"} + }, + "frame": {"showTitle": true, "title": "Total Revenue"} + } + }, + "position": {"x": 0, "y": 0, "width": 2, "height": 3} +} +``` + +**Multi-row dataset with aggregation (supports filters)** - use `disaggregated: false`: +- Dataset returns multiple rows (e.g., grouped by a filter dimension) +- Use `"disaggregated": false` and aggregation expression +- **CRITICAL**: Field `name` MUST match `fieldName` exactly (e.g., `"sum(spend)"`) + +```json +{ + "widget": { + "name": "total-spend", + "queries": [{ + "name": "main_query", + "query": { + "datasetName": "by_category", + "fields": [{"name": "sum(spend)", "expression": "SUM(`spend`)"}], + "disaggregated": false + } + }], + "spec": { + "version": 2, + "widgetType": "counter", + "encodings": { + "value": {"fieldName": "sum(spend)", "displayName": "Total Spend"} + }, + "frame": {"showTitle": true, "title": "Total Spend"} + } + }, + "position": {"x": 0, "y": 0, "width": 2, "height": 3} +} +``` + +--- + +## Table + +- `version`: **2** (NOT 1 or 3!) +- `widgetType`: "table" +- **Columns only need `fieldName` and `displayName`** - no other properties required +- Use `"disaggregated": true` for raw rows +- Default sort: use `ORDER BY` in dataset SQL + +```json +{ + "widget": { + "name": "details-table", + "queries": [{ + "name": "main_query", + "query": { + "datasetName": "details_ds", + "fields": [ + {"name": "name", "expression": "`name`"}, + {"name": "value", "expression": "`value`"} + ], + "disaggregated": true + } + }], + "spec": { + "version": 2, + "widgetType": "table", + "encodings": { + "columns": [ + {"fieldName": "name", "displayName": "Name"}, + {"fieldName": "value", "displayName": "Value"} + ] + }, + "frame": {"showTitle": true, "title": "Details"} + } + }, + "position": {"x": 0, "y": 0, "width": 6, "height": 6} +} +``` + +--- + +## Line / Bar Charts + +- `version`: **3** +- `widgetType`: "line" or "bar" +- Use `x`, `y`, optional `color` encodings +- `scale.type`: `"temporal"` (dates), `"quantitative"` (numbers), `"categorical"` (strings) +- Use `"disaggregated": true` with pre-aggregated dataset data + +**Multiple series - two approaches:** + +1. **Multi-Y Fields** (different metrics): +```json +"y": { + "scale": {"type": "quantitative"}, + "fields": [ + {"fieldName": "sum(orders)", "displayName": "Orders"}, + {"fieldName": "sum(returns)", "displayName": "Returns"} + ] +} +``` + +2. **Color Grouping** (same metric split by dimension): +```json +"y": {"fieldName": "sum(revenue)", "scale": {"type": "quantitative"}}, +"color": {"fieldName": "region", "scale": {"type": "categorical"}} +``` + +### Bar Chart Modes + +| Mode | Configuration | +|------|---------------| +| Stacked (default) | No `mark` field | +| Grouped | `"mark": {"layout": "group"}` | + +### Horizontal Bar Chart + +Swap `x` and `y` - put quantitative on `x`, categorical/temporal on `y`: +```json +"encodings": { + "x": {"scale": {"type": "quantitative"}, "fields": [...]}, + "y": {"fieldName": "category", "scale": {"type": "categorical"}} +} +``` + +### Color Scale + +> **CRITICAL**: For bar/line/pie, color scale ONLY supports `type` and `sort`. +> Do NOT use `scheme`, `colorRamp`, or `mappings` (only for choropleth-map). + +--- + +## Pie Chart + +- `version`: **3** +- `widgetType`: "pie" +- `angle`: quantitative field +- `color`: categorical dimension +- **Limit to 3-8 categories for readability** + +```json +"spec": { + "version": 3, + "widgetType": "pie", + "encodings": { + "angle": {"fieldName": "revenue", "scale": {"type": "quantitative"}}, + "color": {"fieldName": "category", "scale": {"type": "categorical"}} + } +} +``` + +--- + +## Axis Formatting + +Add `format` to any encoding to display values appropriately: + +| Data Type | Format Type | Example | +|-----------|-------------|---------| +| Currency | `number-currency` | $1.2M | +| Percentage | `number-percent` | 45.2% (data must be 0-1, not 0-100) | +| Large numbers | `number` with `abbreviation` | 1.5K, 2.3M | + +```json +"value": { + "fieldName": "revenue", + "displayName": "Revenue", + "format": { + "type": "number-currency", + "currencyCode": "USD", + "abbreviation": "compact", + "decimalPlaces": {"type": "max", "places": 2} + } +} +``` + +**Options:** +- `abbreviation`: `"compact"` (K/M/B) or omit for full numbers +- `decimalPlaces`: `{"type": "max", "places": N}` or `{"type": "fixed", "places": N}` + +--- + +## Dataset Parameters + +Use `:param` syntax in SQL for dynamic filtering: + +```json +{ + "name": "revenue_by_category", + "queryLines": ["SELECT ... WHERE returns_usd > :threshold GROUP BY category"], + "parameters": [{ + "keyword": "threshold", + "dataType": "INTEGER", + "defaultSelection": {} + }] +} +``` + +**Parameter types:** +- Single value: `"dataType": "INTEGER"` / `"DECIMAL"` / `"STRING"` +- Multi-select: Add `"complexType": "MULTI"` +- Range: `"dataType": "DATE", "complexType": "RANGE"` - use `:param.min` / `:param.max` + +--- + +## Widget Field Expressions + +Allowed in `query.fields` (no CAST or complex SQL): + +```json +// Aggregations +{"name": "sum(revenue)", "expression": "SUM(`revenue`)"} +{"name": "avg(price)", "expression": "AVG(`price`)"} +{"name": "count(id)", "expression": "COUNT(`id`)"} +{"name": "countdistinct(id)", "expression": "COUNT(DISTINCT `id`)"} + +// Date truncation +{"name": "daily(date)", "expression": "DATE_TRUNC(\"DAY\", `date`)"} +{"name": "weekly(date)", "expression": "DATE_TRUNC(\"WEEK\", `date`)"} +{"name": "monthly(date)", "expression": "DATE_TRUNC(\"MONTH\", `date`)"} + +// Simple reference +{"name": "category", "expression": "`category`"} +``` + +For conditional logic, compute in dataset SQL instead. diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-aibi-dashboards/2-advanced-widget-specifications.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-aibi-dashboards/2-advanced-widget-specifications.md new file mode 100644 index 0000000..707cc1a --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-aibi-dashboards/2-advanced-widget-specifications.md @@ -0,0 +1,177 @@ +# Advanced Widget Specifications + +Advanced visualization types for AI/BI dashboards. For core widgets (text, counter, table, bar, line, pie), see [1-widget-specifications.md](1-widget-specifications.md). + +--- + +## Area Chart + +- `version`: **3** +- `widgetType`: "area" +- Same structure as line chart - useful for showing cumulative values or emphasizing volume + +```json +"spec": { + "version": 3, + "widgetType": "area", + "encodings": { + "x": {"fieldName": "week_start", "scale": {"type": "temporal"}}, + "y": { + "scale": {"type": "quantitative"}, + "fields": [ + {"fieldName": "revenue_usd", "displayName": "Revenue"}, + {"fieldName": "returns_usd", "displayName": "Returns"} + ] + } + } +} +``` + +--- + +## Scatter Plot / Bubble Chart + +- `version`: **3** +- `widgetType`: "scatter" +- `x`, `y`: quantitative or temporal +- `size`: optional quantitative field for bubble size +- `color`: optional categorical or quantitative for grouping + +```json +"spec": { + "version": 3, + "widgetType": "scatter", + "encodings": { + "x": {"fieldName": "return_date", "scale": {"type": "temporal"}}, + "y": {"fieldName": "daily_returns", "scale": {"type": "quantitative"}}, + "size": {"fieldName": "count(*)", "scale": {"type": "quantitative"}}, + "color": {"fieldName": "category", "scale": {"type": "categorical"}} + } +} +``` + +--- + +## Combo Chart (Bar + Line) + +Combines bar and line visualizations on the same chart - useful for showing related metrics with different scales. + +- `version`: **1** +- `widgetType`: "combo" +- `y.primary`: bar chart fields +- `y.secondary`: line chart fields + +```json +{ + "widget": { + "name": "revenue-and-growth", + "queries": [{ + "name": "main_query", + "query": { + "datasetName": "metrics_ds", + "fields": [ + {"name": "daily(date)", "expression": "DATE_TRUNC(\"DAY\", `date`)"}, + {"name": "sum(revenue)", "expression": "SUM(`revenue`)"}, + {"name": "avg(growth_rate)", "expression": "AVG(`growth_rate`)"} + ], + "disaggregated": false + } + }], + "spec": { + "version": 1, + "widgetType": "combo", + "encodings": { + "x": {"fieldName": "daily(date)", "scale": {"type": "temporal"}}, + "y": { + "scale": {"type": "quantitative"}, + "primary": { + "fields": [{"fieldName": "sum(revenue)", "displayName": "Revenue ($)"}] + }, + "secondary": { + "fields": [{"fieldName": "avg(growth_rate)", "displayName": "Growth Rate"}] + } + }, + "label": {"show": false} + }, + "frame": {"title": "Revenue & Growth Rate", "showTitle": true} + } + }, + "position": {"x": 0, "y": 0, "width": 6, "height": 5} +} +``` + +--- + +## Choropleth Map + +Displays geographic regions colored by aggregate values. Requires a field with geographic names (state names, country names, etc.). + +- `version`: **1** +- `widgetType`: "choropleth-map" +- `region`: defines the geographic area mapping +- `color`: quantitative field for coloring regions + +```json +"spec": { + "version": 1, + "widgetType": "choropleth-map", + "encodings": { + "region": { + "regionType": "mapbox-v4-admin", + "admin0": { + "type": "value", + "value": "United States", + "geographicRole": "admin0-name" + }, + "admin1": { + "fieldName": "state_name", + "type": "field", + "geographicRole": "admin1-name" + } + }, + "color": { + "fieldName": "sum(revenue)", + "scale": {"type": "quantitative"} + } + } +} +``` + +### Region Configuration + +**Region levels:** +- `admin0`: Country level - use `"type": "value"` with fixed country name +- `admin1`: State/Province level - use `"type": "field"` with your data column +- `admin2`: County/District level + +**Geographic roles:** +- `admin0-name`, `admin1-name`, `admin2-name` - match by name +- `admin0-iso`, `admin1-iso` - match by ISO code + +**Supported countries for admin1:** United States, Japan (prefectures), and others. + +### Color Scale for Maps + +> **Note**: Unlike other charts, choropleth-map supports additional color scale properties: +> - `scheme`: color scheme name (e.g., "YIGnBu") +> - `colorRamp`: custom color gradient +> - `mappings`: explicit value-to-color mappings + +--- + +## Other Visualization Types + +The following visualization types are available in Databricks AI/BI dashboards but are less commonly used. Refer to [Databricks documentation](https://docs.databricks.com/aws/en/visualizations/visualization-types) for details: + +| Widget Type | Description | +|-------------|-------------| +| heatmap | Color intensity grid for numerical data | +| histogram | Frequency distribution with configurable bins | +| funnel | Stage-based metric analysis | +| sankey | Flow visualization between value sets | +| box | Distribution summary with quartiles | +| marker-map | Latitude/longitude point markers | +| pivot | Drag-and-drop aggregation table | +| word-cloud | Word frequency visualization | +| sunburst | Hierarchical data in concentric circles | +| cohort | Group outcome analysis over time | diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-aibi-dashboards/3-examples.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-aibi-dashboards/3-examples.md new file mode 100644 index 0000000..fe128d6 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-aibi-dashboards/3-examples.md @@ -0,0 +1,305 @@ +# Complete Dashboard Examples + +Production-ready templates you can adapt for your use case. + +## Basic Dashboard (NYC Taxi) + +```python +import json + +# Step 1: Check table schema +table_info = get_table_stats_and_schema(catalog="samples", schema="nyctaxi") + +# Step 2: Test queries +execute_sql("SELECT COUNT(*) as trips, AVG(fare_amount) as avg_fare, AVG(trip_distance) as avg_distance FROM samples.nyctaxi.trips") +execute_sql(""" + SELECT pickup_zip, COUNT(*) as trip_count + FROM samples.nyctaxi.trips + GROUP BY pickup_zip + ORDER BY trip_count DESC + LIMIT 10 +""") + +# Step 3: Build dashboard JSON +dashboard = { + "datasets": [ + { + "name": "summary", + "displayName": "Summary Stats", + "queryLines": [ + "SELECT COUNT(*) as trips, AVG(fare_amount) as avg_fare, ", + "AVG(trip_distance) as avg_distance ", + "FROM samples.nyctaxi.trips " + ] + }, + { + "name": "by_zip", + "displayName": "Trips by ZIP", + "queryLines": [ + "SELECT pickup_zip, COUNT(*) as trip_count ", + "FROM samples.nyctaxi.trips ", + "GROUP BY pickup_zip ", + "ORDER BY trip_count DESC ", + "LIMIT 10 " + ] + } + ], + "pages": [{ + "name": "overview", + "displayName": "NYC Taxi Overview", + "pageType": "PAGE_TYPE_CANVAS", + "layout": [ + # Text header - NO spec block! Use SEPARATE widgets for title and subtitle! + { + "widget": { + "name": "title", + "multilineTextboxSpec": { + "lines": ["## NYC Taxi Dashboard"] + } + }, + "position": {"x": 0, "y": 0, "width": 6, "height": 1} + }, + { + "widget": { + "name": "subtitle", + "multilineTextboxSpec": { + "lines": ["Trip statistics and analysis"] + } + }, + "position": {"x": 0, "y": 1, "width": 6, "height": 1} + }, + # Counter - version 2, width 2! + { + "widget": { + "name": "total-trips", + "queries": [{ + "name": "main_query", + "query": { + "datasetName": "summary", + "fields": [{"name": "trips", "expression": "`trips`"}], + "disaggregated": True + } + }], + "spec": { + "version": 2, + "widgetType": "counter", + "encodings": { + "value": {"fieldName": "trips", "displayName": "Total Trips"} + }, + "frame": {"title": "Total Trips", "showTitle": True} + } + }, + "position": {"x": 0, "y": 2, "width": 2, "height": 3} + }, + { + "widget": { + "name": "avg-fare", + "queries": [{ + "name": "main_query", + "query": { + "datasetName": "summary", + "fields": [{"name": "avg_fare", "expression": "`avg_fare`"}], + "disaggregated": True + } + }], + "spec": { + "version": 2, + "widgetType": "counter", + "encodings": { + "value": {"fieldName": "avg_fare", "displayName": "Avg Fare"} + }, + "frame": {"title": "Average Fare", "showTitle": True} + } + }, + "position": {"x": 2, "y": 2, "width": 2, "height": 3} + }, + { + "widget": { + "name": "total-distance", + "queries": [{ + "name": "main_query", + "query": { + "datasetName": "summary", + "fields": [{"name": "avg_distance", "expression": "`avg_distance`"}], + "disaggregated": True + } + }], + "spec": { + "version": 2, + "widgetType": "counter", + "encodings": { + "value": {"fieldName": "avg_distance", "displayName": "Avg Distance"} + }, + "frame": {"title": "Average Distance", "showTitle": True} + } + }, + "position": {"x": 4, "y": 2, "width": 2, "height": 3} + }, + # Bar chart - version 3 + { + "widget": { + "name": "trips-by-zip", + "queries": [{ + "name": "main_query", + "query": { + "datasetName": "by_zip", + "fields": [ + {"name": "pickup_zip", "expression": "`pickup_zip`"}, + {"name": "trip_count", "expression": "`trip_count`"} + ], + "disaggregated": True + } + }], + "spec": { + "version": 3, + "widgetType": "bar", + "encodings": { + "x": {"fieldName": "pickup_zip", "scale": {"type": "categorical"}, "displayName": "ZIP"}, + "y": {"fieldName": "trip_count", "scale": {"type": "quantitative"}, "displayName": "Trips"} + }, + "frame": {"title": "Trips by Pickup ZIP", "showTitle": True} + } + }, + "position": {"x": 0, "y": 5, "width": 6, "height": 5} + }, + # Table - version 2, minimal column props! + { + "widget": { + "name": "zip-table", + "queries": [{ + "name": "main_query", + "query": { + "datasetName": "by_zip", + "fields": [ + {"name": "pickup_zip", "expression": "`pickup_zip`"}, + {"name": "trip_count", "expression": "`trip_count`"} + ], + "disaggregated": True + } + }], + "spec": { + "version": 2, + "widgetType": "table", + "encodings": { + "columns": [ + {"fieldName": "pickup_zip", "displayName": "ZIP Code"}, + {"fieldName": "trip_count", "displayName": "Trip Count"} + ] + }, + "frame": {"title": "Top ZIP Codes", "showTitle": True} + } + }, + "position": {"x": 0, "y": 10, "width": 6, "height": 5} + } + ] + }] +} + +# Step 4: Deploy +result = manage_dashboard( + action="create_or_update", + display_name="NYC Taxi Dashboard", + parent_path="/Workspace/Users/me/dashboards", + serialized_dashboard=json.dumps(dashboard), + warehouse_id=manage_warehouse(action="get_best"), +) +print(result["url"]) +``` + +## Dashboard with Global Filters + +```python +import json + +# Dashboard with a global filter for region +dashboard_with_filters = { + "datasets": [ + { + "name": "sales", + "displayName": "Sales Data", + "queryLines": [ + "SELECT region, SUM(revenue) as total_revenue ", + "FROM catalog.schema.sales ", + "GROUP BY region" + ] + } + ], + "pages": [ + { + "name": "overview", + "displayName": "Sales Overview", + "pageType": "PAGE_TYPE_CANVAS", + "layout": [ + { + "widget": { + "name": "total-revenue", + "queries": [{ + "name": "main_query", + "query": { + "datasetName": "sales", + "fields": [{"name": "total_revenue", "expression": "`total_revenue`"}], + "disaggregated": True + } + }], + "spec": { + "version": 2, # Version 2 for counters! + "widgetType": "counter", + "encodings": { + "value": {"fieldName": "total_revenue", "displayName": "Total Revenue"} + }, + "frame": {"title": "Total Revenue", "showTitle": True} + } + }, + "position": {"x": 0, "y": 0, "width": 6, "height": 3} + } + ] + }, + { + "name": "filters", + "displayName": "Filters", + "pageType": "PAGE_TYPE_GLOBAL_FILTERS", # Required for global filter page! + "layout": [ + { + "widget": { + "name": "filter_region", + "queries": [{ + "name": "ds_sales_region", + "query": { + "datasetName": "sales", + "fields": [ + {"name": "region", "expression": "`region`"} + # DO NOT use associative_filter_predicate_group - causes SQL errors! + ], + "disaggregated": False # False for filters! + } + }], + "spec": { + "version": 2, # Version 2 for filters! + "widgetType": "filter-multi-select", # NOT "filter"! + "encodings": { + "fields": [{ + "fieldName": "region", + "displayName": "Region", + "queryName": "ds_sales_region" # Must match query name! + }] + }, + "frame": {"showTitle": True, "title": "Region"} # Always show title! + } + }, + "position": {"x": 0, "y": 0, "width": 2, "height": 2} + } + ] + } + ] +} + +# Deploy with filters +result = manage_dashboard( + action="create_or_update", + display_name="Sales Dashboard with Filters", + parent_path="/Workspace/Users/me/dashboards", + serialized_dashboard=json.dumps(dashboard_with_filters), + warehouse_id=manage_warehouse(action="get_best"), +) +print(result["url"]) +``` diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-aibi-dashboards/3-filters.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-aibi-dashboards/3-filters.md new file mode 100644 index 0000000..f1c5508 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-aibi-dashboards/3-filters.md @@ -0,0 +1,240 @@ +# Filters (Global vs Page-Level) + +> **CRITICAL**: Filter widgets use DIFFERENT widget types than charts! +> - Valid types: `filter-multi-select`, `filter-single-select`, `filter-date-range-picker` +> - **DO NOT** use `widgetType: "filter"` - this does not exist and will cause errors +> - Filters use `spec.version: 2` +> - **ALWAYS include `frame` with `showTitle: true`** for filter widgets + +**Filter widget types:** +- `filter-date-range-picker`: for DATE/TIMESTAMP fields (date range selection) +- `filter-single-select`: categorical with single selection +- `filter-multi-select`: categorical with multiple selections (preferred for drill-down) + +> **Performance note**: Global filters automatically apply `WHERE` clauses to dataset queries at runtime. You don't need to pre-filter data in your SQL - the dashboard engine handles this efficiently. + +--- + +## Global Filters vs Page-Level Filters + +| Type | Placement | Scope | Use Case | +|------|-----------|-------|----------| +| **Global Filter** | Dedicated page with `"pageType": "PAGE_TYPE_GLOBAL_FILTERS"` | Affects ALL pages that have datasets with the filter field | Cross-dashboard filtering (e.g., date range, campaign) | +| **Page-Level Filter** | Regular page with `"pageType": "PAGE_TYPE_CANVAS"` | Affects ONLY widgets on that same page | Page-specific filtering (e.g., platform filter on breakdown page only) | + +**Key Insight**: A filter only affects datasets that contain the filter field. To have a filter affect only specific pages: +1. Include the filter dimension in datasets for pages that should be filtered +2. Exclude the filter dimension from datasets for pages that should NOT be filtered + +--- + +## Filter Widget Structure + +> **CRITICAL**: Do NOT use `associative_filter_predicate_group` - it causes SQL errors! +> Use a simple field expression instead. + +```json +{ + "widget": { + "name": "filter_region", + "queries": [{ + "name": "ds_data_region", // Query name - must match queryName in encodings! + "query": { + "datasetName": "ds_data", + "fields": [ + {"name": "region", "expression": "`region`"} + ], + "disaggregated": false // CRITICAL: Always false for filters! + } + }], + "spec": { + "version": 2, + "widgetType": "filter-multi-select", + "encodings": { + "fields": [{ + "fieldName": "region", + "displayName": "Region", + "queryName": "ds_data_region" // Must match queries[].name above! + }] + }, + "frame": {"showTitle": true, "title": "Region"} + } + }, + "position": {"x": 0, "y": 0, "width": 2, "height": 2} +} +``` + +--- + +## Global Filter Example + +Place on a dedicated filter page: + +```json +{ + "name": "filters", + "displayName": "Filters", + "pageType": "PAGE_TYPE_GLOBAL_FILTERS", + "layout": [ + { + "widget": { + "name": "filter_campaign", + "queries": [{ + "name": "ds_campaign", + "query": { + "datasetName": "overview", + "fields": [{"name": "campaign_name", "expression": "`campaign_name`"}], + "disaggregated": false + } + }], + "spec": { + "version": 2, + "widgetType": "filter-multi-select", + "encodings": { + "fields": [{ + "fieldName": "campaign_name", + "displayName": "Campaign", + "queryName": "ds_campaign" + }] + }, + "frame": {"showTitle": true, "title": "Campaign"} + } + }, + "position": {"x": 0, "y": 0, "width": 2, "height": 2} + } + ] +} +``` + +--- + +## Page-Level Filter Example + +Place filter widget directly on a `PAGE_TYPE_CANVAS` page (same widget structure as global filter, but only affects that page): + +```json +{ + "name": "platform_breakdown", + "displayName": "Platform Breakdown", + "pageType": "PAGE_TYPE_CANVAS", + "layout": [ + {"widget": {...}, "position": {...}}, + { + "widget": { + "name": "filter_platform", + "queries": [{"name": "ds_platform", "query": {"datasetName": "platform_data", "fields": [{"name": "platform", "expression": "`platform`"}], "disaggregated": false}}], + "spec": { + "version": 2, + "widgetType": "filter-multi-select", + "encodings": {"fields": [{"fieldName": "platform", "displayName": "Platform", "queryName": "ds_platform"}]}, + "frame": {"showTitle": true, "title": "Platform"} + } + }, + "position": {"x": 4, "y": 0, "width": 2, "height": 2} + } + ] +} +``` + +--- + +## Date Range Filtering + +> **Best Practice**: Most dashboards should include a date range filter. However, metrics that are not based on a time range (like "MRR" or "All-Time Total") should NOT be date-filtered - omit them from the filter's queries. + +**Two binding approaches** (can be combined in one filter): +- **Field-based**: Bind to a date column in SELECT → filter auto-applies `IN_RANGE()` +- **Parameter-based**: Use `:param.min`/`:param.max` in WHERE clause for pre-aggregation filtering + +```json +// Dataset with parameter (for aggregated queries) +{ + "name": "revenue_by_category", + "queryLines": [ + "SELECT category, SUM(revenue) as revenue FROM catalog.schema.orders ", + "WHERE order_date BETWEEN :date_range.min AND :date_range.max ", + "GROUP BY category" + ], + "parameters": [{ + "keyword": "date_range", "dataType": "DATE", "complexType": "RANGE", + "defaultSelection": {"range": {"dataType": "DATE", "min": {"value": "now-12M/M"}, "max": {"value": "now/M"}}} + }] +} + +// Filter widget binding to both field and parameter +{ + "widget": { + "name": "date_range_filter", + "queries": [ + {"name": "q_trend", "query": {"datasetName": "weekly_trend", "fields": [{"name": "week_start", "expression": "`week_start`"}], "disaggregated": false}}, + {"name": "q_category", "query": {"datasetName": "revenue_by_category", "parameters": [{"name": "date_range", "keyword": "date_range"}], "disaggregated": false}} + ], + "spec": { + "version": 2, + "widgetType": "filter-date-range-picker", + "encodings": { + "fields": [ + {"fieldName": "week_start", "queryName": "q_trend"}, + {"parameterName": "date_range", "queryName": "q_category"} + ] + }, + "frame": {"showTitle": true, "title": "Date Range"} + } + }, + "position": {"x": 0, "y": 0, "width": 2, "height": 2} +} +``` + +--- + +## Multi-Dataset Filters + +When a filter should affect multiple datasets (e.g., "Region" filter for both sales and customers data), add multiple queries - one per dataset: + +```json +{ + "widget": { + "name": "filter_region", + "queries": [ + { + "name": "sales_region", + "query": { + "datasetName": "sales", + "fields": [{"name": "region", "expression": "`region`"}], + "disaggregated": false + } + }, + { + "name": "customers_region", + "query": { + "datasetName": "customers", + "fields": [{"name": "region", "expression": "`region`"}], + "disaggregated": false + } + } + ], + "spec": { + "version": 2, + "widgetType": "filter-multi-select", + "encodings": { + "fields": [ + {"fieldName": "region", "displayName": "Region (Sales)", "queryName": "sales_region"}, + {"fieldName": "region", "displayName": "Region (Customers)", "queryName": "customers_region"} + ] + }, + "frame": {"showTitle": true, "title": "Region"} + } + }, + "position": {"x": 0, "y": 0, "width": 2, "height": 2} +} +``` + +Each `queryName` in `encodings.fields` binds the filter to that specific dataset. Datasets not bound will not be filtered. + +--- + +## Filter Layout Guidelines + +- Global filters: Position on dedicated filter page, stack vertically at `x=0` +- Page-level filters: Position in header area of page (e.g., top-right corner) +- Typical sizing: `width: 2, height: 2` diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-aibi-dashboards/4-examples.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-aibi-dashboards/4-examples.md new file mode 100644 index 0000000..8c2d015 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-aibi-dashboards/4-examples.md @@ -0,0 +1,496 @@ +# Complete Dashboard Example + +This is a **reference example** to understand the JSON structure and layout patterns. **Always adapt to what the user requests** - use their tables, metrics, and visualizations. This example demonstrates the correct syntax; your dashboard should reflect the user's actual requirements. + +## Key Patterns (Read First) + +### 1. Page Types (Required) +- `PAGE_TYPE_CANVAS` - Main content page with widgets +- `PAGE_TYPE_GLOBAL_FILTERS` - Dedicated filter page that affects all canvas pages + +### 2. Widget Versions (Critical!) +| Widget Type | Version | +|-------------|---------| +| `counter`, `table` | **2** | +| `bar`, `line`, `area`, `pie` | **3** | +| `filter-*` | **2** | + +### 3. KPI Counter with Currency Formatting +```json +"format": { + "type": "number-currency", + "currencyCode": "USD", + "abbreviation": "compact", + "decimalPlaces": {"type": "max", "places": 1} +} +``` + +### 4. Filter Binding to Multiple Datasets +Each filter query binds the filter to one dataset. Add multiple queries to filter multiple datasets: +```json +"queries": [ + {"name": "ds1_region", "query": {"datasetName": "dataset1", ...}}, + {"name": "ds2_region", "query": {"datasetName": "dataset2", ...}} +] +``` + +### 5. Layout Grid (6 columns) +``` +y=0: Header with title + description (w=6, h=2) +y=2: KPI(w=2,h=3) | KPI(w=2,h=3) | KPI(w=2,h=3) ← fills 6 +y=5: Section header (w=6, h=1) +y=6: Area chart (w=6, h=5) +y=11: Section header (w=6, h=1) +y=12: Pie(w=2,h=5) | Bar chart(w=4,h=5) ← fills 6 +``` + +Use `\n\n` in text widget lines array to create line breaks within a single widget. + +--- + +## Full Dashboard: Sales Analytics + +This example shows a complete dashboard with: +- Title and subtitle text widgets +- 3 KPI counters with currency/number formatting +- Area chart for time series trends +- Pie chart for category breakdown +- Bar chart with color grouping by region +- Data table for detailed records +- Global filters (date range, region, category) + +```json +{ + "datasets": [ + { + "name": "ds_daily_sales", + "displayName": "Daily Sales", + "queryLines": [ + "SELECT sale_date, region, department, total_orders, total_units, total_revenue, total_cost, profit_margin ", + "FROM catalog.schema.gold_daily_sales ", + "ORDER BY sale_date" + ] + }, + { + "name": "ds_products", + "displayName": "Product Performance", + "queryLines": [ + "SELECT product_id, product_name, department, region, units_sold, revenue, cost, profit ", + "FROM catalog.schema.gold_product_performance" + ] + } + ], + "pages": [ + { + "name": "sales_overview", + "displayName": "Sales Overview", + "pageType": "PAGE_TYPE_CANVAS", + "layout": [ + { + "widget": { + "name": "header", + "multilineTextboxSpec": { + "lines": ["# Sales Dashboard\n\nMonitor daily sales, revenue, and profit margins across regions and departments."] + } + }, + "position": {"x": 0, "y": 0, "width": 6, "height": 2} + }, + { + "widget": { + "name": "kpi_revenue", + "queries": [{ + "name": "main_query", + "query": { + "datasetName": "ds_daily_sales", + "fields": [{"name": "sum(total_revenue)", "expression": "SUM(`total_revenue`)"}], + "disaggregated": false + } + }], + "spec": { + "version": 2, + "widgetType": "counter", + "encodings": { + "value": { + "fieldName": "sum(total_revenue)", + "displayName": "Total Revenue", + "format": { + "type": "number-currency", + "currencyCode": "USD", + "abbreviation": "compact", + "decimalPlaces": {"type": "max", "places": 1} + } + } + }, + "frame": {"title": "Total Revenue", "showTitle": true, "description": "For the selected period", "showDescription": true} + } + }, + "position": {"x": 0, "y": 2, "width": 2, "height": 3} + }, + { + "widget": { + "name": "kpi_orders", + "queries": [{ + "name": "main_query", + "query": { + "datasetName": "ds_daily_sales", + "fields": [{"name": "sum(total_orders)", "expression": "SUM(`total_orders`)"}], + "disaggregated": false + } + }], + "spec": { + "version": 2, + "widgetType": "counter", + "encodings": { + "value": { + "fieldName": "sum(total_orders)", + "displayName": "Total Orders", + "format": { + "type": "number", + "abbreviation": "compact", + "decimalPlaces": {"type": "max", "places": 0} + } + } + }, + "frame": {"title": "Total Orders", "showTitle": true, "description": "For the selected period", "showDescription": true} + } + }, + "position": {"x": 2, "y": 2, "width": 2, "height": 3} + }, + { + "widget": { + "name": "kpi_profit", + "queries": [{ + "name": "main_query", + "query": { + "datasetName": "ds_daily_sales", + "fields": [{"name": "avg(profit_margin)", "expression": "AVG(`profit_margin`)"}], + "disaggregated": false + } + }], + "spec": { + "version": 2, + "widgetType": "counter", + "encodings": { + "value": { + "fieldName": "avg(profit_margin)", + "displayName": "Avg Profit Margin", + "format": { + "type": "number-percent", + "decimalPlaces": {"type": "max", "places": 1} + } + } + }, + "frame": {"title": "Profit Margin", "showTitle": true, "description": "Average for period", "showDescription": true} + } + }, + "position": {"x": 4, "y": 2, "width": 2, "height": 3} + }, + { + "widget": { + "name": "section_trends", + "multilineTextboxSpec": { + "lines": ["## Revenue Trend"] + } + }, + "position": {"x": 0, "y": 5, "width": 6, "height": 1} + }, + { + "widget": { + "name": "chart_revenue_trend", + "queries": [{ + "name": "main_query", + "query": { + "datasetName": "ds_daily_sales", + "fields": [ + {"name": "sale_date", "expression": "`sale_date`"}, + {"name": "sum(total_revenue)", "expression": "SUM(`total_revenue`)"} + ], + "disaggregated": false + } + }], + "spec": { + "version": 3, + "widgetType": "area", + "encodings": { + "x": { + "fieldName": "sale_date", + "scale": {"type": "temporal"}, + "axis": {"title": "Date"}, + "displayName": "Date" + }, + "y": { + "fieldName": "sum(total_revenue)", + "scale": {"type": "quantitative"}, + "format": { + "type": "number-currency", + "currencyCode": "USD", + "abbreviation": "compact" + }, + "axis": {"title": "Revenue ($)"}, + "displayName": "Revenue ($)" + } + }, + "frame": { + "title": "Daily Revenue", + "showTitle": true, + "description": "Track daily revenue trends" + } + } + }, + "position": {"x": 0, "y": 6, "width": 6, "height": 5} + }, + { + "widget": { + "name": "section_breakdown", + "multilineTextboxSpec": { + "lines": ["## Breakdown"] + } + }, + "position": {"x": 0, "y": 11, "width": 6, "height": 1} + }, + { + "widget": { + "name": "chart_by_department", + "queries": [{ + "name": "main_query", + "query": { + "datasetName": "ds_daily_sales", + "fields": [ + {"name": "department", "expression": "`department`"}, + {"name": "sum(total_revenue)", "expression": "SUM(`total_revenue`)"} + ], + "disaggregated": false + } + }], + "spec": { + "version": 3, + "widgetType": "pie", + "encodings": { + "angle": { + "fieldName": "sum(total_revenue)", + "scale": {"type": "quantitative"}, + "displayName": "Revenue" + }, + "color": { + "fieldName": "department", + "scale": {"type": "categorical"}, + "displayName": "Department" + }, + "label": {"show": true} + }, + "frame": {"title": "Revenue by Department", "showTitle": true} + } + }, + "position": {"x": 0, "y": 12, "width": 2, "height": 5} + }, + { + "widget": { + "name": "chart_by_region", + "queries": [{ + "name": "main_query", + "query": { + "datasetName": "ds_daily_sales", + "fields": [ + {"name": "sale_date", "expression": "`sale_date`"}, + {"name": "region", "expression": "`region`"}, + {"name": "sum(total_revenue)", "expression": "SUM(`total_revenue`)"} + ], + "disaggregated": false + } + }], + "spec": { + "version": 3, + "widgetType": "bar", + "encodings": { + "x": { + "fieldName": "sale_date", + "scale": {"type": "temporal"}, + "axis": {"title": "Date"}, + "displayName": "Date" + }, + "y": { + "fieldName": "sum(total_revenue)", + "scale": {"type": "quantitative"}, + "format": { + "type": "number-currency", + "currencyCode": "USD", + "abbreviation": "compact" + }, + "axis": {"title": "Revenue ($)"}, + "displayName": "Revenue ($)" + }, + "color": { + "fieldName": "region", + "scale": {"type": "categorical"}, + "displayName": "Region" + } + }, + "frame": {"title": "Revenue by Region", "showTitle": true} + } + }, + "position": {"x": 2, "y": 12, "width": 4, "height": 5} + }, + { + "widget": { + "name": "section_products", + "multilineTextboxSpec": { + "lines": ["## Top Products"] + } + }, + "position": {"x": 0, "y": 17, "width": 6, "height": 1} + }, + { + "widget": { + "name": "table_products", + "queries": [{ + "name": "main_query", + "query": { + "datasetName": "ds_products", + "fields": [ + {"name": "product_name", "expression": "`product_name`"}, + {"name": "department", "expression": "`department`"}, + {"name": "units_sold", "expression": "`units_sold`"}, + {"name": "revenue", "expression": "`revenue`"}, + {"name": "profit", "expression": "`profit`"} + ], + "disaggregated": true + } + }], + "spec": { + "version": 2, + "widgetType": "table", + "encodings": { + "columns": [ + {"fieldName": "product_name", "displayName": "Product"}, + {"fieldName": "department", "displayName": "Department"}, + {"fieldName": "units_sold", "displayName": "Units Sold"}, + {"fieldName": "revenue", "displayName": "Revenue ($)"}, + {"fieldName": "profit", "displayName": "Profit ($)"} + ] + }, + "frame": { + "title": "Product Performance", + "showTitle": true, + "description": "Top products by revenue" + } + } + }, + "position": {"x": 0, "y": 18, "width": 6, "height": 6} + } + ] + }, + { + "name": "global_filters", + "displayName": "Filters", + "pageType": "PAGE_TYPE_GLOBAL_FILTERS", + "layout": [ + { + "widget": { + "name": "filter_date_range", + "queries": [ + { + "name": "ds_sales_date", + "query": { + "datasetName": "ds_daily_sales", + "fields": [{"name": "sale_date", "expression": "`sale_date`"}], + "disaggregated": false + } + } + ], + "spec": { + "version": 2, + "widgetType": "filter-date-range-picker", + "encodings": { + "fields": [ + {"fieldName": "sale_date", "displayName": "Date", "queryName": "ds_sales_date"} + ] + }, + "selection": { + "defaultSelection": { + "range": { + "dataType": "DATE", + "min": {"value": "now/y"}, + "max": {"value": "now/y"} + } + } + }, + "frame": {"showTitle": true, "title": "Date Range"} + } + }, + "position": {"x": 0, "y": 0, "width": 2, "height": 2} + }, + { + "widget": { + "name": "filter_region", + "queries": [ + { + "name": "ds_sales_region", + "query": { + "datasetName": "ds_daily_sales", + "fields": [{"name": "region", "expression": "`region`"}], + "disaggregated": false + } + }, + { + "name": "ds_products_region", + "query": { + "datasetName": "ds_products", + "fields": [{"name": "region", "expression": "`region`"}], + "disaggregated": false + } + } + ], + "spec": { + "version": 2, + "widgetType": "filter-multi-select", + "encodings": { + "fields": [ + {"fieldName": "region", "displayName": "Region", "queryName": "ds_sales_region"}, + {"fieldName": "region", "displayName": "Region", "queryName": "ds_products_region"} + ] + }, + "frame": {"showTitle": true, "title": "Region"} + } + }, + "position": {"x": 2, "y": 0, "width": 2, "height": 2} + }, + { + "widget": { + "name": "filter_department", + "queries": [ + { + "name": "ds_sales_dept", + "query": { + "datasetName": "ds_daily_sales", + "fields": [{"name": "department", "expression": "`department`"}], + "disaggregated": false + } + }, + { + "name": "ds_products_dept", + "query": { + "datasetName": "ds_products", + "fields": [{"name": "department", "expression": "`department`"}], + "disaggregated": false + } + } + ], + "spec": { + "version": 2, + "widgetType": "filter-multi-select", + "encodings": { + "fields": [ + {"fieldName": "department", "displayName": "Department", "queryName": "ds_sales_dept"}, + {"fieldName": "department", "displayName": "Department", "queryName": "ds_products_dept"} + ] + }, + "frame": {"showTitle": true, "title": "Department"} + } + }, + "position": {"x": 4, "y": 0, "width": 2, "height": 2} + } + ] + } + ] +} +``` diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-aibi-dashboards/5-troubleshooting.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-aibi-dashboards/5-troubleshooting.md new file mode 100644 index 0000000..8c99d9e --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-aibi-dashboards/5-troubleshooting.md @@ -0,0 +1,100 @@ +# Troubleshooting + +Common errors and fixes for AI/BI dashboards. + +## Structural Errors (JSON Parse Failures) + +These errors occur when the JSON structure is wrong: + +| Error | Cause | Fix | +|-------|-------|-----| +| "failed to parse serialized dashboard" | Wrong JSON structure | Check: `queryLines` is array (not `"query": "string"`), widgets inline in `layout[].widget`, `pageType` on every page | +| "no selected fields to visualize" | `fields[].name` ≠ `encodings.fieldName` | Names must match exactly (e.g., both `"sum(spend)"`) | +| Widgets in wrong location | Used separate `"widgets"` array | Widgets must be INLINE: `layout[]: {widget: {...}, position: {...}}` | +| Missing page content | Omitted `pageType` | Add `"pageType": "PAGE_TYPE_CANVAS"` or `"PAGE_TYPE_GLOBAL_FILTERS"` | + +--- + +## Widget shows "no selected fields to visualize" + +**This is a field name mismatch error.** The `name` in `query.fields` must exactly match the `fieldName` in `encodings`. + +**Fix:** Ensure names match exactly: +```json +// WRONG - names don't match +"fields": [{"name": "spend", "expression": "SUM(`spend`)"}] +"encodings": {"value": {"fieldName": "sum(spend)", ...}} // ERROR! + +// CORRECT - names match +"fields": [{"name": "sum(spend)", "expression": "SUM(`spend`)"}] +"encodings": {"value": {"fieldName": "sum(spend)", ...}} // OK! +``` + +## Widget shows "Invalid widget definition" + +**Check version numbers:** +- Counters: `version: 2` (NOT 3!) +- Tables: `version: 2` (NOT 1 or 3!) +- Filters: `version: 2` +- Bar/Line/Pie/Area/Scatter charts: `version: 3` +- Combo/Choropleth-map: `version: 1` + +**Text widget errors:** +- Text widgets must NOT have a `spec` block +- Use `multilineTextboxSpec` directly on the widget object +- Do NOT use `widgetType: "text"` - this is invalid + +**Table widget errors:** +- Use `version: 2` (NOT 1 or 3) +- Column objects only need `fieldName` and `displayName` +- Do NOT add `type`, `numberFormat`, or other column properties + +**Counter widget errors:** +- Use `version: 2` (NOT 3) +- Ensure dataset returns exactly 1 row for `disaggregated: true` + +## Dashboard shows empty widgets + +- Run the dataset SQL query directly to check data exists +- Verify column aliases match widget field expressions +- Check `disaggregated` flag: + - `true` for pre-aggregated data (1 row) + - `false` when widget performs aggregation (multi-row) + +## Layout has gaps + +- Ensure each row sums to width=6 +- Check that y positions don't skip values + +## Filter shows "Invalid widget definition" + +- Check `widgetType` is one of: `filter-multi-select`, `filter-single-select`, `filter-date-range-picker` +- **DO NOT** use `widgetType: "filter"` - this is invalid +- Verify `spec.version` is `2` +- Ensure `queryName` in encodings matches the query `name` +- Confirm `disaggregated: false` in filter queries +- Ensure `frame` with `showTitle: true` is included + +## Filter not affecting expected pages + +- **Global filters** (on `PAGE_TYPE_GLOBAL_FILTERS` page) affect all datasets containing the filter field +- **Page-level filters** (on `PAGE_TYPE_CANVAS` page) only affect widgets on that same page +- A filter only works on datasets that include the filter dimension column + +## Filter shows "UNRESOLVED_COLUMN" error for `associative_filter_predicate_group` + +- **DO NOT** use `COUNT_IF(\`associative_filter_predicate_group\`)` in filter queries +- This internal expression causes SQL errors when the dashboard executes queries +- Use a simple field expression instead: `{"name": "field", "expression": "\`field\`"}` + +## Text widget shows title and description on same line + +- Multiple items in the `lines` array are **concatenated**, not displayed on separate lines +- Use **separate text widgets** for title and subtitle at different y positions +- Example: title at y=0 with height=1, subtitle at y=1 with height=1 + +## Chart unreadable (too many categories) + +- Use TOP-N + "Other" bucketing in dataset SQL +- Aggregate to a higher level (region instead of store) +- Use a table widget instead of a chart for high-cardinality data diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-aibi-dashboards/SKILL.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-aibi-dashboards/SKILL.md new file mode 100644 index 0000000..99cff12 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-aibi-dashboards/SKILL.md @@ -0,0 +1,213 @@ +--- +name: databricks-aibi-dashboards +description: "Create Databricks AI/BI dashboards. Use when creating, updating, or deploying Lakeview dashboards. CRITICAL: You MUST test ALL SQL queries via execute_sql BEFORE deploying. Follow guidelines strictly." +--- + +# AI/BI Dashboard Skill + +Create Databricks AI/BI dashboards (formerly Lakeview dashboards). **Follow these guidelines strictly.** + +## CRITICAL: MANDATORY VALIDATION WORKFLOW + +**You MUST follow this workflow exactly. Skipping validation causes broken dashboards.** + +``` +┌─────────────────────────────────────────────────────────────────────┐ +│ STEP 1: Get table schemas via get_table_stats_and_schema(catalog, schema) │ +├─────────────────────────────────────────────────────────────────────┤ +│ STEP 2: Write SQL queries for each dataset │ +├─────────────────────────────────────────────────────────────────────┤ +│ STEP 3: TEST EVERY QUERY via execute_sql() ← DO NOT SKIP! │ +│ - If query fails, FIX IT before proceeding │ +│ - Verify column names match what widgets will reference │ +│ - Verify data types are correct (dates, numbers, strings) │ +├─────────────────────────────────────────────────────────────────────┤ +│ STEP 4: Build dashboard JSON using ONLY verified queries │ +├─────────────────────────────────────────────────────────────────────┤ +│ STEP 5: Deploy via manage_dashboard(action="create_or_update") │ +└─────────────────────────────────────────────────────────────────────┘ +``` + +**WARNING: If you deploy without testing queries, widgets WILL show "Invalid widget definition" errors!** + +## Available MCP Tools + +| Tool | Description | +|------|-------------| +| `get_table_stats_and_schema` | **STEP 1**: Get table schemas for designing queries | +| `execute_sql` | **STEP 3**: Test SQL queries - MANDATORY before deployment! | +| `manage_warehouse` (action="get_best") | Get available warehouse ID | +| `manage_dashboard` | **STEP 5**: Dashboard lifecycle management (see actions below) | + +### manage_dashboard Actions + +| Action | Description | Required Params | +|--------|-------------|-----------------| +| `create_or_update` | Deploy dashboard JSON (only after validation!) | display_name, parent_path, serialized_dashboard, warehouse_id | +| `get` | Get dashboard details by ID | dashboard_id | +| `list` | List all dashboards | (none) | +| `delete` | Move dashboard to trash | dashboard_id | +| `publish` | Publish a dashboard | dashboard_id, warehouse_id | +| `unpublish` | Unpublish a dashboard | dashboard_id | + +**Example usage:** +```python +# Create/update dashboard +manage_dashboard( + action="create_or_update", + display_name="Sales Dashboard", + parent_path="/Workspace/Users/me/dashboards", + serialized_dashboard=dashboard_json, + warehouse_id="abc123", + publish=True # auto-publish after create +) + +# Get dashboard details +manage_dashboard(action="get", dashboard_id="dashboard_123") + +# List all dashboards +manage_dashboard(action="list") +``` + +## Reference Files + +| What are you building? | Reference | +|------------------------|-----------| +| Any widget (text, counter, table, chart) | [1-widget-specifications.md](1-widget-specifications.md) | +| Dashboard with filters (global or page-level) | [2-filters.md](2-filters.md) | +| Need a complete working template to adapt | [3-examples.md](3-examples.md) | +| Debugging a broken dashboard | [4-troubleshooting.md](4-troubleshooting.md) | + +--- + +## Implementation Guidelines + +### 1) DATASET ARCHITECTURE + +- **One dataset per domain** (e.g., orders, customers, products) +- **Exactly ONE valid SQL query per dataset** (no multiple queries separated by `;`) +- Always use **fully-qualified table names**: `catalog.schema.table_name` +- SELECT must include all dimensions needed by widgets and all derived columns via `AS` aliases +- Put ALL business logic (CASE/WHEN, COALESCE, ratios) into the dataset SELECT with explicit aliases +- **Contract rule**: Every widget `fieldName` must exactly match a dataset column or alias + +### 2) WIDGET FIELD EXPRESSIONS + +> **CRITICAL: Field Name Matching Rule** +> The `name` in `query.fields` MUST exactly match the `fieldName` in `encodings`. +> If they don't match, the widget shows "no selected fields to visualize" error! + +**Correct pattern for aggregations:** +```json +// In query.fields: +{"name": "sum(spend)", "expression": "SUM(`spend`)"} + +// In encodings (must match!): +{"fieldName": "sum(spend)", "displayName": "Total Spend"} +``` + +**WRONG - names don't match:** +```json +// In query.fields: +{"name": "spend", "expression": "SUM(`spend`)"} // name is "spend" + +// In encodings: +{"fieldName": "sum(spend)", ...} // ERROR: "sum(spend)" ≠ "spend" +``` + +Allowed expressions in widget queries (you CANNOT use CAST or other SQL in expressions): + +**For numbers:** +```json +{"name": "sum(revenue)", "expression": "SUM(`revenue`)"} +{"name": "avg(price)", "expression": "AVG(`price`)"} +{"name": "count(orders)", "expression": "COUNT(`order_id`)"} +{"name": "countdistinct(customers)", "expression": "COUNT(DISTINCT `customer_id`)"} +{"name": "min(date)", "expression": "MIN(`order_date`)"} +{"name": "max(date)", "expression": "MAX(`order_date`)"} +``` + +**For dates** (use daily for timeseries, weekly/monthly for grouped comparisons): +```json +{"name": "daily(date)", "expression": "DATE_TRUNC(\"DAY\", `date`)"} +{"name": "weekly(date)", "expression": "DATE_TRUNC(\"WEEK\", `date`)"} +{"name": "monthly(date)", "expression": "DATE_TRUNC(\"MONTH\", `date`)"} +``` + +**Simple field reference** (for pre-aggregated data): +```json +{"name": "category", "expression": "`category`"} +``` + +If you need conditional logic or multi-field formulas, compute a derived column in the dataset SQL first. + +### 3) SPARK SQL PATTERNS + +- Date math: `date_sub(current_date(), N)` for days, `add_months(current_date(), -N)` for months +- Date truncation: `DATE_TRUNC('DAY'|'WEEK'|'MONTH'|'QUARTER'|'YEAR', column)` +- **AVOID** `INTERVAL` syntax - use functions instead + +### 4) LAYOUT (6-Column Grid, NO GAPS) + +Each widget has a position: `{"x": 0, "y": 0, "width": 2, "height": 4}` + +**CRITICAL**: Each row must fill width=6 exactly. No gaps allowed. + +**Recommended widget sizes:** + +| Widget Type | Width | Height | Notes | +|-------------|-------|--------|-------| +| Text header | 6 | 1 | Full width; use SEPARATE widgets for title and subtitle | +| Counter/KPI | 2 | **3-4** | **NEVER height=2** - too cramped! | +| Line/Bar chart | 3 | **5-6** | Pair side-by-side to fill row | +| Pie chart | 3 | **5-6** | Needs space for legend | +| Full-width chart | 6 | 5-7 | For detailed time series | +| Table | 6 | 5-8 | Full width for readability | + +**Standard dashboard structure:** +```text +y=0: Title (w=6, h=1) - Dashboard title (use separate widget!) +y=1: Subtitle (w=6, h=1) - Description (use separate widget!) +y=2: KPIs (w=2 each, h=3) - 3 key metrics side-by-side +y=5: Section header (w=6, h=1) - "Trends" or similar +y=6: Charts (w=3 each, h=5) - Two charts side-by-side +y=11: Section header (w=6, h=1) - "Details" +y=12: Table (w=6, h=6) - Detailed data +``` + +### 5) CARDINALITY & READABILITY (CRITICAL) + +**Dashboard readability depends on limiting distinct values:** + +| Dimension Type | Max Values | Examples | +|----------------|------------|----------| +| Chart color/groups | **3-8** | 4 regions, 5 product lines, 3 tiers | +| Filters | 4-10 | 8 countries, 5 channels | +| High cardinality | **Table only** | customer_id, order_id, SKU | + +**Before creating any chart with color/grouping:** +1. Check column cardinality (use `get_table_stats_and_schema` to see distinct values) +2. If >10 distinct values, aggregate to higher level OR use TOP-N + "Other" bucket +3. For high-cardinality dimensions, use a table widget instead of a chart + +### 6) QUALITY CHECKLIST + +Before deploying, verify: +1. All widget names use only alphanumeric + hyphens + underscores +2. All rows sum to width=6 with no gaps +3. KPIs use height 3-4, charts use height 5-6 +4. Chart dimensions have ≤8 distinct values +5. All widget fieldNames match dataset columns exactly +6. **Field `name` in query.fields matches `fieldName` in encodings exactly** (e.g., both `"sum(spend)"`) +7. Counter datasets: use `disaggregated: true` for 1-row datasets, `disaggregated: false` with aggregation for multi-row +8. Percent values are 0-1 (not 0-100) +9. SQL uses Spark syntax (date_sub, not INTERVAL) +10. **All SQL queries tested via `execute_sql` and return expected data** + +--- + +## Related Skills + +- **[databricks-unity-catalog](../databricks-unity-catalog/SKILL.md)** - for querying the underlying data and system tables +- **[databricks-spark-declarative-pipelines](../databricks-spark-declarative-pipelines/SKILL.md)** - for building the data pipelines that feed dashboards +- **[databricks-jobs](../databricks-jobs/SKILL.md)** - for scheduling dashboard data refreshes diff --git a/.claude/skills/databricks-app-python/1-authorization.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-app-python/1-authorization.md similarity index 100% rename from .claude/skills/databricks-app-python/1-authorization.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-app-python/1-authorization.md diff --git a/.claude/skills/databricks-app-python/2-app-resources.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-app-python/2-app-resources.md similarity index 100% rename from .claude/skills/databricks-app-python/2-app-resources.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-app-python/2-app-resources.md diff --git a/.claude/skills/databricks-app-python/3-frameworks.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-app-python/3-frameworks.md similarity index 92% rename from .claude/skills/databricks-app-python/3-frameworks.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-app-python/3-frameworks.md index cb1ef87..b8e76c8 100644 --- a/.claude/skills/databricks-app-python/3-frameworks.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-app-python/3-frameworks.md @@ -25,7 +25,7 @@ app = dash.Dash( |--------|-------| | Pre-installed version | 2.18.1 | | app.yaml command | `["python", "app.py"]` | -| Default port | 8050 (set `DATABRICKS_APP_PORT=8080` or use `app.run(port=8080)`) | +| Default port | 8050 — override in code: `app.run(port=int(os.environ.get("DATABRICKS_APP_PORT", 8000)))` | | Auth header | `request.headers.get('x-forwarded-access-token')` (Flask under the hood) | **Databricks tips**: @@ -84,6 +84,7 @@ def get_connection(): **Critical**: Use `gr.Request` parameter to access auth headers. ```python +import os import gradio as gr import requests from databricks.sdk.core import Config @@ -102,14 +103,15 @@ def predict(message, request: gr.Request): return resp.json()["predictions"][0] demo = gr.Interface(fn=predict, inputs="text", outputs="text") -demo.launch(server_name="0.0.0.0", server_port=8080) +port = int(os.environ.get("DATABRICKS_APP_PORT", 8000)) +demo.launch(server_name="0.0.0.0", server_port=port) ``` | Detail | Value | |--------|-------| | Pre-installed version | 4.44.0 | | app.yaml command | `["python", "app.py"]` | -| Default port | 7860 (override with `server_port=8080` or `GRADIO_SERVER_PORT=8080`) | +| Default port | 7860 — override in code: `server_port=int(os.environ.get("DATABRICKS_APP_PORT", 8000))` | | Auth header | `request.headers.get('x-forwarded-access-token')` via `gr.Request` | **Databricks tips**: @@ -150,7 +152,7 @@ def get_data(): | Detail | Value | |--------|-------| | Pre-installed version | 3.0.3 | -| app.yaml command | `["gunicorn", "app:app", "-w", "4", "-b", "0.0.0.0:8080"]` | +| app.yaml command | `["gunicorn", "app:app", "-w", "4", "-b", "0.0.0.0:8000"]` | | Auth header | `request.headers.get('x-forwarded-access-token')` | **Databricks tips**: @@ -190,7 +192,7 @@ async def get_data(request: Request): | Detail | Value | |--------|-------| | Pre-installed version | 0.115.0 | -| app.yaml command | `["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]` | +| app.yaml command | `["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]` | | Auth header | `request.headers.get('x-forwarded-access-token')` via `Request` | **Databricks tips**: @@ -241,6 +243,6 @@ class State(rx.State): - All frameworks are **pre-installed** — no need to add them to `requirements.txt` - Add only additional packages your app needs to `requirements.txt` - SDK `Config()` auto-detects credentials from injected environment variables -- Databricks Apps expects apps to listen on **port 8080** (configure your framework accordingly) +- Apps must bind to `DATABRICKS_APP_PORT` env var (defaults to 8000). Streamlit is auto-configured by the runtime; for other frameworks, read the env var in code or hardcode 8000 in `app.yaml` command. **Never use 8080** - For framework-specific deployment commands, see [4-deployment.md](4-deployment.md) - For authorization integration, see [1-authorization.md](1-authorization.md) diff --git a/.claude/skills/databricks-app-python/4-deployment.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-app-python/4-deployment.md similarity index 95% rename from .claude/skills/databricks-app-python/4-deployment.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-app-python/4-deployment.md index 688f1f2..b318bbd 100644 --- a/.claude/skills/databricks-app-python/4-deployment.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-app-python/4-deployment.md @@ -31,8 +31,8 @@ env: | Dash | `["python", "app.py"]` | | Streamlit | `["streamlit", "run", "app.py"]` | | Gradio | `["python", "app.py"]` | -| Flask | `["gunicorn", "app:app", "-w", "4", "-b", "0.0.0.0:8080"]` | -| FastAPI | `["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]` | +| Flask | `["gunicorn", "app:app", "-w", "4", "-b", "0.0.0.0:8000"]` | +| FastAPI | `["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]` | | Reflex | `["reflex", "run", "--env", "prod"]` | ### Step 2: Create and Deploy @@ -103,7 +103,7 @@ databricks bundle run -t prod **Key difference from other resources**: environment variables go in `src/app/app.yaml`, not `databricks.yml`. -For complete DABs guidance, use the **databricks-asset-bundles** skill. +For complete DABs guidance, use the **databricks-bundles** skill. --- diff --git a/.claude/skills/databricks-app-python/5-lakebase.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-app-python/5-lakebase.md similarity index 100% rename from .claude/skills/databricks-app-python/5-lakebase.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-app-python/5-lakebase.md diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-app-python/6-mcp-approach.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-app-python/6-mcp-approach.md new file mode 100644 index 0000000..943c49b --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-app-python/6-mcp-approach.md @@ -0,0 +1,79 @@ +# MCP Tools for App Lifecycle + +Use MCP tools to create, deploy, and manage Databricks Apps programmatically. This mirrors the CLI workflow but can be invoked by AI agents. + +--- + +## manage_app - App Lifecycle Management + +| Action | Description | Required Params | +|--------|-------------|-----------------| +| `create_or_update` | Idempotent create, deploys if source_code_path provided | name | +| `get` | Get app details (with optional logs) | name | +| `list` | List all apps | (none, optional name_contains filter) | +| `delete` | Delete an app | name | + +--- + +## Workflow + +### Step 1: Write App Files Locally + +Create your app files in a local folder: + +``` +my_app/ +├── app.py # Main application +├── models.py # Pydantic models +├── backend.py # Data access layer +├── requirements.txt # Additional dependencies +└── app.yaml # Databricks Apps configuration +``` + +### Step 2: Upload to Workspace + +```python +# MCP Tool: manage_workspace_files +manage_workspace_files( + action="upload", + local_path="/path/to/my_app", + workspace_path="/Workspace/Users/user@example.com/my_app" +) +``` + +### Step 3: Create and Deploy App + +```python +# MCP Tool: manage_app (creates if needed + deploys) +result = manage_app( + action="create_or_update", + name="my-dashboard", + description="Customer analytics dashboard", + source_code_path="/Workspace/Users/user@example.com/my_app" +) +# Returns: {"name": "my-dashboard", "url": "...", "created": True, "deployment": {...}} +``` + +### Step 4: Verify + +```python +# MCP Tool: manage_app (get with logs) +app = manage_app(action="get", name="my-dashboard", include_logs=True) +# Returns: {"name": "...", "url": "...", "status": "RUNNING", "logs": "...", ...} +``` + +### Step 5: Iterate + +1. Fix issues in local files +2. Re-upload with `manage_workspace_files(action="upload", ...)` +3. Re-deploy with `manage_app(action="create_or_update", ...)` (will update existing + deploy) +4. Check `manage_app(action="get", name=..., include_logs=True)` for errors +5. Repeat until app is healthy + +--- + +## Notes + +- Add resources (SQL warehouse, Lakebase, etc.) via the Databricks Apps UI after creating the app +- MCP tools use the service principal's permissions — ensure it has access to required resources +- For manual deployment, see [4-deployment.md](4-deployment.md) diff --git a/.claude/skills/databricks-app-python/SKILL.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-app-python/SKILL.md similarity index 89% rename from .claude/skills/databricks-app-python/SKILL.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-app-python/SKILL.md index eb62551..777d337 100644 --- a/.claude/skills/databricks-app-python/SKILL.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-app-python/SKILL.md @@ -1,6 +1,6 @@ --- name: databricks-app-python -description: "Builds Python-based Databricks applications using Dash, Streamlit, Gradio, Flask, FastAPI, or Reflex. Handles OAuth authorization (app and user auth), app resources, SQL warehouse and Lakebase connectivity, model serving integration, and deployment. Use when building Python web apps, dashboards, ML demos, or REST APIs for Databricks, or when the user mentions Streamlit, Dash, Gradio, Flask, FastAPI, Reflex, or Databricks app." +description: "Builds Python-based Databricks applications using Dash, Streamlit, Gradio, Flask, FastAPI, or Reflex. Handles OAuth authorization (app and user auth), app resources, SQL warehouse and Lakebase connectivity, model serving integration, foundation model APIs, LLM integration, and deployment. Use when building Python web apps, dashboards, ML demos, or REST APIs for Databricks, or when the user mentions Streamlit, Dash, Gradio, Flask, FastAPI, Reflex, or Databricks app." --- # Databricks Python Application @@ -38,8 +38,8 @@ Copy this checklist and verify each item: | **Dash** | Production dashboards, BI tools, complex interactivity | `["python", "app.py"]` | | **Streamlit** | Rapid prototyping, data science apps, internal tools | `["streamlit", "run", "app.py"]` | | **Gradio** | ML demos, model interfaces, chat UIs | `["python", "app.py"]` | -| **Flask** | Custom REST APIs, lightweight apps, webhooks | `["gunicorn", "app:app", "-w", "4", "-b", "0.0.0.0:8080"]` | -| **FastAPI** | Async APIs, auto-generated OpenAPI docs | `["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8080"]` | +| **Flask** | Custom REST APIs, lightweight apps, webhooks | `["gunicorn", "app:app", "-w", "4", "-b", "0.0.0.0:8000"]` | +| **FastAPI** | Async APIs, auto-generated OpenAPI docs | `["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"]` | | **Reflex** | Full-stack Python apps without JavaScript | `["reflex", "run", "--env", "prod"]` | **Default**: Recommend **Streamlit** for prototypes, **Dash** for production dashboards, **FastAPI** for APIs, **Gradio** for ML demos. @@ -74,6 +74,8 @@ Copy this checklist and verify each item: **MCP tools**: Use [6-mcp-approach.md](6-mcp-approach.md) for managing app lifecycle via MCP tools — covers creating, deploying, monitoring, and deleting apps programmatically. (Keywords: MCP, create app, deploy app, app logs) +**Foundation Models**: See [examples/llm_config.py](examples/llm_config.py) for calling Databricks foundation model APIs — covers OAuth M2M auth, OpenAI-compatible client wiring, and token caching. (Keywords: foundation model, LLM, OpenAI client, chat completions) + --- ## Workflow @@ -86,6 +88,7 @@ Copy this checklist and verify each item: **Using Lakebase (PostgreSQL)?** → Read [5-lakebase.md](5-lakebase.md) **Deploying to Databricks?** → Read [4-deployment.md](4-deployment.md) **Using MCP tools?** → Read [6-mcp-approach.md](6-mcp-approach.md) + **Calling foundation model/LLM APIs?** → See [examples/llm_config.py](examples/llm_config.py) 2. Follow the instructions in the relevant guide 3. For full code examples, browse https://apps-cookbook.dev/ @@ -170,7 +173,7 @@ class EntityIn(BaseModel): | **Resource not accessible** | Add resource via UI, verify SP has permissions, use `valueFrom` in app.yaml | | **Import error on deploy** | Add missing packages to `requirements.txt` (pre-installed packages don't need listing) | | **Lakebase app crashes on start** | `psycopg2`/`asyncpg` are NOT pre-installed — MUST add to `requirements.txt` | -| **Port conflict** | Databricks Apps expects port 8080; configure your framework accordingly | +| **Port conflict** | Apps must bind to `DATABRICKS_APP_PORT` env var (defaults to 8000). Never use 8080. Streamlit is auto-configured; for others, read the env var in code or use 8000 in app.yaml command | | **Streamlit: set_page_config error** | `st.set_page_config()` must be the first Streamlit command | | **Dash: unstyled layout** | Add `dash-bootstrap-components`; use `dbc.themes.BOOTSTRAP` | | **Slow queries** | Use Lakebase for transactional/low-latency; SQL warehouse for analytical queries | @@ -202,7 +205,7 @@ class EntityIn(BaseModel): ## Related Skills - **[databricks-app-apx](../databricks-app-apx/SKILL.md)** - full-stack apps with FastAPI + React -- **[databricks-asset-bundles](../databricks-asset-bundles/SKILL.md)** - deploying apps via DABs +- **[databricks-bundles](../databricks-bundles/SKILL.md)** - deploying apps via DABs - **[databricks-python-sdk](../databricks-python-sdk/SKILL.md)** - backend SDK integration - **[databricks-lakebase-provisioned](../databricks-lakebase-provisioned/SKILL.md)** - adding persistent PostgreSQL state - **[databricks-model-serving](../databricks-model-serving/SKILL.md)** - serving ML models for app integration diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-app-python/examples/fm-minimal-chat.py b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-app-python/examples/fm-minimal-chat.py new file mode 100644 index 0000000..db9a35d --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-app-python/examples/fm-minimal-chat.py @@ -0,0 +1,182 @@ +""" +Minimal Databricks Foundation Model Chat App + +A complete, deployable Streamlit app demonstrating Foundation Model API integration +in Databricks Apps. This is a working example extracted from databricksters-check-and-pub. + +Features: +- Validated dual-mode auth (OAuth M2M in Apps, PAT for local dev) +- OpenAI SDK wired to Databricks serving endpoints +- Token caching with expiry check +- Multi-turn chat with conversation history +- Viewer identity display +- Latency tracking + +Local Development: + export DATABRICKS_TOKEN="dapi..." + export DATABRICKS_SERVING_BASE_URL="https:///serving-endpoints" + export DATABRICKS_MODEL="" # See databricks-model-serving + streamlit run 2-minimal-chat-app.py + +Databricks Apps Deployment: + 1. Create app.yaml: + command: ["streamlit", "run", "2-minimal-chat-app.py"] + env: + - name: DATABRICKS_SERVING_BASE_URL + value: "https:///serving-endpoints" + - name: DATABRICKS_MODEL + value: "" # See databricks-model-serving + + 2. Create requirements.txt: + streamlit>=1.38,<2.0 + openai>=1.30,<2.0 + requests>=2.31,<3.0 # Needed for endpoint validation and OAuth fallback + + 3. Deploy: + databricks apps create foundation-chat --source-code-path . + + 4. Add service principal via UI for OAuth M2M auth +""" + +import time +from typing import Dict, List, Optional, Tuple + +import streamlit as st +from openai import OpenAI + +from llm_config import create_foundation_model_client, get_model_name + + +def _get_forwarded_headers() -> Dict[str, str]: + try: + return dict(getattr(st, "context").headers) + except Exception: + return {} + + +def get_viewer_identity() -> Tuple[Optional[str], Optional[str]]: + headers = _get_forwarded_headers() + email = headers.get("X-Forwarded-Email") or headers.get("x-forwarded-email") + token = headers.get("X-Forwarded-Access-Token") or headers.get( + "x-forwarded-access-token" + ) + return email, token + + +# ============================================================================= +# LLM Helper +# ============================================================================= +def llm_chat( + client: OpenAI, + *, + model: str, + messages: List[Dict[str, str]], + max_tokens: int = 1000, + temperature: float = 0.7, +) -> Tuple[str, int]: + """Call foundation model and return (response, latency_ms).""" + t0 = time.perf_counter() + resp = client.chat.completions.create( + model=model, + messages=messages, + max_tokens=max_tokens, + temperature=temperature, + ) + elapsed_ms = int((time.perf_counter() - t0) * 1000) + content = resp.choices[0].message.content or "" + return content, elapsed_ms + + +# ============================================================================= +# Streamlit App +# ============================================================================= +def main(): + st.set_page_config( + page_title="Databricks Foundation Model Chat", + page_icon="💬", + layout="centered", + ) + + st.title("💬 Foundation Model Chat") + st.caption("Powered by Databricks Apps") + + # Sidebar: viewer identity + viewer_email, _ = get_viewer_identity() + if viewer_email: + st.sidebar.success(f"Logged in as: {viewer_email}") + else: + st.sidebar.info("Local dev mode (no viewer identity)") + + # Sidebar: model config + with st.sidebar: + st.subheader("Configuration") + st.code(f"Model: {get_model_name()}", language=None) + + if st.button("🗑️ Clear Chat History"): + st.session_state.messages = [] + st.rerun() + + with st.expander("ℹ️ About"): + st.markdown( + """ + This app demonstrates calling Databricks Foundation Model APIs + from a Streamlit app using: + - Shared dual-mode auth (PAT + OAuth M2M) + - Shared OpenAI client wiring + - Viewer identity extraction + """ + ) + + # Initialize chat history + if "messages" not in st.session_state: + st.session_state.messages = [] + + # Display chat history + for message in st.session_state.messages: + with st.chat_message(message["role"]): + st.markdown(message["content"]) + if message.get("latency_ms"): + st.caption(f"⏱️ {message['latency_ms']}ms") + + # Chat input + if prompt := st.chat_input("Ask me anything..."): + # Add user message to chat history + st.session_state.messages.append({"role": "user", "content": prompt}) + with st.chat_message("user"): + st.markdown(prompt) + + # Generate assistant response + with st.chat_message("assistant"): + with st.spinner("Thinking..."): + try: + client = create_foundation_model_client(cache=st.session_state) + + # Call foundation model + response, latency_ms = llm_chat( + client, + model=get_model_name(), + messages=st.session_state.messages, + max_tokens=1000, + temperature=0.7, + ) + + # Display response + st.markdown(response) + st.caption(f"⏱️ {latency_ms}ms") + + # Add to chat history + st.session_state.messages.append( + { + "role": "assistant", + "content": response, + "latency_ms": latency_ms, + } + ) + + except Exception as e: + st.error(f"Error calling foundation model: {e}") + st.session_state.messages.pop() # Remove failed user message + + +if __name__ == "__main__": + main() diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-app-python/examples/fm-parallel-calls.py b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-app-python/examples/fm-parallel-calls.py new file mode 100644 index 0000000..53cc6a2 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-app-python/examples/fm-parallel-calls.py @@ -0,0 +1,265 @@ +""" +Parallel Foundation Model Calls + +This example demonstrates how to make multiple foundation model API calls in parallel +for improved performance. It uses the same bounded job-runner pattern as the +production Databricks App, but keeps the example generic enough to reuse in +other review, extraction, or scoring workflows. + +Use cases: +- Document evaluation with multiple independent checks +- Batch processing of independent prompts +- Multi-aspect analysis of the same content +- A/B testing different prompts + +Performance impact: +- Serial: 5 calls × 2s each = 10s total +- Parallel (max_workers=5): ~2s to 3s total depending on endpoint overhead + +Configuration: +- LLM_MAX_CONCURRENCY env var controls parallelism (positive integer, default: 5) +- Balance between throughput and rate limits +- DATABRICKS_MODEL must be set to a valid serving endpoint name +""" + +import time +from typing import Any, Callable, Dict, List, Tuple + +from openai import OpenAI + +from llm_config import ( + create_foundation_model_client, + get_model_name, + run_jobs_parallel, +) + + +# ============================================================================= +# LLM Call Helper +# ============================================================================= +def llm_call( + client: OpenAI, + prompt: str, + model: str | None = None, + max_tokens: int = 500, +) -> Tuple[str, int]: + """Make a single LLM call and return (response, latency_ms).""" + t0 = time.perf_counter() + resp = client.chat.completions.create( + model=model or get_model_name(), + messages=[{"role": "user", "content": prompt}], + max_tokens=max_tokens, + temperature=0.2, + ) + elapsed_ms = int((time.perf_counter() - t0) * 1000) + content = resp.choices[0].message.content or "" + return content, elapsed_ms + + +# ============================================================================= +# Example: Generic Technical Document Checks +# ============================================================================= +def check_structure(client: OpenAI, text: str) -> Dict[str, Any]: + """Check if a technical document has clear section structure.""" + prompt = f"""Evaluate the structure of this technical document. Does it have clear section headings and a logical progression? + +DOCUMENT: +{text[:2000]} + +Answer with: PASS or FAIL, then brief explanation.""" + + response, latency_ms = llm_call(client, prompt) + passed = "PASS" in response.upper().split("\n")[0] + + return { + "check": "structure", + "passed": passed, + "response": response, + "latency_ms": latency_ms, + } + + +def check_summary(client: OpenAI, text: str) -> Dict[str, Any]: + """Check if content has a concise executive summary near the top.""" + prompt = f"""Does this technical document start with a concise summary or key takeaways section in the first 10 percent? + +DOCUMENT: +{text[:2000]} + +Answer with: PASS or FAIL, then brief explanation.""" + + response, latency_ms = llm_call(client, prompt) + passed = "PASS" in response.upper().split("\n")[0] + + return { + "check": "summary", + "passed": passed, + "response": response, + "latency_ms": latency_ms, + } + + +def check_examples(client: OpenAI, text: str) -> Dict[str, Any]: + """Check if content includes concrete examples.""" + prompt = f"""Does this technical document include concrete examples, code, or step-by-step guidance readers can adapt? + +DOCUMENT: +{text[:2000]} + +Answer with: PASS or FAIL, then brief explanation.""" + + response, latency_ms = llm_call(client, prompt) + passed = "PASS" in response.upper().split("\n")[0] + + return { + "check": "examples", + "passed": passed, + "response": response, + "latency_ms": latency_ms, + } + + +def check_troubleshooting(client: OpenAI, text: str) -> Dict[str, Any]: + """Check if content covers troubleshooting or failure modes.""" + prompt = f"""Does this technical document include troubleshooting guidance, failure modes, or common pitfalls? + +DOCUMENT: +{text[:2000]} + +Answer with: PASS or FAIL, then brief explanation.""" + + response, latency_ms = llm_call(client, prompt) + passed = "PASS" in response.upper().split("\n")[0] + + return { + "check": "troubleshooting", + "passed": passed, + "response": response, + "latency_ms": latency_ms, + } + + +def check_audience_fit(client: OpenAI, text: str) -> Dict[str, Any]: + """Check if content matches a technical practitioner audience.""" + prompt = f"""Does this technical document appear written for practitioners, with the right level of specificity and useful context? + +DOCUMENT: +{text[:2000]} + +Answer with: PASS or FAIL, then brief explanation.""" + + response, latency_ms = llm_call(client, prompt) + passed = "PASS" in response.upper().split("\n")[0] + + return { + "check": "audience_fit", + "passed": passed, + "response": response, + "latency_ms": latency_ms, + } + + +# ============================================================================= +# Example Usage: Parallel Execution +# ============================================================================= +if __name__ == "__main__": + # Sample technical document + sample_text = """ + Summary: This guide shows how to deploy a Databricks App in three steps. + + ## Introduction + Databricks Apps provides a way to deploy web applications... + + ## Step 1: Create Your App + First, create an app.py file... + + ## Step 2: Configure app.yaml + Next, set up your configuration... + + ## Step 3: Deploy + Finally, deploy using the CLI... + """ + + client = create_foundation_model_client() + + print("Making 5 parallel LLM calls...") + print(f"Model: {get_model_name()}\n") + + # Define independent parallel jobs + jobs = { + "structure": (check_structure, (client, sample_text), {}), + "summary": (check_summary, (client, sample_text), {}), + "examples": (check_examples, (client, sample_text), {}), + "troubleshooting": (check_troubleshooting, (client, sample_text), {}), + "audience_fit": (check_audience_fit, (client, sample_text), {}), + } + + # Execute in parallel using the shared bounded job runner. + start = time.perf_counter() + results, errors = run_jobs_parallel(jobs) + total_time = time.perf_counter() - start + + # Display results + print("=" * 60) + print(f"Completed in {total_time:.2f}s (parallel execution)") + print("=" * 60) + + if errors: + print("\nErrors encountered:") + for error in errors: + print(f" ❌ {error}") + + print("\nResults:") + for job_name, result in results.items(): + if result: + status = "✅ PASS" if result["passed"] else "❌ FAIL" + print(f"\n{job_name.upper()}: {status}") + print(f" Latency: {result['latency_ms']}ms") + print(f" Response: {result['response'][:150]}...") + else: + print(f"\n{job_name.upper()}: ❌ FAILED (see errors above)") + + # Calculate time saved + total_latency = sum(r["latency_ms"] for r in results.values() if r) + time_saved = (total_latency / 1000) - total_time + print(f"\n{'='*60}") + print(f"Time saved vs serial execution: {time_saved:.2f}s") + print(f"Speedup: {(total_latency/1000) / total_time:.1f}×") + print(f"{'='*60}") + + +# ============================================================================= +# Production Best Practices +# ============================================================================= +""" +Best practices from databricksters-check-and-pub: + +1. Configurable concurrency + - Use LLM_MAX_CONCURRENCY env var (default: 5 in the production app) + - Balance throughput vs rate limits + - Too high = rate limit errors + - Too low = underutilized resources + +2. Error handling + - Capture exceptions per job + - Return None for failed jobs + - Collect error messages for debugging + - Continue execution even if some jobs fail + +3. Bounded execution + - Only parallelize independent checks + - Cap concurrency with an env var rather than firing unlimited requests + - Keep the job contract simple: name -> (callable, args, kwargs) + +4. When to use parallel calls + - Multiple independent evaluations of same content + - Batch processing multiple documents + - A/B testing different prompts + - Multi-aspect analysis + +5. When NOT to use parallel calls + - Dependent/sequential operations + - Single evaluation needed + - Rate limits are very strict + - Debugging (use serial for easier troubleshooting) +""" diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-app-python/examples/fm-structured-outputs.py b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-app-python/examples/fm-structured-outputs.py new file mode 100644 index 0000000..90fe6d2 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-app-python/examples/fm-structured-outputs.py @@ -0,0 +1,337 @@ +""" +Structured Outputs and Robust Response Parsing + +Production patterns for getting structured data (JSON) from foundation models. +Extracted from databricksters-check-and-pub production app. + +Key patterns: +1. Robust JSON parsing (handles code fences, smart quotes, malformed JSON) +2. Retry logic on parse failure with stricter prompts +3. Content normalization (handles various response formats) +4. temperature=0.0 for deterministic structured outputs +5. Streamlit caching for expensive API calls +6. Consistent timeout handling + +Use cases: +- Content evaluation/scoring +- Data extraction from text +- Classification tasks +- Compliance checking +- Any task requiring structured model output + +Set `DATABRICKS_MODEL` to a valid serving endpoint name before running. +""" + +import json +import re +import time +from typing import Any, Dict, List, Tuple + +import streamlit as st +from openai import OpenAI + +from llm_config import create_foundation_model_client, get_model_name + + +# ============================================================================= +# Pattern 1: Content Normalization +# ============================================================================= +def _content_to_text(content: Any) -> str: + """Normalize model message content to a string. + + Handles various content types returned by foundation models: + - str: return as-is + - bytes: decode to UTF-8 + - list: extract text from content parts (handles multi-modal responses) + + This is critical for handling different response formats consistently. + """ + if isinstance(content, str): + return content + + if isinstance(content, (bytes, bytearray)): + return content.decode("utf-8", errors="replace") + + if isinstance(content, list): + parts: List[str] = [] + for item in content: + if isinstance(item, str): + parts.append(item) + elif isinstance(item, dict): + # Handle content part objects + if "text" in item and isinstance(item["text"], str): + parts.append(item["text"]) + elif "content" in item and isinstance(item["content"], str): + parts.append(item["content"]) + return "".join(parts) + + return str(content) + + +# ============================================================================= +# Pattern 2: Robust JSON Parsing +# ============================================================================= +def _parse_json_object(response_text: str) -> Dict[str, Any]: + """Best-effort parse of a JSON object from a model response. + + Handles common failure modes: + 1. Model wraps JSON in markdown code fences (```json ... ```) + 2. Model uses smart/curly quotes instead of straight quotes + 3. Model includes extra text before/after JSON + 4. Model returns malformed JSON + + This is THE critical pattern for production structured outputs. + """ + text = (response_text or "").strip() + if not text: + raise ValueError("Empty model response (expected JSON object)") + + # Strip markdown code fences if present + if text.startswith("```"): + text = re.sub(r"^```[a-zA-Z]*\n", "", text) + text = re.sub(r"```$", "", text).strip() + + # Try direct parse first + try: + obj = json.loads(text) + if isinstance(obj, dict): + return obj + except Exception: + pass + + # Extract first {...} block (handles extra text around JSON) + start = text.find("{") + end = text.rfind("}") + if start != -1 and end != -1 and end > start: + candidate = text[start : end + 1] + else: + candidate = text + + # Normalize smart quotes (common LLM formatting issue) + candidate = ( + candidate.replace("\u201c", '"') # Left double quote + .replace("\u201d", '"') # Right double quote + .replace("\u2018", "'") # Left single quote + .replace("\u2019", "'") # Right single quote + ) + + # Final parse attempt + obj = json.loads(candidate) + if not isinstance(obj, dict): + raise ValueError("Model did not return a JSON object") + return obj + + +# ============================================================================= +# Pattern 3: Structured LLM Call with Retry +# ============================================================================= +def llm_structured_call( + client: OpenAI, + system_prompt: str, + user_prompt: str, + model: str | None = None, +) -> Tuple[Dict[str, Any], int]: + """Call foundation model for structured output with retry on parse failure. + + Returns: + (parsed_json_dict, latency_ms) + + Critical pattern: + - Use temperature=0.0 for deterministic structured outputs + - If JSON parse fails, retry with stricter instructions + - Combine latencies from both attempts + """ + # First attempt + t0 = time.perf_counter() + response = client.chat.completions.create( + model=model or get_model_name(), + messages=[ + {"role": "system", "content": system_prompt}, + {"role": "user", "content": user_prompt}, + ], + max_tokens=2000, + temperature=0.0, # Deterministic for structured outputs + ) + elapsed_ms = int((time.perf_counter() - t0) * 1000) + + content = _content_to_text(response.choices[0].message.content) + + # Try to parse response + try: + return _parse_json_object(content), elapsed_ms + except Exception as e: + # Retry with stricter prompt + print(f"Parse failed (attempt 1): {e}. Retrying with stricter prompt...") + + t0_retry = time.perf_counter() + retry_response = client.chat.completions.create( + model=model or get_model_name(), + messages=[ + {"role": "system", "content": system_prompt}, + {"role": "user", "content": "Return ONLY minified JSON object. Strings must be JSON-escaped. No extra text."}, + {"role": "user", "content": user_prompt}, + ], + max_tokens=2000, + temperature=0.0, + ) + retry_elapsed_ms = int((time.perf_counter() - t0_retry) * 1000) + + retry_content = _content_to_text(retry_response.choices[0].message.content) + return _parse_json_object(retry_content), elapsed_ms + retry_elapsed_ms + + +# ============================================================================= +# Pattern 4: Caching Expensive Calls (Streamlit) +# ============================================================================= +@st.cache_data(ttl=60 * 60) # Cache for 1 hour +def cached_structured_call( + prompt: str, + model: str | None = None, +) -> Dict[str, Any]: + """Cache expensive structured LLM calls. + + Use @st.cache_data with TTL for: + - Expensive/slow API calls + - Calls with same inputs (idempotent) + - Data that doesn't need real-time freshness + + TTL examples: + - 60 * 10 = 10 minutes (frequently changing data) + - 60 * 60 = 1 hour (moderate freshness) + - 60 * 60 * 24 = 24 hours (stable data) + """ + client = create_foundation_model_client() + system = "You are a data extraction assistant. Return ONLY valid JSON." + result, _ = llm_structured_call(client, system, prompt, model or get_model_name()) + return result + + +# ============================================================================= +# Example: Content Quality Evaluation +# ============================================================================= +def evaluate_content_quality( + client: OpenAI, text: str +) -> Tuple[Dict[str, Any], int]: + """Evaluate content quality with structured output.""" + + system_prompt = """You are a content quality evaluator. +You must return ONLY valid JSON that exactly matches the schema below. +No commentary. No markdown. No explanations.""" + + user_prompt = f"""Evaluate this content and return JSON with this exact schema: +{{ + "overall_score": 0-100, + "readability": "poor"|"fair"|"good"|"excellent", + "has_clear_structure": true|false, + "has_actionable_takeaways": true|false, + "strengths": ["string", "string"], + "weaknesses": ["string", "string"], + "suggestions": ["string", "string"] +}} + +Content to evaluate: +{text[:2000]} +""" + + return llm_structured_call(client, system_prompt, user_prompt) + + +# ============================================================================= +# Example: Entity Extraction +# ============================================================================= +def extract_entities(client: OpenAI, text: str) -> Tuple[Dict[str, Any], int]: + """Extract structured entities from text.""" + + system_prompt = """You are an entity extraction system. +Return ONLY valid JSON. Do not include explanations.""" + + user_prompt = f"""Extract entities from this text and return JSON: +{{ + "people": ["name1", "name2"], + "organizations": ["org1", "org2"], + "technologies": ["tech1", "tech2"], + "key_concepts": ["concept1", "concept2"] +}} + +Text: +{text[:2000]} +""" + + return llm_structured_call(client, system_prompt, user_prompt) + + +# ============================================================================= +# Example Usage +# ============================================================================= +if __name__ == "__main__": + sample_text = """ + Databricks Lakehouse Platform combines data warehousing and AI with open + data formats like Delta Lake. Apache Spark and MLflow are key components. + Jane Smith, VP of Engineering at Acme Corp, recently shared their migration story. + """ + + client = create_foundation_model_client() + + print("=" * 60) + print("Example 1: Content Quality Evaluation") + print("=" * 60) + try: + quality_data, latency_ms = evaluate_content_quality(client, sample_text) + print(f"✓ Completed in {latency_ms}ms") + print(json.dumps(quality_data, indent=2)) + except Exception as e: + print(f"❌ Error: {e}") + + print("\n" + "=" * 60) + print("Example 2: Entity Extraction") + print("=" * 60) + try: + entity_data, latency_ms = extract_entities(client, sample_text) + print(f"✓ Completed in {latency_ms}ms") + print(json.dumps(entity_data, indent=2)) + except Exception as e: + print(f"❌ Error: {e}") + + +# ============================================================================= +# Production Best Practices Summary +# ============================================================================= +""" +Key takeaways from databricksters-check-and-pub: + +1. Content Normalization (_content_to_text) + - Handle str, bytes, list content types + - Essential for multi-modal or varying response formats + +2. Robust JSON Parsing (_parse_json_object) + - Strip markdown code fences (```json) + - Normalize smart quotes + - Extract {...} from surrounding text + - This ONE function prevents 90% of parsing errors in production + +3. Retry on Parse Failure + - If first attempt fails to parse, retry with stricter prompt + - Add latencies together for accurate tracking + - Shows user total cost, not just successful attempt + +4. Temperature Settings + - Use temperature=0.0 for structured outputs (deterministic) + - Use temperature=0.2-0.7 for creative/generative tasks + - Compliance checks = 0.0, content generation = 0.7 + +5. Caching with TTL + - Use @st.cache_data(ttl=...) for expensive calls + - Choose TTL based on data freshness needs + - Dramatically improves app responsiveness + +6. Timeouts + - Set timeout=30 on all HTTP requests + - Prevents hanging connections + - Provides better error messages to users + +7. System Prompts for Structure + - Clearly state: "Return ONLY valid JSON" + - Provide exact schema in prompt + - Use examples when needed + - Be explicit about constraints +""" diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-app-python/examples/llm_config.py b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-app-python/examples/llm_config.py new file mode 100644 index 0000000..6c4d550 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-app-python/examples/llm_config.py @@ -0,0 +1,353 @@ +import concurrent.futures +import os +import threading +import time +from collections.abc import MutableMapping as MutableMappingABC +from dataclasses import dataclass +from typing import Any, Callable, Dict, MutableMapping, Tuple +from urllib.parse import urlsplit + +from openai import OpenAI + +CACHE_KEY = "dbx_oauth" +VALIDATION_TTL_SECONDS = 300 + + +class DatabricksLLMConfigError(RuntimeError): + """Raised when Databricks LLM configuration is invalid.""" + + +@dataclass(frozen=True) +class DatabricksLLMConfig: + serving_base_url: str + workspace_host: str + model: str + auth_mode: str + + +_token_lock = threading.Lock() +_token_cache: Dict[str, Any] = {} +_validation_cache: Dict[Tuple[str, str], int] = {} + + +def _requests_module(): + import requests + + return requests + + +def _normalize_host(raw_host: str) -> str: + host = (raw_host or "").strip().rstrip("/") + if not host: + raise DatabricksLLMConfigError("Databricks workspace host is empty.") + if not host.startswith(("http://", "https://")): + host = "https://" + host + parts = urlsplit(host) + if not parts.scheme or not parts.netloc: + raise DatabricksLLMConfigError(f"Invalid Databricks workspace host: {raw_host!r}") + return f"{parts.scheme}://{parts.netloc}" + + +def _normalize_serving_base_url(raw_url: str) -> str: + value = (raw_url or "").strip() + if not value: + raise DatabricksLLMConfigError( + "DATABRICKS_SERVING_BASE_URL must be set to https:///serving-endpoints." + ) + if not value.startswith(("http://", "https://")): + value = "https://" + value + parts = urlsplit(value) + if not parts.scheme or not parts.netloc: + raise DatabricksLLMConfigError(f"Invalid DATABRICKS_SERVING_BASE_URL: {raw_url!r}") + path = parts.path.rstrip("/") + if path != "/serving-endpoints": + raise DatabricksLLMConfigError( + "DATABRICKS_SERVING_BASE_URL must end with /serving-endpoints for the target workspace." + ) + return f"{parts.scheme}://{parts.netloc}/serving-endpoints" + + +def get_databricks_llm_config() -> DatabricksLLMConfig: + serving_base_url = _normalize_serving_base_url( + os.environ.get("DATABRICKS_SERVING_BASE_URL", "") + ) + workspace_host = serving_base_url[: -len("/serving-endpoints")] + + configured_host = os.environ.get("DATABRICKS_HOST", "").strip() + if configured_host: + normalized_host = _normalize_host(configured_host) + if normalized_host != workspace_host: + raise DatabricksLLMConfigError( + "DATABRICKS_HOST must match the workspace host in DATABRICKS_SERVING_BASE_URL." + ) + + model = os.environ.get("DATABRICKS_MODEL", "").strip() + if not model: + raise DatabricksLLMConfigError( + "DATABRICKS_MODEL must be set to a serving endpoint available in the workspace." + ) + + client_id = os.environ.get("DATABRICKS_CLIENT_ID", "").strip() + client_secret = os.environ.get("DATABRICKS_CLIENT_SECRET", "").strip() + token = os.environ.get("DATABRICKS_TOKEN", "").strip() + + if client_id and client_secret: + auth_mode = "oauth-m2m" + elif token: + auth_mode = "pat" + else: + raise DatabricksLLMConfigError( + "No Databricks auth configured. Set DATABRICKS_CLIENT_ID and " + "DATABRICKS_CLIENT_SECRET, or provide DATABRICKS_TOKEN." + ) + + return DatabricksLLMConfig( + serving_base_url=serving_base_url, + workspace_host=workspace_host, + model=model, + auth_mode=auth_mode, + ) + + +def get_serving_base_url() -> str: + return get_databricks_llm_config().serving_base_url + + +def get_model_name() -> str: + return get_databricks_llm_config().model + + +def _is_token_fresh(cache: MutableMapping[str, Any] | Dict[str, Any]) -> bool: + return bool( + cache.get("access_token") + and int(cache.get("expires_at", 0)) > int(time.time()) + 30 + ) + + +def _write_token_cache( + access_token: str, + expires_at: int, + config: DatabricksLLMConfig, + cache: MutableMapping[str, Any] | None = None, +) -> None: + token_record = { + "access_token": access_token, + "expires_at": expires_at, + "workspace_host": config.workspace_host, + "auth_mode": config.auth_mode, + "client_id": os.environ.get("DATABRICKS_CLIENT_ID", "").strip(), + } + _token_cache.clear() + _token_cache.update(token_record) + if cache is not None: + cache[CACHE_KEY] = dict(token_record) + + +def _token_cache_matches( + cache: MutableMapping[str, Any] | Dict[str, Any], + config: DatabricksLLMConfig, +) -> bool: + return bool( + cache.get("workspace_host") == config.workspace_host + and cache.get("auth_mode") == config.auth_mode + and cache.get("client_id", "") == os.environ.get("DATABRICKS_CLIENT_ID", "").strip() + ) + + +def get_databricks_bearer_token( + cache: MutableMapping[str, Any] | None = None, +) -> str: + config = get_databricks_llm_config() + + if config.auth_mode == "pat": + return os.environ["DATABRICKS_TOKEN"].strip() + + if cache: + cached = cache.get(CACHE_KEY, {}) + if ( + isinstance(cached, MutableMappingABC) + and _token_cache_matches(cached, config) + and _is_token_fresh(cached) + ): + _write_token_cache( + str(cached["access_token"]), + int(cached["expires_at"]), + config, + cache=cache, + ) + return str(cached["access_token"]) + + if _token_cache_matches(_token_cache, config) and _is_token_fresh(_token_cache): + access_token = str(_token_cache["access_token"]) + expires_at = int(_token_cache["expires_at"]) + _write_token_cache(access_token, expires_at, config, cache=cache) + return access_token + + with _token_lock: + if _token_cache_matches(_token_cache, config) and _is_token_fresh(_token_cache): + access_token = str(_token_cache["access_token"]) + expires_at = int(_token_cache["expires_at"]) + _write_token_cache(access_token, expires_at, config, cache=cache) + return access_token + + requests = _requests_module() + try: + response = requests.post( + f"{config.workspace_host}/oidc/v1/token", + headers={"Content-Type": "application/x-www-form-urlencoded"}, + data={"grant_type": "client_credentials", "scope": "all-apis"}, + auth=( + os.environ["DATABRICKS_CLIENT_ID"].strip(), + os.environ["DATABRICKS_CLIENT_SECRET"].strip(), + ), + timeout=30, + ) + except Exception as exc: + raise DatabricksLLMConfigError( + f"Could not reach Databricks OAuth token endpoint for " + f"{config.workspace_host}: {type(exc).__name__}: {str(exc)[:200]}" + ) from exc + if response.status_code >= 400: + raise DatabricksLLMConfigError( + f"Failed Databricks OAuth authentication for {config.workspace_host} " + f"(HTTP {response.status_code}). Check the service principal credentials " + "for that workspace." + ) + + payload = response.json() + access_token = payload.get("access_token") + expires_in = int(payload.get("expires_in", 300)) + if not access_token: + raise DatabricksLLMConfigError( + f"Token endpoint response is missing access_token: {payload}" + ) + + expires_at = int(time.time()) + expires_in + _write_token_cache(str(access_token), expires_at, config, cache=cache) + return str(access_token) + + +def validate_databricks_llm_config( + cache: MutableMapping[str, Any] | None = None, +) -> DatabricksLLMConfig: + config = get_databricks_llm_config() + cache_key = (config.serving_base_url, config.model) + + cached_expiry = _validation_cache.get(cache_key, 0) + if cached_expiry > int(time.time()): + return config + + requests = _requests_module() + token = get_databricks_bearer_token(cache=cache) + headers = {"Authorization": f"Bearer {token}"} + endpoint_url = f"{config.workspace_host}/api/2.0/serving-endpoints/{config.model}" + try: + response = requests.get(endpoint_url, headers=headers, timeout=30) + except Exception as exc: + raise DatabricksLLMConfigError( + f"Could not validate DATABRICKS_MODEL={config.model!r} in workspace " + f"{config.workspace_host}: {type(exc).__name__}: {str(exc)[:200]}" + ) from exc + + if response.status_code == 404: + try: + list_response = requests.get( + f"{config.workspace_host}/api/2.0/serving-endpoints", + headers=headers, + timeout=30, + ) + except Exception: + list_response = None + available: list[str] = [] + if list_response is not None and list_response.status_code < 400: + try: + payload = list_response.json() + available = sorted( + endpoint.get("name", "").strip() + for endpoint in payload.get("endpoints", []) + if endpoint.get("name", "").strip() + ) + except Exception: + available = [] + available_text = ", ".join(available[:10]) if available else "no endpoints were returned" + raise DatabricksLLMConfigError( + f"DATABRICKS_MODEL={config.model!r} was not found in workspace " + f"{config.workspace_host}. Available endpoints include: {available_text}." + ) + + if response.status_code >= 400: + raise DatabricksLLMConfigError( + f"Failed to validate DATABRICKS_MODEL={config.model!r} in workspace " + f"{config.workspace_host} (HTTP {response.status_code}). " + f"Response: {response.text[:300]}" + ) + + _validation_cache[cache_key] = int(time.time()) + VALIDATION_TTL_SECONDS + return config + + +def build_openai_client( + *, + validate: bool = True, + cache: MutableMapping[str, Any] | None = None, +) -> OpenAI: + config = ( + validate_databricks_llm_config(cache=cache) + if validate + else get_databricks_llm_config() + ) + token = get_databricks_bearer_token(cache=cache) + return OpenAI(api_key=token, base_url=config.serving_base_url) + + +def create_foundation_model_client( + cache: MutableMapping[str, Any] | None = None, +) -> OpenAI: + return build_openai_client(validate=True, cache=cache) + + +def resolve_bearer_token(cache: MutableMapping[str, Any] | None = None) -> str: + return get_databricks_bearer_token(cache=cache) + + +def run_jobs_parallel( + jobs: Dict[str, Tuple[Callable[..., Any], Tuple[Any, ...], Dict[str, Any]]], + max_workers: int | None = None, +) -> Tuple[Dict[str, Any], list[str]]: + """Run independent jobs in parallel and collect per-job failures.""" + if max_workers is None: + raw_worker_count = os.environ.get("LLM_MAX_CONCURRENCY", "5") + try: + worker_count = int(raw_worker_count) + except ValueError as exc: + raise DatabricksLLMConfigError( + "LLM_MAX_CONCURRENCY must be a positive integer." + ) from exc + else: + worker_count = max_workers + + if worker_count < 1: + raise DatabricksLLMConfigError( + "LLM_MAX_CONCURRENCY must be a positive integer." + ) + + results: Dict[str, Any] = {} + errors: list[str] = [] + + def _call(fn: Callable[..., Any], args: Tuple[Any, ...], kwargs: Dict[str, Any]) -> Any: + return fn(*args, **kwargs) + + with concurrent.futures.ThreadPoolExecutor(max_workers=worker_count) as executor: + futures = { + executor.submit(_call, fn, args, kwargs): name + for name, (fn, args, kwargs) in jobs.items() + } + concurrent.futures.wait(list(futures.keys())) + for future, name in [(future, futures[future]) for future in futures]: + try: + results[name] = future.result() + except Exception as exc: + errors.append(f"{name}: {type(exc).__name__}: {str(exc)[:200]}") + results[name] = None + + return results, errors diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-bdd-testing/SKILL.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-bdd-testing/SKILL.md new file mode 100644 index 0000000..ee76e50 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-bdd-testing/SKILL.md @@ -0,0 +1,336 @@ +--- +name: databricks-bdd-testing +description: "BDD testing with Python Behave and Databricks. Use when the user asks to set up BDD, create Gherkin feature files, write step definitions, scaffold a Behave project, run BDD tests, or test pipelines, Unity Catalog, jobs, or Apps using behavior-driven development." +--- + +# BDD testing with Python Behave + +Set up and run Behavior-Driven Development test suites against Databricks using Python Behave (Gherkin). Generate feature files, step definitions, and test harnesses that call real Unity Catalog functions via the Statement Execution API. + +## When to use + +- User asks to "set up BDD", "scaffold Behave", "create Gherkin tests", or "add BDD to my project" +- User wants to test pipelines, Unity Catalog permissions, jobs, Apps, or SQL functions +- User has existing SQL rule functions and wants automated test coverage +- User asks to "write Given/When/Then tests" or "generate feature files" + +## Quick start + +### 1. Scaffold a Behave project + +```bash +uv add --group test behave databricks-sdk httpx +``` + +Generate this directory structure: + +``` +features/ +├── environment.py # Databricks SDK setup, ephemeral schema lifecycle +├── steps/ +│ ├── common_steps.py # Shared: workspace connection, SQL execution, row counts +│ └── _steps.py # Per-domain step implementations +├── catalog/ # Feature files by domain +├── pipelines/ +├── jobs/ +└── sql/ +behave.ini +``` + +### 2. Write a feature file + +```gherkin +@compliance @smoke +Feature: Back-to-Back Promotion Compliance + As a compliance officer + I need to ensure products have a 4-week cooling period between promotions + So that we comply with ACCC pricing guidelines + + Rule: Products must have a minimum 4-week gap between promotions + + Scenario: Product promoted in consecutive weeks violates cooling period + Given a product was promoted in weeks 1, 2 + When I check for back-to-back promotions + Then the result should be "FAILED" + + Scenario: Product with 5-week gap is compliant + Given a product was promoted in weeks 1, 6 + When I check for back-to-back promotions + Then the result should be "PASSED" + + Scenario Outline: Promotion gap validation + Given a product was promoted in weeks + When I check for back-to-back promotions + Then the result should be "" + + Examples: Various gaps + | weeks | expected | + | 1, 2 | FAILED | + | 1, 5 | FAILED | + | 1, 6 | PASSED | + | 1, 6, 11 | PASSED | +``` + +### 3. Implement step definitions + +Step definitions call UC functions via the Statement Execution API: + +```python +from __future__ import annotations + +from behave import given, when, then +from behave.runner import Context + + +@given("a product was promoted in weeks {weeks}") +def step_promo_weeks(context: Context, weeks: str) -> None: + context.promo_weeks = [int(w.strip()) for w in weeks.split(",")] + + +@when("I check for back-to-back promotions") +def step_check_b2b(context: Context) -> None: + weeks = sorted(context.promo_weeks) + if not weeks: + context.result = "PASSED" + return + + last = weeks[-1] + prev_promos = [False, False, False, False] + for w in weeks[:-1]: + gap = last - w + if 1 <= gap <= 4: + prev_promos[gap - 1] = True + + args = ", ".join(["TRUE"] + [str(p).upper() for p in prev_promos]) + violation = call_rule(f"check_back_to_back_promo({args})") + context.result = "FAILED" if violation else "PASSED" + + +@then('the result should be "{expected}"') +def step_result_is(context: Context, expected: str) -> None: + assert context.result == expected, f"Expected '{expected}' but got '{context.result}'" +``` + +### 4. The test harness: `call_rule()` + +The core pattern: call real UC functions via the Statement Execution API. No local PySpark needed. + +```python +from databricks.sdk import WorkspaceClient + +def call_rule(expr: str): + """Execute a SQL expression against the warehouse and return the scalar result.""" + ws = WorkspaceClient() + warehouse_id = os.environ.get("DATABRICKS_WAREHOUSE_ID") + + # Auto-qualify unqualified function names + if "." not in expr.split("(")[0]: + func_name = expr.split("(")[0].strip() + expr = expr.replace(func_name, f"{catalog}.{schema}.{func_name}", 1) + + sql = f"SELECT {expr} AS result" + response = ws.statement_execution.execute_statement( + warehouse_id=warehouse_id, + statement=sql, + wait_timeout="30s", + ) + raw = response.result.data_array[0][0] + return _coerce(raw) # Convert "true"->True, "false"->False, numeric->int/float +``` + +### 5. Run tests + +```bash +# All tests +uv run behave --format=pretty + +# Smoke tests only +uv run behave --tags="@smoke" --format=pretty + +# Specific feature +uv run behave features/catalog/permissions.feature + +# Dry run (validate step coverage) +uv run behave --dry-run + +# JUnit output for CI +uv run behave --junit --junit-directory=reports/ --format=progress +``` + +## Common patterns + +### Pattern 1: Testing Unity Catalog SQL functions + +SQL functions are the single source of truth. The same function runs in BDD tests and in the production pipeline. + +```sql +-- sql/rules/check_back_to_back_promo.sql +CREATE OR REPLACE FUNCTION check_back_to_back_promo( + is_promoted BOOLEAN, + prev_promo_week_1 BOOLEAN, + prev_promo_week_2 BOOLEAN, + prev_promo_week_3 BOOLEAN, + prev_promo_week_4 BOOLEAN +) +RETURNS BOOLEAN +RETURN + is_promoted AND ( + COALESCE(prev_promo_week_1, FALSE) OR + COALESCE(prev_promo_week_2, FALSE) OR + COALESCE(prev_promo_week_3, FALSE) OR + COALESCE(prev_promo_week_4, FALSE) + ); +``` + +The production pipeline calls the same function: + +```sql +CREATE OR REFRESH MATERIALIZED VIEW compliance_results AS +WITH timeline_with_lags AS ( + SELECT *, + LAG(is_promoted, 1) OVER w AS prev_promo_1, + LAG(is_promoted, 2) OVER w AS prev_promo_2, + LAG(is_promoted, 3) OVER w AS prev_promo_3, + LAG(is_promoted, 4) OVER w AS prev_promo_4 + FROM silver_timeline + WINDOW w AS (PARTITION BY product_id, location_id ORDER BY week_start) +) +SELECT + check_back_to_back_promo( + t.is_promoted, t.prev_promo_1, t.prev_promo_2, + t.prev_promo_3, t.prev_promo_4 + ) AS b2b_violation +FROM timeline_with_lags t; +``` + +### Pattern 2: Ephemeral test schemas + +Each test run creates an isolated schema, preventing cross-run contamination: + +```python +# environment.py +def before_all(context): + ws = WorkspaceClient() + context.workspace = ws + ts = datetime.now().strftime("%Y%m%d_%H%M%S") + context.test_schema = f"behave_test_{ts}" + ws.statement_execution.execute_statement( + warehouse_id=context.warehouse_id, + statement=f"CREATE SCHEMA IF NOT EXISTS {context.catalog}.{context.test_schema}", + wait_timeout="30s", + ) + +def after_all(context): + context.workspace.statement_execution.execute_statement( + warehouse_id=context.warehouse_id, + statement=f"DROP SCHEMA IF EXISTS {context.catalog}.{context.test_schema} CASCADE", + wait_timeout="30s", + ) +``` + +### Pattern 3: Scenario Outlines for data-driven testing + +```gherkin +Scenario Outline: Established price boundary coverage + Given the promotion status is "" + And the regular price is $ with prior weeks $, $, $, $ + When I check the established price rule + Then the result should be "" + + Examples: Various price histories + | promoted | current | w1 | w2 | w3 | w4 | expected | + | yes | 10.00 | 10.00 | 10.00 | 10.00 | 10.00 | PASSED | + | yes | 10.00 | 10.00 | 10.00 | 10.00 | 9.99 | FAILED | + | no | 10.00 | 5.00 | 6.00 | 7.00 | 8.00 | PASSED | +``` + +### Pattern 4: Pipeline integration tests + +```gherkin +@integration @slow +Feature: Pipeline end-to-end verification + Verify compliance rules through Bronze -> Silver -> Gold. + + Scenario: Single promotion with gap passes end-to-end + Given a pipeline workspace connection + And events for product "PIPE-001" with a 5-week gap between promotions + When I push the events to the pipeline + And I wait for Gold results + Then the compliance status should be "PASSED" +``` + +### Pattern 5: Grant and permission testing + +```gherkin +@catalog @smoke +Feature: Unity Catalog permissions + Scenario: Grant SELECT on a table + Given a table "customers" in the test schema + When I grant SELECT on "customers" to group "readers" + Then the group "readers" should have SELECT on "customers" +``` + +## Gherkin writing rules + +**Declarative, not imperative.** Describe what the system should do, not UI clicks. + +**One behavior per scenario.** Split scenarios that test multiple independent things. + +**CRITICAL: Curly braces break step matching.** Behave's `parse` library treats `{anything}` as a capture group. Never use `{schema}.table` in feature text. Use short names like `"customers"` and resolve the schema in step code. + +**Trailing colons for data tables.** When a step has a data table, the `:` is part of the step text. Pattern must be `@given('a table with data:')` not `@given('a table with data')`. + +**Tag strategy:** + +| Tag | Purpose | Typical runtime | +|-----|---------|----------------| +| `@smoke` | Critical path, fast | < 30s each | +| `@regression` | Thorough coverage | Minutes | +| `@integration` | Needs live workspace | Minutes | +| `@slow` | Pipeline/job execution | > 2 min | +| `@wip` | Work in progress, skip in CI | N/A | + +## Makefile targets + +```makefile +.PHONY: bdd bdd-smoke bdd-report + +bdd: + uv run behave --format=pretty + +bdd-smoke: + uv run behave --tags="@smoke" --format=pretty + +bdd-report: + uv run behave --junit --junit-directory=reports/ --format=progress +``` + +## Prerequisites + +- Python 3.10+ +- `uv` for package management +- `databricks-sdk` and `behave` (`uv add --group test behave databricks-sdk`) +- Authenticated Databricks CLI profile or environment variables +- A SQL warehouse (auto-discovered if not specified) + +## Reference files + +- [gherkin-patterns.md](references/gherkin-patterns.md) — Databricks-specific Gherkin patterns for UC, pipelines, jobs, Apps, SQL +- [step-library.md](references/step-library.md) — Reusable step definitions for all Databricks domains +- [environment-template.md](references/environment-template.md) — Complete environment.py with Databricks hooks + +## Common issues + +| Issue | Solution | +|-------|----------| +| **Undefined step** | Run `uv run behave --dry-run` to find unmatched steps | +| **Auth failure (401/403)** | Check `databricks auth profiles` or env vars | +| **WAREHOUSE_NOT_RUNNING** | Start the SQL warehouse or use auto-start | +| **SCHEMA_NOT_FOUND** | Verify `before_all` created the ephemeral schema | +| **Step match collision** | Behave imports all steps globally; use unique patterns | +| **Curly brace parse error** | Don't use `{schema}` in feature files; resolve in step code | + +## External resources + +- [Public plugin repo](https://github.com/dgokeeffe/databricks-bdd-tools) — Full Claude Code plugin with four skills +- [The Foundation of Modern DataOps](https://medium.com/dbsql-sme-engineering/the-foundation-of-modern-dataops-with-databricks-68e36f5d72e8) — DataOps testing principles diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-bdd-testing/references/environment-template.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-bdd-testing/references/environment-template.md new file mode 100644 index 0000000..2a7dc1b --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-bdd-testing/references/environment-template.md @@ -0,0 +1,195 @@ +# environment.py Template — Databricks + Behave + +Complete annotated template for `features/environment.py`. Copy and adapt to the target project. + +## Full template + +```python +"""Behave environment hooks — Databricks SDK integration. + +Sets up workspace connection, ephemeral test schema, and per-scenario cleanup. +""" +from __future__ import annotations + +import logging +import os +from datetime import datetime + +from behave.model import Feature, Scenario, Step +from behave.runner import Context + +logger = logging.getLogger("behave.databricks") + + +# ─── Session-level hooks ──────────────────────────────────────── + +def before_all(context: Context) -> None: + """Initialize Databricks clients and create ephemeral test schema.""" + from databricks.sdk import WorkspaceClient + + context.workspace = WorkspaceClient() + + # Fix host URL — some profiles include ?o= which breaks SDK API paths. + # The CLI handles this transparently but the SDK does not. + if context.workspace.config.host and "?" in context.workspace.config.host: + clean_host = context.workspace.config.host.split("?")[0].rstrip("/") + profile = os.environ.get("DATABRICKS_CONFIG_PROFILE") + context.workspace = WorkspaceClient(profile=profile, host=clean_host) + + # Verify auth + me = context.workspace.current_user.me() + context.current_user = me.user_name + logger.info("Authenticated as: %s", context.current_user) + + # Warehouse — from -D userdata, env var, or auto-discover + userdata = context.config.userdata + context.warehouse_id = ( + userdata.get("warehouse_id") + or os.environ.get("DATABRICKS_WAREHOUSE_ID") + or _discover_warehouse(context.workspace) + ) + logger.info("Using warehouse: %s", context.warehouse_id) + + # Catalog — from -D userdata or env var + context.test_catalog = userdata.get("catalog", os.environ.get("TEST_CATALOG", "main")) + + # Create ephemeral schema (timestamped for isolation) + ts = datetime.now().strftime("%Y%m%d_%H%M%S") + worker = os.environ.get("BEHAVE_WORKER_ID", "0") + context.test_schema = f"{context.test_catalog}.behave_test_{ts}_w{worker}" + + _execute_sql(context, f"CREATE SCHEMA IF NOT EXISTS {context.test_schema}") + logger.info("Created test schema: %s", context.test_schema) + + +def after_all(context: Context) -> None: + """Drop ephemeral test schema.""" + if hasattr(context, "test_schema"): + try: + _execute_sql(context, f"DROP SCHEMA IF EXISTS {context.test_schema} CASCADE") + logger.info("Dropped test schema: %s", context.test_schema) + except Exception as e: + logger.warning("Failed to drop test schema %s: %s", context.test_schema, e) + + +# ─── Feature-level hooks ──────────────────────────────────────── + +def before_feature(context: Context, feature: Feature) -> None: + """Log feature start. Skip if tagged @skip.""" + logger.info("▶ Feature: %s", feature.name) + if "skip" in feature.tags: + feature.skip("Marked with @skip") + + +def after_feature(context: Context, feature: Feature) -> None: + logger.info("◀ Feature: %s [%s]", feature.name, feature.status) + + +# ─── Scenario-level hooks ─────────────────────────────────────── + +def before_scenario(context: Context, scenario: Scenario) -> None: + """Initialize per-scenario state. Skip @wip scenarios.""" + logger.info(" ▶ Scenario: %s", scenario.name) + if "wip" in scenario.tags: + scenario.skip("Work in progress") + return + # Track resources created during this scenario for cleanup + context.scenario_cleanup_sql = [] + + +def after_scenario(context: Context, scenario: Scenario) -> None: + """Clean up scenario-specific resources.""" + for sql in getattr(context, "scenario_cleanup_sql", []): + try: + _execute_sql(context, sql) + except Exception as e: + logger.warning("Cleanup SQL failed: %s — %s", sql, e) + if scenario.status == "failed": + logger.error(" ✗ FAILED: %s", scenario.name) + else: + logger.info(" ◀ Scenario: %s [%s]", scenario.name, scenario.status) + + +# ─── Step-level hooks ─────────────────────────────────────────── + +def before_step(context: Context, step: Step) -> None: + context._step_start = datetime.now() + + +def after_step(context: Context, step: Step) -> None: + elapsed = (datetime.now() - context._step_start).total_seconds() + if elapsed > 10: + logger.warning(" Slow step (%.1fs): %s %s", elapsed, step.keyword, step.name) + if step.status == "failed": + logger.error(" ✗ %s %s\n %s", step.keyword, step.name, step.error_message) + + +# ─── Tag-based hooks ──────────────────────────────────────────── + +def before_tag(context, tag: str) -> None: + """Ensure resources for tagged scenarios.""" + if tag == "fixture.sql_warehouse": + _ensure_warehouse_running(context) + + +# ─── Helpers ──────────────────────────────────────────────────── + +def _execute_sql(context: Context, sql: str) -> object: + """Execute a SQL statement via the Statement Execution API.""" + return context.workspace.statement_execution.execute_statement( + warehouse_id=context.warehouse_id, + statement=sql, + wait_timeout="30s", + ) + + +def _discover_warehouse(workspace) -> str: + """Find the first available SQL warehouse.""" + from databricks.sdk.service.sql import State + + warehouses = list(workspace.warehouses.list()) + # Prefer running warehouses + for wh in warehouses: + if wh.state == State.RUNNING: + return wh.id + if warehouses: + return warehouses[0].id + raise RuntimeError( + "No SQL warehouses found. Pass warehouse_id via -D warehouse_id= " + "or set DATABRICKS_WAREHOUSE_ID." + ) + + +def _ensure_warehouse_running(context: Context) -> None: + """Start warehouse if stopped. Used by @fixture.sql_warehouse tag.""" + from databricks.sdk.service.sql import State + + wh = context.workspace.warehouses.get(context.warehouse_id) + if wh.state != State.RUNNING: + logger.info("Starting warehouse %s...", context.warehouse_id) + context.workspace.warehouses.start(context.warehouse_id) + context.workspace.warehouses.wait_get_warehouse_running(context.warehouse_id) + logger.info("Warehouse %s is running.", context.warehouse_id) +``` + +## Context object layering + +Behave's `context` has scoped layers. Data set at different levels has different lifetimes: + +| Set in | Lifetime | Example | +|--------|----------|---------| +| `before_all` | Entire run | `context.workspace`, `context.test_schema` | +| `before_feature` | Current feature | `context.feature_data` | +| `before_scenario` / steps | Current scenario | `context.query_result`, `context.scenario_cleanup_sql` | + +At the end of each scenario, the scenario layer is popped — anything set during steps is gone. Root-level data persists across everything. + +## Parallel execution isolation + +When using `behavex` for parallel execution, each worker needs its own schema. The template uses `BEHAVE_WORKER_ID` from the environment. Set it in the parallel runner config or wrapper script: + +```bash +# Example wrapper for behavex +export BEHAVE_WORKER_ID=$WORKER_INDEX +behave "$@" +``` diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-bdd-testing/references/gherkin-patterns.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-bdd-testing/references/gherkin-patterns.md new file mode 100644 index 0000000..1913252 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-bdd-testing/references/gherkin-patterns.md @@ -0,0 +1,446 @@ +# Gherkin Patterns for Databricks + +Reusable Gherkin patterns for common Databricks testing scenarios. Copy and adapt these to feature files. + +> **WARNING: Curly braces in step text break Behave's `parse` matcher.** +> +> Behave uses Python's `parse` library for step matching. Any `{...}` in step text +> is interpreted as a capture group. Writing `{test_schema}.customers` in a step line +> will **silently fail to match** your step definition. +> +> **The correct pattern:** +> - Step text uses **short table names in quotes**: `"customers"`, `"orders"` +> - SQL inside **docstrings** (triple-quoted blocks) can safely use `{schema}` because +> docstrings are accessed via `context.text`, not step matching +> - Step definitions prepend `context.test_schema + "."` internally to build the FQN +> +> ```python +> # WRONG - step text with curly braces +> @given('a table "{test_schema}.customers" exists') # BROKEN - parse eats {test_schema} +> +> # RIGHT - short name in step text, FQN built in the step body +> @given('a managed table "{table_name}" exists') +> def step_impl(context, table_name): +> fqn = f"{context.test_schema}.{table_name}" +> # ... use fqn +> ``` +> +> **Docstring SQL pattern** (safe because `context.text` is just a string): +> ```python +> @when('I execute SQL:') +> def step_impl(context): +> sql = context.text.replace("{schema}", context.test_schema) +> # ... execute sql +> ``` + +## Common Background + +Most Databricks feature files share this Background: + +```gherkin +Background: + Given a Databricks workspace connection is established + And a test schema is provisioned +``` + +--- + +## Unity Catalog + +### Table permissions + +```gherkin +@catalog @permissions +Feature: Unity Catalog table permissions + As a data engineer + I want to verify table-level permissions + So that sensitive data is properly protected + + Background: + Given a Databricks workspace connection is established + And a test schema is provisioned + + Scenario: Grant SELECT to a group + Given a managed table "customers" exists + When I execute SQL: + """sql + GRANT SELECT ON TABLE {schema}.customers TO `data_readers` + """ + And I execute SQL: + """sql + SHOW GRANTS ON TABLE {schema}.customers + """ + Then the result should contain a row where "ActionType" is "SELECT" and "Principal" is "data_readers" + + Scenario Outline: Verify multiple privilege types + Given a managed table "sales" exists + When I execute SQL: + """sql + GRANT ON TABLE {schema}.sales TO `` + """ + And I execute SQL: + """sql + SHOW GRANTS ON TABLE {schema}.sales + """ + Then the result should contain a row where "ActionType" is "" and "Principal" is "" + + Examples: + | privilege | group | + | SELECT | data_readers | + | MODIFY | data_writers | +``` + +### Column masks + +```gherkin +@catalog @security +Feature: Column-level security + + Background: + Given a Databricks workspace connection is established + And a test schema is provisioned + + Scenario: Mask PII columns for analysts + Given a managed table "customers" with columns: + | column_name | data_type | contains_pii | + | id | BIGINT | false | + | name | STRING | true | + | email | STRING | true | + | region | STRING | false | + And a column mask function "mask_pii" is applied to "name" and "email" on "customers" + When I query "customers" as group "analysts" + Then columns "name" and "email" should return masked values + But columns "id" and "region" should return actual values +``` + +### Row filters + +```gherkin +@catalog @security +Feature: Row-level security + + Background: + Given a Databricks workspace connection is established + And a test schema is provisioned + + Scenario: Row filter restricts by region + Given a managed table "regional_sales" with data: + | region | revenue | quarter | + | APAC | 50000 | Q1 | + | EMEA | 75000 | Q1 | + | AMER | 100000 | Q1 | + And a row filter on "regional_sales" restricts "apac_analysts" to region "APAC" + When I query "regional_sales" as group "apac_analysts" + Then I should only see rows where "region" is "APAC" + And the result should have 1 row +``` + +--- + +## Lakeflow Spark Declarative Pipelines + +### Pipeline lifecycle + +```gherkin +@pipeline @lakeflow +Feature: Events pipeline processing + As a data engineer + I want to verify the events pipeline processes data correctly + So that downstream consumers get accurate aggregations + + Background: + Given a Databricks workspace connection is established + And a test schema is provisioned + + @integration @slow + Scenario: Full refresh produces expected tables + Given a pipeline "events_pipeline" exists targeting the test schema + When I trigger a full refresh of the pipeline + Then the pipeline update should succeed within 600 seconds + And the streaming table "bronze_events" should exist + And the materialized view "silver_events_agg" should exist + And the table "silver_events_agg" should have more than 0 rows + + @integration + Scenario: Incremental refresh picks up new data + Given the pipeline "events_pipeline" has completed a full refresh + When I insert test records into the source + And I trigger an incremental refresh of the pipeline + Then the pipeline update should succeed within 300 seconds + And the new records should appear in "bronze_events" + + Scenario: Pipeline handles empty source gracefully + Given a pipeline "events_pipeline" exists targeting the test schema + And the source table is empty + When I trigger a full refresh of the pipeline + Then the pipeline update should succeed within 300 seconds + And the streaming table "bronze_events" should have 0 rows +``` + +### Pipeline failure handling + +```gherkin + Scenario: Pipeline surfaces schema mismatch errors + Given a pipeline "events_pipeline" exists targeting the test schema + And the source table has an unexpected column "extra_col" of type "BINARY" + When I trigger a full refresh of the pipeline + Then the pipeline update should fail + And the pipeline error should mention schema +``` + +--- + +## Jobs and Notebooks + +### Notebook execution + +```gherkin +@jobs @notebook +Feature: Customer ETL notebook + As a data engineer + I want to verify the ETL notebook produces correct output + + Background: + Given a Databricks workspace connection is established + And a test schema is provisioned + + @integration @slow + Scenario: Dedup notebook removes duplicates + Given a managed table "raw_customers" with data: + | customer_id | name | email | updated_at | + | 1 | Alice | alice@example.com | 2024-01-01T00:00:00 | + | 1 | Alice B. | alice@example.com | 2024-06-01T00:00:00 | + | 2 | Bob | bob@example.com | 2024-03-15T00:00:00 | + When I run the notebook "/Repos/team/etl/customer_dedup" with parameters: + | key | value | + | source_table | raw_customers | + | target_table | clean_customers| + Then the job should complete with status "SUCCESS" within 300 seconds + And the table "clean_customers" should have 2 rows + And the table "clean_customers" should contain a row where "customer_id" is "1" and "name" is "Alice B." + + Scenario: Notebook fails gracefully on missing source + When I run the notebook "/Repos/team/etl/customer_dedup" with parameters: + | key | value | + | source_table | nonexistent | + | target_table | output | + Then the job should complete with status "FAILED" within 120 seconds +``` + +--- + +## Databricks Apps (FastAPI) + +### API endpoint testing + +```gherkin +@app @fastapi +Feature: Databricks App API + As a user + I want the app endpoints to work correctly + + Background: + Given the app is running at the configured base URL + And the test user is "testuser@databricks.com" + + @smoke + Scenario: Health check + When I GET "/health" + Then the response status should be 200 + And the response JSON should contain "status" with value "healthy" + + Scenario: Authenticated user can list resources + When I GET "/api/dashboards" with auth headers + Then the response status should be 200 + And the response should be a JSON list + + Scenario: Unauthenticated request is rejected + When I GET "/api/dashboards" without auth headers + Then the response status should be 401 + + Scenario: POST creates a resource + When I POST "/api/items" with auth headers and body: + """json + {"name": "Test Item", "description": "Created by BDD test"} + """ + Then the response status should be 201 + And the response JSON should contain "name" with value "Test Item" +``` + +### App deployment testing + +```gherkin +@app @deployment @slow +Feature: App deployment lifecycle + Scenario: Deploy and verify app is running + Given a bundle project at the repository root + When I deploy using Asset Bundles with target "dev" + Then the deployment should succeed + And the app should reach "RUNNING" state within 120 seconds + And the app health endpoint should return 200 +``` + +--- + +## SQL Data Quality + +### Row counts and data validation + +```gherkin +@sql @data-quality +Feature: Data quality checks + + Background: + Given a Databricks workspace connection is established + And a test schema is provisioned + + @smoke + Scenario: Table is not empty + Given the table "orders" has been loaded + Then the table "orders" should have more than 0 rows + + Scenario: No duplicate primary keys + Given the table "orders" has been loaded + When I execute SQL: + """sql + SELECT order_id, COUNT(*) as cnt + FROM {schema}.orders + GROUP BY order_id + HAVING COUNT(*) > 1 + """ + Then the result should have 0 rows + + Scenario: Foreign key integrity + Given the tables "orders" and "customers" have been loaded + When I execute SQL: + """sql + SELECT o.customer_id + FROM {schema}.orders o + LEFT JOIN {schema}.customers c ON o.customer_id = c.customer_id + WHERE c.customer_id IS NULL + """ + Then the result should have 0 rows + + Scenario: No null values in required columns + When I execute SQL: + """sql + SELECT COUNT(*) as null_count + FROM {schema}.orders + WHERE order_id IS NULL OR customer_id IS NULL OR order_date IS NULL + """ + Then the first row column "null_count" should be "0" + + Scenario: Verify GRANT was applied via SQL + Given a managed table "products" exists + When I execute SQL: + """sql + GRANT SELECT ON TABLE {schema}.products TO `reporting_team` + """ + And I execute SQL: + """sql + SHOW GRANTS ON TABLE {schema}.products + """ + Then the result should contain a row where "ActionType" is "SELECT" and "Principal" is "reporting_team" +``` + +--- + +## Asset Bundles Deployment + +```gherkin +@deployment @dabs +Feature: Bundle lifecycle + @smoke + Scenario: Bundle validates successfully + When I run "databricks bundle validate" with target "dev" + Then the command should exit with code 0 + + @integration @slow + Scenario: Deploy and destroy lifecycle + When I run "databricks bundle deploy" with target "dev" + Then the command should exit with code 0 + When I run "databricks bundle destroy" with target "dev" and auto-approve + Then the command should exit with code 0 +``` + +--- + +## Scenario Outline patterns + +Use Scenario Outlines for testing multiple variations of the same behavior. + +Note: table names in the Examples table are short names (no schema prefix). The step +definition prepends `context.test_schema` to build the fully-qualified name. + +```gherkin + Scenario Outline: Verify table existence after pipeline run + Then the "" should exist + + Examples: Streaming tables + | table_type | table_name | + | streaming table | bronze_events | + | streaming table | bronze_transactions| + + Examples: Materialized views + | table_type | table_name | + | materialized view | silver_events_agg| + | materialized view | gold_summary | +``` + +--- + +## Steps with data tables and docstrings + +Steps that accept a data table or docstring **must** end with a trailing colon. The colon +is part of the step text that Behave matches against your `@given`/`@when`/`@then` decorator. + +```gherkin +# CORRECT - colon before data table +Given a managed table "customers" with data: + | id | name | region | + | 1 | Alice | APAC | + | 2 | Bob | EMEA | + +# CORRECT - colon before docstring +When I execute SQL: + """sql + SELECT * FROM {schema}.customers + """ + +# WRONG - missing colon, Behave will not match the step +Given a managed table "customers" with data + | id | name | region | +``` + +--- + +## SHOW GRANTS column names + +`SHOW GRANTS` returns PascalCase column names. Use these exact names when asserting +on grant results: + +| Column | Description | +|--------------|------------------------------------------------| +| `Principal` | The user, group, or service principal | +| `ActionType` | The privilege (SELECT, MODIFY, ALL PRIVILEGES) | +| `ObjectType` | TABLE, SCHEMA, CATALOG, etc. | +| `ObjectKey` | The fully-qualified object name | + +--- + +## Tag strategy + +| Tag | Purpose | Typical runtime | +|-----|---------|----------------| +| `@smoke` | Critical path, must always pass | < 30s per scenario | +| `@regression` | Full coverage | Minutes | +| `@integration` | Needs live workspace | Varies | +| `@slow` | Pipeline/job execution | > 2 min | +| `@wip` | Work in progress, skip by default | N/A | +| `@skip` | Explicitly disabled | N/A | +| `@catalog` | Unity Catalog tests | Varies | +| `@pipeline` | Lakeflow SDP tests | Minutes | +| `@jobs` | Job/notebook tests | Minutes | +| `@app` | Databricks Apps tests | Seconds | +| `@sql` | SQL/data quality tests | Seconds | +| `@deployment` | DABs lifecycle tests | Minutes | diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-bdd-testing/references/step-library.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-bdd-testing/references/step-library.md new file mode 100644 index 0000000..11ddf76 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-bdd-testing/references/step-library.md @@ -0,0 +1,660 @@ +# Reusable Step Definition Library + +Complete library of Databricks step definitions for Behave. Organized by domain. Copy relevant sections into `features/steps/` files. + +**Proven patterns used throughout:** + +- Step patterns use **short names** (e.g., `"{table_name}"`), never `{test_schema}.table` in the pattern +- Step code builds FQN internally: `fqn = f"{context.test_schema}.{table_name}"` +- SQL in docstrings uses `{schema}` placeholder, replaced via `context.text.replace("{schema}", context.test_schema)` +- Steps with data tables have a **trailing colon** in the decorator: `@given('... with data:')` +- Grants use **SQL**, not the SDK grants API (which breaks on recent SDK versions) +- Integer parameters use Behave's built-in `{count:d}` format, not custom type parsers + +--- + +## Common Steps (`common_steps.py`) + +Always include these. They provide workspace connection, SQL execution, and basic assertions. + +```python +"""Shared step definitions for Databricks BDD tests.""" +from __future__ import annotations + +import os +from datetime import datetime + +from behave import given, then, step +from behave.runner import Context +from databricks.sdk.service.sql import StatementState + + +# ─── Connection and setup steps ───────────────────────────────── + +@given("a Databricks workspace connection is established") +def step_workspace_connection(context: Context) -> None: + """Initialize workspace client. Usually handled by environment.py.""" + if not hasattr(context, "workspace"): + from databricks.sdk import WorkspaceClient + context.workspace = WorkspaceClient() + me = context.workspace.current_user.me() + context.current_user = me.user_name + + +@given("a test schema is provisioned") +def step_test_schema(context: Context) -> None: + """Verify test schema exists. Usually handled by environment.py.""" + assert hasattr(context, "test_schema"), ( + "No test_schema on context — check environment.py before_all" + ) + + +# ─── SQL execution steps ──────────────────────────────────────── + +@step("I execute the following SQL") +def step_execute_sql_docstring(context: Context) -> None: + """Execute SQL from a docstring (triple-quoted text in feature file). + + In feature files, use {schema} as the placeholder: + When I execute the following SQL + \"\"\" + SELECT * FROM {schema}.customers + \"\"\" + """ + sql = context.text.replace("{schema}", context.test_schema) + context.query_result = _execute_sql(context, sql) + + +@step('I execute SQL "{sql}"') +def step_execute_sql_inline(context: Context, sql: str) -> None: + """Execute inline SQL. The {schema} placeholder is replaced automatically.""" + sql = sql.replace("{schema}", context.test_schema) + context.query_result = _execute_sql(context, sql) + + +# ─── Table existence and row count assertions ─────────────────── + +@then('the table "{table_name}" should exist') +def step_table_exists(context: Context, table_name: str) -> None: + fqn = f"{context.test_schema}.{table_name}" + try: + context.workspace.tables.get(fqn) + except Exception as e: + raise AssertionError(f"Table {fqn} does not exist: {e}") + + +@then('the streaming table "{table_name}" should exist') +def step_streaming_table_exists(context: Context, table_name: str) -> None: + fqn = f"{context.test_schema}.{table_name}" + try: + info = context.workspace.tables.get(fqn) + assert info.table_type is not None, f"{fqn} exists but has no table_type" + except Exception as e: + raise AssertionError(f"Streaming table {fqn} does not exist: {e}") + + +@then('the materialized view "{table_name}" should exist') +def step_mv_exists(context: Context, table_name: str) -> None: + fqn = f"{context.test_schema}.{table_name}" + try: + context.workspace.tables.get(fqn) + except Exception as e: + raise AssertionError(f"Materialized view {fqn} does not exist: {e}") + + +@then('the table "{table_name}" should have {expected:d} rows') +def step_exact_row_count(context: Context, table_name: str, expected: int) -> None: + actual = _count_rows(context, table_name) + assert actual == expected, f"Expected {expected} rows in {table_name}, got {actual}" + + +@then('the table "{table_name}" should have more than {expected:d} rows') +def step_min_row_count(context: Context, table_name: str, expected: int) -> None: + actual = _count_rows(context, table_name) + assert actual > expected, f"Expected more than {expected} rows in {table_name}, got {actual}" + + +@then('the table "{table_name}" should have 0 rows') +def step_empty_table(context: Context, table_name: str) -> None: + actual = _count_rows(context, table_name) + assert actual == 0, f"Expected 0 rows in {table_name}, got {actual}" + + +# ─── Query result assertions ──────────────────────────────────── + +@then("the result should have {expected:d} rows") +def step_result_row_count(context: Context, expected: int) -> None: + rows = context.query_result.result.data_array or [] + actual = len(rows) + assert actual == expected, f"Expected {expected} rows, got {actual}" + + +@then("the result should have more than {expected:d} rows") +def step_result_min_rows(context: Context, expected: int) -> None: + rows = context.query_result.result.data_array or [] + actual = len(rows) + assert actual > expected, f"Expected more than {expected} rows, got {actual}" + + +@then('the first row column "{col}" should be "{value}"') +def step_first_row_value(context: Context, col: str, value: str) -> None: + result = context.query_result + columns = [c.name for c in result.manifest.schema.columns] + col_idx = columns.index(col) + actual = result.result.data_array[0][col_idx] + assert str(actual) == value, f"Expected {col}={value}, got {actual}" + + +# ─── Data setup steps ─────────────────────────────────────────── + +@given('the table "{table_name}" has been loaded') +def step_table_loaded(context: Context, table_name: str) -> None: + """Assert table exists and is not empty.""" + fqn = f"{context.test_schema}.{table_name}" + count = _count_rows(context, table_name) + assert count > 0, f"Table {fqn} exists but is empty" + + +@given('a managed table "{table_name}" exists') +def step_ensure_table_exists(context: Context, table_name: str) -> None: + fqn = f"{context.test_schema}.{table_name}" + try: + context.workspace.tables.get(fqn) + except Exception: + # Create a minimal table + _execute_sql(context, f"CREATE TABLE IF NOT EXISTS {fqn} (id BIGINT)") + context.scenario_cleanup_sql.append(f"DROP TABLE IF EXISTS {fqn}") + + +@given('a managed table "{table_name}" with data:') +def step_create_table_with_data(context: Context, table_name: str) -> None: + """Create a table and populate from the Gherkin data table. + + The trailing colon in the decorator is required — Behave matches it + as part of the step text when a data table follows. + + Example feature file usage: + Given a managed table "customers" with data: + | id | name | region | + | 1 | Acme | APAC | + | 2 | Contoso | EMEA | + """ + fqn = f"{context.test_schema}.{table_name}" + headers = context.table.headings + rows = context.table.rows + + # Infer types (simple heuristic — all STRING) + col_defs = ", ".join(f"{h} STRING" for h in headers) + _execute_sql(context, f"CREATE OR REPLACE TABLE {fqn} ({col_defs})") + context.scenario_cleanup_sql.append(f"DROP TABLE IF EXISTS {fqn}") + + # Insert rows + for row in rows: + values = ", ".join(f"'{cell}'" for cell in row) + _execute_sql(context, f"INSERT INTO {fqn} VALUES ({values})") + + +# ─── Helpers ──────────────────────────────────────────────────── + +def _execute_sql(context: Context, sql: str): + """Execute SQL and return result.""" + result = context.workspace.statement_execution.execute_statement( + warehouse_id=context.warehouse_id, + statement=sql, + wait_timeout="30s", + ) + assert result.status.state == StatementState.SUCCEEDED, ( + f"SQL failed: {result.status.error}\nStatement: {sql[:200]}" + ) + return result + + +def _count_rows(context: Context, table_name: str) -> int: + """Count rows in a table.""" + fqn = f"{context.test_schema}.{table_name}" + result = _execute_sql(context, f"SELECT COUNT(*) AS cnt FROM {fqn}") + return int(result.result.data_array[0][0]) +``` + +--- + +## Catalog Steps (`catalog_steps.py`) + +Uses SQL for grants instead of the SDK grants API. The SDK's `grants.update(securable_type=SecurableType.TABLE, ...)` fails with `SECURABLETYPE.TABLE is not a valid securable type` on recent SDK versions. + +```python +"""Step definitions for Unity Catalog permissions and security. + +Uses SQL for all grant operations. The SDK grants API is unreliable — +SecurableType.TABLE fails on recent databricks-sdk versions. +""" +from __future__ import annotations + +from behave import when, then +from behave.runner import Context +from databricks.sdk.service.sql import StatementState + + +@when('I grant {privilege} on table "{table_name}" to group "{group}"') +def step_grant(context: Context, privilege: str, table_name: str, group: str) -> None: + """Grant a privilege on a table using SQL. + + Example feature file usage: + When I grant SELECT on table "customers" to group "analysts" + """ + fqn = f"{context.test_schema}.{table_name}" + _execute_sql(context, f"GRANT {privilege} ON TABLE {fqn} TO `{group}`") + + +@when('I revoke {privilege} on table "{table_name}" from group "{group}"') +def step_revoke(context: Context, privilege: str, table_name: str, group: str) -> None: + """Revoke a privilege on a table using SQL.""" + fqn = f"{context.test_schema}.{table_name}" + _execute_sql(context, f"REVOKE {privilege} ON TABLE {fqn} FROM `{group}`") + + +@then('the group "{group}" should have {privilege} permission on "{table_name}"') +def step_verify_grant( + context: Context, group: str, privilege: str, table_name: str +) -> None: + """Verify a grant exists using SHOW GRANTS. + + SHOW GRANTS returns PascalCase columns: Principal, ActionType, ObjectType, ObjectKey. + """ + fqn = f"{context.test_schema}.{table_name}" + result = _execute_sql(context, f"SHOW GRANTS ON TABLE {fqn}") + columns = [c.name for c in result.manifest.schema.columns] + principal_idx = columns.index("Principal") + action_idx = columns.index("ActionType") + + found_privs = [] + for row in result.result.data_array or []: + if row[principal_idx] == group: + found_privs.append(row[action_idx]) + + assert privilege in found_privs, ( + f"Expected {group} to have {privilege} on {fqn}, " + f"found: {found_privs}" + ) + + +@then('the group "{group}" should not have {privilege} permission on "{table_name}"') +def step_verify_no_grant( + context: Context, group: str, privilege: str, table_name: str +) -> None: + """Verify a grant does NOT exist using SHOW GRANTS.""" + fqn = f"{context.test_schema}.{table_name}" + result = _execute_sql(context, f"SHOW GRANTS ON TABLE {fqn}") + columns = [c.name for c in result.manifest.schema.columns] + principal_idx = columns.index("Principal") + action_idx = columns.index("ActionType") + + found_privs = [] + for row in result.result.data_array or []: + if row[principal_idx] == group: + found_privs.append(row[action_idx]) + + assert privilege not in found_privs, ( + f"Expected {group} NOT to have {privilege} on {fqn}, " + f"but found: {found_privs}" + ) + + +def _execute_sql(context: Context, sql: str): + """Execute SQL and return result.""" + result = context.workspace.statement_execution.execute_statement( + warehouse_id=context.warehouse_id, + statement=sql, + wait_timeout="30s", + ) + assert result.status.state == StatementState.SUCCEEDED, ( + f"SQL failed: {result.status.error}\nStatement: {sql[:200]}" + ) + return result +``` + +--- + +## Pipeline Steps (`pipeline_steps.py`) + +```python +"""Step definitions for Lakeflow Spark Declarative Pipelines.""" +from __future__ import annotations + +import time + +from behave import given, when, then +from behave.runner import Context + + +@given('a pipeline "{name}" exists targeting "{schema}"') +def step_pipeline_exists(context: Context, name: str, schema: str) -> None: + pipelines = list( + context.workspace.pipelines.list_pipelines(filter=f'name LIKE "{name}"') + ) + if pipelines: + context.pipeline_id = pipelines[0].pipeline_id + else: + result = context.workspace.pipelines.create( + name=name, + target=schema, + catalog=context.test_catalog, + channel="CURRENT", + ) + context.pipeline_id = result.pipeline_id + context.scenario_cleanup_sql.append(None) # Mark for pipeline cleanup + + +@given('the pipeline "{name}" has completed a full refresh') +def step_pipeline_refreshed(context: Context, name: str) -> None: + """Ensure pipeline exists and has been refreshed at least once.""" + pipelines = list( + context.workspace.pipelines.list_pipelines(filter=f'name LIKE "{name}"') + ) + assert pipelines, f"Pipeline '{name}' not found" + context.pipeline_id = pipelines[0].pipeline_id + # Check latest update status + detail = context.workspace.pipelines.get(context.pipeline_id) + assert detail.latest_updates, f"Pipeline '{name}' has never been run" + + +@when("I trigger a full refresh of the pipeline") +def step_full_refresh(context: Context) -> None: + response = context.workspace.pipelines.start_update( + pipeline_id=context.pipeline_id, + full_refresh=True, + ) + context.update_id = response.update_id + + +@when("I trigger an incremental refresh of the pipeline") +def step_incremental_refresh(context: Context) -> None: + response = context.workspace.pipelines.start_update( + pipeline_id=context.pipeline_id, + full_refresh=False, + ) + context.update_id = response.update_id + + +@then("the pipeline update should succeed within {timeout:d} seconds") +def step_pipeline_success(context: Context, timeout: int) -> None: + _wait_for_pipeline(context, timeout, expect_success=True) + + +@then("the pipeline update should fail") +def step_pipeline_fail(context: Context) -> None: + _wait_for_pipeline(context, timeout=300, expect_success=False) + + +@then('the pipeline error should mention {keyword}') +def step_pipeline_error_contains(context: Context, keyword: str) -> None: + events = list(context.workspace.pipelines.list_pipeline_events( + pipeline_id=context.pipeline_id, + max_results=10, + )) + error_messages = " ".join( + str(e.message) for e in events if e.level == "ERROR" + ) + assert keyword.lower() in error_messages.lower(), ( + f"Expected pipeline error to mention '{keyword}', " + f"but errors were: {error_messages[:500]}" + ) + + +def _wait_for_pipeline( + context: Context, timeout: int, expect_success: bool +) -> None: + deadline = time.time() + timeout + while time.time() < deadline: + update = context.workspace.pipelines.get_update( + pipeline_id=context.pipeline_id, + update_id=context.update_id, + ) + state = update.update.state + if state in ("COMPLETED",): + if expect_success: + return + raise AssertionError("Expected pipeline to fail, but it succeeded") + if state in ("FAILED", "CANCELED"): + if not expect_success: + return + raise AssertionError( + f"Pipeline update {state}. Check update {context.update_id}" + ) + time.sleep(15) + raise TimeoutError(f"Pipeline did not complete within {timeout}s") +``` + +--- + +## Job Steps (`job_steps.py`) + +```python +"""Step definitions for Databricks Jobs and notebook runs.""" +from __future__ import annotations + +import time + +from behave import when, then +from behave.runner import Context +from databricks.sdk.service.jobs import ( + NotebookTask, + RunLifeCycleState, + SubmitTask, +) + + +@when('I run the notebook "{path}" with parameters:') +def step_run_notebook(context: Context, path: str) -> None: + """Run a notebook with parameters from a Gherkin data table. + + The trailing colon is required when a data table follows. + + Example feature file usage: + When I run the notebook "/Workspace/tests/etl" with parameters: + | key | value | + | schema | my_schema | + | mode | full | + """ + params = {} + for row in context.table: + value = row["value"].replace("{schema}", context.test_schema) + params[row["key"]] = value + + run = context.workspace.jobs.submit( + run_name=f"behave-{context.scenario.name[:50]}", + tasks=[ + SubmitTask( + task_key="main", + notebook_task=NotebookTask( + notebook_path=path, + base_parameters=params, + ), + ) + ], + ) + context.run_id = run.response.run_id + + +@then('the job should complete with status "{expected}" within {timeout:d} seconds') +def step_job_status(context: Context, expected: str, timeout: int) -> None: + deadline = time.time() + timeout + while time.time() < deadline: + run = context.workspace.jobs.get_run(context.run_id) + state = run.state + if state.life_cycle_state in ( + RunLifeCycleState.TERMINATED, + RunLifeCycleState.INTERNAL_ERROR, + RunLifeCycleState.SKIPPED, + ): + break + time.sleep(10) + else: + raise TimeoutError(f"Run {context.run_id} did not complete within {timeout}s") + + actual = state.result_state.value if state.result_state else "UNKNOWN" + assert actual == expected, ( + f"Expected {expected}, got {actual}. Message: {state.state_message}" + ) +``` + +--- + +## App Steps (`app_steps.py`) + +```python +"""Step definitions for Databricks Apps (FastAPI) testing.""" +from __future__ import annotations + +import subprocess +import os + +import httpx +from behave import given, when, then +from behave.runner import Context + + +@given('the app is running at "{base_url}"') +def step_app_running(context: Context, base_url: str) -> None: + context.app_client = httpx.Client(base_url=base_url, timeout=10) + + +@given('the test user is "{email}"') +def step_test_user(context: Context, email: str) -> None: + context.auth_headers = { + "X-Forwarded-Email": email, + "X-Forwarded-User": email.split("@")[0], + } + + +@when('I GET "{path}"') +def step_get(context: Context, path: str) -> None: + context.response = context.app_client.get(path) + + +@when('I GET "{path}" with auth headers') +def step_get_auth(context: Context, path: str) -> None: + context.response = context.app_client.get(path, headers=context.auth_headers) + + +@when('I GET "{path}" without auth headers') +def step_get_no_auth(context: Context, path: str) -> None: + context.response = context.app_client.get(path) + + +@when('I POST "{path}" with auth headers and body') +def step_post_auth(context: Context, path: str) -> None: + """POST with JSON body from a docstring. + + Example feature file usage: + When I POST "/api/items" with auth headers and body + \"\"\" + {"name": "test-item", "value": 42} + \"\"\" + """ + import json + body = json.loads(context.text) + context.response = context.app_client.post( + path, json=body, headers=context.auth_headers, + ) + + +@then("the response status should be {code:d}") +def step_status_code(context: Context, code: int) -> None: + assert context.response.status_code == code, ( + f"Expected {code}, got {context.response.status_code}: " + f"{context.response.text[:200]}" + ) + + +@then('the response JSON should contain "{key}" with value "{value}"') +def step_json_value(context: Context, key: str, value: str) -> None: + data = context.response.json() + assert key in data, f"Key '{key}' not in response: {list(data.keys())}" + assert str(data[key]) == value, f"Expected {key}='{value}', got '{data[key]}'" + + +@then("the response should be a JSON list") +def step_json_list(context: Context) -> None: + data = context.response.json() + assert isinstance(data, list), f"Expected list, got {type(data).__name__}" + + +# ─── Deployment steps ──────────────────────────────────────────── + +@when('I deploy using Asset Bundles with target "{target}"') +def step_deploy_bundle(context: Context, target: str) -> None: + result = subprocess.run( + ["databricks", "bundle", "deploy", "--target", target], + capture_output=True, + text=True, + env={**dict(os.environ), "DATABRICKS_BUNDLE_ENGINE": "direct"}, + timeout=300, + ) + context.deploy_result = result + + +@then("the deployment should succeed") +def step_deploy_success(context: Context) -> None: + r = context.deploy_result + assert r.returncode == 0, ( + f"Deploy failed (rc={r.returncode}):\n{r.stderr[:500]}" + ) +``` + +--- + +## Shell Command Steps (reusable) + +```python +"""Step definitions for running CLI commands (DABs, databricks CLI).""" +from __future__ import annotations + +import os +import subprocess + +from behave import when, then +from behave.runner import Context + + +@when('I run "{command}" with target "{target}"') +def step_run_command(context: Context, command: str, target: str) -> None: + full_cmd = f"{command} --target {target}" + context.cmd_result = subprocess.run( + full_cmd.split(), + capture_output=True, + text=True, + env={**dict(os.environ), "DATABRICKS_BUNDLE_ENGINE": "direct"}, + timeout=300, + ) + + +@when('I run "{command}" with target "{target}" and auto-approve') +def step_run_command_approve(context: Context, command: str, target: str) -> None: + full_cmd = f"{command} --target {target} --auto-approve" + context.cmd_result = subprocess.run( + full_cmd.split(), + capture_output=True, + text=True, + env={**dict(os.environ), "DATABRICKS_BUNDLE_ENGINE": "direct"}, + timeout=300, + ) + + +@then("the command should exit with code {code:d}") +def step_exit_code(context: Context, code: int) -> None: + actual = context.cmd_result.returncode + assert actual == code, ( + f"Expected exit code {code}, got {actual}.\n" + f"stdout: {context.cmd_result.stdout[:300]}\n" + f"stderr: {context.cmd_result.stderr[:300]}" + ) + + +@then("the command should succeed") +def step_command_success(context: Context) -> None: + assert context.cmd_result.returncode == 0, ( + f"Command failed (rc={context.cmd_result.returncode}):\n" + f"{context.cmd_result.stderr[:500]}" + ) +``` diff --git a/.claude/skills/databricks-asset-bundles/SDP_guidance.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-bundles/SDP_guidance.md similarity index 100% rename from .claude/skills/databricks-asset-bundles/SDP_guidance.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-bundles/SDP_guidance.md diff --git a/.claude/skills/databricks-asset-bundles/SKILL.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-bundles/SKILL.md similarity index 95% rename from .claude/skills/databricks-asset-bundles/SKILL.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-bundles/SKILL.md index 4253e8e..5b01051 100644 --- a/.claude/skills/databricks-asset-bundles/SKILL.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-bundles/SKILL.md @@ -1,9 +1,9 @@ --- -name: databricks-asset-bundles -description: "Create and configure Databricks Asset Bundles (DABs) with best practices for multi-environment deployments. Use when working with: (1) Creating new DAB projects, (2) Adding resources (dashboards, pipelines, jobs, alerts), (3) Configuring multi-environment deployments, (4) Setting up permissions, (5) Deploying or running bundle resources" +name: databricks-bundles +description: "Create and configure Declarative Automation Bundles (formerly Asset Bundles) with best practices for multi-environment deployments (CICD). Use when working with: (1) Creating new DAB projects, (2) Adding resources (dashboards, pipelines, jobs, alerts), (3) Configuring multi-environment deployments, (4) Setting up permissions, (5) Deploying or running bundle resources" --- -# Databricks Asset Bundle (DABs) Writer +# DABs Writer ## Overview Create DABs for multi-environment deployment (dev/staging/prod). @@ -317,7 +317,7 @@ databricks bundle destroy -t prod --auto-approve ## Resources -- [Databricks Asset Bundles Documentation](https://docs.databricks.com/dev-tools/bundles/) +- [DABs Documentation](https://docs.databricks.com/dev-tools/bundles/) - [Bundle Resources Reference](https://docs.databricks.com/dev-tools/bundles/resources) - [Bundle Configuration Reference](https://docs.databricks.com/dev-tools/bundles/settings) - [Supported Resource Types](https://docs.databricks.com/aws/en/dev-tools/bundles/resources#resource-types) diff --git a/.claude/skills/databricks-asset-bundles/alerts_guidance.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-bundles/alerts_guidance.md similarity index 100% rename from .claude/skills/databricks-asset-bundles/alerts_guidance.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-bundles/alerts_guidance.md diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-config/SKILL.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-config/SKILL.md new file mode 100644 index 0000000..118713d --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-config/SKILL.md @@ -0,0 +1,22 @@ +--- +name: databricks-config +description: "Manage Databricks workspace connections: check current workspace, switch profiles, list available workspaces, or authenticate to a new workspace. Use when the user mentions \"switch workspace\", \"which workspace\", \"current profile\", \"databrickscfg\", \"connect to workspace\", or \"databricks auth\"." +--- + +Use the `manage_workspace` MCP tool for all workspace operations. Do NOT edit `~/.databrickscfg`, use Bash, or use the Databricks CLI. + +## Steps + +1. Call `ToolSearch` with query `select:mcp__databricks__manage_workspace` to load the tool. + +2. Map user intent to action: + - status / which workspace / current → `action="status"` + - list / available workspaces → `action="list"` + - switch to X → call `list` first to find the profile name, then `action="switch", profile=""` (or `host=""` if a URL was given) + - login / connect / authenticate → `action="login", host=""` + +3. Call `mcp__databricks__manage_workspace` with the action and any parameters. + +4. Present the result. For `status`/`switch`/`login`: show host, profile, username. For `list`: formatted table with the active profile marked. + +> **Note:** The switch is session-scoped — it resets on MCP server restart. For permanent profile setup, use `databricks auth login -p ` and update `~/.databrickscfg` with `cluster_id` or `serverless_compute_id = auto`. diff --git a/.claude/skills/databricks-dbsql/SKILL.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-dbsql/SKILL.md similarity index 100% rename from .claude/skills/databricks-dbsql/SKILL.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-dbsql/SKILL.md diff --git a/.claude/skills/databricks-dbsql/ai-functions.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-dbsql/ai-functions.md similarity index 100% rename from .claude/skills/databricks-dbsql/ai-functions.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-dbsql/ai-functions.md diff --git a/.claude/skills/databricks-dbsql/best-practices.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-dbsql/best-practices.md similarity index 100% rename from .claude/skills/databricks-dbsql/best-practices.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-dbsql/best-practices.md diff --git a/.claude/skills/databricks-dbsql/geospatial-collations.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-dbsql/geospatial-collations.md similarity index 100% rename from .claude/skills/databricks-dbsql/geospatial-collations.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-dbsql/geospatial-collations.md diff --git a/.claude/skills/databricks-dbsql/materialized-views-pipes.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-dbsql/materialized-views-pipes.md similarity index 100% rename from .claude/skills/databricks-dbsql/materialized-views-pipes.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-dbsql/materialized-views-pipes.md diff --git a/.claude/skills/databricks-dbsql/sql-scripting.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-dbsql/sql-scripting.md similarity index 100% rename from .claude/skills/databricks-dbsql/sql-scripting.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-dbsql/sql-scripting.md diff --git a/.claude/skills/databricks-docs/SKILL.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-docs/SKILL.md similarity index 80% rename from .claude/skills/databricks-docs/SKILL.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-docs/SKILL.md index 54bb157..ceca11e 100644 --- a/.claude/skills/databricks-docs/SKILL.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-docs/SKILL.md @@ -1,6 +1,6 @@ --- name: databricks-docs -description: "Databricks documentation reference. Use as a lookup resource alongside other skills and MCP tools for comprehensive guidance." +description: "Databricks documentation reference via llms.txt index. Use when other skills do not cover a topic, looking up unfamiliar Databricks features, or needing authoritative docs on APIs, configurations, or platform capabilities." --- # Databricks Documentation Reference @@ -16,7 +16,7 @@ This is a **reference skill**, not an action skill. Use it to: - Find detailed information to inform how you use MCP tools - Discover features and capabilities you may not know about -**Always prefer using MCP tools for actions** (execute_sql, create_or_update_pipeline, etc.) and **load specific skills for workflows** (databricks-python-sdk, databricks-spark-declarative-pipelines, etc.). Use this skill when you need reference documentation. +**Always prefer using MCP tools for actions** (execute_sql, manage_pipeline, etc.) and **load specific skills for workflows** (databricks-python-sdk, databricks-spark-declarative-pipelines, etc.). Use this skill when you need reference documentation. ## How to Use @@ -47,7 +47,7 @@ The llms.txt file is organized by category: 1. Load `databricks-spark-declarative-pipelines` skill for workflow patterns 2. Use this skill to fetch docs if you need clarification on specific DLT features -3. Use `create_or_update_pipeline` MCP tool to actually create the pipeline +3. Use `manage_pipeline(action="create_or_update")` MCP tool to actually create the pipeline **Scenario:** User asks about an unfamiliar Databricks feature diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-execution-compute/SKILL.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-execution-compute/SKILL.md new file mode 100644 index 0000000..c351838 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-execution-compute/SKILL.md @@ -0,0 +1,82 @@ +--- +name: databricks-execution-compute +description: >- + Execute code and manage compute on Databricks. Use this skill when the user + mentions: "run code", "execute", "run on databricks", "serverless", "no + cluster", "run python", "run scala", "run sql", "run R", "run file", "push + and run", "notebook run", "batch script", "model training", "run script on + cluster", "create cluster", "new cluster", "resize cluster", "modify cluster", + "delete cluster", "terminate cluster", "create warehouse", "new warehouse", + "resize warehouse", "delete warehouse", "node types", "runtime versions", + "DBR versions", "spin up compute", "provision cluster". +--- + +# Databricks Execution & Compute + +Run code on Databricks. Three execution modes—choose based on workload. + +## Execution Mode Decision Matrix + +| Aspect | [Databricks Connect](references/1-databricks-connect.md) ⭐ | [Serverless Job](references/2-serverless-job.md) | [Interactive Cluster](references/3-interactive-cluster.md) | +|--------|-------------------|----------------|---------------------| +| **Use for** | Spark code (ETL, data gen) | Heavy processing (ML) | State across tool calls, Scala/R | +| **Startup** | Instant | ~25-50s cold start | ~5min if stopped | +| **State** | Within Python process | None | Via context_id | +| **Languages** | Python (PySpark) | Python, SQL | Python, Scala, SQL, R | +| **Dependencies** | `withDependencies()` | CLI with environments spec | Install on cluster | + +### Decision Flow + +``` +Spark-based code? → Databricks Connect (fastest) + └─ Python 3.12 missing? → Install it + databricks-connect + └─ Install fails? → Ask user (don't auto-switch modes) + +Heavy/long-running (ML)? → Serverless Job (independent) +Need state across calls? → Interactive Cluster (list and ask which one to use) +Scala/R? → Interactive Cluster (list and ask which one to use) +``` + + +## How to Run Code + +**Read the reference file for your chosen mode before proceeding.** + +### Databricks Connect (no MCP tool, run locally) → [reference](references/1-databricks-connect.md) + +```bash +python my_spark_script.py +``` + +### Serverless Job → [reference](references/2-serverless-job.md) + +```python +execute_code(file_path="/path/to/script.py") +``` + +### Interactive Cluster → [reference](references/3-interactive-cluster.md) + +```python +# Check for running clusters first (or use the one instructed) +list_compute(resource="clusters") +# Ask the customer which one to use + +# Run code, reuse context_id for follow-up MCP call +result = execute_code(code="...", compute_type="cluster", cluster_id="...") +execute_code(code="...", context_id=result["context_id"], cluster_id=result["cluster_id"]) +``` + +## MCP Tools + +| Tool | For | Purpose | +|------|-----|---------| +| `execute_code` | Serverless, Interactive | Run code remotely | +| `list_compute` | Interactive | List clusters, check status, auto-select running cluster | +| `manage_cluster` | Interactive | Create, start, terminate, delete. **COSTLY:** `start` takes 3-8 min—ask user | +| `manage_sql_warehouse` | SQL | Create, modify, delete SQL warehouses | + +## Related Skills + +- **[databricks-synthetic-data-gen](../databricks-synthetic-data-gen/SKILL.md)** — Data generation using Spark + Faker +- **[databricks-jobs](../databricks-jobs/SKILL.md)** — Production job orchestration +- **[databricks-dbsql](../databricks-dbsql/SKILL.md)** — SQL warehouse and AI functions diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-execution-compute/references/1-databricks-connect.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-execution-compute/references/1-databricks-connect.md new file mode 100644 index 0000000..838d2a7 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-execution-compute/references/1-databricks-connect.md @@ -0,0 +1,72 @@ +# Databricks Connect (Recommended Default) + +**Use when:** Running Spark code locally that executes on Databricks serverless compute. This is the fastest, cleanest approach for data generation, ETL, and any Spark workload. + +## Why Databricks Connect First? + +- **Instant iteration** — Edit file, re-run immediately +- **Local debugging** — IDE debugger, breakpoints work +- **No cold start** — Session stays warm across executions +- **Clean dependencies** — `withDependencies()` installs packages on remote compute + +## Requirements + +- **Python 3.12** (databricks-connect >= 16.4 requires it) +- **databricks-connect >= 16.4** package +- **~/.databrickscfg** with serverless config + +## Setup + +**Python 3.12 required.** If not available, install it (uv or other). If install fails, ask user—don't auto-switch modes. + +Use default profile, if not setup you can add it `~/.databrickscfg` (never overwrite it without conscent) +```ini +[DEFAULT] +host = https://your-workspace.cloud.databricks.com/ +serverless_compute_id = auto +auth_type = databricks-cli +``` + +## Usage Pattern + +```python +from databricks.connect import DatabricksSession, DatabricksEnv + +# Declare dependencies installed on serverless compute +# CRITICAL: Include ALL packages used inside UDFs (pandas/numpy are there by default) +env = DatabricksEnv().withDependencies("faker", "holidays") + +spark = ( + DatabricksSession.builder + .profile("my-workspace") # optional: run on a specific profile from ~/.databrickscfg instead of default + .withEnvironment(env) + .serverless(True) + .getOrCreate() +) + +# Spark code now executes on Databricks serverless +df = spark.range(1000)... +df.write.mode('overwrite').saveAsTable("catalog.schema.table") +``` + +## Common Issues + +| Issue | Solution | +|-------|----------| +| `Python 3.12 required` | create venv with correct python version | +| `DatabricksEnv not found` | Upgrade to databricks-connect >= 16.4 | +| `serverless_compute_id` error | Add `serverless_compute_id = auto` to ~/.databrickscfg | +| `ModuleNotFoundError` inside UDF | Add the package to `withDependencies()` | +| `PERSIST TABLE not supported` | Don't use `.cache()` or `.persist()` with serverless | +| `broadcast` is used | Don't broadcast small DF using spark connect, have a small python list instead or join small DF | + +## When NOT to Use + +Switch to **[Serverless Job](2-serverless-job.md)** when: +- one-off execution +- Heavy ML training that shouldn't depend on local machine staying connected +- Non-Spark Python code (pure sklearn, pytorch, etc.) + +Switch to **[Interactive Cluster](3-interactive-cluster.md)** when: +- Need state across multiple separate MCP tool calls +- Need Scala or R support diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-execution-compute/references/2-serverless-job.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-execution-compute/references/2-serverless-job.md new file mode 100644 index 0000000..4be8801 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-execution-compute/references/2-serverless-job.md @@ -0,0 +1,76 @@ +# Serverless Job Execution + +**Use when:** Running intensive Python code remotely (ML training, heavy processing) that doesn't need Spark, or when code shouldn't depend on local machine staying connected. + +## When to Choose Serverless Job + +- ML model training (runs independently of local machine) +- Heavy non-Spark Python processing +- Code that takes > 5 minutes (local connection can drop) +- Production/scheduled runs + +## Trade-offs + +| Pro | Con | +|-----|-----| +| No cluster to manage | ~25-50s cold start each invocation | +| Up to 30 min timeout | No state preserved between calls | +| Independent execution | print() unreliable—use `dbutils.notebook.exit()` | + +## Executing code +### Prefer running from a Local File (edit the local file then run it) + +```python +execute_code( + file_path="/local/path/to/train_model.py", + compute_type="serverless" +) +``` + +## Jobs with Custom Dependencies + +Use `job_extra_params` to install pip packages: + +```python +execute_code( + file_path="/path/to/train.py", + job_extra_params={ + "environments": [{ + "environment_key": "ml_env", + "spec": {"client": "4", "dependencies": ["scikit-learn", "pandas", "mlflow"]} + }] + } +) +``` + +**CRITICAL:** Use `"client": "4"` in the spec. `"client": "1"` won't install dependencies. + +## Output Handling + +```python +# ❌ BAD - print() may not be captured +print("Training complete!") + +# ✅ GOOD - Use dbutils.notebook.exit() +import json +results = {"accuracy": 0.95, "model_path": "/Volumes/..."} +dbutils.notebook.exit(json.dumps(results)) +``` + +## Common Issues + +| Issue | Solution | +|-------|----------| +| print() output missing | Use `dbutils.notebook.exit()` | +| `ModuleNotFoundError` | Add to environments spec with `"client": "4"` | +| Job times out | Max is 1800s; split into smaller tasks | + +## When NOT to Use + +Switch to **[Databricks Connect](1-databricks-connect.md)** when: +- Iterating on Spark code and want instant feedback +- Need local debugging with breakpoints + +Switch to **[Interactive Cluster](3-interactive-cluster.md)** when: +- Need state across multiple MCP tool calls +- Need Scala or R support diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-execution-compute/references/3-interactive-cluster.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-execution-compute/references/3-interactive-cluster.md new file mode 100644 index 0000000..aa73ea9 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-execution-compute/references/3-interactive-cluster.md @@ -0,0 +1,140 @@ +# Interactive Cluster Execution + +**Use when:** You have an existing running cluster and need to preserve state across multiple MCP tool calls, or need Scala/R support. + +## When to Choose Interactive Cluster + +- Multiple sequential commands where variables must persist +- Scala or R code (serverless only supports Python/SQL) +- Existing running cluster available + +## Trade-offs + +| Pro | Con | +|-----|-----| +| State persists via `context_id` | Cluster startup ~5 min if not running | +| Near-instant follow-up commands | Costs money while running | +| Scala/R/SQL support | Must manage cluster lifecycle | + +## Critical: Never Start a Cluster Without Asking + +**Starting a cluster takes 3-8 minutes and costs money.** Always check first: + +```python +list_compute(resource="clusters") +``` + +If no cluster is running, ask the user: +> "No running cluster. Options: +> 1. Start 'my-dev-cluster' (~5 min startup, costs money) +> 2. Use serverless (instant, no setup) +> Which do you prefer?" + +## Basic Usage + +### First Command: Creates Context + +```python +result = execute_code( + code="import pandas as pd\ndf = pd.DataFrame({'a': [1, 2, 3]})", + compute_type="cluster", + cluster_id="1234-567890-abcdef" +) +# result contains context_id for reuse +``` + +### Follow-up Commands: Reuse Context + +```python +# Variables from first command still available +execute_code( + code="print(df.shape)", # df exists + context_id=result["context_id"], + cluster_id=result["cluster_id"] +) +``` + +### Auto-Select Best Running Cluster + +```python +best_cluster = list_compute(resource="clusters", auto_select=True) +execute_code( + code="spark.range(100).show()", + compute_type="cluster", + cluster_id=best_cluster["cluster_id"] +) +``` + +## Language Support + +```python +execute_code(code='println("Hello")', compute_type="cluster", language="scala") +execute_code(code="SELECT * FROM table LIMIT 10", compute_type="cluster", language="sql") +execute_code(code='print("Hello")', compute_type="cluster", language="r") +``` + +## Installing Libraries + +Install pip packages directly in the execution context (pandas/numpy are there by default): + +```python +# Install library +execute_code( + code="""%pip install faker + dbutils.library.restartPython()""", # Restart Python to pick up new packages (if needed) + compute_type="cluster", + cluster_id="...", + context_id="..." +) +``` + +## Context Lifecycle + +**Keep alive (default):** Context persists until cluster terminates. + +**Destroy when done:** +```python +execute_code( + code="print('Done!')", + compute_type="cluster", + destroy_context_on_completion=True +) +``` + +## Handling No Running Cluster + +When no cluster is running, `execute_code` returns: +```json +{ + "success": false, + "error": "No running cluster available", + "startable_clusters": [{"cluster_id": "...", "cluster_name": "...", "state": "TERMINATED"}], + "suggestions": ["Start a terminated cluster", "Use serverless instead"] +} +``` + +### Starting a Cluster (With User Approval Only) + +```python +manage_cluster(action="start", cluster_id="1234-567890-abcdef") +# Poll until running (wait 20sec) +list_compute(resource="clusters", cluster_id="1234-567890-abcdef") +``` + +## Common Issues + +| Issue | Solution | +|-------|----------| +| "No running cluster" | Ask user to start or use serverless | +| Context not found | Context expired; create new one | +| Library not found | `%pip install ` then if needed `dbutils.library.restartPython()` | + +## When NOT to Use + +Switch to **[Databricks Connect](1-databricks-connect.md)** when: +- Developing Spark code with local debugging +- Want instant iteration without cluster concerns + +Switch to **[Serverless Job](2-serverless-job.md)** when: +- No cluster running and user doesn't want to wait +- One-off execution without state needs diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-genie/SKILL.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-genie/SKILL.md new file mode 100644 index 0000000..8233247 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-genie/SKILL.md @@ -0,0 +1,200 @@ +--- +name: databricks-genie +description: "Create and query Databricks Genie Spaces for natural language SQL exploration. Use when building Genie Spaces, exporting and importing Genie Spaces, migrating Genie Spaces between workspaces or environments, or asking questions via the Genie Conversation API." +--- + +# Databricks Genie + +Create, manage, and query Databricks Genie Spaces - natural language interfaces for SQL-based data exploration. + +## Overview + +Genie Spaces allow users to ask natural language questions about structured data in Unity Catalog. The system translates questions into SQL queries, executes them on a SQL warehouse, and presents results conversationally. + +## When to Use This Skill + +Use this skill when: +- Creating a new Genie Space for data exploration +- Adding sample questions to guide users +- Connecting Unity Catalog tables to a conversational interface +- Asking questions to a Genie Space programmatically (Conversation API) +- Exporting a Genie Space configuration (serialized_space) for backup or migration +- Importing / cloning a Genie Space from a serialized payload +- Migrating a Genie Space between workspaces or environments (dev → staging → prod) + - Only supports catalog remapping where catalog names differ across environments + - Not supported for schema and/or table names that differ across environments + - Not including migration of tables between environments (only migration of Genie Spaces) + +## MCP Tools + +| Tool | Purpose | +|------|---------| +| `manage_genie` | Create, get, list, delete, export, and import Genie Spaces | +| `ask_genie` | Ask natural language questions to a Genie Space | +| `get_table_stats_and_schema` | Inspect table schemas before creating a space | +| `execute_sql` | Test SQL queries directly | + +### manage_genie - Space Management + +| Action | Description | Required Params | +|--------|-------------|-----------------| +| `create_or_update` | Idempotent create/update a space | display_name, table_identifiers (or serialized_space) | +| `get` | Get space details | space_id | +| `list` | List all spaces | (none) | +| `delete` | Delete a space | space_id | +| `export` | Export space config for migration/backup | space_id | +| `import` | Import space from serialized config | warehouse_id, serialized_space | + +**Example tool calls:** +``` +# MCP Tool: manage_genie +# Create a new space +manage_genie( + action="create_or_update", + display_name="Sales Analytics", + table_identifiers=["catalog.schema.customers", "catalog.schema.orders"], + description="Explore sales data with natural language", + sample_questions=["What were total sales last month?"] +) + +# MCP Tool: manage_genie +# Get space details with full config +manage_genie(action="get", space_id="space_123", include_serialized_space=True) + +# MCP Tool: manage_genie +# List all spaces +manage_genie(action="list") + +# MCP Tool: manage_genie +# Export for migration +exported = manage_genie(action="export", space_id="space_123") + +# MCP Tool: manage_genie +# Import to new workspace +manage_genie( + action="import", + warehouse_id="warehouse_456", + serialized_space=exported["serialized_space"], + title="Sales Analytics (Prod)" +) +``` + +### ask_genie - Conversation API (Query) + +Ask natural language questions to a Genie Space. Pass `conversation_id` for follow-up questions. + +``` +# MCP Tool: ask_genie +# Start a new conversation +result = ask_genie( + space_id="space_123", + question="What were total sales last month?" +) +# Returns: {question, conversation_id, message_id, status, sql, columns, data, row_count} + +# MCP Tool: ask_genie +# Follow-up question in same conversation +result = ask_genie( + space_id="space_123", + question="Break that down by region", + conversation_id=result["conversation_id"] +) +``` + +## Quick Start + +### 1. Inspect Your Tables + +Before creating a Genie Space, understand your data: + +``` +# MCP Tool: get_table_stats_and_schema +get_table_stats_and_schema( + catalog="my_catalog", + schema="sales", + table_stat_level="SIMPLE" +) +``` + +### 2. Create the Genie Space + +``` +# MCP Tool: manage_genie +manage_genie( + action="create_or_update", + display_name="Sales Analytics", + table_identifiers=[ + "my_catalog.sales.customers", + "my_catalog.sales.orders" + ], + description="Explore sales data with natural language", + sample_questions=[ + "What were total sales last month?", + "Who are our top 10 customers?" + ] +) +``` + +### 3. Ask Questions (Conversation API) + +``` +# MCP Tool: ask_genie +ask_genie( + space_id="your_space_id", + question="What were total sales last month?" +) +# Returns: SQL, columns, data, row_count +``` + +### 4. Export & Import (Clone / Migrate) + +Export a space (preserves all tables, instructions, SQL examples, and layout): + +``` +# MCP Tool: manage_genie +exported = manage_genie(action="export", space_id="your_space_id") +# exported["serialized_space"] contains the full config +``` + +Clone to a new space (same catalog): + +``` +# MCP Tool: manage_genie +manage_genie( + action="import", + warehouse_id=exported["warehouse_id"], + serialized_space=exported["serialized_space"], + title=exported["title"], # override title; omit to keep original + description=exported["description"], +) +``` + +> **Cross-workspace migration:** Each MCP server is workspace-scoped. Configure one server entry per workspace profile in your IDE's MCP config, then `manage_genie(action="export")` from the source server and `manage_genie(action="import")` via the target server. See [spaces.md §Migration](spaces.md#migrating-across-workspaces-with-catalog-remapping) for the full workflow. + +## Reference Files + +- [spaces.md](spaces.md) - Creating and managing Genie Spaces +- [conversation.md](conversation.md) - Asking questions via the Conversation API + +## Prerequisites + +Before creating a Genie Space: + +1. **Tables in Unity Catalog** - Bronze/silver/gold tables with the data +2. **SQL Warehouse** - A warehouse to execute queries (auto-detected if not specified) + +### Creating Tables + +Use these skills in sequence: +1. `databricks-synthetic-data-gen` - Generate raw parquet files +2. `databricks-spark-declarative-pipelines` - Create bronze/silver/gold tables + +## Common Issues + +See [spaces.md §Troubleshooting](spaces.md#troubleshooting) for a full list of issues and solutions. +## Related Skills + +- **[databricks-agent-bricks](../databricks-agent-bricks/SKILL.md)** - Use Genie Spaces as agents inside Supervisor Agents +- **[databricks-synthetic-data-gen](../databricks-synthetic-data-gen/SKILL.md)** - Generate raw parquet data to populate tables for Genie +- **[databricks-spark-declarative-pipelines](../databricks-spark-declarative-pipelines/SKILL.md)** - Build bronze/silver/gold tables consumed by Genie Spaces +- **[databricks-unity-catalog](../databricks-unity-catalog/SKILL.md)** - Manage the catalogs, schemas, and tables Genie queries diff --git a/.claude/skills/databricks-genie/conversation.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-genie/conversation.md similarity index 90% rename from .claude/skills/databricks-genie/conversation.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-genie/conversation.md index 149cafa..e4320e8 100644 --- a/.claude/skills/databricks-genie/conversation.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-genie/conversation.md @@ -31,8 +31,7 @@ The `ask_genie` tool allows you to programmatically send questions to a Genie Sp | Tool | Purpose | |------|---------| -| `ask_genie` | Ask a question, start new conversation | -| `ask_genie_followup` | Ask follow-up in existing conversation | +| `ask_genie` | Ask a question or follow-up (`conversation_id` optional) | ## Basic Usage @@ -71,10 +70,10 @@ result = ask_genie( ) # Follow-up (uses context from first question) -ask_genie_followup( +ask_genie( space_id="01abc123...", - conversation_id=result["conversation_id"], - question="Break that down by region" + question="Break that down by region", + conversation_id=result["conversation_id"] ) ``` @@ -148,7 +147,7 @@ Claude: User: "I just created a Genie Space for HR data. Can you test it?" Claude: -1. Gets the space_id from the user or recent create_or_update_genie result +1. Gets the space_id from the user or recent manage_genie(action="create_or_update") result 2. Calls ask_genie with test questions: - "How many employees do we have?" - "What is the average salary by department?" @@ -163,9 +162,9 @@ User: "Use my analytics Genie to explore sales trends" Claude: 1. ask_genie(space_id, "What were total sales by month this year?") 2. User: "Which month had the highest growth?" -3. ask_genie_followup(space_id, conv_id, "Which month had the highest growth?") +3. ask_genie(space_id, "Which month had the highest growth?", conversation_id=conv_id) 4. User: "What products drove that growth?" -5. ask_genie_followup(space_id, conv_id, "What products drove that growth?") +5. ask_genie(space_id, "What products drove that growth?", conversation_id=conv_id) ``` ## Best Practices @@ -181,8 +180,8 @@ result2 = ask_genie(space_id, "How many employees do we have?") # New conversat # Good: Follow-up for related question result1 = ask_genie(space_id, "What were sales last month?") -result2 = ask_genie_followup(space_id, result1["conversation_id"], - "Break that down by product") # Related follow-up +result2 = ask_genie(space_id, "Break that down by product", + conversation_id=result1["conversation_id"]) # Related follow-up ``` ### Handle Clarification Requests @@ -219,7 +218,7 @@ ask_genie(space_id, "Calculate customer lifetime value for all customers", - Verify the `space_id` is correct - Check you have access to the space -- Use `get_genie(space_id)` to verify it exists +- Use `manage_genie(action="get", space_id=...)` to verify it exists ### "Query timed out" diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-genie/spaces.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-genie/spaces.md new file mode 100644 index 0000000..ff8acb6 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-genie/spaces.md @@ -0,0 +1,395 @@ +# Creating Genie Spaces + +This guide covers creating and managing Genie Spaces for SQL-based data exploration. + +## What is a Genie Space? + +A Genie Space connects to Unity Catalog tables and translates natural language questions into SQL — understanding schemas, generating queries, executing them on a SQL warehouse, and presenting results conversationally. + +## Creation Workflow + +### Step 1: Inspect Table Schemas (Required) + +**Before creating a Genie Space, you MUST inspect the table schemas** to understand what data is available: + +```python +get_table_stats_and_schema( + catalog="my_catalog", + schema="sales", + table_stat_level="SIMPLE" +) +``` + +This returns: +- Table names and row counts +- Column names and data types +- Sample values and cardinality +- Null counts and statistics + +### Step 2: Analyze and Plan + +Based on the schema information: + +1. **Select relevant tables** - Choose tables that support the user's use case +2. **Identify key columns** - Note date columns, metrics, dimensions, and foreign keys +3. **Understand relationships** - How do tables join together? +4. **Plan sample questions** - What questions can this data answer? + +### Step 3: Create the Genie Space + +Create the space with content tailored to the actual data: + +```python +manage_genie( + action="create_or_update", + display_name="Sales Analytics", + table_identifiers=[ + "my_catalog.sales.customers", + "my_catalog.sales.orders", + "my_catalog.sales.products" + ], + description="""Explore retail sales data with three related tables: +- customers: Customer demographics including region, segment, and signup date +- orders: Transaction history with order_date, total_amount, and status +- products: Product catalog with category, price, and inventory + +Tables join on customer_id and product_id.""", + sample_questions=[ + "What were total sales last month?", + "Who are our top 10 customers by total_amount?", + "How many orders were placed in Q4 by region?", + "What's the average order value by customer segment?", + "Which product categories have the highest revenue?", + "Show me customers who haven't ordered in 90 days" + ] +) +``` + +## Why This Workflow Matters + +**Sample questions that reference actual column names** help Genie: +- Learn the vocabulary of your data +- Generate more accurate SQL queries +- Provide better autocomplete suggestions + +**A description that explains table relationships** helps Genie: +- Understand how to join tables correctly +- Know which table contains which information +- Provide more relevant answers + +## Auto-Detection of Warehouse + +When `warehouse_id` is not specified, the tool: + +1. Lists all SQL warehouses in the workspace +2. Prioritizes by: + - **Running** warehouses first (already available) + - **Starting** warehouses second + - **Smaller sizes** preferred (cost-efficient) +3. Returns an error if no warehouses exist + +To use a specific warehouse, provide the `warehouse_id` explicitly. + +## Table Selection + +Choose tables carefully for best results: + +| Layer | Recommended | Why | +|-------|-------------|-----| +| Bronze | No | Raw data, may have quality issues | +| Silver | Yes | Cleaned and validated | +| Gold | Yes | Aggregated, optimized for analytics | + +### Tips for Table Selection + +- **Include related tables**: If users ask about customers and orders, include both +- **Use descriptive column names**: `customer_name` is better than `cust_nm` +- **Add table comments**: Genie uses metadata to understand the data + +## Sample Questions + +Sample questions help users understand what they can ask: + +**Good sample questions:** +- "What were total sales last month?" +- "Who are our top 10 customers by revenue?" +- "How many orders were placed in Q4?" +- "What's the average order value by region?" + +These appear in the Genie UI to guide users. + +## Best Practices + +### Table Design for Genie + +1. **Descriptive names**: Use `customer_lifetime_value` not `clv` +2. **Add comments**: `COMMENT ON TABLE sales.customers IS 'Customer master data'` +3. **Primary keys**: Define relationships clearly +4. **Date columns**: Include proper date/timestamp columns for time-based queries + +### Description and Context + +Provide context in the description: + +``` +Explore retail sales data from our e-commerce platform. Includes: +- Customers: demographics, segments, and account status +- Orders: transaction history with amounts and dates +- Products: catalog with categories and pricing + +Time range: Last 6 months of data +``` + +### Sample Questions + +Write sample questions that: +- Cover common use cases +- Demonstrate the data's capabilities +- Use natural language (not SQL terms) + +## Updating a Genie Space + +`manage_genie(action="create_or_update")` handles both create and update automatically. There are two ways it locates an existing space to update: + +- **By `space_id`** (explicit, preferred): pass `space_id=` to target a specific space. +- **By `display_name`** (implicit fallback): if `space_id` is omitted, the tool searches for a space with a matching name and updates it if found; otherwise it creates a new one. + +### Simple field updates (tables, questions, warehouse) + +To update metadata without a serialized config: + +```python +manage_genie( + action="create_or_update", + display_name="Sales Analytics", + space_id="01abc123...", # omit to match by name instead + table_identifiers=[ # updated table list + "my_catalog.sales.customers", + "my_catalog.sales.orders", + "my_catalog.sales.products", + ], + sample_questions=[ # updated sample questions + "What were total sales last month?", + "Who are our top 10 customers by revenue?", + ], + warehouse_id="abc123def456", # omit to keep current / auto-detect + description="Updated description.", +) +``` + +### Full config update via `serialized_space` + +To push a complete serialized configuration to an existing space (the dict contains all regular table metadata, plus it preserves all instructions, SQL examples, join specs, etc.): + +```python +manage_genie( + action="create_or_update", + display_name="Sales Analytics", # overrides title embedded in serialized_space + table_identifiers=[], # ignored when serialized_space is provided + space_id="01abc123...", # target space to overwrite + warehouse_id="abc123def456", # overrides warehouse embedded in serialized_space + description="Updated description.", # overrides description embedded in serialized_space; omit to keep the one in the payload + serialized_space=remapped_config, # JSON string from manage_genie(action="export") (after catalog remap if needed) +) +``` + +> **Note:** When `serialized_space` is provided, `table_identifiers` and `sample_questions` are ignored — the full config comes from the serialized payload. However, `display_name`, `warehouse_id`, and `description` are still applied as top-level overrides on top of the serialized payload. Omit any of them to keep the values embedded in `serialized_space`. + +## Export, Import & Migration + +`manage_genie(action="export")` returns a dictionary with four top-level keys: + +| Key | Description | +|-----|-------------| +| `space_id` | ID of the exported space | +| `title` | Display name of the space | +| `description` | Description of the space | +| `warehouse_id` | SQL warehouse associated with the space (workspace-specific — do **not** reuse across workspaces) | +| `serialized_space` | JSON-encoded string with the full space configuration (see below) | + +This envelope enables cloning, backup, and cross-workspace migration. Use `manage_genie(action="export")` and `manage_genie(action="import")` for all export/import operations — no direct REST calls needed. + +### What is `serialized_space`? + +`serialized_space` is a JSON string (version 2) embedded inside the export envelope. Its top-level keys are: + +| Key | Contents | +|-----|----------| +| `version` | Schema version (currently `2`) | +| `config` | Space-level config: `sample_questions` shown in the UI | +| `data_sources` | `tables` array — each entry has a fully-qualified `identifier` (`catalog.schema.table`) and optional `column_configs` (format assistance, entity matching per column) | +| `instructions` | `example_question_sqls` (certified Q&A pairs), `join_specs` (join relationships between tables), `sql_snippets` (`filters` and `measures` with display names and usage instructions) | +| `benchmarks` | Evaluation Q&A pairs used to measure space quality | + +Catalog names appear **everywhere** inside `serialized_space` — in `data_sources.tables[].identifier`, SQL strings in `example_question_sqls`, `join_specs`, and `sql_snippets`. A single `.replace(src_catalog, tgt_catalog)` on the whole string is sufficient for catalog remapping. + +Minimum structure: +```json +{"version": 2, "data_sources": {"tables": [{"identifier": "catalog.schema.table"}]}} +``` + +### Exporting a Space + +Use `manage_genie(action="export")` to export the full configuration (requires CAN EDIT permission): + +```python +exported = manage_genie(action="export", space_id="01abc123...") +# Returns: +# { +# "space_id": "01abc123...", +# "title": "Sales Analytics", +# "description": "Explore sales data...", +# "warehouse_id": "abc123def456", +# "serialized_space": "{\"version\":2,\"data_sources\":{...},\"instructions\":{...}}" +# } +``` + +You can also get `serialized_space` inline via `manage_genie(action="get")`: + +```python +details = manage_genie(action="get", space_id="01abc123...", include_serialized_space=True) +serialized = details["serialized_space"] +``` + +### Cloning a Space (Same Workspace) + +```python +# Step 1: Export the source space +source = manage_genie(action="export", space_id="01abc123...") + +# Step 2: Import as a new space +manage_genie( + action="import", + warehouse_id=source["warehouse_id"], + serialized_space=source["serialized_space"], + title=source["title"], # override title; omit to keep original + description=source["description"], +) +# Returns: {"space_id": "01def456...", "title": "Sales Analytics (Dev Copy)", "operation": "imported"} +``` + +### Migrating Across Workspaces with Catalog Remapping + +When migrating between environments (e.g. prod → dev), Unity Catalog names are often different. The `serialized_space` string contains the source catalog name **everywhere** — in table identifiers, SQL queries, join specs, and filter snippets. You must remap it before importing. + +**Agent workflow (3 steps):** + +**Step 1 — Export from source workspace:** +```python +exported = manage_genie(action="export", space_id="01f106e1239d14b28d6ab46f9c15e540") +# exported keys: warehouse_id, title, description, serialized_space +# exported["serialized_space"] contains all references to source catalog +``` + +**Step 2 — Remap catalog name in `serialized_space`:** + +The agent does this as an inline string substitution between the two MCP calls: +```python +modified_serialized = exported["serialized_space"].replace( + "source_catalog_name", # e.g. "healthverity_claims_sample_patient_dataset" + "target_catalog_name" # e.g. "healthverity_claims_sample_patient_dataset_dev" +) +``` +This replaces all occurrences — table identifiers, SQL FROM clauses, join specs, and filter snippets. + +**Step 3 — Import to target workspace:** +```python +manage_genie( + action="import", + warehouse_id="", # from manage_warehouse(action="list") on target + serialized_space=modified_serialized, + title=exported["title"], + description=exported["description"] +) +``` + +### Batch Migration of Multiple Spaces + +To migrate several spaces at once, loop through space IDs. The agent exports, remaps the catalog, then imports each: + +``` +For each space_id in [id1, id2, id3]: + 1. exported = manage_genie(action="export", space_id=space_id) + 2. modified = exported["serialized_space"].replace(src_catalog, tgt_catalog) + 3. result = manage_genie(action="import", warehouse_id=wh_id, serialized_space=modified, title=exported["title"], description=exported["description"]) + 4. record result["space_id"] for updating databricks.yml +``` + +After migration, update `databricks.yml` with the new dev `space_id` values under the `dev` target's `genie_space_ids` variable. + +### Updating an Existing Space with New Config + +To push a serialized config to an already-existing space (rather than creating a new one), use `manage_genie(action="create_or_update")` with `space_id=` and `serialized_space=`. The export → remap → push pattern is identical to the migration steps above; just replace `manage_genie(action="import")` with `manage_genie(action="create_or_update", space_id=TARGET_SPACE_ID, ...)` as the final call. + +### Permissions Required + +| Operation | Required Permission | +|-----------|-------------------| +| `manage_genie(action="export")` / `manage_genie(action="get", include_serialized_space=True)` | CAN EDIT on source space | +| `manage_genie(action="import")` | Can create items in target workspace folder | +| `manage_genie(action="create_or_update")` with `serialized_space` (update) | CAN EDIT on target space | + +## Example End-to-End Workflow + +1. **Generate synthetic data** using `databricks-synthetic-data-gen` skill: + - Creates parquet files in `/Volumes/catalog/schema/raw_data/` + +2. **Create tables** using `databricks-spark-declarative-pipelines` skill: + - Creates `catalog.schema.bronze_*` → `catalog.schema.silver_*` → `catalog.schema.gold_*` + +3. **Inspect the tables**: + ```python + get_table_stats_and_schema(catalog="catalog", schema="schema") + ``` + +4. **Create the Genie Space**: + - `display_name`: "My Data Explorer" + - `table_identifiers`: `["catalog.schema.silver_customers", "catalog.schema.silver_orders"]` + +5. **Add sample questions** based on actual column names + +6. **Test** in the Databricks UI + +## Troubleshooting + +### No warehouse available + +- Create a SQL warehouse in the Databricks workspace +- Or provide a specific `warehouse_id` + +### Queries are slow + +- Ensure the warehouse is running (not stopped) +- Consider using a larger warehouse size +- Check if tables are optimized (OPTIMIZE, Z-ORDER) + +### Poor query generation + +- Use descriptive column names +- Add table and column comments +- Include sample questions that demonstrate the vocabulary +- Add instructions via the Databricks Genie UI + +### `manage_genie(action="export")` returns empty `serialized_space` + +Requires at least **CAN EDIT** permission on the space. + +### `manage_genie(action="import")` fails with permission error + +Ensure you have CREATE privileges in the target workspace folder. + +### Tables not found after migration + +Catalog name was not remapped — replace the source catalog name in `serialized_space` before calling `manage_genie(action="import")`. The catalog appears in table identifiers, SQL FROM clauses, join specs, and filter snippets; a single `.replace(src_catalog, tgt_catalog)` on the whole string covers all occurrences. + +### `manage_genie` lands in the wrong workspace + +Each MCP server is workspace-scoped. Set up two named MCP server entries (one per profile) in your IDE's MCP config instead of switching a single server's profile mid-session. + +### MCP server doesn't pick up profile change + +The MCP process reads `DATABRICKS_CONFIG_PROFILE` once at startup — editing the config file requires an IDE reload to take effect. + +### `manage_genie(action="import")` fails with JSON parse error + +The `serialized_space` string may contain multi-line SQL arrays with `\n` escape sequences. Flatten SQL arrays to single-line strings before passing to avoid double-escaping issues. diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-iceberg/1-managed-iceberg-tables.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-iceberg/1-managed-iceberg-tables.md new file mode 100644 index 0000000..a0f3f06 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-iceberg/1-managed-iceberg-tables.md @@ -0,0 +1,262 @@ +# Managed Iceberg Tables + +Managed Iceberg tables are native Apache Iceberg tables created and stored within Unity Catalog. They support full read/write operations in Databricks and are accessible to external engines via the UC Iceberg REST Catalog (IRC) endpoint. + +**Requirements**: Unity Catalog, DBR 16.4 LTS+ (Managed Iceberg v2), DBR 17.3+ (Managed Iceberg v3 Beta) + +--- + +## Creating Tables + +### Basic DDL + +```sql +-- Create an empty Iceberg table (no clustering) +CREATE TABLE my_catalog.my_schema.events ( + event_id BIGINT, + event_type STRING, + event_date DATE, + payload STRING +) +USING ICEBERG; +``` + +### Create Table As Select (CTAS) + +```sql +-- Create from existing data (no clustering) +CREATE TABLE my_catalog.my_schema.events_archive +USING ICEBERG +AS SELECT * FROM my_catalog.my_schema.events +WHERE event_date < '2025-01-01'; +``` + +### Liquid Clustering + +Managed Iceberg tables use **Liquid Clustering** for data layout optimization. Both `PARTITIONED BY` and `CLUSTER BY` produce a Liquid Clustered table — **no traditional Hive-style partitions are created**. Unity Catalog interprets the partition clause as clustering keys. + +| Syntax | DDL (create table) | Reads via IRC | Iceberg partition fields visible to external engines | DV/row-tracking handling | +|--------|--------------------|---------------|------------------------------------------------------|--------------------------| +| `PARTITIONED BY (col)` | DBR + EMR, OSS Spark, Trino, Flink | Yes | Yes — UC exposes Iceberg partition fields corresponding to clustering keys; external engines can prune | **Auto-handled** | +| `CLUSTER BY (col)` | DBR only | Yes | Yes — same; UC maintains Iceberg partition spec from clustering keys regardless of DDL used | Manual on v2, auto on v3 | + +> **Both syntaxes produce the same Iceberg metadata for external engines.** UC maintains an Iceberg partition spec (partition fields corresponding to the clustering keys) that external engines read via IRC. This is Iceberg-style partitioning — not legacy Hive-style directory partitions. External engines see a partitioned Iceberg table and benefit from partition pruning. Internally, UC uses those partition fields as liquid clustering keys. + +> **`PARTITIONED BY` limitation**: Only plain column references are supported. Expression transforms (`bucket()`, `years()`, `months()`, `days()`, `hours()`) are **not** supported and will error. + +> **`CLUSTER BY` on Iceberg v2**: requires explicitly setting `'delta.enableDeletionVectors' = false` and `'delta.enableRowTracking' = false`, otherwise you get: `[MANAGED_ICEBERG_ATTEMPTED_TO_ENABLE_CLUSTERING_WITHOUT_DISABLING_DVS_OR_ROW_TRACKING]` + +**`PARTITIONED BY` — recommended for cross-platform** (auto-handles all required properties): + +```sql +-- Single column (v2 or v3 — no TBLPROPERTIES needed) +CREATE TABLE orders ( + order_id BIGINT, + order_date DATE +) +USING ICEBERG +PARTITIONED BY (order_date); + +-- Multi-column +CREATE TABLE orders ( + order_id BIGINT, + region STRING, + order_date DATE +) +USING ICEBERG +PARTITIONED BY (region, order_date); +``` + +**`CLUSTER BY` on Iceberg v2** (DBR-only; must disable DVs and row tracking manually): + +```sql +-- Single column clustering (v2) +CREATE TABLE orders ( + order_id BIGINT, + order_date DATE +) +USING ICEBERG +TBLPROPERTIES ( + 'delta.enableDeletionVectors' = false, + 'delta.enableRowTracking' = false +) +CLUSTER BY (order_date); +``` + +**`CLUSTER BY` on Iceberg v3** (no extra TBLPROPERTIES needed): + +```sql +CREATE TABLE orders ( + order_id BIGINT, + order_date DATE +) +USING ICEBERG +TBLPROPERTIES ('format-version' = '3') +CLUSTER BY (order_date); +``` + +--- + +## DML Operations + +Managed Iceberg tables support all standard DML operations: + +```sql +-- INSERT +INSERT INTO my_catalog.my_schema.events +VALUES (1, 'click', '2025-06-01', '{"page": "home"}'); + +-- INSERT from query +INSERT INTO my_catalog.my_schema.events +SELECT * FROM staging_events WHERE event_date = current_date(); + +-- UPDATE +UPDATE my_catalog.my_schema.events +SET event_type = 'page_view' +WHERE event_id = 1; + +-- DELETE +DELETE FROM my_catalog.my_schema.events +WHERE event_date < '2024-01-01'; + +-- MERGE (upsert) +MERGE INTO my_catalog.my_schema.events AS target +USING staging_events AS source +ON target.event_id = source.event_id +WHEN MATCHED THEN UPDATE SET * +WHEN NOT MATCHED THEN INSERT *; +``` + +--- + +## Time Travel + +Query historical snapshots using timestamp or snapshot ID: + +```sql +-- Query by timestamp +SELECT * FROM my_catalog.my_schema.events TIMESTAMP AS OF '2025-06-01T00:00:00Z'; + +-- Query by snapshot ID +SELECT * FROM my_catalog.my_schema.events VERSION AS OF 1234567890; + +-- Only for external engines: View snapshot history +SELECT * FROM my_catalog.my_schema.events.snapshots; +``` + +--- + +## Predictive Optimization + +Predictive Optimization is **recommended** for managed Iceberg tables — it is not auto-enabled and must be turned on explicitly. Once enabled, it automatically runs: + +- **Compaction** — consolidates small files +- **Vacuum** — removes expired snapshots and orphan files +- **Statistics collection** — keeps column statistics up to date for query optimization + +Enable at the catalog or schema level. Manual operations are still available if needed: + +```sql +-- Manual compaction +OPTIMIZE my_catalog.my_schema.events; + +-- Manual vacuum +VACUUM my_catalog.my_schema.events; + +-- Manual statistics collection +ANALYZE TABLE my_catalog.my_schema.events COMPUTE STATISTICS FOR ALL COLUMNS; +``` + +--- + +## Iceberg v3 (Beta) + +**Requires**: DBR 17.3+ + +Iceberg v3 introduces new capabilities on top of v2: + +| Feature | Description | +|---------|-------------| +| **Deletion Vectors** | Row-level deletes without rewriting data files — faster UPDATE/DELETE/MERGE | +| **VARIANT Type** | Semi-structured data column (like Delta's VARIANT) | +| **Row Lineage** | Track row-level provenance across transformations | + +### Creating an Iceberg v3 Table + +```sql +CREATE TABLE my_catalog.my_schema.events_v3 ( + event_id BIGINT, + event_date DATE, + data VARIANT +) +USING ICEBERG +TBLPROPERTIES ('format-version' = '3') +CLUSTER BY (event_date); +``` + +### Important Notes + +- **Cannot downgrade**: Once a table is upgraded to v3, it cannot be downgraded back to v2 +- **External engine compatibility**: External engines must use Iceberg library 1.9.0+ to read v3 tables +- **Deletion vectors**: Enabled by default on v3 tables. External readers must support deletion vectors +- **Beta status**: Iceberg v3 is in Beta — not recommended for production workloads yet + +### Upgrading an Existing Table to v3 + +```sql +ALTER TABLE my_catalog.my_schema.events +SET TBLPROPERTIES ('format-version' = '3'); +``` + +> **Warning**: This is irreversible. Test with non-production data first. + +--- + +## Limitations + +| Limitation | Details | +|------------|---------| +| **No Vector Search** | Vector Search indexes are not supported on Iceberg tables | +| **No Change Data Feed (CDF)** | CDF is a Delta-only feature; use Delta + UniForm if CDF is required | +| **Parquet only** | Iceberg tables on Databricks use Parquet as the underlying file format | +| **No shallow clone** | `SHALLOW CLONE` is not supported; use `DEEP CLONE` or CTAS | +| **`PARTITIONED BY` maps to Liquid Clustering** | `PARTITIONED BY` is supported and recommended for cross-platform scenarios — it maps to Liquid Clustering, not traditional partitions. Only plain column references work; expression transforms (`bucket()`, `years()`, etc.) are not supported. | +| **No Structured Streaming sink** | Cannot use `writeStream` to write to Iceberg tables directly; use `INSERT INTO` or `MERGE` in batch or SDP | +| **Compression** | Default compression is `zstd`; older readers may need `snappy` — set `write.parquet.compression-codec` if needed | +| **Do not set metadata path** | Never set `write.metadata.path` or `write.metadata.previous-versions-max` | +| **Do not install Iceberg library** | DBR includes built-in support; installing an Iceberg JAR causes conflicts | + +--- + +## Converting From Other Formats + +### Delta to Iceberg (via DEEP CLONE) + +```sql +CREATE TABLE my_catalog.my_schema.events_iceberg +USING ICEBERG +DEEP CLONE my_catalog.my_schema.events_delta; +``` + +### Foreign Iceberg to Managed Iceberg + +```sql +-- With Liquid Clustering (v2 — must disable DVs and row tracking) +CREATE TABLE my_catalog.my_schema.events_managed +USING ICEBERG +TBLPROPERTIES ( + 'delta.enableDeletionVectors' = false, + 'delta.enableRowTracking' = false +) +CLUSTER BY (event_date) +AS SELECT * FROM foreign_catalog.foreign_schema.events; + +-- With Liquid Clustering (v3 — no extra TBLPROPERTIES needed) +CREATE TABLE my_catalog.my_schema.events_managed +USING ICEBERG +TBLPROPERTIES ('format-version' = '3') +CLUSTER BY (event_date) +AS SELECT * FROM foreign_catalog.foreign_schema.events; +``` + + diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-iceberg/2-uniform-and-compatibility.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-iceberg/2-uniform-and-compatibility.md new file mode 100644 index 0000000..8437a72 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-iceberg/2-uniform-and-compatibility.md @@ -0,0 +1,207 @@ +# UniForm and Compatibility Mode + +UniForm and Compatibility Mode make Delta tables readable as Iceberg by external engines — without converting to a native Iceberg table. Data is written as Delta, but Iceberg metadata is generated automatically so external tools (Snowflake, PyIceberg, Spark, Trino) can read via UC IRC endpoint. + +--- + +## External Iceberg Reads (fka UniForm) (GA) + +**Requirements**: Unity Catalog, DBR 14.3+, column mapping enabled, deletion vectors disabled, the Delta table must have a minReaderVersion >= 2 and minWriterVersion >= 7, both managed and external tables supported. + +UniForm adds automatic Iceberg metadata generation to regular Delta tables. The table remains Delta internally but is readable as Iceberg externally. + +### Enabling UniForm on a New Table + +```sql +CREATE TABLE my_catalog.my_schema.customers ( + customer_id BIGINT, + name STRING, + region STRING, + updated_at TIMESTAMP +) +TBLPROPERTIES ( + 'delta.columnMapping.mode' = 'name', + 'delta.enableIcebergCompatV2' = 'true', + 'delta.universalFormat.enabledFormats' = 'iceberg' +); +``` + +### Enabling UniForm on an Existing Table + +```sql +ALTER TABLE my_catalog.my_schema.customers +SET TBLPROPERTIES ( + 'delta.columnMapping.mode' = 'name', + 'delta.enableIcebergCompatV2' = 'true', + 'delta.universalFormat.enabledFormats' = 'iceberg' +); +``` + +### Requirements and Prerequisites + +UniForm requires the following properties to be set explicitly: + +| Requirement | Details | +|-------------|---------| +| **Unity Catalog** | Table must be registered in UC | +| **DBR 14.3+** | Minimum runtime version | +| **Deletion vectors disabled** | Set `delta.enableDeletionVectors = false` before enabling UniForm | +| **No column mapping conflicts** | If table uses `id` mode, migrate to `name` mode first | + +If deletion vectors are currently enabled: + +```sql +-- Disable deletion vectors first +ALTER TABLE my_catalog.my_schema.customers +SET TBLPROPERTIES ('delta.enableDeletionVectors' = 'false'); + +-- Rewrite to remove existing deletion vectors +REORG TABLE my_catalog.my_schema.customers +APPLY (PURGE); + +-- Then enable UniForm +ALTER TABLE my_catalog.my_schema.customers +SET TBLPROPERTIES ( + 'delta.columnMapping.mode' = 'name', + 'delta.enableIcebergCompatV2' = 'true', + 'delta.universalFormat.enabledFormats' = 'iceberg' +); +``` + +### Async Metadata Generation + +Iceberg metadata is generated **asynchronously** after each Delta transaction. There is a brief delay (typically seconds, occasionally minutes for large transactions) before external engines see the latest data. + +### Checking UniForm Status + +> See [Check Iceberg metadata generation status](https://docs.databricks.com/aws/en/delta/uniform#check-iceberg-metadata-generation-status) for full details. + + +### Disabling UniForm + +```sql +ALTER TABLE my_catalog.my_schema.customers +UNSET TBLPROPERTIES ('delta.universalFormat.enabledFormats'); +``` + +--- + +## Compatibility Mode + +**Requirements**: Unity Catalog, DBR 16.1+, SDP pipeline + +Compatibility Mode extends UniForm to **streaming tables (STs)** and **materialized views (MVs)** created by Spark Declarative Pipelines (SDP) or DBSQL. Regular UniForm does not work on STs/MVs — Compatibility Mode is the only option. + +**How it works**: When you enable Compatibility Mode, Databricks creates a separate, read-only **"compatibility version"** of the object at the external location you specify (`delta.universalFormat.compatibility.location`). This is a full copy of the data in Iceberg-compatible format — not a pointer to the original Delta data. After the initial full copy, subsequent metadata and data generation is **incremental** (only new/changed data is synced to the external location). + +> **Storage cost consideration**: Because Compatibility Mode writes a separate copy of the data to the external location, you incur additional cloud storage costs proportional to the size of the table. Factor this in when enabling Compatibility Mode on large tables. + +### Enabling Compatibility Mode + +Compatibility Mode is configured via table properties: + +**SQL Example (streaming table)**: + +```sql +CREATE OR REFRESH STREAMING TABLE my_events +TBLPROPERTIES ( + 'delta.universalFormat.enabledFormats' = 'compatibility', + 'delta.universalFormat.compatibility.location' = '' +) +AS SELECT * FROM STREAM read_files('/Volumes/catalog/schema/raw/events/'); +``` + +**SQL Example (materialized view)**: + +```sql +CREATE OR REFRESH MATERIALIZED VIEW daily_summary +TBLPROPERTIES ( + 'delta.universalFormat.enabledFormats' = 'compatibility', + 'delta.universalFormat.compatibility.location' = '' +) +AS SELECT event_date, COUNT(*) AS event_count +FROM my_events +GROUP BY event_date; +``` + +**Python Example**: + +```python +from pyspark import pipelines as dp + +@dp.table( + name="my_events", + table_properties={ + "delta.universalFormat.enabledFormats": "compatibility", + "delta.universalFormat.compatibility.location": "", + }, +) +def my_events(): + return ( + spark.readStream.format("cloudFiles") + .option("cloudFiles.format", "json") + .load("/Volumes/catalog/schema/raw/events/") + ) +``` + +### Considerations for Compatibility Mode + +| Consideration | Details | +|---------------|---------| +| **External location** | `delta.universalFormat.compatibility.location` must point to a configured external location for the Iceberg metadata output path | +| **SDP pipeline only** | Only works with streaming tables and MVs defined in SDP pipelines | +| **Initial generation time** | First metadata generation can take up to 1 hour for large tables | +| **Unity Catalog** | Required | +| **DBR 16.1+** | Minimum runtime for the SDP pipeline | + +### Refresh Mechanics + +Compatibility Mode metadata can be refreshed manually or controlled via the `delta.universalFormat.compatibility.targetRefreshInterval` property: + +```sql +CREATE OR REFRESH STREAMING TABLE my_events +TBLPROPERTIES ( + 'delta.universalFormat.enabledFormats' = 'compatibility', + 'delta.universalFormat.compatibility.location' = '', + 'delta.universalFormat.compatibility.targetRefreshInterval' = '0 MINUTES' +) +AS SELECT * FROM STREAM read_files('/Volumes/catalog/schema/raw/events/'); +``` + +| Interval value | Behavior | +|----------------|----------| +| `0 MINUTES` | Checks for changes after every commit and triggers a refresh if needed — default for streaming tables and MVs | +| `1 HOUR` | Default for non-SDP tables; refreshes at most once per hour | +| Values below `1 HOUR` (e.g. `30 MINUTES`) | Not recommended — won't make refreshes more frequent than once per hour | + +Metadata can also be triggered manually: + +```sql +REFRESH TABLE my_catalog.my_schema.my_events; +``` + +### Future Modes + +A more efficient mode for streaming tables and materialized views is expected in a future release. + +--- + +## Decision Table: Which Approach? + +| Criteria | Managed Iceberg | UniForm | Compatibility Mode | +|----------|:-:|:-:|:-:| +| **Full Iceberg read/write** | Yes | Read-only (as Iceberg) | Read-only (as Iceberg) | +| **Works with Delta features (CDF)** | No | Partial* | Partial* | +| **Streaming tables / MVs** | No | No | Yes | +| **External engine write via IRC** | Yes | No | No | +| **Existing Delta investment** | Requires migration | No migration | No migration | +| **Predictive Optimization** | Auto-enabled | Auto-enabled (Delta) | Auto-enabled (Delta) | +| **DBR requirement** | 16.1+ | 14.3+ | 16.1+ | + +*given that Iceberg doesn't have CDF so the features dependent on it are not supported e.g., +streaming tables, materialized views, data classification, vector search, data profiling. For Synced tables to Lakebase, only snapshot mode is supported. +### When to Choose Each + +- **Managed Iceberg**: You want a native Iceberg table with full read/write from both Databricks and external engines. You don't need Delta-specific features (e.g., CDF). +- **UniForm**: You have existing Delta tables and want to make them readable as Iceberg by external engines without migrating. You want to keep Delta features internally. +- **Compatibility Mode**: You have streaming tables or materialized views that need to be readable as Iceberg by external engines. diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-iceberg/3-iceberg-rest-catalog.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-iceberg/3-iceberg-rest-catalog.md new file mode 100644 index 0000000..e7cf571 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-iceberg/3-iceberg-rest-catalog.md @@ -0,0 +1,107 @@ +# Iceberg REST Catalog (IRC) + +The Iceberg REST Catalog (IRC) is a REST API endpoint that lets external engines read and write Databricks-managed Iceberg data using the standard Apache Iceberg REST Catalog protocol. External tools connect to the IRC endpoint, authenticate, and receive vended credentials for direct cloud storage access. + +**Endpoint**: `https:///api/2.1/unity-catalog/iceberg-rest` + +> **Legacy endpoint warning**: The older `/api/2.1/unity-catalog/iceberg` endpoint is in maintenance mode and should not be used for new integrations. It was the original read-only endpoint documented for UniForm. All new integrations — both UniForm (Delta with Iceberg reads) and managed Iceberg tables — must use `/api/2.1/unity-catalog/iceberg-rest`. + +**Requirements**: Unity Catalog, external data access enabled on the workspace, DBR 16.1+ + +--- + +## Prerequisites + +### 1. Enable External Data Access + +External data access must be enabled for your workspace. This is typically configured by a workspace admin. + +### 2. Network Access to the IRC Endpoint + +External engines must reach the Databricks workspace over HTTPS (port 443). If the workspace has **IP access lists** enabled, the CIDR range(s) of the Iceberg client must be explicitly allowed — otherwise connections will fail regardless of correct credentials or grants. + +Check and manage IP access lists: +- Admin console: **Settings → Security → IP access list** +- REST API: `GET /api/2.0/ip-access-lists` to inspect, `POST /api/2.0/ip-access-lists` to add ranges + +> **Common symptom**: Connections time out or return `403 Forbidden` even with valid credentials and correct grants. IP access list misconfiguration is a frequent root cause — check this before debugging auth. + +### 3. Grant EXTERNAL USE SCHEMA + +The connecting principal (user or service principal) must have the `EXTERNAL USE SCHEMA` grant on each schema they want to access: + +```sql +-- Grant to a user +GRANT EXTERNAL USE SCHEMA ON SCHEMA my_catalog.my_schema TO `user@example.com`; + +-- Grant to a service principal +GRANT EXTERNAL USE SCHEMA ON SCHEMA my_catalog.my_schema TO `my-service-principal`; + +-- Grant to a group +GRANT EXTERNAL USE SCHEMA ON SCHEMA my_catalog.my_schema TO `data-engineers`; +``` + +> **Important**: `EXTERNAL USE SCHEMA` is separate from `SELECT` or `MODIFY` grants. A user needs both data permissions AND the external use grant. + +--- + +## Authentication + +### Personal Access Token (PAT) + +``` +Authorization: Bearer +``` + +### OAuth (M2M) + +For service-to-service authentication, use OAuth with a service principal: + +1. Create a service principal in the Databricks account +2. Generate an OAuth secret +3. Use the OAuth token endpoint to get an access token +4. Pass the access token as a Bearer token + +--- + +## Read/Write Capability Matrix + +| Table Type | IRC Read | IRC Write | +|------------|:-:|:-:| +| Managed Iceberg (`USING ICEBERG`) | Yes | Yes | +| Delta + UniForm | Yes | No | +| Delta + Compatibility Mode | Yes | No | +| Foreign Iceberg Table | No | No | + +> **Key insight**: Only managed Iceberg tables support writes via IRC. UniForm and Compatibility Mode tables are read-only because the underlying format is Delta. + +--- + +## Credential Vending + +When an external engine connects via IRC, Databricks **vends temporary cloud credentials** (short-lived STS tokens for AWS, SAS tokens for Azure) so the engine can read/write data files directly in cloud storage. This is transparent to the client — the IRC protocol handles it automatically. + +Benefits: +- No need to configure cloud credentials in the external engine +- Credentials are scoped to the specific table and operation +- Credentials automatically expire (typically 1 hour) + +--- + +## Common Configuration Reference + +| Parameter | Value | +|-----------|-------| +| **Catalog type** | `rest` | +| **URI** | `https:///api/2.1/unity-catalog/iceberg-rest` | +| **Warehouse** | Unity Catalog catalog name (e.g., `my_catalog`) | +| **Token** | Databricks PAT or OAuth access token | +| **Credential vending** | Automatic (handled by the REST protocol) | + + +--- + +## Related + +- [4-snowflake-interop.md](4-snowflake-interop.md) — Snowflake reading Databricks via catalog integration (uses IRC) +- [5-external-engine-interop.md](5-external-engine-interop.md) — Per-engine connection configs: PyIceberg, OSS Spark, EMR, Flink, Kafka Connect, DuckDB, Trino diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-iceberg/4-snowflake-interop.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-iceberg/4-snowflake-interop.md new file mode 100644 index 0000000..2f9d953 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-iceberg/4-snowflake-interop.md @@ -0,0 +1,349 @@ +# Snowflake Interoperability + +Databricks and Snowflake can share Iceberg data bidirectionally. This file covers both directions: Snowflake reading Databricks-managed tables, and Databricks reading Snowflake-managed Iceberg tables. + +**Cloud scope**: AWS-primary examples. Azure/GCS differences noted where relevant. + +--- + +## Direction 1: Snowflake Reading Databricks + +Snowflake can read Databricks-managed Iceberg tables (managed Iceberg + UniForm + Compatibility Mode) through a **Catalog Integration** that connects to the Databricks Iceberg REST Catalog (IRC). + +### Step 1: Create a Catalog Integration in Snowflake + +`ACCESS_DELEGATION_MODE = VENDED_CREDENTIALS` is required on AWS for Snowflake to receive temporary STS credentials from the Databricks IRC. Without it, Snowflake cannot access the underlying Parquet files. + +**PAT / Bearer token**: + +```sql +-- In Snowflake +CREATE OR REPLACE CATALOG INTEGRATION databricks_catalog_int + CATALOG_SOURCE = ICEBERG_REST + TABLE_FORMAT = ICEBERG + CATALOG_NAMESPACE = 'my_schema' -- UC schema (default namespace) + REST_CONFIG = ( + CATALOG_URI = 'https:///api/2.1/unity-catalog/iceberg-rest' + WAREHOUSE = '' -- UC catalog name + ACCESS_DELEGATION_MODE = VENDED_CREDENTIALS + ) + REST_AUTHENTICATION = ( + TYPE = BEARER + BEARER_TOKEN = '' + ) + REFRESH_INTERVAL_SECONDS = 300 + ENABLED = TRUE; +``` + +**OAuth (recommended for production)**: + +```sql +CREATE OR REPLACE CATALOG INTEGRATION databricks_catalog_int + CATALOG_SOURCE = ICEBERG_REST + TABLE_FORMAT = ICEBERG + CATALOG_NAMESPACE = 'my_schema' + REST_CONFIG = ( + CATALOG_URI = 'https:///api/2.1/unity-catalog/iceberg-rest' + WAREHOUSE = '' + ACCESS_DELEGATION_MODE = VENDED_CREDENTIALS + ) + REST_AUTHENTICATION = ( + TYPE = OAUTH + OAUTH_CLIENT_ID = '' + OAUTH_CLIENT_SECRET = '' + OAUTH_TOKEN_URI = 'https:///oidc/v1/token' + OAUTH_ALLOWED_SCOPES = ('all-apis', 'sql') + ) + REFRESH_INTERVAL_SECONDS = 300 + ENABLED = TRUE; +``` + +> **Grant on the Databricks side**: The principal used for authentication needs these privileges in Unity Catalog: +> - `USE CATALOG` on the catalog +> - `USE SCHEMA` on the schema +> - `EXTERNAL USE SCHEMA` on the schema — this is the key privilege that enables external engines to access tables via IRC +> - `SELECT` on the target tables (or schema/catalog for broader access) +> +> Missing `EXTERNAL USE SCHEMA` causes a `Failed to retrieve credentials` error in Snowflake. + +### Step 2: External Volume (Azure/GCS Only) + +On **AWS with vended credentials**, no external volume is needed — Databricks IRC vends temporary STS credentials automatically. + +On **Azure** or **GCS**, you must create an external volume in Snowflake because vended credentials are not supported for those clouds: + +```sql +-- Azure example (in Snowflake) +CREATE OR REPLACE EXTERNAL VOLUME databricks_ext_vol + STORAGE_LOCATIONS = ( + ( + NAME = 'azure_location' + STORAGE_BASE_URL = 'azure://myaccount.blob.core.windows.net/my-container/iceberg/' + AZURE_TENANT_ID = '' + ) + ); +``` + +### Step 3: Expose Tables in Snowflake + +Two approaches available. **Linked catalog** is preferred — it exposes all tables in the namespace at once and updates automatically. + +**Option A: Linked Catalog Database (preferred)** + +```sql +-- Verify namespaces are visible (should return your UC schemas) +SELECT SYSTEM$LIST_NAMESPACES_FROM_CATALOG('databricks_catalog_int', '', 0); + +-- Create a linked catalog database exposing all tables in the namespace +CREATE DATABASE my_snowflake_db + LINKED_CATALOG = ( + CATALOG = 'databricks_catalog_int', + ALLOWED_NAMESPACES = ('my_schema') -- UC schema + ); + +-- Check link health (executionState should be "RUNNING" with empty failureDetails) +SELECT SYSTEM$CATALOG_LINK_STATUS('my_snowflake_db'); + +-- Query +SELECT * FROM my_snowflake_db."my_schema"."my_table" +WHERE event_date >= '2025-01-01'; +``` + +**Option B: Individual Table Reference (legacy)** + +```sql +-- AWS (vended creds — no EXTERNAL_VOLUME needed) +CREATE ICEBERG TABLE my_snowflake_db.my_schema.events + CATALOG = 'databricks_catalog_int' + CATALOG_TABLE_NAME = 'events'; + +-- Azure/GCS (EXTERNAL_VOLUME required) +CREATE ICEBERG TABLE my_snowflake_db.my_schema.events + CATALOG = 'databricks_catalog_int' + CATALOG_TABLE_NAME = 'events' + EXTERNAL_VOLUME = 'databricks_ext_vol'; + +-- Query +SELECT * FROM my_snowflake_db.my_schema.events +WHERE event_date >= '2025-01-01'; +``` + +### Key Gotchas + +#### Workspace IP Access Lists Must Allow Snowflake Egress IPs + +If the Databricks workspace has **IP access lists** enabled, Snowflake's outbound NAT IPs must be added to the allowlist. Snowflake connects to the Databricks IRC endpoint (`/api/2.1/unity-catalog/iceberg-rest`) over HTTPS (port 443), and a blocked IP produces connection timeouts or `403` errors that can look like auth failures. + + +> **Diagnosis tip**: If the catalog integration shows `ENABLED = TRUE` but `SYSTEM$CATALOG_LINK_STATUS` returns a connection error (not a credentials error), IP access lists are the first thing to check. + +#### REFRESH_INTERVAL_SECONDS Is Per-Integration, Not Per-Table + +The `REFRESH_INTERVAL_SECONDS` setting on the catalog integration controls how often Snowflake polls the Databricks IRC for metadata changes. This applies to **all tables** using that integration — you cannot set different refresh intervals per table. + +- Lower values = fresher data but more API calls +- Default: 300 seconds (5 minutes) +- Minimum: 60 seconds + +#### 1000-Commit Limit + +For Iceberg tables created from Delta files in object storage, Snowflake processes a maximum of 1000 Delta commit files each time you refresh a table using CREATE/ALTER ICEBERG TABLE … REFRESH or an automatic refresh; if the table has more than 1000 commit files since the last checkpoint, you can perform additional refreshes and each refresh continues from where the previous one stopped. The 1000‑commit limit applies only to Delta commit files after the latest Delta checkpoint file, and does not limit how many commits the catalog integration can ultimately synchronize over multiple refreshes + +**Mitigations**: +- Enable Predictive Optimization (auto-compaction reduces commit frequency) +- Batch writes instead of high-frequency micro-batches +- Run `OPTIMIZE` and `VACUUM` to consolidate metadata manually if needed. + +--- + +## Direction 2: Databricks Reading Snowflake + +Databricks can read Snowflake-managed Iceberg tables through a **foreign catalog** that connects to Snowflake's Iceberg catalog. Snowflake Iceberg tables are stored in external volumes (cloud storage), so Databricks reads the Iceberg's Parquet files directly — no Snowflake compute required. + +**Assumption**: A Snowflake-managed Iceberg table already exists, created with `CATALOG = 'SNOWFLAKE'` pointing to an external volume: + +```sql +-- In Snowflake — prerequisite table +CREATE ICEBERG TABLE sensor_readings ( + device_id INT, + device_value STRING +) + CATALOG = 'SNOWFLAKE' + EXTERNAL_VOLUME = 'ICEBERG_SHARED_VOL' + BASE_LOCATION = 'sensor_readings/'; + +INSERT INTO sensor_readings VALUES (1, 'value01'), (2, 'value02'); + +SELECT * FROM sensor_readings; +``` + +`CATALOG = 'SNOWFLAKE'` means Snowflake manages the Iceberg metadata. The data files land in the external volume at the `BASE_LOCATION` sub-path. The steps below set up Databricks to read this table. + +### Step 1: Find Snowflake External Volume Path + +Before setting up the Databricks side, run this in Snowflake to get the S3/ADLS/GCS path where Snowflake stores its Iceberg data. You'll need this path for Steps 2 and 4. + +```sql +-- In Snowflake +DESCRIBE EXTERNAL VOLUME ; +-- Note the STORAGE_BASE_URL value (e.g. s3://my-bucket/snowflake-iceberg/) +``` + +### Step 2: Create a Storage Credential + +Create a storage credential for the cloud storage where Snowflake stores its Iceberg data. Assuming that the IAM role already exists. Follow the documentation for details (https://docs.databricks.com/aws/en/connect/unity-catalog/cloud-storage/s3/s3-external-location-manual) + +```bash +# In Databricks CLI (AWS example) +databricks storage-credentials create snowflake_storage_cred \ + --aws-iam-role-arn "arn:aws:iam::123456789012:role/snowflake-data-access" +``` + +### Step 3: Create an External Location + +The external location must point to the **root** of the bucket (not a sub-path), so that all Snowflake external volume paths fall under it. + +> **Fallback mode**: You do not need this external-location fallback enabled to read Snowflake‑created Iceberg tables via catalog federation. It only affects how storage credentials are resolved for paths, not whether Snowflake Iceberg federation works. + +```sql +-- In Databricks (URL should be the bucket root, not a sub-path) +CREATE EXTERNAL LOCATION snowflake_data +URL 's3://snowflake-iceberg-bucket/' +WITH (CREDENTIAL snowflake_storage_cred); +``` + +### Step 4: Create a Snowflake Connection + +```sql +-- In Databricks +CREATE CONNECTION snowflake_conn +TYPE SNOWFLAKE +OPTIONS ( + 'host' = '.snowflakecomputing.com', + 'user' = '', + 'password' = '', + 'sfWarehouse' = '' +); +``` + +### Step 5: Create a Foreign Catalog + +Two mandatory fields beyond `database`: + +- **`authorized_paths`**: The path(s) where Snowflake stores Iceberg table files — from `STORAGE_BASE_URL` in `DESCRIBE EXTERNAL VOLUME`. Databricks can only read Iceberg tables whose data falls under these paths. +- **`storage_root`**: Where Databricks stores catalog metadata for Iceberg reads. Must point to an existing external location. This is required — the foreign catalog creation will fail without it. + +```sql +-- In Databricks +CREATE FOREIGN CATALOG snowflake_iceberg +USING CONNECTION snowflake_conn +OPTIONS ( + 'catalog' = '', + 'authorized_paths' = 's3://snowflake-iceberg-bucket/snowflake-iceberg/', + 'storage_root' = 's3://snowflake-iceberg-bucket/uc-metadata/' +); +``` + +> **UI workflow note**: The Databricks connection wizard (Catalog Explorer → Add connection → Snowflake) will prompt for authorized paths and storage location in the form and create the foreign catalog automatically. The SQL above is the equivalent DDL. + +### Step 6: Refresh, Verify, and Query + +```sql +-- Refresh to discover tables +REFRESH FOREIGN CATALOG snowflake_iceberg; + +-- Verify provider type before querying at scale: +-- Provider = Iceberg → Databricks reads directly from cloud storage (cheap) +-- Provider = Snowflake → double compute via JDBC (Snowflake + Databricks) +DESCRIBE EXTENDED snowflake_iceberg.my_schema.my_table; + +-- Query +SELECT * FROM snowflake_iceberg.my_schema.my_table +WHERE created_at >= '2025-01-01'; +``` + +### Compute Cost Matrix + +| Snowflake Table Type | Databricks Read | Compute Cost | +|---------------------|:-:|---| +| **Snowflake Iceberg table** | Yes | Databricks compute only (reads data files directly from cloud storage) | +| **Snowflake native table** | Yes (via federation) | Double compute — Snowflake runs the query, Databricks processes the result | + +> **Key insight**: Snowflake Iceberg tables are more cost-efficient to read from Databricks because Databricks reads the Parquet files directly. Native Snowflake tables require Snowflake to run the scan. + + +--- + +## Full AWS Example: Snowflake Reading Databricks + +```sql +-- ======================================== +-- DATABRICKS SIDE (run in Databricks) +-- ======================================== + +-- 1. Create a managed Iceberg table (v2 — disable DVs and row tracking for CLUSTER BY) +CREATE TABLE main.sales.orders ( + order_id BIGINT, + customer_id BIGINT, + amount DECIMAL(10,2), + order_date DATE +) +USING ICEBERG +TBLPROPERTIES ( + 'delta.enableDeletionVectors' = false, + 'delta.enableRowTracking' = false +) +CLUSTER BY (order_date); + +-- 2. Grant external access to the service principal used in Snowflake catalog integration +GRANT EXTERNAL USE SCHEMA ON SCHEMA main.sales TO `snowflake-service-principal`; + +-- ======================================== +-- SNOWFLAKE SIDE (run in Snowflake) +-- ======================================== + +-- 3. Create catalog integration (ACCESS_DELEGATION_MODE required for vended creds on AWS) +CREATE OR REPLACE CATALOG INTEGRATION databricks_int + CATALOG_SOURCE = ICEBERG_REST + TABLE_FORMAT = ICEBERG + CATALOG_NAMESPACE = 'sales' + REST_CONFIG = ( + CATALOG_URI = 'https://my-workspace.cloud.databricks.com/api/2.1/unity-catalog/iceberg-rest' + WAREHOUSE = 'main' + ACCESS_DELEGATION_MODE = VENDED_CREDENTIALS + ) + REST_AUTHENTICATION = ( + TYPE = OAUTH + OAUTH_CLIENT_ID = '' + OAUTH_CLIENT_SECRET = '' + OAUTH_TOKEN_URI = 'https://my-workspace.cloud.databricks.com/oidc/v1/token' + OAUTH_ALLOWED_SCOPES = ('all-apis', 'sql') + ) + REFRESH_INTERVAL_SECONDS = 300 + ENABLED = TRUE; + +-- 4. Verify schemas are visible +SELECT SYSTEM$LIST_NAMESPACES_FROM_CATALOG('databricks_int', '', 0); + +-- 5. Create linked catalog database (exposes all tables in the namespace) +CREATE DATABASE analytics + LINKED_CATALOG = ( + CATALOG = 'databricks_int', + ALLOWED_NAMESPACES = ('sales') + ); + +-- 6. Check link health +SELECT SYSTEM$CATALOG_LINK_STATUS('analytics'); + +-- 7. Query (schema and table names are case-sensitive) +SELECT order_date, SUM(amount) AS daily_revenue +FROM analytics."sales"."orders" +GROUP BY order_date +ORDER BY order_date DESC; +``` + +--- + +## Related + +- [3-iceberg-rest-catalog.md](3-iceberg-rest-catalog.md) — IRC endpoint details and authentication diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-iceberg/5-external-engine-interop.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-iceberg/5-external-engine-interop.md new file mode 100644 index 0000000..ecafcbe --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-iceberg/5-external-engine-interop.md @@ -0,0 +1,206 @@ +# External Engine Interoperability + +This file covers connecting external engines to Databricks via the Iceberg REST Catalog (IRC). Each engine section includes the minimum configuration needed to read (and where supported, write) Databricks-managed Iceberg tables. + +**Prerequisites for all engines**: +- Databricks workspace with external data access enabled +- `EXTERNAL USE SCHEMA` granted on target schemas +- PAT or OAuth (service principal) credentials for authentication with the required permissions. +- **Network access**: The client must reach the Databricks workspace on HTTPS (port 443). If workspace **IP access lists** are enabled, add the client's egress CIDR to the allowlist — this is a common setup issue that blocks connectivity even when credentials and grants are correct. + +See [3-iceberg-rest-catalog.md](3-iceberg-rest-catalog.md) for IRC endpoint details. + +--- + +## PyIceberg + +PyIceberg is a Python library for reading and writing Iceberg tables without Spark. + +### Installation + +Upgrade both packages explicitly — if `pyarrow` (v15) is too old, it causes write errors. Also install `adlfs` for Azure storage access: + +```bash +pip install --upgrade "pyiceberg>=0.9,<0.10" "pyarrow>=17,<20" +pip install adlfs +``` + +For non-Databricks environments: + +```bash +pip install "pyiceberg[pyarrow]>=0.9" +``` + +### Connect to Catalog + +The `warehouse` parameter pins the catalog, so all subsequent table identifiers use `.` (not `..
`): + +```python +from pyiceberg.catalog import load_catalog + +catalog = load_catalog( + "uc", + uri="https:///api/2.1/unity-catalog/iceberg-rest", + warehouse="", # Unity Catalog catalog name + token="", +) +``` + +### Read Table + +```python +# Load table — identifier is .
because 'warehouse' pins the UC catalog +tbl = catalog.load_table(".
") + +# Inspect schema and current snapshot +print(tbl) # schema, partitioning, snapshot summary +print(tbl.current_snapshot()) # snapshot metadata + +# Read sample rows +df = tbl.scan(limit=10).to_pandas() +print(df.head()) + +# Pushdown filter (SQL-style filter strings are supported) +df = tbl.scan( + row_filter="event_date >= '2025-01-01'", + limit=1000, +).to_pandas() + +# Read as Arrow +arrow_table = tbl.scan().to_arrow() +``` + +### Append Data + +```python +import pyarrow as pa +from pyiceberg.catalog import load_catalog + +catalog = load_catalog( + "uc", + uri="https:///api/2.1/unity-catalog/iceberg-rest", + warehouse="", + token="", +) + +tbl = catalog.load_table(".
") + +# Schema must match the Iceberg table schema exactly — use explicit Arrow types +# PyArrow defaults to int64; if the Iceberg table uses int (32-bit), cast explicitly +arrow_schema = pa.schema([ + pa.field("id", pa.int32()), + pa.field("name", pa.string()), + pa.field("qty", pa.int32()), +]) + +rows = [ + {"id": 1, "name": "foo", "qty": 10}, + {"id": 2, "name": "bar", "qty": 20}, +] +arrow_tbl = pa.Table.from_pylist(rows, schema=arrow_schema) + +tbl.append(arrow_tbl) + +# Verify +print("Current snapshot:", tbl.current_snapshot()) +``` + +--- + +## OSS Apache Spark + +> **CRITICAL**: Only configure this **outside** Databricks Runtime. Inside DBR, use the built-in Iceberg support — do NOT install the Iceberg library. + +### Dependencies + +Two JARs are required: the Spark runtime and a cloud-specific bundle for object storage access. Choose the bundle matching your Databricks metastore's cloud: + +| Cloud | Bundle | +|-------|--------| +| AWS | `org.apache.iceberg:iceberg-aws-bundle:` | +| Azure | `org.apache.iceberg:iceberg-azure-bundle:` | +| GCP | `org.apache.iceberg:iceberg-gcp-bundle:` | + +### Spark Session Configuration + +The Databricks docs recommend OAuth2 (service principal) for external Spark connections. Set `rest.auth.type=oauth2` and provide the OAuth2 server URI, credential, and scope: + +```python +from pyspark.sql import SparkSession + +WORKSPACE_URL = "https://" +UC_CATALOG_NAME = "" +OAUTH_CLIENT_ID = "" +OAUTH_CLIENT_SECRET = "" +CATALOG_ALIAS = "uc" # arbitrary name used to reference this catalog in Spark SQL +ICEBERG_VER = "1.7.1" + +RUNTIME = f"org.apache.iceberg:iceberg-spark-runtime-3.5_2.12:{ICEBERG_VER}" +CLOUD_BUNDLE = f"org.apache.iceberg:iceberg-aws-bundle:{ICEBERG_VER}" # or azure/gcp-bundle + +spark = ( + SparkSession.builder + .appName("uc-iceberg") + .config("spark.jars.packages", f"{RUNTIME},{CLOUD_BUNDLE}") + .config("spark.sql.extensions", + "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions") + .config(f"spark.sql.catalog.{CATALOG_ALIAS}", + "org.apache.iceberg.spark.SparkCatalog") + .config(f"spark.sql.catalog.{CATALOG_ALIAS}.type", "rest") + .config(f"spark.sql.catalog.{CATALOG_ALIAS}.rest.auth.type", "oauth2") + .config(f"spark.sql.catalog.{CATALOG_ALIAS}.uri", + f"{WORKSPACE_URL}/api/2.1/unity-catalog/iceberg-rest") + .config(f"spark.sql.catalog.{CATALOG_ALIAS}.oauth2-server-uri", + f"{WORKSPACE_URL}/oidc/v1/token") + .config(f"spark.sql.catalog.{CATALOG_ALIAS}.credential", + f"{OAUTH_CLIENT_ID}:{OAUTH_CLIENT_SECRET}") + .config(f"spark.sql.catalog.{CATALOG_ALIAS}.scope", "all-apis") + .config(f"spark.sql.catalog.{CATALOG_ALIAS}.warehouse", UC_CATALOG_NAME) + .getOrCreate() +) + +# List schemas +spark.sql(f"SHOW NAMESPACES IN {CATALOG_ALIAS}").show(truncate=False) + +# Query +spark.sql(f"SELECT * FROM {CATALOG_ALIAS}..
").show() + +# Write (managed Iceberg tables only) +df.writeTo(f"{CATALOG_ALIAS}..
").append() +``` + +### Spark SQL + +```sql +-- List schemas +SHOW NAMESPACES IN uc; + +-- Query +SELECT * FROM uc..
; + +-- Insert +INSERT INTO uc..
VALUES (1, 'foo', 10); +``` + +--- + +## Troubleshooting + +| Issue | Solution | +|-------|----------| +| **Connection timeout or `403 Forbidden` with valid credentials** | Workspace IP access list is blocking the client — add the client's egress CIDR to the allowlist (admin console: **Settings → Security → IP access list**) | +| **`403 Forbidden`** | Check `EXTERNAL USE SCHEMA` grant and token validity | +| **`Table not found`** | Verify the `warehouse` config matches the UC catalog name; check schema and table names | +| **Class conflict in DBR** | You installed an Iceberg library in Databricks Runtime — remove it; DBR has built-in support | +| **Credential vending failure** | Ensure external data access is enabled on the workspace | +| **Slow reads** | Check if table needs compaction (`OPTIMIZE`); large numbers of small files degrade performance | +| **v3 table incompatibility** | Upgrade to Iceberg library 1.9.0+ for v3 support; older versions cannot read v3 tables | +| **PyArrow schema mismatch** | Cast to explicit types (e.g., `pa.int32()`) when the Iceberg table schema uses 32-bit integers | +| **PyIceberg write error on serverless** | Upgrade pyarrow (`>=17`) and install `adlfs` — the bundled pyarrow v15 is incompatible | + +--- + +## Related + +- [3-iceberg-rest-catalog.md](3-iceberg-rest-catalog.md) — IRC endpoint details, auth, credential vending +- [4-snowflake-interop.md](4-snowflake-interop.md) — Snowflake-specific integration diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-iceberg/SKILL.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-iceberg/SKILL.md new file mode 100644 index 0000000..3c8a1cb --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-iceberg/SKILL.md @@ -0,0 +1,148 @@ +--- +name: databricks-iceberg +description: "Apache Iceberg tables on Databricks — Managed Iceberg tables, External Iceberg Reads (fka Uniform), Compatibility Mode, Iceberg REST Catalog (IRC), Iceberg v3, Snowflake interop, PyIceberg, OSS Spark, external engine access and credential vending. Use when creating Iceberg tables, enabling External Iceberg Reads (uniform) on Delta tables (including Streaming Tables and Materialized Views via compatibility mode), configuring external engines to read Databricks tables via Unity Catalog IRC, integrating with Snowflake catalog to read Foreign Iceberg tables" +--- + +# Apache Iceberg on Databricks + +Databricks provides multiple ways to work with Apache Iceberg: native managed Iceberg tables, UniForm for Delta-to-Iceberg interoperability, and the Iceberg REST Catalog (IRC) for external engine access. + +--- + +## Critical Rules (always follow) + +- **MUST** use Unity Catalog — all Iceberg features require UC-enabled workspaces +- **MUST NOT** install an Iceberg library into Databricks Runtime (DBR includes built-in Iceberg support; adding a library causes version conflicts) +- **MUST NOT** set `write.metadata.path` or `write.metadata.previous-versions-max` — Databricks manages metadata locations automatically; overriding causes corruption +- **MUST** determine which Iceberg pattern fits the use case before writing code — see the [When to Use](#when-to-use) section below +- **MUST** know that both `PARTITIONED BY` and `CLUSTER BY` produce the same Iceberg metadata for external engines — UC maintains an Iceberg partition spec with partition fields corresponding to the clustering keys, so external engines reading via IRC see a partitioned Iceberg table (not Hive-style, but proper Iceberg partition fields) and can prune on those fields; internally UC uses those fields as liquid clustering keys; the only differences between the two syntaxes are: (1) `PARTITIONED BY` is standard Iceberg DDL (any engine can create the table), while `CLUSTER BY` is DBR-only DDL; (2) `PARTITIONED BY` **auto-handles** DV/row-tracking properties, while `CLUSTER BY` requires manual TBLPROPERTIES on v2 +- **MUST NOT** use expression-based partition transforms (`bucket()`, `years()`, `months()`, `days()`, `hours()`) with `PARTITIONED BY` on managed Iceberg tables — only plain column references are supported; expression transforms cause errors +- **MUST** disable deletion vectors and row tracking when using `CLUSTER BY` on Iceberg v2 tables — set `'delta.enableDeletionVectors' = false` and `'delta.enableRowTracking' = false` in TBLPROPERTIES (Iceberg v3 handles this automatically; `PARTITIONED BY` handles this automatically on both v2 and v3) + +--- + +## Key Concepts + +| Concept | Summary | +|---------|---------| +| **Managed Iceberg Table** | Native Iceberg table created with `USING ICEBERG` — full read/write in Databricks and via external Iceberg engines | +| **External Iceberg Reads (Uniform)** | Delta table that auto-generates Iceberg metadata — read as Iceberg externally, write as Delta internally | +| **Compatibility Mode** | UniForm variant for streaming tables and materialized views in SDP pipelines | +| **Iceberg REST Catalog (IRC)** | Unity Catalog's built-in REST endpoint implementing the Iceberg REST Catalog spec — lets external engines (Spark, PyIceberg, Snowflake) access UC-managed Iceberg data | +| **Iceberg v3** | Next-gen format (Beta, DBR 17.3+) — deletion vectors, VARIANT type, row lineage | + +--- + +## Quick Start + +### Create a Managed Iceberg Table + +```sql +-- No clustering +CREATE TABLE my_catalog.my_schema.events +USING ICEBERG +AS SELECT * FROM raw_events; + +-- PARTITIONED BY (recommended for cross-platform): standard Iceberg syntax, works on EMR/OSS Spark/Trino/Flink +-- auto-disables DVs and row tracking — no TBLPROPERTIES needed on v2 or v3 +CREATE TABLE my_catalog.my_schema.events +USING ICEBERG +PARTITIONED BY (event_date) +AS SELECT * FROM raw_events; + +-- CLUSTER BY on Iceberg v2 (DBR-only syntax): must manually disable DVs and row tracking +CREATE TABLE my_catalog.my_schema.events +USING ICEBERG +TBLPROPERTIES ( + 'delta.enableDeletionVectors' = false, + 'delta.enableRowTracking' = false +) +CLUSTER BY (event_date) +AS SELECT * FROM raw_events; + +-- CLUSTER BY on Iceberg v3 (DBR-only syntax): no TBLPROPERTIES needed +CREATE TABLE my_catalog.my_schema.events +USING ICEBERG +TBLPROPERTIES ('format-version' = '3') +CLUSTER BY (event_date) +AS SELECT * FROM raw_events; +``` + +### Enable UniForm on an Existing Delta Table + +```sql +ALTER TABLE my_catalog.my_schema.customers +SET TBLPROPERTIES ( + 'delta.columnMapping.mode' = 'name', + 'delta.enableIcebergCompatV2' = 'true', + 'delta.universalFormat.enabledFormats' = 'iceberg' +); +``` + +--- + +## Read/Write Capability Matrix + +| Table Type | Databricks Read | Databricks Write | External IRC Read | External IRC Write | +|------------|:-:|:-:|:-:|:-:| +| Managed Iceberg (`USING ICEBERG`) | Yes | Yes | Yes | Yes | +| Delta + UniForm | Yes (as Delta) | Yes (as Delta) | Yes (as Iceberg) | No | +| Delta + Compatibility Mode | Yes (as Delta) | Yes | Yes (as Iceberg) | No | + +--- + +## Reference Files + +| File | Summary | Keywords | +|------|---------|----------| +| [1-managed-iceberg-tables.md](1-managed-iceberg-tables.md) | Creating and managing native Iceberg tables — DDL, DML, Liquid Clustering, Predictive Optimization, Iceberg v3, limitations | CREATE TABLE USING ICEBERG, CTAS, MERGE, time travel, deletion vectors, VARIANT | +| [2-uniform-and-compatibility.md](2-uniform-and-compatibility.md) | Making Delta tables readable as Iceberg — UniForm for regular tables, Compatibility Mode for streaming tables and MVs | UniForm, universalFormat, Compatibility Mode, streaming tables, materialized views, SDP | +| [3-iceberg-rest-catalog.md](3-iceberg-rest-catalog.md) | Exposing Databricks tables to external engines via the IRC endpoint — auth, credential vending, IP access lists | IRC, REST Catalog, credential vending, EXTERNAL USE SCHEMA, PAT, OAuth | +| [4-snowflake-interop.md](4-snowflake-interop.md) | Bidirectional Snowflake-Databricks integration — catalog integration, foreign catalogs, vended credentials | Snowflake, catalog integration, external volume, vended credentials, REFRESH_INTERVAL_SECONDS | +| [5-external-engine-interop.md](5-external-engine-interop.md) | Connecting PyIceberg, OSS Spark, AWS EMR, Apache Flink, and Kafka Connect via IRC | PyIceberg, OSS Spark, EMR, Flink, Kafka Connect, pyiceberg.yaml | + +--- + +## When to Use + +- **Creating a new Iceberg table** → [1-managed-iceberg-tables.md](1-managed-iceberg-tables.md) +- **Making an existing Delta table readable as Iceberg** → [2-uniform-and-compatibility.md](2-uniform-and-compatibility.md) +- **Making a streaming table or MV readable as Iceberg** → [2-uniform-and-compatibility.md](2-uniform-and-compatibility.md) (Compatibility Mode section) +- **Choosing between Managed Iceberg vs UniForm vs Compatibility Mode** → decision table in [2-uniform-and-compatibility.md](2-uniform-and-compatibility.md) +- **Exposing Databricks tables to external engines via REST API** → [3-iceberg-rest-catalog.md](3-iceberg-rest-catalog.md) +- **Integrating Databricks with Snowflake (either direction)** → [4-snowflake-interop.md](4-snowflake-interop.md) +- **Connecting PyIceberg, OSS Spark, Flink, EMR, or Kafka** → [5-external-engine-interop.md](5-external-engine-interop.md) + +--- + +## Common Issues + +| Issue | Solution | +|-------|----------| +| **No Change Data Feed (CDF)** | CDF is not supported on managed Iceberg tables. Use Delta + UniForm if you need CDF. | +| **UniForm async delay** | Iceberg metadata generation is asynchronous. After a write, there may be a brief delay before external engines see the latest data. Check status with `DESCRIBE EXTENDED table_name`. | +| **Compression codec change** | Managed Iceberg tables use `zstd` compression by default (not `snappy`). Older Iceberg readers that don't support zstd will fail. Verify reader compatibility or set `write.parquet.compression-codec` to `snappy`. | +| **Snowflake 1000-commit limit** | Snowflake's Iceberg catalog integration can only see the last 1000 Iceberg commits. High-frequency writers must compact metadata or Snowflake will lose visibility of older data. | +| **Deletion vectors with UniForm** | UniForm requires deletion vectors to be disabled (`delta.enableDeletionVectors = false`). If your table has deletion vectors enabled, disable them before enabling UniForm. | +| **No shallow clone for Iceberg** | `SHALLOW CLONE` is not supported for Iceberg tables. Use `DEEP CLONE` or `CREATE TABLE ... AS SELECT` instead. | +| **Version mismatch with external engines** | Ensure external engines use an Iceberg library version compatible with the format version of your tables. Iceberg v3 tables require Iceberg library 1.9.0+. | + +--- + +## Related Skills + +- **[databricks-unity-catalog](../databricks-unity-catalog/SKILL.md)** — catalog/schema management, governance, system tables +- **[databricks-spark-declarative-pipelines](../databricks-spark-declarative-pipelines/SKILL.md)** — SDP pipelines (streaming tables, materialized views with Compatibility Mode) +- **[databricks-python-sdk](../databricks-python-sdk/SKILL.md)** — Python SDK and REST API for Databricks operations +- **[databricks-dbsql](../databricks-dbsql/SKILL.md)** — SQL warehouse features, query patterns + +--- + +## Resources + +- **[Iceberg Overview](https://docs.databricks.com/aws/en/iceberg/)** — main hub for Iceberg on Databricks +- **[UniForm](https://docs.databricks.com/aws/en/delta/uniform.html)** — Delta Universal Format +- **[Iceberg REST Catalog](https://docs.databricks.com/aws/en/external-access/iceberg)** — IRC endpoint and external engine access +- **[Compatibility Mode](https://docs.databricks.com/aws/en/external-access/compatibility-mode)** — UniForm for streaming tables and MVs +- **[Iceberg v3](https://docs.databricks.com/aws/en/iceberg/iceberg-v3)** — next-gen format features (Beta) +- **[Foreign Tables](https://docs.databricks.com/aws/en/query-data/foreign-tables.html)** — reading external catalog data diff --git a/.claude/skills/databricks-jobs/SKILL.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-jobs/SKILL.md similarity index 98% rename from .claude/skills/databricks-jobs/SKILL.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-jobs/SKILL.md index 2f0f8c7..0f60a24 100644 --- a/.claude/skills/databricks-jobs/SKILL.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-jobs/SKILL.md @@ -326,7 +326,7 @@ resources: ## Related Skills -- **[databricks-asset-bundles](../databricks-asset-bundles/SKILL.md)** - Deploy jobs via Databricks Asset Bundles +- **[databricks-bundles](../databricks-bundles/SKILL.md)** - Deploy jobs via Databricks Asset Bundles - **[databricks-spark-declarative-pipelines](../databricks-spark-declarative-pipelines/SKILL.md)** - Configure pipelines triggered by jobs ## Resources diff --git a/.claude/skills/databricks-jobs/examples.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-jobs/examples.md similarity index 100% rename from .claude/skills/databricks-jobs/examples.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-jobs/examples.md diff --git a/.claude/skills/databricks-jobs/notifications-monitoring.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-jobs/notifications-monitoring.md similarity index 100% rename from .claude/skills/databricks-jobs/notifications-monitoring.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-jobs/notifications-monitoring.md diff --git a/.claude/skills/databricks-jobs/task-types.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-jobs/task-types.md similarity index 100% rename from .claude/skills/databricks-jobs/task-types.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-jobs/task-types.md diff --git a/.claude/skills/databricks-jobs/triggers-schedules.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-jobs/triggers-schedules.md similarity index 100% rename from .claude/skills/databricks-jobs/triggers-schedules.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-jobs/triggers-schedules.md diff --git a/.claude/skills/databricks-lakebase-autoscale/SKILL.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-lakebase-autoscale/SKILL.md similarity index 81% rename from .claude/skills/databricks-lakebase-autoscale/SKILL.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-lakebase-autoscale/SKILL.md index 50ba1df..f471765 100644 --- a/.claude/skills/databricks-lakebase-autoscale/SKILL.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-lakebase-autoscale/SKILL.md @@ -1,6 +1,6 @@ --- name: databricks-lakebase-autoscale -description: "Patterns and best practices for using Lakebase Autoscaling (next-gen managed PostgreSQL) with autoscaling, branching, scale-to-zero, and instant restore." +description: "Patterns and best practices for Lakebase Autoscaling (next-gen managed PostgreSQL). Use when creating or managing Lakebase Autoscaling projects, configuring autoscaling compute or scale-to-zero, working with database branching for dev/test workflows, implementing reverse ETL via synced tables, or connecting applications to Lakebase with OAuth credentials." --- # Lakebase Autoscaling @@ -173,26 +173,66 @@ w.postgres.update_endpoint( The following MCP tools are available for managing Lakebase infrastructure. Use `type="autoscale"` for Lakebase Autoscaling. -### Database (Project) Management +### manage_lakebase_database - Project Management -| Tool | Description | -|------|-------------| -| `create_or_update_lakebase_database` | Create or update a database. Finds by name, creates if new, updates if existing. Use `type="autoscale"`, `display_name`, `pg_version` params. A new project auto-creates a production branch, default compute, and databricks_postgres database. | -| `get_lakebase_database` | Get database details (including branches and endpoints) or list all. Pass `name` to get one, omit to list all. Use `type="autoscale"` to filter. | -| `delete_lakebase_database` | Delete a project and all its branches, computes, and data. Use `type="autoscale"`. | +| Action | Description | Required Params | +|--------|-------------|-----------------| +| `create_or_update` | Create or update a project | name | +| `get` | Get project details (includes branches/endpoints) | name | +| `list` | List all projects | (none, optional type filter) | +| `delete` | Delete project and all branches/computes/data | name | -### Branch Management +**Example usage:** +```python +# Create an autoscale project +manage_lakebase_database( + action="create_or_update", + name="my-app", + type="autoscale", + display_name="My Application", + pg_version="17" +) + +# Get project with branches +manage_lakebase_database(action="get", name="my-app", type="autoscale") + +# Delete project +manage_lakebase_database(action="delete", name="my-app", type="autoscale") +``` + +### manage_lakebase_branch - Branch Management + +| Action | Description | Required Params | +|--------|-------------|-----------------| +| `create_or_update` | Create/update branch with compute endpoint | project_name, branch_id | +| `delete` | Delete branch and endpoints | name (full branch name) | -| Tool | Description | -|------|-------------| -| `create_or_update_lakebase_branch` | Create or update a branch with its compute endpoint. Params: `project_name`, `branch_id`, `source_branch`, `ttl_seconds`, `is_protected`, plus compute params (`autoscaling_limit_min_cu`, `autoscaling_limit_max_cu`, `scale_to_zero_seconds`). | -| `delete_lakebase_branch` | Delete a branch and its compute endpoints. | +**Example usage:** +```python +# Create a dev branch with 7-day TTL +manage_lakebase_branch( + action="create_or_update", + project_name="my-app", + branch_id="development", + source_branch="production", + ttl_seconds=604800, # 7 days + autoscaling_limit_min_cu=0.5, + autoscaling_limit_max_cu=4.0, + scale_to_zero_seconds=300 +) + +# Delete branch +manage_lakebase_branch(action="delete", name="projects/my-app/branches/development") +``` -### Credentials +### generate_lakebase_credential - OAuth Tokens -| Tool | Description | -|------|-------------| -| `generate_lakebase_credential` | Generate OAuth token for PostgreSQL connections (1-hour expiry). Pass `endpoint` resource name for autoscale. | +Generate OAuth token (~1hr) for PostgreSQL connections. Use as password with `sslmode=require`. + +```python +# For autoscale endpoints +generate_lakebase_credential(endpoint="projects/my-app/branches/production/endpoints/ep-primary") +``` ## Reference Files @@ -290,5 +330,5 @@ These features are NOT yet supported in Lakebase Autoscaling: - **[databricks-app-apx](../databricks-app-apx/SKILL.md)** - full-stack apps that can use Lakebase for persistence - **[databricks-app-python](../databricks-app-python/SKILL.md)** - Python apps with Lakebase backend - **[databricks-python-sdk](../databricks-python-sdk/SKILL.md)** - SDK used for project management and token generation -- **[databricks-asset-bundles](../databricks-asset-bundles/SKILL.md)** - deploying apps with Lakebase resources +- **[databricks-bundles](../databricks-bundles/SKILL.md)** - deploying apps with Lakebase resources - **[databricks-jobs](../databricks-jobs/SKILL.md)** - scheduling reverse ETL sync jobs diff --git a/.claude/skills/databricks-lakebase-autoscale/branches.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-lakebase-autoscale/branches.md similarity index 100% rename from .claude/skills/databricks-lakebase-autoscale/branches.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-lakebase-autoscale/branches.md diff --git a/.claude/skills/databricks-lakebase-autoscale/computes.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-lakebase-autoscale/computes.md similarity index 100% rename from .claude/skills/databricks-lakebase-autoscale/computes.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-lakebase-autoscale/computes.md diff --git a/.claude/skills/databricks-lakebase-autoscale/connection-patterns.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-lakebase-autoscale/connection-patterns.md similarity index 100% rename from .claude/skills/databricks-lakebase-autoscale/connection-patterns.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-lakebase-autoscale/connection-patterns.md diff --git a/.claude/skills/databricks-lakebase-autoscale/projects.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-lakebase-autoscale/projects.md similarity index 100% rename from .claude/skills/databricks-lakebase-autoscale/projects.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-lakebase-autoscale/projects.md diff --git a/.claude/skills/databricks-lakebase-autoscale/reverse-etl.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-lakebase-autoscale/reverse-etl.md similarity index 100% rename from .claude/skills/databricks-lakebase-autoscale/reverse-etl.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-lakebase-autoscale/reverse-etl.md diff --git a/.claude/skills/databricks-lakebase-provisioned/SKILL.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-lakebase-provisioned/SKILL.md similarity index 79% rename from .claude/skills/databricks-lakebase-provisioned/SKILL.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-lakebase-provisioned/SKILL.md index b2b404a..7548219 100644 --- a/.claude/skills/databricks-lakebase-provisioned/SKILL.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-lakebase-provisioned/SKILL.md @@ -1,6 +1,6 @@ --- name: databricks-lakebase-provisioned -description: "Patterns and best practices for using Lakebase Provisioned (Databricks managed PostgreSQL) for OLTP workloads." +description: "Patterns and best practices for Lakebase Provisioned (Databricks managed PostgreSQL) for OLTP workloads. Use when creating Lakebase instances, connecting applications or Databricks Apps to PostgreSQL, implementing reverse ETL via synced tables, storing agent or chat memory, or configuring OAuth authentication for Lakebase." --- # Lakebase Provisioned @@ -225,21 +225,65 @@ mlflow.langchain.log_model( The following MCP tools are available for managing Lakebase infrastructure. Use `type="provisioned"` for Lakebase Provisioned. -### Database Management +### manage_lakebase_database - Database Management -| Tool | Description | -|------|-------------| -| `create_or_update_lakebase_database` | Create or update a database. Finds by name, creates if new, updates if existing. Use `type="provisioned"`, `capacity` (CU_1-CU_8), `stopped` params. | -| `get_lakebase_database` | Get database details or list all. Pass `name` to get one, omit to list all. Use `type="provisioned"` to filter. | -| `delete_lakebase_database` | Delete a database and its resources. Use `type="provisioned"`, `force=True` to cascade. | -| `generate_lakebase_credential` | Generate OAuth token for PostgreSQL connections (1-hour expiry). Pass `instance_names` for provisioned. | +| Action | Description | Required Params | +|--------|-------------|-----------------| +| `create_or_update` | Create or update a database | name | +| `get` | Get database details | name | +| `list` | List all databases | (none, optional type filter) | +| `delete` | Delete database and resources | name | -### Reverse ETL (Catalog + Synced Tables) +**Example usage:** +```python +# Create a provisioned database +manage_lakebase_database( + action="create_or_update", + name="my-lakebase-instance", + type="provisioned", + capacity="CU_1" +) + +# Get database details +manage_lakebase_database(action="get", name="my-lakebase-instance", type="provisioned") + +# List all databases +manage_lakebase_database(action="list") + +# Delete with cascade +manage_lakebase_database(action="delete", name="my-lakebase-instance", type="provisioned", force=True) +``` + +### manage_lakebase_sync - Reverse ETL + +| Action | Description | Required Params | +|--------|-------------|-----------------| +| `create_or_update` | Set up reverse ETL from Delta to Lakebase | instance_name, source_table_name, target_table_name | +| `delete` | Remove synced table (and optionally catalog) | table_name | -| Tool | Description | -|------|-------------| -| `create_or_update_lakebase_sync` | Set up reverse ETL: ensures UC catalog registration exists, then creates a synced table from Delta to Lakebase. Params: `instance_name`, `source_table_name`, `target_table_name`, `scheduling_policy` ("TRIGGERED"/"SNAPSHOT"/"CONTINUOUS"). | -| `delete_lakebase_sync` | Remove a synced table and optionally its UC catalog registration. | +**Example usage:** +```python +# Set up reverse ETL +manage_lakebase_sync( + action="create_or_update", + instance_name="my-lakebase-instance", + source_table_name="catalog.schema.delta_table", + target_table_name="lakebase_catalog.schema.postgres_table", + scheduling_policy="TRIGGERED" # or SNAPSHOT, CONTINUOUS +) + +# Delete synced table +manage_lakebase_sync(action="delete", table_name="lakebase_catalog.schema.postgres_table") +``` + +### generate_lakebase_credential - OAuth Tokens + +Generate OAuth token (~1hr) for PostgreSQL connections. Use as password with `sslmode=require`. + +```python +# For provisioned instances +generate_lakebase_credential(instance_names=["my-lakebase-instance"]) +``` ## Reference Files @@ -304,5 +348,5 @@ databricks database start-database-instance --name my-lakebase-instance - **[databricks-app-apx](../databricks-app-apx/SKILL.md)** - full-stack apps that can use Lakebase for persistence - **[databricks-app-python](../databricks-app-python/SKILL.md)** - Python apps with Lakebase backend - **[databricks-python-sdk](../databricks-python-sdk/SKILL.md)** - SDK used for instance management and token generation -- **[databricks-asset-bundles](../databricks-asset-bundles/SKILL.md)** - deploying apps with Lakebase resources +- **[databricks-bundles](../databricks-bundles/SKILL.md)** - deploying apps with Lakebase resources - **[databricks-jobs](../databricks-jobs/SKILL.md)** - scheduling reverse ETL sync jobs diff --git a/.claude/skills/databricks-lakebase-provisioned/connection-patterns.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-lakebase-provisioned/connection-patterns.md similarity index 100% rename from .claude/skills/databricks-lakebase-provisioned/connection-patterns.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-lakebase-provisioned/connection-patterns.md diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-lakebase-provisioned/reverse-etl.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-lakebase-provisioned/reverse-etl.md new file mode 100644 index 0000000..5b5caef --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-lakebase-provisioned/reverse-etl.md @@ -0,0 +1,171 @@ +# Reverse ETL with Lakebase Provisioned + +## Overview + +Reverse ETL allows you to sync data from Unity Catalog Delta tables into Lakebase Provisioned as PostgreSQL tables. This enables OLTP access patterns on data processed in the Lakehouse. + +## Sync Modes + +| Mode | Description | Best For | Notes | +|------|-------------|----------|-------| +| **Snapshot** | One-time full copy | Initial setup, small tables | 10x more efficient if modifying >10% of data | +| **Triggered** | Scheduled updates on demand | Dashboards updated hourly/daily | Requires CDF on source table | +| **Continuous** | Real-time streaming (seconds of latency) | Live applications | Highest cost, minimum 15s intervals, requires CDF | + +**Note:** Triggered and Continuous modes require Change Data Feed (CDF) enabled on the source table: + +```sql +ALTER TABLE your_catalog.your_schema.your_table +SET TBLPROPERTIES (delta.enableChangeDataFeed = true) +``` + +## Creating Synced Tables + +### Using Python SDK + +```python +from databricks.sdk import WorkspaceClient +from databricks.sdk.service.database import ( + SyncedDatabaseTable, + SyncedTableSpec, + SyncedTableSchedulingPolicy, +) + +w = WorkspaceClient() + +# Create a synced table from Unity Catalog to Lakebase Provisioned +synced_table = w.database.create_synced_database_table( + SyncedDatabaseTable( + name="lakebase_catalog.schema.synced_table", + database_instance_name="my-lakebase-instance", + spec=SyncedTableSpec( + source_table_full_name="analytics.gold.user_profiles", + primary_key_columns=["user_id"], + scheduling_policy=SyncedTableSchedulingPolicy.TRIGGERED, + ), + ) +) +print(f"Created synced table: {synced_table.name}") +``` + +**Key parameters:** + +| Parameter | Description | +|-----------|-------------| +| `name` | Fully qualified target table name (catalog.schema.table) | +| `database_instance_name` | Lakebase Provisioned instance name | +| `source_table_full_name` | Fully qualified source Delta table (catalog.schema.table) | +| `primary_key_columns` | List of primary key columns from the source table | +| `scheduling_policy` | `SNAPSHOT`, `TRIGGERED`, or `CONTINUOUS` | + +### Using CLI + +```bash +databricks database create-synced-database-table \ + --json '{ + "name": "lakebase_catalog.schema.synced_table", + "database_instance_name": "my-lakebase-instance", + "spec": { + "source_table_full_name": "analytics.gold.user_profiles", + "primary_key_columns": ["user_id"], + "scheduling_policy": "TRIGGERED" + } + }' +``` + +**Note:** There is no SQL syntax for creating synced tables. Use the Python SDK, CLI, or Catalog Explorer UI. + +## Checking Synced Table Status + +```python +status = w.database.get_synced_database_table(name="lakebase_catalog.schema.synced_table") +print(f"State: {status.data_synchronization_status.detailed_state}") +print(f"Message: {status.data_synchronization_status.message}") +``` + +## Deleting a Synced Table + +Delete from both Unity Catalog and Postgres: + +1. **Unity Catalog:** Delete via Catalog Explorer or SDK +2. **Postgres:** Drop the table to free storage + +```python +# Delete the synced table via SDK +w.database.delete_synced_database_table(name="lakebase_catalog.schema.synced_table") +``` + +```sql +-- Drop the Postgres table to free storage +DROP TABLE your_database.your_schema.your_table; +``` + +## Use Cases + +### 1. Product Catalog for Web App + +```python +w.database.create_synced_database_table( + SyncedDatabaseTable( + name="ecommerce_catalog.public.products", + database_instance_name="ecommerce-db", + spec=SyncedTableSpec( + source_table_full_name="gold.products.catalog", + primary_key_columns=["product_id"], + scheduling_policy=SyncedTableSchedulingPolicy.TRIGGERED, + ), + ) +) +# Application queries PostgreSQL directly with low-latency point lookups +``` + +### 2. User Profiles for Authentication + +```python +w.database.create_synced_database_table( + SyncedDatabaseTable( + name="auth_catalog.public.user_profiles", + database_instance_name="auth-db", + spec=SyncedTableSpec( + source_table_full_name="gold.users.profiles", + primary_key_columns=["user_id"], + scheduling_policy=SyncedTableSchedulingPolicy.CONTINUOUS, + ), + ) +) +``` + +### 3. Feature Store for Real-time ML + +```python +w.database.create_synced_database_table( + SyncedDatabaseTable( + name="ml_catalog.public.user_features", + database_instance_name="feature-store-db", + spec=SyncedTableSpec( + source_table_full_name="ml.features.user_features", + primary_key_columns=["user_id"], + scheduling_policy=SyncedTableSchedulingPolicy.CONTINUOUS, + ), + ) +) +# ML model queries features with low latency +``` + +## Best Practices + +1. **Enable CDF** on source tables before creating Triggered or Continuous synced tables +2. **Choose appropriate sync mode**: Snapshot for small tables or one-time loads, Triggered for hourly/daily refreshes, Continuous for real-time +3. **Monitor sync status**: Check for failures and latency via Catalog Explorer or `get_synced_database_table()` +4. **Index target tables**: Create appropriate indexes in PostgreSQL for your query patterns +5. **Handle schema changes**: Only additive changes (e.g., adding columns) are supported for Triggered/Continuous modes +6. **Account for connection limits**: Each synced table uses up to 16 connections + +## Common Issues + +| Issue | Solution | +|-------|----------| +| **Sync fails with CDF error** | Enable Change Data Feed on source table before using Triggered or Continuous mode | +| **Schema mismatch** | Only additive schema changes are supported; for breaking changes, delete and recreate the synced table | +| **Sync takes too long** | Switch to Triggered mode for scheduled updates; use Snapshot for initial bulk loads | +| **Target table locked** | Avoid DDL on target during sync operations | diff --git a/.claude/skills/databricks-metric-views/SKILL.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-metric-views/SKILL.md similarity index 95% rename from .claude/skills/databricks-metric-views/SKILL.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-metric-views/SKILL.md index d3f5834..bddc74a 100644 --- a/.claude/skills/databricks-metric-views/SKILL.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-metric-views/SKILL.md @@ -25,6 +25,19 @@ Use this skill when: ## Quick Start +### Inspect Source Table Schema + +Before creating a metric view, call `get_table_stats_and_schema` to understand available columns for dimensions and measures: + +``` +get_table_stats_and_schema( + catalog="catalog", + schema="schema", + table_names=["orders"], + table_stat_level="SIMPLE" # Use "DETAILED" for cardinality, min/max, histograms +) +``` + ### Create a Metric View ```sql diff --git a/.claude/skills/databricks-metric-views/patterns.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-metric-views/patterns.md similarity index 100% rename from .claude/skills/databricks-metric-views/patterns.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-metric-views/patterns.md diff --git a/.claude/skills/databricks-metric-views/yaml-reference.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-metric-views/yaml-reference.md similarity index 100% rename from .claude/skills/databricks-metric-views/yaml-reference.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-metric-views/yaml-reference.md diff --git a/.claude/skills/databricks-mlflow-evaluation/SKILL.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-mlflow-evaluation/SKILL.md similarity index 100% rename from .claude/skills/databricks-mlflow-evaluation/SKILL.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-mlflow-evaluation/SKILL.md diff --git a/.claude/skills/databricks-mlflow-evaluation/references/CRITICAL-interfaces.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-mlflow-evaluation/references/CRITICAL-interfaces.md similarity index 100% rename from .claude/skills/databricks-mlflow-evaluation/references/CRITICAL-interfaces.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-mlflow-evaluation/references/CRITICAL-interfaces.md diff --git a/.claude/skills/databricks-mlflow-evaluation/references/GOTCHAS.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-mlflow-evaluation/references/GOTCHAS.md similarity index 100% rename from .claude/skills/databricks-mlflow-evaluation/references/GOTCHAS.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-mlflow-evaluation/references/GOTCHAS.md diff --git a/.claude/skills/databricks-mlflow-evaluation/references/patterns-context-optimization.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-mlflow-evaluation/references/patterns-context-optimization.md similarity index 100% rename from .claude/skills/databricks-mlflow-evaluation/references/patterns-context-optimization.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-mlflow-evaluation/references/patterns-context-optimization.md diff --git a/.claude/skills/databricks-mlflow-evaluation/references/patterns-datasets.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-mlflow-evaluation/references/patterns-datasets.md similarity index 100% rename from .claude/skills/databricks-mlflow-evaluation/references/patterns-datasets.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-mlflow-evaluation/references/patterns-datasets.md diff --git a/.claude/skills/databricks-mlflow-evaluation/references/patterns-evaluation.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-mlflow-evaluation/references/patterns-evaluation.md similarity index 100% rename from .claude/skills/databricks-mlflow-evaluation/references/patterns-evaluation.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-mlflow-evaluation/references/patterns-evaluation.md diff --git a/.claude/skills/databricks-mlflow-evaluation/references/patterns-judge-alignment.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-mlflow-evaluation/references/patterns-judge-alignment.md similarity index 100% rename from .claude/skills/databricks-mlflow-evaluation/references/patterns-judge-alignment.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-mlflow-evaluation/references/patterns-judge-alignment.md diff --git a/.claude/skills/databricks-mlflow-evaluation/references/patterns-prompt-optimization.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-mlflow-evaluation/references/patterns-prompt-optimization.md similarity index 100% rename from .claude/skills/databricks-mlflow-evaluation/references/patterns-prompt-optimization.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-mlflow-evaluation/references/patterns-prompt-optimization.md diff --git a/.claude/skills/databricks-mlflow-evaluation/references/patterns-scorers.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-mlflow-evaluation/references/patterns-scorers.md similarity index 100% rename from .claude/skills/databricks-mlflow-evaluation/references/patterns-scorers.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-mlflow-evaluation/references/patterns-scorers.md diff --git a/.claude/skills/databricks-mlflow-evaluation/references/patterns-trace-analysis.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-mlflow-evaluation/references/patterns-trace-analysis.md similarity index 100% rename from .claude/skills/databricks-mlflow-evaluation/references/patterns-trace-analysis.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-mlflow-evaluation/references/patterns-trace-analysis.md diff --git a/.claude/skills/databricks-mlflow-evaluation/references/patterns-trace-ingestion.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-mlflow-evaluation/references/patterns-trace-ingestion.md similarity index 100% rename from .claude/skills/databricks-mlflow-evaluation/references/patterns-trace-ingestion.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-mlflow-evaluation/references/patterns-trace-ingestion.md diff --git a/.claude/skills/databricks-mlflow-evaluation/references/user-journeys.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-mlflow-evaluation/references/user-journeys.md similarity index 100% rename from .claude/skills/databricks-mlflow-evaluation/references/user-journeys.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-mlflow-evaluation/references/user-journeys.md diff --git a/.claude/skills/databricks-model-serving/1-classical-ml.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-model-serving/1-classical-ml.md similarity index 99% rename from .claude/skills/databricks-model-serving/1-classical-ml.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-model-serving/1-classical-ml.md index 0d7d5ac..4b973e0 100644 --- a/.claude/skills/databricks-model-serving/1-classical-ml.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-model-serving/1-classical-ml.md @@ -143,7 +143,8 @@ endpoint = w.serving_endpoints.create_and_wait( ### Via MCP Tool ``` -query_serving_endpoint( +manage_serving_endpoint( + action="query", name="diabetes-predictor", dataframe_records=[ {"age": 45, "bmi": 25.3, "bp": 120, "s1": 200} diff --git a/.claude/skills/databricks-model-serving/2-custom-pyfunc.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-model-serving/2-custom-pyfunc.md similarity index 98% rename from .claude/skills/databricks-model-serving/2-custom-pyfunc.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-model-serving/2-custom-pyfunc.md index afd6e18..b7dbad3 100644 --- a/.claude/skills/databricks-model-serving/2-custom-pyfunc.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-model-serving/2-custom-pyfunc.md @@ -189,7 +189,8 @@ endpoint = client.create_endpoint( ## Query Custom Model ``` -query_serving_endpoint( +manage_serving_endpoint( + action="query", name="custom-model-endpoint", dataframe_records=[ {"age": 25, "income": 50000, "category": "A"} @@ -200,7 +201,8 @@ query_serving_endpoint( Or with inputs format: ``` -query_serving_endpoint( +manage_serving_endpoint( + action="query", name="custom-model-endpoint", inputs={"age": 25, "income": 50000, "category": "A"} ) diff --git a/.claude/skills/databricks-model-serving/3-genai-agents.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-model-serving/3-genai-agents.md similarity index 98% rename from .claude/skills/databricks-model-serving/3-genai-agents.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-model-serving/3-genai-agents.md index 6f2c779..4061dba 100644 --- a/.claude/skills/databricks-model-serving/3-genai-agents.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-model-serving/3-genai-agents.md @@ -224,7 +224,7 @@ for event in AGENT.predict_stream(request): Run via MCP: ``` -run_python_file_on_databricks(file_path="./my_agent/test_agent.py") +execute_code(file_path="./my_agent/test_agent.py") ``` ## Logging the Agent @@ -275,7 +275,8 @@ agents.deploy( ## Query Deployed Agent ``` -query_serving_endpoint( +manage_serving_endpoint( + action="query", name="my-agent-endpoint", messages=[{"role": "user", "content": "What is Databricks?"}], max_tokens=500 diff --git a/.claude/skills/databricks-model-serving/4-tools-integration.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-model-serving/4-tools-integration.md similarity index 100% rename from .claude/skills/databricks-model-serving/4-tools-integration.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-model-serving/4-tools-integration.md diff --git a/.claude/skills/databricks-model-serving/5-development-testing.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-model-serving/5-development-testing.md similarity index 85% rename from .claude/skills/databricks-model-serving/5-development-testing.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-model-serving/5-development-testing.md index cbc4f76..2a3806c 100644 --- a/.claude/skills/databricks-model-serving/5-development-testing.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-model-serving/5-development-testing.md @@ -13,17 +13,17 @@ MCP-based workflow for developing and testing agents on Databricks. ▼ ┌─────────────────────────────────────────────────────────────┐ │ Step 2: Upload to workspace │ -│ → upload_folder MCP tool │ +│ → manage_workspace_files MCP tool │ └─────────────────────────────────────────────────────────────┘ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Step 3: Install packages │ -│ → execute_databricks_command MCP tool │ +│ → execute_code MCP tool │ └─────────────────────────────────────────────────────────────┘ ▼ ┌─────────────────────────────────────────────────────────────┐ │ Step 4: Test agent (iterate) │ -│ → run_python_file_on_databricks MCP tool │ +│ → execute_code MCP tool (with file_path) │ │ → If error: fix locally, re-upload, re-run │ └─────────────────────────────────────────────────────────────┘ ``` @@ -85,12 +85,13 @@ print("Response:", result.model_dump(exclude_none=True)) ## Step 2: Upload to Workspace -Use the `upload_folder` MCP tool: +Use the `manage_workspace_files` MCP tool: ``` -upload_folder( - local_folder="./my_agent", - workspace_folder="/Workspace/Users/you@company.com/my_agent" +manage_workspace_files( + action="upload", + local_path="./my_agent", + workspace_path="/Workspace/Users/you@company.com/my_agent" ) ``` @@ -98,10 +99,10 @@ This uploads all files in parallel. ## Step 3: Install Packages -Use `execute_databricks_command` to install dependencies: +Use `execute_code` to install dependencies: ``` -execute_databricks_command( +execute_code( code="%pip install -U mlflow==3.6.0 databricks-langchain langgraph==0.3.4 databricks-agents pydantic" ) ``` @@ -111,7 +112,7 @@ execute_databricks_command( ### Follow-up Commands (Reuse Context) ``` -execute_databricks_command( +execute_code( code="dbutils.library.restartPython()", cluster_id="", context_id="" @@ -120,10 +121,10 @@ execute_databricks_command( ## Step 4: Test the Agent -Use `run_python_file_on_databricks`: +Use `execute_code` with `file_path`: ``` -run_python_file_on_databricks( +execute_code( file_path="./my_agent/test_agent.py", cluster_id="", context_id="" @@ -134,8 +135,8 @@ run_python_file_on_databricks( 1. Read the error from the output 2. Fix the local file (`agent.py` or `test_agent.py`) -3. Re-upload: `upload_folder(...)` -4. Re-run: `run_python_file_on_databricks(...)` +3. Re-upload: `manage_workspace_files(action="upload", ...)` +4. Re-run: `execute_code(file_path=...)` ### Iteration Tips @@ -148,7 +149,7 @@ run_python_file_on_databricks( ### Check if packages are installed ``` -execute_databricks_command( +execute_code( code="import mlflow; print(mlflow.__version__)", cluster_id="", context_id="" @@ -158,7 +159,7 @@ execute_databricks_command( ### List available endpoints ``` -execute_databricks_command( +execute_code( code=""" from databricks.sdk import WorkspaceClient w = WorkspaceClient() @@ -173,7 +174,7 @@ for ep in list(w.serving_endpoints.list())[:10]: ### Test LLM endpoint directly ``` -execute_databricks_command( +execute_code( code=""" from databricks_langchain import ChatDatabricks llm = ChatDatabricks(endpoint="databricks-meta-llama-3-3-70b-instruct") @@ -189,11 +190,11 @@ print(response.content) | Step | MCP Tool | Purpose | |------|----------|---------| -| Upload files | `upload_folder` | Sync local files to workspace | -| Install packages | `execute_databricks_command` | Set up dependencies | -| Restart Python | `execute_databricks_command` | Apply package changes | -| Test agent | `run_python_file_on_databricks` | Run test script | -| Debug | `execute_databricks_command` | Quick checks | +| Upload files | `manage_workspace_files` (action="upload") | Sync local files to workspace | +| Install packages | `execute_code` | Set up dependencies | +| Restart Python | `execute_code` | Apply package changes | +| Test agent | `execute_code` (with `file_path`) | Run test script | +| Debug | `execute_code` | Quick checks | ## Next Steps diff --git a/.claude/skills/databricks-model-serving/6-logging-registration.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-model-serving/6-logging-registration.md similarity index 98% rename from .claude/skills/databricks-model-serving/6-logging-registration.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-model-serving/6-logging-registration.md index f2344af..cd68735 100644 --- a/.claude/skills/databricks-model-serving/6-logging-registration.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-model-serving/6-logging-registration.md @@ -63,7 +63,7 @@ print(f"Registered: {uc_model_info.name} version {uc_model_info.version}") Run via MCP: ``` -run_python_file_on_databricks(file_path="./my_agent/log_model.py") +execute_code(file_path="./my_agent/log_model.py") ``` ## Resources for Auto Authentication diff --git a/.claude/skills/databricks-model-serving/7-deployment.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-model-serving/7-deployment.md similarity index 93% rename from .claude/skills/databricks-model-serving/7-deployment.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-model-serving/7-deployment.md index c2def49..666cb16 100644 --- a/.claude/skills/databricks-model-serving/7-deployment.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-model-serving/7-deployment.md @@ -90,7 +90,7 @@ manage_job_runs(action="get", run_id="") Or check endpoint directly: ``` -get_serving_endpoint_status(name="") +manage_serving_endpoint(action="get", name="") ``` ## Classical ML Deployment @@ -172,7 +172,7 @@ deployment = agents.deploy( Endpoints created via `agents.deploy()` appear under **Serving** in the Databricks UI. If you don't see your endpoint: 1. **Check the filter** - The Serving page defaults to "Owned by me". If the deployment ran as a service principal (e.g., via a job), switch to "All" to see it. -2. **Verify via API** - Use `list_serving_endpoints()` or `get_serving_endpoint_status(name="...")` to confirm the endpoint exists and check its state. +2. **Verify via API** - Use `manage_serving_endpoint(action="list")` or `manage_serving_endpoint(action="get", name="...")` to confirm the endpoint exists and check its state. 3. **Check the name** - The auto-generated name may not be what you expect. Print `deployment.endpoint_name` in the deploy script or check the job run output. ### Deployment Script with Explicit Naming @@ -263,16 +263,16 @@ client.update_endpoint( | Step | MCP Tool | Waits? | |------|----------|--------| -| Upload deploy script | `upload_folder` | Yes | +| Upload deploy script | `manage_workspace_files` (action="upload") | Yes | | Create job (one-time) | `manage_jobs` (action="create") | Yes | | Run deployment | `manage_job_runs` (action="run_now") | **No** - returns immediately | | Check job status | `manage_job_runs` (action="get") | Yes | -| Check endpoint status | `get_serving_endpoint_status` | Yes | +| Check endpoint status | `manage_serving_endpoint` (action="get") | Yes | ## After Deployment Once endpoint is READY: -1. **Test with MCP**: `query_serving_endpoint(name="...", messages=[...])` +1. **Test with MCP**: `manage_serving_endpoint(action="query", name="...", messages=[...])` 2. **Share with team**: Endpoint URL in Databricks UI 3. **Integrate in apps**: Use REST API or SDK diff --git a/.claude/skills/databricks-model-serving/8-querying-endpoints.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-model-serving/8-querying-endpoints.md similarity index 96% rename from .claude/skills/databricks-model-serving/8-querying-endpoints.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-model-serving/8-querying-endpoints.md index 9c655a1..4dfa2f9 100644 --- a/.claude/skills/databricks-model-serving/8-querying-endpoints.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-model-serving/8-querying-endpoints.md @@ -11,7 +11,7 @@ Send requests to deployed Model Serving endpoints. Before querying, verify the endpoint is ready: ``` -get_serving_endpoint_status(name="my-agent-endpoint") +manage_serving_endpoint(action="get", name="my-agent-endpoint") ``` Response: @@ -28,7 +28,8 @@ Response: ### Query Chat/Agent Endpoint ``` -query_serving_endpoint( +manage_serving_endpoint( + action="query", name="my-agent-endpoint", messages=[ {"role": "user", "content": "What is Databricks?"} @@ -61,7 +62,8 @@ Response: ### Query ML Model Endpoint ``` -query_serving_endpoint( +manage_serving_endpoint( + action="query", name="sklearn-classifier", dataframe_records=[ {"age": 25, "income": 50000, "credit_score": 720}, @@ -80,7 +82,7 @@ Response: ### List All Endpoints ``` -list_serving_endpoints(limit=20) +manage_serving_endpoint(action="list", limit=20) ``` ## Python SDK diff --git a/.claude/skills/databricks-model-serving/9-package-requirements.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-model-serving/9-package-requirements.md similarity index 97% rename from .claude/skills/databricks-model-serving/9-package-requirements.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-model-serving/9-package-requirements.md index f78a112..f9ceb7a 100644 --- a/.claude/skills/databricks-model-serving/9-package-requirements.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-model-serving/9-package-requirements.md @@ -139,10 +139,10 @@ export DATABRICKS_CONFIG_PROFILE="your-profile" ## Installing Packages via MCP -Use `execute_databricks_command`: +Use `execute_code`: ``` -execute_databricks_command( +execute_code( code="%pip install -U mlflow==3.6.0 databricks-langchain langgraph==0.3.4 databricks-agents pydantic" ) ``` @@ -150,7 +150,7 @@ execute_databricks_command( Then restart Python: ``` -execute_databricks_command( +execute_code( code="dbutils.library.restartPython()", cluster_id="", context_id="" @@ -174,7 +174,7 @@ for pkg in packages: Via MCP: ``` -execute_databricks_command( +execute_code( code=""" import pkg_resources for pkg in ['mlflow', 'langchain', 'langgraph', 'pydantic', 'databricks-langchain']: diff --git a/.claude/skills/databricks-model-serving/SKILL.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-model-serving/SKILL.md similarity index 84% rename from .claude/skills/databricks-model-serving/SKILL.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-model-serving/SKILL.md index de566f4..7416029 100644 --- a/.claude/skills/databricks-model-serving/SKILL.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-model-serving/SKILL.md @@ -29,8 +29,7 @@ ALWAYS use exact endpoint names from this table. NEVER guess or abbreviate. | Endpoint Name | Provider | Notes | |--------------|----------|-------| -| `databricks-gpt-5-3-codex` | OpenAI | Latest GPT Codex, 400K context | -| `databricks-gpt-5-2` | OpenAI | GPT 5.2, 400K context | +| `databricks-gpt-5-2` | OpenAI | Latest GPT, 400K context | | `databricks-gpt-5-1` | OpenAI | Instant + Thinking modes | | `databricks-gpt-5-1-codex-max` | OpenAI | Code-specialized (high perf) | | `databricks-gpt-5-1-codex-mini` | OpenAI | Code-specialized (cost-opt) | @@ -102,7 +101,7 @@ dbutils.library.restartPython() Or via MCP: ``` -execute_databricks_command(code="%pip install -U mlflow==3.6.0 databricks-langchain langgraph==0.3.4 databricks-agents pydantic") +execute_code(code="%pip install -U mlflow==3.6.0 databricks-langchain langgraph==0.3.4 databricks-agents pydantic") ``` ### Step 2: Create Agent File @@ -112,16 +111,17 @@ Create `agent.py` locally with `ResponsesAgent` pattern (see [3-genai-agents.md] ### Step 3: Upload to Workspace ``` -upload_folder( - local_folder="./my_agent", - workspace_folder="/Workspace/Users/you@company.com/my_agent" +manage_workspace_files( + action="upload", + local_path="./my_agent", + workspace_path="/Workspace/Users/you@company.com/my_agent" ) ``` ### Step 4: Test Agent ``` -run_python_file_on_databricks( +execute_code( file_path="./my_agent/test_agent.py", cluster_id="" ) @@ -130,7 +130,7 @@ run_python_file_on_databricks( ### Step 5: Log Model ``` -run_python_file_on_databricks( +execute_code( file_path="./my_agent/log_model.py", cluster_id="" ) @@ -143,7 +143,8 @@ See [7-deployment.md](7-deployment.md) for job-based deployment that doesn't tim ### Step 7: Query Endpoint ``` -query_serving_endpoint( +manage_serving_endpoint( + action="query", name="my-agent-endpoint", messages=[{"role": "user", "content": "Hello!"}] ) @@ -181,9 +182,8 @@ Then deploy via UI or SDK. See [1-classical-ml.md](1-classical-ml.md). | Tool | Purpose | |------|---------| -| `upload_folder` | Upload agent files to workspace | -| `run_python_file_on_databricks` | Test agent, log model | -| `execute_databricks_command` | Install packages, quick tests | +| `manage_workspace_files` (action="upload") | Upload agent files to workspace | +| `execute_code` | Install packages, test agent, log model | ### Deployment @@ -193,13 +193,37 @@ Then deploy via UI or SDK. See [1-classical-ml.md](1-classical-ml.md). | `manage_job_runs` (action="run_now") | Kick off deployment (async) | | `manage_job_runs` (action="get") | Check deployment job status | -### Querying +### manage_serving_endpoint - Querying -| Tool | Purpose | -|------|---------| -| `get_serving_endpoint_status` | Check if endpoint is READY | -| `query_serving_endpoint` | Send requests to endpoint | -| `list_serving_endpoints` | List all endpoints | +| Action | Description | Required Params | +|--------|-------------|-----------------| +| `get` | Check endpoint status (READY/NOT_READY/NOT_FOUND) | name | +| `list` | List all endpoints | (none, optional limit) | +| `query` | Send requests to endpoint | name + one of: messages, inputs, dataframe_records | + +**Example usage:** +```python +# Check endpoint status +manage_serving_endpoint(action="get", name="my-agent-endpoint") + +# List all endpoints +manage_serving_endpoint(action="list") + +# Query a chat/agent endpoint +manage_serving_endpoint( + action="query", + name="my-agent-endpoint", + messages=[{"role": "user", "content": "Hello!"}], + max_tokens=500 +) + +# Query a traditional ML endpoint +manage_serving_endpoint( + action="query", + name="sklearn-classifier", + dataframe_records=[{"age": 25, "income": 50000, "credit_score": 720}] +) +``` --- @@ -208,7 +232,7 @@ Then deploy via UI or SDK. See [1-classical-ml.md](1-classical-ml.md). ### Check Endpoint Status After Deployment ``` -get_serving_endpoint_status(name="my-agent-endpoint") +manage_serving_endpoint(action="get", name="my-agent-endpoint") ``` Returns: @@ -223,7 +247,8 @@ Returns: ### Query a Chat/Agent Endpoint ``` -query_serving_endpoint( +manage_serving_endpoint( + action="query", name="my-agent-endpoint", messages=[ {"role": "user", "content": "What is Databricks?"} @@ -235,7 +260,8 @@ query_serving_endpoint( ### Query a Traditional ML Endpoint ``` -query_serving_endpoint( +manage_serving_endpoint( + action="query", name="sklearn-classifier", dataframe_records=[ {"age": 25, "income": 50000, "credit_score": 720} @@ -250,7 +276,7 @@ query_serving_endpoint( | Issue | Solution | |-------|----------| | **Invalid output format** | Use `self.create_text_output_item(text, id)` - NOT raw dicts! | -| **Endpoint NOT_READY** | Deployment takes ~15 min. Use `get_serving_endpoint_status` to poll. | +| **Endpoint NOT_READY** | Deployment takes ~15 min. Use `manage_serving_endpoint(action="get")` to poll. | | **Package not found** | Specify exact versions in `pip_requirements` when logging model | | **Tool timeout** | Use job-based deployment, not synchronous calls | | **Auth error on endpoint** | Ensure `resources` specified in `log_model` for auto passthrough | diff --git a/.claude/skills/databricks-python-sdk/SKILL.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-python-sdk/SKILL.md similarity index 99% rename from .claude/skills/databricks-python-sdk/SKILL.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-python-sdk/SKILL.md index 1365666..eaf7cd6 100644 --- a/.claude/skills/databricks-python-sdk/SKILL.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-python-sdk/SKILL.md @@ -617,7 +617,7 @@ If I'm unsure about a method, I should: ## Related Skills - **[databricks-config](../databricks-config/SKILL.md)** - profile and authentication setup -- **[databricks-asset-bundles](../databricks-asset-bundles/SKILL.md)** - deploying resources via DABs +- **[databricks-bundles](../databricks-bundles/SKILL.md)** - deploying resources via DABs - **[databricks-jobs](../databricks-jobs/SKILL.md)** - job orchestration patterns - **[databricks-unity-catalog](../databricks-unity-catalog/SKILL.md)** - catalog governance - **[databricks-model-serving](../databricks-model-serving/SKILL.md)** - serving endpoint management diff --git a/.claude/skills/databricks-python-sdk/doc-index.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-python-sdk/doc-index.md similarity index 100% rename from .claude/skills/databricks-python-sdk/doc-index.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-python-sdk/doc-index.md diff --git a/.claude/skills/databricks-python-sdk/examples/1-authentication.py b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-python-sdk/examples/1-authentication.py similarity index 100% rename from .claude/skills/databricks-python-sdk/examples/1-authentication.py rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-python-sdk/examples/1-authentication.py diff --git a/.claude/skills/databricks-python-sdk/examples/2-clusters-and-jobs.py b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-python-sdk/examples/2-clusters-and-jobs.py similarity index 100% rename from .claude/skills/databricks-python-sdk/examples/2-clusters-and-jobs.py rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-python-sdk/examples/2-clusters-and-jobs.py diff --git a/.claude/skills/databricks-python-sdk/examples/3-sql-and-warehouses.py b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-python-sdk/examples/3-sql-and-warehouses.py similarity index 100% rename from .claude/skills/databricks-python-sdk/examples/3-sql-and-warehouses.py rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-python-sdk/examples/3-sql-and-warehouses.py diff --git a/.claude/skills/databricks-python-sdk/examples/4-unity-catalog.py b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-python-sdk/examples/4-unity-catalog.py similarity index 100% rename from .claude/skills/databricks-python-sdk/examples/4-unity-catalog.py rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-python-sdk/examples/4-unity-catalog.py diff --git a/.claude/skills/databricks-python-sdk/examples/5-serving-and-vector-search.py b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-python-sdk/examples/5-serving-and-vector-search.py similarity index 100% rename from .claude/skills/databricks-python-sdk/examples/5-serving-and-vector-search.py rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-python-sdk/examples/5-serving-and-vector-search.py diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/SKILL.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/SKILL.md new file mode 100644 index 0000000..a1bdd7c --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/SKILL.md @@ -0,0 +1,389 @@ +--- +name: databricks-spark-declarative-pipelines +description: "Creates, configures, and updates Databricks Lakeflow Spark Declarative Pipelines (SDP/LDP) using serverless compute. Handles data ingestion with streaming tables, materialized views, CDC, SCD Type 2, and Auto Loader ingestion patterns. Use when building data pipelines, working with Delta Live Tables, ingesting streaming data, implementing change data capture, or when the user mentions SDP, LDP, DLT, Lakeflow pipelines, streaming tables, or bronze/silver/gold medallion architectures." +--- + +# Lakeflow Spark Declarative Pipelines (SDP) + +--- + +## Critical Rules (always follow) + +### Syntax: CREATE OR REFRESH (not CREATE OR REPLACE) +- **MUST** use `CREATE OR REFRESH` for SDP objects: + - `CREATE OR REFRESH STREAMING TABLE` - for streaming tables + - `CREATE OR REFRESH MATERIALIZED VIEW` - for materialized views +- **NEVER** use `CREATE OR REPLACE` - that is standard SQL syntax, not SDP syntax + +### Simplicity First +- **MUST** create the minimal number of tables to solve the task +- Simplicity first: prefer single pipeline even for multi-schema setups - use fully qualified names (`catalog.schema.table`) +- When asked to "create a silver table" or "create a gold table", create **ONE table** - not a multi-layer pipeline +- Don't add intermediate tables, staging tables, or helper views unless explicitly requested +- A silver transformation = 1 streaming table reading from bronze +- A gold aggregation = 1 materialized view reading from silver +- Create bronze→silver→gold chains when the user asks for a "pipeline" or "medallion architecture" or full/detailed ingestion. Otherwise keep it simple - don't over engineer. + +### Language Selection +- **MUST** know the language (Python or SQL). For simple task / pipeline / table creation, pick SQL. For complex pipeline with parametrized information, or if the user mentions python-related items pick python. If you have a doubt, ask the user. Stick with that language unless told otherwise. + +| User Says | Action | +|-----------|--------| +| "Python pipeline", "Python SDP", "use Python", "udf", "pandas", "ml inference", "pyspark" | **User wants Python** | +| "SQL pipeline", "SQL files", "use SQL" | **User wants SQL** | +| "Create a simple pipeline", "create a table", "an aggregation" | **Pick SQL as it's simple** | + +### Other Rules +- **MUST** create serverless pipelines by default. Only use classic clusters if user explicitly requires R language, Spark RDD APIs, or JAR libraries. +- **MUST** choose the right workflow based on context (see below). +- When the user provides table schema and asks for code, respond directly with the code. Don't ask clarifying questions if the request is clear. + +## Tools +- List files in volume: `databricks fs ls dbfs:/Volumes/{catalog}/{schema}/{volume}/{path} --profile {PROFILE}` +- Query data: `databricks experimental aitools tools query --profile {PROFILE} --warehouse abc123 "SELECT 1 FROM catalog.schema.table"` +- Discover schema: `databricks experimental aitools tools discover-schema --profile {PROFILE} catalog.schema.table1 catalog.schema.table2` +- Pipelines CLI: `databricks pipelines init|deploy|run|logs|stop` or use `databricks pipelines --help` for more options + +## Choose Your Workflow + +**First, determine which workflow to use:** + +### Option A: Standalone New Pipeline Project (use `databricks pipelines init`) + +Use this when the user wants to **create a new, standalone SDP project** that will have its own DAB: +- User asks: "Create a new pipeline", "Build me an SDP", "Set up a new data pipeline" +- No existing `databricks.yml` in the workspace +- The pipeline IS the project (not part of a larger demo/app) + + +Use `databricks pipeline` CLI commands: +```bash +databricks pipelines init --output-dir . --config-file init-config.json +``` + +**Example init-config.json:** +```json +{ + "project_name": "customer_pipeline", + "initial_catalog": "prod_catalog", + "use_personal_schema": "no", + "initial_language": "sql" +} +``` + +→ See [1-project-initialization.md](references/1-project-initialization.md) +→ + + +### Option B: Pipeline within Existing Bundle (edit the bundle) + +Use this when the pipeline is **part of an existing DAB project**: +- There's already a `databricks.yml` file in the project +- User is adding a pipeline to an existing app/demo + +→ See [1-project-initialization.md](references/1-project-initialization.md) for adding pipelines to existing bundles + +### Option C: Rapid Iteration with MCP Tools (no bundle management) + +Use this when you need to **quickly create, test, and iterate** on a pipeline without managing bundle files: +- User wants to "just run a pipeline and see if it works" +- Part of a larger demo where bundle is managed separately, or the DAB bundle will be created at the end as you want to quickly test the project first +- Prototyping or experimenting with pipeline logic +- User explicitly asks to use MCP tools + +→ See [2-mcp-approach.md](references/2-mcp-approach.md) for MCP-based workflow + +--- + +## Required Checklist + +Before writing pipeline code, make sure you have: +``` +- [ ] Language selected: Python or SQL +- [ ] Read the syntax basics: **SQL**: Always Read [sql/1-syntax-basics.md](references/sql/1-syntax-basics.md), **Python**: Always Read [python/1-syntax-basics.md](references/python/1-syntax-basics.md) +- [ ] Workflow chosen: Standalone DAB / Existing DAB / MCP iteration +- [ ] Compute type: serverless (default) or classic +- [ ] Schema strategy: single schema with prefixes vs. multi-schema +- [ ] Consider [Multi-Schema Patterns](#multi-schema-patterns) and [Modern Defaults](#modern-defaults) +``` + +**Then read additional guides based on what the pipeline needs, when you need it:** +| If the pipeline needs... | Read | +|--------------------------|------| +| File ingestion (Auto Loader, JSON, CSV, Parquet) | `references/sql/2-ingestion.md` or `references/python/2-ingestion.md` | +| Kafka, Event Hub, or Kinesis streaming | `references/sql/2-ingestion.md` or `references/python/2-ingestion.md` | +| Deduplication, windowed aggregations, joins | `references/sql/3-streaming-patterns.md` or `references/python/3-streaming-patterns.md` | +| CDC, SCD Type 1/2, or history tracking | `references/sql/4-cdc-patterns.md` or `references/python/4-cdc-patterns.md` | +| Performance tuning, Liquid Clustering | `references/sql/5-performance.md` or `references/python/5-performance.md` | + +--- + +## Quick Reference + +| Concept | Details | +|---------|---------| +| **Names** | SDP = Spark Declarative Pipelines = LDP = Lakeflow Declarative Pipelines (all interchangeable) | +| **SQL Syntax** | `CREATE OR REFRESH STREAMING TABLE`, `CREATE OR REFRESH MATERIALIZED VIEW` | +| **Python Import** | `from pyspark import pipelines as dp` | +| **Primary Decorators** | `@dp.table()`, `@dp.materialized_view()`, `@dp.temporary_view()` | + +### Legacy APIs (Do NOT Use) + +| Legacy | Modern Replacement | +|--------|-------------------| +| `import dlt` | `from pyspark import pipelines as dp` | +| `dlt.apply_changes()` | `dp.create_auto_cdc_flow()` | +| `dlt.read()` / `dlt.read_stream()` | `spark.read` / `spark.readStream` | +| `CREATE LIVE XXX` | `CREATE OR REFRESH STREAMING TABLE\|MATERIALIZED VIEW` | +| `PARTITION BY` + `ZORDER` | `CLUSTER BY` (Liquid Clustering) | +| `input_file_name()` | `_metadata.file_path` | +| `target` parameter | `schema` parameter | + +### Streaming Table vs Materialized View + +| Use Case | Type | Pattern | +|----------|------|---------| +| Windowed aggregations (tumbling, sliding, session) | Streaming Table | `FROM stream(source)` + `GROUP BY window()` | +| Full-table aggregations (totals, daily counts) | Materialized View | `FROM source` (no stream wrapper) | +| CDC / SCD Type 2 | Streaming Table | `AUTO CDC INTO` or `dp.create_auto_cdc_flow()` | + +Use streaming tables for windowed aggregations to enable incremental processing. Use materialized views for simple aggregations that recompute fully on each refresh. + +--- + +## Task-Based Routing + +After choosing your workflow (see [Choose Your Workflow](#choose-your-workflow)), determine the specific task: + +**Choose documentation by language:** + +### SQL Documentation +| Task | Guide | +|------|-------| +| **SQL syntax basics** | [sql/1-syntax-basics.md](references/sql/1-syntax-basics.md) | +| **Data ingestion (Auto Loader, Kafka)** | [sql/2-ingestion.md](references/sql/2-ingestion.md) | +| **Streaming patterns (deduplication, windows)** | [sql/3-streaming-patterns.md](references/sql/3-streaming-patterns.md) | +| **CDC patterns (AUTO CDC, SCD, queries)** | [sql/4-cdc-patterns.md](references/sql/4-cdc-patterns.md) | +| **Performance tuning** | [sql/5-performance.md](references/sql/5-performance.md) | + +### Python Documentation +| Task | Guide | +|------|-------| +| **Python syntax basics** | [python/1-syntax-basics.md](references/python/1-syntax-basics.md) | +| **Data ingestion (Auto Loader, Kafka)** | [python/2-ingestion.md](references/python/2-ingestion.md) | +| **Streaming patterns (deduplication, windows)** | [python/3-streaming-patterns.md](references/python/3-streaming-patterns.md) | +| **CDC patterns (AUTO CDC, SCD, queries)** | [python/4-cdc-patterns.md](references/python/4-cdc-patterns.md) | +| **Performance tuning** | [python/5-performance.md](references/python/5-performance.md) | + +### General Documentation +| Task | Guide | +|------|-------| +| **Setting up standalone pipeline project** | [1-project-initialization.md](references/1-project-initialization.md) | +| **Rapid iteration with MCP tools** | [2-mcp-approach.md](references/2-mcp-approach.md) | +| **Advanced configuration** | [3-advanced-configuration.md](references/3-advanced-configuration.md) | +| **Migrating from DLT** | [4-dlt-migration.md](references/4-dlt-migration.md) | + +--- + +## Official Documentation + +- **[Lakeflow Spark Declarative Pipelines Overview](https://docs.databricks.com/aws/en/ldp/)** - Main documentation hub +- **[SQL Language Reference](https://docs.databricks.com/aws/en/ldp/developer/sql-dev)** - SQL syntax for streaming tables and materialized views +- **[Python Language Reference](https://docs.databricks.com/aws/en/ldp/developer/python-ref)** - `pyspark.pipelines` API +- **[Loading Data](https://docs.databricks.com/aws/en/ldp/load)** - Auto Loader, Kafka, Kinesis ingestion +- **[Change Data Capture (CDC)](https://docs.databricks.com/aws/en/ldp/cdc)** - AUTO CDC, SCD Type 1/2 + + +### Medallion Architecture + +| Layer | SDP Pattern | Common Practices | +|-------|-------------|------------------| +| **Bronze** | `STREAM read_files()` → streaming table | Often adds `_metadata.file_path`, `_ingested_at`. Minimal transforms, append-only. | +| **Silver** | `stream(bronze)` → streaming table | Clean/validate, type casting, quality filters. Prefer `DECIMAL(p,s)` for money. Dedup can happen here or gold. | +| **Gold** | `AUTO CDC INTO` or materialized view | Aggregated, denormalized. SCD/dedup often via `AUTO CDC`. Star schema typically uses `dim_*`/`fact_*`. | + +#### Gold Layer: Preserve Key Dimensions + +When aggregating data in gold tables, **keep the main business dimensions** to enable flexible analysis. Over-aggregating loses information that analysts may need later. + +**Guidance based on context:** +- **If a dashboard is mentioned**: Include all dimensions that appear as filters. Dashboard filters only work if the underlying data has those columns. +- **If analysis by dimension is mentioned** (e.g., "analyze by store", "breakdown by department"): Include those dimensions in the aggregation. +- **If no specific instructions**: Default to keeping key business dimensions (location, department, product line, customer segment, time period) rather than aggregating them away. This preserves flexibility for future analysis. + +**Rule of thumb**: If users might want to slice the data by a dimension, include it in the gold table. It's easier to aggregate further in queries than to recover lost dimensions. + +**For medallion architecture** (bronze/silver/gold), two approaches work: +- **Flat with naming** (template default): `bronze_*.sql`, `silver_*.sql`, `gold_*.sql` +- **Subdirectories**: `bronze/orders.sql`, `silver/cleaned.sql`, `gold/summary.sql` + +Both work with the `transformations/**` glob pattern. Choose based on preference/existing. + +See **[1-project-initialization.md](references/1-project-initialization.md)** for complete details on bundle initialization, migration, and troubleshooting. + +--- +## General SDP development guidance + +**SQL Example:** +```sql +CREATE OR REFRESH STREAMING TABLE bronze_orders +CLUSTER BY (order_date) +AS SELECT *, current_timestamp() AS _ingested_at +FROM STREAM read_files('/Volumes/catalog/schema/raw/orders/', format => 'json'); +``` + +**Python Example:** +```python +from pyspark import pipelines as dp + +@dp.table(name="bronze_events", cluster_by=["event_date"]) +def bronze_events(): + return spark.readStream.format("cloudFiles").option("cloudFiles.format", "json").load("/Volumes/...") +``` + +For detailed syntax, see [sql/1-syntax-basics.md](references/sql/1-syntax-basics.md) or [python/1-syntax-basics.md](references/python/1-syntax-basics.md). + +## Best Practices (2026) + +### Project Structure +- **Standalone pipeline projects**: Use `databricks pipelines init` for Asset Bundle with multi-environment support +- **Pipeline in existing bundle**: Add to `resources/*.pipeline.yml` +- **Rapid iteration/prototyping**: Use MCP tools, formalize in bundle later +- See **[1-project-initialization.md](references/1-project-initialization.md)** for project setup details + +### Minimal pipeline config pointers +- Define parameters in your pipeline’s configuration and access them in code with spark.conf.get("key"). +- In Databricks Asset Bundles, set these under resources.pipelines..configuration; validate with databricks bundle validate. + +### Modern Defaults +- **Always use raw `.sql`/`.py` files for the transformations files** - NO notebooks in your pipeline. Pipeline code must be plain files. +- **Databricks notebook source for explorations** - Use `# Databricks notebook source` format with `# COMMAND ----------` separators for ad-hoc queries. See [examples/exploration_notebook.py](scripts/exploration_notebook.py). +- **Serverless compute** - Do not use classic clusters unless explicitly required (R, RDD APIs, JAR libraries) +- **Unity Catalog** (required for serverless) +- **CLUSTER BY** (Liquid Clustering), not PARTITION BY with ZORDER - see [sql/5-performance.md](references/sql/5-performance.md) or [python/5-performance.md](references/python/5-performance.md) +- **read_files()** for SQL cloud storage ingestion - always consume a folder, not a single file - see [sql/2-ingestion.md](references/sql/2-ingestion.md) + +### Multi-Schema Patterns + +**Preferred: One pipeline writing to multiple schemas** using fully qualified table names (`catalog.schema.table`). This keeps dependencies clear and is simpler to manage than multiple pipelines. + +- **Python**: `@dp.table(name="catalog.bronze_schema.orders")` +- **SQL**: `CREATE OR REFRESH STREAMING TABLE catalog.silver_schema.orders_clean AS ...` + +For detailed examples, see **[3-advanced-configuration.md](references/3-advanced-configuration.md#multi-schema-patterns)**. + +**Fallback**: If all tables must be in the same schema, use name prefixes (`bronze_*`, `silver_*`, `gold_*`). + +--- + +## Post-Run Validation (Required) + +After running a pipeline (via DAB or MCP), you **MUST** validate both the execution status AND the actual data. + +### Step 1: Check Pipeline Execution Status + +**From MCP (`manage_pipeline(action="run")` or `manage_pipeline(action="create_or_update")`):** +- Check `result["success"]` and `result["state"]` +- If failed, check `result["message"]` and `result["errors"]` for details + +**From DAB (`databricks bundle run`):** +- Check the command output for success/failure +- Use `manage_pipeline(action="get", pipeline_id=...)` to get detailed status and recent events + +### Step 2: Validate Output Data + +Even if the pipeline reports SUCCESS, you **MUST** verify the data is correct: + +``` +# MCP Tool: get_table_stats_and_schema - validates schema, row counts, and stats +get_table_stats_and_schema( + catalog="my_catalog", + schema="my_schema", + table_names=["bronze_*", "silver_*", "gold_*"] # Use glob patterns +) +``` + +**Check for:** +- Empty tables (row_count = 0) - indicates ingestion or filtering issues +- Unexpected row counts - joins may have exploded or filtered too much +- Missing columns - schema mismatch or transformation errors +- NULL values in key columns - data quality issues + +### Step 3: Debug Data Issues + +If validation reveals problems, trace upstream to find the root cause: + +1. **Start from the problematic table** - identify what's wrong (empty, wrong counts, bad data) +2. **Check its source table** - use `get_table_stats_and_schema` on the upstream table +3. **Trace back to bronze** - continue until you find where the issue originates +4. **Common causes:** + - Bronze empty → source files missing or path incorrect + - Silver empty → filter too aggressive or join condition wrong + - Gold wrong counts → aggregation logic error or duplicate keys + - Data mismatch → type casting issues or NULL handling + +5. **Fix the SQL/Python code**, re-upload, and re-run the pipeline + +**Do NOT use `execute_sql` with COUNT queries for validation** - `get_table_stats_and_schema` is faster and returns more information in a single call. + +--- + +## Common Issues + +| Issue | Solution | +|-------|----------| +| **Empty output tables** | Use `get_table_stats_and_schema` to check upstream sources. Verify source files exist and paths are correct. | +| **Pipeline stuck INITIALIZING** | Normal for serverless, wait a few minutes | +| **"Column not found"** | Check `schemaHints` match actual data | +| **Streaming reads fail** | For file ingestion in a streaming table, you must use the `STREAM` keyword with `read_files`: `FROM STREAM read_files(...)`. For table streams use `FROM stream(table)`. See [read_files — Usage in streaming tables](https://docs.databricks.com/aws/en/sql/language-manual/functions/read_files#usage-in-streaming-tables). | +| **Timeout during run** | Increase `timeout`, or use `wait_for_completion=False` and check status with `manage_pipeline(action="get")` | +| **MV doesn't refresh** | Enable row tracking on source tables | +| **SCD2: query column not found** | Lakeflow uses `__START_AT` and `__END_AT` (double underscore), not `START_AT`/`END_AT`. Use `WHERE __END_AT IS NULL` for current rows. See [sql/4-cdc-patterns.md](references/sql/4-cdc-patterns.md). | +| **AUTO CDC parse error at APPLY/SEQUENCE** | Put `APPLY AS DELETE WHEN` **before** `SEQUENCE BY`. Only list columns in `COLUMNS * EXCEPT (...)` that exist in the source (omit `_rescued_data` unless bronze uses rescue data). Omit `TRACK HISTORY ON *` if it causes "end of input" errors; default is equivalent. See [sql/4-cdc-patterns.md](references/sql/4-cdc-patterns.md). | +| **"Cannot create streaming table from batch query"** | In a streaming table query, use `FROM STREAM read_files(...)` so `read_files` leverages Auto Loader; `FROM read_files(...)` alone is batch. See [sql/2-ingestion.md](references/sql/2-ingestion.md) and [read_files — Usage in streaming tables](https://docs.databricks.com/aws/en/sql/language-manual/functions/read_files#usage-in-streaming-tables). | + +**For detailed errors**, the `result["message"]` from `manage_pipeline(action="create_or_update")` includes suggested next steps. Use `manage_pipeline(action="get", pipeline_id=...)` which includes recent events and error details. + +--- + +## Advanced Pipeline Configuration + +For advanced configuration options (development mode, continuous pipelines, custom clusters, notifications, Python dependencies, etc.), see **[3-advanced-configuration.md](references/3-advanced-configuration.md)**. + +--- + +## Platform Constraints + +### Serverless Pipeline Requirements (Default) +| Requirement | Details | +|-------------|---------| +| **Unity Catalog** | Required - serverless pipelines always use UC | +| **Workspace Region** | Must be in serverless-enabled region | +| **Serverless Terms** | Must accept serverless terms of use | +| **CDC Features** | Requires serverless (or Pro/Advanced with classic clusters) | + +### Serverless Limitations (When Classic Clusters Required) +| Limitation | Workaround | +|------------|-----------| +| **R language** | Not supported - use classic clusters if required | +| **Spark RDD APIs** | Not supported - use classic clusters if required | +| **JAR libraries** | Not supported - use classic clusters if required | +| **Maven coordinates** | Not supported - use classic clusters if required | +| **DBFS root access** | Limited - must use Unity Catalog external locations | +| **Global temp views** | Not supported | + +### General Constraints +| Constraint | Details | +|------------|---------| +| **Schema Evolution** | Streaming tables require full refresh for incompatible changes | +| **SQL Limitations** | PIVOT clause unsupported | +| **Sinks** | Python only, streaming only, append flows only | + +**Default to serverless** unless user explicitly requires R, RDD APIs, or JAR libraries. + +## Related Skills + +- **[databricks-jobs](../databricks-jobs/SKILL.md)** - for orchestrating and scheduling pipeline runs +- **[databricks-bundles](../databricks-bundles/SKILL.md)** - for multi-environment deployment of pipeline projects +- **[databricks-synthetic-data-gen](../databricks-synthetic-data-gen/SKILL.md)** - for generating test data to feed into pipelines +- **[databricks-unity-catalog](../databricks-unity-catalog/SKILL.md)** - for catalog/schema/volume management and governance diff --git a/.claude/skills/databricks-spark-declarative-pipelines/8-project-initialization.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/1-project-initialization.md similarity index 59% rename from .claude/skills/databricks-spark-declarative-pipelines/8-project-initialization.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/1-project-initialization.md index 0f272db..fbab69b 100644 --- a/.claude/skills/databricks-spark-declarative-pipelines/8-project-initialization.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/1-project-initialization.md @@ -1,19 +1,16 @@ -# Project Initialization with databricks pipelines init +# Project Initialization -## Overview +Two approaches for creating SDP pipelines with Declarative Automation Bundles (DABs): +- **Option A**: Standalone new project using `databricks pipelines init` +- **Option B**: Adding a pipeline to an existing bundle -The `databricks pipelines init` command scaffolds a complete Databricks Asset Bundle project for Lakeflow Spark Declarative Pipelines, providing a production-ready structure with multi-environment support, pipeline configuration, and sample transformation files. +--- -**Benefits of Asset Bundles:** -- Multi-environment deployments (dev/staging/prod) -- Infrastructure as code with `databricks.yml` -- Built-in CI/CD integration -- Version control for pipeline configuration -- Automated deployment workflows +## Option A: Standalone New Pipeline Project ---- +Use `databricks pipelines init` to scaffold a complete DAB project with multi-environment support, pipeline configuration, and sample transformation files. -## Command Reference +### Command Reference ### Interactive Mode @@ -241,185 +238,84 @@ databricks pipelines start-update --pipeline-id --- -## Language Detection (for Claude) - -When a user requests a new Lakeflow pipeline, Claude should detect the appropriate language from keywords in the prompt. - -### CRITICAL: Explicit Language Requests - -**If the user explicitly mentions a language, use it without asking:** - -| User Says | Action | -|-----------|--------| -| "Python pipeline", "Python SDP", "use Python" | **Use Python immediately** | -| "SQL pipeline", "SQL files", "use SQL" | **Use SQL immediately** | -| "Python Spark Declarative Pipeline" | **Use Python immediately** | +## Medallion Architecture -**DO NOT ask for clarification when the user explicitly states a language.** This is the most common mistake - ignoring an explicit language request. +For bronze/silver/gold organization, two file structure approaches work with Declarative Automation Bundles (DABs): -### SQL Indicators (Default Choice When Ambiguous) +### Option 1: Flat Structure with Prefixes (Recommended) -**Keywords:** -- "sql files", ".sql" -- "simple", "basic", "straightforward" -- "aggregations", "joins", "transformations" -- "materialized view", "CREATE OR REFRESH" -- "SELECT", "GROUP BY", "WHERE" +``` +transformations/ +├── bronze_orders.sql +├── bronze_events.sql +├── silver_orders.sql +├── silver_events.sql +├── gold_daily_metrics.sql +└── gold_summary.sql +``` -**Context:** -- User mentions only data transformations without complex logic -- Request focuses on filtering, joining, aggregating data -- No mention of custom functions or external integrations -- **No explicit mention of "Python"** +### Option 2: Subdirectories by Layer -**Default Behavior**: Prefer SQL only when ambiguous AND no Python indicators present +``` +transformations/ +├── bronze/ +│ └── orders.sql +├── silver/ +│ └── orders.sql +└── gold/ + └── daily_metrics.sql +``` -### Python Indicators +Both work with `transformations/**` glob pattern. Choose based on team preference. -**Keywords:** -- "Python", "python files", ".py", "@dp.table" -- "UDF", "user-defined function", "custom function" -- "complex logic", "complex transformations" -- "ML", "machine learning", "inference", "model" -- "API", "external API", "REST", "HTTP" -- "pandas", "numpy", "pyspark" -- "decorator", "pyspark.pipelines" +For syntax examples, see: +- **[sql/1-syntax-basics.md](sql/1-syntax-basics.md)** - SQL table definitions +- **[python/1-syntax-basics.md](python/1-syntax-basics.md)** - Python decorators +- **[sql/2-ingestion.md](sql/2-ingestion.md)** - Bronze layer ingestion patterns -**Context:** -- User needs custom data processing beyond SQL capabilities -- Request mentions integrating with external services -- Task requires ML model inference or scoring -- Dynamic schema or path generation needed +--- -### Ambiguous Cases (Ask User) +## Option B: Adding a Pipeline to an Existing Bundle -**Only ask when ALL conditions are met:** -- User did NOT explicitly mention "Python" or "SQL" -- Mixed signals present (some SQL keywords, some Python keywords) -- OR no clear indicators either way +If you already have a `databricks.yml` for a larger project (e.g., an app with jobs, dashboards, etc.) and want to add a pipeline: -**Response:** -``` -I can create this pipeline using either SQL or Python: +### Step 1: Create Pipeline Resource File -- **SQL**: Best for transformations, aggregations, joins (simpler, faster to develop) -- **Python**: Best for custom logic, UDFs, ML inference, external APIs +Create `resources/my_pipeline.pipeline.yml`: -Which would you prefer? +```yaml +resources: + pipelines: + my_pipeline: + name: my_pipeline + catalog: ${var.catalog} + schema: ${var.schema} + serverless: true + libraries: + - file: + path: ../src/pipelines/my_pipeline/ ``` ---- - -## Medallion Architecture - -For bronze/silver/gold organization, Asset Bundles support two approaches. Both work with the `transformations/**` glob pattern in pipeline configuration. +### Step 2: Add Pipeline Source Files -### Option 1: Flat Structure with Naming (Template Default, SQL Example) +Create your pipeline transformation files: ``` -transformations/ -├── bronze_raw_orders.sql # Raw data ingestion -├── bronze_raw_events.sql -├── bronze_raw_customers.sql -├── silver_cleaned_orders.sql # Cleaned and validated -├── silver_joined_data.sql -├── silver_customer_profiles.sql -├── gold_daily_metrics.sql # Business aggregations -├── gold_customer_summary.sql -└── gold_revenue_analysis.sql +src/pipelines/my_pipeline/ +├── bronze_ingest.sql +├── silver_clean.sql +└── gold_summary.sql ``` -**Advantages:** -- Matches the official `databricks pipelines init` template structure -- All files visible at one level -- Simple file listing and discovery -- Clear naming provides logical organization - -### Option 2: Subdirectories by Layer, SQL Example +### Step 3: Deploy -``` -transformations/ -├── bronze/ -│ ├── raw_orders.sql -│ ├── raw_events.sql -│ └── raw_customers.sql -├── silver/ -│ ├── cleaned_orders.sql -│ ├── joined_data.sql -│ └── customer_profiles.sql -└── gold/ - ├── daily_metrics.sql - ├── customer_summary.sql - └── revenue_analysis.sql -``` - -**Advantages:** -- Physical separation of layers -- Familiar structure for teams using manual workflow -- Easier to navigate large projects with many files -- Works with `transformations/**` glob pattern - -**Both approaches are technically valid** - the `**` in the glob pattern matches files recursively. Choose based on team preference and project size. - -### Example Bronze Layer (SQL) - -```sql --- File: bronze_raw_orders.sql -CREATE OR REFRESH STREAMING TABLE bronze_raw_orders -CLUSTER BY (order_date) -COMMENT "Raw order data ingested from cloud storage" -AS -SELECT - *, - current_timestamp() AS _ingested_at, - _metadata.file_path AS _source_file -FROM read_files( - '/Volumes/main/raw_data/orders/', - format => 'json', - schemaHints => 'order_id STRING, customer_id STRING, amount DECIMAL(10,2), order_date DATE' -); -``` - -### Example Silver Layer (SQL) - -```sql --- File: silver_cleaned_orders.sql -CREATE OR REFRESH MATERIALIZED VIEW silver_cleaned_orders -CLUSTER BY (order_date) -COMMENT "Cleaned and validated orders with customer enrichment" -AS -SELECT - o.order_id, - o.customer_id, - o.amount, - o.order_date, - c.customer_name, - c.customer_segment -FROM LIVE.bronze_raw_orders o -INNER JOIN LIVE.bronze_raw_customers c - ON o.customer_id = c.customer_id -WHERE o.amount > 0 -- Remove invalid orders - AND o.order_date >= '2020-01-01'; -``` - -### Example Gold Layer (SQL) - -```sql --- File: gold_daily_metrics.sql -CREATE OR REFRESH MATERIALIZED VIEW gold_daily_metrics -CLUSTER BY (metric_date) -COMMENT "Daily business metrics for reporting" -AS -SELECT - order_date AS metric_date, - COUNT(DISTINCT customer_id) AS unique_customers, - COUNT(*) AS total_orders, - SUM(amount) AS total_revenue, - AVG(amount) AS avg_order_value -FROM LIVE.silver_cleaned_orders -GROUP BY order_date; +```bash +databricks bundle deploy +databricks bundle run my_pipeline ``` +That's it - the pipeline is now part of your existing bundle and shares the same targets/variables. + --- ## Migration from Manual Structure @@ -609,7 +505,7 @@ For advanced pipeline configuration options beyond the bundle initialization: - **Custom notifications**: Email or webhook alerts - **Non-serverless clusters**: When serverless limitations apply -See [7-advanced-configuration.md](7-advanced-configuration.md) for detailed examples. +See [3-advanced-configuration.md](3-advanced-configuration.md) for detailed examples. --- @@ -668,45 +564,22 @@ resources: ## Best Practices -### Project Organization - -1. **Use descriptive file names**: `bronze_orders_raw.sql` not just `orders.sql` -2. **Choose structure approach**: - - **Flat with prefixes**: `bronze_*`, `silver_*`, `gold_*` (template default) - - **Subdirectories**: `bronze/`, `silver/`, `gold/` folders (also valid) - - Both work with `transformations/**` glob pattern -3. **One table per file**: Each file defines a single table or view -4. **Be consistent**: Pick one approach and use it throughout the project - -### Configuration Management - -1. **Use variables**: Parameterize catalog and schema names -2. **Separate environments**: Define dev/staging/prod targets -3. **Version control**: Track `databricks.yml` and pipeline configs in git -4. **Sensitive data**: Use secrets, not hardcoded values - -### Development Workflow - -1. **Start with dev**: Always test in development environment first -2. **Validate locally**: Run `databricks bundle validate` before deploy -3. **Incremental changes**: Deploy and test small changes frequently -4. **Use explorations**: Ad-hoc notebooks for data exploration - -### Deployment Strategy +1. **One table per file** - Each `.sql` or `.py` file defines a single table/view +2. **Use variables** - Parameterize catalog and schema names for environment portability +3. **Sensitive data** - Use secrets (`{{secrets/scope/key}}`), not hardcoded values +4. **Test in dev first** - Run `databricks bundle validate` before deploy +5. **Version control** - Track `databricks.yml` and pipeline configs in git -1. **CI/CD integration**: Automate deployments with GitHub Actions, GitLab CI -2. **Approval gates**: Require approval for production deployments -3. **Rollback plan**: Keep previous bundle versions for quick rollback -4. **Monitor pipelines**: Set up notifications for failures +For technical best practices (Liquid Clustering, serverless, etc.), see **[SKILL.md](SKILL.md#best-practices-2026)**. --- ## References -- **[SKILL.md](SKILL.md)** - Main development workflow and MCP tools -- **[Databricks Asset Bundles Documentation](https://docs.databricks.com/dev-tools/bundles/)** - Official bundle reference +- **[SKILL.md](../SKILL.md)** - Main development workflow and MCP tools +- **[Declarative Automation Bundles (DABs) Documentation](https://docs.databricks.com/dev-tools/bundles/)** - Official bundle reference - **[Pipeline Configuration Reference](https://docs.databricks.com/aws/en/ldp/configure-pipeline)** - Pipeline settings - **[Databricks CLI Reference](https://docs.databricks.com/dev-tools/cli/)** - CLI commands and options -- **[1-ingestion-patterns.md](1-ingestion-patterns.md)** - Data ingestion patterns -- **[2-streaming-patterns.md](2-streaming-patterns.md)** - Streaming transformations -- **[7-advanced-configuration.md](7-advanced-configuration.md)** - Advanced pipeline settings +- **[sql/2-ingestion.md](sql/2-ingestion.md)** or **[python/2-ingestion.md](python/2-ingestion.md)** - Data ingestion patterns +- **[sql/3-streaming-patterns.md](sql/3-streaming-patterns.md)** or **[python/3-streaming-patterns.md](python/3-streaming-patterns.md)** - Streaming transformations +- **[3-advanced-configuration.md](3-advanced-configuration.md)** - Advanced pipeline settings diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/2-mcp-approach.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/2-mcp-approach.md new file mode 100644 index 0000000..87e0ed7 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/2-mcp-approach.md @@ -0,0 +1,163 @@ +Use MCP tools to create, run, and iterate on **SDP pipelines**. The **primary tool is `manage_pipeline`** which handles the entire lifecycle. + +**IMPORTANT: Default to serverless pipelines.** Only use classic clusters if user explicitly requires R language, Spark RDD APIs, or JAR libraries. + +### Step 1: Write Pipeline Files Locally + +Create `.sql` or `.py` files in a local folder. For syntax examples, see: +- [sql/1-syntax-basics.md](sql/1-syntax-basics.md) for SQL syntax +- [python/1-syntax-basics.md](python/1-syntax-basics.md) for Python syntax + +### Step 2: Upload to Databricks Workspace + +``` +# MCP Tool: manage_workspace_files +manage_workspace_files( + action="upload", + local_path="/path/to/my_pipeline", + workspace_path="/Workspace/Users/user@example.com/my_pipeline" +) +``` + +### Step 3: Create/Update and Run Pipeline + +Use **`manage_pipeline`** with `action="create_or_update"` to manage the resource: + +``` +# MCP Tool: manage_pipeline +manage_pipeline( + action="create_or_update", + name="my_orders_pipeline", + root_path="/Workspace/Users/user@example.com/my_pipeline", + catalog="my_catalog", + schema="my_schema", + workspace_file_paths=[ + "/Workspace/Users/user@example.com/my_pipeline/bronze/ingest_orders.sql", + "/Workspace/Users/user@example.com/my_pipeline/silver/clean_orders.sql", + "/Workspace/Users/user@example.com/my_pipeline/gold/daily_summary.sql" + ], + start_run=True, # Automatically run after create/update + wait_for_completion=True, # Wait for run to finish + full_refresh=True # Reprocess all data +) +``` + +**Result contains actionable information:** +```json +{ + "success": true, + "pipeline_id": "abc-123", + "pipeline_name": "my_orders_pipeline", + "created": true, + "state": "COMPLETED", + "catalog": "my_catalog", + "schema": "my_schema", + "duration_seconds": 45.2, + "message": "Pipeline created and completed successfully in 45.2s. Tables written to my_catalog.my_schema", + "error_message": null, + "errors": [] +} +``` + +### Alternative: Run Pipeline Separately + +If you want to run an existing pipeline or control the run separately: + +``` +# MCP Tool: manage_pipeline_run +manage_pipeline_run( + action="start", + pipeline_id="", + full_refresh=True, + wait=True, # Wait for completion + timeout=1800 # 30 minute timeout +) +``` + +### Step 4: Validate Results + +**On Success** - Use `get_table_stats_and_schema` to verify tables (NOT manual SQL COUNT queries): +``` +# MCP Tool: get_table_stats_and_schema +get_table_stats_and_schema( + catalog="my_catalog", + schema="my_schema", + table_names=["bronze_orders", "silver_orders", "gold_daily_summary"] +) +# Returns schema, row counts, and column stats for all tables in one call +``` + +**On Failure** - Check `run_result["message"]` for suggested next steps, then get detailed errors: +``` +# MCP Tool: manage_pipeline +manage_pipeline(action="get", pipeline_id="") +# Returns pipeline details enriched with recent events and error messages + +# Or get events/logs directly: +# MCP Tool: manage_pipeline_run +manage_pipeline_run( + action="get_events", + pipeline_id="", + event_log_level="ERROR", # ERROR, WARN, or INFO + max_results=10 +) +``` + +### Step 5: Iterate Until Working + +1. Review errors from run result or `manage_pipeline(action="get")` +2. Fix issues in local files +3. Re-upload with `manage_workspace_files(action="upload")` +4. Run `manage_pipeline(action="create_or_update", start_run=True)` again (it will update, not recreate) +5. Repeat until `result["success"] == True` + +--- + +## Quick Reference: MCP Tools + +### manage_pipeline - Pipeline Lifecycle + +| Action | Description | Required Params | +|--------|-------------|-----------------| +| `create` | Create new pipeline | name, root_path, catalog, schema, workspace_file_paths | +| `create_or_update` | **Main entry point.** Idempotent create/update, optionally run | name, root_path, catalog, schema, workspace_file_paths | +| `get` | Get pipeline details by ID | pipeline_id | +| `update` | Update pipeline config | pipeline_id + fields to change | +| `delete` | Delete a pipeline | pipeline_id | +| `find_by_name` | Find pipeline by name | name | + +**create_or_update options:** +- `start_run=True`: Automatically run after create/update +- `wait_for_completion=True`: Block until run finishes +- `full_refresh=True`: Reprocess all data (default) +- `timeout=1800`: Max wait time in seconds + +### manage_pipeline_run - Run Management + +| Action | Description | Required Params | +|--------|-------------|-----------------| +| `start` | Start pipeline update | pipeline_id | +| `get` | Get run status | pipeline_id, update_id | +| `stop` | Stop running pipeline | pipeline_id | +| `get_events` | Get events/logs for debugging | pipeline_id | + +**start options:** +- `wait=True`: Block until complete (default) +- `full_refresh=True`: Reprocess all data +- `validate_only=True`: Dry run without writing data +- `refresh_selection=["table1", "table2"]`: Refresh specific tables only + +**get_events options:** +- `event_log_level`: "ERROR", "WARN" (default), "INFO" +- `max_results`: Number of events (default 5) +- `update_id`: Filter to specific run + +### Supporting Tools + +| Tool | Description | +|------|-------------| +| `manage_workspace_files(action="upload")` | Upload files/folders to workspace | +| `get_table_stats_and_schema` | **Use this to validate tables** - returns schema, row counts, and stats in one call | +| `execute_sql` | Run ad-hoc SQL to inspect actual data content (not for row counts) | + +--- diff --git a/.claude/skills/databricks-spark-declarative-pipelines/7-advanced-configuration.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/3-advanced-configuration.md similarity index 76% rename from .claude/skills/databricks-spark-declarative-pipelines/7-advanced-configuration.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/3-advanced-configuration.md index a6c8ecf..b637f46 100644 --- a/.claude/skills/databricks-spark-declarative-pipelines/7-advanced-configuration.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/3-advanced-configuration.md @@ -142,7 +142,7 @@ Install pip dependencies for serverless pipelines: | Field | Type | Description | |-------|------|-------------| -| `kind` | str | `"BUNDLE"` (Databricks Asset Bundles) or `"DEFAULT"` | +| `kind` | str | `"BUNDLE"` (DABs) or `"DEFAULT"` | | `metadata_file_path` | str | Path to deployment metadata file | ### Edition Comparison @@ -159,7 +159,7 @@ Install pip dependencies for serverless pipelines: ### Development Mode Pipeline -Use `create_or_update_pipeline` tool with: +Use `manage_pipeline(action="create_or_update")` tool with: - `name`: "my_dev_pipeline" - `root_path`: "/Workspace/Users/user@example.com/my_pipeline" - `catalog`: "dev_catalog" @@ -176,7 +176,7 @@ Use `create_or_update_pipeline` tool with: ### Non-Serverless with Dedicated Cluster -Use `create_or_update_pipeline` tool with `extra_settings`: +Use `manage_pipeline(action="create_or_update")` tool with `extra_settings`: ```json { "serverless": false, @@ -193,7 +193,7 @@ Use `create_or_update_pipeline` tool with `extra_settings`: ### Continuous Streaming Pipeline -Use `create_or_update_pipeline` tool with `extra_settings`: +Use `manage_pipeline(action="create_or_update")` tool with `extra_settings`: ```json { "continuous": true, @@ -205,7 +205,7 @@ Use `create_or_update_pipeline` tool with `extra_settings`: ### Using Instance Pool -Use `create_or_update_pipeline` tool with `extra_settings`: +Use `manage_pipeline(action="create_or_update")` tool with `extra_settings`: ```json { "serverless": false, @@ -220,7 +220,7 @@ Use `create_or_update_pipeline` tool with `extra_settings`: ### Custom Event Log Location -Use `create_or_update_pipeline` tool with `extra_settings`: +Use `manage_pipeline(action="create_or_update")` tool with `extra_settings`: ```json { "event_log": { @@ -233,7 +233,7 @@ Use `create_or_update_pipeline` tool with `extra_settings`: ### Pipeline with Email Notifications -Use `create_or_update_pipeline` tool with `extra_settings`: +Use `manage_pipeline(action="create_or_update")` tool with `extra_settings`: ```json { "notifications": [{ @@ -245,7 +245,7 @@ Use `create_or_update_pipeline` tool with `extra_settings`: ### Production Pipeline with Autoscaling -Use `create_or_update_pipeline` tool with `extra_settings`: +Use `manage_pipeline(action="create_or_update")` tool with `extra_settings`: ```json { "serverless": false, @@ -274,7 +274,7 @@ Use `create_or_update_pipeline` tool with `extra_settings`: ### Run as Service Principal -Use `create_or_update_pipeline` tool with `extra_settings`: +Use `manage_pipeline(action="create_or_update")` tool with `extra_settings`: ```json { "run_as": { @@ -285,7 +285,7 @@ Use `create_or_update_pipeline` tool with `extra_settings`: ### Continuous Pipeline with Restart Window -Use `create_or_update_pipeline` tool with `extra_settings`: +Use `manage_pipeline(action="create_or_update")` tool with `extra_settings`: ```json { "continuous": true, @@ -299,7 +299,7 @@ Use `create_or_update_pipeline` tool with `extra_settings`: ### Serverless with Python Dependencies -Use `create_or_update_pipeline` tool with `extra_settings`: +Use `manage_pipeline(action="create_or_update")` tool with `extra_settings`: ```json { "serverless": true, @@ -348,3 +348,77 @@ You can copy pipeline settings from the Databricks UI (Pipeline Settings > JSON) ``` **Note**: Explicit tool parameters (`name`, `root_path`, `catalog`, `schema`, `workspace_file_paths`) always take precedence over values in `extra_settings`. + +--- + +## Multi-Schema Patterns + +**Recommended: One pipeline writing to multiple schemas** using fully qualified table names. This is simpler than creating multiple pipelines and keeps all dependencies in one place. + +For simple cases where all tables go to the same schema, use name prefixes (`bronze_*`, `silver_*`, `gold_*`). + +### Option 1: Same Catalog, Separate Schemas + +Set pipeline defaults to bronze, use parameters for silver/gold: + +```python +from pyspark import pipelines as dp +from pyspark.sql.functions import col + +# Pull variables from pipeline configuration +silver_schema = spark.conf.get("silver_schema") # e.g., "silver" +gold_schema = spark.conf.get("gold_schema") # e.g., "gold" +landing_schema = spark.conf.get("landing_schema") # e.g., "landing" + +# Bronze → uses default catalog/schema (set to bronze in pipeline settings) +@dp.table(name="orders_bronze") +def orders_bronze(): + return spark.readStream.table(f"{landing_schema}.orders_raw") + +# Silver → same catalog, schema from parameter +@dp.table(name=f"{silver_schema}.orders_clean") +def orders_clean(): + return spark.read.table("orders_bronze").filter(col("order_id").isNotNull()) + +# Gold → same catalog, schema from parameter +@dp.materialized_view(name=f"{gold_schema}.orders_by_date") +def orders_by_date(): + return (spark.read.table(f"{silver_schema}.orders_clean") + .groupBy("order_date").count()) +``` + +### Option 2: Custom Catalog/Schema Per Layer + +For cross-catalog scenarios: + +```python +from pyspark import pipelines as dp +from pyspark.sql.functions import col + +# Pull variables from pipeline configuration +silver_catalog = spark.conf.get("silver_catalog") +silver_schema = spark.conf.get("silver_schema") +gold_catalog = spark.conf.get("gold_catalog") +gold_schema = spark.conf.get("gold_schema") + +# Bronze → uses pipeline defaults +@dp.table(name="orders_bronze") +def orders_bronze(): + return spark.readStream.format("cloudFiles").load("/Volumes/...") + +# Silver → custom catalog + schema +@dp.table(name=f"{silver_catalog}.{silver_schema}.orders_clean") +def orders_clean(): + return spark.read.table("orders_bronze").filter(col("order_id").isNotNull()) + +# Gold → custom catalog + schema +@dp.materialized_view(name=f"{gold_catalog}.{gold_schema}.orders_by_date") +def orders_by_date(): + return (spark.read.table(f"{silver_catalog}.{silver_schema}.orders_clean") + .groupBy("order_date").count()) +``` + +**Key points:** +- Multipart names in `@dp.table(name=...)` let you publish to explicit catalog.schema targets +- Unqualified names use pipeline defaults +- Use fully-qualified names when crossing catalogs diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/4-dlt-migration.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/4-dlt-migration.md new file mode 100644 index 0000000..dbde0d9 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/4-dlt-migration.md @@ -0,0 +1,447 @@ +# Migration Guide: DLT to SDP + +Guide for migrating from Delta Live Tables (DLT) to Spark Declarative Pipelines (SDP). + +**Two migration paths:** +1. **DLT Python → SDP Python** (dlt → dp): Same language, new API +2. **DLT Python → SDP SQL**: Change language for simpler pipelines + +--- + +## Migration Path 1: DLT Python → SDP Python (dlt → dp) + +Use this when staying with Python but moving to the modern `pyspark.pipelines` API. + +### Quick Reference + +| Aspect | Legacy (`dlt`) | Modern (`dp`) | +|--------|---------------|----------------| +| **Import** | `import dlt` | `from pyspark import pipelines as dp` | +| **Table decorator** | `@dlt.table()` | `@dp.table()` | +| **Read table** | `dlt.read("table")` | `spark.read.table("table")` | +| **Read stream** | `dlt.read_stream("table")` | `spark.readStream.table("table")` | +| **CDC/SCD** | `dlt.apply_changes()` | `dp.create_auto_cdc_flow()` | +| **Clustering** | `partition_cols=["date"]` | `cluster_by=["date", "col2"]` | + +### Step-by-Step Migration + +#### Step 1: Update Imports + +```python +# Before +import dlt + +# After +from pyspark import pipelines as dp +``` + +#### Step 2: Update Decorators + +```python +# Before +@dlt.table(name="my_table") + +# After +@dp.table(name="my_table") +``` + +#### Step 3: Update Table Reads + +```python +# Before +@dlt.table(name="silver_events") +def silver_events(): + return dlt.read("bronze_events").filter(...) + +# After +@dp.table(name="silver_events") +def silver_events(): + return spark.read.table("bronze_events").filter(...) +``` + +```python +# Before (streaming) +@dlt.table(name="silver_events") +def silver_events(): + return dlt.read_stream("bronze_events").filter(...) + +# After (streaming) +@dp.table(name="silver_events") +def silver_events(): + return spark.readStream.table("bronze_events").filter(...) +``` + +#### Step 4: Update Expectations + +```python +# Before +@dlt.table(name="silver") +@dlt.expect_or_drop("valid_id", "id IS NOT NULL") + +# After (identical syntax, just change dlt → dp) +@dp.table(name="silver") +@dp.expect_or_drop("valid_id", "id IS NOT NULL") +``` + +#### Step 5: Update CDC/SCD Operations + +```python +# Before +dlt.create_streaming_table("customers_history") +dlt.apply_changes( + target="customers_history", + source="customers_cdc", + keys=["customer_id"], + sequence_by="event_timestamp", + stored_as_scd_type="2" +) + +# After +from pyspark.sql.functions import col + +dp.create_streaming_table("customers_history") +dp.create_auto_cdc_flow( + target="customers_history", + source="customers_cdc", + keys=["customer_id"], + sequence_by=col("event_timestamp"), # Note: use col() + stored_as_scd_type=2 # Note: integer, not string +) +``` + +**Key differences:** +- `apply_changes()` → `create_auto_cdc_flow()` +- `sequence_by` takes a Column object (`col("...")`) not a string +- `stored_as_scd_type` is integer `2` for Type 2, string `"1"` for Type 1 + +#### Step 6: Update Clustering (Partitioning → Liquid Clustering) + +```python +# Before (legacy partitioning) +@dlt.table( + name="bronze_events", + partition_cols=["event_date"], + table_properties={"pipelines.autoOptimize.zOrderCols": "event_type"} +) + +# After (Liquid Clustering) +@dp.table( + name="bronze_events", + cluster_by=["event_date", "event_type"] +) +``` + +### Complete Before/After Example + +**Before (DLT):** +```python +import dlt +from pyspark.sql import functions as F + +@dlt.table(name="bronze_orders", partition_cols=["order_date"]) +def bronze_orders(): + return spark.readStream.format("cloudFiles").load("/data/orders") + +@dlt.table(name="silver_orders") +@dlt.expect_or_drop("valid_amount", "amount > 0") +def silver_orders(): + return dlt.read_stream("bronze_orders").filter(F.col("status") == "completed") + +dlt.create_streaming_table("dim_customers") +dlt.apply_changes( + target="dim_customers", + source="customers_cdc", + keys=["customer_id"], + sequence_by="updated_at", + stored_as_scd_type="2" +) +``` + +**After (SDP):** +```python +from pyspark import pipelines as dp +from pyspark.sql import functions as F + +@dp.table(name="bronze_orders", cluster_by=["order_date"]) +def bronze_orders(): + return spark.readStream.format("cloudFiles").load("/data/orders") + +@dp.table(name="silver_orders") +@dp.expect_or_drop("valid_amount", "amount > 0") +def silver_orders(): + return spark.readStream.table("bronze_orders").filter(F.col("status") == "completed") + +dp.create_streaming_table("dim_customers") +dp.create_auto_cdc_flow( + target="dim_customers", + source="customers_cdc", + keys=["customer_id"], + sequence_by=F.col("updated_at"), + stored_as_scd_type=2 +) +``` + +--- + +## Migration Path 2: DLT Python → SDP SQL + +Use this when simplifying pipelines by converting to SQL. + +### Decision Matrix + +| Feature/Pattern | DLT Python | SDP SQL | Recommendation | +|-----------------|------------|---------|----------------| +| Simple transformations | ✓ | ✓ | **Migrate to SQL** | +| Aggregations | ✓ | ✓ | **Migrate to SQL** | +| Filtering, WHERE clauses | ✓ | ✓ | **Migrate to SQL** | +| CASE expressions | ✓ | ✓ | **Migrate to SQL** | +| SCD Type 1/2 | ✓ | ✓ | **Migrate to SQL** (AUTO CDC) | +| Simple joins | ✓ | ✓ | **Migrate to SQL** | +| Auto Loader | ✓ | ✓ | **Migrate to SQL** (read_files) | +| Streaming sources (Kafka) | ✓ | ✓ | **Migrate to SQL** (read_kafka) | +| Complex Python UDFs | ✓ | ❌ | **Stay in Python** | +| External API calls | ✓ | ❌ | **Stay in Python** | +| Custom libraries | ✓ | ❌ | **Stay in Python** | +| ML model inference | ✓ | ❌ | **Stay in Python** | + +**Rule**: If 80%+ is SQL-expressible, migrate to SDP SQL. If heavy Python logic, stay with Python (use modern `dp` API). + +### Side-by-Side Conversions + +#### Basic Streaming Table + +**DLT Python:** +```python +@dlt.table(name="bronze_sales", comment="Raw sales") +def bronze_sales(): + return ( + spark.readStream.format("cloudFiles") + .option("cloudFiles.format", "json") + .load("/Volumes/my_catalog/my_schema/raw/sales") + .withColumn("_ingested_at", F.current_timestamp()) + ) +``` + +**SDP SQL:** +```sql +CREATE OR REFRESH STREAMING TABLE bronze_sales +COMMENT 'Raw sales' +AS +SELECT *, current_timestamp() AS _ingested_at +FROM STREAM read_files('/Volumes/my_catalog/my_schema/raw/sales', format => 'json'); +``` + +#### Filtering and Transformations + +**DLT Python:** +```python +@dlt.table(name="silver_sales") +@dlt.expect_or_drop("valid_amount", "amount > 0") +@dlt.expect_or_drop("valid_sale_id", "sale_id IS NOT NULL") +def silver_sales(): + return ( + dlt.read_stream("bronze_sales") + .withColumn("sale_date", F.to_date("sale_date")) + .withColumn("amount", F.col("amount").cast("decimal(10,2)")) + .select("sale_id", "customer_id", "amount", "sale_date") + ) +``` + +**SDP SQL:** +```sql +CREATE OR REFRESH STREAMING TABLE silver_sales AS +SELECT + sale_id, customer_id, + CAST(amount AS DECIMAL(10,2)) AS amount, + CAST(sale_date AS DATE) AS sale_date +FROM STREAM bronze_sales +WHERE amount > 0 AND sale_id IS NOT NULL; +``` + +#### SCD Type 2 + +**DLT Python:** +```python +dlt.create_streaming_table("customers_history") + +dlt.apply_changes( + target="customers_history", + source="customers_cdc_clean", + keys=["customer_id"], + sequence_by="event_timestamp", + stored_as_scd_type="2", + track_history_column_list=["*"] +) +``` + +**SDP SQL:** +```sql +CREATE OR REFRESH STREAMING TABLE customers_history; + +CREATE FLOW customers_scd2_flow AS +AUTO CDC INTO customers_history +FROM stream(customers_cdc_clean) +KEYS (customer_id) +APPLY AS DELETE WHEN operation = "DELETE" +SEQUENCE BY event_timestamp +COLUMNS * EXCEPT (operation, _ingested_at, _source_file) +STORED AS SCD TYPE 2; +``` + +**Note:** In SQL, put `APPLY AS DELETE WHEN` before `SEQUENCE BY`. Only list columns in `COLUMNS * EXCEPT (...)` that exist in the source. + +#### Joins + +**DLT Python:** +```python +@dlt.table(name="silver_sales_enriched") +def silver_sales_enriched(): + sales = dlt.read_stream("silver_sales") + products = dlt.read("dim_products") + return sales.join(products, "product_id", "left") +``` + +**SDP SQL:** +```sql +CREATE OR REFRESH STREAMING TABLE silver_sales_enriched AS +SELECT s.*, p.product_name, p.category +FROM STREAM silver_sales s +LEFT JOIN dim_products p ON s.product_id = p.product_id; +``` + +### Handling Expectations + +**DLT Python:** +```python +@dlt.expect_or_drop("valid_amount", "amount > 0") +@dlt.expect_or_fail("critical_id", "id IS NOT NULL") +``` + +**SDP SQL - Basic** (equivalent to expect_or_drop): +```sql +WHERE amount > 0 AND id IS NOT NULL +``` + +**SDP SQL - Quarantine Pattern** (for auditing dropped records): +```sql +-- Flag invalid records +CREATE OR REFRESH STREAMING TABLE bronze_data_flagged AS +SELECT *, + CASE WHEN amount <= 0 OR id IS NULL THEN TRUE ELSE FALSE END AS is_invalid +FROM STREAM bronze_data; + +-- Clean for downstream +CREATE OR REFRESH STREAMING TABLE silver_data_clean AS +SELECT * FROM STREAM bronze_data_flagged WHERE NOT is_invalid; + +-- Quarantine for investigation +CREATE OR REFRESH STREAMING TABLE silver_data_quarantine AS +SELECT * FROM STREAM bronze_data_flagged WHERE is_invalid; +``` + +### Handling UDFs + +#### Simple UDFs → SQL CASE + +**DLT Python:** +```python +@F.udf(returnType=StringType()) +def categorize_amount(amount): + if amount > 1000: return "High" + elif amount > 100: return "Medium" + else: return "Low" + +@dlt.table(name="sales_categorized") +def sales_categorized(): + return dlt.read("sales").withColumn("category", categorize_amount(F.col("amount"))) +``` + +**SDP SQL:** +```sql +CREATE OR REFRESH MATERIALIZED VIEW sales_categorized AS +SELECT *, + CASE + WHEN amount > 1000 THEN 'High' + WHEN amount > 100 THEN 'Medium' + ELSE 'Low' + END AS category +FROM sales; +``` + +#### Complex UDFs → Stay in Python + +Keep in Python if: +- Complex conditional logic +- External API calls +- Custom algorithms +- ML inference + +Use modern `dp` API instead of `dlt`. + +--- + +## Migration Process + +### Step 1: Inventory + +Document: +- Number of tables/views +- Python UDFs (simple vs complex) +- External dependencies +- Expectations and quality rules + +### Step 2: Choose Path + +- **80%+ SQL-expressible** → Migrate to SDP SQL +- **Heavy Python logic** → Migrate to SDP Python (`dp` API) +- **Mixed** → Hybrid (SQL for most, Python for complex) + +### Step 3: Migrate by Layer + +1. **Bronze** (ingestion): `cloudFiles` → `read_files()` or keep `cloudFiles` with `dp` +2. **Silver** (cleansing): `dlt.expect*` → WHERE clause or `dp.expect*` +3. **Gold** (aggregations): Usually straightforward +4. **SCD/CDC**: `apply_changes` → AUTO CDC or `create_auto_cdc_flow` + +### Step 4: Test + +- Run both pipelines in parallel +- Compare outputs for correctness +- Validate performance +- Check quality metrics + +--- + +## When NOT to Migrate + +**Stay with current approach if:** +1. Pipeline works well and team is comfortable +2. Heavy Python UDF usage (>30% of logic) +3. External API calls required +4. Custom ML model inference +5. Complex stateful operations not expressible in SQL +6. Limited time/resources for migration + +**Key**: DLT and SDP are both fully supported. Migrate for simplicity or new features, not necessity. + +--- + +## Common Issues + +| Issue | Solution | +|-------|----------| +| `sequence_by` type error | Use `col("column")` not string in `dp.create_auto_cdc_flow()` | +| UDF doesn't translate | Keep in Python or refactor with SQL built-ins | +| Expectations differ | Use quarantine pattern to audit dropped records | +| Performance degradation | Use `CLUSTER BY` for Liquid Clustering | +| Schema evolution different | Use `mode => 'PERMISSIVE'` in `read_files()` | +| AUTO CDC parse error | Put `APPLY AS DELETE WHEN` before `SEQUENCE BY` | + +--- + +## Related Documentation + +- **[python/1-syntax-basics.md](python/1-syntax-basics.md)** - Modern `dp` API reference +- **[python/4-cdc-patterns.md](python/4-cdc-patterns.md)** - Python CDC patterns +- **[sql/4-cdc-patterns.md](sql/4-cdc-patterns.md)** - SQL CDC patterns +- **[SKILL.md](../SKILL.md)** - Main skill entry point diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/python/1-syntax-basics.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/python/1-syntax-basics.md new file mode 100644 index 0000000..9d00cde --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/python/1-syntax-basics.md @@ -0,0 +1,321 @@ +# Python Syntax Basics + +Core Python syntax for Spark Declarative Pipelines (SDP) using the modern `pyspark.pipelines` API. + +**Import**: `from pyspark import pipelines as dp` + +--- + +## Decorators + +### `@dp.table()` + +Creates a streaming table or batch table. + +```python +from pyspark import pipelines as dp +from pyspark.sql import functions as F + +@dp.table( + name="bronze_events", # Table name (can be fully qualified: catalog.schema.table) + comment="Raw event data", # Optional description + cluster_by=["event_type", "date"], # Liquid Clustering columns (recommended) + table_properties={ # Delta table properties + "delta.autoOptimize.optimizeWrite": "true", + "delta.autoOptimize.autoCompact": "true" + }, + schema="col1 STRING, col2 INT", # Optional explicit schema + path="/path/to/external/location" # Optional external location +) +def bronze_events(): + return ( + spark.readStream.format("cloudFiles") + .option("cloudFiles.format", "json") + .load("/Volumes/catalog/schema/raw/events/") + .withColumn("_ingested_at", F.current_timestamp()) + .withColumn("_source_file", F.col("_metadata.file_path")) + ) +``` + +**Parameters:** +| Parameter | Type | Description | +|-----------|------|-------------| +| `name` | str | Table name. Can be unqualified (`my_table`), schema-qualified (`schema.table`), or fully qualified (`catalog.schema.table`). | +| `comment` | str | Table description | +| `cluster_by` | list | Columns for Liquid Clustering. Use `["AUTO"]` for automatic selection. | +| `table_properties` | dict | Delta table properties | +| `schema` | str/StructType | Explicit schema (optional, usually inferred) | +| `path` | str | External storage location (optional) | + +**Streaming vs Batch:** +- Return `spark.readStream...` for streaming table +- Return `spark.read...` for batch table + +### `@dp.materialized_view()` + +Creates a materialized view (batch, incrementally refreshed). + +```python +@dp.materialized_view( + name="gold_daily_summary", + comment="Daily aggregated metrics", + cluster_by=["report_date"] +) +def gold_daily_summary(): + return ( + spark.read.table("silver_orders") + .groupBy("report_date") + .agg(F.sum("amount").alias("total_amount")) + ) +``` + +**Parameters:** Same as `@dp.table()`. + +### `@dp.temporary_view()` + +Creates a pipeline-scoped temporary view (not persisted, exists only during pipeline execution). + +```python +@dp.temporary_view() +def orders_with_calculations(): + """Intermediate view for complex logic before AUTO CDC.""" + return ( + spark.readStream.table("bronze_orders") + .withColumn("total", F.col("quantity") * F.col("price")) + .filter(F.col("total") > 0) + ) +``` + +**Constraints:** +- Cannot specify `catalog` or `schema` (pipeline-scoped only) +- Cannot use `cluster_by` (not persisted) +- Useful for intermediate transformations before AUTO CDC + +--- + +## Expectation Decorators (Data Quality) + +```python +@dp.table(name="silver_validated") +@dp.expect("valid_id", "id IS NOT NULL") # Warn only, keep all rows +@dp.expect_or_drop("valid_amount", "amount > 0") # Drop invalid rows +@dp.expect_or_fail("critical_field", "timestamp IS NOT NULL") # Fail pipeline if violated +def silver_validated(): + return spark.read.table("bronze_events") +``` + +| Decorator | Behavior | +|-----------|----------| +| `@dp.expect(name, condition)` | Log warning, keep all rows | +| `@dp.expect_or_drop(name, condition)` | Drop rows that violate | +| `@dp.expect_or_fail(name, condition)` | Fail pipeline if any row violates | + +--- + +## Functions + +### `dp.create_streaming_table()` + +Creates an empty streaming table (typically used before `create_auto_cdc_flow`). + +```python +dp.create_streaming_table( + name="customers_history", + comment="SCD Type 2 customer dimension" +) +``` + +### `dp.create_auto_cdc_flow()` + +Creates a Change Data Capture flow for SCD Type 1 or Type 2. + +```python +from pyspark.sql.functions import col + +dp.create_streaming_table("dim_customers") + +dp.create_auto_cdc_flow( + target="dim_customers", + source="customers_cdc_clean", + keys=["customer_id"], + sequence_by=col("event_timestamp"), # Note: use col(), not string + stored_as_scd_type=2, # Integer for Type 2 + apply_as_deletes=col("operation") == "DELETE", # Optional + except_column_list=["operation", "_ingested_at"], # Columns to exclude + track_history_column_list=["price", "status"] # Type 2: only track these +) +``` + +**Parameters:** +| Parameter | Type | Description | +|-----------|------|-------------| +| `target` | str | Target table name | +| `source` | str | Source table/view name | +| `keys` | list | Primary key columns | +| `sequence_by` | Column | Column for ordering changes (**use `col()`**) | +| `stored_as_scd_type` | int/str | `2` for Type 2 (history), `"1"` for Type 1 (overwrite) | +| `apply_as_deletes` | Column | Condition identifying delete operations | +| `apply_as_truncates` | Column | Condition identifying truncate operations | +| `except_column_list` | list | Columns to exclude from target | +| `track_history_column_list` | list | Type 2 only: columns that trigger new versions | + +**Important:** `stored_as_scd_type` is integer `2` for Type 2, string `"1"` for Type 1. + +### `dp.create_auto_cdc_from_snapshot_flow()` + +Creates CDC from periodic snapshots (compares consecutive snapshots to detect changes). + +```python +dp.create_streaming_table("dim_products") + +dp.create_auto_cdc_from_snapshot_flow( + target="dim_products", + source="products_snapshot", + keys=["product_id"], + stored_as_scd_type=2 +) +``` + +### `dp.append_flow()` + +Appends data from a source to a target table. + +```python +dp.create_streaming_table("events_archive") + +dp.append_flow( + target="events_archive", + source="old_events_source" +) +``` + +### `dp.create_sink()` + +Creates a custom sink for streaming data. + +```python +def write_to_kafka(batch_df, batch_id): + batch_df.write.format("kafka").option("topic", "output").save() + +dp.create_sink( + name="kafka_sink", + sink_fn=write_to_kafka +) +``` + +--- + +## Reading Data + +**Use standard Spark APIs** - SDP automatically tracks dependencies: + +```python +# Batch read (for materialized views or batch tables) +df = spark.read.table("catalog.schema.source_table") + +# Streaming read (for streaming tables) +df = spark.readStream.table("catalog.schema.source_table") + +# Unqualified name (uses pipeline's default catalog/schema) +df = spark.read.table("source_table") + +# Read from file with Auto Loader (schema location managed automatically in SDP) +df = spark.readStream.format("cloudFiles") \ + .option("cloudFiles.format", "json") \ + .load("/Volumes/catalog/schema/raw/data/") +``` + +**Do NOT use:** +- `dp.read()` or `dp.read_stream()` - not part of modern API +- `dlt.read()` or `dlt.read_stream()` - legacy API +- `dlt.apply_changes()` - legacy API; use `dp.create_auto_cdc_flow()` instead +- `import dlt` - legacy module; use `from pyspark import pipelines as dp` + +--- + +## Table Name Resolution + +| Level | Example | When to Use | +|-------|---------|-------------| +| Unqualified | `spark.read.table("my_table")` | Tables in same pipeline (recommended) | +| Schema-qualified | `spark.read.table("other_schema.my_table")` | Different schema, same catalog | +| Fully-qualified | `spark.read.table("other_catalog.schema.table")` | External catalogs | + +**Best practice:** Use unqualified names for pipeline-internal tables. + +### Multi-Schema Pattern (One Pipeline) + +Write to multiple schemas from a single pipeline using fully qualified names: + +```python +from pyspark import pipelines as dp + +# Bronze → writes to bronze schema +@dp.table(name="my_catalog.bronze.raw_orders") +def bronze_orders(): + return spark.readStream.format("cloudFiles") \ + .option("cloudFiles.format", "json") \ + .load("/Volumes/my_catalog/raw/orders/") + +# Silver → writes to silver schema, reads from bronze +@dp.table(name="my_catalog.silver.clean_orders") +def silver_orders(): + return spark.readStream.table("my_catalog.bronze.raw_orders") \ + .filter("order_id IS NOT NULL") +``` + +--- + +## Pipeline Parameters + +Access configuration values set in pipeline settings: + +```python +# Get parameter value +catalog = spark.conf.get("target_catalog") +schema = spark.conf.get("target_schema") + +# With default +env = spark.conf.get("environment", "dev") + +@dp.table(name=f"{catalog}.{schema}.my_table") +def my_table(): + return spark.readStream.format("cloudFiles") \ + .option("cloudFiles.format", "json") \ + .load("/Volumes/...") +``` + +--- + +## Prohibited Operations + +**Do NOT include these in dataset definitions:** + +```python +# These cause unexpected behavior +@dp.table(name="bad_example") +def bad_example(): + df = spark.read.table("source") + df.collect() # No collect() + df.count() # No count() + df.toPandas() # No toPandas() + df.save(...) # No save() + df.saveAsTable(...) # No saveAsTable() + return df +``` + +Dataset functions should only contain code to define the transformation, not execute actions. + +--- + +## Common Issues + +| Issue | Solution | +|-------|----------| +| `sequence_by` type error | Use `col("column")` not string in `create_auto_cdc_flow()` | +| SCD type syntax error | Type 2 uses integer `2`, Type 1 uses string `"1"` | +| Table not found | Check catalog/schema qualification or pipeline default settings | +| Parameter not resolved | Use `spark.conf.get("param_name")` | +| Actions in definition | Remove `collect()`, `count()`, `save()` from table functions | +| Using legacy `dlt` API | Replace `import dlt` with `from pyspark import pipelines as dp` | +| Using `input_file_name()` | Use `F.col("_metadata.file_path")` | diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/python/2-ingestion.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/python/2-ingestion.md new file mode 100644 index 0000000..06ddad2 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/python/2-ingestion.md @@ -0,0 +1,150 @@ +# Python Data Ingestion + +Data ingestion patterns using the modern `pyspark.pipelines` API. + +**Official Documentation:** +- [Auto Loader options](https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/options) +- [Structured Streaming + Kafka](https://docs.databricks.com/aws/en/structured-streaming/kafka) + +--- + +## Auto Loader (Cloud Files) + +Auto Loader incrementally processes new files. In SDP pipelines, schema location and checkpoints are managed automatically. + +### Basic Pattern + +```python +from pyspark import pipelines as dp +from pyspark.sql import functions as F + +@dp.table(name="bronze_orders", cluster_by=["order_date"]) +def bronze_orders(): + return ( + spark.readStream + .format("cloudFiles") + .option("cloudFiles.format", "json") + .option("cloudFiles.inferColumnTypes", "true") + .load("/Volumes/my_catalog/my_schema/raw/orders/") + .withColumn("_ingested_at", F.current_timestamp()) + .withColumn("_source_file", F.col("_metadata.file_path")) + ) +``` + +**Key options:** +- `cloudFiles.format`: `json`, `csv`, `parquet`, `avro`, `text`, `binaryFile` +- `cloudFiles.inferColumnTypes`: Infer types (default strings) +- `cloudFiles.schemaHints`: Hint specific column types + +### Rescue Data (Quarantine Pattern) + +```python +@dp.table(name="bronze_events", cluster_by=["ingestion_date"]) +def bronze_events(): + return ( + spark.readStream + .format("cloudFiles") + .option("cloudFiles.format", "json") + .option("rescuedDataColumn", "_rescued_data") + .load("/Volumes/catalog/schema/raw/events/") + .withColumn("_ingested_at", F.current_timestamp()) + .withColumn("_has_errors", F.col("_rescued_data").isNotNull()) + ) + +@dp.table(name="bronze_quarantine") +def bronze_quarantine(): + return spark.readStream.table("bronze_events").filter("_has_errors = true") + +@dp.table(name="silver_clean") +def silver_clean(): + return spark.readStream.table("bronze_events").filter("_has_errors = false") +``` + +--- + +## Streaming Sources + +### Kafka + +```python +@dp.table(name="bronze_kafka_events") +def bronze_kafka_events(): + kafka_brokers = spark.conf.get("kafka_brokers") + return ( + spark.readStream + .format("kafka") + .option("kafka.bootstrap.servers", kafka_brokers) + .option("subscribe", "events-topic") + .option("startingOffsets", "latest") + .load() + .selectExpr( + "CAST(key AS STRING) AS event_key", + "CAST(value AS STRING) AS event_value", + "topic", "partition", "offset", + "timestamp AS kafka_timestamp" + ) + .withColumn("_ingested_at", F.current_timestamp()) + ) +``` + +### Parse JSON from Kafka + +```python +from pyspark.sql.types import StructType, StructField, StringType, TimestampType + +event_schema = StructType([ + StructField("event_id", StringType()), + StructField("event_type", StringType()), + StructField("timestamp", TimestampType()) +]) + +@dp.table(name="silver_events") +def silver_events(): + return ( + spark.readStream.table("bronze_kafka_events") + .withColumn("data", F.from_json("event_value", event_schema)) + .select("data.*", "kafka_timestamp", "_ingested_at") + ) +``` + +--- + +## Authentication + +### Databricks Secrets + +```python +username = dbutils.secrets.get(scope="kafka", key="username") +password = dbutils.secrets.get(scope="kafka", key="password") +``` + +### Pipeline Parameters + +```python +kafka_brokers = spark.conf.get("kafka_brokers") +input_path = spark.conf.get("input_path") +``` + +--- + +## Best Practices + +1. **Add ingestion metadata:** +```python +.withColumn("_ingested_at", F.current_timestamp()) +.withColumn("_source_file", F.col("_metadata.file_path")) +``` + +2. **Handle rescue data** - route malformed records to quarantine + +3. **Use pipeline parameters** for paths and connection strings + +--- + +## Common Issues + +| Issue | Solution | +|-------|----------| +| Files not picked up | Verify path and format match actual files | +| Schema evolution breaking | Use `rescuedDataColumn` and monitor `_rescued_data` | +| Kafka lag increasing | Check downstream bottlenecks | diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/python/3-streaming-patterns.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/python/3-streaming-patterns.md new file mode 100644 index 0000000..44fd619 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/python/3-streaming-patterns.md @@ -0,0 +1,382 @@ +# Python Streaming Patterns + +Streaming-specific patterns including deduplication, windowed aggregations, late-arriving data handling, and stateful operations. + +**Import**: `from pyspark import pipelines as dp` + +--- + +## Deduplication Patterns + +### By Key + +```python +from pyspark import pipelines as dp +from pyspark.sql import functions as F +from pyspark.sql.window import Window + +@dp.table(name="silver_events_dedup", cluster_by=["event_date"]) +def silver_events_dedup(): + """Deduplicate by event_id, keeping first occurrence.""" + window_spec = Window.partitionBy("event_id").orderBy("event_timestamp") + return ( + spark.readStream.table("bronze_events") + .withColumn("rn", F.row_number().over(window_spec)) + .filter(F.col("rn") == 1) + .drop("rn") + ) +``` + +### With Time Window + +Deduplicate within time window to handle late arrivals: + +```python +@dp.table(name="silver_events_dedup") +def silver_events_dedup(): + return ( + spark.readStream.table("bronze_events") + .groupBy( + "event_id", "user_id", "event_type", "event_timestamp", + F.window("event_timestamp", "1 hour") + ) + .agg(F.min("_ingested_at").alias("first_seen_at")) + ) +``` + +### Composite Key + +```python +@dp.table(name="silver_transactions_dedup") +def silver_transactions_dedup(): + return ( + spark.readStream.table("bronze_transactions") + .groupBy("transaction_id", "customer_id", "amount", "transaction_timestamp") + .agg(F.min("_ingested_at").alias("_ingested_at")) + ) +``` + +--- + +## Windowed Aggregations + +### Tumbling Windows + +Non-overlapping fixed-size windows: + +```python +@dp.table(name="silver_sensor_5min", cluster_by=["sensor_id"]) +def silver_sensor_5min(): + """5-minute tumbling window aggregations.""" + return ( + spark.readStream.table("bronze_sensor_events") + .groupBy( + F.col("sensor_id"), + F.window("event_timestamp", "5 minutes") + ) + .agg( + F.avg("temperature").alias("avg_temperature"), + F.min("temperature").alias("min_temperature"), + F.max("temperature").alias("max_temperature"), + F.count("*").alias("event_count") + ) + ) +``` + +### Multiple Window Sizes + +```python +# 1-minute for real-time monitoring +@dp.table(name="gold_sensor_1min") +def gold_sensor_1min(): + return ( + spark.readStream.table("silver_sensor_data") + .groupBy( + "sensor_id", + F.window("event_timestamp", "1 minute") + ) + .agg( + F.avg("value").alias("avg_value"), + F.count("*").alias("event_count") + ) + .select( + "sensor_id", + F.col("window.start").alias("window_start"), + F.col("window.end").alias("window_end"), + "avg_value", + "event_count" + ) + ) + +# 1-hour for trend analysis +@dp.table(name="gold_sensor_1hour") +def gold_sensor_1hour(): + return ( + spark.readStream.table("silver_sensor_data") + .groupBy( + "sensor_id", + F.window("event_timestamp", "1 hour") + ) + .agg( + F.avg("value").alias("avg_value"), + F.stddev("value").alias("stddev_value") + ) + ) +``` + +### Session Windows + +Group events into sessions based on inactivity gaps: + +```python +@dp.table(name="silver_user_sessions") +def silver_user_sessions(): + """Group user events into sessions with 30-minute inactivity timeout.""" + return ( + spark.readStream.table("bronze_user_events") + .groupBy( + F.col("user_id"), + F.session_window("event_timestamp", "30 minutes") + ) + .agg( + F.min("event_timestamp").alias("session_start"), + F.max("event_timestamp").alias("session_end"), + F.count("*").alias("event_count"), + F.collect_list("event_type").alias("event_sequence") + ) + ) +``` + +--- + +## Late-Arriving Data + +### Event-Time vs Processing-Time + +Always use event timestamp for business logic: + +```python +@dp.table(name="gold_daily_orders") +def gold_daily_orders(): + return ( + spark.readStream.table("silver_orders") + .groupBy(F.to_date("order_timestamp").alias("order_date")) # Event time + .agg( + F.count("*").alias("order_count"), + F.sum("amount").alias("total_amount") + ) + ) +``` + +**Keep processing time for debugging:** +```python +.select( + "order_id", "order_timestamp", # Event time (business logic) + "customer_id", "amount", + "_ingested_at" # Processing time (debugging only) +) +``` + +--- + +## Joins + +### Stream-to-Static Joins + +Enrich streaming data with dimension tables: + +```python +@dp.table(name="silver_sales_enriched", cluster_by=["product_id"]) +def silver_sales_enriched(): + """Enrich streaming sales with static product dimension.""" + sales = spark.readStream.table("bronze_sales") + products = spark.read.table("dim_products") + return ( + sales.join(products, "product_id", "left") + .select( + "sale_id", "product_id", "quantity", "sale_timestamp", + "product_name", "category", "price" + ) + .withColumn("total_amount", F.col("quantity") * F.col("price")) + ) +``` + +### Stream-to-Stream Joins + +```python +@dp.table(name="silver_orders_with_payments") +def silver_orders_with_payments(): + """Join orders with payments within 1-hour window.""" + orders = spark.readStream.table("bronze_orders") + payments = spark.readStream.table("bronze_payments") + + return ( + orders.join( + payments, + (orders.order_id == payments.order_id) & + (payments.payment_timestamp >= orders.order_timestamp) & + (payments.payment_timestamp <= orders.order_timestamp + F.expr("INTERVAL 1 HOUR")), + "inner" + ) + .select( + orders.order_id, + orders.customer_id, + orders.order_timestamp, + orders.amount.alias("order_amount"), + payments.payment_id, + payments.payment_timestamp, + payments.amount.alias("payment_amount") + ) + ) +``` + +**Important:** Use time bounds in join condition to limit state retention. + +--- + +## Incremental Aggregations + +### Running Totals + +```python +@dp.table(name="silver_customer_running_totals") +def silver_customer_running_totals(): + return ( + spark.readStream.table("bronze_transactions") + .groupBy("customer_id") + .agg( + F.sum("amount").alias("total_spent"), + F.count("*").alias("transaction_count"), + F.max("transaction_timestamp").alias("last_transaction_at") + ) + ) +``` + +--- + +## Anomaly Detection + +### Real-Time Outlier Detection + +```python +@dp.table(name="silver_sensor_with_anomalies") +def silver_sensor_with_anomalies(): + window_spec = Window.partitionBy("sensor_id").orderBy("event_timestamp").rowsBetween(-100, 0) + + return ( + spark.readStream.table("bronze_sensor_events") + .withColumn("rolling_avg", F.avg("temperature").over(window_spec)) + .withColumn("rolling_stddev", F.stddev("temperature").over(window_spec)) + .withColumn("anomaly_flag", + F.when(F.col("temperature") > F.col("rolling_avg") + (3 * F.col("rolling_stddev")), "HIGH_OUTLIER") + .when(F.col("temperature") < F.col("rolling_avg") - (3 * F.col("rolling_stddev")), "LOW_OUTLIER") + .otherwise("NORMAL") + ) + ) + +@dp.table(name="silver_sensor_anomalies") +def silver_sensor_anomalies(): + return ( + spark.readStream.table("silver_sensor_with_anomalies") + .filter(F.col("anomaly_flag").isin("HIGH_OUTLIER", "LOW_OUTLIER")) + ) +``` + +### Threshold-Based Filtering + +```python +@dp.table(name="silver_high_value_transactions") +def silver_high_value_transactions(): + return ( + spark.readStream.table("bronze_transactions") + .filter(F.col("amount") > 10000) + ) +``` + +--- + +## Monitoring Lag + +```python +@dp.table(name="monitoring_lag") +def monitoring_lag(): + return ( + spark.readStream.table("bronze_kafka_events") + .groupBy(F.window("kafka_timestamp", "1 minute")) + .agg( + F.lit("kafka_events").alias("source"), + F.max("kafka_timestamp").alias("max_event_timestamp"), + F.current_timestamp().alias("processing_timestamp") + ) + .withColumn("lag_seconds", + F.unix_timestamp("processing_timestamp") - F.unix_timestamp("max_event_timestamp") + ) + ) +``` + +--- + +## Best Practices + +### 1. Use Event Timestamps + +```python +# Correct: Event timestamp for logic +.groupBy(F.date_trunc("hour", "event_timestamp")) + +# Avoid: Processing timestamp +# .groupBy(F.date_trunc("hour", "_ingested_at")) +``` + +### 2. Window Size Selection + +- **1-5 minutes**: Real-time monitoring +- **15-60 minutes**: Operational dashboards +- **1-24 hours**: Analytical reports + +### 3. State Management + +Higher cardinality = more state: + +```python +# High state: 1M users x 10K products x 100M sessions +.groupBy("user_id", "product_id", "session_id") + +# Lower state: 1M users x 100 categories x days +.groupBy("user_id", "product_category", F.to_date("event_time")) +``` + +Use time windows to bound state retention. + +### 4. Deduplicate Early + +Apply at bronze → silver transition: + +```python +# Bronze: Accept duplicates +@dp.table(name="bronze_events") +def bronze_events(): + return spark.readStream.format("cloudFiles")... + +# Silver: Deduplicate immediately +@dp.table(name="silver_events") +def silver_events(): + return spark.readStream.table("bronze_events").dropDuplicates(["event_id"]) + +# Gold: Work with clean data +@dp.table(name="gold_metrics") +def gold_metrics(): + return spark.readStream.table("silver_events")... +``` + +--- + +## Common Issues + +| Issue | Solution | +|-------|----------| +| High memory with windows | Use larger windows, reduce group-by cardinality | +| Duplicate events in output | Add explicit deduplication by unique key | +| Missing late-arriving events | Increase window size or use longer retention | +| Stream-to-stream join empty | Verify join conditions and time bounds | +| State growth over time | Add time windows, reduce cardinality, materialize intermediates | diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/python/4-cdc-patterns.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/python/4-cdc-patterns.md new file mode 100644 index 0000000..9e05370 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/python/4-cdc-patterns.md @@ -0,0 +1,449 @@ +# Python CDC Patterns (AUTO CDC & SCD) + +Change Data Capture patterns using AUTO CDC for SCD Type 1 and Type 2, plus querying SCD history tables. + +**Import**: `from pyspark import pipelines as dp` + +--- + +## Overview + +AUTO CDC automatically handles Change Data Capture to track changes using Slow Changing Dimensions (SCD). It provides automatic deduplication, change tracking, and handles late-arriving data correctly. + +**Where to apply AUTO CDC:** +- **Silver layer**: When business users need deduplicated or historical data +- **Gold layer**: When implementing dimensional modeling (star schema) + +--- + +## SCD Type 1 vs Type 2 + +### SCD Type 1 (In-place updates) +- **Overwrites** old values with new values +- **No history preserved** - only current state +- **Use for**: Error corrections, attributes where history doesn't matter +- **Syntax**: `stored_as_scd_type="1"` (string) + +### SCD Type 2 (History tracking) +- **Creates new row** for each change +- **Preserves full history** with `__START_AT` and `__END_AT` timestamps +- **Use for**: Tracking changes over time (addresses, prices, roles) +- **Syntax**: `stored_as_scd_type=2` (integer) + +**Important:** Type 2 uses integer `2`, Type 1 uses string `"1"`. + +--- + +## Creating AUTO CDC Flows + +### SCD Type 2 + +```python +from pyspark import pipelines as dp +from pyspark.sql.functions import col + +target_schema = spark.conf.get("target_schema") +source_schema = spark.conf.get("source_schema") + +# Step 1: Create target table +dp.create_streaming_table(f"{target_schema}.dim_customers") + +# Step 2: Create AUTO CDC flow +dp.create_auto_cdc_flow( + target=f"{target_schema}.dim_customers", + source=f"{source_schema}.customers_cdc_clean", + keys=["customer_id"], + sequence_by=col("event_timestamp"), # Note: use col(), not string + stored_as_scd_type=2, # Integer for Type 2 + apply_as_deletes=col("operation") == "DELETE", + except_column_list=["operation", "_ingested_at", "_source_file"] +) +``` + +### SCD Type 1 + +```python +dp.create_streaming_table(f"{target_schema}.orders_current") + +dp.create_auto_cdc_flow( + target=f"{target_schema}.orders_current", + source=f"{source_schema}.orders_clean", + keys=["order_id"], + sequence_by=col("updated_timestamp"), + stored_as_scd_type="1" # String for Type 1 +) +``` + +### Selective History Tracking + +Track history only when specific columns change: + +```python +dp.create_auto_cdc_flow( + target="gold.dim_products", + source="silver.products_clean", + keys=["product_id"], + sequence_by=col("modified_at"), + stored_as_scd_type=2, + track_history_column_list=["price", "cost"] # Only track these columns +) +``` + +When `price` or `cost` changes, a new version is created. Other column changes update the current record without new versions. + +--- + +## Complete Pattern: Clean + AUTO CDC + +### Step 1: Clean and Validate Source Data + +```python +from pyspark import pipelines as dp +from pyspark.sql import functions as F + +schema = spark.conf.get("schema") + +@dp.table( + name=f"{schema}.users_clean", + comment="Cleaned and validated user data", + cluster_by=["user_id"] +) +def users_clean(): + """ + Clean data with proper typing and quality checks. + """ + return ( + spark.readStream.table("bronze_users") + .filter(F.col("user_id").isNotNull()) + .filter(F.col("email").isNotNull()) + .filter(F.col("email").rlike(r"^[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}$")) + .withColumn("created_timestamp", F.to_timestamp("created_timestamp")) + .withColumn("updated_timestamp", F.to_timestamp("updated_timestamp")) + .drop("_rescued_data") + .select( + "user_id", "email", "name", "subscription_tier", "country", + "created_timestamp", "updated_timestamp", + "_ingested_at", "_source_file" + ) + ) +``` + +### Step 2: Apply AUTO CDC + +```python +from pyspark.sql.functions import col + +target_schema = spark.conf.get("target_schema") +source_schema = spark.conf.get("source_schema") + +dp.create_streaming_table(f"{target_schema}.dim_users") + +dp.create_auto_cdc_flow( + target=f"{target_schema}.dim_users", + source=f"{source_schema}.users_clean", + keys=["user_id"], + sequence_by=col("updated_timestamp"), + stored_as_scd_type=2, + except_column_list=["_ingested_at", "_source_file"] +) +``` + +--- + +## Using Temporary Views with AUTO CDC + +`@dp.temporary_view()` creates in-pipeline temporary views useful for intermediate transformations before AUTO CDC. + +**Key Constraints:** +- Cannot specify `catalog` or `schema` (pipeline-scoped only) +- Cannot use `cluster_by` (not persisted) +- Only exists during pipeline execution + +```python +from pyspark import pipelines as dp +from pyspark.sql import functions as F + +# Step 1: Temporary view for complex business logic +@dp.temporary_view() +def orders_with_calculated_fields(): + """ + Temporary view for complex calculations. + No catalog/schema needed - exists only in pipeline. + """ + return ( + spark.readStream.table("bronze.orders") + .withColumn("order_total", F.col("quantity") * F.col("unit_price")) + .withColumn("discount_amount", F.col("order_total") * F.col("discount_rate")) + .withColumn("final_amount", F.col("order_total") - F.col("discount_amount")) + .withColumn("order_category", + F.when(F.col("final_amount") > 1000, "large") + .when(F.col("final_amount") > 100, "medium") + .otherwise("small") + ) + .filter(F.col("order_id").isNotNull()) + .filter(F.col("final_amount") > 0) + ) + +# Step 2: Apply AUTO CDC using the temporary view as source +target_schema = spark.conf.get("target_schema") + +dp.create_streaming_table(f"{target_schema}.orders_current") +dp.create_auto_cdc_flow( + target=f"{target_schema}.orders_current", + source="orders_with_calculated_fields", # Reference temporary view by name + keys=["order_id"], + sequence_by=col("order_date"), + stored_as_scd_type="1" +) +``` + +--- + +## Querying SCD Type 2 Tables + +SCD Type 2 tables include temporal columns: +- `__START_AT` - When this version became effective +- `__END_AT` - When this version expired (NULL for current) + +### Current State + +```python +@dp.materialized_view(name="dim_customers_current") +def dim_customers_current(): + """All current records.""" + return ( + spark.read.table("dim_customers") + .filter(F.col("__END_AT").isNull()) + .select( + "customer_id", "customer_name", "email", "phone", "address", + F.col("__START_AT").alias("valid_from") + ) + ) +``` + +### Point-in-Time Queries + +Get state as of a specific date: + +```python +@dp.materialized_view(name="products_as_of_date") +def products_as_of_date(): + """Products as of January 1, 2024.""" + as_of_date = "2024-01-01" + return ( + spark.read.table("products_history") + .filter(F.col("__START_AT") <= as_of_date) + .filter( + (F.col("__END_AT") > as_of_date) | + F.col("__END_AT").isNull() + ) + ) +``` + +### Change Analysis + +Track all changes for an entity: + +```python +def get_customer_history(customer_id: str): + """Get complete history for a customer.""" + return ( + spark.read.table("dim_customers") + .filter(F.col("customer_id") == customer_id) + .withColumn("days_active", + F.coalesce( + F.datediff("__END_AT", "__START_AT"), + F.datediff(F.current_timestamp(), "__START_AT") + ) + ) + .orderBy(F.col("__START_AT").desc()) + ) +``` + +--- + +## Joining Facts with Historical Dimensions + +### At Transaction Time + +```python +@dp.materialized_view(name="sales_with_historical_prices") +def sales_with_historical_prices(): + """Join sales with product prices at time of sale.""" + sales = spark.read.table("sales_fact") + products = spark.read.table("products_history") + + return ( + sales.join( + products, + (sales.product_id == products.product_id) & + (sales.sale_date >= products.__START_AT) & + ((sales.sale_date < products.__END_AT) | products.__END_AT.isNull()), + "inner" + ) + .select( + sales.sale_id, + sales.product_id, + sales.sale_date, + sales.quantity, + products.product_name, + products.price.alias("unit_price_at_sale_time"), + (sales.quantity * products.price).alias("calculated_amount"), + products.category + ) + ) +``` + +### With Current Dimension + +```python +@dp.materialized_view(name="sales_with_current_prices") +def sales_with_current_prices(): + """Join sales with current product information.""" + sales = spark.read.table("sales_fact") + products_current = spark.read.table("products_history").filter(F.col("__END_AT").isNull()) + + return ( + sales.join(products_current, "product_id", "inner") + .select( + "sale_id", "product_id", "sale_date", "quantity", + sales.amount.alias("amount_at_sale"), + products_current.product_name.alias("current_product_name"), + products_current.price.alias("current_price") + ) + ) +``` + +--- + +## Common Patterns + +### Pattern 1: Gold Dimensional Model + +```python +# Silver: Cleaned streaming tables +@dp.table(name="silver.customers_clean") +def customers_clean(): + return spark.readStream.table("bronze.customers").filter(...) + +# Gold: SCD Type 2 dimension +dp.create_streaming_table("gold.dim_customers") +dp.create_auto_cdc_flow( + target="gold.dim_customers", + source="silver.customers_clean", + keys=["customer_id"], + sequence_by=col("updated_at"), + stored_as_scd_type=2 +) + +# Gold: Fact table (no AUTO CDC) +@dp.table(name="gold.fact_orders") +def fact_orders(): + return spark.read.table("silver.orders_clean") +``` + +### Pattern 2: Silver Deduplication for Joins + +```python +# Silver: AUTO CDC for deduplication +dp.create_streaming_table("silver.products_dedupe") +dp.create_auto_cdc_flow( + target="silver.products_dedupe", + source="bronze.products", + keys=["product_id"], + sequence_by=col("modified_at"), + stored_as_scd_type="1" # Type 1: just dedupe, no history +) + +# Silver: Join with deduplicated data +@dp.table(name="silver.orders_enriched") +def orders_enriched(): + orders = spark.readStream.table("bronze.orders") + products = spark.read.table("silver.products_dedupe") + return orders.join(products, "product_id") +``` + +### Pattern 3: Mixed SCD Types + +```python +# SCD Type 2: Need history +dp.create_auto_cdc_flow( + target="gold.dim_customers", + source="silver.customers", + keys=["customer_id"], + sequence_by=col("updated_at"), + stored_as_scd_type=2 # Track address changes over time +) + +# SCD Type 1: Corrections only +dp.create_auto_cdc_flow( + target="gold.dim_products", + source="silver.products", + keys=["product_id"], + sequence_by=col("modified_at"), + stored_as_scd_type="1" # Current product info only +) +``` + +--- + +## Best Practices + +### 1. Clean Data Before AUTO CDC + +Apply type casting, validation, and filtering first: + +```python +@dp.table(name="users_clean") +def users_clean(): + return ( + spark.readStream.table("bronze_users") + .filter(F.col("user_id").isNotNull()) + .filter(F.col("email").isNotNull()) + .withColumn("updated_at", F.to_timestamp("updated_at")) + ) + +# Then apply AUTO CDC +dp.create_auto_cdc_flow( + target="dim_users", + source="users_clean", + keys=["user_id"], + sequence_by=col("updated_at"), + stored_as_scd_type=2 +) +``` + +### 2. Use col() for sequence_by + +```python +# Correct +sequence_by=col("event_timestamp") + +# Wrong - causes error +# sequence_by="event_timestamp" +``` + +### 3. Choose the Right SCD Type + +- **Type 2** (`stored_as_scd_type=2`): Need to query historical states +- **Type 1** (`stored_as_scd_type="1"`): Only need current state or deduplication + +### 4. Use meaningful sequence_by column + +Should reflect true chronological order of changes: +- `updated_timestamp` +- `modified_at` +- `event_timestamp` + +--- + +## Common Issues + +| Issue | Solution | +|-------|----------| +| `sequence_by` type error | Use `col("column")` not string | +| SCD type syntax error | Type 2 uses integer `2`, Type 1 uses string `"1"` | +| Duplicates still appearing | Check `keys` include all business key columns | +| Missing `__START_AT`/`__END_AT` | These only appear in SCD Type 2, not Type 1 | +| Late data not handled | Ensure `sequence_by` reflects true event time | +| Performance issues | Use `track_history_column_list` to limit version triggers | diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/python/5-performance.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/python/5-performance.md new file mode 100644 index 0000000..0cdcc94 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/python/5-performance.md @@ -0,0 +1,423 @@ +# Python Performance Tuning + +Performance optimization strategies including Liquid Clustering, materialized view refresh, state management, and compute configuration. + +**Import**: `from pyspark import pipelines as dp` + +--- + +## Liquid Clustering (Recommended) + +Liquid Clustering is the recommended approach for data layout optimization. It replaces manual partitioning and Z-ORDER. + +### Benefits + +- **Adaptive**: Adjusts to data distribution changes +- **Multi-dimensional**: Clusters on multiple columns simultaneously +- **Automatic file sizing**: Maintains optimal file sizes +- **Self-optimizing**: Reduces manual OPTIMIZE commands + +### Basic Syntax + +```python +from pyspark import pipelines as dp + +@dp.table(cluster_by=["event_type", "event_date"]) +def bronze_events(): + return spark.readStream.format("cloudFiles").load("/data") +``` + +### Automatic Key Selection + +```python +@dp.table(cluster_by=["AUTO"]) +def bronze_events(): + return spark.readStream.format("cloudFiles").load("/data") +``` + +**When to use AUTO**: Learning phase, unknown access patterns, prototyping +**When to define manually**: Well-known query patterns, production workloads + +--- + +## Cluster Key Selection by Layer + +### Bronze Layer + +Cluster by event type + date: + +```python +@dp.table( + name="bronze_events", + cluster_by=["event_type", "ingestion_date"], + table_properties={"delta.autoOptimize.optimizeWrite": "true"} +) +def bronze_events(): + return ( + spark.readStream.format("cloudFiles") + .option("cloudFiles.format", "json") + .load("/Volumes/my_catalog/my_schema/raw/events/") + .withColumn("_ingested_at", F.current_timestamp()) + .withColumn("ingestion_date", F.current_date()) + ) +``` + +**Why**: Bronze filtered by event type for processing and by date for incremental loads. + +### Silver Layer + +Cluster by primary key + business dimension: + +```python +@dp.table( + name="silver_orders", + cluster_by=["customer_id", "order_date"] +) +def silver_orders(): + return ( + spark.readStream.table("bronze_orders") + .withColumn("order_date", F.to_date("order_timestamp")) + .select("order_id", "customer_id", "product_id", "amount", "order_date") + ) +``` + +**Why**: Entity lookups (by ID) and time-range queries (by date). + +### Gold Layer + +Cluster by aggregation dimensions: + +```python +@dp.materialized_view( + name="gold_sales_summary", + cluster_by=["product_category", "year_month"] +) +def gold_sales_summary(): + return ( + spark.read.table("silver_orders") + .withColumn("year_month", F.date_format("order_date", "yyyy-MM")) + .groupBy("product_category", "year_month") + .agg( + F.sum("amount").alias("total_sales"), + F.count("*").alias("transaction_count"), + F.avg("amount").alias("avg_order_value") + ) + ) +``` + +**Why**: Dashboard filters (category, region, time period). + +### Selection Guidelines + +| Layer | Good Keys | Rationale | +|-------|-----------|-----------| +| **Bronze** | event_type, ingestion_date | Filter by type; date for incremental | +| **Silver** | primary_key, business_date | Entity lookups + time ranges | +| **Gold** | aggregation_dimensions | Dashboard filters | + +**Best practices:** +- First key: Most selective filter (e.g., customer_id) +- Second key: Next common filter (e.g., date) +- Order matters: Most selective first +- Limit to 4 keys: Diminishing returns beyond 4 +- **Use `["AUTO"]` if unsure** + +--- + +## Table Properties + +### Auto-Optimize + +```python +@dp.table( + name="bronze_events", + table_properties={ + "delta.autoOptimize.optimizeWrite": "true", + "delta.autoOptimize.autoCompact": "true" + } +) +def bronze_events(): + return spark.readStream.format("cloudFiles").load(...) +``` + +### Change Data Feed + +```python +@dp.table( + name="silver_customers", + table_properties={"delta.enableChangeDataFeed": "true"} +) +def silver_customers(): + return spark.readStream.table("bronze_customers") +``` + +**Use when**: Downstream systems need efficient change tracking. + +### Retention Periods + +```python +@dp.table( + name="bronze_high_volume", + table_properties={ + "delta.logRetentionDuration": "7 days", + "delta.deletedFileRetentionDuration": "7 days" + } +) +def bronze_high_volume(): + return spark.readStream.format("cloudFiles").load(...) +``` + +**Use for**: High-volume tables to reduce storage costs. + +--- + +## State Management for Streaming + +### Understand State Growth + +Higher cardinality = more state: + +```python +# High state: 1M users x 10K products x 100M sessions - Massive state! +.groupBy("user_id", "product_id", "session_id") +``` + +### Reduce State Size + +**Strategy 1: Reduce cardinality** + +```python +@dp.table(name="user_category_stats") +def user_category_stats(): + return ( + spark.readStream.table("bronze_events") + .groupBy( + "user_id", + "product_category", # 100 categories (not 10K products) + F.to_date("event_time").alias("event_date") + ) + .agg(F.count("*").alias("events")) + ) +``` + +**Strategy 2: Use time windows** + +```python +@dp.table(name="user_hourly_stats") +def user_hourly_stats(): + return ( + spark.readStream.table("bronze_events") + .groupBy( + "user_id", + F.window("event_time", "1 hour") + ) + .agg(F.count("*").alias("events")) + ) +``` + +**Strategy 3: Materialize intermediates** + +```python +# Streaming aggregation (maintains state) +@dp.table(name="user_daily_stats") +def user_daily_stats(): + return ( + spark.readStream.table("bronze_events") + .groupBy("user_id", F.to_date("event_time").alias("event_date")) + .agg(F.count("*").alias("event_count")) + ) + +# Batch aggregation (no streaming state) +@dp.materialized_view(name="user_monthly_stats") +def user_monthly_stats(): + return ( + spark.read.table("user_daily_stats") + .groupBy("user_id", F.date_trunc("month", "event_date").alias("month")) + .agg(F.sum("event_count").alias("total_events")) + ) +``` + +--- + +## Join Optimization + +### Stream-to-Static (Efficient) + +```python +@dp.table(name="sales_enriched") +def sales_enriched(): + """Small static dimension, large streaming fact.""" + sales = spark.readStream.table("bronze_sales") + products = spark.read.table("dim_products") # Small, broadcast + + return ( + sales.join(products, "product_id", "left") + .select("sale_id", "product_id", "amount", "product_name", "category") + ) +``` + +**Best practice**: Keep static dimensions small (<10K rows) for broadcast. + +### Stream-to-Stream (Stateful) + +```python +@dp.table(name="orders_with_payments") +def orders_with_payments(): + """Time bounds limit state retention.""" + orders = spark.readStream.table("bronze_orders") + payments = spark.readStream.table("bronze_payments") + + return orders.join( + payments, + (orders.order_id == payments.order_id) & + (payments.payment_time >= orders.order_time) & + (payments.payment_time <= orders.order_time + F.expr("INTERVAL 1 HOUR")), + "inner" + ) +``` + +--- + +## Query Optimization + +### Filter Early + +```python +# Filter at source +@dp.table(name="silver_recent") +def silver_recent(): + return ( + spark.readStream.table("bronze_events") + .filter(F.col("event_date") >= F.current_date() - 7) + ) + +# Avoid filtering late in separate table +# @dp.table(name="silver_all") +# def silver_all(): return spark.readStream.table("bronze_events") +# @dp.materialized_view(name="gold_recent") +# def gold_recent(): return spark.read.table("silver_all").filter(...) +``` + +### Select Specific Columns + +```python +# Only needed columns +.select("customer_id", "order_date", "amount") + +# Avoid SELECT * +# .select("*") +``` + +--- + +## Pre-Aggregation + +```python +@dp.materialized_view(name="orders_monthly") +def orders_monthly(): + """Pre-aggregate for fast queries.""" + return ( + spark.read.table("large_orders_table") + .groupBy( + "customer_id", + F.year("order_date").alias("year"), + F.month("order_date").alias("month") + ) + .agg(F.sum("amount").alias("total")) + ) + +# Query the MV directly - much faster than querying large_orders_table +``` + +--- + +## Compute Configuration + +### Serverless vs Classic + +| Aspect | Serverless | Classic | +|--------|-----------|---------| +| Startup | Fast (seconds) | Slower (minutes) | +| Scaling | Automatic, instant | Manual/autoscaling | +| Cost | Pay-per-use | Pay for cluster time | +| Best for | Variable workloads, dev/test | Steady workloads | + +### Serverless (Recommended) + +Enable at pipeline level: + +```yaml +execution_mode: continuous # or triggered +serverless: true +``` + +**Advantages**: No cluster management, instant scaling, lower cost for bursty workloads. + +--- + +## Complete Example + +```python +from pyspark import pipelines as dp +from pyspark.sql import functions as F + +# Bronze: Optimized ingestion +@dp.table( + name="bronze_orders", + cluster_by=["order_date"], + table_properties={ + "delta.autoOptimize.optimizeWrite": "true", + "delta.autoOptimize.autoCompact": "true" + } +) +def bronze_orders(): + return ( + spark.readStream.format("cloudFiles") + .option("cloudFiles.format", "json") + .load("/Volumes/my_catalog/my_schema/raw/orders/") + .withColumn("_ingested_at", F.current_timestamp()) + .withColumn("order_date", F.to_date("order_timestamp")) + ) + +# Silver: Efficient clustering for joins +@dp.table( + name="silver_orders", + cluster_by=["customer_id", "order_date"] +) +@dp.expect_or_drop("valid_amount", "amount > 0") +def silver_orders(): + return ( + spark.readStream.table("bronze_orders") + .filter(F.col("order_date") >= F.current_date() - 90) # Filter early + .withColumn("amount", F.col("amount").cast("decimal(10,2)")) # DECIMAL for monetary + .select("order_id", "customer_id", "amount", "order_date") # Select specific + ) + +# Gold: Pre-aggregated for dashboards +@dp.materialized_view( + name="gold_daily_revenue", + cluster_by=["order_date"] +) +def gold_daily_revenue(): + return ( + spark.read.table("silver_orders") + .groupBy("order_date") + .agg( + F.sum("amount").alias("total_revenue"), + F.count("order_id").alias("order_count"), + F.countDistinct("customer_id").alias("unique_customers") + ) + ) +``` + +--- + +## Common Issues + +| Issue | Solution | +|-------|----------| +| Pipeline running slowly | Check clustering, state size, join patterns | +| High memory usage | Unbounded state - add time windows, reduce cardinality | +| Many small files | Enable auto-optimize table properties | +| Expensive queries on large tables | Add clustering, create filtered MVs | +| MV refresh slow | Enable row tracking on source | diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/sql/1-syntax-basics.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/sql/1-syntax-basics.md new file mode 100644 index 0000000..54e45df --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/sql/1-syntax-basics.md @@ -0,0 +1,243 @@ +# SQL Syntax Basics + +Core SQL syntax for Spark Declarative Pipelines (SDP). + +--- + +## Table Types + +### Streaming Table + +Processes data incrementally. Use for continuous ingestion and transformations. + +```sql +CREATE OR REFRESH STREAMING TABLE bronze_events +COMMENT 'Raw event data' +CLUSTER BY (event_type, event_date) +TBLPROPERTIES ( + 'delta.autoOptimize.optimizeWrite' = 'true', + 'delta.autoOptimize.autoCompact' = 'true' +) +AS +SELECT + *, + current_timestamp() AS _ingested_at, + _metadata.file_path AS _source_file +FROM STREAM read_files('/Volumes/my_catalog/my_schema/raw/events/', format => 'json'); +``` + +**Key points:** +- Use `STREAM` keyword with source for incremental processing +- `CLUSTER BY` enables Liquid Clustering (recommended over PARTITION BY) +- Returns streaming DataFrame + +### Materialized View + +Batch table with automatic incremental refresh. + +```sql +CREATE OR REFRESH MATERIALIZED VIEW gold_daily_summary +COMMENT 'Daily aggregated metrics' +CLUSTER BY (report_date) +AS +SELECT + report_date, + SUM(amount) AS total_amount, + COUNT(*) AS transaction_count +FROM silver_orders +GROUP BY report_date; +``` + +**Key points:** +- No `STREAM` keyword - reads batch +- Automatically refreshes incrementally when source changes +- Use for aggregations and reporting tables + +### View (Persisted) + +A regular view published to Unity Catalog. Unlike materialized views, it doesn't store data - the query runs each time the view is accessed. + +```sql +CREATE VIEW taxi_raw AS +SELECT * FROM read_files("/Volumes/catalog/schema/raw/taxi/"); + +CREATE VIEW active_customers AS +SELECT customer_id, name, email +FROM dim_customers +WHERE status = 'active'; +``` + +**Key points:** +- Persisted in Unity Catalog (visible outside pipeline) +- No data storage - query executes on access +- Cannot use streaming queries or constraints +- Requires Unity Catalog pipeline with default publishing mode + +**Documentation:** [CREATE VIEW reference](https://docs.databricks.com/aws/en/ldp/developer/ldp-sql-ref-create-view) + +### Temporary View + +Pipeline-scoped view, not persisted. Useful for intermediate transformations. + +```sql +CREATE TEMPORARY VIEW orders_with_calculations AS +SELECT + *, + quantity * price AS total, + quantity * price * discount_rate AS discount_amount +FROM STREAM bronze_orders +WHERE quantity > 0; +``` + +**Key points:** +- Exists only during pipeline execution +- No storage cost +- Not visible outside pipeline +- Useful before AUTO CDC flows + +### Choosing Between View Types + +| Type | Persisted | Stores Data | Streaming | Use Case | +|------|-----------|-------------|-----------|----------| +| **Materialized View** | Yes | Yes | No | Aggregations, reporting tables | +| **View** | Yes | No | No | Simple transformations, external access | +| **Temporary View** | No | No | Yes | Intermediate steps, before AUTO CDC | + +--- + +## Data Quality (Expectations) +**Documentation:** [Expectations]https://docs.databricks.com/aws/en/ldp/expectations) + +### Constraint Syntax + +```sql +CREATE OR REFRESH STREAMING TABLE silver_orders ( + CONSTRAINT valid_amount EXPECT (amount > 0) ON VIOLATION DROP ROW, + CONSTRAINT valid_customer EXPECT (customer_id IS NOT NULL) ON VIOLATION DROP ROW, + CONSTRAINT critical_field EXPECT (order_id IS NOT NULL) ON VIOLATION FAIL UPDATE +) +AS +SELECT * FROM STREAM bronze_orders; +``` + +| Violation Action | Behavior | +|-----------------|----------| +| `ON VIOLATION DROP ROW` | Drop rows that violate | +| `ON VIOLATION FAIL UPDATE` | Fail pipeline if any row violates | +| (no action) | Log warning, keep all rows | + +### WHERE Clause Alternative + +For simple filtering without tracking: + +```sql +CREATE OR REFRESH STREAMING TABLE silver_orders AS +SELECT * FROM STREAM bronze_orders +WHERE amount > 0 AND customer_id IS NOT NULL; +``` + +--- + +## Liquid Clustering + +Use `CLUSTER BY` instead of legacy `PARTITION BY`. See **[5-performance.md](5-performance.md#liquid-clustering-recommended)** for detailed guidance on key selection by layer. + +```sql +CREATE OR REFRESH STREAMING TABLE bronze_events +CLUSTER BY (event_type, event_date) +AS SELECT ...; +``` + +--- + +## Table Properties + +```sql +CREATE OR REFRESH STREAMING TABLE bronze_events +TBLPROPERTIES ( + 'delta.autoOptimize.optimizeWrite' = 'true', -- Optimize file sizes on write + 'delta.autoOptimize.autoCompact' = 'true', -- Automatic compaction + 'delta.enableChangeDataFeed' = 'true', -- Enable CDF for downstream + 'delta.logRetentionDuration' = '7 days', -- Log retention + 'delta.deletedFileRetentionDuration' = '7 days' -- Deleted file retention +) +AS SELECT ...; +``` + +--- + +## Refresh Scheduling (Materialized Views) + +```sql +-- Near-real-time +CREATE OR REFRESH MATERIALIZED VIEW gold_live_metrics +REFRESH EVERY 5 MINUTES +AS SELECT ...; + +-- Daily +CREATE OR REFRESH MATERIALIZED VIEW gold_daily_summary +REFRESH EVERY 1 DAY +AS SELECT ...; +``` + +--- + +## Table Name Resolution + +| Level | Example | When to Use | +|-------|---------|-------------| +| Unqualified | `FROM bronze_orders` | Tables in same pipeline (recommended) | +| Schema-qualified | `FROM other_schema.orders` | Different schema, same catalog | +| Fully-qualified | `FROM other_catalog.schema.orders` | External catalogs | + +**Best practice:** Use unqualified names for pipeline-internal tables. + +### Multi-Schema Pattern (One Pipeline) + +Write to multiple schemas from a single pipeline using fully qualified names: + +```sql +-- bronze_orders.sql → writes to bronze schema +CREATE OR REFRESH STREAMING TABLE my_catalog.bronze.raw_orders +AS SELECT *, current_timestamp() AS _ingested_at +FROM STREAM read_files('/Volumes/my_catalog/raw/orders/', format => 'json'); + +-- silver_orders.sql → writes to silver schema, reads from bronze +CREATE OR REFRESH STREAMING TABLE my_catalog.silver.clean_orders +AS SELECT * FROM STREAM my_catalog.bronze.raw_orders +WHERE order_id IS NOT NULL; +``` + +--- + +## Pipeline Parameters + +Reference configuration values in SQL: + +```sql +-- In SQL, use ${variable_name} syntax +CREATE OR REFRESH STREAMING TABLE bronze_orders AS +SELECT * FROM STREAM read_files( + '${input_path}/orders/', + format => 'json' +); +``` + +Define in pipeline configuration (YAML): +```yaml +configuration: + input_path: /Volumes/my_catalog/my_schema/raw +``` + +--- + +## Common Issues + +| Issue | Solution | +|-------|----------| +| Missing `STREAM` keyword | Use `FROM STREAM table_name` for streaming tables | +| Constraint syntax error | Use `CONSTRAINT name EXPECT (condition)` | +| Cluster key not working | Verify column exists, limit to 4 keys | +| Parameter not resolved | Check `${var}` syntax and pipeline configuration | +| Using legacy `LIVE` keyword | Use `CREATE OR REFRESH STREAMING TABLE` \| `MATERIALIZED VIEW`, not `CREATE LIVE TABLE` \| `STREAMING LIVE TABLE` | +| Using `input_file_name()` | Use `_metadata.file_path` | diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/sql/2-ingestion.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/sql/2-ingestion.md new file mode 100644 index 0000000..61f98f6 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/sql/2-ingestion.md @@ -0,0 +1,161 @@ +# SQL Data Ingestion + +Data ingestion patterns for cloud storage and streaming sources. + +**Official Documentation:** +- [read_files function reference](https://docs.databricks.com/aws/en/sql/language-manual/functions/read_files) +- [Auto Loader options](https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/options) + +--- + +## Auto Loader (Cloud Files) + +Auto Loader incrementally processes new files. Use `STREAM read_files()` in streaming table queries. + +### Basic Pattern + +```sql +CREATE OR REFRESH STREAMING TABLE bronze_orders AS +SELECT + *, + current_timestamp() AS _ingested_at, + _metadata.file_path AS _source_file +FROM STREAM read_files( + '/Volumes/my_catalog/my_schema/raw/orders/', + format => 'json', + schemaHints => 'order_id STRING, amount DECIMAL(10,2)' +); +``` + +**Key points:** +- Use `FROM STREAM read_files(...)` for streaming tables (not `FROM read_files(...)` which is batch) +- `format` supports: `json`, `csv`, `parquet`, `avro`, `text`, `binaryFile` +- `schemaHints` recommended for production to prevent schema drift +- `_metadata` provides file path, modification time, size + +### Schema Handling + +```sql +-- Explicit hints (recommended for production) +FROM STREAM read_files( + '/Volumes/catalog/schema/raw/', + format => 'json', + schemaHints => 'id STRING, amount DECIMAL(10,2), date DATE' +) + +-- Schema evolution with rescue data +FROM STREAM read_files( + '/Volumes/catalog/schema/raw/', + format => 'json', + schemaHints => 'id STRING', + mode => 'PERMISSIVE' +) +``` + +### Rescue Data (Quarantine Pattern) + +Handle malformed records: + +```sql +-- Flag records with parsing errors +CREATE OR REFRESH STREAMING TABLE bronze_events AS +SELECT + *, + current_timestamp() AS _ingested_at, + CASE WHEN _rescued_data IS NOT NULL THEN TRUE ELSE FALSE END AS _has_errors +FROM STREAM read_files('/Volumes/catalog/schema/raw/events/', format => 'json'); + +-- Quarantine bad records +CREATE OR REFRESH STREAMING TABLE bronze_quarantine AS +SELECT * FROM STREAM bronze_events WHERE _rescued_data IS NOT NULL; + +-- Clean records for downstream +CREATE OR REFRESH STREAMING TABLE silver_clean AS +SELECT * FROM STREAM bronze_events WHERE _rescued_data IS NULL; +``` + +--- + +## Streaming Sources + +### Kafka + +```sql +CREATE OR REFRESH STREAMING TABLE bronze_kafka_events AS +SELECT + CAST(key AS STRING) AS event_key, + CAST(value AS STRING) AS event_value, + topic, partition, offset, + timestamp AS kafka_timestamp, + current_timestamp() AS _ingested_at +FROM read_kafka( + bootstrapServers => '${kafka_brokers}', + subscribe => 'events-topic', + startingOffsets => 'latest' +); +``` + +**Documentation:** [read_kafka function](https://docs.databricks.com/aws/en/sql/language-manual/functions/read_kafka) + +### Parse JSON from Kafka + +```sql +CREATE OR REFRESH STREAMING TABLE silver_events AS +SELECT + from_json(event_value, 'event_id STRING, event_type STRING, timestamp TIMESTAMP') AS data, + kafka_timestamp, _ingested_at +FROM STREAM bronze_kafka_events; +``` + +--- + +## Authentication + +### Databricks Secrets + +```sql +-- Kafka +`kafka.sasl.jaas.config` => '...username="{{secrets/kafka/username}}" password="{{secrets/kafka/password}}";' + +-- Event Hub +`eventhubs.connectionString` => '{{secrets/eventhub/connection-string}}' +``` + +### Pipeline Variables + +```sql +-- Reference in SQL +FROM STREAM read_files('${input_path}/orders/', format => 'json') +``` + +Define in pipeline configuration: +```yaml +configuration: + input_path: /Volumes/my_catalog/my_schema/raw +``` + +--- + +## Best Practices + +1. **Always add ingestion metadata:** +```sql +SELECT *, current_timestamp() AS _ingested_at, _metadata.file_path AS _source_file +``` + +2. **Use schemaHints for production** - prevents unexpected schema changes + +3. **Handle rescue data** - route malformed records to quarantine table + +4. **Use STREAM keyword** - `FROM STREAM read_files(...)` for streaming tables + +--- + +## Common Issues + +| Issue | Solution | +|-------|----------| +| Files not picked up | Verify path and format match actual files | +| "Cannot create streaming table from batch query" | Use `FROM STREAM read_files(...)` not `FROM read_files(...)` | +| Schema evolution breaking | Use `mode => 'PERMISSIVE'` and monitor `_rescued_data` | +| Kafka lag increasing | Check downstream bottlenecks | diff --git a/.claude/skills/databricks-spark-declarative-pipelines/2-streaming-patterns.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/sql/3-streaming-patterns.md similarity index 70% rename from .claude/skills/databricks-spark-declarative-pipelines/2-streaming-patterns.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/sql/3-streaming-patterns.md index c1ec63b..fc42702 100644 --- a/.claude/skills/databricks-spark-declarative-pipelines/2-streaming-patterns.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/sql/3-streaming-patterns.md @@ -1,4 +1,4 @@ -# Streaming Patterns for SDP +# SQL Streaming Patterns Streaming-specific patterns including deduplication, windowed aggregations, late-arriving data handling, and stateful operations. @@ -10,12 +10,12 @@ Streaming-specific patterns including deduplication, windowed aggregations, late ```sql -- Bronze: Ingest all (may contain duplicates) -CREATE OR REPLACE STREAMING TABLE bronze_events AS +CREATE OR REFRESH STREAMING TABLE bronze_events AS SELECT *, current_timestamp() AS _ingested_at -FROM read_stream(...); +FROM STREAM read_files(...); -- Silver: Deduplicate by event_id -CREATE OR REPLACE STREAMING TABLE silver_events_dedup AS +CREATE OR REFRESH STREAMING TABLE silver_events_dedup AS SELECT event_id, user_id, event_type, event_timestamp, _ingested_at FROM ( @@ -32,21 +32,21 @@ WHERE rn = 1; Deduplicate within time window to handle late arrivals: ```sql -CREATE OR REPLACE STREAMING TABLE silver_events_dedup AS +CREATE OR REFRESH STREAMING TABLE silver_events_dedup AS SELECT event_id, user_id, event_type, event_timestamp, MIN(_ingested_at) AS first_seen_at FROM STREAM bronze_events GROUP BY event_id, user_id, event_type, event_timestamp, - window(event_timestamp, '1 hour') -- Deduplicate within 1-hour windows + window(event_timestamp, '1 hour') HAVING COUNT(*) >= 1; ``` ### Composite Key ```sql -CREATE OR REPLACE STREAMING TABLE silver_transactions_dedup AS +CREATE OR REFRESH STREAMING TABLE silver_transactions_dedup AS SELECT transaction_id, customer_id, amount, transaction_timestamp, MIN(_ingested_at) AS _ingested_at @@ -60,9 +60,11 @@ GROUP BY transaction_id, customer_id, amount, transaction_timestamp; ### Tumbling Windows +Non-overlapping fixed-size windows: + ```sql --- 5-minute non-overlapping windows -CREATE OR REPLACE STREAMING TABLE silver_sensor_5min AS +-- 5-minute windows +CREATE OR REFRESH STREAMING TABLE silver_sensor_5min AS SELECT sensor_id, window(event_timestamp, '5 minutes') AS time_window, @@ -78,7 +80,7 @@ GROUP BY sensor_id, window(event_timestamp, '5 minutes'); ```sql -- 1-minute for real-time monitoring -CREATE OR REPLACE STREAMING TABLE gold_sensor_1min AS +CREATE OR REFRESH STREAMING TABLE gold_sensor_1min AS SELECT sensor_id, window(event_timestamp, '1 minute').start AS window_start, @@ -89,7 +91,7 @@ FROM STREAM silver_sensor_data GROUP BY sensor_id, window(event_timestamp, '1 minute'); -- 1-hour for trend analysis -CREATE OR REPLACE STREAMING TABLE gold_sensor_1hour AS +CREATE OR REFRESH STREAMING TABLE gold_sensor_1hour AS SELECT sensor_id, window(event_timestamp, '1 hour').start AS window_start, @@ -99,25 +101,35 @@ FROM STREAM silver_sensor_data GROUP BY sensor_id, window(event_timestamp, '1 hour'); ``` +### Session Windows + +Group events into sessions based on inactivity gaps: + +```sql +-- 30-minute inactivity timeout +CREATE OR REFRESH STREAMING TABLE silver_user_sessions AS +SELECT + user_id, + session_window(event_timestamp, '30 minutes') AS session, + MIN(event_timestamp) AS session_start, + MAX(event_timestamp) AS session_end, + COUNT(*) AS event_count, + COLLECT_LIST(event_type) AS event_sequence +FROM STREAM bronze_user_events +GROUP BY user_id, session_window(event_timestamp, '30 minutes'); +``` + --- ## Late-Arriving Data ### Event-Time vs Processing-Time -Always use event timestamp for business logic, not ingestion timestamp: +Always use event timestamp for business logic: ```sql --- ✅ Use event timestamp -CREATE OR REPLACE STREAMING TABLE silver_orders AS -SELECT - order_id, order_timestamp, -- Event time from source - customer_id, amount, - _ingested_at -- Processing time (debugging only) -FROM STREAM bronze_orders; - --- Group by event time -CREATE OR REPLACE STREAMING TABLE gold_daily_orders AS +-- Use event timestamp for aggregations +CREATE OR REFRESH STREAMING TABLE gold_daily_orders AS SELECT CAST(order_timestamp AS DATE) AS order_date, -- Event time COUNT(*) AS order_count, @@ -126,32 +138,23 @@ FROM STREAM silver_orders GROUP BY CAST(order_timestamp AS DATE); ``` -### Handling Out-of-Order with SCD2 - -Use SEQUENCE BY with event timestamp. **Clause order matters**: put `APPLY AS DELETE WHEN` before `SEQUENCE BY`. Only list columns in `COLUMNS * EXCEPT (...)` that actually exist in the source (omit `_rescued_data` unless the bronze table uses rescue data). Omit `TRACK HISTORY ON *` if it causes parse errors; the default is equivalent. - +**Keep processing time for debugging:** ```sql -CREATE OR REFRESH STREAMING TABLE silver_customers_history; - -CREATE FLOW customers_scd2_flow AS -AUTO CDC INTO silver_customers_history -FROM stream(bronze_customer_cdc) -KEYS (customer_id) -APPLY AS DELETE WHEN operation = "DELETE" -SEQUENCE BY event_timestamp -- Handles out-of-order -COLUMNS * EXCEPT (operation, _ingested_at, _source_file) -STORED AS SCD TYPE 2; +SELECT + order_id, order_timestamp, -- Event time (business logic) + customer_id, amount, + _ingested_at -- Processing time (debugging only) +FROM STREAM bronze_orders; ``` --- -## Stateful Operations +## Joins ### Stream-to-Stream Joins ```sql --- Join two streaming sources -CREATE OR REPLACE STREAMING TABLE silver_orders_with_payments AS +CREATE OR REFRESH STREAMING TABLE silver_orders_with_payments AS SELECT o.order_id, o.customer_id, o.order_timestamp, o.amount AS order_amount, p.payment_id, p.payment_timestamp, p.payment_method, p.amount AS payment_amount @@ -161,17 +164,19 @@ INNER JOIN STREAM bronze_payments p AND p.payment_timestamp BETWEEN o.order_timestamp AND o.order_timestamp + INTERVAL 1 HOUR; ``` +**Important:** Use time bounds in join condition to limit state retention. + ### Stream-to-Static Joins Enrich streaming data with dimension tables: ```sql --- Static dimension (changes infrequently) +-- Static dimension CREATE OR REPLACE TABLE dim_products AS SELECT * FROM catalog.schema.products; -- Stream-to-static join -CREATE OR REPLACE STREAMING TABLE silver_sales_enriched AS +CREATE OR REFRESH STREAMING TABLE silver_sales_enriched AS SELECT s.sale_id, s.product_id, s.quantity, s.sale_timestamp, p.product_name, p.category, p.price, @@ -180,11 +185,14 @@ FROM STREAM bronze_sales s LEFT JOIN dim_products p ON s.product_id = p.product_id; ``` -### Incremental Aggregations +--- + +## Incremental Aggregations + +### Running Totals ```sql --- Running totals by customer (stateful) -CREATE OR REPLACE STREAMING TABLE silver_customer_running_totals AS +CREATE OR REFRESH STREAMING TABLE silver_customer_running_totals AS SELECT customer_id, SUM(amount) AS total_spent, @@ -196,32 +204,12 @@ GROUP BY customer_id; --- -## Session Windows - -Group events into sessions based on inactivity gaps: - -```sql --- 30-minute inactivity timeout -CREATE OR REPLACE STREAMING TABLE silver_user_sessions AS -SELECT - user_id, - session_window(event_timestamp, '30 minutes') AS session, - MIN(event_timestamp) AS session_start, - MAX(event_timestamp) AS session_end, - COUNT(*) AS event_count, - COLLECT_LIST(event_type) AS event_sequence -FROM STREAM bronze_user_events -GROUP BY user_id, session_window(event_timestamp, '30 minutes'); -``` - ---- - ## Anomaly Detection ### Real-Time Outlier Detection ```sql -CREATE OR REPLACE STREAMING TABLE silver_sensor_with_anomalies AS +CREATE OR REFRESH STREAMING TABLE silver_sensor_with_anomalies AS SELECT sensor_id, event_timestamp, temperature, AVG(temperature) OVER ( @@ -240,7 +228,7 @@ SELECT FROM STREAM bronze_sensor_events; -- Route anomalies for alerting -CREATE OR REPLACE STREAMING TABLE silver_sensor_anomalies AS +CREATE OR REFRESH STREAMING TABLE silver_sensor_anomalies AS SELECT * FROM STREAM silver_sensor_with_anomalies WHERE anomaly_flag IN ('HIGH_OUTLIER', 'LOW_OUTLIER'); @@ -249,7 +237,7 @@ WHERE anomaly_flag IN ('HIGH_OUTLIER', 'LOW_OUTLIER'); ### Threshold-Based Filtering ```sql -CREATE OR REPLACE STREAMING TABLE silver_high_value_transactions AS +CREATE OR REFRESH STREAMING TABLE silver_high_value_transactions AS SELECT transaction_id, customer_id, amount, transaction_timestamp FROM STREAM bronze_transactions WHERE amount > 10000; @@ -257,38 +245,51 @@ WHERE amount > 10000; --- +## Monitoring Lag + +```sql +CREATE OR REFRESH STREAMING TABLE monitoring_lag AS +SELECT + 'kafka_events' AS source, + MAX(kafka_timestamp) AS max_event_timestamp, + current_timestamp() AS processing_timestamp, + (unix_timestamp(current_timestamp()) - unix_timestamp(MAX(kafka_timestamp))) AS lag_seconds +FROM STREAM bronze_kafka_events +GROUP BY window(kafka_timestamp, '1 minute'); +``` + +--- + ## Execution Modes Configure at pipeline level (not in SQL): -**Continuous** (real-time, sub-second latency): ```yaml +# Continuous (real-time, sub-second latency) execution_mode: continuous serverless: true -``` -**Triggered** (scheduled, cost-optimized): -```yaml +# Triggered (scheduled, cost-optimized) execution_mode: triggered schedule: "0 * * * *" # Hourly ``` -**When to use**: +**When to use:** - **Continuous**: Real-time dashboards, alerting, sub-minute SLAs - **Triggered**: Daily/hourly reports, batch processing --- -## Key Patterns +## Best Practices ### 1. Use Event Timestamps ```sql --- ✅ Event timestamp for logic +-- Correct: Event timestamp for logic GROUP BY date_trunc('hour', event_timestamp) --- ❌ Processing timestamp -GROUP BY date_trunc('hour', _ingested_at) +-- Avoid: Processing timestamp +-- GROUP BY date_trunc('hour', _ingested_at) ``` ### 2. Window Size Selection @@ -302,10 +303,10 @@ GROUP BY date_trunc('hour', _ingested_at) Higher cardinality = more state: ```sql --- High state: 1M users × 10K products × 100M sessions +-- High state: 1M users x 10K products x 100M sessions GROUP BY user_id, product_id, session_id --- Lower state: 1M users × 100 categories × days +-- Lower state: 1M users x 100 categories x days GROUP BY user_id, product_category, DATE(event_time) ``` @@ -317,32 +318,19 @@ Apply at bronze → silver transition: ```sql -- Bronze: Accept duplicates -CREATE OR REPLACE STREAMING TABLE bronze_events AS -SELECT * FROM read_stream(...); +CREATE OR REFRESH STREAMING TABLE bronze_events AS +SELECT * FROM STREAM read_files(...); -- Silver: Deduplicate immediately -CREATE OR REPLACE STREAMING TABLE silver_events AS +CREATE OR REFRESH STREAMING TABLE silver_events AS SELECT DISTINCT event_id, event_type, event_timestamp, user_id FROM STREAM bronze_events; -- Gold: Work with clean data -CREATE OR REPLACE STREAMING TABLE gold_metrics AS +CREATE OR REFRESH STREAMING TABLE gold_metrics AS SELECT ... FROM STREAM silver_events; ``` -### 5. Monitor Lag - -```sql -CREATE OR REPLACE STREAMING TABLE monitoring_lag AS -SELECT - 'kafka_events' AS source, - MAX(kafka_timestamp) AS max_event_timestamp, - current_timestamp() AS processing_timestamp, - (unix_timestamp(current_timestamp()) - unix_timestamp(MAX(kafka_timestamp))) AS lag_seconds -FROM STREAM bronze_kafka_events -GROUP BY window(kafka_timestamp, '1 minute'); -``` - --- ## Common Issues diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/sql/4-cdc-patterns.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/sql/4-cdc-patterns.md new file mode 100644 index 0000000..d9977c2 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/sql/4-cdc-patterns.md @@ -0,0 +1,323 @@ +# SQL CDC Patterns (AUTO CDC & SCD) + +Change Data Capture patterns using AUTO CDC for SCD Type 1 and Type 2, plus querying SCD history tables. + +--- + +## Overview + +AUTO CDC automatically handles Change Data Capture to track changes using Slow Changing Dimensions (SCD). It provides automatic deduplication, change tracking, and handles late-arriving data correctly. + +**Where to apply AUTO CDC:** +- **Silver layer**: When business users need deduplicated or historical data +- **Gold layer**: When implementing dimensional modeling (star schema) + +--- + +## SCD Type 1 vs Type 2 + +### SCD Type 1 (In-place updates) +- **Overwrites** old values with new values +- **No history preserved** - only current state +- **Use for**: Error corrections, attributes where history doesn't matter +- **Syntax**: `STORED AS SCD TYPE 1` + +### SCD Type 2 (History tracking) +- **Creates new row** for each change +- **Preserves full history** with `__START_AT` and `__END_AT` timestamps +- **Use for**: Tracking changes over time (addresses, prices, roles) +- **Syntax**: `STORED AS SCD TYPE 2` + +--- + +## Creating AUTO CDC Flows + +### SCD Type 2 + +```sql +-- Step 1: Create target table +CREATE OR REFRESH STREAMING TABLE dim_customers; + +-- Step 2: Create AUTO CDC flow +CREATE FLOW customers_scd2_flow AS +AUTO CDC INTO dim_customers +FROM stream(customers_cdc_clean) +KEYS (customer_id) +APPLY AS DELETE WHEN operation = "DELETE" +SEQUENCE BY event_timestamp +COLUMNS * EXCEPT (operation, _ingested_at, _source_file) +STORED AS SCD TYPE 2; +``` + +**Important:** Put `APPLY AS DELETE WHEN` before `SEQUENCE BY`. Only list columns in `COLUMNS * EXCEPT (...)` that exist in the source. + +### SCD Type 1 + +```sql +-- Step 1: Create target table +CREATE OR REFRESH STREAMING TABLE orders_current; + +-- Step 2: Create AUTO CDC flow +CREATE FLOW orders_scd1_flow AS +AUTO CDC INTO orders_current +FROM stream(orders_clean) +KEYS (order_id) +SEQUENCE BY updated_timestamp +COLUMNS * EXCEPT (_ingested_at) +STORED AS SCD TYPE 1; +``` + +### Selective History Tracking + +Track history only when specific columns change: + +```sql +CREATE FLOW products_scd2_flow AS +AUTO CDC INTO products_history +FROM stream(products_clean) +KEYS (product_id) +SEQUENCE BY modified_at +COLUMNS * EXCEPT (operation) +STORED AS SCD TYPE 2 +TRACK HISTORY ON price, cost; +``` + +When `price` or `cost` changes, a new version is created. Other column changes update the current record without new versions. + +--- + +## Complete Pattern: Clean + AUTO CDC + +### Step 1: Clean and Validate Source Data + +```sql +CREATE OR REFRESH STREAMING TABLE customers_cdc_clean AS +SELECT + customer_id, + customer_name, + email, + phone, + address, + CAST(updated_at AS TIMESTAMP) AS event_timestamp, + operation +FROM STREAM bronze_customers_cdc +WHERE customer_id IS NOT NULL + AND email IS NOT NULL; +``` + +### Step 2: Apply AUTO CDC + +```sql +CREATE OR REFRESH STREAMING TABLE dim_customers; + +CREATE FLOW customers_scd2_flow AS +AUTO CDC INTO dim_customers +FROM stream(customers_cdc_clean) +KEYS (customer_id) +APPLY AS DELETE WHEN operation = "DELETE" +SEQUENCE BY event_timestamp +COLUMNS * EXCEPT (operation) +STORED AS SCD TYPE 2; +``` + +--- + +## Querying SCD Type 2 Tables + +SCD Type 2 tables include temporal columns: +- `__START_AT` - When this version became effective +- `__END_AT` - When this version expired (NULL for current) + +### Current State + +```sql +-- All current records +CREATE OR REFRESH MATERIALIZED VIEW dim_customers_current AS +SELECT + customer_id, customer_name, email, phone, address, + __START_AT AS valid_from +FROM dim_customers +WHERE __END_AT IS NULL; + +-- Specific customer +SELECT * +FROM dim_customers +WHERE customer_id = '12345' + AND __END_AT IS NULL; +``` + +### Point-in-Time Queries + +Get state as of a specific date: + +```sql +-- Products as of January 1, 2024 +CREATE OR REFRESH MATERIALIZED VIEW products_as_of_2024_01_01 AS +SELECT + product_id, product_name, price, category, + __START_AT, __END_AT +FROM products_history +WHERE __START_AT <= '2024-01-01' + AND (__END_AT > '2024-01-01' OR __END_AT IS NULL); +``` + +### Change Analysis + +Track all changes for an entity: + +```sql +SELECT + customer_id, customer_name, email, phone, + __START_AT, __END_AT, + COALESCE( + DATEDIFF(DAY, __START_AT, __END_AT), + DATEDIFF(DAY, __START_AT, CURRENT_TIMESTAMP()) + ) AS days_active +FROM dim_customers +WHERE customer_id = '12345' +ORDER BY __START_AT DESC; +``` + +Changes within a time period: + +```sql +-- Customers who changed during Q1 2024 +SELECT + customer_id, customer_name, + __START_AT AS change_timestamp, + 'UPDATE' AS change_type +FROM dim_customers +WHERE __START_AT BETWEEN '2024-01-01' AND '2024-03-31' + AND __START_AT != ( + SELECT MIN(__START_AT) + FROM dim_customers ch2 + WHERE ch2.customer_id = dim_customers.customer_id + ) +ORDER BY __START_AT; +``` + +--- + +## Joining Facts with Historical Dimensions + +### At Transaction Time + +```sql +-- Join sales with product prices at time of sale +CREATE OR REFRESH MATERIALIZED VIEW sales_with_historical_prices AS +SELECT + s.sale_id, s.product_id, s.sale_date, s.quantity, + p.product_name, p.price AS unit_price_at_sale_time, + s.quantity * p.price AS calculated_amount, + p.category +FROM sales_fact s +INNER JOIN products_history p + ON s.product_id = p.product_id + AND s.sale_date >= p.__START_AT + AND (s.sale_date < p.__END_AT OR p.__END_AT IS NULL); +``` + +### With Current Dimension + +```sql +CREATE OR REFRESH MATERIALIZED VIEW sales_with_current_prices AS +SELECT + s.sale_id, s.product_id, s.sale_date, s.quantity, + s.amount AS amount_at_sale, + p.product_name AS current_product_name, + p.price AS current_price +FROM sales_fact s +INNER JOIN products_history p + ON s.product_id = p.product_id + AND p.__END_AT IS NULL; +``` + +--- + +## Optimization Patterns + +### Pre-Filter Materialized Views + +```sql +-- Current state view (most common pattern) +CREATE OR REFRESH MATERIALIZED VIEW dim_products_current AS +SELECT * FROM products_history WHERE __END_AT IS NULL; + +-- Recent changes only +CREATE OR REFRESH MATERIALIZED VIEW dim_recent_changes AS +SELECT * FROM products_history +WHERE __START_AT >= CURRENT_DATE() - INTERVAL 90 DAYS; + +-- Change frequency stats +CREATE OR REFRESH MATERIALIZED VIEW product_change_stats AS +SELECT + product_id, + COUNT(*) AS version_count, + MIN(__START_AT) AS first_seen, + MAX(__START_AT) AS last_updated +FROM products_history +GROUP BY product_id; +``` + +--- + +## Best Practices + +### 1. Filter by __END_AT for Current + +```sql +-- Efficient +WHERE __END_AT IS NULL + +-- Less efficient +WHERE __START_AT = (SELECT MAX(__START_AT) FROM table WHERE ...) +``` + +### 2. Use Inclusive Lower, Exclusive Upper + +```sql +WHERE __START_AT <= '2024-01-01' + AND (__END_AT > '2024-01-01' OR __END_AT IS NULL) +``` + +### 3. Clean Data Before AUTO CDC + +Apply type casting, validation, and filtering first: + +```sql +-- Clean source +CREATE OR REFRESH STREAMING TABLE users_clean AS +SELECT + user_id, + TRIM(email) AS email, + CAST(updated_at AS TIMESTAMP) AS updated_timestamp +FROM STREAM bronze_users +WHERE user_id IS NOT NULL AND email IS NOT NULL; + +-- Then apply AUTO CDC +CREATE FLOW users_scd2_flow AS +AUTO CDC INTO dim_users +FROM stream(users_clean) +KEYS (user_id) +SEQUENCE BY updated_timestamp +STORED AS SCD TYPE 2; +``` + +### 4. Choose the Right SCD Type + +- **Type 2**: Need to query historical states +- **Type 1**: Only need current state or deduplication + +--- + +## Common Issues + +| Issue | Solution | +|-------|----------| +| Multiple rows for same key | Missing `__END_AT IS NULL` filter for current state | +| Point-in-time no results | Use `__START_AT <= date AND (__END_AT > date OR __END_AT IS NULL)` | +| Slow temporal join | Create materialized view for specific time period | +| Unexpected duplicates | Multiple changes same day - use SEQUENCE BY with high precision | +| Parse error on AUTO CDC | Put `APPLY AS DELETE WHEN` before `SEQUENCE BY` | +| Columns not in target | Only list existing columns in `COLUMNS * EXCEPT (...)` | +| Type syntax error | Use `SCD TYPE 1` or `SCD TYPE 2` (not quoted) | diff --git a/.claude/skills/databricks-spark-declarative-pipelines/4-performance-tuning.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/sql/5-performance.md similarity index 72% rename from .claude/skills/databricks-spark-declarative-pipelines/4-performance-tuning.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/sql/5-performance.md index bd1c1dc..aa9ffaf 100644 --- a/.claude/skills/databricks-spark-declarative-pipelines/4-performance-tuning.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/references/sql/5-performance.md @@ -1,14 +1,14 @@ -# Performance Tuning for SDP +# SQL Performance Tuning -Performance optimization strategies including **Liquid Clustering** (modern approach), materialized view refresh, state management, and compute configuration. +Performance optimization strategies including Liquid Clustering, materialized view refresh, state management, and compute configuration. --- ## Liquid Clustering (Recommended) -**Liquid Clustering** is the recommended approach for data layout optimization. It replaces manual `PARTITION BY` and `Z-ORDER`. +Liquid Clustering is the recommended approach for data layout optimization. It replaces manual `PARTITION BY` and `Z-ORDER`. -### What is Liquid Clustering? +### Benefits - **Adaptive**: Adjusts to data distribution changes - **Multi-dimensional**: Clusters on multiple columns simultaneously @@ -17,32 +17,22 @@ Performance optimization strategies including **Liquid Clustering** (modern appr ### Basic Syntax -**SQL**: ```sql -CREATE OR REPLACE STREAMING TABLE bronze_events +CREATE OR REFRESH STREAMING TABLE bronze_events CLUSTER BY (event_type, event_date) AS SELECT *, current_timestamp() AS _ingested_at, CAST(current_date() AS DATE) AS event_date -FROM read_files('/mnt/raw/events/', format => 'json'); +FROM STREAM read_files('/Volumes/my_catalog/my_schema/raw/events/', format => 'json'); ``` -**Python**: -```python -from pyspark import pipelines as dp - -@dp.table(cluster_by=["event_type", "event_date"]) -def bronze_events(): - return spark.readStream.format("cloudFiles").load("/data") -``` - -### Automatic Cluster Key Selection +### Automatic Key Selection ```sql -- Let Databricks choose based on query patterns -CREATE OR REPLACE STREAMING TABLE bronze_events +CREATE OR REFRESH STREAMING TABLE bronze_events CLUSTER BY (AUTO) AS SELECT ...; ``` @@ -59,7 +49,7 @@ AS SELECT ...; Cluster by event type + date: ```sql -CREATE OR REPLACE STREAMING TABLE bronze_events +CREATE OR REFRESH STREAMING TABLE bronze_events CLUSTER BY (event_type, ingestion_date) TBLPROPERTIES ('delta.autoOptimize.optimizeWrite' = 'true') AS @@ -67,7 +57,7 @@ SELECT *, current_timestamp() AS _ingested_at, CAST(current_date() AS DATE) AS ingestion_date -FROM read_files('/mnt/raw/events/', format => 'json'); +FROM STREAM read_files('/Volumes/my_catalog/my_schema/raw/events/', format => 'json'); ``` **Why**: Bronze filtered by event type for processing and by date for incremental loads. @@ -77,11 +67,12 @@ FROM read_files('/mnt/raw/events/', format => 'json'); Cluster by primary key + business dimension: ```sql -CREATE OR REPLACE STREAMING TABLE silver_orders +CREATE OR REFRESH STREAMING TABLE silver_orders CLUSTER BY (customer_id, order_date) AS SELECT - order_id, customer_id, product_id, amount, + order_id, customer_id, product_id, + CAST(amount AS DECIMAL(10,2)) AS amount, -- DECIMAL for monetary values CAST(order_timestamp AS DATE) AS order_date, order_timestamp FROM STREAM bronze_orders; @@ -94,7 +85,7 @@ FROM STREAM bronze_orders; Cluster by aggregation dimensions: ```sql -CREATE OR REPLACE MATERIALIZED VIEW gold_sales_summary +CREATE OR REFRESH MATERIALIZED VIEW gold_sales_summary CLUSTER BY (product_category, year_month) AS SELECT @@ -117,7 +108,7 @@ GROUP BY product_category, DATE_FORMAT(order_date, 'yyyy-MM'); | **Silver** | primary_key, business_date | Entity lookups + time ranges | | **Gold** | aggregation_dimensions | Dashboard filters | -**Best practices**: +**Best practices:** - First key: Most selective filter (e.g., customer_id) - Second key: Next common filter (e.g., date) - Order matters: Most selective first @@ -131,7 +122,7 @@ GROUP BY product_category, DATE_FORMAT(order_date, 'yyyy-MM'); ### Before (Legacy) ```sql -CREATE OR REPLACE STREAMING TABLE events +CREATE OR REFRESH STREAMING TABLE events PARTITIONED BY (date DATE) TBLPROPERTIES ('pipelines.autoOptimize.zOrderCols' = 'user_id,event_type') AS SELECT ...; @@ -139,10 +130,10 @@ AS SELECT ...; **Issues**: Fixed keys, small file problem, skewed distribution, manual OPTIMIZE required. -### After (Modern with Liquid Clustering) +### After (Modern) ```sql -CREATE OR REPLACE STREAMING TABLE events +CREATE OR REFRESH STREAMING TABLE events CLUSTER BY (date, user_id, event_type) AS SELECT ...; ``` @@ -157,8 +148,6 @@ AS SELECT ...; 3. **Compatibility**: Older Delta Lake versions (< DBR 13.3) 4. **Existing large tables**: Migration cost outweighs benefits -**Otherwise, prefer Liquid Clustering.** - --- ## Table Properties @@ -166,20 +155,18 @@ AS SELECT ...; ### Auto-Optimize ```sql -CREATE OR REPLACE STREAMING TABLE bronze_events +CREATE OR REFRESH STREAMING TABLE bronze_events TBLPROPERTIES ( 'delta.autoOptimize.optimizeWrite' = 'true', 'delta.autoOptimize.autoCompact' = 'true' ) -AS SELECT * FROM read_files(...); +AS SELECT * FROM STREAM read_files(...); ``` -**Benefits**: Reduces small files, improves reads, automatic compaction. - ### Change Data Feed ```sql -CREATE OR REPLACE STREAMING TABLE silver_customers +CREATE OR REFRESH STREAMING TABLE silver_customers TBLPROPERTIES ('delta.enableChangeDataFeed' = 'true') AS SELECT * FROM STREAM bronze_customers; ``` @@ -189,12 +176,12 @@ AS SELECT * FROM STREAM bronze_customers; ### Retention Periods ```sql -CREATE OR REPLACE STREAMING TABLE bronze_high_volume +CREATE OR REFRESH STREAMING TABLE bronze_high_volume TBLPROPERTIES ( 'delta.logRetentionDuration' = '7 days', 'delta.deletedFileRetentionDuration' = '7 days' ) -AS SELECT * FROM read_files(...); +AS SELECT * FROM STREAM read_files(...); ``` **Use for**: High-volume tables to reduce storage costs. @@ -206,8 +193,8 @@ AS SELECT * FROM read_files(...); ### Refresh Frequency ```sql --- Near-real-time (frequent) -CREATE OR REPLACE MATERIALIZED VIEW gold_live_metrics +-- Near-real-time +CREATE OR REFRESH MATERIALIZED VIEW gold_live_metrics REFRESH EVERY 5 MINUTES AS SELECT @@ -217,8 +204,8 @@ SELECT FROM silver_metrics GROUP BY metric_name; --- Daily reports (scheduled) -CREATE OR REPLACE MATERIALIZED VIEW gold_daily_summary +-- Daily reports +CREATE OR REFRESH MATERIALIZED VIEW gold_daily_summary REFRESH EVERY 1 DAY AS SELECT report_date, SUM(amount) AS total_amount @@ -226,13 +213,12 @@ FROM silver_sales GROUP BY report_date; ``` -### Incremental Refresh (Automatic) +### Incremental Refresh Materialized views auto-use incremental refresh when possible: ```sql --- Refreshes incrementally if source has row tracking -CREATE OR REPLACE MATERIALIZED VIEW gold_aggregates AS +CREATE OR REFRESH MATERIALIZED VIEW gold_aggregates AS SELECT product_id, SUM(quantity) AS total_quantity, @@ -246,8 +232,8 @@ GROUP BY product_id; ### Pre-Aggregation ```sql --- Instead of querying large table repeatedly -CREATE OR REPLACE MATERIALIZED VIEW orders_monthly AS +-- Create pre-aggregated MV for fast queries +CREATE OR REFRESH MATERIALIZED VIEW orders_monthly AS SELECT customer_id, YEAR(order_date) AS year, @@ -282,7 +268,6 @@ GROUP BY user_id, product_id, session_id; -- Massive state! **Strategy 1: Reduce cardinality** ```sql --- Aggregate at higher level SELECT user_id, product_category, -- 100 categories (not 10K products) @@ -295,7 +280,6 @@ GROUP BY user_id, product_category, DATE(event_time); **Strategy 2: Use time windows** ```sql --- Bounded state with windows SELECT user_id, window(event_time, '1 hour') AS time_window, @@ -308,7 +292,7 @@ GROUP BY user_id, window(event_time, '1 hour'); ```sql -- Streaming aggregation (maintains state) -CREATE OR REPLACE STREAMING TABLE user_daily_stats AS +CREATE OR REFRESH STREAMING TABLE user_daily_stats AS SELECT user_id, DATE(event_time) AS event_date, @@ -317,7 +301,7 @@ FROM STREAM bronze_events GROUP BY user_id, DATE(event_time); -- Batch aggregation (no streaming state) -CREATE OR REPLACE MATERIALIZED VIEW user_monthly_stats AS +CREATE OR REFRESH MATERIALIZED VIEW user_monthly_stats AS SELECT user_id, DATE_TRUNC('month', event_date) AS month, @@ -334,10 +318,10 @@ GROUP BY user_id, DATE_TRUNC('month', event_date); ```sql -- Small static dimension, large streaming fact -CREATE OR REPLACE STREAMING TABLE sales_enriched AS +CREATE OR REFRESH STREAMING TABLE sales_enriched AS SELECT s.sale_id, s.product_id, s.amount, - p.product_name, p.category -- From small static table + p.product_name, p.category FROM STREAM bronze_sales s LEFT JOIN dim_products p ON s.product_id = p.product_id; ``` @@ -348,7 +332,7 @@ LEFT JOIN dim_products p ON s.product_id = p.product_id; ```sql -- Time bounds limit state retention -CREATE OR REPLACE STREAMING TABLE orders_with_payments AS +CREATE OR REFRESH STREAMING TABLE orders_with_payments AS SELECT o.order_id, o.amount AS order_amount, p.payment_id, p.amount AS payment_amount @@ -358,32 +342,6 @@ INNER JOIN STREAM bronze_payments p AND p.payment_time BETWEEN o.order_time AND o.order_time + INTERVAL 1 HOUR; ``` -**Optimization**: Use time bounds in join condition. - ---- - -## Compute Configuration - -### Serverless vs Classic - -| Aspect | Serverless | Classic | -|--------|-----------|---------| -| Startup | Fast (seconds) | Slower (minutes) | -| Scaling | Automatic, instant | Manual/autoscaling | -| Cost | Pay-per-use | Pay for cluster time | -| Best for | Variable workloads, dev/test | Steady workloads | - -### Serverless (Recommended) - -Enable at pipeline level: - -```yaml -execution_mode: continuous # or triggered -serverless: true -``` - -**Advantages**: No cluster management, instant scaling, lower cost for bursty workloads. - --- ## Query Optimization @@ -391,47 +349,53 @@ serverless: true ### Filter Early ```sql --- ✅ Filter at source -CREATE OR REPLACE STREAMING TABLE silver_recent AS +-- Filter at source +CREATE OR REFRESH STREAMING TABLE silver_recent AS SELECT * FROM STREAM bronze_events WHERE event_date >= CURRENT_DATE() - INTERVAL 7 DAYS; --- ❌ Filter late -CREATE OR REPLACE STREAMING TABLE silver_all AS -SELECT * FROM STREAM bronze_events; - -CREATE OR REPLACE MATERIALIZED VIEW gold_recent AS -SELECT * FROM silver_all -WHERE event_date >= CURRENT_DATE() - INTERVAL 7 DAYS; +-- Avoid filtering late +-- CREATE OR REFRESH STREAMING TABLE silver_all AS SELECT * FROM STREAM bronze_events; +-- CREATE OR REFRESH MATERIALIZED VIEW gold_recent AS SELECT * FROM silver_all WHERE ...; ``` ### Select Specific Columns ```sql --- ❌ Reads all columns -SELECT * FROM large_table; - --- ✅ Only needed columns +-- Only needed columns SELECT customer_id, order_date, amount FROM large_table; + +-- Avoid SELECT * +-- SELECT * FROM large_table; ``` -### Use GROUP BY Over DISTINCT +--- -```sql --- ❌ Expensive on high-cardinality -SELECT DISTINCT transaction_id FROM huge_table; +## Compute Configuration + +### Serverless vs Classic + +| Aspect | Serverless | Classic | +|--------|-----------|---------| +| Startup | Fast (seconds) | Slower (minutes) | +| Scaling | Automatic, instant | Manual/autoscaling | +| Cost | Pay-per-use | Pay for cluster time | +| Best for | Variable workloads, dev/test | Steady workloads | + +### Serverless (Recommended) --- ✅ Better -SELECT transaction_id, COUNT(*) FROM huge_table GROUP BY transaction_id; +Enable at pipeline level: + +```yaml +execution_mode: continuous # or triggered +serverless: true ``` --- ## Monitoring -Track key metrics: - ```sql -- Data freshness SELECT @@ -455,7 +419,7 @@ GROUP BY table_name; | Issue | Solution | |-------|----------| -| Pipeline running slowly | Check partitioning, state size, join patterns | +| Pipeline running slowly | Check clustering, state size, join patterns | | High memory usage | Unbounded state - add time windows, reduce cardinality | | Many small files | Enable auto-optimize, run OPTIMIZE command | | Expensive queries on large tables | Add clustering, create filtered MVs | diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/scripts/exploration_notebook.py b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/scripts/exploration_notebook.py new file mode 100644 index 0000000..f3f6785 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-declarative-pipelines/scripts/exploration_notebook.py @@ -0,0 +1,81 @@ +# Databricks notebook source +# MAGIC %md +# MAGIC # Data Exploration Notebook +# MAGIC +# MAGIC Explore raw data in Volumes before building pipeline transformations. +# MAGIC +# MAGIC **Note:** Pipeline transformations should use raw `.sql` or `.py` files, NOT notebooks. + +# COMMAND ---------- + +# MAGIC %md +# MAGIC ## 1. Explore Raw Files in Volume +# MAGIC +# MAGIC Query raw parquet/json files directly to understand the data structure. + +# COMMAND ---------- + +# MAGIC %sql +# MAGIC -- Preview raw orders data +# MAGIC SELECT * FROM parquet.`/Volumes/my_catalog/my_schema/raw/orders/` LIMIT 100 + +# COMMAND ---------- + +# MAGIC %sql +# MAGIC -- Check schema and sample values +# MAGIC DESCRIBE SELECT * FROM parquet.`/Volumes/my_catalog/my_schema/raw/orders/` + +# COMMAND ---------- + +# MAGIC %sql +# MAGIC -- Data quality: nulls, distinct values, date range +# MAGIC SELECT +# MAGIC COUNT(*) AS total_rows, +# MAGIC COUNT(order_id) AS non_null_order_id, +# MAGIC COUNT(DISTINCT customer_id) AS unique_customers, +# MAGIC MIN(order_date) AS min_date, +# MAGIC MAX(order_date) AS max_date +# MAGIC FROM parquet.`/Volumes/my_catalog/my_schema/raw/orders/` + +# COMMAND ---------- + +# MAGIC %md +# MAGIC ## 2. Explore Another Raw Source + +# COMMAND ---------- + +# MAGIC %sql +# MAGIC -- Preview raw customers data +# MAGIC SELECT * FROM parquet.`/Volumes/my_catalog/my_schema/raw/customers/` LIMIT 100 + +# COMMAND ---------- + +# MAGIC %md +# MAGIC ## 3. Join Raw Data for Exploration +# MAGIC +# MAGIC Test joins before building the pipeline. + +# COMMAND ---------- + +# MAGIC %sql +# MAGIC -- Join orders with customers to validate keys +# MAGIC SELECT +# MAGIC o.order_id, +# MAGIC o.order_date, +# MAGIC o.amount, +# MAGIC c.customer_name, +# MAGIC c.email +# MAGIC FROM parquet.`/Volumes/my_catalog/my_schema/raw/orders/` o +# MAGIC LEFT JOIN parquet.`/Volumes/my_catalog/my_schema/raw/customers/` c +# MAGIC ON o.customer_id = c.customer_id +# MAGIC LIMIT 100 + +# COMMAND ---------- + +# MAGIC %sql +# MAGIC -- Check for orphan orders (no matching customer) +# MAGIC SELECT COUNT(*) AS orphan_orders +# MAGIC FROM parquet.`/Volumes/my_catalog/my_schema/raw/orders/` o +# MAGIC LEFT JOIN parquet.`/Volumes/my_catalog/my_schema/raw/customers/` c +# MAGIC ON o.customer_id = c.customer_id +# MAGIC WHERE c.customer_id IS NULL diff --git a/.claude/skills/databricks-spark-structured-streaming/SKILL.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-structured-streaming/SKILL.md similarity index 85% rename from .claude/skills/databricks-spark-structured-streaming/SKILL.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-structured-streaming/SKILL.md index b1f5930..ddb52a0 100644 --- a/.claude/skills/databricks-spark-structured-streaming/SKILL.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-structured-streaming/SKILL.md @@ -1,6 +1,6 @@ --- name: databricks-spark-structured-streaming -description: Comprehensive guide to Spark Structured Streaming for production workloads. Use when building streaming pipelines, implementing real-time data processing, handling stateful operations, or optimizing streaming performance. +description: "Comprehensive guide to Spark Structured Streaming for production workloads. Use when building streaming pipelines, working with Kafka ingestion, implementing Real-Time Mode (RTM), configuring triggers (processingTime, availableNow), handling stateful operations with watermarks, optimizing checkpoints, performing stream-stream or stream-static joins, writing to multiple sinks, or tuning streaming cost and performance." --- # Spark Structured Streaming diff --git a/.claude/skills/databricks-spark-structured-streaming/checkpoint-best-practices.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-structured-streaming/checkpoint-best-practices.md similarity index 100% rename from .claude/skills/databricks-spark-structured-streaming/checkpoint-best-practices.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-structured-streaming/checkpoint-best-practices.md diff --git a/.claude/skills/databricks-spark-structured-streaming/kafka-streaming.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-structured-streaming/kafka-streaming.md similarity index 95% rename from .claude/skills/databricks-spark-structured-streaming/kafka-streaming.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-structured-streaming/kafka-streaming.md index 83630e8..9731434 100644 --- a/.claude/skills/databricks-spark-structured-streaming/kafka-streaming.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-structured-streaming/kafka-streaming.md @@ -140,30 +140,30 @@ df_bronze.writeStream \ ### Pattern 3: Real-Time Mode (Sub-Second Latency) -Use RTM for < 800ms latency requirements: +Use RTM for sub-second (as low as 5ms) latency requirements. Requires DBR 16.4 LTS+: ```python -# Real-time trigger (Databricks 13.3+) +# Real-time trigger (DBR 16.4 LTS+) +# Requirements: dedicated cluster, no autoscaling, no Photon, outputMode("update") +# Spark config on cluster: spark.databricks.streaming.realTimeMode.enabled = true query = (enriched_df .select(col("key"), col("value")) .writeStream .format("kafka") .option("kafka.bootstrap.servers", brokers) .option("topic", "output-events") - .trigger(realTime=True) # Enable RTM + .outputMode("update") # RTM only supports update mode + .trigger(realTime="5 minutes") # PySpark requires specifying the checkpoint interval .option("checkpointLocation", checkpoint_path) .start() ) -# RTM Cluster Requirements -spark.conf.set("spark.databricks.photon.enabled", "true") -spark.conf.set("spark.sql.streaming.stateStore.providerClass", - "com.databricks.sql.streaming.state.RocksDBStateProvider") - # When to use RTM: -# - Latency < 800ms required -# - Photon enabled -# - Fixed-size cluster (no autoscaling) +# - Sub-second latency required (achieves as low as 5ms E2E) +# - Photon must be DISABLED (not supported with RTM) +# - Autoscaling must be DISABLED +# - Dedicated (single-user) cluster only +# - forEachBatch is NOT supported in RTM ``` ### Pattern 4: Event Enrichment (Kafka to Kafka with Delta) diff --git a/.claude/skills/databricks-spark-structured-streaming/merge-operations.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-structured-streaming/merge-operations.md similarity index 100% rename from .claude/skills/databricks-spark-structured-streaming/merge-operations.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-structured-streaming/merge-operations.md diff --git a/.claude/skills/databricks-spark-structured-streaming/multi-sink-writes.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-structured-streaming/multi-sink-writes.md similarity index 100% rename from .claude/skills/databricks-spark-structured-streaming/multi-sink-writes.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-structured-streaming/multi-sink-writes.md diff --git a/.claude/skills/databricks-spark-structured-streaming/stateful-operations.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-structured-streaming/stateful-operations.md similarity index 100% rename from .claude/skills/databricks-spark-structured-streaming/stateful-operations.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-structured-streaming/stateful-operations.md diff --git a/.claude/skills/databricks-spark-structured-streaming/stream-static-joins.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-structured-streaming/stream-static-joins.md similarity index 100% rename from .claude/skills/databricks-spark-structured-streaming/stream-static-joins.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-structured-streaming/stream-static-joins.md diff --git a/.claude/skills/databricks-spark-structured-streaming/stream-stream-joins.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-structured-streaming/stream-stream-joins.md similarity index 100% rename from .claude/skills/databricks-spark-structured-streaming/stream-stream-joins.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-structured-streaming/stream-stream-joins.md diff --git a/.claude/skills/databricks-spark-structured-streaming/streaming-best-practices.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-structured-streaming/streaming-best-practices.md similarity index 100% rename from .claude/skills/databricks-spark-structured-streaming/streaming-best-practices.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-structured-streaming/streaming-best-practices.md diff --git a/.claude/skills/databricks-spark-structured-streaming/trigger-and-cost-optimization.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-structured-streaming/trigger-and-cost-optimization.md similarity index 100% rename from .claude/skills/databricks-spark-structured-streaming/trigger-and-cost-optimization.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-spark-structured-streaming/trigger-and-cost-optimization.md diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-synthetic-data-gen/SKILL.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-synthetic-data-gen/SKILL.md new file mode 100644 index 0000000..c046e48 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-synthetic-data-gen/SKILL.md @@ -0,0 +1,261 @@ +--- +name: databricks-synthetic-data-gen +description: "Generate realistic synthetic data using Spark + Faker (strongly recommended). Supports serverless execution, multiple output formats (Parquet/JSON/CSV/Delta), and scales from thousands to millions of rows. For small datasets (<10K rows), can optionally generate locally and upload to volumes. Use when user mentions 'synthetic data', 'test data', 'generate data', 'demo dataset', 'Faker', or 'sample data'." +--- + +> Catalog and schema are **always user-supplied** — never default to any value. If the user hasn't provided them, ask. For any UC write, **always create the schema if it doesn't exist** before writing data. + +# Databricks Synthetic Data Generation + +Generate realistic, story-driven synthetic data for Databricks using **Spark + Faker + Pandas UDFs** (strongly recommended). + +## Data Must Tell a Business Story + +Synthetic data should demonstrate how Databricks helps solve real business problems. + +**The pattern:** Something goes wrong → business impact ($) → analyze root cause → identify affected customers → fix and prevent. + +**Key principles:** +- **Problem → Impact → Analysis → Solution** — Include an incident, anomaly, or issue that causes measurable business impact. The data lets you find the root cause and act on it. +- **Industry-relevant but simple** — Use domain terms (e.g., "SLA breach", "churn", "stockout") but keep the schema easy to understand. A few tables, clear relationships. +- **Business metrics with $ impact** — Revenue, MRR, cost, conversion rate. Every story needs a dollar sign to show why it matters. +- **Tables explain each other** — Ticket spike? Incident table shows the outage. Revenue drop? Churn table shows who left and why. All data connects. +- **Actionable insights** — Data should answer: What happened? Who's affected? How much did it cost? How do we prevent it? + +**Why no flat distributions:** Uniform data has no story — no spikes, no anomalies, no cohort, no 20/80, no skew, nothing to investigate. It can't show Databricks' value for root cause analysis. + +## References + +| When | Guide | +|------|-------| +| User mentions **ML model training** or complex time patterns | [references/1-data-patterns.md](references/1-data-patterns.md) — ML-ready data, time multipliers, row coherence | +| Errors during generation | [references/2-troubleshooting.md](references/2-troubleshooting.md) — Fixing common issues | + +## Critical Rules + +1. **Data tells a story** — Something goes wrong, impacts $, can be analyzed and fixed. Show Databricks value. +2. **All data serves the story** — Every table and column must be coherent and usable in dashboards or ML models. No orphan data, no random noise — if it doesn't help explain or plot a futur dashboard or predict, don't generate it. +3. **Industry terms, simple schema** — Use domain-specific vocabulary but keep it easy to understand (few tables, clear relationships) +4. **Never uniform distributions** — Skewed categories, log-normal amounts, 80/20 patterns. Flat = no story = useless +5. **Enough data for trends** — ~100K+ rows for main tables so patterns survive aggregation +6. **Ask for catalog/schema** — Never default, always confirm before generating +7. **Present plan for approval** — Show tables, distributions, assumptions before writing code +8. **Master tables first** — Generate parent tables, write to Delta, then create children with valid FKs +9. **Use Spark + Faker + Pandas UDFs** — Scalable, parallel. Polars only if user explicitly wants local + <30K rows +10. **Use Databricks Connect Serverless by default to generate data** — Update databricks-connect on python 3.12 if required (avoid using execute_code unless instructed to not use Databricks Connect) +11. **No `.cache()` or `.persist()`** — Not supported on serverless. Write to Delta, read back for joins +12. **No Python loops or `.collect()`** — Use Spark parallelism. No driver-side iteration, avoid Pandas↔Spark conversions + +## Generation Planning Workflow + +**Before generating any code, you MUST present a plan for user approval.** + +### ⚠️ MUST DO: Confirm Catalog Before Proceeding + +**You MUST explicitly ask the user which catalog to use.** Do not assume or proceed without confirmation. + +Example prompt to user: +> "Which Unity Catalog should I use for this data?" + +When presenting your plan, always show the selected catalog prominently: +``` +📍 Output Location: catalog_name.schema_name + Volume: /Volumes/catalog_name/schema_name/raw_data/ +``` + +This makes it easy for the user to spot and correct if needed. + +### Step 1: Gather Requirements + +Ask the user about: +- **Catalog/Schema** — Which catalog to use? +- **Domain** — E-commerce, support tickets, IoT, financial? (Use industry terms) + +**If user doesn't specify a story:** Propose one. Don't generate bland data — suggest an incident, anomaly, or trend that shows Databricks value (e.g., "I'll include a system outage that causes ticket spike and churn — this lets you demo root cause analysis"). + +### Step 2: Present Plan with Story + +Show a clear specification with **the business story and your assumptions surfaced**: + +``` +📍 Output Location: {user_catalog}.support_demo + Volume: /Volumes/{user_catalog}/support_demo/raw_data/ + +📖 Story: A payment system outage causes support ticket spike. Resolution times + degrade, enterprise customers churn, revenue drops $2.3M. With Databricks we + identify the root cause, affected customers, and prevent future impact. +``` + +| Table | Description | Rows | Key Assumptions | +|-------|-------------|------|-----------------| +| customers | Customer profiles with tier, MRR | 10,000 | Enterprise 10% but 60% of revenue | +| tickets | Support tickets with priority, resolution_time | 80,000 | Spike during outage, SLA breaches | +| incidents | System events (outages, deployments) | 50 | Payment outage mid-month | +| churn_events | Customer cancellations with reason | 500 | Spike after poor support experience | + +**Business metrics:** +- `customers.mrr` — Revenue at risk ($) +- `tickets.resolution_hours` — SLA performance +- `churn_events.lost_mrr` — Churn impact ($) + +**The story this data tells:** +- Incident table shows payment outage on March 15 +- Tickets spike 5x during outage, resolution time degrades from 4h → 18h +- Enterprise customers with SLA breaches churn 3 weeks later +- Total impact: $2.3M lost MRR, traceable to one incident +- **Databricks value:** Root cause analysis, identify at-risk customers, build alerting + +**Ask user**: "Does this story work? Any adjustments?" + +### Step 3: Ask About Data Features + +- [x] Skew (non-uniform distributions) - **Enabled by default** +- [x] Joins (referential integrity) - **Enabled by default** +- [ ] Bad data injection (for data quality testing) +- [ ] Multi-language text +- [ ] Incremental mode (append instead of overwrite) + +### Pre-Generation Checklist + +- [ ] **Catalog confirmed** - User explicitly approved which catalog to use +- [ ] Output location shown prominently in plan (easy to spot/change) +- [ ] Table specification shown and approved +- [ ] Assumptions about distributions confirmed +- [ ] User confirmed compute preference (Databricks Connect on serverless recommended) +- [ ] Data features selected + +**Do NOT proceed to code generation until user approves the plan, including the catalog.** + +### Post-Generation Checklist + +After generating data, use `get_volume_folder_details` to validate the output matches requirements: +- Row counts match the plan +- Schema matches expected columns and types +- Data distributions look reasonable (check column stats) + +## Use Databricks Connect Spark + Faker Pattern + +```python +from databricks.connect import DatabricksSession, DatabricksEnv +from pyspark.sql import functions as F +from pyspark.sql.types import StringType +import pandas as pd + +# Setup serverless with dependencies (MUST list all libs used in UDFs) +env = DatabricksEnv().withDependencies("faker", "holidays") +spark = DatabricksSession.builder.withEnvironment(env).serverless(True).getOrCreate() + +# Pandas UDF pattern - import lib INSIDE the function +@F.pandas_udf(StringType()) +def fake_name(ids: pd.Series) -> pd.Series: + from faker import Faker # Import inside UDF + fake = Faker() + return pd.Series([fake.name() for _ in range(len(ids))]) + +# Generate with spark.range, apply UDFs +customers_df = spark.range(0, 10000, numPartitions=16).select( + F.concat(F.lit("CUST-"), F.lpad(F.col("id").cast("string"), 5, "0")).alias("customer_id"), + fake_name(F.col("id")).alias("name"), +) + +# Write to Volume as Parquet (default for raw data) +# Path is a folder with table name: /Volumes/catalog/schema/raw_data/customers/ +spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA}") +spark.sql(f"CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.raw_data") +customers_df.write.mode("overwrite").parquet(f"/Volumes/{CATALOG}/{SCHEMA}/raw_data/customers") +``` + +**Partitions by scale:** `spark.range(N, numPartitions=P)` +- <100K rows: 8 partitions +- 100K-500K: 16 partitions +- 500K-1M: 32 partitions +- 1M+: 64+ partitions + +**Output formats:** +- **Parquet to Volume** (default): `df.write.parquet("/Volumes/.../raw_data/table")` — raw data for pipelines +- **Delta Table**: `df.write.saveAsTable("catalog.schema.table")` — if user wants queryable tables +- **JSON/CSV**: small dimension tables, replicate legacy systems + +## Performance Rules + +Generated scripts must be highly performant. **Never** do these: + +| Anti-Pattern | Why It's Slow | Do This Instead | +|--------------|---------------|-----------------| +| Python loops on driver | Single-threaded, no parallelism | Use `spark.range()` + Spark operations | +| `.collect()` then iterate | Brings all data to driver memory | Keep data in Spark, use DataFrame ops | +| Pandas → Spark → Pandas | Serialization overhead, defeats distribution | Stay in Spark, use `pandas_udf` only for UDFs | +| Read/write temp files | Unnecessary I/O | Chain DataFrame transformations | +| Scalar UDFs | Row-by-row processing | Use `pandas_udf` for batch processing | + +**Good pattern:** `spark.range()` → Spark transforms → `pandas_udf` for Faker → write directly + +## Common Patterns + +### Weighted Categories (never uniform) +```python +F.when(F.rand() < 0.6, "Free").when(F.rand() < 0.9, "Pro").otherwise("Enterprise") +``` + +### Log-Normal Amounts (in a pandas UDF) +Use `np.random.lognormal(mean, sigma)` — always positive, long tail: +- Enterprise: `lognormal(7.5, 0.8)` → ~$1800 median +- Pro: `lognormal(5.5, 0.7)` → ~$245 median +- Free: `lognormal(4.0, 0.6)` → ~$55 median + +### Date Range (Last 6 Months) +```python +END_DATE = datetime.now() +START_DATE = END_DATE - timedelta(days=180) +``` + +### Infrastructure (always create in script) +```python +spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA}") +spark.sql(f"CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.raw_data") +``` + +### Referential Integrity (FK pattern) +Write master table to Delta first, then read back for FK joins (no `.cache()` on serverless): +```python +# 1. Write master table +customers_df.write.mode("overwrite").saveAsTable(f"{CATALOG}.{SCHEMA}.customers") + +# 2. Read back for FK lookup +customer_lookup = spark.table(f"{CATALOG}.{SCHEMA}.customers").select("customer_idx", "customer_id") + +# 3. Generate child table with valid FKs via join +orders_df = spark.range(N_ORDERS).select( + (F.abs(F.hash(F.col("id"))) % N_CUSTOMERS).alias("customer_idx") +) +orders_with_fk = orders_df.join(customer_lookup, on="customer_idx") +``` + +## Setup + +Requires Python 3.12 and databricks-connect>=16.4. Use `uv`: + +```bash +uv pip install "databricks-connect>=16.4,<17.4" faker numpy pandas holidays +``` + +## Related Skills + +- **databricks-unity-catalog** — Managing catalogs, schemas, and volumes +- **databricks-bundles** — DABs for production deployment + +## Common Issues + +| Issue | Solution | +|-------|----------| +| `ImportError: cannot import name 'DatabricksEnv'` | Upgrade: `uv pip install "databricks-connect>=16.4"` | +| Python 3.11 instead of 3.12 | Python 3.12 required. Use `uv` to create env with correct version | +| `ModuleNotFoundError: faker` | Add to `withDependencies()`, import inside UDF | +| Faker UDF is slow | Use `pandas_udf` for batch processing | +| Out of memory | Increase `numPartitions` in `spark.range()` | +| Referential integrity errors | Write master table to Delta first, read back for FK joins | +| `PERSIST TABLE is not supported on serverless` | **NEVER use `.cache()` or `.persist()` with serverless** - write to Delta table first, then read back | +| `F.window` vs `Window` confusion | Use `from pyspark.sql.window import Window` for `row_number()`, `rank()`, etc. `F.window` is for streaming only. | +| Broadcast variables not supported | **NEVER use `spark.sparkContext.broadcast()` with serverless** | + +See [references/2-troubleshooting.md](references/2-troubleshooting.md) for full troubleshooting guide. diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-synthetic-data-gen/references/1-data-patterns.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-synthetic-data-gen/references/1-data-patterns.md new file mode 100644 index 0000000..eba6491 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-synthetic-data-gen/references/1-data-patterns.md @@ -0,0 +1,146 @@ +# Data Patterns Guide + +Creating realistic synthetic data that tells a story. + +> **Note:** This guide provides principles and simplified examples. Actual implementations should be more sophisticated — use domain-specific distributions, realistic business rules, and correlations that reflect the user's actual use case. Ask clarifying questions to understand the business context before generating. + +## Core Principles + +### 1. Data Must Be Interesting + +Synthetic data should reveal patterns humans can see in dashboards and ML models can learn from: + +- **Visible trends** — Revenue growth, seasonal spikes, degradation over time +- **Actionable segments** — Clear differences between customer tiers, regions, product categories +- **Anomalies to detect** — Fraud patterns, equipment failures, churn signals +- **Correlations to discover** — Higher tier = more spend, faster resolution = better CSAT + +**Anti-pattern:** Uniform random data with no story — useless for demos and ML. + +### 2. Non-Uniform Distributions + +Real data is never uniformly distributed. Use appropriate distributions: + +| Distribution | When to Use | Examples | +|--------------|-------------|----------| +| **Log-normal** | Monetary values, sizes | Order amounts, salaries, file sizes | +| **Pareto (80/20)** | Popularity, wealth | 20% of customers = 80% of revenue | +| **Exponential** | Time between events | Support resolution time, session duration | +| **Weighted categorical** | Skewed categories | Status (70% complete, 5% failed), tiers | + +```python +# Log-normal for amounts (long tail, always positive) +amount = np.random.lognormal(mean=5.5, sigma=0.8) # ~$245 median + +# Pareto for power-law (few large, many small) +value = (np.random.pareto(a=1.5) + 1) * base_value + +# Exponential for time-to-event +hours = np.random.exponential(scale=24) # avg 24h, skewed right +``` + +### 3. Row Coherence + +Attributes within a row must make business sense together. Generate correlated attributes in a single UDF for example: + +| If This... | Then This... | +|------------|--------------| +| Enterprise tier | Higher order amounts, more activity, priority support | +| Critical priority | Faster resolution, more interactions | +| Older equipment | Higher failure rate, more anomalies | +| Large transaction + unusual hour | Higher fraud probability | +| Fast resolution | Higher CSAT score | + +```python +@F.pandas_udf("struct") +def generate_coherent_ticket(tiers: pd.Series) -> pd.DataFrame: + """All attributes correlate logically within each row.""" + results = [] + for tier in tiers: + # Priority depends on tier + priority = "Critical" if tier == "Enterprise" and random() < 0.3 else "Medium" + # Resolution depends on priority + resolution = np.random.exponential(4 if priority == "Critical" else 36) + # CSAT depends on resolution + csat = 5 if resolution < 4 else (3 if resolution < 24 else 2) + results.append({"priority": priority, "resolution_hours": resolution, "csat": csat}) + return pd.DataFrame(results) +``` + +### 4. The 80/20 Rule + +Apply power-law distributions where appropriate: + +- **20% of customers** generate 80% of orders/revenue +- **20% of products** account for 80% of sales +- **20% of support agents** handle 80% of tickets + +Implementation: Use weighted sampling when assigning FKs, not uniform random. + +### 5. Time-Based Patterns + +Most data has temporal patterns: + +- **Weekday vs weekend** — B2B drops on weekends, B2C peaks +- **Business hours** — Support tickets cluster 9am-5pm +- **Seasonality** — Q4 retail spike, summer travel peak +- **Trends** — Growth over time, degradation curves + +```python +def get_volume_multiplier(date): + multiplier = 1.0 + if date.weekday() >= 5: multiplier *= 0.6 # Weekend drop + if date.month in [11, 12]: multiplier *= 1.5 # Holiday spike + return multiplier +``` + +### 6. ML-Ready Data + +If data will train ML models, ensure: + +- **Signal exists** — The patterns you want the model to learn are present +- **Noise is realistic** — Not too clean (overfitting) or too noisy (unlearnable) +- **Class balance** — Fraud at 0.1-1%, not 50/50 (unrealistic) +- **Temporal validity** — Train/test split respects time (no future leakage) + +## Referential Integrity + +Generate master tables first, write to Delta, then join for FKs: + +```python +# 1. Generate and write master table +customers_df.write.mode("overwrite").saveAsTable(f"{CATALOG}.{SCHEMA}.customers") + +# 2. Read back for FK joins (NOT cache - unsupported on serverless) +customer_lookup = spark.table(f"{CATALOG}.{SCHEMA}.customers") + +# 3. Generate child table with valid FKs via join +orders_df = spark.range(N_ORDERS).select( + (F.abs(F.hash(F.col("id"))) % N_CUSTOMERS).alias("customer_idx") +) +orders_with_fk = orders_df.join(customer_lookup, on="customer_idx") +``` + +## Data Volume + +Generate enough rows so patterns survive aggregation: + +| Analysis Type | Minimum Rows | Rationale | +|---------------|--------------|-----------| +| Daily dashboard | 50-100/day | Trends visible after weekly rollup | +| Category comparison | 500+ per category | Statistical significance | +| ML training | 10K-100K+ | Enough signal for model learning | +| Customer-level | 5-20 events/customer | Individual patterns visible | + +**Rule of thumb:** If you'll GROUP BY a column, ensure each group has 100+ rows. + +--- + +## Remember + +These are guiding principles, not templates. Real implementations should: +- Reflect the user's specific business domain and terminology +- Use realistic parameter values (research typical ranges for the industry) +- Include edge cases relevant to the use case (returns, cancellations, failures) +- Have more complex correlations than shown in examples above +- **Never use flat/uniform distributions** — categories, tiers, regions, statuses should always be skewed (e.g., 60/30/10 not 33/33/33) diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-synthetic-data-gen/references/2-troubleshooting.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-synthetic-data-gen/references/2-troubleshooting.md new file mode 100644 index 0000000..420b350 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-synthetic-data-gen/references/2-troubleshooting.md @@ -0,0 +1,324 @@ +# Troubleshooting Guide + +Common issues and solutions for synthetic data generation. + +## Environment Issues + +### ModuleNotFoundError: faker (or other library) + +**Problem:** Dependencies not available in execution environment. + +**Solutions by execution mode:** + +| Mode | Solution | +|------|----------| +| **DB Connect 16.4+** | Use `DatabricksEnv().withDependencies("faker", "pandas", ...)` | +| **Older DB Connect with Serverless** | Create job with `environments` parameter | +| **Databricks Runtime** | Use Databricks CLI to install `faker holidays` | +| **Classic cluster** | Use Databricks CLI to install libraries. `databricks libraries install --json '{"cluster_id": "", "libraries": [{"pypi": {"package": "faker"}}, {"pypi": {"package": "holidays"}}]}'` | + +```python +# For DB Connect 16.4+ +from databricks.connect import DatabricksSession, DatabricksEnv + +env = DatabricksEnv().withDependencies("faker", "pandas", "numpy", "holidays") +spark = DatabricksSession.builder.withEnvironment(env).serverless(True).getOrCreate() +``` + +### DatabricksEnv not found + +**Problem:** Using older databricks-connect version. + +**Solution:** Upgrade to 16.4+ or use job-based approach: + +```bash +# Upgrade (prefer uv, fall back to pip) +uv pip install "databricks-connect>=16.4,<17.4" +# or: pip install "databricks-connect>=16.4,<17.4" + +# Or use job with environments parameter instead +``` + +### serverless_compute_id error + +**Problem:** Missing serverless configuration. + +**Solution:** Add to `~/.databrickscfg`: + +```ini +[DEFAULT] +host = https://your-workspace.cloud.databricks.com/ +serverless_compute_id = auto +auth_type = databricks-cli +``` + +--- + +## Execution Issues + +### CRITICAL: cache() and persist() NOT supported on serverless + +**Problem:** Using `.cache()` or `.persist()` on serverless compute fails with: +``` +AnalysisException: [NOT_SUPPORTED_WITH_SERVERLESS] PERSIST TABLE is not supported on serverless compute. +``` + +**Why this happens:** Serverless compute does not support caching DataFrames in memory. This is a fundamental limitation of the serverless architecture. + +**Solution:** Write master tables to Delta first, then read them back for FK joins: + +```python +# BAD - will fail on serverless +customers_df = spark.range(0, N_CUSTOMERS)... +customers_df.cache() # ❌ FAILS: "PERSIST TABLE is not supported on serverless compute" + +# GOOD - write to Delta, then read back +customers_df = spark.range(0, N_CUSTOMERS)... +customers_df.write.mode("overwrite").saveAsTable(f"{CATALOG}.{SCHEMA}.customers") +customer_lookup = spark.table(f"{CATALOG}.{SCHEMA}.customers") # ✓ Read from Delta +``` + +**Best practice for referential integrity:** +1. Generate master table (e.g., customers) +2. Write to Delta table +3. Read back for FK lookup joins +4. Generate child tables (e.g., orders, tickets) with valid FKs +5. Write child tables to Delta + +--- + +### Serverless job fails to start + +**Possible causes:** +1. Workspace doesn't have serverless enabled +2. Unity Catalog permissions missing +3. Invalid environment configuration + +**Solutions:** +```python +# Verify serverless is available +# Try creating a simple job first to test + +# Check Unity Catalog permissions +spark.sql("SELECT current_catalog(), current_schema()") +``` + +### Classic cluster startup slow (3-8 minutes) + +**Problem:** Clusters take time to start. + +**Solution:** Switch to serverless: + +```python +# Instead of: +# spark = DatabricksSession.builder.clusterId("xxx").getOrCreate() + +# Use: +spark = DatabricksSession.builder.serverless(True).getOrCreate() +``` + +### "Either base environment or version must be provided" + +**Problem:** Missing `client` in job environment spec. + +**Solution:** Add `"client": "4"` to the spec: + +```python +{ + "environments": [{ + "environment_key": "datagen_env", + "spec": { + "client": "4", # Required! + "dependencies": ["faker", "numpy", "pandas"] + } + }] +} +``` + +--- + +## Data Generation Issues + +### AttributeError: 'function' object has no attribute 'partitionBy' + +**Problem:** Using `F.window` instead of `Window` for analytical window functions. + +```python +# WRONG - F.window is for time-based tumbling/sliding windows (streaming) +window_spec = F.window.partitionBy("account_id").orderBy("contact_id") +# Error: AttributeError: 'function' object has no attribute 'partitionBy' + +# CORRECT - Window is for analytical window specifications +from pyspark.sql.window import Window +window_spec = Window.partitionBy("account_id").orderBy("contact_id") +``` + +**When to use Window:** For analytical functions like `row_number()`, `rank()`, `lead()`, `lag()`: + +```python +from pyspark.sql.window import Window + +# Mark first contact per account as primary +window_spec = Window.partitionBy("account_id").orderBy("contact_id") +contacts_df = contacts_df.withColumn( + "is_primary", + F.row_number().over(window_spec) == 1 +) +``` + +--- + +### Faker UDF is slow + +**Problem:** Single-row UDFs don't parallelize well. + +**Solution:** Use `pandas_udf` for batch processing: + +```python +# SLOW - scalar UDF +@F.udf(returnType=StringType()) +def slow_fake_name(): + return Faker().name() + +# FAST - pandas UDF (batch processing) +@F.pandas_udf(StringType()) +def fast_fake_name(ids: pd.Series) -> pd.Series: + fake = Faker() + return pd.Series([fake.name() for _ in range(len(ids))]) +``` + +### Out of memory with large data + +**Problem:** Not enough partitions for data size. + +**Solution:** Increase partitions: + +```python +# For large datasets (1M+ rows) +customers_df = spark.range(0, N_CUSTOMERS, numPartitions=64) # Increase from default +``` + +| Data Size | Recommended Partitions | +|-----------|----------------------| +| < 100K | 8 | +| 100K - 500K | 16 | +| 500K - 1M | 32 | +| 1M+ | 64+ | + +### Context corrupted on classic cluster + +**Problem:** Stale execution context. + +**Solution:** Create fresh context (omit context_id), reinstall libraries: + +```python +# Don't reuse context_id if you see strange errors +# Let it create a new context +``` + +### Referential integrity violations + +**Problem:** Foreign keys reference non-existent parent records. + +**Solution:** Write master table to Delta first, then read back for FK joins: + +```python +# 1. Generate and WRITE master table (do NOT use cache with serverless!) +customers_df = spark.range(0, N_CUSTOMERS)... +customers_df.write.mode("overwrite").saveAsTable(f"{CATALOG}.{SCHEMA}.customers") + +# 2. Read back for FK lookups +customer_lookup = spark.table(f"{CATALOG}.{SCHEMA}.customers").select("customer_id", "tier") + +# 3. Generate child table with valid FKs +orders_df = spark.range(0, N_ORDERS).join( + customer_lookup, + on=, + how="left" +) +``` + +> **WARNING:** Do NOT use `.cache()` or `.persist()` with serverless compute. See the dedicated section above. + +--- + +## Data Quality Issues + +### Uniform distributions (unrealistic) + +**Problem:** All customers have similar order counts, amounts are evenly distributed. + +**Solution:** Use non-linear distributions: + +```python +# BAD - uniform +amounts = np.random.uniform(10, 1000, N) + +# GOOD - log-normal (realistic) +amounts = np.random.lognormal(mean=5, sigma=0.8, N) +``` + +### Missing time-based patterns + +**Problem:** Data doesn't reflect weekday/weekend or seasonal patterns. + +**Solution:** Add multipliers: + +```python +import holidays + +US_HOLIDAYS = holidays.US(years=[2024, 2025]) + +def get_multiplier(date): + mult = 1.0 + if date.weekday() >= 5: # Weekend + mult *= 0.6 + if date in US_HOLIDAYS: + mult *= 0.3 + return mult +``` + +### Incoherent row attributes + +**Problem:** Enterprise customer has low-value orders, critical ticket has slow resolution. + +**Solution:** Correlate attributes: + +```python +# Priority based on tier +if tier == 'Enterprise': + priority = np.random.choice(['Critical', 'High'], p=[0.4, 0.6]) +else: + priority = np.random.choice(['Medium', 'Low'], p=[0.6, 0.4]) + +# Resolution based on priority +resolution_scale = {'Critical': 4, 'High': 12, 'Medium': 36, 'Low': 72} +resolution_hours = np.random.exponential(scale=resolution_scale[priority]) +``` + +--- + +## Validation Steps + +After generation, verify your data: + +```python +# 1. Check row counts +print(f"Customers: {customers_df.count():,}") +print(f"Orders: {orders_df.count():,}") + +# 2. Verify distributions +customers_df.groupBy("tier").count().show() +orders_df.describe("amount").show() + +# 3. Check referential integrity +orphans = orders_df.join( + customers_df, + orders_df.customer_id == customers_df.customer_id, + "left_anti" +) +print(f"Orphan orders: {orphans.count()}") + +# 4. Verify date range +orders_df.select(F.min("order_date"), F.max("order_date")).show() +``` diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-synthetic-data-gen/scripts/generate_synthetic_data.py b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-synthetic-data-gen/scripts/generate_synthetic_data.py new file mode 100644 index 0000000..b9f953f --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-synthetic-data-gen/scripts/generate_synthetic_data.py @@ -0,0 +1,390 @@ +"""Generate synthetic data using Spark + Faker + Pandas UDFs. + +This is the recommended approach for ALL data generation tasks: +- Scales from thousands to millions of rows +- Parallel execution via Spark +- Direct write to Unity Catalog +- Works with serverless and classic compute + +Auto-detects environment and uses: +- DatabricksEnv with managed dependencies if databricks-connect >= 16.4 (local) +- Standard session if running on Databricks Runtime or older databricks-connect +""" +import sys +import os +from pyspark.sql import functions as F +from pyspark.sql.window import Window +from pyspark.sql.types import StringType, DoubleType, StructType, StructField, IntegerType +import numpy as np +import pandas as pd +from datetime import datetime, timedelta + +# ============================================================================= +# CONFIGURATION +# ============================================================================= +# Compute - Serverless strongly recommended +USE_SERVERLESS = True # Set to False and provide CLUSTER_ID for classic compute +CLUSTER_ID = None # Only used if USE_SERVERLESS=False + +# Storage - Update these for your environment +CATALOG = "" # REQUIRED: replace with your catalog +SCHEMA = "" # REQUIRED: replace with your schema +VOLUME_PATH = f"/Volumes/{CATALOG}/{SCHEMA}/raw_data" + +# Data sizes +N_CUSTOMERS = 10_000 +N_ORDERS = 50_000 +PARTITIONS = 16 # Adjust: 8 for <100K, 32 for 1M+ + +# Date range - last 6 months from today +END_DATE = datetime.now().replace(hour=0, minute=0, second=0, microsecond=0) +START_DATE = END_DATE - timedelta(days=180) + +# Write mode - "overwrite" for one-time, "append" for incremental +WRITE_MODE = "overwrite" + +# Bad data injection for testing data quality rules +INJECT_BAD_DATA = False # Set to True to inject bad data +BAD_DATA_CONFIG = { + "null_rate": 0.02, # 2% nulls in required fields + "outlier_rate": 0.01, # 1% impossible values + "orphan_fk_rate": 0.01, # 1% orphan foreign keys +} + +# Reproducibility +SEED = 42 + +# Tier distribution: Free 60%, Pro 30%, Enterprise 10% +TIER_PROBS = [0.6, 0.3, 0.1] + +# Region distribution +REGION_PROBS = [0.4, 0.25, 0.2, 0.15] + +# ============================================================================= +# ENVIRONMENT DETECTION AND SESSION CREATION +# ============================================================================= + +def is_databricks_runtime(): + """Check if running on Databricks Runtime vs locally.""" + return "DATABRICKS_RUNTIME_VERSION" in os.environ + +def get_databricks_connect_version(): + """Get databricks-connect version as (major, minor) tuple or None.""" + try: + import importlib.metadata + version_str = importlib.metadata.version('databricks-connect') + parts = version_str.split('.') + return (int(parts[0]), int(parts[1])) + except Exception: + return None + +# Detect environment +on_runtime = is_databricks_runtime() +db_version = get_databricks_connect_version() + +print("=" * 80) +print("ENVIRONMENT DETECTION") +print("=" * 80) +print(f"Running on Databricks Runtime: {on_runtime}") +if db_version: + print(f"databricks-connect version: {db_version[0]}.{db_version[1]}") +else: + print("databricks-connect: not available") + +# Use DatabricksEnv with managed dependencies if: +# - Running locally (not on Databricks Runtime) +# - databricks-connect >= 16.4 +use_managed_deps = (not on_runtime) and db_version and db_version >= (16, 4) + +if use_managed_deps: + print("Using DatabricksEnv with managed dependencies") + print("=" * 80) + from databricks.connect import DatabricksSession, DatabricksEnv + + env = DatabricksEnv().withDependencies("faker", "pandas", "numpy", "holidays") + + if USE_SERVERLESS: + spark = DatabricksSession.builder.withEnvironment(env).serverless(True).getOrCreate() + print("Connected to serverless compute with managed dependencies!") + else: + if not CLUSTER_ID: + raise ValueError("CLUSTER_ID must be set when USE_SERVERLESS=False") + spark = DatabricksSession.builder.withEnvironment(env).clusterId(CLUSTER_ID).getOrCreate() + print(f"Connected to cluster with managed dependencies!") +else: + print("Using standard session (dependencies must be pre-installed)") + print("=" * 80) + + # Check that UDF dependencies are available + print("\nChecking UDF dependencies...") + missing_deps = [] + + try: + from faker import Faker + print(" faker: OK") + except ImportError: + missing_deps.append("faker") + print(" faker: MISSING") + + try: + import pandas as pd + print(" pandas: OK") + except ImportError: + missing_deps.append("pandas") + print(" pandas: MISSING") + + if missing_deps: + print("\n" + "=" * 80) + print("ERROR: Missing dependencies for UDFs") + print("=" * 80) + print(f"Missing: {', '.join(missing_deps)}") + if on_runtime: + print('\nSolution: Install libraries via Databricks CLI:') + print(' databricks libraries install --json \'{"cluster_id": "", "libraries": [{"pypi": {"package": "faker"}}, {"pypi": {"package": "holidays"}}]}\'') + else: + print("\nSolution: Upgrade to databricks-connect >= 16.4 for managed deps") + print(" Or create a job with environment settings") + print("=" * 80) + sys.exit(1) + + print("\nAll dependencies available") + print("=" * 80) + + from databricks.connect import DatabricksSession + + if USE_SERVERLESS: + spark = DatabricksSession.builder.serverless(True).getOrCreate() + print("Connected to serverless compute") + else: + if not CLUSTER_ID: + raise ValueError("CLUSTER_ID must be set when USE_SERVERLESS=False") + spark = DatabricksSession.builder.clusterId(CLUSTER_ID).getOrCreate() + print(f"Connected to cluster ") + +# Import Faker for UDF definitions +from faker import Faker + +# ============================================================================= +# DEFINE PANDAS UDFs FOR FAKER DATA +# ============================================================================= + +@F.pandas_udf(StringType()) +def fake_name(ids: pd.Series) -> pd.Series: + """Generate realistic person names.""" + fake = Faker() + Faker.seed(SEED) + return pd.Series([fake.name() for _ in range(len(ids))]) + +@F.pandas_udf(StringType()) +def fake_company(ids: pd.Series) -> pd.Series: + """Generate realistic company names.""" + fake = Faker() + Faker.seed(SEED) + return pd.Series([fake.company() for _ in range(len(ids))]) + +@F.pandas_udf(StringType()) +def fake_address(ids: pd.Series) -> pd.Series: + """Generate realistic addresses.""" + fake = Faker() + Faker.seed(SEED) + return pd.Series([fake.address().replace('\n', ', ') for _ in range(len(ids))]) + +@F.pandas_udf(StringType()) +def fake_email(names: pd.Series) -> pd.Series: + """Generate email based on name.""" + emails = [] + for name in names: + if name: + domain = name.lower().replace(" ", ".").replace(",", "")[:20] + emails.append(f"{domain}@example.com") + else: + emails.append("unknown@example.com") + return pd.Series(emails) + +@F.pandas_udf(DoubleType()) +def generate_lognormal_amount(tiers: pd.Series) -> pd.Series: + """Generate amount based on tier using log-normal distribution.""" + np.random.seed(SEED) + amounts = [] + for tier in tiers: + if tier == "Enterprise": + amounts.append(float(np.random.lognormal(mean=7.5, sigma=0.8))) # ~$1800 avg + elif tier == "Pro": + amounts.append(float(np.random.lognormal(mean=5.5, sigma=0.7))) # ~$245 avg + else: + amounts.append(float(np.random.lognormal(mean=4.0, sigma=0.6))) # ~$55 avg + return pd.Series(amounts) + +# ============================================================================= +# CREATE INFRASTRUCTURE +# ============================================================================= +print("\nCreating infrastructure...") +spark.sql(f"CREATE SCHEMA IF NOT EXISTS {CATALOG}.{SCHEMA}") +spark.sql(f"CREATE VOLUME IF NOT EXISTS {CATALOG}.{SCHEMA}.raw_data") +print(f"Infrastructure ready: {VOLUME_PATH}") + +# ============================================================================= +# GENERATE CUSTOMERS (Master Table) +# ============================================================================= +print(f"\nGenerating {N_CUSTOMERS:,} customers...") + +customers_df = ( + spark.range(0, N_CUSTOMERS, numPartitions=PARTITIONS) + .select( + F.concat(F.lit("CUST-"), F.lpad(F.col("id").cast("string"), 5, "0")).alias("customer_id"), + fake_name(F.col("id")).alias("name"), + fake_company(F.col("id")).alias("company"), + fake_address(F.col("id")).alias("address"), + # Tier distribution: Free 60%, Pro 30%, Enterprise 10% + F.when(F.rand(SEED) < TIER_PROBS[0], "Free") + .when(F.rand(SEED) < TIER_PROBS[0] + TIER_PROBS[1], "Pro") + .otherwise("Enterprise").alias("tier"), + # Region distribution + F.when(F.rand(SEED) < REGION_PROBS[0], "North") + .when(F.rand(SEED) < REGION_PROBS[0] + REGION_PROBS[1], "South") + .when(F.rand(SEED) < REGION_PROBS[0] + REGION_PROBS[1] + REGION_PROBS[2], "East") + .otherwise("West").alias("region"), + # Created date (within last 2 years before start date) + F.date_sub(F.lit(START_DATE.date()), (F.rand(SEED) * 730).cast("int")).alias("created_at"), + ) +) + +# Add tier-based ARR and email +customers_df = ( + customers_df + .withColumn("arr", F.round(generate_lognormal_amount(F.col("tier")), 2)) + .withColumn("email", fake_email(F.col("name"))) +) + +# Save customers +customers_df.write.mode(WRITE_MODE).parquet(f"{VOLUME_PATH}/customers") +print(f" Saved customers to {VOLUME_PATH}/customers") + +# Show tier distribution +print("\n Tier distribution:") +customers_df.groupBy("tier").count().orderBy("tier").show() + +# ============================================================================= +# GENERATE ORDERS (Child Table with Referential Integrity) +# ============================================================================= +print(f"\nGenerating {N_ORDERS:,} orders with referential integrity...") + +# Write customer lookup to temp Delta table (no .cache() on serverless!) +customers_tmp_table = f"{CATALOG}.{SCHEMA}._tmp_customers_lookup" +customers_df.select("customer_id", "tier").write.mode("overwrite").saveAsTable(customers_tmp_table) +customer_lookup = spark.table(customers_tmp_table) + +# Generate orders base +orders_df = ( + spark.range(0, N_ORDERS, numPartitions=PARTITIONS) + .select( + F.concat(F.lit("ORD-"), F.lpad(F.col("id").cast("string"), 6, "0")).alias("order_id"), + # Generate customer_idx for FK join (hash-based distribution) + (F.abs(F.hash(F.col("id"), F.lit(SEED))) % N_CUSTOMERS).alias("customer_idx"), + # Order status + F.when(F.rand(SEED) < 0.65, "delivered") + .when(F.rand(SEED) < 0.80, "shipped") + .when(F.rand(SEED) < 0.90, "processing") + .when(F.rand(SEED) < 0.95, "pending") + .otherwise("cancelled").alias("status"), + # Order date within date range + F.date_add(F.lit(START_DATE.date()), (F.rand(SEED) * 180).cast("int")).alias("order_date"), + ) +) + +# Add customer_idx to lookup for join +customer_lookup_with_idx = customer_lookup.withColumn( + "customer_idx", + (F.row_number().over(Window.orderBy(F.monotonically_increasing_id())) - 1).cast("int") +) + +# Join to get customer_id and tier as foreign key +orders_with_fk = ( + orders_df + .join(customer_lookup_with_idx, on="customer_idx", how="left") + .drop("customer_idx") +) + +# Add tier-based amount +orders_with_fk = orders_with_fk.withColumn( + "amount", + F.round(generate_lognormal_amount(F.col("tier")), 2) +) + +# ============================================================================= +# INJECT BAD DATA (OPTIONAL) +# ============================================================================= +if INJECT_BAD_DATA: + print("\nInjecting bad data for quality testing...") + + # Calculate counts + null_count = int(N_ORDERS * BAD_DATA_CONFIG["null_rate"]) + outlier_count = int(N_ORDERS * BAD_DATA_CONFIG["outlier_rate"]) + orphan_count = int(N_ORDERS * BAD_DATA_CONFIG["orphan_fk_rate"]) + + # Add bad data flags + orders_with_fk = orders_with_fk.withColumn( + "row_num", + F.row_number().over(Window.orderBy(F.monotonically_increasing_id())) + ) + + # Inject nulls in customer_id for first null_count rows + orders_with_fk = orders_with_fk.withColumn( + "customer_id", + F.when(F.col("row_num") <= null_count, None).otherwise(F.col("customer_id")) + ) + + # Inject negative amounts for next outlier_count rows + orders_with_fk = orders_with_fk.withColumn( + "amount", + F.when( + (F.col("row_num") > null_count) & (F.col("row_num") <= null_count + outlier_count), + F.lit(-999.99) + ).otherwise(F.col("amount")) + ) + + # Inject orphan FKs for next orphan_count rows + orders_with_fk = orders_with_fk.withColumn( + "customer_id", + F.when( + (F.col("row_num") > null_count + outlier_count) & + (F.col("row_num") <= null_count + outlier_count + orphan_count), + F.lit("CUST-NONEXISTENT") + ).otherwise(F.col("customer_id")) + ) + + orders_with_fk = orders_with_fk.drop("row_num") + + print(f" Injected {null_count} null customer_ids") + print(f" Injected {outlier_count} negative amounts") + print(f" Injected {orphan_count} orphan foreign keys") + +# Drop tier column (not needed in final output) +orders_final = orders_with_fk.drop("tier") + +# Save orders +orders_final.write.mode(WRITE_MODE).parquet(f"{VOLUME_PATH}/orders") +print(f" Saved orders to {VOLUME_PATH}/orders") + +# Show status distribution +print("\n Status distribution:") +orders_final.groupBy("status").count().orderBy("status").show() + +# ============================================================================= +# CLEANUP AND SUMMARY +# ============================================================================= +spark.sql(f"DROP TABLE IF EXISTS {customers_tmp_table}") + +print("\n" + "=" * 80) +print("GENERATION COMPLETE") +print("=" * 80) +print(f"Catalog: {CATALOG}") +print(f"Schema: {SCHEMA}") +print(f"Volume: {VOLUME_PATH}") +print(f"\nGenerated data:") +print(f" - customers: {N_CUSTOMERS:,} rows") +print(f" - orders: {N_ORDERS:,} rows") +if INJECT_BAD_DATA: + print(f" - Bad data injected: nulls, outliers, orphan FKs") +print(f"\nDate range: {START_DATE.date()} to {END_DATE.date()}") +print("=" * 80) diff --git a/.claude/skills/databricks-unity-catalog/5-system-tables.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-unity-catalog/5-system-tables.md similarity index 100% rename from .claude/skills/databricks-unity-catalog/5-system-tables.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-unity-catalog/5-system-tables.md diff --git a/.claude/skills/databricks-unity-catalog/6-volumes.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-unity-catalog/6-volumes.md similarity index 87% rename from .claude/skills/databricks-unity-catalog/6-volumes.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-unity-catalog/6-volumes.md index 1eae49a..497b609 100644 --- a/.claude/skills/databricks-unity-catalog/6-volumes.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-unity-catalog/6-volumes.md @@ -39,69 +39,16 @@ All volume operations use the path format: ## MCP Tools -### List Files in Volume - -```python -# List files and directories -list_volume_files( - volume_path="/Volumes/main/default/my_volume/data/" -) -# Returns: [{"name": "file.csv", "path": "...", "is_directory": false, "file_size": 1024, "last_modified": "..."}] -``` - -### Upload File to Volume - -```python -# Upload a local file -upload_to_volume( - local_path="/tmp/data.csv", - volume_path="/Volumes/main/default/my_volume/data.csv", - overwrite=True -) -# Returns: {"local_path": "...", "volume_path": "...", "success": true} -``` - -### Download File from Volume - -```python -# Download to local path -download_from_volume( - volume_path="/Volumes/main/default/my_volume/data.csv", - local_path="/tmp/downloaded.csv", - overwrite=True -) -# Returns: {"volume_path": "...", "local_path": "...", "success": true} -``` - -### Create Directory - -```python -# Create directory (creates parents like mkdir -p) -create_volume_directory( - volume_path="/Volumes/main/default/my_volume/data/2024/01" -) -# Returns: {"volume_path": "...", "success": true} -``` - -### Delete File - -```python -# Delete a file -delete_volume_file( - volume_path="/Volumes/main/default/my_volume/old_data.csv" -) -# Returns: {"volume_path": "...", "success": true} -``` - -### Get File Info - -```python -# Get file metadata -get_volume_file_info( - volume_path="/Volumes/main/default/my_volume/data.csv" -) -# Returns: {"name": "data.csv", "file_size": 1024, "last_modified": "...", "success": true} -``` +| Tool | Usage | +|------|-------| +| `list_volume_files` | `list_volume_files(volume_path="/Volumes/catalog/schema/volume/path/")` | +| `get_volume_folder_details` | `get_volume_folder_details(volume_path="catalog/schema/volume/path", format="parquet")` - schema, row counts, stats | +| `upload_to_volume` | `upload_to_volume(local_path="/tmp/data/*", volume_path="/Volumes/.../dest")` - supports files, folders, globs | +| `download_from_volume` | `download_from_volume(volume_path="/Volumes/.../file.csv", local_path="/tmp/file.csv")` | +| `create_volume_directory` | `create_volume_directory(volume_path="/Volumes/.../new_folder")` - creates parents like `mkdir -p` | +| `delete_volume_file` | `delete_volume_file(volume_path="/Volumes/.../file.csv")` | +| `delete_volume_directory` | `delete_volume_directory(volume_path="/Volumes/.../folder")` - directory must be empty | +| `get_volume_file_info` | `get_volume_file_info(volume_path="/Volumes/.../file.csv")` - returns size, modified date | --- diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-unity-catalog/7-data-profiling.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-unity-catalog/7-data-profiling.md new file mode 100644 index 0000000..23a2b62 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-unity-catalog/7-data-profiling.md @@ -0,0 +1,309 @@ +# Data Profiling (formerly Lakehouse Monitoring) + +Comprehensive reference for Data Profiling: create quality monitors on Unity Catalog tables to track data profiles, detect drift, and monitor ML model performance. + +## Overview + +Data profiling automatically computes statistical profiles and drift metrics for tables over time. When you create a monitor, Databricks generates two output Delta tables (profile metrics + drift metrics) and an optional dashboard. + +| Component | Description | +|-----------|-------------| +| **Monitor** | Configuration attached to a UC table | +| **Profile Metrics Table** | Summary statistics computed per column | +| **Drift Metrics Table** | Statistical drift compared to baseline or previous time window | +| **Dashboard** | Auto-generated visualization of metrics | + +### Requirements + +- Unity Catalog enabled workspace +- Databricks SQL access +- Privileges: `USE CATALOG`, `USE SCHEMA`, `SELECT`, and `MANAGE` on the table +- Only Delta tables supported (managed, external, views, materialized views, streaming tables) + +--- + +## Profile Types + +| Type | Use Case | Key Params | Limitations | +|------|----------|------------|-------------| +| **Snapshot** | General-purpose tables without time column | None required | Max 4TB table size | +| **TimeSeries** | Tables with a timestamp column | `timestamp_column`, `granularities` | Last 30 days only | +| **InferenceLog** | ML model monitoring | `timestamp_column`, `granularities`, `model_id_column`, `problem_type`, `prediction_column` | Last 30 days only | + +### Granularities (for TimeSeries and InferenceLog) + +Supported `AggregationGranularity` values: `AGGREGATION_GRANULARITY_5_MINUTES`, `AGGREGATION_GRANULARITY_30_MINUTES`, `AGGREGATION_GRANULARITY_1_HOUR`, `AGGREGATION_GRANULARITY_1_DAY`, `AGGREGATION_GRANULARITY_1_WEEK` – `AGGREGATION_GRANULARITY_4_WEEKS`, `AGGREGATION_GRANULARITY_1_MONTH`, `AGGREGATION_GRANULARITY_1_YEAR` + +--- + +## MCP Tools + +Use the `manage_uc_monitors` tool for all monitor operations: + +| Action | Description | +|--------|-------------| +| `create` | Create a quality monitor on a table | +| `get` | Get monitor details and status | +| `run_refresh` | Trigger a metric refresh | +| `list_refreshes` | List refresh history | +| `delete` | Delete the monitor (assets are not deleted) | + +### Create a Monitor + +> **Note:** The MCP tool currently only creates **snapshot** monitors. For TimeSeries or InferenceLog monitors, use the Python SDK directly (see below). + +```python +manage_uc_monitors( + action="create", + table_name="catalog.schema.my_table", + output_schema_name="catalog.schema", +) +``` + +### Get Monitor Status + +```python +manage_uc_monitors( + action="get", + table_name="catalog.schema.my_table", +) +``` + +### Trigger a Refresh + +```python +manage_uc_monitors( + action="run_refresh", + table_name="catalog.schema.my_table", +) +``` + +### Delete a Monitor + +```python +manage_uc_monitors( + action="delete", + table_name="catalog.schema.my_table", +) +``` + +--- + +## Python SDK Examples + +**Doc:** https://databricks-sdk-py.readthedocs.io/en/stable/workspace/dataquality/data_quality.html + +The new SDK provides full control over all profile types via `w.data_quality`. + +### Create Snapshot Monitor + +```python +from databricks.sdk import WorkspaceClient +from databricks.sdk.service.dataquality import ( + Monitor, DataProfilingConfig, SnapshotConfig, +) + +w = WorkspaceClient() + +# Look up UUIDs — the new API uses object_id and output_schema_id (both UUIDs) +table_info = w.tables.get("catalog.schema.my_table") +schema_info = w.schemas.get(f"{table_info.catalog_name}.{table_info.schema_name}") + +monitor = w.data_quality.create_monitor( + monitor=Monitor( + object_type="table", + object_id=table_info.table_id, + data_profiling_config=DataProfilingConfig( + assets_dir="/Workspace/Users/user@example.com/monitoring/my_table", + output_schema_id=schema_info.schema_id, + snapshot=SnapshotConfig(), + ), + ), +) +print(f"Monitor status: {monitor.data_profiling_config.status}") +``` + +### Create TimeSeries Monitor + +```python +from databricks.sdk.service.dataquality import ( + Monitor, DataProfilingConfig, TimeSeriesConfig, AggregationGranularity, +) + +table_info = w.tables.get("catalog.schema.events") +schema_info = w.schemas.get(f"{table_info.catalog_name}.{table_info.schema_name}") + +monitor = w.data_quality.create_monitor( + monitor=Monitor( + object_type="table", + object_id=table_info.table_id, + data_profiling_config=DataProfilingConfig( + assets_dir="/Workspace/Users/user@example.com/monitoring/events", + output_schema_id=schema_info.schema_id, + time_series=TimeSeriesConfig( + timestamp_column="event_timestamp", + granularities=[AggregationGranularity.AGGREGATION_GRANULARITY_1_DAY], + ), + ), + ), +) +``` + +### Create InferenceLog Monitor + +```python +from databricks.sdk.service.dataquality import ( + Monitor, DataProfilingConfig, InferenceLogConfig, + AggregationGranularity, InferenceProblemType, +) + +table_info = w.tables.get("catalog.schema.model_predictions") +schema_info = w.schemas.get(f"{table_info.catalog_name}.{table_info.schema_name}") + +monitor = w.data_quality.create_monitor( + monitor=Monitor( + object_type="table", + object_id=table_info.table_id, + data_profiling_config=DataProfilingConfig( + assets_dir="/Workspace/Users/user@example.com/monitoring/predictions", + output_schema_id=schema_info.schema_id, + inference_log=InferenceLogConfig( + timestamp_column="prediction_timestamp", + granularities=[AggregationGranularity.AGGREGATION_GRANULARITY_1_HOUR], + model_id_column="model_version", + problem_type=InferenceProblemType.INFERENCE_PROBLEM_TYPE_CLASSIFICATION, + prediction_column="prediction", + label_column="label", + ), + ), + ), +) +``` + +### Schedule a Monitor + +```python +from databricks.sdk.service.dataquality import ( + Monitor, DataProfilingConfig, SnapshotConfig, CronSchedule, +) + +table_info = w.tables.get("catalog.schema.my_table") +schema_info = w.schemas.get(f"{table_info.catalog_name}.{table_info.schema_name}") + +monitor = w.data_quality.create_monitor( + monitor=Monitor( + object_type="table", + object_id=table_info.table_id, + data_profiling_config=DataProfilingConfig( + assets_dir="/Workspace/Users/user@example.com/monitoring/my_table", + output_schema_id=schema_info.schema_id, + snapshot=SnapshotConfig(), + schedule=CronSchedule( + quartz_cron_expression="0 0 12 * * ?", # Daily at noon + timezone_id="UTC", + ), + ), + ), +) +``` + +### Get, Refresh, and Delete + +```python +# Get monitor details +monitor = w.data_quality.get_monitor( + object_type="table", + object_id=table_info.table_id, +) + +# Trigger refresh +from databricks.sdk.service.dataquality import Refresh + +refresh = w.data_quality.create_refresh( + object_type="table", + object_id=table_info.table_id, + refresh=Refresh( + object_type="table", + object_id=table_info.table_id, + ), +) + +# Delete monitor (does not delete output tables or dashboard) +w.data_quality.delete_monitor( + object_type="table", + object_id=table_info.table_id, +) +``` + +--- + +## Anomaly Detection + +Anomaly detection is enabled at the **schema level**, not per table. Once enabled, Databricks automatically scans all tables in the schema at the same frequency they are updated. + +```python +from databricks.sdk.service.dataquality import Monitor, AnomalyDetectionConfig + +schema_info = w.schemas.get("catalog.schema") + +monitor = w.data_quality.create_monitor( + monitor=Monitor( + object_type="schema", + object_id=schema_info.schema_id, + anomaly_detection_config=AnomalyDetectionConfig(), + ), +) +``` + +> **Note:** Anomaly detection requires `MANAGE SCHEMA` or `MANAGE CATALOG` privileges and serverless compute enabled on the workspace. + +--- + +## Output Tables + +When a monitor is created, two metric tables are generated in the specified output schema: + +| Table | Naming Convention | Contents | +|-------|-------------------|----------| +| **Profile Metrics** | `{table_name}_profile_metrics` | Per-column statistics (nulls, min, max, mean, distinct count, etc.) | +| **Drift Metrics** | `{table_name}_drift_metrics` | Statistical tests comparing current vs. baseline or previous window | + +### Query Output Tables + +```sql +-- View latest profile metrics +SELECT * +FROM catalog.schema.my_table_profile_metrics +ORDER BY window_end DESC +LIMIT 100; + +-- View latest drift metrics +SELECT * +FROM catalog.schema.my_table_drift_metrics +ORDER BY window_end DESC +LIMIT 100; +``` + +--- + +## Common Issues + +| Issue | Cause | Solution | +|-------|-------|----------| +| `FEATURE_NOT_ENABLED` | Data profiling not enabled on workspace | Contact workspace admin to enable the feature | +| `PERMISSION_DENIED` | Missing `MANAGE` privilege on the table | Grant `MANAGE` on the table to your user/group | +| Monitor refresh stuck in `PENDING` | No SQL warehouse available | Ensure a SQL warehouse is running or set `warehouse_id` | +| Profile metrics table empty | Refresh has not completed yet | Check refresh state with `list_refreshes`; wait for `SUCCESS` | +| Snapshot monitor on large table fails | Table exceeds 4TB limit | Switch to TimeSeries profile type instead | +| TimeSeries shows limited data | Only processes last 30 days | Expected behavior; contact account team to adjust | + +--- + +> **Note:** Data profiling was formerly known as Lakehouse Monitoring. The legacy SDK accessor +> `w.lakehouse_monitors` and the MCP tool `manage_uc_monitors` still use the previous API. + +## Resources + +- [Data Quality Monitoring Documentation](https://docs.databricks.com/aws/en/data-quality-monitoring/) +- [Data Quality SDK Reference](https://databricks-sdk-py.readthedocs.io/en/stable/workspace/dataquality/data_quality.html) +- [Legacy Lakehouse Monitors SDK Reference](https://databricks-sdk-py.readthedocs.io/en/stable/workspace/catalog/lakehouse_monitors.html) diff --git a/.claude/skills/databricks-unity-catalog/SKILL.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-unity-catalog/SKILL.md similarity index 78% rename from .claude/skills/databricks-unity-catalog/SKILL.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-unity-catalog/SKILL.md index 9b77fed..2e3d05f 100644 --- a/.claude/skills/databricks-unity-catalog/SKILL.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-unity-catalog/SKILL.md @@ -17,6 +17,7 @@ Use this skill when: - Tracking **compute resources** (cluster usage, warehouse metrics) - Reviewing **job execution** (run history, success rates, failures) - Analyzing **query performance** (slow queries, warehouse utilization) +- Profiling **data quality** (data profiling, drift detection, metric tables) ## Reference Files @@ -24,30 +25,19 @@ Use this skill when: |-------|------|-------------| | System Tables | [5-system-tables.md](5-system-tables.md) | Lineage, audit, billing, compute, jobs, query history | | Volumes | [6-volumes.md](6-volumes.md) | Volume file operations, permissions, best practices | +| Data Profiling | [7-data-profiling.md](7-data-profiling.md) | Data profiling, drift detection, profile metrics | ## Quick Start ### Volume File Operations (MCP Tools) -```python -# List files in a volume -list_volume_files(volume_path="/Volumes/catalog/schema/volume/folder/") - -# Upload file to volume -upload_to_volume( - local_path="/tmp/data.csv", - volume_path="/Volumes/catalog/schema/volume/data.csv" -) - -# Download file from volume -download_from_volume( - volume_path="/Volumes/catalog/schema/volume/data.csv", - local_path="/tmp/downloaded.csv" -) - -# Create directory -create_volume_directory(volume_path="/Volumes/catalog/schema/volume/new_folder") -``` +| Tool | Usage | +|------|-------| +| `list_volume_files` | `list_volume_files(volume_path="/Volumes/catalog/schema/volume/path/")` | +| `get_volume_folder_details` | `get_volume_folder_details(volume_path="catalog/schema/volume/path", format="parquet")` - schema, row counts, stats | +| `upload_to_volume` | `upload_to_volume(local_path="/tmp/data/*", volume_path="/Volumes/.../dest")` | +| `download_from_volume` | `download_from_volume(volume_path="/Volumes/.../file.csv", local_path="/tmp/file.csv")` | +| `create_volume_directory` | `create_volume_directory(volume_path="/Volumes/.../new_folder")` | ### Enable System Tables Access @@ -108,7 +98,7 @@ mcp__databricks__execute_sql( - **[databricks-spark-declarative-pipelines](../databricks-spark-declarative-pipelines/SKILL.md)** - for pipelines that write to Unity Catalog tables - **[databricks-jobs](../databricks-jobs/SKILL.md)** - for job execution data visible in system tables -- **[databricks-synthetic-data-generation](../databricks-synthetic-data-generation/SKILL.md)** - for generating data stored in Unity Catalog Volumes +- **[databricks-synthetic-data-gen](../databricks-synthetic-data-gen/SKILL.md)** - for generating data stored in Unity Catalog Volumes - **[databricks-aibi-dashboards](../databricks-aibi-dashboards/SKILL.md)** - for building dashboards on top of Unity Catalog data ## Resources diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-unstructured-pdf-generation/SKILL.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-unstructured-pdf-generation/SKILL.md new file mode 100644 index 0000000..92322fd --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-unstructured-pdf-generation/SKILL.md @@ -0,0 +1,337 @@ +--- +name: databricks-unstructured-pdf-generation +description: "Generate PDF documents from HTML and upload to Unity Catalog volumes. Use for creating test PDFs, demo documents, reports, or evaluation datasets." +--- + +# PDF Generation from HTML + +Convert HTML content to PDF documents and upload them to Unity Catalog Volumes. + +## Overview + +The `generate_and_upload_pdf` MCP tool converts HTML to PDF and uploads to a Unity Catalog Volume. You (the LLM) generate the HTML content, and the tool handles conversion and upload. + +## Tool Signature + +``` +generate_and_upload_pdf( + html_content: str, # Complete HTML document + filename: str, # PDF filename (e.g., "report.pdf") + catalog: str, # Unity Catalog name + schema: str, # Schema name + volume: str = "raw_data", # Volume name (default: "raw_data") + folder: str = None, # Optional subfolder +) +``` + +**Returns:** +```json +{ + "success": true, + "volume_path": "/Volumes/catalog/schema/volume/filename.pdf", + "error": null +} +``` + +## Quick Start + +Generate a simple PDF: + +``` +generate_and_upload_pdf( + html_content=''' + + + + + +

Quarterly Report Q1 2024

+
+

Executive Summary

+

Revenue increased 15% year-over-year...

+
+ +''', + filename="q1_report.pdf", + catalog="my_catalog", + schema="my_schema" +) +``` + +## Performance: Generate Multiple PDFs in Parallel + +**IMPORTANT**: PDF generation and upload can take 2-5 seconds per document. When generating multiple PDFs, **call the tool in parallel** to maximize throughput. + +### Example: Generate 5 PDFs in Parallel + +Make 5 simultaneous `generate_and_upload_pdf` calls: + +``` +# Call 1 +generate_and_upload_pdf( + html_content="...Employee Handbook content...", + filename="employee_handbook.pdf", + catalog="hr_catalog", schema="policies", folder="2024" +) + +# Call 2 (parallel) +generate_and_upload_pdf( + html_content="...Leave Policy content...", + filename="leave_policy.pdf", + catalog="hr_catalog", schema="policies", folder="2024" +) + +# Call 3 (parallel) +generate_and_upload_pdf( + html_content="...Code of Conduct content...", + filename="code_of_conduct.pdf", + catalog="hr_catalog", schema="policies", folder="2024" +) + +# Call 4 (parallel) +generate_and_upload_pdf( + html_content="...Benefits Guide content...", + filename="benefits_guide.pdf", + catalog="hr_catalog", schema="policies", folder="2024" +) + +# Call 5 (parallel) +generate_and_upload_pdf( + html_content="...Remote Work Policy content...", + filename="remote_work_policy.pdf", + catalog="hr_catalog", schema="policies", folder="2024" +) +``` + +By calling these in parallel (not sequentially), 5 PDFs that would take 15-25 seconds sequentially complete in 3-5 seconds total. + +## HTML Best Practices + +### Use Complete HTML5 Structure + +Always include the full HTML structure: + +```html + + + + + + + + + +``` + +### CSS Features Supported + +PlutoPrint supports modern CSS3: +- Flexbox and Grid layouts +- CSS variables (`--var-name`) +- Web fonts (system fonts recommended) +- Colors, backgrounds, borders +- Tables with styling + +### CSS to Avoid + +- Animations and transitions (static PDF) +- Interactive elements (forms, hover effects) +- External resources (images via URL) - use embedded base64 if needed + +### Professional Document Template + +```html + + + + + + +

Document Title

+ +

Section 1

+

Content here...

+ +
+ Important: Key information highlighted here. +
+ +

Data Table

+
+ + +
Column 1Column 2Column 3
DataDataData
+ + + + +``` + +## Common Patterns + +### Pattern 1: Technical Documentation + +Generate API documentation, user guides, or technical specs: + +``` +generate_and_upload_pdf( + html_content=''' + + + +

API Reference

+
+ GET /api/v1/users +

Returns a list of all users.

+
+

Request Headers

+
Authorization: Bearer {token}
+Content-Type: application/json
+ +''', + filename="api_reference.pdf", + catalog="docs_catalog", + schema="api_docs" +) +``` + +### Pattern 2: Business Reports + +``` +generate_and_upload_pdf( + html_content=''' + + + +

Q1 2024 Performance Report

+
+
$2.4M
+
Revenue
+
+
+
+15%
+
Growth
+
+ +''', + filename="q1_2024_report.pdf", + catalog="finance", + schema="reports", + folder="quarterly" +) +``` + +### Pattern 3: HR Policies + +``` +generate_and_upload_pdf( + html_content=''' + + + +

Employee Leave Policy

+

Effective: January 1, 2024

+ +
+

1. Annual Leave

+

All full-time employees are entitled to 20 days of paid annual leave per calendar year.

+
+ +
+ Note: Leave requests must be submitted at least 2 weeks in advance. +
+ +''', + filename="leave_policy.pdf", + catalog="hr_catalog", + schema="policies" +) +``` + +## Workflow for Multiple Documents + +When asked to generate multiple PDFs: + +1. **Plan the documents**: Determine titles, content structure for each +2. **Generate HTML for each**: Create complete HTML documents +3. **Call tool in parallel**: Make multiple simultaneous `generate_and_upload_pdf` calls +4. **Report results**: Summarize successful uploads and any errors + +## Prerequisites + +- Unity Catalog schema must exist +- Volume must exist (default: `raw_data`) +- User must have WRITE permission on the volume + +## Troubleshooting + +| Issue | Solution | +|-------|----------| +| "Volume does not exist" | Create the volume first or use an existing one | +| "Schema does not exist" | Create the schema or check the name | +| PDF looks wrong | Check HTML/CSS syntax, use supported CSS features | +| Slow generation | Call multiple PDFs in parallel, not sequentially | diff --git a/.claude/skills/databricks-vector-search/SKILL.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-vector-search/SKILL.md similarity index 56% rename from .claude/skills/databricks-vector-search/SKILL.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-vector-search/SKILL.md index 276cab3..72068ec 100644 --- a/.claude/skills/databricks-vector-search/SKILL.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-vector-search/SKILL.md @@ -31,8 +31,8 @@ Databricks Vector Search provides managed vector similarity search with automati | Type | Latency | Capacity | Cost | Best For | |------|---------|----------|------|----------| -| **Standard** | ~50-100ms | 320M vectors (768 dim) | Higher | Real-time, low-latency | -| **Storage-Optimized** | ~250ms | 1B+ vectors (768 dim) | 7x lower | Large-scale, cost-sensitive | +| **Standard** | 20-50ms | 320M vectors (768 dim) | Higher | Real-time, low-latency | +| **Storage-Optimized** | 300-500ms | 1B+ vectors (768 dim) | 7x lower | Large-scale, cost-sensitive | ## Index Types @@ -184,13 +184,15 @@ results = w.vector_search_indexes.query_index( ### Hybrid Search (Semantic + Keyword) +Hybrid search combines vector similarity (ANN) with BM25 keyword scoring. Use it when queries contain exact terms that must match — SKUs, error codes, proper nouns, or technical terminology — where pure semantic search might miss keyword-specific results. See [search-modes.md](search-modes.md) for detailed guidance on choosing between ANN and hybrid search. + ```python # Combines vector similarity with keyword matching results = w.vector_search_indexes.query_index( index_name="catalog.schema.my_index", columns=["id", "content"], - query_text="machine learning algorithms", - query_type="hybrid", # Enable hybrid search + query_text="SPARK-12345 executor memory error", + query_type="HYBRID", num_results=10 ) ``` @@ -212,20 +214,26 @@ results = w.vector_search_indexes.query_index( ### Storage-Optimized Filters (SQL-like) +Storage-Optimized endpoints use SQL-like filter syntax via the `databricks-vectorsearch` package's `filters` parameter (accepts a string): + ```python -# filter_string uses SQL-like syntax -results = w.vector_search_indexes.query_index( - index_name="catalog.schema.my_index", - columns=["id", "content"], +from databricks.vector_search.client import VectorSearchClient + +vsc = VectorSearchClient() +index = vsc.get_index(endpoint_name="my-storage-endpoint", index_name="catalog.schema.my_index") + +# SQL-like filter syntax for storage-optimized endpoints +results = index.similarity_search( query_text="machine learning", + columns=["id", "content"], num_results=10, - filter_string="category = 'ai' AND status IN ('active', 'pending')" + filters="category = 'ai' AND status IN ('active', 'pending')" ) # More filter examples -filter_string="price > 100 AND price < 500" -filter_string="department LIKE 'eng%'" -filter_string="created_at >= '2024-01-01'" +# filters="price > 100 AND price < 500" +# filters="department LIKE 'eng%'" +# filters="created_at >= '2024-01-01'" ``` ### Trigger Index Sync @@ -249,7 +257,12 @@ scan_result = w.vector_search_indexes.scan_index( ## Reference Files -- [index-types.md](index-types.md) - Detailed comparison of index types and creation patterns +| Topic | File | Description | +|-------|------|-------------| +| Index Types | [index-types.md](index-types.md) | Detailed comparison of Delta Sync (managed/self-managed) vs Direct Access | +| End-to-End RAG | [end-to-end-rag.md](end-to-end-rag.md) | Complete walkthrough: source table → endpoint → index → query → agent integration | +| Search Modes | [search-modes.md](search-modes.md) | When to use semantic (ANN) vs hybrid search, decision guide | +| Operations | [troubleshooting-and-operations.md](troubleshooting-and-operations.md) | Monitoring, cost optimization, capacity planning, migration | ## CLI Quick Reference @@ -285,19 +298,20 @@ databricks vector-search indexes delete-index \ |-------|----------| | **Index sync slow** | Use Storage-Optimized endpoints (20x faster indexing) | | **Query latency high** | Use Standard endpoint for <100ms latency | -| **filters_json not working** | Storage-Optimized uses `filter_string` (SQL syntax) | +| **filters_json not working** | Storage-Optimized uses SQL-like string filters via `databricks-vectorsearch` package's `filters` parameter | | **Embedding dimension mismatch** | Ensure query and index dimensions match | | **Index not updating** | Check pipeline_type; use sync_index() for TRIGGERED | | **Out of capacity** | Upgrade to Storage-Optimized (1B+ vectors) | +| **`query_vector` truncated by MCP tool** | MCP tool calls serialize arrays as JSON and can truncate large vectors (e.g. 1024-dim). Use `query_text` instead (for managed embedding indexes), or use the Databricks SDK/CLI to pass raw vectors | ## Embedding Models Databricks provides built-in embedding models: -| Model | Dimensions | Use Case | -|-------|------------|----------| -| `databricks-gte-large-en` | 1024 | English text, high quality | -| `databricks-bge-large-en` | 1024 | English text, general | +| Model | Dimensions | Context Window | Use Case | +|-------|------------|----------------|----------| +| `databricks-gte-large-en` | 1024 | 8192 tokens | English text, high quality | +| `databricks-bge-large-en` | 1024 | 512 tokens | English text, general purpose | ```python # Use with managed embeddings @@ -311,42 +325,118 @@ embedding_source_columns=[ ## MCP Tools -The following MCP tools are available for managing Vector Search infrastructure. These are **management tools** for creating and configuring endpoints/indexes. For agent-runtime querying, use the Databricks managed Vector Search MCP server or `VectorSearchRetrieverTool`. +The following MCP tools are available for managing Vector Search infrastructure. For a full end-to-end walkthrough, see [end-to-end-rag.md](end-to-end-rag.md). + +### manage_vs_endpoint - Endpoint Management + +| Action | Description | Required Params | +|--------|-------------|-----------------| +| `create_or_update` | Create endpoint (STANDARD or STORAGE_OPTIMIZED). Idempotent | name | +| `get` | Get endpoint details | name | +| `list` | List all endpoints | (none) | +| `delete` | Delete endpoint (indexes must be deleted first) | name | + +```python +# Create or update an endpoint +result = manage_vs_endpoint(action="create_or_update", name="my-vs-endpoint", endpoint_type="STANDARD") +# Returns {"name": "my-vs-endpoint", "endpoint_type": "STANDARD", "created": True} + +# List all endpoints +endpoints = manage_vs_endpoint(action="list") + +# Get specific endpoint +endpoint = manage_vs_endpoint(action="get", name="my-vs-endpoint") +``` + +### manage_vs_index - Index Management + +| Action | Description | Required Params | +|--------|-------------|-----------------| +| `create_or_update` | Create index. Idempotent, auto-triggers sync for DELTA_SYNC | name, endpoint_name, primary_key | +| `get` | Get index details | name | +| `list` | List indexes. Optional endpoint_name filter | (none) | +| `delete` | Delete index | name | + +```python +# Create a Delta Sync index with managed embeddings +result = manage_vs_index( + action="create_or_update", + name="catalog.schema.my_index", + endpoint_name="my-vs-endpoint", + primary_key="id", + index_type="DELTA_SYNC", + delta_sync_index_spec={ + "source_table": "catalog.schema.docs", + "embedding_source_columns": [{"name": "content", "embedding_model_endpoint_name": "databricks-gte-large-en"}], + "pipeline_type": "TRIGGERED" + } +) + +# Get a specific index +index = manage_vs_index(action="get", name="catalog.schema.my_index") + +# List all indexes on an endpoint +indexes = manage_vs_index(action="list", endpoint_name="my-vs-endpoint") + +# List all indexes across all endpoints +all_indexes = manage_vs_index(action="list") +``` + +### query_vs_index - Query (Hot Path) + +Query index with `query_text`, `query_vector`, or hybrid (`query_type="HYBRID"`). Prefer `query_text` over `query_vector` — MCP tool calls can truncate large embedding arrays (1024-dim). + +```python +# Query an index +results = query_vs_index( + index_name="catalog.schema.my_index", + columns=["id", "content"], + query_text="machine learning best practices", + num_results=5 +) -### Endpoint Management +# Hybrid search (combines vector + keyword) +results = query_vs_index( + index_name="catalog.schema.my_index", + columns=["id", "content"], + query_text="SPARK-12345 memory error", + query_type="HYBRID", + num_results=10 +) +``` -| Tool | Description | -|------|-------------| -| `create_vs_endpoint` | Create a Vector Search endpoint (STANDARD or STORAGE_OPTIMIZED) | -| `get_vs_endpoint` | Get endpoint status and details | -| `list_vs_endpoints` | List all endpoints in the workspace | -| `delete_vs_endpoint` | Delete an endpoint (indexes must be deleted first) | +### manage_vs_data - Data Operations -### Index Management +| Action | Description | Required Params | +|--------|-------------|-----------------| +| `upsert` | Insert/update records | index_name, inputs_json | +| `delete` | Delete by primary key | index_name, primary_keys | +| `scan` | Scan index contents | index_name | +| `sync` | Trigger sync for TRIGGERED indexes | index_name | -| Tool | Description | -|------|-------------| -| `create_vs_index` | Create a Delta Sync or Direct Access index | -| `get_vs_index` | Get index status and configuration | -| `list_vs_indexes` | List all indexes on an endpoint | -| `delete_vs_index` | Delete an index | -| `sync_vs_index` | Trigger sync for TRIGGERED pipeline indexes | +```python +# Upsert data into a Direct Access index +manage_vs_data( + action="upsert", + index_name="catalog.schema.my_index", + inputs_json=[{"id": "doc1", "content": "...", "embedding": [0.1, 0.2, ...]}] +) -### Query and Data +# Trigger manual sync for a TRIGGERED pipeline index +manage_vs_data(action="sync", index_name="catalog.schema.my_index") -| Tool | Description | -|------|-------------| -| `query_vs_index` | Query index with text, vector, or hybrid search (for testing) | -| `upsert_vs_data` | Upsert vectors into a Direct Access index | -| `delete_vs_data` | Delete vectors from a Direct Access index | -| `scan_vs_index` | Scan/export index entries (for debugging) | +# Scan index contents +manage_vs_data(action="scan", index_name="catalog.schema.my_index", num_results=100) +``` ## Notes -- **Storage-Optimized is newer** - Better for most use cases unless you need <100ms latency -- **Delta Sync recommended** - Easier than Direct Access for most scenarios -- **Hybrid search** - Available for both Delta Sync and Direct Access indexes -- **Management vs runtime** - MCP tools above handle lifecycle management; for agent tool-calling at runtime, use the Databricks managed Vector Search MCP server +- **Storage-Optimized is newer** — better for most use cases unless you need <100ms latency +- **Delta Sync recommended** — easier than Direct Access for most scenarios +- **Hybrid search** — available for both Delta Sync and Direct Access indexes +- **`columns_to_sync` matters** — only synced columns are available in query results; include all columns you need +- **Filter syntax differs by endpoint** — Standard uses dict-format filters, Storage-Optimized uses SQL-like string filters. Use the `databricks-vectorsearch` package's `filters` parameter which accepts both formats +- **Management vs runtime** — MCP tools above handle lifecycle management; for agent tool-calling at runtime, use `VectorSearchRetrieverTool` or the Databricks managed Vector Search MCP server ## Related Skills diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-vector-search/end-to-end-rag.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-vector-search/end-to-end-rag.md new file mode 100644 index 0000000..a3808d1 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-vector-search/end-to-end-rag.md @@ -0,0 +1,241 @@ +# End-to-End RAG with Vector Search + +Build a complete Retrieval-Augmented Generation pipeline: prepare documents, create a vector index, query it, and wire it into an agent. + +## MCP Tools Used + +| Tool | Step | +|------|------| +| `execute_sql` | Create source table, insert documents | +| `manage_vs_endpoint(action="create")` | Create compute endpoint | +| `manage_vs_index(action="create")` | Create Delta Sync index with managed embeddings | +| `manage_vs_index(action="sync")` | Trigger index sync | +| `manage_vs_index(action="get")` | Check index status | +| `query_vs_index` | Test similarity search | + +--- + +## Step 1: Prepare Source Table + +The source Delta table needs a primary key column and a text column to embed. + +```sql +CREATE TABLE IF NOT EXISTS catalog.schema.knowledge_base ( + doc_id STRING, + title STRING, + content STRING, + category STRING, + updated_at TIMESTAMP DEFAULT current_timestamp() +); + +INSERT INTO catalog.schema.knowledge_base VALUES +('doc-001', 'Getting Started', 'Databricks is a unified analytics platform...', 'overview', current_timestamp()), +('doc-002', 'Unity Catalog', 'Unity Catalog provides centralized governance...', 'governance', current_timestamp()), +('doc-003', 'Delta Lake', 'Delta Lake is an open-source storage layer...', 'storage', current_timestamp()); +``` + +Or via MCP: + +```python +execute_sql(sql_query=""" + CREATE TABLE IF NOT EXISTS catalog.schema.knowledge_base ( + doc_id STRING, + title STRING, + content STRING, + category STRING, + updated_at TIMESTAMP DEFAULT current_timestamp() + ) +""") +``` + +## Step 2: Create Vector Search Endpoint + +```python +manage_vs_endpoint( + action="create", + name="my-rag-endpoint", + endpoint_type="STORAGE_OPTIMIZED" +) +``` + +Endpoint creation is asynchronous. Check status: + +```python +manage_vs_endpoint(action="get", name="my-rag-endpoint") +# Wait for state: "ONLINE" +``` + +## Step 3: Create Delta Sync Index + +```python +manage_vs_index( + action="create", + name="catalog.schema.knowledge_base_index", + endpoint_name="my-rag-endpoint", + primary_key="doc_id", + index_type="DELTA_SYNC", + delta_sync_index_spec={ + "source_table": "catalog.schema.knowledge_base", + "embedding_source_columns": [ + { + "name": "content", + "embedding_model_endpoint_name": "databricks-gte-large-en" + } + ], + "pipeline_type": "TRIGGERED", + "columns_to_sync": ["doc_id", "title", "content", "category"] + } +) +``` + +Key decisions: +- **`embedding_source_columns`**: Databricks computes embeddings automatically from the `content` column +- **`pipeline_type`**: `TRIGGERED` for manual sync (cheaper), `CONTINUOUS` for auto-sync on table changes +- **`columns_to_sync`**: Only sync columns you need in query results (reduces storage and improves performance) + +## Step 4: Sync and Verify + +```python +# Trigger initial sync +manage_vs_index(action="sync", index_name="catalog.schema.knowledge_base_index") + +# Check status +manage_vs_index(action="get", index_name="catalog.schema.knowledge_base_index") +# Wait for state: "ONLINE" +``` + +## Step 5: Query the Index + +```python +# Semantic search +query_vs_index( + index_name="catalog.schema.knowledge_base_index", + columns=["doc_id", "title", "content", "category"], + query_text="How do I govern my data?", + num_results=3 +) +``` + +### With Filters + +The filter syntax depends on the endpoint type used when creating the index. + +```python +# Storage-Optimized endpoint (used in this walkthrough): SQL-like filter syntax +query_vs_index( + index_name="catalog.schema.knowledge_base_index", + columns=["doc_id", "title", "content"], + query_text="How do I govern my data?", + num_results=3, + filters="category = 'governance'" +) + +# Standard endpoint (if you created a Standard endpoint instead): JSON filters_json +query_vs_index( + index_name="catalog.schema.my_standard_index", + columns=["doc_id", "title", "content"], + query_text="How do I govern my data?", + num_results=3, + filters_json='{"category": "governance"}' +) +``` + +### Hybrid Search (Vector + Keyword) + +```python +query_vs_index( + index_name="catalog.schema.knowledge_base_index", + columns=["doc_id", "title", "content"], + query_text="Delta Lake ACID transactions", + num_results=5, + query_type="HYBRID" +) +``` + +--- + +## Step 6: Use in an Agent + +### As a Tool in a ChatAgent + +Use `VectorSearchRetrieverTool` to wire the index into an agent deployed on Model Serving: + +```python +from databricks.agents import ChatAgent +from databricks.agents.tools import VectorSearchRetrieverTool +from databricks.sdk import WorkspaceClient + +# Define the retriever tool +retriever_tool = VectorSearchRetrieverTool( + index_name="catalog.schema.knowledge_base_index", + columns=["doc_id", "title", "content"], + num_results=3, +) + +class RAGAgent(ChatAgent): + def __init__(self): + self.w = WorkspaceClient() + + def predict(self, messages, context=None): + query = messages[-1].content + + results = self.w.vector_search_indexes.query_index( + index_name="catalog.schema.knowledge_base_index", + columns=["title", "content"], + query_text=query, + num_results=3, + ) + + context_docs = "\n\n".join( + f"**{row[0]}**: {row[1]}" + for row in results.result.data_array + ) + + response = self.w.serving_endpoints.query( + name="databricks-meta-llama-3-3-70b-instruct", + messages=[ + {"role": "system", "content": f"Answer using this context:\n{context_docs}"}, + {"role": "user", "content": query}, + ], + ) + + return {"content": response.choices[0].message.content} +``` + +--- + +## Updating the Index + +### Add New Documents + +```sql +INSERT INTO catalog.schema.knowledge_base VALUES +('doc-004', 'MLflow', 'MLflow is an open-source platform for ML lifecycle...', 'ml', current_timestamp()); +``` + +Then sync: + +```python +manage_vs_index(action="sync", index_name="catalog.schema.knowledge_base_index") +``` + +### Delete Documents + +```sql +DELETE FROM catalog.schema.knowledge_base WHERE doc_id = 'doc-001'; +``` + +Then sync — the index automatically handles deletions via Delta change data feed. + +--- + +## Common Issues + +| Issue | Solution | +|-------|----------| +| **Index stuck in PROVISIONING** | Endpoint may still be creating. Check `manage_vs_endpoint(action="get")` first | +| **Query returns no results** | Index may not be synced yet. Run `manage_vs_index(action="sync")` and wait for ONLINE state | +| **"Column not found in index"** | Column must be in `columns_to_sync`. Recreate index with the column included | +| **Embeddings not computed** | Ensure `embedding_model_endpoint_name` is a valid serving endpoint | +| **Stale results after table update** | For TRIGGERED pipelines, you must call `manage_vs_index(action="sync")` manually | +| **Filter not working** | Standard endpoints use dict-format filters (`filters_json`), Storage-Optimized use SQL-like string filters (`filters`) | diff --git a/.claude/skills/databricks-vector-search/index-types.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-vector-search/index-types.md similarity index 100% rename from .claude/skills/databricks-vector-search/index-types.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-vector-search/index-types.md diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-vector-search/search-modes.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-vector-search/search-modes.md new file mode 100644 index 0000000..58092af --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-vector-search/search-modes.md @@ -0,0 +1,142 @@ +# Vector Search Modes + +Databricks Vector Search supports three search modes: **ANN** (semantic, default), **HYBRID** (semantic + keyword), and **FULL_TEXT** (keyword only, beta). ANN and HYBRID work with Delta Sync and Direct Access indexes. + +## Semantic Search (ANN) + +ANN (Approximate Nearest Neighbor) is the default search mode. It finds documents by vector similarity — matching the *meaning* of your query against stored embeddings. + +### When to use + +- Conceptual or meaning-based queries ("How do I handle errors in my pipeline?") +- Paraphrased input where exact terms may not appear in the documents +- Multilingual scenarios where query and document languages may differ +- General-purpose RAG retrieval + +### Example + +```python +# ANN is the default — no query_type parameter needed +results = w.vector_search_indexes.query_index( + index_name="catalog.schema.my_index", + columns=["id", "content"], + query_text="How do I handle errors in my pipeline?", + num_results=5 +) +``` + +## Hybrid Search + +Hybrid search combines vector similarity (ANN) with BM25 keyword scoring. It retrieves documents that are both semantically similar *and* contain matching keywords, then merges the results. + +### When to use + +- Queries containing exact terms that must appear: SKUs, product codes, error codes, acronyms +- Proper nouns — company names, people, specific technologies +- Technical documentation where terminology precision matters +- Mixed-intent queries combining concepts with specific terms + +### Example + +```python +results = w.vector_search_indexes.query_index( + index_name="catalog.schema.my_index", + columns=["id", "content"], + query_text="SPARK-12345 executor memory error", + query_type="HYBRID", + num_results=10 +) +``` + +## Decision Guide + +| Mode | Best for | Trade-off | Choose when | +|------|----------|-----------|-------------| +| **ANN** (default) | Conceptual queries, paraphrases, meaning-based search | Fastest; may miss exact keyword matches | You want documents *about* a topic regardless of exact wording | +| **HYBRID** | Exact terms, codes, proper nouns, mixed-intent queries | ~2x resource usage vs ANN; max 200 results | Your queries contain specific identifiers or technical terms that must appear in results | +| **FULL_TEXT** (beta) | Pure keyword search without vector embeddings | No semantic understanding; max 200 results | You need keyword matching only, without vector similarity | + +**Start with ANN.** Switch to HYBRID if you notice relevant documents being missed because they don't share vocabulary with the query. + +## Combining Search Modes with Filters + +Both search modes support filters. The filter syntax depends on your endpoint type: + +- **Standard endpoints** → `filters` as dict (or `filters_json` as JSON string via `databricks-sdk`) +- **Storage-Optimized endpoints** → `filters` as SQL-like string (via `databricks-vectorsearch` package) + +### Standard endpoint with hybrid search + +```python +results = w.vector_search_indexes.query_index( + index_name="catalog.schema.my_index", + columns=["id", "content", "category"], + query_text="SPARK-12345 executor memory error", + query_type="HYBRID", + num_results=10, + filters_json='{"category": "troubleshooting", "status": ["open", "in_progress"]}' +) +``` + +### Storage-Optimized endpoint with hybrid search + +```python +from databricks.vector_search.client import VectorSearchClient + +vsc = VectorSearchClient() +index = vsc.get_index(endpoint_name="my-storage-endpoint", index_name="catalog.schema.my_index") + +results = index.similarity_search( + query_text="SPARK-12345 executor memory error", + columns=["id", "content", "category"], + query_type="hybrid", + num_results=10, + filters="category = 'troubleshooting' AND status IN ('open', 'in_progress')" +) +``` + +## Using with Pre-Computed Embeddings + +If you compute embeddings yourself, use `query_vector` instead of `query_text` for ANN search: + +```python +# ANN with pre-computed embedding (default) +results = w.vector_search_indexes.query_index( + index_name="catalog.schema.my_index", + columns=["id", "content"], + query_vector=[0.1, 0.2, 0.3, ...], # Your embedding vector + num_results=10 +) +``` + +For **hybrid search with self-managed embeddings** (indexes without an associated model endpoint), you must provide **both** `query_vector` and `query_text`. The vector is used for the ANN component and the text for the BM25 keyword component: + +```python +# HYBRID with self-managed embeddings — requires both vector AND text +results = w.vector_search_indexes.query_index( + index_name="catalog.schema.my_index", + columns=["id", "content"], + query_vector=[0.1, 0.2, 0.3, ...], # For ANN similarity + query_text="executor memory error", # For BM25 keyword matching + query_type="HYBRID", + num_results=10 +) +``` + +**Notes:** +- For **ANN** queries: provide either `query_text` or `query_vector`, not both. +- For **HYBRID** queries on **managed embedding indexes**: provide only `query_text` (the system handles both components). +- For **HYBRID** queries on **self-managed indexes without a model endpoint**: provide both `query_vector` and `query_text`. +- When using `query_text` alone, the index must have an associated embedding model (managed embeddings or `embedding_model_endpoint_name` on a Direct Access index). + +## Parameter Reference + +| Parameter | Type | Package | Description | +|-----------|------|---------|-------------| +| `query_text` | `str` | Both | Text query — requires embedding model on the index | +| `query_vector` | `list[float]` | Both | Pre-computed embedding vector | +| `query_type` | `str` | Both | `"ANN"` (default) or `"HYBRID"` or `"FULL_TEXT"` (beta) | +| `columns` | `list[str]` | Both | Column names to return in results | +| `num_results` | `int` | Both | Number of results (default: 10 in `databricks-sdk`, 5 in `databricks-vectorsearch`) | +| `filters_json` | `str` | `databricks-sdk` | JSON dict filter string (Standard endpoints) | +| `filters` | `str` or `dict` | `databricks-vectorsearch` | Dict for Standard, SQL-like string for Storage-Optimized | diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-vector-search/troubleshooting-and-operations.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-vector-search/troubleshooting-and-operations.md new file mode 100644 index 0000000..7dc4b8c --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-vector-search/troubleshooting-and-operations.md @@ -0,0 +1,177 @@ +# Vector Search Troubleshooting & Operations + +Operational guidance for monitoring, cost optimization, capacity planning, and migration of Databricks Vector Search resources. + +## Monitoring Endpoint Status + +Use `manage_vs_endpoint(action="get")` (MCP tool) or `w.vector_search_endpoints.get_endpoint()` (SDK) to check endpoint health. + +### Endpoint fields + +| Field | Description | +|-------|-------------| +| `state` | `ONLINE`, `PROVISIONING`, `OFFLINE`, `YELLOW_STATE`, `RED_STATE`, `DELETED` | +| `message` | Human-readable status or error message | +| `endpoint_type` | `STANDARD` or `STORAGE_OPTIMIZED` | +| `num_indexes` | Number of indexes hosted on this endpoint | +| `creation_timestamp` | When the endpoint was created | +| `last_updated_timestamp` | When the endpoint was last modified | + +### Example + +```python +endpoint = w.vector_search_endpoints.get_endpoint(endpoint_name="my-endpoint") +print(f"State: {endpoint.endpoint_status.state.value}") +print(f"Indexes: {endpoint.num_indexes}") +``` + +**What to do per state:** +- `PROVISIONING` → Wait. Endpoint creation is asynchronous and can take several minutes. +- `ONLINE` → Ready to serve queries and host indexes. +- `OFFLINE` → Check the `message` field for error details. May require recreation. +- `YELLOW_STATE` → Endpoint is degraded but still serving. Investigate the `message` field. +- `RED_STATE` → Endpoint is unhealthy. Check `message` for details; may need support intervention. + +## Monitoring Index Status + +Use `manage_vs_index(action="get")` (MCP tool) or `w.vector_search_indexes.get_index()` (SDK) to check index health. + +### Index fields + +| Field | Description | +|-------|-------------| +| `status.ready` | Boolean — `True` when ready for queries, `False` when provisioning/syncing | +| `status.message` | Status details or error information | +| `status.index_url` | URL to access the index in the Databricks UI | +| `status.indexed_row_count` | Number of rows currently indexed | +| `delta_sync_index_spec.pipeline_id` | DLT pipeline ID (Delta Sync indexes only) — useful for debugging sync issues | +| `index_type` | `DELTA_SYNC` or `DIRECT_ACCESS` | + +### Example + +```python +index = w.vector_search_indexes.get_index(index_name="catalog.schema.my_index") +if index.status.ready: + print("Index is ONLINE") +else: + print(f"Index is NOT_READY: {index.status.message}") +``` + +## Pipeline Type Trade-offs + +Delta Sync indexes use a DLT pipeline to sync data from the source Delta table. The pipeline type determines sync behavior: + +| Pipeline Type | Behavior | Cost | Best for | +|---------------|----------|------|----------| +| **TRIGGERED** | Manual sync via `manage_vs_index(action="sync")` | Lower — runs only when triggered | Batch updates, periodic refreshes, cost-sensitive workloads | +| **CONTINUOUS** | Auto-syncs on source table changes | Higher — always running | Real-time freshness, applications needing up-to-date results | + +### Triggering a sync + +```python +# For TRIGGERED pipelines only +w.vector_search_indexes.sync_index(index_name="catalog.schema.my_index") +# Check sync progress with get_index() +``` + +**Tip:** CONTINUOUS pipelines cannot be synced manually — they sync automatically. Calling `sync_index()` on a CONTINUOUS index will raise an error. + +## Cost Optimization + +### Endpoint type selection + +| Factor | Standard | Storage-Optimized | +|--------|----------|-------------------| +| Query latency | 20-50ms | 300-500ms | +| Cost | Higher | ~7x lower | +| Max capacity | 320M vectors (768 dim) | 1B+ vectors (768 dim) | +| Indexing speed | Slower | 20x faster | + +**Recommendation:** Start with Storage-Optimized unless you need sub-100ms latency. It handles most RAG workloads well. + +### Reducing storage costs + +- Use `columns_to_sync` to limit which columns are synced to the index. Only synced columns are available in query results, so include only what you need. +- Choose TRIGGERED pipelines for batch workloads to avoid continuous compute costs. + +```python +# Only sync the columns you actually need in query results +delta_sync_index_spec={ + "source_table": "catalog.schema.documents", + "embedding_source_columns": [ + {"name": "content", "embedding_model_endpoint_name": "databricks-gte-large-en"} + ], + "pipeline_type": "TRIGGERED", + "columns_to_sync": ["id", "content", "title"] # Exclude large unused columns +} +``` + +## Capacity Planning + +| Endpoint Type | Max Vectors (768 dim) | Guidance | +|---------------|----------------------|----------| +| Standard | ~320M | Suitable for most production workloads under 300M documents | +| Storage-Optimized | 1B+ | Large-scale corpora, enterprise knowledge bases | + +**Estimating needs:** +- One document typically maps to one vector (or multiple if chunked) +- If chunking at ~512 tokens, expect 2-5 vectors per page of text +- Monitor `num_indexes` on your endpoint to understand utilization + +## Migration Patterns + +### Changing endpoint type + +Endpoints are **immutable after creation** — you cannot change the type (Standard ↔ Storage-Optimized) of an existing endpoint. To migrate: + +1. **Create a new endpoint** with the desired type +2. **Recreate indexes** on the new endpoint pointing to the same source tables +3. **Wait for sync** to complete (check index state) +4. **Update applications** to query the new index names +5. **Delete old indexes**, then delete the old endpoint + +```python +# Step 1: Create new endpoint +w.vector_search_endpoints.create_endpoint( + name="my-endpoint-storage-optimized", + endpoint_type="STORAGE_OPTIMIZED" +) + +# Step 2: Recreate index on new endpoint (same source table) +w.vector_search_indexes.create_index( + name="catalog.schema.my_index_v2", + endpoint_name="my-endpoint-storage-optimized", + primary_key="id", + index_type="DELTA_SYNC", + delta_sync_index_spec={ + "source_table": "catalog.schema.documents", + "embedding_source_columns": [ + {"name": "content", "embedding_model_endpoint_name": "databricks-gte-large-en"} + ], + "pipeline_type": "TRIGGERED" + } +) + +# Step 3: Trigger sync and wait for ONLINE state +w.vector_search_indexes.sync_index(index_name="catalog.schema.my_index_v2") + +# Step 4: Update your application to use "catalog.schema.my_index_v2" +# Step 5: Clean up old resources +w.vector_search_indexes.delete_index(index_name="catalog.schema.my_index") +w.vector_search_endpoints.delete_endpoint(endpoint_name="my-endpoint") +``` + +## Expanded Troubleshooting + +| Issue | Likely Cause | Solution | +|-------|-------------|----------| +| **Index stuck in NOT_READY** | Sync pipeline failed or source table issue | Check `message` field via `manage_vs_index(action="get")`. Inspect the DLT pipeline using `pipeline_id`. | +| **Embedding dimension mismatch** | Query vector dimensions ≠ index dimensions | Ensure your embedding model output matches the `embedding_dimension` in the index spec. | +| **Permission errors on create** | Missing Unity Catalog privileges | User needs `CREATE TABLE` on the schema and `USE CATALOG`/`USE SCHEMA` privileges. | +| **Index returns NOT_FOUND** | Wrong name format or index deleted | Index names must be fully qualified: `catalog.schema.index_name`. | +| **Sync not running (TRIGGERED)** | Sync not triggered after source update | Call `manage_vs_index(action="sync")` or `w.vector_search_indexes.sync_index()` after updating source data. | +| **Endpoint NOT_FOUND** | Endpoint name typo or deleted | List all endpoints with `manage_vs_endpoint(action="list")` to verify available endpoints. | +| **Query returns empty results** | Index not yet synced, or filters too restrictive | Check index state is ONLINE. Verify `columns_to_sync` includes queried columns. Test without filters first. | +| **filters_json has no effect** | Using wrong filter syntax for endpoint type | Standard endpoints use dict-format filters (`filters_json` in SDK, `filters` as dict in `databricks-vectorsearch`). Storage-Optimized endpoints use SQL-like string filters (`filters` as str in `databricks-vectorsearch`). | +| **Quota or capacity errors** | Too many indexes or vectors | Check `num_indexes` on endpoint. Consider Storage-Optimized for higher capacity. | +| **Upsert fails on Delta Sync** | Cannot upsert to Delta Sync indexes | Upsert/delete operations only work on Direct Access indexes. Delta Sync indexes update via their source table. | diff --git a/.claude/skills/databricks-zerobus-ingest/1-setup-and-authentication.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-zerobus-ingest/1-setup-and-authentication.md similarity index 89% rename from .claude/skills/databricks-zerobus-ingest/1-setup-and-authentication.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-zerobus-ingest/1-setup-and-authentication.md index 10b07f6..31dfd1b 100644 --- a/.claude/skills/databricks-zerobus-ingest/1-setup-and-authentication.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-zerobus-ingest/1-setup-and-authentication.md @@ -76,6 +76,8 @@ GRANT MODIFY, SELECT ON TABLE my_catalog.my_schema.my_events TO `=1.0.0 ``` Or with a virtual environment: ```bash -uv pip install databricks-zerobus-ingest-sdk +uv pip install databricks-zerobus-ingest-sdk>=1.0.0 ``` +**Note:** The Zerobus SDK cannot be pip-installed on Databricks serverless compute. Use classic compute clusters, or use the [Zerobus REST API](https://docs.databricks.com/aws/en/ingestion/zerobus-rest-api) (Beta) for notebook-based ingestion without the SDK. + ### Java (8+) Maven: diff --git a/.claude/skills/databricks-zerobus-ingest/2-python-client.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-zerobus-ingest/2-python-client.md similarity index 77% rename from .claude/skills/databricks-zerobus-ingest/2-python-client.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-zerobus-ingest/2-python-client.md index ac95cd4..64c6f8b 100644 --- a/.claude/skills/databricks-zerobus-ingest/2-python-client.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-zerobus-ingest/2-python-client.md @@ -11,12 +11,14 @@ Python SDK patterns for Zerobus Ingest: synchronous and asynchronous APIs, JSON from zerobus.sdk.sync import ZerobusSdk # Asynchronous API (equivalent capabilities) -from zerobus.sdk.asyncio import ZerobusSdk as AsyncZerobusSdk +from zerobus.sdk.aio import ZerobusSdk as AsyncZerobusSdk # Shared types (used by both sync and async) from zerobus.sdk.shared import ( RecordType, - IngestRecordResponse, + AckCallback, + ZerobusException, + NonRetriableException, StreamConfigurationOptions, TableProperties, ) @@ -49,8 +51,8 @@ stream = sdk.create_stream(client_id, client_secret, table_props, options) try: for i in range(100): record = {"device_name": f"sensor-{i}", "temp": 22, "humidity": 55} - ack = stream.ingest_record(record) - ack.wait_for_ack() # Block until durably written + offset = stream.ingest_record_offset(record) + stream.wait_for_offset(offset) # Block until durably written finally: stream.close() ``` --> @@ -90,8 +92,8 @@ try: temp=22, humidity=55, ) - ack = stream.ingest_record(record) - ack.wait_for_ack() + offset = stream.ingest_record_offset(record) + stream.wait_for_offset(offset) finally: stream.close() ``` @@ -100,17 +102,21 @@ finally: ## ACK Callback (Asynchronous Acknowledgment) -Instead of blocking on each ACK, register a callback for background durability confirmation: +Instead of blocking on each ACK, register an `AckCallback` subclass for background durability confirmation: ```python -from zerobus.sdk.shared import IngestRecordResponse, StreamConfigurationOptions, RecordType +from zerobus.sdk.shared import AckCallback, StreamConfigurationOptions, RecordType -def on_ack(response: IngestRecordResponse) -> None: - print(f"Durable up to offset: {response.durability_ack_up_to_offset}") +class MyAckHandler(AckCallback): + def on_ack(self, offset: int) -> None: + print(f"Durable up to offset: {offset}") + + def on_error(self, offset: int, message: str) -> None: + print(f"Error at offset {offset}: {message}") options = StreamConfigurationOptions( record_type=RecordType.JSON, - ack_callback=on_ack, + ack_callback=MyAckHandler(), ) # Create stream with callback @@ -119,7 +125,7 @@ stream = sdk.create_stream(client_id, client_secret, table_props, options) try: for i in range(1000): record = {"device_name": f"sensor-{i}", "temp": 22, "humidity": 55} - stream.ingest_record(record) # Non-blocking, ACKs arrive via callback + stream.ingest_record_nowait(record) # Fire-and-forget, ACKs arrive via callback stream.flush() # Ensure all buffered records are sent finally: stream.close() @@ -135,12 +141,12 @@ A production-ready wrapper with retry logic, reconnection, and both JSON and Pro import os import time import logging -from typing import Optional, Callable +from typing import Optional from zerobus.sdk.sync import ZerobusSdk from zerobus.sdk.shared import ( RecordType, - IngestRecordResponse, + AckCallback, StreamConfigurationOptions, TableProperties, ) @@ -159,7 +165,7 @@ class ZerobusClient: client_id: str, client_secret: str, record_type: RecordType = RecordType.JSON, - ack_callback: Optional[Callable[[IngestRecordResponse], None]] = None, + ack_callback: Optional[AckCallback] = None, proto_descriptor=None, ): self.server_endpoint = server_endpoint @@ -199,8 +205,8 @@ class ZerobusClient: try: if self.stream is None: self.init_stream() - ack = self.stream.ingest_record(payload) - ack.wait_for_ack() + offset = self.stream.ingest_record_offset(payload) + self.stream.wait_for_offset(offset) return True except Exception as e: err = str(e).lower() @@ -275,7 +281,7 @@ The SDK provides an equivalent async API for use with `asyncio`: ```python import asyncio -from zerobus.sdk.asyncio import ZerobusSdk as AsyncZerobusSdk +from zerobus.sdk.aio import ZerobusSdk as AsyncZerobusSdk from zerobus.sdk.shared import RecordType, StreamConfigurationOptions, TableProperties @@ -289,8 +295,8 @@ async def ingest_async(): try: for i in range(100): record = {"device_name": f"sensor-{i}", "temp": 22, "humidity": 55} - ack = await stream.ingest_record(record) - await ack.wait_for_ack() + offset = await stream.ingest_record_offset(record) + await stream.wait_for_offset(offset) finally: await stream.close() @@ -304,7 +310,7 @@ asyncio.run(ingest_async()) ## Batch Pattern -For higher throughput, send records without blocking on each ACK and flush at the end: +For higher throughput, use `ingest_record_nowait` (fire-and-forget) or batch methods, and flush at the end: ```python with ZerobusClient( @@ -314,10 +320,39 @@ with ZerobusClient( client_id=os.environ["DATABRICKS_CLIENT_ID"], client_secret=os.environ["DATABRICKS_CLIENT_SECRET"], record_type=RecordType.JSON, - ack_callback=lambda resp: None, # Discard individual ACKs ) as client: for i in range(10_000): record = {"device_name": f"sensor-{i}", "temp": 22, "humidity": 55} - client.stream.ingest_record(record) # Non-blocking + client.stream.ingest_record_nowait(record) # Fire-and-forget # flush() and close() called automatically by context manager ``` + +For true batch ingestion, use the batch variants: + +```python +records = [ + {"device_name": f"sensor-{i}", "temp": 22, "humidity": 55} + for i in range(10_000) +] +# Fire-and-forget batch +stream.ingest_records_nowait(records) +stream.flush() + +# Or with offset tracking +offset = stream.ingest_records_offset(records) +stream.wait_for_offset(offset) +``` + +--- + +## Ingestion Method Comparison + +| Method | Returns | Blocks? | Best For | +|--------|---------|---------|----------| +| `ingest_record_offset(record)` | offset | No (enqueues) | Single record with durability tracking | +| `ingest_record_nowait(record)` | None | No | Max single-record throughput | +| `ingest_records_offset(records)` | last offset | No (enqueues) | Batch with durability tracking | +| `ingest_records_nowait(records)` | None | No | Max batch throughput | +| `wait_for_offset(offset)` | None | Yes (until ACK) | Durability confirmation | +| `flush()` | None | Yes (until sent) | Ensure all buffered records are sent | +| `ingest_record(record)` | RecordAcknowledgment | No | Primary method in SDK v1.1.0+; pass `json.dumps(record)` for JSON | diff --git a/.claude/skills/databricks-zerobus-ingest/3-multilanguage-clients.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-zerobus-ingest/3-multilanguage-clients.md similarity index 90% rename from .claude/skills/databricks-zerobus-ingest/3-multilanguage-clients.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-zerobus-ingest/3-multilanguage-clients.md index 217398c..4eba101 100644 --- a/.claude/skills/databricks-zerobus-ingest/3-multilanguage-clients.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-zerobus-ingest/3-multilanguage-clients.md @@ -51,7 +51,8 @@ public class ZerobusProducer { .setTemp(22) .setHumidity(55) .build(); - stream.ingestRecord(record).join(); + long offset = stream.ingestRecordOffset(record); + stream.waitForOffset(offset); } } finally { stream.close(); @@ -126,12 +127,12 @@ func main() { record := fmt.Sprintf( `{"device_name": "sensor-%d", "temp": 22, "humidity": 55}`, i, ) - ack, err := stream.IngestRecord(record) + offset, err := stream.IngestRecordOffset(record) if err != nil { log.Printf("Ingest failed for record %d: %v", i, err) continue } - ack.Await() + stream.WaitForOffset(offset) } stream.Flush() @@ -187,7 +188,8 @@ const stream = await sdk.createStream( try { for (let i = 0; i < 100; i++) { const record = { device_name: `sensor-${i}`, temp: 22, humidity: 55 }; - await stream.ingestRecord(record); + const offset = await stream.ingestRecordOffset(record); + await stream.waitForOffset(offset); } await stream.flush(); } finally { @@ -207,7 +209,8 @@ async function ingestWithRetry( ): Promise { for (let attempt = 0; attempt < maxRetries; attempt++) { try { - await stream.ingestRecord(record); + const offset = await stream.ingestRecordOffset(record); + await stream.waitForOffset(offset); return true; } catch (error) { console.warn(`Attempt ${attempt + 1}/${maxRetries} failed:`, error); @@ -268,8 +271,8 @@ async fn main() -> Result<(), Box> { r#"{{"device_name": "sensor-{}", "temp": 22, "humidity": 55}}"#, i ); - let ack = stream.ingest_record(record.into_bytes()).await?; - ack.await?; + let offset = stream.ingest_record_offset(record.into_bytes()).await?; + stream.wait_for_offset(offset).await?; } stream.close().await?; @@ -296,8 +299,8 @@ let mut stream = sdk // Ingest serialized protobuf bytes let record_bytes = my_proto_message.encode_to_vec(); -let ack = stream.ingest_record(record_bytes).await?; -ack.await?; +let offset = stream.ingest_record_offset(record_bytes).await?; +stream.wait_for_offset(offset).await?; ``` --- @@ -310,5 +313,5 @@ ack.await?; | Package | `databricks-zerobus-ingest-sdk` | `com.databricks:zerobus-ingest-sdk` | `github.com/databricks/zerobus-sdk-go` | `@databricks/zerobus-ingest-sdk` | `databricks-zerobus-ingest-sdk` | | Default serialization | JSON | Protobuf | JSON | JSON | JSON | | Async API | Yes (separate module) | CompletableFuture | Goroutines | Native async/await | Tokio async/await | -| ACK pattern | `ack.wait_for_ack()` or callback | `.join()` | `ack.Await()` | Implicit in `await` | `ack.await?` | +| ACK pattern | `wait_for_offset(offset)` or `AckCallback` | `waitForOffset(offset)` | `WaitForOffset(offset)` | `await waitForOffset(offset)` | `wait_for_offset(offset).await?` | | Proto generation | `python -m zerobus.tools.generate_proto` | JAR CLI tool | External `protoc` | External `protoc` | External `protoc` | diff --git a/.claude/skills/databricks-zerobus-ingest/4-protobuf-schema.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-zerobus-ingest/4-protobuf-schema.md similarity index 100% rename from .claude/skills/databricks-zerobus-ingest/4-protobuf-schema.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-zerobus-ingest/4-protobuf-schema.md diff --git a/.claude/skills/databricks-zerobus-ingest/5-operations-and-limits.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-zerobus-ingest/5-operations-and-limits.md similarity index 87% rename from .claude/skills/databricks-zerobus-ingest/5-operations-and-limits.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-zerobus-ingest/5-operations-and-limits.md index 7b8cb2b..004774d 100644 --- a/.claude/skills/databricks-zerobus-ingest/5-operations-and-limits.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-zerobus-ingest/5-operations-and-limits.md @@ -12,40 +12,44 @@ Every ingested record returns a durability acknowledgment. An ACK indicates that | Strategy | When to Use | Trade-off | |----------|-------------|-----------| -| **Sync block per record** | Low-volume, strict ordering | Simplest; lower throughput | -| **ACK callback** | High-volume producers | Higher throughput; more complex | -| **Periodic flush** | Batch-oriented workloads | Best throughput; eventual consistency | +| **`ingest_record_offset` + `wait_for_offset`** | Low-volume, strict ordering | Simplest; lower throughput | +| **`ingest_record_nowait` + `AckCallback`** | High-volume producers | Higher throughput; more complex | +| **`ingest_record_nowait` + periodic `flush`** | Batch-oriented workloads | Best throughput; eventual consistency | ### Sync Block (Python) ```python -ack = stream.ingest_record(record) -ack.wait_for_ack() # Blocks until durable +offset = stream.ingest_record_offset(record) +stream.wait_for_offset(offset) # Blocks until durable ``` ### ACK Callback (Python) ```python -from zerobus.sdk.shared import IngestRecordResponse +from zerobus.sdk.shared import AckCallback -last_acked_offset = 0 +class MyAckHandler(AckCallback): + def __init__(self): + self.last_acked_offset = 0 -def on_ack(response: IngestRecordResponse) -> None: - global last_acked_offset - last_acked_offset = response.durability_ack_up_to_offset + def on_ack(self, offset: int) -> None: + self.last_acked_offset = offset + + def on_error(self, offset: int, message: str) -> None: + print(f"Error at offset {offset}: {message}") options = StreamConfigurationOptions( record_type=RecordType.JSON, - ack_callback=on_ack, + ack_callback=MyAckHandler(), ) ``` ### Flush-Based ```python -# Send many records without blocking +# Send many records without blocking (fire-and-forget) for record in batch: - stream.ingest_record(record) + stream.ingest_record_nowait(record) # Flush ensures all buffered records are sent stream.flush() @@ -89,8 +93,8 @@ def ingest_with_retry(stream_factory, record, max_retries=5): for attempt in range(max_retries): try: - ack = stream.ingest_record(record) - ack.wait_for_ack() + offset = stream.ingest_record_offset(record) + stream.wait_for_offset(offset) return stream # Return the (possibly new) stream except Exception as e: err = str(e).lower() diff --git a/.claude/skills/databricks-zerobus-ingest/SKILL.md b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-zerobus-ingest/SKILL.md similarity index 83% rename from .claude/skills/databricks-zerobus-ingest/SKILL.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/databricks-zerobus-ingest/SKILL.md index d29dc00..22f90c5 100644 --- a/.claude/skills/databricks-zerobus-ingest/SKILL.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/databricks-zerobus-ingest/SKILL.md @@ -7,7 +7,7 @@ description: "Build Zerobus Ingest clients for near real-time data ingestion int Build clients that ingest data directly into Databricks Delta tables via the Zerobus gRPC API. -**Status:** Public Preview (currently free; Databricks plans to introduce charges in the future) +**Status:** GA (Generally Available since February 2026; billed under Lakeflow Jobs Serverless SKU) **Documentation:** - [Zerobus Overview](https://docs.databricks.com/aws/en/ingestion/zerobus-overview) @@ -37,7 +37,7 @@ Zerobus Ingest is a serverless connector that enables direct, record-by-record d | Schema generation from UC table | Any | Protobuf | [4-protobuf-schema.md](4-protobuf-schema.md) | | Retry / reconnection logic | Any | Any | [5-operations-and-limits.md](5-operations-and-limits.md) | -If not speficfied, default to python. +If not specified, default to python. --- @@ -46,9 +46,9 @@ If not speficfied, default to python. These libraries are essential for ZeroBus data ingestion: - **databricks-sdk>=0.85.0**: Databricks workspace client for authentication and metadata -- **databricks-zerobus-ingest-sdk>=0.2.0**: ZeroBus SDK for high-performance streaming ingestion +- **databricks-zerobus-ingest-sdk>=1.0.0**: ZeroBus SDK for high-performance streaming ingestion - **grpcio-tools** -These are typically NOT pre-installed on Databricks. Install them using `execute_databricks_command` tool: +These are typically NOT pre-installed on Databricks. Install them using `execute_code` tool: - `code`: "%pip install databricks-sdk>=VERSION databricks-zerobus-ingest-sdk>=VERSION" Save the returned `cluster_id` and `context_id` for subsequent calls. @@ -85,6 +85,7 @@ See [1-setup-and-authentication.md](1-setup-and-authentication.md) for complete ## Minimal Python Example (JSON) ```python +import json from zerobus.sdk.sync import ZerobusSdk from zerobus.sdk.shared import RecordType, StreamConfigurationOptions, TableProperties @@ -95,8 +96,8 @@ table_props = TableProperties(table_name) stream = sdk.create_stream(client_id, client_secret, table_props, options) try: record = {"device_name": "sensor-1", "temp": 22, "humidity": 55} - ack = stream.ingest_record(record) - ack.wait_for_ack() + stream.ingest_record(json.dumps(record)) + stream.flush() finally: stream.close() ``` @@ -115,22 +116,24 @@ finally: --- -You must always follow all the steps in the Workslfow +You must always follow all the steps in the Workflow ## Workflow 0. **Display the plan of your execution** 1. **Determinate the type of client** -2. **Get schema** Always use 4-protobuf-schema.md. Execute using the `run_python_file_on_databricks` MCP tool -3. **Write Python code to a local file follow the instructions in the relevant guide to ingest with zerobus** in the project (e.g., `scripts/zerobus_ingest.py`). -4. **Execute on Databricks** using the `run_python_file_on_databricks` MCP tool +2. **Get schema** Always use 4-protobuf-schema.md. Execute using the `execute_code` MCP tool +3. **Write Python code to a local file follow the instructions in the relevant guide to ingest with zerobus** in the project (e.g., `scripts/zerobus_ingest.py`). +4. **Execute on Databricks** using the `execute_code` MCP tool (with `file_path` parameter) 5. **If execution fails**: Edit the local file to fix the error, then re-execute 6. **Reuse the context** for follow-up executions by passing the returned `cluster_id` and `context_id` --- ## Important -- Never install local packages +- Never install local packages - Always validate MCP server requirement before execution +- **Serverless limitation**: The Zerobus SDK cannot pip-install on serverless compute. Use classic compute clusters, or use the [Zerobus REST API](https://docs.databricks.com/aws/en/ingestion/zerobus-rest-api) (Beta) for notebook-based ingestion without the SDK. +- **Explicit table grants**: Service principals need explicit `MODIFY` and `SELECT` grants on the target table. Schema-level inherited permissions may not be sufficient for the `authorization_details` OAuth flow. --- @@ -138,7 +141,7 @@ You must always follow all the steps in the Workslfow The first execution auto-selects a running cluster and creates an execution context. **Reuse this context for follow-up calls** - it's much faster (~1s vs ~15s) and shares variables/imports: -**First execution** - use `run_python_file_on_databricks` tool: +**First execution** - use `execute_code` tool: - `file_path`: "scripts/zerobus_ingest.py" Returns: `{ success, output, error, cluster_id, context_id, ... }` @@ -148,7 +151,7 @@ Save `cluster_id` and `context_id` for follow-up calls. **If execution fails:** 1. Read the error from the result 2. Edit the local Python file to fix the issue -3. Re-execute with same context using `run_python_file_on_databricks` tool: +3. Re-execute with same context using `execute_code` tool: - `file_path`: "scripts/zerobus_ingest.py" - `cluster_id`: "" - `context_id`: "" @@ -172,8 +175,8 @@ When execution fails: Databricks provides Spark, pandas, numpy, and common data libraries by default. **Only install a library if you get an import error.** -Use `execute_databricks_command` tool: -- `code`: "%pip install databricks-zerobus-ingest-sdk>=0.2.0" +Use `execute_code` tool: +- `code`: "%pip install databricks-zerobus-ingest-sdk>=1.0.0" - `cluster_id`: "" - `context_id`: "" @@ -193,7 +196,7 @@ The timestamp generation must use microseconds for Databricks. - **gRPC + Protobuf**: Zerobus uses gRPC as its transport protocol. Any application that can communicate via gRPC and construct Protobuf messages can produce to Zerobus. - **JSON or Protobuf serialization**: JSON for quick starts; Protobuf for type safety, forward compatibility, and performance. - **At-least-once delivery**: The connector provides at-least-once guarantees. Design consumers to handle duplicates. -- **Durability ACKs**: Each ingested record returns an ACK confirming durable write. ACKs indicate all records up to that offset have been durably written. +- **Durability ACKs**: Each ingested record returns a `RecordAcknowledgment`. Use `flush()` to ensure all buffered records are durably written, or use `wait_for_offset(offset)` for offset-based tracking. - **No table management**: Zerobus does not create or alter tables. You must pre-create your target table and manage schema evolution yourself. - **Single-AZ durability**: The service runs in a single availability zone. Plan for potential zone outages. @@ -210,6 +213,8 @@ The timestamp generation must use microseconds for Databricks. | **Throughput limits hit** | Max 100 MB/s and 15,000 rows/s per stream. Open multiple streams or contact Databricks. | | **Region not supported** | Check supported regions in [5-operations-and-limits.md](5-operations-and-limits.md). | | **Table not found** | Ensure table is a managed Delta table in a supported region with correct three-part name. | +| **SDK install fails on serverless** | The Zerobus SDK cannot be pip-installed on serverless compute. Use classic compute clusters or the REST API (Beta) from notebooks. | +| **Error 4024 / authorization_details** | Service principal lacks explicit table-level grants. Grant `MODIFY` and `SELECT` directly on the target table — schema-level inherited grants may be insufficient. | --- @@ -218,7 +223,7 @@ The timestamp generation must use microseconds for Databricks. - **[databricks-python-sdk](../databricks-python-sdk/SKILL.md)** - General SDK patterns and WorkspaceClient for table/schema management - **[databricks-spark-declarative-pipelines](../databricks-spark-declarative-pipelines/SKILL.md)** - Downstream pipeline processing of ingested data - **[databricks-unity-catalog](../databricks-unity-catalog/SKILL.md)** - Managing catalogs, schemas, and tables that Zerobus writes to -- **[databricks-synthetic-data-generation](../databricks-synthetic-data-generation/SKILL.md)** - Generate test data to feed into Zerobus producers +- **[databricks-synthetic-data-gen](../databricks-synthetic-data-gen/SKILL.md)** - Generate test data to feed into Zerobus producers - **[databricks-config](../databricks-config/SKILL.md)** - Profile and authentication setup ## Resources diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/spark-python-data-source/SKILL.md b/coda-marketplace/plugins/coda-databricks-skills/skills/spark-python-data-source/SKILL.md new file mode 100644 index 0000000..4f90c60 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/spark-python-data-source/SKILL.md @@ -0,0 +1,157 @@ +--- +name: spark-python-data-source +description: Build custom Python data sources for Apache Spark using the PySpark DataSource API — batch and streaming readers/writers for external systems. Use this skill whenever someone wants to connect Spark to an external system (database, API, message queue, custom protocol), build a Spark connector or plugin in Python, implement a DataSourceReader or DataSourceWriter, pull data from or push data to a system via Spark, or work with the PySpark DataSource API in any way. Even if they just say "read from X in Spark" or "write DataFrame to Y" and there's no native connector, this skill applies. +--- + +# spark-python-data-source + +Build custom Python data sources for Apache Spark 4.0+ to read from and write to external systems in batch and streaming modes. + +## Instructions + +You are an experienced Spark developer building custom Python data sources using the PySpark DataSource API. Follow these principles and patterns. + +### Core Architecture + +Each data source follows a flat, single-level inheritance structure: + +1. **DataSource class** — entry point that returns readers/writers +2. **Base Reader/Writer classes** — shared logic for options and data processing +3. **Batch classes** — inherit from base + `DataSourceReader`/`DataSourceWriter` +4. **Stream classes** — inherit from base + `DataSourceStreamReader`/`DataSourceStreamWriter` + +See [implementation-template.md](references/implementation-template.md) for the full annotated skeleton covering all four modes (batch read/write, stream read/write). + +### Spark-Specific Design Constraints + +These are specific to the PySpark DataSource API and its driver/executor architecture — general Python best practices (clean code, minimal dependencies, no premature abstraction) still apply but aren't repeated here. + +**Flat single-level inheritance only.** PySpark serializes reader/writer instances to ship them to executors. Complex inheritance hierarchies and abstract base classes break serialization and make cross-process debugging painful. Use one shared base class mixed with the PySpark interface (e.g., `class YourBatchWriter(YourWriter, DataSourceWriter)`). + +**Import third-party libraries inside executor methods.** The `read()` and `write()` methods run on remote executor processes that don't share the driver's Python environment. Top-level imports from the driver won't be available on executors — always import libraries like `requests` or database drivers inside the methods that run on workers. + +**Minimize dependencies.** Every package you add must be installed on all executor nodes in the cluster, not just the driver. Prefer the standard library; when external packages are needed, keep them few and well-known. + +**No async/await** unless the external system's SDK is async-only. The PySpark DataSource API is synchronous, so async adds complexity with no benefit. + +### Project Setup + +Create a Python project using a packaging tool such as `uv`, `poetry`, or `hatch`. Examples use `uv` (substitute your tool of choice): + +```bash +uv init your-datasource +cd your-datasource +uv add pyspark pytest pytest-spark +``` + +``` +your-datasource/ +├── pyproject.toml +├── src/ +│ └── your_datasource/ +│ ├── __init__.py +│ └── datasource.py +└── tests/ + ├── conftest.py + └── test_datasource.py +``` + +Run all commands through the packaging tool so they execute within the correct virtual environment: + +```bash +uv run pytest # Run tests +uv run ruff check src/ # Lint +uv run ruff format src/ # Format +uv build # Build wheel +``` + +### Key Implementation Decisions + +**Partitioning Strategy** — choose based on data source characteristics: +- Time-based: for APIs with temporal data +- Token-range: for distributed databases +- ID-range: for paginated APIs +- See [partitioning-patterns.md](references/partitioning-patterns.md) for implementations of each strategy + +**Authentication** — support multiple methods in priority order: +- Databricks Unity Catalog credentials +- Cloud default credentials (managed identity) +- Explicit credentials (service principal, API key, username/password) +- See [authentication-patterns.md](references/authentication-patterns.md) for patterns with fallback chains + +**Type Conversion** — map between Spark and external types: +- Handle nulls, timestamps, UUIDs, collections +- See [type-conversion.md](references/type-conversion.md) for bidirectional mapping tables and helpers + +**Streaming Offsets** — design for exactly-once semantics: +- JSON-serializable offset class +- Non-overlapping partition boundaries +- See [streaming-patterns.md](references/streaming-patterns.md) for offset tracking and watermark patterns + +**Error Handling** — implement retries and resilience: +- Exponential backoff for transient failures (network, rate limits) +- Circuit breakers for cascading failures +- See [error-handling.md](references/error-handling.md) for retry decorators and failure classification + +### Testing + +```python +import pytest +from unittest.mock import patch, Mock + +@pytest.fixture +def spark(): + from pyspark.sql import SparkSession + return SparkSession.builder.master("local[2]").getOrCreate() + +def test_data_source_name(): + assert YourDataSource.name() == "your-format" + +def test_writer_sends_data(spark): + with patch('requests.post') as mock_post: + mock_post.return_value = Mock(status_code=200) + + df = spark.createDataFrame([(1, "test")], ["id", "value"]) + df.write.format("your-format").option("url", "http://api").save() + + assert mock_post.called +``` + +See [testing-patterns.md](references/testing-patterns.md) for unit/integration test patterns, fixtures, and running tests. + +### Reference Implementations + +Study these for real-world patterns: +- [cyber-spark-data-connectors](https://github.com/alexott/cyber-spark-data-connectors) — Sentinel, Splunk, REST +- [spark-cassandra-data-source](https://github.com/alexott/spark-cassandra-data-source) — Token-range partitioning +- [pyspark-hubspot](https://github.com/dgomez04/pyspark-hubspot) — REST API pagination +- [pyspark-mqtt](https://github.com/databricks-industry-solutions/python-data-sources/tree/main/mqtt) — Streaming with TLS + +## Example Prompts + +``` +Create a Spark data source for reading from MongoDB with sharding support +Build a streaming connector for RabbitMQ with at-least-once delivery +Implement a batch writer for Snowflake with staged uploads +Write a data source for REST API with OAuth2 authentication and pagination +``` + +## Related + +- databricks-testing: Test data sources on Databricks clusters +- databricks-spark-declarative-pipelines: Use custom sources in DLT pipelines +- python-dev: Python development best practices + +## References + +- [implementation-template.md](references/implementation-template.md) — Full annotated skeleton; read when starting a new data source +- [partitioning-patterns.md](references/partitioning-patterns.md) — Read when the source supports parallel reads and you need to split work across executors +- [authentication-patterns.md](references/authentication-patterns.md) — Read when the external system requires credentials or tokens +- [type-conversion.md](references/type-conversion.md) — Read when mapping between Spark types and the external system's type system +- [streaming-patterns.md](references/streaming-patterns.md) — Read when implementing `DataSourceStreamReader` or `DataSourceStreamWriter` +- [error-handling.md](references/error-handling.md) — Read when adding retry logic or handling transient failures +- [testing-patterns.md](references/testing-patterns.md) — Read when writing tests; covers unit, integration, and performance testing +- [production-patterns.md](references/production-patterns.md) — Read when hardening for production: observability, security, input validation +- [Official Databricks Documentation](https://docs.databricks.com/aws/en/pyspark/datasources) +- [Apache Spark Python DataSource Tutorial](https://spark.apache.org/docs/latest/api/python/tutorial/sql/python_data_source.html) +- [awesome-python-datasources](https://github.com/allisonwang-db/awesome-python-datasources) — Directory of community implementations diff --git a/.claude/skills/spark-python-data-source/references/authentication-patterns.md b/coda-marketplace/plugins/coda-databricks-skills/skills/spark-python-data-source/references/authentication-patterns.md similarity index 100% rename from .claude/skills/spark-python-data-source/references/authentication-patterns.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/spark-python-data-source/references/authentication-patterns.md diff --git a/.claude/skills/spark-python-data-source/references/error-handling.md b/coda-marketplace/plugins/coda-databricks-skills/skills/spark-python-data-source/references/error-handling.md similarity index 100% rename from .claude/skills/spark-python-data-source/references/error-handling.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/spark-python-data-source/references/error-handling.md diff --git a/coda-marketplace/plugins/coda-databricks-skills/skills/spark-python-data-source/references/implementation-template.md b/coda-marketplace/plugins/coda-databricks-skills/skills/spark-python-data-source/references/implementation-template.md new file mode 100644 index 0000000..045fe94 --- /dev/null +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/spark-python-data-source/references/implementation-template.md @@ -0,0 +1,141 @@ +# Implementation Template + +Full skeleton for a Python data source covering all four modes: batch read, batch write, stream read, stream write. Adapt to your needs — most connectors only implement a subset. + +```python +from pyspark.sql.datasource import ( + DataSource, DataSourceReader, DataSourceWriter, + DataSourceStreamReader, DataSourceStreamWriter +) + +# 1. DataSource class — entry point that returns readers/writers +class YourDataSource(DataSource): + @classmethod + def name(cls): + return "your-format" + + def __init__(self, options): + self.options = options + + def schema(self): + return self._infer_or_return_schema() + + def reader(self, schema): + return YourBatchReader(self.options, schema) + + def streamReader(self, schema): + return YourStreamReader(self.options, schema) + + def writer(self, schema, overwrite): + return YourBatchWriter(self.options, schema) + + def streamWriter(self, schema, overwrite): + return YourStreamWriter(self.options, schema) + +# 2. Base Writer — shared logic for batch and stream writing +# Plain class (not a DataSourceWriter yet) so batch/stream +# subclasses can mix it in with the right PySpark base. +class YourWriter: + def __init__(self, options, schema=None): + self.url = options.get("url") + assert self.url, "url is required" + self.batch_size = int(options.get("batch_size", "50")) + self.schema = schema + + def write(self, iterator): + # Import here — this runs on executors, not the driver. + # Executor processes don't share the driver's module state. + import requests + from pyspark import TaskContext + + context = TaskContext.get() + partition_id = context.partitionId() + + msgs = [] + cnt = 0 + + for row in iterator: + cnt += 1 + msgs.append(row.asDict()) + + if len(msgs) >= self.batch_size: + self._send_batch(msgs) + msgs = [] + + if msgs: + self._send_batch(msgs) + + return SimpleCommitMessage(partition_id=partition_id, count=cnt) + + def _send_batch(self, msgs): + # Implement send logic + pass + +# 3. Batch Writer — inherits shared logic + PySpark interface +class YourBatchWriter(YourWriter, DataSourceWriter): + pass + +# 4. Stream Writer — adds commit/abort for micro-batch semantics +class YourStreamWriter(YourWriter, DataSourceStreamWriter): + def commit(self, messages, batchId): + pass + + def abort(self, messages, batchId): + pass + +# 5. Base Reader — shared logic for batch and stream reading +class YourReader: + def __init__(self, options, schema): + self.url = options.get("url") + assert self.url, "url is required" + self.schema = schema + + def partitions(self): + return [YourPartition(0, start, end)] + + def read(self, partition): + # Import here — runs on executors + import requests + + response = requests.get(f"{self.url}?start={partition.start}") + for item in response.json(): + yield tuple(item.values()) + +# 6. Batch Reader +class YourBatchReader(YourReader, DataSourceReader): + pass + +# 7. Stream Reader — adds offset tracking for incremental reads +class YourStreamReader(YourReader, DataSourceStreamReader): + def initialOffset(self): + return {"offset": "0"} + + def latestOffset(self): + return {"offset": str(self._get_latest())} + + def partitions(self, start, end): + return [YourPartition(0, start["offset"], end["offset"])] + + def commit(self, end): + pass +``` + +## Registration and Usage + +```python +# Register +from your_package import YourDataSource +spark.dataSource.register(YourDataSource) + +# Batch read +df = spark.read.format("your-format").option("url", "...").load() + +# Batch write +df.write.format("your-format").option("url", "...").save() + +# Streaming read +df = spark.readStream.format("your-format").option("url", "...").load() + +# Streaming write +df.writeStream.format("your-format").option("url", "...").start() +``` diff --git a/.claude/skills/spark-python-data-source/references/partitioning-patterns.md b/coda-marketplace/plugins/coda-databricks-skills/skills/spark-python-data-source/references/partitioning-patterns.md similarity index 100% rename from .claude/skills/spark-python-data-source/references/partitioning-patterns.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/spark-python-data-source/references/partitioning-patterns.md diff --git a/.claude/skills/spark-python-data-source/references/production-patterns.md b/coda-marketplace/plugins/coda-databricks-skills/skills/spark-python-data-source/references/production-patterns.md similarity index 78% rename from .claude/skills/spark-python-data-source/references/production-patterns.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/spark-python-data-source/references/production-patterns.md index 71928ca..6dfbd8a 100644 --- a/.claude/skills/spark-python-data-source/references/production-patterns.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/spark-python-data-source/references/production-patterns.md @@ -146,146 +146,55 @@ class LoggingWriter: ## Security Validation -Input validation and sanitization: +Input validation and sanitization for production data sources: ```python import re import ipaddress class SecureDataSource: - """Data source with security validation.""" - - # Sensitive keys that should never be logged - SENSITIVE_KEYS = { - "password", "api_key", "client_secret", "token", - "access_token", "refresh_token", "bearer_token" - } + """Data source with input validation.""" def __init__(self, options): - # Validate and sanitize options self._validate_options(options) self.options = options - # Create sanitized version for logging - self._safe_options = self._sanitize_for_logging(options) - def _validate_options(self, options): - """Comprehensive option validation.""" - # Validate required options + """Validate options at system boundary.""" required = ["host", "database", "table"] missing = [opt for opt in required if opt not in options] if missing: raise ValueError(f"Missing required options: {', '.join(missing)}") - # Validate host (IP or hostname) self._validate_host(options["host"]) - # Validate port range if "port" in options: port = int(options["port"]) if port < 1 or port > 65535: raise ValueError(f"Port must be 1-65535, got {port}") - # Validate table name (prevent SQL injection) self._validate_identifier(options["table"], "table") - # Validate numeric options - if "batch_size" in options: - batch_size = int(options["batch_size"]) - if batch_size < 1 or batch_size > 10000: - raise ValueError(f"batch_size must be 1-10000, got {batch_size}") - def _validate_host(self, host): """Validate host is valid IP or hostname.""" try: - # Try as IP address ipaddress.ip_address(host) return except ValueError: pass - - # Validate as hostname if not re.match(r'^[a-zA-Z0-9][a-zA-Z0-9-\.]*[a-zA-Z0-9]$', host): raise ValueError(f"Invalid host format: {host}") def _validate_identifier(self, identifier, name): - """Validate SQL identifier (table, column name).""" - # Prevent SQL injection + """Validate SQL identifier to prevent injection.""" if not re.match(r'^[a-zA-Z_][a-zA-Z0-9_]*$', identifier): raise ValueError( f"Invalid {name} identifier: {identifier}. " - f"Must contain only letters, numbers, and underscores, " - f"and start with a letter or underscore." + f"Must contain only letters, numbers, and underscores." ) - - def _sanitize_for_logging(self, options): - """Mask sensitive values for logging.""" - safe = {} - for key, value in options.items(): - if key.lower() in self.SENSITIVE_KEYS: - safe[key] = "***REDACTED***" - else: - safe[key] = value - return safe - - def __repr__(self): - return f"SecureDataSource({self._safe_options})" ``` -## Secrets Management - -Load credentials from secure storage: - -```python -def load_secrets_from_databricks(scope, keys): - """Load secrets from Databricks secrets.""" - try: - from pyspark.dbutils import DBUtils - from pyspark.sql import SparkSession - - spark = SparkSession.getActiveSession() - if not spark: - raise ValueError("No active Spark session") - - dbutils = DBUtils(spark) - secrets = {} - - for key in keys: - try: - secrets[key] = dbutils.secrets.get(scope=scope, key=key) - except Exception as e: - raise ValueError(f"Failed to load secret '{key}' from scope '{scope}': {e}") - - return secrets - - except Exception as e: - raise ValueError(f"Failed to access Databricks secrets: {e}") - -class SecureCredentialLoader: - """Load credentials securely.""" - - @staticmethod - def load_credentials(options): - """Load credentials from secure storage.""" - # Priority 1: Databricks secrets - if "secret_scope" in options: - secret_keys = [ - "username", "password", "api_key", "client_secret" - ] - secrets = load_secrets_from_databricks( - options["secret_scope"], - secret_keys - ) - options.update(secrets) - - # Priority 2: Environment variables - elif options.get("use_env_vars", "false").lower() == "true": - import os - options["username"] = os.environ.get("DB_USERNAME") - options["password"] = os.environ.get("DB_PASSWORD") - - return options -``` +For credential sanitization in logs and secrets management, see [authentication-patterns.md](authentication-patterns.md) — the "Security Best Practices" and "Use Secrets Management" sections. ## Configuration Validation diff --git a/.claude/skills/spark-python-data-source/references/streaming-patterns.md b/coda-marketplace/plugins/coda-databricks-skills/skills/spark-python-data-source/references/streaming-patterns.md similarity index 98% rename from .claude/skills/spark-python-data-source/references/streaming-patterns.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/spark-python-data-source/references/streaming-patterns.md index 6f00ddd..66b9e8e 100644 --- a/.claude/skills/spark-python-data-source/references/streaming-patterns.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/spark-python-data-source/references/streaming-patterns.md @@ -396,5 +396,5 @@ class MonitoredStreamReader(DataSourceStreamReader): 3. **State Management**: Store offsets in Spark checkpoints 4. **Watermarking**: Support event-time processing for late data 5. **Monitoring**: Track batch progress and lag metrics -6. **Error Handling**: Implement retry logic for transient failures +6. **Error Handling**: Streaming writers are especially susceptible to transient failures (network blips, rate limits) since they run continuously. Use retry with exponential backoff from [error-handling.md](error-handling.md) in your `write()` methods. 7. **Backpressure**: Respect rate limits with appropriate partition sizing diff --git a/.claude/skills/spark-python-data-source/references/testing-patterns.md b/coda-marketplace/plugins/coda-databricks-skills/skills/spark-python-data-source/references/testing-patterns.md similarity index 97% rename from .claude/skills/spark-python-data-source/references/testing-patterns.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/spark-python-data-source/references/testing-patterns.md index 96e2e28..1b4aeb2 100644 --- a/.claude/skills/spark-python-data-source/references/testing-patterns.md +++ b/coda-marketplace/plugins/coda-databricks-skills/skills/spark-python-data-source/references/testing-patterns.md @@ -415,25 +415,27 @@ tests/ ## Running Tests +Run tests through your packaging tool (e.g., `uv run`, `poetry run`, `hatch run`). Examples use `uv`: + ```bash # Run all tests -poetry run pytest +uv run pytest # Run specific test file -poetry run pytest tests/unit/test_writer.py +uv run pytest tests/unit/test_writer.py # Run specific test -poetry run pytest tests/unit/test_writer.py::test_writer_sends_batch +uv run pytest tests/unit/test_writer.py::test_writer_sends_batch # Run with coverage -poetry run pytest --cov=your_package --cov-report=html +uv run pytest --cov=your_package --cov-report=html # Run only unit tests -poetry run pytest tests/unit/ +uv run pytest tests/unit/ # Run with verbose output -poetry run pytest -v +uv run pytest -v # Run with print statements -poetry run pytest -s +uv run pytest -s ``` diff --git a/.claude/skills/spark-python-data-source/references/type-conversion.md b/coda-marketplace/plugins/coda-databricks-skills/skills/spark-python-data-source/references/type-conversion.md similarity index 100% rename from .claude/skills/spark-python-data-source/references/type-conversion.md rename to coda-marketplace/plugins/coda-databricks-skills/skills/spark-python-data-source/references/type-conversion.md diff --git a/coda-marketplace/plugins/coda-essentials/.claude-plugin/plugin.json b/coda-marketplace/plugins/coda-essentials/.claude-plugin/plugin.json new file mode 100644 index 0000000..756bd57 --- /dev/null +++ b/coda-marketplace/plugins/coda-essentials/.claude-plugin/plugin.json @@ -0,0 +1,18 @@ +{ + "name": "coda-essentials", + "description": "Subagents, hooks, slash commands, and session lifecycle tooling bundled with every CODA instance.", + "version": "0.1.0", + "author": { + "name": "Databricks Field Engineering" + }, + "keywords": [ + "coda", + "databricks", + "workshop", + "tdd", + "memory", + "hooks" + ], + "agents": "./agents/", + "commands": "./commands/" +} diff --git a/agents/build-feature.md b/coda-marketplace/plugins/coda-essentials/agents/build-feature.md similarity index 100% rename from agents/build-feature.md rename to coda-marketplace/plugins/coda-essentials/agents/build-feature.md diff --git a/agents/implementer.md b/coda-marketplace/plugins/coda-essentials/agents/implementer.md similarity index 100% rename from agents/implementer.md rename to coda-marketplace/plugins/coda-essentials/agents/implementer.md diff --git a/agents/prd-writer.md b/coda-marketplace/plugins/coda-essentials/agents/prd-writer.md similarity index 100% rename from agents/prd-writer.md rename to coda-marketplace/plugins/coda-essentials/agents/prd-writer.md diff --git a/agents/test-generator.md b/coda-marketplace/plugins/coda-essentials/agents/test-generator.md similarity index 100% rename from agents/test-generator.md rename to coda-marketplace/plugins/coda-essentials/agents/test-generator.md diff --git a/coda-marketplace/plugins/coda-essentials/commands/cache-stats.md b/coda-marketplace/plugins/coda-essentials/commands/cache-stats.md new file mode 100644 index 0000000..8e3fcc7 --- /dev/null +++ b/coda-marketplace/plugins/coda-essentials/commands/cache-stats.md @@ -0,0 +1,52 @@ +--- +description: "Report prompt-cache hit rate + token savings for recent Claude Code sessions (reads MLflow traces)" +--- + +Analyse prompt-cache performance for this user's recent Claude Code sessions +in CODA. Traces are captured by `setup_mlflow.py` when MLflow tracing is +enabled; they include per-request token usage from Anthropic, which is +what reveals caching. + +## Steps + +1. **Check tracing is on.** + Read `os.environ.get("MLFLOW_CLAUDE_TRACING_ENABLED", "")`. If it's empty, + `"0"`, or `"false"` (case-insensitive), tell the user tracing is off and + stop — suggest they re-run setup with `MLFLOW_CLAUDE_TRACING_ENABLED=true` + or flip it in `app.yaml`. + +2. **Resolve the experiment path.** + The setup logs to `/Users/{email}/{app_name}` where: + - `email` = `APP_OWNER_EMAIL` env var, or `databricks current-user me` + - `app_name` = `DATABRICKS_APP_NAME` env var, or the basename of `$HOME` + +3. **Query recent traces.** Use `mlflow` (already installed in CODA) to list + the last ~50 traces in that experiment. Anthropic / Claude Code traces + carry per-call token usage on the root span outputs (or `info.tags`): + - `input_tokens` — uncached input + - `cache_read_input_tokens` — served from cache + - `cache_creation_input_tokens` — written to cache + - `output_tokens` + + Sum each across all traces. + +4. **Report a compact summary.** Include: + - **Hit rate** = `cache_read / (cache_read + input_tokens)`, as a % + - **Cached tokens served** (with the cost context that cache-read ≈ 10% + of base input price) + - **Totals**: input / cache_read / cache_creation / output + - **Estimated $ saved vs uncached** — assume Claude Opus pricing unless + `ANTHROPIC_MODEL` env var says otherwise: + `saved ≈ cache_read_tokens × (input_price − cache_read_price) / 1e6` + (Opus: input $15/MTok, cache_read $1.50/MTok → $13.50 saved per M + cache_read tokens.) + +5. **If hit rate < 50%, diagnose.** Likely causes in order: + - Prefix < 1024 tokens (Databricks passthrough minimum — won't cache) + - Sessions spaced > 5 min apart (ephemeral TTL expired) + - System prompt changed between calls (non-deterministic skill loading, + varying `CLAUDE.md` content, or model/route switch) + - Tracing only captures a subset of calls (check `MLFLOW_TRACE_SAMPLING`) + +Keep the output tight — 10-15 lines, not a report. This is observability, +not a presentation. diff --git a/coda-marketplace/plugins/coda-essentials/commands/til.md b/coda-marketplace/plugins/coda-essentials/commands/til.md new file mode 100644 index 0000000..9523e2a --- /dev/null +++ b/coda-marketplace/plugins/coda-essentials/commands/til.md @@ -0,0 +1,28 @@ +--- +description: "Capture what I learned today — fight cognitive passivity" +--- + +Review what we worked on in this session and extract what I should have learned. + +Create a brief TIL (Today I Learned) entry: + +### Concepts +- What Databricks/Python/React/infrastructure concepts came up? +- Which ones were new to me (based on what I asked about or seemed unfamiliar with)? +- Link to the relevant docs or source code + +### Decisions +- What key decisions were made and why? +- What were the alternatives we rejected? + +### Sharp edges +- What non-obvious gotchas did we encounter? +- What would I need to remember if I did this again without AI? + +### Could I do this solo? +Be honest. Based on this session: +- Which parts could I now do independently? +- Which parts would I still need AI for? +- What should I study to close that gap? + +Output as a compact markdown block I can save or paste into my notes. diff --git a/coda-marketplace/plugins/coda-essentials/hooks/check-memory-staleness.py b/coda-marketplace/plugins/coda-essentials/hooks/check-memory-staleness.py new file mode 100644 index 0000000..8c7d37f --- /dev/null +++ b/coda-marketplace/plugins/coda-essentials/hooks/check-memory-staleness.py @@ -0,0 +1,134 @@ +"""SessionStart hook: warn about stale Claude Code memory files. + +Scans ~/.claude/projects/*/memory/ and reports entries whose frontmatter +`last_verified` is missing or older than the threshold (default 30 days). + +Exit 0 = clean, exit 1 = stale memories found (warnings on stdout). +""" +from __future__ import annotations + +import argparse +import re +import sys +from datetime import date, timedelta +from pathlib import Path + +CLAUDE_DIR = Path.home() / ".claude" +PROJECTS_DIR = CLAUDE_DIR / "projects" +DEFAULT_STALE_DAYS = 30 + +FRONTMATTER_RE = re.compile(r"^---\s*\n(.*?)\n---", re.DOTALL) +LAST_VERIFIED_RE = re.compile(r"^last_verified:\s*(\d{4}-\d{2}-\d{2})", re.MULTILINE) +NAME_RE = re.compile(r"^name:\s*(.+)", re.MULTILINE) +TYPE_RE = re.compile(r"^type:\s*(.+)", re.MULTILINE) + + +def cwd_to_project_slug(cwd: str) -> str: + return re.sub(r"[/.]", "-", cwd) + + +def slug_to_readable(slug: str) -> str: + home = str(Path.home()) + home_slug = re.sub(r"[/.]", "-", home) + if slug.startswith(home_slug): + return "~" + slug[len(home_slug):].replace("-", "/") + return slug.lstrip("-").replace("-", "/") + + +def parse_memory(path: Path) -> dict | None: + try: + text = path.read_text() + except OSError: + return None + m = FRONTMATTER_RE.search(text) + if not m: + return None + fm = m.group(1) + name = NAME_RE.search(fm) + verified = LAST_VERIFIED_RE.search(fm) + type_ = TYPE_RE.search(fm) + return { + "path": path, + "name": name.group(1).strip() if name else path.stem, + "type": type_.group(1).strip() if type_ else "unknown", + "last_verified": verified.group(1) if verified else None, + } + + +def check_staleness(threshold_days: int, project_slug: str | None) -> list[dict]: + if not PROJECTS_DIR.exists(): + return [] + today = date.today() + threshold = today - timedelta(days=threshold_days) + stale: list[dict] = [] + dirs = [PROJECTS_DIR / project_slug / "memory"] if project_slug \ + else sorted(PROJECTS_DIR.glob("*/memory")) + for memory_dir in dirs: + if not memory_dir.exists(): + continue + proj = memory_dir.parent.name + for md in sorted(memory_dir.glob("*.md")): + if md.name == "MEMORY.md": + continue + info = parse_memory(md) + if info is None: + continue + if info["last_verified"] is None: + stale.append({ + "project": proj, "name": info["name"], "type": info["type"], + "reason": "missing last_verified", "file": str(md), + }) + continue + try: + vdate = date.fromisoformat(info["last_verified"]) + except ValueError: + stale.append({ + "project": proj, "name": info["name"], "type": info["type"], + "reason": f"invalid date: {info['last_verified']}", + "file": str(md), + }) + continue + if vdate < threshold: + age = (today - vdate).days + stale.append({ + "project": proj, "name": info["name"], "type": info["type"], + "reason": f"{age}d since verified ({info['last_verified']})", + "file": str(md), + }) + return stale + + +def main() -> int: + parser = argparse.ArgumentParser() + parser.add_argument("--cwd") + parser.add_argument("--days", type=int, default=DEFAULT_STALE_DAYS) + parser.add_argument("--all", action="store_true") + args = parser.parse_args() + + slug = cwd_to_project_slug(args.cwd) if args.cwd and not args.all else None + stale = check_staleness(args.days, slug) + if not stale: + return 0 + + by_proj: dict[str, list[dict]] = {} + for e in stale: + by_proj.setdefault(e["project"], []).append(e) + + total = len(stale) + if slug: + lines = [f"Stale memories ({total}) in {slug_to_readable(slug)}:"] + for e in stale: + lines.append(f" - [{e['type']}] {e['name']}: {e['reason']}") + else: + lines = [f"Stale memories: {total} across {len(by_proj)} project(s)"] + for proj, entries in by_proj.items(): + lines.append(f" {slug_to_readable(proj)}: {len(entries)} stale") + for e in entries: + lines.append(f" - [{e['type']}] {e['name']}: {e['reason']}") + lines.append("\nUpdate `last_verified` to today's date after reviewing each memory.") + print("\n".join(lines)) + return 1 + + +if __name__ == "__main__": + sys.exit(main()) diff --git a/coda-marketplace/plugins/coda-essentials/hooks/memory-stamp-verified.sh b/coda-marketplace/plugins/coda-essentials/hooks/memory-stamp-verified.sh new file mode 100644 index 0000000..7334789 --- /dev/null +++ b/coda-marketplace/plugins/coda-essentials/hooks/memory-stamp-verified.sh @@ -0,0 +1,30 @@ +#!/usr/bin/env bash +# PostToolUse (Edit|Write) hook: stamp `last_verified: YYYY-MM-DD` on memory +# files after they are edited. Operates only on Claude Code auto-memory files +# under ~/.claude/projects/*/memory/*.md (excluding the MEMORY.md index). +# +# Uses GNU sed syntax (Linux). Do not use BSD sed forms (`sed -i ''`) here. + +set -euo pipefail + +filepath="${CLAUDE_FILE_PATH:-}" + +[[ -n "$filepath" ]] || exit 0 +[[ "$filepath" == *"/.claude/projects/"*"/memory/"* ]] || exit 0 +[[ "$filepath" == *.md ]] || exit 0 +[[ "$(basename "$filepath")" != "MEMORY.md" ]] || exit 0 +[[ -f "$filepath" ]] || exit 0 + +today=$(date +%Y-%m-%d) + +head -1 "$filepath" | grep -q '^---' || exit 0 + +if grep -q '^last_verified:' "$filepath"; then + sed -i "s/^last_verified:.*$/last_verified: $today/" "$filepath" +else + awk -v stamp="last_verified: $today" ' + /^---$/ { count++ } + count == 2 && inserted == 0 { print stamp; inserted = 1 } + { print } + ' "$filepath" > "${filepath}.tmp" && mv "${filepath}.tmp" "$filepath" +fi diff --git a/coda-marketplace/plugins/coda-essentials/hooks/mlflow-trace-stop.sh b/coda-marketplace/plugins/coda-essentials/hooks/mlflow-trace-stop.sh new file mode 100644 index 0000000..5917b81 --- /dev/null +++ b/coda-marketplace/plugins/coda-essentials/hooks/mlflow-trace-stop.sh @@ -0,0 +1,31 @@ +#!/usr/bin/env bash +# Stop hook: flush the Claude Code session transcript to an MLflow trace. +# +# Claude Code pipes the hook-event JSON to our stdin. We capture that +# synchronously (fast, bounded to one read) then background the actual +# flush with stdin redirected from the captured file. This way: +# +# - the wrapper returns in <1s, unblocking the Stop chain +# (crystallize-nudge, brain-push, /til) +# - a hard `timeout 30` caps the backgrounded handler so a stall in +# transcript processing can't hold memory/CPU indefinitely +# - stop_hook_handler() actually receives its hook-event JSON, which +# naive `nohup ... & disown` would have redirected to /dev/null + +set -euo pipefail + +APP_DIR="/app/python/source_code" +LOG="$HOME/.mlflow-hook.log" +STDIN_FILE="$(mktemp -t mlflow-hook.XXXXXX)" + +# Synchronous: read Claude Code's hook-event JSON from stdin. +cat > "$STDIN_FILE" + +# Async: run the handler in the background with the captured stdin. +# The subshell cleans up the temp file after timeout/handler exits. +nohup bash -c " + timeout 30 uv run --project '$APP_DIR' python -c \ + 'from mlflow.claude_code.hooks import stop_hook_handler; stop_hook_handler()' \ + < '$STDIN_FILE' + rm -f '$STDIN_FILE' +" >> "$LOG" 2>&1 & disown diff --git a/coda-marketplace/plugins/coda-essentials/hooks/push-brain-to-workspace.sh b/coda-marketplace/plugins/coda-essentials/hooks/push-brain-to-workspace.sh new file mode 100644 index 0000000..e86d2f3 --- /dev/null +++ b/coda-marketplace/plugins/coda-essentials/hooks/push-brain-to-workspace.sh @@ -0,0 +1,14 @@ +#!/usr/bin/env bash +# Stop hook: push Claude Code's auto-memory to Databricks Workspace so it +# survives app redeployment. Fire-and-forget: runs in background, never blocks. + +set -euo pipefail + +APP_DIR="/app/python/source_code" +SYNC_SCRIPT="$APP_DIR/claude_brain_sync.py" +LOG="$HOME/.brain-sync.log" + +[ -f "$SYNC_SCRIPT" ] || exit 0 + +nohup uv run --project "$APP_DIR" python "$SYNC_SCRIPT" push \ + >> "$LOG" 2>&1 & disown diff --git a/coda-marketplace/plugins/coda-essentials/hooks/session-context-loader.sh b/coda-marketplace/plugins/coda-essentials/hooks/session-context-loader.sh new file mode 100644 index 0000000..0fd7133 --- /dev/null +++ b/coda-marketplace/plugins/coda-essentials/hooks/session-context-loader.sh @@ -0,0 +1,70 @@ +#!/usr/bin/env bash +# SessionStart hook: inject recent git activity into context so Claude +# knows what was happening last session. + +set -euo pipefail + +git rev-parse --git-dir >/dev/null 2>&1 || exit 0 + +branch=$(git branch --show-current 2>/dev/null || echo "detached") +repo_name=$(basename "$(git rev-parse --show-toplevel 2>/dev/null)" 2>/dev/null || basename "$PWD") + +recent_commits=$(git log --all --since="7 days ago" \ + --format="%h %ad %an: %s" --date=relative --max-count=10 2>/dev/null || true) + +author=$(git config user.name 2>/dev/null || echo "") +last_own_commit="" +if [ -n "$author" ]; then + last_own_commit=$(git log --author="$author" --format="%ad" \ + --date=relative --max-count=1 2>/dev/null || true) +fi + +status_summary=$(git status --short 2>/dev/null | head -15 || true) +status_count=$(git status --short 2>/dev/null | wc -l | tr -d ' ') + +active_branches=$(git for-each-ref --sort=-committerdate \ + --format='%(refname:short) (%(committerdate:relative))' \ + refs/heads/ --count=5 2>/dev/null || true) + +open_prs="" +if command -v gh >/dev/null 2>&1; then + open_prs=$(gh pr list --author="@me" --state=open \ + --json number,title,headRefName \ + --jq '.[] | "#\(.number) [\(.headRefName)] \(.title)"' \ + 2>/dev/null | head -5 || true) +fi + +ctx="Session context for ${repo_name} (branch: ${branch})" +[ -n "$last_own_commit" ] && ctx="${ctx} +Your last commit: ${last_own_commit}" +[ -n "$recent_commits" ] && ctx="${ctx} + +Recent commits (7d): +${recent_commits}" +if [ -n "$status_summary" ]; then + if [ "$status_count" -gt 15 ]; then + ctx="${ctx} + +Uncommitted changes (${status_count} files, showing first 15): +${status_summary}" + else + ctx="${ctx} + +Uncommitted changes: +${status_summary}" + fi +fi +[ -n "$active_branches" ] && ctx="${ctx} + +Active branches: +${active_branches}" +[ -n "$open_prs" ] && ctx="${ctx} + +Open PRs: +${open_prs}" + +json_ctx=$(printf '%s' "$ctx" | python3 -c 'import sys, json; print(json.dumps(sys.stdin.read()))') + +cat <=1 recent commit OR 3+ changed files. + +set -euo pipefail + +MIN_COMMITS=1 +MIN_CHANGED_FILES=3 +SINCE="2 hours ago" + +git rev-parse --git-dir >/dev/null 2>&1 || exit 0 + +author=$(git config user.name 2>/dev/null || echo "") +[ -n "$author" ] || exit 0 + +commit_count=$(git log --author="$author" --since="$SINCE" --oneline 2>/dev/null | wc -l | tr -d ' ') +changed_files=$(git diff --name-only HEAD 2>/dev/null | wc -l | tr -d ' ') +staged_files=$(git diff --cached --name-only 2>/dev/null | wc -l | tr -d ' ') +total_changed=$((changed_files + staged_files)) + +if [ "$commit_count" -ge "$MIN_COMMITS" ] || [ "$total_changed" -ge "$MIN_CHANGED_FILES" ]; then + summary="" + if [ "$commit_count" -gt 0 ]; then + summary="${commit_count} commit(s) this session" + fi + if [ "$total_changed" -gt 0 ]; then + if [ -n "$summary" ]; then + summary="$summary, ${total_changed} uncommitted file(s)" + else + summary="${total_changed} uncommitted changed file(s)" + fi + fi + + cat < \ |----------|----------|-------------| | `DATABRICKS_TOKEN` | No | Optional. If not set, the app prompts for a token on first session. Auto-rotated every 10 minutes | | `HOME` | Yes | Set to `/app/python/source_code` in app.yaml | -| `ANTHROPIC_MODEL` | No | Claude model name (default: `databricks-claude-opus-4-6`) | +| `ANTHROPIC_MODEL` | No | Claude model name (default: `databricks-claude-opus-4-7`) | | `CODEX_MODEL` | No | Codex model name (default: `databricks-gpt-5-3-codex`) | | `GEMINI_MODEL` | No | Gemini model name (default: `databricks-gemini-3-1-pro`) | | `DATABRICKS_GATEWAY_HOST` | No | AI Gateway URL override. Auto-discovered from `DATABRICKS_WORKSPACE_ID` if unset. Falls back to direct model serving if neither is available | diff --git a/pat_rotator.py b/pat_rotator.py index 28e0319..a27407d 100644 --- a/pat_rotator.py +++ b/pat_rotator.py @@ -1,6 +1,6 @@ """Auto-rotate short-lived PATs in the background. -Mints a new 15-minute PAT every 10 minutes, writes to ~/.databrickscfg +Mints a new 4-hour PAT every 3 hours, writes to ~/.databrickscfg (immediate CLI/SDK use), and revokes the old PAT. Rotation only runs while active sessions exist. If the app restarts, the interactive PAT prompt re-provisions credentials on next session. Fixes #81. @@ -18,8 +18,8 @@ logger = logging.getLogger(__name__) -DEFAULT_TOKEN_LIFETIME = 900 # 15 minutes -DEFAULT_ROTATION_INTERVAL = 600 # 10 minutes +DEFAULT_TOKEN_LIFETIME = 14400 # 4 hours +DEFAULT_ROTATION_INTERVAL = 10800 # 3 hours class PATRotator: diff --git a/projects/coles-vibe-workshop/.claude/commands/bdd-features/SKILL.md b/projects/coles-vibe-workshop/.claude/commands/bdd-features/SKILL.md new file mode 100644 index 0000000..8d0bb12 --- /dev/null +++ b/projects/coles-vibe-workshop/.claude/commands/bdd-features/SKILL.md @@ -0,0 +1,105 @@ +--- +name: bdd-features +description: "This skill should be used when the user asks to \"write Gherkin\", \"create feature files\", \"generate BDD scenarios\", \"write acceptance tests in Gherkin\", \"create Behave features\", \"write Given When Then tests\", \"BDD test cases for my pipeline\", \"Gherkin for Unity Catalog\", or wants to translate requirements into Gherkin feature files for Databricks." +user-invocable: true +--- + +# BDD Features — Gherkin Feature File Generation + +Generate well-structured Gherkin `.feature` files for Databricks workloads. Translate requirements, user stories, or existing code into behavior specifications using Given/When/Then syntax. + +## When to use + +- Translating requirements or user stories into Gherkin acceptance criteria +- Creating feature files for Databricks pipelines, catalog permissions, jobs, or Apps +- Writing regression tests in Gherkin for existing functionality +- Generating Scenario Outlines for data-driven testing + +## Process + +### 1. Identify the test subject + +Determine what to test. Read the relevant code or ask the user: + +- A Lakeflow SDP pipeline definition → pipeline behavior tests +- Unity Catalog grants/policies → permission verification tests +- A FastAPI Databricks App → API endpoint tests +- A notebook or job → execution and output validation tests +- SQL transformations → data quality and correctness tests + +### 2. Write the feature file + +Place feature files in the appropriate subdirectory under `features/`: + +``` +features/ +├── catalog/permissions.feature +├── pipelines/events_pipeline.feature +├── apps/api_endpoints.feature +├── jobs/etl_notebook.feature +└── sql/data_quality.feature +``` + +**Structure every feature file with:** + +1. **Tags** — `@domain`, `@smoke`/`@regression`/`@integration`, optional `@slow` or `@wip` +2. **Feature header** — name + As a / I want / So that narrative +3. **Background** — shared Given steps (workspace connection, test schema) +4. **Scenarios** — one behavior per scenario, descriptive names + +Refer to `references/gherkin-patterns.md` for Databricks-specific Gherkin patterns covering: +- Pipeline lifecycle (full refresh, incremental, failure handling) +- Unity Catalog grants, column masks, row filters +- App endpoint testing with SSO headers +- Job/notebook execution and output validation +- SQL data quality assertions +- Scenario Outlines for parameterized testing + +### 3. Gherkin writing principles + +**Declarative, not imperative.** Describe *what* the system should do, not *how* to click buttons: + +```gherkin +# Good — declarative +When I grant SELECT on "catalog.schema.table" to group "readers" +Then the group "readers" should have SELECT permission + +# Bad — imperative +When I open the Catalog Explorer +And I click on the table "catalog.schema.table" +And I click "Permissions" +And I click "Grant" +And I select "SELECT" +And I type "readers" in the group field +And I click "Save" +``` + +**One behavior per scenario.** If a scenario tests two independent things, split it. + +**Use Backgrounds for shared setup.** Avoid repeating connection/schema steps across scenarios. + +**Scenario Outlines for data variations.** When the same behavior is tested with different inputs, use Examples tables instead of duplicating scenarios. + +**Tag strategically:** +- `@smoke` — fast, critical-path tests (< 30 seconds each) +- `@regression` — thorough coverage (minutes) +- `@integration` — needs live workspace (skip in unit test CI) +- `@slow` — pipeline tests, job executions (> 2 minutes) + +**CRITICAL — Curly braces break step matching.** Behave uses the `parse` library for step matching. `{anything}` in feature file text is interpreted as a capture group, not a literal. Never use `{test_schema}.table_name` in feature files — it will fail to match step definitions. Instead, use short table names (`"customers"`) and resolve the schema in step code. + +**Trailing colons matter.** When a step has an attached data table or docstring, the `:` at the end of the Gherkin line IS part of the step text. The step pattern must include it: `@given('a table "{name}" with data:')` — not `with data` (no colon). + +### 4. Validate step coverage + +After writing features, check that step definitions exist for all steps: + +```bash +uv run behave --dry-run +``` + +Any undefined steps will be reported with suggested snippets. Hand those to the `bdd-steps` skill for implementation. + +## Additional resources + +- **`references/gherkin-patterns.md`** — Complete Databricks Gherkin pattern library with examples for every domain diff --git a/projects/coles-vibe-workshop/.claude/commands/bdd-features/references/gherkin-patterns.md b/projects/coles-vibe-workshop/.claude/commands/bdd-features/references/gherkin-patterns.md new file mode 100644 index 0000000..1913252 --- /dev/null +++ b/projects/coles-vibe-workshop/.claude/commands/bdd-features/references/gherkin-patterns.md @@ -0,0 +1,446 @@ +# Gherkin Patterns for Databricks + +Reusable Gherkin patterns for common Databricks testing scenarios. Copy and adapt these to feature files. + +> **WARNING: Curly braces in step text break Behave's `parse` matcher.** +> +> Behave uses Python's `parse` library for step matching. Any `{...}` in step text +> is interpreted as a capture group. Writing `{test_schema}.customers` in a step line +> will **silently fail to match** your step definition. +> +> **The correct pattern:** +> - Step text uses **short table names in quotes**: `"customers"`, `"orders"` +> - SQL inside **docstrings** (triple-quoted blocks) can safely use `{schema}` because +> docstrings are accessed via `context.text`, not step matching +> - Step definitions prepend `context.test_schema + "."` internally to build the FQN +> +> ```python +> # WRONG - step text with curly braces +> @given('a table "{test_schema}.customers" exists') # BROKEN - parse eats {test_schema} +> +> # RIGHT - short name in step text, FQN built in the step body +> @given('a managed table "{table_name}" exists') +> def step_impl(context, table_name): +> fqn = f"{context.test_schema}.{table_name}" +> # ... use fqn +> ``` +> +> **Docstring SQL pattern** (safe because `context.text` is just a string): +> ```python +> @when('I execute SQL:') +> def step_impl(context): +> sql = context.text.replace("{schema}", context.test_schema) +> # ... execute sql +> ``` + +## Common Background + +Most Databricks feature files share this Background: + +```gherkin +Background: + Given a Databricks workspace connection is established + And a test schema is provisioned +``` + +--- + +## Unity Catalog + +### Table permissions + +```gherkin +@catalog @permissions +Feature: Unity Catalog table permissions + As a data engineer + I want to verify table-level permissions + So that sensitive data is properly protected + + Background: + Given a Databricks workspace connection is established + And a test schema is provisioned + + Scenario: Grant SELECT to a group + Given a managed table "customers" exists + When I execute SQL: + """sql + GRANT SELECT ON TABLE {schema}.customers TO `data_readers` + """ + And I execute SQL: + """sql + SHOW GRANTS ON TABLE {schema}.customers + """ + Then the result should contain a row where "ActionType" is "SELECT" and "Principal" is "data_readers" + + Scenario Outline: Verify multiple privilege types + Given a managed table "sales" exists + When I execute SQL: + """sql + GRANT ON TABLE {schema}.sales TO `` + """ + And I execute SQL: + """sql + SHOW GRANTS ON TABLE {schema}.sales + """ + Then the result should contain a row where "ActionType" is "" and "Principal" is "" + + Examples: + | privilege | group | + | SELECT | data_readers | + | MODIFY | data_writers | +``` + +### Column masks + +```gherkin +@catalog @security +Feature: Column-level security + + Background: + Given a Databricks workspace connection is established + And a test schema is provisioned + + Scenario: Mask PII columns for analysts + Given a managed table "customers" with columns: + | column_name | data_type | contains_pii | + | id | BIGINT | false | + | name | STRING | true | + | email | STRING | true | + | region | STRING | false | + And a column mask function "mask_pii" is applied to "name" and "email" on "customers" + When I query "customers" as group "analysts" + Then columns "name" and "email" should return masked values + But columns "id" and "region" should return actual values +``` + +### Row filters + +```gherkin +@catalog @security +Feature: Row-level security + + Background: + Given a Databricks workspace connection is established + And a test schema is provisioned + + Scenario: Row filter restricts by region + Given a managed table "regional_sales" with data: + | region | revenue | quarter | + | APAC | 50000 | Q1 | + | EMEA | 75000 | Q1 | + | AMER | 100000 | Q1 | + And a row filter on "regional_sales" restricts "apac_analysts" to region "APAC" + When I query "regional_sales" as group "apac_analysts" + Then I should only see rows where "region" is "APAC" + And the result should have 1 row +``` + +--- + +## Lakeflow Spark Declarative Pipelines + +### Pipeline lifecycle + +```gherkin +@pipeline @lakeflow +Feature: Events pipeline processing + As a data engineer + I want to verify the events pipeline processes data correctly + So that downstream consumers get accurate aggregations + + Background: + Given a Databricks workspace connection is established + And a test schema is provisioned + + @integration @slow + Scenario: Full refresh produces expected tables + Given a pipeline "events_pipeline" exists targeting the test schema + When I trigger a full refresh of the pipeline + Then the pipeline update should succeed within 600 seconds + And the streaming table "bronze_events" should exist + And the materialized view "silver_events_agg" should exist + And the table "silver_events_agg" should have more than 0 rows + + @integration + Scenario: Incremental refresh picks up new data + Given the pipeline "events_pipeline" has completed a full refresh + When I insert test records into the source + And I trigger an incremental refresh of the pipeline + Then the pipeline update should succeed within 300 seconds + And the new records should appear in "bronze_events" + + Scenario: Pipeline handles empty source gracefully + Given a pipeline "events_pipeline" exists targeting the test schema + And the source table is empty + When I trigger a full refresh of the pipeline + Then the pipeline update should succeed within 300 seconds + And the streaming table "bronze_events" should have 0 rows +``` + +### Pipeline failure handling + +```gherkin + Scenario: Pipeline surfaces schema mismatch errors + Given a pipeline "events_pipeline" exists targeting the test schema + And the source table has an unexpected column "extra_col" of type "BINARY" + When I trigger a full refresh of the pipeline + Then the pipeline update should fail + And the pipeline error should mention schema +``` + +--- + +## Jobs and Notebooks + +### Notebook execution + +```gherkin +@jobs @notebook +Feature: Customer ETL notebook + As a data engineer + I want to verify the ETL notebook produces correct output + + Background: + Given a Databricks workspace connection is established + And a test schema is provisioned + + @integration @slow + Scenario: Dedup notebook removes duplicates + Given a managed table "raw_customers" with data: + | customer_id | name | email | updated_at | + | 1 | Alice | alice@example.com | 2024-01-01T00:00:00 | + | 1 | Alice B. | alice@example.com | 2024-06-01T00:00:00 | + | 2 | Bob | bob@example.com | 2024-03-15T00:00:00 | + When I run the notebook "/Repos/team/etl/customer_dedup" with parameters: + | key | value | + | source_table | raw_customers | + | target_table | clean_customers| + Then the job should complete with status "SUCCESS" within 300 seconds + And the table "clean_customers" should have 2 rows + And the table "clean_customers" should contain a row where "customer_id" is "1" and "name" is "Alice B." + + Scenario: Notebook fails gracefully on missing source + When I run the notebook "/Repos/team/etl/customer_dedup" with parameters: + | key | value | + | source_table | nonexistent | + | target_table | output | + Then the job should complete with status "FAILED" within 120 seconds +``` + +--- + +## Databricks Apps (FastAPI) + +### API endpoint testing + +```gherkin +@app @fastapi +Feature: Databricks App API + As a user + I want the app endpoints to work correctly + + Background: + Given the app is running at the configured base URL + And the test user is "testuser@databricks.com" + + @smoke + Scenario: Health check + When I GET "/health" + Then the response status should be 200 + And the response JSON should contain "status" with value "healthy" + + Scenario: Authenticated user can list resources + When I GET "/api/dashboards" with auth headers + Then the response status should be 200 + And the response should be a JSON list + + Scenario: Unauthenticated request is rejected + When I GET "/api/dashboards" without auth headers + Then the response status should be 401 + + Scenario: POST creates a resource + When I POST "/api/items" with auth headers and body: + """json + {"name": "Test Item", "description": "Created by BDD test"} + """ + Then the response status should be 201 + And the response JSON should contain "name" with value "Test Item" +``` + +### App deployment testing + +```gherkin +@app @deployment @slow +Feature: App deployment lifecycle + Scenario: Deploy and verify app is running + Given a bundle project at the repository root + When I deploy using Asset Bundles with target "dev" + Then the deployment should succeed + And the app should reach "RUNNING" state within 120 seconds + And the app health endpoint should return 200 +``` + +--- + +## SQL Data Quality + +### Row counts and data validation + +```gherkin +@sql @data-quality +Feature: Data quality checks + + Background: + Given a Databricks workspace connection is established + And a test schema is provisioned + + @smoke + Scenario: Table is not empty + Given the table "orders" has been loaded + Then the table "orders" should have more than 0 rows + + Scenario: No duplicate primary keys + Given the table "orders" has been loaded + When I execute SQL: + """sql + SELECT order_id, COUNT(*) as cnt + FROM {schema}.orders + GROUP BY order_id + HAVING COUNT(*) > 1 + """ + Then the result should have 0 rows + + Scenario: Foreign key integrity + Given the tables "orders" and "customers" have been loaded + When I execute SQL: + """sql + SELECT o.customer_id + FROM {schema}.orders o + LEFT JOIN {schema}.customers c ON o.customer_id = c.customer_id + WHERE c.customer_id IS NULL + """ + Then the result should have 0 rows + + Scenario: No null values in required columns + When I execute SQL: + """sql + SELECT COUNT(*) as null_count + FROM {schema}.orders + WHERE order_id IS NULL OR customer_id IS NULL OR order_date IS NULL + """ + Then the first row column "null_count" should be "0" + + Scenario: Verify GRANT was applied via SQL + Given a managed table "products" exists + When I execute SQL: + """sql + GRANT SELECT ON TABLE {schema}.products TO `reporting_team` + """ + And I execute SQL: + """sql + SHOW GRANTS ON TABLE {schema}.products + """ + Then the result should contain a row where "ActionType" is "SELECT" and "Principal" is "reporting_team" +``` + +--- + +## Asset Bundles Deployment + +```gherkin +@deployment @dabs +Feature: Bundle lifecycle + @smoke + Scenario: Bundle validates successfully + When I run "databricks bundle validate" with target "dev" + Then the command should exit with code 0 + + @integration @slow + Scenario: Deploy and destroy lifecycle + When I run "databricks bundle deploy" with target "dev" + Then the command should exit with code 0 + When I run "databricks bundle destroy" with target "dev" and auto-approve + Then the command should exit with code 0 +``` + +--- + +## Scenario Outline patterns + +Use Scenario Outlines for testing multiple variations of the same behavior. + +Note: table names in the Examples table are short names (no schema prefix). The step +definition prepends `context.test_schema` to build the fully-qualified name. + +```gherkin + Scenario Outline: Verify table existence after pipeline run + Then the "" should exist + + Examples: Streaming tables + | table_type | table_name | + | streaming table | bronze_events | + | streaming table | bronze_transactions| + + Examples: Materialized views + | table_type | table_name | + | materialized view | silver_events_agg| + | materialized view | gold_summary | +``` + +--- + +## Steps with data tables and docstrings + +Steps that accept a data table or docstring **must** end with a trailing colon. The colon +is part of the step text that Behave matches against your `@given`/`@when`/`@then` decorator. + +```gherkin +# CORRECT - colon before data table +Given a managed table "customers" with data: + | id | name | region | + | 1 | Alice | APAC | + | 2 | Bob | EMEA | + +# CORRECT - colon before docstring +When I execute SQL: + """sql + SELECT * FROM {schema}.customers + """ + +# WRONG - missing colon, Behave will not match the step +Given a managed table "customers" with data + | id | name | region | +``` + +--- + +## SHOW GRANTS column names + +`SHOW GRANTS` returns PascalCase column names. Use these exact names when asserting +on grant results: + +| Column | Description | +|--------------|------------------------------------------------| +| `Principal` | The user, group, or service principal | +| `ActionType` | The privilege (SELECT, MODIFY, ALL PRIVILEGES) | +| `ObjectType` | TABLE, SCHEMA, CATALOG, etc. | +| `ObjectKey` | The fully-qualified object name | + +--- + +## Tag strategy + +| Tag | Purpose | Typical runtime | +|-----|---------|----------------| +| `@smoke` | Critical path, must always pass | < 30s per scenario | +| `@regression` | Full coverage | Minutes | +| `@integration` | Needs live workspace | Varies | +| `@slow` | Pipeline/job execution | > 2 min | +| `@wip` | Work in progress, skip by default | N/A | +| `@skip` | Explicitly disabled | N/A | +| `@catalog` | Unity Catalog tests | Varies | +| `@pipeline` | Lakeflow SDP tests | Minutes | +| `@jobs` | Job/notebook tests | Minutes | +| `@app` | Databricks Apps tests | Seconds | +| `@sql` | SQL/data quality tests | Seconds | +| `@deployment` | DABs lifecycle tests | Minutes | diff --git a/projects/coles-vibe-workshop/.claude/commands/bdd-run/SKILL.md b/projects/coles-vibe-workshop/.claude/commands/bdd-run/SKILL.md new file mode 100644 index 0000000..f8f242e --- /dev/null +++ b/projects/coles-vibe-workshop/.claude/commands/bdd-run/SKILL.md @@ -0,0 +1,145 @@ +--- +name: bdd-run +description: "This skill should be used when the user asks to \"run BDD tests\", \"execute Behave\", \"run Gherkin tests\", \"run my feature files\", \"behave test results\", \"run smoke tests\", \"BDD test report\", or needs to execute Behave test suites with specific options like tag filtering, parallel execution, or CI reporting." +user-invocable: true +--- + +# BDD Run — Execute and Report Behave Tests + +Execute Behave test suites with tag filtering, parallel execution, output formatting, and CI integration. Diagnose failures and suggest fixes. + +## When to use + +- Running the full BDD test suite or a subset by tags +- Getting JUnit/JSON reports for CI pipelines +- Re-running only failed scenarios +- Running tests in parallel for speed +- Diagnosing and triaging test failures + +## Process + +### 1. Pre-flight checks + +Before running tests, verify the environment: + +```bash +# Verify Behave is installed +uv run behave --version + +# Verify Databricks auth +uv run python -c "from databricks.sdk import WorkspaceClient; print(WorkspaceClient().current_user.me().user_name)" + +# Dry run to check step coverage +uv run behave --dry-run +``` + +If any undefined steps are found, report them and suggest using the `bdd-steps` skill. + +### 2. Execute tests + +**Run by tag (most common):** + +```bash +# Smoke tests only +uv run behave --tags="@smoke" --format=pretty + +# All except slow and WIP +uv run behave --tags="not @slow and not @wip" + +# Specific domain +uv run behave --tags="@catalog" +uv run behave --tags="@pipeline" + +# Boolean combinations +uv run behave --tags="(@catalog or @pipeline) and @smoke" +``` + +**Run specific feature file or directory:** + +```bash +uv run behave features/catalog/permissions.feature +uv run behave features/pipelines/ +``` + +**Run by scenario name:** + +```bash +uv run behave --name "Grant SELECT on a table" +``` + +**Pass runtime configuration:** + +```bash +uv run behave -D warehouse_id=abc123 -D catalog=my_catalog -D environment=dev +``` + +### 3. Output and reporting + +**For local development:** + +```bash +uv run behave --format=pretty --show-timings +``` + +**For CI pipelines (JUnit XML):** + +```bash +uv run behave --junit --junit-directory=reports/behave/ --format=progress +``` + +**JSON output for programmatic analysis:** + +```bash +uv run behave --format=json --outfile=reports/results.json --format=progress +``` + +**Multiple formatters simultaneously:** + +```bash +uv run behave --format=pretty --format=json --outfile=reports/results.json +``` + +### 4. Re-run failed tests + +Configure rerun file output, then re-run only failures: + +```bash +# First run captures failures +uv run behave --format=rerun --outfile=reports/rerun.txt --format=pretty + +# Re-run only failed scenarios +uv run behave @reports/rerun.txt +``` + +### 5. Parallel execution + +Behave has no built-in parallelism. Use `behavex` for parallel feature execution: + +```bash +uv run behavex --parallel-processes 4 --parallel-scheme feature +``` + +Each parallel worker needs its own test schema to avoid cross-contamination. The `environment.py` template from `bdd-scaffold` handles this by using timestamped schema names with worker ID suffixes. + +### 6. Failure diagnosis + +When tests fail, read the output and categorize: + +| Failure type | Symptom | Action | +|-------------|---------|--------| +| Undefined step | `NotImplementedError` or "undefined" in output | Generate step with `bdd-steps` | +| Auth failure | `PermissionDenied`, 401/403 | Check `databricks auth profiles` | +| Timeout | `TimeoutError` in polling steps | Increase timeout parameter or check resource state | +| Data mismatch | Assertion error with expected vs. actual | Check test data setup or query logic | +| Schema not found | `SCHEMA_NOT_FOUND` | Verify `before_all` created the ephemeral schema | +| Warehouse stopped | `WAREHOUSE_NOT_RUNNING` | Start warehouse or use `@fixture.sql_warehouse` tag hook | + +### 7. Makefile integration + +If a Makefile exists, prefer `make` targets: + +```bash +make bdd # Full suite +make bdd-smoke # Smoke tests +make bdd-report # JUnit for CI +``` diff --git a/projects/coles-vibe-workshop/.claude/commands/bdd-scaffold/SKILL.md b/projects/coles-vibe-workshop/.claude/commands/bdd-scaffold/SKILL.md new file mode 100644 index 0000000..d24c5ca --- /dev/null +++ b/projects/coles-vibe-workshop/.claude/commands/bdd-scaffold/SKILL.md @@ -0,0 +1,114 @@ +--- +name: bdd-scaffold +description: "This skill should be used when the user asks to \"set up BDD\", \"create a Behave project\", \"scaffold BDD tests\", \"initialize Behave\", \"add BDD to my project\", \"set up Gherkin testing\", \"create test structure for Behave\", or mentions setting up behavior-driven development testing. Generates a complete Behave project structure wired to Databricks SDK." +user-invocable: true +--- + +# BDD Scaffold — Behave + Databricks Project Setup + +Generate a complete Python Behave project structure pre-wired with Databricks SDK integration, including `environment.py` hooks, test isolation via ephemeral schemas, and `behave.ini` configuration. + +## When to use + +- Starting a new BDD test suite for a Databricks project +- Adding Behave-based acceptance tests to an existing repo +- Setting up integration testing against Unity Catalog, pipelines, jobs, or Apps + +## Process + +### 1. Detect project context + +Identify the project root and existing tooling: + +```bash +git rev-parse --show-toplevel +``` + +Check for existing test infrastructure: `pyproject.toml`, `Makefile`, `behave.ini`, `features/` directory. If a `features/` directory already exists, confirm before overwriting. + +### 2. Determine test domains + +Ask (or infer from the codebase) which Databricks domains to scaffold step files for: + +| Domain | Step file | When | +|--------|-----------|------| +| Unity Catalog | `catalog_steps.py` | Tables, schemas, grants, row filters, column masks | +| Pipelines | `pipeline_steps.py` | Lakeflow SDP, streaming tables, materialized views | +| Jobs | `job_steps.py` | Notebook runs, workflow tasks, job clusters | +| Apps | `app_steps.py` | FastAPI endpoints, SSO headers, deployment | +| SQL | `sql_steps.py` | Statement execution, warehouse queries, data validation | + +Always generate `common_steps.py` (shared workspace connection, row counting, table existence checks). + +### 3. Generate the directory structure + +``` +features/ +├── environment.py # Databricks SDK setup, ephemeral schema lifecycle +├── steps/ +│ ├── common_steps.py # Shared steps (always generated) +│ └── _steps.py # Per-domain (based on step 2) +├── catalog/ # Feature file directories (one per domain) +├── pipelines/ +├── jobs/ +├── apps/ +└── sql/ +behave.ini +Makefile # (append BDD targets if Makefile exists) +``` + +Refer to `references/environment-template.md` for the full `environment.py` template with: +- `before_all`: WorkspaceClient init, warehouse auto-discovery, ephemeral schema creation +- `after_all`: Schema cascade drop +- `before_scenario` / `after_scenario`: Per-scenario resource tracking and cleanup +- Tag-based hooks for `@wip`, `@skip`, `@slow` + +Refer to `references/behave-config.md` for `behave.ini` and `pyproject.toml` configuration. + +### 4. Add dependencies + +If `pyproject.toml` exists and uses `uv`: + +```bash +uv add --group test behave databricks-sdk httpx +``` + +If no `pyproject.toml`, create a minimal one with test dependencies. + +### 5. Add Makefile targets + +Append these targets (or create a Makefile if none exists): + +```makefile +.PHONY: bdd bdd-smoke bdd-report + +bdd: + uv run behave --format=pretty + +bdd-smoke: + uv run behave --tags="@smoke" --format=pretty + +bdd-report: + uv run behave --junit --junit-directory=reports/ --format=progress +``` + +### 6. Verify scaffold + +Run `behave --dry-run` to confirm step discovery works and there are no import errors: + +```bash +uv run behave --dry-run +``` + +Report the generated structure and next steps to the user. + +## Key design decisions + +- **Ephemeral schemas** — each test run creates a timestamped schema (`behave_test_YYYYMMDD_HHMMSS`) and drops it in `after_all`. Prevents cross-run contamination. +- **`-D` userdata** for parameterization — warehouse IDs, catalog names, and targets are passed via CLI args, never hardcoded. +- **Step files are globally scoped** in Behave — all files in `steps/` are imported regardless of which feature runs. Name step patterns carefully to avoid collisions. + +## Additional resources + +- **`references/environment-template.md`** — Full annotated environment.py template +- **`references/behave-config.md`** — behave.ini and pyproject.toml configuration reference diff --git a/projects/coles-vibe-workshop/.claude/commands/bdd-scaffold/references/behave-config.md b/projects/coles-vibe-workshop/.claude/commands/bdd-scaffold/references/behave-config.md new file mode 100644 index 0000000..d994f51 --- /dev/null +++ b/projects/coles-vibe-workshop/.claude/commands/bdd-scaffold/references/behave-config.md @@ -0,0 +1,134 @@ +# Behave Configuration Reference + +## behave.ini + +Standard Behave configuration file. Place at project root. + +```ini +[behave] +# Output +default_format = pretty +show_timings = true +color = true + +# Default tag filter — skip WIP and explicitly skipped tests +default_tags = not @wip and not @skip + +# Logging +logging_level = INFO +logging_format = %(asctime)s %(levelname)-8s %(name)s: %(message)s + +# Capture control +stdout_capture = true +log_capture = true + +# JUnit output (enable in CI) +junit = false +junit_directory = reports/ + +# Feature paths +paths = features/ + +[behave.userdata] +# Override with -D key=value on CLI +warehouse_id = auto +catalog = main +environment = dev +``` + +## pyproject.toml + +Alternative configuration via pyproject.toml (Behave reads `[tool.behave]`): + +**IMPORTANT:** In `pyproject.toml`, `default_tags` must be a **list**, not a string. The `behave.ini` parser accepts a plain string, but the TOML parser is stricter: + +```toml +[tool.behave] +default_format = "pretty" +show_timings = true +default_tags = ["not @wip and not @skip"] # MUST be a list in pyproject.toml +junit = false +junit_directory = "reports/" +logging_level = "INFO" + +[tool.behave.userdata] +warehouse_id = "auto" +catalog = "main" +environment = "dev" +``` + +## Dependencies + +Add to `pyproject.toml`: + +```toml +[project.optional-dependencies] +test = [ + "behave>=1.2.6", + "databricks-sdk>=0.40.0", + "httpx>=0.27.0", +] + +# Or for parallel execution +test-parallel = [ + "behave>=1.2.6", + "behavex>=3.0", + "databricks-sdk>=0.40.0", + "httpx>=0.27.0", +] +``` + +With `uv`: + +```bash +uv add --group test behave databricks-sdk httpx +``` + +## Makefile targets + +```makefile +.PHONY: bdd bdd-smoke bdd-report bdd-rerun bdd-parallel bdd-dry-run + +bdd: + uv run behave --format=pretty --show-timings + +bdd-smoke: + uv run behave --tags="@smoke" --format=pretty + +bdd-report: + uv run behave --junit --junit-directory=reports/behave/ --format=progress + +bdd-rerun: + uv run behave @reports/rerun.txt + +bdd-parallel: + uv run behavex --parallel-processes 4 --parallel-scheme feature + +bdd-dry-run: + uv run behave --dry-run +``` + +## CI integration (GitHub Actions example) + +```yaml +- name: Run BDD tests + env: + DATABRICKS_HOST: ${{ secrets.DATABRICKS_HOST }} + DATABRICKS_TOKEN: ${{ secrets.DATABRICKS_TOKEN }} + DATABRICKS_WAREHOUSE_ID: ${{ secrets.WAREHOUSE_ID }} + TEST_CATALOG: ci_test + run: | + uv run behave \ + --tags="not @slow" \ + --junit --junit-directory=reports/behave/ \ + --format=progress \ + -D catalog=$TEST_CATALOG \ + -D warehouse_id=$DATABRICKS_WAREHOUSE_ID + +- name: Upload test results + if: always() + uses: actions/upload-artifact@v4 + with: + name: behave-results + path: reports/behave/ +``` diff --git a/projects/coles-vibe-workshop/.claude/commands/bdd-scaffold/references/environment-template.md b/projects/coles-vibe-workshop/.claude/commands/bdd-scaffold/references/environment-template.md new file mode 100644 index 0000000..2a7dc1b --- /dev/null +++ b/projects/coles-vibe-workshop/.claude/commands/bdd-scaffold/references/environment-template.md @@ -0,0 +1,195 @@ +# environment.py Template — Databricks + Behave + +Complete annotated template for `features/environment.py`. Copy and adapt to the target project. + +## Full template + +```python +"""Behave environment hooks — Databricks SDK integration. + +Sets up workspace connection, ephemeral test schema, and per-scenario cleanup. +""" +from __future__ import annotations + +import logging +import os +from datetime import datetime + +from behave.model import Feature, Scenario, Step +from behave.runner import Context + +logger = logging.getLogger("behave.databricks") + + +# ─── Session-level hooks ──────────────────────────────────────── + +def before_all(context: Context) -> None: + """Initialize Databricks clients and create ephemeral test schema.""" + from databricks.sdk import WorkspaceClient + + context.workspace = WorkspaceClient() + + # Fix host URL — some profiles include ?o= which breaks SDK API paths. + # The CLI handles this transparently but the SDK does not. + if context.workspace.config.host and "?" in context.workspace.config.host: + clean_host = context.workspace.config.host.split("?")[0].rstrip("/") + profile = os.environ.get("DATABRICKS_CONFIG_PROFILE") + context.workspace = WorkspaceClient(profile=profile, host=clean_host) + + # Verify auth + me = context.workspace.current_user.me() + context.current_user = me.user_name + logger.info("Authenticated as: %s", context.current_user) + + # Warehouse — from -D userdata, env var, or auto-discover + userdata = context.config.userdata + context.warehouse_id = ( + userdata.get("warehouse_id") + or os.environ.get("DATABRICKS_WAREHOUSE_ID") + or _discover_warehouse(context.workspace) + ) + logger.info("Using warehouse: %s", context.warehouse_id) + + # Catalog — from -D userdata or env var + context.test_catalog = userdata.get("catalog", os.environ.get("TEST_CATALOG", "main")) + + # Create ephemeral schema (timestamped for isolation) + ts = datetime.now().strftime("%Y%m%d_%H%M%S") + worker = os.environ.get("BEHAVE_WORKER_ID", "0") + context.test_schema = f"{context.test_catalog}.behave_test_{ts}_w{worker}" + + _execute_sql(context, f"CREATE SCHEMA IF NOT EXISTS {context.test_schema}") + logger.info("Created test schema: %s", context.test_schema) + + +def after_all(context: Context) -> None: + """Drop ephemeral test schema.""" + if hasattr(context, "test_schema"): + try: + _execute_sql(context, f"DROP SCHEMA IF EXISTS {context.test_schema} CASCADE") + logger.info("Dropped test schema: %s", context.test_schema) + except Exception as e: + logger.warning("Failed to drop test schema %s: %s", context.test_schema, e) + + +# ─── Feature-level hooks ──────────────────────────────────────── + +def before_feature(context: Context, feature: Feature) -> None: + """Log feature start. Skip if tagged @skip.""" + logger.info("▶ Feature: %s", feature.name) + if "skip" in feature.tags: + feature.skip("Marked with @skip") + + +def after_feature(context: Context, feature: Feature) -> None: + logger.info("◀ Feature: %s [%s]", feature.name, feature.status) + + +# ─── Scenario-level hooks ─────────────────────────────────────── + +def before_scenario(context: Context, scenario: Scenario) -> None: + """Initialize per-scenario state. Skip @wip scenarios.""" + logger.info(" ▶ Scenario: %s", scenario.name) + if "wip" in scenario.tags: + scenario.skip("Work in progress") + return + # Track resources created during this scenario for cleanup + context.scenario_cleanup_sql = [] + + +def after_scenario(context: Context, scenario: Scenario) -> None: + """Clean up scenario-specific resources.""" + for sql in getattr(context, "scenario_cleanup_sql", []): + try: + _execute_sql(context, sql) + except Exception as e: + logger.warning("Cleanup SQL failed: %s — %s", sql, e) + if scenario.status == "failed": + logger.error(" ✗ FAILED: %s", scenario.name) + else: + logger.info(" ◀ Scenario: %s [%s]", scenario.name, scenario.status) + + +# ─── Step-level hooks ─────────────────────────────────────────── + +def before_step(context: Context, step: Step) -> None: + context._step_start = datetime.now() + + +def after_step(context: Context, step: Step) -> None: + elapsed = (datetime.now() - context._step_start).total_seconds() + if elapsed > 10: + logger.warning(" Slow step (%.1fs): %s %s", elapsed, step.keyword, step.name) + if step.status == "failed": + logger.error(" ✗ %s %s\n %s", step.keyword, step.name, step.error_message) + + +# ─── Tag-based hooks ──────────────────────────────────────────── + +def before_tag(context, tag: str) -> None: + """Ensure resources for tagged scenarios.""" + if tag == "fixture.sql_warehouse": + _ensure_warehouse_running(context) + + +# ─── Helpers ──────────────────────────────────────────────────── + +def _execute_sql(context: Context, sql: str) -> object: + """Execute a SQL statement via the Statement Execution API.""" + return context.workspace.statement_execution.execute_statement( + warehouse_id=context.warehouse_id, + statement=sql, + wait_timeout="30s", + ) + + +def _discover_warehouse(workspace) -> str: + """Find the first available SQL warehouse.""" + from databricks.sdk.service.sql import State + + warehouses = list(workspace.warehouses.list()) + # Prefer running warehouses + for wh in warehouses: + if wh.state == State.RUNNING: + return wh.id + if warehouses: + return warehouses[0].id + raise RuntimeError( + "No SQL warehouses found. Pass warehouse_id via -D warehouse_id= " + "or set DATABRICKS_WAREHOUSE_ID." + ) + + +def _ensure_warehouse_running(context: Context) -> None: + """Start warehouse if stopped. Used by @fixture.sql_warehouse tag.""" + from databricks.sdk.service.sql import State + + wh = context.workspace.warehouses.get(context.warehouse_id) + if wh.state != State.RUNNING: + logger.info("Starting warehouse %s...", context.warehouse_id) + context.workspace.warehouses.start(context.warehouse_id) + context.workspace.warehouses.wait_get_warehouse_running(context.warehouse_id) + logger.info("Warehouse %s is running.", context.warehouse_id) +``` + +## Context object layering + +Behave's `context` has scoped layers. Data set at different levels has different lifetimes: + +| Set in | Lifetime | Example | +|--------|----------|---------| +| `before_all` | Entire run | `context.workspace`, `context.test_schema` | +| `before_feature` | Current feature | `context.feature_data` | +| `before_scenario` / steps | Current scenario | `context.query_result`, `context.scenario_cleanup_sql` | + +At the end of each scenario, the scenario layer is popped — anything set during steps is gone. Root-level data persists across everything. + +## Parallel execution isolation + +When using `behavex` for parallel execution, each worker needs its own schema. The template uses `BEHAVE_WORKER_ID` from the environment. Set it in the parallel runner config or wrapper script: + +```bash +# Example wrapper for behavex +export BEHAVE_WORKER_ID=$WORKER_INDEX +behave "$@" +``` diff --git a/projects/coles-vibe-workshop/.claude/commands/bdd-steps/SKILL.md b/projects/coles-vibe-workshop/.claude/commands/bdd-steps/SKILL.md new file mode 100644 index 0000000..a14a12b --- /dev/null +++ b/projects/coles-vibe-workshop/.claude/commands/bdd-steps/SKILL.md @@ -0,0 +1,109 @@ +--- +name: bdd-steps +description: "This skill should be used when the user asks to \"write step definitions\", \"implement BDD steps\", \"generate step code\", \"create Behave steps\", \"implement Given When Then\", \"write Python steps for Gherkin\", \"step definitions for Databricks\", or needs to create Python step implementations for existing Gherkin feature files." +user-invocable: true +--- + +# BDD Steps — Python Step Definition Generation + +Generate Python step definitions for Behave that implement Gherkin steps using the Databricks SDK. Read existing `.feature` files, identify undefined steps, and produce well-typed implementations. + +## When to use + +- Implementing step definitions for new or existing feature files +- Adding Databricks SDK calls to step implementations +- Refactoring step definitions for reusability across features + +## Process + +### 1. Identify undefined steps + +Read the target feature files, then run a dry-run to find undefined steps: + +```bash +uv run behave --dry-run features/.feature 2>&1 +``` + +Behave prints suggested snippets for each undefined step. Use these as the starting point. + +### 2. Write step definitions + +Place step files in `features/steps/` organized by domain: + +| File | Domain | Key SDK imports | +|------|--------|----------------| +| `common_steps.py` | Shared utilities | `WorkspaceClient`, `StatementState` | +| `catalog_steps.py` | Unity Catalog | `catalog.PermissionsChange`, `catalog.Privilege`, `catalog.SecurableType` | +| `pipeline_steps.py` | Lakeflow SDP | `pipelines.PipelineStateInfo` | +| `job_steps.py` | Jobs/Notebooks | `jobs.SubmitTask`, `jobs.NotebookTask`, `jobs.RunLifeCycleState` | +| `app_steps.py` | Databricks Apps | `httpx.Client` for HTTP assertions | +| `sql_steps.py` | SQL/Data quality | `sql.StatementState`, `sql.Disposition` | + +**Step definition structure:** + +```python +from __future__ import annotations + +from behave import given, when, then +from behave.runner import Context + + +@given('a descriptive step pattern with "{parameter}"') +def step_impl(context: Context, parameter: str) -> None: + """Docstring explaining what this step does.""" + # Implementation using context.workspace (set in environment.py) + ... +``` + +Refer to `references/step-library.md` for a comprehensive library of reusable Databricks step definitions covering: +- Workspace connection and SQL execution +- Table/schema existence and row count assertions +- Grant and permission verification +- Pipeline triggering and status polling +- Job submission and completion waiting +- HTTP endpoint testing with SSO header simulation + +### 3. Step writing principles + +**Use `context` for state passing.** Store results in `context` attributes so downstream `Then` steps can assert on them: + +```python +@when('I execute a query on "{table_name}"') +def step_execute(context: Context, table_name: str) -> None: + context.query_result = context.workspace.statement_execution.execute_statement(...) + +@then('the result should have {count:d} rows') +def step_check_rows(context: Context, count: int) -> None: + actual = len(context.query_result.result.data_array or []) + assert actual == count, f"Expected {count}, got {actual}" +``` + +**Type all parameters.** Use Behave's parse types (`{name:d}` for int, `{name:f}` for float) or register custom types. + +**Assertion messages must be diagnostic.** Always include expected vs. actual values: + +```python +assert actual == expected, f"Expected {expected}, got {actual}" +``` + +**Substitute `{test_schema}` references.** Feature files may use `{test_schema}` as a placeholder. Step definitions should resolve it from `context.test_schema`: + +```python +table_fqn = table_name.replace("{test_schema}", context.test_schema) +``` + +**Poll with timeout for async operations.** Jobs, pipelines, and app deployments need polling loops with configurable timeouts. + +### 4. Validate steps compile + +After writing, verify all steps resolve: + +```bash +uv run behave --dry-run +``` + +Zero undefined steps = ready to run. + +## Additional resources + +- **`references/step-library.md`** — Complete reusable step definition library for all Databricks domains diff --git a/projects/coles-vibe-workshop/.claude/commands/bdd-steps/references/step-library.md b/projects/coles-vibe-workshop/.claude/commands/bdd-steps/references/step-library.md new file mode 100644 index 0000000..11ddf76 --- /dev/null +++ b/projects/coles-vibe-workshop/.claude/commands/bdd-steps/references/step-library.md @@ -0,0 +1,660 @@ +# Reusable Step Definition Library + +Complete library of Databricks step definitions for Behave. Organized by domain. Copy relevant sections into `features/steps/` files. + +**Proven patterns used throughout:** + +- Step patterns use **short names** (e.g., `"{table_name}"`), never `{test_schema}.table` in the pattern +- Step code builds FQN internally: `fqn = f"{context.test_schema}.{table_name}"` +- SQL in docstrings uses `{schema}` placeholder, replaced via `context.text.replace("{schema}", context.test_schema)` +- Steps with data tables have a **trailing colon** in the decorator: `@given('... with data:')` +- Grants use **SQL**, not the SDK grants API (which breaks on recent SDK versions) +- Integer parameters use Behave's built-in `{count:d}` format, not custom type parsers + +--- + +## Common Steps (`common_steps.py`) + +Always include these. They provide workspace connection, SQL execution, and basic assertions. + +```python +"""Shared step definitions for Databricks BDD tests.""" +from __future__ import annotations + +import os +from datetime import datetime + +from behave import given, then, step +from behave.runner import Context +from databricks.sdk.service.sql import StatementState + + +# ─── Connection and setup steps ───────────────────────────────── + +@given("a Databricks workspace connection is established") +def step_workspace_connection(context: Context) -> None: + """Initialize workspace client. Usually handled by environment.py.""" + if not hasattr(context, "workspace"): + from databricks.sdk import WorkspaceClient + context.workspace = WorkspaceClient() + me = context.workspace.current_user.me() + context.current_user = me.user_name + + +@given("a test schema is provisioned") +def step_test_schema(context: Context) -> None: + """Verify test schema exists. Usually handled by environment.py.""" + assert hasattr(context, "test_schema"), ( + "No test_schema on context — check environment.py before_all" + ) + + +# ─── SQL execution steps ──────────────────────────────────────── + +@step("I execute the following SQL") +def step_execute_sql_docstring(context: Context) -> None: + """Execute SQL from a docstring (triple-quoted text in feature file). + + In feature files, use {schema} as the placeholder: + When I execute the following SQL + \"\"\" + SELECT * FROM {schema}.customers + \"\"\" + """ + sql = context.text.replace("{schema}", context.test_schema) + context.query_result = _execute_sql(context, sql) + + +@step('I execute SQL "{sql}"') +def step_execute_sql_inline(context: Context, sql: str) -> None: + """Execute inline SQL. The {schema} placeholder is replaced automatically.""" + sql = sql.replace("{schema}", context.test_schema) + context.query_result = _execute_sql(context, sql) + + +# ─── Table existence and row count assertions ─────────────────── + +@then('the table "{table_name}" should exist') +def step_table_exists(context: Context, table_name: str) -> None: + fqn = f"{context.test_schema}.{table_name}" + try: + context.workspace.tables.get(fqn) + except Exception as e: + raise AssertionError(f"Table {fqn} does not exist: {e}") + + +@then('the streaming table "{table_name}" should exist') +def step_streaming_table_exists(context: Context, table_name: str) -> None: + fqn = f"{context.test_schema}.{table_name}" + try: + info = context.workspace.tables.get(fqn) + assert info.table_type is not None, f"{fqn} exists but has no table_type" + except Exception as e: + raise AssertionError(f"Streaming table {fqn} does not exist: {e}") + + +@then('the materialized view "{table_name}" should exist') +def step_mv_exists(context: Context, table_name: str) -> None: + fqn = f"{context.test_schema}.{table_name}" + try: + context.workspace.tables.get(fqn) + except Exception as e: + raise AssertionError(f"Materialized view {fqn} does not exist: {e}") + + +@then('the table "{table_name}" should have {expected:d} rows') +def step_exact_row_count(context: Context, table_name: str, expected: int) -> None: + actual = _count_rows(context, table_name) + assert actual == expected, f"Expected {expected} rows in {table_name}, got {actual}" + + +@then('the table "{table_name}" should have more than {expected:d} rows') +def step_min_row_count(context: Context, table_name: str, expected: int) -> None: + actual = _count_rows(context, table_name) + assert actual > expected, f"Expected more than {expected} rows in {table_name}, got {actual}" + + +@then('the table "{table_name}" should have 0 rows') +def step_empty_table(context: Context, table_name: str) -> None: + actual = _count_rows(context, table_name) + assert actual == 0, f"Expected 0 rows in {table_name}, got {actual}" + + +# ─── Query result assertions ──────────────────────────────────── + +@then("the result should have {expected:d} rows") +def step_result_row_count(context: Context, expected: int) -> None: + rows = context.query_result.result.data_array or [] + actual = len(rows) + assert actual == expected, f"Expected {expected} rows, got {actual}" + + +@then("the result should have more than {expected:d} rows") +def step_result_min_rows(context: Context, expected: int) -> None: + rows = context.query_result.result.data_array or [] + actual = len(rows) + assert actual > expected, f"Expected more than {expected} rows, got {actual}" + + +@then('the first row column "{col}" should be "{value}"') +def step_first_row_value(context: Context, col: str, value: str) -> None: + result = context.query_result + columns = [c.name for c in result.manifest.schema.columns] + col_idx = columns.index(col) + actual = result.result.data_array[0][col_idx] + assert str(actual) == value, f"Expected {col}={value}, got {actual}" + + +# ─── Data setup steps ─────────────────────────────────────────── + +@given('the table "{table_name}" has been loaded') +def step_table_loaded(context: Context, table_name: str) -> None: + """Assert table exists and is not empty.""" + fqn = f"{context.test_schema}.{table_name}" + count = _count_rows(context, table_name) + assert count > 0, f"Table {fqn} exists but is empty" + + +@given('a managed table "{table_name}" exists') +def step_ensure_table_exists(context: Context, table_name: str) -> None: + fqn = f"{context.test_schema}.{table_name}" + try: + context.workspace.tables.get(fqn) + except Exception: + # Create a minimal table + _execute_sql(context, f"CREATE TABLE IF NOT EXISTS {fqn} (id BIGINT)") + context.scenario_cleanup_sql.append(f"DROP TABLE IF EXISTS {fqn}") + + +@given('a managed table "{table_name}" with data:') +def step_create_table_with_data(context: Context, table_name: str) -> None: + """Create a table and populate from the Gherkin data table. + + The trailing colon in the decorator is required — Behave matches it + as part of the step text when a data table follows. + + Example feature file usage: + Given a managed table "customers" with data: + | id | name | region | + | 1 | Acme | APAC | + | 2 | Contoso | EMEA | + """ + fqn = f"{context.test_schema}.{table_name}" + headers = context.table.headings + rows = context.table.rows + + # Infer types (simple heuristic — all STRING) + col_defs = ", ".join(f"{h} STRING" for h in headers) + _execute_sql(context, f"CREATE OR REPLACE TABLE {fqn} ({col_defs})") + context.scenario_cleanup_sql.append(f"DROP TABLE IF EXISTS {fqn}") + + # Insert rows + for row in rows: + values = ", ".join(f"'{cell}'" for cell in row) + _execute_sql(context, f"INSERT INTO {fqn} VALUES ({values})") + + +# ─── Helpers ──────────────────────────────────────────────────── + +def _execute_sql(context: Context, sql: str): + """Execute SQL and return result.""" + result = context.workspace.statement_execution.execute_statement( + warehouse_id=context.warehouse_id, + statement=sql, + wait_timeout="30s", + ) + assert result.status.state == StatementState.SUCCEEDED, ( + f"SQL failed: {result.status.error}\nStatement: {sql[:200]}" + ) + return result + + +def _count_rows(context: Context, table_name: str) -> int: + """Count rows in a table.""" + fqn = f"{context.test_schema}.{table_name}" + result = _execute_sql(context, f"SELECT COUNT(*) AS cnt FROM {fqn}") + return int(result.result.data_array[0][0]) +``` + +--- + +## Catalog Steps (`catalog_steps.py`) + +Uses SQL for grants instead of the SDK grants API. The SDK's `grants.update(securable_type=SecurableType.TABLE, ...)` fails with `SECURABLETYPE.TABLE is not a valid securable type` on recent SDK versions. + +```python +"""Step definitions for Unity Catalog permissions and security. + +Uses SQL for all grant operations. The SDK grants API is unreliable — +SecurableType.TABLE fails on recent databricks-sdk versions. +""" +from __future__ import annotations + +from behave import when, then +from behave.runner import Context +from databricks.sdk.service.sql import StatementState + + +@when('I grant {privilege} on table "{table_name}" to group "{group}"') +def step_grant(context: Context, privilege: str, table_name: str, group: str) -> None: + """Grant a privilege on a table using SQL. + + Example feature file usage: + When I grant SELECT on table "customers" to group "analysts" + """ + fqn = f"{context.test_schema}.{table_name}" + _execute_sql(context, f"GRANT {privilege} ON TABLE {fqn} TO `{group}`") + + +@when('I revoke {privilege} on table "{table_name}" from group "{group}"') +def step_revoke(context: Context, privilege: str, table_name: str, group: str) -> None: + """Revoke a privilege on a table using SQL.""" + fqn = f"{context.test_schema}.{table_name}" + _execute_sql(context, f"REVOKE {privilege} ON TABLE {fqn} FROM `{group}`") + + +@then('the group "{group}" should have {privilege} permission on "{table_name}"') +def step_verify_grant( + context: Context, group: str, privilege: str, table_name: str +) -> None: + """Verify a grant exists using SHOW GRANTS. + + SHOW GRANTS returns PascalCase columns: Principal, ActionType, ObjectType, ObjectKey. + """ + fqn = f"{context.test_schema}.{table_name}" + result = _execute_sql(context, f"SHOW GRANTS ON TABLE {fqn}") + columns = [c.name for c in result.manifest.schema.columns] + principal_idx = columns.index("Principal") + action_idx = columns.index("ActionType") + + found_privs = [] + for row in result.result.data_array or []: + if row[principal_idx] == group: + found_privs.append(row[action_idx]) + + assert privilege in found_privs, ( + f"Expected {group} to have {privilege} on {fqn}, " + f"found: {found_privs}" + ) + + +@then('the group "{group}" should not have {privilege} permission on "{table_name}"') +def step_verify_no_grant( + context: Context, group: str, privilege: str, table_name: str +) -> None: + """Verify a grant does NOT exist using SHOW GRANTS.""" + fqn = f"{context.test_schema}.{table_name}" + result = _execute_sql(context, f"SHOW GRANTS ON TABLE {fqn}") + columns = [c.name for c in result.manifest.schema.columns] + principal_idx = columns.index("Principal") + action_idx = columns.index("ActionType") + + found_privs = [] + for row in result.result.data_array or []: + if row[principal_idx] == group: + found_privs.append(row[action_idx]) + + assert privilege not in found_privs, ( + f"Expected {group} NOT to have {privilege} on {fqn}, " + f"but found: {found_privs}" + ) + + +def _execute_sql(context: Context, sql: str): + """Execute SQL and return result.""" + result = context.workspace.statement_execution.execute_statement( + warehouse_id=context.warehouse_id, + statement=sql, + wait_timeout="30s", + ) + assert result.status.state == StatementState.SUCCEEDED, ( + f"SQL failed: {result.status.error}\nStatement: {sql[:200]}" + ) + return result +``` + +--- + +## Pipeline Steps (`pipeline_steps.py`) + +```python +"""Step definitions for Lakeflow Spark Declarative Pipelines.""" +from __future__ import annotations + +import time + +from behave import given, when, then +from behave.runner import Context + + +@given('a pipeline "{name}" exists targeting "{schema}"') +def step_pipeline_exists(context: Context, name: str, schema: str) -> None: + pipelines = list( + context.workspace.pipelines.list_pipelines(filter=f'name LIKE "{name}"') + ) + if pipelines: + context.pipeline_id = pipelines[0].pipeline_id + else: + result = context.workspace.pipelines.create( + name=name, + target=schema, + catalog=context.test_catalog, + channel="CURRENT", + ) + context.pipeline_id = result.pipeline_id + context.scenario_cleanup_sql.append(None) # Mark for pipeline cleanup + + +@given('the pipeline "{name}" has completed a full refresh') +def step_pipeline_refreshed(context: Context, name: str) -> None: + """Ensure pipeline exists and has been refreshed at least once.""" + pipelines = list( + context.workspace.pipelines.list_pipelines(filter=f'name LIKE "{name}"') + ) + assert pipelines, f"Pipeline '{name}' not found" + context.pipeline_id = pipelines[0].pipeline_id + # Check latest update status + detail = context.workspace.pipelines.get(context.pipeline_id) + assert detail.latest_updates, f"Pipeline '{name}' has never been run" + + +@when("I trigger a full refresh of the pipeline") +def step_full_refresh(context: Context) -> None: + response = context.workspace.pipelines.start_update( + pipeline_id=context.pipeline_id, + full_refresh=True, + ) + context.update_id = response.update_id + + +@when("I trigger an incremental refresh of the pipeline") +def step_incremental_refresh(context: Context) -> None: + response = context.workspace.pipelines.start_update( + pipeline_id=context.pipeline_id, + full_refresh=False, + ) + context.update_id = response.update_id + + +@then("the pipeline update should succeed within {timeout:d} seconds") +def step_pipeline_success(context: Context, timeout: int) -> None: + _wait_for_pipeline(context, timeout, expect_success=True) + + +@then("the pipeline update should fail") +def step_pipeline_fail(context: Context) -> None: + _wait_for_pipeline(context, timeout=300, expect_success=False) + + +@then('the pipeline error should mention {keyword}') +def step_pipeline_error_contains(context: Context, keyword: str) -> None: + events = list(context.workspace.pipelines.list_pipeline_events( + pipeline_id=context.pipeline_id, + max_results=10, + )) + error_messages = " ".join( + str(e.message) for e in events if e.level == "ERROR" + ) + assert keyword.lower() in error_messages.lower(), ( + f"Expected pipeline error to mention '{keyword}', " + f"but errors were: {error_messages[:500]}" + ) + + +def _wait_for_pipeline( + context: Context, timeout: int, expect_success: bool +) -> None: + deadline = time.time() + timeout + while time.time() < deadline: + update = context.workspace.pipelines.get_update( + pipeline_id=context.pipeline_id, + update_id=context.update_id, + ) + state = update.update.state + if state in ("COMPLETED",): + if expect_success: + return + raise AssertionError("Expected pipeline to fail, but it succeeded") + if state in ("FAILED", "CANCELED"): + if not expect_success: + return + raise AssertionError( + f"Pipeline update {state}. Check update {context.update_id}" + ) + time.sleep(15) + raise TimeoutError(f"Pipeline did not complete within {timeout}s") +``` + +--- + +## Job Steps (`job_steps.py`) + +```python +"""Step definitions for Databricks Jobs and notebook runs.""" +from __future__ import annotations + +import time + +from behave import when, then +from behave.runner import Context +from databricks.sdk.service.jobs import ( + NotebookTask, + RunLifeCycleState, + SubmitTask, +) + + +@when('I run the notebook "{path}" with parameters:') +def step_run_notebook(context: Context, path: str) -> None: + """Run a notebook with parameters from a Gherkin data table. + + The trailing colon is required when a data table follows. + + Example feature file usage: + When I run the notebook "/Workspace/tests/etl" with parameters: + | key | value | + | schema | my_schema | + | mode | full | + """ + params = {} + for row in context.table: + value = row["value"].replace("{schema}", context.test_schema) + params[row["key"]] = value + + run = context.workspace.jobs.submit( + run_name=f"behave-{context.scenario.name[:50]}", + tasks=[ + SubmitTask( + task_key="main", + notebook_task=NotebookTask( + notebook_path=path, + base_parameters=params, + ), + ) + ], + ) + context.run_id = run.response.run_id + + +@then('the job should complete with status "{expected}" within {timeout:d} seconds') +def step_job_status(context: Context, expected: str, timeout: int) -> None: + deadline = time.time() + timeout + while time.time() < deadline: + run = context.workspace.jobs.get_run(context.run_id) + state = run.state + if state.life_cycle_state in ( + RunLifeCycleState.TERMINATED, + RunLifeCycleState.INTERNAL_ERROR, + RunLifeCycleState.SKIPPED, + ): + break + time.sleep(10) + else: + raise TimeoutError(f"Run {context.run_id} did not complete within {timeout}s") + + actual = state.result_state.value if state.result_state else "UNKNOWN" + assert actual == expected, ( + f"Expected {expected}, got {actual}. Message: {state.state_message}" + ) +``` + +--- + +## App Steps (`app_steps.py`) + +```python +"""Step definitions for Databricks Apps (FastAPI) testing.""" +from __future__ import annotations + +import subprocess +import os + +import httpx +from behave import given, when, then +from behave.runner import Context + + +@given('the app is running at "{base_url}"') +def step_app_running(context: Context, base_url: str) -> None: + context.app_client = httpx.Client(base_url=base_url, timeout=10) + + +@given('the test user is "{email}"') +def step_test_user(context: Context, email: str) -> None: + context.auth_headers = { + "X-Forwarded-Email": email, + "X-Forwarded-User": email.split("@")[0], + } + + +@when('I GET "{path}"') +def step_get(context: Context, path: str) -> None: + context.response = context.app_client.get(path) + + +@when('I GET "{path}" with auth headers') +def step_get_auth(context: Context, path: str) -> None: + context.response = context.app_client.get(path, headers=context.auth_headers) + + +@when('I GET "{path}" without auth headers') +def step_get_no_auth(context: Context, path: str) -> None: + context.response = context.app_client.get(path) + + +@when('I POST "{path}" with auth headers and body') +def step_post_auth(context: Context, path: str) -> None: + """POST with JSON body from a docstring. + + Example feature file usage: + When I POST "/api/items" with auth headers and body + \"\"\" + {"name": "test-item", "value": 42} + \"\"\" + """ + import json + body = json.loads(context.text) + context.response = context.app_client.post( + path, json=body, headers=context.auth_headers, + ) + + +@then("the response status should be {code:d}") +def step_status_code(context: Context, code: int) -> None: + assert context.response.status_code == code, ( + f"Expected {code}, got {context.response.status_code}: " + f"{context.response.text[:200]}" + ) + + +@then('the response JSON should contain "{key}" with value "{value}"') +def step_json_value(context: Context, key: str, value: str) -> None: + data = context.response.json() + assert key in data, f"Key '{key}' not in response: {list(data.keys())}" + assert str(data[key]) == value, f"Expected {key}='{value}', got '{data[key]}'" + + +@then("the response should be a JSON list") +def step_json_list(context: Context) -> None: + data = context.response.json() + assert isinstance(data, list), f"Expected list, got {type(data).__name__}" + + +# ─── Deployment steps ──────────────────────────────────────────── + +@when('I deploy using Asset Bundles with target "{target}"') +def step_deploy_bundle(context: Context, target: str) -> None: + result = subprocess.run( + ["databricks", "bundle", "deploy", "--target", target], + capture_output=True, + text=True, + env={**dict(os.environ), "DATABRICKS_BUNDLE_ENGINE": "direct"}, + timeout=300, + ) + context.deploy_result = result + + +@then("the deployment should succeed") +def step_deploy_success(context: Context) -> None: + r = context.deploy_result + assert r.returncode == 0, ( + f"Deploy failed (rc={r.returncode}):\n{r.stderr[:500]}" + ) +``` + +--- + +## Shell Command Steps (reusable) + +```python +"""Step definitions for running CLI commands (DABs, databricks CLI).""" +from __future__ import annotations + +import os +import subprocess + +from behave import when, then +from behave.runner import Context + + +@when('I run "{command}" with target "{target}"') +def step_run_command(context: Context, command: str, target: str) -> None: + full_cmd = f"{command} --target {target}" + context.cmd_result = subprocess.run( + full_cmd.split(), + capture_output=True, + text=True, + env={**dict(os.environ), "DATABRICKS_BUNDLE_ENGINE": "direct"}, + timeout=300, + ) + + +@when('I run "{command}" with target "{target}" and auto-approve') +def step_run_command_approve(context: Context, command: str, target: str) -> None: + full_cmd = f"{command} --target {target} --auto-approve" + context.cmd_result = subprocess.run( + full_cmd.split(), + capture_output=True, + text=True, + env={**dict(os.environ), "DATABRICKS_BUNDLE_ENGINE": "direct"}, + timeout=300, + ) + + +@then("the command should exit with code {code:d}") +def step_exit_code(context: Context, code: int) -> None: + actual = context.cmd_result.returncode + assert actual == code, ( + f"Expected exit code {code}, got {actual}.\n" + f"stdout: {context.cmd_result.stdout[:300]}\n" + f"stderr: {context.cmd_result.stderr[:300]}" + ) + + +@then("the command should succeed") +def step_command_success(context: Context) -> None: + assert context.cmd_result.returncode == 0, ( + f"Command failed (rc={context.cmd_result.returncode}):\n" + f"{context.cmd_result.stderr[:500]}" + ) +``` diff --git a/projects/coles-vibe-workshop/.claude/commands/setup-track-env/SKILL.md b/projects/coles-vibe-workshop/.claude/commands/setup-track-env/SKILL.md new file mode 100644 index 0000000..951f6af --- /dev/null +++ b/projects/coles-vibe-workshop/.claude/commands/setup-track-env/SKILL.md @@ -0,0 +1,144 @@ +--- +name: setup-track-env +description: "Use when a team needs to set up their Python virtual environment for a workshop track (de, ds, or analyst). Creates an isolated uv venv under .venvs/ with all required dependencies pre-installed." +user-invocable: true +--- + +# Setup Track Environment + +Create a pre-configured Python virtual environment for a workshop track using uv. + +## Arguments + +One required argument — the track name: + +| Argument | Track | Key packages | +|----------|-------|-------------| +| `de` | Data Engineering | pyspark 4.1, delta-spark 4.1, beautifulsoup4, lxml | +| `ds` | Data Science | pyspark 4.1, mlflow 3.11, scikit-learn 1.8, xgboost 3.2 | +| `analyst` | Analyst | fastapi, uvicorn, sse-starlette, httpx | + +All tracks also get: pytest, ruff, databricks-sdk, databricks-sql-connector. + +## Process + +### 1. Validate the argument + +If no argument or invalid argument, print the table above and stop. + +### 2. Ensure .venvs/ is gitignored + +```bash +cd /app/python/source_code/projects/coles-vibe-workshop && grep -qxF '.venvs/' .gitignore || echo '.venvs/' >> .gitignore +``` + +### 3. Create the virtual environment + +```bash +cd /app/python/source_code/projects/coles-vibe-workshop && /usr/local/bin/uv venv --python /usr/bin/python3.11 --clear .venvs/ +``` + +If Python 3.11 is not found, fall back to `python3`: + +```bash +cd /app/python/source_code/projects/coles-vibe-workshop && /usr/local/bin/uv venv --python python3 --clear .venvs/ +``` + +### 4. Install dependencies + +```bash +cd /app/python/source_code/projects/coles-vibe-workshop && /usr/local/bin/uv pip install -r requirements/.txt --python .venvs//bin/python +``` + +If installation fails with a connection/network error, print: + +``` +Package installation failed — PyPI may be unreachable. + +Ask the facilitator to allowlist these domains: + pypi.org + files.pythonhosted.org + +To diagnose: curl -I https://pypi.org +``` + +### 5. Verify imports + +Run the appropriate verification for the track: + +**DE:** +```bash +.venvs/de/bin/python -c " +import pyspark; print(f'pyspark {pyspark.__version__}') +import pytest; print(f'pytest {pytest.__version__}') +import bs4; print('beautifulsoup4 OK') +import lxml; print('lxml OK') +print('--- DE environment ready ---') +" +``` + +Note: `databricks-declarative-pipelines` is NOT on public PyPI — it is only available on Databricks cluster runtime. The `@dp.table` / `@dp.expect` decorators will work when the pipeline runs on a cluster. For local testing, mock or skip those imports. + +**DS:** +```bash +.venvs/ds/bin/python -c " +import pyspark; print(f'pyspark {pyspark.__version__}') +import mlflow; print(f'mlflow {mlflow.__version__}') +import sklearn; print(f'scikit-learn {sklearn.__version__}') +import xgboost; print(f'xgboost {xgboost.__version__}') +import pytest; print(f'pytest {pytest.__version__}') +print('--- DS environment ready ---') +" +``` + +**Analyst:** +```bash +.venvs/analyst/bin/python -c " +import fastapi; print(f'fastapi {fastapi.__version__}') +import uvicorn; print(f'uvicorn {uvicorn.__version__}') +import sse_starlette; print('sse-starlette OK') +import httpx; print(f'httpx {httpx.__version__}') +import pytest; print(f'pytest {pytest.__version__}') +print('--- Analyst environment ready ---') +" +``` + +### 6. Check Java (DE and DS only) + +If the track is `de` or `ds`, check for Java since PySpark requires it: + +```bash +java -version 2>&1 || echo "WARNING: Java not found — see note below." +``` + +### 7. Print activation instructions + +``` +Environment created: .venvs// + +Activate it: + source .venvs//bin/activate + +Run tests: + .venvs//bin/python -m pytest tests/ -x --no-header -q + +Or use directly without activating: + .venvs//bin/python your_script.py +``` + +## Important: PySpark in this environment + +This Databricks App environment does **not** have Java installed. PySpark imports fine (`import pyspark` works), but `SparkSession.builder.getOrCreate()` will fail with `JAVA_GATEWAY_EXITED`. + +**This is expected and OK for the workshop.** The DE and DS tracks write PySpark code that runs on Databricks clusters (via Lakeflow pipelines or notebooks), not locally. Teams should: +- Write and test pipeline logic using `pytest` with mocked Spark objects +- Deploy to Databricks clusters for actual Spark execution via `databricks bundle deploy` +- Use `databricks-sql-connector` for local data access when needed + +Similarly, `databricks-declarative-pipelines` (`@dp.table`, `@dp.expect`) is only available on Databricks cluster runtime, not on public PyPI. + +## Notes + +- Running this skill again for the same track is safe — it recreates the venv cleanly. +- Teams can have multiple track venvs simultaneously. +- Requirements files are in `requirements/` and are version-controlled. diff --git a/projects/coles-vibe-workshop/.gitignore b/projects/coles-vibe-workshop/.gitignore new file mode 100644 index 0000000..2b4f227 --- /dev/null +++ b/projects/coles-vibe-workshop/.gitignore @@ -0,0 +1,7 @@ +node_modules/ +package.json +package-lock.json +*.pdf +_state.md +export-pdf.js +.venvs/ diff --git a/projects/coles-vibe-workshop/CLAUDE.md b/projects/coles-vibe-workshop/CLAUDE.md new file mode 100644 index 0000000..5f163ed --- /dev/null +++ b/projects/coles-vibe-workshop/CLAUDE.md @@ -0,0 +1,92 @@ +# CLAUDE.md + +This file provides guidance to Claude Code (claude.ai/code) when working with code in this repository. + +## What This Is + +Workshop materials for a 6.5-hour Coles Group vibe coding hackathon (9:30 AM – 4:00 PM). Teams of 2-3 build a Grocery Intelligence Platform using AI coding agents on Databricks. Three parallel tracks: Data Engineering, Data Science, and Analyst. + +## Build & Export Commands + +```bash +# Generate shareable PDFs from track HTML workbooks (requires Node.js) +npm install puppeteer pdf-lib --silent +cp ~/.claude/plugins/cache/fe-vibe/fe-html-slides/*/skills/html-slides/resources/export-pdf.js . +node export-pdf.js track-common.html # → track-common.pdf +node export-pdf.js track-de.html # → track-de.pdf +node export-pdf.js track-ds.html # → track-ds.pdf +node export-pdf.js track-analyst.html # → track-analyst.pdf + +# Run reference implementation tests (requires PySpark) +cd reference-implementation && uv run pytest tests/test_pipeline.py -x --no-header -q +cd reference-implementation && uv run pytest tests/test_quality.py -x --no-header -q +cd reference-implementation && uv run pytest tests/test_app.py -x --no-header -q +``` + +## Workshop Schedule (9:30–16:00) + +- **9:30** Welcome + Icebreaker (15 min) +- **9:45** Theory: Vibe Coding, CLAUDE.md, TDD (45 min) +- **10:30** Break (15 min) +- **10:45** Lab 0: Guided Hands-On — ALL together (45 min) +- **11:30** Theory: Skills, MCP, Genie/AI-BI (20 min) +- **11:50** Track Briefing — teams choose DE / DS / Analyst (10 min) +- **12:00** Lab 1 — track-specific (60 min) +- **13:00** Show & Tell (15 min) +- **13:15** Lunch (45 min) +- **14:00** Lab 2 — track-specific (60 min) +- **15:00** Team Demos (30 min) +- **15:30** Takeaways + Close (15 min) + +## Six Files Must Stay In Sync + +When changing workshop content, update all six: +1. `slides.html` — the deck attendees see +2. `slides-facilitator.html` — same deck with speaker notes (JS-injected from `data-notes` attributes) +3. `FACILITATOR-SCRIPT.md` — dot-point cue cards per slide +4. `WORKSHOP-PLAN.md` — master plan with session details and timing +5. `VIBE-CODING-GUIDE.md` — markdown version of common track theory content +6. `track-common.html` — HTML workbook version of the same content + +The three track HTMLs (`track-de.html`, `track-ds.html`, `track-analyst.html`) pull content from the corresponding lab markdown files (`LAB-1-DE.md`, `LAB-2-DE.md`, etc.) and must stay aligned with them. + +## Slide Deck Conventions + +Single HTML files with no build step. Open directly in a browser. + +**Slide structure:** Each slide is a `
` with `aria-label` (title) and `data-notes` (speaker notes). Follow the `` comment convention. + +**CSS variables:** Databricks brand in `:root` — `var(--db-lava)` #FF3621, `var(--db-navy)` #1B3139, `var(--db-teal)` / `var(--db-green)` #00A972, `var(--db-oat)` #F9F7F4. Coles red `var(--coles-red)` #E01A22 used sparingly. + +**Code blocks:** `
` with `` for syntax highlighting. `white-space: pre` — preserve indentation in HTML source. + +**PDF export:** Uses Puppeteer to screenshot each `.slide` section at 1920×1080 @2x, then pdf-lib stitches into PDF. This is why all exportable HTML files must use `
` elements. + +## Track Workbook Conventions + +The `track-*.html` files are slide-format workbooks (not presentation decks). Each uses the same `.slide` structure for PDF export compatibility but styled as document pages with: +- `.cover-slide` — dark themed title page +- `.divider-slide` — section dividers with timing badges +- `.page-slide` — white content pages with `.page-header`, `.page-content`, `.page-footer` + +Track accent colors: DE = lava red (#FF3621), DS = purple (#7c3aed), Analyst = cyan (#0891b2), Common = green (#00A972). + +## Three-Track System + +After shared theory and Lab 0, teams choose one track: + +| Track | Badge Color | Lab 1 | Lab 2 | Lab Files | +|-------|------------|-------|-------|-----------| +| Data Engineering | Red | Lakeflow pipeline (Bronze→Silver→Gold) | Data quality, FSANZ source, scheduling | `LAB-1-DE.md`, `LAB-2-DE.md` | +| Data Science | Purple | Feature engineering, MLflow experiments | Model training, serving, prediction app | `LAB-1-DS.md`, `LAB-2-DS.md` | +| Analyst | Cyan | Genie spaces, AI/BI dashboards | FastAPI web app with embedded dashboards | `LAB-1-ANALYST.md`, `LAB-2-ANALYST.md` | + +The original generic lab files (`LAB-1-DATA-PIPELINE.md`, `LAB-2-APP-GENIE-DASHBOARD.md`) are kept for reference but the track-specific versions are canonical. + +## Reference Implementation + +`reference-implementation/` contains the complete DE track solution built following Anthropic best practices (CLAUDE.md → tests → implementation). It serves as the facilitator answer key and teaching artifact. 55 tests across three files: `test_pipeline.py` (bronze/silver/gold), `test_app.py` (FastAPI), `test_quality.py` (data validation). Pipeline uses `import databricks.declarative_pipelines as dp` with `@dp.table` decorators and `@dp.expect` / `@dp.expect_or_fail` quality rules. + +## Content Alignment + +The workshop references a working implementation at `~/Repos/coles-genie-demo` — a Databricks Asset Bundle with Lakeflow pipelines ingesting ABS Retail Trade, ABS CPI Food, FSANZ Food Recalls, and ACCC Grocery PDFs. Lab instructions, slide content, and pipeline flow diagrams must stay aligned with that repo's actual data sources, table names, and schemas. diff --git a/projects/coles-vibe-workshop/LAB-0-GETTING-STARTED.md b/projects/coles-vibe-workshop/LAB-0-GETTING-STARTED.md new file mode 100644 index 0000000..1ff6522 --- /dev/null +++ b/projects/coles-vibe-workshop/LAB-0-GETTING-STARTED.md @@ -0,0 +1,99 @@ +# Lab 0: Getting Started (All Tracks) + +**Duration:** 10 minutes +**Goal:** Set up your project, explore the data, and choose your track + +--- + +## Step 1: Open Your Terminal + +Your Coding Agents app gives you a browser-based terminal with Claude Code pre-installed. Open it in your browser now. + +--- + +## Step 2: Set Up Your Project (Rule #1: Just Say What You Want) + +Don't write config files by hand — have a conversation. Tell Claude about your project and it will create everything. + +1. Create your project directory: + ```bash + mkdir -p grocery-intelligence && cd grocery-intelligence + ``` + +2. Tell Claude about your project — just say it: + + ``` + I'm building a grocery intelligence platform on Databricks. + Tech stack: PySpark, Lakeflow Declarative Pipelines, FastAPI + React, DABs. + Data sources: ABS SDMX APIs, FSANZ web scraping, ACCC PDF ingestion via UC Volumes. + Unity Catalog namespace: workshop_vibe_coding.. + Set up the project and create a CLAUDE.md. + ``` + + Replace `` with your assigned value (e.g., `team_01`). + +3. Review what Claude created — want to change something? Just say it: + - "Add a rule that we always use PySpark, never pandas" + - "Add our team angle: Retail Performance" + - "Set up the test directory with conftest.py" + +4. Copy your track's test stubs: + + | Track | Command | + |-------|---------| + | **Data Engineering** | `cp ~/starter-kit/test_pipeline.py tests/` | + | **Data Science** | `cp ~/starter-kit/test_features.py tests/` | + | **Analyst** | *(minimal tests — most work is UI-based)* | + +--- + +## Step 3: Verify Your Environment + +```bash +# Check Claude Code is working +claude --version + +# Check Databricks CLI +databricks auth status + +# Check your Unity Catalog access +databricks catalogs list | grep workshop +``` + +If any of these fail, ask the facilitator for help. + +--- + +## Step 4: Explore the Data + +All tracks use the same gold tables. Paste this into Claude Code: + +``` +Query these tables and show me 5 rows from each: +- workshop_vibe_coding.checkpoints.retail_summary +- workshop_vibe_coding.checkpoints.food_inflation_yoy + +Tell me: what columns are available, what date range is covered, and which states are included. +``` + +This gives you a feel for the data regardless of your track. + +--- + +## Step 5: Choose Your Track + +| Track | You'll Build | Best For | +|-------|-------------|----------| +| **Data Engineering** | Lakeflow pipeline (Bronze→Silver→Gold), DABs deployment, data quality | Teams who want to build the data foundation | +| **Data Science** | Feature engineering, MLflow experiments, model training + serving | Teams who want to build ML models from the data | +| **Analyst** | Genie spaces, AI/BI dashboards, FastAPI web app | Teams who want to build interfaces for business users | + +All tracks connect to the same data — you're building different layers of the same platform. + +--- + +## Now Go To Your Track + +- **Data Engineering** → `LAB-1-DE.md` +- **Data Science** → `LAB-1-DS.md` +- **Analyst** → `LAB-1-ANALYST.md` diff --git a/projects/coles-vibe-workshop/LAB-1-ANALYST.md b/projects/coles-vibe-workshop/LAB-1-ANALYST.md new file mode 100644 index 0000000..3445abf --- /dev/null +++ b/projects/coles-vibe-workshop/LAB-1-ANALYST.md @@ -0,0 +1,201 @@ +# Lab 1: Genie Spaces & AI/BI Dashboards (Analyst Track) + +**Duration:** 55 minutes +**Goal:** Create Genie spaces, build AI/BI dashboards, and tune with metadata +**Team Size:** 2–3 people + +> Complete `LAB-0-GETTING-STARTED.md` first, then return here. + +--- + +## The Mission + +Your gold tables are pre-loaded (from checkpoints). Build natural language interfaces that let business users query this data without writing code. + +**Most of this lab is in the Databricks UI.** The terminal is used for adding metadata and writing SQL. + +--- + +## Phase 1: Create Genie Space + Add Metadata (15 min) + +> **Team Tasks for This Phase** +> - **Person A (Databricks UI):** Create Genie space with gold tables and instructions +> - **Person B (Databricks UI):** Navigate to AI/BI Dashboards, start first visualization +> - **Person C (Terminal):** Add column descriptions to gold tables in Unity Catalog +> +> *Teams of 2: Person A does UI tasks, Person B does Terminal + UI.* + +### 1.1 Create your Genie space (Person A) + +In the Databricks workspace UI: + +1. Click **Genie** in the left sidebar +2. Click **New Genie Space** +3. Configure: + - **Name:** "Grocery Intelligence — TEAM_NAME" + - **SQL Warehouse:** Select the workshop warehouse + - **Tables:** Add: + - `workshop_vibe_coding.TEAM_SCHEMA.retail_summary` + - `workshop_vibe_coding.TEAM_SCHEMA.food_inflation_yoy` + - **General Instructions:** Paste this: + ``` + This data contains Australian retail trade and food price data. + States are Australian states (New South Wales, Victoria, Queensland, etc.). + Turnover is in millions of AUD. + CPI index values are relative to a base period. + YoY growth and change percentages show year-over-year comparisons. + ``` + +### 1.2 Add column descriptions (Person C) + +Paste this into Claude Code: + +``` +Add column comments to our gold tables for better Genie and AI/BI accuracy: + +For workshop_vibe_coding.TEAM_SCHEMA.retail_summary: +- Table comment: "Monthly retail turnover summary by Australian state and industry with rolling averages and YoY growth" +- state: "Australian state name (New South Wales, Victoria, Queensland, etc.)" +- industry: "Retail industry category (Food retailing, Department stores, etc.)" +- month: "Date of the monthly observation (first of month)" +- turnover_millions: "Monthly retail turnover in millions of AUD" +- turnover_3m_avg: "3-month rolling average of turnover in millions AUD" +- turnover_12m_avg: "12-month rolling average of turnover in millions AUD" +- yoy_growth_pct: "Year-over-year turnover growth as a percentage" + +For workshop_vibe_coding.TEAM_SCHEMA.food_inflation_yoy: +- Table comment: "Quarterly food price inflation by Australian state with YoY CPI changes" +- state: "Australian state name" +- quarter: "Calendar quarter" +- cpi_index: "Consumer Price Index value (base period = 100)" +- yoy_change_pct: "Year-over-year CPI change as a percentage (positive = inflation)" + +Use ALTER TABLE ... SET TBLPROPERTIES for table comments. +Use ALTER TABLE ... ALTER COLUMN ... COMMENT for column comments. +``` + +> **Starter Kit:** Steps in `starter-kit/prompts/analyst/01-setup-genie.md` and `analyst/02-tune-genie.md` + +--- + +## Phase 2: Build AI/BI Dashboard (20 min) + +> **Team Tasks for This Phase** +> - **Person A (Databricks UI):** Create 4-5 dashboard visualizations using NL prompts +> - **Person B (Databricks UI):** Test Genie with challenging questions, refine instructions +> - **Person C (Terminal):** Write SQL queries for complex visualizations the NL can't generate well +> +> *Teams of 2: Person A does dashboards, Person B does Genie + SQL.* + +### 2.1 Create dashboard (Person A) + +In the Databricks UI: + +1. Navigate to **Dashboards** → **Create Dashboard** → **AI/BI Dashboard** +2. Connect to your gold tables +3. Use these natural language prompts: + +``` +Show monthly food retail turnover by state as a line chart for the last 2 years +``` + +``` +Create a bar chart comparing year-over-year retail growth by state for the latest month +``` + +``` +Show a heatmap of food inflation by state and quarter +``` + +``` +Display the top 5 states by average monthly turnover as a horizontal bar chart +``` + +``` +Show a trend line of national food price inflation over time +``` + +4. Arrange into a clean layout +5. Title: "Grocery Intelligence Dashboard — TEAM_NAME" + +### 2.2 Test Genie (Person B) + +Try these questions and note which ones Genie gets right: + +- "Which state had the highest food retail turnover last month?" +- "Show me the year-over-year food price inflation trend for Victoria" +- "Compare retail growth across all states for the last 12 months" +- "What's the average monthly turnover for department stores in NSW?" +- "Which industry has the fastest growing turnover nationally?" + +If Genie gets a question wrong, refine the General Instructions. + +> **Starter Kit:** Dashboard steps in `starter-kit/prompts/analyst/03-build-dashboard.md` + +> **Stuck?** Grab **Checkpoint AN-1B**: dashboard with 3 pre-built visualizations. + +--- + +## Phase 3: Polish + Advanced Tuning (15 min) + +> **Team Tasks for This Phase** +> - **Person A:** Refine Genie instructions — add example queries, clarify ambiguous terms +> - **Person B:** Polish dashboard layout, add filters, publish +> - **Person C:** Test Genie with 10 sample questions, document accuracy rate +> +> *Teams of 2: Split between Genie tuning and dashboard polishing.* + +### 3.1 Tune Genie accuracy + +Based on testing, update General Instructions with: +- Example questions and expected SQL patterns +- Clarifications for ambiguous terms (e.g., "last month" = most recent month in data) +- Specific column mappings for common questions + +### 3.2 Publish dashboard + +1. Click **Publish** on your dashboard +2. Click **Share** → get the URL for your demo +3. Optionally click **Embed** → copy iframe code (for Lab 2 app integration) + +--- + +## Phase 4: Verify + Prepare (5 min) + +- Test 5 sample questions against Genie — aim for 80%+ accuracy +- Review dashboard — all visualizations render, filters work +- Prepare for Show & Tell: What did you build? What was Genie best/worst at? + +--- + +## Troubleshooting + +| Problem | Solution | +|---------|----------| +| **Can't find Genie in sidebar** | Ask facilitator — Genie may need to be enabled | +| **"No permission" for Genie** | Need CREATE GENIE SPACE permission. Ask facilitator. | +| **Genie gives wrong SQL** | Add column descriptions and example queries to instructions | +| **AI/BI dashboard slow** | Check SQL warehouse is running (Compute → SQL Warehouses) | +| **Dashboard viz doesn't match** | Rephrase the NL prompt or write SQL directly | +| **Column descriptions not showing** | Run ALTER TABLE with correct syntax (see starter-kit prompt) | +| **Running out of time** | Grab Checkpoint AN-1B (dashboard) or AN-1C (complete solution) | + +--- + +## Success Criteria + +- [ ] Genie space created with gold tables and instructions +- [ ] Gold tables have column descriptions in Unity Catalog +- [ ] AI/BI dashboard has at least 4 visualizations +- [ ] Genie answers 7/10 test questions correctly +- [ ] Dashboard published with clean layout +- [ ] Ready for Show & Tell + +--- + +## Reflection Questions (for Show & Tell) + +1. How did column descriptions affect Genie's accuracy? +2. Which questions was Genie best/worst at answering? +3. How does this compare to building a custom query interface? +4. What would you do differently with more time? diff --git a/projects/coles-vibe-workshop/LAB-1-DATA-PIPELINE.md b/projects/coles-vibe-workshop/LAB-1-DATA-PIPELINE.md new file mode 100644 index 0000000..ccc6168 --- /dev/null +++ b/projects/coles-vibe-workshop/LAB-1-DATA-PIPELINE.md @@ -0,0 +1,348 @@ +# Lab 1: Build Your Data Pipeline + +**Duration:** 55 minutes +**Goal:** Build, test, and deploy a Lakeflow Declarative Pipeline using an AI coding agent +**Team Size:** 2–3 people + +--- + +## The Mission + +Your team needs a data pipeline that ingests public Australian retail data, transforms it through a medallion architecture, and produces analytics-ready gold tables. You'll use **TDD** to guide the agent and **Lakeflow Declarative Pipelines** to define your tables. + +**You will NOT write this code yourself.** You will direct an AI agent to build it. + +--- + +## Your Data Sources + +| Source | API Endpoint | What It Contains | +|--------|-------------|-----------------| +| **ABS Retail Trade** | `https://data.api.abs.gov.au/data/ABS,RT,1.0.0/...` | Monthly retail turnover by state & industry since 2010 | +| **ABS Consumer Price Index** | `https://data.api.abs.gov.au/data/ABS,CPI,2.0.0/...` | Quarterly food price indices by state since 2010 | + +Both APIs return CSV data via the SDMX standard. The agent will handle the API calls and parsing. + +--- + +## Getting Started (2 minutes) + +Before you begin, set up your project: + +1. Copy the starter CLAUDE.md to your project root: + ```bash + cp starter-kit/CLAUDE.md ./CLAUDE.md + ``` + +2. Edit CLAUDE.md — replace `TEAM_NAME` and `TEAM_SCHEMA` with your team's values + +3. Copy test fixtures: + ```bash + mkdir -p tests + cp starter-kit/conftest.py tests/ + ``` + +4. Copy the config template: + ```bash + cp starter-kit/databricks.yml.template ./databricks.yml + ``` + Replace `TEAM_NAME` and `TEAM_SCHEMA` placeholders. + +> All prompts for this lab are in `starter-kit/prompts/01-05`. Each is exact copy-paste — no typing required. + +--- + +## Phase 1: Write Tests First (15 min) + +> **Team Tasks for This Phase** +> - **Person A (Terminal):** Run data exploration prompt from `starter-kit/prompts/01-explore-data.md` +> - **Person B (Terminal):** Copy `starter-kit/CLAUDE.md` to project root, customize team name/schema +> - **Person C (Databricks UI):** Open workspace, verify Unity Catalog schema exists, check checkpoint tables are accessible +> +> *Teams of 2: Person A takes Terminal tasks, Person B takes Terminal + UI tasks.* + +> **Remember:** Tests are your spec. Write them BEFORE any implementation. + +### 1.1 Explore the data + +Ask the agent: + +``` +Fetch a sample of the ABS Retail Trade API: +https://data.api.abs.gov.au/data/ABS,RT,1.0.0/M1.20+41+42+43+44+45.20.1+2+3+4+5+6+7+8.M?format=csv&startPeriod=2024-01&endPeriod=2024-03 + +Show me the columns, data types, and a few sample rows. +Also fetch a sample of the ABS CPI Food API: +https://data.api.abs.gov.au/data/ABS,CPI,2.0.0/1.10001+20001.10.1+2+3+4+5+6+7+8.Q?format=csv&startPeriod=2024-Q1&endPeriod=2024-Q4 + +Show me the same for this one. +``` + +Understanding the raw data structure is critical before writing tests. + +### 1.2 Write your pipeline tests + +Ask the agent: + +``` +Create pytest tests for a Lakeflow Declarative Pipeline with these transformations: + +1. test_bronze_retail_trade: + - Raw CSV data is ingested with all original columns + - Non-null TIME_PERIOD and OBS_VALUE columns + - Test: given sample CSV rows, bronze table has correct schema + +2. test_silver_retail_turnover: + - REGION codes decoded to state names (1=NSW, 2=VIC, 3=QLD, etc.) + - INDUSTRY codes decoded to readable names (20=Food retailing, etc.) + - TIME_PERIOD parsed to proper date column + - OBS_VALUE renamed to turnover_millions + - Test: given bronze rows with codes, silver rows have readable names + +3. test_gold_retail_summary: + - Adds 3-month and 12-month rolling averages + - Adds year-over-year growth percentage + - Test: given 24 months of silver data, gold has correct rolling averages + +4. test_bronze_cpi_food: + - Raw CPI CSV data ingested with all columns + - Non-null TIME_PERIOD and OBS_VALUE + - Test: correct schema + +5. test_silver_food_price_index: + - REGION codes decoded to state names + - INDEX codes decoded (10001=All groups CPI, 20001=Food and non-alcoholic beverages) + - OBS_VALUE renamed to cpi_index + - Test: codes correctly decoded + +6. test_gold_food_inflation: + - Calculates year-over-year CPI change percentage + - Test: given 8 quarters of data, YoY change is correct + +Write ONLY the tests. Do NOT implement the functions yet. +Use PySpark test fixtures with small DataFrames (5-10 rows each). +``` + +### 1.3 Review the tests + +Read through the generated tests: +- Do they capture your transformation logic? +- Are the test data realistic? +- Are edge cases covered (nulls, missing periods)? + +Edit or ask the agent to adjust before moving on. + +> **Starter Kit:** Copy-paste prompts are in `starter-kit/prompts/01-explore-data.md` and `starter-kit/prompts/02-write-tests.md` + +--- + +## Phase 2: Build Bronze Layer (15 min) + +> **Team Tasks for This Phase** +> - **Person A (Terminal):** Build retail trade bronze table using `starter-kit/prompts/03-build-bronze.md` +> - **Person B (Terminal):** Build CPI food bronze table (same prompt covers both) +> - **Person C (Databricks UI):** Monitor Unity Catalog for new tables appearing, prepare checkpoint fallback if APIs fail +> +> *Teams of 2: Person A takes Terminal tasks, Person B takes Terminal + UI tasks.* + +### 2.1 Create the pipeline structure + +Ask the agent: + +``` +Create a Lakeflow Declarative Pipeline project with: +- src/bronze/abs_retail_trade.py - Ingest ABS Retail Trade API to bronze table +- src/bronze/abs_cpi_food.py - Ingest ABS CPI Food API to bronze table +- src/silver/ (empty for now) +- src/gold/ (empty for now) +- resources/pipeline.yml - Lakeflow pipeline definition +- databricks.yml - DABs deployment config +- tests/ (our tests from Phase 1) + +For bronze tables, use @dp.table decorator with data quality expectations: +- @dp.expect("valid_time_period", "TIME_PERIOD IS NOT NULL") +- @dp.expect("valid_obs_value", "OBS_VALUE IS NOT NULL") + +Use spark.read.csv() to fetch from the API URLs. +Unity Catalog target: workshop_vibe_coding. +``` + +### 2.2 Run the bronze tests + +``` +Run the bronze tests. Fix any failures. +``` + +Watch the agent iterate: read test output → fix code → re-run → repeat until green. + +> **Stuck?** If the API calls are failing (network issues, parsing errors), grab +> **Checkpoint 1A**: pre-loaded bronze tables are already in your schema. +> Tell the agent: "Use the pre-loaded tables in workshop_vibe_coding.checkpoints +> instead of calling the API. Copy them to our schema." + +> **Starter Kit:** Copy-paste prompt in `starter-kit/prompts/03-build-bronze.md` + +--- + +## Phase 3: Build Silver + Gold (20 min) + +> **Team Tasks for This Phase** +> - **Person A (Terminal):** Build silver retail_turnover and gold retail_summary +> - **Person B (Terminal):** Build silver food_price_index and gold food_inflation +> - **Person C (Databricks UI):** Monitor test output, review gold table data as it appears, prepare icebreaker answers +> +> *Teams of 2: Person A takes Terminal tasks, Person B takes Terminal + UI tasks.* + +### 3.1 Build silver transformations + +``` +Implement the silver layer to make the silver tests pass: + +1. src/silver/retail_turnover.py + - @dp.table that reads from bronze retail trade + - Decode REGION codes to state names: 1=New South Wales, 2=Victoria, + 3=Queensland, 4=South Australia, 5=Western Australia, 6=Tasmania, + 7=Northern Territory, 8=Australian Capital Territory + - Decode INDUSTRY codes: 20=Food retailing, 41=Clothing/footwear/personal, + 42=Department stores, 43=Other retailing, 44=Cafes/restaurants/takeaway, + 45=Household goods retailing + - Parse TIME_PERIOD to date, extract month/year/quarter + - Rename OBS_VALUE to turnover_millions + +2. src/silver/food_price_index.py + - @dp.table that reads from bronze CPI + - Decode REGION and INDEX codes + - Rename OBS_VALUE to cpi_index + +Run the silver tests after implementation. +``` + +### 3.2 Build gold materialized views + +``` +Implement the gold layer to make the gold tests pass: + +1. src/gold/retail_summary.py + - @dp.materialized_view + - Join with silver retail_turnover + - Add 3-month rolling average (turnover_3m_avg) + - Add 12-month rolling average (turnover_12m_avg) + - Add year-over-year growth percentage (yoy_growth_pct) + +2. src/gold/food_inflation.py + - @dp.materialized_view + - Calculate year-over-year CPI change percentage (yoy_change_pct) + +Run ALL tests (bronze + silver + gold). Everything should be green. +``` + +> **Stuck at 40 minutes?** Grab **Checkpoint 1B**: silver and gold tables +> pre-loaded in your schema. This ensures you have data for Lab 2. + +> **Starter Kit:** Copy-paste prompt in `starter-kit/prompts/04-build-silver-gold.md` + +### 3.3 Verify your data + +``` +Query the gold tables and show me: +1. Top 5 states by food retail turnover (latest month) +2. Year-over-year food price inflation by state (latest quarter) +3. The state with the highest retail growth rate +``` + +**This is where you check your ice breaker predictions!** + +--- + +## Phase 4: Deploy with DABs (5 min) + +> **Team Tasks for This Phase** +> - **Person A (Terminal):** Run `databricks bundle validate` and `databricks bundle deploy` +> - **Person B (Databricks UI):** Verify pipeline appears in Workflows tab, tables visible in Unity Catalog +> - **Person C:** Query gold tables to check icebreaker prediction answers +> +> *Teams of 2: Person A takes Terminal tasks, Person B takes Terminal + UI tasks.* + +### 4.1 Create the pipeline definition + +``` +Create resources/pipeline.yml that defines a Lakeflow Declarative Pipeline: +- Pipeline name: grocery-intelligence- +- Serverless: true +- Libraries: all our src/ notebooks +- Catalog: workshop_vibe_coding +- Schema: + +And update databricks.yml with: +- Dev target using our workshop catalog/schema +- The pipeline resource +``` + +### 4.2 Validate and deploy + +``` +Validate the bundle: databricks bundle validate +Deploy to dev: databricks bundle deploy -t dev +Run the pipeline: databricks bundle run grocery-intelligence- -t dev +``` + +### 4.3 Verify in the workspace + +Open the Databricks workspace UI and confirm: +- [ ] Pipeline appears in the Workflows tab +- [ ] Tables are visible in Unity Catalog +- [ ] Data quality expectations are passing + +> **Stuck?** Grab **Checkpoint 1C**: complete pipeline code and databricks.yml. + +> **Starter Kit:** Copy-paste prompt in `starter-kit/prompts/05-deploy-pipeline.md`. Config template at `starter-kit/databricks.yml.template`. + +--- + +## Pro Tips + +> **Steering the agent effectively:** +> +> - If the agent uses pandas instead of PySpark, say **"use PySpark, not pandas"** and add it to CLAUDE.md +> - If the agent writes implementation before you have tests, say **"stop — write the tests first"** +> - If the agent goes off track, say **"stop, let's go back to the failing tests"** +> - Use **"show me the test output"** to see exactly what's failing +> - Say **"explain what this function does"** to verify the agent's logic +> - **Rotate the driver** every 20 min so everyone gets hands-on time +> - Use skills: try **`/commit`** to commit your work with a good message + +--- + +## Success Criteria + +- [ ] Tests written BEFORE implementation +- [ ] All tests pass (bronze, silver, gold) +- [ ] Pipeline uses `@dp.table` and `@dp.materialized_view` decorators +- [ ] Data quality expectations with `@dp.expect()` +- [ ] Gold tables have rolling averages and YoY metrics +- [ ] Deployed as a Lakeflow pipeline via DABs +- [ ] Can answer the ice breaker prediction questions from the data! + +--- + +## Troubleshooting + +| Problem | Solution | +|---------|----------| +| **ABS API returns errors or times out** | The APIs can be slow. Grab **Checkpoint 1A** (pre-loaded bronze tables) and skip the API ingestion. | +| **Agent uses pandas instead of PySpark** | Add to CLAUDE.md: "Always use PySpark, never pandas". Remind the agent explicitly. | +| **Tests fail with SparkSession errors** | Ensure conftest.py creates a local SparkSession: `SparkSession.builder.master("local[*]").getOrCreate()` | +| **@dp.table decorator not found** | Make sure the import is: `import dlt` (for older runtimes) or `import databricks.declarative_pipelines as dp` | +| **Can't write to Unity Catalog** | Check your schema: `workshop_vibe_coding.`. Run `databricks auth status` to verify access. | +| **Pipeline deploy fails** | Validate first: `databricks bundle validate`. Check databricks.yml for syntax errors. | +| **Agent rewrites working code** | Say: "Don't change functions that are already passing tests. Only fix the failing ones." | +| **Running out of time** | Grab the next checkpoint. No shame — the goal is to have data for Lab 2 and a demo at the end. | + +--- + +## Reflection Questions (for Show & Tell) + +1. How much code did you write vs. the agent? +2. Where did TDD help the most? +3. What was your most interesting data insight? +4. Were your ice breaker predictions right? diff --git a/projects/coles-vibe-workshop/LAB-1-DE.md b/projects/coles-vibe-workshop/LAB-1-DE.md new file mode 100644 index 0000000..29f047c --- /dev/null +++ b/projects/coles-vibe-workshop/LAB-1-DE.md @@ -0,0 +1,326 @@ +# Lab 1: Build Your Data Pipeline (Data Engineering Track) + +**Duration:** 55 minutes +**Goal:** Build, test, and deploy a Lakeflow Declarative Pipeline using an AI coding agent +**Team Size:** 2–3 people + +> Complete `LAB-0-GETTING-STARTED.md` first, then return here. + +--- + +## The Mission + +Your team needs a data pipeline that ingests public Australian retail data, transforms it through a medallion architecture, and produces analytics-ready gold tables. You'll use **TDD** to guide the agent and **Lakeflow Declarative Pipelines** to define your tables. + +**You will NOT write this code yourself.** You will direct an AI agent to build it. + +--- + +## Your Data Sources + +| Source | API Endpoint | What It Contains | +|--------|-------------|-----------------| +| **ABS Retail Trade** | `https://data.api.abs.gov.au/data/ABS,RT,1.0.0/...` | Monthly retail turnover by state & industry since 2010 | +| **ABS Consumer Price Index** | `https://data.api.abs.gov.au/data/ABS,CPI,2.0.0/...` | Quarterly food price indices by state since 2010 | + +Both APIs return CSV data via the SDMX standard. The agent will handle the API calls and parsing. + +--- + + +--- + +## Phase 1: Write Tests First (15 min) + +> **Team Tasks for This Phase** +> - **Person A (Terminal):** Run data exploration prompt from `starter-kit/prompts/01-explore-data.md` +> - **Person B (Terminal):** Copy `starter-kit/CLAUDE.md` to project root, customize team name/schema +> - **Person C (Databricks UI):** Open workspace, verify Unity Catalog schema exists, check checkpoint tables are accessible +> +> *Teams of 2: Person A takes Terminal tasks, Person B takes Terminal + UI tasks.* + +> **Remember:** Tests are your spec. Write them BEFORE any implementation. + +### 1.1 Explore the data + +Ask the agent: + +``` +Fetch a sample of the ABS Retail Trade API: +https://data.api.abs.gov.au/data/ABS,RT,1.0.0/M1.20+41+42+43+44+45.20.1+2+3+4+5+6+7+8.M?format=csv&startPeriod=2024-01&endPeriod=2024-03 + +Show me the columns, data types, and a few sample rows. +Also fetch a sample of the ABS CPI Food API: +https://data.api.abs.gov.au/data/ABS,CPI,2.0.0/1.10001+20001.10.1+2+3+4+5+6+7+8.Q?format=csv&startPeriod=2024-Q1&endPeriod=2024-Q4 + +Show me the same for this one. +``` + +Understanding the raw data structure is critical before writing tests. + +### 1.2 Write your pipeline tests + +Ask the agent: + +``` +Create pytest tests for a Lakeflow Declarative Pipeline with these transformations: + +1. test_bronze_retail_trade: + - Raw CSV data is ingested with all original columns + - Non-null TIME_PERIOD and OBS_VALUE columns + - Test: given sample CSV rows, bronze table has correct schema + +2. test_silver_retail_turnover: + - REGION codes decoded to state names (1=NSW, 2=VIC, 3=QLD, etc.) + - INDUSTRY codes decoded to readable names (20=Food retailing, etc.) + - TIME_PERIOD parsed to proper date column + - OBS_VALUE renamed to turnover_millions + - Test: given bronze rows with codes, silver rows have readable names + +3. test_gold_retail_summary: + - Adds 3-month and 12-month rolling averages + - Adds year-over-year growth percentage + - Test: given 24 months of silver data, gold has correct rolling averages + +4. test_bronze_cpi_food: + - Raw CPI CSV data ingested with all columns + - Non-null TIME_PERIOD and OBS_VALUE + - Test: correct schema + +5. test_silver_food_price_index: + - REGION codes decoded to state names + - INDEX codes decoded (10001=All groups CPI, 20001=Food and non-alcoholic beverages) + - OBS_VALUE renamed to cpi_index + - Test: codes correctly decoded + +6. test_gold_food_inflation: + - Calculates year-over-year CPI change percentage + - Test: given 8 quarters of data, YoY change is correct + +Write ONLY the tests. Do NOT implement the functions yet. +Use PySpark test fixtures with small DataFrames (5-10 rows each). +``` + +### 1.3 Review the tests + +Read through the generated tests: +- Do they capture your transformation logic? +- Are the test data realistic? +- Are edge cases covered (nulls, missing periods)? + +Edit or ask the agent to adjust before moving on. + +> **Starter Kit:** Copy-paste prompts are in `starter-kit/prompts/01-explore-data.md` and `starter-kit/prompts/02-write-tests.md` + +--- + +## Phase 2: Build Bronze Layer (15 min) + +> **Team Tasks for This Phase** +> - **Person A (Terminal):** Build retail trade bronze table using `starter-kit/prompts/03-build-bronze.md` +> - **Person B (Terminal):** Build CPI food bronze table (same prompt covers both) +> - **Person C (Databricks UI):** Monitor Unity Catalog for new tables appearing, prepare checkpoint fallback if APIs fail +> +> *Teams of 2: Person A takes Terminal tasks, Person B takes Terminal + UI tasks.* + +### 2.1 Create the pipeline structure + +Ask the agent: + +``` +Create a Lakeflow Declarative Pipeline project with: +- src/bronze/abs_retail_trade.py - Ingest ABS Retail Trade API to bronze table +- src/bronze/abs_cpi_food.py - Ingest ABS CPI Food API to bronze table +- src/silver/ (empty for now) +- src/gold/ (empty for now) +- resources/pipeline.yml - Lakeflow pipeline definition +- databricks.yml - DABs deployment config +- tests/ (our tests from Phase 1) + +For bronze tables, use @dp.table decorator with data quality expectations: +- @dp.expect("valid_time_period", "TIME_PERIOD IS NOT NULL") +- @dp.expect("valid_obs_value", "OBS_VALUE IS NOT NULL") + +Use spark.read.csv() to fetch from the API URLs. +Unity Catalog target: workshop_vibe_coding. +``` + +### 2.2 Run the bronze tests + +``` +Run the bronze tests. Fix any failures. +``` + +Watch the agent iterate: read test output → fix code → re-run → repeat until green. + +> **Stuck?** If the API calls are failing (network issues, parsing errors), grab +> **Checkpoint 1A**: pre-loaded bronze tables are already in your schema. +> Tell the agent: "Use the pre-loaded tables in workshop_vibe_coding.checkpoints +> instead of calling the API. Copy them to our schema." + +> **Starter Kit:** Copy-paste prompt in `starter-kit/prompts/03-build-bronze.md` + +--- + +## Phase 3: Build Silver + Gold (20 min) + +> **Team Tasks for This Phase** +> - **Person A (Terminal):** Build silver retail_turnover and gold retail_summary +> - **Person B (Terminal):** Build silver food_price_index and gold food_inflation +> - **Person C (Databricks UI):** Monitor test output, review gold table data as it appears, prepare icebreaker answers +> +> *Teams of 2: Person A takes Terminal tasks, Person B takes Terminal + UI tasks.* + +### 3.1 Build silver transformations + +``` +Implement the silver layer to make the silver tests pass: + +1. src/silver/retail_turnover.py + - @dp.table that reads from bronze retail trade + - Decode REGION codes to state names: 1=New South Wales, 2=Victoria, + 3=Queensland, 4=South Australia, 5=Western Australia, 6=Tasmania, + 7=Northern Territory, 8=Australian Capital Territory + - Decode INDUSTRY codes: 20=Food retailing, 41=Clothing/footwear/personal, + 42=Department stores, 43=Other retailing, 44=Cafes/restaurants/takeaway, + 45=Household goods retailing + - Parse TIME_PERIOD to date, extract month/year/quarter + - Rename OBS_VALUE to turnover_millions + +2. src/silver/food_price_index.py + - @dp.table that reads from bronze CPI + - Decode REGION and INDEX codes + - Rename OBS_VALUE to cpi_index + +Run the silver tests after implementation. +``` + +### 3.2 Build gold materialized views + +``` +Implement the gold layer to make the gold tests pass: + +1. src/gold/retail_summary.py + - @dp.materialized_view + - Join with silver retail_turnover + - Add 3-month rolling average (turnover_3m_avg) + - Add 12-month rolling average (turnover_12m_avg) + - Add year-over-year growth percentage (yoy_growth_pct) + +2. src/gold/food_inflation.py + - @dp.materialized_view + - Calculate year-over-year CPI change percentage (yoy_change_pct) + +Run ALL tests (bronze + silver + gold). Everything should be green. +``` + +> **Stuck at 40 minutes?** Grab **Checkpoint 1B**: silver and gold tables +> pre-loaded in your schema. This ensures you have data for Lab 2. + +> **Starter Kit:** Copy-paste prompt in `starter-kit/prompts/04-build-silver-gold.md` + +### 3.3 Verify your data + +``` +Query the gold tables and show me: +1. Top 5 states by food retail turnover (latest month) +2. Year-over-year food price inflation by state (latest quarter) +3. The state with the highest retail growth rate +``` + +**This is where you check your ice breaker predictions!** + +--- + +## Phase 4: Deploy with DABs (5 min) + +> **Team Tasks for This Phase** +> - **Person A (Terminal):** Run `databricks bundle validate` and `databricks bundle deploy` +> - **Person B (Databricks UI):** Verify pipeline appears in Workflows tab, tables visible in Unity Catalog +> - **Person C:** Query gold tables to check icebreaker prediction answers +> +> *Teams of 2: Person A takes Terminal tasks, Person B takes Terminal + UI tasks.* + +### 4.1 Create the pipeline definition + +``` +Create resources/pipeline.yml that defines a Lakeflow Declarative Pipeline: +- Pipeline name: grocery-intelligence- +- Serverless: true +- Libraries: all our src/ notebooks +- Catalog: workshop_vibe_coding +- Schema: + +And update databricks.yml with: +- Dev target using our workshop catalog/schema +- The pipeline resource +``` + +### 4.2 Validate and deploy + +``` +Validate the bundle: databricks bundle validate +Deploy to dev: databricks bundle deploy -t dev +Run the pipeline: databricks bundle run grocery-intelligence- -t dev +``` + +### 4.3 Verify in the workspace + +Open the Databricks workspace UI and confirm: +- [ ] Pipeline appears in the Workflows tab +- [ ] Tables are visible in Unity Catalog +- [ ] Data quality expectations are passing + +> **Stuck?** Grab **Checkpoint 1C**: complete pipeline code and databricks.yml. + +> **Starter Kit:** Copy-paste prompt in `starter-kit/prompts/05-deploy-pipeline.md`. Config template at `starter-kit/databricks.yml.template`. + +--- + +## Pro Tips + +> **Steering the agent effectively:** +> +> - If the agent uses pandas instead of PySpark, say **"use PySpark, not pandas"** and add it to CLAUDE.md +> - If the agent writes implementation before you have tests, say **"stop — write the tests first"** +> - If the agent goes off track, say **"stop, let's go back to the failing tests"** +> - Use **"show me the test output"** to see exactly what's failing +> - Say **"explain what this function does"** to verify the agent's logic +> - **Rotate the driver** every 20 min so everyone gets hands-on time +> - Use skills: try **`/commit`** to commit your work with a good message + +--- + +## Success Criteria + +- [ ] Tests written BEFORE implementation +- [ ] All tests pass (bronze, silver, gold) +- [ ] Pipeline uses `@dp.table` and `@dp.materialized_view` decorators +- [ ] Data quality expectations with `@dp.expect()` +- [ ] Gold tables have rolling averages and YoY metrics +- [ ] Deployed as a Lakeflow pipeline via DABs +- [ ] Can answer the ice breaker prediction questions from the data! + +--- + +## Troubleshooting + +| Problem | Solution | +|---------|----------| +| **ABS API returns errors or times out** | The APIs can be slow. Grab **Checkpoint 1A** (pre-loaded bronze tables) and skip the API ingestion. | +| **Agent uses pandas instead of PySpark** | Add to CLAUDE.md: "Always use PySpark, never pandas". Remind the agent explicitly. | +| **Tests fail with SparkSession errors** | Ensure conftest.py creates a local SparkSession: `SparkSession.builder.master("local[*]").getOrCreate()` | +| **@dp.table decorator not found** | Make sure the import is: `import dlt` (for older runtimes) or `import databricks.declarative_pipelines as dp` | +| **Can't write to Unity Catalog** | Check your schema: `workshop_vibe_coding.`. Run `databricks auth status` to verify access. | +| **Pipeline deploy fails** | Validate first: `databricks bundle validate`. Check databricks.yml for syntax errors. | +| **Agent rewrites working code** | Say: "Don't change functions that are already passing tests. Only fix the failing ones." | +| **Running out of time** | Grab the next checkpoint. No shame — the goal is to have data for Lab 2 and a demo at the end. | + +--- + +## Reflection Questions (for Show & Tell) + +1. How much code did you write vs. the agent? +2. Where did TDD help the most? +3. What was your most interesting data insight? +4. Were your ice breaker predictions right? diff --git a/projects/coles-vibe-workshop/LAB-1-DS.md b/projects/coles-vibe-workshop/LAB-1-DS.md new file mode 100644 index 0000000..019bef0 --- /dev/null +++ b/projects/coles-vibe-workshop/LAB-1-DS.md @@ -0,0 +1,195 @@ +# Lab 1: Feature Engineering & MLflow (Data Science Track) + +**Duration:** 55 minutes +**Goal:** Build a feature engineering pipeline from gold tables and track experiments with MLflow +**Team Size:** 2–3 people + +> Complete `LAB-0-GETTING-STARTED.md` first, then return here. + +--- + +## The Mission + +Your gold tables are pre-loaded (from checkpoints). Build a feature engineering pipeline that creates lag features, seasonal indicators, and growth rates — then track everything with MLflow to prepare for model training in Lab 2. + +**You will NOT write this code yourself.** You will direct an AI agent to build it. + +--- + +## Phase 1: Explore Data + Write Tests (15 min) + +> **Team Tasks for This Phase** +> - **Person A (Terminal):** Query gold tables, understand distributions and patterns +> - **Person B (Terminal):** Write pytest tests for feature engineering functions +> - **Person C (Databricks UI):** Create MLflow experiment in workspace, verify tracking works +> +> *Teams of 2: Person A takes Terminal tasks, Person B takes Terminal + UI tasks.* + +### 1.1 Explore the gold tables + +Paste this into Claude Code: + +``` +Query these tables and show me a comprehensive analysis: + +1. workshop_vibe_coding.TEAM_SCHEMA.retail_summary: + - Row count, date range, distinct states, distinct industries + - Summary statistics for turnover_millions (min, max, mean, stddev) + - Top 5 state-industry combinations by average turnover + +2. workshop_vibe_coding.TEAM_SCHEMA.food_inflation_yoy: + - Row count, date range, distinct states + - Summary statistics for yoy_change_pct + - States with highest and lowest inflation + +Show me the results as tables. +``` + +### 1.2 Write feature engineering tests + +``` +Create pytest tests for feature engineering in tests/test_features.py. +Use the fixtures from tests/conftest.py. + +Write these tests: + +1. test_create_lag_features: + - Given 24 months of data for one state/industry + - Creates turnover_lag_1m, turnover_lag_3m, turnover_lag_6m, turnover_lag_12m + - lag_1m equals previous month's value + - First 12 rows have null lag_12m (expected) + +2. test_create_seasonal_features: + - Adds month_of_year (1-12), quarter (1-4), is_december (boolean), is_q4 (boolean) + - is_december is True only for month 12 + +3. test_create_growth_features: + - Adds turnover_mom_growth and turnover_yoy_growth + - MoM growth = (current - previous) / previous * 100 + - YoY growth = (current - 12m_ago) / 12m_ago * 100 + +4. test_feature_table_schema: + - Output has all expected columns + - Key columns (state, industry, month, turnover_millions) have no nulls + +Write ONLY the tests. Do NOT implement the functions yet. +Use PySpark test fixtures with small DataFrames. +``` + +> **Starter Kit:** Copy-paste prompts in `starter-kit/prompts/ds/01-explore-gold.md` and `ds/02-feature-engineering.md`. Test stubs at `starter-kit/test_features.py`. + +--- + +## Phase 2: Build Feature Engineering Pipeline (20 min) + +> **Team Tasks for This Phase** +> - **Person A (Terminal):** Build lag and seasonal feature functions +> - **Person B (Terminal):** Build growth rate features and combine into feature table +> - **Person C (Databricks UI):** Run EDA — distribution plots, correlation analysis, trend identification +> +> *Teams of 2: Person A takes Terminal tasks, Person B takes Terminal + UI tasks.* + +### 2.1 Build features + +``` +Create a feature engineering pipeline that reads from our gold tables and produces a feature table. + +1. Lag features from retail_summary: + - turnover_lag_1m, turnover_lag_3m, turnover_lag_6m, turnover_lag_12m + - Use PySpark Window functions partitioned by state + industry, ordered by month + +2. Seasonal features: + - month_of_year (1-12), quarter (1-4), is_december (boolean), is_q4 (boolean) + - Extract from the month date column + +3. Growth rate features: + - turnover_mom_growth: month-over-month growth percentage + - turnover_yoy_growth: year-over-year growth percentage + - cpi_yoy_change: join with food_inflation_yoy on state + quarter + +4. Write the combined feature table to: + workshop_vibe_coding.TEAM_SCHEMA.retail_features + +Run tests after implementation. Handle nulls in lag features (first N rows will be null — filter them out in the final table). +``` + +> **Starter Kit:** Copy-paste prompt in `starter-kit/prompts/ds/02-feature-engineering.md` + +> **Stuck at 35 minutes?** Grab **Checkpoint DS-1B**: pre-built feature table in your schema. + +--- + +## Phase 3: MLflow Experiment Tracking (15 min) + +> **Team Tasks for This Phase** +> - **Person A (Terminal):** Log feature engineering run to MLflow with parameters, metrics, artifacts +> - **Person B (Terminal):** Create and log visualizations (correlation heatmap, trend plots) +> - **Person C (Databricks UI):** Review experiment in MLflow UI, compare runs, tag experiment +> +> *Teams of 2: Person A takes Terminal tasks, Person B takes Terminal + UI tasks.* + +### 3.1 Track experiments + +``` +Set up MLflow experiment tracking for our feature engineering: + +1. Create an MLflow experiment named "grocery-features-TEAM_NAME" + +2. Log a run with: + - Parameters: number of features, date range, number of states + - Metrics: feature table row count, null percentage per feature + - Tags: team_name, track="data_science", phase="feature_engineering" + - Artifacts: save a feature summary CSV showing stats per state + +3. Create and log visualizations: + - A correlation heatmap of the numeric features (save as PNG) + - A time series plot of turnover trends for top 3 states (save as PNG) + +Use mlflow.log_param(), mlflow.log_metric(), mlflow.log_artifact(). +Show me the MLflow experiment URL when done. +``` + +> **Starter Kit:** Copy-paste prompt in `starter-kit/prompts/ds/03-mlflow-experiment.md` + +--- + +## Phase 4: Verify + Prepare (5 min) + +- Verify feature table in Unity Catalog browser +- Review MLflow experiment UI — runs, metrics, artifacts +- Prepare for Show & Tell: What features did you create? What patterns did you find? + +--- + +## Troubleshooting + +| Problem | Solution | +|---------|----------| +| **MLflow tracking URI errors** | Check `DATABRICKS_HOST` env var is set: `echo $DATABRICKS_HOST` | +| **Feature table write permission** | Check UC schema access: `workshop_vibe_coding.TEAM_SCHEMA` | +| **Window function errors** | Verify orderBy uses a date column and partitionBy uses state + industry | +| **Pandas vs PySpark confusion** | Start with PySpark; collect to pandas only for small viz DataFrames | +| **Null values in lag features** | Expected for first N rows. Filter with `.dropna()` in the final table | +| **Agent uses pandas for features** | Say: "Use PySpark Window functions, not pandas. Check CLAUDE.md." | +| **MLflow experiment not found** | Set experiment explicitly: `mlflow.set_experiment("/Users/.../grocery-features")` | +| **Running out of time** | Grab Checkpoint DS-1B (feature table) or DS-1C (with MLflow experiment) | + +--- + +## Success Criteria + +- [ ] Feature table has lag, seasonal, and growth features +- [ ] All feature engineering tests pass +- [ ] MLflow experiment logged with parameters, metrics, artifacts +- [ ] At least one visualization logged as artifact +- [ ] Feature table accessible in Unity Catalog +- [ ] Ready for Show & Tell + +--- + +## Reflection Questions (for Show & Tell) + +1. Which features do you think will be most predictive for forecasting? +2. How did MLflow help organize your experimentation? +3. What additional data sources would improve your features? +4. Were there any surprising patterns in the data? diff --git a/projects/coles-vibe-workshop/LAB-2-ANALYST.md b/projects/coles-vibe-workshop/LAB-2-ANALYST.md new file mode 100644 index 0000000..812fad6 --- /dev/null +++ b/projects/coles-vibe-workshop/LAB-2-ANALYST.md @@ -0,0 +1,539 @@ +# Lab 2: Build Your App & Dashboard (Analyst Track) + +**Duration:** 55 minutes +**Goal:** Build a web app with AI features, create a Genie space, and set up an AI/BI dashboard +**Team Size:** 2–3 people + +> Complete `LAB-0-GETTING-STARTED.md` and `LAB-1-ANALYST.md` first. + +--- + +## The Mission + +Your pipeline is producing analytics-ready gold tables. Now put that data to work. **Pick your tier:** + +### Tier 1: Quick — Embed & Ship (~20 min) +FastAPI backend + **embedded AI/BI dashboard via iframe**. Publish your dashboard, drop the embed URL into your app. Polished result, minimal frontend code. + +### Tier 2: Medium — Custom Charts (~35 min) +FastAPI + React with **Recharts** or **Observable Plot**. Query gold tables via API, render interactive visualisations in the browser. + +### Tier 3: Stretch — Full Platform (~55 min) +Full React app with custom viz + **embedded dashboard** + Genie space + NL query feature. The whole enchilada. + +**All tiers also include:** +- **A Genie space** — natural language Q&A for business users (2 min to set up!) +- **An AI/BI dashboard** — auto-generated visualisations (can be embedded in your app) + +All connect to the same gold tables from Lab 1. **You will direct the AI agent to build everything.** + +--- + + +--- + +## Architecture + +``` +┌───────────────┐ HTTP ┌──────────────────────┐ +│ │ (htmx calls) │ │ +│ Browser │───────────────▶│ FastAPI Backend │ +│ (Tailwind │◀───────────────│ (app.py) │ +│ + htmx) │ HTML / JSON │ │ +│ │ │ GET /api/metrics │ +└───────────────┘ │ POST /api/ask │ + │ GET / │ + └──────┬───────┬───────┘ + │ │ + SQL │ │ LLM + queries │ │ prompt + ▼ ▼ + ┌────────┐ ┌──────────┐ + │SQL │ │Databricks│ + │Warehou-│ │Foundation│ + │se │ │Model API │ + └────────┘ └──────────┘ + + ┌─────────────────┐ ┌─────────────────────┐ + │ Genie Space │ │ AI/BI Dashboard │ + │ (NL queries │ │ (Auto-generated │ + │ on gold data) │ │ visualizations) │ + └─────────────────┘ └─────────────────────┘ +``` + +--- + +## Phase 1: Write PRD + Tests (10 min) + +> **Team Tasks for This Phase** +> - **Person A (Terminal):** Run PRD prompt from `starter-kit/prompts/06-write-prd.md` +> - **Person B (Terminal):** Run API test prompt from `starter-kit/prompts/07-write-app-tests.md` +> - **Person C (Databricks UI):** Start creating Genie space NOW (follow `starter-kit/prompts/10-setup-genie.md`) — this is a UI task that doesn't need the terminal +> +> *Teams of 2: Person A takes Terminal tasks, Person B takes Terminal + UI tasks.* + +> **Starter Kit:** Copy-paste prompts in `starter-kit/prompts/06-write-prd.md` and `07-write-app-tests.md`. Genie setup steps in `10-setup-genie.md`. + +### 1.1 Write your app PRD + +Tell the agent: + +``` +Create a new project called "grocery-app" with this PRD: + +## Grocery Intelligence App + +### Overview +A web application that displays retail analytics from our gold tables +and allows natural language querying of the data. + +### User Stories +1. As a business user, I want to see retail turnover by state in a + clean dashboard with filters. +2. As an analyst, I want to ask questions in plain English like + "which state had the highest food retail growth last year?" +3. As an executive, I want to see food inflation trends at a glance. + +### Technical Requirements +- Backend: FastAPI (Python) +- Frontend: HTML + Tailwind CSS + htmx (no npm/node required) +- Data: workshop_vibe_coding..retail_summary + and workshop_vibe_coding..food_inflation_yoy +- AI feature: Natural language to SQL using Foundation Model API +- Deployment: Databricks Apps + +### API Endpoints +- GET /api/metrics?state=X&start_date=Y&end_date=Z +- POST /api/ask {"question": "which state has highest growth?"} +- GET /health → {"status": "ok"} +- GET / (serves the frontend) + +Also create a CLAUDE.md with: +- Use FastAPI with Pydantic models +- Use databricks-sql-connector for database access +- Frontend uses htmx for dynamic updates (no JS frameworks) +- Write tests for all API endpoints using pytest + httpx +- All SQL queries must be parameterized (no string concatenation) +``` + +### 1.2 Write API tests + +``` +Write pytest tests for the FastAPI backend: + +1. test_health: GET /health returns 200 with {"status": "ok"} + +2. test_get_metrics: + - Returns 200 with valid state and date range + - Returns list of records with: state, industry, month, + turnover_millions, yoy_growth_pct + - Returns 400 for invalid date format + - Returns empty list for non-existent state + +3. test_ask_question: + - Returns 200 with a valid question + - Response has: answer (string), sql_query (string) + - Returns 400 for empty question + +Write ONLY the tests. Do NOT implement yet. +Use httpx AsyncClient with ASGITransport for testing. +``` + +### 1.3 Create your Genie space (Person C) + +Genie is Databricks' natural language Q&A interface. Business users type questions in plain English, Genie generates SQL, and returns results with visualizations. No code required. + +In the Databricks workspace UI (not the terminal): + +1. Navigate to **Genie** in the left sidebar +2. Click **New Genie Space** +3. Configure: + - **Name:** "Grocery Intelligence — Team [your_team]" + - **SQL Warehouse:** Select the workshop warehouse + - **Tables:** Add your gold tables: + - `workshop_vibe_coding..retail_summary` + - `workshop_vibe_coding..food_inflation_yoy` + - **Instructions** (optional but recommended): Add context like + "This data contains Australian retail trade and food price data. + States are Australian states. Turnover is in millions AUD." + +> **Stuck?** Grab **Checkpoint 2B**: step-by-step Genie setup instructions with +> recommended table descriptions and sample questions. + +--- + +## Phase 2: Build Backend + Frontend (25 min) + +> **Team Tasks for This Phase** +> - **Person A (Terminal):** Build FastAPI backend using `starter-kit/prompts/08-build-backend.md` +> - **Person B (Terminal):** Build frontend using `starter-kit/prompts/09-build-frontend.md` +> - **Person C (Databricks UI):** Create AI/BI dashboard — this is another UI task. Navigate to Dashboards → Create → AI/BI Dashboard. Use gold tables. +> +> *Teams of 2: Person A takes Terminal tasks, Person B takes Terminal + UI tasks.* + +> **Starter Kit:** Copy-paste prompts in `starter-kit/prompts/08-build-backend.md` and `09-build-frontend.md`. + +### 2.1 Implement the backend + +``` +Implement the FastAPI backend to pass all tests. + +For /api/metrics: +- Query the retail_summary gold table with optional filters +- Use databricks-sql-connector with parameterized queries +- Return results as JSON + +For /api/ask: +- Send the user's question to the Foundation Model API with + the table schema as context +- The LLM generates a SQL query +- Execute the SQL and return results with the generated query + +Connection details from environment variables: +- DATABRICKS_HOST, DATABRICKS_HTTP_PATH, DATABRICKS_TOKEN + +Run tests after implementation. +``` + +### 2.2 Build the frontend + +**Choose your approach based on your tier:** + +#### Option A: Embedded AI/BI Dashboard (Tier 1 — Quick) + +``` +Build a frontend in static/index.html with: + +1. Header: "Grocery Intelligence Platform" with your team name +2. An iframe embedding our published AI/BI dashboard (I'll give you the URL) +3. "Ask AI" section: text input + response area +4. Use Tailwind CSS from CDN +5. Clean, professional styling (dark header, white cards) +``` + +To get the embed URL: +1. Person C creates and publishes the AI/BI dashboard during this phase (see 2.4 below) +2. Click **Share** → **Embed** → copy the iframe code +3. Paste the ` +
+``` + +> **Note:** Embedded dashboards display in light mode only. Users need Databricks credentials to view (or use service principal embedding for external users). + +### 3.4 Verify everything works + +- [ ] App loads in browser with dashboard +- [ ] Filters work (state, date range) +- [ ] "Ask AI" returns meaningful answers +- [ ] Genie space answers natural language questions +- [ ] AI/BI dashboard shows your visualizations + +> **Stuck?** Grab **Checkpoint 2B** (Genie setup), **Checkpoint 2C** (dashboard SQL), +> or **Checkpoint 2D** (complete solution). + +--- + +## Phase 4: Demo Prep (5 min) + +> **Team Tasks for This Phase** +> - **All:** Prepare demo script, decide who presents what (pipeline, app, Genie, dashboard) +> +> *Teams of 2: Person A takes Terminal tasks, Person B takes Terminal + UI tasks.* + +### 4.1 Prepare your demo + +You have 3 minutes to show: +1. Your pipeline (quick: show the DAG or table list) +2. Your app (show it running, use the AI feature) +3. Your Genie space (ask it a question live) +4. Your dashboard (show the key visualizations) +5. One thing that surprised you + +> **Stuck?** Grab **Checkpoint 2D**: complete solution for reference. + +--- + +## Using MCP During This Lab + +MCP servers are available to help you build faster: + +**Databricks Docs MCP** — search official documentation: +``` +Search the Databricks docs for how to create a Genie space programmatically. +``` + +**Using skills** — speed up common tasks: +``` +/commit — commit with a good message +/test — run and fix failing tests +``` + +**AI Dev Kit skills** — pre-built Databricks patterns: +``` +Use the databricks skills to help scaffold the app deployment. +``` + +--- + +## Pro Tips + +> **Steering the agent effectively:** +> +> - **Divide and conquer:** One person works on the app, another sets up Genie, another does the dashboard +> - If htmx isn't working, check the `` +> - For the AI query feature, include the table schema in the LLM system prompt +> - If the agent writes raw SQL concatenation, say **"parameterize all queries to prevent injection"** +> - Use **`/commit`** regularly to save your progress +> - If the frontend looks wrong, say **"take a screenshot"** or describe what you see +> - **Don't over-engineer** — a working app is better than a perfect app that's not done + +--- + +## Success Criteria + +- [ ] FastAPI backend with tested endpoints +- [ ] HTML frontend with Tailwind + htmx +- [ ] AI-powered natural language query feature +- [ ] Genie space created and answering questions +- [ ] AI/BI dashboard with at least 3 visualizations +- [ ] App deployed to Databricks Apps +- [ ] All tests passing +- [ ] Ready for 3-minute demo! + +--- + +## Bonus Challenge: Build an MCP Server for Retail Analytics + +**Goal:** Wrap your Retail Analytics App as an MCP server so any AI agent (Claude Code, Cursor, ChatGPT) can query retail trends, food prices, and state comparisons through the MCP protocol. + +**Why this matters:** You've been _using_ MCP servers all day. Now you'll _build_ one — the same pattern Coles could use to expose internal data to every agent in the organisation. + +### What to build + +Your MCP server should expose these tools: + +| Tool | Description | Example call | +|------|-------------|-------------| +| `get_retail_turnover` | Monthly retail turnover by state and industry | `get_retail_turnover(state="NSW", months=12)` | +| `get_food_inflation` | Year-over-year food CPI changes by category | `get_food_inflation(category="Dairy", since="2020-01")` | +| `compare_states` | Side-by-side comparison of two states | `compare_states(state_a="VIC", state_b="QLD")` | +| `get_top_insights` | Auto-generated summary of notable trends | `get_top_insights(limit=5)` | + +### How to build it + +Ask Claude to help you: + +``` +Build an MCP server that wraps our Retail Analytics API. +Use the FastMCP Python library. Expose 4 tools: +- get_retail_turnover: query retail_summary gold table +- get_food_inflation: query food_inflation_yoy gold table +- compare_states: compare two states side-by-side +- get_top_insights: return the most notable trends + +Connect to Unity Catalog via databricks-sql-connector. +Return structured JSON from each tool. +``` + +### Test it + +```bash +# Run locally first +uv run python mcp_server.py + +# Test with MCP Inspector +npx @modelcontextprotocol/inspector python mcp_server.py + +# Or connect from Claude Code settings: +# "mcpServers": { "retail": { "command": "python", "args": ["mcp_server.py"] } } +``` + +### Deploy to Databricks Apps (stretch) + +```bash +databricks apps deploy --name "retail-mcp-${TEAM}" --source-code-path ./mcp-server/ +``` + +Now any agent in the org can query Coles retail data via MCP — no custom integration code. + +--- + +## Other Bonus Challenges (if time permits) + +1. **Add charts:** Use Chart.js (CDN) to add a revenue trend line chart to your app +2. **Add FSANZ data:** If you included food recalls in your pipeline, add a recalls feed to the app +3. **Custom skill:** Create a `/deploy` skill that bundles validate + deploy in one command +4. **Cross-team Genie:** Add another team's gold tables to your Genie space for richer queries + +--- + +## Troubleshooting + +| Problem | Solution | +|---------|----------| +| **htmx not loading** | Check `` is in ``. Check browser DevTools console. | +| **CORS errors** | Add `CORSMiddleware` to FastAPI with `allow_origins=["*"]`. The agent sometimes forgets this. | +| **AI generates invalid SQL** | Add the full table schema + column descriptions to the LLM system prompt. Include 2-3 example queries. | +| **Can't create Genie space** | Check permissions: you need CREATE GENIE SPACE on the catalog. Ask David for help. | +| **Dashboard queries are slow** | Gold tables are materialized views — they should be fast. Check your SQL warehouse is running. | +| **App deploys but shows blank page** | Check static files are mounted: `app.mount("/static", StaticFiles(directory="static"))`. Check Databricks App logs. | +| **databricks-sql-connector errors** | Ensure it's in requirements.txt. Check DATABRICKS_HOST, DATABRICKS_HTTP_PATH, DATABRICKS_TOKEN env vars. | +| **Running out of time** | Prioritize: working app > Genie > dashboard. Grab checkpoints for what you can't finish. | + +--- + +## Reflection Questions (for Demo) + +1. How did the PRD guide the agent's decisions? +2. How does Genie compare to your custom AI query feature? +3. What would you need to add to make this production-ready? +4. Which approach (app vs. Genie vs. AI/BI dashboard) is most useful for your team? diff --git a/projects/coles-vibe-workshop/LAB-2-APP-GENIE-DASHBOARD.md b/projects/coles-vibe-workshop/LAB-2-APP-GENIE-DASHBOARD.md new file mode 100644 index 0000000..db78fb3 --- /dev/null +++ b/projects/coles-vibe-workshop/LAB-2-APP-GENIE-DASHBOARD.md @@ -0,0 +1,555 @@ +# Lab 2: Build Your App, Genie Space & Dashboard + +**Duration:** 55 minutes +**Goal:** Build a web app with AI features, create a Genie space, and set up an AI/BI dashboard +**Team Size:** 2–3 people + +--- + +## The Mission + +Your pipeline is producing analytics-ready gold tables. Now put that data to work. **Pick your tier:** + +### Tier 1: Quick — Embed & Ship (~20 min) +FastAPI backend + **embedded AI/BI dashboard via iframe**. Publish your dashboard, drop the embed URL into your app. Polished result, minimal frontend code. + +### Tier 2: Medium — Custom Charts (~35 min) +FastAPI + React with **Recharts** or **Observable Plot**. Query gold tables via API, render interactive visualisations in the browser. + +### Tier 3: Stretch — Full Platform (~55 min) +Full React app with custom viz + **embedded dashboard** + Genie space + NL query feature. The whole enchilada. + +**All tiers also include:** +- **A Genie space** — natural language Q&A for business users (2 min to set up!) +- **An AI/BI dashboard** — auto-generated visualisations (can be embedded in your app) + +All connect to the same gold tables from Lab 1. **You will direct the AI agent to build everything.** + +--- + +## Getting Started (2 minutes) + +1. Create the app directory: + ```bash + mkdir -p app/static + ``` + +2. Copy the app config template: + ```bash + cp starter-kit/app.yaml.template app/app.yaml + ``` + Replace `REPLACE_WITH_SQL_WAREHOUSE_PATH` with your SQL warehouse path. + +3. Your CLAUDE.md from Lab 1 already has the project context. The PRD prompt will add app-specific rules. + +> All prompts for this lab are in `starter-kit/prompts/06-11`. Each is exact copy-paste. +> +> **Key insight:** Person C can start the Genie space and AI/BI dashboard immediately in the Databricks UI while Persons A and B work in terminals. This saves significant time. + +--- + +## Architecture + +``` +┌───────────────┐ HTTP ┌──────────────────────┐ +│ │ (htmx calls) │ │ +│ Browser │───────────────▶│ FastAPI Backend │ +│ (Tailwind │◀───────────────│ (app.py) │ +│ + htmx) │ HTML / JSON │ │ +│ │ │ GET /api/metrics │ +└───────────────┘ │ POST /api/ask │ + │ GET / │ + └──────┬───────┬───────┘ + │ │ + SQL │ │ LLM + queries │ │ prompt + ▼ ▼ + ┌────────┐ ┌──────────┐ + │SQL │ │Databricks│ + │Warehou-│ │Foundation│ + │se │ │Model API │ + └────────┘ └──────────┘ + + ┌─────────────────┐ ┌─────────────────────┐ + │ Genie Space │ │ AI/BI Dashboard │ + │ (NL queries │ │ (Auto-generated │ + │ on gold data) │ │ visualizations) │ + └─────────────────┘ └─────────────────────┘ +``` + +--- + +## Phase 1: Write PRD + Tests (10 min) + +> **Team Tasks for This Phase** +> - **Person A (Terminal):** Run PRD prompt from `starter-kit/prompts/06-write-prd.md` +> - **Person B (Terminal):** Run API test prompt from `starter-kit/prompts/07-write-app-tests.md` +> - **Person C (Databricks UI):** Start creating Genie space NOW (follow `starter-kit/prompts/10-setup-genie.md`) — this is a UI task that doesn't need the terminal +> +> *Teams of 2: Person A takes Terminal tasks, Person B takes Terminal + UI tasks.* + +> **Starter Kit:** Copy-paste prompts in `starter-kit/prompts/06-write-prd.md` and `07-write-app-tests.md`. Genie setup steps in `10-setup-genie.md`. + +### 1.1 Write your app PRD + +Tell the agent: + +``` +Create a new project called "grocery-app" with this PRD: + +## Grocery Intelligence App + +### Overview +A web application that displays retail analytics from our gold tables +and allows natural language querying of the data. + +### User Stories +1. As a business user, I want to see retail turnover by state in a + clean dashboard with filters. +2. As an analyst, I want to ask questions in plain English like + "which state had the highest food retail growth last year?" +3. As an executive, I want to see food inflation trends at a glance. + +### Technical Requirements +- Backend: FastAPI (Python) +- Frontend: HTML + Tailwind CSS + htmx (no npm/node required) +- Data: workshop_vibe_coding..retail_summary + and workshop_vibe_coding..food_inflation_yoy +- AI feature: Natural language to SQL using Foundation Model API +- Deployment: Databricks Apps + +### API Endpoints +- GET /api/metrics?state=X&start_date=Y&end_date=Z +- POST /api/ask {"question": "which state has highest growth?"} +- GET /health → {"status": "ok"} +- GET / (serves the frontend) + +Also create a CLAUDE.md with: +- Use FastAPI with Pydantic models +- Use databricks-sql-connector for database access +- Frontend uses htmx for dynamic updates (no JS frameworks) +- Write tests for all API endpoints using pytest + httpx +- All SQL queries must be parameterized (no string concatenation) +``` + +### 1.2 Write API tests + +``` +Write pytest tests for the FastAPI backend: + +1. test_health: GET /health returns 200 with {"status": "ok"} + +2. test_get_metrics: + - Returns 200 with valid state and date range + - Returns list of records with: state, industry, month, + turnover_millions, yoy_growth_pct + - Returns 400 for invalid date format + - Returns empty list for non-existent state + +3. test_ask_question: + - Returns 200 with a valid question + - Response has: answer (string), sql_query (string) + - Returns 400 for empty question + +Write ONLY the tests. Do NOT implement yet. +Use httpx AsyncClient with ASGITransport for testing. +``` + +### 1.3 Create your Genie space (Person C) + +Genie is Databricks' natural language Q&A interface. Business users type questions in plain English, Genie generates SQL, and returns results with visualizations. No code required. + +In the Databricks workspace UI (not the terminal): + +1. Navigate to **Genie** in the left sidebar +2. Click **New Genie Space** +3. Configure: + - **Name:** "Grocery Intelligence — Team [your_team]" + - **SQL Warehouse:** Select the workshop warehouse + - **Tables:** Add your gold tables: + - `workshop_vibe_coding..retail_summary` + - `workshop_vibe_coding..food_inflation_yoy` + - **Instructions** (optional but recommended): Add context like + "This data contains Australian retail trade and food price data. + States are Australian states. Turnover is in millions AUD." + +> **Stuck?** Grab **Checkpoint 2B**: step-by-step Genie setup instructions with +> recommended table descriptions and sample questions. + +--- + +## Phase 2: Build Backend + Frontend (25 min) + +> **Team Tasks for This Phase** +> - **Person A (Terminal):** Build FastAPI backend using `starter-kit/prompts/08-build-backend.md` +> - **Person B (Terminal):** Build frontend using `starter-kit/prompts/09-build-frontend.md` +> - **Person C (Databricks UI):** Create AI/BI dashboard — this is another UI task. Navigate to Dashboards → Create → AI/BI Dashboard. Use gold tables. +> +> *Teams of 2: Person A takes Terminal tasks, Person B takes Terminal + UI tasks.* + +> **Starter Kit:** Copy-paste prompts in `starter-kit/prompts/08-build-backend.md` and `09-build-frontend.md`. + +### 2.1 Implement the backend + +``` +Implement the FastAPI backend to pass all tests. + +For /api/metrics: +- Query the retail_summary gold table with optional filters +- Use databricks-sql-connector with parameterized queries +- Return results as JSON + +For /api/ask: +- Send the user's question to the Foundation Model API with + the table schema as context +- The LLM generates a SQL query +- Execute the SQL and return results with the generated query + +Connection details from environment variables: +- DATABRICKS_HOST, DATABRICKS_HTTP_PATH, DATABRICKS_TOKEN + +Run tests after implementation. +``` + +### 2.2 Build the frontend + +**Choose your approach based on your tier:** + +#### Option A: Embedded AI/BI Dashboard (Tier 1 — Quick) + +``` +Build a frontend in static/index.html with: + +1. Header: "Grocery Intelligence Platform" with your team name +2. An iframe embedding our published AI/BI dashboard (I'll give you the URL) +3. "Ask AI" section: text input + response area +4. Use Tailwind CSS from CDN +5. Clean, professional styling (dark header, white cards) +``` + +To get the embed URL: +1. Person C creates and publishes the AI/BI dashboard during this phase (see 2.4 below) +2. Click **Share** → **Embed** → copy the iframe code +3. Paste the ` + +``` + +> **Note:** Embedded dashboards display in light mode only. Users need Databricks credentials to view (or use service principal embedding for external users). + +### 3.4 Verify everything works + +- [ ] App loads in browser with dashboard +- [ ] Filters work (state, date range) +- [ ] "Ask AI" returns meaningful answers +- [ ] Genie space answers natural language questions +- [ ] AI/BI dashboard shows your visualizations + +> **Stuck?** Grab **Checkpoint 2B** (Genie setup), **Checkpoint 2C** (dashboard SQL), +> or **Checkpoint 2D** (complete solution). + +--- + +## Phase 4: Demo Prep (5 min) + +> **Team Tasks for This Phase** +> - **All:** Prepare demo script, decide who presents what (pipeline, app, Genie, dashboard) +> +> *Teams of 2: Person A takes Terminal tasks, Person B takes Terminal + UI tasks.* + +### 4.1 Prepare your demo + +You have 3 minutes to show: +1. Your pipeline (quick: show the DAG or table list) +2. Your app (show it running, use the AI feature) +3. Your Genie space (ask it a question live) +4. Your dashboard (show the key visualizations) +5. One thing that surprised you + +> **Stuck?** Grab **Checkpoint 2D**: complete solution for reference. + +--- + +## Using MCP During This Lab + +MCP servers are available to help you build faster: + +**Databricks Docs MCP** — search official documentation: +``` +Search the Databricks docs for how to create a Genie space programmatically. +``` + +**Using skills** — speed up common tasks: +``` +/commit — commit with a good message +/test — run and fix failing tests +``` + +**AI Dev Kit skills** — pre-built Databricks patterns: +``` +Use the databricks skills to help scaffold the app deployment. +``` + +--- + +## Pro Tips + +> **Steering the agent effectively:** +> +> - **Divide and conquer:** One person works on the app, another sets up Genie, another does the dashboard +> - If htmx isn't working, check the `` +> - For the AI query feature, include the table schema in the LLM system prompt +> - If the agent writes raw SQL concatenation, say **"parameterize all queries to prevent injection"** +> - Use **`/commit`** regularly to save your progress +> - If the frontend looks wrong, say **"take a screenshot"** or describe what you see +> - **Don't over-engineer** — a working app is better than a perfect app that's not done + +--- + +## Success Criteria + +- [ ] FastAPI backend with tested endpoints +- [ ] HTML frontend with Tailwind + htmx +- [ ] AI-powered natural language query feature +- [ ] Genie space created and answering questions +- [ ] AI/BI dashboard with at least 3 visualizations +- [ ] App deployed to Databricks Apps +- [ ] All tests passing +- [ ] Ready for 3-minute demo! + +--- + +## Bonus Challenge: Build an MCP Server for Retail Analytics + +**Goal:** Wrap your Retail Analytics App as an MCP server so any AI agent (Claude Code, Cursor, ChatGPT) can query retail trends, food prices, and state comparisons through the MCP protocol. + +**Why this matters:** You've been _using_ MCP servers all day. Now you'll _build_ one — the same pattern Coles could use to expose internal data to every agent in the organisation. + +### What to build + +Your MCP server should expose these tools: + +| Tool | Description | Example call | +|------|-------------|-------------| +| `get_retail_turnover` | Monthly retail turnover by state and industry | `get_retail_turnover(state="NSW", months=12)` | +| `get_food_inflation` | Year-over-year food CPI changes by category | `get_food_inflation(category="Dairy", since="2020-01")` | +| `compare_states` | Side-by-side comparison of two states | `compare_states(state_a="VIC", state_b="QLD")` | +| `get_top_insights` | Auto-generated summary of notable trends | `get_top_insights(limit=5)` | + +### How to build it + +Ask Claude to help you: + +``` +Build an MCP server that wraps our Retail Analytics API. +Use the FastMCP Python library. Expose 4 tools: +- get_retail_turnover: query retail_summary gold table +- get_food_inflation: query food_inflation_yoy gold table +- compare_states: compare two states side-by-side +- get_top_insights: return the most notable trends + +Connect to Unity Catalog via databricks-sql-connector. +Return structured JSON from each tool. +``` + +### Test it + +```bash +# Run locally first +uv run python mcp_server.py + +# Test with MCP Inspector +npx @modelcontextprotocol/inspector python mcp_server.py + +# Or connect from Claude Code settings: +# "mcpServers": { "retail": { "command": "python", "args": ["mcp_server.py"] } } +``` + +### Deploy to Databricks Apps (stretch) + +```bash +databricks apps deploy --name "retail-mcp-${TEAM}" --source-code-path ./mcp-server/ +``` + +Now any agent in the org can query Coles retail data via MCP — no custom integration code. + +--- + +## Other Bonus Challenges (if time permits) + +1. **Add charts:** Use Chart.js (CDN) to add a revenue trend line chart to your app +2. **Add FSANZ data:** If you included food recalls in your pipeline, add a recalls feed to the app +3. **Custom skill:** Create a `/deploy` skill that bundles validate + deploy in one command +4. **Cross-team Genie:** Add another team's gold tables to your Genie space for richer queries + +--- + +## Troubleshooting + +| Problem | Solution | +|---------|----------| +| **htmx not loading** | Check `` is in ``. Check browser DevTools console. | +| **CORS errors** | Add `CORSMiddleware` to FastAPI with `allow_origins=["*"]`. The agent sometimes forgets this. | +| **AI generates invalid SQL** | Add the full table schema + column descriptions to the LLM system prompt. Include 2-3 example queries. | +| **Can't create Genie space** | Check permissions: you need CREATE GENIE SPACE on the catalog. Ask David for help. | +| **Dashboard queries are slow** | Gold tables are materialized views — they should be fast. Check your SQL warehouse is running. | +| **App deploys but shows blank page** | Check static files are mounted: `app.mount("/static", StaticFiles(directory="static"))`. Check Databricks App logs. | +| **databricks-sql-connector errors** | Ensure it's in requirements.txt. Check DATABRICKS_HOST, DATABRICKS_HTTP_PATH, DATABRICKS_TOKEN env vars. | +| **Running out of time** | Prioritize: working app > Genie > dashboard. Grab checkpoints for what you can't finish. | + +--- + +## Reflection Questions (for Demo) + +1. How did the PRD guide the agent's decisions? +2. How does Genie compare to your custom AI query feature? +3. What would you need to add to make this production-ready? +4. Which approach (app vs. Genie vs. AI/BI dashboard) is most useful for your team? diff --git a/projects/coles-vibe-workshop/LAB-2-DE.md b/projects/coles-vibe-workshop/LAB-2-DE.md new file mode 100644 index 0000000..9b8ea9e --- /dev/null +++ b/projects/coles-vibe-workshop/LAB-2-DE.md @@ -0,0 +1,164 @@ +# Lab 2: Data Quality, New Sources & Scheduling (Data Engineering Track) + +**Duration:** 55 minutes +**Goal:** Add data sources, implement data quality rules, and set up pipeline scheduling +**Team Size:** 2–3 people + +> Complete `LAB-0-GETTING-STARTED.md` and `LAB-1-DE.md` first. + +--- + +## The Mission + +Your pipeline handles ABS retail and CPI data. Now make it production-grade: add a new data source (FSANZ food recalls), implement data quality rules across all tables, and set up automated scheduling. + +--- + +## Phase 1: Add FSANZ Food Recalls (20 min) + +> **Team Tasks for This Phase** +> - **Person A (Terminal):** Write tests for FSANZ ingestion (schema, non-nulls, date parsing) +> - **Person B (Terminal):** Build bronze + silver tables for FSANZ food recalls +> - **Person C (Databricks UI):** Monitor pipeline, verify new tables appear in Unity Catalog +> +> *Teams of 2: Person A writes tests, Person B implements + monitors.* + +### 1.1 Write tests + build tables + +``` +Add a new data source — FSANZ food recalls: + +1. Write tests first: + - test_bronze_food_recalls_schema: has columns (product, category, issue, date, state, url) + - test_bronze_food_recalls_not_null: product and date are never null + - test_silver_food_recalls_dates: date strings parsed to proper DATE type + - test_silver_food_recalls_states: state names normalized to match our state list + +2. Build the tables: + - src/bronze/fsanz_food_recalls.py: @dp.table ingesting FSANZ data + - src/silver/food_recalls.py: @dp.table with cleaned dates, normalized states + - Data source: https://www.foodstandards.gov.au/food-recalls/recalls + - If website is blocked, read from: workshop_vibe_coding.checkpoints.fsanz_food_recalls + +3. Run tests after implementation. +``` + +> **Starter Kit:** Copy-paste prompt in `starter-kit/prompts/de/05-add-data-sources.md` + +> **Stuck?** Grab **Checkpoint DE-2A**: FSANZ bronze + silver tables pre-loaded. + +--- + +## Phase 2: Data Quality Rules (20 min) + +> **Team Tasks for This Phase** +> - **Person A (Terminal):** Add data quality expectations across all bronze/silver/gold tables +> - **Person B (Terminal):** Build cross-source gold view joining retail + CPI + recalls +> - **Person C (Databricks UI):** Verify expectations appear in pipeline UI, check quality metrics +> +> *Teams of 2: Person A does quality rules, Person B does cross-source view + monitoring.* + +### 2.1 Add quality expectations + +``` +Add data quality expectations across all pipeline tables: + +Bronze tables: +- @dp.expect("valid_time_period", "TIME_PERIOD IS NOT NULL") +- @dp.expect("valid_obs_value", "OBS_VALUE IS NOT NULL") +- @dp.expect("valid_date_range", "TIME_PERIOD >= '2010-01'") + +Silver tables: +- @dp.expect_or_fail("valid_state", "state IN ('New South Wales','Victoria','Queensland','South Australia','Western Australia','Tasmania','Northern Territory','Australian Capital Territory')") +- @dp.expect("valid_turnover", "turnover_millions > 0") + +Gold tables: +- @dp.expect("valid_yoy", "yoy_growth_pct BETWEEN -100 AND 500") +- @dp.expect("valid_rolling_avg", "turnover_3m_avg > 0") + +Run all tests to verify nothing breaks. +``` + +### 2.2 Build cross-source gold view + +``` +Create a cross-source analysis view: +- src/gold/grocery_insights.py: @dp.materialized_view +- Joins retail_summary + food_inflation_yoy + food_recalls (if available) +- Columns: state, month, turnover_millions, yoy_growth_pct, cpi_yoy_change, recall_count +- Join retail (monthly) with CPI (quarterly) on state + quarter +- Left join recalls on state + month (recall_count may be 0) +``` + +> **Starter Kit:** Copy-paste prompt in `starter-kit/prompts/de/06-data-quality.md` + +--- + +## Phase 3: Scheduling + Deploy (10 min) + +> **Team Tasks for This Phase** +> - **Person A (Terminal):** Add cron scheduling, validate, and deploy +> - **Person B (Databricks UI):** Verify pipeline schedule in Workflows tab +> - **Person C:** Test the full pipeline end-to-end +> +> *Teams of 2: Person A deploys, Person B verifies.* + +### 3.1 Add scheduling + +``` +Add cron scheduling to our pipeline: + +1. Update databricks.yml — add trigger: + trigger: + cron: + quartz_cron_expression: "0 0 6 * * ?" + timezone_id: "Australia/Sydney" + +2. Validate: databricks bundle validate +3. Deploy: databricks bundle deploy -t dev +4. Show me the pipeline schedule. +``` + +> **Starter Kit:** Copy-paste prompt in `starter-kit/prompts/de/07-scheduling.md` + +--- + +## Phase 4: Verify + Prepare (5 min) + +- Verify full pipeline with all sources in the Workflows UI +- Check data quality expectations are visible and passing +- Prepare demo: Show pipeline DAG, quality metrics, cross-source view, schedule + +--- + +## Troubleshooting + +| Problem | Solution | +|---------|----------| +| **FSANZ website blocked** | Use checkpoint: `workshop_vibe_coding.checkpoints.fsanz_food_recalls` | +| **Web scraping errors** | Try RSS feed: `https://www.foodstandards.gov.au/rss/recalls` | +| **@dp.expect_or_fail stops pipeline** | Use `@dp.expect` (warn) first, upgrade after verifying | +| **Cross-source join duplicates** | Join on state + date range, not exact date (CPI is quarterly) | +| **Cron not triggering** | Check timezone_id and Quartz format. `?` is required for day-of-week. | +| **Pipeline too slow** | Use `spark.conf.set("spark.sql.shuffle.partitions", "4")` for workshop | +| **Running out of time** | Grab Checkpoint DE-2C (complete pipeline) or DE-2D (full solution) | + +--- + +## Success Criteria + +- [ ] FSANZ food recalls ingested (bronze + silver) +- [ ] Data quality expectations on all tables +- [ ] Cross-source gold view joining retail + CPI + recalls +- [ ] Pipeline scheduled with cron trigger +- [ ] All tests pass including new data source +- [ ] Ready for 3-minute demo + +--- + +## Reflection Questions (for Demo) + +1. How do data quality expectations change your confidence in the pipeline? +2. What challenges came with adding a new data source? +3. How would you monitor this pipeline in production? +4. What's the most interesting insight from the cross-source view? diff --git a/projects/coles-vibe-workshop/LAB-2-DS.md b/projects/coles-vibe-workshop/LAB-2-DS.md new file mode 100644 index 0000000..95f8430 --- /dev/null +++ b/projects/coles-vibe-workshop/LAB-2-DS.md @@ -0,0 +1,181 @@ +# Lab 2: Model Training, Serving & App (Data Science Track) + +**Duration:** 55 minutes +**Goal:** Train a forecasting model, register in MLflow, serve via Model Serving, build a prediction app +**Team Size:** 2–3 people + +> Complete `LAB-0-GETTING-STARTED.md` and `LAB-1-DS.md` first. + +--- + +## The Mission + +Your feature table is ready. Train a model to predict retail turnover, register it in MLflow, serve it as an endpoint, and build a simple web app that calls it. + +--- + +## Phase 1: Train Model + Write Tests (15 min) + +> **Team Tasks for This Phase** +> - **Person A (Terminal):** Write tests for model training (input schema, positive predictions, R² > 0.5) +> - **Person B (Terminal):** Implement training script — read features, train model, log to MLflow +> - **Person C (Databricks UI):** Verify Model Serving permissions, check Model Registry is accessible +> +> *Teams of 2: Person A takes Terminal tasks, Person B takes Terminal + UI tasks.* + +### 1.1 Write model tests + +``` +Write pytest tests for model training in tests/test_model.py: + +1. test_model_predictions_positive: all predictions are positive (turnover can't be negative) +2. test_model_r2_score: R² > 0.5 on test set +3. test_model_logged_to_mlflow: run has model artifact, R², MAE, RMSE metrics + +Write ONLY tests. Do NOT implement yet. +``` + +### 1.2 Train the model + +``` +Train a retail turnover forecasting model: + +1. Read feature table: workshop_vibe_coding.TEAM_SCHEMA.retail_features +2. Target: turnover_millions +3. Features: all lag, seasonal, and growth columns +4. Split: 80% train / 20% test (split by date, not random) +5. Try both RandomForestRegressor and XGBRegressor +6. Use mlflow.sklearn.autolog() or mlflow.xgboost.autolog() +7. Log both models, compare R², MAE, RMSE +8. Print which model performed better + +Run tests after training. +``` + +> **Starter Kit:** Copy-paste prompt in `starter-kit/prompts/ds/04-train-model.md` + +--- + +## Phase 2: Register + Serve (20 min) + +> **Team Tasks for This Phase** +> - **Person A (Terminal):** Register best model in MLflow Model Registry +> - **Person B (Terminal):** Create Model Serving endpoint, test with sample request +> - **Person C (Terminal):** Write tests for serving endpoint response schema +> +> *Teams of 2: Person A takes Terminal tasks, Person B takes Terminal + UI tasks.* + +### 2.1 Register the model + +``` +Register the best model from our experiment: + +1. Find the best run (highest R²) from our MLflow experiment +2. Register as: workshop_vibe_coding.TEAM_SCHEMA.retail_forecast_model +3. Add description: "Retail turnover forecasting model for Australian states" +4. Set alias "production" on the latest version +``` + +### 2.2 Create serving endpoint + +``` +Create a Model Serving endpoint: + +1. Name: grocery-forecast-TEAM_NAME +2. Model: workshop_vibe_coding.TEAM_SCHEMA.retail_forecast_model (production alias) +3. Serverless endpoint +4. Wait for it to be ready (may take 5-10 minutes) +5. Test with a sample request: + {"dataframe_records": [{"turnover_lag_1m": 4500, "turnover_lag_3m": 4400, "turnover_lag_12m": 4200, "month_of_year": 3, "quarter": 1, "is_december": false, "is_q4": false, "turnover_mom_growth": 2.3, "turnover_yoy_growth": 7.1}]} + +Show me the prediction response. +``` + +> **Starter Kit:** Copy-paste prompts in `starter-kit/prompts/ds/05-register-model.md` and `ds/06-serve-model.md` + +> **Stuck at 25 minutes?** Grab **Checkpoint DS-2B**: pre-registered model + working endpoint. + +--- + +## Phase 3: Build Prediction App (15 min) + +> **Team Tasks for This Phase** +> - **Person A (Terminal):** Build FastAPI backend with `/predict` endpoint calling Model Serving +> - **Person B (Terminal):** Build HTML + Tailwind frontend with prediction form +> - **Person C (Databricks UI):** Test end-to-end flow, verify predictions make sense +> +> *Teams of 2: Person A takes Terminal tasks, Person B takes Terminal + UI tasks.* + +### 3.1 Build the app + +``` +Build a prediction web app: + +1. FastAPI backend (app/app.py): + - GET /health → {"status": "ok"} + - POST /predict: + Accepts: {"state": "New South Wales", "industry": "Food retailing", "month": "2024-06"} + Looks up latest features for that state/industry from the feature table + Calls our Model Serving endpoint + Returns: {"predicted_turnover": 4650.2, "state": "New South Wales", "industry": "Food retailing"} + - GET / → serves the frontend + +2. Frontend (app/static/index.html): + - Tailwind CSS + htmx (CDN, no build step) + - Header: "Grocery Forecast — TEAM_NAME" + - Form: dropdowns for State, Industry, Month + - Submit → calls POST /predict → shows result card + - Keep it simple and clean + +3. Create app/app.yaml and app/requirements.txt +4. Deploy: databricks apps deploy --name grocery-forecast-TEAM_NAME --source-code-path ./app/ +``` + +> **Starter Kit:** Copy-paste prompt in `starter-kit/prompts/ds/07-build-app.md` + +--- + +## Phase 4: Demo Prep (5 min) + +You have 3 minutes to show: +1. Your feature table (show in UC browser) +2. Your MLflow experiment (show runs, metrics, artifacts) +3. Your model (show registry + serving endpoint) +4. Your app (make a prediction live) +5. One thing that surprised you + +--- + +## Troubleshooting + +| Problem | Solution | +|---------|----------| +| **Model training OOM** | Reduce feature count or sample size. Collect to pandas for small datasets. | +| **Model Serving 404** | Endpoint takes 5-10 min to provision. Check status in UI or `databricks serving-endpoints get`. | +| **Model Serving auth error** | Check `DATABRICKS_TOKEN` env var is set. | +| **App can't reach endpoint** | Use workspace-internal URL, not external. | +| **Low R² score** | Try XGBoost instead of RandomForest, or add more features. | +| **Model Registry permission error** | Check CREATE MODEL permission on catalog. Ask facilitator. | +| **mlflow.register_model fails** | Use full UC path: `models:/workshop_vibe_coding.TEAM_SCHEMA.model_name` | +| **Running out of time** | Grab checkpoint DS-2C (complete app) or DS-2D (complete solution). | + +--- + +## Success Criteria + +- [ ] Model trained with R² > 0.5 +- [ ] Model registered in Unity Catalog Model Registry +- [ ] Model Serving endpoint responding to requests +- [ ] Prediction app deployed to Databricks Apps +- [ ] End-to-end flow: form → API → Model Serving → response +- [ ] All tests passing +- [ ] Ready for 3-minute demo + +--- + +## Reflection Questions (for Demo) + +1. How accurate were your model's predictions? +2. What was the hardest part — training, serving, or building the app? +3. How would you improve the model for production use? +4. How does MLflow help with model lifecycle management? diff --git a/projects/coles-vibe-workshop/VIBE-CODING-GUIDE.md b/projects/coles-vibe-workshop/VIBE-CODING-GUIDE.md new file mode 100644 index 0000000..e3939a9 --- /dev/null +++ b/projects/coles-vibe-workshop/VIBE-CODING-GUIDE.md @@ -0,0 +1,887 @@ +# Vibe Coding Best Practices + +**Coles x Databricks Vibe Coding Workshop -- 9:30 AM - 4:00 PM** +A Guide to Agentic Software Development with Claude + +--- + +# Block A: How to Think + +## 1. What is Vibe Coding? + +> "There's a new kind of coding I call 'vibe coding', where you fully give in to the vibes, embrace exponentials, and forget that the code even exists." +> +> -- Andrej Karpathy, February 2025 + +### Traditional vs Agentic Development + +| | Traditional Development | Agentic Development | +|---|---|---| +| **Who writes code** | Human writes every line of code | Human specifies intent via tests, specs, and CLAUDE.md | +| **Implementation** | Human handles all implementation details | Agent implements code to match the specification | +| **Debugging** | Human debugs, refactors, and iterates manually | Agent self-corrects by reading test failures | +| **Speed bottleneck** | Speed limited by typing and cognitive load | Speed limited by quality of direction, not typing | + +### Rule #1: Just Say What You Want + +Everything else builds on this: **you literally type what you want and it happens.** + +- **Want a project?** Tell Claude what you're building — it creates the CLAUDE.md, project structure, and config +- **Want behavior to change?** Say "from now on, do X" — it updates the rules +- **See a technique you like?** Paste it in — Claude reads it and adapts +- **Everything is markdown** — CLAUDE.md, skills, hooks, all of it + +That's **agentic engineering**. You shape the harness through conversation, not by hand-writing config files. You don't "write" a CLAUDE.md. You have a conversation that produces one. + +The progression of agentic engineering: + +1. **Say it** — have a conversation, get what you want +2. **Curate it** — save the good stuff as markdown files (CLAUDE.md, skills, hooks) so you don't repeat yourself +3. **Wire up tools** — increasingly just instructions to CLI commands, not heavyweight MCP servers + +### The "Brilliant New Employee" + +Think of Claude as a **brilliant but brand-new employee** who just joined your team today: + +1. **Deep Technical Skills** -- Knows Python, PySpark, SQL, FastAPI, React -- but has zero context on *your* norms or architecture decisions. +2. **Excels With Clear Direction** -- Given a precise spec, produces excellent code. Given a vague request, produces plausible-looking code that misses the mark. +3. **Needs Explicit Context** -- "Always PySpark, never pandas." "Snake_case for all columns." "Tests before code." Won't infer your team's standards. + +**Your job shifts from WRITING code to DIRECTING an exceptionally capable engineer. Invest time in specs upfront, not in writing code.** + +--- + +## 2. The Medallion Architecture -- What We're Building + +``` +┌──────────────────┐ ┌──────────────────┐ ┌──────────────────┐ +│ Bronze │ │ Silver │ │ Gold │ +│ Raw Ingestion │ ──> │ Cleaned & Enriched│ ──> │ Analytics-Ready │ +└──────────────────┘ └──────────────────┘ └──────────────────┘ +``` + +### Bronze (Raw) + +- Ingest data as-is from APIs and files +- No transformations applied +- Preserves original column names and types +- Acts as an immutable audit trail +- Use `@dp.table` with data quality expectations + +### Silver (Cleaned) + +- Decode codes to readable names +- Handle nulls and invalid rows +- Standardize date formats and types +- Rename columns to snake_case +- Use `@dp.table` reading from bronze + +### Gold (Aggregated) + +- Roll up and aggregate metrics +- Join across data sources +- Calculate KPIs (YoY growth, rolling averages) +- Ready for Genie and AI/BI dashboards +- Use `@dp.materialized_view` + +> **Why medallion?** Separation of concerns. Each layer has one job. Bugs are easy to trace. Silver and gold can be rebuilt from bronze at any time. + +--- + +## 3. CLAUDE.md -- Your Team's Operating Manual + +CLAUDE.md is a **persistent instruction file** that encodes your team's standards. The agent reads it automatically at the start of every session -- no need to repeat yourself. + +### Three Scope Levels + +| Level | Path | Purpose | +|---|---|---| +| **User-Level** | `~/.claude/CLAUDE.md` | Personal preferences, coding style, editor habits. Applies to all your projects. | +| **Repo-Level** | `./CLAUDE.md` | Team standards, tech stack, architecture decisions. Overrides user-level. Checked into git. | +| **Module-Level** | `./src/CLAUDE.md` | Module-specific rules (e.g., "all files here use `@dp.table`"). Overrides repo-level. | + +### Why It Works + +- **Self-correcting:** Agent re-reads it each session -- no drift +- **Searchable:** Agent can reference specific sections on demand +- **Maintainable:** One file to update, entire team benefits +- **Scoped:** Different rules for different parts of the codebase + +--- + +## 4. CLAUDE.md -- What to Include + +### Template Structure + +```markdown +# CLAUDE.md -- Grocery Intelligence Platform + +## Team +- Team Name: TEAM_NAME +- Schema: workshop_vibe_coding.TEAM_SCHEMA + +## Project +A data platform that ingests Australian retail +and food price data through a medallion architecture. + +## Tech Stack +- Data processing: PySpark (never pandas) +- Pipeline: Lakeflow Declarative Pipelines +- Web backend: FastAPI with Pydantic models +- Deployment: Databricks Asset Bundles + +## Data Standards +- Architecture: Bronze -> Silver -> Gold +- Date columns: YYYY-MM-DD, stored as DATE +- Naming: snake_case for all tables/columns + +## Rules +- Always use PySpark, never pandas +- Write tests BEFORE implementation +- Keep solutions minimal + +## Project Structure +(directory tree) + +## Data Sources +(table with API endpoints) + +## Code Mappings +(region codes, industry codes) +``` + +### Tips for Effective CLAUDE.md + +- **Keep it lean:** Aim for ~50 lines. CLAUDE.md consumes context tokens -- every line counts. +- **Be specific:** "Always PySpark, never pandas" beats "Use appropriate tools." Concrete rules get followed. +- **Include code mappings:** Region codes, industry codes, enum values -- anything the agent needs for lookups. + +> **Warning:** Don't dump everything in. CLAUDE.md is not documentation. It's a set of standing orders. If a rule isn't referenced often, it belongs in a separate doc the agent can read on demand. + +### What NOT to Include + +- Long explanations or tutorials +- Full API documentation (link to it instead) +- Implementation details that change frequently +- Anything already in your test suite + +--- + +## 5. TDD + Agents -- The Loop & Why It Works + +> **Rule #1 connection:** You say what should happen. The agent writes the test. Because it's Given/When/Then, you can read it back and verify it captured your intent. The agent writes the code AND the tests -- your job is to check that the tests match what you actually wanted. + +``` +STEP 1 STEP 2 STEP 3 STEP 4 +Human writes -> Agent implements -> Run & iterate -> Human reviews +the test code to pass (agent self- & accepts + corrects) +``` + +**The Test IS the Spec.** Unlike prose specs, a test is executable, unambiguous, and either passes or fails. + +### Why This Works + +1. **Unambiguous Specifications** -- A test says exactly what the code must do. No room for "I thought you meant..." The agent reads the assertion and knows the target. +2. **Self-Correcting Loop** -- The agent reads failure messages, understands what went wrong, and fixes automatically. No waiting for human review on each iteration. +3. **Guardrails** -- Existing passing tests prevent the agent from rewriting working code. New code must pass new tests *without* breaking old ones. +4. **Ratchet Effect** -- Each passing test constrains the next implementation. Quality only goes up, never down. Every green test is permanent progress. + +> **If you take one thing from today:** write the tests first. Always. + +--- + +## 6. Validation Patterns -- Proving the Work is Done + +### 6.1 The Core Principle + +Anthropic's #1 best practice: **"Include tests, screenshots, or expected outputs so Claude can check itself. This is the single highest-leverage thing you can do."** + +There is a critical difference between an agent that *thinks* it did the work and one that *proves* it did. Every prompt you write should include a way for the agent to verify its own output. If there is no verification step, the agent is guessing -- and you will not know until much later. + +### 6.2 Separate Tests from Implementation + +Never ask Claude to write tests AND code in the same prompt. When both are generated together, the agent writes tests that match the implementation rather than the requirements. The tests become a mirror of the code, not an independent check. + +**Do this -- two separate prompts:** + +```text +Prompt 1: +"Write tests for a function that decodes ABS region codes to state names. +Test: 1→NSW, 2→VIC, 99→ValueError. Do NOT implement the function." + +Prompt 2: +"Now implement decode_region() to make these tests pass. +Run pytest after implementation." +``` + +**Not this -- one combined prompt:** + +```text +"Write a function that decodes region codes and write tests for it." +``` + +When the test exists before the code, it acts as a genuine constraint. When both are written together, the test becomes a rubber stamp. + +### 6.3 Data Quality Expectations (Declarative Validation) + +For Lakeflow pipelines, embed validation directly in the table definition using `@dp.expect`. These checks run automatically every time the pipeline executes -- not a separate manual step. + +```python +import dlt as dp + +@dp.table +@dp.expect("turnover_not_null", "turnover IS NOT NULL") +@dp.expect_or_fail("no_negative_values", "turnover >= 0") +@dp.expect("valid_state_code", "state_code BETWEEN 1 AND 8") +def bronze_retail_trade(): + return ( + spark.read.format("json") + .load("/Volumes/workshop/raw/abs_retail/") + ) +``` + +- `@dp.expect` logs violations but lets rows through (monitoring). +- `@dp.expect_or_fail` halts the pipeline on violation (hard stop). + +These are validation patterns baked into the pipeline itself. The agent does not need to remember to run checks -- they execute every time. + +### 6.4 API Schema Contracts + +Tests that validate the exact response structure, not just "returns 200". Every field, every type, every boundary. + +```python +def test_metrics_schema(client): + """Gold metrics endpoint returns correct structure and value ranges.""" + response = client.get("/api/metrics") + assert response.status_code == 200 + + data = response.json() + assert isinstance(data, list) + assert len(data) > 0 + + for record in data: + # Exact field set -- no extra, no missing + assert set(record.keys()) == {"state", "turnover", "yoy_growth"} + + # Type checks + assert isinstance(record["state"], str) + assert isinstance(record["turnover"], (int, float)) + assert isinstance(record["yoy_growth"], (int, float)) + + # Boundary checks -- catch nonsense values + assert record["state"] in ("NSW", "VIC", "QLD", "SA", "WA", "TAS", "NT", "ACT") + assert 0 < record["turnover"] < 50000 + assert -100 < record["yoy_growth"] < 500 +``` + +This test does not just check that the endpoint works. It proves the data contract is correct. If the agent adds a field, removes a field, or returns a value outside the expected range, this test fails immediately. + +### 6.5 Regression Testing (The Ratchet) + +After every new test passes, run the **full** test suite: + +```bash +pytest tests/ -x --no-header -q +``` + +This proves new code did not break existing functionality. The `-x` flag stops at the first failure -- you want to know immediately. + +Each green test is permanent progress. The suite is a ratchet: quality only goes up, never down. If the agent adds a new silver transformation and breaks a bronze test, the full suite catches it before you move on. + +### 6.6 Negative Case Testing + +Prove error handling works. Happy-path tests are necessary but insufficient -- you need to verify the system fails gracefully. + +```python +def test_invalid_date_returns_error(client): + """Invalid date format returns 400/422, not a server crash.""" + response = client.get("/api/metrics?start_date=2024/13/45") + assert response.status_code in (400, 422) + assert "error" in response.json() + + +def test_unknown_state_raises(spark): + """Decoding an invalid state code raises ValueError.""" + from src.silver.transforms import decode_region + import pytest + + with pytest.raises(ValueError, match="Unknown region code"): + decode_region(99) + + +def test_empty_dataframe_handled(spark): + """Silver transform handles empty input without crashing.""" + empty_df = spark.createDataFrame([], "date STRING, turnover DOUBLE, state STRING") + result = clean_retail_data(empty_df) + assert result.count() == 0 +``` + +Negative tests are how you prove the agent built something robust, not just something that works on the happy path. + +### 6.7 Diff-Based Validation + +After changes, ask the agent to show `git diff` and verify only the intended files changed. This catches hidden side effects -- the agent reformatting a file it should not have touched, or silently modifying a passing test. + +```text +"Show me git diff. Only src/silver/transforms.py should have changed. +If any other files were modified, revert them." +``` + +This is especially important after the agent has been running for several iterations. Context window pressure can cause it to make broader changes than intended. + +### 6.8 The Prompt Checklist + +Before every prompt, ask yourself these four questions: + +| Question | If "No"... | +|---|---| +| Can Claude run something to prove this worked? (test, command, query) | Add a verification step: "Run pytest after" or "Query the table and show row count" | +| Is the success criterion binary? (pass/fail, not "looks good") | Rewrite with a concrete assertion: exact count, specific value, schema match | +| Can it run in under 30 seconds? | Break into smaller pieces. Long-running validation wastes context on waiting. | +| Is it separate from the implementation? (not self-grading) | Write the test first in a separate prompt. Never let the agent grade its own work. | + +If the answer to any of these is "no," stop and add a verification step before sending the prompt. + +--- + +## 7. Context Windows -- Your Agent's RAM + +### What Are Tokens? + +A **token** is the basic unit of text for an LLM -- roughly 3/4 of a word. Everything the agent reads and writes is measured in tokens. + +| Content | Approximate Tokens | +|---|---| +| Your CLAUDE.md (~50 lines) | ~500 tokens | +| 200-line Python file | ~2,500 tokens | +| Claude's context window | 200,000 tokens | + +Context window = RAM. When it fills up, older context gets evicted and the agent may forget earlier decisions. + +### Four Strategies + +1. **Keep CLAUDE.md Lean** -- ~50 lines. Every line consumes tokens every session. Put verbose docs elsewhere. +2. **Be Specific** -- "Fix the test in test_pipeline.py::test_silver" not "write some code." Target files, not vague asks. +3. **Plan Before Building** -- Use `/plan` mode. Alignment upfront prevents expensive rework that wastes context. +4. **New Sessions for New Tasks** -- Context is RAM. When switching tasks, start fresh with `/clear` or a new session. + +--- + +# Block B: Lab 0 -- Hands-On Together + +## 8. Writing Your CLAUDE.md (10 min) + +Open your Coding Agents terminal and paste the prompt below. Replace `` with your assigned schema name. + +```text +Create a CLAUDE.md for a grocery intelligence platform. + +Tech stack: PySpark, Lakeflow Declarative Pipelines, FastAPI + React, +Databricks Asset Bundles (DABs). + +Data sources: ABS SDMX APIs, FSANZ web scraping, ACCC PDF ingestion +via UC Volumes. + +Unity Catalog namespace: workshop_vibe_coding.. + +Include: +- Team name and schema +- Tech stack with explicit constraints (PySpark not pandas) +- Data standards (medallion architecture, date formats, naming) +- Rules (TDD, minimal solutions, don't change passing tests) +- Project structure (src/bronze, src/silver, src/gold, tests/, app/) +- Data sources table with API endpoints +- Code mappings for region and industry codes +``` + +**This is the most important 10 minutes of the workshop -- everything else builds on this.** + +> **After generating:** Read through the CLAUDE.md and edit anything that doesn't match your team's preferences. Add your team's chosen angle (Retail Performance, Food Inflation, etc.) to the Project section. + +--- + +## 9. Writing Your First Test (10 min) + +### Given -- When -- Then + +```python +def test_bronze_ingest_retail(spark): + # GIVEN: Raw ABS retail trade data + raw_data = spark.createDataFrame([ + ("2024-01-15", 1000, "NSW"), + ("2024-01-16", None, "VIC"), + ("invalid", 2000, "QLD"), + ("2024-01-17", 1500, "NSW"), + ("2024-01-18", 900, "VIC"), + ("2024-01-19", 1100, "QLD"), + ("2024-01-20", 1300, "NSW"), + ("2024-01-21", None, "SA"), + ("2024-01-22", 800, "WA"), + ("2024-01-23", 1200, "TAS"), + ], ["date", "price", "state"]) + + # WHEN: Clean function applied + result = clean_transactions(raw_data) + + # THEN: Invalid rows removed + assert result.count() == 8 + assert result.filter("price IS NULL").count() == 0 +``` + +### Key Principles + +- **Concrete Values** -- Use real data: actual state codes, valid dates, realistic dollar amounts. Not mocks or abstract placeholders. +- **Small Datasets** -- 5-10 rows per test. Enough to cover happy path + edge cases. Easy to reason about. +- **Descriptive Names** -- `test_bronze_ingest_retail` not `test_transform`. The name tells the agent what to build. +- **Multiple Assertions** -- Check row count, column names, specific values, and null handling. Each assertion is a constraint. + +--- + +## 10. Building Bronze Ingest (20 min) + +Now let the agent build the implementation to pass your test. Paste the prompt below into Claude Code. + +```text +Read CLAUDE.md and tests/test_pipeline.py. Implement the bronze +ingest function in src/bronze/ to make the failing test pass. + +Rules: +- Use PySpark, not pandas +- Use @dp.table decorator for Lakeflow Declarative Pipelines +- Read data from the ABS Retail Trade SDMX API +- Store raw data with original column names +- Add _ingested_at timestamp column +- Run pytest tests/test_pipeline.py -k "bronze" -x after implementation +``` + +### What to Watch For + +- **Does the agent read CLAUDE.md first?** If not, tell it: "Read CLAUDE.md first, then try again." +- **Does it use PySpark?** If it reaches for pandas, steer it back. +- **Does it run the test?** The agent should run pytest automatically and iterate until green. + +### Expected Outcome + +After ~15 minutes you should see: + +- [ ] `src/bronze/ingest.py` -- Bronze ingest function +- [ ] `tests/test_pipeline.py` -- Passing bronze test +- [ ] pytest output showing green + +> **Stuck?** Tell the agent: "Read the test failure message carefully and fix only the failing assertion." + +--- + +## 11. Lab 0 Checkpoint (5 min) + +**Before moving on, verify your team has all three pieces in place:** + +### 1. CLAUDE.md + +- [ ] Team name and schema present +- [ ] Tech stack with PySpark constraint +- [ ] Data standards section +- [ ] Rules section (TDD, minimal) +- [ ] Project structure defined + +### 2. Passing Test + +- [ ] Test uses Given-When-Then structure +- [ ] Concrete values (real state codes, dates) +- [ ] Multiple assertions (count, nulls) +- [ ] Descriptive test name +- [ ] `pytest -k "bronze" -x` passes + +### 3. Bronze Ingest + +- [ ] Uses PySpark (not pandas) +- [ ] Has @dp.table decorator +- [ ] Reads from data source +- [ ] Adds _ingested_at timestamp +- [ ] All bronze tests green + +> **Falling behind?** No shame in using checkpoints. Tell the agent: "Copy the checkpoint tables from `workshop_vibe_coding.checkpoints` into my schema `workshop_vibe_coding.`" + +**You've just completed the full TDD cycle: spec (CLAUDE.md) -> test -> implementation -> green. This is the pattern for the rest of the day.** + +--- + +# Block C: Tools for the Labs + +## 12. Skills -- Slash Commands + +Skills are **reusable agent capabilities** triggered by slash commands. They encode domain knowledge and multi-step workflows into a single invocation. + +### How Skills Work + +A skill is a Markdown file that defines a multi-step workflow. When you type `/commit`, the agent: + +1. Reads git status and diff +2. Analyzes changes and drafts a commit message +3. Stages files and creates the commit +4. Runs git status to verify success + +### Built-In Examples + +```bash +# Common skills +/commit # Smart git commit with message +/review-pr # Review a pull request +/test # Run tests intelligently + +# Custom skills you can create +/deploy-dab # Validate + deploy DABs bundle +/check-data # Query tables, verify counts +``` + +### Why Skills Matter + +- **Automate Repetitive Patterns** -- Workflows you do 10x/day become one command. No re-explaining each time. +- **Encode Team Conventions** -- Your team's commit message format, deploy steps, and review checklist -- codified once, used by everyone. +- **Reduce Context Usage** -- A skill runs a pre-defined workflow without you needing to type out multi-step instructions each time. + +> **Workshop tip:** You can create custom skills during the labs. Think about which multi-step workflows you repeat most often. + +--- + +## 13. MCP -- "USB-C for AI" + +**Model Context Protocol (MCP)** is a standard protocol for connecting AI agents to external tools and data sources. One protocol, every tool connects -- like USB-C for AI. + +### Three Types of MCP Servers + +**Managed (Built-In)** +Zero config -- pre-integrated with Databricks: +- Unity Catalog tables & volumes +- Vector Search indexes +- Genie spaces (NL -> SQL) +- DBSQL warehouses + +**External (via Proxies)** +Community & vendor integrations: +- GitHub (PRs, issues, repos) +- Slack (messages, channels) +- Glean (internal search) +- JIRA (tickets, sprints) +- Databricks Docs + +**Custom (Org-Specific)** +Build your own for internal tools: +- Wrap internal REST APIs +- Data quality workflows +- Monitoring & alerting +- Host on Databricks Apps + +> **Key benefit:** Agents access ANY tool through a standard protocol. Credentials stay secure in Unity Catalog -- the agent never sees raw tokens or passwords. + +--- + +## 14. Extending Claude Code for Your Organisation + +Out of the box, Claude Code is a general-purpose agent. The real power comes from **customising it for your team's specific tools, data, and workflows**. Three mechanisms make this possible. + +### Custom Skills -- Your Team's Playbooks + +Skills are Markdown files that encode multi-step workflows. Any team can create them: + +```markdown +# /deploy-pipeline +1. Run `databricks bundle validate` +2. Fix any validation errors +3. Run `databricks bundle deploy -t dev` +4. Run the pipeline and verify row counts +5. Commit with message "deploy: pipeline updated" +``` + +Save this as a skill, and every team member gets a one-command deploy workflow. No tribal knowledge required. + +**Where to start:** Identify the 3 workflows your team repeats most often. Write them as skills. Share them in a team repo. + +> **How powerful is a markdown skill?** [deathbyclawd.com](https://deathbyclawd.com/) scans which SaaS products can be replaced by one. + +### Unity Catalog MCP Servers -- Connecting to Your Data + +Databricks provides **managed MCP servers** that connect Claude Code directly to your data platform: + +| MCP Server | What It Does | +|---|---| +| **Unity Catalog** | Browse tables, schemas, and volumes. Agent can read metadata without SQL. | +| **Genie Spaces** | Natural language queries on your gold tables, powered by your Genie configuration. | +| **SQL Warehouses** | Execute SQL queries directly. Agent reads results and iterates. | +| **Vector Search** | Semantic search over documents and embeddings stored in UC. | + +These run inside Databricks -- credentials are managed by Unity Catalog, not stored in config files. The agent never sees raw tokens. + +**Custom MCP servers** can wrap any REST API your org uses. Host them on Databricks Apps and register them in Unity Catalog for secure, governed access. + +### Plugin Marketplaces -- Scaling Across Teams + +As your organisation builds skills, agents, and MCP integrations, you need a way to **share and discover** them. A plugin marketplace solves this: + +- **Plugins** group related skills and agents (e.g., a "Data Quality" plugin with skills for profiling, validation, and monitoring) +- **A marketplace** lets teams browse, install, and contribute plugins +- **Versioning** ensures updates don't break existing workflows + +This is how you go from one team using Claude Code effectively to an entire organisation benefiting from shared automation. The pattern: start with 3-5 skills for your team, package them as a plugin, publish to a shared marketplace. + +> **The key insight:** Claude Code is not a fixed product -- it's an extensible platform. The teams that invest in customisation get compounding returns as every new skill makes the agent more capable for everyone. + +--- + +## 15. Genie & AI/BI Dashboards + +### Genie: Natural Language on Your Data + +Business users ask questions in plain English. Genie generates SQL, runs it, and returns results with visualizations. + +```sql +-- User asks: +-- "Which state had the highest food retail turnover in 2024?" + +-- Genie generates: +SELECT state, + SUM(turnover_millions) AS total +FROM gold.retail_summary +WHERE year = 2024 +GROUP BY state +ORDER BY total DESC +LIMIT 1 +``` + +**How to set up:** Point Genie at your gold tables, add column descriptions, and provide example queries in the instructions. + +### AI/BI Dashboards + +Auto-generated dashboards that understand your data. Describe a visualization in natural language and it generates the chart. + +- "Show monthly revenue by state as a line chart" +- "Compare food CPI across states as a bar chart" +- "Display year-over-year growth as a heatmap" + +**How they complement each other:** +- **Dashboards** = recurring views, standard KPIs, shared with stakeholders +- **Genie** = ad-hoc questions, exploration, self-serve analytics + +Both feed from the same **gold tables** -- the output of your data pipeline. Good gold tables = good Genie + dashboards. + +--- + +# Block D: Reference + +## 16. Anthropic's Two Core Practices + +### #1: Give Claude a Way to Verify Its Work + +The agent needs a **feedback loop** -- a way to check if its output is correct without asking you. + +- **Tests:** pytest, unit tests, integration tests +- **Screenshots:** visual verification of UI changes +- **Expected outputs:** "this query should return 42" +- **Diff comparisons:** "output should match this template" + +> **Self-correcting:** Agent reads failure, fixes, re-runs. Scales without human review bottleneck for each iteration. + +### #2: Explore First, Plan, Then Code + +Don't ask the agent to implement immediately. Follow a three-phase approach: + +1. **Explore:** Read relevant code, understand existing patterns, check what's already there +2. **Plan:** Use `/plan` mode or `ultrathink` to align on approach before writing code +3. **Code:** Implement with full context -- agent knows what exists, what to change, what to preserve + +> **Why this works:** Prevents over-engineering and hallucinations. The agent builds on what exists rather than speculating about what might exist. + +--- + +## 17. Steering Your Agent -- Pitfalls & Phrases + +### Common Pitfalls & Fixes + +**Overengineering** +- **Symptom:** Asked for one function, got an entire module with abstract base classes. +- **Fix:** "Keep it minimal. One function, not a framework. No extra features." + +**Hallucinations** +- **Symptom:** Agent used a column name that doesn't exist in the data. +- **Fix:** "Never speculate about code you have not opened. Read the file first." + +**Going Off-Rails** +- **Symptom:** Looked away for 5 minutes, agent rewrote half the project. +- **Fix:** Check in every 2-3 tool calls. "Don't change functions that already pass tests." + +### Steering Phrases Cheatsheet + +| When the agent... | Say this | +|---|---| +| **Uses pandas** | "Rewrite using PySpark. We never use pandas." | +| **Ignores CLAUDE.md** | "Read CLAUDE.md first, then try again." | +| **Rewrites working code** | "Don't change functions that already pass tests." | +| **Writes code before tests** | "Stop. Write the tests first, then implement." | +| **Too complex** | "Simplify. I just need [specific thing]." | +| **Stuck in a loop** | "Stop. Let's try a different approach." | +| **Speculates** | "Read the file first. Don't guess." | +| **Hasn't planned** | "Stop. Use /plan first. Interview me about what I need." | +| **Claims it works** | "Prove it. Show me the git diff. Grill me on these changes." | +| **First attempt is mediocre** | "Knowing everything you know now, scrap this and implement the elegant solution." | + +**Healthy cadence:** Agent makes 2-3 tool calls -> You review -> Steer if needed -> Continue + +### Commit as Checkpoints + +Commit every 15-20 minutes during the labs. Commits are your safety net: + +- **Before big changes:** `/commit` to save what works +- **If the agent goes off-rails:** Press `Esc` twice to cancel, then `git checkout` to revert +- **Why it matters:** You can always get back to a known-good state. Don't let 30 minutes of work ride on a single uncommitted session. + +--- + +## 18. Workshop Quick Reference + +### Key Commands + +```bash +# Check environment +databricks auth status +claude --version + +# Run tests (always use -x for fail-fast) +pytest tests/ -x +pytest tests/test_pipeline.py -k "bronze" -x +pytest tests/test_pipeline.py -k "silver" -x +pytest tests/test_app.py -x + +# Deploy with Databricks Asset Bundles +databricks bundle validate +databricks bundle deploy -t dev +databricks bundle run grocery-intelligence-TEAM -t dev + +# Deploy web app +cd app && databricks apps deploy \ + --name grocery-app-TEAM \ + --source-code-path ./ +``` + +### Key Files + +| File | Purpose | +|---|---| +| `CLAUDE.md` | Agent instructions (team standards) | +| `tests/conftest.py` | PySpark test fixtures | +| `src/bronze/` | Raw data ingestion | +| `src/silver/` | Cleaned transformations | +| `src/gold/` | Aggregated analytics | +| `app/app.py` | FastAPI web application | +| `databricks.yml` | DABs deployment config | + +### Workshop Schedule + +| Time | Session | +|---|---| +| 9:30 AM | Block A: How to Think (theory) | +| 10:45 AM | Lab 0: Guided Hands-On (all together) | +| 11:30 AM | Block C: Tools for the Labs (theory) | +| 12:00 PM | Lab 1: Track-specific lab | +| 1:00 PM | Lunch break | +| 2:00 PM | Lab 2: Track-specific lab | +| 3:30 PM | Demos & wrap-up | + +### Checkpoint Recovery + +No shame in using checkpoints -- the goal is a working demo! + +Tell the agent: + +```text +Copy the checkpoint tables from workshop_vibe_coding.checkpoints +into my schema workshop_vibe_coding. +``` + +--- + +## 19. Advanced -- Self-Improving CLAUDE.md + +Your CLAUDE.md is static by default. You write it once, the agent follows it, and mistakes repeat across sessions. This technique turns it into a **living document** that gets smarter every time you use it. + +### The Reflection Prompt + +When Claude makes a mistake, say: + +```text +Reflect on this mistake. Abstract and generalize the learning. +Write it to CLAUDE.md. +``` + +That one sentence triggers the agent to: + +1. **Reflect** -- Analyse what went wrong while full context is still in memory +2. **Abstract** -- Extract the general pattern, not the specific instance +3. **Generalize** -- Create a reusable rule for similar future situations +4. **Document** -- Write it to CLAUDE.md following the meta-rules below + +### The Session Review Prompt + +At the end of a session, ask: + +```text +Review this session. Did you make any mistakes I should capture? +Did I ask you to do anything three or more times that you should +do automatically? Draft the rules and add them to CLAUDE.md. +``` + +The agent identifies patterns like *"You corrected me on PySpark three times"* and drafts a standing rule. Over time, corrections you give once become permanent. + +### Meta-Rules -- Teaching the Agent to Write Good Rules + +Add this section to your CLAUDE.md to control how rules get written: + +```markdown +## META -- Maintaining This Document + +When adding rules to this file: +1. Use absolute directives -- Start with "NEVER" or "ALWAYS" +2. Lead with why -- Explain the problem before the solution (1-3 bullets max) +3. Be concrete -- Include actual commands or code, not vague guidance +4. One rule per entry -- No compound rules that mix concerns +5. Keep it under 80 lines total -- If it's longer, move detail to a separate doc +``` + +**Why meta-rules matter:** Without them, the agent writes verbose paragraphs with examples and caveats. With them, every rule is terse and actionable. CLAUDE.md stays lean. + +### Anti-Bloat: When NOT to Add a Rule + +Not every correction deserves a permanent rule. Before adding to CLAUDE.md, ask: + +| Question | If "No"... | +|---|---| +| Will this mistake happen again without the rule? | Don't add it -- one-off fixes don't need rules | +| Is this specific to my project, not general knowledge? | Don't add it -- the agent already knows general best practices | +| Does it affect more than one file or function? | Don't add it -- local fixes stay local | + +### The Compounding Effect + +- **Session 1** -- Agent makes basic mistakes. You capture 3 rules (5 seconds each). +- **Session 2** -- Those mistakes vanish. New, more sophisticated ones surface. +- **Session 3** -- You're discussing architecture instead of fighting over import order. + +Each rule is permanent progress. Like your test suite, CLAUDE.md is a ratchet -- quality only goes up. + +> **Monday morning action #2:** After your first real session with Claude Code, run the session review prompt. Capture the top 3 corrections as rules. Within a week, your CLAUDE.md will be more valuable than any style guide your team has written. + +--- + +## 20. Key Takeaways + +1. **Write clear specs** -- CLAUDE.md, tests, and PRDs define what "done" looks like. The agent can only be as good as your specification. +2. **Use tests as executable specs (TDD)** -- Tests are unambiguous. They pass or they fail. No interpretation required. Write them first, always. +3. **Give agents verification mechanisms** -- Tests, screenshots, expected outputs. The agent needs a feedback loop to self-correct without human intervention. +4. **Manage context windows** -- Keep CLAUDE.md lean (~50 lines), be specific with requests, use new sessions for new tasks. +5. **Steer early and often** -- Review every 2-3 tool calls. Short feedback loops produce better results than long unsupervised runs. +6. **Make CLAUDE.md self-improving** -- Capture mistakes as rules. Run session reviews. Your agent gets smarter every day without extra effort. + +**Think of yourself as a director, not a typist.** + +> **Monday morning action:** Create a CLAUDE.md for your team's main repository. Start with tech stack, coding standards, and 5 rules. After your first real session, run the reflection prompt and capture what you learned. Iterate from there. + +> **The discipline transfers:** These practices work with Claude Code, Cursor, Windsurf, GitHub Copilot, and any agentic coding tool. The discipline is the differentiator, not the tool. diff --git a/projects/coles-vibe-workshop/demo-pipeline/.gitignore b/projects/coles-vibe-workshop/demo-pipeline/.gitignore new file mode 100644 index 0000000..2377731 --- /dev/null +++ b/projects/coles-vibe-workshop/demo-pipeline/.gitignore @@ -0,0 +1,18 @@ +# Databricks bundle state +.databricks/ + +# Python +__pycache__/ +*.pyc +*.pyo +.venv/ +.uv/ +uv.lock + +# IDE +.idea/ +.vscode/ +*.swp +dist/ +*.egg-info/ +build/ diff --git a/projects/coles-vibe-workshop/demo-pipeline/CLAUDE.md b/projects/coles-vibe-workshop/demo-pipeline/CLAUDE.md new file mode 100644 index 0000000..ffa75d4 --- /dev/null +++ b/projects/coles-vibe-workshop/demo-pipeline/CLAUDE.md @@ -0,0 +1,92 @@ +# CLAUDE.md — Grocery Intelligence Demo Pipeline + +## Project + +Demo pipeline for the Coles Vibe Coding Workshop opening. Ingests Australian retail and food price data from the ABS, transforms through bronze/silver/gold, and serves analytics-ready tables for the demo app. + +This is the **facilitator's pre-built pipeline** — deployed before the workshop so gold tables are ready for the opening demo. Teams build their own version during Lab 1. + +## Tech Stack + +- **Pipeline framework:** Lakeflow Declarative Pipelines (`import dlt` with `databricks.declarative_pipelines` fallback) +- **Data processing:** PySpark (never pandas) +- **Data quality:** `@dp.expect()`, `@dp.expect_or_fail()` +- **Deployment:** Databricks Asset Bundles with `DATABRICKS_BUNDLE_ENGINE=direct` +- **Testing:** Python Behave (BDD) with Gherkin feature files + +## Data Standards + +- **Catalog:** `workshop_vibe_coding` +- **Schema:** `demo` +- **Architecture:** Bronze (raw from UC volumes) → Silver (decoded, typed) → Gold (aggregated) +- **Column naming:** snake_case +- **Date columns:** DATE type +- **Nulls:** Bronze may contain nulls; Silver filters them; Gold has zero nulls + +## Data Sources + +| Source | Location | Format | What It Contains | +|--------|----------|--------|-----------------| +| ABS Retail Trade | `/Volumes/workshop_vibe_coding/demo/raw_data/abs_retail_trade.csv` | CSV | Monthly retail turnover by state & industry since 2010 | +| ABS CPI | `/Volumes/workshop_vibe_coding/demo/raw_data/abs_cpi_food.csv` | CSV | Quarterly CPI indices by state since 2010 | + +## Code Mappings + +### Region Codes (INT) + +| Code | State | +|------|-------| +| 1 | New South Wales | +| 2 | Victoria | +| 3 | Queensland | +| 4 | South Australia | +| 5 | Western Australia | +| 6 | Tasmania | +| 7 | Northern Territory | +| 8 | Australian Capital Territory | + +### Industry Codes (INT, Retail Trade only) + +| Code | Industry | +|------|----------| +| 20 | Food retailing | +| 41 | Clothing, footwear and personal accessories | +| 42 | Department stores | +| 43 | Other retailing | +| 44 | Cafes, restaurants and takeaway | +| 45 | Household goods retailing | + +## Tables + +| Table | Layer | Rows | Description | +|-------|-------|------|-------------| +| `bronze_abs_retail_trade` | Bronze | ~8,900 | Raw ABS retail data | +| `bronze_abs_cpi_food` | Bronze | ~500 | Raw ABS CPI data | +| `silver_retail_turnover` | Silver | ~8,200 | Decoded states, industries, dates | +| `silver_food_price_index` | Silver | ~500 | Decoded states, quarterly CPI | +| `gold_retail_summary` | Gold | ~1,500 | Monthly turnover + 3m/12m avg + YoY growth | +| `gold_food_inflation_yoy` | Gold | ~480 | Quarterly CPI inflation by state | + +## Rules + +- PySpark only, never pandas +- Use `@dp.expect` / `@dp.expect_or_fail` for data quality +- One transformation per file +- Import dlt with fallback: `try: import databricks.declarative_pipelines as dp / except: import dlt as dp` +- Read raw data from UC volumes, not external APIs (serverless compute blocks external HTTPS) + +## Commands + +```bash +# Validate +DATABRICKS_BUNDLE_ENGINE=direct databricks bundle validate + +# Deploy +DATABRICKS_BUNDLE_ENGINE=direct databricks bundle deploy + +# Run pipeline +DATABRICKS_BUNDLE_ENGINE=direct databricks bundle run grocery_pipeline + +# Run BDD tests +uv run behave +``` diff --git a/projects/coles-vibe-workshop/demo-pipeline/Makefile b/projects/coles-vibe-workshop/demo-pipeline/Makefile new file mode 100644 index 0000000..add0187 --- /dev/null +++ b/projects/coles-vibe-workshop/demo-pipeline/Makefile @@ -0,0 +1,35 @@ +.PHONY: help build validate deploy run test lint clean gates + +help: ## Show this help + @grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | awk 'BEGIN {FS = ":.*?## "}; {printf "\033[36m%-15s\033[0m %s\n", $$1, $$2}' + +build: ## Build the grocery_intelligence wheel + uv build + +validate: ## Validate the DABs bundle + DATABRICKS_BUNDLE_ENGINE=direct databricks bundle validate + +deploy: validate ## Build wheel + deploy pipeline to Databricks + DATABRICKS_BUNDLE_ENGINE=direct databricks bundle deploy + +run: ## Run the pipeline + DATABRICKS_BUNDLE_ENGINE=direct databricks bundle run grocery_pipeline + +test: ## Run BDD smoke tests against deployed tables + uv run behave --tags="@smoke" + +test-all: ## Run all BDD tests + uv run behave + +lint: ## Lint Python sources + uv run ruff check src/ features/ + uv run ruff format --check src/ features/ + +format: ## Auto-format Python sources + uv run ruff format src/ features/ + +clean: ## Remove build artifacts + rm -rf .databricks/ dist/ build/ *.egg-info/ __pycache__/ + find . -type d -name __pycache__ -exec rm -rf {} + 2>/dev/null || true + +gates: lint test ## Run all quality gates before shipping diff --git a/projects/coles-vibe-workshop/demo-pipeline/README.md b/projects/coles-vibe-workshop/demo-pipeline/README.md new file mode 100644 index 0000000..7ca9dfb --- /dev/null +++ b/projects/coles-vibe-workshop/demo-pipeline/README.md @@ -0,0 +1,43 @@ +# Grocery Intelligence Demo Pipeline + +Pre-built Lakeflow SDP pipeline for the Coles Vibe Coding Workshop opening demo. Ingests Australian Bureau of Statistics retail trade and CPI data through a medallion architecture. + +## Tables + +| Table | Layer | Description | +|-------|-------|-------------| +| `bronze_abs_retail_trade` | Bronze | Raw ABS monthly retail turnover | +| `bronze_abs_cpi_food` | Bronze | Raw ABS quarterly CPI indices | +| `silver_retail_turnover` | Silver | Decoded states, industries, parsed dates | +| `silver_food_price_index` | Silver | Decoded states, quarterly CPI | +| `gold_retail_summary` | Gold | Monthly turnover + rolling avg + YoY growth | +| `gold_food_inflation_yoy` | Gold | Quarterly inflation rate by state | + +## Quick start + +```bash +make deploy # validate + deploy to Databricks +make run # trigger pipeline refresh +make test # BDD tests against deployed tables +``` + +## Project structure + +``` +demo-pipeline/ +├── CLAUDE.md # Agent instructions +├── databricks.yml # Bundle config +├── Makefile # Command interface +├── pyproject.toml # Python config + test deps +├── resources/ +│ └── pipeline.yml # Pipeline + job definitions +├── src/ +│ ├── bronze/ # Raw ingestion from UC volumes +│ ├── silver/ # Decode, type, clean +│ └── gold/ # Aggregate, rolling avg, YoY +└── features/ + ├── environment.py # Behave hooks (workspace connection) + ├── pipeline.feature # 9 Gherkin scenarios + └── steps/ + └── pipeline_steps.py # Step definitions +``` diff --git a/projects/coles-vibe-workshop/demo-pipeline/databricks.yml b/projects/coles-vibe-workshop/demo-pipeline/databricks.yml new file mode 100644 index 0000000..d206c25 --- /dev/null +++ b/projects/coles-vibe-workshop/demo-pipeline/databricks.yml @@ -0,0 +1,27 @@ +bundle: + name: grocery-intelligence-demo + +variables: + team_name: + default: demo + schema: + default: demo + +include: + - resources/*.yml + +workspace: + profile: daveok + +# NOTE: The grocery_intelligence package under src/grocery_intelligence/ +# provides the canonical mappings and decode helpers. Pipeline notebooks +# inline these because SDP's deprecated whl: field and glob: path resolution +# prevent clean cross-module imports. For production, consider installing +# the package via %pip in a setup notebook. + +targets: + dev: + default: true + variables: + team_name: demo + schema: demo diff --git a/projects/coles-vibe-workshop/demo-pipeline/features/environment.py b/projects/coles-vibe-workshop/demo-pipeline/features/environment.py new file mode 100644 index 0000000..9ed139c --- /dev/null +++ b/projects/coles-vibe-workshop/demo-pipeline/features/environment.py @@ -0,0 +1,30 @@ +"""Behave environment hooks — Databricks workspace connection.""" + +from __future__ import annotations + +from databricks.sdk import WorkspaceClient + + +def before_all(context): + """Connect to Databricks workspace.""" + context.workspace = WorkspaceClient() + context.cursor = None + + # Auto-discover a running SQL warehouse + warehouses = list(context.workspace.warehouses.list()) + running = [w for w in warehouses if w.state and w.state.value == "RUNNING"] + if not running: + raise RuntimeError("No running SQL warehouse found") + context.warehouse_id = running[0].id + + +def before_scenario(context, scenario): + """Get a fresh statement execution connection per scenario.""" + context.result = None + context.result_columns = [] + context.result_rows = [] + + +def after_all(context): + """Clean up.""" + pass diff --git a/projects/coles-vibe-workshop/demo-pipeline/features/pipeline.feature b/projects/coles-vibe-workshop/demo-pipeline/features/pipeline.feature new file mode 100644 index 0000000..ee8c869 --- /dev/null +++ b/projects/coles-vibe-workshop/demo-pipeline/features/pipeline.feature @@ -0,0 +1,61 @@ +@pipeline @smoke +Feature: Grocery Intelligence Pipeline + As a data engineer + I want a medallion pipeline that ingests ABS retail and CPI data + So that gold tables are ready for the workshop demo app + + Background: + Given a connection to the Databricks workspace + And the catalog "workshop_vibe_coding" exists + And the schema "demo" exists + + Scenario: Bronze retail table has data from ABS + When I query "workshop_vibe_coding.demo.bronze_abs_retail_trade" + Then the table should have more than 8000 rows + And the table should have columns "REGION, INDUSTRY, TIME_PERIOD, OBS_VALUE" + + Scenario: Bronze CPI table has data from ABS + When I query "workshop_vibe_coding.demo.bronze_abs_cpi_food" + Then the table should have more than 400 rows + And the table should have columns "REGION, INDEX, TIME_PERIOD, OBS_VALUE" + + Scenario: Silver retail turnover has decoded states + When I query "workshop_vibe_coding.demo.silver_retail_turnover" + Then the table should have more than 7000 rows + And the column "state" should contain "New South Wales" + And the column "state" should not contain "1" + And there should be 8 distinct values in "state" + + Scenario: Silver retail turnover has no nulls + When I query "workshop_vibe_coding.demo.silver_retail_turnover" + Then the column "date" should have zero nulls + And the column "turnover_millions" should have zero nulls + And all values in "turnover_millions" should be greater than 0 + + Scenario: Silver food price index has quarterly dates + When I query "workshop_vibe_coding.demo.silver_food_price_index" + Then the table should have more than 400 rows + And the column "state" should contain "Victoria" + And the column "cpi_index" should have zero nulls + + Scenario: Gold retail summary has rolling averages and YoY growth + When I query "workshop_vibe_coding.demo.gold_retail_summary" + Then the table should have more than 1000 rows + And the table should have columns "state, date, total_turnover, turnover_3m_avg, turnover_12m_avg, yoy_growth_pct" + And there should be 8 distinct values in "state" + And the column "turnover_3m_avg" should have zero nulls + + Scenario: Gold retail summary date range spans 2010 to 2025 + When I query "workshop_vibe_coding.demo.gold_retail_summary" + Then the minimum value in "date" should be before "2010-06-01" + And the maximum value in "date" should be after "2025-01-01" + + Scenario: Gold food inflation has YoY change percentages + When I query "workshop_vibe_coding.demo.gold_food_inflation_yoy" + Then the table should have more than 400 rows + And the table should have columns "state, date, cpi_index, yoy_change_pct" + And all values in "yoy_change_pct" should be between -50 and 100 + + Scenario: NSW has the highest total retail turnover + When I run SQL "SELECT state, SUM(total_turnover) as total FROM workshop_vibe_coding.demo.gold_retail_summary GROUP BY state ORDER BY total DESC LIMIT 1" + Then the first row "state" should be "New South Wales" diff --git a/projects/coles-vibe-workshop/demo-pipeline/features/steps/pipeline_steps.py b/projects/coles-vibe-workshop/demo-pipeline/features/steps/pipeline_steps.py new file mode 100644 index 0000000..f12ffab --- /dev/null +++ b/projects/coles-vibe-workshop/demo-pipeline/features/steps/pipeline_steps.py @@ -0,0 +1,155 @@ +"""Step definitions for pipeline BDD tests.""" + +from __future__ import annotations + +import time + +from behave import given, when, then +from behave.runner import Context +from databricks.sdk.service.sql import StatementState + + +def _execute_sql(context: Context, sql: str) -> None: + """Execute SQL via Statement Execution API and store results.""" + resp = context.workspace.statement_execution.execute_statement( + warehouse_id=context.warehouse_id, + statement=sql, + wait_timeout="30s", + ) + # Poll if needed + retries = 0 + while resp.status and resp.status.state in (StatementState.PENDING, StatementState.RUNNING): + time.sleep(1) + resp = context.workspace.statement_execution.get_statement(resp.statement_id) + retries += 1 + if retries > 30: + raise TimeoutError(f"SQL statement timed out after 30s: {sql[:80]}") + + if resp.status and resp.status.state == StatementState.FAILED: + error = resp.status.error + raise RuntimeError(f"SQL failed: {error.message if error else 'unknown'}") + + context.result_columns = [c.name for c in resp.manifest.schema.columns] if resp.manifest else [] + context.result_rows = resp.result.data_array if resp.result and resp.result.data_array else [] + + +# --------------------------------------------------------------------------- +# GIVEN steps +# --------------------------------------------------------------------------- + +@given('a connection to the Databricks workspace') +def step_workspace_connection(context: Context) -> None: + assert context.workspace is not None, "WorkspaceClient not initialized" + assert context.warehouse_id, "No SQL warehouse found" + + +@given('the catalog "{catalog}" exists') +def step_catalog_exists(context: Context, catalog: str) -> None: + _execute_sql(context, f"SELECT 1 FROM system.information_schema.catalogs WHERE catalog_name = '{catalog}'") + assert len(context.result_rows) > 0, f"Catalog '{catalog}' does not exist" + + +@given('the schema "{schema}" exists') +def step_schema_exists(context: Context, schema: str) -> None: + _execute_sql(context, f"SHOW SCHEMAS IN workshop_vibe_coding LIKE '{schema}'") + assert len(context.result_rows) > 0, f"Schema '{schema}' does not exist" + + +# --------------------------------------------------------------------------- +# WHEN steps +# --------------------------------------------------------------------------- + +@when('I query "{table}"') +def step_query_table(context: Context, table: str) -> None: + _execute_sql(context, f"SELECT * FROM {table}") + + +@when('I run SQL "{sql}"') +def step_run_sql(context: Context, sql: str) -> None: + _execute_sql(context, sql) + + +# --------------------------------------------------------------------------- +# THEN steps +# --------------------------------------------------------------------------- + +@then('the table should have more than {count:d} rows') +def step_row_count_more_than(context: Context, count: int) -> None: + actual = len(context.result_rows) + assert actual > count, f"Expected more than {count} rows, got {actual}" + + +@then('the table should have columns "{columns}"') +def step_has_columns(context: Context, columns: str) -> None: + expected = [c.strip() for c in columns.split(",")] + for col in expected: + assert col in context.result_columns, ( + f"Column '{col}' not found. Available: {context.result_columns}" + ) + + +@then('the column "{column}" should contain "{value}"') +def step_column_contains(context: Context, column: str, value: str) -> None: + idx = context.result_columns.index(column) + values = [row[idx] for row in context.result_rows] + assert value in values, f"'{value}' not found in column '{column}'. Sample: {values[:5]}" + + +@then('the column "{column}" should not contain "{value}"') +def step_column_not_contains(context: Context, column: str, value: str) -> None: + idx = context.result_columns.index(column) + values = [row[idx] for row in context.result_rows] + assert value not in values, f"'{value}' unexpectedly found in column '{column}'" + + +@then('there should be {count:d} distinct values in "{column}"') +def step_distinct_count(context: Context, count: int, column: str) -> None: + idx = context.result_columns.index(column) + distinct = set(row[idx] for row in context.result_rows) + assert len(distinct) == count, f"Expected {count} distinct values in '{column}', got {len(distinct)}: {distinct}" + + +@then('the column "{column}" should have zero nulls') +def step_no_nulls(context: Context, column: str) -> None: + idx = context.result_columns.index(column) + nulls = sum(1 for row in context.result_rows if row[idx] is None) + assert nulls == 0, f"Column '{column}' has {nulls} null values" + + +@then('all values in "{column}" should be greater than {threshold:d}') +def step_all_greater_than(context: Context, column: str, threshold: int) -> None: + idx = context.result_columns.index(column) + violations = [row[idx] for row in context.result_rows if row[idx] is not None and float(row[idx]) <= threshold] + assert len(violations) == 0, f"{len(violations)} values in '{column}' <= {threshold}. Sample: {violations[:5]}" + + +@then('all values in "{column}" should be between {low:d} and {high:d}') +def step_all_between(context: Context, column: str, low: int, high: int) -> None: + idx = context.result_columns.index(column) + violations = [ + row[idx] for row in context.result_rows + if row[idx] is not None and not (low <= float(row[idx]) <= high) + ] + assert len(violations) == 0, f"{len(violations)} values in '{column}' outside [{low}, {high}]. Sample: {violations[:5]}" + + +@then('the minimum value in "{column}" should be before "{date}"') +def step_min_before(context: Context, column: str, date: str) -> None: + idx = context.result_columns.index(column) + dates = sorted(row[idx] for row in context.result_rows if row[idx] is not None) + assert dates[0] < date, f"Minimum date in '{column}' is {dates[0]}, expected before {date}" + + +@then('the maximum value in "{column}" should be after "{date}"') +def step_max_after(context: Context, column: str, date: str) -> None: + idx = context.result_columns.index(column) + dates = sorted(row[idx] for row in context.result_rows if row[idx] is not None) + assert dates[-1] > date, f"Maximum date in '{column}' is {dates[-1]}, expected after {date}" + + +@then('the first row "{column}" should be "{value}"') +def step_first_row_value(context: Context, column: str, value: str) -> None: + assert len(context.result_rows) > 0, "Query returned no rows" + idx = context.result_columns.index(column) + actual = context.result_rows[0][idx] + assert actual == value, f"First row '{column}' is '{actual}', expected '{value}'" diff --git a/projects/coles-vibe-workshop/demo-pipeline/pyproject.toml b/projects/coles-vibe-workshop/demo-pipeline/pyproject.toml new file mode 100644 index 0000000..037e389 --- /dev/null +++ b/projects/coles-vibe-workshop/demo-pipeline/pyproject.toml @@ -0,0 +1,22 @@ +[project] +name = "grocery-intelligence" +version = "0.1.0" +description = "Shared utilities for the Grocery Intelligence demo pipeline" +requires-python = ">=3.10" +dependencies = [] + +[build-system] +requires = ["setuptools>=68.0", "wheel"] +build-backend = "setuptools.build_meta" + +[tool.setuptools.packages.find] +where = ["src"] +include = ["grocery_intelligence*"] + +[tool.ruff] +line-length = 100 +target-version = "py310" + +[tool.behave] +paths = ["features"] +format = ["pretty"] diff --git a/projects/coles-vibe-workshop/demo-pipeline/resources/pipeline.yml b/projects/coles-vibe-workshop/demo-pipeline/resources/pipeline.yml new file mode 100644 index 0000000..5e45409 --- /dev/null +++ b/projects/coles-vibe-workshop/demo-pipeline/resources/pipeline.yml @@ -0,0 +1,32 @@ +resources: + pipelines: + grocery_pipeline: + name: grocery-intelligence-${var.team_name} + serverless: true + catalog: workshop_vibe_coding + schema: ${var.schema} + development: true + libraries: + - notebook: + path: ../src/bronze/abs_retail_trade.py + - notebook: + path: ../src/bronze/abs_cpi_food.py + - notebook: + path: ../src/silver/retail_turnover.py + - notebook: + path: ../src/silver/food_price_index.py + - notebook: + path: ../src/gold/retail_summary.py + - notebook: + path: ../src/gold/food_inflation_yoy.py + + jobs: + grocery_daily_refresh: + name: grocery-refresh-${var.team_name} + tasks: + - task_key: run_pipeline + pipeline_task: + pipeline_id: ${resources.pipelines.grocery_pipeline.id} + schedule: + quartz_cron_expression: "0 0 6 * * ?" + timezone_id: "Australia/Sydney" diff --git a/projects/coles-vibe-workshop/demo-pipeline/src/bronze/abs_cpi_food.py b/projects/coles-vibe-workshop/demo-pipeline/src/bronze/abs_cpi_food.py new file mode 100644 index 0000000..32502bd --- /dev/null +++ b/projects/coles-vibe-workshop/demo-pipeline/src/bronze/abs_cpi_food.py @@ -0,0 +1,23 @@ +# Databricks notebook source +"""Bronze: ABS Consumer Price Index (Food) ingestion.""" + +try: + import databricks.declarative_pipelines as dp +except ModuleNotFoundError: + import dlt as dp +from pyspark.sql import SparkSession +from pyspark.sql.functions import current_timestamp + + +@dp.expect("valid_time_period", "TIME_PERIOD IS NOT NULL") +@dp.expect("valid_obs_value", "OBS_VALUE IS NOT NULL") +@dp.table(comment="Raw ABS CPI data — quarterly food price indices by state since 2010") +def bronze_abs_cpi_food(): + spark = SparkSession.getActiveSession() + return ( + spark.read.csv( + "/Volumes/workshop_vibe_coding/demo/raw_data/abs_cpi_food.csv", + header=True, inferSchema=True, + ) + .withColumn("_ingested_at", current_timestamp()) + ) diff --git a/projects/coles-vibe-workshop/demo-pipeline/src/bronze/abs_retail_trade.py b/projects/coles-vibe-workshop/demo-pipeline/src/bronze/abs_retail_trade.py new file mode 100644 index 0000000..096aa16 --- /dev/null +++ b/projects/coles-vibe-workshop/demo-pipeline/src/bronze/abs_retail_trade.py @@ -0,0 +1,23 @@ +# Databricks notebook source +"""Bronze: ABS Retail Trade ingestion.""" + +try: + import databricks.declarative_pipelines as dp +except ModuleNotFoundError: + import dlt as dp +from pyspark.sql import SparkSession +from pyspark.sql.functions import current_timestamp + + +@dp.expect("valid_time_period", "TIME_PERIOD IS NOT NULL") +@dp.expect("valid_obs_value", "OBS_VALUE IS NOT NULL") +@dp.table(comment="Raw ABS Retail Trade data — monthly turnover by state and industry since 2010") +def bronze_abs_retail_trade(): + spark = SparkSession.getActiveSession() + return ( + spark.read.csv( + "/Volumes/workshop_vibe_coding/demo/raw_data/abs_retail_trade.csv", + header=True, inferSchema=True, + ) + .withColumn("_ingested_at", current_timestamp()) + ) diff --git a/projects/coles-vibe-workshop/demo-pipeline/src/gold/food_inflation_yoy.py b/projects/coles-vibe-workshop/demo-pipeline/src/gold/food_inflation_yoy.py new file mode 100644 index 0000000..9be5f83 --- /dev/null +++ b/projects/coles-vibe-workshop/demo-pipeline/src/gold/food_inflation_yoy.py @@ -0,0 +1,34 @@ +# Databricks notebook source +"""Gold: Food Inflation YoY — quarterly CPI change by state.""" + +try: + import databricks.declarative_pipelines as dp +except ModuleNotFoundError: + import dlt as dp +from pyspark.sql import SparkSession +from pyspark.sql.functions import col, lag, round as round_ +from pyspark.sql.window import Window + + +@dp.expect("valid_yoy_change", "yoy_change_pct BETWEEN -50 AND 100") +@dp.table(comment="Year-over-year CPI inflation rate by state and quarter") +def gold_food_inflation_yoy(): + spark = SparkSession.getActiveSession() + df = spark.read.table("LIVE.silver_food_price_index") + + wyoy = Window.partitionBy("state").orderBy("date") + + return ( + df.withColumn("prev_year_index", lag("cpi_index", 4).over(wyoy)) + .withColumn( + "yoy_change_pct", + round_( + (col("cpi_index") - col("prev_year_index")) + / col("prev_year_index") * 100, + 2, + ), + ) + .filter(col("prev_year_index").isNotNull()) + .select("state", "category", "date", "year", "quarter", "cpi_index", "yoy_change_pct") + .orderBy("state", "date") + ) diff --git a/projects/coles-vibe-workshop/demo-pipeline/src/gold/retail_summary.py b/projects/coles-vibe-workshop/demo-pipeline/src/gold/retail_summary.py new file mode 100644 index 0000000..6e9a18d --- /dev/null +++ b/projects/coles-vibe-workshop/demo-pipeline/src/gold/retail_summary.py @@ -0,0 +1,44 @@ +# Databricks notebook source +"""Gold: Retail Summary — monthly turnover with rolling averages and YoY growth.""" + +try: + import databricks.declarative_pipelines as dp +except ModuleNotFoundError: + import dlt as dp +from pyspark.sql import SparkSession +from pyspark.sql.functions import col, sum as sum_, avg, lag, round as round_ +from pyspark.sql.window import Window + + +@dp.expect("valid_yoy", "yoy_growth_pct BETWEEN -100 AND 500") +@dp.expect("valid_rolling_avg", "turnover_3m_avg > 0") +@dp.table(comment="Monthly retail summary by state with rolling averages and YoY growth") +def gold_retail_summary(): + spark = SparkSession.getActiveSession() + df = spark.read.table("LIVE.silver_retail_turnover") + + monthly = ( + df.groupBy("state", "date", "year", "month") + .agg(sum_("turnover_millions").alias("total_turnover")) + ) + + w3 = Window.partitionBy("state").orderBy("date").rowsBetween(-2, 0) + w12 = Window.partitionBy("state").orderBy("date").rowsBetween(-11, 0) + wyoy = Window.partitionBy("state").orderBy("date") + + return ( + monthly + .withColumn("turnover_3m_avg", round_(avg("total_turnover").over(w3), 2)) + .withColumn("turnover_12m_avg", round_(avg("total_turnover").over(w12), 2)) + .withColumn("prev_year_turnover", lag("total_turnover", 12).over(wyoy)) + .withColumn( + "yoy_growth_pct", + round_( + (col("total_turnover") - col("prev_year_turnover")) + / col("prev_year_turnover") * 100, + 2, + ), + ) + .drop("prev_year_turnover") + .orderBy("state", "date") + ) diff --git a/projects/coles-vibe-workshop/demo-pipeline/src/grocery_intelligence/__init__.py b/projects/coles-vibe-workshop/demo-pipeline/src/grocery_intelligence/__init__.py new file mode 100644 index 0000000..9457a72 --- /dev/null +++ b/projects/coles-vibe-workshop/demo-pipeline/src/grocery_intelligence/__init__.py @@ -0,0 +1,17 @@ +"""Grocery Intelligence — shared utilities for pipeline transformations.""" + +from grocery_intelligence.mappings import ( + INDUSTRY_DECODE, + INDEX_DECODE, + REGION_DECODE, + VOLUME_PATH, + decode_column, +) + +__all__ = [ + "REGION_DECODE", + "INDUSTRY_DECODE", + "INDEX_DECODE", + "VOLUME_PATH", + "decode_column", +] diff --git a/projects/coles-vibe-workshop/demo-pipeline/src/grocery_intelligence/mappings.py b/projects/coles-vibe-workshop/demo-pipeline/src/grocery_intelligence/mappings.py new file mode 100644 index 0000000..38d74cf --- /dev/null +++ b/projects/coles-vibe-workshop/demo-pipeline/src/grocery_intelligence/mappings.py @@ -0,0 +1,44 @@ +"""Shared code mappings for ABS data decoding. + +Used by silver layer transforms to decode integer codes from ABS SDMX API +into human-readable names. +""" + +from __future__ import annotations + +REGION_DECODE: dict[int, str] = { + 1: "New South Wales", + 2: "Victoria", + 3: "Queensland", + 4: "South Australia", + 5: "Western Australia", + 6: "Tasmania", + 7: "Northern Territory", + 8: "Australian Capital Territory", +} + +INDUSTRY_DECODE: dict[int, str] = { + 20: "Food retailing", + 41: "Clothing, footwear and personal accessories", + 42: "Department stores", + 43: "Other retailing", + 44: "Cafes, restaurants and takeaway", + 45: "Household goods retailing", +} + +INDEX_DECODE: dict[int, str] = { + 10001: "All groups CPI", + 20001: "Food and non-alcoholic beverages", +} + +VOLUME_PATH = "/Volumes/workshop_vibe_coding/demo/raw_data" + + +def decode_column(column_name: str, mapping: dict[int, str]): + """Build a PySpark when().otherwise() chain for integer-to-string decoding.""" + from pyspark.sql.functions import col, when, lit + + expr = lit("Unknown") + for code, name in mapping.items(): + expr = when(col(column_name) == code, lit(name)).otherwise(expr) + return expr diff --git a/projects/coles-vibe-workshop/demo-pipeline/src/silver/food_price_index.py b/projects/coles-vibe-workshop/demo-pipeline/src/silver/food_price_index.py new file mode 100644 index 0000000..c5ff6df --- /dev/null +++ b/projects/coles-vibe-workshop/demo-pipeline/src/silver/food_price_index.py @@ -0,0 +1,60 @@ +# Databricks notebook source +"""Silver: Food Price Index — decode regions/indices, parse quarterly dates.""" + +try: + import databricks.declarative_pipelines as dp +except ModuleNotFoundError: + import dlt as dp +from pyspark.sql import SparkSession +from pyspark.sql.functions import col, when, lit, to_date, concat, year, quarter as quarter_fn + +REGION_DECODE = { + 1: "New South Wales", 2: "Victoria", 3: "Queensland", 4: "South Australia", + 5: "Western Australia", 6: "Tasmania", 7: "Northern Territory", 8: "Australian Capital Territory", +} +INDEX_DECODE = {10001: "All groups CPI", 20001: "Food and non-alcoholic beverages"} + + +def _decode(column, mapping): + expr = lit("Unknown") + for code, name in mapping.items(): + expr = when(col(column) == code, lit(name)).otherwise(expr) + return expr + + +@dp.expect_or_fail("valid_date", "date IS NOT NULL") +@dp.expect( + "valid_state", + "state IN ('New South Wales','Victoria','Queensland','South Australia'," + "'Western Australia','Tasmania','Northern Territory','Australian Capital Territory')", +) +@dp.expect("positive_index", "cpi_index > 0") +@dp.table(comment="Cleaned CPI food price index with decoded regions and categories") +def silver_food_price_index(): + spark = SparkSession.getActiveSession() + df = spark.read.table("LIVE.bronze_abs_cpi_food") + + return ( + df.withColumn("state", _decode("REGION", REGION_DECODE)) + .withColumn("category", _decode("INDEX", INDEX_DECODE)) + .withColumn( + "date", + to_date( + concat( + col("TIME_PERIOD").substr(1, 4), lit("-"), + when(col("TIME_PERIOD").contains("Q1"), lit("01")) + .when(col("TIME_PERIOD").contains("Q2"), lit("04")) + .when(col("TIME_PERIOD").contains("Q3"), lit("07")) + .when(col("TIME_PERIOD").contains("Q4"), lit("10")), + lit("-01"), + ), + "yyyy-MM-dd", + ), + ) + .withColumn("year", year("date")) + .withColumn("quarter", quarter_fn("date")) + .withColumnRenamed("OBS_VALUE", "cpi_index") + .select("state", "category", "date", "year", "quarter", "cpi_index") + .filter(col("date").isNotNull()) + .filter(col("cpi_index").isNotNull()) + ) diff --git a/projects/coles-vibe-workshop/demo-pipeline/src/silver/retail_turnover.py b/projects/coles-vibe-workshop/demo-pipeline/src/silver/retail_turnover.py new file mode 100644 index 0000000..21ba67c --- /dev/null +++ b/projects/coles-vibe-workshop/demo-pipeline/src/silver/retail_turnover.py @@ -0,0 +1,53 @@ +# Databricks notebook source +"""Silver: Retail Turnover — decode regions/industries, cast dates.""" + +try: + import databricks.declarative_pipelines as dp +except ModuleNotFoundError: + import dlt as dp +from pyspark.sql import SparkSession +from pyspark.sql.functions import col, month, year, quarter, when, lit + +REGION_DECODE = { + 1: "New South Wales", 2: "Victoria", 3: "Queensland", 4: "South Australia", + 5: "Western Australia", 6: "Tasmania", 7: "Northern Territory", 8: "Australian Capital Territory", +} +INDUSTRY_DECODE = { + 20: "Food retailing", 41: "Clothing, footwear and personal accessories", + 42: "Department stores", 43: "Other retailing", + 44: "Cafes, restaurants and takeaway", 45: "Household goods retailing", +} + + +def _decode(column, mapping): + expr = lit("Unknown") + for code, name in mapping.items(): + expr = when(col(column) == code, lit(name)).otherwise(expr) + return expr + + +@dp.expect_or_fail("valid_date", "date IS NOT NULL") +@dp.expect( + "valid_state", + "state IN ('New South Wales','Victoria','Queensland','South Australia'," + "'Western Australia','Tasmania','Northern Territory','Australian Capital Territory')", +) +@dp.expect("positive_turnover", "turnover_millions > 0") +@dp.table(comment="Cleaned retail turnover with decoded regions and industries") +def silver_retail_turnover(): + spark = SparkSession.getActiveSession() + df = spark.read.table("LIVE.bronze_abs_retail_trade") + + return ( + df.withColumn("state", _decode("REGION", REGION_DECODE)) + .withColumn("industry", _decode("INDUSTRY", INDUSTRY_DECODE)) + .withColumn("date", col("TIME_PERIOD").cast("date")) + .withColumn("year", year("date")) + .withColumn("month", month("date")) + .withColumn("quarter", quarter("date")) + .withColumnRenamed("OBS_VALUE", "turnover_millions") + .select("state", "industry", "date", "year", "month", "quarter", "turnover_millions") + .filter(col("date").isNotNull()) + .filter(col("turnover_millions").isNotNull()) + .filter(col("turnover_millions") > 0) + ) diff --git a/projects/coles-vibe-workshop/images/databricks-platform-stack.png b/projects/coles-vibe-workshop/images/databricks-platform-stack.png new file mode 100644 index 0000000..6318cd6 Binary files /dev/null and b/projects/coles-vibe-workshop/images/databricks-platform-stack.png differ diff --git a/projects/coles-vibe-workshop/images/mcp-databricks-architecture-detail.png b/projects/coles-vibe-workshop/images/mcp-databricks-architecture-detail.png new file mode 100644 index 0000000..f6f6d10 Binary files /dev/null and b/projects/coles-vibe-workshop/images/mcp-databricks-architecture-detail.png differ diff --git a/projects/coles-vibe-workshop/images/mcp-databricks-architecture-overview.png b/projects/coles-vibe-workshop/images/mcp-databricks-architecture-overview.png new file mode 100644 index 0000000..ccbf8dd Binary files /dev/null and b/projects/coles-vibe-workshop/images/mcp-databricks-architecture-overview.png differ diff --git a/projects/coles-vibe-workshop/quick-reference.html b/projects/coles-vibe-workshop/quick-reference.html new file mode 100644 index 0000000..9164f17 --- /dev/null +++ b/projects/coles-vibe-workshop/quick-reference.html @@ -0,0 +1,615 @@ + + + + + +Vibe Coding Workshop - Quick Reference Card + + + + + + +
+
+
+
Vibe Coding Workshop Quick Reference
+
Coles Group - Agentic Software Development with Databricks
+
+
[Databricks]
+
+ + +
+
The TDD + Agent Workflow
+
+
1. Write Tests
+
+
2. Agent Implements
+
+
3. Run Tests
+
+
4. Iterate
+
+
5. Review & Deploy
+
+
+ Key insight: Tests are unambiguous specs the agent can verify against. They prevent the agent from "going off the rails." +
+
+ +
+ + +
+
Claude Code Commands
+ + + + + + + + + + +
CommandDescription
claudeStart interactive session
claude --versionCheck installed version
claude --helpShow all CLI options
claude mcp listList MCP server connections
/helpHelp within a session
/clearReset conversation context
/compactCompress context (save tokens)
/costShow token usage & cost
+
+ + +
+
Terminal Shortcuts
+ + + + + + + +
ShortcutAction
EscCancel current generation
Ctrl+CInterrupt / exit
Shift+EnterMulti-line input
Up/DownScroll through history
TabAccept autocomplete
+
+ Tip: Use Shift+Enter to write multi-line prompts before sending. +
+
+ + + + + +
+
Common Prompt Patterns
+ +
+ Setup +
Create a new Python project called "store-pipeline" with: +- A src/ directory for pipeline code +- A tests/ directory for pytest tests +- A CLAUDE.md with these rules: ...
+
+ +
+ Tests First +
Write pytest tests for a function clean_transactions(df): +- Remove rows where amount is null or negative +- Test: given 10 rows with 2 invalid, output has 8 +Write ONLY the tests. Do NOT implement yet.
+
+ +
+ Implement +
Implement all functions to make the tests pass. +Run the tests after each function to verify.
+
+ +
+ Deploy +
Deploy using: databricks apps deploy --name my-app
+
+
+ +
+ + +
+
Tips for Effective Prompting
+
+
+ Be Specific +
    +
  • Name exact functions, tables, columns
  • +
  • Provide example inputs/outputs
  • +
  • State constraints explicitly
  • +
+
+
+ Set Boundaries +
    +
  • "Write ONLY tests, do NOT implement"
  • +
  • Specify tech stack upfront
  • +
  • Declare table/schema names
  • +
+
+
+ Iterate Wisely +
    +
  • Start small: one function, one test
  • +
  • Add edge cases after green tests
  • +
  • Use /compact for long sessions
  • +
+
+
+
+ + +
+ + +
+
+
+
Databricks Quick Reference
+
Unity Catalog, SQL, Apps, DABs, AI Gateway
+
+
[Databricks]
+
+ +
+ + +
+
Unity Catalog Hierarchy
+
catalog.schema.table + +# Workshop setup: +workshop_vibe_coding # catalog + └ raw_data # shared schema + │ └ store_transactions # bronze table + │ └ products # reference table + └ <your_schema> # your schema + └ daily_store_metrics # your gold table
+
+ + +
+
Key SQL Commands
+
-- Explore data +DESCRIBE workshop_vibe_coding.raw_data.store_transactions; +SELECT * FROM ... LIMIT 10; + +-- Create objects +CREATE CATALOG IF NOT EXISTS my_catalog; +CREATE SCHEMA IF NOT EXISTS catalog.schema; + +-- Permissions +GRANT SELECT, MODIFY ON SCHEMA catalog.schema + TO `user@domain.com`; + +-- Idempotent writes (upsert) +MERGE INTO target USING source + ON target.id = source.id +WHEN MATCHED THEN UPDATE SET * +WHEN NOT MATCHED THEN INSERT *;
+
+ + + + + +
+
Databricks Asset Bundles (DABs)
+
# databricks.yml +bundle: + name: daily-store-metrics + +resources: + jobs: + daily_store_metrics: + name: "Daily Store Metrics" + schedule: + quartz_cron_expression: "0 0 6 * * ?" + timezone_id: "Australia/Melbourne" + tasks: + - task_key: run_pipeline + notebook_task: + notebook_path: ./src/pipeline.py + +targets: + dev: + default: true + workspace: + host: https://<workspace>.databricks.com + prod: + workspace: + host: https://<workspace>.databricks.com
+
databricks bundle validate # check config +databricks bundle deploy -t dev # deploy +databricks bundle run -t dev daily_store_metrics
+
+ + +
+
AI Gateway
+

+ Routes all LLM calls through Databricks for model switching, + rate limiting, cost tracking, and audit logging. +

+
    +
  • The Coding Agents App uses AI Gateway for all Claude Code calls
  • +
  • MLflow auto-traces every agent session
  • +
  • Foundation Model API for app AI features (NL-to-SQL)
  • +
  • Supports: Claude, GPT-4o, DBRX, Llama, Mixtral
  • +
+
+ In Lab 2: The /api/ask endpoint uses Foundation Model API via AI Gateway to convert natural language to SQL. +
+
+ + + + + +
+
Workshop Quick Facts
+
+
+ Catalog & Data +
    +
  • Catalog: workshop_vibe_coding
  • +
  • Raw data: raw_data.store_transactions
  • +
  • Products: raw_data.products
  • +
  • Your schema: <your_username>
  • +
+
+
+ Lab 1: Data Pipeline +
    +
  • PySpark transformations
  • +
  • clean → enrich → aggregate
  • +
  • Write gold: daily_store_metrics
  • +
  • Deploy as scheduled DABs job
  • +
+
+
+ Lab 2: Full-Stack App +
    +
  • FastAPI + HTML/Tailwind/htmx
  • +
  • Dashboard with filters
  • +
  • NL query via Foundation Model
  • +
  • Deploy as Databricks App
  • +
+
+
+
+ +
+ + +
+ + + diff --git a/projects/coles-vibe-workshop/quiz-app/app.py b/projects/coles-vibe-workshop/quiz-app/app.py new file mode 100644 index 0000000..ab62dfc --- /dev/null +++ b/projects/coles-vibe-workshop/quiz-app/app.py @@ -0,0 +1,422 @@ +"""Grocery Data Quiz — Interactive Databricks App for the Coles Vibe Coding Workshop.""" + +from __future__ import annotations + +import asyncio +import os +import re +import time +from enum import Enum +from typing import Any + +from fastapi import FastAPI, HTTPException, Request +from fastapi.responses import HTMLResponse +from fastapi.staticfiles import StaticFiles +from pydantic import BaseModel +from sse_starlette.sse import EventSourceResponse + +app = FastAPI(title="Grocery Data Quiz") + + +# Team names are rendered in the shared host/player UI. Restrict to a safe +# alphabet so a malicious attendee cannot submit markup that other players' +# browsers execute when the leaderboard re-renders. +TEAM_NAME_RE = re.compile(r"^[A-Za-z0-9 _-]{1,30}$") + + +@app.middleware("http") +async def _security_headers(request: Request, call_next): + response = await call_next(request) + # Defense-in-depth: even if a future change reintroduces an innerHTML + # interpolation without escaping, inline script execution is blocked. + response.headers.setdefault( + "Content-Security-Policy", + "default-src 'self'; script-src 'self' 'unsafe-inline'; " + "style-src 'self' 'unsafe-inline'; img-src 'self' data:; " + "connect-src 'self'", + ) + response.headers.setdefault("X-Content-Type-Options", "nosniff") + response.headers.setdefault("Referrer-Policy", "strict-origin-when-cross-origin") + return response + +TIMER_SECONDS = int(os.getenv("QUIZ_TIMER_SECONDS", "30")) + + +# --------------------------------------------------------------------------- +# Data models +# --------------------------------------------------------------------------- + +class Phase(str, Enum): + LOBBY = "lobby" + QUESTION = "question" + REVEALED = "revealed" + RESULTS = "results" + + +class Question(BaseModel): + category: str + question: str + options: list[str] + correct: int + explanation: str + + +class Team(BaseModel): + name: str + score: int = 0 + answers: list[int | None] = [] # index of chosen option per question, None if unanswered + times: list[float | None] = [] # seconds remaining when answered + + +QUESTIONS: list[Question] = [ + # ------------------------------------------------------------------ + # ROUND 1: GROCERY & RETAIL (estimation / gut-feel) + # ------------------------------------------------------------------ + Question( + category="ESTIMATION", + question="Australians spend ~$15B/month on food retail. What's the approximate per-capita weekly grocery spend?", + options=["$90", "$170", "$230", "$310"], + correct=2, + explanation="$15B/month ÷ 26M people ÷ 4.3 weeks ≈ $134 per person — but that's individuals. Per household (~10M) it's ~$350/week. Per capita including eating out: ~$230.", + ), + Question( + category="DATA PUZZLE", + question="NSW has the highest food retail turnover. But which state has the highest turnover per capita?", + options=["ACT", "NSW", "Western Australia", "Northern Territory"], + correct=0, + explanation="ACT punches above its weight — high average incomes + no nearby regional alternatives = highest per-capita food spend. NSW wins on raw volume but not per-person.", + ), + Question( + category="GOTCHA", + question="You run a pipeline that ingests ABS retail data monthly. March 2024 shows a 40% spike in NSW food turnover. What's the most likely cause?", + options=[ + "A genuine demand surge", + "Easter fell in March that year", + "ABS revised historical data", + "Seasonal adjustment wasn't applied", + ], + correct=3, + explanation="ABS publishes both seasonally adjusted AND original series. A 40% spike in original data is normal — Easter timing shifts, school holidays, etc. Always check which series you're ingesting.", + ), + # ------------------------------------------------------------------ + # ROUND 2: AI & SYCOPHANCY (tricky / counterintuitive) + # ------------------------------------------------------------------ + Question( + category="AI TRAPS", + question="Stanford tested 11 LLMs (2026): when the user was 100% wrong, how often did the AI still agree?", + options=["12% of the time", "28% of the time", "51% of the time", "73% of the time"], + correct=2, + explanation="51% — worse than a coin flip. RLHF optimises for user satisfaction, not truth. This is why structural verification (BDD, schema contracts) matters more than asking 'did this work?'", + ), + Question( + category="AGENT PITFALLS", + question="You ask an AI agent to 'build a data pipeline.' It produces 12 files, a custom logging framework, and an abstract base class hierarchy. What went wrong?", + options=[ + "The model is too powerful", + "You didn't constrain scope in CLAUDE.md", + "The context window was too large", + "Nothing — that's good engineering", + ], + correct=1, + explanation="Overeagerness is the #1 agent pitfall. Without constraints like 'minimal solution, no abstractions unless asked,' agents default to over-engineering. One line in CLAUDE.md prevents this.", + ), + Question( + category="CRITICAL THINKING", + question="Your agent says 'I've verified the pipeline works correctly.' What should you do?", + options=[ + "Trust it — it ran the code", + "Ask it to show the git diff and test output", + "Ask it to argue why the pipeline might be WRONG", + "Both B and C", + ], + correct=3, + explanation="Never trust claims — demand proof (git diff, test results). Then apply the Karpathy Test: ask the model to argue the opposite. If it can demolish its own work, the work wasn't solid.", + ), + # ------------------------------------------------------------------ + # ROUND 3: DATA ENGINEERING (technical / applied) + # ------------------------------------------------------------------ + Question( + category="DATA ENGINEERING", + question="Your Bronze table has 500M rows. You add CLUSTER BY (date, store_id) to the Gold table. What does this actually do on serverless?", + options=[ + "Creates partitioned folders on disk", + "Sorts data within files for faster predicate pushdown", + "Creates an index like a traditional database", + "Nothing — CLUSTER BY is ignored on serverless", + ], + correct=1, + explanation="CLUSTER BY uses liquid clustering — it colocates data within files (Z-ordering) so queries with predicates on those columns skip irrelevant file groups. It's NOT partitioning.", + ), + Question( + category="PIPELINE GOTCHA", + question="Your streaming table reads from Auto Loader. You deploy, it works. Next day: 0 new rows. The source files are there. What's the most likely cause?", + options=[ + "The checkpoint was corrupted", + "The source path changed and Auto Loader's checkpoint tracks the OLD path", + "Serverless compute timed out", + "Unity Catalog permissions were revoked", + ], + correct=1, + explanation="Auto Loader checkpoints are path-specific. If someone moved the source files or changed the path in config, the checkpoint still watches the old location. Classic 'silent zero rows' bug.", + ), + Question( + category="DATA QUALITY", + question="You add @dp.expect_or_fail('valid_amount', 'amount > 0') to your Silver table. 3 rows fail. What happens?", + options=[ + "The 3 rows are dropped, rest succeeds", + "The entire pipeline update fails", + "The 3 rows are quarantined to an error table", + "A warning is logged but all rows pass through", + ], + correct=1, + explanation="expect_or_fail is strict — ANY row failing the constraint causes the entire update to fail. Use @dp.expect (warn only) or @dp.expect_or_drop (filter) for softer handling.", + ), + # ------------------------------------------------------------------ + # ROUND 4: YOUR OWN PLATFORM (Coles-specific) + # ------------------------------------------------------------------ + Question( + category="YOUR PLATFORM", + question="How many active tables does Coles have under Unity Catalog management right now?", + options=["~3,000", "~10,000", "~30,000", "~75,000"], + correct=2, + explanation="30,000+ active tables under UC governance as of March 2026. UC replaced Amundsen + Apache Atlas + JanusGraph as Coles' data governance layer.", + ), + Question( + category="YOUR PLATFORM", + question="What percentage of Coles' Databricks compute (DBUs) currently runs on Unity Catalog?", + options=["32%", "55%", "71%", "94%"], + correct=2, + explanation="71% of DBU consumption is on UC, against a target of 80%. The gap is mostly ETL workloads still on Classic compute — migrating those to UC-native pipelines is an active initiative.", + ), + # ------------------------------------------------------------------ + # ROUND 5: ESTIMATION & DEBATE (fun / no clear right answer) + # ------------------------------------------------------------------ + Question( + category="ESTIMATION", + question="How many unique products does a typical Coles supermarket carry?", + options=["~8,000", "~20,000", "~35,000", "~55,000"], + correct=1, + explanation="A standard Coles carries ~20,000-25,000 SKUs. A Coles Local might have ~8,000. Costco carries ~4,000. For context, a single ABS category like 'bread & cereals' covers thousands of these.", + ), + Question( + category="DEBATE", + question="An AI agent writes code 10x faster than a human. Measured studies show AI-authored code has 1.7x more major issues. Net effect on team velocity?", + options=[ + "Massive improvement — speed outweighs bugs", + "Roughly break-even after accounting for review and fixes", + "Net negative — bugs compound faster than speed gains", + "Depends entirely on the review process", + ], + correct=3, + explanation="This is the real answer. Speed without verification is just faster bugs. With BDD gates, CLAUDE.md constraints, and code review — the 10x speed compounds positively. Without them, it compounds negatively.", + ), + Question( + category="WILDCARD", + question="In 2024, Australia recalled 847 food products (FSANZ data). Which category had the most recalls?", + options=[ + "Dairy & eggs", + "Undeclared allergens in processed foods", + "Meat & poultry contamination", + "Foreign objects in bakery products", + ], + correct=1, + explanation="Undeclared allergens dominate food recalls (~40%). It's a labelling and supply chain data problem — exactly the kind of thing a well-governed data pipeline can help detect.", + ), +] + + +# --------------------------------------------------------------------------- +# In-memory quiz state +# --------------------------------------------------------------------------- + +class QuizState: + def __init__(self) -> None: + self.phase: Phase = Phase.LOBBY + self.current_question: int = 0 + self.teams: dict[str, Team] = {} + self.question_start_time: float = 0.0 + self._version: int = 0 # bumped on every mutation for SSE change detection + + def snapshot(self) -> dict[str, Any]: + """Return JSON-serialisable state for clients.""" + q = QUESTIONS[self.current_question] if self.current_question < len(QUESTIONS) else None + elapsed = time.time() - self.question_start_time if self.phase == Phase.QUESTION else 0 + time_left = max(0, TIMER_SECONDS - int(elapsed)) + + # Leaderboard sorted by score desc, then by fastest average answer time + leaderboard = sorted( + self.teams.values(), + key=lambda t: (-t.score, -sum(x or 0 for x in t.times)), + ) + + result: dict[str, Any] = { + "phase": self.phase.value, + "currentQuestion": self.current_question, + "totalQuestions": len(QUESTIONS), + "timeLeft": time_left, + "timerSeconds": TIMER_SECONDS, + "version": self._version, + "leaderboard": [ + {"name": t.name, "score": t.score, "answered": len([a for a in t.answers if a is not None])} + for t in leaderboard + ], + "teamCount": len(self.teams), + } + + if q: + result["question"] = { + "category": q.category, + "text": q.question, + "options": q.options, + } + if self.phase == Phase.REVEALED: + result["question"]["correct"] = q.correct + result["question"]["explanation"] = q.explanation + + return result + + def bump(self) -> None: + self._version += 1 + + +quiz = QuizState() + + +# --------------------------------------------------------------------------- +# API routes +# --------------------------------------------------------------------------- + +class JoinRequest(BaseModel): + name: str + + +class AnswerRequest(BaseModel): + team: str + answer: int + + +@app.post("/api/teams") +def join_team(req: JoinRequest) -> dict[str, Any]: + name = req.name.strip()[:30] + if not name: + raise HTTPException(400, "Team name required") + if not TEAM_NAME_RE.match(name): + raise HTTPException( + 400, + "Team name must be 1-30 characters of letters, digits, spaces, " + "hyphens, or underscores.", + ) + if name not in quiz.teams: + quiz.teams[name] = Team(name=name) + quiz.bump() + return {"ok": True, "name": name} + + +@app.get("/api/state") +def get_state() -> dict[str, Any]: + return quiz.snapshot() + + +@app.post("/api/answer") +def submit_answer(req: AnswerRequest) -> dict[str, Any]: + if quiz.phase != Phase.QUESTION: + raise HTTPException(400, "Not accepting answers right now") + + team = quiz.teams.get(req.team) + if not team: + raise HTTPException(404, "Team not found") + + qi = quiz.current_question + # Pad answers list if needed + while len(team.answers) <= qi: + team.answers.append(None) + team.times.append(None) + + if team.answers[qi] is not None: + return {"ok": True, "already_answered": True} + + elapsed = time.time() - quiz.question_start_time + time_remaining = max(0, TIMER_SECONDS - elapsed) + + team.answers[qi] = req.answer + team.times[qi] = time_remaining + + if req.answer == QUESTIONS[qi].correct: + # Score: base 100 + up to 100 bonus for speed + speed_bonus = int(100 * (time_remaining / TIMER_SECONDS)) + team.score += 100 + speed_bonus + + quiz.bump() + return {"ok": True, "already_answered": False} + + +@app.post("/api/control/{action}") +def control(action: str) -> dict[str, Any]: + if action == "start": + quiz.phase = Phase.QUESTION + quiz.current_question = 0 + quiz.question_start_time = time.time() + # Reset all teams + for t in quiz.teams.values(): + t.score = 0 + t.answers = [] + t.times = [] + quiz.bump() + elif action == "next": + if quiz.phase == Phase.REVEALED: + quiz.current_question += 1 + if quiz.current_question >= len(QUESTIONS): + quiz.phase = Phase.RESULTS + else: + quiz.phase = Phase.QUESTION + quiz.question_start_time = time.time() + quiz.bump() + elif action == "reveal": + if quiz.phase == Phase.QUESTION: + quiz.phase = Phase.REVEALED + quiz.bump() + elif action == "reset": + quiz.phase = Phase.LOBBY + quiz.current_question = 0 + quiz.teams.clear() + quiz.bump() + else: + raise HTTPException(400, f"Unknown action: {action}") + + return quiz.snapshot() + + +# --------------------------------------------------------------------------- +# SSE — real-time event stream +# --------------------------------------------------------------------------- + +@app.get("/api/events") +async def events(request: Request) -> EventSourceResponse: + last_version = -1 + + async def generate(): + nonlocal last_version + while True: + if await request.is_disconnected(): + break + if quiz._version != last_version: + last_version = quiz._version + yield {"event": "state", "data": str(quiz.snapshot()).replace("'", '"')} + await asyncio.sleep(0.5) + + return EventSourceResponse(generate()) + + +# --------------------------------------------------------------------------- +# Static files + SPA fallback +# --------------------------------------------------------------------------- + +app.mount("/static", StaticFiles(directory="static"), name="static") + + +@app.get("/", response_class=HTMLResponse) +@app.get("/play", response_class=HTMLResponse) +@app.get("/host", response_class=HTMLResponse) +async def spa(request: Request) -> HTMLResponse: + with open("static/index.html") as f: + return HTMLResponse(f.read()) diff --git a/projects/coles-vibe-workshop/quiz-app/app.yaml b/projects/coles-vibe-workshop/quiz-app/app.yaml new file mode 100644 index 0000000..8f99e4e --- /dev/null +++ b/projects/coles-vibe-workshop/quiz-app/app.yaml @@ -0,0 +1,9 @@ +command: + - uvicorn + - app:app + - --host=0.0.0.0 + - --port=8000 + +env: + - name: QUIZ_TIMER_SECONDS + value: "30" diff --git a/projects/coles-vibe-workshop/quiz-app/pyproject.toml b/projects/coles-vibe-workshop/quiz-app/pyproject.toml new file mode 100644 index 0000000..229d0a8 --- /dev/null +++ b/projects/coles-vibe-workshop/quiz-app/pyproject.toml @@ -0,0 +1,14 @@ +[project] +name = "grocery-quiz" +version = "0.1.0" +description = "Interactive workshop quiz app for the Coles Vibe Coding Workshop" +requires-python = ">=3.10" +dependencies = [ + "fastapi>=0.115", + "uvicorn[standard]>=0.32", + "sse-starlette>=2.0", +] + +[tool.ruff] +line-length = 100 +target-version = "py310" diff --git a/projects/coles-vibe-workshop/quiz-app/static/index.html b/projects/coles-vibe-workshop/quiz-app/static/index.html new file mode 100644 index 0000000..0ad5071 --- /dev/null +++ b/projects/coles-vibe-workshop/quiz-app/static/index.html @@ -0,0 +1,559 @@ + + + + + + Grocery Data Quiz + + + + + + + + +
+
🛒
+

Grocery Data Quiz

+

Enter your team name to join

+ + +
+ +
+ + +
+
+
+
Team
+
0 pts
+
+
30
+
+
+
+ + +
+
+ +
+
30
+
+
+
+
+ + + + diff --git a/projects/coles-vibe-workshop/quiz.html b/projects/coles-vibe-workshop/quiz.html new file mode 100644 index 0000000..7504de8 --- /dev/null +++ b/projects/coles-vibe-workshop/quiz.html @@ -0,0 +1,593 @@ + + + + + + Grocery Data Quiz — Coles Vibe Workshop + + + + + + + + +
+
🛒 🍎 📊 🤖 🛒
+

The Great
Grocery Data
Quiz

+

How well do you know Australian retail data, AI, and the Databricks platform?

+ +

Press Space or Enter to start

+
+ + +
+
+
+
QUESTION 1 of 10
+
AUSTRALIAN RETAIL
+
+
30
+
+
+
Loading...
+
+
+
+ +
+ + +
+
🎉
+
Final Score
+
0/10
+
Nice work!
+
+ +
+ + + + + + diff --git a/projects/coles-vibe-workshop/reference-implementation/CLAUDE.md b/projects/coles-vibe-workshop/reference-implementation/CLAUDE.md new file mode 100644 index 0000000..44fb4f8 --- /dev/null +++ b/projects/coles-vibe-workshop/reference-implementation/CLAUDE.md @@ -0,0 +1,150 @@ +# CLAUDE.md — Grocery Intelligence Platform (Reference Implementation) + +## Team +- **Team Name:** reference +- **Schema:** workshop_vibe_coding.reference + +## Project +A production-grade data platform that ingests Australian retail and food price data from the ABS, transforms it through a medallion architecture (Bronze/Silver/Gold), and serves analytics via a FastAPI + htmx web app with natural language querying. This is the gold-standard reference implementation demonstrating Anthropic best practices: specs first, then tests, then code. + +## Tech Stack +- **Data processing:** PySpark (never pandas) +- **Pipeline framework:** Lakeflow Declarative Pipelines (`import databricks.declarative_pipelines as dp`) +- **Pipeline decorators:** `@dp.table` for streaming/batch tables, `@dp.materialized_view` for aggregation views +- **Data quality:** `@dp.expect("name", "SQL_EXPRESSION")`, `@dp.expect_or_fail()`, `@dp.expect_all()` +- **Web backend:** FastAPI with Pydantic models +- **Web frontend:** HTML + Tailwind CSS (CDN) + htmx (CDN) — no npm/node required +- **Database access:** `databricks-sql-connector` with parameterized queries only +- **Deployment:** Databricks Asset Bundles (`databricks bundle deploy`) +- **Testing:** pytest with PySpark test fixtures + +## Data Standards +- **Catalog:** `workshop_vibe_coding` +- **Schema:** `reference` +- **Architecture:** Bronze (raw ingestion) -> Silver (cleaned, decoded, typed) -> Gold (aggregated, analytics-ready) +- **Column naming:** snake_case for all table and column names +- **Date columns:** stored as DATE type (not string, not timestamp) +- **Currency/numeric columns:** DECIMAL(15,2) for monetary values, DOUBLE for indices +- **Nulls:** Bronze may contain nulls; Silver must filter or reject nulls; Gold must have zero nulls + +## Rules +- **TDD always:** Write tests BEFORE implementation. Tests are the spec. +- **PySpark not pandas:** Always use PySpark for data processing. Never use pandas. +- **Parameterized queries only:** Never use string concatenation for SQL. Use parameterized queries. +- **Minimal solutions:** Don't over-engineer. Write the simplest code that makes the tests pass. +- **Don't change passing tests:** Never modify a function that already has passing tests unless explicitly asked. +- **One function per file:** Each bronze/silver/gold transformation lives in its own file. +- **Given-When-Then:** Structure all tests with Given (setup), When (action), Then (assertions). +- **Small test DataFrames:** Use 5-10 rows per fixture. Don't mock the database. + +## Project Structure +``` +reference-implementation/ +├── CLAUDE.md # This file — project spec and rules +├── databricks.yml # DABs deployment config +├── resources/ +│ └── pipeline.yml # Lakeflow pipeline definition +├── src/ +│ ├── bronze/ +│ │ ├── abs_retail_trade.py # Ingest ABS Retail Trade API -> bronze +│ │ ├── abs_cpi_food.py # Ingest ABS CPI Food API -> bronze +│ │ └── fsanz_food_recalls.py # Ingest FSANZ food recalls -> bronze +│ ├── silver/ +│ │ ├── retail_turnover.py # Decode regions/industries, parse dates +│ │ ├── food_price_index.py # Decode indices/regions, rename columns +│ │ └── food_recalls.py # Clean dates, normalize states +│ └── gold/ +│ ├── retail_summary.py # Rolling averages, YoY growth +│ ├── food_inflation.py # YoY CPI change percentage +│ └── grocery_insights.py # Cross-source join: retail + CPI + recalls +├── tests/ +│ ├── conftest.py # PySpark fixtures, sample DataFrames +│ ├── test_pipeline.py # Bronze/Silver/Gold transformation tests +│ ├── test_app.py # FastAPI endpoint tests +│ └── test_quality.py # Data quality expectation tests +└── app/ + ├── app.py # FastAPI backend + ├── app.yaml # Databricks Apps config + ├── requirements.txt # Python dependencies + └── static/ + └── index.html # htmx + Tailwind frontend +``` + +## Data Sources + +| Source | API Endpoint | Format | Frequency | What It Contains | +|--------|-------------|--------|-----------|-----------------| +| **ABS Retail Trade** | `https://data.api.abs.gov.au/data/ABS,RT,1.0.0/M1.20+41+42+43+44+45.20.1+2+3+4+5+6+7+8.M?format=csv` | CSV (SDMX) | Monthly | Retail turnover by state & industry since 2010 | +| **ABS Consumer Price Index** | `https://data.api.abs.gov.au/data/ABS,CPI,2.0.0/1.10001+20001.10.1+2+3+4+5+6+7+8.Q?format=csv` | CSV (SDMX) | Quarterly | Food price indices by state since 2010 | +| **FSANZ Food Recalls** | `https://www.foodstandards.gov.au/food-recalls/recalls` | HTML/RSS | Ad-hoc | Australian food recall notices | + +### ABS SDMX API Notes +- Add `&startPeriod=YYYY-MM&endPeriod=YYYY-MM` to filter date ranges +- Retail Trade uses monthly periods (`2024-01`), CPI uses quarterly (`2024-Q1`) +- Both return CSV with headers when `format=csv` is specified +- Rate-limited; cache responses where possible + +## Code Mappings (Silver Layer) + +### Region Codes (ABS REGION field) +| Code | State | +|------|-------| +| 1 | New South Wales | +| 2 | Victoria | +| 3 | Queensland | +| 4 | South Australia | +| 5 | Western Australia | +| 6 | Tasmania | +| 7 | Northern Territory | +| 8 | Australian Capital Territory | + +### Industry Codes (ABS INDUSTRY field — Retail Trade) +| Code | Industry | +|------|----------| +| 20 | Food retailing | +| 41 | Clothing, footwear and personal accessory retailing | +| 42 | Department stores | +| 43 | Other retailing | +| 44 | Cafes, restaurants and takeaway food services | +| 45 | Household goods retailing | + +### CPI Index Codes (ABS INDEX field — Consumer Price Index) +| Code | Index | +|------|-------| +| 10001 | All groups CPI | +| 20001 | Food and non-alcoholic beverages | + +## Table Schemas + +### Bronze Tables (raw ingestion, original column names preserved) + +**bronze_retail_trade:** +`DATAFLOW (STRING), FREQ (STRING), MEASURE (STRING), INDUSTRY (STRING), REGION (STRING), TIME_PERIOD (STRING), OBS_VALUE (DOUBLE), _ingested_at (TIMESTAMP)` + +**bronze_cpi_food:** +`DATAFLOW (STRING), FREQ (STRING), MEASURE (STRING), INDEX (STRING), REGION (STRING), TIME_PERIOD (STRING), OBS_VALUE (DOUBLE), _ingested_at (TIMESTAMP)` + +**bronze_food_recalls:** +`product (STRING), category (STRING), issue (STRING), date (STRING), state (STRING), url (STRING), _ingested_at (TIMESTAMP)` + +### Silver Tables (cleaned, decoded, typed) + +**silver_retail_turnover:** +`date (DATE), state (STRING), industry (STRING), turnover_millions (DECIMAL(15,2)), month (INT), year (INT), quarter (INT)` + +**silver_food_price_index:** +`date (DATE), state (STRING), index_name (STRING), cpi_index (DOUBLE), quarter (INT), year (INT)` + +**silver_food_recalls:** +`date (DATE), state (STRING), product (STRING), category (STRING), issue (STRING), url (STRING)` + +### Gold Tables (aggregated, analytics-ready) + +**gold_retail_summary:** +`date (DATE), state (STRING), industry (STRING), turnover_millions (DECIMAL(15,2)), turnover_3m_avg (DECIMAL(15,2)), turnover_12m_avg (DECIMAL(15,2)), yoy_growth_pct (DOUBLE)` + +**gold_food_inflation:** +`date (DATE), state (STRING), index_name (STRING), cpi_index (DOUBLE), yoy_change_pct (DOUBLE)` + +**gold_grocery_insights:** +`state (STRING), month (DATE), turnover_millions (DECIMAL(15,2)), yoy_growth_pct (DOUBLE), cpi_yoy_change (DOUBLE), recall_count (INT)` diff --git a/projects/coles-vibe-workshop/reference-implementation/README.md b/projects/coles-vibe-workshop/reference-implementation/README.md new file mode 100644 index 0000000..c634fb1 --- /dev/null +++ b/projects/coles-vibe-workshop/reference-implementation/README.md @@ -0,0 +1,64 @@ +# Grocery Intelligence Platform — Reference Implementation + +Reference implementation for the Coles Vibe Coding Workshop. This is the +"gold standard" solution that participants work towards during the hackathon. + +Built following Anthropic best practices: **CLAUDE.md -> Tests -> Implementation**. + +## File Structure + +``` +reference-implementation/ +├── databricks.yml # DABs bundle config (includes resources/) +├── resources/ +│ └── pipeline.yml # Lakeflow pipeline + daily job definitions +├── src/ +│ ├── bronze/ # Raw ingestion (ABS retail, ABS CPI, FSANZ recalls) +│ ├── silver/ # Cleaned & decoded tables +│ └── gold/ # Aggregated analytics tables +├── tests/ # pytest suite (PySpark fixtures, no mocks) +├── app/ +│ ├── app.py # FastAPI application +│ ├── app.yaml # Databricks Apps config +│ ├── requirements.txt # Python dependencies +│ └── static/ +│ └── index.html # Single-page dashboard (Tailwind + htmx) +└── README.md +``` + +## Run Tests + +```bash +cd reference-implementation +pytest tests/ -x +``` + +## Deploy + +```bash +# Validate the bundle configuration +databricks bundle validate -t dev + +# Deploy pipeline, job, and app +databricks bundle deploy -t dev +``` + +## App Endpoints + +| Method | Path | Description | +|--------|----------------|------------------------------------------| +| GET | `/health` | Health check | +| GET | `/` | Dashboard UI | +| GET | `/api/metrics` | KPIs: top states, trend, YoY, inflation | +| GET | `/api/recalls` | Recent food recalls | +| POST | `/api/ask` | Natural language query (Foundation Model) | + +## Environment Variables + +The app expects these environment variables (set automatically by Databricks Apps): + +- `DATABRICKS_HOST` — Workspace hostname +- `DATABRICKS_TOKEN` — PAT or OAuth token +- `SQL_WAREHOUSE_ID` — SQL warehouse for queries +- `CATALOG` — Unity Catalog name (default: `workshop_vibe_coding`) +- `SCHEMA` — Schema name (default: `reference`) diff --git a/projects/coles-vibe-workshop/reference-implementation/app/app.py b/projects/coles-vibe-workshop/reference-implementation/app/app.py new file mode 100644 index 0000000..449d7a2 --- /dev/null +++ b/projects/coles-vibe-workshop/reference-implementation/app/app.py @@ -0,0 +1,208 @@ +"""Grocery Intelligence Platform — FastAPI application. + +Serves a dashboard for Australian retail trade and food price analytics. +Queries gold-layer tables in Unity Catalog via databricks-sql-connector. +""" + +import json +import os +from contextlib import asynccontextmanager + +from fastapi import FastAPI, HTTPException, Request +from fastapi.middleware.cors import CORSMiddleware +from fastapi.responses import FileResponse, JSONResponse +from fastapi.staticfiles import StaticFiles +from databricks import sql as databricks_sql + + +# --------------------------------------------------------------------------- +# Configuration +# --------------------------------------------------------------------------- + +DATABRICKS_HOST = os.environ.get("DATABRICKS_HOST", "") +DATABRICKS_TOKEN = os.environ.get("DATABRICKS_TOKEN", "") +SQL_WAREHOUSE_ID = os.environ.get("SQL_WAREHOUSE_ID", "") +CATALOG = os.environ.get("CATALOG", "workshop_vibe_coding") +SCHEMA = os.environ.get("SCHEMA", "reference") + + +# --------------------------------------------------------------------------- +# Database helpers +# --------------------------------------------------------------------------- + +def _get_connection(): + """Return a new databricks-sql-connector connection.""" + return databricks_sql.connect( + server_hostname=DATABRICKS_HOST, + http_path=f"/sql/1.0/warehouses/{SQL_WAREHOUSE_ID}", + access_token=DATABRICKS_TOKEN, + ) + + +def _query(sql: str, params: dict | None = None) -> list[dict]: + """Execute *sql* and return rows as a list of dicts.""" + with _get_connection() as conn: + with conn.cursor() as cursor: + cursor.execute(sql, params) + columns = [desc[0] for desc in cursor.description] + return [dict(zip(columns, row)) for row in cursor.fetchall()] + + +# --------------------------------------------------------------------------- +# App +# --------------------------------------------------------------------------- + +@asynccontextmanager +async def lifespan(app: FastAPI): + yield + + +app = FastAPI(title="Grocery Intelligence Platform", lifespan=lifespan) + +app.add_middleware( + CORSMiddleware, + allow_origins=["*"], + allow_credentials=True, + allow_methods=["*"], + allow_headers=["*"], +) + +# Mount static files (index.html, etc.) +STATIC_DIR = os.path.join(os.path.dirname(__file__), "static") +app.mount("/static", StaticFiles(directory=STATIC_DIR), name="static") + + +# --------------------------------------------------------------------------- +# Routes +# --------------------------------------------------------------------------- + +@app.get("/health") +async def health(): + return {"status": "healthy"} + + +@app.get("/") +async def root(): + return FileResponse(os.path.join(STATIC_DIR, "index.html")) + + +@app.get("/api/metrics") +async def metrics(): + """Return headline KPIs from gold tables.""" + try: + top_states = _query(f""" + SELECT state, ROUND(SUM(turnover_millions), 1) AS total_turnover + FROM {CATALOG}.{SCHEMA}.retail_summary + WHERE month >= ADD_MONTHS(CURRENT_DATE(), -12) + GROUP BY state + ORDER BY total_turnover DESC + LIMIT 5 + """) + + monthly_trend = _query(f""" + SELECT month, ROUND(SUM(turnover_millions), 1) AS national_turnover + FROM {CATALOG}.{SCHEMA}.retail_summary + WHERE month >= ADD_MONTHS(CURRENT_DATE(), -12) + GROUP BY month + ORDER BY month + """) + + yoy_row = _query(f""" + SELECT ROUND(AVG(yoy_growth_pct), 2) AS avg_yoy_growth + FROM {CATALOG}.{SCHEMA}.retail_summary + WHERE month = ( + SELECT MAX(month) FROM {CATALOG}.{SCHEMA}.retail_summary + ) + """) + + inflation_row = _query(f""" + SELECT ROUND(AVG(yoy_change_pct), 2) AS avg_food_inflation + FROM {CATALOG}.{SCHEMA}.food_inflation_yoy + WHERE quarter = ( + SELECT MAX(quarter) FROM {CATALOG}.{SCHEMA}.food_inflation_yoy + ) + """) + + return { + "top_states": top_states, + "monthly_trend": [ + {"month": str(r["month"]), "national_turnover": r["national_turnover"]} + for r in monthly_trend + ], + "yoy_growth": yoy_row[0]["avg_yoy_growth"] if yoy_row else None, + "food_inflation": inflation_row[0]["avg_food_inflation"] if inflation_row else None, + } + except Exception as exc: + raise HTTPException(status_code=500, detail=str(exc)) + + +@app.get("/api/recalls") +async def recalls(): + """Return recent food recalls from gold table.""" + try: + rows = _query(f""" + SELECT product, category, issue, date, state, url + FROM {CATALOG}.{SCHEMA}.food_recalls_clean + ORDER BY date DESC + LIMIT 20 + """) + return [ + {k: str(v) if v is not None else None for k, v in row.items()} + for row in rows + ] + except Exception as exc: + raise HTTPException(status_code=500, detail=str(exc)) + + +@app.post("/api/ask") +async def ask(request: Request): + """Accept a natural language question, generate SQL via Foundation Model API, execute it.""" + body = await request.json() + question = body.get("question", "").strip() + if not question: + raise HTTPException(status_code=400, detail="question is required") + + # Build a prompt that constrains the model to generate valid SQL + system_prompt = f"""You are a SQL assistant for Australian grocery retail data. +Available tables in catalog `{CATALOG}`, schema `{SCHEMA}`: + +1. retail_summary (state STRING, industry STRING, month DATE, turnover_millions DOUBLE, + turnover_3m_avg DOUBLE, turnover_12m_avg DOUBLE, yoy_growth_pct DOUBLE) +2. food_inflation_yoy (state STRING, quarter DATE, cpi_index DOUBLE, yoy_change_pct DOUBLE) + +Rules: +- Return ONLY a single SQL SELECT statement, nothing else. +- Use fully qualified table names: {CATALOG}.{SCHEMA}.. +- Never use DROP, DELETE, INSERT, UPDATE, or ALTER. +""" + + try: + # Use the Foundation Model API via databricks-sql-connector + sql_gen_query = f""" + SELECT ai_query( + 'databricks-dbrx-instruct', + CONCAT('{system_prompt.replace("'", "''")}', '\\nQuestion: ', :question) + ) AS generated_sql + """ + result = _query(sql_gen_query, {"question": question}) + generated_sql = result[0]["generated_sql"].strip() if result else "" + + # Basic safety check + forbidden = ["DROP", "DELETE", "INSERT", "UPDATE", "ALTER", "GRANT", "REVOKE"] + if any(kw in generated_sql.upper() for kw in forbidden): + raise HTTPException(status_code=400, detail="Generated SQL contains forbidden operations") + + # Execute the generated SQL + rows = _query(generated_sql) + return { + "question": question, + "sql": generated_sql, + "results": [ + {k: str(v) if v is not None else None for k, v in row.items()} + for row in rows + ], + } + except HTTPException: + raise + except Exception as exc: + raise HTTPException(status_code=500, detail=str(exc)) diff --git a/projects/coles-vibe-workshop/reference-implementation/app/app.yaml b/projects/coles-vibe-workshop/reference-implementation/app/app.yaml new file mode 100644 index 0000000..e3cf1ca --- /dev/null +++ b/projects/coles-vibe-workshop/reference-implementation/app/app.yaml @@ -0,0 +1,7 @@ +command: + - uvicorn + - app:app + - --host + - 0.0.0.0 + - --port + - "8000" diff --git a/projects/coles-vibe-workshop/reference-implementation/app/requirements.txt b/projects/coles-vibe-workshop/reference-implementation/app/requirements.txt new file mode 100644 index 0000000..4cfb4c9 --- /dev/null +++ b/projects/coles-vibe-workshop/reference-implementation/app/requirements.txt @@ -0,0 +1,3 @@ +fastapi +uvicorn +databricks-sql-connector diff --git a/projects/coles-vibe-workshop/reference-implementation/app/static/index.html b/projects/coles-vibe-workshop/reference-implementation/app/static/index.html new file mode 100644 index 0000000..3147036 --- /dev/null +++ b/projects/coles-vibe-workshop/reference-implementation/app/static/index.html @@ -0,0 +1,290 @@ + + + + + + Grocery Intelligence Platform + + + + + + + +
+
+
+

Grocery Intelligence Platform

+

Australian Retail Trade & Food Price Analytics

+
+
+ Databricks + Coles + ... +
+
+
+ +
+ + +
+

Key Metrics

+
+
+

Total Turnover (12m)

+

--

+

AUD millions

+
+
+

YoY Growth

+

--

+

Latest month

+
+
+

Food Inflation

+

--

+

Latest quarter YoY

+
+
+

Food Recalls

+

--

+

Recent count

+
+
+
+ + +
+

National Turnover — Monthly Trend

+
+

Loading...

+
+ + +
+

Ask a Question

+

+ Ask in plain English — the platform generates SQL and queries your data. +

+
+ + + + +
+ + +
+

Recent Food Recalls

+
+
+ + + + + + + + + + + + +
DateProductCategoryIssueState
Loading...
+ +
+ + + +
+ Grocery Intelligence Platform — Coles Vibe Coding Workshop Reference Implementation +
+ + + + diff --git a/projects/coles-vibe-workshop/reference-implementation/databricks.yml b/projects/coles-vibe-workshop/reference-implementation/databricks.yml new file mode 100644 index 0000000..f04319e --- /dev/null +++ b/projects/coles-vibe-workshop/reference-implementation/databricks.yml @@ -0,0 +1,20 @@ +bundle: + name: grocery-intelligence-reference + +variables: + team_name: + default: reference + description: Team name used for resource naming + schema: + default: reference + description: Target schema within the workshop catalog + +include: + - resources/*.yml + +targets: + dev: + default: true + variables: + team_name: reference + schema: reference diff --git a/projects/coles-vibe-workshop/reference-implementation/resources/pipeline.yml b/projects/coles-vibe-workshop/reference-implementation/resources/pipeline.yml new file mode 100644 index 0000000..35b0a54 --- /dev/null +++ b/projects/coles-vibe-workshop/reference-implementation/resources/pipeline.yml @@ -0,0 +1,44 @@ +resources: + pipelines: + grocery_pipeline: + name: grocery-intelligence-${var.team_name} + description: > + Lakeflow Declarative Pipeline for the Grocery Intelligence Platform. + Ingests ABS retail trade, ABS CPI, and FSANZ food recalls through + bronze -> silver -> gold medallion architecture. + serverless: true + catalog: workshop_vibe_coding + schema: ${var.schema} + development: true + libraries: + - notebook: + path: ../src/bronze/abs_retail_trade.py + - notebook: + path: ../src/bronze/abs_cpi_food.py + - notebook: + path: ../src/bronze/fsanz_food_recalls.py + - notebook: + path: ../src/silver/retail_turnover.py + - notebook: + path: ../src/silver/food_price_index.py + - notebook: + path: ../src/silver/food_recalls.py + - notebook: + path: ../src/gold/retail_summary.py + - notebook: + path: ../src/gold/food_inflation.py + - notebook: + path: ../src/gold/grocery_insights.py + + jobs: + grocery_daily_refresh: + name: grocery-daily-refresh-${var.team_name} + description: Daily refresh of the Grocery Intelligence Pipeline + trigger: + cron: + quartz_cron_expression: "0 0 6 * * ?" + timezone_id: "Australia/Sydney" + tasks: + - task_key: run_pipeline + pipeline_task: + pipeline_id: ${resources.pipelines.grocery_pipeline.id} diff --git a/projects/coles-vibe-workshop/reference-implementation/src/__init__.py b/projects/coles-vibe-workshop/reference-implementation/src/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/projects/coles-vibe-workshop/reference-implementation/src/bronze/__init__.py b/projects/coles-vibe-workshop/reference-implementation/src/bronze/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/projects/coles-vibe-workshop/reference-implementation/src/bronze/abs_cpi_food.py b/projects/coles-vibe-workshop/reference-implementation/src/bronze/abs_cpi_food.py new file mode 100644 index 0000000..7d12d4a --- /dev/null +++ b/projects/coles-vibe-workshop/reference-implementation/src/bronze/abs_cpi_food.py @@ -0,0 +1,43 @@ +"""Bronze layer: ABS Consumer Price Index (Food) ingestion. + +Ingests raw CSV data from the ABS SDMX API for quarterly CPI food +indices by state. No transformations — raw columns preserved. +""" + +import databricks.declarative_pipelines as dp +from pyspark.sql import SparkSession +from pyspark.sql.functions import current_timestamp + + +@dp.expect("valid_time_period", "TIME_PERIOD IS NOT NULL") +@dp.expect("valid_obs_value", "OBS_VALUE IS NOT NULL") +@dp.table( + comment="Raw ABS CPI Food data from SDMX API", +) +def bronze_abs_cpi_food(): + """Ingest ABS CPI Food CSV from the SDMX API. + + Fetches quarterly food price indices by state since 2010. + Falls back to checkpoint table if the API is unavailable. + """ + spark = SparkSession.getActiveSession() + + api_url = ( + "https://api.data.abs.gov.au/data/ABS,CPI,2.0.0/" + "1.10001+20001.10.1+2+3+4+5+6+7+8.Q" + "?format=csv&startPeriod=2010-Q1" + ) + + try: + df = ( + spark.read.csv(api_url, header=True, inferSchema=True) + .withColumn("_ingested_at", current_timestamp()) + ) + if df.first() is None: + raise ValueError("Empty response from ABS CPI API") + return df + except Exception: + return ( + spark.read.table("workshop_vibe_coding.checkpoints.abs_cpi_food") + .withColumn("_ingested_at", current_timestamp()) + ) diff --git a/projects/coles-vibe-workshop/reference-implementation/src/bronze/abs_retail_trade.py b/projects/coles-vibe-workshop/reference-implementation/src/bronze/abs_retail_trade.py new file mode 100644 index 0000000..513d1a9 --- /dev/null +++ b/projects/coles-vibe-workshop/reference-implementation/src/bronze/abs_retail_trade.py @@ -0,0 +1,45 @@ +"""Bronze layer: ABS Retail Trade ingestion. + +Ingests raw CSV data from the ABS SDMX API for monthly retail turnover +by state and industry. No transformations — raw columns preserved. +""" + +import databricks.declarative_pipelines as dp +from pyspark.sql import SparkSession +from pyspark.sql.functions import current_timestamp + + +@dp.expect("valid_time_period", "TIME_PERIOD IS NOT NULL") +@dp.expect("valid_obs_value", "OBS_VALUE IS NOT NULL") +@dp.table( + comment="Raw ABS Retail Trade data from SDMX API", +) +def bronze_abs_retail_trade(): + """Ingest ABS Retail Trade CSV from the SDMX API. + + Fetches monthly retail turnover by state and industry since 2010. + Falls back to checkpoint table if the API is unavailable. + """ + spark = SparkSession.getActiveSession() + + api_url = ( + "https://api.data.abs.gov.au/data/ABS,RT,1.0.0/" + "M1.20+41+42+43+44+45.20.1+2+3+4+5+6+7+8.M" + "?format=csv&startPeriod=2010-01" + ) + + try: + df = ( + spark.read.csv(api_url, header=True, inferSchema=True) + .withColumn("_ingested_at", current_timestamp()) + ) + # Verify we got data + if df.first() is None: + raise ValueError("Empty response from ABS Retail Trade API") + return df + except Exception: + # Fallback: read from pre-loaded checkpoint table + return ( + spark.read.table("workshop_vibe_coding.checkpoints.abs_retail_trade") + .withColumn("_ingested_at", current_timestamp()) + ) diff --git a/projects/coles-vibe-workshop/reference-implementation/src/bronze/fsanz_food_recalls.py b/projects/coles-vibe-workshop/reference-implementation/src/bronze/fsanz_food_recalls.py new file mode 100644 index 0000000..968dd77 --- /dev/null +++ b/projects/coles-vibe-workshop/reference-implementation/src/bronze/fsanz_food_recalls.py @@ -0,0 +1,48 @@ +"""Bronze layer: FSANZ Food Recalls ingestion. + +Ingests food recall data from the FSANZ website or a checkpoint table. +Raw columns preserved with no transformations. +""" + +import databricks.declarative_pipelines as dp +from pyspark.sql import SparkSession +from pyspark.sql.functions import current_timestamp + + +@dp.expect("product_name_not_null", "product_name IS NOT NULL") +@dp.expect("recall_date_not_null", "recall_date IS NOT NULL") +@dp.table( + comment="Raw FSANZ food recall data", +) +def bronze_fsanz_food_recalls(): + """Ingest FSANZ food recalls. + + Attempts to read from the FSANZ website. Falls back to a checkpoint + table if the site is unreachable or blocked. + """ + spark = SparkSession.getActiveSession() + + # The FSANZ website requires web scraping which is unreliable in a + # pipeline context. Read from the pre-loaded checkpoint table. + # URL for reference: https://www.foodstandards.gov.au/consumer/recalls + try: + df = ( + spark.read.table("workshop_vibe_coding.checkpoints.fsanz_food_recalls") + .withColumn("_ingested_at", current_timestamp()) + ) + return df + except Exception: + # Create empty DataFrame with expected schema as last resort + from pyspark.sql.types import StructType, StructField, StringType + + schema = StructType([ + StructField("recall_date", StringType(), True), + StructField("product_name", StringType(), True), + StructField("hazard", StringType(), True), + StructField("company", StringType(), True), + StructField("states_affected", StringType(), True), + ]) + return ( + spark.createDataFrame([], schema) + .withColumn("_ingested_at", current_timestamp()) + ) diff --git a/projects/coles-vibe-workshop/reference-implementation/src/gold/__init__.py b/projects/coles-vibe-workshop/reference-implementation/src/gold/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/projects/coles-vibe-workshop/reference-implementation/src/gold/food_inflation_yoy.py b/projects/coles-vibe-workshop/reference-implementation/src/gold/food_inflation_yoy.py new file mode 100644 index 0000000..34bf92f --- /dev/null +++ b/projects/coles-vibe-workshop/reference-implementation/src/gold/food_inflation_yoy.py @@ -0,0 +1,62 @@ +"""Gold layer: Food Inflation Year-over-Year. + +Calculates year-over-year inflation rates from the silver food price +index, aggregated by state and quarter. +""" + +import databricks.declarative_pipelines as dp +from pyspark.sql import SparkSession +from pyspark.sql.functions import ( + col, + lag, + round as round_, +) +from pyspark.sql.window import Window + + +@dp.expect("valid_yoy_change", "yoy_change_pct BETWEEN -50 AND 100") +@dp.table( + comment="Year-over-year food inflation rate by state and quarter", +) +def gold_food_inflation_yoy(): + """Calculate year-over-year CPI change percentage for food. + + Compares each quarter's CPI index to the same quarter one year prior + to derive the annual inflation rate by state. + """ + spark = SparkSession.getActiveSession() + + df = spark.read.table("LIVE.silver_food_price_index") + + # Window: partition by state, order by date, look back 4 quarters + window_yoy = ( + Window.partitionBy("state") + .orderBy("date") + ) + + return ( + df.withColumn( + "prev_year_index", + lag("cpi_index", 4).over(window_yoy), + ) + .withColumn( + "yoy_change_pct", + round_( + (col("cpi_index") - col("prev_year_index")) + / col("prev_year_index") + * 100, + 2, + ), + ) + .filter(col("prev_year_index").isNotNull()) + .select( + "state", + "category", + "date", + "year", + "quarter", + "cpi_index", + "yoy_change_pct", + ) + .orderBy("state", "date") + ) diff --git a/projects/coles-vibe-workshop/reference-implementation/src/gold/grocery_insights.py b/projects/coles-vibe-workshop/reference-implementation/src/gold/grocery_insights.py new file mode 100644 index 0000000..48f783c --- /dev/null +++ b/projects/coles-vibe-workshop/reference-implementation/src/gold/grocery_insights.py @@ -0,0 +1,100 @@ +"""Gold layer: Grocery Insights cross-source view. + +Joins retail turnover summary with food inflation data and food recall +counts to produce a single analytics-ready table. +""" + +import databricks.declarative_pipelines as dp +from pyspark.sql import SparkSession +from pyspark.sql.functions import ( + col, + count, + coalesce, + lit, + trunc, + to_date, + month as month_fn, + year as year_fn, +) + + +@dp.expect("has_state", "state IS NOT NULL") +@dp.expect("has_date", "month IS NOT NULL") +@dp.table( + comment="Cross-source grocery insights joining retail, inflation, and recalls", +) +def gold_grocery_insights(): + """Join retail summary, food inflation, and food recalls into one view. + + - Retail data is monthly; CPI data is quarterly. Joins on state + quarter. + - Food recalls are left-joined by state + month (recall_count may be 0). + """ + spark = SparkSession.getActiveSession() + + retail = spark.read.table("LIVE.gold_retail_summary") + inflation = spark.read.table("LIVE.gold_food_inflation_yoy") + + # Prepare retail: add quarter for join + retail_q = ( + retail + .withColumn("month", trunc("date", "month")) + .withColumn("join_quarter", trunc("date", "quarter")) + ) + + # Prepare inflation: add quarter key for join + inflation_q = ( + inflation + .withColumn("join_quarter", trunc("date", "quarter")) + .select( + col("state").alias("inf_state"), + col("join_quarter").alias("inf_quarter"), + "yoy_change_pct", + ) + .dropDuplicates(["inf_state", "inf_quarter"]) + ) + + # Join retail with inflation on state + quarter + joined = ( + retail_q.join( + inflation_q, + (retail_q["state"] == inflation_q["inf_state"]) + & (retail_q["join_quarter"] == inflation_q["inf_quarter"]), + "left", + ) + ) + + # Prepare food recalls — count by state and month + try: + recalls = spark.read.table("LIVE.bronze_fsanz_food_recalls") + recalls_monthly = ( + recalls + .withColumn("recall_month", trunc(to_date("recall_date"), "month")) + .groupBy( + col("states_affected").alias("recall_state"), + "recall_month", + ) + .agg(count("*").alias("recall_count")) + ) + + result = ( + joined.join( + recalls_monthly, + (joined["state"] == recalls_monthly["recall_state"]) + & (joined["month"] == recalls_monthly["recall_month"]), + "left", + ) + .withColumn("recall_count", coalesce("recall_count", lit(0))) + ) + except Exception: + # FSANZ data may not be available — continue without it + result = joined.withColumn("recall_count", lit(0)) + + return result.select( + "state", + "month", + "total_turnover", + "turnover_3m_avg", + "yoy_growth_pct", + col("yoy_change_pct").alias("cpi_yoy_change"), + "recall_count", + ) diff --git a/projects/coles-vibe-workshop/reference-implementation/src/gold/retail_summary.py b/projects/coles-vibe-workshop/reference-implementation/src/gold/retail_summary.py new file mode 100644 index 0000000..157fa89 --- /dev/null +++ b/projects/coles-vibe-workshop/reference-implementation/src/gold/retail_summary.py @@ -0,0 +1,80 @@ +"""Gold layer: Retail Summary aggregation. + +Aggregates silver retail turnover by state and month, then calculates +rolling averages and year-over-year growth rates. +""" + +import databricks.declarative_pipelines as dp +from pyspark.sql import SparkSession +from pyspark.sql.functions import ( + col, + sum as sum_, + avg, + lag, + round as round_, +) +from pyspark.sql.window import Window + + +@dp.expect("valid_yoy", "yoy_growth_pct BETWEEN -100 AND 500") +@dp.expect("valid_rolling_avg", "turnover_3m_avg > 0") +@dp.table( + comment="Monthly retail summary by state with rolling averages and YoY growth", +) +def gold_retail_summary(): + """Aggregate retail turnover by state and month. + + Calculates: + - total_turnover: sum of turnover across industries per state/month + - turnover_3m_avg: 3-month rolling average + - turnover_12m_avg: 12-month rolling average + - yoy_growth_pct: year-over-year growth percentage + """ + spark = SparkSession.getActiveSession() + + df = spark.read.table("LIVE.silver_retail_turnover") + + # Aggregate by state and date (monthly) + monthly = ( + df.groupBy("state", "date", "year", "month") + .agg(sum_("turnover_millions").alias("total_turnover")) + ) + + # Window for rolling averages — partition by state, order by date + window_3m = ( + Window.partitionBy("state") + .orderBy("date") + .rowsBetween(-2, 0) + ) + window_12m = ( + Window.partitionBy("state") + .orderBy("date") + .rowsBetween(-11, 0) + ) + + # Window for YoY — look back 12 months + window_yoy = ( + Window.partitionBy("state") + .orderBy("date") + ) + + return ( + monthly + .withColumn("turnover_3m_avg", round_(avg("total_turnover").over(window_3m), 2)) + .withColumn("turnover_12m_avg", round_(avg("total_turnover").over(window_12m), 2)) + .withColumn( + "prev_year_turnover", + lag("total_turnover", 12).over(window_yoy), + ) + .withColumn( + "yoy_growth_pct", + round_( + (col("total_turnover") - col("prev_year_turnover")) + / col("prev_year_turnover") + * 100, + 2, + ), + ) + .drop("prev_year_turnover") + .orderBy("state", "date") + ) diff --git a/projects/coles-vibe-workshop/reference-implementation/src/silver/__init__.py b/projects/coles-vibe-workshop/reference-implementation/src/silver/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/projects/coles-vibe-workshop/reference-implementation/src/silver/food_price_index.py b/projects/coles-vibe-workshop/reference-implementation/src/silver/food_price_index.py new file mode 100644 index 0000000..b953b02 --- /dev/null +++ b/projects/coles-vibe-workshop/reference-implementation/src/silver/food_price_index.py @@ -0,0 +1,107 @@ +"""Silver layer: Food Price Index transformation. + +Decodes series codes, parses dates, and standardizes column names +from the bronze ABS CPI Food table. +""" + +import databricks.declarative_pipelines as dp +from pyspark.sql import SparkSession +from pyspark.sql.functions import ( + col, + when, + lit, + regexp_replace, + to_date, + concat, + year, + quarter as quarter_fn, +) + + +REGION_DECODE = { + "1": "New South Wales", + "2": "Victoria", + "3": "Queensland", + "4": "South Australia", + "5": "Western Australia", + "6": "Tasmania", + "7": "Northern Territory", + "8": "Australian Capital Territory", +} + +INDEX_DECODE = { + "10001": "All groups CPI", + "20001": "Food and non-alcoholic beverages", +} + + +@dp.expect_or_fail("valid_date", "date IS NOT NULL") +@dp.expect( + "valid_state", + "state IN ('New South Wales','Victoria','Queensland','South Australia'," + "'Western Australia','Tasmania','Northern Territory','Australian Capital Territory')", +) +@dp.expect("positive_index", "cpi_index > 0") +@dp.table( + comment="Cleaned CPI food price index with decoded regions and categories", +) +def silver_food_price_index(): + """Transform bronze CPI food data into analytics-ready silver table. + + - Decodes REGION codes to state names + - Decodes INDEX_CODE to readable category names + - Parses TIME_PERIOD (quarterly) to a proper date column + - Renames OBS_VALUE to cpi_index + - Filters to food-related categories only + """ + spark = SparkSession.getActiveSession() + + # Build decode expressions + region_expr = col("REGION").cast("string") + for code, name in REGION_DECODE.items(): + region_expr = when( + col("REGION").cast("string") == code, lit(name) + ).otherwise(region_expr) + + index_expr = col("INDEX").cast("string") + for code, name in INDEX_DECODE.items(): + index_expr = when( + col("INDEX").cast("string") == code, lit(name) + ).otherwise(index_expr) + + df = spark.read.table("LIVE.bronze_abs_cpi_food") + + return ( + df.withColumn("state", region_expr) + .withColumn("category", index_expr) + # Parse quarterly TIME_PERIOD (e.g. "2024-Q1") to date + .withColumn( + "date", + to_date( + concat( + col("TIME_PERIOD").substr(1, 4), + lit("-"), + when(col("TIME_PERIOD").contains("Q1"), lit("01")) + .when(col("TIME_PERIOD").contains("Q2"), lit("04")) + .when(col("TIME_PERIOD").contains("Q3"), lit("07")) + .when(col("TIME_PERIOD").contains("Q4"), lit("10")), + lit("-01"), + ), + "yyyy-MM-dd", + ), + ) + .withColumn("year", year("date")) + .withColumn("quarter", quarter_fn("date")) + .withColumnRenamed("OBS_VALUE", "cpi_index") + .select( + "state", + "category", + "date", + "year", + "quarter", + "cpi_index", + ) + .filter(col("date").isNotNull()) + .filter(col("cpi_index").isNotNull()) + .filter(col("category") == "Food and non-alcoholic beverages") + ) diff --git a/projects/coles-vibe-workshop/reference-implementation/src/silver/retail_turnover.py b/projects/coles-vibe-workshop/reference-implementation/src/silver/retail_turnover.py new file mode 100644 index 0000000..48706fc --- /dev/null +++ b/projects/coles-vibe-workshop/reference-implementation/src/silver/retail_turnover.py @@ -0,0 +1,107 @@ +"""Silver layer: Retail Turnover transformation. + +Decodes region and industry codes, parses dates, and renames columns +from the bronze ABS Retail Trade table. +""" + +import databricks.declarative_pipelines as dp +from pyspark.sql import SparkSession +from pyspark.sql.functions import ( + col, + to_date, + month, + year, + quarter, + when, + concat, + lit, +) + + +REGION_DECODE = { + "1": "New South Wales", + "2": "Victoria", + "3": "Queensland", + "4": "South Australia", + "5": "Western Australia", + "6": "Tasmania", + "7": "Northern Territory", + "8": "Australian Capital Territory", +} + +INDUSTRY_DECODE = { + "20": "Food retailing", + "41": "Clothing, footwear and personal accessories", + "42": "Department stores", + "43": "Other retailing", + "44": "Cafes, restaurants and takeaway", + "45": "Household goods retailing", +} + + +def _build_when_chain(col_name, mapping): + """Build a chained when() expression from a dictionary mapping.""" + expr = None + for code, name in mapping.items(): + condition = when(col(col_name).cast("string") == code, lit(name)) + expr = condition if expr is None else expr.when( + col(col_name).cast("string") == code, lit(name) + ) + return expr.otherwise(col(col_name).cast("string")) + + +@dp.expect_or_fail("valid_date", "date IS NOT NULL") +@dp.expect( + "valid_state", + "state IN ('New South Wales','Victoria','Queensland','South Australia'," + "'Western Australia','Tasmania','Northern Territory','Australian Capital Territory')", +) +@dp.expect("positive_turnover", "turnover_millions > 0") +@dp.table( + comment="Cleaned retail turnover with decoded regions and industries", +) +def silver_retail_turnover(): + """Transform bronze retail trade data into analytics-ready silver table. + + - Decodes REGION codes to state names + - Decodes INDUSTRY codes to readable industry names + - Parses TIME_PERIOD to a proper date column + - Renames OBS_VALUE to turnover_millions + """ + spark = SparkSession.getActiveSession() + + # Build decode expressions + region_expr = col("REGION").cast("string") + for code, name in REGION_DECODE.items(): + region_expr = when( + col("REGION").cast("string") == code, lit(name) + ).otherwise(region_expr) + + industry_expr = col("INDUSTRY").cast("string") + for code, name in INDUSTRY_DECODE.items(): + industry_expr = when( + col("INDUSTRY").cast("string") == code, lit(name) + ).otherwise(industry_expr) + + df = spark.read.table("LIVE.bronze_abs_retail_trade") + + return ( + df.withColumn("state", region_expr) + .withColumn("industry", industry_expr) + .withColumn("date", to_date(concat(col("TIME_PERIOD"), lit("-01")), "yyyy-MM-dd")) + .withColumn("year", year("date")) + .withColumn("month", month("date")) + .withColumn("quarter", quarter("date")) + .withColumnRenamed("OBS_VALUE", "turnover_millions") + .select( + "state", + "industry", + "date", + "year", + "month", + "quarter", + "turnover_millions", + ) + .filter(col("date").isNotNull()) + .filter(col("turnover_millions").isNotNull()) + ) diff --git a/projects/coles-vibe-workshop/reference-implementation/tests/__init__.py b/projects/coles-vibe-workshop/reference-implementation/tests/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/projects/coles-vibe-workshop/reference-implementation/tests/conftest.py b/projects/coles-vibe-workshop/reference-implementation/tests/conftest.py new file mode 100644 index 0000000..cd987f2 --- /dev/null +++ b/projects/coles-vibe-workshop/reference-implementation/tests/conftest.py @@ -0,0 +1,270 @@ +""" +Shared pytest fixtures for the Grocery Intelligence Platform reference implementation. + +Provides a SparkSession and sample DataFrames matching the ABS API schemas +for bronze, silver, and gold layer testing. +""" + +import os +import tempfile +from datetime import date +from decimal import Decimal + +import pytest +from pyspark.sql import SparkSession +from pyspark.sql.types import ( + DateType, + DecimalType, + DoubleType, + IntegerType, + StringType, + StructField, + StructType, + TimestampType, +) + + +@pytest.fixture(scope="session") +def spark(tmp_path_factory): + """Create a local SparkSession for testing. Shared across all tests.""" + warehouse_dir = str(tmp_path_factory.mktemp("spark-warehouse")) + session = ( + SparkSession.builder.master("local[*]") + .appName("grocery-intelligence-tests") + .config("spark.sql.warehouse.dir", warehouse_dir) + .config("spark.sql.shuffle.partitions", "4") + .config("spark.driver.bindAddress", "127.0.0.1") + .config("spark.ui.enabled", "false") + .getOrCreate() + ) + yield session + session.stop() + + +# ── Bronze Fixtures ────────────────────────────────────────────────── + + +@pytest.fixture +def sample_retail_csv(spark): + """ + Sample ABS Retail Trade data matching the raw API CSV schema. + 6 rows covering 3 states, 2 industries, 2 months. + + REGION codes: 1=NSW, 2=VIC, 3=QLD + INDUSTRY codes: 20=Food retailing, 41=Clothing/footwear/personal + """ + schema = StructType( + [ + StructField("DATAFLOW", StringType()), + StructField("FREQ", StringType()), + StructField("MEASURE", StringType()), + StructField("INDUSTRY", StringType()), + StructField("REGION", StringType()), + StructField("TIME_PERIOD", StringType()), + StructField("OBS_VALUE", DoubleType()), + ] + ) + data = [ + ("ABS:RT", "M", "M1", "20", "1", "2024-01", 4500.0), + ("ABS:RT", "M", "M1", "20", "2", "2024-01", 3800.0), + ("ABS:RT", "M", "M1", "20", "3", "2024-01", 2900.0), + ("ABS:RT", "M", "M1", "41", "1", "2024-01", 1200.0), + ("ABS:RT", "M", "M1", "20", "1", "2024-02", 4600.0), + ("ABS:RT", "M", "M1", "20", "2", "2024-02", 3900.0), + ] + return spark.createDataFrame(data, schema) + + +@pytest.fixture +def sample_retail_csv_with_nulls(spark): + """ + Sample ABS Retail Trade data containing null values. + Used to test null filtering in silver layer. + """ + schema = StructType( + [ + StructField("DATAFLOW", StringType()), + StructField("FREQ", StringType()), + StructField("MEASURE", StringType()), + StructField("INDUSTRY", StringType()), + StructField("REGION", StringType()), + StructField("TIME_PERIOD", StringType()), + StructField("OBS_VALUE", DoubleType()), + ] + ) + data = [ + ("ABS:RT", "M", "M1", "20", "1", "2024-01", 4500.0), + ("ABS:RT", "M", "M1", "20", "2", "2024-01", None), + ("ABS:RT", "M", "M1", "20", "3", None, 2900.0), + ("ABS:RT", "M", "M1", "20", "1", "2024-02", 4600.0), + ("ABS:RT", "M", "M1", "41", "1", "2024-01", 1200.0), + ] + return spark.createDataFrame(data, schema) + + +@pytest.fixture +def sample_cpi_csv(spark): + """ + Sample ABS CPI Food data matching the raw API CSV schema. + 5 rows covering 2 states, 2 index types, 2 quarters. + + INDEX codes: 10001=All groups CPI, 20001=Food and non-alcoholic beverages + REGION codes: 1=NSW, 2=VIC + """ + schema = StructType( + [ + StructField("DATAFLOW", StringType()), + StructField("FREQ", StringType()), + StructField("MEASURE", StringType()), + StructField("INDEX", StringType()), + StructField("REGION", StringType()), + StructField("TIME_PERIOD", StringType()), + StructField("OBS_VALUE", DoubleType()), + ] + ) + data = [ + ("ABS:CPI", "Q", "1", "10001", "1", "2024-Q1", 136.4), + ("ABS:CPI", "Q", "1", "20001", "1", "2024-Q1", 142.8), + ("ABS:CPI", "Q", "1", "10001", "2", "2024-Q1", 134.9), + ("ABS:CPI", "Q", "1", "20001", "2", "2024-Q1", 140.2), + ("ABS:CPI", "Q", "1", "10001", "1", "2024-Q2", 137.1), + ] + return spark.createDataFrame(data, schema) + + +# ── Silver Fixtures ────────────────────────────────────────────────── + + +@pytest.fixture +def sample_silver_retail(spark): + """ + Silver-layer retail turnover data: decoded states, industries, proper dates. + 24 rows covering 2 years of monthly data for NSW Food retailing. + Used for gold-layer aggregation tests. + """ + schema = StructType( + [ + StructField("date", DateType()), + StructField("state", StringType()), + StructField("industry", StringType()), + StructField("turnover_millions", DoubleType()), + StructField("month", IntegerType()), + StructField("year", IntegerType()), + StructField("quarter", IntegerType()), + ] + ) + # 24 months of NSW Food retailing data: Jan 2023 - Dec 2024 + # Turnover grows ~2% month-on-month with seasonal variation + data = [ + (date(2023, 1, 1), "New South Wales", "Food retailing", 4200.0, 1, 2023, 1), + (date(2023, 2, 1), "New South Wales", "Food retailing", 4100.0, 2, 2023, 1), + (date(2023, 3, 1), "New South Wales", "Food retailing", 4300.0, 3, 2023, 1), + (date(2023, 4, 1), "New South Wales", "Food retailing", 4250.0, 4, 2023, 2), + (date(2023, 5, 1), "New South Wales", "Food retailing", 4150.0, 5, 2023, 2), + (date(2023, 6, 1), "New South Wales", "Food retailing", 4350.0, 6, 2023, 2), + (date(2023, 7, 1), "New South Wales", "Food retailing", 4400.0, 7, 2023, 3), + (date(2023, 8, 1), "New South Wales", "Food retailing", 4300.0, 8, 2023, 3), + (date(2023, 9, 1), "New South Wales", "Food retailing", 4500.0, 9, 2023, 3), + (date(2023, 10, 1), "New South Wales", "Food retailing", 4450.0, 10, 2023, 4), + (date(2023, 11, 1), "New South Wales", "Food retailing", 4550.0, 11, 2023, 4), + (date(2023, 12, 1), "New South Wales", "Food retailing", 4800.0, 12, 2023, 4), + (date(2024, 1, 1), "New South Wales", "Food retailing", 4500.0, 1, 2024, 1), + (date(2024, 2, 1), "New South Wales", "Food retailing", 4400.0, 2, 2024, 1), + (date(2024, 3, 1), "New South Wales", "Food retailing", 4600.0, 3, 2024, 1), + (date(2024, 4, 1), "New South Wales", "Food retailing", 4550.0, 4, 2024, 2), + (date(2024, 5, 1), "New South Wales", "Food retailing", 4450.0, 5, 2024, 2), + (date(2024, 6, 1), "New South Wales", "Food retailing", 4650.0, 6, 2024, 2), + (date(2024, 7, 1), "New South Wales", "Food retailing", 4700.0, 7, 2024, 3), + (date(2024, 8, 1), "New South Wales", "Food retailing", 4600.0, 8, 2024, 3), + (date(2024, 9, 1), "New South Wales", "Food retailing", 4800.0, 9, 2024, 3), + (date(2024, 10, 1), "New South Wales", "Food retailing", 4750.0, 10, 2024, 4), + (date(2024, 11, 1), "New South Wales", "Food retailing", 4850.0, 11, 2024, 4), + (date(2024, 12, 1), "New South Wales", "Food retailing", 5100.0, 12, 2024, 4), + ] + return spark.createDataFrame(data, schema) + + +@pytest.fixture +def sample_silver_cpi(spark): + """ + Silver-layer CPI food price index data: decoded states, indices, proper dates. + 8 quarters covering 2 years for NSW, Food and non-alcoholic beverages. + Used for gold-layer inflation calculation tests. + """ + schema = StructType( + [ + StructField("date", DateType()), + StructField("state", StringType()), + StructField("index_name", StringType()), + StructField("cpi_index", DoubleType()), + StructField("quarter", IntegerType()), + StructField("year", IntegerType()), + ] + ) + # 8 quarters of NSW Food CPI data: Q1 2023 - Q4 2024 + # CPI rises ~1-2% per quarter (food inflation) + data = [ + (date(2023, 1, 1), "New South Wales", "Food and non-alcoholic beverages", 130.0, 1, 2023), + (date(2023, 4, 1), "New South Wales", "Food and non-alcoholic beverages", 132.5, 2, 2023), + (date(2023, 7, 1), "New South Wales", "Food and non-alcoholic beverages", 134.8, 3, 2023), + (date(2023, 10, 1), "New South Wales", "Food and non-alcoholic beverages", 136.2, 4, 2023), + (date(2024, 1, 1), "New South Wales", "Food and non-alcoholic beverages", 137.8, 1, 2024), + (date(2024, 4, 1), "New South Wales", "Food and non-alcoholic beverages", 139.5, 2, 2024), + (date(2024, 7, 1), "New South Wales", "Food and non-alcoholic beverages", 141.0, 3, 2024), + (date(2024, 10, 1), "New South Wales", "Food and non-alcoholic beverages", 142.3, 4, 2024), + ] + return spark.createDataFrame(data, schema) + + +# ── Gold Fixtures ──────────────────────────────────────────────────── + + +@pytest.fixture +def sample_gold_retail(spark): + """ + Gold-layer retail summary with rolling averages and YoY growth. + Used for data quality tests. + """ + schema = StructType( + [ + StructField("date", DateType()), + StructField("state", StringType()), + StructField("industry", StringType()), + StructField("turnover_millions", DoubleType()), + StructField("turnover_3m_avg", DoubleType()), + StructField("turnover_12m_avg", DoubleType()), + StructField("yoy_growth_pct", DoubleType()), + ] + ) + data = [ + (date(2024, 1, 1), "New South Wales", "Food retailing", 4500.0, 4483.3, 4362.5, 7.14), + (date(2024, 2, 1), "New South Wales", "Food retailing", 4400.0, 4566.7, 4370.8, 7.32), + (date(2024, 3, 1), "New South Wales", "Food retailing", 4600.0, 4500.0, 4391.7, 6.98), + (date(2024, 1, 1), "Victoria", "Food retailing", 3800.0, 3766.7, 3675.0, 5.56), + (date(2024, 2, 1), "Victoria", "Food retailing", 3700.0, 3800.0, 3683.3, 5.71), + ] + return spark.createDataFrame(data, schema) + + +@pytest.fixture +def sample_gold_cpi(spark): + """ + Gold-layer food inflation with YoY change percentage. + Used for data quality tests. + """ + schema = StructType( + [ + StructField("date", DateType()), + StructField("state", StringType()), + StructField("index_name", StringType()), + StructField("cpi_index", DoubleType()), + StructField("yoy_change_pct", DoubleType()), + ] + ) + data = [ + (date(2024, 1, 1), "New South Wales", "Food and non-alcoholic beverages", 137.8, 6.0), + (date(2024, 4, 1), "New South Wales", "Food and non-alcoholic beverages", 139.5, 5.28), + (date(2024, 7, 1), "New South Wales", "Food and non-alcoholic beverages", 141.0, 4.60), + (date(2024, 10, 1), "New South Wales", "Food and non-alcoholic beverages", 142.3, 4.48), + ] + return spark.createDataFrame(data, schema) diff --git a/projects/coles-vibe-workshop/reference-implementation/tests/test_app.py b/projects/coles-vibe-workshop/reference-implementation/tests/test_app.py new file mode 100644 index 0000000..50cffff --- /dev/null +++ b/projects/coles-vibe-workshop/reference-implementation/tests/test_app.py @@ -0,0 +1,232 @@ +""" +FastAPI application tests for the Grocery Intelligence Platform. + +These tests define the SPEC for the web app endpoints. +Written BEFORE implementation — the agent builds code to make these pass. + +Uses httpx AsyncClient with ASGITransport for async testing. +""" + +import pytest +from httpx import ASGITransport, AsyncClient + + +# ═══════════════════════════════════════════════════════════════════════ +# HEALTH ENDPOINT +# ═══════════════════════════════════════════════════════════════════════ + + +class TestHealthEndpoint: + """Tests for the /health endpoint.""" + + @pytest.mark.asyncio + async def test_health_endpoint_returns_200(self): + """GET /health returns 200 with status ok.""" + # Given: the FastAPI app is running + from app.app import app + + transport = ASGITransport(app=app) + + # When: we request the health endpoint + async with AsyncClient(transport=transport, base_url="http://test") as client: + response = await client.get("/health") + + # Then: response is 200 with expected JSON + assert response.status_code == 200 + data = response.json() + assert data["status"] == "ok" + + @pytest.mark.asyncio + async def test_health_endpoint_json_content_type(self): + """GET /health returns application/json content type.""" + # Given: the FastAPI app + from app.app import app + + transport = ASGITransport(app=app) + + # When: we request health + async with AsyncClient(transport=transport, base_url="http://test") as client: + response = await client.get("/health") + + # Then: content type is JSON + assert "application/json" in response.headers["content-type"] + + +# ═══════════════════════════════════════════════════════════════════════ +# METRICS ENDPOINT +# ═══════════════════════════════════════════════════════════════════════ + + +class TestMetricsEndpoint: + """Tests for the /api/metrics endpoint.""" + + @pytest.mark.asyncio + async def test_metrics_endpoint_returns_json(self): + """GET /api/metrics returns 200 with a list of metric records.""" + # Given: the FastAPI app with a database connection + from app.app import app + + transport = ASGITransport(app=app) + + # When: we request metrics + async with AsyncClient(transport=transport, base_url="http://test") as client: + response = await client.get("/api/metrics") + + # Then: response is 200 with a JSON list + assert response.status_code == 200 + data = response.json() + assert isinstance(data, list), "Metrics should return a list" + + @pytest.mark.asyncio + async def test_metrics_endpoint_record_keys(self): + """Each metric record has required keys: state, industry, month, turnover_millions, yoy_growth_pct.""" + # Given: the FastAPI app + from app.app import app + + transport = ASGITransport(app=app) + + # When: we request metrics + async with AsyncClient(transport=transport, base_url="http://test") as client: + response = await client.get("/api/metrics") + + # Then: each record has expected keys + data = response.json() + if len(data) > 0: + required_keys = {"state", "industry", "month", "turnover_millions", "yoy_growth_pct"} + first_record = data[0] + assert required_keys.issubset( + set(first_record.keys()) + ), f"Missing keys: {required_keys - set(first_record.keys())}" + + @pytest.mark.asyncio + async def test_metrics_endpoint_state_filter(self): + """GET /api/metrics?state=New+South+Wales filters results to NSW only.""" + # Given: the FastAPI app + from app.app import app + + transport = ASGITransport(app=app) + + # When: we request metrics filtered by state + async with AsyncClient(transport=transport, base_url="http://test") as client: + response = await client.get("/api/metrics", params={"state": "New South Wales"}) + + # Then: all records are for NSW + assert response.status_code == 200 + data = response.json() + for record in data: + assert record["state"] == "New South Wales", ( + f"Expected NSW only, got {record['state']}" + ) + + @pytest.mark.asyncio + async def test_metrics_endpoint_invalid_date_returns_error(self): + """GET /api/metrics with invalid date format returns 400 or 422.""" + # Given: the FastAPI app + from app.app import app + + transport = ASGITransport(app=app) + + # When: we request metrics with an invalid date + async with AsyncClient(transport=transport, base_url="http://test") as client: + response = await client.get("/api/metrics", params={"start_date": "not-a-date"}) + + # Then: error response + assert response.status_code in (400, 422), ( + f"Invalid date should return 400 or 422, got {response.status_code}" + ) + + +# ═══════════════════════════════════════════════════════════════════════ +# ASK ENDPOINT +# ═══════════════════════════════════════════════════════════════════════ + + +class TestAskEndpoint: + """Tests for the /api/ask natural language query endpoint.""" + + @pytest.mark.asyncio + async def test_ask_endpoint_returns_response(self): + """POST /api/ask with a valid question returns answer and sql_query.""" + # Given: the FastAPI app + from app.app import app + + transport = ASGITransport(app=app) + + # When: we ask a question about the data + async with AsyncClient(transport=transport, base_url="http://test") as client: + response = await client.post( + "/api/ask", + json={"question": "Which state has the highest retail turnover?"}, + ) + + # Then: response has answer and sql_query + assert response.status_code == 200 + data = response.json() + assert "answer" in data, "Response should contain 'answer' key" + assert "sql_query" in data, "Response should contain 'sql_query' key" + assert isinstance(data["answer"], str), "Answer should be a string" + assert len(data["answer"]) > 0, "Answer should not be empty" + + @pytest.mark.asyncio + async def test_ask_endpoint_empty_question_returns_error(self): + """POST /api/ask with empty question returns 400 or 422.""" + # Given: the FastAPI app + from app.app import app + + transport = ASGITransport(app=app) + + # When: we send an empty question + async with AsyncClient(transport=transport, base_url="http://test") as client: + response = await client.post("/api/ask", json={"question": ""}) + + # Then: error response + assert response.status_code in (400, 422), ( + f"Empty question should return 400 or 422, got {response.status_code}" + ) + + @pytest.mark.asyncio + async def test_ask_endpoint_missing_question_returns_error(self): + """POST /api/ask without question field returns 422.""" + # Given: the FastAPI app + from app.app import app + + transport = ASGITransport(app=app) + + # When: we send a request missing the question field + async with AsyncClient(transport=transport, base_url="http://test") as client: + response = await client.post("/api/ask", json={}) + + # Then: validation error + assert response.status_code == 422, ( + f"Missing question should return 422, got {response.status_code}" + ) + + +# ═══════════════════════════════════════════════════════════════════════ +# STATIC FILES +# ═══════════════════════════════════════════════════════════════════════ + + +class TestStaticFiles: + """Tests for serving the frontend HTML.""" + + @pytest.mark.asyncio + async def test_root_returns_html(self): + """GET / returns HTML content (the frontend).""" + # Given: the FastAPI app serving static files + from app.app import app + + transport = ASGITransport(app=app) + + # When: we request the root + async with AsyncClient(transport=transport, base_url="http://test") as client: + response = await client.get("/") + + # Then: returns HTML + assert response.status_code == 200 + assert "text/html" in response.headers.get("content-type", ""), ( + "Root should serve HTML content" + ) + assert " 0, "Should have 2024 data with YoY metrics" + + def test_gold_retail_summary_has_growth(self, spark, sample_silver_retail): + """Gold table has yoy_growth_pct column with numeric values.""" + # Given: 24 months of silver retail data + silver_df = sample_silver_retail + + # When: gold aggregation is applied + from src.gold.retail_summary import transform_gold_retail_summary + + result = transform_gold_retail_summary(silver_df) + + # Then: yoy_growth_pct exists and is numeric (DoubleType) + assert "yoy_growth_pct" in result.columns + growth_field = [f for f in result.schema.fields if f.name == "yoy_growth_pct"][0] + assert isinstance( + growth_field.dataType, DoubleType + ), f"yoy_growth_pct should be DoubleType, got {growth_field.dataType}" + + # And: Jan 2024 NSW growth = (4500 - 4200) / 4200 * 100 = 7.14% + jan_2024 = result.filter(F.col("date") == date(2024, 1, 1)).first() + if jan_2024 is not None: + assert jan_2024.yoy_growth_pct is not None, "YoY growth should not be null" + assert abs(jan_2024.yoy_growth_pct - 7.14) < 0.5, ( + f"Jan 2024 YoY growth should be ~7.14%, got {jan_2024.yoy_growth_pct}" + ) + + def test_gold_retail_summary_rolling_averages(self, spark, sample_silver_retail): + """Gold table has correct 3-month and 12-month rolling averages.""" + # Given: 24 months of silver retail data + silver_df = sample_silver_retail + + # When: gold aggregation is applied + from src.gold.retail_summary import transform_gold_retail_summary + + result = transform_gold_retail_summary(silver_df) + + # Then: rolling averages exist + assert "turnover_3m_avg" in result.columns + assert "turnover_12m_avg" in result.columns + + # Mar 2024: 3-month avg = (4500 + 4400 + 4600) / 3 = 4500.0 + mar_2024 = result.filter(F.col("date") == date(2024, 3, 1)).first() + if mar_2024 is not None: + assert mar_2024.turnover_3m_avg is not None + assert abs(mar_2024.turnover_3m_avg - 4500.0) < 10.0, ( + f"Mar 2024 3m avg should be ~4500.0, got {mar_2024.turnover_3m_avg}" + ) + + def test_gold_retail_summary_no_nulls_in_2024(self, spark, sample_silver_retail): + """Gold table has no null values for months with sufficient history.""" + # Given: 24 months of silver data (2023 + 2024) + silver_df = sample_silver_retail + + # When: gold aggregation is applied + from src.gold.retail_summary import transform_gold_retail_summary + + result = transform_gold_retail_summary(silver_df) + + # Then: 2024 rows should have no null growth or rolling averages + result_2024 = result.filter(F.year("date") == 2024) + null_growth = result_2024.filter(F.col("yoy_growth_pct").isNull()).count() + null_3m = result_2024.filter(F.col("turnover_3m_avg").isNull()).count() + assert null_growth == 0, "2024 rows should have no null YoY growth" + assert null_3m == 0, "2024 rows should have no null 3-month averages" + + +# ── Gold: Food Inflation ───────────────────────────────────────────── + + +class TestGoldFoodInflation: + """Tests for gold-layer food inflation YoY calculation.""" + + def test_gold_food_inflation_yoy(self, spark, sample_silver_cpi): + """Gold table calculates year-over-year CPI change percentage correctly.""" + # Given: 8 quarters of silver CPI data (2023-2024) + silver_df = sample_silver_cpi + + # When: gold inflation calculation is applied + from src.gold.food_inflation import transform_gold_food_inflation + + result = transform_gold_food_inflation(silver_df) + + # Then: has yoy_change_pct column + assert "yoy_change_pct" in result.columns + + # Q1 2024 YoY = (137.8 - 130.0) / 130.0 * 100 = 6.0% + q1_2024 = result.filter( + (F.col("date") == date(2024, 1, 1)) + & (F.col("state") == "New South Wales") + ).first() + assert q1_2024 is not None, "Should have Q1 2024 inflation data" + assert abs(q1_2024.yoy_change_pct - 6.0) < 0.1, ( + f"Q1 2024 YoY change should be ~6.0%, got {q1_2024.yoy_change_pct}" + ) + + def test_gold_food_inflation_schema(self, spark, sample_silver_cpi): + """Gold food inflation table has expected columns.""" + # Given: silver CPI data + silver_df = sample_silver_cpi + + # When: gold inflation calculation is applied + from src.gold.food_inflation import transform_gold_food_inflation + + result = transform_gold_food_inflation(silver_df) + + # Then: expected columns present + expected_columns = {"date", "state", "index_name", "cpi_index", "yoy_change_pct"} + assert set(result.columns) == expected_columns + + def test_gold_food_inflation_only_2024_has_yoy(self, spark, sample_silver_cpi): + """YoY change is only calculated when prior year data exists.""" + # Given: silver CPI data starting Q1 2023 + silver_df = sample_silver_cpi + + # When: gold inflation calculation is applied + from src.gold.food_inflation import transform_gold_food_inflation + + result = transform_gold_food_inflation(silver_df) + + # Then: 2024 quarters should have YoY values (have 2023 for comparison) + result_2024 = result.filter(F.year("date") == 2024) + null_yoy = result_2024.filter(F.col("yoy_change_pct").isNull()).count() + assert null_yoy == 0, "2024 quarters should all have YoY values" + + def test_gold_food_inflation_q4_calculation(self, spark, sample_silver_cpi): + """Q4 2024 YoY correctly compares with Q4 2023.""" + # Given: silver CPI data with Q4 2023 = 136.2 and Q4 2024 = 142.3 + silver_df = sample_silver_cpi + + # When: gold inflation calculation is applied + from src.gold.food_inflation import transform_gold_food_inflation + + result = transform_gold_food_inflation(silver_df) + + # Then: Q4 2024 YoY = (142.3 - 136.2) / 136.2 * 100 = 4.48% + q4_2024 = result.filter( + (F.col("date") == date(2024, 10, 1)) + & (F.col("state") == "New South Wales") + ).first() + assert q4_2024 is not None, "Should have Q4 2024 data" + assert abs(q4_2024.yoy_change_pct - 4.48) < 0.1, ( + f"Q4 2024 YoY should be ~4.48%, got {q4_2024.yoy_change_pct}" + ) + + +# ── Gold: Cross-Source Join ────────────────────────────────────────── + + +class TestGoldCrossSourceJoin: + """Tests for gold-layer cross-source analysis joining retail + CPI.""" + + def test_gold_cross_source_join(self, spark, sample_silver_retail, sample_silver_cpi): + """Gold grocery insights joins retail turnover with CPI inflation.""" + # Given: silver retail data (monthly) and silver CPI data (quarterly) + retail_df = sample_silver_retail + cpi_df = sample_silver_cpi + + # When: cross-source join is applied + from src.gold.grocery_insights import transform_gold_grocery_insights + + result = transform_gold_grocery_insights(retail_df, cpi_df) + + # Then: result has columns from both sources + assert "turnover_millions" in result.columns, "Should have retail turnover" + assert "cpi_yoy_change" in result.columns or "cpi_index" in result.columns, ( + "Should have CPI data" + ) + assert "state" in result.columns, "Should have state column" + + # And: NSW data is present (exists in both sources) + nsw_rows = result.filter(F.col("state") == "New South Wales").count() + assert nsw_rows > 0, "Should have NSW data from join" + + def test_gold_cross_source_join_columns(self, spark, sample_silver_retail, sample_silver_cpi): + """Cross-source join has expected output schema.""" + # Given: silver data from both sources + retail_df = sample_silver_retail + cpi_df = sample_silver_cpi + + # When: cross-source join is applied + from src.gold.grocery_insights import transform_gold_grocery_insights + + result = transform_gold_grocery_insights(retail_df, cpi_df) + + # Then: expected columns present + required_columns = {"state", "month", "turnover_millions", "yoy_growth_pct", "cpi_yoy_change"} + assert required_columns.issubset( + set(result.columns) + ), f"Missing columns: {required_columns - set(result.columns)}" diff --git a/projects/coles-vibe-workshop/reference-implementation/tests/test_quality.py b/projects/coles-vibe-workshop/reference-implementation/tests/test_quality.py new file mode 100644 index 0000000..2b2e86e --- /dev/null +++ b/projects/coles-vibe-workshop/reference-implementation/tests/test_quality.py @@ -0,0 +1,364 @@ +""" +Data quality tests for the Grocery Intelligence Platform. + +These tests verify data quality expectations that would be enforced +by @dp.expect() in the Lakeflow pipeline. They run against the +PySpark fixtures to validate the quality rules independently. + +Structure: Given-When-Then with concrete Australian data values. +""" + +from datetime import date + +from pyspark.sql import functions as F +from pyspark.sql.types import ( + DateType, + DoubleType, + IntegerType, + StringType, + StructField, + StructType, +) + + +# ═══════════════════════════════════════════════════════════════════════ +# RETAIL TURNOVER QUALITY +# ═══════════════════════════════════════════════════════════════════════ + + +class TestRetailTurnoverQuality: + """Data quality rules for retail turnover data.""" + + def test_retail_turnover_positive(self, spark, sample_gold_retail): + """All turnover values must be strictly positive (> 0).""" + # Given: gold retail summary data with Australian state turnover figures + df = sample_gold_retail + + # When: we check for non-positive turnover values + non_positive = df.filter(F.col("turnover_millions") <= 0).count() + + # Then: no rows have zero or negative turnover + assert non_positive == 0, ( + f"Found {non_positive} rows with non-positive turnover. " + "All retail turnover figures should be > 0" + ) + + def test_retail_turnover_reasonable_range(self, spark, sample_gold_retail): + """Turnover values fall within a reasonable range for Australian retail (0-50000 million AUD).""" + # Given: gold retail summary data + df = sample_gold_retail + + # When: we check for out-of-range values + out_of_range = df.filter( + (F.col("turnover_millions") < 0) | (F.col("turnover_millions") > 50000) + ).count() + + # Then: all values are within expected bounds + assert out_of_range == 0, ( + f"Found {out_of_range} rows with turnover outside 0-50000M range. " + "Australian monthly state retail turnover should be within this range." + ) + + def test_retail_rolling_averages_positive(self, spark, sample_gold_retail): + """Rolling averages must be positive when turnover is positive.""" + # Given: gold retail summary with rolling averages + df = sample_gold_retail + + # When: we check rolling average values + non_positive_3m = df.filter( + (F.col("turnover_3m_avg").isNotNull()) & (F.col("turnover_3m_avg") <= 0) + ).count() + non_positive_12m = df.filter( + (F.col("turnover_12m_avg").isNotNull()) & (F.col("turnover_12m_avg") <= 0) + ).count() + + # Then: rolling averages are positive where they exist + assert non_positive_3m == 0, "3-month rolling average should be positive" + assert non_positive_12m == 0, "12-month rolling average should be positive" + + def test_retail_yoy_growth_reasonable(self, spark, sample_gold_retail): + """YoY growth percentage is within a reasonable range (-100% to 500%).""" + # Given: gold retail summary with YoY growth + df = sample_gold_retail + + # When: we check for extreme growth values + extreme_growth = df.filter( + (F.col("yoy_growth_pct").isNotNull()) + & ((F.col("yoy_growth_pct") < -100) | (F.col("yoy_growth_pct") > 500)) + ).count() + + # Then: no extreme values + assert extreme_growth == 0, ( + f"Found {extreme_growth} rows with YoY growth outside -100% to 500% range" + ) + + +# ═══════════════════════════════════════════════════════════════════════ +# DATE QUALITY +# ═══════════════════════════════════════════════════════════════════════ + + +class TestDateQuality: + """Data quality rules for date columns across all tables.""" + + def test_dates_in_range_retail(self, spark, sample_gold_retail): + """All retail dates fall between 2010 and 2026.""" + # Given: gold retail data with dates + df = sample_gold_retail + + # When: we check date ranges + out_of_range = df.filter( + (F.col("date") < date(2010, 1, 1)) | (F.col("date") > date(2026, 12, 31)) + ).count() + + # Then: all dates are within expected range + assert out_of_range == 0, ( + f"Found {out_of_range} rows with dates outside 2010-2026 range. " + "ABS data should only contain dates from 2010 onwards." + ) + + def test_dates_in_range_cpi(self, spark, sample_gold_cpi): + """All CPI dates fall between 2010 and 2026.""" + # Given: gold CPI data with dates + df = sample_gold_cpi + + # When: we check date ranges + out_of_range = df.filter( + (F.col("date") < date(2010, 1, 1)) | (F.col("date") > date(2026, 12, 31)) + ).count() + + # Then: all dates are within expected range + assert out_of_range == 0, ( + f"Found {out_of_range} rows with dates outside 2010-2026 range" + ) + + def test_dates_are_date_type_retail(self, spark, sample_gold_retail): + """Date column in retail data is proper DateType, not a string.""" + # Given: gold retail data + df = sample_gold_retail + + # When: we inspect the date field type + date_field = [f for f in df.schema.fields if f.name == "date"][0] + + # Then: it's DateType + assert isinstance(date_field.dataType, DateType), ( + f"date column should be DateType, got {date_field.dataType}" + ) + + def test_dates_are_date_type_cpi(self, spark, sample_gold_cpi): + """Date column in CPI data is proper DateType, not a string.""" + # Given: gold CPI data + df = sample_gold_cpi + + # When: we inspect the date field type + date_field = [f for f in df.schema.fields if f.name == "date"][0] + + # Then: it's DateType + assert isinstance(date_field.dataType, DateType), ( + f"date column should be DateType, got {date_field.dataType}" + ) + + def test_no_null_dates(self, spark, sample_gold_retail): + """No null date values in gold tables.""" + # Given: gold retail data + df = sample_gold_retail + + # When: we check for null dates + null_dates = df.filter(F.col("date").isNull()).count() + + # Then: no nulls + assert null_dates == 0, f"Found {null_dates} null date values" + + +# ═══════════════════════════════════════════════════════════════════════ +# DUPLICATE DETECTION +# ═══════════════════════════════════════════════════════════════════════ + + +class TestDuplicateDetection: + """Data quality rules for detecting duplicate rows.""" + + def test_no_duplicate_rows_retail(self, spark, sample_gold_retail): + """No exact duplicate rows in gold retail summary.""" + # Given: gold retail data + df = sample_gold_retail + total_count = df.count() + + # When: we remove duplicates and count + distinct_count = df.distinct().count() + + # Then: counts match (no duplicates) + assert total_count == distinct_count, ( + f"Found {total_count - distinct_count} duplicate rows in retail data" + ) + + def test_no_duplicate_rows_cpi(self, spark, sample_gold_cpi): + """No exact duplicate rows in gold CPI data.""" + # Given: gold CPI data + df = sample_gold_cpi + total_count = df.count() + + # When: we remove duplicates and count + distinct_count = df.distinct().count() + + # Then: counts match + assert total_count == distinct_count, ( + f"Found {total_count - distinct_count} duplicate rows in CPI data" + ) + + def test_no_duplicate_state_date_combinations_retail(self, spark, sample_gold_retail): + """No duplicate (state, industry, date) combinations in retail data.""" + # Given: gold retail data + df = sample_gold_retail + total_count = df.count() + + # When: we count distinct combinations of key columns + distinct_keys = df.select("state", "industry", "date").distinct().count() + + # Then: key combinations are unique + assert total_count == distinct_keys, ( + f"Found {total_count - distinct_keys} duplicate (state, industry, date) combinations" + ) + + def test_no_duplicate_state_date_combinations_cpi(self, spark, sample_gold_cpi): + """No duplicate (state, index_name, date) combinations in CPI data.""" + # Given: gold CPI data + df = sample_gold_cpi + total_count = df.count() + + # When: we count distinct combinations of key columns + distinct_keys = df.select("state", "index_name", "date").distinct().count() + + # Then: key combinations are unique + assert total_count == distinct_keys, ( + f"Found {total_count - distinct_keys} duplicate (state, index_name, date) combinations" + ) + + +# ═══════════════════════════════════════════════════════════════════════ +# REGION CODE VALIDATION +# ═══════════════════════════════════════════════════════════════════════ + + +class TestRegionCodeValidation: + """Data quality rules for region/state validation.""" + + VALID_STATES = { + "New South Wales", + "Victoria", + "Queensland", + "South Australia", + "Western Australia", + "Tasmania", + "Northern Territory", + "Australian Capital Territory", + } + + def test_region_codes_valid_retail(self, spark, sample_gold_retail): + """All state values in retail data are valid Australian states.""" + # Given: gold retail data with decoded state names + df = sample_gold_retail + + # When: we collect all distinct states + states = {row.state for row in df.select("state").distinct().collect()} + + # Then: all states are in the valid set + invalid_states = states - self.VALID_STATES + assert len(invalid_states) == 0, ( + f"Found invalid state names: {invalid_states}. " + f"Valid states: {self.VALID_STATES}" + ) + + def test_region_codes_valid_cpi(self, spark, sample_gold_cpi): + """All state values in CPI data are valid Australian states.""" + # Given: gold CPI data with decoded state names + df = sample_gold_cpi + + # When: we collect all distinct states + states = {row.state for row in df.select("state").distinct().collect()} + + # Then: all states are in the valid set + invalid_states = states - self.VALID_STATES + assert len(invalid_states) == 0, ( + f"Found invalid state names: {invalid_states}" + ) + + def test_no_numeric_region_codes_retail(self, spark, sample_gold_retail): + """No numeric region codes remain in gold retail data (all should be decoded).""" + # Given: gold retail data + df = sample_gold_retail + + # When: we check for numeric-looking state values + states = {row.state for row in df.select("state").distinct().collect()} + + # Then: no numeric strings + numeric_states = {s for s in states if s.isdigit()} + assert len(numeric_states) == 0, ( + f"Found undecoded numeric region codes: {numeric_states}. " + "All region codes should be decoded to state names." + ) + + def test_no_null_states(self, spark, sample_gold_retail): + """No null state values in gold data.""" + # Given: gold retail data + df = sample_gold_retail + + # When: we check for null states + null_states = df.filter(F.col("state").isNull()).count() + + # Then: no nulls + assert null_states == 0, f"Found {null_states} null state values" + + +# ═══════════════════════════════════════════════════════════════════════ +# CPI INDEX QUALITY +# ═══════════════════════════════════════════════════════════════════════ + + +class TestCpiIndexQuality: + """Data quality rules for CPI index values.""" + + def test_cpi_index_positive(self, spark, sample_gold_cpi): + """All CPI index values must be positive.""" + # Given: gold CPI data + df = sample_gold_cpi + + # When: we check for non-positive CPI values + non_positive = df.filter(F.col("cpi_index") <= 0).count() + + # Then: all CPI values are positive + assert non_positive == 0, ( + f"Found {non_positive} rows with non-positive CPI index values" + ) + + def test_cpi_index_reasonable_range(self, spark, sample_gold_cpi): + """CPI index values are within a reasonable range (50-300).""" + # Given: gold CPI data (Australian CPI is typically 100-200 range) + df = sample_gold_cpi + + # When: we check for out-of-range values + out_of_range = df.filter( + (F.col("cpi_index") < 50) | (F.col("cpi_index") > 300) + ).count() + + # Then: all values are within expected bounds + assert out_of_range == 0, ( + f"Found {out_of_range} rows with CPI index outside 50-300 range. " + "Australian CPI index should be within this range." + ) + + def test_cpi_yoy_change_reasonable(self, spark, sample_gold_cpi): + """YoY CPI change percentage is within a reasonable range (-20% to 30%).""" + # Given: gold CPI data with YoY change + df = sample_gold_cpi + + # When: we check for extreme inflation/deflation + extreme_change = df.filter( + (F.col("yoy_change_pct").isNotNull()) + & ((F.col("yoy_change_pct") < -20) | (F.col("yoy_change_pct") > 30)) + ).count() + + # Then: no extreme values (Australian food inflation stays within -20% to 30%) + assert extreme_change == 0, ( + f"Found {extreme_change} rows with YoY CPI change outside -20% to 30% range" + ) diff --git a/projects/coles-vibe-workshop/requirements/analyst.txt b/projects/coles-vibe-workshop/requirements/analyst.txt new file mode 100644 index 0000000..2b57628 --- /dev/null +++ b/projects/coles-vibe-workshop/requirements/analyst.txt @@ -0,0 +1,7 @@ +# Analyst track +-r shared.txt + +fastapi>=0.115 +uvicorn[standard]>=0.32 +sse-starlette>=2.0 +httpx>=0.28 diff --git a/projects/coles-vibe-workshop/requirements/de.txt b/projects/coles-vibe-workshop/requirements/de.txt new file mode 100644 index 0000000..9a7edee --- /dev/null +++ b/projects/coles-vibe-workshop/requirements/de.txt @@ -0,0 +1,14 @@ +# Data Engineering track +-r shared.txt + +pyspark>=4.1 +delta-spark>=4.1 +beautifulsoup4>=4.12 +lxml>=5.0 +httpx>=0.28 +fastapi>=0.100 +uvicorn[standard]>=0.20 + +# NOTE: databricks-declarative-pipelines is only available on Databricks +# cluster runtime, not on public PyPI. The @dp.table / @dp.expect decorators +# will work when your pipeline runs on a Databricks cluster. diff --git a/projects/coles-vibe-workshop/requirements/ds.txt b/projects/coles-vibe-workshop/requirements/ds.txt new file mode 100644 index 0000000..567ff2a --- /dev/null +++ b/projects/coles-vibe-workshop/requirements/ds.txt @@ -0,0 +1,7 @@ +# Data Science track +-r shared.txt + +pyspark>=4.1 +mlflow>=3.11 +scikit-learn>=1.8 +xgboost>=3.2 diff --git a/projects/coles-vibe-workshop/requirements/shared.txt b/projects/coles-vibe-workshop/requirements/shared.txt new file mode 100644 index 0000000..9750ff4 --- /dev/null +++ b/projects/coles-vibe-workshop/requirements/shared.txt @@ -0,0 +1,7 @@ +# Shared dependencies for all workshop tracks +pytest>=8.0 +pytest-asyncio>=1.0 +behave>=1.2 +ruff>=0.15 +databricks-sdk>=0.102 +databricks-sql-connector>=3.4 diff --git a/projects/coles-vibe-workshop/slides.html b/projects/coles-vibe-workshop/slides.html new file mode 100644 index 0000000..f35f694 --- /dev/null +++ b/projects/coles-vibe-workshop/slides.html @@ -0,0 +1,3095 @@ + + + + + + Vibe Coding Workshop — Databricks + + + + + + +
+ +
+
N for notes
+
+
+

Speaker Notes

+ +
+
+
+ + +
+
+ +
+
+
+
+ +
+ + + +
+ +

Vibe Coding

+

The Great Grocery Data Challenge
Agentic Software Development with Databricks

+
+
+ Coles Group + Data & AI Engineering Team · 6.5-Hour Hackathon +
+

David O'Keeffe · Solutions Architect, Databricks

+
+
+ + +
+
+ Before We Start +

I Built This for Coles

+
+

Before any theory -- this is a real system, close to production-ready, running on Databricks.

+ +
+
+
4.5M
+

Flybuys Members

+

13 personalized offers every week

+
+
+
857
+

Commits

+

37K lines, 4 product lines, 340+ tests

+
+
+
7
+

Weeks, Spare Time

+

One engineer, AI-assisted, nights & weekends

+
+
+
R→Py
+

Full Migration

+

Azure Batch VMs to native Databricks

+
+
+ +
+

Your Weekly Specials

+

I'm not a data scientist. I knew almost nothing about this system. The agent helped me learn the domain, port the R code, and solve the hard problems: trillion-row tables, memory overruns, distributed LightGBM on Ray.

+
+
+ +
+ + +
+
+ 9:30 AM · Ice Breaker +

Grocery Data Predictions

+

How well does your team know Australian retail? No phones -- just gut instinct.

+
+

Write your team's predictions on your card. Answers revealed in Show & Tell using the pipeline you build.

+ +
+
+ Q1 +

Which Australian state has the highest monthly food retail turnover?

+
+ Click to reveal + New South Wales -- ~$3.5B/month in food retailing. +
+
+
+ Q2 +

By what percentage have Australian food prices (CPI) increased since January 2020?

+
+ Click to reveal + Approximately 25-30%. The biggest spike hit in 2022-23. +
+
+
+ Q3 +

How much does the average Australian household spend on groceries per week?

+
+ Click to reveal + Around $200-$220/week -- up from ~$160 pre-pandemic. +
+
+
+
+
+ Q4 +

What month of the year do Australians spend the most on retail?

+
+ Click to reveal + December -- Christmas shopping drives a massive spike across all categories. +
+
+
+ Q5 +

Which food category has seen the largest price increase since 2020 -- dairy, meat, fruit, or bread & cereals?

+
+ Click to reveal + Dairy & eggs -- up over 30% since 2020, outpacing all other food groups. +
+
+
+
+ +
+ + +
+
+ THE GREAT GROCERY DATA CHALLENGE +

Today's Agenda

+
+ +
+
+

Morning: Foundations & Lab 1

+
+
+ 9:30 AM +
Welcome & Icebreaker
Form teams, Grocery Data Predictions
+
+
+ 9:45 AM +
Theory: Vibe Coding, CLAUDE.md, BDD
Specs, testing, context windows
+
+
+ 10:30 AM +
Break
15 min
+
+
+ 10:45 AM +
Lab 0: Guided Hands-On
CLAUDE.md, first test, bronze ingest -- everyone together
+
+
+ 11:30 AM +
Skills, MCP, Genie + AI/BI
Practical tools for the labs
+
+
+ 11:50 AM +
Track Briefing
Choose DE, DS, or Analyst
+
+
+ 12:00 PM +
Lab 1
60 min track-specific build
+
+
+
+
+

Afternoon: Lab 2 & Demos

+
+
+ 1:00 PM +
Show & Tell
Quick demos, prediction reveal
+
+
+ 1:15 PM +
Lunch
45 min
+
+
+ 2:00 PM +
Lab 2
60 min track-specific build
+
+
+ 3:00 PM +
Demos
Team presentations + voting
+
+
+ 3:30 PM +
Takeaways + Close
Reflect, next steps
+
+
+ 4:00 PM +
End
Go build something great
+
+
+
+
+
+ Coming up: APJ Building Intelligent Apps Hackathon — $68K in prizes, build period starts 1 May. We'll talk about it at the end. +
+
+ +
+ + +
+
+ SETTING THE SCENE +

Databricks Today

+
+

This isn't the Databricks you onboarded with. The platform now covers everything from raw ingestion to serving AI apps.

+ +
+ Databricks platform stack showing layers from open formats to applications +
+ +
+
+

MCP Servers

+

Connect any AI agent to every layer of the platform

+
+
+

Proprietary Model Serving

+

Anthropic, OpenAI, Google -- served through AI Gateway with guardrails

+
+
+ +
+

Today you'll use Lakeflow, Unity Catalog, Genie, AI/BI, Custom Apps, MCP, and Model Serving -- all in one workshop.

+
+
+ +
+ + +
+
+ 9:45 AM +

The Paradigm Shift

+

From writing code to directing AI agents

+
+
+
+ + +
+
+ THE PARADIGM SHIFT +

What is Vibe Coding?

+
+ +
+

"There's a new kind of coding I call 'vibe coding', where you fully give in to the vibes, embrace exponentials, and forget that the code even exists."

+

Andrej Karpathy, February 2025

+
+ +
+
+

Traditional Development

+
    +
  • Human writes every line
  • +
  • Slow iteration cycles
  • +
  • Knowledge bottlenecks
  • +
  • Context switching overhead
  • +
+
+
+

Agentic Development

+
    +
  • Human directs, agent implements
  • +
  • Build-test cycles in seconds
  • +
  • One CLAUDE.md rule fixes every future output
  • +
  • 10x amplification, not replacement
  • +
+
+
+ +
+

Rule #1: Just Say What You Want

+

You literally type what you want and it happens.

+
+
+ +
+ + +
+
+ YOUR ENVIRONMENT +

Coding Agents Databricks App

+
+

A browser terminal running AI coding agents. Nothing to install -- it runs on Databricks compute.

+ +
+
+ Browser + xterm.js terminal +
+
+
+ Databricks App + Flask + Gunicorn +
+
+
+ AI Agents + Claude Code / OpenCode +
+
+
+ AI Gateway + Route · Limit · Trace +
+
+ +
+ 39 Pre-built Skills + MCP Servers + MLflow Tracing + Unity Catalog +
+
+ +
+ + +
+
+ OPEN SOURCE +

CODA: Coding Agents on Databricks Apps

+
+

The platform you're using today. Browser-based AI agents with zero local setup, governed by Unity Catalog, traced by MLflow.

+ +
+
+
+
+

Claude Code

+

39 Databricks skills + 2 MCP servers

+
+
+

Codex

+

OpenAI's agent

+
+
+

Gemini CLI

+

Google's agent

+
+
+

OpenCode

+

Open-source, any provider

+
+
+
+ +
+

Why This Matters

+
+
+ Auth +

Paste a PAT once. Auto-rotates every 10 minutes.

+
+
+ Trace +

Every agent session logged to MLflow.

+
+
+ Govern +

Unity Catalog controls what each agent can read and write.

+
+
+ Cost +

AI Gateway handles rate limiting and spend tracking per team.

+
+
+
+
+ +

+ github.com/datasciencemonkey/coding-agents-databricks-apps +

+
+ +
+ + +
+
+ THE DESTINATION +

What We're Building Today

+
+

By 4pm today, your team will have built all four of these

+ +
+
+ 1 +

Data Pipeline

+

Lakeflow pipeline ingesting ABS retail & food price data through Bronze → Silver → Gold medallion architecture

+
+
+ 2 +

Web Application

+

FastAPI + Tailwind app with dashboards, filters, and an AI-powered "Ask a Question" feature

+
+
+ 3 +

Genie Space

+

Natural language Q&A -- business users type questions in English, get instant answers from your data

+
+
+ 4 +

AI/BI Dashboard

+

Auto-generated visualizations from your gold tables -- describe what you want in plain language

+
+
+ +
+
Pipeline
+
+
App
+
+
Genie
+
+
Dashboard
+ = Grocery Intelligence Platform +
+
+ +
+ + +
+
+ THE CHALLENGE +

Today's Challenge: Grocery Intelligence Platform

+
+

One platform, two labs. The CLAUDE.md you create next will guide you through both.

+ +
+
+

Four Real Data Sources

+
    +
  • ABS Retail Trade -- monthly turnover by state & industry
  • +
  • ABS CPI Food -- quarterly food price indices
  • +
  • FSANZ Food Recalls -- safety recall notices
  • +
  • ACCC Grocery Reports -- supermarket inquiry PDFs
  • +
+
+
+

Tech Stack

+
    +
  • PySpark + Lakeflow Declarative Pipelines
  • +
  • Unity Catalog -- workshop_vibe_coding.<team>
  • +
  • UC Volumes for raw files (PDFs, JSON)
  • +
  • FastAPI + React for the app
  • +
  • Genie + AI/BI Dashboards
  • +
  • DABs for deployment
  • +
+
+
+

What You'll Build

+
    +
  • Lab 1: Medallion pipeline -- APIs, scraping & PDFs → Gold
  • +
  • Lab 2: FastAPI + React app, Genie space & AI/BI dashboard
  • +
+
+
+
+ +
+ + +
+
+ 9:45 AM +

Specs & BDD

+

PRDs, CLAUDE.md, and behavior-driven agentic development

+
+
+
+ + +
+
+ SPECIFICATIONS +

Why Specs Matter More with AI

+
+

Garbage in, garbage out -- except now the garbage arrives in 30 seconds instead of 3 days.

+ +
+

Mental model: Think of Claude as a brilliant but new employee who lacks context on your norms and workflows.

+
+ +
+
+

A Good PRD Includes

+
    +
  • Clear acceptance criteria
  • +
  • Constraints & non-functional requirements
  • +
  • Data contracts and interfaces
  • +
  • Example inputs & expected outputs
  • +
+
+
+

CLAUDE.md Encodes

+
    +
  • Coding standards & patterns
  • +
  • Architectural decisions
  • +
  • Testing requirements
  • +
  • Tool and framework preferences
  • +
+
+
+ +
+ Repository-level + Project-level + User-level + Evolve over time +
+
+ +
+ + + +
+
+ SPECIFICATIONS +

CLAUDE.md in Action

+
+
+
+

One file. The agent reads it before every task. Add a rule here and it sticks.

+
    +
  • Repository-level: team standards, architecture
  • +
  • Project-level: specific patterns for this app
  • +
  • User-level: personal preferences
  • +
+
+

Exercise: Write a CLAUDE.md for your team's Grocery Intelligence Platform. 15 minutes.

+
+
+
+# CLAUDE.md + +## Tech Stack +- Use PySpark for all data processing +- All functions must have type hints +- Tables use Unity Catalog: + workshop_vibe_coding.<schema> + +## Data Standards +- Date columns: DATE, YYYY-MM-DD +- Currency: DECIMAL(12,2) +- All tables: processing_timestamp +- Gold tables: partitioned by date + +## Testing +- pytest with PySpark fixtures +- Small DataFrames (5-10 rows) +- Assert schema, counts, values +
+
+
+ +
+ + + + +
+
+ SPECIFICATIONS +

CLAUDE.md: Three Scope Levels

+
+

Instructions cascade from global to project-specific. More specific rules override general ones.

+
+
+
+ User-Level +
~/.claude/CLAUDE.md
+
+
+

Personal Preferences

+

Your coding style, preferred tools, git conventions. "Always use uv run for Python." "Use imperative mood in commits."

+
+
+
+
+
+ Repo-Level +
./CLAUDE.md
+
+
+

Team Standards

+

Architecture decisions, testing patterns, dependencies. "Use PySpark for all data processing." "All tables in Unity Catalog."

+
+
+
+
+
+ Project-Level +
./src/CLAUDE.md
+
+
+

Module-Specific Rules

+

Component-specific patterns and constraints. "All API endpoints return JSON." "Use Pydantic models for validation."

+
+
+
+
+ +
+ + + + +
+
+ RULE #1 · EXERCISE · 15 MINUTES +

Just Say What You Want

+
+
+
+

The Insight

+

Everything else in this workshop builds on one idea:

+
+

You literally type what you want and it happens.

+
+
    +
  • Want a project? Tell Claude what you're building
  • +
  • Want behavior to change? Say “from now on, do X”
  • +
  • See a technique you like? Paste it in
  • +
  • Everything is markdown — CLAUDE.md, skills, hooks
  • +
+

That's agentic engineering. It works in three steps:

+
    +
  1. Say it — have a conversation, get what you want
  2. +
  3. Curate it — save the good stuff as markdown files (CLAUDE.md, skills, hooks)
  4. +
  5. Wire up tools — increasingly just instructions to CLI commands, not heavyweight servers
  6. +
+
+
+

Try It Now (15 min)

+
    +
  1. Open your Coding Agents terminal as a team
  2. +
  3. Don't write a file — have a conversation:
  4. +
+
+

Just say: “I'm building a grocery intelligence platform on Databricks. Tech stack: PySpark, Lakeflow Declarative Pipelines, FastAPI + React, DABs. Data sources: ABS SDMX APIs, FSANZ web scraping, ACCC PDF ingestion via UC Volumes. Unity Catalog namespace: workshop_vibe_coding.<team_schema>. Set up the project and create a CLAUDE.md.”

+
+
    +
  1. Watch Claude create your project, CLAUDE.md, and structure
  2. +
  3. Want to change something? Just say it. That's the whole point.
  4. +
+
+

Discussion: What did Claude get right? What did you correct by just saying so? How is this different from writing config by hand?

+
+
+
+
+ +
+ + + + +
+
+ BEHAVIOR-DRIVEN DEVELOPMENT +

The BDD + Agent Workflow

+
+

Rule #1 applies here too: say what the system should do in plain English. The agent writes the test code. You can read it because it's Given/When/Then. When a scenario fails, the agent reads the error and fixes itself.

+
+
+
01
+

Human Writes Features

+

Gherkin Given/When/Then — the spec in plain English. Use /bdd-features

+
+
+
02
+

Agent Implements Steps

+

Agent reads features, writes Python step definitions. Use /bdd-steps

+
+
+
03
+

Run & Iterate

+

Agent runs behave, sees failures, fixes. Use /bdd-run

+
+
+
04
+

Human Reviews

+

Add scenarios. Agent handles them. Ship it.

+
+
+
+ +
+ + + + +
+
+ BEST PRACTICES +

Why BDD Works So Well with Agents

+
+
+
+
🎯
+

Agent Writes, You Read

+

The agent writes the test code. But it's Given/When/Then, so you can actually read it and check if it captured what you meant.

+
+
+
🔄
+

Self-Correcting Loop

+

The agent runs behave, reads the traceback, fixes its step definitions, reruns. You watch.

+
+
+
🛡️
+

Acceptance Boundary

+

The agent can’t say “it works” when 3 of 8 scenarios are red. The proof is structural, not verbal.

+
+
+
+

Pro tips: Start with one feature file. Use Background: for shared setup. Write declarative steps ("When I grant SELECT") not imperative ("When I click the button").

+
+ +
+
+ + + + +
+
+ BEHAVIOR-DRIVEN DEVELOPMENT +

Writing Gherkin That Guides the Agent

+
+

You describe what should happen. The agent writes the test. Because it's Gherkin, you can read it back and verify it's right.

+
+
+
+@pipeline @smoke +Feature: Transaction Data Cleaning + As a data engineer + I want invalid transactions removed + So that downstream analytics are accurate + + Scenario: Remove null and negative amounts + Given a raw transactions table with 10 rows + And 2 rows have null amounts + And 1 row has a negative amount + When I run the clean_transactions pipeline + Then the cleaned table should have 7 rows + And no row should have a null amount + And no row should have amount <= 0 + And the schema should include "transaction_amount" +
+
+
+

Key Principles

+
+ Declarative, Not Imperative +

Describe what the system does, not how to click buttons.

+
+
+ Concrete Values +

Use specific data — "10 rows, 2 invalid" not "some rows."

+
+
+ One Behavior Per Scenario +

If it tests two things, split it. Scenario names are documentation.

+
+
+ Use Backgrounds +

Shared setup goes in Background: — don't repeat connection steps.

+
+
+
+ +
+
+ + + + +
+
+ VERIFICATION PATTERNS +

Every Prompt Must Answer: How Will Claude Prove This Worked?

+
+
+
+
+
+

Six Verification Patterns

+

Don’t ask “did this work?” — make the agent prove it with gates that either pass or fail.

+
    +
  • BDD gates — write Gherkin features first, agent must pass all scenarios
  • +
  • Separate prompts — Prompt 1 writes tests, Prompt 2 implements
  • +
  • @dp.expect — data quality constraints that fail the pipeline
  • +
  • Schema contractsStructType definitions as machine-readable specs
  • +
  • Full suite regression — run ALL tests after every change
  • +
  • Negative testing — test what should fail, not just what should pass
  • +
+
+
+
+
+
🧭
+

Explore First, Then Plan, Then Code

+

Separate research and planning from implementation to avoid solving the wrong problem fast.

+
    +
  • Use /plan mode to align on approach first
  • +
  • Let the agent read the codebase before writing
  • +
  • Cheap to course-correct in planning, expensive in code
  • +
+
+
+
+
+

The difference: an agent that thinks it did the work vs one that proves it did. — Anthropic, Claude Code Best Practices

+
+ +
+
+ + + + +
+
+ CRITICAL INSIGHT +

Why AI Agrees With You (Even When You're Wrong)

+
+
+ +
+
+

The Stanford Study (Science, March 2026)

+

Tested 11 production LLMs against 2,000+ real advice prompts.

+
+
+
49%
+
more agreement
than humans
+
+
+
51%
+
agreement when user
is 100% wrong
+
+
+
+
+

The Karpathy Experiment

+

Spent 4 hours refining an argument with an LLM. Was genuinely convinced. Then asked the model to argue the opposite — it demolished his argument completely.

+

"The model was never reasoning toward truth. It was generating the most compelling version of whatever position it was asked to defend."

+
+
+ +
+

Your Defenses

+
+
+

The Karpathy Test

+

Before trusting any AI analysis, ask it to argue the opposite position. If it demolishes its own argument, the original wasn’t reliable.

+
+
+

"Wait a minute..."

+

Two words that measurably improve critical evaluation. Prefixing with doubt triggers more rigorous analysis.

+
+
+

BDD — Structural Verification

+

Tests don’t rely on claims. Code passes or it doesn’t. AI-authored code has 1.7× more issues and 2.7× more security vulnerabilities (Code Rabbit, 2026).

+
+
+

Separate Prompts

+

Prompt 1 writes tests, Prompt 2 implements. If the same prompt does both, it writes tests that match its implementation, not your requirements.

+
+
+
+
+
+

This is not a bug to patch. RLHF training selected for agreement. The question: are you using AI to extend your thinking, or replace it? — François Chollet

+
+ +
+
+ + + + +
+
+ CONTEXT MANAGEMENT +

What Are Tokens?

+
+

Every interaction with the agent is measured in tokens. Understanding them helps you work faster.

+
+
+

Tokens = the currency of AI

+

A token is roughly 3/4 of a word or ~4 characters. Not exactly words, not exactly characters — somewhere in between.

+
+
+ This sentence is about 8 tokens. + ~8 tokens +
+
+ A 200-line Python file + ~2-3K tokens +
+
+ A full README + 5 source files + ~15-20K tokens +
+
+ Harry Potter and the Philosopher’s Stone + ~110K tokens +
+
+
+
+

Two limits to know

+
+
+ 200K + Context window +
+

How much the agent can “see” at once — your conversation, files read, tool results, CLAUDE.md. When it fills up, older content gets compacted (forgotten).

+
+
+
+ ~1M/min + Rate limit (shared) +
+

Tokens per minute through AI Gateway — shared across all teams. If everyone reads large files at once, requests queue up. Keep requests focused.

+
+
+

Practical tip: Don’t ask the agent to “read everything in this directory.” Be specific. Smaller, focused requests = faster responses for everyone.

+
+
+
+ +
+
+ + + + +
+
+ CONTEXT MANAGEMENT +

Managing Context Windows

+
+

The agent can “see” about 200K tokens at once. Fill that up with junk and it forgets your earlier instructions.

+ + +
+
+

Context Window — 200K tokens

+

~a medium novel

+
+ +
+ +
+ CLAUDE.md +
+
+ File reads & tool results +
+
+ Conversation turns +
+
+ Agent responses +
+ +
+ COMPACTION +
+
+
+ 0K + 50K + 100K + 150K + 200K +
+

When the bar fills → auto-compaction fires → the agent forgets earlier details

+
+ + +
+
+

Keep CLAUDE.md lean

+

50 lines, not 500. It’s loaded every turn. Rules and patterns, not docs.

+
+
+

Be specific with requests

+

Don’t ask the agent to “read everything.” Targeted requests = less context consumed.

+
+
+

Plan before building

+

/plan aligns on approach before burning context on code.

+
+
+

Start new sessions for new tasks

+

Fresh session = fresh context window. Don’t reuse a bloated session for a new task.

+
+
+
+

Think of it like RAM — manage it or the OS starts swapping.

+
+ +
+
+ + + + +
+
+ LIVE DEMO +

Live Demo: BDD in Action

+
+

Watch the full cycle: write failing tests, let the agent implement, iterate to green.

+
+
+
1
+

Write 3 Failing Tests

+

clean_data, join_tables, aggregate

+
+
+
+
2
+

Agent Implements

+

Watch it read tests, write code

+
+
+
+
3
+

Tests Fail, Agent Fixes

+

Reads errors, adjusts code

+
+
+
+
4
+

All Green

+

Review the generated code

+
+
+
+

What to Watch For

+
+
+

Agent Reads Tests

+

It reads your test expectations before writing any code

+
+
+

Self-Corrects

+

When tests fail, it reads the error and adjusts its approach

+
+
+

Converges

+

Within 2-5 iterations, all tests pass without manual intervention

+
+
+
+ +
+
+ + + + +
+
+ Lab 0 · 10:45 AM · 45 Minutes +

Lab 0: Guided Hands-On

+

Initialize your project, write your first test, build bronze ingest — everyone together

+
+
+
+
1
+

Just Say It

+

Tell Claude, it initializes

+
+
+
2
+

First Test

+

BDD with the agent

+
+
+
3
+

Bronze Ingest

+

First pipeline layer

+
+
+
+
+ + + + +
+
+ LAB PREPARATION +

Practical Tips for the Labs

+
+

These five things trip up every team on their first day with agents. Learn them now, not mid-lab.

+
+

Default workflow: Start every non-trivial task with /plan. Even better — ask Claude to interview you about requirements before it builds anything.

+
+
+
+
⚠️
+
+

Watch for Overengineering

+

Claude tends to create extra files, add unnecessary abstractions, and build in flexibility you didn’t ask for.

+
+# Add to your CLAUDE.md: +- Keep solutions minimal. Do not add features, abstractions, or files beyond what is requested. +
+
+
+
+
🔍
+
+

Prevent Hallucinations

+

Never trust claims about code Claude hasn’t read. Make it investigate before answering.

+
+# Add to your CLAUDE.md: +- Never speculate about code you have not opened. Read relevant files BEFORE answering. +
+
+
+
+
🎯
+
+

Course-Correct Early and Often

+

Don’t let the agent run for 10 minutes unchecked. Check in every 2–3 tool calls. If it’s heading the wrong direction, say “stop, let’s rethink this approach”.

+
+
+
+
💪
+
+

Challenge Claude to Prove Its Work

+

After implementation, demand proof: “show me the git diff”, “prove to me this works”, “grill me on these changes”. Then iterate: “knowing everything you know now, scrap this and implement the elegant solution.”

+
+
+
+
💾
+
+

Commit as Checkpoints

+

Every 15–20 minutes, use /commit. Commits are your safety net — if the agent goes off-rails, press Esc twice to cancel, then rewind with git.

+
+
+
+ +
+
+ + + +
+
+ 11:30 AM +

Skills, MCP & Genie

+

Practical tools for the labs

+
+
+
+ + + + +
+
+ TOOLING +

Practical Tools for the Labs

+
+
+
+

Skills

+

Curated markdown files — just instructions, not code

+
+
+# Built-in skills +/commit → smart commit messages +/test → run and fix tests +/review → code review + +# Databricks skills +/deploy-dab → validate + deploy bundle +
+
+

Rule #1 in action: Say something 3×? Curate it into a skill. It's just a markdown file.

+
+
+

Tools

+

Increasingly just instructions to CLI commands, not heavyweight servers

+
+
+# MCP server (structured): +"Search the Databricks docs + for how to create a + Genie space" + +# CLI instruction (lightweight): +"Run: databricks catalogs list + to check available catalogs" +
+
+

MCP: structured tool access
CLI instructions: just tell it which command to run

+
+
+
+

How powerful is a markdown skill? deathbyclawd.com scans which SaaS products can be replaced by one.

+
+ +
+
+ + + + +
+
+ PLATFORM CAPABILITIES +

MCP Servers on Databricks

+
+

Three flavours — all secured through Unity Catalog

+
+
+ Built-in +

Managed

+

Databricks-provided servers

+
    +
  • UC functions, tables & volumes
  • +
  • Vector Search indexes
  • +
  • Genie spaces for NL → SQL
  • +
  • Zero config — works out of the box
  • +
+
+/api/2.0/mcp/functions/{catalog}/{schema} +
+
+
+ Proxy +

External

+

Third-party services via Databricks

+
    +
  • GitHub, Slack, Glean, Jira …
  • +
  • Install from Marketplace or custom HTTP
  • +
  • UC Connections manage auth tokens
  • +
  • No credentials exposed to clients
  • +
+
+/api/2.0/mcp/external/{connection_name} +
+
+
+ Build Your Own +

Custom

+

Organisation-specific tools

+
    +
  • Host on Databricks Apps
  • +
  • Wrap internal APIs & workflows
  • +
  • OAuth authentication required
  • +
  • Full control over tool surface
  • +
+
+Databricks Apps → your-mcp-server:8000 +
+
+
+

+ Clients: Claude (Connectors) • Cursor • Windsurf • ChatGPT • Replit • MCP Inspector +

+
+ Coles + In Lab 2, you’ll connect your app to Genie via a managed MCP server — zero custom glue code. +
+ +
+
+ + + + +
+
+ LIVE DEMO +

How Databricks Uses Claude

+
+

A look inside how we’ve configured AI-assisted development at Databricks

+
+ +
+
+ +
+
+ 📋 +

CLAUDE.md

+ Rules +
+

Company-wide coding standards, approved patterns, forbidden actions. The AI reads these before every task.

+
+ +
+
+ +

Hooks

+ Guardrails +
+

Shell commands that fire on tool events. Post-edit: auto-format & lint Python with ruff. On stop: verify your work. Not AI — deterministic guardrails.

+
+ +
+
+ 🧩 +

Skills + MCP Servers

+ Actions +
+

Custom slash commands for Databricks workflows. MCP servers connect to UC, Genie, internal services — all via the protocol.

+
+
+
+ +
+

What the configuration looks like

+
+// .claude/settings.json +{ + "hooks": { + "PostToolUse": [{ + "matcher": "Edit|Write", + "command": "ruff format $FILE && ruff check --fix $FILE" + }], + "Stop": [{ + "command": "~/.claude/verify-hint.sh" + }] + }, + "mcpServers": { + "databricks": { + "command": "uvx databricks-mcp" + }, + "slack": { ... }, + "jira": { ... } + } +} +
+ +
+ Coles + You can replicate this — a CLAUDE.md with your standards, hooks for your CI/CD, skills for your data patterns. +
+
+
+ +
+
+ + + + +
+
+ INTELLIGENCE LAYER +

Genie + AI/BI Dashboards

+
+
+
+

Genie Spaces

+
+

Natural language → SQL on Unity Catalog tables

+
    +
  • Create a Genie space, point at your gold tables
  • +
  • Ask questions in plain English
  • +
  • End users type English, not SQL
  • +
  • No tickets to the data team for simple questions
  • +
+
+
+# Ask Genie: +"Which Australian state had the +highest food retail turnover +in 2024?" + +# Genie generates SQL, runs it, +# returns results + visualization +
+
+
+

AI/BI Dashboards

+
+

Auto-generated visualizations from your data

+
    +
  • Point at Unity Catalog tables
  • +
  • Describe what you want in natural language
  • +
  • Get interactive, auto-updating dashboards
  • +
  • Complement Genie: dashboards for recurring views, Genie for ad-hoc
  • +
+
+
+# Create dashboard: +"Show monthly retail turnover +by state as a line chart, +with food CPI overlay" + +# Connected to your gold tables +
+
+
+
+ Genie + AI/BI + Unity Catalog + Natural Language +
+ +
+
+ + + + +
+
+ Lab 1 · 12:00 PM · 60 Minutes +

The Great Grocery
Data Challenge — Phase 1

+

Track-specific build with real Australian data

+
+
+
+ + + + +
+
+ LAB SESSION +

Choose Your Track

+
+

All teams start with LAB-0-GETTING-STARTED.md (10 min), then fork into your track

+
+
+

Data Engineering

+

Build the data foundation

+
    +
  • Lakeflow pipeline
  • +
  • Bronze → Silver → Gold
  • +
  • Data quality expectations
  • +
  • DABs deployment
  • +
+
LAB-1-DE.md
+
+
+

Data Science

+

Build ML models from the data

+
    +
  • Feature engineering
  • +
  • MLflow experiment tracking
  • +
  • Correlation analysis
  • +
  • EDA visualizations
  • +
+
LAB-1-DS.md
+
+
+

Analyst

+

Build interfaces for business users

+
    +
  • Genie spaces (NL queries)
  • +
  • AI/BI dashboards
  • +
  • Column metadata tuning
  • +
  • Dashboard publishing
  • +
+
LAB-1-ANALYST.md
+
+
+
+
+ Checkpoints at every phase — pre-loaded data in workshop_vibe_coding.checkpoints. Nobody gets stuck. +
+
+ +
+
+ + + + +
+
+ MID-DAY CHECK-IN +

Show & Tell + Prediction Reveal

+
+
+
+

Show & Tell

+
+
    +
  • 3 teams volunteer to show their pipeline DAG (90 sec each)
  • +
  • Share one interesting insight from your Gold data
  • +
  • What worked well? Where did BDD help?
  • +
  • Where did the agent go off-rails?
  • +
+
+
+

Discussion: Where did BDD help the agent stay on track? Where did it go off-rails?

+
+
+
+

Prediction Reveal

+
+

Query your Gold tables live to answer the ice breaker:

+
    +
  • Q1: Highest food retail turnover by state?
  • +
  • Q2: Food CPI increase since 2020?
  • +
  • Q3: Average household grocery spend/week?
  • +
  • Q4: Peak retail spending month?
  • +
  • Q5: Fastest-rising food category?
  • +
+
+

Score the prediction cards — most accurate team gets bragging rights!

+
+
+
+
+ +
+
+ + + + +
+
+ Lab 2 · 2:00 PM · 60 Minutes +

The Great Grocery
Data Challenge — Phase 2

+

Build an app, connect Genie, create AI/BI dashboards

+
+
+
+ + + + +
+
+ LABS +

Continue Your Track

+
+

Same track as Lab 1 — pick up where you left off

+
+
+

Data Engineering

+

Make it production-grade

+
    +
  • Add FSANZ food recalls source
  • +
  • Data quality expectations
  • +
  • Cross-source gold view
  • +
  • Cron scheduling
  • +
+
LAB-2-DE.md
+
+
+

Data Science

+

Train, serve, and ship

+
    +
  • Train forecasting model
  • +
  • Register in MLflow
  • +
  • Model Serving endpoint
  • +
  • Prediction web app
  • +
+
LAB-2-DS.md
+
+
+

Analyst

+

Build the app layer

+
    +
  • FastAPI + Tailwind app
  • +
  • Embedded AI/BI dashboards
  • +
  • NL query feature
  • +
  • Deploy to Databricks Apps
  • +
+
LAB-2-ANALYST.md
+
+
+
+
+ Get as far as you can — checkpoints at every phase. 60 minutes. +
+
+ +
+
+ + + + +
+
+ 3:00 PM · 30 Minutes +

Demo Time!

+

Each team presents their platform (3 min each)

+
+
+
+

Pipeline Quality

+

Does the data flow work? Expectations met? Bronze → Silver → Gold complete?

+
+
+

App Polish

+

UI quality, user experience, features. Does it look and feel like a real product?

+
+
+

Genie & AI/BI

+

Natural language query working? Dashboard useful? Can a business user self-serve?

+
+
+

Creativity

+

Unique features, clever use of agents, surprises. Did the team go beyond the brief?

+
+
+
+

Vote for the best platform! Winning team gets eternal bragging rights. Then 5-min retro: What worked? What would you do differently? What will you use on Monday?

+
+
+
+ + + + +
+
+ TAKEAWAYS +

What to Remember

+
+
+
+
1
+

Specs Do the Heavy Lifting

+

20 minutes writing a CLAUDE.md saves you hours of correcting bad output. The agent only knows what you tell it.

+
+
+
2
+

BDD + Agents = Deterministic

+

When the Gherkin scenario is red, the agent can’t bluff. It either fixes the code or it doesn’t.

+
+
+
3
+

Subagents & MCP Extend Reach

+

One agent hits a wall at complex tasks. Subagents split the work. MCP wires them into Slack, JIRA, Databricks, whatever.

+
+
+
4
+

Start Small, Iterate

+

Pick one real task this week. Write a CLAUDE.md, write a feature file, let the agent build it. That’s your proof of concept.

+
+
+ +
+
+ + + + +
+
+ NEXT STEPS +

Next Steps

+
+
+
+

Immediate

+
    +
  • Share learnings with your wider team
  • +
  • Set up a team CLAUDE.md with your coding standards
  • +
  • Try it on a real task — start with something small this week
  • +
+

Coming Soon

+
    +
  • Broader team rollout — full team workshop
  • +
  • Shared skill libraries for common Coles patterns
  • +
  • Genie spaces for your real production data
  • +
+
+
+

Your Champions

+
+
+
🧑‍💻
+
+

Farbod & Swee Hoe

+

Internal champions — your go-to for day-to-day questions

+
+
+
+
🏢
+
+

David O’Keeffe

+

Databricks SA — available for follow-up support

+
+
+
+
+

Measure success: Developer velocity, code quality, and developer satisfaction.

+
+
+
+ +
+
+ + + + +
+
+ WHAT'S NEXT +

Building Intelligent Apps Hackathon

+
+

APJ's biggest hackathon of the year. Same tech stack you learned today.

+
+
+

Key dates

+
+ Registration +

Open now

+
+
+ Build period +

1 May – 22 May 2026

+
+
+ Winners announced +

17 June 2026

+
+
+
+

Prizes

+
+
$700
+

USD in Databricks credits to get started

+
+
+
$68K
+

USD in total prizes

+
+
+

Certificates for all teams who submit

+
+
+
+
+

Genie · Lakebase · Agent Bricks · Apps

+ Register now → +
+
+ +
+ + + + +
+
+
+ + + + + + + + + + + + + + + + + + + +
+

Thank You

+

You built a data platform in 6 hours.
Imagine what your team does with a full week.

+
+
+ david.okeeffe@databricks.com +
+
+
+ + + + +
+
+ Appendix +

Appendix

+

Reference slides — available if time allows or questions arise

+
+
+
+ + + + +
+
+ APPENDIX +

Subagents, Skills, Hooks & Plugins

+
+
+
+

Subagents

+

Parallel workers, isolated context

+
    +
  • Spawn agents for parallel tasks
  • +
  • Each has its own context window
  • +
  • Main agent orchestrates
  • +
+
+# In Lab 2: +"Spawn a subagent to build +the frontend while I work +on the backend API." +
+
+
+

Skills

+

Slash commands, domain knowledge

+
    +
  • Built-in: /commit, /review
  • +
  • Custom: /deploy-pipeline
  • +
  • Encode expert workflows
  • +
+
+# AI Dev Kit skills: +/deploy-dab +/create-pipeline +/run-data-quality +
+
+
+

Hooks

+

Event-driven guardrails

+
    +
  • PostToolUse: auto-format on save
  • +
  • Stop: verify before you walk away
  • +
  • Deterministic — not AI
  • +
+
+# .claude/settings.json +"hooks": { + "PostToolUse": [{ + "matcher": "Edit|Write", + "command": "ruff format && ruff check --fix" + }] +} +
+
+
+

Plugins

+

Packaged for teams

+
    +
  • Skills + agents + hooks bundled
  • +
  • Distributable and versioned
  • +
  • Composable toolkit
  • +
+
+# Plugin = everything packaged +skills/ +agents/ +hooks/ +manifest.json +
+
+
+ +
+
+ + + + +
+
+ APPENDIX +

Skills in Action: BDD Skill Chain

+
+

Four Databricks BDD skills automate the entire behavior-driven workflow — from project scaffold to test execution.

+
+
+

The Chain

+
+
+ /bdd-scaffold +

Creates Behave project structure, environment.py hooks, ephemeral schema isolation

+
+
+
+ /bdd-features +

Translates requirements into Gherkin feature files with Given/When/Then scenarios

+
+
+
+ /bdd-steps +

Implements Python step definitions using Databricks SDK

+
+
+
+ /bdd-run +

Executes Behave with tag filtering, parallel runs, and CI reporting

+
+
+
+
+

Try It in the Labs

+
+# Set up BDD in your project: +/bdd-scaffold + +# Write Gherkin from requirements: +/bdd-features +# → features/pipelines/bronze_ingest.feature + +# Generate step definitions: +/bdd-steps +# → features/steps/pipeline_steps.py + +# Run your scenarios: +/bdd-run +# → behave --tags="@pipeline @smoke" +
+
+

Key insight: Gherkin features are plain English anyone can read — analysts, PMs, engineers. They define “done” before any implementation exists.

+
+
+
+ +
+
+ + + + +
+
+ APPENDIX +

Subagents vs Agent Teams

+
+

The right question isn’t “should I use multiple agents?” — it’s “what kind of coordination does this task need?”

+
+
+

Subagents: Fire-and-Forget

+

Like delegating focused questions to researchers — they come back with distilled findings.

+
    +
  • Own isolated context window
  • +
  • One job, then result returns to parent
  • +
  • Can’t talk to each other — by design
  • +
  • Key benefit: compression — vast exploration distilled to clean signal
  • +
+
+
+

Agent Teams: Ongoing Coordination

+

Like assembling a team that works in the same room — they persist, communicate, and coordinate.

+
    +
  • Long-running instances with shared state
  • +
  • Peer-to-peer messaging between teammates
  • +
  • Shared task list with dependencies (blockedBy)
  • +
  • Key benefit: negotiation — discoveries in one thread change another
  • +
+
+
+
+

The #1 design principle: Start with a single agent. Push it until it breaks. That failure point tells you exactly what to add next.

+
+ +
+
+ + + + +
+
+ APPENDIX +

What is MCP?

+
+

Model Context Protocol — the USB-C of AI agents

+
+
+

Before MCP: N × M custom integrations

+
+
+
+
Claude
+
GPT
+
Gemini
+
+
+ + + + + + + + + + + +
+
+
GitHub
+
Slack
+
Jira
+
+
+

9 custom connectors for just 3 × 3

+
+
+
+

After MCP: N + M standardised connections

+
+
+
+
Claude
+
GPT
+
Gemini
+
+
+ + + + + + MCP + + + + +
+
+
GitHub
+
Slack
+
Jira
+
+
+

6 connections — build once, use everywhere

+
+
+
+ +
+
+ + + + +
+
+ APPENDIX +

How MCP Works

+
+
+ +
+

Client ↔ Server Architecture

+ MCP client-server architecture showing agents connecting to MCP servers via JSON-RPC +

Clients (agents) ↔ JSON-RPC ↔ Servers (tools)

+
+ +
+

Why This Matters

+
+

Build once, connect everywhere

+

A GitHub MCP server works with Claude, Cursor, ChatGPT — any client that speaks the protocol.

+
+
+

Structured, not ad-hoc

+

Tools describe their capabilities in a schema. Agents discover what’s available and call tools via JSON-RPC.

+
+
+

For Coles: one protocol, all your tools

+

Wrap Unity Catalog, internal APIs, and Genie behind MCP servers — every agent in the org gets access without custom glue code.

+
+
+

MCP = tool connectivity  •  Skills = procedural knowledge  •  Agent = orchestrator

+
+
+
+ +
+
+ + + + +
+
+ APPENDIX +

MCP Architecture on Databricks

+
+
+ Databricks MCP Architecture — showing Databricks-served and externally-served agents connecting to Custom, Managed, and 3rd party MCP servers, with AI Gateway, MLflow Tracing, and Model Serving +
+

+ Databricks-served agents (left) and external agents like Claude Code (right) connect to the same MCP servers — secured by Unity Catalog and AI Gateway. +

+ +
+
+ + + + +
+
+ APPENDIX +

AI Dev Kit

+
+

The complete toolkit for AI-driven development on Databricks

+
+ +
+
📚
+

Skills

+

Knowledge layer

+
    +
  • • 25+ skill packs
  • +
  • • Pipelines, DABs, Unity Catalog
  • +
  • • Genie, AI/BI, MLflow
  • +
  • • Markdown docs AI reads before acting
  • +
+
+ +
+
🔌
+

MCP Server

+

Action layer

+
    +
  • • 50+ tools via MCP protocol
  • +
  • • SQL, clusters, jobs, apps
  • +
  • • Genie, dashboards, MLflow
  • +
  • • Works with any MCP client
  • +
+
+ +
+
⚙️
+

Tools Core

+

Python library

+
    +
  • • High-level Databricks functions
  • +
  • • Shared across MCP & agents
  • +
  • pip install databricks-tools
  • +
  • • Extend with custom tools
  • +
+
+ +
+
💻
+

Builder App

+

Web interface

+
    +
  • • Chat-driven Databricks agent
  • +
  • • Runs as a Databricks App
  • +
  • • Browser-based — no install
  • +
  • • What you’re using today!
  • +
+
+
+ +
+
+
You type a prompt
+
+
Agent reads Skills
+
+
Calls MCP tools
+
+
Databricks executes
+
+
+ +
+ Coles + Fork the AI Dev Kit and add custom skills for your team — /run-data-quality, /deploy-pipeline, /check-lineage +
+ +
+
+ + + + diff --git a/projects/coles-vibe-workshop/starter-kit/CLAUDE-analyst.md b/projects/coles-vibe-workshop/starter-kit/CLAUDE-analyst.md new file mode 100644 index 0000000..d7e0f3e --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/CLAUDE-analyst.md @@ -0,0 +1,29 @@ +## Track: Analyst + +### Data Source +- Read from gold tables: `workshop_vibe_coding.TEAM_SCHEMA.retail_summary` and `food_inflation_yoy` +- These are pre-loaded in checkpoints — no need to build the pipeline + +### Genie Spaces +- Create via Databricks UI: Genie → New Genie Space +- Add gold tables and write clear general instructions +- Column descriptions in Unity Catalog improve Genie accuracy significantly +- Test with varied natural language questions + +### AI/BI Dashboards +- Create via Databricks UI: Dashboards → Create → AI/BI Dashboard +- Use natural language prompts to describe each visualization +- Arrange into a clean layout with title and filters +- Publish and get embed URL for app integration + +### Web Application +- Backend: FastAPI with Pydantic models +- Frontend: HTML + Tailwind CSS (CDN) + htmx (CDN) — no npm/node required +- Database: databricks-sql-connector with parameterized queries only +- AI feature: Foundation Model API for natural language to SQL +- Deployment: Databricks Apps with `app.yaml` + +### Embedding Dashboards +- Published dashboards can be embedded via iframe +- `` +- Users need Databricks credentials to view embedded dashboards diff --git a/projects/coles-vibe-workshop/starter-kit/CLAUDE-de.md b/projects/coles-vibe-workshop/starter-kit/CLAUDE-de.md new file mode 100644 index 0000000..9317974 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/CLAUDE-de.md @@ -0,0 +1,22 @@ +## Track: Data Engineering + +### Pipeline Framework +- Use Lakeflow Declarative Pipelines: `import databricks.declarative_pipelines as dp` +- `@dp.table` for streaming/batch tables +- `@dp.materialized_view` for aggregation views +- Data quality: `@dp.expect("name", "SQL_EXPRESSION")`, `@dp.expect_or_fail()`, `@dp.expect_all()` + +### Pipeline File Structure +- One function per file in src/bronze/, src/silver/, src/gold/ +- Each file is a notebook that Lakeflow runs independently +- Bronze reads from APIs/files, Silver decodes/cleans, Gold aggregates + +### Deployment +- Databricks Asset Bundles: `databricks.yml` + `resources/pipeline.yml` +- Always validate before deploying: `databricks bundle validate` +- Target: serverless Lakeflow pipeline + +### Data Sources +- ABS Retail Trade API: `https://data.api.abs.gov.au/data/ABS,RT,1.0.0/...` +- ABS CPI Food API: `https://data.api.abs.gov.au/data/ABS,CPI,2.0.0/...` +- Both return CSV via SDMX format with `spark.read.csv(url, header=True, inferSchema=True)` diff --git a/projects/coles-vibe-workshop/starter-kit/CLAUDE-ds.md b/projects/coles-vibe-workshop/starter-kit/CLAUDE-ds.md new file mode 100644 index 0000000..5ddc3e8 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/CLAUDE-ds.md @@ -0,0 +1,29 @@ +## Track: Data Science + +### Data Source +- Read from gold tables: `workshop_vibe_coding.TEAM_SCHEMA.retail_summary` and `food_inflation_yoy` +- These are pre-loaded in checkpoints — no need to build the pipeline + +### Feature Engineering +- Use PySpark for all feature transformations +- Create lag features (1, 3, 6, 12 month lags) using Window functions +- Create seasonal indicators (month_of_year, quarter, is_december) +- Create growth rate features (MoM, YoY) +- Output a feature table in Unity Catalog + +### MLflow +- Track experiments with `mlflow.start_run()` +- Log parameters, metrics, and artifacts +- Use `mlflow.sklearn` or `mlflow.xgboost` autologging where possible +- Register best model in Unity Catalog: `mlflow.register_model()` + +### Model Serving +- Use Databricks Model Serving (serverless) +- Endpoint accepts JSON input, returns predictions +- Test with `databricks api post /serving-endpoints/{name}/invocations` + +### ML Libraries +- scikit-learn for baseline models +- XGBoost for boosted tree models +- pandas is OK for small feature DataFrames after collecting from Spark +- Always start with PySpark for data loading and feature engineering diff --git a/projects/coles-vibe-workshop/starter-kit/CLAUDE.md b/projects/coles-vibe-workshop/starter-kit/CLAUDE.md new file mode 100644 index 0000000..d0a832a --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/CLAUDE.md @@ -0,0 +1,76 @@ +# CLAUDE.md — Grocery Intelligence Platform + +## Team +- **Team Name:** TEAM_NAME +- **Schema:** workshop_vibe_coding.TEAM_SCHEMA + +Replace `TEAM_NAME` and `TEAM_SCHEMA` above with your assigned values (e.g., `team_01`). + +## Project +A data platform that ingests Australian retail and food price data, transforms it through a medallion architecture, and serves analytics via a web app with natural language querying. + +## Tech Stack +- **Data processing:** PySpark (never pandas) +- **Pipeline framework:** Lakeflow Declarative Pipelines (`import databricks.declarative_pipelines as dp`) +- **Pipeline decorators:** `@dp.table` for tables, `@dp.materialized_view` for views +- **Data quality:** `@dp.expect("name", "SQL_EXPRESSION")` on pipeline tables +- **Web backend:** FastAPI with Pydantic models +- **Web frontend:** HTML + Tailwind CSS (CDN) + htmx (CDN) — no npm/node required +- **Database access:** `databricks-sql-connector` with parameterized queries only +- **Deployment:** Databricks Asset Bundles (`databricks bundle deploy`) +- **Testing:** pytest with PySpark test fixtures + +## Data Standards +- **Catalog:** `workshop_vibe_coding` +- **Schema:** `TEAM_SCHEMA` +- **Architecture:** Bronze (raw) → Silver (cleaned/decoded) → Gold (aggregated) +- **Date columns:** `YYYY-MM-DD` format, stored as DATE type +- **Naming:** snake_case for all table and column names + +## Rules +- Always use PySpark, never pandas +- Always use parameterized SQL queries — never string concatenation +- Write tests BEFORE implementation +- Use small test DataFrames (5-10 rows) in pytest fixtures — don't mock the database +- Keep solutions minimal — don't over-engineer +- Don't change functions that are already passing tests +- One function per file in the pipeline (bronze, silver, gold layers) + +## Project Structure +``` +├── CLAUDE.md +├── databricks.yml +├── src/ +│ ├── bronze/ +│ │ ├── abs_retail_trade.py +│ │ └── abs_cpi_food.py +│ ├── silver/ +│ │ ├── retail_turnover.py +│ │ └── food_price_index.py +│ └── gold/ +│ ├── retail_summary.py +│ └── food_inflation.py +├── tests/ +│ ├── conftest.py +│ ├── test_pipeline.py +│ └── test_app.py +└── app/ + ├── app.py + ├── app.yaml + ├── requirements.txt + └── static/ + └── index.html +``` + +## Data Sources +| Source | API | Format | +|--------|-----|--------| +| ABS Retail Trade | `https://data.api.abs.gov.au/data/ABS,RT,1.0.0/...` | CSV (SDMX) | +| ABS Consumer Price Index | `https://data.api.abs.gov.au/data/ABS,CPI,2.0.0/...` | CSV (SDMX) | + +## Code Mappings (for silver layer) +**Regions:** 1=New South Wales, 2=Victoria, 3=Queensland, 4=South Australia, 5=Western Australia, 6=Tasmania, 7=Northern Territory, 8=Australian Capital Territory + +**Industries:** 20=Food retailing, 41=Clothing/footwear/personal, 42=Department stores, 43=Other retailing, 44=Cafes/restaurants/takeaway, 45=Household goods retailing + +**CPI Index:** 10001=All groups CPI, 20001=Food and non-alcoholic beverages diff --git a/projects/coles-vibe-workshop/starter-kit/README.md b/projects/coles-vibe-workshop/starter-kit/README.md new file mode 100644 index 0000000..a4db9ca --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/README.md @@ -0,0 +1,77 @@ +# Starter Kit + +Everything you need to get started. Follow these steps in order. + +## Track Selection + +Before setup, your team picks one of three tracks: + +| Track | Focus | Key Files | +|-------|-------|-----------| +| **Data Engineering (DE)** | Lakeflow pipeline (Bronze→Silver→Gold) | `CLAUDE-de.md`, `test_pipeline.py`, `prompts/de/` | +| **Data Science (DS)** | Feature engineering, MLflow, model serving | `CLAUDE-ds.md`, `test_features.py`, `test_model.py`, `prompts/ds/` | +| **Analyst** | Genie spaces, AI/BI dashboards, FastAPI app | `CLAUDE-analyst.md`, `test_app.py`, `prompts/analyst/` | + +## Setup (5 minutes) + +1. **Copy CLAUDE.md** to your project root: + ```bash + cp starter-kit/CLAUDE.md ./CLAUDE.md + ``` + Then edit `CLAUDE.md` and replace `TEAM_SCHEMA` with your team schema (e.g., `team_01`) + +2. **Append your track's CLAUDE extension:** + ```bash + # Data Engineering track + cat starter-kit/CLAUDE-de.md >> ./CLAUDE.md + + # Data Science track + cat starter-kit/CLAUDE-ds.md >> ./CLAUDE.md + + # Analyst track + cat starter-kit/CLAUDE-analyst.md >> ./CLAUDE.md + ``` + +3. **Copy test files** to your tests directory: + ```bash + mkdir -p tests + cp starter-kit/conftest.py tests/ + + # Data Engineering track + cp starter-kit/test_pipeline.py tests/ + + # Data Science track + cp starter-kit/test_features.py tests/ + cp starter-kit/test_model.py tests/ + + # Analyst track + cp starter-kit/test_app.py tests/ + ``` + +4. **Follow the prompts** in your track's folder — they're numbered in order: + - DE track: `starter-kit/prompts/de/` (pipeline phases) + - DS track: `starter-kit/prompts/ds/` (features → training → serving) + - Analyst track: `starter-kit/prompts/analyst/` (Genie → dashboard → app) + - Each prompt is exact copy-paste into Claude Code + +5. **If stuck**, check `starter-kit/cheatsheet.md` for quick fixes + +## What's in Here + +| File | What it is | +|------|-----------| +| `CLAUDE.md` | Shared project instructions for the AI agent — drop into project root | +| `CLAUDE-de.md` | DE track extension — Lakeflow, medallion architecture rules | +| `CLAUDE-ds.md` | DS track extension — MLflow, feature engineering, model serving rules | +| `CLAUDE-analyst.md` | Analyst track extension — Genie, AI/BI, FastAPI + htmx rules | +| `conftest.py` | pytest fixtures with SparkSession and sample data | +| `test_pipeline.py` | DE track test stubs (pipeline tests) | +| `test_features.py` | DS track test stubs (feature engineering tests) | +| `test_model.py` | DS track test stubs (model training/serving tests) | +| `test_app.py` | Analyst track test stubs (API + app tests) | +| `databricks.yml.template` | Databricks Asset Bundle config | +| `app.yaml.template` | Databricks Apps deployment config | +| `cheatsheet.md` | Quick fixes for common problems (all tracks) | +| `prompts/de/` | Exact copy-paste prompts for DE track | +| `prompts/ds/` | Exact copy-paste prompts for DS track | +| `prompts/analyst/` | Exact copy-paste prompts for Analyst track | diff --git a/projects/coles-vibe-workshop/starter-kit/app.yaml.template b/projects/coles-vibe-workshop/starter-kit/app.yaml.template new file mode 100644 index 0000000..8d01998 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/app.yaml.template @@ -0,0 +1,16 @@ +# Databricks Apps deployment configuration +# Replace REPLACE_WITH_SQL_WAREHOUSE_PATH with your SQL warehouse HTTP path + +command: + - uvicorn + - app:app + - --host + - 0.0.0.0 + - --port + - "8000" + +env: + - name: DATABRICKS_HOST + valueFrom: "{databricks_host}" + - name: DATABRICKS_HTTP_PATH + value: "REPLACE_WITH_SQL_WAREHOUSE_PATH" diff --git a/projects/coles-vibe-workshop/starter-kit/cheatsheet.md b/projects/coles-vibe-workshop/starter-kit/cheatsheet.md new file mode 100644 index 0000000..5d91c51 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/cheatsheet.md @@ -0,0 +1,124 @@ +# Cheatsheet — Quick Fixes + +## Common Problems + +| Problem | Fix | +|---------|-----| +| **Agent uses pandas** | Add to CLAUDE.md: `Always use PySpark, never pandas`. Then tell the agent: "Rewrite this using PySpark." | +| **SparkSession errors in tests** | Check `tests/conftest.py` has `SparkSession.builder.master("local[*]")` | +| **ABS API timeout or network error** | Use checkpoint tables instead. Tell the agent: `Read from workshop_vibe_coding.checkpoints. instead of calling the API` | +| **`@dp.table` not found** | Use `import databricks.declarative_pipelines as dp`, NOT `import dlt` | +| **CORS errors in browser** | Add to `app.py`: `app.add_middleware(CORSMiddleware, allow_origins=["*"], allow_methods=["*"], allow_headers=["*"])` | +| **Agent rewrites working code** | Say: "Don't change functions that already pass tests. Only fix the failing ones." | +| **Can't write to Unity Catalog** | Check your schema name: `workshop_vibe_coding.team_XX`. Run `databricks auth status` to verify access. | +| **htmx not loading** | Add to ``: `` | +| **databricks-sql-connector errors** | Run `pip install databricks-sql-connector`. Check env vars: `echo $DATABRICKS_HOST` | +| **Pipeline deploy fails** | Run `databricks bundle validate` first. Check `databricks.yml` syntax. | +| **Agent goes off track** | Say "stop" then give a specific instruction. Don't let it keep going. | +| **Agent generates too much code** | Say "Keep it minimal. Just make the failing test pass." | +| **Running out of time** | Grab the next checkpoint. No shame — the goal is to have a working demo! | + +## Useful Commands + +```bash +# Check your Databricks connection +databricks auth status + +# Run specific tests +pytest tests/test_pipeline.py -k "bronze" -x +pytest tests/test_pipeline.py -k "silver" -x +pytest tests/test_pipeline.py -k "gold" -x +pytest tests/test_app.py -x + +# Run all tests +pytest tests/ -x + +# Validate DABs bundle +databricks bundle validate + +# Deploy pipeline +databricks bundle deploy -t dev +databricks bundle run grocery-intelligence-TEAM_NAME -t dev + +# Deploy app +cd app && databricks apps deploy --name grocery-app-TEAM_NAME --source-code-path ./ + +# Start app locally for testing +cd app && uvicorn app:app --reload --port 8000 +``` + +## Checkpoint Recovery + +If you're stuck and need to catch up: + +```sql +-- Checkpoint 1A: Bronze tables +CREATE TABLE workshop_vibe_coding.TEAM_SCHEMA.abs_retail_trade_bronze + AS SELECT * FROM workshop_vibe_coding.checkpoints.abs_retail_trade_bronze; + +CREATE TABLE workshop_vibe_coding.TEAM_SCHEMA.abs_cpi_food_bronze + AS SELECT * FROM workshop_vibe_coding.checkpoints.abs_cpi_food_bronze; + +-- Checkpoint 1B: Silver + Gold tables +CREATE TABLE workshop_vibe_coding.TEAM_SCHEMA.retail_turnover + AS SELECT * FROM workshop_vibe_coding.checkpoints.retail_turnover; + +CREATE TABLE workshop_vibe_coding.TEAM_SCHEMA.food_price_index + AS SELECT * FROM workshop_vibe_coding.checkpoints.food_price_index; + +CREATE TABLE workshop_vibe_coding.TEAM_SCHEMA.retail_summary + AS SELECT * FROM workshop_vibe_coding.checkpoints.retail_summary; + +CREATE TABLE workshop_vibe_coding.TEAM_SCHEMA.food_inflation_yoy + AS SELECT * FROM workshop_vibe_coding.checkpoints.food_inflation_yoy; +``` + +## Data Science Track + +| Problem | Fix | +|---------|-----| +| **MLflow tracking URI error** | Check `DATABRICKS_HOST` env var: `echo $DATABRICKS_HOST` | +| **MLflow experiment not found** | Set explicitly: `mlflow.set_experiment("/Users/.../name")` | +| **Feature table write error** | Check UC schema: `workshop_vibe_coding.TEAM_SCHEMA` | +| **Window function errors** | Verify `orderBy("month")` and `partitionBy("state", "industry")` | +| **XGBoost not installed** | `pip install xgboost` | +| **Model Serving 404** | Endpoint takes 5-10 min to provision. Check status in UI. | +| **Model Serving auth error** | Check `DATABRICKS_TOKEN` env var | +| **Low R² score** | Try XGBoost, add more features, or check for data leakage | + +### DS Useful Commands + +```bash +# MLflow +mlflow experiments list +mlflow runs list --experiment-id + +# Model Registry +databricks unity-catalog models list --catalog workshop_vibe_coding --schema TEAM_SCHEMA + +# Model Serving +databricks serving-endpoints list +databricks serving-endpoints get grocery-forecast-TEAM_NAME +``` + +## Analyst Track + +| Problem | Fix | +|---------|-----| +| **Can't find Genie in sidebar** | Ask facilitator — may need to be enabled | +| **Genie permission error** | Need CREATE GENIE SPACE on catalog | +| **Genie gives wrong SQL** | Add column descriptions + example queries to instructions | +| **Dashboard viz doesn't match** | Rephrase NL prompt or write SQL directly | +| **Column comments not showing** | Use `ALTER TABLE t ALTER COLUMN c COMMENT 'desc'` | +| **Dashboard slow** | Check SQL warehouse is running | +| **Embedded dashboard blank** | Users need Databricks credentials to view | + +## Steering Tips + +| When the agent... | Say this | +|-------------------|---------| +| Writes too much code | "Keep it simple. One function, minimal code." | +| Ignores your CLAUDE.md | "Read CLAUDE.md first, then try again." | +| Gets stuck in a loop | "Stop. Let's take a different approach. [describe what you want]" | +| Makes something overly complex | "Simplify this. I just need [specific thing]." | +| Writes code before tests | "Stop. Write the tests first, then implement." | diff --git a/projects/coles-vibe-workshop/starter-kit/conftest.py b/projects/coles-vibe-workshop/starter-kit/conftest.py new file mode 100644 index 0000000..e5bdcf2 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/conftest.py @@ -0,0 +1,75 @@ +""" +Shared pytest fixtures for workshop tests. +Provides a SparkSession and sample DataFrames matching the ABS API schemas. +""" + +import pytest +from pyspark.sql import SparkSession +from pyspark.sql.types import StructType, StructField, StringType, DoubleType + + +@pytest.fixture(scope="session") +def spark(): + """Create a local SparkSession for testing. Shared across all tests.""" + return ( + SparkSession.builder + .master("local[*]") + .appName("workshop-tests") + .getOrCreate() + ) + + +@pytest.fixture +def sample_retail_csv(spark): + """ + Sample ABS Retail Trade data matching the bronze table schema. + Columns mirror what the API returns in CSV format. + + REGION codes: 1=NSW, 2=VIC, 3=QLD + INDUSTRY codes: 20=Food retailing, 41=Clothing + """ + schema = StructType([ + StructField("DATAFLOW", StringType()), + StructField("FREQ", StringType()), + StructField("MEASURE", StringType()), + StructField("INDUSTRY", StringType()), + StructField("REGION", StringType()), + StructField("TIME_PERIOD", StringType()), + StructField("OBS_VALUE", DoubleType()), + ]) + data = [ + ("ABS:RT", "M", "M1", "20", "1", "2024-01", 4500.0), + ("ABS:RT", "M", "M1", "20", "2", "2024-01", 3800.0), + ("ABS:RT", "M", "M1", "20", "3", "2024-01", 2900.0), + ("ABS:RT", "M", "M1", "41", "1", "2024-01", 1200.0), + ("ABS:RT", "M", "M1", "20", "1", "2024-02", 4600.0), + ("ABS:RT", "M", "M1", "20", "2", "2024-02", 3900.0), + ] + return spark.createDataFrame(data, schema) + + +@pytest.fixture +def sample_cpi_csv(spark): + """ + Sample ABS CPI Food data matching the bronze table schema. + + INDEX codes: 10001=All groups CPI, 20001=Food and non-alcoholic beverages + REGION codes: 1=NSW, 2=VIC + """ + schema = StructType([ + StructField("DATAFLOW", StringType()), + StructField("FREQ", StringType()), + StructField("MEASURE", StringType()), + StructField("INDEX", StringType()), + StructField("REGION", StringType()), + StructField("TIME_PERIOD", StringType()), + StructField("OBS_VALUE", DoubleType()), + ]) + data = [ + ("ABS:CPI", "Q", "1", "10001", "1", "2024-Q1", 136.4), + ("ABS:CPI", "Q", "1", "20001", "1", "2024-Q1", 142.8), + ("ABS:CPI", "Q", "1", "10001", "2", "2024-Q1", 134.9), + ("ABS:CPI", "Q", "1", "20001", "2", "2024-Q1", 140.2), + ("ABS:CPI", "Q", "1", "10001", "1", "2024-Q2", 137.1), + ] + return spark.createDataFrame(data, schema) diff --git a/projects/coles-vibe-workshop/starter-kit/databricks.yml.template b/projects/coles-vibe-workshop/starter-kit/databricks.yml.template new file mode 100644 index 0000000..d6d6ee0 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/databricks.yml.template @@ -0,0 +1,30 @@ +# Databricks Asset Bundle configuration +# Replace TEAM_NAME and TEAM_SCHEMA with your team's values + +bundle: + name: grocery-intelligence-TEAM_NAME + +resources: + pipelines: + grocery_pipeline: + name: grocery-intelligence-TEAM_NAME + serverless: true + catalog: workshop_vibe_coding + schema: TEAM_SCHEMA + libraries: + - notebook: + path: src/bronze/abs_retail_trade.py + - notebook: + path: src/bronze/abs_cpi_food.py + - notebook: + path: src/silver/retail_turnover.py + - notebook: + path: src/silver/food_price_index.py + - notebook: + path: src/gold/retail_summary.py + - notebook: + path: src/gold/food_inflation.py + +targets: + dev: + default: true diff --git a/projects/coles-vibe-workshop/starter-kit/features/de_pipeline.feature b/projects/coles-vibe-workshop/starter-kit/features/de_pipeline.feature new file mode 100644 index 0000000..3977543 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/features/de_pipeline.feature @@ -0,0 +1,23 @@ +@local @smoke +Feature: Fast local validation + As a workshop team + I want to validate my transformation logic locally + So that I catch bugs before deploying to Databricks + + Scenario: Region codes decode correctly + Given I have region code "1" + Then the decoded state should be "New South Wales" + + Scenario: Industry codes decode correctly + Given I have industry code "20" + Then the decoded industry should be "Food retailing" + + Scenario: Monthly time periods parse correctly + Given I have time period "2024-01" + Then the parsed year should be 2024 + And the parsed month should be 1 + + Scenario: Quarterly time periods parse correctly + Given I have time period "2024-Q3" + Then the parsed year should be 2024 + And the parsed month should be 7 diff --git a/projects/coles-vibe-workshop/starter-kit/features/ds_features.feature b/projects/coles-vibe-workshop/starter-kit/features/ds_features.feature new file mode 100644 index 0000000..d8df840 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/features/ds_features.feature @@ -0,0 +1,48 @@ +@ds @local +Feature: Feature Engineering for Retail Forecasting + As a data scientist + I need lag, seasonal, and growth features + So that my model can learn temporal patterns in retail turnover + + Background: + Given a time series of monthly retail turnover data + And the data covers at least 24 months + + @lag + Scenario: Lag features capture past values + When I create lag features with windows 1, 3, 6, and 12 months + Then lag_1m equals the previous month value + And lag_12m equals the same month last year + And the first 12 rows have null lag_12m + + @seasonal + Scenario: Seasonal indicators decompose dates + When I extract seasonal features from the date column + Then month_of_year ranges from 1 to 12 + And quarter ranges from 1 to 4 + And is_december is true only for month 12 + And is_q4 is true only for months 10, 11, 12 + + @growth + Scenario: Growth rates measure change + Given turnover this month is 4600 and last month was 4500 + And turnover 12 months ago was 4200 + When I calculate growth features + Then MoM growth is approximately 2.22 percent + And YoY growth is approximately 9.52 percent + + @schema + Scenario: Feature table has all required columns + When I assemble a complete feature row + Then the row contains state, industry, month, turnover_millions + And the row contains turnover_lag_1m, turnover_lag_3m, turnover_lag_6m, turnover_lag_12m + And the row contains month_of_year, quarter, is_december, is_q4 + And the row contains turnover_mom_growth, turnover_yoy_growth, cpi_yoy_change + + @nulls + Scenario: Key columns never have nulls + When I assemble a feature row with valid inputs + Then state is not null + And industry is not null + And month is not null + And turnover_millions is not null diff --git a/projects/coles-vibe-workshop/starter-kit/features/steps/de_steps.py b/projects/coles-vibe-workshop/starter-kit/features/steps/de_steps.py new file mode 100644 index 0000000..7705c01 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/features/steps/de_steps.py @@ -0,0 +1,50 @@ +from behave import given, then + +REGIONS = { + "1": "New South Wales", "2": "Victoria", "3": "Queensland", + "4": "South Australia", "5": "Western Australia", "6": "Tasmania", + "7": "Northern Territory", "8": "Australian Capital Territory", +} +INDUSTRIES = { + "20": "Food retailing", "41": "Clothing, footwear and personal accessories", + "42": "Department stores", "43": "Other retailing", + "44": "Cafes, restaurants and takeaway", "45": "Household goods retailing", +} + +@given('I have region code "{code}"') +def step_region_code(context, code): + context.region_code = code + +@then('the decoded state should be "{state}"') +def step_decoded_state(context, state): + actual = REGIONS.get(context.region_code, "Unknown") + assert actual == state, f"Expected '{state}', got '{actual}'" + +@given('I have industry code "{code}"') +def step_industry_code(context, code): + context.industry_code = code + +@then('the decoded industry should be "{name}"') +def step_decoded_industry(context, name): + actual = INDUSTRIES.get(context.industry_code, "Unknown") + assert actual == name, f"Expected '{name}', got '{actual}'" + +@given('I have time period "{tp}"') +def step_time_period(context, tp): + context.time_period = tp + if "-Q" in tp: + year, q = tp.split("-Q") + context.parsed_year = int(year) + context.parsed_month = (int(q) - 1) * 3 + 1 + else: + parts = tp.split("-") + context.parsed_year = int(parts[0]) + context.parsed_month = int(parts[1]) + +@then('the parsed year should be {year:d}') +def step_parsed_year(context, year): + assert context.parsed_year == year + +@then('the parsed month should be {month:d}') +def step_parsed_month(context, month): + assert context.parsed_month == month diff --git a/projects/coles-vibe-workshop/starter-kit/features/steps/ds_steps.py b/projects/coles-vibe-workshop/starter-kit/features/steps/ds_steps.py new file mode 100644 index 0000000..394394a --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/features/steps/ds_steps.py @@ -0,0 +1,189 @@ +""" +BDD step definitions for DS feature engineering. +Runs locally without Spark — tests the pure Python logic. +""" +from datetime import date +from behave import given, when, then +import pytest + + +# ── Import feature functions from test_features_local ───────────── +# In real usage, teams would import from their src/ module. +# Here we inline the functions to keep the BDD test self-contained. + +def create_lag_features(values, lag): + return [None if i < lag else values[i - lag] for i in range(len(values))] + + +def extract_seasonal_features(d): + month = d.month + quarter = (month - 1) // 3 + 1 + return { + "month_of_year": month, + "quarter": quarter, + "is_december": month == 12, + "is_q4": quarter == 4, + } + + +def calc_mom_growth(current, previous): + if previous is None or previous == 0: + return None + return ((current - previous) / previous) * 100 + + +def calc_yoy_growth(current, year_ago): + if year_ago is None or year_ago == 0: + return None + return ((current - year_ago) / year_ago) * 100 + + +def build_feature_row(state, industry, month, turnover, lag_values, cpi_yoy_change=None): + seasonal = extract_seasonal_features(month) + return { + "state": state, "industry": industry, "month": month, + "turnover_millions": turnover, + "turnover_lag_1m": lag_values.get("lag_1m"), + "turnover_lag_3m": lag_values.get("lag_3m"), + "turnover_lag_6m": lag_values.get("lag_6m"), + "turnover_lag_12m": lag_values.get("lag_12m"), + **seasonal, + "turnover_mom_growth": calc_mom_growth(turnover, lag_values.get("lag_1m")), + "turnover_yoy_growth": calc_yoy_growth(turnover, lag_values.get("lag_12m")), + "cpi_yoy_change": cpi_yoy_change, + } + + +# ── Background Steps ───────────────────────────────────────────── + +@given("a time series of monthly retail turnover data") +def step_create_time_series(context): + context.values = [4000 + i * 25 for i in range(24)] + + +@given("the data covers at least 24 months") +def step_verify_length(context): + assert len(context.values) >= 24 + + +# ── Lag Feature Steps ──────────────────────────────────────────── + +@when("I create lag features with windows 1, 3, 6, and 12 months") +def step_create_lags(context): + context.lag_1 = create_lag_features(context.values, 1) + context.lag_3 = create_lag_features(context.values, 3) + context.lag_6 = create_lag_features(context.values, 6) + context.lag_12 = create_lag_features(context.values, 12) + + +@then("lag_1m equals the previous month value") +def step_check_lag_1(context): + for i in range(1, len(context.values)): + assert context.lag_1[i] == context.values[i - 1] + + +@then("lag_12m equals the same month last year") +def step_check_lag_12(context): + for i in range(12, len(context.values)): + assert context.lag_12[i] == context.values[i - 12] + + +@then("the first 12 rows have null lag_12m") +def step_check_lag_12_nulls(context): + for i in range(12): + assert context.lag_12[i] is None + + +# ── Seasonal Feature Steps ─────────────────────────────────────── + +@when("I extract seasonal features from the date column") +def step_extract_seasonal(context): + context.seasonal = [ + extract_seasonal_features(date(2024, m, 1)) for m in range(1, 13) + ] + + +@then("month_of_year ranges from 1 to 12") +def step_check_month_range(context): + months = [s["month_of_year"] for s in context.seasonal] + assert months == list(range(1, 13)) + + +@then("quarter ranges from 1 to 4") +def step_check_quarter_range(context): + quarters = sorted(set(s["quarter"] for s in context.seasonal)) + assert quarters == [1, 2, 3, 4] + + +@then("is_december is true only for month 12") +def step_check_december(context): + for s in context.seasonal: + assert s["is_december"] == (s["month_of_year"] == 12) + + +@then("is_q4 is true only for months 10, 11, 12") +def step_check_q4(context): + for s in context.seasonal: + assert s["is_q4"] == (s["month_of_year"] >= 10) + + +# ── Growth Feature Steps ───────────────────────────────────────── + +@given("turnover this month is {current:g} and last month was {previous:g}") +def step_set_turnover(context, current, previous): + context.current = current + context.previous = previous + + +@given("turnover 12 months ago was {year_ago:g}") +def step_set_year_ago(context, year_ago): + context.year_ago = year_ago + + +@when("I calculate growth features") +def step_calc_growth(context): + context.mom = calc_mom_growth(context.current, context.previous) + context.yoy = calc_yoy_growth(context.current, context.year_ago) + + +@then("MoM growth is approximately {expected:g} percent") +def step_check_mom(context, expected): + assert abs(context.mom - expected) < 0.1 + + +@then("YoY growth is approximately {expected:g} percent") +def step_check_yoy(context, expected): + assert abs(context.yoy - expected) < 0.1 + + +# ── Schema Steps ───────────────────────────────────────────────── + +@when("I assemble a complete feature row") +def step_assemble_row(context): + context.row = build_feature_row( + "New South Wales", "Food retailing", date(2024, 6, 1), 4600.0, + {"lag_1m": 4500, "lag_3m": 4400, "lag_6m": 4300, "lag_12m": 4200}, + cpi_yoy_change=3.5, + ) + + +@then("the row contains {columns}") +def step_check_columns(context, columns): + cols = [c.strip() for c in columns.split(",")] + for col in cols: + assert col in context.row, f"Missing column: {col}" + + +# ── Null Check Steps ───────────────────────────────────────────── + +@when("I assemble a feature row with valid inputs") +def step_assemble_valid_row(context): + context.row = build_feature_row( + "Victoria", "Food retailing", date(2024, 3, 1), 3900.0, + {"lag_1m": 3800}, + ) + + +@then("{column} is not null") +def step_check_not_null(context, column): + assert context.row[column] is not None diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/00-explore-data.md b/projects/coles-vibe-workshop/starter-kit/prompts/00-explore-data.md new file mode 100644 index 0000000..15cef34 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/00-explore-data.md @@ -0,0 +1,34 @@ +## Step 1: Explore the Data + +Before writing any code, understand what the raw data looks like. + +### Prompt + +Paste this into Claude Code: + +``` +Fetch a sample of the ABS Retail Trade API and show me the columns, data types, and 3 sample rows: +https://data.api.abs.gov.au/data/ABS,RT,1.0.0/M1.20+41+42+43+44+45.20.1+2+3+4+5+6+7+8.M?format=csv&startPeriod=2024-01&endPeriod=2024-03 + +Also fetch the ABS CPI Food API and show me the same: +https://data.api.abs.gov.au/data/ABS,CPI,2.0.0/1.10001+20001.10.1+2+3+4+5+6+7+8.Q?format=csv&startPeriod=2024-Q1&endPeriod=2024-Q4 + +For each API, show me: +1. All column names and their data types +2. 3 sample rows +3. What the coded values mean (REGION, INDUSTRY, INDEX columns) +``` + +### Expected Result + +You should see a table of columns for each API. Key columns: +- **Retail Trade:** DATAFLOW, FREQ, MEASURE, INDUSTRY, REGION, TIME_PERIOD, OBS_VALUE +- **CPI Food:** DATAFLOW, FREQ, MEASURE, INDEX, REGION, TIME_PERIOD, OBS_VALUE + +The REGION, INDUSTRY, and INDEX columns contain numeric codes (e.g., "1" for NSW). + +### If It Doesn't Work + +- **API timeout:** The ABS APIs can be slow. Wait 30 seconds and try again. +- **Network error:** Check you have internet access from the terminal. Try `curl -I https://data.api.abs.gov.au`. +- **Still failing:** Skip this step — you can see sample data in `starter-kit/conftest.py`. diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/01-explore-data.md b/projects/coles-vibe-workshop/starter-kit/prompts/01-explore-data.md new file mode 100644 index 0000000..15cef34 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/01-explore-data.md @@ -0,0 +1,34 @@ +## Step 1: Explore the Data + +Before writing any code, understand what the raw data looks like. + +### Prompt + +Paste this into Claude Code: + +``` +Fetch a sample of the ABS Retail Trade API and show me the columns, data types, and 3 sample rows: +https://data.api.abs.gov.au/data/ABS,RT,1.0.0/M1.20+41+42+43+44+45.20.1+2+3+4+5+6+7+8.M?format=csv&startPeriod=2024-01&endPeriod=2024-03 + +Also fetch the ABS CPI Food API and show me the same: +https://data.api.abs.gov.au/data/ABS,CPI,2.0.0/1.10001+20001.10.1+2+3+4+5+6+7+8.Q?format=csv&startPeriod=2024-Q1&endPeriod=2024-Q4 + +For each API, show me: +1. All column names and their data types +2. 3 sample rows +3. What the coded values mean (REGION, INDUSTRY, INDEX columns) +``` + +### Expected Result + +You should see a table of columns for each API. Key columns: +- **Retail Trade:** DATAFLOW, FREQ, MEASURE, INDUSTRY, REGION, TIME_PERIOD, OBS_VALUE +- **CPI Food:** DATAFLOW, FREQ, MEASURE, INDEX, REGION, TIME_PERIOD, OBS_VALUE + +The REGION, INDUSTRY, and INDEX columns contain numeric codes (e.g., "1" for NSW). + +### If It Doesn't Work + +- **API timeout:** The ABS APIs can be slow. Wait 30 seconds and try again. +- **Network error:** Check you have internet access from the terminal. Try `curl -I https://data.api.abs.gov.au`. +- **Still failing:** Skip this step — you can see sample data in `starter-kit/conftest.py`. diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/02-write-tests.md b/projects/coles-vibe-workshop/starter-kit/prompts/02-write-tests.md new file mode 100644 index 0000000..4923c1f --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/02-write-tests.md @@ -0,0 +1,67 @@ +## Step 2: Write Pipeline Tests + +Tests are your spec. Write them BEFORE any implementation. + +### Prompt + +Paste this into Claude Code: + +``` +Create pytest tests for a Lakeflow Declarative Pipeline in tests/test_pipeline.py. +Use the fixtures from tests/conftest.py (spark, sample_retail_csv, sample_cpi_csv). + +Write these tests: + +1. test_bronze_retail_trade_schema: + - Raw CSV data has all original columns: DATAFLOW, FREQ, MEASURE, INDUSTRY, REGION, TIME_PERIOD, OBS_VALUE + - Test: given sample_retail_csv, assert correct columns exist + +2. test_bronze_retail_trade_not_null: + - TIME_PERIOD and OBS_VALUE are never null + - Test: assert no nulls in these columns + +3. test_silver_retail_turnover_decodes_regions: + - REGION codes decoded to state names (1=New South Wales, 2=Victoria, 3=Queensland, 4=South Australia, 5=Western Australia, 6=Tasmania, 7=Northern Territory, 8=Australian Capital Territory) + - Test: given bronze rows with code "1", silver rows have "New South Wales" + +4. test_silver_retail_turnover_decodes_industries: + - INDUSTRY codes decoded (20=Food retailing, 41=Clothing/footwear/personal, 42=Department stores, 43=Other retailing, 44=Cafes/restaurants/takeaway, 45=Household goods retailing) + - Test: given bronze rows with code "20", silver rows have "Food retailing" + +5. test_silver_retail_turnover_parses_dates: + - TIME_PERIOD string "2024-01" parsed to proper date column + - Test: assert date type and correct value + +6. test_gold_retail_summary_rolling_averages: + - Adds 3-month and 12-month rolling averages + - Test: given 24 months of silver data, verify rolling averages are correct + +7. test_gold_retail_summary_yoy_growth: + - Adds year-over-year growth percentage + - Test: given 24 months of data, verify YoY growth = (current - same_month_last_year) / same_month_last_year * 100 + +8. test_bronze_cpi_schema: + - CPI data has columns: DATAFLOW, FREQ, MEASURE, INDEX, REGION, TIME_PERIOD, OBS_VALUE + +9. test_silver_food_price_index_decodes: + - INDEX codes decoded (10001=All groups CPI, 20001=Food and non-alcoholic beverages) + - REGION codes decoded to state names + +10. test_gold_food_inflation_yoy: + - Calculates year-over-year CPI change percentage + - Test: given 8 quarters of data, verify YoY change is correct + +Write ONLY the tests. Do NOT implement any pipeline functions yet. +Use PySpark test fixtures with small DataFrames (5-10 rows each). +Import transformation functions from src/ modules (they don't exist yet — that's fine). +``` + +### Expected Result + +A `tests/test_pipeline.py` file with 10 test functions. All tests should FAIL when you run them (because the implementation doesn't exist yet). That's correct — this is TDD. + +### If It Doesn't Work + +- **Agent writes implementation too:** Say "Stop. Delete the implementation. I only want the tests." +- **Agent uses pandas:** Say "Use PySpark, not pandas. Check CLAUDE.md." +- **Import errors:** That's expected — the src/ modules don't exist yet. diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/03-build-bronze.md b/projects/coles-vibe-workshop/starter-kit/prompts/03-build-bronze.md new file mode 100644 index 0000000..bf75be0 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/03-build-bronze.md @@ -0,0 +1,44 @@ +## Step 3: Build Bronze Layer + +Bronze = raw data ingestion. No transformations, just get the data in. + +### Prompt + +Paste this into Claude Code: + +``` +Create the bronze layer for our Lakeflow Declarative Pipeline: + +1. src/bronze/abs_retail_trade.py + - Use @dp.table decorator (import databricks.declarative_pipelines as dp) + - Ingest ABS Retail Trade API CSV: https://data.api.abs.gov.au/data/ABS,RT,1.0.0/M1.20+41+42+43+44+45.20.1+2+3+4+5+6+7+8.M?format=csv&startPeriod=2010-01 + - Use spark.read.csv() with header=True and inferSchema=True + - Add data quality expectations: + @dp.expect("valid_time_period", "TIME_PERIOD IS NOT NULL") + @dp.expect("valid_obs_value", "OBS_VALUE IS NOT NULL") + +2. src/bronze/abs_cpi_food.py + - Same pattern with @dp.table + - Ingest ABS CPI Food API CSV: https://data.api.abs.gov.au/data/ABS,CPI,2.0.0/1.10001+20001.10.1+2+3+4+5+6+7+8.Q?format=csv&startPeriod=2010-Q1 + - Same data quality expectations + +Unity Catalog target: workshop_vibe_coding.TEAM_SCHEMA + +Then run the bronze tests: pytest tests/test_pipeline.py -k "bronze" -x +Fix any failures. +``` + +### Expected Result + +Two files in `src/bronze/`, each with a `@dp.table` decorated function. Bronze tests should pass. + +### If It Doesn't Work + +- **API timeout:** The ABS APIs can be slow from some networks. Try once more. +- **Still failing:** Use checkpoint data instead. Tell the agent: + ``` + The API is not accessible. Instead, read from the checkpoint tables: + - spark.read.table("workshop_vibe_coding.checkpoints.abs_retail_trade_bronze") + - spark.read.table("workshop_vibe_coding.checkpoints.abs_cpi_food_bronze") + ``` +- **@dp.table not found:** Make sure the import is `import databricks.declarative_pipelines as dp`, not `import dlt`. diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/04-build-silver-gold.md b/projects/coles-vibe-workshop/starter-kit/prompts/04-build-silver-gold.md new file mode 100644 index 0000000..5a40c2b --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/04-build-silver-gold.md @@ -0,0 +1,56 @@ +## Step 4: Build Silver + Gold Layers + +Silver = cleaned and decoded. Gold = aggregated and analytics-ready. + +### Prompt + +Paste this into Claude Code: + +``` +Build the silver and gold layers to make all remaining tests pass. + +SILVER LAYER: + +1. src/silver/retail_turnover.py + - @dp.table that reads from the bronze retail trade table + - Decode REGION codes: 1=New South Wales, 2=Victoria, 3=Queensland, 4=South Australia, 5=Western Australia, 6=Tasmania, 7=Northern Territory, 8=Australian Capital Territory + - Decode INDUSTRY codes: 20=Food retailing, 41=Clothing/footwear/personal, 42=Department stores, 43=Other retailing, 44=Cafes/restaurants/takeaway, 45=Household goods retailing + - Parse TIME_PERIOD "2024-01" to a proper date column + - Rename OBS_VALUE to turnover_millions + +2. src/silver/food_price_index.py + - @dp.table that reads from the bronze CPI table + - Decode REGION codes (same as above) + - Decode INDEX codes: 10001=All groups CPI, 20001=Food and non-alcoholic beverages + - Rename OBS_VALUE to cpi_index + +GOLD LAYER: + +3. src/gold/retail_summary.py + - @dp.materialized_view reading from silver retail_turnover + - Add turnover_3m_avg: 3-month rolling average per state/industry + - Add turnover_12m_avg: 12-month rolling average per state/industry + - Add yoy_growth_pct: year-over-year growth percentage + +4. src/gold/food_inflation.py + - @dp.materialized_view reading from silver food_price_index + - Add yoy_change_pct: year-over-year CPI change percentage + +Run ALL tests: pytest tests/test_pipeline.py -x +Fix any failures until everything is green. +``` + +### Expected Result + +Four new files in `src/silver/` and `src/gold/`. All 10 pipeline tests pass. + +### If It Doesn't Work + +- **Agent uses pandas:** Say "Use PySpark window functions, not pandas. Check CLAUDE.md." +- **Rolling averages wrong:** Ensure the window is ordered by date and partitioned by state + industry. +- **Running out of time:** Grab Checkpoint 1B — pre-loaded silver and gold tables in your schema. + ``` + Copy checkpoint tables to our schema. Run: + CREATE TABLE workshop_vibe_coding.TEAM_SCHEMA.retail_summary AS SELECT * FROM workshop_vibe_coding.checkpoints.retail_summary; + CREATE TABLE workshop_vibe_coding.TEAM_SCHEMA.food_inflation_yoy AS SELECT * FROM workshop_vibe_coding.checkpoints.food_inflation_yoy; + ``` diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/05-deploy-pipeline.md b/projects/coles-vibe-workshop/starter-kit/prompts/05-deploy-pipeline.md new file mode 100644 index 0000000..b598fba --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/05-deploy-pipeline.md @@ -0,0 +1,43 @@ +## Step 5: Deploy Pipeline with DABs + +Package and deploy your pipeline as a Databricks Asset Bundle. + +### Prompt + +Paste this into Claude Code: + +``` +Set up Databricks Asset Bundles deployment for our pipeline. + +1. Update databricks.yml with: + - Bundle name: grocery-intelligence-TEAM_NAME + - Pipeline resource pointing to all our src/ notebooks + - Serverless: true + - Catalog: workshop_vibe_coding + - Schema: TEAM_SCHEMA + - Dev target as default + +2. Validate the bundle: + databricks bundle validate + +3. Deploy to dev: + databricks bundle deploy -t dev + +4. Run the pipeline: + databricks bundle run grocery-intelligence-TEAM_NAME -t dev + +Show me the output of each command. +``` + +### Expected Result + +- `databricks bundle validate` shows no errors +- `databricks bundle deploy` deploys successfully +- Pipeline starts running in the Databricks workspace + +### If It Doesn't Work + +- **Validate fails:** Check `databricks.yml` syntax. Compare with `starter-kit/databricks.yml.template`. +- **Auth errors:** Run `databricks auth status` to check your token is valid. +- **Pipeline fails on run:** Open the pipeline in the Databricks UI (Workflows tab) to see detailed error logs. +- **Can't find notebooks:** Make sure `src/` paths in databricks.yml match your actual file locations. diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/06-write-prd.md b/projects/coles-vibe-workshop/starter-kit/prompts/06-write-prd.md new file mode 100644 index 0000000..37e6601 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/06-write-prd.md @@ -0,0 +1,52 @@ +## Step 6: Write the App PRD + +Start Lab 2 by giving the agent a clear product spec. + +### Prompt + +Paste this into Claude Code: + +``` +Create a new directory called "app" and add a PRD as app/README.md: + +## Grocery Intelligence App + +### Overview +A web application that displays retail analytics from our gold tables +and lets users ask questions in plain English. + +### User Stories +1. As a business user, I want to see retail turnover by state so I can compare performance. +2. As an analyst, I want to ask questions like "which state had the highest food retail growth?" and get answers. +3. As an executive, I want to see food inflation trends at a glance. + +### Technical Requirements +- Backend: FastAPI (Python) +- Frontend: HTML + Tailwind CSS (CDN) + htmx (CDN) — no npm or node required +- Data: workshop_vibe_coding.TEAM_SCHEMA.retail_summary and food_inflation_yoy +- AI feature: Natural language to SQL using Databricks Foundation Model API +- Deployment: Databricks Apps + +### API Endpoints +- GET /health → {"status": "ok"} +- GET /api/metrics?state=X&start_date=Y&end_date=Z → list of records +- POST /api/ask {"question": "..."} → {"answer": "...", "sql_query": "..."} +- GET / → serves the frontend HTML + +### Tech Constraints +- Use databricks-sql-connector for all database queries +- All SQL must be parameterized (no string concatenation) +- Frontend uses Tailwind CSS from CDN and htmx from CDN — no build step +- Use Pydantic models for request/response validation + +Also update CLAUDE.md to add these app-specific rules. +``` + +### Expected Result + +An `app/README.md` with the PRD and updated `CLAUDE.md` with app-specific instructions. + +### If It Doesn't Work + +- **Agent creates too much:** Say "Just the PRD and CLAUDE.md update. Don't build anything yet." +- **Wrong directory:** Make sure the agent creates files inside the `app/` directory. diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/07-write-app-tests.md b/projects/coles-vibe-workshop/starter-kit/prompts/07-write-app-tests.md new file mode 100644 index 0000000..cd542b4 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/07-write-app-tests.md @@ -0,0 +1,50 @@ +## Step 7: Write API Tests + +Define what the backend should do before building it. + +### Prompt + +Paste this into Claude Code: + +``` +Write pytest tests for the FastAPI backend in tests/test_app.py. +Use httpx AsyncClient with ASGITransport for testing. + +Tests to write: + +1. test_health_endpoint: + - GET /health returns 200 + - Response body is {"status": "ok"} + +2. test_get_metrics_valid: + - GET /api/metrics returns 200 + - Response is a list of records + - Each record has keys: state, industry, month, turnover_millions, yoy_growth_pct + +3. test_get_metrics_with_state_filter: + - GET /api/metrics?state=New%20South%20Wales returns 200 + - All records in response have state == "New South Wales" + +4. test_get_metrics_invalid_date: + - GET /api/metrics?start_date=not-a-date returns 400 or 422 + +5. test_ask_question_valid: + - POST /api/ask with {"question": "Which state has the highest turnover?"} + - Returns 200 + - Response has "answer" (string) and "sql_query" (string) + +6. test_ask_question_empty: + - POST /api/ask with {"question": ""} returns 400 or 422 + +Write ONLY the tests. Do NOT implement the app yet. +Install httpx if needed: pip install httpx pytest-asyncio +``` + +### Expected Result + +A `tests/test_app.py` with 6 async test functions. All tests fail (no app exists yet). + +### If It Doesn't Work + +- **Import errors:** Expected — `app.py` doesn't exist yet. The tests define the contract. +- **Agent builds the app:** Say "Stop. Delete the implementation. Tests only." diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/08-build-backend.md b/projects/coles-vibe-workshop/starter-kit/prompts/08-build-backend.md new file mode 100644 index 0000000..bba3d24 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/08-build-backend.md @@ -0,0 +1,54 @@ +## Step 8: Build the Backend + +Implement FastAPI to make all API tests pass. + +### Prompt + +Paste this into Claude Code: + +``` +Implement the FastAPI backend in app/app.py to pass all tests in tests/test_app.py. + +For GET /api/metrics: +- Query workshop_vibe_coding.TEAM_SCHEMA.retail_summary +- Support optional query params: state, start_date, end_date +- Use databricks-sql-connector with parameterized queries +- Return list of records as JSON + +For POST /api/ask: +- Take {"question": "..."} in request body +- Send the question to the Foundation Model API with the table schema as context +- The LLM generates a SQL query +- Execute the SQL query against our gold tables +- Return {"answer": "...", "sql_query": "..."} +- Use the Databricks SDK: from databricks.sdk import WorkspaceClient + +For GET /health: +- Return {"status": "ok"} + +Connection config from environment variables: +- DATABRICKS_HOST (workspace URL) +- DATABRICKS_HTTP_PATH (SQL warehouse path) +- DATABRICKS_TOKEN (PAT token — already set in your environment) + +Create app/requirements.txt with all dependencies: +- fastapi +- uvicorn +- databricks-sql-connector +- databricks-sdk +- pydantic + +Run tests after implementation: pytest tests/test_app.py -x +Fix any failures. +``` + +### Expected Result + +An `app/app.py` with three endpoints. All 6 API tests pass. + +### If It Doesn't Work + +- **databricks-sql-connector import error:** Run `pip install databricks-sql-connector` +- **Auth errors:** Check `echo $DATABRICKS_HOST` and `echo $DATABRICKS_TOKEN` are set +- **SQL errors:** Make sure table names match: `workshop_vibe_coding.TEAM_SCHEMA.retail_summary` +- **Foundation Model API errors:** Check the AI Gateway endpoint. Ask the facilitator for the correct URL. diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/09-build-frontend.md b/projects/coles-vibe-workshop/starter-kit/prompts/09-build-frontend.md new file mode 100644 index 0000000..48119d2 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/09-build-frontend.md @@ -0,0 +1,52 @@ +## Step 9: Build the Frontend + +HTML + Tailwind + htmx — no build step required. + +### Prompt + +Paste this into Claude Code: + +``` +Build the frontend in app/static/index.html: + +1. Include these CDN scripts in : + - Tailwind CSS: + - htmx: + +2. Layout: + - Dark header bar: "Grocery Intelligence Platform — TEAM_NAME" + - Three metric cards at the top: Total Turnover, Average Growth %, Top State + (fetch from GET /api/metrics on page load) + - Filter bar: State dropdown, date range pickers + - Data table showing metrics (updated via htmx when filters change) + - "Ask AI" section at the bottom: + - Text input for questions + - Submit button + - Response area showing the answer and the generated SQL + +3. Use htmx for all dynamic updates: + - hx-get="/api/metrics" for the data table + - hx-post="/api/ask" for the AI question + - hx-trigger="change" on filters to auto-refresh + +4. Style with Tailwind: + - Clean, professional look + - White cards with subtle shadows + - Responsive layout + +Also update app/app.py to: +- Mount static files: app.mount("/static", StaticFiles(directory="static")) +- Serve index.html at GET / +- Add CORSMiddleware with allow_origins=["*"] +``` + +### Expected Result + +A single `app/static/index.html` file. When you run `uvicorn app.app:app`, the page loads with the dashboard. + +### If It Doesn't Work + +- **Blank page:** Check browser DevTools console (F12) for errors. Usually a missing script tag. +- **htmx not working:** Make sure the script tag is in ``, not ``. +- **CORS errors:** Add CORSMiddleware to FastAPI (the agent sometimes forgets this). +- **Static files not serving:** Check `app.mount("/static", StaticFiles(directory="static"))` in app.py. diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/10-setup-genie.md b/projects/coles-vibe-workshop/starter-kit/prompts/10-setup-genie.md new file mode 100644 index 0000000..0f34148 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/10-setup-genie.md @@ -0,0 +1,46 @@ +## Step 10: Create a Genie Space + +This is done in the Databricks UI (not the terminal). Genie lets business users ask questions in plain English. + +### Steps + +1. Open your Databricks workspace in the browser +2. Click **Genie** in the left sidebar +3. Click **New Genie Space** +4. Configure: + - **Name:** `Grocery Intelligence — TEAM_NAME` + - **SQL Warehouse:** Select the workshop SQL warehouse + - **Tables:** Click "Add tables" and add: + - `workshop_vibe_coding.TEAM_SCHEMA.retail_summary` + - `workshop_vibe_coding.TEAM_SCHEMA.food_inflation_yoy` + - **General Instructions** (paste this): + ``` + This data contains Australian retail trade and food price data. + States are Australian states (New South Wales, Victoria, Queensland, etc.). + Turnover is in millions of AUD. + CPI index values are relative to a base period. + YoY growth and change percentages show year-over-year comparisons. + ``` +5. Click **Save** + +### Test Your Genie Space + +Try these questions: + +``` +Which state had the highest food retail turnover last month? +``` + +``` +Show me the year-over-year food price inflation trend for Victoria. +``` + +``` +Compare retail growth across all states for the last 12 months. +``` + +### If It Doesn't Work + +- **Can't find Genie in sidebar:** Ask the facilitator — Genie may need to be enabled for your workspace. +- **"No permission" error:** You need CREATE GENIE SPACE permission on the catalog. Ask the facilitator. +- **Wrong answers:** Add column descriptions to your tables in Unity Catalog. Richer metadata = better Genie answers. diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/11-deploy-app.md b/projects/coles-vibe-workshop/starter-kit/prompts/11-deploy-app.md new file mode 100644 index 0000000..bc00de7 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/11-deploy-app.md @@ -0,0 +1,49 @@ +## Step 11: Deploy Your App + +Package and deploy to Databricks Apps. + +### Prompt + +Paste this into Claude Code: + +``` +Prepare the app for deployment to Databricks Apps: + +1. Make sure app/requirements.txt has all dependencies: + fastapi + uvicorn + databricks-sql-connector + databricks-sdk + pydantic + +2. Create app/app.yaml with: + command: ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"] + +3. Deploy to Databricks Apps: + cd app + databricks apps deploy --name grocery-app-TEAM_NAME --source-code-path ./ + +4. Show me the app URL when deployment completes. + +5. Test the deployed app: + curl /health +``` + +### Expected Result + +The app is deployed and accessible at a URL like `https://.databricks.com/apps/grocery-app-TEAM_NAME`. The health endpoint returns `{"status": "ok"}`. + +### If It Doesn't Work + +- **Deploy fails:** Check `app.yaml` syntax. Compare with `starter-kit/app.yaml.template`. +- **App starts but shows errors:** Check app logs in the Databricks UI under Apps. +- **Can't connect to database:** Ensure `DATABRICKS_HOST` and `DATABRICKS_HTTP_PATH` are in `app.yaml` env section. +- **502 error after deploy:** The app may still be starting. Wait 30 seconds and refresh. + +### Prepare Your Demo + +You have 3 minutes to show: +1. Your pipeline (show the DAG or table list in Databricks UI) +2. Your app (load it, use the AI feature) +3. Your Genie space (ask a question live) +4. One thing that surprised you diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/analyst/01-setup-genie.md b/projects/coles-vibe-workshop/starter-kit/prompts/analyst/01-setup-genie.md new file mode 100644 index 0000000..0f34148 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/analyst/01-setup-genie.md @@ -0,0 +1,46 @@ +## Step 10: Create a Genie Space + +This is done in the Databricks UI (not the terminal). Genie lets business users ask questions in plain English. + +### Steps + +1. Open your Databricks workspace in the browser +2. Click **Genie** in the left sidebar +3. Click **New Genie Space** +4. Configure: + - **Name:** `Grocery Intelligence — TEAM_NAME` + - **SQL Warehouse:** Select the workshop SQL warehouse + - **Tables:** Click "Add tables" and add: + - `workshop_vibe_coding.TEAM_SCHEMA.retail_summary` + - `workshop_vibe_coding.TEAM_SCHEMA.food_inflation_yoy` + - **General Instructions** (paste this): + ``` + This data contains Australian retail trade and food price data. + States are Australian states (New South Wales, Victoria, Queensland, etc.). + Turnover is in millions of AUD. + CPI index values are relative to a base period. + YoY growth and change percentages show year-over-year comparisons. + ``` +5. Click **Save** + +### Test Your Genie Space + +Try these questions: + +``` +Which state had the highest food retail turnover last month? +``` + +``` +Show me the year-over-year food price inflation trend for Victoria. +``` + +``` +Compare retail growth across all states for the last 12 months. +``` + +### If It Doesn't Work + +- **Can't find Genie in sidebar:** Ask the facilitator — Genie may need to be enabled for your workspace. +- **"No permission" error:** You need CREATE GENIE SPACE permission on the catalog. Ask the facilitator. +- **Wrong answers:** Add column descriptions to your tables in Unity Catalog. Richer metadata = better Genie answers. diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/analyst/02-tune-genie.md b/projects/coles-vibe-workshop/starter-kit/prompts/analyst/02-tune-genie.md new file mode 100644 index 0000000..d49c0a6 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/analyst/02-tune-genie.md @@ -0,0 +1,40 @@ +## Step 2: Tune Genie with Metadata + +Adding column descriptions and table comments dramatically improves Genie accuracy. + +### Prompt + +Paste this into Claude Code: + +``` +Add column comments to our gold tables for better Genie and AI/BI accuracy: + +For workshop_vibe_coding.TEAM_SCHEMA.retail_summary: +- Table comment: "Monthly retail turnover summary by Australian state and industry with rolling averages and YoY growth" +- state: "Australian state name (New South Wales, Victoria, Queensland, etc.)" +- industry: "Retail industry category (Food retailing, Department stores, etc.)" +- month: "Date of the monthly observation (first of month)" +- turnover_millions: "Monthly retail turnover in millions of AUD" +- turnover_3m_avg: "3-month rolling average of turnover in millions AUD" +- turnover_12m_avg: "12-month rolling average of turnover in millions AUD" +- yoy_growth_pct: "Year-over-year turnover growth as a percentage" + +For workshop_vibe_coding.TEAM_SCHEMA.food_inflation_yoy: +- Table comment: "Quarterly food price inflation by Australian state with YoY CPI changes" +- state: "Australian state name" +- quarter: "Calendar quarter (e.g., 2024-Q1)" +- cpi_index: "Consumer Price Index value (base period = 100)" +- yoy_change_pct: "Year-over-year CPI change as a percentage (positive = inflation)" + +Use ALTER TABLE ... SET TBLPROPERTIES for table comments. +Use ALTER TABLE ... ALTER COLUMN ... COMMENT for column comments. +``` + +### Expected Result + +Both tables have descriptive comments visible in Unity Catalog. Genie should now give better answers. + +### If It Doesn't Work + +- **Permission denied:** You need ALTER permission on the tables. Ask the facilitator. +- **ALTER COLUMN syntax error:** Use: `ALTER TABLE t ALTER COLUMN c COMMENT 'description'` diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/analyst/03-build-dashboard.md b/projects/coles-vibe-workshop/starter-kit/prompts/analyst/03-build-dashboard.md new file mode 100644 index 0000000..7b04cee --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/analyst/03-build-dashboard.md @@ -0,0 +1,48 @@ +## Step 3: Build AI/BI Dashboard + +This is done in the Databricks UI, not the terminal. + +### Steps + +1. Navigate to **Dashboards** in the left sidebar +2. Click **Create Dashboard** → **AI/BI Dashboard** +3. Connect to your gold tables: + - `workshop_vibe_coding.TEAM_SCHEMA.retail_summary` + - `workshop_vibe_coding.TEAM_SCHEMA.food_inflation_yoy` + +4. Create visualizations using natural language: + +``` +Show monthly food retail turnover by state as a line chart for the last 2 years +``` + +``` +Create a bar chart comparing year-over-year retail growth by state for the latest month +``` + +``` +Show a heatmap of food inflation by state and quarter +``` + +``` +Display the top 5 states by average monthly turnover as a horizontal bar chart +``` + +``` +Show a trend line of national food price inflation over time +``` + +5. Arrange visualizations into a clean layout +6. Add a title: "Grocery Intelligence Dashboard — TEAM_NAME" +7. Click **Publish** + +### Expected Result + +A published dashboard with 4-5 visualizations accessible via URL. + +### If It Doesn't Work + +- **Dashboard viz doesn't match prompt:** Try rephrasing. Be specific about chart type and time range. +- **"No data" message:** Check the SQL warehouse is running (Compute → SQL Warehouses). +- **Can't connect to tables:** Verify table names match exactly including schema. +- **Want custom SQL:** Click the SQL icon on any viz to write your own query. diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/analyst/04-write-prd.md b/projects/coles-vibe-workshop/starter-kit/prompts/analyst/04-write-prd.md new file mode 100644 index 0000000..37e6601 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/analyst/04-write-prd.md @@ -0,0 +1,52 @@ +## Step 6: Write the App PRD + +Start Lab 2 by giving the agent a clear product spec. + +### Prompt + +Paste this into Claude Code: + +``` +Create a new directory called "app" and add a PRD as app/README.md: + +## Grocery Intelligence App + +### Overview +A web application that displays retail analytics from our gold tables +and lets users ask questions in plain English. + +### User Stories +1. As a business user, I want to see retail turnover by state so I can compare performance. +2. As an analyst, I want to ask questions like "which state had the highest food retail growth?" and get answers. +3. As an executive, I want to see food inflation trends at a glance. + +### Technical Requirements +- Backend: FastAPI (Python) +- Frontend: HTML + Tailwind CSS (CDN) + htmx (CDN) — no npm or node required +- Data: workshop_vibe_coding.TEAM_SCHEMA.retail_summary and food_inflation_yoy +- AI feature: Natural language to SQL using Databricks Foundation Model API +- Deployment: Databricks Apps + +### API Endpoints +- GET /health → {"status": "ok"} +- GET /api/metrics?state=X&start_date=Y&end_date=Z → list of records +- POST /api/ask {"question": "..."} → {"answer": "...", "sql_query": "..."} +- GET / → serves the frontend HTML + +### Tech Constraints +- Use databricks-sql-connector for all database queries +- All SQL must be parameterized (no string concatenation) +- Frontend uses Tailwind CSS from CDN and htmx from CDN — no build step +- Use Pydantic models for request/response validation + +Also update CLAUDE.md to add these app-specific rules. +``` + +### Expected Result + +An `app/README.md` with the PRD and updated `CLAUDE.md` with app-specific instructions. + +### If It Doesn't Work + +- **Agent creates too much:** Say "Just the PRD and CLAUDE.md update. Don't build anything yet." +- **Wrong directory:** Make sure the agent creates files inside the `app/` directory. diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/analyst/05-write-app-tests.md b/projects/coles-vibe-workshop/starter-kit/prompts/analyst/05-write-app-tests.md new file mode 100644 index 0000000..cd542b4 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/analyst/05-write-app-tests.md @@ -0,0 +1,50 @@ +## Step 7: Write API Tests + +Define what the backend should do before building it. + +### Prompt + +Paste this into Claude Code: + +``` +Write pytest tests for the FastAPI backend in tests/test_app.py. +Use httpx AsyncClient with ASGITransport for testing. + +Tests to write: + +1. test_health_endpoint: + - GET /health returns 200 + - Response body is {"status": "ok"} + +2. test_get_metrics_valid: + - GET /api/metrics returns 200 + - Response is a list of records + - Each record has keys: state, industry, month, turnover_millions, yoy_growth_pct + +3. test_get_metrics_with_state_filter: + - GET /api/metrics?state=New%20South%20Wales returns 200 + - All records in response have state == "New South Wales" + +4. test_get_metrics_invalid_date: + - GET /api/metrics?start_date=not-a-date returns 400 or 422 + +5. test_ask_question_valid: + - POST /api/ask with {"question": "Which state has the highest turnover?"} + - Returns 200 + - Response has "answer" (string) and "sql_query" (string) + +6. test_ask_question_empty: + - POST /api/ask with {"question": ""} returns 400 or 422 + +Write ONLY the tests. Do NOT implement the app yet. +Install httpx if needed: pip install httpx pytest-asyncio +``` + +### Expected Result + +A `tests/test_app.py` with 6 async test functions. All tests fail (no app exists yet). + +### If It Doesn't Work + +- **Import errors:** Expected — `app.py` doesn't exist yet. The tests define the contract. +- **Agent builds the app:** Say "Stop. Delete the implementation. Tests only." diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/analyst/06-build-backend.md b/projects/coles-vibe-workshop/starter-kit/prompts/analyst/06-build-backend.md new file mode 100644 index 0000000..bba3d24 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/analyst/06-build-backend.md @@ -0,0 +1,54 @@ +## Step 8: Build the Backend + +Implement FastAPI to make all API tests pass. + +### Prompt + +Paste this into Claude Code: + +``` +Implement the FastAPI backend in app/app.py to pass all tests in tests/test_app.py. + +For GET /api/metrics: +- Query workshop_vibe_coding.TEAM_SCHEMA.retail_summary +- Support optional query params: state, start_date, end_date +- Use databricks-sql-connector with parameterized queries +- Return list of records as JSON + +For POST /api/ask: +- Take {"question": "..."} in request body +- Send the question to the Foundation Model API with the table schema as context +- The LLM generates a SQL query +- Execute the SQL query against our gold tables +- Return {"answer": "...", "sql_query": "..."} +- Use the Databricks SDK: from databricks.sdk import WorkspaceClient + +For GET /health: +- Return {"status": "ok"} + +Connection config from environment variables: +- DATABRICKS_HOST (workspace URL) +- DATABRICKS_HTTP_PATH (SQL warehouse path) +- DATABRICKS_TOKEN (PAT token — already set in your environment) + +Create app/requirements.txt with all dependencies: +- fastapi +- uvicorn +- databricks-sql-connector +- databricks-sdk +- pydantic + +Run tests after implementation: pytest tests/test_app.py -x +Fix any failures. +``` + +### Expected Result + +An `app/app.py` with three endpoints. All 6 API tests pass. + +### If It Doesn't Work + +- **databricks-sql-connector import error:** Run `pip install databricks-sql-connector` +- **Auth errors:** Check `echo $DATABRICKS_HOST` and `echo $DATABRICKS_TOKEN` are set +- **SQL errors:** Make sure table names match: `workshop_vibe_coding.TEAM_SCHEMA.retail_summary` +- **Foundation Model API errors:** Check the AI Gateway endpoint. Ask the facilitator for the correct URL. diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/analyst/07-build-frontend.md b/projects/coles-vibe-workshop/starter-kit/prompts/analyst/07-build-frontend.md new file mode 100644 index 0000000..48119d2 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/analyst/07-build-frontend.md @@ -0,0 +1,52 @@ +## Step 9: Build the Frontend + +HTML + Tailwind + htmx — no build step required. + +### Prompt + +Paste this into Claude Code: + +``` +Build the frontend in app/static/index.html: + +1. Include these CDN scripts in : + - Tailwind CSS: + - htmx: + +2. Layout: + - Dark header bar: "Grocery Intelligence Platform — TEAM_NAME" + - Three metric cards at the top: Total Turnover, Average Growth %, Top State + (fetch from GET /api/metrics on page load) + - Filter bar: State dropdown, date range pickers + - Data table showing metrics (updated via htmx when filters change) + - "Ask AI" section at the bottom: + - Text input for questions + - Submit button + - Response area showing the answer and the generated SQL + +3. Use htmx for all dynamic updates: + - hx-get="/api/metrics" for the data table + - hx-post="/api/ask" for the AI question + - hx-trigger="change" on filters to auto-refresh + +4. Style with Tailwind: + - Clean, professional look + - White cards with subtle shadows + - Responsive layout + +Also update app/app.py to: +- Mount static files: app.mount("/static", StaticFiles(directory="static")) +- Serve index.html at GET / +- Add CORSMiddleware with allow_origins=["*"] +``` + +### Expected Result + +A single `app/static/index.html` file. When you run `uvicorn app.app:app`, the page loads with the dashboard. + +### If It Doesn't Work + +- **Blank page:** Check browser DevTools console (F12) for errors. Usually a missing script tag. +- **htmx not working:** Make sure the script tag is in ``, not ``. +- **CORS errors:** Add CORSMiddleware to FastAPI (the agent sometimes forgets this). +- **Static files not serving:** Check `app.mount("/static", StaticFiles(directory="static"))` in app.py. diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/analyst/08-deploy-app.md b/projects/coles-vibe-workshop/starter-kit/prompts/analyst/08-deploy-app.md new file mode 100644 index 0000000..bc00de7 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/analyst/08-deploy-app.md @@ -0,0 +1,49 @@ +## Step 11: Deploy Your App + +Package and deploy to Databricks Apps. + +### Prompt + +Paste this into Claude Code: + +``` +Prepare the app for deployment to Databricks Apps: + +1. Make sure app/requirements.txt has all dependencies: + fastapi + uvicorn + databricks-sql-connector + databricks-sdk + pydantic + +2. Create app/app.yaml with: + command: ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"] + +3. Deploy to Databricks Apps: + cd app + databricks apps deploy --name grocery-app-TEAM_NAME --source-code-path ./ + +4. Show me the app URL when deployment completes. + +5. Test the deployed app: + curl /health +``` + +### Expected Result + +The app is deployed and accessible at a URL like `https://.databricks.com/apps/grocery-app-TEAM_NAME`. The health endpoint returns `{"status": "ok"}`. + +### If It Doesn't Work + +- **Deploy fails:** Check `app.yaml` syntax. Compare with `starter-kit/app.yaml.template`. +- **App starts but shows errors:** Check app logs in the Databricks UI under Apps. +- **Can't connect to database:** Ensure `DATABRICKS_HOST` and `DATABRICKS_HTTP_PATH` are in `app.yaml` env section. +- **502 error after deploy:** The app may still be starting. Wait 30 seconds and refresh. + +### Prepare Your Demo + +You have 3 minutes to show: +1. Your pipeline (show the DAG or table list in Databricks UI) +2. Your app (load it, use the AI feature) +3. Your Genie space (ask a question live) +4. One thing that surprised you diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/de/01-explore-data.md b/projects/coles-vibe-workshop/starter-kit/prompts/de/01-explore-data.md new file mode 100644 index 0000000..15cef34 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/de/01-explore-data.md @@ -0,0 +1,34 @@ +## Step 1: Explore the Data + +Before writing any code, understand what the raw data looks like. + +### Prompt + +Paste this into Claude Code: + +``` +Fetch a sample of the ABS Retail Trade API and show me the columns, data types, and 3 sample rows: +https://data.api.abs.gov.au/data/ABS,RT,1.0.0/M1.20+41+42+43+44+45.20.1+2+3+4+5+6+7+8.M?format=csv&startPeriod=2024-01&endPeriod=2024-03 + +Also fetch the ABS CPI Food API and show me the same: +https://data.api.abs.gov.au/data/ABS,CPI,2.0.0/1.10001+20001.10.1+2+3+4+5+6+7+8.Q?format=csv&startPeriod=2024-Q1&endPeriod=2024-Q4 + +For each API, show me: +1. All column names and their data types +2. 3 sample rows +3. What the coded values mean (REGION, INDUSTRY, INDEX columns) +``` + +### Expected Result + +You should see a table of columns for each API. Key columns: +- **Retail Trade:** DATAFLOW, FREQ, MEASURE, INDUSTRY, REGION, TIME_PERIOD, OBS_VALUE +- **CPI Food:** DATAFLOW, FREQ, MEASURE, INDEX, REGION, TIME_PERIOD, OBS_VALUE + +The REGION, INDUSTRY, and INDEX columns contain numeric codes (e.g., "1" for NSW). + +### If It Doesn't Work + +- **API timeout:** The ABS APIs can be slow. Wait 30 seconds and try again. +- **Network error:** Check you have internet access from the terminal. Try `curl -I https://data.api.abs.gov.au`. +- **Still failing:** Skip this step — you can see sample data in `starter-kit/conftest.py`. diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/de/02-write-tests.md b/projects/coles-vibe-workshop/starter-kit/prompts/de/02-write-tests.md new file mode 100644 index 0000000..4923c1f --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/de/02-write-tests.md @@ -0,0 +1,67 @@ +## Step 2: Write Pipeline Tests + +Tests are your spec. Write them BEFORE any implementation. + +### Prompt + +Paste this into Claude Code: + +``` +Create pytest tests for a Lakeflow Declarative Pipeline in tests/test_pipeline.py. +Use the fixtures from tests/conftest.py (spark, sample_retail_csv, sample_cpi_csv). + +Write these tests: + +1. test_bronze_retail_trade_schema: + - Raw CSV data has all original columns: DATAFLOW, FREQ, MEASURE, INDUSTRY, REGION, TIME_PERIOD, OBS_VALUE + - Test: given sample_retail_csv, assert correct columns exist + +2. test_bronze_retail_trade_not_null: + - TIME_PERIOD and OBS_VALUE are never null + - Test: assert no nulls in these columns + +3. test_silver_retail_turnover_decodes_regions: + - REGION codes decoded to state names (1=New South Wales, 2=Victoria, 3=Queensland, 4=South Australia, 5=Western Australia, 6=Tasmania, 7=Northern Territory, 8=Australian Capital Territory) + - Test: given bronze rows with code "1", silver rows have "New South Wales" + +4. test_silver_retail_turnover_decodes_industries: + - INDUSTRY codes decoded (20=Food retailing, 41=Clothing/footwear/personal, 42=Department stores, 43=Other retailing, 44=Cafes/restaurants/takeaway, 45=Household goods retailing) + - Test: given bronze rows with code "20", silver rows have "Food retailing" + +5. test_silver_retail_turnover_parses_dates: + - TIME_PERIOD string "2024-01" parsed to proper date column + - Test: assert date type and correct value + +6. test_gold_retail_summary_rolling_averages: + - Adds 3-month and 12-month rolling averages + - Test: given 24 months of silver data, verify rolling averages are correct + +7. test_gold_retail_summary_yoy_growth: + - Adds year-over-year growth percentage + - Test: given 24 months of data, verify YoY growth = (current - same_month_last_year) / same_month_last_year * 100 + +8. test_bronze_cpi_schema: + - CPI data has columns: DATAFLOW, FREQ, MEASURE, INDEX, REGION, TIME_PERIOD, OBS_VALUE + +9. test_silver_food_price_index_decodes: + - INDEX codes decoded (10001=All groups CPI, 20001=Food and non-alcoholic beverages) + - REGION codes decoded to state names + +10. test_gold_food_inflation_yoy: + - Calculates year-over-year CPI change percentage + - Test: given 8 quarters of data, verify YoY change is correct + +Write ONLY the tests. Do NOT implement any pipeline functions yet. +Use PySpark test fixtures with small DataFrames (5-10 rows each). +Import transformation functions from src/ modules (they don't exist yet — that's fine). +``` + +### Expected Result + +A `tests/test_pipeline.py` file with 10 test functions. All tests should FAIL when you run them (because the implementation doesn't exist yet). That's correct — this is TDD. + +### If It Doesn't Work + +- **Agent writes implementation too:** Say "Stop. Delete the implementation. I only want the tests." +- **Agent uses pandas:** Say "Use PySpark, not pandas. Check CLAUDE.md." +- **Import errors:** That's expected — the src/ modules don't exist yet. diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/de/03-build-bronze.md b/projects/coles-vibe-workshop/starter-kit/prompts/de/03-build-bronze.md new file mode 100644 index 0000000..bf75be0 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/de/03-build-bronze.md @@ -0,0 +1,44 @@ +## Step 3: Build Bronze Layer + +Bronze = raw data ingestion. No transformations, just get the data in. + +### Prompt + +Paste this into Claude Code: + +``` +Create the bronze layer for our Lakeflow Declarative Pipeline: + +1. src/bronze/abs_retail_trade.py + - Use @dp.table decorator (import databricks.declarative_pipelines as dp) + - Ingest ABS Retail Trade API CSV: https://data.api.abs.gov.au/data/ABS,RT,1.0.0/M1.20+41+42+43+44+45.20.1+2+3+4+5+6+7+8.M?format=csv&startPeriod=2010-01 + - Use spark.read.csv() with header=True and inferSchema=True + - Add data quality expectations: + @dp.expect("valid_time_period", "TIME_PERIOD IS NOT NULL") + @dp.expect("valid_obs_value", "OBS_VALUE IS NOT NULL") + +2. src/bronze/abs_cpi_food.py + - Same pattern with @dp.table + - Ingest ABS CPI Food API CSV: https://data.api.abs.gov.au/data/ABS,CPI,2.0.0/1.10001+20001.10.1+2+3+4+5+6+7+8.Q?format=csv&startPeriod=2010-Q1 + - Same data quality expectations + +Unity Catalog target: workshop_vibe_coding.TEAM_SCHEMA + +Then run the bronze tests: pytest tests/test_pipeline.py -k "bronze" -x +Fix any failures. +``` + +### Expected Result + +Two files in `src/bronze/`, each with a `@dp.table` decorated function. Bronze tests should pass. + +### If It Doesn't Work + +- **API timeout:** The ABS APIs can be slow from some networks. Try once more. +- **Still failing:** Use checkpoint data instead. Tell the agent: + ``` + The API is not accessible. Instead, read from the checkpoint tables: + - spark.read.table("workshop_vibe_coding.checkpoints.abs_retail_trade_bronze") + - spark.read.table("workshop_vibe_coding.checkpoints.abs_cpi_food_bronze") + ``` +- **@dp.table not found:** Make sure the import is `import databricks.declarative_pipelines as dp`, not `import dlt`. diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/de/04-build-silver-gold.md b/projects/coles-vibe-workshop/starter-kit/prompts/de/04-build-silver-gold.md new file mode 100644 index 0000000..5a40c2b --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/de/04-build-silver-gold.md @@ -0,0 +1,56 @@ +## Step 4: Build Silver + Gold Layers + +Silver = cleaned and decoded. Gold = aggregated and analytics-ready. + +### Prompt + +Paste this into Claude Code: + +``` +Build the silver and gold layers to make all remaining tests pass. + +SILVER LAYER: + +1. src/silver/retail_turnover.py + - @dp.table that reads from the bronze retail trade table + - Decode REGION codes: 1=New South Wales, 2=Victoria, 3=Queensland, 4=South Australia, 5=Western Australia, 6=Tasmania, 7=Northern Territory, 8=Australian Capital Territory + - Decode INDUSTRY codes: 20=Food retailing, 41=Clothing/footwear/personal, 42=Department stores, 43=Other retailing, 44=Cafes/restaurants/takeaway, 45=Household goods retailing + - Parse TIME_PERIOD "2024-01" to a proper date column + - Rename OBS_VALUE to turnover_millions + +2. src/silver/food_price_index.py + - @dp.table that reads from the bronze CPI table + - Decode REGION codes (same as above) + - Decode INDEX codes: 10001=All groups CPI, 20001=Food and non-alcoholic beverages + - Rename OBS_VALUE to cpi_index + +GOLD LAYER: + +3. src/gold/retail_summary.py + - @dp.materialized_view reading from silver retail_turnover + - Add turnover_3m_avg: 3-month rolling average per state/industry + - Add turnover_12m_avg: 12-month rolling average per state/industry + - Add yoy_growth_pct: year-over-year growth percentage + +4. src/gold/food_inflation.py + - @dp.materialized_view reading from silver food_price_index + - Add yoy_change_pct: year-over-year CPI change percentage + +Run ALL tests: pytest tests/test_pipeline.py -x +Fix any failures until everything is green. +``` + +### Expected Result + +Four new files in `src/silver/` and `src/gold/`. All 10 pipeline tests pass. + +### If It Doesn't Work + +- **Agent uses pandas:** Say "Use PySpark window functions, not pandas. Check CLAUDE.md." +- **Rolling averages wrong:** Ensure the window is ordered by date and partitioned by state + industry. +- **Running out of time:** Grab Checkpoint 1B — pre-loaded silver and gold tables in your schema. + ``` + Copy checkpoint tables to our schema. Run: + CREATE TABLE workshop_vibe_coding.TEAM_SCHEMA.retail_summary AS SELECT * FROM workshop_vibe_coding.checkpoints.retail_summary; + CREATE TABLE workshop_vibe_coding.TEAM_SCHEMA.food_inflation_yoy AS SELECT * FROM workshop_vibe_coding.checkpoints.food_inflation_yoy; + ``` diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/de/05-add-data-sources.md b/projects/coles-vibe-workshop/starter-kit/prompts/de/05-add-data-sources.md new file mode 100644 index 0000000..10d9e82 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/de/05-add-data-sources.md @@ -0,0 +1,31 @@ +## Step 5: Add FSANZ Food Recalls Data Source + +### Prompt + +``` +Add a new data source to our pipeline — FSANZ food recalls. + +1. Write tests first: + - test_bronze_food_recalls_schema: has columns (product, category, issue, date, state, url) + - test_bronze_food_recalls_not_null: product and date are never null + - test_silver_food_recalls_dates: date strings parsed to proper DATE type + - test_silver_food_recalls_states: state names normalized to match our existing state list + +2. Build bronze + silver tables: + - src/bronze/fsanz_food_recalls.py: @dp.table ingesting from FSANZ + - src/silver/food_recalls.py: @dp.table with cleaned dates, normalized states, categorized issues + - Data source: https://www.foodstandards.gov.au/food-recalls/recalls + - If website is blocked, read from: workshop_vibe_coding.checkpoints.fsanz_food_recalls + +3. Run tests after implementation. +``` + +### Expected Result + +Two new pipeline files and passing tests for the FSANZ data source. + +### If It Doesn't Work + +- **Website blocked:** Use checkpoint data: `spark.read.table("workshop_vibe_coding.checkpoints.fsanz_food_recalls")` +- **Scraping errors:** Try the RSS feed: `https://www.foodstandards.gov.au/rss/recalls` +- **State names don't match:** Map to the same state names used in retail_summary diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/de/06-data-quality.md b/projects/coles-vibe-workshop/starter-kit/prompts/de/06-data-quality.md new file mode 100644 index 0000000..1c956da --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/de/06-data-quality.md @@ -0,0 +1,37 @@ +## Step 6: Add Data Quality Rules + +### Prompt + +``` +Add comprehensive data quality expectations across all pipeline tables: + +Bronze tables: +- @dp.expect("valid_time_period", "TIME_PERIOD IS NOT NULL") +- @dp.expect("valid_obs_value", "OBS_VALUE IS NOT NULL") +- @dp.expect("valid_date_range", "TIME_PERIOD >= '2010-01'") + +Silver tables: +- @dp.expect_or_fail("valid_state", "state IN ('New South Wales','Victoria','Queensland','South Australia','Western Australia','Tasmania','Northern Territory','Australian Capital Territory')") +- @dp.expect("valid_turnover", "turnover_millions > 0") +- @dp.expect("no_unknown_industry", "industry != 'Unknown'") + +Gold tables: +- @dp.expect("valid_yoy", "yoy_growth_pct BETWEEN -100 AND 500") +- @dp.expect("valid_rolling_avg", "turnover_3m_avg > 0") + +Also create a gold-level cross-source view: +- src/gold/grocery_insights.py: @dp.materialized_view +- Joins retail_summary + food_inflation_yoy + food_recalls (if available) +- Columns: state, month, turnover_millions, yoy_growth_pct, cpi_yoy_change, recall_count + +Run all tests to verify nothing breaks. +``` + +### Expected Result + +All existing tables have data quality expectations. A new cross-source gold view is created. + +### If It Doesn't Work + +- **@dp.expect_or_fail stops pipeline:** Start with `@dp.expect` (warn only), then upgrade after verifying data +- **Cross-source join duplicates:** Join on state + date range, not exact date (CPI is quarterly, retail is monthly) diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/de/07-scheduling.md b/projects/coles-vibe-workshop/starter-kit/prompts/de/07-scheduling.md new file mode 100644 index 0000000..9170480 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/de/07-scheduling.md @@ -0,0 +1,27 @@ +## Step 7: Add Pipeline Scheduling + +### Prompt + +``` +Set up automated scheduling for our pipeline: + +1. Update databricks.yml to add a cron trigger: + trigger: + cron: + quartz_cron_expression: "0 0 6 * * ?" + timezone_id: "Australia/Sydney" + +2. Validate the bundle: databricks bundle validate +3. Deploy: databricks bundle deploy -t dev +4. Show me the pipeline schedule in the Workflows UI. +``` + +### Expected Result + +Pipeline is scheduled to run at 6 AM Sydney time daily. Visible in the Workflows tab. + +### If It Doesn't Work + +- **Cron syntax error:** Use Quartz format, not standard cron. The `?` is required for day-of-week. +- **Timezone not found:** Use `Australia/Sydney`, not `AEST`. +- **Pipeline runs but fails:** Check individual table errors in the pipeline UI. diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/ds/01-explore-gold.md b/projects/coles-vibe-workshop/starter-kit/prompts/ds/01-explore-gold.md new file mode 100644 index 0000000..1ed5421 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/ds/01-explore-gold.md @@ -0,0 +1,31 @@ +## Step 1: Explore Gold Tables + +Understand the data you'll be building features from. + +### Prompt + +``` +Query these tables and show me a comprehensive analysis: + +1. workshop_vibe_coding.TEAM_SCHEMA.retail_summary: + - Row count, date range, distinct states, distinct industries + - Summary statistics for turnover_millions (min, max, mean, stddev) + - Top 5 state-industry combinations by average turnover + - Any null values or data quality issues + +2. workshop_vibe_coding.TEAM_SCHEMA.food_inflation_yoy: + - Row count, date range, distinct states + - Summary statistics for yoy_change_pct + - States with highest and lowest inflation + +Show me the results as tables. +``` + +### Expected Result + +Summary tables showing data distributions, ranges, and any quality issues. + +### If It Doesn't Work + +- **Table not found:** Check schema name. Try `workshop_vibe_coding.checkpoints.retail_summary` instead. +- **Permission error:** Run `databricks auth status` to verify your token. diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/ds/02-feature-engineering.md b/projects/coles-vibe-workshop/starter-kit/prompts/ds/02-feature-engineering.md new file mode 100644 index 0000000..1b7f78c --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/ds/02-feature-engineering.md @@ -0,0 +1,39 @@ +## Step 2: Build Feature Engineering Pipeline + +Write tests first, then build the features. + +### Prompt + +``` +Create a feature engineering pipeline that reads from our gold tables and produces a feature table. + +Write tests first in tests/test_features.py, then implement: + +1. Lag features from retail_summary: + - turnover_lag_1m, turnover_lag_3m, turnover_lag_6m, turnover_lag_12m + - Use PySpark Window functions partitioned by state + industry, ordered by month + +2. Seasonal features: + - month_of_year (1-12), quarter (1-4), is_december (boolean), is_q4 (boolean) + - Extract from the month date column + +3. Growth rate features: + - turnover_mom_growth: month-over-month growth percentage + - turnover_yoy_growth: year-over-year growth percentage (use lag_12m) + - cpi_yoy_change: join with food_inflation_yoy on state + quarter + +4. Write the combined feature table to: + workshop_vibe_coding.TEAM_SCHEMA.retail_features + +Run tests after implementation. Handle nulls in lag features (first N rows will be null — that's expected, filter them out in the final table). +``` + +### Expected Result + +A feature table in Unity Catalog with lag, seasonal, and growth columns. All tests pass. + +### If It Doesn't Work + +- **Window function errors:** Verify `orderBy("month")` and `partitionBy("state", "industry")` +- **Agent uses pandas:** Say "Use PySpark Window functions. Check CLAUDE.md." +- **Null lag values:** Expected for first N rows. Use `.filter(col("turnover_lag_12m").isNotNull())` diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/ds/03-mlflow-experiment.md b/projects/coles-vibe-workshop/starter-kit/prompts/ds/03-mlflow-experiment.md new file mode 100644 index 0000000..c977cc4 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/ds/03-mlflow-experiment.md @@ -0,0 +1,32 @@ +## Step 3: Track Experiments with MLflow + +### Prompt + +``` +Set up MLflow experiment tracking for our feature engineering: + +1. Create an MLflow experiment named "grocery-features-TEAM_NAME" + +2. Log a run with: + - Parameters: number of features, date range, number of states + - Metrics: feature table row count, null percentage per feature, feature correlation stats + - Tags: team_name, track="data_science", phase="feature_engineering" + - Artifacts: save a feature summary CSV showing feature stats per state + +3. Create and log a visualization: + - A correlation heatmap of the numeric features (save as PNG) + - A time series plot of turnover trends for top 3 states (save as PNG) + +Use mlflow.log_param(), mlflow.log_metric(), mlflow.log_artifact(). +Show me the MLflow experiment URL when done. +``` + +### Expected Result + +An MLflow experiment with logged parameters, metrics, and artifact visualizations. + +### If It Doesn't Work + +- **MLflow not found:** Run `pip install mlflow` +- **Experiment URL not showing:** Set explicitly: `mlflow.set_experiment("/Users/.../grocery-features-TEAM_NAME")` +- **Visualization errors:** Use matplotlib. `pip install matplotlib seaborn` if needed. diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/ds/04-train-model.md b/projects/coles-vibe-workshop/starter-kit/prompts/ds/04-train-model.md new file mode 100644 index 0000000..5385a11 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/ds/04-train-model.md @@ -0,0 +1,34 @@ +## Step 4: Train the Model + +### Prompt + +``` +Train a retail turnover forecasting model: + +1. Write tests first: + - test_model_predictions_positive: all predictions are positive + - test_model_r2_score: R² > 0.5 on test set + - test_model_logged_to_mlflow: run has model artifact and metrics + +2. Implement training: + - Read feature table: workshop_vibe_coding.TEAM_SCHEMA.retail_features + - Target: turnover_millions + - Features: all lag, seasonal, and growth columns + - Split: 80% train / 20% test (split by date, not random) + - Try both RandomForestRegressor and XGBRegressor + - Use mlflow.sklearn.autolog() or mlflow.xgboost.autolog() + - Log both models, compare R², MAE, RMSE + - Print which model performed better + +Run tests after training. +``` + +### Expected Result + +Two models logged to MLflow. One should have R² > 0.5. Tests pass. + +### If It Doesn't Work + +- **XGBoost not installed:** `pip install xgboost` +- **Low R² score:** Try adding more features or using a different split ratio +- **OOM error:** Collect to pandas for training (sklearn needs pandas/numpy, not Spark) diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/ds/05-register-model.md b/projects/coles-vibe-workshop/starter-kit/prompts/ds/05-register-model.md new file mode 100644 index 0000000..82d4875 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/ds/05-register-model.md @@ -0,0 +1,26 @@ +## Step 5: Register Model in MLflow + +### Prompt + +``` +Register the best model from our experiment: + +1. Find the best run (highest R²) from our MLflow experiment +2. Register it as a Unity Catalog model: + - Model name: workshop_vibe_coding.TEAM_SCHEMA.retail_forecast_model + - Use mlflow.register_model() with the UC path +3. Add a model description: "Retail turnover forecasting model for Australian states" +4. Set an alias "production" on the latest version + +Show me the model in the Model Registry UI. +``` + +### Expected Result + +Model registered in Unity Catalog with "production" alias set. + +### If It Doesn't Work + +- **Permission error:** Need CREATE MODEL on catalog. Ask facilitator. +- **Path format:** Use `models:/workshop_vibe_coding.TEAM_SCHEMA.retail_forecast_model` +- **Version not found:** Check that the run has a logged model artifact first. diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/ds/06-serve-model.md b/projects/coles-vibe-workshop/starter-kit/prompts/ds/06-serve-model.md new file mode 100644 index 0000000..5ae7e9b --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/ds/06-serve-model.md @@ -0,0 +1,31 @@ +## Step 6: Create Model Serving Endpoint + +### Prompt + +``` +Create a Model Serving endpoint for our registered model: + +1. Create a serverless endpoint: + - Name: grocery-forecast-TEAM_NAME + - Model: workshop_vibe_coding.TEAM_SCHEMA.retail_forecast_model + - Version: the one with alias "production" + - Use the Databricks SDK or CLI + +2. Wait for the endpoint to be ready (may take 5-10 minutes) + +3. Test with a sample request: + POST /serving-endpoints/grocery-forecast-TEAM_NAME/invocations + Body: {"dataframe_records": [{"turnover_lag_1m": 4500, "turnover_lag_3m": 4400, "turnover_lag_12m": 4200, "month_of_year": 3, "quarter": 1, "is_december": false, "is_q4": false, "turnover_mom_growth": 2.3, "turnover_yoy_growth": 7.1}]} + +Show me the prediction response. +``` + +### Expected Result + +Endpoint status is "READY" and returns a prediction for the sample input. + +### If It Doesn't Work + +- **Endpoint not ready:** Wait 5-10 minutes. Check status: `databricks serving-endpoints get grocery-forecast-TEAM_NAME` +- **Permission error:** Need CREATE_SERVING_ENDPOINT permission. Ask facilitator. +- **Invalid input:** Ensure feature names match exactly what the model was trained on. diff --git a/projects/coles-vibe-workshop/starter-kit/prompts/ds/07-build-app.md b/projects/coles-vibe-workshop/starter-kit/prompts/ds/07-build-app.md new file mode 100644 index 0000000..378ffb2 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/prompts/ds/07-build-app.md @@ -0,0 +1,45 @@ +## Step 7: Build Prediction App + +### Prompt + +``` +Build a simple prediction web app: + +1. FastAPI backend (app/app.py): + - GET /health → {"status": "ok"} + - POST /predict: + - Accepts: {"state": "New South Wales", "industry": "Food retailing", "month": "2024-06"} + - Looks up latest features for that state/industry from the feature table + - Calls our Model Serving endpoint with the features + - Returns: {"predicted_turnover": 4650.2, "state": "New South Wales", "industry": "Food retailing", "month": "2024-06"} + - GET / → serves the frontend + +2. HTML frontend (app/static/index.html): + - Tailwind CSS + htmx (CDN, no build step) + - Dark header: "Grocery Forecast — TEAM_NAME" + - Form with dropdowns: State, Industry, Month + - Submit button that calls POST /predict via htmx + - Result card showing the prediction + - Simple and clean — don't over-engineer + +3. Create app/app.yaml: + command: ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"] + +4. Create app/requirements.txt: + fastapi, uvicorn, databricks-sdk, databricks-sql-connector, pydantic + +5. Deploy: databricks apps deploy --name grocery-forecast-TEAM_NAME --source-code-path ./app/ + +Show me the deployed app URL. +``` + +### Expected Result + +A deployed app where you select state/industry/month and get a turnover prediction. + +### If It Doesn't Work + +- **App can't reach endpoint:** Use the workspace-internal serving URL, not external +- **CORS errors:** Add CORSMiddleware to FastAPI +- **502 after deploy:** App may still be starting. Wait 30 seconds. +- **Feature lookup fails:** Check feature table name and that features exist for the selected state/industry diff --git a/projects/coles-vibe-workshop/starter-kit/test_app.py b/projects/coles-vibe-workshop/starter-kit/test_app.py new file mode 100644 index 0000000..7a4cd24 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/test_app.py @@ -0,0 +1,77 @@ +""" +API test stubs for Lab 2. +These define WHAT the FastAPI backend should do. The agent implements the code to make them pass. +Copy to tests/test_app.py before starting Lab 2. + +NOTE: Install httpx first: pip install httpx +""" + +import pytest + +# TODO: Uncomment these imports once app.py exists +# from httpx import AsyncClient, ASGITransport +# from app import app + + +# ── Health ──────────────────────────────────────────────────────────── + + +@pytest.mark.asyncio +async def test_health_endpoint(): + """GET /health returns 200 with {"status": "ok"}.""" + # TODO: Create AsyncClient with ASGITransport(app=app) + # TODO: GET /health + # TODO: Assert status 200 + # TODO: Assert response JSON == {"status": "ok"} + pass + + +# ── Metrics ─────────────────────────────────────────────────────────── + + +@pytest.mark.asyncio +async def test_get_metrics_valid(): + """GET /api/metrics returns a list of records.""" + # TODO: GET /api/metrics + # TODO: Assert status 200 + # TODO: Assert response is a list + # TODO: Assert each record has keys: state, industry, month, turnover_millions, yoy_growth_pct + pass + + +@pytest.mark.asyncio +async def test_get_metrics_with_state_filter(): + """GET /api/metrics?state=NSW filters results to NSW only.""" + # TODO: GET /api/metrics?state=New South Wales + # TODO: Assert status 200 + # TODO: Assert all records have state == "New South Wales" + pass + + +@pytest.mark.asyncio +async def test_get_metrics_invalid_date(): + """GET /api/metrics with invalid date format returns 400.""" + # TODO: GET /api/metrics?start_date=not-a-date + # TODO: Assert status 400 or 422 + pass + + +# ── Ask AI ──────────────────────────────────────────────────────────── + + +@pytest.mark.asyncio +async def test_ask_question_valid(): + """POST /api/ask with a valid question returns answer and sql_query.""" + # TODO: POST /api/ask with {"question": "Which state has the highest turnover?"} + # TODO: Assert status 200 + # TODO: Assert response has "answer" key (string) + # TODO: Assert response has "sql_query" key (string) + pass + + +@pytest.mark.asyncio +async def test_ask_question_empty(): + """POST /api/ask with empty question returns 400.""" + # TODO: POST /api/ask with {"question": ""} + # TODO: Assert status 400 or 422 + pass diff --git a/projects/coles-vibe-workshop/starter-kit/test_features.py b/projects/coles-vibe-workshop/starter-kit/test_features.py new file mode 100644 index 0000000..bb830a6 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/test_features.py @@ -0,0 +1,67 @@ +""" +Feature engineering test stubs for Data Science track Lab 1. +Copy to tests/test_features.py before starting. +""" + + +# ── Lag Features ────────────────────────────────────────────────────── + + +def test_create_lag_features(spark, sample_retail_csv): + """Lag features (1, 3, 6, 12 month) are created correctly.""" + # TODO: Create 24 months of data for one state/industry + # TODO: Apply lag feature function + # TODO: Assert turnover_lag_1m equals previous month's value + # TODO: Assert turnover_lag_12m equals same month last year + # TODO: Assert first 12 rows have null lag_12m (expected) + pass + + +# ── Seasonal Features ───────────────────────────────────────────────── + + +def test_create_seasonal_features(spark, sample_retail_csv): + """Seasonal indicators extracted correctly from date column.""" + # TODO: Apply seasonal feature function + # TODO: Assert month_of_year is 1-12 + # TODO: Assert quarter is 1-4 + # TODO: Assert is_december is True only for month 12 + # TODO: Assert is_q4 is True only for months 10-12 + pass + + +# ── Growth Features ─────────────────────────────────────────────────── + + +def test_create_growth_features(spark, sample_retail_csv): + """MoM and YoY growth rates computed correctly.""" + # TODO: Create 24 months of data with known values + # TODO: Apply growth feature function + # TODO: Assert turnover_mom_growth = (current - previous) / previous * 100 + # TODO: Assert turnover_yoy_growth = (current - 12m_ago) / 12m_ago * 100 + pass + + +# ── Feature Table Schema ───────────────────────────────────────────── + + +def test_feature_table_schema(spark): + """Feature table has all expected columns.""" + expected_columns = { + "state", "industry", "month", + "turnover_millions", + "turnover_lag_1m", "turnover_lag_3m", "turnover_lag_6m", "turnover_lag_12m", + "month_of_year", "quarter", "is_december", "is_q4", + "turnover_mom_growth", "turnover_yoy_growth", "cpi_yoy_change", + } + # TODO: Read or create feature table + # TODO: Assert all expected columns exist + pass + + +def test_feature_table_no_key_nulls(spark): + """Key columns (state, industry, month, turnover_millions) have no nulls.""" + # TODO: Read feature table + # TODO: Assert state, industry, month, turnover_millions have zero nulls + # TODO: Note: lag features CAN have nulls in early rows + pass diff --git a/projects/coles-vibe-workshop/starter-kit/test_features_local.py b/projects/coles-vibe-workshop/starter-kit/test_features_local.py new file mode 100644 index 0000000..66bb2dc --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/test_features_local.py @@ -0,0 +1,288 @@ +""" +Fast local tests for DS track — feature engineering logic. +Runs WITHOUT Spark or Java. Tests the MATH, not the DataFrame wiring. + +Pattern: extract feature logic into pure Python functions, +test them in sub-second, then wire into PySpark Window functions on the cluster. + +Run: uv run pytest tests/test_features_local.py -x --no-header -q +""" +import pytest +from datetime import date + + +# ── Pure Python feature functions (teams implement these) ───────── + + +def create_lag_features(values: list[float], lag: int) -> list[float | None]: + """ + Create lag features from a time-ordered list of values. + Returns a list where element i = values[i - lag], or None if i < lag. + """ + result = [] + for i in range(len(values)): + if i < lag: + result.append(None) + else: + result.append(values[i - lag]) + return result + + +def extract_seasonal_features(d: date) -> dict: + """ + Extract seasonal indicators from a date. + Returns month_of_year, quarter, is_december, is_q4. + """ + month = d.month + quarter = (month - 1) // 3 + 1 + return { + "month_of_year": month, + "quarter": quarter, + "is_december": month == 12, + "is_q4": quarter == 4, + } + + +def calc_mom_growth(current: float, previous: float) -> float | None: + """Month-over-month growth percentage.""" + if previous is None or previous == 0: + return None + return ((current - previous) / previous) * 100 + + +def calc_yoy_growth(current: float, year_ago: float) -> float | None: + """Year-over-year growth percentage.""" + if year_ago is None or year_ago == 0: + return None + return ((current - year_ago) / year_ago) * 100 + + +def build_feature_row( + state: str, + industry: str, + month: date, + turnover: float, + lag_values: dict[str, float | None], + cpi_yoy_change: float | None = None, +) -> dict: + """ + Assemble a single feature row combining all feature groups. + This is what the PySpark pipeline produces per row. + """ + seasonal = extract_seasonal_features(month) + mom_growth = calc_mom_growth(turnover, lag_values.get("lag_1m")) + yoy_growth = calc_yoy_growth(turnover, lag_values.get("lag_12m")) + + return { + "state": state, + "industry": industry, + "month": month, + "turnover_millions": turnover, + "turnover_lag_1m": lag_values.get("lag_1m"), + "turnover_lag_3m": lag_values.get("lag_3m"), + "turnover_lag_6m": lag_values.get("lag_6m"), + "turnover_lag_12m": lag_values.get("lag_12m"), + **seasonal, + "turnover_mom_growth": mom_growth, + "turnover_yoy_growth": yoy_growth, + "cpi_yoy_change": cpi_yoy_change, + } + + +# ── Tests: Lag Features ────────────────────────────────────────── + + +class TestLagFeatures: + """Tests for create_lag_features — pure list operations.""" + + def test_lag_1_month(self): + values = [100, 110, 120, 130, 140] + lags = create_lag_features(values, lag=1) + assert lags == [None, 100, 110, 120, 130] + + def test_lag_3_month(self): + values = [100, 110, 120, 130, 140] + lags = create_lag_features(values, lag=3) + assert lags == [None, None, None, 100, 110] + + def test_lag_12_month(self): + # 24 months of data: lag_12m should be None for first 12 + values = [1000 + i * 50 for i in range(24)] + lags = create_lag_features(values, lag=12) + # First 12 are None + assert all(l is None for l in lags[:12]) + # From index 12 onward, lag matches values 12 months back + assert lags[12] == values[0] # 1000 + assert lags[13] == values[1] # 1050 + assert lags[23] == values[11] # 1550 + + def test_lag_preserves_length(self): + values = list(range(100)) + for lag in [1, 3, 6, 12]: + result = create_lag_features(values, lag) + assert len(result) == len(values) + + def test_lag_empty_list(self): + assert create_lag_features([], lag=1) == [] + + +# ── Tests: Seasonal Features ───────────────────────────────────── + + +class TestSeasonalFeatures: + """Tests for extract_seasonal_features — date decomposition.""" + + def test_january(self): + result = extract_seasonal_features(date(2024, 1, 1)) + assert result["month_of_year"] == 1 + assert result["quarter"] == 1 + assert result["is_december"] is False + assert result["is_q4"] is False + + def test_december(self): + result = extract_seasonal_features(date(2024, 12, 1)) + assert result["month_of_year"] == 12 + assert result["quarter"] == 4 + assert result["is_december"] is True + assert result["is_q4"] is True + + def test_q4_months(self): + """October, November, December are Q4.""" + for month in [10, 11, 12]: + result = extract_seasonal_features(date(2024, month, 1)) + assert result["is_q4"] is True, f"Month {month} should be Q4" + assert result["quarter"] == 4 + + def test_non_q4_months(self): + """Months 1-9 are not Q4.""" + for month in range(1, 10): + result = extract_seasonal_features(date(2024, month, 1)) + assert result["is_q4"] is False, f"Month {month} should not be Q4" + + def test_all_quarters(self): + expected = {1: 1, 2: 1, 3: 1, 4: 2, 5: 2, 6: 2, + 7: 3, 8: 3, 9: 3, 10: 4, 11: 4, 12: 4} + for month, quarter in expected.items(): + result = extract_seasonal_features(date(2024, month, 1)) + assert result["quarter"] == quarter, f"Month {month} → Q{quarter}" + + def test_only_december_is_december(self): + for month in range(1, 13): + result = extract_seasonal_features(date(2024, month, 1)) + assert result["is_december"] == (month == 12) + + +# ── Tests: Growth Features ─────────────────────────────────────── + + +class TestGrowthFeatures: + """Tests for MoM and YoY growth calculations.""" + + def test_mom_positive_growth(self): + # $4600M this month, $4500M last month → 2.22% growth + assert calc_mom_growth(4600, 4500) == pytest.approx(2.2222, rel=1e-3) + + def test_mom_negative_growth(self): + assert calc_mom_growth(4400, 4500) == pytest.approx(-2.2222, rel=1e-3) + + def test_mom_no_previous(self): + """First month has no previous — returns None.""" + assert calc_mom_growth(4500, None) is None + + def test_mom_zero_previous(self): + assert calc_mom_growth(4500, 0) is None + + def test_yoy_positive_growth(self): + # $4500M now, $4200M a year ago → 7.14% growth + assert calc_yoy_growth(4500, 4200) == pytest.approx(7.1429, rel=1e-3) + + def test_yoy_negative_growth(self): + assert calc_yoy_growth(4000, 4200) == pytest.approx(-4.7619, rel=1e-3) + + def test_yoy_no_year_ago(self): + assert calc_yoy_growth(4500, None) is None + + def test_realistic_food_retail_nsw(self): + """Real-ish numbers: NSW Food retailing ~$4.5B/month.""" + jan_2024 = 4520.0 + jan_2023 = 4200.0 + dec_2023 = 4850.0 # December spike + # YoY + assert calc_yoy_growth(jan_2024, jan_2023) == pytest.approx(7.619, rel=1e-2) + # MoM (Jan drops after December spike) + assert calc_mom_growth(jan_2024, dec_2023) == pytest.approx(-6.804, rel=1e-2) + + +# ── Tests: Feature Row Assembly ────────────────────────────────── + + +class TestFeatureRowAssembly: + """Tests for the complete feature row — all feature groups combined.""" + + def test_complete_row_schema(self): + row = build_feature_row( + state="New South Wales", + industry="Food retailing", + month=date(2024, 6, 1), + turnover=4600.0, + lag_values={"lag_1m": 4500, "lag_3m": 4400, "lag_6m": 4300, "lag_12m": 4200}, + cpi_yoy_change=3.5, + ) + expected_keys = { + "state", "industry", "month", "turnover_millions", + "turnover_lag_1m", "turnover_lag_3m", "turnover_lag_6m", "turnover_lag_12m", + "month_of_year", "quarter", "is_december", "is_q4", + "turnover_mom_growth", "turnover_yoy_growth", "cpi_yoy_change", + } + assert set(row.keys()) == expected_keys + + def test_key_columns_not_null(self): + row = build_feature_row( + state="Victoria", + industry="Food retailing", + month=date(2024, 3, 1), + turnover=3900.0, + lag_values={"lag_1m": 3800}, + ) + assert row["state"] is not None + assert row["industry"] is not None + assert row["month"] is not None + assert row["turnover_millions"] is not None + + def test_lag_nulls_propagate(self): + """Early rows missing lag values → growth features are None.""" + row = build_feature_row( + state="Queensland", + industry="Food retailing", + month=date(2023, 1, 1), + turnover=2900.0, + lag_values={}, # No lag data (first month) + ) + assert row["turnover_lag_1m"] is None + assert row["turnover_lag_12m"] is None + assert row["turnover_mom_growth"] is None + assert row["turnover_yoy_growth"] is None + + def test_seasonal_features_in_row(self): + row = build_feature_row( + state="New South Wales", + industry="Food retailing", + month=date(2024, 12, 1), + turnover=5200.0, + lag_values={"lag_1m": 4800, "lag_12m": 4900}, + ) + assert row["month_of_year"] == 12 + assert row["quarter"] == 4 + assert row["is_december"] is True + assert row["is_q4"] is True + + def test_growth_calculations_in_row(self): + row = build_feature_row( + state="New South Wales", + industry="Food retailing", + month=date(2024, 2, 1), + turnover=4600.0, + lag_values={"lag_1m": 4500, "lag_12m": 4200}, + ) + assert row["turnover_mom_growth"] == pytest.approx(2.2222, rel=1e-3) + assert row["turnover_yoy_growth"] == pytest.approx(9.5238, rel=1e-3) diff --git a/projects/coles-vibe-workshop/starter-kit/test_model.py b/projects/coles-vibe-workshop/starter-kit/test_model.py new file mode 100644 index 0000000..6f9e383 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/test_model.py @@ -0,0 +1,48 @@ +""" +Model training and serving test stubs for Data Science track Lab 2. +Copy to tests/test_model.py before starting. +""" + + +# ── Model Training ──────────────────────────────────────────────────── + + +def test_model_predictions_positive(spark): + """All predictions are positive numbers (turnover can't be negative).""" + # TODO: Load model, run predictions on test set + # TODO: Assert all predictions > 0 + pass + + +def test_model_r2_score(spark): + """Model achieves R² > 0.5 on test set.""" + # TODO: Load model, predict on held-out test data + # TODO: Calculate R² score + # TODO: Assert R² > 0.5 + pass + + +def test_model_logged_to_mlflow(): + """Training run logged model artifact to MLflow.""" + # TODO: Query MLflow for latest run in our experiment + # TODO: Assert run has a logged model artifact + # TODO: Assert run has R², MAE, RMSE metrics logged + pass + + +# ── Model Serving ───────────────────────────────────────────────────── + + +def test_serving_endpoint_health(): + """Model Serving endpoint responds to health check.""" + # TODO: Query serving endpoint status + # TODO: Assert endpoint state is "READY" + pass + + +def test_serving_endpoint_prediction(): + """Model Serving endpoint returns valid predictions.""" + # TODO: Send sample feature vector to endpoint + # TODO: Assert response has predictions key + # TODO: Assert prediction is a positive number + pass diff --git a/projects/coles-vibe-workshop/starter-kit/test_model_local.py b/projects/coles-vibe-workshop/starter-kit/test_model_local.py new file mode 100644 index 0000000..53c5b77 --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/test_model_local.py @@ -0,0 +1,309 @@ +""" +Fast local tests for DS track — model training & evaluation logic. +Runs WITHOUT Spark, Databricks, or MLflow server. +Uses sklearn and xgboost directly on small numpy/pandas data. + +Pattern: test model behavior on synthetic data locally, +then run on real feature table when on the cluster. + +Run: uv run pytest tests/test_model_local.py -x --no-header -q +""" +import pytest +import numpy as np +from datetime import date + + +# ── Synthetic feature data (mimics retail_features table) ───────── + + +def make_synthetic_features(n_months: int = 36, seed: int = 42) -> tuple: + """ + Generate synthetic retail turnover data with known patterns. + Returns (X, y, dates) — numpy arrays ready for sklearn. + + Patterns embedded: + - Seasonal: December spike, January dip + - Trend: 2% annual growth + - Noise: small random variation + """ + rng = np.random.default_rng(seed) + dates = [] + features = [] + targets = [] + + base_turnover = 4000.0 + for i in range(n_months): + month_num = (i % 12) + 1 + year_offset = i // 12 + + # Base with trend + turnover = base_turnover * (1 + 0.02 * year_offset) + + # Seasonal pattern + if month_num == 12: + turnover *= 1.15 # December spike + elif month_num == 1: + turnover *= 0.92 # January dip + elif month_num in [6, 7]: + turnover *= 0.95 # Winter dip (Australia) + + # Add noise + turnover += rng.normal(0, 50) + + # Build feature vector + lag_1m = targets[-1] if i > 0 else None + lag_3m = targets[-3] if i >= 3 else None + lag_6m = targets[-6] if i >= 6 else None + lag_12m = targets[-12] if i >= 12 else None + quarter = (month_num - 1) // 3 + 1 + + if lag_1m is not None and lag_12m is not None: + mom_growth = ((turnover - lag_1m) / lag_1m) * 100 + yoy_growth = ((turnover - lag_12m) / lag_12m) * 100 + features.append([ + lag_1m, lag_3m or 0, lag_6m or 0, lag_12m, + month_num, quarter, + 1.0 if month_num == 12 else 0.0, + 1.0 if quarter == 4 else 0.0, + mom_growth, yoy_growth, + ]) + targets.append(turnover) + dates.append(date(2022 + year_offset, month_num, 1)) + else: + targets.append(turnover) + + X = np.array(features) + y = np.array(targets[len(targets) - len(features):]) + return X, y, dates + + +FEATURE_NAMES = [ + "turnover_lag_1m", "turnover_lag_3m", "turnover_lag_6m", "turnover_lag_12m", + "month_of_year", "quarter", "is_december", "is_q4", + "turnover_mom_growth", "turnover_yoy_growth", +] + + +# ── Fixtures ───────────────────────────────────────────────────── + + +@pytest.fixture(scope="module") +def synthetic_data(): + """36 months of synthetic retail data.""" + return make_synthetic_features(n_months=36) + + +@pytest.fixture(scope="module") +def train_test_split(synthetic_data): + """80/20 chronological split (not random — time series!).""" + X, y, dates = synthetic_data + split_idx = int(len(X) * 0.8) + return { + "X_train": X[:split_idx], + "y_train": y[:split_idx], + "X_test": X[split_idx:], + "y_test": y[split_idx:], + "dates_test": dates[split_idx:], + } + + +# ── Tests: Data Preparation ────────────────────────────────────── + + +class TestDataPreparation: + """Verify synthetic data has the right shape for training.""" + + def test_feature_count(self, synthetic_data): + X, y, _ = synthetic_data + assert X.shape[1] == len(FEATURE_NAMES) + + def test_no_nans_in_features(self, synthetic_data): + X, y, _ = synthetic_data + assert not np.any(np.isnan(X)), "Features should have no NaN after filtering" + + def test_target_all_positive(self, synthetic_data): + _, y, _ = synthetic_data + assert np.all(y > 0), "Turnover must be positive" + + def test_chronological_split(self, train_test_split): + """Train set is before test set — no data leakage.""" + data = train_test_split + assert len(data["X_train"]) > len(data["X_test"]) + # Test dates should be after training period + assert data["dates_test"][0] > date(2023, 1, 1) + + +# ── Tests: RandomForest Model ──────────────────────────────────── + + +class TestRandomForestModel: + """Test sklearn RandomForestRegressor on synthetic data.""" + + @pytest.fixture(scope="class") + def rf_model(self, train_test_split): + from sklearn.ensemble import RandomForestRegressor + model = RandomForestRegressor(n_estimators=50, random_state=42) + model.fit(train_test_split["X_train"], train_test_split["y_train"]) + return model + + def test_predictions_positive(self, rf_model, train_test_split): + """All predictions must be positive (turnover can't be negative).""" + predictions = rf_model.predict(train_test_split["X_test"]) + assert np.all(predictions > 0), "RandomForest produced negative predictions" + + def test_r2_score_above_threshold(self, rf_model, train_test_split): + """R² > 0.5 on test set.""" + from sklearn.metrics import r2_score + data = train_test_split + predictions = rf_model.predict(data["X_test"]) + r2 = r2_score(data["y_test"], predictions) + assert r2 > 0.5, f"R² = {r2:.3f}, expected > 0.5" + + def test_predictions_in_reasonable_range(self, rf_model, train_test_split): + """Predictions should be within realistic turnover range.""" + predictions = rf_model.predict(train_test_split["X_test"]) + assert np.all(predictions > 1000), "Predictions too low for retail turnover" + assert np.all(predictions < 10000), "Predictions unrealistically high" + + def test_feature_importances_exist(self, rf_model): + """RandomForest provides feature importances.""" + importances = rf_model.feature_importances_ + assert len(importances) == len(FEATURE_NAMES) + assert np.sum(importances) == pytest.approx(1.0, abs=0.01) + + +# ── Tests: XGBoost Model ───────────────────────────────────────── + + +class TestXGBoostModel: + """Test XGBRegressor on synthetic data.""" + + @pytest.fixture(scope="class") + def xgb_model(self, train_test_split): + from xgboost import XGBRegressor + model = XGBRegressor(n_estimators=50, max_depth=4, random_state=42) + model.fit(train_test_split["X_train"], train_test_split["y_train"]) + return model + + def test_predictions_positive(self, xgb_model, train_test_split): + """All predictions must be positive.""" + predictions = xgb_model.predict(train_test_split["X_test"]) + assert np.all(predictions > 0), "XGBoost produced negative predictions" + + def test_r2_score_above_threshold(self, xgb_model, train_test_split): + """R² > 0.5 on test set.""" + from sklearn.metrics import r2_score + data = train_test_split + predictions = xgb_model.predict(data["X_test"]) + r2 = r2_score(data["y_test"], predictions) + assert r2 > 0.5, f"R² = {r2:.3f}, expected > 0.5" + + def test_mae_reasonable(self, xgb_model, train_test_split): + """Mean Absolute Error should be < 10% of mean turnover.""" + from sklearn.metrics import mean_absolute_error + data = train_test_split + predictions = xgb_model.predict(data["X_test"]) + mae = mean_absolute_error(data["y_test"], predictions) + mean_turnover = np.mean(data["y_test"]) + assert mae < mean_turnover * 0.1, f"MAE={mae:.1f} > 10% of mean={mean_turnover:.1f}" + + +# ── Tests: Model Comparison ────────────────────────────────────── + + +class TestModelComparison: + """Compare RandomForest vs XGBoost — the pattern teams follow in Lab 2.""" + + @pytest.fixture(scope="class") + def both_models(self, train_test_split): + from sklearn.ensemble import RandomForestRegressor + from xgboost import XGBRegressor + from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error + + data = train_test_split + + rf = RandomForestRegressor(n_estimators=50, random_state=42) + rf.fit(data["X_train"], data["y_train"]) + rf_pred = rf.predict(data["X_test"]) + + xgb = XGBRegressor(n_estimators=50, max_depth=4, random_state=42) + xgb.fit(data["X_train"], data["y_train"]) + xgb_pred = xgb.predict(data["X_test"]) + + return { + "rf": { + "model": rf, + "predictions": rf_pred, + "r2": r2_score(data["y_test"], rf_pred), + "mae": mean_absolute_error(data["y_test"], rf_pred), + "rmse": np.sqrt(mean_squared_error(data["y_test"], rf_pred)), + }, + "xgb": { + "model": xgb, + "predictions": xgb_pred, + "r2": r2_score(data["y_test"], xgb_pred), + "mae": mean_absolute_error(data["y_test"], xgb_pred), + "rmse": np.sqrt(mean_squared_error(data["y_test"], xgb_pred)), + }, + } + + def test_both_models_beat_baseline(self, both_models): + """Both models should beat R² = 0 (predicting the mean).""" + assert both_models["rf"]["r2"] > 0, "RandomForest worse than mean predictor" + assert both_models["xgb"]["r2"] > 0, "XGBoost worse than mean predictor" + + def test_metrics_are_logged(self, both_models): + """Both models produce the three metrics we log to MLflow.""" + for name in ["rf", "xgb"]: + metrics = both_models[name] + assert "r2" in metrics + assert "mae" in metrics + assert "rmse" in metrics + assert metrics["rmse"] >= metrics["mae"], "RMSE should be >= MAE" + + +# ── Tests: MLflow Logging Pattern (no server needed) ───────────── + + +class TestMLflowLoggingPattern: + """ + Test that MLflow API calls work locally (uses local file store). + Teams use the same pattern on the cluster with a Databricks-backed store. + """ + + def test_mlflow_log_metrics(self, train_test_split): + """Verify we can log the metrics MLflow expects.""" + import mlflow + from sklearn.ensemble import RandomForestRegressor + from sklearn.metrics import r2_score, mean_absolute_error + + data = train_test_split + model = RandomForestRegressor(n_estimators=10, random_state=42) + model.fit(data["X_train"], data["y_train"]) + preds = model.predict(data["X_test"]) + + with mlflow.start_run(run_name="local-test-run"): + mlflow.log_param("model_type", "RandomForest") + mlflow.log_param("n_estimators", 10) + mlflow.log_metric("r2", r2_score(data["y_test"], preds)) + mlflow.log_metric("mae", mean_absolute_error(data["y_test"], preds)) + mlflow.sklearn.log_model(model, "model") + + run = mlflow.active_run() + assert run is not None + assert run.info.run_name == "local-test-run" + + def test_mlflow_autolog(self, train_test_split): + """Verify sklearn autolog works locally.""" + import mlflow + from sklearn.ensemble import RandomForestRegressor + + mlflow.sklearn.autolog(log_models=False) + data = train_test_split + + with mlflow.start_run(run_name="autolog-test"): + model = RandomForestRegressor(n_estimators=10, random_state=42) + model.fit(data["X_train"], data["y_train"]) + model.score(data["X_test"], data["y_test"]) + + mlflow.sklearn.autolog(disable=True) diff --git a/projects/coles-vibe-workshop/starter-kit/test_pipeline.py b/projects/coles-vibe-workshop/starter-kit/test_pipeline.py new file mode 100644 index 0000000..66786ed --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/test_pipeline.py @@ -0,0 +1,106 @@ +""" +Pipeline test stubs for Lab 1. +These define WHAT the pipeline should do. The agent implements the code to make them pass. +Copy to tests/test_pipeline.py before starting Lab 1. +""" + + +# ── Bronze: Retail Trade ────────────────────────────────────────────── + + +def test_bronze_retail_trade_schema(spark, sample_retail_csv): + """Bronze retail table has all expected columns from the API.""" + expected_columns = {"DATAFLOW", "FREQ", "MEASURE", "INDUSTRY", "REGION", "TIME_PERIOD", "OBS_VALUE"} + # TODO: Pass sample_retail_csv through bronze ingestion + # TODO: Assert output columns match expected_columns + pass + + +def test_bronze_retail_trade_not_null(spark, sample_retail_csv): + """TIME_PERIOD and OBS_VALUE are never null in bronze table.""" + # TODO: Pass sample_retail_csv through bronze ingestion + # TODO: Assert no nulls in TIME_PERIOD column + # TODO: Assert no nulls in OBS_VALUE column + pass + + +# ── Silver: Retail Turnover ─────────────────────────────────────────── + + +def test_silver_retail_turnover_decodes_regions(spark, sample_retail_csv): + """REGION code '1' becomes 'New South Wales', '2' becomes 'Victoria', etc.""" + # TODO: Pass bronze data through silver transformation + # TODO: Assert REGION "1" is decoded to "New South Wales" + # TODO: Assert REGION "2" is decoded to "Victoria" + # TODO: Assert REGION "3" is decoded to "Queensland" + pass + + +def test_silver_retail_turnover_decodes_industries(spark, sample_retail_csv): + """INDUSTRY code '20' becomes 'Food retailing', '41' becomes 'Clothing/footwear/personal'.""" + # TODO: Pass bronze data through silver transformation + # TODO: Assert INDUSTRY "20" is decoded to "Food retailing" + # TODO: Assert INDUSTRY "41" is decoded to "Clothing/footwear/personal" + pass + + +def test_silver_retail_turnover_parses_dates(spark, sample_retail_csv): + """TIME_PERIOD string '2024-01' is parsed to a proper date column.""" + # TODO: Pass bronze data through silver transformation + # TODO: Assert TIME_PERIOD is converted to DateType + # TODO: Assert "2024-01" becomes date(2024, 1, 1) + pass + + +# ── Gold: Retail Summary ───────────────────────────────────────────── + + +def test_gold_retail_summary_rolling_averages(spark): + """Gold table has 3-month and 12-month rolling averages.""" + # TODO: Create 24 months of silver-like data for one state/industry + # TODO: Pass through gold transformation + # TODO: Assert turnover_3m_avg is average of last 3 months + # TODO: Assert turnover_12m_avg is average of last 12 months + pass + + +def test_gold_retail_summary_yoy_growth(spark): + """Gold table has year-over-year growth percentage.""" + # TODO: Create 24 months of silver-like data + # TODO: Pass through gold transformation + # TODO: Assert yoy_growth_pct = (current - same_month_last_year) / same_month_last_year * 100 + pass + + +# ── Bronze: CPI Food ───────────────────────────────────────────────── + + +def test_bronze_cpi_schema(spark, sample_cpi_csv): + """Bronze CPI table has all expected columns.""" + expected_columns = {"DATAFLOW", "FREQ", "MEASURE", "INDEX", "REGION", "TIME_PERIOD", "OBS_VALUE"} + # TODO: Pass sample_cpi_csv through bronze ingestion + # TODO: Assert output columns match expected_columns + pass + + +# ── Silver: Food Price Index ───────────────────────────────────────── + + +def test_silver_food_price_index_decodes(spark, sample_cpi_csv): + """INDEX code '10001' becomes 'All groups CPI', '20001' becomes 'Food and non-alcoholic beverages'.""" + # TODO: Pass bronze CPI data through silver transformation + # TODO: Assert INDEX "10001" decoded to "All groups CPI" + # TODO: Assert INDEX "20001" decoded to "Food and non-alcoholic beverages" + # TODO: Assert REGION codes decoded to state names + pass + + +# ── Gold: Food Inflation ───────────────────────────────────────────── + + +def test_gold_food_inflation_yoy(spark): + """Gold table has year-over-year CPI change percentage.""" + # TODO: Create 8 quarters of silver-like CPI data + # TODO: Pass through gold transformation + # TODO: Assert yoy_change_pct = (current_quarter - same_quarter_last_year) / same_quarter_last_year * 100 + pass diff --git a/projects/coles-vibe-workshop/starter-kit/test_pipeline_local.py b/projects/coles-vibe-workshop/starter-kit/test_pipeline_local.py new file mode 100644 index 0000000..c0f8ecf --- /dev/null +++ b/projects/coles-vibe-workshop/starter-kit/test_pipeline_local.py @@ -0,0 +1,100 @@ +""" +Fast local tests that run WITHOUT Spark or Java. +Test transformation LOGIC separately from Spark execution. +""" +import pytest + + +# ── Pattern: Extract transformation logic into pure Python functions ── + +def decode_region(code: str) -> str: + """Decode ABS region code to state name.""" + mapping = { + "1": "New South Wales", "2": "Victoria", "3": "Queensland", + "4": "South Australia", "5": "Western Australia", "6": "Tasmania", + "7": "Northern Territory", "8": "Australian Capital Territory", + } + return mapping.get(str(code), f"Unknown ({code})") + + +def decode_industry(code: str) -> str: + """Decode ABS industry code to name.""" + mapping = { + "20": "Food retailing", "41": "Clothing, footwear and personal accessories", + "42": "Department stores", "43": "Other retailing", + "44": "Cafes, restaurants and takeaway", "45": "Household goods retailing", + } + return mapping.get(str(code), f"Unknown ({code})") + + +def parse_time_period(tp: str) -> tuple: + """Parse ABS TIME_PERIOD to (year, month, day).""" + if "-Q" in tp: + year, q = tp.split("-Q") + month = (int(q) - 1) * 3 + 1 + return (int(year), month, 1) + else: + parts = tp.split("-") + return (int(parts[0]), int(parts[1]), 1) + + +def calc_yoy_growth(current: float, previous: float) -> float: + """Calculate year-over-year growth percentage.""" + if previous == 0: + return 0.0 + return ((current - previous) / previous) * 100 + + +# ── Tests: Pure Python, no Spark, sub-second execution ────────── + +class TestDecodeRegion: + def test_nsw(self): + assert decode_region("1") == "New South Wales" + + def test_vic(self): + assert decode_region("2") == "Victoria" + + def test_all_states(self): + for code in range(1, 9): + result = decode_region(str(code)) + assert "Unknown" not in result, f"Code {code} not mapped" + + def test_unknown_code(self): + assert "Unknown" in decode_region("99") + + +class TestDecodeIndustry: + def test_food_retailing(self): + assert decode_industry("20") == "Food retailing" + + def test_clothing(self): + assert "Clothing" in decode_industry("41") + + def test_unknown(self): + assert "Unknown" in decode_industry("999") + + +class TestTimePeriodParsing: + def test_monthly(self): + assert parse_time_period("2024-01") == (2024, 1, 1) + + def test_quarterly(self): + assert parse_time_period("2024-Q1") == (2024, 1, 1) + assert parse_time_period("2024-Q2") == (2024, 4, 1) + assert parse_time_period("2024-Q3") == (2024, 7, 1) + assert parse_time_period("2024-Q4") == (2024, 10, 1) + + +class TestYoYGrowth: + def test_positive_growth(self): + assert calc_yoy_growth(110, 100) == pytest.approx(10.0) + + def test_negative_growth(self): + assert calc_yoy_growth(90, 100) == pytest.approx(-10.0) + + def test_zero_previous(self): + assert calc_yoy_growth(100, 0) == 0.0 + + def test_realistic_retail(self): + # Jan 2024: $4500M, Jan 2023: $4200M → 7.14% growth + assert calc_yoy_growth(4500, 4200) == pytest.approx(7.142857, rel=1e-3) diff --git a/projects/coles-vibe-workshop/track-analyst.html b/projects/coles-vibe-workshop/track-analyst.html new file mode 100644 index 0000000..2a42397 --- /dev/null +++ b/projects/coles-vibe-workshop/track-analyst.html @@ -0,0 +1,1612 @@ + + + + + +Analyst Track — Coles Vibe Coding Workshop + + + + + + + +
+ + + +
+
+
+
+
+
+
+
+
Analyst Track
+ + + + + + + + + + + + + + + + + + +

Analyst Track

+

Coles × Databricks Vibe Coding Workshop

+
Genie Spaces · AI/BI Dashboards · Apps
+
Coles Group • Databricks • Anthropic
+
+ + +
+ +
+ Post Lab 0 5 min + +
You already completed Lab 0 (guided hands-on) with the whole group. You should have a working CLAUDE.md, a passing bronze test, and a verified Databricks connection. If not, ask the facilitator for help before continuing.
+ +

Step 1: Add Analyst Track Context

+

Append the Analyst track additions to your existing CLAUDE.md:

+
cat ~/starter-kit/CLAUDE-analyst.md >> CLAUDE.md
+ +

Step 2: Verify Your Environment

+
# Confirm your bronze test still passes +uv run pytest tests/ -x --no-header -q + +# Check your Unity Catalog access +databricks catalogs list | grep workshop
+ +
If any of these fail, ask the facilitator for help immediately.
+
+ +
+ + +
+ +
+ Lab 0 + +

Step 4: Explore the Data

+

All tracks use the same gold tables. Paste this into Claude Code:

+
Query these tables and show me 5 rows from each: +- workshop_vibe_coding.checkpoints.retail_summary +- workshop_vibe_coding.checkpoints.food_inflation_yoy + +Tell me: what columns are available, what date range is covered, +and which states are included.
+ +

What You're Working With

+
+
+

retail_summary

+
    +
  • state — Australian state name
  • +
  • industry — Retail category
  • +
  • month — Monthly observation date
  • +
  • turnover_millions — Monthly turnover ($M AUD)
  • +
  • turnover_3m_avg / turnover_12m_avg
  • +
  • yoy_growth_pct — Year-over-year growth %
  • +
+
+
+

food_inflation_yoy

+
    +
  • state — Australian state name
  • +
  • quarter — Calendar quarter
  • +
  • cpi_index — CPI value (base = 100)
  • +
  • yoy_change_pct — YoY CPI change %
  • +
+
+
+ +

Your Track: Analyst

+

You will build natural language interfaces that let business users query this data without writing code — Genie spaces, AI/BI dashboards, and a FastAPI web app.

+ +
Analyst track note: Most Lab 1 work is in the Databricks UI, not the terminal. The terminal is used for adding metadata and writing SQL.
+
+ +
+ + +
+

Lab 1

+

Genie & AI/BI Dashboards

+

Create natural language interfaces for business users

+
12:00 PM / 60 Minutes
+
+ + +
+ +
+ Phase 1 15 min + +
+
+

1.1 Create your Genie Space (UI)

+

In the Databricks workspace UI:

+
1

Click Genie in the left sidebar

+
2

Click New Genie Space

+
3

Configure:

+
    +
  • Name: "Grocery Intelligence — TEAM_NAME"
  • +
  • SQL Warehouse: Select the workshop warehouse
  • +
  • Tables: Add both gold tables: +
    workshop_vibe_coding.TEAM_SCHEMA.retail_summary +
    workshop_vibe_coding.TEAM_SCHEMA.food_inflation_yoy
  • +
+

General Instructions (paste into Genie):

+
This data contains Australian retail trade and food price data. +States are Australian states (New South Wales, Victoria, Queensland, etc.). +Turnover is in millions of AUD. +CPI index values are relative to a base period. +YoY growth and change percentages show year-over-year comparisons.
+
+
+

1.2 Add Column Descriptions (Terminal)

+

Paste this into Claude Code:

+
Add column comments to our gold tables for better Genie and AI/BI accuracy: + +For workshop_vibe_coding.TEAM_SCHEMA.retail_summary: +- Table comment: "Monthly retail turnover summary by Australian + state and industry with rolling averages and YoY growth" +- state: "Australian state name (NSW, VIC, QLD, etc.)" +- industry: "Retail industry category" +- month: "Date of the monthly observation" +- turnover_millions: "Monthly retail turnover in millions AUD" +- turnover_3m_avg: "3-month rolling average" +- turnover_12m_avg: "12-month rolling average" +- yoy_growth_pct: "Year-over-year turnover growth %" + +For workshop_vibe_coding.TEAM_SCHEMA.food_inflation_yoy: +- Table comment: "Quarterly food price inflation by state" +- state: "Australian state name" +- quarter: "Calendar quarter" +- cpi_index: "Consumer Price Index (base = 100)" +- yoy_change_pct: "YoY CPI change % (positive = inflation)" + +Use ALTER TABLE ... SET TBLPROPERTIES for table comments. +Use ALTER TABLE ... ALTER COLUMN ... COMMENT for columns.
+
+
+
+ +
+ + +
+ +
+ Phase 2 20 min + +
+
+

2.1 Create Dashboard (Person A — UI)

+
1

Navigate to DashboardsCreate DashboardAI/BI Dashboard

+
2

Connect to your gold tables

+
3

Use these natural language prompts:

+ +
Show monthly food retail turnover by state as a line chart for the last 2 years
+
Create a bar chart comparing year-over-year retail growth by state for the latest month
+
Show a heatmap of food inflation by state and quarter
+
Display the top 5 states by average monthly turnover as a horizontal bar chart
+
Show a trend line of national food price inflation over time
+ +
4

Arrange into a clean layout. Title: "Grocery Intelligence Dashboard — TEAM_NAME"

+
+
+

2.2 Test Genie (Person B — UI)

+

Try these questions and note which ones Genie gets right:

+
    +
  • "Which state had the highest food retail turnover last month?"
  • +
  • "Show me the YoY food price inflation trend for Victoria"
  • +
  • "Compare retail growth across all states for the last 12 months"
  • +
  • "What's the average monthly turnover for department stores in NSW?"
  • +
  • "Which industry has the fastest growing turnover nationally?"
  • +
+

If Genie gets a question wrong, refine the General Instructions.

+ +
Tip: The richer your column descriptions and instructions, the more accurate Genie becomes. This is the single biggest lever for quality.
+ +
Stuck? Grab Checkpoint AN-1B: dashboard with 3 pre-built visualizations.
+
+
+
+ +
+ + +
+ +
+ Phase 3 15 min + +
+
+

3.1 Tune Genie Accuracy

+

Based on testing, update General Instructions with:

+
    +
  • Example questions and expected SQL patterns
  • +
  • Clarifications for ambiguous terms (e.g. "last month" = most recent month in data)
  • +
  • Column mappings for common questions
  • +
+

Example instruction additions:

+
When users ask about "food retail", filter industry = 'Food retailing'. +When users ask about "last month", use the MAX(month) from the data. +For state comparisons, include all 8 states/territories. +Turnover values are already in millions — do not multiply.
+
+
+

3.2 Publish Dashboard

+
1

Click Publish on your dashboard

+
2

Click Share → get the URL for your demo

+
3

Click Embed → copy iframe code (for Lab 2 app integration)

+ +
Save the embed URL! You will use it in Lab 2 to embed the dashboard inside your web app.
+ +

3.3 Document Accuracy

+

Test Genie with 10 sample questions. For each, record:

+
    +
  • Did Genie answer correctly? (Yes / No / Partial)
  • +
  • Did you need to refine instructions?
  • +
  • What SQL did Genie generate?
  • +
+

Target: 80%+ accuracy on straightforward questions.

+
+
+
+ +
+ + +
+ +
+ Phase 4 5 min + +

Verification Checklist

+
+
+
    +
  • Genie space created with both gold tables and instructions
  • +
  • Gold tables have column descriptions in Unity Catalog
  • +
  • AI/BI dashboard has at least 4 visualizations
  • +
  • Genie answers 7/10 test questions correctly
  • +
  • Dashboard published with clean layout
  • +
  • Ready for Show & Tell
  • +
+
+
+

Show & Tell Prep

+

You have 3 minutes to present. Cover:

+
    +
  1. Live Genie demo — ask it a question
  2. +
  3. Dashboard walkthrough — key visualizations
  4. +
  5. What improved Genie accuracy the most?
  6. +
  7. Which questions was Genie best/worst at?
  8. +
+ +

Reflection Questions

+
    +
  • How did column descriptions affect Genie accuracy?
  • +
  • How does Genie compare to building a custom query interface?
  • +
  • What would you do differently with more time?
  • +
+
+
+ +
Running out of time? Grab Checkpoint AN-1C for a complete Lab 1 solution.
+
+ +
+ + +
+

Lab 2

+

Build Your App

+

FastAPI backend · HTML + Tailwind + htmx · AI-powered queries

+
2:00 PM / 60 Minutes
+
+ + +
+ +
+

Choose an ambition level based on your team's pace. All tiers include a Genie space and an AI/BI dashboard.

+ +
+
+
Tier 1
+

Quick — Embed & Ship

+

~20 minutes

+

FastAPI backend + embedded AI/BI dashboard via iframe. Publish your dashboard, drop the embed URL into your app.

+
    +
  • Minimal frontend code
  • +
  • Polished result fast
  • +
  • "Ask AI" text input
  • +
  • HTML + Tailwind CSS
  • +
+
+
+
Tier 2
+

Medium — Custom Charts

+

~35 minutes

+

FastAPI + Recharts or Observable Plot. Query gold tables via API, render interactive visualizations.

+
    +
  • Custom chart components
  • +
  • Filter bar (state, dates)
  • +
  • Metric cards
  • +
  • "Ask AI" with results viz
  • +
+
+
+
Tier 3
+

Stretch — Full Platform

+

~60 minutes

+

Full React app with custom viz + embedded dashboard + Genie + NL query feature. The whole enchilada.

+
    +
  • Tabbed navigation
  • +
  • Dashboard + Explorer + AI tabs
  • +
  • Sparklines + metric cards
  • +
  • Responsive Tailwind layout
  • +
+
+
+ +
┌───────────────┐ HTTP ┌──────────────────────┐ +│ Browser │ (htmx calls) │ FastAPI Backend │ +│ (Tailwind │───────────────▶│ (app.py) │ ┌──────────┐ ┌──────────────┐ +│ + htmx) │◀───────────────│ GET /api/metrics │─────▶│SQL │ │Foundation │ +└───────────────┘ HTML / JSON │ POST /api/ask │ │Warehouse │ │Model API │ + │ GET / │ └──────────┘ └──────────────┘ + └──────────────────────┘
+
+ +
+ + +
+ +
+ Phase 1 10 min + +
+
+

1.1 Write Your App PRD

+

Paste this into Claude Code:

+
Create a new project called "grocery-app" with this PRD: + +## Grocery Intelligence App + +### Overview +A web application that displays retail analytics from our gold tables +and allows natural language querying of the data. + +### User Stories +1. As a business user, I want to see retail turnover by state in a + clean dashboard with filters. +2. As an analyst, I want to ask questions in plain English like + "which state had the highest food retail growth last year?" +3. As an executive, I want to see food inflation trends at a glance. + +### Technical Requirements +- Backend: FastAPI (Python) +- Frontend: HTML + Tailwind CSS + htmx (no npm/node required) +- Data: workshop_vibe_coding.<team_schema>.retail_summary + and workshop_vibe_coding.<team_schema>.food_inflation_yoy +- AI feature: Natural language to SQL using Foundation Model API +- Deployment: Databricks Apps + +### API Endpoints +- GET /api/metrics?state=X&start_date=Y&end_date=Z +- POST /api/ask {"question": "which state has highest growth?"} +- GET /health → {"status": "ok"} +- GET / (serves the frontend)
+
+
+

1.2 Write API Tests

+

Paste this into Claude Code:

+
Write pytest tests for the FastAPI backend: + +1. test_health: GET /health returns 200 with {"status": "ok"} + +2. test_get_metrics: + - Returns 200 with valid state and date range + - Returns list of records with: state, industry, month, + turnover_millions, yoy_growth_pct + - Returns 400 for invalid date format + - Returns empty list for non-existent state + +3. test_ask_question: + - Returns 200 with a valid question + - Response has: answer (string), sql_query (string) + - Returns 400 for empty question + +Write ONLY the tests. Do NOT implement yet. +Use httpx AsyncClient with ASGITransport for testing.
+ +

1.3 Create Genie Space (Person C — UI)

+

If not already done in Lab 1, follow the Genie Space setup from Phase 1 of Lab 1.

+ +
Team split: Person A → PRD prompt, Person B → test prompt, Person C → Genie space (UI task).
+
+
+
+ +
+ + +
+ +
+ Phase 2 25 min + +

2.1 Implement the Backend

+

Paste into Claude Code:

+
Implement the FastAPI backend to pass all tests. + +For /api/metrics: +- Query the retail_summary gold table with optional filters +- Use databricks-sql-connector with parameterized queries +- Return results as JSON + +For /api/ask: +- Send the user's question to the Foundation Model API with the table schema as context +- The LLM generates a SQL query +- Execute the SQL and return results with the generated query + +Connection details from environment variables: +- DATABRICKS_HOST, DATABRICKS_HTTP_PATH, DATABRICKS_TOKEN + +Run tests after implementation.
+ +

2.3 Wire It Together

+
Create app.py that: +1. Serves FastAPI with CORS middleware +2. Mounts static files from static/ directory +3. Serves index.html at root +4. Has proper error handling + +Create app.yaml for Databricks Apps: + command: ["uvicorn", "app:app", "--host", "0.0.0.0", "--port", "8000"] + +Create requirements.txt with all dependencies. Run all tests one final time.
+ +
Stuck at 25 minutes? Grab Checkpoint 2A: a working app skeleton with health endpoint, database connection, and basic structure.
+
+ +
+ + +
+ +
+

2.2 Build the Frontend — Choose Your Approach

+ +
+
+

Tier 1: Embedded Dashboard

+
Build a frontend in static/index.html with: + +1. Header: "Grocery Intelligence Platform" + with your team name +2. An iframe embedding our published AI/BI + dashboard (I'll give you the URL) +3. "Ask AI" section: text input + response area +4. Use Tailwind CSS from CDN +5. Clean, professional styling + (dark header, white cards)
+
+
+

Tier 2: Custom Charts

+
Build a React frontend with: + +1. Header: "Grocery Intelligence Platform" +2. Filter bar: state dropdown, date pickers +3. Recharts line chart: monthly turnover +4. Recharts bar chart: YoY growth by state +5. Metric cards: Total Turnover, Avg Growth +6. "Ask AI" section +7. Tailwind CSS styling +8. Fetch from FastAPI endpoints
+
+
+

Tier 3: Full Platform

+
Build a React frontend with: + +1. Header + navigation tabs +2. Dashboard tab: charts + embedded + AI/BI iframe +3. Explorer tab: filter bar + sortable + data table +4. Ask AI tab: NL query + SQL display +5. Metric cards with sparklines +6. Responsive Tailwind CSS layout
+
+
+ +

2.4 Create AI/BI Dashboard (Person C — UI)

+

Navigate to DashboardsCreate DashboardAI/BI Dashboard. Connect to gold tables. Use the NL prompts from Lab 1 Phase 2.

+ +
Observable Plot alternative: Use <script src="https://cdn.jsdelivr.net/npm/@observablehq/plot"> for D3-based charts without React.
+
+ +
+ + +
+ +
+ Phase 3 15 min + +
+
+

3.1 Deploy Your App

+
databricks apps deploy \ + --name grocery-app-<team_name> \ + --source-code-path ./
+ +

3.2 Embed the Dashboard

+

Get the embed URL from Person C, then add to your HTML:

+
<div style="width:100%;height:600px;"> + <iframe + src="YOUR_EMBED_URL_HERE" + width="100%" + height="100%" + frameborder="0" + ></iframe> +</div>
+ +
Note: Embedded dashboards display in light mode only. Users need Databricks credentials to view.
+
+
+

3.3 Test Genie Space

+

Test with these questions:

+
Which state had the highest food retail turnover last month?
+
Show me the YoY food price inflation trend for Victoria.
+
Compare retail growth across all states for the last 12 months.
+ +
Tip: If Genie generates incorrect SQL, add table descriptions and column comments in Unity Catalog. The richer the metadata, the better Genie performs.
+ +

3.4 Verify Everything Works

+
    +
  • App loads in browser with dashboard
  • +
  • Filters work (state, date range)
  • +
  • "Ask AI" returns meaningful answers
  • +
  • Genie space answers NL questions
  • +
  • AI/BI dashboard shows visualizations
  • +
+
+
+
+ +
+ + +
+ +
+ Phase 4 5 min + +

Your 3-Minute Demo

+

Prepare a demo script covering these points:

+ +
+
+
1

Your pipeline — quick: show the DAG or table list

+
2

Your app — show it running, use the AI "Ask" feature live

+
3

Your Genie space — ask it a question live

+
4

Your dashboard — show the key visualizations

+
5

One thing that surprised you

+ +
Pro tip: Decide who presents what. Split across team members so everyone gets a moment.
+
+
+

Reflection Questions

+
    +
  1. How did the PRD guide the agent's decisions?
  2. +
  3. How does Genie compare to your custom AI query feature?
  4. +
  5. What would you need to add to make this production-ready?
  6. +
  7. Which approach (app vs. Genie vs. AI/BI dashboard) is most useful for your team?
  8. +
+ +
Stuck? Grab Checkpoint 2D for a complete solution for reference.
+ +

Bonus Challenges (if time permits)

+
    +
  • Add Chart.js charts to your app
  • +
  • Build an MCP server wrapping your API
  • +
  • Create a /deploy skill for validate + deploy
  • +
  • Add another team's gold tables to Genie
  • +
+
+
+
+ +
+ + +
+ +
+
+
+

Genie & Dashboard Issues

+ + + + + + + + + +
ProblemFix
Can't find Genie in sidebarAsk facilitator — may need to be enabled for your workspace
Genie permission errorNeed CREATE GENIE SPACE permission on the catalog
Genie gives wrong SQLAdd column descriptions + example queries to General Instructions
Dashboard viz doesn't matchRephrase the NL prompt or write SQL directly
Column comments not showingUse ALTER TABLE t ALTER COLUMN c COMMENT 'desc'
Dashboard slowCheck SQL warehouse is running (Compute → SQL Warehouses)
Embedded dashboard blankUsers need Databricks credentials to view
+
+
+

App & API Issues

+ + + + + + + + +
ProblemFix
CORS errorsAdd CORSMiddleware with allow_origins=["*"]
htmx not loadingAdd to <head>: <script src="https://unpkg.com/htmx.org@2.0.4">
AI generates invalid SQLAdd full table schema + column descriptions to LLM system prompt
App deploys, blank pageCheck static files mounted: app.mount("/static", ...)
databricks-sql-connector errorsCheck requirements.txt and env vars
Running out of timePrioritize: working app > Genie > dashboard. Grab checkpoints!
+
+
+
+ +
+ + +
+ +
+
+
+

Steering the Agent

+ + + + + + + + +
When the agent...Say this
Writes too much code"Keep it simple. One function, minimal code."
Ignores your CLAUDE.md"Read CLAUDE.md first, then try again."
Gets stuck in a loop"Stop. Let's try a different approach."
Makes it overly complex"Simplify. I just need [specific thing]."
Writes code before tests"Stop. Write the tests first, then implement."
Rewrites working code"Don't change functions that pass tests."
+ +

Useful Commands

+
# Check your Databricks connection +databricks auth status + +# Run app tests +pytest tests/test_app.py -x + +# Start app locally for testing +cd app && uvicorn app:app --reload --port 8000 + +# Deploy app +databricks apps deploy \ + --name grocery-app-TEAM_NAME \ + --source-code-path ./
+
+
+

Checkpoint Recovery

+

If you are stuck and need to catch up:

+
-- Ensure gold tables exist in your schema +CREATE TABLE workshop_vibe_coding.TEAM_SCHEMA + .retail_summary + AS SELECT * FROM + workshop_vibe_coding.checkpoints.retail_summary; + +CREATE TABLE workshop_vibe_coding.TEAM_SCHEMA + .food_inflation_yoy + AS SELECT * FROM + workshop_vibe_coding.checkpoints + .food_inflation_yoy;
+ +

MCP & Skills

+

Speed up common tasks:

+
# Commit with a good message +/commit + +# Run and fix failing tests +/test + +# Search Databricks docs +Search the Databricks docs for how to +create a Genie space programmatically.
+
+
+
+ +
+ + +
+ +
+
+
+
+

Lab 1: Genie & AI/BI Dashboards

+
    +
  • Genie space created with both gold tables and instructions
  • +
  • Gold tables have column descriptions in Unity Catalog
  • +
  • AI/BI dashboard has at least 4 visualizations
  • +
  • Genie answers 7/10 test questions correctly
  • +
  • Dashboard published with clean layout
  • +
  • Ready for Show & Tell
  • +
+
+
+
+
+

Lab 2: Build Your App

+
    +
  • FastAPI backend with tested endpoints
  • +
  • HTML frontend with Tailwind + htmx
  • +
  • AI-powered natural language query feature
  • +
  • Genie space created and answering questions
  • +
  • AI/BI dashboard with at least 3 visualizations
  • +
  • App deployed to Databricks Apps
  • +
  • All tests passing
  • +
  • Ready for 3-minute demo!
  • +
+
+
+
+ +
+
+ Remember: A working demo is better than a perfect app that is not done. Use checkpoints if you are running behind. The goal is to experience building with AI coding agents, not to finish every feature. +
+
+
+ +
+ + + + + diff --git a/projects/coles-vibe-workshop/track-common.html b/projects/coles-vibe-workshop/track-common.html new file mode 100644 index 0000000..80d84a6 --- /dev/null +++ b/projects/coles-vibe-workshop/track-common.html @@ -0,0 +1,2072 @@ + + + + + +Vibe Coding Best Practices — Coles Vibe Coding Workshop + + + + + + + +
+ + + + + + + + + +
+
+
+
+
+
+
+
+
Best Practices
+ + + + + + + + + + + + + + + + + +

Vibe Coding Best Practices

+

Coles × Databricks Vibe Coding Workshop — 9:30 AM – 4:00 PM

+

A Guide to Agentic Software Development with Claude

+
+ + + + +
+ +
+
+ "There's a new kind of coding I call 'vibe coding', where you fully give in to the vibes, embrace exponentials, and forget that the code even exists." + — Andrej Karpathy, February 2025 +
+ +
+
+

Traditional vs Agentic Development

+
+

Traditional Development

+

Human writes every line of code, handles all implementation details, debugs and iterates manually. Speed limited by typing and cognitive load.

+
+
+

Agentic Development

+

Human specifies intent via tests, specs, and CLAUDE.md. Agent implements, self-corrects by reading test failures. Speed limited by quality of direction, not typing.

+
+
+

Rule #1: Just Say What You Want

+

You literally type what you want and it happens. Don't write config files by hand. Have a conversation instead. Want behavior to change? Say how. See a technique? Paste it in. Everything is just markdown: CLAUDE.md, skills, hooks. The progression: say it → curate it → wire up tools (often just CLI instructions).

+
+
+
+

The "Brilliant New Employee"

+

Think of Claude as a brilliant but brand-new employee who just joined your team today:

+

1 Deep Technical Skills

+

Knows Python, PySpark, SQL, FastAPI, React — but has zero context on your norms or architecture decisions.

+

2 Excels With Clear Direction

+

Given a precise spec, produces excellent code. Given a vague request, produces plausible-looking code that misses the mark.

+

3 Needs Explicit Context

+

"Always PySpark, never pandas." "Snake_case for all columns." "Tests before code." Won't infer your team's standards.

+
+
+ +
Your job shifts from WRITING code to DIRECTING an exceptionally capable engineer. Invest time in specs upfront, not in writing code.
+
+ +
+ + + + +
+ +
+
+
Bronze
Raw Ingestion
+ +
Silver
Cleaned & Enriched
+ +
Gold
Analytics-Ready
+
+ +
+
+

Bronze (Raw)

+
    +
  • Ingest data as-is from APIs and files
  • +
  • No transformations applied
  • +
  • Preserves original column names and types
  • +
  • Acts as an immutable audit trail
  • +
  • Use @dp.table with data quality expectations
  • +
+
+
+

Silver (Cleaned)

+
    +
  • Decode codes to readable names
  • +
  • Handle nulls and invalid rows
  • +
  • Standardize date formats and types
  • +
  • Rename columns to snake_case
  • +
  • Use @dp.table reading from bronze
  • +
+
+
+

Gold (Aggregated)

+
    +
  • Roll up and aggregate metrics
  • +
  • Join across data sources
  • +
  • Calculate KPIs (YoY growth, rolling averages)
  • +
  • Ready for Genie and AI/BI dashboards
  • +
  • Use @dp.materialized_view
  • +
+
+
+ +
Why medallion? Separation of concerns. Each layer has one job. Bugs are easy to trace. Silver and gold can be rebuilt from bronze at any time.
+
+ +
+ + + + +
+ +
+

CLAUDE.md is a persistent instruction file that encodes your team's standards. The agent reads it automatically at the start of every session — no need to repeat yourself. You don't write it by hand — tell Claude about your project and it creates the CLAUDE.md for you.

+ +

Three Scope Levels

+
+
+

1. User-Level

+

~/.claude/CLAUDE.md

+

Personal preferences, coding style, editor habits. Applies to all your projects.

+
+
+

2. Repo-Level

+

./CLAUDE.md

+

Team standards, tech stack, architecture decisions. Overrides user-level. Checked into git.

+
+
+

3. Module-Level

+

./src/CLAUDE.md

+

Module-specific rules (e.g., "all files here use @dp.table"). Overrides repo-level.

+
+
+ +

Why It Works

+
+
    +
  • Self-correcting: Agent re-reads it each session — no drift
  • +
  • Searchable: Agent can reference specific sections on demand
  • +
+
    +
  • Maintainable: One file to update, entire team benefits
  • +
  • Scoped: Different rules for different parts of the codebase
  • +
+
+
+ +
+ + + + +
+ +
+
+
+

Template Structure

+
# CLAUDE.md — Grocery Intelligence Platform + +## Team +- Team Name: TEAM_NAME +- Schema: workshop_vibe_coding.TEAM_SCHEMA + +## Project +A data platform that ingests Australian retail +and food price data through a medallion architecture. + +## Tech Stack +- Data processing: PySpark (never pandas) +- Pipeline: Lakeflow Declarative Pipelines +- Web backend: FastAPI with Pydantic models +- Deployment: Databricks Asset Bundles + +## Data Standards +- Architecture: Bronze → Silver → Gold +- Date columns: YYYY-MM-DD, stored as DATE +- Naming: snake_case for all tables/columns + +## Rules +- Always use PySpark, never pandas +- Write tests BEFORE implementation +- Keep solutions minimal + +## Project Structure +(directory tree) + +## Data Sources +(table with API endpoints) + +## Code Mappings +(region codes, industry codes)
+
+
+

Tips for Effective CLAUDE.md

+
Keep it lean: Aim for ~50 lines. CLAUDE.md consumes context tokens — every line counts.
+
Be specific: "Always PySpark, never pandas" beats "Use appropriate tools." Concrete rules get followed.
+
Include code mappings: Region codes, industry codes, enum values — anything the agent needs for lookups.
+ +
Don't dump everything in. CLAUDE.md is not documentation. It's a set of standing orders. If a rule isn't referenced often, it belongs in a separate doc the agent can read on demand.
+ +

What NOT to Include

+
    +
  • Long explanations or tutorials
  • +
  • Full API documentation (link to it instead)
  • +
  • Implementation details that change frequently
  • +
  • Anything already in your test suite
  • +
+
+
+
+ +
+ + + + +
+ +
+
+
STEP 1Human writes
the test
+ +
STEP 2Agent implements
code to pass
+ +
STEP 3Run & iterate
(agent self-corrects)
+ +
STEP 4Human reviews
& accepts
+
+ +
The Test IS the Spec. Unlike prose specs, a test is executable, unambiguous, and either passes or fails.
+ +
+
+

Why This Works

+
+

1 Agent Writes, You Read

+

The agent writes the test code, but because it's Given/When/Then, you can read it back. You verify the test captured what you meant. No room for "I thought you meant..."

+
+
+

2 Self-Correcting Loop

+

The agent reads failure messages, understands what went wrong, and fixes automatically. No waiting for human review on each iteration.

+
+
+
+
+

3 Guardrails

+

Existing passing tests prevent the agent from rewriting working code. New code must pass new tests without breaking old ones.

+
+
+

4 Ratchet Effect

+

Each passing test constrains the next implementation. Quality only goes up, never down. Every green test is permanent progress.

+
+
If you take one thing from today: write the tests first. Always.
+
+
+ +
Tip: Always use SEPARATE prompts for tests and implementation. If the same prompt writes both, Claude may write tests that match its code rather than your requirements.
+
+ +
+ + + + +
+ +
+
+
+

What Are Tokens?

+

A token is the basic unit of text for an LLM — roughly 3/4 of a word. Everything the agent reads and writes is measured in tokens.

+
+
+ 200-line Python file +
+ ~2,500 tokens +
+
+ Your CLAUDE.md (~50 lines) +
+ ~500 tokens +
+
+ Claude's context window +
+ 200,000 tokens +
+
+
Context window = RAM. When it fills up, older context gets evicted and the agent may forget earlier decisions.
+
+
+

Four Strategies

+
+

1 Keep CLAUDE.md Lean

+

~50 lines. Every line consumes tokens every session. Put verbose docs elsewhere.

+
+
+

2 Be Specific

+

"Fix the test in test_pipeline.py::test_silver" not "write some code." Target files, not vague asks.

+
+
+

3 Plan Before Building

+

Use /plan mode. Alignment upfront prevents expensive rework that wastes context.

+
+
+

4 New Sessions for New Tasks

+

Context is RAM. When switching tasks, start fresh with /clear or a new session.

+
+
+
+
+ +
+ + + + + + + + +
+
0
+

Lab 0: Guided Hands-On

+

Initialize your project · Write your first test · Build bronze ingest

+
10:45 AM / 45 Minutes
+
+ + + + +
+ +
+

Don't write a file — have a conversation. Open your Coding Agents terminal and tell Claude about your project. Replace <team_schema> with your assigned schema name.

+ +
I'm building a grocery intelligence platform on Databricks. + +Tech stack: PySpark, Lakeflow Declarative Pipelines, FastAPI + React, +Databricks Asset Bundles (DABs). + +Data sources: ABS SDMX APIs, FSANZ web scraping, ACCC PDF ingestion +via UC Volumes. + +Unity Catalog namespace: workshop_vibe_coding.<team_schema>. + +Set up the project and create a CLAUDE.md.
+ +
This is Rule #1 of vibe coding: you literally type what you want and it happens. You don't hand-write config files — you engineer the harness through conversation.
+ +
After Claude responds: Want to change something? Just say it. “Add a rule that we always use PySpark, never pandas.” “Add our team angle: Retail Performance.” That's agentic engineering — you direct, it implements.
+
+ +
+ + + + +
+ +
+
+
+

Given — When — Then

+
def test_bronze_ingest_retail(spark): + # GIVEN: Raw ABS retail trade data + raw_data = spark.createDataFrame([ + ("2024-01-15", 1000, "NSW"), + ("2024-01-16", None, "VIC"), + ("invalid", 2000, "QLD"), + ("2024-01-17", 1500, "NSW"), + ("2024-01-18", 900, "VIC"), + ("2024-01-19", 1100, "QLD"), + ("2024-01-20", 1300, "NSW"), + ("2024-01-21", None, "SA"), + ("2024-01-22", 800, "WA"), + ("2024-01-23", 1200, "TAS"), + ], ["date", "price", "state"]) + + # WHEN: Clean function applied + result = clean_transactions(raw_data) + + # THEN: Invalid rows removed + assert result.count() == 8 + assert result.filter( + "price IS NULL" + ).count() == 0
+
+
+

Key Principles

+
+

Concrete Values

+

Use real data: actual state codes, valid dates, realistic dollar amounts. Not mocks or abstract placeholders.

+
+
+

Small Datasets

+

5-10 rows per test. Enough to cover happy path + edge cases. Easy to reason about.

+
+
+

Descriptive Names

+

test_bronze_ingest_retail not test_transform. The name tells the agent what to build.

+
+
+

Multiple Assertions

+

Check row count, column names, specific values, and null handling. Each assertion is a constraint.

+
+
+
+
+ +
+ + + + +
+ +
+

Now let the agent build the implementation to pass your test. Paste the prompt below into Claude Code.

+ +
Read CLAUDE.md and tests/test_pipeline.py. Implement the bronze +ingest function in src/bronze/ to make the failing test pass. + +Rules: +- Use PySpark, not pandas +- Use @dp.table decorator for Lakeflow Declarative Pipelines +- Read data from the ABS Retail Trade SDMX API +- Store raw data with original column names +- Add _ingested_at timestamp column +- Run pytest tests/test_pipeline.py -k "bronze" -x after implementation
+ +
+
+

What to Watch For

+
    +
  • Does the agent read CLAUDE.md first? If not, tell it: "Read CLAUDE.md first, then try again."
  • +
  • Does it use PySpark? If it reaches for pandas, steer it back.
  • +
  • Does it run the test? The agent should run pytest automatically and iterate until green.
  • +
  • Does the agent run the tests automatically after implementation? If not, tell it: "Run pytest tests/test_pipeline.py -k bronze -x and show me the output."
  • +
+
+
+

Expected Outcome

+
+

After ~15 minutes you should see:

+
    +
  • src/bronze/ingest.py — Bronze ingest function
  • +
  • tests/test_pipeline.py — Passing bronze test
  • +
  • pytest output showing green
  • +
+
+
Stuck? Tell the agent: "Read the test failure message carefully and fix only the failing assertion."
+
+
+
+ +
+ + + + +
+ +
+
+ Before moving on, verify your team has all three pieces in place: +
+ +
+
+

1. CLAUDE.md

+
    +
  • Team name and schema present
  • +
  • Tech stack with PySpark constraint
  • +
  • Data standards section
  • +
  • Rules section (BDD, minimal)
  • +
  • Project structure defined
  • +
  • Verify: Agent can read it back and list the rules
  • +
+
+
+

2. Passing Test

+
    +
  • Test uses Given-When-Then structure
  • +
  • Concrete values (real state codes, dates)
  • +
  • Multiple assertions (count, nulls)
  • +
  • Descriptive test name
  • +
  • pytest -k "bronze" -x passes
  • +
  • Verify: pytest output shows green — not just "I think it passes"
  • +
+
+
+

3. Bronze Ingest

+
    +
  • Uses PySpark (not pandas)
  • +
  • Has @dp.table decorator
  • +
  • Reads from data source
  • +
  • Adds _ingested_at timestamp
  • +
  • All bronze tests green
  • +
  • Verify: Run databricks bundle validate to verify pipeline config
  • +
+
+
+ +
Falling behind? No shame in using checkpoints. Tell the agent: "Copy the checkpoint tables from workshop_vibe_coding.checkpoints into my schema workshop_vibe_coding.<team_schema>"
+ +
You've just completed the full BDD cycle: spec (CLAUDE.md) → feature → steps → green. This is the pattern for the rest of the day.
+
+ +
+ + + + + + + + +
+

Skills, MCP & Beyond

+

Extending agents from code generators to system operators

+
Reference for Lab 1 (12:00 PM) & Lab 2 (2:00 PM)
+
+ + + + +
+ +
+

Skills are curated markdown files triggered by slash commands. Remember Rule #1: you said something useful, so you save it. A skill is just that — instructions you don't want to repeat, saved as markdown.

+ +
+
+

How Skills Work

+

A skill is a Markdown file that defines a multi-step workflow. When you type /commit, the agent:

+
    +
  1. Reads git status and diff
  2. +
  3. Analyzes changes and drafts a commit message
  4. +
  5. Stages files and creates the commit
  6. +
  7. Runs git status to verify success
  8. +
+ +

Built-In Examples

+
# Common skills +/commit # Smart git commit with message +/review-pr # Review a pull request +/test # Run tests intelligently + +# Custom skills you can create +/deploy-dab # Validate + deploy DABs bundle +/check-data # Query tables, verify counts
+
+
+

Why Skills Matter

+
+

Automate Repetitive Patterns

+

Workflows you do 10x/day become one command. No re-explaining each time.

+
+
+

Encode Team Conventions

+

Your team's commit message format, deploy steps, and review checklist — codified once, used by everyone.

+
+
+

Reduce Context Usage

+

A skill runs a pre-defined workflow without you needing to type out multi-step instructions each time.

+
+ +
Workshop tip: You can create custom skills during the labs. Think about which multi-step workflows you repeat most often.
+
How powerful is a markdown skill? deathbyclawd.com scans which SaaS products can be replaced by one.
+
+
+
+ +
+ + + + +
+ +
+

Model Context Protocol (MCP) is a standard protocol for connecting AI agents to external tools and data sources. One protocol, every tool connects — like USB-C for AI.

+ +
+
Claude Agent
+
+
MCP Protocol
+
+
Any Tool
+
+ +

Three Types of MCP Servers

+
+
+
🏠
+

Managed (Built-In)

+
Zero config — pre-integrated with Databricks
+
    +
  • Unity Catalog tables & volumes
  • +
  • Vector Search indexes
  • +
  • Genie spaces (NL → SQL)
  • +
  • DBSQL warehouses
  • +
+
+
+
🔌
+

External (via Proxies)

+
Community & vendor integrations
+
    +
  • GitHub (PRs, issues, repos)
  • +
  • Slack (messages, channels)
  • +
  • Glean (internal search)
  • +
  • JIRA (tickets, sprints)
  • +
  • Databricks Docs
  • +
+
+
+
🔧
+

Custom (Org-Specific)

+
Build your own for internal tools
+
    +
  • Wrap internal REST APIs
  • +
  • Data quality workflows
  • +
  • Monitoring & alerting
  • +
  • Host on Databricks Apps
  • +
+
+
+ +
Key benefit: Agents access ANY tool through a standard protocol. Credentials stay secure in Unity Catalog — the agent never sees raw tokens or passwords.
+
+ +
+ + + + +
+ +
+
+
+

Genie: Natural Language on Your Data

+

Business users ask questions in plain English. Genie generates SQL, runs it, and returns results with visualizations.

+
# User asks: +"Which state had the highest food retail + turnover in 2024?" + +# Genie generates: +SELECT state, + SUM(turnover_millions) AS total +FROM gold.retail_summary +WHERE year = 2024 +GROUP BY state +ORDER BY total DESC +LIMIT 1
+

How to set up: Point Genie at your gold tables, add column descriptions, and provide example queries in the instructions.

+
+
+

AI/BI Dashboards

+

Auto-generated dashboards that understand your data. Describe a visualization in natural language and it generates the chart.

+
    +
  • "Show monthly revenue by state as a line chart"
  • +
  • "Compare food CPI across states as a bar chart"
  • +
  • "Display year-over-year growth as a heatmap"
  • +
+
How they complement each other:
+ Dashboards = recurring views, standard KPIs, shared with stakeholders
+ Genie = ad-hoc questions, exploration, self-serve analytics
+

Both feed from the same gold tables — the output of your data pipeline. Good gold tables = good Genie + dashboards.

+
+
+
+ +
+ + + + + + + + +
+ +
+
+
+

The Core Principle

+
+ "Include tests, screenshots, or expected outputs so Claude can check itself." + — Anthropic +
+

There's a difference between an agent that thinks it did the work and one that proves it did. "I implemented the function" is not proof. A green test suite is.

+
Every prompt should answer: "How will the agent verify this worked?"
+
+
+

Six Validation Patterns

+
    +
  1. BDD Gate — scenarios fail → implement steps → behave passes (binary proof)
  2. +
  3. Separate prompts — write tests in one prompt, implement in another (prevents gaming)
  4. +
  5. @dp.expect — data quality expectations built into pipeline code (automatic)
  6. +
  7. Schema contracts — test exact response keys, types, and value ranges (not just status 200)
  8. +
  9. Full suite regressionpytest tests/ -x after every change (the ratchet)
  10. +
  11. Negative testing — prove error handling works with invalid inputs
  12. +
+
+
+
+ +
+ + + + +
+ +
+
+
+

Common Pitfalls & Fixes

+
+

Overengineering

+

Symptom: Asked for one function, got an entire module with abstract base classes.

+

Fix: "Keep it minimal. One function, not a framework. No extra features."

+
+
+

Hallucinations

+

Symptom: Agent used a column name that doesn't exist in the data.

+

Fix: "Never speculate about code you have not opened. Read the file first."

+
+
+

Going Off-Rails

+

Symptom: Looked away for 5 minutes, agent rewrote half the project.

+

Fix: Check in every 2-3 tool calls. "Don't change functions that already pass tests."

+
+
+
+

Steering Phrases Cheatsheet

+ + + + + + + + + + + + +
When the agent...Say this
Uses pandas"Rewrite using PySpark. We never use pandas."
Ignores CLAUDE.md"Read CLAUDE.md first, then try again."
Rewrites working code"Don't change functions that already pass tests."
Writes code before tests"Stop. Write the tests first, then implement."
Too complex"Simplify. I just need [specific thing]."
Stuck in a loop"Stop. Let's try a different approach."
Speculates"Read the file first. Don't guess."
Hasn't planned"Stop. Use /plan first. Interview me about what I need."
Claims it works"Prove it. Show me the git diff. Grill me on these changes."
First attempt is mediocre"Knowing everything you know now, scrap this and implement the elegant solution."
+ +
Healthy cadence: Agent makes 2-3 tool calls → You review → Steer if needed → Continue
+ +
Commit as checkpoints: /commit every 15-20 minutes. Commits are your safety net. Esc-Esc cancels a runaway agent; git checkout reverts to your last good state.
+
+
+
+ +
+ + + + +
+ +
+
+
+

Key Commands

+
# Check environment +databricks auth status +claude --version + +# Run tests (always use -x for fail-fast) +pytest tests/ -x +pytest tests/test_pipeline.py -k "bronze" -x +pytest tests/test_pipeline.py -k "silver" -x +pytest tests/test_app.py -x + +# Deploy with Databricks Asset Bundles +databricks bundle validate +databricks bundle deploy -t dev +databricks bundle run grocery-intelligence-TEAM -t dev + +# Deploy web app +cd app && databricks apps deploy \ + --name grocery-app-TEAM \ + --source-code-path ./
+
+
+

Key Files

+ + + + + + + + + +
FilePurpose
CLAUDE.mdAgent instructions (team standards)
tests/conftest.pyPySpark test fixtures
src/bronze/Raw data ingestion
src/silver/Cleaned transformations
src/gold/Aggregated analytics
app/app.pyFastAPI web application
databricks.ymlDABs deployment config
+ +

Workshop Schedule

+ + + + + + + + + +
TimeSession
9:30 AMBlock A: How to Think (theory)
10:45 AMLab 0: Guided Hands-On (all together)
11:30 AMBlock C: Tools for the Labs (theory)
12:00 PMLab 1: Track-specific lab
1:00 PMLunch break
2:00 PMLab 2: Track-specific lab
3:30 PMDemos & wrap-up
+
+
+
+ +
+ + + + +
+ +
+
+

Success Criteria for Agentic Development

+
    +
  1. Write clear specs — CLAUDE.md, tests, and PRDs define what "done" looks like. The agent can only be as good as your specification.
  2. +
  3. Use Gherkin as executable specs (BDD) — Scenarios are unambiguous. They pass or they fail. No interpretation required. Write them first, always.
  4. +
  5. Every prompt needs a verification step — a test to run, a command to execute, or an output to check. If Claude can't prove it worked, it didn't.
  6. +
  7. Manage context windows — Keep CLAUDE.md lean (~50 lines), be specific with requests, use new sessions for new tasks.
  8. +
  9. Steer early and often — Review every 2-3 tool calls. Short feedback loops produce better results than long unsupervised runs.
  10. +
+
+ +
Think of yourself as a director, not a typist.
+ +
+
Monday morning action: Create a CLAUDE.md for your team's main repository. Start with tech stack, coding standards, and 5 rules. Iterate from there.
+
The discipline transfers: These practices work with Claude Code, Cursor, Windsurf, GitHub Copilot, and any agentic coding tool. The discipline is the differentiator, not the tool.
+
+
+ +
+ + + + diff --git a/projects/coles-vibe-workshop/track-de.html b/projects/coles-vibe-workshop/track-de.html new file mode 100644 index 0000000..f297802 --- /dev/null +++ b/projects/coles-vibe-workshop/track-de.html @@ -0,0 +1,1703 @@ + + + + + +Data Engineering Track — Coles Vibe Coding Workshop + + + + + + + +
+ + + + + +
+
+
+
+
+
+
+
+
Data Engineering
+ + + + + + + + + + + + + + + + + +

Data Engineering Track

+

Coles × Databricks Vibe Coding Workshop

+

Build a Lakeflow Declarative Pipeline with BDD

+
+ + + + +
+ +
+
You've already completed Lab 0 — your CLAUDE.md and bronze test are ready. Your project is set up, environment is verified, and you've seen the agent pass its first test. Now extend into your track.
+ +

Step 1: Add DE-Specific Instructions

+

Append the Data Engineering track context to your existing CLAUDE.md:

+
cd grocery-intelligence + +# Append DE-specific instructions +cat ~/starter-kit/CLAUDE-de.md >> CLAUDE.md + +# Copy additional test stubs for full pipeline +cp ~/starter-kit/test_pipeline.py tests/
+ +

Step 2: Explore the Data

+

All tracks use the same data. Paste this into Claude Code to get a feel for it:

+
Query these tables and show me 5 rows from each: +- workshop_vibe_coding.checkpoints.retail_summary +- workshop_vibe_coding.checkpoints.food_inflation_yoy + +Tell me: what columns are available, what date range is covered, and which states are included.
+
+ +
+ + + + +
+ +
+

Build a data pipeline that ingests public Australian retail data, transforms it through a medallion architecture, and produces analytics-ready gold tables. You will direct an AI agent to build it using BDD.

+ +
+
Bronze
Raw API data
+ +
Silver
Cleaned & decoded
+ +
Gold
Aggregated analytics
+
+ +

Data Sources

+ + + + +
SourceAPI EndpointWhat It Contains
ABS Retail Tradedata.api.abs.gov.au/data/ABS,RT,1.0.0/...Monthly retail turnover by state & industry since 2010
ABS Consumer Price Indexdata.api.abs.gov.au/data/ABS,CPI,2.0.0/...Quarterly food price indices by state since 2010
+ +

Project Structure

+
+
grocery-intelligence/ +├── CLAUDE.md +├── databricks.yml +├── src/ +│ ├── bronze/ +│ │ ├── abs_retail_trade.py +│ │ └── abs_cpi_food.py +│ ├── silver/ +│ │ ├── retail_turnover.py +│ │ └── food_price_index.py +│ └── gold/ +│ ├── retail_summary.py +│ └── food_inflation.py +├── resources/ +│ └── pipeline.yml +└── tests/ + ├── conftest.py + └── test_pipeline.py
+
+

Code Mappings (Silver Layer)

+

Regions: 1=NSW, 2=VIC, 3=QLD, 4=SA, 5=WA, 6=TAS, 7=NT, 8=ACT

+

Industries: 20=Food retailing, 41=Clothing, 42=Department stores, 43=Other, 44=Cafes/restaurants, 45=Household goods

+

CPI Index: 10001=All groups CPI, 20001=Food & non-alcoholic beverages

+
Key rule: Always use PySpark, never pandas. Add this to your CLAUDE.md if the agent forgets.
+
+
+
+ +
+ + + + +
+
01
+

Lab 1: Build Your Pipeline

+

BDD → Bronze → Silver → Gold → Deploy

+ 12:00 PM / 60 minutes +
+ + + + +
+ +
+ Step 1.1 — Explore the Data + +
+ Team Tasks: + Person A (Terminal) — Run data exploration prompt below  |  + Person B (Terminal) — Set up CLAUDE.md with team name/schema  |  + Person C (Databricks UI) — Verify Unity Catalog schema, check checkpoint tables +
+ +

Look at the raw data before writing tests — you need to know the column names and types. Paste this into Claude Code:

+ +
Fetch a sample of the ABS Retail Trade API: +https://data.api.abs.gov.au/data/ABS,RT,1.0.0/M1.20+41+42+43+44+45.20.1+2+3+4+5+6+7+8.M?format=csv&startPeriod=2024-01&endPeriod=2024-03 + +Show me the columns, data types, and a few sample rows. +Also fetch a sample of the ABS CPI Food API: +https://data.api.abs.gov.au/data/ABS,CPI,2.0.0/1.10001+20001.10.1+2+3+4+5+6+7+8.Q?format=csv&startPeriod=2024-Q1&endPeriod=2024-Q4 + +Show me the same for this one.
+ +
Why explore first? The agent needs to see the actual column names, data types, and value formats before you can write meaningful tests. This 2-minute step saves 10 minutes of debugging later.
+
+ +
+ + + + +
+ +
+ Step 1.2 — Write Pipeline Tests +

Tell the agent to write tests covering every layer. Do NOT implement anything yet.

+ +
Create pytest tests for a Lakeflow Declarative Pipeline with these transformations: + +1. test_bronze_retail_trade: + - Raw CSV data is ingested with all original columns + - Non-null TIME_PERIOD and OBS_VALUE columns + - Test: given sample CSV rows, bronze table has correct schema + +2. test_silver_retail_turnover: + - REGION codes decoded to state names (1=NSW, 2=VIC, 3=QLD, etc.) + - INDUSTRY codes decoded to readable names (20=Food retailing, etc.) + - TIME_PERIOD parsed to proper date column + - OBS_VALUE renamed to turnover_millions + - Test: given bronze rows with codes, silver rows have readable names + +3. test_gold_retail_summary: + - Adds 3-month and 12-month rolling averages + - Adds year-over-year growth percentage + - Test: given 24 months of silver data, gold has correct rolling averages + +4. test_bronze_cpi_food: + - Raw CPI CSV data ingested with all columns + - Non-null TIME_PERIOD and OBS_VALUE + - Test: correct schema + +5. test_silver_food_price_index: + - REGION codes decoded to state names + - INDEX codes decoded (10001=All groups CPI, 20001=Food and non-alcoholic beverages) + - OBS_VALUE renamed to cpi_index + - Test: codes correctly decoded + +6. test_gold_food_inflation: + - Calculates year-over-year CPI change percentage + - Test: given 8 quarters of data, YoY change is correct + +Write ONLY the tests. Do NOT implement the functions yet. +Use PySpark test fixtures with small DataFrames (5-10 rows each).
+
+ +
+ + + + +
+ +
+ Step 1.3 — Review Before Moving On + +

Read through the generated tests and verify:

+
    +
  • Do they capture your transformation logic?
  • +
  • Are the test data realistic (real state codes, valid date formats)?
  • +
  • Are edge cases covered (nulls, missing periods)?
  • +
+ +
Edit or ask the agent to adjust before moving on. Tests are your spec — the agent will use them as its target.
+ +

What Good Tests Look Like

+
+
+

Bronze Test Example

+
def test_bronze_retail_trade(spark): + # Given: raw CSV rows + raw = spark.createDataFrame([ + ("M1", "20", "1", + "2024-01", 1234.5), + ], ["MEASURE", "INDUSTRY", + "REGION", "TIME_PERIOD", + "OBS_VALUE"]) + + # Then: schema matches expected + assert "TIME_PERIOD" in raw.columns + assert "OBS_VALUE" in raw.columns + assert raw.filter( + "TIME_PERIOD IS NULL" + ).count() == 0
+
+
+

Silver Test Example

+
def test_silver_retail_turnover(spark): + # Given: bronze with codes + bronze = spark.createDataFrame([ + ("1", "20", + "2024-01", 1234.5), + ], ["REGION", "INDUSTRY", + "TIME_PERIOD", "OBS_VALUE"]) + + # When: transform applied + result = transform_retail(bronze) + + # Then: codes decoded + row = result.first() + assert row["state"] == \ + "New South Wales" + assert row["industry"] == \ + "Food retailing" + assert "turnover_millions" \ + in result.columns
+
+
+
+ +
+ + + + +
+ +
+
+ Person A: Build retail trade bronze table  |  + Person B: Build CPI food bronze table  |  + Person C: Monitor Unity Catalog for new tables, prepare checkpoint fallback +
+ + Step 2.1 — Create Pipeline Structure + +
Create a Lakeflow Declarative Pipeline project with: +- src/bronze/abs_retail_trade.py - Ingest ABS Retail Trade API to bronze table +- src/bronze/abs_cpi_food.py - Ingest ABS CPI Food API to bronze table +- src/silver/ (empty for now) +- src/gold/ (empty for now) +- resources/pipeline.yml - Lakeflow pipeline definition +- databricks.yml - DABs deployment config +- tests/ (our tests from Phase 1) + +For bronze tables, use @dp.table decorator with data quality expectations: +- @dp.expect("valid_time_period", "TIME_PERIOD IS NOT NULL") +- @dp.expect("valid_obs_value", "OBS_VALUE IS NOT NULL") + +Use spark.read.csv() to fetch from the API URLs. +Unity Catalog target: workshop_vibe_coding.<team_schema>
+ + Step 2.2 — Run Bronze Tests +
# Tell the agent: +Run the bronze tests. Fix any failures.
+

Watch the agent iterate: read test output → fix code → re-run → repeat until green.

+ +
Stuck? If API calls are failing, grab Checkpoint 1A: pre-loaded bronze tables. Tell the agent: "Use the pre-loaded tables in workshop_vibe_coding.checkpoints instead of calling the API. Copy them to our schema."
+
+ +
+ + + + +
+ +
+
+ Person A: Build silver retail_turnover + gold retail_summary  |  + Person B: Build silver food_price_index + gold food_inflation  |  + Person C: Monitor tests, review gold data as it appears +
+ + Step 3.1 — Build Silver Transformations + +
Implement the silver layer to make the silver tests pass: + +1. src/silver/retail_turnover.py + - @dp.table that reads from bronze retail trade + - Decode REGION codes to state names: 1=New South Wales, 2=Victoria, + 3=Queensland, 4=South Australia, 5=Western Australia, 6=Tasmania, + 7=Northern Territory, 8=Australian Capital Territory + - Decode INDUSTRY codes: 20=Food retailing, 41=Clothing/footwear/personal, + 42=Department stores, 43=Other retailing, 44=Cafes/restaurants/takeaway, + 45=Household goods retailing + - Parse TIME_PERIOD to date, extract month/year/quarter + - Rename OBS_VALUE to turnover_millions + +2. src/silver/food_price_index.py + - @dp.table that reads from bronze CPI + - Decode REGION and INDEX codes + - Rename OBS_VALUE to cpi_index + +Run the silver tests after implementation.
+
+ +
+ + + + +
+ +
+ Step 3.2 — Build Gold Materialized Views + +
Implement the gold layer to make the gold tests pass: + +1. src/gold/retail_summary.py + - @dp.materialized_view + - Join with silver retail_turnover + - Add 3-month rolling average (turnover_3m_avg) + - Add 12-month rolling average (turnover_12m_avg) + - Add year-over-year growth percentage (yoy_growth_pct) + +2. src/gold/food_inflation.py + - @dp.materialized_view + - Calculate year-over-year CPI change percentage (yoy_change_pct) + +Run ALL tests (bronze + silver + gold). Everything should be green.
+ + Step 3.3 — Verify Your Data +
Query the gold tables and show me: +1. Top 5 states by food retail turnover (latest month) +2. Year-over-year food price inflation by state (latest quarter) +3. The state with the highest retail growth rate
+ +

This is where you check your ice breaker predictions!

+ +
Stuck at 40 minutes? Grab Checkpoint 1B: silver and gold tables pre-loaded in your schema. This ensures you have data for Lab 2.
+
+ +
+ + + + +
+ +
+
+ Person A: Run validate and deploy  |  + Person B: Verify pipeline in Workflows tab  |  + Person C: Query gold tables to check icebreaker predictions +
+ +
+
+ Step 4.1 — Create Pipeline Definition +
Create resources/pipeline.yml that defines a Lakeflow Declarative Pipeline: +- Pipeline name: grocery-intelligence-<team_name> +- Serverless: true +- Libraries: all our src/ notebooks +- Catalog: workshop_vibe_coding +- Schema: <team_schema> + +And update databricks.yml with: +- Dev target using our workshop catalog/schema +- The pipeline resource
+
+
+ Step 4.2 — Validate & Deploy +
Validate the bundle: databricks bundle validate +Deploy to dev: databricks bundle deploy -t dev +Run the pipeline: databricks bundle run grocery-intelligence-<team_name> -t dev
+ + Step 4.3 — Verify in Workspace +
    +
  • Pipeline appears in the Workflows tab
  • +
  • Tables are visible in Unity Catalog
  • +
  • Data quality expectations are passing
  • +
+ +
Stuck? Grab Checkpoint 1C: complete pipeline code and databricks.yml.
+
+
+
+ +
+ + + + +
+
02
+

Lab 2: Extend & Harden

+

New Data Sources • Data Quality • Scheduling

+ 2:00 PM / 60 minutes +
+ + + + +
+ +
+
+ Person A: Write tests for FSANZ ingestion  |  + Person B: Build bronze + silver tables  |  + Person C: Monitor pipeline, verify new tables in Unity Catalog +
+ + Step 1.1 — Write Tests + Build Tables + +
Add a new data source — FSANZ food recalls: + +1. Write tests first: + - test_bronze_food_recalls_schema: has columns (product, category, issue, date, state, url) + - test_bronze_food_recalls_not_null: product and date are never null + - test_silver_food_recalls_dates: date strings parsed to proper DATE type + - test_silver_food_recalls_states: state names normalized to match our state list + +2. Build the tables: + - src/bronze/fsanz_food_recalls.py: @dp.table ingesting FSANZ data + - src/silver/food_recalls.py: @dp.table with cleaned dates, normalized states + - Data source: https://www.foodstandards.gov.au/food-recalls/recalls + - If website is blocked, read from: workshop_vibe_coding.checkpoints.fsanz_food_recalls + +3. Run tests after implementation.
+ +
Stuck? Grab Checkpoint DE-2A: FSANZ bronze + silver tables pre-loaded. If the website is blocked, use the checkpoint table as a source.
+
+ +
+ + + + +
+ +
+
+ Person A: Add data quality expectations across all tables  |  + Person B: Build cross-source gold view  |  + Person C: Verify expectations in pipeline UI, check quality metrics +
+ + Step 2.1 — Add Quality Expectations + +
Add data quality expectations across all pipeline tables: + +Bronze tables: +- @dp.expect("valid_time_period", "TIME_PERIOD IS NOT NULL") +- @dp.expect("valid_obs_value", "OBS_VALUE IS NOT NULL") +- @dp.expect("valid_date_range", "TIME_PERIOD >= '2010-01'") + +Silver tables: +- @dp.expect_or_fail("valid_state", "state IN ('New South Wales','Victoria','Queensland','South Australia','Western Australia','Tasmania','Northern Territory','Australian Capital Territory')") +- @dp.expect("valid_turnover", "turnover_millions > 0") + +Gold tables: +- @dp.expect("valid_yoy", "yoy_growth_pct BETWEEN -100 AND 500") +- @dp.expect("valid_rolling_avg", "turnover_3m_avg > 0") + +Run all tests to verify nothing breaks.
+ +
Tip: Use @dp.expect (warn only) first. Upgrade to @dp.expect_or_fail (hard fail) once you are confident the data is clean.
+
+ +
+ + + + +
+ +
+ Step 2.2 — Build Cross-Source Analysis View + +
Create a cross-source analysis view: +- src/gold/grocery_insights.py: @dp.materialized_view +- Joins retail_summary + food_inflation_yoy + food_recalls (if available) +- Columns: state, month, turnover_millions, yoy_growth_pct, cpi_yoy_change, recall_count +- Join retail (monthly) with CPI (quarterly) on state + quarter +- Left join recalls on state + month (recall_count may be 0)
+ +

Expected Pipeline DAG

+
+
+
abs_retail_trade
+ +
retail_turnover
+ +
retail_summary
+ +
+
+
abs_cpi_food
+ +
food_price_index
+ +
food_inflation
+ +
grocery_insights
+
+
+
fsanz_food_recalls
+ +
food_recalls
+ +
+
+ +
Join strategy: CPI is quarterly, retail is monthly. Join on state + quarter (derive quarter from month). Left join recalls since recall_count may be 0 for most state/month combos.
+
+ +
+ + + + +
+ +
+
+ Person A: Add cron scheduling, validate, deploy  |  + Person B: Verify pipeline schedule in Workflows tab  |  + Person C: Test full pipeline end-to-end +
+ +
+
+ Step 3.1 — Add Scheduling +
Add cron scheduling to our pipeline: + +1. Update databricks.yml — add trigger: + trigger: + cron: + quartz_cron_expression: "0 0 6 * * ?" + timezone_id: "Australia/Sydney" + +2. Validate: databricks bundle validate +3. Deploy: databricks bundle deploy -t dev +4. Show me the pipeline schedule.
+
+
+ Phase 4 — Verify + Prepare (5 min) +
    +
  • Full pipeline runs with all three sources
  • +
  • Data quality expectations visible and passing in UI
  • +
  • Cross-source grocery_insights view has data
  • +
  • Pipeline schedule is configured
  • +
+ +

Prepare Your Demo

+

For Show & Tell, be ready to show:

+
    +
  1. Pipeline DAG in Databricks Workflows UI
  2. +
  3. Data quality metrics and expectations
  4. +
  5. Cross-source gold view query results
  6. +
  7. Your scheduling configuration
  8. +
  9. An interesting data insight you discovered
  10. +
+ +
Running out of time? Grab Checkpoint DE-2C (complete pipeline) or DE-2D (full solution).
+
+
+
+ +
+ + + + +
+ +
+
+
+

Useful Commands

+
# Check Databricks connection +databricks auth status + +# Run specific test groups +pytest tests/test_pipeline.py -k "bronze" -x +pytest tests/test_pipeline.py -k "silver" -x +pytest tests/test_pipeline.py -k "gold" -x + +# Run all tests +pytest tests/ -x + +# Validate DABs bundle +databricks bundle validate + +# Deploy pipeline +databricks bundle deploy -t dev +databricks bundle run \ + grocery-intelligence-TEAM_NAME -t dev
+
+
+

Common Problems

+ + + + + + + + + +
ProblemFix
Agent uses pandasAdd to CLAUDE.md: Always use PySpark, never pandas
SparkSession errorsCheck conftest.py has SparkSession.builder.master("local[*]")
ABS API timeoutUse checkpoint tables instead of API
@dp.table not foundUse import databricks.declarative_pipelines as dp
Can't write to UCCheck schema: workshop_vibe_coding.<team>
Deploy failsRun databricks bundle validate first
Agent rewrites working codeSay: "Don't change passing tests"
+
+
+
+ +
+ + + + +
+ +
+
+
+

Checkpoint Recovery

+

If you are stuck and need to catch up, copy from checkpoint tables:

+
-- Checkpoint 1A: Bronze tables +CREATE TABLE workshop_vibe_coding.TEAM_SCHEMA + .abs_retail_trade_bronze + AS SELECT * FROM + workshop_vibe_coding.checkpoints + .abs_retail_trade_bronze; + +CREATE TABLE workshop_vibe_coding.TEAM_SCHEMA + .abs_cpi_food_bronze + AS SELECT * FROM + workshop_vibe_coding.checkpoints + .abs_cpi_food_bronze; + +-- Checkpoint 1B: Silver + Gold +CREATE TABLE workshop_vibe_coding.TEAM_SCHEMA + .retail_turnover + AS SELECT * FROM + workshop_vibe_coding.checkpoints + .retail_turnover; + +CREATE TABLE workshop_vibe_coding.TEAM_SCHEMA + .retail_summary + AS SELECT * FROM + workshop_vibe_coding.checkpoints + .retail_summary; + +-- (same pattern for food_price_index, +-- food_inflation_yoy)
+
+
+

Steering the Agent

+ + + + + + + + +
When the agent...Say this
Writes too much code"Keep it simple. One function, minimal code."
Ignores CLAUDE.md"Read CLAUDE.md first, then try again."
Gets stuck in a loop"Stop. Let's take a different approach."
Makes it overly complex"Simplify. I just need [specific thing]."
Writes code before tests"Stop. Write the tests first."
Goes off track"Stop, let's go back to the failing tests."
+ +

Pro Tips

+
    +
  • Rotate the driver every 20 min so everyone gets hands-on time
  • +
  • Use /commit skill to commit with good messages
  • +
  • Say "show me the test output" to see exactly what is failing
  • +
  • Say "explain what this function does" to verify agent logic
  • +
  • If agent uses pandas: say "use PySpark, not pandas" and add it to CLAUDE.md
  • +
+
+
+
+ +
+ + + + +
+ +
+
+
+
+

Lab 1 Success Criteria

+
    +
  • Tests written BEFORE implementation
  • +
  • All tests pass (bronze, silver, gold)
  • +
  • Pipeline uses @dp.table and @dp.materialized_view
  • +
  • Data quality expectations with @dp.expect()
  • +
  • Gold tables have rolling averages and YoY metrics
  • +
  • Deployed as a Lakeflow pipeline via DABs
  • +
  • Can answer ice breaker predictions from the data
  • +
+
+ +
+

Lab 2 Success Criteria

+
    +
  • FSANZ food recalls ingested (bronze + silver)
  • +
  • Data quality expectations on all tables
  • +
  • Cross-source gold view joining retail + CPI + recalls
  • +
  • Pipeline scheduled with cron trigger
  • +
  • All tests pass including new data source
  • +
  • Ready for 3-minute demo
  • +
+
+
+
+

Show & Tell: What to Demo (3 min)

+
    +
  1. Pipeline DAG — Show the full pipeline graph in Databricks Workflows with all three data sources flowing through bronze → silver → gold
  2. +
  3. Data Quality — Show expectations passing in the pipeline UI. Highlight any @dp.expect_or_fail rules
  4. +
  5. Cross-Source Insights — Query grocery_insights and show how retail turnover, CPI inflation, and food recalls connect
  6. +
  7. A Data Insight — Share the most interesting thing you found (e.g., which state has highest inflation, retail growth trends)
  8. +
+ +

Reflection Questions

+
    +
  1. How much code did you write vs. the agent?
  2. +
  3. Where did BDD help the most?
  4. +
  5. What was your most interesting data insight?
  6. +
  7. How do data quality expectations change your confidence?
  8. +
  9. Were your ice breaker predictions right?
  10. +
+
+
+
+ +
+ + + + diff --git a/projects/coles-vibe-workshop/track-ds.html b/projects/coles-vibe-workshop/track-ds.html new file mode 100644 index 0000000..a174ffc --- /dev/null +++ b/projects/coles-vibe-workshop/track-ds.html @@ -0,0 +1,1689 @@ + + + + + +Data Science Track — Coles Vibe Coding Workshop + + + + + + + +
+ + + + + +
+
+
+
+
+
+
+
+
Data Science Track
+ + + + + + + + + + + + + + + + + +

Feature Engineering,
MLflow & Model Serving

+

Coles × Databricks Vibe Coding Workshop

+

Build ML models from Australian retail data — directed entirely by AI agents

+
+ COLES GROUP + + DATABRICKS +
+
+ + + + +
+ +
+ Post Lab 05 min + +
+ You already completed Lab 0 (guided hands-on) — you have a working CLAUDE.md, your environment is verified, and your bronze test passes. This slide gets you from there to the DS track. +
+ +
+
+

1. Add DS Track Instructions

+

From your grocery-intelligence project directory:

+ +

Append DS track context to CLAUDE.md:

+
cat ~/starter-kit/CLAUDE-ds.md >> CLAUDE.md
+ +

Copy DS test fixtures + stubs:

+
cp ~/starter-kit/conftest.py tests/ +cp ~/starter-kit/test_features.py tests/
+ +
+ Don't skip this: The DS-specific CLAUDE.md section tells the agent to use PySpark (not pandas) and sets up MLflow conventions. +
+
+ +
+

2. Explore the Gold Tables

+

All tracks use the same gold tables. Paste this into Claude Code:

+
Paste into Claude Code
+
Query these tables and show me 5 rows from each: +- workshop_vibe_coding.checkpoints.retail_summary +- workshop_vibe_coding.checkpoints.food_inflation_yoy + +Tell me: what columns are available, what date +range is covered, and which states are included.
+ +
+ Tip: Understanding the data shape now saves time in Phase 1. Note which columns you'll use for features. +
+
+
+
+ +
+ + + + +
+
Lab 1
+

Features & Experiments

+

Build a feature engineering pipeline from gold tables and track experiments with MLflow

+
12:00 PM / 60 Minutes • 4 phases
+
+ + + + +
+ +
+ Phase 115 min + +
+
+

Person A (Terminal)

+

Query gold tables, understand distributions and patterns

+
+
+

Person B (Terminal)

+

Write pytest tests for feature engineering functions

+
+
+

Person C (Databricks UI)

+

Create MLflow experiment, verify tracking works

+
+
+ +

1.1 Explore the Gold Tables

+
Paste into Claude Code
+
Query these tables and show me a comprehensive analysis: + +1. workshop_vibe_coding.TEAM_SCHEMA.retail_summary: + - Row count, date range, distinct states, distinct industries + - Summary statistics for turnover_millions (min, max, mean, stddev) + - Top 5 state-industry combinations by average turnover + +2. workshop_vibe_coding.TEAM_SCHEMA.food_inflation_yoy: + - Row count, date range, distinct states + - Summary statistics for yoy_change_pct + - States with highest and lowest inflation + +Show me the results as tables.
+
+ +
+ + + + +
+ +
+ Phase 115 min + +

1.2 Write Tests First (BDD)

+
Paste into Claude Code
+
Create pytest tests for feature engineering in tests/test_features.py. +Use the fixtures from tests/conftest.py. + +Write these tests: + +1. test_create_lag_features: + - Given 24 months of data for one state/industry + - Creates turnover_lag_1m, turnover_lag_3m, turnover_lag_6m, turnover_lag_12m + - lag_1m equals previous month's value + - First 12 rows have null lag_12m (expected) + +2. test_create_seasonal_features: + - Adds month_of_year (1-12), quarter (1-4), is_december (boolean), is_q4 (boolean) + - is_december is True only for month 12 + +3. test_create_growth_features: + - Adds turnover_mom_growth and turnover_yoy_growth + - MoM growth = (current - previous) / previous * 100 + - YoY growth = (current - 12m_ago) / 12m_ago * 100 + +4. test_feature_table_schema: + - Output has all expected columns + - Key columns (state, industry, month, turnover_millions) have no nulls + +Write ONLY the tests. Do NOT implement the functions yet. +Use PySpark test fixtures with small DataFrames.
+ +
+ Starter Kit: Test stubs are pre-loaded at tests/test_features.py. Prompt files at starter-kit/prompts/ds/01-explore-gold.md and ds/02-feature-engineering.md. +
+
+ +
+ + + + +
+ +
+ Phase 220 min + +
+
+

Person A (Terminal)

+

Build lag and seasonal feature functions

+
+
+

Person B (Terminal)

+

Build growth rate features, combine into feature table

+
+
+

Person C (Databricks UI)

+

Run EDA: distributions, correlations, trends

+
+
+ +

2.1 Build the Feature Pipeline

+
Paste into Claude Code
+
Create a feature engineering pipeline that reads from our gold tables +and produces a feature table. + +1. Lag features from retail_summary: + - turnover_lag_1m, turnover_lag_3m, turnover_lag_6m, turnover_lag_12m + - Use PySpark Window functions partitioned by state + industry, + ordered by month + +2. Seasonal features: + - month_of_year (1-12), quarter (1-4), is_december (boolean), + is_q4 (boolean) + - Extract from the month date column + +3. Growth rate features: + - turnover_mom_growth: month-over-month growth percentage + - turnover_yoy_growth: year-over-year growth percentage + - cpi_yoy_change: join with food_inflation_yoy on state + quarter + +4. Write the combined feature table to: + workshop_vibe_coding.TEAM_SCHEMA.retail_features + +Run tests after implementation. Handle nulls in lag features +(first N rows will be null — filter them out in the final table).
+
+ +
+ + + + +
+ +
+
+
+

Lag Features (PySpark Window)

+
from pyspark.sql import Window +from pyspark.sql import functions as F + +window = Window.partitionBy("state", "industry") \ + .orderBy("month") + +df = df.withColumn( + "turnover_lag_1m", + F.lag("turnover_millions", 1).over(window) +).withColumn( + "turnover_lag_3m", + F.lag("turnover_millions", 3).over(window) +).withColumn( + "turnover_lag_12m", + F.lag("turnover_millions", 12).over(window) +)
+ +

Seasonal Features

+
df = df.withColumn( + "month_of_year", F.month("month") +).withColumn( + "quarter", F.quarter("month") +).withColumn( + "is_december", + F.month("month") == 12 +).withColumn( + "is_q4", + F.quarter("month") == 4 +)
+
+ +
+

Growth Features

+
# Month-over-month growth +df = df.withColumn( + "turnover_mom_growth", + (F.col("turnover_millions") + - F.lag("turnover_millions", 1) + .over(window)) + / F.lag("turnover_millions", 1) + .over(window) * 100 +) + +# Year-over-year growth +df = df.withColumn( + "turnover_yoy_growth", + (F.col("turnover_millions") + - F.lag("turnover_millions", 12) + .over(window)) + / F.lag("turnover_millions", 12) + .over(window) * 100 +)
+ +
+ Common mistake: The agent may try to use pandas. If it does, say: "Use PySpark Window functions, not pandas. Check CLAUDE.md." +
+ +
+ Stuck at 35 minutes? Grab Checkpoint DS-1B: pre-built feature table in your schema. +
+
+
+
+ +
+ + + + +
+ +
+ Phase 315 min + +
+
+

Person A (Terminal)

+

Log feature engineering run to MLflow with params, metrics, artifacts

+
+
+

Person B (Terminal)

+

Create and log visualizations (correlation heatmap, trend plots)

+
+
+

Person C (Databricks UI)

+

Review experiment in MLflow UI, compare runs, tag experiment

+
+
+ +

3.1 Track Your Experiments

+
Paste into Claude Code
+
Set up MLflow experiment tracking for our feature engineering: + +1. Create an MLflow experiment named "grocery-features-TEAM_NAME" + +2. Log a run with: + - Parameters: number of features, date range, number of states + - Metrics: feature table row count, null percentage per feature + - Tags: team_name, track="data_science", phase="feature_engineering" + - Artifacts: save a feature summary CSV showing stats per state + +3. Create and log visualizations: + - A correlation heatmap of the numeric features (save as PNG) + - A time series plot of turnover trends for top 3 states (save as PNG) + +Use mlflow.log_param(), mlflow.log_metric(), mlflow.log_artifact(). +Show me the MLflow experiment URL when done.
+ +
+ Starter Kit: Prompt file at starter-kit/prompts/ds/03-mlflow-experiment.md +
+
+ +
+ + + + +
+ +
+ Phase 45 min + +
+
+

Verification Checklist

+
+

Before moving on, confirm:

+
    +
  • Feature table has lag, seasonal, and growth features
  • +
  • All feature engineering tests pass
  • +
  • MLflow experiment logged with parameters, metrics, artifacts
  • +
  • At least one visualization logged as artifact
  • +
  • Feature table accessible in Unity Catalog
  • +
+
+ +
+ Running behind? Grab Checkpoint DS-1C (feature table + MLflow experiment) and move to Lab 2. +
+
+ +
+

Show & Tell Prep

+

Prepare to share with the group:

+
    +
  1. Feature table — show in Unity Catalog browser
  2. +
  3. MLflow experiment — show runs, metrics, artifacts
  4. +
  5. Key insight — a pattern or surprise in the data
  6. +
+ +

Reflection Questions

+
    +
  • Which features do you think will be most predictive for forecasting?
  • +
  • How did MLflow help organize your experimentation?
  • +
  • What additional data sources would improve your features?
  • +
  • Were there any surprising patterns in the data?
  • +
+
+
+
+ +
+ + + + +
+
Lab 2
+

Train, Serve & Deploy

+

Train a forecasting model, register in MLflow, serve as an endpoint, and build a prediction app

+
2:00 PM / 60 Minutes • 4 phases
+
+ + + + +
+ +
+ Phase 115 min + +
+
+

Person A (Terminal)

+

Write tests: input schema, positive predictions, R² > 0.5

+
+
+

Person B (Terminal)

+

Implement training script: read features, train, log to MLflow

+
+
+

Person C (Databricks UI)

+

Verify Model Serving permissions, check Model Registry

+
+
+ +

1.1 Write Model Tests

+
Paste into Claude Code
+
Write pytest tests for model training in tests/test_model.py: + +1. test_model_predictions_positive: + all predictions are positive (turnover can't be negative) +2. test_model_r2_score: + R² > 0.5 on test set +3. test_model_logged_to_mlflow: + run has model artifact, R², MAE, RMSE metrics + +Write ONLY tests. Do NOT implement yet.
+
+ +
+ + + + +
+ +
+

1.2 Train a Retail Turnover Forecasting Model

+
Paste into Claude Code
+
Train a retail turnover forecasting model: + +1. Read feature table: workshop_vibe_coding.TEAM_SCHEMA.retail_features +2. Target: turnover_millions +3. Features: all lag, seasonal, and growth columns +4. Split: 80% train / 20% test (split by date, not random) +5. Try both RandomForestRegressor and XGBRegressor +6. Use mlflow.sklearn.autolog() or mlflow.xgboost.autolog() +7. Log both models, compare R², MAE, RMSE +8. Print which model performed better + +Run tests after training.
+ +
+
+

Key Concepts

+
    +
  • Time-based split: Train on older data, test on recent — never random for time series
  • +
  • autolog(): Automatically captures parameters, metrics, and model artifacts
  • +
  • Two models: RandomForest vs XGBoost — compare and pick the best
  • +
+
+
+
+ Tip: If R² is low, try adding more features or tuning hyperparameters. XGBoost often outperforms RandomForest on tabular data. +
+
+ Starter Kit: Prompt file at starter-kit/prompts/ds/04-train-model.md +
+
+
+
+ +
+ + + + +
+ +
+ Phase 220 min + +
+
+

Person A (Terminal)

+

Register best model in MLflow Model Registry

+
+
+

Person B (Terminal)

+

Create Model Serving endpoint, test with sample request

+
+
+

Person C (Terminal)

+

Write tests for serving endpoint response schema

+
+
+ +

2.1 Register the Best Model

+
Paste into Claude Code
+
Register the best model from our experiment: + +1. Find the best run (highest R²) from our MLflow experiment +2. Register as: workshop_vibe_coding.TEAM_SCHEMA.retail_forecast_model +3. Add description: "Retail turnover forecasting model for Australian states" +4. Set alias "production" on the latest version
+
+ +
+ + + + +
+ +
+

2.2 Create a Model Serving Endpoint

+
Paste into Claude Code
+
Create a Model Serving endpoint: + +1. Name: grocery-forecast-TEAM_NAME +2. Model: workshop_vibe_coding.TEAM_SCHEMA.retail_forecast_model + (production alias) +3. Serverless endpoint +4. Wait for it to be ready (may take 5-10 minutes) +5. Test with a sample request: + {"dataframe_records": [{"turnover_lag_1m": 4500, + "turnover_lag_3m": 4400, "turnover_lag_12m": 4200, + "month_of_year": 3, "quarter": 1, "is_december": false, + "is_q4": false, "turnover_mom_growth": 2.3, + "turnover_yoy_growth": 7.1}]} + +Show me the prediction response.
+ +
+
+
+ Endpoint provisioning: Model Serving endpoints take 5-10 minutes to become ready. Start this early and work on something else while waiting. +
+
+
+
+ Stuck at 25 minutes? Grab Checkpoint DS-2B: pre-registered model + working endpoint. +
+
+ Starter Kit: Prompts at starter-kit/prompts/ds/05-register-model.md and ds/06-serve-model.md +
+
+
+
+ +
+ + + + +
+ +
+ Phase 315 min + +
+
+

Person A (Terminal)

+

Build FastAPI backend with /predict endpoint

+
+
+

Person B (Terminal)

+

Build HTML + Tailwind frontend with prediction form

+
+
+

Person C (Databricks UI)

+

Test end-to-end flow, verify predictions

+
+
+ +

3.1 Build the Prediction Web App

+
Paste into Claude Code
+
Build a prediction web app: + +1. FastAPI backend (app/app.py): + - GET /health -> {"status": "ok"} + - POST /predict: + Accepts: {"state": "New South Wales", + "industry": "Food retailing", "month": "2024-06"} + Looks up latest features for that state/industry + Calls our Model Serving endpoint + Returns: {"predicted_turnover": 4650.2, + "state": "New South Wales", + "industry": "Food retailing"} + - GET / -> serves the frontend + +2. Frontend (app/static/index.html): + - Tailwind CSS + htmx (CDN, no build step) + - Header: "Grocery Forecast — TEAM_NAME" + - Form: dropdowns for State, Industry, Month + - Submit -> calls POST /predict -> shows result card + +3. Create app/app.yaml and app/requirements.txt +4. Deploy: + databricks apps deploy \ + --name grocery-forecast-TEAM_NAME \ + --source-code-path ./app/
+
+ +
+ + + + +
+ +
+ Phase 45 min + +
+
+

Your 3-Minute Demo

+

Show the group your end-to-end ML pipeline:

+
    +
  1. Feature table — show in Unity Catalog browser
  2. +
  3. MLflow experiment — show runs, metrics, artifacts
  4. +
  5. Model — show registry + serving endpoint
  6. +
  7. App — make a prediction live
  8. +
  9. One thing that surprised you
  10. +
+ +
+

Success Criteria

+
    +
  • Model trained with R² > 0.5
  • +
  • Model registered in Unity Catalog Model Registry
  • +
  • Model Serving endpoint responding to requests
  • +
  • Prediction app deployed to Databricks Apps
  • +
  • End-to-end: form → API → Model Serving → response
  • +
  • All tests passing
  • +
+
+
+ +
+

Reflection Questions

+
+
    +
  1. How accurate were your model's predictions?
  2. +
  3. What was the hardest part — training, serving, or building the app?
  4. +
  5. How would you improve the model for production use?
  6. +
  7. How does MLflow help with model lifecycle management?
  8. +
+
+ +
+ Running out of time? Grab Checkpoint DS-2C (complete app) or DS-2D (complete solution). +
+
+
+
+ +
+ + + + +
+ +
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
ProblemFix
Agent uses pandasAdd to CLAUDE.md: Always use PySpark, never pandas. Say: "Rewrite using PySpark."
MLflow tracking URI errorCheck DATABRICKS_HOST env var: echo $DATABRICKS_HOST
MLflow experiment not foundSet explicitly: mlflow.set_experiment("/Users/.../name")
Window function errorsVerify orderBy("month") and partitionBy("state", "industry")
Feature table write errorCheck UC schema: workshop_vibe_coding.TEAM_SCHEMA
SparkSession errors in testsCheck tests/conftest.py has SparkSession.builder.master("local[*]")
XGBoost not installedpip install xgboost
Low R² scoreTry XGBoost, add more features, or check for data leakage
Model Serving 404Endpoint takes 5-10 min to provision. Check status in Databricks UI.
Model Serving auth errorCheck DATABRICKS_TOKEN env var is set
Model Registry permission errorCheck CREATE MODEL permission on catalog. Ask facilitator.
mlflow.register_model failsUse full UC path: models:/workshop_vibe_coding.TEAM_SCHEMA.model_name
+
+ +
+ + + + +
+ +
+
+
+

Useful Commands

+
# Run specific tests +pytest tests/test_features.py -x +pytest tests/test_model.py -x + +# Run all tests +pytest tests/ -x + +# MLflow +mlflow experiments list +mlflow runs list --experiment-id <id> + +# Model Registry +databricks unity-catalog models list \ + --catalog workshop_vibe_coding \ + --schema TEAM_SCHEMA + +# Model Serving +databricks serving-endpoints list +databricks serving-endpoints get \ + grocery-forecast-TEAM_NAME + +# Deploy app +cd app && databricks apps deploy \ + --name grocery-forecast-TEAM_NAME \ + --source-code-path ./ + +# Start app locally +cd app && uvicorn app:app --reload --port 8000
+
+ +
+

Checkpoint Recovery

+

If you're stuck and need to catch up:

+
-- Gold tables (if needed) +CREATE TABLE + workshop_vibe_coding.TEAM_SCHEMA.retail_summary + AS SELECT * FROM + workshop_vibe_coding.checkpoints.retail_summary; + +CREATE TABLE + workshop_vibe_coding.TEAM_SCHEMA.food_inflation_yoy + AS SELECT * FROM + workshop_vibe_coding.checkpoints.food_inflation_yoy;
+ +

Agent Steering Tips

+ + + + + + + + + + + + + + + + + + + + + + +
When the agent...Say this
Writes too much code"Keep it simple. One function, minimal code."
Ignores your CLAUDE.md"Read CLAUDE.md first, then try again."
Gets stuck in a loop"Stop. Let's take a different approach."
Writes code before tests"Stop. Write the tests first, then implement."
+
+
+
+ +
+ + + + +
+ +
+
+
+

Lab 1: Features & Experiments

+
+
    +
  • Feature table has lag, seasonal, and growth features
  • +
  • All feature engineering tests pass
  • +
  • MLflow experiment logged with parameters, metrics, artifacts
  • +
  • At least one visualization logged as artifact
  • +
  • Feature table accessible in Unity Catalog
  • +
+
+ +

What You Built

+
+

+ Gold TablesFeature Engineering (lag, seasonal, growth) + → Feature Table in Unity Catalog + → MLflow Experiment with tracked runs, metrics & visualizations + → Trained Model (RandomForest / XGBoost) + → Model Registry + → Model Serving Endpoint + → Prediction Web App +

+
+
+ +
+

Lab 2: Train, Serve & Deploy

+
+
    +
  • Model trained with R² > 0.5
  • +
  • Model registered in Unity Catalog Model Registry
  • +
  • Model Serving endpoint responding to requests
  • +
  • Prediction app deployed to Databricks Apps
  • +
  • End-to-end: form → API → Model Serving → response
  • +
  • All tests passing
  • +
+
+ +

Key Takeaways

+
+
    +
  • AI agents can build ML pipelines when given clear specs
  • +
  • BDD keeps the agent focused and verifiable
  • +
  • MLflow tracks the full experiment lifecycle
  • +
  • Unity Catalog provides governance for models + data
  • +
  • Model Serving makes deployment a single API call
  • +
+
+
+
+
+ +
+ + + + diff --git a/pyproject.toml b/pyproject.toml index 4f27f61..7638287 100644 --- a/pyproject.toml +++ b/pyproject.toml @@ -9,7 +9,8 @@ dependencies = [ "simple-websocket>=1.0", "claude-agent-sdk", "databricks-sdk>=0.20.0", - "mlflow-skinny==3.10.1", + "mlflow-skinny==3.11.1", + "mlflow-tracing==3.11.1", "requests", "cryptography>=46.0.6", ] diff --git a/requirements.lock b/requirements.lock index 9cae879..a056033 100644 --- a/requirements.lock +++ b/requirements.lock @@ -48,6 +48,7 @@ cachetools==7.0.5 \ # via # -r requirements.txt # mlflow-skinny + # mlflow-tracing certifi==2026.2.25 \ --hash=sha256:027692e4402ad994f1c42e52a4997a9763c646b73e4096e4d5d6db8af1d6f0fa \ --hash=sha256:e887ab5cee78ea814d3472169153c2d12cd43b14bd03329a39a9c6e2e80bfba7 @@ -360,6 +361,7 @@ databricks-sdk==0.102.0 \ # via # -r requirements.txt # mlflow-skinny + # mlflow-tracing fastapi==0.135.3 \ --hash=sha256:9b0f590c813acd13d0ab43dd8494138eb58e484bfac405db1f3187cfc5810d98 \ --hash=sha256:bd6d7caf1a2bdd8d676843cdcd2287729572a1ef524fc4d65c17ae002a1be654 @@ -561,9 +563,13 @@ mcp==1.26.0 \ # via # -r requirements.txt # claude-agent-sdk -mlflow-skinny==3.10.1 \ - --hash=sha256:3d1c5c30245b6e7065b492b09dd47be7528e0a14c4266b782fe58f9bcd1e0be0 \ - --hash=sha256:df1dd507d8ddadf53bfab2423c76cdcafc235cd1a46921a06d1a6b4dd04b023c +mlflow-skinny==3.11.1 \ + --hash=sha256:82ffd5f6980320b4ac19f741e7a754faa1d01707e632b002ea68e04fd25a0535 \ + --hash=sha256:86ce63491349f6713afc8a4ef0bf77a8314d0e79e03753cb150d6c860a0b0475 + # via -r requirements.txt +mlflow-tracing==3.11.1 \ + --hash=sha256:cb63cee16385d081467ec5bee4807fe1af59ddfdf04be4c79e7a7813b1002193 \ + --hash=sha256:fa82df64dacf8293b714ae666440fe7c1902c6470c024df389bb91e9de3106d9 # via -r requirements.txt opentelemetry-api==1.40.0 \ --hash=sha256:159be641c0b04d11e9ecd576906462773eb97ae1b657730f0ecf64d32071569f \ @@ -571,6 +577,7 @@ opentelemetry-api==1.40.0 \ # via # -r requirements.txt # mlflow-skinny + # mlflow-tracing # opentelemetry-sdk # opentelemetry-semantic-conventions opentelemetry-proto==1.40.0 \ @@ -579,12 +586,14 @@ opentelemetry-proto==1.40.0 \ # via # -r requirements.txt # mlflow-skinny + # mlflow-tracing opentelemetry-sdk==1.40.0 \ --hash=sha256:18e9f5ec20d859d268c7cb3c5198c8d105d073714db3de50b593b8c1345a48f2 \ --hash=sha256:787d2154a71f4b3d81f20524a8ce061b7db667d24e46753f32a7bc48f1c1f3f1 # via # -r requirements.txt # mlflow-skinny + # mlflow-tracing opentelemetry-semantic-conventions==0.61b0 \ --hash=sha256:072f65473c5d7c6dc0355b27d6c9d1a679d63b6d4b4b16a9773062cb7e31192a \ --hash=sha256:fa530a96be229795f8cef353739b618148b0fe2b4b3f005e60e262926c4d38e2 @@ -597,6 +606,7 @@ packaging==26.0 \ # via # -r requirements.txt # mlflow-skinny + # mlflow-tracing protobuf==6.33.6 \ --hash=sha256:0cd27b587afca21b7cfa59a74dcbd48a50f0a6400cfb59391340ad729d91d326 \ --hash=sha256:77179e006c476e69bf8e8ce866640091ec42e1beb80b213c3900006ecfba6901 \ @@ -612,6 +622,7 @@ protobuf==6.33.6 \ # -r requirements.txt # databricks-sdk # mlflow-skinny + # mlflow-tracing # opentelemetry-proto pyasn1==0.6.3 \ --hash=sha256:697a8ecd6d98891189184ca1fa05d1bb00e2f84b5977c481452050549c8a72cf \ @@ -639,6 +650,7 @@ pydantic==2.12.5 \ # fastapi # mcp # mlflow-skinny + # mlflow-tracing # pydantic-settings pydantic-core==2.41.5 \ --hash=sha256:0177272f88ab8312479336e1d777f6b124537d47f2123f89cb37e0accea97f90 \ diff --git a/requirements.txt b/requirements.txt index 46f445a..3e462e9 100644 --- a/requirements.txt +++ b/requirements.txt @@ -22,7 +22,9 @@ blinker==1.9.0 # flask # flask-socketio cachetools==7.0.5 - # via mlflow-skinny + # via + # mlflow-skinny + # mlflow-tracing certifi==2026.2.25 # via # httpcore @@ -51,6 +53,7 @@ databricks-sdk==0.102.0 # via # coda (pyproject.toml) # mlflow-skinny + # mlflow-tracing fastapi==0.135.3 # via mlflow-skinny flask==3.1.3 @@ -102,25 +105,35 @@ markupsafe==3.0.3 # werkzeug mcp==1.26.0 # via claude-agent-sdk -mlflow-skinny==3.10.1 +mlflow-skinny==3.11.1 + # via coda (pyproject.toml) +mlflow-tracing==3.11.1 # via coda (pyproject.toml) opentelemetry-api==1.40.0 # via # mlflow-skinny + # mlflow-tracing # opentelemetry-sdk # opentelemetry-semantic-conventions opentelemetry-proto==1.40.0 - # via mlflow-skinny + # via + # mlflow-skinny + # mlflow-tracing opentelemetry-sdk==1.40.0 - # via mlflow-skinny + # via + # mlflow-skinny + # mlflow-tracing opentelemetry-semantic-conventions==0.61b0 # via opentelemetry-sdk packaging==26.0 - # via mlflow-skinny + # via + # mlflow-skinny + # mlflow-tracing protobuf==6.33.6 # via # databricks-sdk # mlflow-skinny + # mlflow-tracing # opentelemetry-proto pyasn1==0.6.3 # via pyasn1-modules @@ -133,6 +146,7 @@ pydantic==2.12.5 # fastapi # mcp # mlflow-skinny + # mlflow-tracing # pydantic-settings pydantic-core==2.41.5 # via pydantic diff --git a/setup_claude.py b/setup_claude.py index 725ad4d..7cd9ea2 100644 --- a/setup_claude.py +++ b/setup_claude.py @@ -1,4 +1,5 @@ import os +import sys import json import shutil import subprocess @@ -16,6 +17,117 @@ claude_dir = home / ".claude" claude_dir.mkdir(exist_ok=True) +# The coda-marketplace bundled with the CODA source is registered via +# extraKnownMarketplaces in settings.json below. Claude Code auto-discovers +# agents/ and commands/ inside enabled plugins, so we only need to: +# 1. ensure hook scripts are executable (git doesn't preserve +x reliably) +# 2. know the hooks/ path so we can wire hooks into settings.json +marketplace_dir = Path(__file__).parent / "coda-marketplace" +plugin_dir = marketplace_dir / "plugins" / "coda-essentials" +hooks_dir = plugin_dir / "hooks" +if hooks_dir.exists(): + for hook in hooks_dir.iterdir(): + if hook.is_file(): + os.chmod(hook, 0o755) + print(f"coda-essentials hooks ready: {hooks_dir}") + +# Register the bundled marketplace with Claude Code's plugin system. Just +# listing the marketplace in settings.json's extraKnownMarketplaces and the +# plugin in enabledPlugins is NOT enough — Claude Code also requires state +# files under ~/.claude/plugins/ (known_marketplaces.json + installed_plugins.json) +# before plugin content (skills, commands, agents, hooks) is actually loaded. +# For a directory-source marketplace the "installLocation" is the source path, +# no copy needed. We write these here so fresh CODA instances get /cache-stats, +# /til, and the marketplace skills available on first Claude Code session. +import datetime as _dt +plugins_state_dir = home / ".claude" / "plugins" +plugins_state_dir.mkdir(exist_ok=True) +cache_root = plugins_state_dir / "cache" / "coda" +cache_root.mkdir(parents=True, exist_ok=True) + +# Stage each plugin into ~/.claude/plugins/cache////. +# Claude Code requires this layout — even directory-source marketplaces get +# their plugins copied into a versioned cache path, and `installPath` in +# installed_plugins.json must point at the cache, not at the source. +# Verified by inspecting a working fe-vibe install where the marketplace +# source lives at ~/Repos/vibe-ebc-fix but plugin installPath is +# ~/.claude/plugins/cache/fe-vibe/fe-html-slides/1.1.4. +PLUGIN_VERSION = "0.1.0" +plugin_cache_paths = {} +for pname in ("coda-essentials", "coda-databricks-skills"): + src_p = marketplace_dir / "plugins" / pname + dst_p = cache_root / pname / PLUGIN_VERSION + if dst_p.exists(): + shutil.rmtree(dst_p) + shutil.copytree(src_p, dst_p) + plugin_cache_paths[pname] = dst_p + print(f"Staged plugin {pname} -> {dst_p}") + +# Re-point hooks_dir at the cached coda-essentials so settings.json hooks +# reference the copy Claude Code actually loads, not the source tree. +# (Source and cache have identical contents; this keeps the hook path +# consistent with the plugin loader's view of the filesystem.) +hooks_dir = plugin_cache_paths["coda-essentials"] / "hooks" +if hooks_dir.exists(): + for hook in hooks_dir.iterdir(): + if hook.is_file(): + os.chmod(hook, 0o755) + +_now = _dt.datetime.now(_dt.timezone.utc).strftime("%Y-%m-%dT%H:%M:%S.000Z") + +(plugins_state_dir / "known_marketplaces.json").write_text(json.dumps({ + "coda": { + "source": {"source": "directory", "path": str(marketplace_dir)}, + "installLocation": str(marketplace_dir), + "lastUpdated": _now, + } +}, indent=2)) + +(plugins_state_dir / "installed_plugins.json").write_text(json.dumps({ + "version": 2, + "plugins": { + "coda-essentials@coda": [{ + "scope": "user", + "installPath": str(plugin_cache_paths["coda-essentials"]), + "version": PLUGIN_VERSION, + "installedAt": _now, + "lastUpdated": _now, + }], + "coda-databricks-skills@coda": [{ + "scope": "user", + "installPath": str(plugin_cache_paths["coda-databricks-skills"]), + "version": PLUGIN_VERSION, + "installedAt": _now, + "lastUpdated": _now, + }], + }, +}, indent=2)) +print(f"Registered coda marketplace + plugins in {plugins_state_dir}") + +# Defence-in-depth: also copy commands/agents into ~/.claude/commands/ +# and ~/.claude/agents/ at the user level. Claude Code's plugin loader +# on the Databricks Apps runtime didn't surface plugin-bundled commands +# on first attempt; user-level paths are the canonical fallback and +# are always scanned regardless of plugin state. Running both keeps the +# marketplace as the source of truth for content while guaranteeing the +# slash commands + subagents actually work. +user_commands_dir = claude_dir / "commands" +user_commands_dir.mkdir(exist_ok=True) +user_agents_dir = claude_dir / "agents" +user_agents_dir.mkdir(exist_ok=True) + +for src_commands in [plugin_cache_paths["coda-essentials"] / "commands"]: + if src_commands.exists(): + for f in src_commands.glob("*.md"): + shutil.copy2(str(f), str(user_commands_dir / f.name)) +print(f"User-level commands synced: {sorted(p.name for p in user_commands_dir.glob('*.md'))}") + +for src_agents in [plugin_cache_paths["coda-essentials"] / "agents"]: + if src_agents.exists(): + for f in src_agents.glob("*.md"): + shutil.copy2(str(f), str(user_agents_dir / f.name)) +print(f"User-level agents synced: {sorted(p.name for p in user_agents_dir.glob('*.md'))}") + # 1. Write settings.json for Databricks model serving (requires DATABRICKS_TOKEN) token = os.environ.get("DATABRICKS_TOKEN", "").strip() if token: @@ -30,35 +142,192 @@ print(f"Using Databricks Host: {databricks_host}") settings = { + "theme": "dark", + "outputStyle": "Explanatory", + "extraKnownMarketplaces": { + "coda": { + "source": { + "source": "directory", + "path": str(marketplace_dir), + }, + }, + }, + "enabledPlugins": { + "coda-essentials@coda": True, + "coda-databricks-skills@coda": True, + }, + "permissions": { + "defaultMode": "auto", + "allow": [ + "Bash(databricks *)", + "Bash(uv *)", + "Bash(git *)", + "Bash(make *)", + "Bash(python *)", + "Bash(pytest *)", + "Bash(ruff *)", + "Bash(wsync)", + "Bash(databricks sync * /Workspace/Shared/apps/coding-agents*)", + "Bash(databricks workspace import /Workspace/Shared/apps/coding-agents/*)", + "Bash(databricks workspace import-dir * /Workspace/Shared/apps/coding-agents*)", + ], + "deny": [ + # Process kills that would take down the gunicorn worker (single-worker app) + "Bash(pkill *)", + "Bash(pkill)", + "Bash(killall *)", + "Bash(fuser -k *)", + "Bash(kill 1)", + "Bash(kill -9 1)", + "Bash(kill -- -1)", + # Catastrophic filesystem deletion (would wipe app source / home) + "Bash(rm -rf /)", + "Bash(rm -rf /*)", + "Bash(rm -rf /app*)", + "Bash(rm -rf ~)", + "Bash(rm -rf ~/*)", + "Bash(rm -rf $HOME)", + "Bash(rm -rf $HOME/*)", + # Credential/config destruction (breaks auth + PAT rotator) + "Bash(rm ~/.databrickscfg*)", + "Bash(rm -rf ~/.claude*)", + # Shared Workspace paths that other apps depend on + "Bash(rm -rf /Workspace*)", + "Bash(databricks workspace delete /Workspace/Shared*)", + "Bash(databricks workspace delete-dir /Workspace/Shared*)", + # Don't delete other users' coda apps + "Bash(databricks apps delete *)", + # System-level destructive + "Bash(shutdown *)", + "Bash(reboot *)", + "Bash(halt *)", + "Bash(mkfs *)", + "Bash(dd if=* of=/dev/*)", + "Bash(chmod -R * /app*)", + "Bash(chown -R * /app*)", + ], + }, "env": { - "ANTHROPIC_MODEL": os.environ.get("ANTHROPIC_MODEL", "databricks-claude-opus-4-6"), + "ANTHROPIC_MODEL": os.environ.get("ANTHROPIC_MODEL", "databricks-claude-opus-4-7"), "ANTHROPIC_BASE_URL": anthropic_base_url, "ANTHROPIC_AUTH_TOKEN": token, - "ANTHROPIC_DEFAULT_OPUS_MODEL": "databricks-claude-opus-4-6", + "ANTHROPIC_DEFAULT_OPUS_MODEL": "databricks-claude-opus-4-7", "ANTHROPIC_DEFAULT_SONNET_MODEL": "databricks-claude-sonnet-4-6", "ANTHROPIC_DEFAULT_HAIKU_MODEL": "databricks-claude-haiku-4-5", "ANTHROPIC_CUSTOM_HEADERS": "x-databricks-use-coding-agent-mode: true", "CLAUDE_CODE_DISABLE_EXPERIMENTAL_BETAS": "1", - } + }, + "hooks": { + "SessionStart": [{ + "matcher": "", + "hooks": [ + {"type": "command", + "command": f"python3 {hooks_dir}/check-memory-staleness.py --cwd \"$PWD\"", + "timeout": 10}, + {"type": "command", + "command": f"bash {hooks_dir}/session-context-loader.sh", + "timeout": 15}, + ], + }], + "PostToolUse": [{ + "matcher": "Edit|Write", + "hooks": [{ + "type": "command", + "command": f"bash {hooks_dir}/memory-stamp-verified.sh", + "timeout": 5, + }], + }], + "Stop": [{ + "matcher": "", + "hooks": [ + {"type": "command", + "command": f"bash {hooks_dir}/session-crystallize-nudge.sh", + "timeout": 10}, + {"type": "command", + "command": f"bash {hooks_dir}/push-brain-to-workspace.sh", + "timeout": 5}, + ], + }], + }, } settings_path = claude_dir / "settings.json" settings_path.write_text(json.dumps(settings, indent=2)) print(f"Claude configured: {settings_path}") + + # 1b. Secure-egress network detection. If docs.databricks.com is blocked + # (common in enterprise Azure workspaces with a restrictive outbound + # allowlist), append a note to ~/.claude/CLAUDE.md telling agents to + # substitute learn.microsoft.com/en-us/azure/databricks/ — Microsoft + # Learn mirrors the Azure Databricks docs one-to-one and is usually + # allowlisted by default. Idempotent via a marker comment. + import urllib.request # stdlib, no extra deps + _egress_marker = "" + _egress_note = ( + f"\n{_egress_marker}\n" + "## Documentation fallback — secure-egress workspace\n" + "`docs.databricks.com` is blocked from this environment. " + "When looking up Databricks docs, rewrite URLs:\n" + "- `docs.databricks.com/azure/en/X` → `learn.microsoft.com/en-us/azure/databricks/X`\n" + "- `docs.databricks.com/aws/en/X` → `learn.microsoft.com/en-us/azure/databricks/X`\n" + "Microsoft Learn mirrors the Azure Databricks docs one-to-one and is usually reachable.\n" + ) + try: + urllib.request.urlopen("https://docs.databricks.com/", timeout=3) + print("docs.databricks.com reachable — no egress fallback needed") + except Exception as _e: + print(f"docs.databricks.com unreachable ({type(_e).__name__}) — installing learn.microsoft.com fallback note") + _claude_md = claude_dir / "CLAUDE.md" + _existing = _claude_md.read_text() if _claude_md.exists() else "" + if _egress_marker not in _existing: + with open(_claude_md, "a") as _f: + _f.write(_egress_note) + print(f"Appended egress fallback note to {_claude_md}") + + # 1c. Fork-specific directives. These are conventions that apply to every + # CODA spawned from this repo — package manager, required library floors, + # the working Unity Catalog, and the terminal-editor pointer. Mirrors the + # egress-fallback pattern above: idempotent via a marker comment, appended + # to ~/.claude/CLAUDE.md so Claude sees them in every session regardless + # of cwd. Update this block when fork-wide conventions change. + _fork_marker = "" + _fork_note = ( + f"\n{_fork_marker}\n" + "## CODA fork directives (always active)\n\n" + "### Python packaging\n" + "Always use `uv` for Python work. Never `pip install` directly — " + "`uv add` for new deps, `uv sync` to install, `uv run` to execute.\n\n" + "### Required library version floors\n" + "When scaffolding or updating a `pyproject.toml`, pin at least:\n" + "- `mlflow >= 3.11`\n" + "- `databricks-sdk >= 0.100.0`\n" + "Bump older pins rather than matching them. Do not downgrade.\n\n" + "### Unity Catalog\n" + "The working catalog in this environment is `edp_aisandbox_aisandbox_dev`. " + "Place new schemas, tables, volumes, and pipelines under this catalog " + "unless the user explicitly names another. " + "Example: `edp_aisandbox_aisandbox_dev.my_schema.my_table`.\n\n" + "### Terminal editors\n" + "`micro` is pre-installed at `~/.local/bin/micro` (Ctrl-S save, Ctrl-Q quit, " + "mouse support, no modal editing — safe default to recommend). " + "For other editors, check `~/.local/share/coda/editors.txt` — generated " + "at app startup, lists every editor detected via `command -v`. " + "If a user asks for vim/emacs and the file shows they're missing, say so " + "rather than guessing.\n" + ) + _claude_md = claude_dir / "CLAUDE.md" + _existing = _claude_md.read_text() if _claude_md.exists() else "" + if _fork_marker not in _existing: + with open(_claude_md, "a") as _f: + _f.write(_fork_note) + print(f"Appended fork directives to {_claude_md}") + else: + print(f"Fork directives already present in {_claude_md}") else: print("No DATABRICKS_TOKEN — skipping settings.json (will be configured after PAT setup)") # 2. Write ~/.claude.json with onboarding skip AND MCP servers -mcp_servers = { - "deepwiki": { - "type": "http", - "url": "https://mcp.deepwiki.com/mcp" - }, - "exa": { - "type": "http", - "url": "https://mcp.exa.ai/mcp" - } -} +mcp_servers = {} # Auto-configure team-memory MCP if URL is provided team_memory_url = os.environ.get("TEAM_MEMORY_MCP_URL", "").strip().rstrip("/") @@ -69,6 +338,14 @@ } print(f"Team memory MCP configured: {team_memory_url}/mcp") +# Public-internet MCPs (deepwiki, exa) are opt-in: they live on the open +# internet and won't work in air-gapped or secure-egress deployments. Set +# ENABLE_PUBLIC_MCPS=true only when you know the runtime can reach them. +if os.environ.get("ENABLE_PUBLIC_MCPS", "").strip().lower() in ("1", "true", "yes"): + mcp_servers["deepwiki"] = {"type": "http", "url": "https://mcp.deepwiki.com/mcp"} + mcp_servers["exa"] = {"type": "http", "url": "https://mcp.exa.ai/mcp"} + print("Public MCPs enabled (ENABLE_PUBLIC_MCPS=true): deepwiki, exa") + claude_json = { "hasCompletedOnboarding": True, "mcpServers": mcp_servers @@ -95,21 +372,8 @@ else: print(f"CLI install warning: {result.stderr}") -# 4. Copy subagent definitions to ~/.claude/agents/ -# These enable TDD workflow: prd-writer → test-generator → implementer → build-feature -agents_src = Path(__file__).parent / "agents" -agents_dst = claude_dir / "agents" -agents_dst.mkdir(exist_ok=True) - -if agents_src.exists(): - copied = [] - for agent_file in agents_src.glob("*.md"): - shutil.copy2(str(agent_file), str(agents_dst / agent_file.name)) - copied.append(agent_file.name) - if copied: - print(f"Subagents installed: {', '.join(copied)}") -else: - print("No agents directory found, skipping subagent setup") +# 4. Subagents are discovered automatically from coda-essentials plugin +# (no manual copy step needed — the plugin's agents/ dir is scanned by Claude Code). # 5. Create projects directory projects_dir = home / "projects" @@ -119,3 +383,22 @@ # 5. Git identity and hooks are now configured by app.py's _setup_git_config() # (runs directly in Python before setup_claude.py, writes ~/.gitconfig and ~/.githooks/) print("Git identity and hooks: configured by app.py (skipping here)") + +# 6. Restore Claude Code auto-memory ("brain") from workspace if present. +# This makes accumulated memories survive app redeployment. Best-effort — +# failures are logged but don't break startup. +if token: + brain_sync = Path(__file__).parent / "claude_brain_sync.py" + if brain_sync.exists(): + try: + result = subprocess.run( + [sys.executable, str(brain_sync), "pull"], + capture_output=True, text=True, timeout=60, + env={**os.environ, "HOME": str(home)}, + ) + if result.stdout: + print(result.stdout.strip()) + if result.returncode != 0 and result.stderr: + print(f"brain-sync pull warning: {result.stderr.strip()}") + except Exception as e: + print(f"brain-sync pull skipped: {e}") diff --git a/setup_mlflow.py b/setup_mlflow.py index 9d305d3..7be8fd6 100644 --- a/setup_mlflow.py +++ b/setup_mlflow.py @@ -1,8 +1,8 @@ """Configure MLflow tracing for Claude Code sessions. -Merges MLflow env vars and a Stop hook into ~/.claude/settings.json so that -every Claude Code session automatically logs traces to a Databricks MLflow -experiment at /Users/{app_owner}/{app_name}. +Merges MLflow env vars into ~/.claude/settings.json. Tracing is disabled +by default — the Stop hook stalls on transcript processing. Set +MLFLOW_CLAUDE_TRACING_ENABLED=true to opt in once the hook is reliable. """ import os @@ -31,34 +31,54 @@ experiment_name = f"/Users/{app_owner}/{app_name}" -# Merge MLflow env vars +# Tracing disabled by default — stop hook stalls on transcript processing +tracing_enabled = os.environ.get("MLFLOW_CLAUDE_TRACING_ENABLED", "false").lower() == "true" + +# Merge MLflow env vars (always set so tracing can be toggled at runtime) settings.setdefault("env", {}) -settings["env"]["MLFLOW_CLAUDE_TRACING_ENABLED"] = "false" +settings["env"]["MLFLOW_CLAUDE_TRACING_ENABLED"] = str(tracing_enabled).lower() settings["env"]["MLFLOW_TRACKING_URI"] = "databricks" settings["env"]["MLFLOW_EXPERIMENT_NAME"] = experiment_name # Override container-level OTEL endpoint so MLflow uses its native MlflowV3SpanExporter # instead of sending traces to a non-existent localhost:4314 OTLP collector settings["env"]["OTEL_EXPORTER_OTLP_ENDPOINT"] = "" -# Add Stop hook (processes full transcript at session end) -# Use `uv run python` so mlflow resolves correctly regardless of venv paths -python_cmd = "uv run python" -mlflow_hook = { - "hooks": [ - { - "type": "command", - "command": f"{python_cmd} -c \"from mlflow.claude_code.hooks import stop_hook_handler; stop_hook_handler()\"" - } - ] -} +# Only register the Stop hook when explicitly enabled +if tracing_enabled: + app_dir = os.path.dirname(os.path.abspath(__file__)) + # Delegate to a proper hook script that backgrounds the handler via + # `nohup timeout 30 ... & disown`. This: + # 1. unblocks the Stop chain immediately (brain-push, /til, etc.) + # 2. caps the backgrounded flush at 30s so a stuck handler can't + # eat memory/CPU forever — one dropped trace beats a leaked + # transcript processor + hook_script = os.path.join( + app_dir, + "coda-marketplace", "plugins", "coda-essentials", "hooks", + "mlflow-trace-stop.sh", + ) + os.chmod(hook_script, 0o755) + mlflow_hook = { + "hooks": [ + { + "type": "command", + "command": f"bash {hook_script}", + # The wrapper script backgrounds the work and returns in <1s, + # so this outer timeout is belt-and-braces only. + "timeout": 5, + } + ] + } -existing_hooks = settings.get("hooks", {}) -stop_hooks = existing_hooks.get("Stop", []) -stop_hooks.append(mlflow_hook) -existing_hooks["Stop"] = stop_hooks -settings["hooks"] = existing_hooks + existing_hooks = settings.get("hooks", {}) + stop_hooks = existing_hooks.get("Stop", []) + stop_hooks.append(mlflow_hook) + existing_hooks["Stop"] = stop_hooks + settings["hooks"] = existing_hooks + print(f"MLflow tracing enabled: experiment={experiment_name}") +else: + print("MLflow tracing disabled (set MLFLOW_CLAUDE_TRACING_ENABLED=true to enable)") settings_path.write_text(json.dumps(settings, indent=2)) -print(f"MLflow tracing enabled: experiment={experiment_name}") print(f" Tracking URI: databricks") print(f" Settings updated: {settings_path}") diff --git a/setup_opencode.py b/setup_opencode.py index a0ef9c7..f090221 100644 --- a/setup_opencode.py +++ b/setup_opencode.py @@ -106,8 +106,8 @@ "apiKey": "{env:DATABRICKS_TOKEN}" }, "models": { - "databricks-claude-opus-4-6": { - "name": "Claude Opus 4.6 (Databricks)", + "databricks-claude-opus-4-7": { + "name": "Claude Opus 4.7 (Databricks)", "limit": { "context": 200000, "output": 16384 @@ -198,8 +198,8 @@ "apiKey": "{env:DATABRICKS_TOKEN}" }, "models": { - "databricks-claude-opus-4-6": { - "name": "Claude Opus 4.6 (Databricks)", + "databricks-claude-opus-4-7": { + "name": "Claude Opus 4.7 (Databricks)", "limit": { "context": 200000, "output": 16384 diff --git a/spawner/.gitignore b/spawner/.gitignore new file mode 100644 index 0000000..91c53ec --- /dev/null +++ b/spawner/.gitignore @@ -0,0 +1,8 @@ +__pycache__/ +*.pyc +*.pyo +.venv/ +.env +.databricks/ +uv.lock +pyproject.toml diff --git a/spawner/Makefile b/spawner/Makefile new file mode 100644 index 0000000..465b533 --- /dev/null +++ b/spawner/Makefile @@ -0,0 +1,88 @@ +# Makefile for deploying Coding Agents Spawner to Databricks Apps +# +# Usage: +# make deploy PROFILE=daveok # full deploy (first time) +# make redeploy PROFILE=daveok # sync + deploy only +# make status PROFILE=daveok +# make logs PROFILE=daveok + +PROFILE ?= DEFAULT +APP_NAME := coding-agents-spawner +TEMPLATE_SRC := /Workspace/Shared/apps/coding-agents + +USER_EMAIL = $(shell databricks current-user me --profile $(PROFILE) --output json 2>/dev/null | python3 -c "import sys,json; print(json.load(sys.stdin).get('userName',''))") +WORKSPACE_PATH = /Workspace/Users/$(USER_EMAIL)/apps/$(APP_NAME) +HOST = $(shell databricks auth env --profile $(PROFILE) 2>/dev/null | python3 -c "import sys,json; print(json.load(sys.stdin)['env']['DATABRICKS_HOST'])") + +.PHONY: help run deploy redeploy create-app sync-template sync deploy-app status logs test + +run: ## Wait for app to be running and print URL + @echo "==> Waiting for '$(APP_NAME)' to be running..." + @for i in 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15; do \ + STATE=$$(databricks apps get $(APP_NAME) --profile $(PROFILE) --output json 2>/dev/null | python3 -c "import sys,json; print(json.load(sys.stdin).get('app_status',{}).get('state',''))"); \ + if [ "$$STATE" = "RUNNING" ]; then \ + echo ""; \ + echo "App is RUNNING!"; \ + databricks apps get $(APP_NAME) --profile $(PROFILE) --output json 2>/dev/null | python3 -c "import sys,json; print('URL:', json.load(sys.stdin).get('url','(unknown)'))"; \ + exit 0; \ + fi; \ + echo " State: $$STATE (waiting...)"; \ + sleep 10; \ + done; \ + echo " Timed out waiting for app to reach RUNNING state." + +test: ## Run spawner gate tests (AC-1 through AC-12) + @echo "==> Running spawner gate tests..." + cd .. && uv run pytest tests/gates/test_spawner_*.py -v --tb=short + +help: ## Show this help + @grep -E '^[a-zA-Z_-]+:.*?## .*$$' $(MAKEFILE_LIST) | awk 'BEGIN {FS = ":.*?## "}; {printf " \033[36m%-18s\033[0m %s\n", $$1, $$2}' + +deploy: create-app sync-template sync deploy-app run ## Full deploy: create app, sync template + spawner, deploy + run + +redeploy: sync deploy-app run ## Redeploy: sync spawner + deploy + +create-app: ## Create the spawner app (idempotent) + @echo "==> Checking if app '$(APP_NAME)' exists..." + @if databricks apps get $(APP_NAME) --profile $(PROFILE) >/dev/null 2>&1; then \ + echo " App '$(APP_NAME)' already exists, skipping create."; \ + else \ + echo " Creating app '$(APP_NAME)'..."; \ + databricks apps create $(APP_NAME) --profile $(PROFILE); \ + fi + +sync-template: ## Sync coding-agents source to shared template path + @echo "==> Syncing coding-agents template to $(TEMPLATE_SRC)..." + @databricks workspace mkdirs /Workspace/Shared/apps --profile $(PROFILE) 2>/dev/null || true + @cd .. && databricks sync . $(TEMPLATE_SRC) --watch=false --exclude-from .syncignore --profile $(PROFILE) + @# Override app.yaml with template defaults (no pre-loaded token — users paste PAT in CODA) + @echo " Uploading template app.yaml..." + @# Resolve team-memory-mcp app URL (if deployed) + $(eval TEAM_MEMORY_URL := $(shell databricks apps get team-memory-mcp --profile $(PROFILE) --output json 2>/dev/null | python3 -c "import sys,json; print(json.load(sys.stdin).get('url',''))" 2>/dev/null)) + @printf 'command:\n - gunicorn\n - app:app\nenv:\n - name: HOME\n value: /app/python/source_code\n - name: ANTHROPIC_MODEL\n value: databricks-claude-opus-4-7\n - name: GEMINI_MODEL\n value: databricks-gemini-3-1-pro\n - name: CODEX_MODEL\n value: databricks-gpt-5-2\n - name: CLAUDE_CODE_DISABLE_AUTO_MEMORY\n value: 0\n' > /tmp/_coda_template_app.yaml + @if [ -n "$(TEAM_MEMORY_URL)" ]; then \ + printf ' - name: TEAM_MEMORY_MCP_URL\n value: %s\n' "$(TEAM_MEMORY_URL)" >> /tmp/_coda_template_app.yaml; \ + echo " Team memory MCP URL: $(TEAM_MEMORY_URL)"; \ + else \ + echo " Team memory MCP: not deployed (skipping)"; \ + fi + @databricks workspace import $(TEMPLATE_SRC)/app.yaml --file /tmp/_coda_template_app.yaml --format AUTO --overwrite --profile $(PROFILE) + @rm -f /tmp/_coda_template_app.yaml + @echo " Template synced." + +sync: ## Sync spawner source to workspace + @echo "==> Syncing spawner to $(WORKSPACE_PATH)..." + @databricks sync . $(WORKSPACE_PATH) --watch=false --exclude-from ../.syncignore --profile $(PROFILE) + +deploy-app: ## Deploy the spawner app + @echo "==> Deploying '$(APP_NAME)'..." + @databricks apps deploy $(APP_NAME) --source-code-path $(WORKSPACE_PATH) --profile $(PROFILE) --no-wait + @echo "" + @echo "App URL:" + @databricks apps get $(APP_NAME) --profile $(PROFILE) --output json 2>/dev/null | python3 -c "import sys,json; print(json.load(sys.stdin).get('url','(pending)'))" + +status: ## Check spawner app status + @databricks apps get $(APP_NAME) --profile $(PROFILE) + +logs: ## Tail spawner app logs + @databricks apps logs $(APP_NAME) --profile $(PROFILE) diff --git a/spawner/README.md b/spawner/README.md new file mode 100644 index 0000000..c9566e9 --- /dev/null +++ b/spawner/README.md @@ -0,0 +1,111 @@ +# Coding Agents Spawner + +One-click provisioning of individual [coding-agents](../) Databricks Apps for any developer in your workspace. + +## How It Works + +A developer visits the spawner UI, pastes their Databricks PAT, and clicks **Deploy**. The spawner: + +1. **Resolves identity** — calls SCIM `/Me` with the user's PAT to get their email +2. **Stores the PAT** — creates a secret scope `coding-agents-{user}-secrets` and stores the PAT with a unique UUID key (uses admin token for privileged scope operations) +3. **Creates the app** — `POST /api/2.0/apps` with the user's PAT so they own it; the secret resource (`DATABRICKS_TOKEN`) is included in the creation call +4. **Grants SP access** — gives the app's service principal READ on the secret scope +5. **Deploys** — deploys from the shared template at `/Workspace/Shared/apps/coding-agents` + +The spawned app is named `coding-agents-{username}` (derived from email), e.g., `coding-agents-david-okeeffe`. + +## Architecture + +``` +┌─────────────────────┐ ┌──────────────────────────┐ +│ Spawner App │ │ Shared Template │ +│ (this app) │ │ /Workspace/Shared/apps/ │ +│ │ deploy │ coding-agents/ │ +│ - Admin PAT (env) ├────────►│ - app.py │ +│ - Provisioning API │ │ - app.yaml │ +│ - Spawned apps list│ │ - requirements.txt │ +└─────────────────────┘ └──────────────────────────┘ + │ + │ creates per-user + ▼ +┌─────────────────────────────┐ +│ coding-agents-{user} │ +│ - Owned by user │ +│ - DATABRICKS_TOKEN = PAT │ +│ - Deployed from template │ +└─────────────────────────────┘ +``` + +### Token Model + +| Token | Stored in | Used for | +|-------|-----------|----------| +| **Admin PAT** | `coding-agents-spawner-secrets/admin-token` | Secret scope creation, ACLs, deployment | +| **User PAT** | `coding-agents-{user}-secrets/{uuid}` | App creation (ownership), runtime `DATABRICKS_TOKEN` | + +The admin PAT requires **workspace admin** privileges (for secret scope creation and ACL management). + +The user PAT should have **all access** scopes since Claude Code uses it for model serving, workspace operations, Unity Catalog, clusters, etc. + +## Prerequisites + +- Databricks CLI configured with a profile (`databricks configure --profile `) +- Workspace admin access (for the admin PAT) +- Shared template synced to `/Workspace/Shared/apps/coding-agents` + +## Deploy + +### First time + +```bash +cd spawner +make deploy PROFILE=daveok ADMIN_PAT=dapi... +``` + +This will: +- Create the `coding-agents-spawner` app +- Create secret scope and store the admin PAT +- Sync the coding-agents template to the shared workspace path +- Sync the spawner source and deploy +- Wait for the app to be RUNNING and print the URL + +If you omit `ADMIN_PAT`, it will prompt interactively. + +### Subsequent deploys + +```bash +make redeploy PROFILE=daveok +``` + +Syncs source and redeploys (skips secret setup and template sync). + +### Other targets + +```bash +make status PROFILE=daveok # Check app status +make logs PROFILE=daveok # Tail app logs +make sync-template PROFILE=daveok # Re-sync shared template +make clean PROFILE=daveok # Remove secret scope (destructive) +make help # Show all targets +``` + +## API Endpoints + +| Endpoint | Method | Description | +|----------|--------|-------------| +| `/` | GET | Spawner UI | +| `/health` | GET | Health check | +| `/api/status` | GET | Check if current user has a deployed app | +| `/api/apps` | GET | List all spawned coding-agents apps | +| `/api/provision` | POST | Provision a new app (body: `{"pat": "dapi..."}`) | + +## Files + +``` +spawner/ +├── app.py # Flask app with provisioning logic +├── app.yaml # Databricks App config (exposes ADMIN_TOKEN env) +├── requirements.txt # flask, gunicorn, requests +├── Makefile # Deploy/manage targets +└── README.md # This file +``` diff --git a/spawner/__init__.py b/spawner/__init__.py new file mode 100644 index 0000000..e69de29 diff --git a/spawner/app.py b/spawner/app.py new file mode 100644 index 0000000..8e23592 --- /dev/null +++ b/spawner/app.py @@ -0,0 +1,923 @@ +"""Coding Agents Spawner App -- zero-PAT one-click provisioning for any developer. + +Auth: Uses the app's own service principal (auto-provisioned by Databricks Apps) +via OAuth M2M. The SP must be a workspace admin to create apps and manage +service principal access. Users no longer need to provide a PAT to the spawner — +they authenticate via SSO (X-Forwarded-Email) and paste their PAT later +directly in their CODA instance. + +Owner resolution: The spawner sets `owner:{email}` in the app description field. +CODA's get_token_owner() reads this to determine who owns the app. +""" + +import hashlib +import os +import threading +import time + +import requests +from flask import Flask, jsonify, request + +app = Flask(__name__, static_folder="static") + +_raw_host = os.environ.get("DATABRICKS_HOST", "") +DATABRICKS_HOST = ( + _raw_host if _raw_host.startswith("https://") else f"https://{_raw_host}" +).rstrip("/") + +# OAuth M2M: use the app's own service principal credentials +# These are auto-injected by Databricks Apps runtime. +_SP_CLIENT_ID = os.environ.get("DATABRICKS_CLIENT_ID", "") +_SP_CLIENT_SECRET = os.environ.get("DATABRICKS_CLIENT_SECRET", "") + +# Token cache +_oauth_token = None +_oauth_token_expiry = 0 +_oauth_lock = threading.Lock() + + +def get_admin_token() -> str: + """Get an OAuth token using the app's service principal credentials. + + Tokens are cached and refreshed 60s before expiry. + Falls back to ADMIN_TOKEN env var if SP credentials are not available. + """ + global _oauth_token, _oauth_token_expiry + + # Fallback to legacy ADMIN_TOKEN if SP creds not available + if not _SP_CLIENT_ID or not _SP_CLIENT_SECRET: + legacy = os.environ.get("ADMIN_TOKEN", "").strip() + if legacy: + return legacy + raise RuntimeError( + "No SP credentials (DATABRICKS_CLIENT_ID/SECRET) or ADMIN_TOKEN configured" + ) + + with _oauth_lock: + if _oauth_token and time.time() < _oauth_token_expiry - 60: + return _oauth_token + + resp = requests.post( + f"{DATABRICKS_HOST}/oidc/v1/token", + data={ + "grant_type": "client_credentials", + "client_id": _SP_CLIENT_ID, + "client_secret": _SP_CLIENT_SECRET, + "scope": "all-apis", + }, + ) + resp.raise_for_status() + data = resp.json() + _oauth_token = data["access_token"] + _oauth_token_expiry = time.time() + data.get("expires_in", 3600) + return _oauth_token + + +# In-memory provision progress, keyed by app_name +_provision_jobs: dict[str, dict] = {} +_provision_lock = threading.Lock() + +MAX_APP_NAME_LENGTH = 30 + +# UC Volume for offline Python wheel installation +WHEELS_VOLUME_CATALOG = os.environ.get("WHEELS_VOLUME_CATALOG", "main") +WHEELS_VOLUME_SCHEMA = os.environ.get("WHEELS_VOLUME_SCHEMA", "coda") +WHEELS_VOLUME_NAME = os.environ.get("WHEELS_VOLUME_NAME", "coda-wheels") + + +def app_name_from_email(email: str) -> str: + """Derive app name from user email: david.okeeffe@company.com -> coda-david-okeeffe. + + Databricks app names are limited to 30 characters. Uses 'coda-' prefix + to maximise space for the username slug (25 chars available). + If still too long, truncates with a hash suffix for uniqueness. + """ + prefix = "coda-" + username = email.split("@")[0] + slug = username.replace(".", "-").replace("_", "-").lower() + full_name = f"{prefix}{slug}" + + if len(full_name) <= MAX_APP_NAME_LENGTH: + return full_name + + hash_suffix = hashlib.sha256(slug.encode()).hexdigest()[:6] + max_slug_len = MAX_APP_NAME_LENGTH - len(prefix) - len(hash_suffix) - 1 + truncated_slug = slug[:max_slug_len].rstrip("-") + return f"{prefix}{truncated_slug}-{hash_suffix}" + + +def _ensure_owner_description( + host: str, admin_token: str, app_name: str, owner_email: str +) -> None: + """PATCH the app description to owner:{email} on re-provision. + + When an app already exists (409 on create), the description may be missing + or stale. This ensures the child app can always resolve its owner. + """ + resp = requests.patch( + f"{host}/api/2.0/apps/{app_name}", + headers={"Authorization": f"Bearer {admin_token}"}, + json={"description": f"owner:{owner_email}"}, + ) + if resp.ok: + print(f" Updated app description to owner:{owner_email}") + else: + print(f" Warning: could not update description ({resp.status_code}): {resp.text[:200]}") + + +def grant_user_app_permissions( + host: str, admin_token: str, app_name: str, owner_email: str +) -> None: + """Grant CAN_MANAGE on the app to the owner so they appear as the app owner.""" + resp = requests.patch( + f"{host}/api/2.0/permissions/apps/{app_name}", + headers={"Authorization": f"Bearer {admin_token}"}, + json={ + "access_control_list": [ + {"user_name": owner_email, "permission_level": "CAN_MANAGE"} + ] + }, + ) + if resp.ok: + print(f" Granted CAN_MANAGE on {app_name} to {owner_email}") + else: + print( + f" Warning: could not grant permissions ({resp.status_code}): " + f"{resp.text[:200]}" + ) + + +def _request_with_backoff( + method: str, + url: str, + max_retries: int = 6, + base_delay: float = 5.0, + **kwargs, +) -> requests.Response: + """Make an HTTP request with exponential backoff on 429 responses.""" + for attempt in range(max_retries): + resp = requests.request(method, url, **kwargs) + if resp.status_code != 429: + return resp + delay = min(base_delay * (2 ** attempt), 60) + print(f" Rate limited ({url.split('/')[-1]}), retrying in {delay:.0f}s " + f"(attempt {attempt + 1}/{max_retries})") + time.sleep(delay) + return resp + + +# Limit concurrent API-heavy provisioning threads to avoid 429s +_provision_semaphore = threading.Semaphore(3) + + +def create_app(host: str, admin_token: str, app_name: str, owner_email: str) -> dict: + """Create the Databricks App via POST /api/2.0/apps. + + The app is created with the admin SP token. Owner identity is stored in the + description field as 'owner:{email}' so CODA's get_token_owner() can resolve + it without requiring the user's PAT. + """ + resp = _request_with_backoff( + "POST", + f"{host}/api/2.0/apps", + headers={"Authorization": f"Bearer {admin_token}"}, + json={ + "name": app_name, + "description": f"owner:{owner_email}", + }, + ) + # 409 means app already exists -- ensure description has owner:{email} + if resp.status_code == 409: + _ensure_owner_description(host, admin_token, app_name, owner_email) + return check_existing_app(host, admin_token, app_name) + resp.raise_for_status() + return resp.json() + + +def wait_for_compute_active( + host: str, + oauth_token: str, + app_name: str, + timeout: int = 180, + interval: int = 10, +) -> None: + """Poll until compute_status reaches ACTIVE (required before first deploy).""" + headers = {"Authorization": f"Bearer {oauth_token}"} + elapsed = 0 + while elapsed < timeout: + resp = requests.get(f"{host}/api/2.0/apps/{app_name}", headers=headers) + if resp.ok: + compute = resp.json().get("compute_status", {}).get("state", "") + if compute == "ACTIVE": + return + time.sleep(interval) + elapsed += interval + raise RuntimeError( + f"Timed out waiting for compute to become ACTIVE after {timeout}s" + ) + + +def deploy_app( + host: str, + oauth_token: str, + app_name: str, + source_code_path: str, +) -> dict: + """Deploy the app via POST /api/2.0/apps/{name}/deployments.""" + resp = requests.post( + f"{host}/api/2.0/apps/{app_name}/deployments", + headers={"Authorization": f"Bearer {oauth_token}"}, + json={"source_code_path": source_code_path}, + ) + if not resp.ok: + raise RuntimeError(f"{resp.status_code} from deploy API: {resp.text}") + return resp.json() + + +def _grant_with_retry( + host: str, + headers: dict, + securable_type: str, + full_name: str, + privileges: list[str], + sp_name: str, + max_retries: int = 6, + base_delay: float = 5.0, +) -> bool: + """Grant UC permissions, retrying if the principal doesn't exist yet. + + After app creation the service principal can take 10-30s to propagate into + Unity Catalog's identity store. Retries with exponential backoff (5s, 10s, + 20s, 40s, 60s, 60s) for up to ~3 minutes. + """ + for attempt in range(max_retries): + resp = requests.patch( + f"{host}/api/2.1/unity-catalog/permissions/{securable_type}/{full_name}", + headers=headers, + json={"changes": [{"add": privileges, "principal": sp_name}]}, + ) + if resp.ok: + print(f" Granted {privileges} on {securable_type} {full_name} to {sp_name}") + return True + + # Retry only on PRINCIPAL_DOES_NOT_EXIST — the SP hasn't propagated yet + if resp.status_code == 404 and "PRINCIPAL_DOES_NOT_EXIST" in resp.text: + delay = min(base_delay * (2**attempt), 60) + print( + f" SP not propagated yet, retrying in {delay:.0f}s " + f"(attempt {attempt + 1}/{max_retries})..." + ) + time.sleep(delay) + continue + + # Any other error is not retryable + print(f" Warning: grant failed ({resp.status_code}): {resp.text[:200]}") + return False + + print(f" Error: SP {sp_name} not found after {max_retries} retries, skipping grant") + return False + + +def grant_sp_volume_access(host: str, auth_token: str, app_result: dict) -> None: + """Grant the app's SP read access to the coda-wheels UC Volume. + + Uses the Unity Catalog permissions API to grant USE CATALOG, USE SCHEMA, + and READ_VOLUME to the child app's service principal. + """ + sp_name = app_result.get("service_principal_name", "") + if not sp_name: + return + + catalog = WHEELS_VOLUME_CATALOG + schema = WHEELS_VOLUME_SCHEMA + volume = WHEELS_VOLUME_NAME + headers = { + "Authorization": f"Bearer {auth_token}", + "Content-Type": "application/json", + } + + grants = [ + ("catalog", catalog, ["USE_CATALOG"]), + ("schema", f"{catalog}.{schema}", ["USE_SCHEMA"]), + ("volume", f"{catalog}.{schema}.{volume}", ["READ_VOLUME"]), + ] + + for securable_type, full_name, privileges in grants: + _grant_with_retry(host, headers, securable_type, full_name, privileges, sp_name) + + +def list_spawned_apps(host: str, oauth_token: str) -> list: + """List all coding-agents apps (excluding the spawner itself).""" + resp = requests.get( + f"{host}/api/2.0/apps", + headers={"Authorization": f"Bearer {oauth_token}"}, + ) + resp.raise_for_status() + all_apps = resp.json().get("apps", []) + + result = [] + seen_names = set() + for a in all_apps: + name = a["name"] + if name == "coding-agents-spawner": + continue + if not (name.startswith("coding-agents-") or name.startswith("coda-")): + continue + seen_names.add(name) + job = _provision_jobs.get(name) + if job and job["status"] == "in_progress": + last_step = job["steps"][-1] if job["steps"] else {} + state = f"PROVISIONING: {last_step.get('message', '...')}" + else: + compute = a.get("compute_status", {}).get("state", "") + deploy = a.get("active_deployment", {}).get("status", {}).get("state", "") + if compute == "ACTIVE" and deploy == "SUCCEEDED": + state = "RUNNING" + elif deploy == "IN_PROGRESS": + state = "DEPLOYING" + elif compute == "ACTIVE": + state = "DEPLOYED" + elif not a.get("active_deployment"): + state = "NOT DEPLOYED" + else: + state = compute or "UNKNOWN" + result.append( + { + "name": name, + "url": a.get("url", ""), + "creator": a.get("creator", ""), + "state": state, + "compute": a.get("compute_status", {}).get("state", "UNKNOWN"), + "created": a.get("create_time", ""), + } + ) + + # Include in-flight jobs not yet in the API + for name, job in _provision_jobs.items(): + if name not in seen_names and job["status"] == "in_progress": + last_step = job["steps"][-1] if job["steps"] else {} + result.append( + { + "name": name, + "url": "", + "creator": job.get("email", ""), + "state": f"PROVISIONING: {last_step.get('message', '...')}", + "compute": "PENDING", + "created": "", + } + ) + + return result + + +def check_existing_app(host: str, oauth_token: str, app_name: str) -> dict: + """Check if an app already exists.""" + resp = requests.get( + f"{host}/api/2.0/apps/{app_name}", + headers={"Authorization": f"Bearer {oauth_token}"}, + ) + if resp.status_code == 200: + data = resp.json() + return { + "deployed": True, + "app_name": app_name, + "app_url": data.get("url", ""), + "state": data.get("app_status", {}).get("state", "UNKNOWN"), + "service_principal_id": data.get("service_principal_id"), + "service_principal_client_id": data.get("service_principal_client_id"), + "service_principal_name": data.get("service_principal_name"), + } + return {"deployed": False} + + +def find_existing_app_for_email(host: str, oauth_token: str, email: str) -> dict: + """Check if user already has an app under any known prefix (coda- or coding-agents-).""" + username = email.split("@")[0] + slug = username.replace(".", "-").replace("_", "-").lower() + for prefix in ("coda-", "coding-agents-"): + candidate = f"{prefix}{slug}" + result = check_existing_app(host, oauth_token, candidate) + if result.get("deployed"): + return result + return {"deployed": False} + + +def _update_job(app_name: str, **kwargs): + """Thread-safe update of a provision job's state.""" + with _provision_lock: + if app_name in _provision_jobs: + _provision_jobs[app_name].update(kwargs) + + +def _add_step(app_name: str, step: int, status: str, message: str): + """Thread-safe append of a step to a provision job.""" + entry = {"step": step, "status": status, "message": message} + with _provision_lock: + if app_name in _provision_jobs: + _provision_jobs[app_name]["steps"].append(entry) + + +def provision_app_async(host: str, admin_token: str, email: str, app_name: str): + """Run provisioning in a background thread, updating _provision_jobs as it goes. + + Zero-PAT flow: uses the admin SP token for all operations. The user's email + (from SSO) is stored in the app description for owner resolution. + Uses a semaphore to limit concurrent API calls and avoid 429s. + """ + source_code_path = "/Workspace/Shared/apps/coding-agents" + + _add_step(app_name, 0, "queued", "Waiting for slot...") + _provision_semaphore.acquire() + try: + # Step 1: Create app (with owner in description) + _add_step(app_name, 1, "creating_app", f"Creating app '{app_name}'...") + app_result = create_app(host, admin_token, app_name, email) + sp_client_id = app_result.get("service_principal_client_id", "") + + # Step 2: Grant user CAN_MANAGE + SP access to UC Volume + _add_step( + app_name, 2, "granting_access", "Granting permissions..." + ) + grant_user_app_permissions(host, admin_token, app_name, email) + if sp_client_id: + grant_sp_volume_access(host, admin_token, app_result) + + # Step 3: Wait for compute + _add_step( + app_name, + 3, + "waiting_for_compute", + "Waiting for compute to be ready (60-90s)...", + ) + wait_for_compute_active(host, admin_token, app_name) + + # Step 4: Deploy + _add_step(app_name, 4, "deploying", "Deploying app...") + deploy_app(host, admin_token, app_name, source_code_path) + + # Step 5: Wait for app to be running + _add_step(app_name, 5, "starting", "Waiting for app to start...") + _wait_for_app_running(host, admin_token, app_name) + + app_url = app_result.get("url", app_result.get("app_url", "")) + _add_step(app_name, 6, "complete", "App is running!") + _update_job(app_name, status="complete", app_url=app_url) + + except Exception as exc: + _add_step(app_name, -1, "error", str(exc)) + _update_job(app_name, status="error", error=str(exc)) + finally: + _provision_semaphore.release() + + +def _wait_for_app_running( + host: str, token: str, app_name: str, timeout: int = 300, interval: int = 10 +): + """Poll until app_status reaches RUNNING.""" + headers = {"Authorization": f"Bearer {token}"} + elapsed = 0 + while elapsed < timeout: + resp = requests.get(f"{host}/api/2.0/apps/{app_name}", headers=headers) + if resp.ok: + state = resp.json().get("app_status", {}).get("state", "") + if state == "RUNNING": + return + time.sleep(interval) + elapsed += interval + raise RuntimeError(f"Timed out waiting for app to reach RUNNING after {timeout}s") + + +# --- Flask Routes --- + + +@app.route("/") +def index(): + """Serve the spawner UI with user context injected via data attributes.""" + import html as html_mod + + email = (request.headers.get("X-Forwarded-Email") or "unknown").lower() + app_name = app_name_from_email(email) if email != "unknown" else "coding-agents-you" + + index_path = os.path.join(os.path.dirname(__file__), "static", "index.html") + with open(index_path) as f: + page = f.read() + + page = page.replace( + "", + f'', + ) + return page + + +@app.route("/health") +def health(): + """Health check endpoint.""" + return jsonify({"status": "ok"}) + + +@app.route("/api/status") +def api_status(): + """Check if user already has a deployed instance.""" + email = (request.headers.get("X-Forwarded-Email") or "").lower() + host = DATABRICKS_HOST + + app_name = app_name_from_email(email) + result = check_existing_app(host, get_admin_token(), app_name) + return jsonify(result) + + +@app.route("/api/apps") +def api_list_apps(): + """List all spawned coding-agents apps (with in-flight provision status merged).""" + host = DATABRICKS_HOST + try: + admin_token = get_admin_token() + except RuntimeError as e: + return jsonify({"error": str(e)}), 500 + apps = list_spawned_apps(host, admin_token) + return jsonify({"apps": apps}) + + +@app.route("/api/provision", methods=["POST"]) +def api_provision(): + """Start provisioning in background. No PAT required — uses SSO identity. + + Returns immediately with app_name to poll via /api/provision-status/. + """ + host = DATABRICKS_HOST + + try: + admin_token = get_admin_token() + except RuntimeError as e: + return jsonify({"success": False, "error": str(e)}), 500 + + # Identity: use email from POST body (admin provisioning for another user), + # fall back to SSO header (self-provisioning) + body = request.get_json(silent=True) or {} + email = (body.get("email") or request.headers.get("X-Forwarded-Email", "")).strip().lower() + if not email: + return jsonify({"success": False, "error": "No user identity provided"}), 400 + + app_name = app_name_from_email(email) + + # Check if already running under any prefix (coda- or coding-agents-) + existing = find_existing_app_for_email(host, admin_token, email) + if existing.get("deployed") and existing.get("state") == "RUNNING": + return jsonify( + { + "success": True, + "app_name": existing.get("app_name", app_name), + "app_url": existing.get("app_url", ""), + "already_running": True, + } + ) + + # Check if already provisioning + with _provision_lock: + existing_job = _provision_jobs.get(app_name) + if existing_job and existing_job["status"] == "in_progress": + return jsonify( + {"success": True, "app_name": app_name, "already_in_progress": True} + ) + + _provision_jobs[app_name] = { + "steps": [ + { + "step": 0, + "status": "starting", + "message": f"Provisioning for {email}...", + } + ], + "status": "in_progress", + "app_url": "", + "app_name": app_name, + "email": email, + } + + thread = threading.Thread( + target=provision_app_async, + args=(host, admin_token, email, app_name), + daemon=True, + ) + thread.start() + + return jsonify({"success": True, "app_name": app_name}) + + +@app.route("/api/provision-bulk", methods=["POST"]) +def api_provision_bulk(): + """Provision apps for multiple users in parallel. + + Accepts {"emails": ["a@example.com", "b@example.com", ...]}. + Kicks off a background thread per user (reuses the single-user flow) + and returns the list of app names to poll via /api/provision-status/. + """ + try: + admin_token = get_admin_token() + except RuntimeError as e: + return jsonify({"success": False, "error": str(e)}), 500 + + body = request.get_json(silent=True) or {} + emails = body.get("emails", []) + if not emails or not isinstance(emails, list): + return jsonify({"success": False, "error": "Provide a list of emails"}), 400 + + host = DATABRICKS_HOST + results = [] + + for email in emails: + email = email.strip().lower() + if not email: + continue + app_name = app_name_from_email(email) + + # Skip if already running under any prefix, or already provisioning + existing = find_existing_app_for_email(host, admin_token, email) + if existing.get("deployed") and existing.get("state") == "RUNNING": + results.append({"email": email, "app_name": existing.get("app_name", app_name), "status": "already_running"}) + continue + + with _provision_lock: + existing_job = _provision_jobs.get(app_name) + if existing_job and existing_job["status"] == "in_progress": + results.append({"email": email, "app_name": app_name, "status": "already_in_progress"}) + continue + + _provision_jobs[app_name] = { + "steps": [{"step": 0, "status": "starting", "message": f"Provisioning for {email}..."}], + "status": "in_progress", + "app_url": "", + "app_name": app_name, + "email": email, + } + + thread = threading.Thread( + target=provision_app_async, + args=(host, admin_token, email, app_name), + daemon=True, + ) + thread.start() + results.append({"email": email, "app_name": app_name, "status": "started"}) + + return jsonify({"success": True, "apps": results}) + + +@app.route("/api/provision-status/") +def api_provision_status(app_name): + """Poll endpoint for provision progress.""" + with _provision_lock: + job = _provision_jobs.get(app_name) + if not job: + return jsonify({"found": False}) + return jsonify({"found": True, **job}) + + +# In-memory redeploy-all job tracker +_redeploy_job: dict | None = None +_redeploy_lock = threading.Lock() + + +def _deploy_with_backoff( + host: str, + headers: dict, + app_name: str, + source_code_path: str, + max_retries: int = 5, + base_delay: float = 2.0, +) -> requests.Response: + """Deploy a single app with exponential backoff on 429 (rate limit) responses.""" + for attempt in range(max_retries): + resp = requests.post( + f"{host}/api/2.0/apps/{app_name}/deployments", + headers=headers, + json={"source_code_path": source_code_path}, + ) + if resp.status_code != 429: + return resp + delay = min(base_delay * (2**attempt), 60) + print(f" Rate limited deploying {app_name}, retrying in {delay:.0f}s " + f"(attempt {attempt + 1}/{max_retries})") + time.sleep(delay) + return resp # Return last 429 response if all retries exhausted + + +# Max concurrent deploy requests to avoid hitting API rate limits +_REDEPLOY_BATCH_SIZE = 3 +_REDEPLOY_BATCH_DELAY = 2.0 # seconds between batches + + +def redeploy_all_apps(host: str, admin_token: str): + """Redeploy all coding-agents-* apps from the shared template. + + Deploys in batches of 3 with a delay between batches to avoid API + rate limiting (429). Individual deploys retry with exponential backoff. + """ + global _redeploy_job + source_code_path = "/Workspace/Shared/apps/coding-agents" + headers = {"Authorization": f"Bearer {admin_token}"} + + try: + resp = requests.get(f"{host}/api/2.0/apps", headers=headers) + resp.raise_for_status() + all_apps = resp.json().get("apps", []) + targets = [ + a + for a in all_apps + if a["name"].startswith("coding-agents-") + and a["name"] != "coding-agents-spawner" + ] + + with _redeploy_lock: + _redeploy_job["total"] = len(targets) + _redeploy_job["apps"] = [ + {"name": a["name"], "status": "pending"} for a in targets + ] + + for i, a in enumerate(targets): + name = a["name"] + with _redeploy_lock: + _redeploy_job["apps"][i]["status"] = "deploying" + _redeploy_job["completed"] = i + + try: + deploy_resp = _deploy_with_backoff( + host, headers, name, source_code_path + ) + if deploy_resp.ok: + with _redeploy_lock: + _redeploy_job["apps"][i]["status"] = "deployed" + else: + with _redeploy_lock: + _redeploy_job["apps"][i]["status"] = "error" + _redeploy_job["apps"][i]["error"] = deploy_resp.text[:200] + except Exception as exc: + with _redeploy_lock: + _redeploy_job["apps"][i]["status"] = "error" + _redeploy_job["apps"][i]["error"] = str(exc)[:200] + + # Pause between batches to stay under rate limits + if (i + 1) % _REDEPLOY_BATCH_SIZE == 0 and i + 1 < len(targets): + time.sleep(_REDEPLOY_BATCH_DELAY) + + with _redeploy_lock: + _redeploy_job["completed"] = len(targets) + _redeploy_job["status"] = "complete" + + except Exception as exc: + with _redeploy_lock: + _redeploy_job["status"] = "error" + _redeploy_job["error"] = str(exc) + + +@app.route("/api/redeploy-all", methods=["POST"]) +def api_redeploy_all(): + """Trigger redeployment of all spawned coding-agents apps from the shared template.""" + global _redeploy_job + + try: + admin_token = get_admin_token() + except RuntimeError as e: + return jsonify({"error": str(e)}), 500 + + with _redeploy_lock: + if _redeploy_job and _redeploy_job.get("status") == "in_progress": + return jsonify({"error": "Redeploy already in progress"}), 409 + + _redeploy_job = { + "status": "in_progress", + "total": 0, + "completed": 0, + "apps": [], + "error": None, + "started_at": time.time(), + } + + thread = threading.Thread( + target=redeploy_all_apps, + args=(DATABRICKS_HOST, admin_token), + daemon=True, + ) + thread.start() + return jsonify({"success": True}) + + +@app.route("/api/redeploy-all/status") +def api_redeploy_all_status(): + """Poll endpoint for redeploy-all progress.""" + with _redeploy_lock: + if not _redeploy_job: + return jsonify({"active": False}) + return jsonify({"active": True, **_redeploy_job}) + + +# In-memory stop-all job tracker +_stop_job: dict | None = None +_stop_lock = threading.Lock() + +_STOP_BATCH_SIZE = 3 +_STOP_BATCH_DELAY = 2.0 + + +def stop_all_apps(host: str, admin_token: str): + """Stop all coding-agents / coda- apps (excluding the spawner itself).""" + global _stop_job + headers = {"Authorization": f"Bearer {admin_token}"} + + try: + resp = requests.get(f"{host}/api/2.0/apps", headers=headers) + resp.raise_for_status() + all_apps = resp.json().get("apps", []) + targets = [ + a + for a in all_apps + if (a["name"].startswith("coding-agents-") or a["name"].startswith("coda-")) + and a["name"] != "coding-agents-spawner" + and a["name"] != "coding-agent-spawner" + and a.get("compute_status", {}).get("state") == "ACTIVE" + ] + + with _stop_lock: + _stop_job["total"] = len(targets) + _stop_job["apps"] = [ + {"name": a["name"], "status": "pending"} for a in targets + ] + + for i, a in enumerate(targets): + name = a["name"] + with _stop_lock: + _stop_job["apps"][i]["status"] = "stopping" + _stop_job["completed"] = i + + try: + stop_resp = requests.post( + f"{host}/api/2.0/apps/{name}/stop", + headers=headers, + ) + if stop_resp.ok: + with _stop_lock: + _stop_job["apps"][i]["status"] = "stopped" + else: + with _stop_lock: + _stop_job["apps"][i]["status"] = "error" + _stop_job["apps"][i]["error"] = stop_resp.text[:200] + except Exception as exc: + with _stop_lock: + _stop_job["apps"][i]["status"] = "error" + _stop_job["apps"][i]["error"] = str(exc)[:200] + + if (i + 1) % _STOP_BATCH_SIZE == 0 and i + 1 < len(targets): + time.sleep(_STOP_BATCH_DELAY) + + with _stop_lock: + _stop_job["completed"] = len(targets) + _stop_job["status"] = "complete" + + except Exception as exc: + with _stop_lock: + _stop_job["status"] = "error" + _stop_job["error"] = str(exc) + + +@app.route("/api/stop-all", methods=["POST"]) +def api_stop_all(): + """Stop all spawned coding-agents / coda- apps.""" + global _stop_job + + try: + admin_token = get_admin_token() + except RuntimeError as e: + return jsonify({"error": str(e)}), 500 + + with _stop_lock: + if _stop_job and _stop_job.get("status") == "in_progress": + return jsonify({"error": "Stop-all already in progress"}), 409 + + _stop_job = { + "status": "in_progress", + "total": 0, + "completed": 0, + "apps": [], + "error": None, + "started_at": time.time(), + } + + thread = threading.Thread( + target=stop_all_apps, + args=(DATABRICKS_HOST, admin_token), + daemon=True, + ) + thread.start() + return jsonify({"success": True}) + + +@app.route("/api/stop-all/status") +def api_stop_all_status(): + """Poll endpoint for stop-all progress.""" + with _stop_lock: + if not _stop_job: + return jsonify({"active": False}) + return jsonify({"active": True, **_stop_job}) + + +if __name__ == "__main__": + app.run(host="0.0.0.0", port=8001) diff --git a/spawner/app.yaml b/spawner/app.yaml new file mode 100644 index 0000000..d13d5ae --- /dev/null +++ b/spawner/app.yaml @@ -0,0 +1,7 @@ +command: + - gunicorn + - --timeout + - "300" + - app:app + # Auth: spawner uses its own SP via OAuth M2M. + # Grant the spawner's SP workspace admin permissions. diff --git a/spawner/requirements.txt b/spawner/requirements.txt new file mode 100644 index 0000000..6262523 --- /dev/null +++ b/spawner/requirements.txt @@ -0,0 +1,3 @@ +flask>=3.0.3 +gunicorn>=23.0.0 +requests>=2.31.0 diff --git a/spawner/static/index.html b/spawner/static/index.html new file mode 100644 index 0000000..d43fd0c --- /dev/null +++ b/spawner/static/index.html @@ -0,0 +1,683 @@ + + + + + + Coding Agents | Databricks + + + + +
+
+ + Coding Agents +
+ +
+ +
+
+

Deploy Your Coding Agent

+

Get a personal AI-powered coding environment on Databricks. One click and we'll provision everything for you.

+
+ +
+
+ Deploy + One-click +
+ + +

+ App will be named .
+ No tokens needed — change the email to provision for another user. +

+ +
+
+ +
+
+ Bulk Deploy + Paste list +
+ + +

+ Paste attendee emails from Outlook, Teams, or a spreadsheet. Names and angle brackets are stripped automatically. + +

+ +
+
+ +
+
+ Spawned Apps +
+ + +
+
+ + +
Loading...
+ + + + + +
+
+ + + + diff --git a/static/index.html b/static/index.html index d27965e..b5190a4 100644 --- a/static/index.html +++ b/static/index.html @@ -9,7 +9,7 @@