diff --git a/README.md b/README.md index 030ee52..3eeaa37 100644 --- a/README.md +++ b/README.md @@ -29,27 +29,21 @@ A **Databricks-native document intelligence + agent** stack: parse PDFs once wit [2], regulation [3]…" ``` +For motivation, architecture diagrams, the Spec-Kit + Claude Code build workflow, and the chicken-egg deploy-ordering story, see [**`docs/design.md`**](./docs/design.md). For day-2 ops, see [**`docs/runbook.md`**](./docs/runbook.md). + --- ## Table of contents -- [Why this exists](#why-this-exists) - [Features](#features) - [Readiness levels](#readiness-levels) - [Prerequisites](#prerequisites) - - [Software](#software) - - [Databricks workspace](#databricks-workspace) - - [Free trial signup](#free-trial-signup) - [Getting started](#getting-started) -- [Architecture](#architecture) -- [How it's built — three pillars](#how-its-built--three-pillars) -- [Deploy ordering: foundation → consumers](#deploy-ordering-foundation--consumers) - [CLEARS quality gate](#clears-quality-gate) - [Configuration](#configuration) - [Testing & validation](#testing--validation) - [Deployment](#deployment) - [Repo layout](#repo-layout) -- [What you can learn from this repo](#what-you-can-learn-from-this-repo) - [Limitations](#limitations) - [Contributing](#contributing) - [Security](#security) @@ -58,14 +52,6 @@ A **Databricks-native document intelligence + agent** stack: parse PDFs once wit --- -## Why this exists - -Databricks shipped a lot of new generative-AI surface area in 2025–2026: `ai_parse_document`, Mosaic AI Vector Search, the Agent Framework, AI Gateway, Lakebase, Databricks Apps. Tutorials show each piece in isolation; nobody shows them wired together with **eval gates, governance, and reproducible deploys** the way you'd actually ship to analysts. - -This repo is that worked example. Drop a PDF into a governed UC volume; ten minutes later, an analyst can ask cited questions in plain English with end-to-end audit. The whole stack is described declaratively as one **Databricks Asset Bundle (DAB)** plus a small bootstrap script. DAB manages catalog/schema/volume, pipeline, jobs, the Vector Search **endpoint**, the Lakebase instance, the serving endpoint, the monitor, the app, and the dashboard; the Vector Search **index** itself is created and synced by `jobs/index_refresh/sync_index.py` (DAB doesn't yet manage indexes as a resource type), and the agent model version is registered by `agent/log_and_register.py`. The bootstrap script orchestrates them in the right order. - -It also demonstrates a development workflow: **Spec-Kit** for spec-driven design, **Claude Code** with Databricks skill bundles for AI-assisted implementation, six **non-negotiable constitution principles** that gate every plan. See [How it's built](#how-its-built--three-pillars). - ## Features - **End-to-end document intelligence pipeline** — Auto Loader ingest → `ai_parse_document` → section explosion → `ai_classify` + `ai_extract` → 5-dim quality rubric → Vector Search Delta-Sync index (the endpoint is DAB-managed; the index is created/synced by `jobs/index_refresh/sync_index.py`). SQL-only pipeline (Lakeflow Spark Declarative Pipelines). @@ -181,7 +167,7 @@ DOCINTEL_WAREHOUSE_ID= \ ./scripts/bootstrap-dev.sh ``` -The script handles the chicken-egg ordering automatically — see [Deploy ordering](#deploy-ordering-foundation--consumers). +The script handles the chicken-egg ordering automatically — see [`docs/design.md` § Deploy ordering](./docs/design.md#deploy-ordering-foundation--consumers). ### 5. Run the eval gate @@ -228,323 +214,9 @@ For a guided 30-minute tour, see [`specs/001-doc-intel-10k/quickstart.md`](./spe --- -## Architecture - -### Two halves: an offline pipeline, and an online agent - -``` - ╔═══════════════════════════════════════════════════════════════════╗ - ║ pipelines/sql/ (one SQL file per tier) ║ - ╚═══════════════════════════════════════════════════════════════════╝ - - raw_filings/ ┌─────────────────┐ ┌─────────────────┐ ┌──────────────────┐ - ACME_10K.pdf ──▶ │ bronze_filings │──▶│ silver_parsed_ │──▶│ gold_filing_ │ - BETA_10K.pdf │ (raw bytes, │ │ filings (parsed │ │ sections (one │ - GAMMA_10K.pdf │ filename, │ │ VARIANT — │ │ row per parsed │ - │ ingested_at) │ │ ai_parse_ │ │ $.sections[*]; │ - │ │ │ document) │ │ fallback to │ - │ >50MB rejects: │ │ │ │ full_document │ - │ bronze_filings │ │ Status: ok / │ │ if absent) │ - │ _rejected │ │ partial / error │ │ │ - └─────────────────┘ └─────────────────┘ │ gold_filing_kpis │ - 01_bronze.sql 02_silver_parse │ (typed columns: │ - .sql │ segment_revenue │ - │ ARRAY, │ - │ top_risks │ - │ ARRAY) │ - └──────────────────┘ - 03_gold_classify - _extract.sql - │ - ▼ - ┌──────────────────┐ - │ gold_filing_ │ - │ quality │ - │ (5-dim rubric: │ - │ parse, layout, │ - │ ocr, sections, │ - │ kpi → 0-30) │ - └──────────────────┘ - 04_gold_quality.sql -``` - -**Key idea — "parse once, extract many":** PDFs are expensive to parse. Silver runs `ai_parse_document` exactly once per file and stores the structured result as a `VARIANT`. Everything downstream — classification, KPI extraction, summarization, quality scoring — reads the parsed output, never the raw bytes. This is a non-negotiable constitution principle. - -**Triggering**: prod runs the pipeline in `continuous: true` mode so Auto Loader (`read_files`) reacts to new PDFs in the volume automatically. Dev overrides to `continuous: false` to avoid a 24/7 cluster during smoke iterations. See `resources/foundation/doc_intel.pipeline.yml` and the dev override block in `databricks.yml`. - -### Vector Search bridges data and agent - -``` - gold_filing_sections ┌─────────────────────────┐ - (governed Delta table) ─────▶ │ Mosaic AI Vector │ - │ Search Index │ - Filter: embed_eligible=true │ (Delta-Sync — auto- │ - Embed column: "summary" │ refreshes when Gold │ - │ updates) │ - └─────────────────────────┘ - - Why "summary" not the raw text? - ───────────────────────────── - Embedding a 50-page 10-K verbatim is noisy. We embed an LLM-written - summary instead — tighter, more searchable. Constitution principle IV: - "Quality before retrieval." -``` - -**Ownership note**: DAB manages the Vector Search **endpoint** (`resources/consumers/filings_index.yml`) and the index-refresh **job** (`resources/consumers/index_refresh.job.yml`). The **index** itself isn't yet a DAB-managed resource type as of CLI 0.298 — `jobs/index_refresh/sync_index.py` creates the Delta-Sync index on first run and triggers a sync on subsequent runs. That's why the bootstrap script's stage-2 deploy creates the endpoint + job, and the job's first execution materializes the actual index. - -### Agent has two paths, one endpoint - -``` - User question - │ - ▼ - ┌────────────────────────────────────────────┐ - │ AnalystAgent.predict() │ - │ ───────────────────── │ - │ contains "compare" / "vs" / │ - │ "between" + ≥2 company names? │ - └────────────┬─────────────────┬─────────────┘ - │ no │ yes - ▼ ▼ - ┌──────────────────────┐ ┌──────────────────────┐ - │ Single-filing path │ │ Supervisor path │ - │ │ │ │ - │ 1. Hybrid search │ │ For each company: │ - │ (keyword + vec) │ │ ▸ run analyst path │ - │ 2. Re-rank → top 5 │ │ ▸ pull KPIs from │ - │ 3. LLM generates │ │ gold_filing_kpis │ - │ answer w/ [1] [2] │ │ Format markdown │ - │ citations │ │ table with cites. │ - └──────────────────────┘ └──────────────────────┘ - │ │ - └────────┬────────┘ - ▼ - ┌──────────────────────┐ - │ Response JSON: │ - │ answer │ - │ citations[] │ - │ grounded: bool │ - │ latency_ms │ - └──────────────────────┘ -``` - -The agent is an `mlflow.pyfunc` model registered in Unity Catalog and served behind an **AI Gateway** (rate limiting per-user, usage tracking, inference-table audit). Identity passthrough is implemented at the *App layer* when the workspace has Databricks Apps user-token passthrough enabled: the Streamlit app extracts the user's `x-forwarded-access-token` header and constructs a user-scoped `WorkspaceClient`. The served model is OBO-ready via MLflow `auth_policy` and Model Serving user credentials. If app-level passthrough is not enabled, the app falls back to service-principal auth and the repo must be treated as a reference/dev deployment, not a production row-level-security deployment. See [`SECURITY.md`](./SECURITY.md) and `app/README.md`. - -### Runtime stack - -``` - ┌──────────────────────────────────────────────────────────────────┐ - │ │ - │ Databricks App (Streamlit) ← user interacts here │ - │ app/app.py │ - │ │ - │ ┌────────────────┐ ┌──────────────────┐ │ - │ │ Chat input box │ │ Citation chips │ │ - │ │ Thumbs up/down │ │ Markdown tables │ │ - │ └────────┬───────┘ └─────┬────────────┘ │ - │ │ │ │ - └──────────────│─────────────────│─────────────────────────────────┘ - │ │ - │ query │ feedback writes - ▼ ▼ - ┌────────────────────────┐ ┌────────────────────────┐ - │ Model Serving endpoint │ │ Lakebase Postgres │ - │ "analyst-agent-dev" │ │ ───────────────── │ - │ (CPU, scales to 0) │ │ conversation_history │ - │ │ │ query_logs │ - │ + AI Gateway: │ │ feedback │ - │ rate limit │ │ │ - │ (per-user key) │ │ (Postgres for tiny │ - │ inference-table │ │ per-turn writes — │ - │ audit │ │ Delta isn't great │ - │ usage tracking │ │ at row-by-row) │ - └────────────────────────┘ └────────────────────────┘ - - OBO (user identity end-to-end, when enabled): - ────────────────────────────── - App reads `x-forwarded-access-token` from the request, builds - `WorkspaceClient(token=...)`, calls the serving endpoint with the - user's identity. The agent-side MLflow auth policy and Model Serving - OBO credentials let downstream calls run as the user. If the app-side - feature is unavailable, the bootstrap script prints an explicit warning - and the deployment remains reference/dev only. -``` - -**Why Postgres for state?** Delta tables are great for analytics but bad at "insert one tiny row per chat turn at high frequency." Lakebase is Databricks's managed Postgres — same governance, right tool for the job. - ---- - -## How it's built — three pillars - -This repo is a worked example of combining three things that, together, change how you ship Databricks projects. - -### Pillar 1 — Spec-Kit (spec-driven development) - -[Spec-Kit](https://github.com/github/spec-kit) is a workflow that forces you to write — and *clarify* — a specification before writing code. Each phase is a slash-command in Claude Code that produces a checked-in artifact: - -``` - /speckit-specify → specs//spec.md What & why (no how) - │ - ▼ - /speckit-clarify → appended Q&A in spec.md Resolve ambiguity - │ - ▼ - /speckit-plan → specs//plan.md Tech stack + structure - │ + research.md, data-model.md, - │ contracts/, quickstart.md - ▼ - /speckit-tasks → specs//tasks.md Dependency-ordered tasks - │ - ▼ - /speckit-analyze → cross-artifact consistency check - │ - ▼ - /speckit-implement → the actual code -``` - -`.specify/extensions.yml` auto-commits at each phase boundary so the trail is clean. `.specify/memory/constitution.md` defines six **non-negotiable principles** every plan must respect: - -| # | Principle | What it means | -|---|---|---| -| I | **Unity Catalog source of truth** | Every table, volume, model, index, endpoint lives under `.` — no DBFS, no workspace-local resources | -| II | **Parse once, extract many** | `ai_parse_document` runs once at Silver → VARIANT; everything downstream reads the parsed output | -| III | **Declarative over imperative** | SDP SQL pipelines, Lakeflow Jobs, DAB resources — no production notebooks | -| IV | **Quality before retrieval** | 5-dim rubric scores every section; only ≥22/30 reach the index. Embed `summary`, not raw text | -| V | **Eval-gated agents** | MLflow CLEARS scores must clear thresholds before any deploy is considered complete | -| VI | **Reproducible deploys** | `databricks bundle deploy -t ` recreates the entire stack; `dev` and `prod` parity enforced | - -When you read `specs/001-doc-intel-10k/plan.md` you'll see a "Constitution Check" gate that maps each design decision back to the principle it satisfies. When you read `specs/001-doc-intel-10k/tasks.md` you'll see how each task derives from the plan, and how user-stories (P1, P2, P3) are independently demoable. - -### Pillar 2 — Databricks Asset Bundles + the Claude Code skill suite - -[**Databricks Asset Bundles**](https://docs.databricks.com/aws/en/dev-tools/bundles/) (DABs) describe most of the workspace state as YAML. One root `databricks.yml` declares variables and targets (`dev`, `prod`); `resources/**/*.yml` declares each resource (pipeline, jobs, Vector Search endpoint, index-refresh job, serving endpoint, app, monitor, dashboard, Lakebase instance + catalog). `databricks bundle deploy -t dev` reconciles workspace state to YAML. The two non-DAB-managed pieces — the Vector Search **index** itself and the registered **model version** — are produced at runtime by `jobs/index_refresh/sync_index.py` and `agent/log_and_register.py` respectively, which the bootstrap script orchestrates. - -This repo was built with Databricks-specific Claude Code skill bundles. Those bundles are distributed by Databricks via the CLI / Claude Code plugin channel and **are not vendored in this open-source tree** — install them locally if you have access, or reference the canonical Databricks docs (mapping in [`CONTRIBUTING.md`](./CONTRIBUTING.md)). - -| Skill bundle | What it provides | Canonical docs | -|---|---|---| -| **databricks-core** | Auth, profiles, data exploration, bundle basics | [docs](https://docs.databricks.com/aws/en/dev-tools/cli/) | -| **databricks-dabs** | DAB structure, validation, deploy workflow, target separation | [docs](https://docs.databricks.com/aws/en/dev-tools/bundles/) | -| **databricks-pipelines** | Lakeflow Spark Declarative Pipelines (`ai_parse_document`, `ai_classify`, `ai_extract`, `APPLY CHANGES INTO`) | [docs](https://docs.databricks.com/aws/en/dlt/) | -| **databricks-jobs** | Lakeflow Jobs with retries, schedules, table-update / file-arrival triggers | [docs](https://docs.databricks.com/aws/en/jobs/) | -| **databricks-apps** | Databricks Apps (Streamlit), App resource bindings | [docs](https://docs.databricks.com/aws/en/dev-tools/databricks-apps/) | -| **databricks-lakebase** | Lakebase Postgres instances, branches, computes, endpoint provisioning | [docs](https://docs.databricks.com/aws/en/oltp/) | -| **databricks-model-serving** | Model Serving endpoints, AI Gateway, served entities, scaling config | [docs](https://docs.databricks.com/aws/en/machine-learning/model-serving/) | - -Skills are loaded by Claude Code on demand. When you ask Claude to "wire up Vector Search," it should read the Databricks pipeline/model-serving guidance *before* writing YAML, so the output reflects current Databricks API shapes — not stale training data. - -### Pillar 3 — Claude Code as the implementation surface - -Spec-Kit produces the specs. The Databricks skills provide platform expertise. **Claude Code orchestrates both**: every phase artifact and every code file in this repo was authored by prompting Claude Code with the spec/plan/tasks as context. - -The workflow looks like: - -1. `/speckit-specify` → Claude writes spec.md from a natural-language description, you iterate via `/speckit-clarify` until ambiguity is resolved. -2. `/speckit-plan` → Claude consults the constitution + Databricks skills, drafts plan.md with research decisions and architecture. -3. `/speckit-tasks` → Claude generates a dependency-ordered task list grouped by user story (P1, P2, P3). -4. `/speckit-implement` → Claude writes the actual SQL/Python/YAML, one task at a time, committing per task. -5. Operational loops: when the deploy hits unexpected issues (it always does), Claude reads the runbook, fixes the issue, updates the runbook, commits. - -The "AI-driven" part isn't "the AI did it for you" — it's "the AI carries the boring parts (boilerplate YAML, retry-loop scripts, dependency analysis) so you focus on the actually-hard parts (what the spec should say, what the constitution should require)." - ---- - -## Deploy ordering: foundation → consumers - -DABs deploy *everything in one shot*. But our resources have a chicken-and-egg problem on a fresh workspace: - -``` - ┌────────────────────────────────────────────────┐ - │ What "bundle deploy" tries to create: │ - │ │ - │ ▸ Pipeline ────┐ │ - │ ▸ Tables ────┼──── all need each other │ - │ ▸ Vector idx ───┤ │ - │ ▸ Model ───┤ Monitor wants the │ - │ ▸ Endpoint ────┤ KPI table to exist │ - │ ▸ App ───┤ BEFORE it can attach │ - │ ▸ Monitor ────┘ │ - │ ▸ Lakebase ──── │ - └────────────────────────────────────────────────┘ - - Endpoint needs a registered model version. - Model version needs the model logged. - Model logging needs the agent code. - Monitor needs the table populated. - Table needs the pipeline to run. - - ▶ Single `bundle deploy` → 4+ errors on a fresh workspace. -``` - -The fix is a **staged deploy** orchestrated by `scripts/bootstrap-dev.sh`. Resources are split into two directories by data dependency: - -``` - resources/ - ├── foundation/ ← no data deps — deploy first - │ ├── catalog.yml (schema + volume + grants) - │ ├── doc_intel.pipeline.yml - │ ├── retention.job.yml - │ └── lakebase_instance.yml - │ - └── consumers/ ← need foundation to be RUNNING and producing data - ├── agent.serving.yml (needs registered model version) - ├── kpi_drift.yml (needs gold_filing_kpis table) - ├── filings_index.yml (VS endpoint) - ├── index_refresh.job.yml (needs source table) - ├── analyst.app.yml (needs Lakebase + agent endpoint) - ├── usage.dashboard.yml - └── lakebase_catalog.yml (needs instance AVAILABLE) -``` - -**The bootstrap script auto-detects which mode to run** by checking whether the agent serving endpoint already has a populated config: - -``` - does analyst-agent-${target} have served entities? - │ - no ◀───────┴───────▶ yes - │ │ - ▼ ▼ - ┌──────────────────┐ ┌──────────────────┐ - │ FIRST-DEPLOY │ │ STEADY-STATE │ - │ (staged) │ │ (full deploy) │ - ├──────────────────┤ ├──────────────────┤ - │ 1. temp-rename │ │ 1. bundle deploy │ - │ consumers/* │ │ (full bundle) │ - │ .yml.skip │ │ │ - │ 2. bundle deploy │ │ 2. refresh data: │ - │ (foundation) │ │ upload, run │ - │ 3. produce data: │ │ pipeline, │ - │ upload, run, │ │ register new │ - │ register │ │ model version │ - │ model │ │ + repoint │ - │ 4. wait Lakebase │ │ serving in- │ - │ AVAILABLE │ │ place │ - │ 5. restore yamls │ │ │ - │ 6. bundle deploy │ │ │ - │ (full bundle) │ │ │ - └────────┬─────────┘ └────────┬─────────┘ - │ │ - └───────────┬───────────┘ - ▼ - ┌──────────────────────────┐ - │ Common to both: │ - │ • bundle run analyst_app│ - │ • UC grants chain │ - │ • smoke check │ - └──────────────────────────┘ -``` - -**Why two modes?** DAB tracks resource state; if you run the temp-rename trick against an *existing* deployment, DAB sees the consumer YAMLs as removed and plans to **delete** the serving endpoint, app, monitor, etc. Safe-ish on a fresh workspace; destructive in steady-state. The script detects mode and does the right thing. - -CI (`.github/workflows/deploy.yml`) assumes steady-state — the first-ever bring-up of a workspace must be done locally with `./scripts/bootstrap-dev.sh`. After that, every push to `main` runs the steady-state path: full `bundle deploy` → refresh data → repoint serving endpoint → grants → CLEARS gate. - -Full breakdown in [`docs/runbook.md`](./docs/runbook.md). - ---- - ## CLEARS quality gate -Before any deploy reaches production, an evaluation must pass. This is constitution principle V — eval-gated agents. +Before any deploy reaches production, an evaluation must pass (constitution principle V — eval-gated agents). ``` evals/dataset.jsonl (30 questions: 20 single-filing P2 + 10 cross-company P3) @@ -652,93 +324,21 @@ For day-2 ops (rolling agent versions, debugging low quality scores, inspecting ``` databricks/ ├── databricks.yml # Bundle root — variables + dev/prod targets -├── README.md # This file -├── CLAUDE.md # Runtime guidance for Claude Code sessions -├── CONTRIBUTING.md # Contribution guidelines -├── SECURITY.md # Identity modes, OBO, grants -├── PRODUCTION_READINESS.md # Reference / Pilot / Production checklists -├── VALIDATION.md # Validation procedure with expected outputs -├── REAL_10K_PILOT.md # Real EDGAR pilot guidance -├── LICENSE # MIT -│ -├── pipelines/sql/ # Lakeflow SDP — Bronze → Silver → Gold (SQL) -│ ├── 01_bronze.sql # Auto Loader BINARYFILE ingest + size filter -│ ├── 02_silver_parse.sql # ai_parse_document → VARIANT -│ ├── 03_gold_classify_extract.sql # ai_classify + ai_extract → typed KPIs -│ └── 04_gold_quality.sql # 5-dim rubric → embed_eligible filter -│ -├── agent/ # Mosaic AI Agent Framework -│ ├── analyst_agent.py # mlflow.pyfunc model + routing -│ ├── retrieval.py # Hybrid search + re-rank + OBO VS client -│ ├── supervisor.py # Cross-company fan-out -│ ├── tools.py # UC Function tool over gold_filing_kpis -│ ├── _obo.py # On-behalf-of credentials helpers -│ ├── log_and_register.py # Register + auth_policy + alias -│ └── tests/ # pytest unit tests -│ -├── app/ # Streamlit App on Databricks Apps -│ ├── app.py # Chat UI + citations + thumbs feedback + OBO -│ ├── lakebase_client.py # psycopg writes to query_logs / feedback -│ ├── app.yaml # App runtime config (port, CORS, XSRF) -│ └── README.md # App-specific runtime + local-dev notes -│ -├── evals/ # MLflow CLEARS eval gate -│ ├── dataset.jsonl # 30 hand-authored questions (P2 + P3) -│ └── clears_eval.py # mlflow.evaluate(model_type="databricks-agent") -│ -├── jobs/ # Lakeflow Jobs Python tasks -│ ├── retention/prune_volume.py # 90-day raw PDF cleanup -│ └── index_refresh/sync_index.py # Vector Search SYNC INDEX -│ -├── resources/ # DAB resources, split by data dependency -│ ├── foundation/ # Stage 1 — no data deps -│ └── consumers/ # Stage 2 — depend on foundation data -│ -├── scripts/ # Operational scripts -│ ├── bootstrap-dev.sh # Fresh-workspace bring-up (staged deploy) -│ └── wait_for_kpis.py # Poll helper used by bootstrap + CI -│ -├── samples/ # Synthetic 10-Ks for smoke tests + eval -│ ├── synthesize.py # Reproducible PDF generator -│ ├── ACME_10K_2024.pdf -│ ├── BETA_10K_2024.pdf -│ ├── GAMMA_10K_2024.pdf -│ └── garbage_10K_2024.pdf # SC-006 negative test (low quality) -│ -├── specs/ # Spec-Kit artifacts -│ └── 001-doc-intel-10k/ -│ ├── spec.md # What & why -│ ├── plan.md # Tech stack + Constitution Check -│ ├── tasks.md # Dependency-ordered implementation tasks -│ ├── research.md # Decision log -│ ├── data-model.md # Entity → table mapping -│ ├── quickstart.md # 30-min deploy walkthrough -│ └── contracts/ # JSON schemas for KPIs + agent I/O -│ -├── docs/ -│ └── runbook.md # Day-2 ops + bring-up workflow -│ -├── .specify/ # Spec-Kit machinery (constitution, hooks) -│ ├── memory/constitution.md # Six non-negotiable principles -│ └── extensions.yml # Auto-commit hooks per phase -│ -└── .github/workflows/ - └── deploy.yml # PR validate; main → steady-state deploy + CLEARS gate - # (first-ever bring-up must be done locally via bootstrap-dev.sh) +├── pipelines/sql/ # Lakeflow SDP — Bronze → Silver → Gold (SQL only) +├── agent/ # Mosaic AI Agent Framework — pyfunc, retrieval, OBO +├── app/ # Streamlit on Databricks Apps + Lakebase client +├── evals/ # MLflow CLEARS eval gate (dataset + runner) +├── jobs/ # Lakeflow Jobs (retention, index refresh) +├── resources/foundation/ # DAB resources with no data deps +├── resources/consumers/ # DAB resources that depend on foundation data +├── scripts/ # bootstrap-dev.sh + helpers +├── samples/ # Synthetic 10-K PDFs (regenerable) +├── specs/001-doc-intel-10k/ # Spec-Kit artifacts (spec, plan, tasks, etc.) +├── docs/ # design.md (this repo's "why") + runbook.md (day-2 ops) +└── .specify/ # Spec-Kit machinery (constitution, hooks) ``` ---- - -## What you can learn from this repo - -- **How to wire `ai_parse_document` into Lakeflow SDP** — pattern for streaming-tables + `STREAM(...)` views + `APPLY CHANGES INTO` keyed on filename. -- **How to score document quality before retrieval** — five 0–6 dimensions in SQL, threshold filter on the index source. -- **How to log a Mosaic AI agent to UC** — `mlflow.pyfunc` with both inputs *and* outputs in the signature (UC requirement), `AnyType` for variable-shape fields, `auth_policy` + `resources` for OBO. -- **How to ground an agent with citations** — hybrid Vector Search → re-rank → top-k → LLM with explicit "cite sources [1] [2]" prompt. -- **How to handle DAB deploy ordering** — chicken-egg dependencies between heterogeneous resources, solved with a 5-step bootstrap rather than `depends_on` (which DAB doesn't reliably honor across resource types). -- **How to gate deploys on MLflow eval** — `mlflow.evaluate(model_type="databricks-agent")` with documented metric keys, per-axis thresholds, exit-code gate in CI. -- **How to do end-to-end OBO** — `ModelServingUserCredentials` from `databricks_ai_bridge`, `CredentialStrategy.MODEL_SERVING_USER_CREDENTIALS` for Vector Search, MLflow `auth_policy` with `model-serving` + `vector-search` user scopes, App-side `user_api_scopes` declaration. -- **How Spec-Kit + Claude Code + Databricks skills compose** — every artifact in `specs/` and `pipelines/` and `agent/` was generated through that loop. +Top-level docs: [`CLAUDE.md`](./CLAUDE.md) (runtime guidance for Claude Code), [`CONTRIBUTING.md`](./CONTRIBUTING.md), [`SECURITY.md`](./SECURITY.md), [`PRODUCTION_READINESS.md`](./PRODUCTION_READINESS.md), [`VALIDATION.md`](./VALIDATION.md), [`REAL_10K_PILOT.md`](./REAL_10K_PILOT.md), [`LICENSE`](./LICENSE). --- @@ -780,7 +380,5 @@ Released under the [**MIT License**](./LICENSE) — Copyright (c) 2026 Sathish K - [**Spec-Kit**](https://github.com/github/spec-kit) — spec-driven development workflow for AI coding agents. - [**Claude Code**](https://claude.com/claude-code) — Anthropic's CLI for AI-assisted development. -- [**Anthropic Skills**](https://github.com/anthropics/skills) — general-purpose Claude Code skill bundles. -- [**Databricks Lakehouse + Mosaic AI**](https://www.databricks.com/) — Unity Catalog, Lakeflow Spark Declarative Pipelines, Mosaic AI Vector Search, Agent Framework, Model Serving, AI Gateway, Databricks Apps, Lakebase, Lakehouse Monitoring. - -The 10-K analyst pattern is inspired by Databricks's own reference architecture for governed agent applications. +- [**Agent Skills**](https://github.com/anthropics/skills) — general-purpose Claude Code skill bundles. +- [**Databricks**](https://www.databricks.com/) — Unity Catalog, Lakeflow Spark Declarative Pipelines, Mosaic AI Vector Search, Agent Framework, Model Serving, AI Gateway, Databricks Apps, Lakebase, Lakehouse Monitoring. diff --git a/docs/_social_preview.py b/docs/_social_preview.py index f9a46a9..4bc55e1 100644 --- a/docs/_social_preview.py +++ b/docs/_social_preview.py @@ -26,17 +26,30 @@ ACCENT = "#FF3621" # Databricks orange LINE = "#252D3F" # subtle separator -# Arial bundles ship on macOS, support a wide glyph set including arrows, -# and have explicit Regular/Bold/Black files (no .ttc index guessing). -FONT_REG = "/System/Library/Fonts/Supplemental/Arial.ttf" -FONT_BOLD = "/System/Library/Fonts/Supplemental/Arial Bold.ttf" -FONT_BLACK = "/System/Library/Fonts/Supplemental/Arial Black.ttf" +# Prefer macOS Arial for local generation, but fall back to Liberation Sans in +# Linux devcontainers. +FONT_CANDIDATES = { + "regular": [ + "/System/Library/Fonts/Supplemental/Arial.ttf", + "/usr/share/fonts/truetype/liberation2/LiberationSans-Regular.ttf", + ], + "bold": [ + "/System/Library/Fonts/Supplemental/Arial Bold.ttf", + "/usr/share/fonts/truetype/liberation2/LiberationSans-Bold.ttf", + ], + "black": [ + "/System/Library/Fonts/Supplemental/Arial Black.ttf", + "/usr/share/fonts/truetype/liberation2/LiberationSans-Bold.ttf", + ], +} OUT = Path(__file__).parent / "social-preview.png" def font(size: int, weight: str = "regular") -> ImageFont.FreeTypeFont: - path = {"regular": FONT_REG, "bold": FONT_BOLD, "black": FONT_BLACK}[weight] + path = next((p for p in FONT_CANDIDATES[weight] if Path(p).exists()), None) + if path is None: + raise FileNotFoundError(f"No usable font found for weight={weight!r}") return ImageFont.truetype(path, size) @@ -70,7 +83,7 @@ def main() -> None: # One-line architecture summary, near bottom. ASCII arrows guarantee # glyph coverage across any future font swap. arch_f = font(22, "bold") - arch_text = "ai_parse_document -> typed KPIs -> Vector Search -> cited agent on Mosaic AI" + arch_text = "ai_parse_document -> typed KPIs -> Vector Search -> eval-gated cited agent" d.text((margin, H - margin - 80), arch_text, font=arch_f, fill=FG) # Separator + footer. diff --git a/docs/design.md b/docs/design.md new file mode 100644 index 0000000..f405159 --- /dev/null +++ b/docs/design.md @@ -0,0 +1,355 @@ +# Design — Databricks Document Intelligence Agent + +This document covers the *why*, the architecture, and the build workflow behind the repo. For setup and day-to-day use, see [`README.md`](../README.md). For day-2 ops, see [`runbook.md`](./runbook.md). + +## Table of contents + +- [Why this exists](#why-this-exists) +- [Architecture](#architecture) + - [Two halves: an offline pipeline, and an online agent](#two-halves-an-offline-pipeline-and-an-online-agent) + - [Vector Search bridges data and agent](#vector-search-bridges-data-and-agent) + - [Agent has two paths, one endpoint](#agent-has-two-paths-one-endpoint) + - [Runtime stack](#runtime-stack) +- [How it's built — three pillars](#how-its-built--three-pillars) + - [Pillar 1 — Spec-Kit](#pillar-1--spec-kit-spec-driven-development) + - [Pillar 2 — Databricks Asset Bundles + the Claude Code skill suite](#pillar-2--databricks-asset-bundles--the-claude-code-skill-suite) + - [Pillar 3 — Claude Code as the implementation surface](#pillar-3--claude-code-as-the-implementation-surface) +- [Deploy ordering: foundation → consumers](#deploy-ordering-foundation--consumers) +- [What you can learn from this repo](#what-you-can-learn-from-this-repo) + +--- + +## Why this exists + +Databricks shipped a lot of new generative-AI surface area in 2025–2026: `ai_parse_document`, Mosaic AI Vector Search, the Agent Framework, AI Gateway, Lakebase, Databricks Apps. Tutorials show each piece in isolation; nobody shows them wired together with **eval gates, governance, and reproducible deploys** the way you'd actually ship to analysts. + +This repo is that worked example. Drop a PDF into a governed UC volume; ten minutes later, an analyst can ask cited questions in plain English with end-to-end audit. The whole stack is described declaratively as one **Databricks Asset Bundle (DAB)** plus a small bootstrap script. DAB manages catalog/schema/volume, pipeline, jobs, the Vector Search **endpoint**, the Lakebase instance, the serving endpoint, the monitor, the app, and the dashboard; the Vector Search **index** itself is created and synced by `jobs/index_refresh/sync_index.py` (DAB doesn't yet manage indexes as a resource type), and the agent model version is registered by `agent/log_and_register.py`. The bootstrap script orchestrates them in the right order. + +It also demonstrates a development workflow: **Spec-Kit** for spec-driven design, **Claude Code** with Databricks skill bundles for AI-assisted implementation, six **non-negotiable constitution principles** that gate every plan. See [How it's built](#how-its-built--three-pillars). + +--- + +## Architecture + +### Two halves: an offline pipeline, and an online agent + +``` + ╔═══════════════════════════════════════════════════════════════════╗ + ║ pipelines/sql/ (one SQL file per tier) ║ + ╚═══════════════════════════════════════════════════════════════════╝ + + raw_filings/ ┌─────────────────┐ ┌─────────────────┐ ┌──────────────────┐ + ACME_10K.pdf ──▶ │ bronze_filings │──▶│ silver_parsed_ │──▶│ gold_filing_ │ + BETA_10K.pdf │ (raw bytes, │ │ filings (parsed │ │ sections (one │ + GAMMA_10K.pdf │ filename, │ │ VARIANT — │ │ row per parsed │ + │ ingested_at) │ │ ai_parse_ │ │ $.sections[*]; │ + │ │ │ document) │ │ fallback to │ + │ >50MB rejects: │ │ │ │ full_document │ + │ bronze_filings │ │ Status: ok / │ │ if absent) │ + │ _rejected │ │ partial / error │ │ │ + └─────────────────┘ └─────────────────┘ │ gold_filing_kpis │ + 01_bronze.sql 02_silver_parse │ (typed columns: │ + .sql │ segment_revenue │ + │ ARRAY, │ + │ top_risks │ + │ ARRAY) │ + └──────────────────┘ + 03_gold_classify + _extract.sql + │ + ▼ + ┌──────────────────┐ + │ gold_filing_ │ + │ quality │ + │ (5-dim rubric: │ + │ parse, layout, │ + │ ocr, sections, │ + │ kpi → 0-30) │ + └──────────────────┘ + 04_gold_quality.sql +``` + +**Key idea — "parse once, extract many":** PDFs are expensive to parse. Silver runs `ai_parse_document` exactly once per file and stores the structured result as a `VARIANT`. Everything downstream — classification, KPI extraction, summarization, quality scoring — reads the parsed output, never the raw bytes. This is a non-negotiable constitution principle. + +**Triggering**: prod runs the pipeline in `continuous: true` mode so Auto Loader (`read_files`) reacts to new PDFs in the volume automatically. Dev overrides to `continuous: false` to avoid a 24/7 cluster during smoke iterations. See `resources/foundation/doc_intel.pipeline.yml` and the dev override block in `databricks.yml`. + +### Vector Search bridges data and agent + +``` + gold_filing_sections ┌─────────────────────────┐ + (governed Delta table) ─────▶ │ Mosaic AI Vector │ + │ Search Index │ + Filter: embed_eligible=true │ (Delta-Sync — auto- │ + Embed column: "summary" │ refreshes when Gold │ + │ updates) │ + └─────────────────────────┘ + + Why "summary" not the raw text? + ───────────────────────────── + Embedding a 50-page 10-K verbatim is noisy. We embed an LLM-written + summary instead — tighter, more searchable. Constitution principle IV: + "Quality before retrieval." +``` + +**Ownership note**: DAB manages the Vector Search **endpoint** (`resources/consumers/filings_index.yml`) and the index-refresh **job** (`resources/consumers/index_refresh.job.yml`). The **index** itself isn't yet a DAB-managed resource type as of CLI 0.298 — `jobs/index_refresh/sync_index.py` creates the Delta-Sync index on first run and triggers a sync on subsequent runs. That's why the bootstrap script's stage-2 deploy creates the endpoint + job, and the job's first execution materializes the actual index. + +### Agent has two paths, one endpoint + +``` + User question + │ + ▼ + ┌────────────────────────────────────────────┐ + │ AnalystAgent.predict() │ + │ ───────────────────── │ + │ contains "compare" / "vs" / │ + │ "between" + ≥2 company names? │ + └────────────┬─────────────────┬─────────────┘ + │ no │ yes + ▼ ▼ + ┌──────────────────────┐ ┌──────────────────────┐ + │ Single-filing path │ │ Supervisor path │ + │ │ │ │ + │ 1. Hybrid search │ │ For each company: │ + │ (keyword + vec) │ │ ▸ run analyst path │ + │ 2. Re-rank → top 5 │ │ ▸ pull KPIs from │ + │ 3. LLM generates │ │ gold_filing_kpis │ + │ answer w/ [1] [2] │ │ Format markdown │ + │ citations │ │ table with cites. │ + └──────────────────────┘ └──────────────────────┘ + │ │ + └────────┬────────┘ + ▼ + ┌──────────────────────┐ + │ Response JSON: │ + │ answer │ + │ citations[] │ + │ grounded: bool │ + │ latency_ms │ + └──────────────────────┘ +``` + +The agent is an `mlflow.pyfunc` model registered in Unity Catalog and served behind an **AI Gateway** (rate limiting per-user, usage tracking, inference-table audit). Identity passthrough is implemented at the *App layer* when the workspace has Databricks Apps user-token passthrough enabled: the Streamlit app extracts the user's `x-forwarded-access-token` header and constructs a user-scoped `WorkspaceClient`. The served model is OBO-ready via MLflow `auth_policy` and Model Serving user credentials. If app-level passthrough is not enabled, the app falls back to service-principal auth and the repo must be treated as a reference/dev deployment, not a production row-level-security deployment. See [`../SECURITY.md`](../SECURITY.md) and [`../app/README.md`](../app/README.md). + +### Runtime stack + +``` + ┌──────────────────────────────────────────────────────────────────┐ + │ │ + │ Databricks App (Streamlit) ← user interacts here │ + │ app/app.py │ + │ │ + │ ┌────────────────┐ ┌──────────────────┐ │ + │ │ Chat input box │ │ Citation chips │ │ + │ │ Thumbs up/down │ │ Markdown tables │ │ + │ └────────┬───────┘ └─────┬────────────┘ │ + │ │ │ │ + └──────────────│─────────────────│─────────────────────────────────┘ + │ │ + │ query │ feedback writes + ▼ ▼ + ┌────────────────────────┐ ┌────────────────────────┐ + │ Model Serving endpoint │ │ Lakebase Postgres │ + │ "analyst-agent-dev" │ │ ───────────────── │ + │ (CPU, scales to 0) │ │ conversation_history │ + │ │ │ query_logs │ + │ + AI Gateway: │ │ feedback │ + │ rate limit │ │ │ + │ (per-user key) │ │ (Postgres for tiny │ + │ inference-table │ │ per-turn writes — │ + │ audit │ │ Delta isn't great │ + │ usage tracking │ │ at row-by-row) │ + └────────────────────────┘ └────────────────────────┘ + + OBO (user identity end-to-end, when enabled): + ────────────────────────────── + App reads `x-forwarded-access-token` from the request, builds + `WorkspaceClient(token=...)`, calls the serving endpoint with the + user's identity. The agent-side MLflow auth policy and Model Serving + OBO credentials let downstream calls run as the user. If the app-side + feature is unavailable, the bootstrap script prints an explicit warning + and the deployment remains reference/dev only. +``` + +**Why Postgres for state?** Delta tables are great for analytics but bad at "insert one tiny row per chat turn at high frequency." Lakebase is Databricks's managed Postgres — same governance, right tool for the job. + +--- + +## How it's built — three pillars + +This repo combines three things: Spec-Kit for spec-driven design, Databricks Asset Bundles + Claude Code skill bundles for declarative platform work, and Claude Code as the implementation surface. + +### Pillar 1 — Spec-Kit (spec-driven development) + +[Spec-Kit](https://github.com/github/spec-kit) is a workflow that forces you to write — and *clarify* — a specification before writing code. Each phase is a slash-command in Claude Code that produces a checked-in artifact: + +``` + /speckit-specify → specs//spec.md What & why (no how) + │ + ▼ + /speckit-clarify → appended Q&A in spec.md Resolve ambiguity + │ + ▼ + /speckit-plan → specs//plan.md Tech stack + structure + │ + research.md, data-model.md, + │ contracts/, quickstart.md + ▼ + /speckit-tasks → specs//tasks.md Dependency-ordered tasks + │ + ▼ + /speckit-analyze → cross-artifact consistency check + │ + ▼ + /speckit-implement → the actual code +``` + +`.specify/extensions.yml` auto-commits at each phase boundary so the trail is clean. `.specify/memory/constitution.md` defines six **non-negotiable principles** every plan must respect: + +| # | Principle | What it means | +|---|---|---| +| I | **Unity Catalog source of truth** | Every table, volume, model, index, endpoint lives under `.` — no DBFS, no workspace-local resources | +| II | **Parse once, extract many** | `ai_parse_document` runs once at Silver → VARIANT; everything downstream reads the parsed output | +| III | **Declarative over imperative** | SDP SQL pipelines, Lakeflow Jobs, DAB resources — no production notebooks | +| IV | **Quality before retrieval** | 5-dim rubric scores every section; only ≥22/30 reach the index. Embed `summary`, not raw text | +| V | **Eval-gated agents** | MLflow CLEARS scores must clear thresholds before any deploy is considered complete | +| VI | **Reproducible deploys** | `databricks bundle deploy -t ` recreates the entire stack; `dev` and `prod` parity enforced | + +When you read `specs/001-doc-intel-10k/plan.md` you'll see a "Constitution Check" gate that maps each design decision back to the principle it satisfies. When you read `specs/001-doc-intel-10k/tasks.md` you'll see how each task derives from the plan, and how user-stories (P1, P2, P3) are independently demoable. + +### Pillar 2 — Databricks Asset Bundles + the Claude Code skill suite + +[**Databricks Asset Bundles**](https://docs.databricks.com/aws/en/dev-tools/bundles/) (DABs) describe most of the workspace state as YAML. One root `databricks.yml` declares variables and targets (`dev`, `prod`); `resources/**/*.yml` declares each resource (pipeline, jobs, Vector Search endpoint, index-refresh job, serving endpoint, app, monitor, dashboard, Lakebase instance + catalog). `databricks bundle deploy -t dev` reconciles workspace state to YAML. The two non-DAB-managed pieces — the Vector Search **index** itself and the registered **model version** — are produced at runtime by `jobs/index_refresh/sync_index.py` and `agent/log_and_register.py` respectively, which the bootstrap script orchestrates. + +This repo was built with Databricks-specific Claude Code skill bundles. Those bundles are distributed by Databricks via the CLI / Claude Code plugin channel and **are not vendored in this open-source tree** — install them locally if you have access, or reference the canonical Databricks docs (mapping in [`../CONTRIBUTING.md`](../CONTRIBUTING.md)). + +| Skill bundle | What it provides | Canonical docs | +|---|---|---| +| **databricks-core** | Auth, profiles, data exploration, bundle basics | [docs](https://docs.databricks.com/aws/en/dev-tools/cli/) | +| **databricks-dabs** | DAB structure, validation, deploy workflow, target separation | [docs](https://docs.databricks.com/aws/en/dev-tools/bundles/) | +| **databricks-pipelines** | Lakeflow Spark Declarative Pipelines (`ai_parse_document`, `ai_classify`, `ai_extract`, `APPLY CHANGES INTO`) | [docs](https://docs.databricks.com/aws/en/dlt/) | +| **databricks-jobs** | Lakeflow Jobs with retries, schedules, table-update / file-arrival triggers | [docs](https://docs.databricks.com/aws/en/jobs/) | +| **databricks-apps** | Databricks Apps (Streamlit), App resource bindings | [docs](https://docs.databricks.com/aws/en/dev-tools/databricks-apps/) | +| **databricks-lakebase** | Lakebase Postgres instances, branches, computes, endpoint provisioning | [docs](https://docs.databricks.com/aws/en/oltp/) | +| **databricks-model-serving** | Model Serving endpoints, AI Gateway, served entities, scaling config | [docs](https://docs.databricks.com/aws/en/machine-learning/model-serving/) | + +Skills are loaded by Claude Code on demand. When you ask Claude to "wire up Vector Search," it should read the Databricks pipeline/model-serving guidance *before* writing YAML, so the output reflects current Databricks API shapes — not stale training data. + +### Pillar 3 — Claude Code as the implementation surface + +Spec-Kit produces the specs. The Databricks skills provide platform expertise. **Claude Code orchestrates both**: every phase artifact and every code file in this repo was authored by prompting Claude Code with the spec/plan/tasks as context. + +The workflow looks like: + +1. `/speckit-specify` → Claude writes spec.md from a natural-language description, you iterate via `/speckit-clarify` until ambiguity is resolved. +2. `/speckit-plan` → Claude consults the constitution + Databricks skills, drafts plan.md with research decisions and architecture. +3. `/speckit-tasks` → Claude generates a dependency-ordered task list grouped by user story (P1, P2, P3). +4. `/speckit-implement` → Claude writes the actual SQL/Python/YAML, one task at a time, committing per task. +5. Operational loops: when the deploy hits unexpected issues (it always does), Claude reads the runbook, fixes the issue, updates the runbook, commits. + +AI-driven here means Claude carries the boring parts (boilerplate YAML, retry-loop scripts, dependency analysis) so you spend time on what the spec should say and what the constitution should require. + +--- + +## Deploy ordering: foundation → consumers + +DABs deploy *everything in one shot*. But our resources have a chicken-and-egg problem on a fresh workspace: + +``` + ┌────────────────────────────────────────────────┐ + │ What "bundle deploy" tries to create: │ + │ │ + │ ▸ Pipeline ────┐ │ + │ ▸ Tables ────┼──── all need each other │ + │ ▸ Vector idx ───┤ │ + │ ▸ Model ───┤ Monitor wants the │ + │ ▸ Endpoint ────┤ KPI table to exist │ + │ ▸ App ───┤ BEFORE it can attach │ + │ ▸ Monitor ────┘ │ + │ ▸ Lakebase ──── │ + └────────────────────────────────────────────────┘ + + Endpoint needs a registered model version. + Model version needs the model logged. + Model logging needs the agent code. + Monitor needs the table populated. + Table needs the pipeline to run. + + ▶ Single `bundle deploy` → 4+ errors on a fresh workspace. +``` + +The fix is a **staged deploy** orchestrated by `scripts/bootstrap-dev.sh`. Resources are split into two directories by data dependency: + +``` + resources/ + ├── foundation/ ← no data deps — deploy first + │ ├── catalog.yml (schema + volume + grants) + │ ├── doc_intel.pipeline.yml + │ ├── retention.job.yml + │ └── lakebase_instance.yml + │ + └── consumers/ ← need foundation to be RUNNING and producing data + ├── agent.serving.yml (needs registered model version) + ├── kpi_drift.yml (needs gold_filing_kpis table) + ├── filings_index.yml (VS endpoint) + ├── index_refresh.job.yml (needs source table) + ├── analyst.app.yml (needs Lakebase + agent endpoint) + ├── usage.dashboard.yml + └── lakebase_catalog.yml (needs instance AVAILABLE) +``` + +**The bootstrap script auto-detects which mode to run** by checking whether the agent serving endpoint already has a populated config: + +``` + does analyst-agent-${target} have served entities? + │ + no ◀───────┴───────▶ yes + │ │ + ▼ ▼ + ┌──────────────────┐ ┌──────────────────┐ + │ FIRST-DEPLOY │ │ STEADY-STATE │ + │ (staged) │ │ (full deploy) │ + ├──────────────────┤ ├──────────────────┤ + │ 1. temp-rename │ │ 1. bundle deploy │ + │ consumers/* │ │ (full bundle) │ + │ .yml.skip │ │ │ + │ 2. bundle deploy │ │ 2. refresh data: │ + │ (foundation) │ │ upload, run │ + │ 3. produce data: │ │ pipeline, │ + │ upload, run, │ │ register new │ + │ register │ │ model version │ + │ model │ │ + repoint │ + │ 4. wait Lakebase │ │ serving in- │ + │ AVAILABLE │ │ place │ + │ 5. restore yamls │ │ │ + │ 6. bundle deploy │ │ │ + │ (full bundle) │ │ │ + └────────┬─────────┘ └────────┬─────────┘ + │ │ + └───────────┬───────────┘ + ▼ + ┌──────────────────────────┐ + │ Common to both: │ + │ • bundle run analyst_app│ + │ • UC grants chain │ + │ • smoke check │ + └──────────────────────────┘ +``` + +**Why two modes?** DAB tracks resource state; if you run the temp-rename trick against an *existing* deployment, DAB sees the consumer YAMLs as removed and plans to **delete** the serving endpoint, app, monitor, etc. Safe-ish on a fresh workspace; destructive in steady-state. The script detects mode and does the right thing. + +CI (`.github/workflows/deploy.yml`) assumes steady-state — the first-ever bring-up of a workspace must be done locally with `./scripts/bootstrap-dev.sh`. After that, every push to `main` runs the steady-state path: full `bundle deploy` → refresh data → repoint serving endpoint → grants → CLEARS gate. + +For the per-step procedure and known failure modes, see [`runbook.md` § Known deploy ordering gaps](./runbook.md#known-deploy-ordering-gaps-discovered-in-the-2026-04-24-smoke-test). + +--- + +## What you can learn from this repo + +- **Wiring `ai_parse_document` into Lakeflow SDP** — pattern for streaming-tables + `STREAM(...)` views + `APPLY CHANGES INTO` keyed on filename. +- **Scoring document quality before retrieval** — five 0–6 dimensions in SQL, threshold filter on the index source. +- **Logging a Mosaic AI agent to UC** — `mlflow.pyfunc` with both inputs *and* outputs in the signature (UC requirement), `AnyType` for variable-shape fields, `auth_policy` + `resources` for OBO. +- **Grounding an agent with citations** — hybrid Vector Search → re-rank → top-k → LLM with explicit "cite sources [1] [2]" prompt. +- **Handling DAB deploy ordering** — chicken-egg dependencies between heterogeneous resources, solved with a 5-step bootstrap rather than `depends_on` (which DAB doesn't reliably honor across resource types). +- **Gating deploys on MLflow eval** — `mlflow.evaluate(model_type="databricks-agent")` with documented metric keys, per-axis thresholds, exit-code gate in CI. +- **End-to-end OBO** — `ModelServingUserCredentials` from `databricks_ai_bridge`, `CredentialStrategy.MODEL_SERVING_USER_CREDENTIALS` for Vector Search, MLflow `auth_policy` with `model-serving` + `vector-search` user scopes, App-side `user_api_scopes` declaration. +- **Spec-Kit + Claude Code + Databricks skills composing** — every artifact in `specs/` and `pipelines/` and `agent/` was generated through that loop. diff --git a/docs/social-preview.png b/docs/social-preview.png index 735c2e7..d02501b 100644 Binary files a/docs/social-preview.png and b/docs/social-preview.png differ diff --git a/specs/001-doc-intel-10k/plan.md b/specs/001-doc-intel-10k/plan.md index d387a3a..8b6e801 100644 --- a/specs/001-doc-intel-10k/plan.md +++ b/specs/001-doc-intel-10k/plan.md @@ -124,7 +124,7 @@ Output: [research.md](./research.md). Decisions captured: | Idempotency | `APPLY CHANGES INTO` keyed on `filename` for Silver and Gold | SDP native CDC, deterministic on re-upload, no Python helper | Hand-rolled MERGE (rejected: more code paths); content hash key (deferred — filename is sufficient for v1) | | Quality rubric | 5 dimensions × 0–6 scale; threshold ≥ 22/30; computed via `ai_query` calls in `04_gold_quality.sql` | Mirrors Reffy's 31-point pattern; SQL-native means no Python helper; explicit dimensions help debug rejections | Single `extraction_confidence` (rejected: no debuggability); 3-dim avg (rejected: too coarse) | | Vector Search index | Delta-Sync index over `gold_filing_sections` filtered by `embed_eligible`; embed `summary` column | Managed sync, no manual refresh; embeds curated content per principle IV | Direct Vector Index (rejected: no managed sync); embedding raw `parsed.text_full` (rejected: noise) | -| Retrieval strategy | Hybrid (keyword + semantic) top-25 → re-rank → top-5 | Reffy pattern; re-rank improves relevance materially; CPU re-rank stays in budget | Pure semantic (rejected: misses exact filings/years); re-rank against top-100 (rejected: latency budget) | +| Retrieval strategy | Hybrid (keyword + semantic) top-25 → re-rank → top-5 | Reffy pattern; re-rank tightens top-5 ordering; CPU re-rank stays in budget | Pure semantic (rejected: misses exact filings/years); re-rank against top-100 (rejected: latency budget) | | Agent framework | Mosaic AI Agent Framework via `databricks-agents` SDK + MLflow `pyfunc` | First-class Knowledge Assistant + Supervisor primitives; logged + registered in UC | LangGraph standalone (rejected: more glue, no UC registration story) | | Serving | CPU instance behind AI Gateway; identity passthrough on | Cost-first per Reffy; Gateway gives audit + rate limit + on-behalf-of | GPU (rejected: not needed at scale of pilot); raw endpoint (rejected: no governance layer) | | State store | Lakebase Postgres (managed) | Native to platform, low-latency reads/writes, fits Reffy pattern; integrates with Apps | Delta tables (rejected: write throughput on small turn-level updates); external Postgres (rejected: governance gap) | diff --git a/specs/001-doc-intel-10k/quickstart.md b/specs/001-doc-intel-10k/quickstart.md index 4447677..e3152de 100644 --- a/specs/001-doc-intel-10k/quickstart.md +++ b/specs/001-doc-intel-10k/quickstart.md @@ -1,6 +1,6 @@ # Quickstart: Deploy and Test the 10-K Analyst -Goal: from a clean clone, stand up the entire stack on the Databricks `dev` target and verify P1, P2, P3 acceptance scenarios in under 30 minutes. +Goal: from a clean clone, stand up the entire stack on the Databricks `dev` target and verify P1, P2, P3 acceptance scenarios in 15–25 minutes. ## Prerequisites diff --git a/specs/001-doc-intel-10k/research.md b/specs/001-doc-intel-10k/research.md index 3ac6e85..ef9d4c6 100644 --- a/specs/001-doc-intel-10k/research.md +++ b/specs/001-doc-intel-10k/research.md @@ -55,7 +55,7 @@ tunable as a bundle parameter. **Rationale**: Reffy reports keyword-only sub-2s but reasoning needs LLM generation. Hybrid keyword + semantic retrieval to top-25, then a Mosaic AI re-ranker (CPU) trim to top-5, keeps single-filing P95 ≤ 8s achievable -on CPU serving while improving relevance materially. Bigger windows blow +on CPU serving while improving top-5 ordering qualitatively. Bigger windows blow the latency budget; pure semantic misses exact ticker/year matches in financial filings.