feat: semantic layer compiler (Slice 1)#22
Open
kosminus wants to merge 1 commit into
Open
Conversation
Attacks the cold-start problem: point QueryWise at an operational DB
with an empty semantic layer and get reviewable draft objects with
evidence and confidence scores.
- Engine (app/semantic_compiler/): self-contained collectors
(catalog, pg_stats/CHECK/enums, view definitions, pg_stat_statements)
+ deterministic inference: join paths without FKs (naming +
value-overlap probe + log co-occurrence), value dictionaries,
view/log metric extraction, dead tables, tenant scoping, PII,
fan-out warnings. LLM pass names/describes only — never invents.
- Staging review flow: compilation_runs/compilation_findings
(migration 013); findings become semantic objects only on accept,
landing as status='draft'; policies created disabled. Accepted
findings are name-keyed and rematerialize after re-introspection.
- API (/connections/{id}/compilation/*) + Compiler page (progress,
findings grouped by kind with evidence, bulk accept/dismiss).
- opsdb fixture: hostile operational schema + pg_stat_statements
workload script; eval harness scores recovery of the IFRS 9 seed
(relationships 5/5 @ 100% precision with FKs hidden).
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds a semantic layer compiler: introspect an operational database (schema + column statistics + view definitions + query logs), run deterministic inference, and propose draft semantic-layer objects — inferred join paths, metrics, value dictionaries, glossary entities, and refusal boundaries (PII masking, tenant row filters, dead tables, fan-out warnings) — each with evidence and a confidence score, reviewed through a new Compiler page before anything is created.
Why
QueryWise's hardest cold-start problem is building the semantic layer: a new customer connects an operational DB and faces empty glossary/metrics/dictionary. Operational schemas are hostile in ways warehouses aren't — no declared FKs, int-coded statuses, soft deletes, tenant columns — and that hostility is exactly the signal a compiler can mine. Crucially, the generated guardrails (fan-out warnings, PII masking, dead tables) prevent the most common classes of silently-wrong text-to-SQL answers.
Changes
Engine (
backend/app/semantic_compiler/— self-contained, no FastAPI/ORM imports, standalone-CLI extractable)pg_stats/CHECK IN-lists/enums/unique indexes,pg_get_viewdef,pg_stat_statements— each degrades to empty when unavailable and the run records which sources answeredsqlmeta.py: sqlglot analysis (join pairs, aggregates, GROUP BY, WHERE) with graceful degradation, mirroringlineage_servicemost_common_vals, handling negativen_distinct), view→metric extraction (aggregates/dimensions/canonical filters), recurring log aggregates, dead tables, tenant scoping (call-weighted log confirmation required), PII (name + sampled value shape), fan-out warnings (1:N parent-measure double-counting)LLM annotation (
app/llm/agents/semantic_annotator.py): names/describes only — output merges onto naming fields, structurally unable to invent tables/joins/values; runs complete without a providerStaging review flow (migration
013,compilation_service.py)compilation_runs/compilation_findingstables; findings become real semantic objects only on explicit accept (draft metrics/glossary feed the query-pipeline context builder today, so unreviewed output stays out)status='draft'for normal certification; data policies created disabled; fan-out guidance becomes a knowledge document (injected into SQL prompts via existing RAG)introspect_and_cachewipes the schema cache, cascading to inferred relationships and dictionary entries);cached_relationshipsgainsorigin/confidence/cardinality/evidencesemantic_compilation) with in-memory progress, registered for both in-process and arq backendsAPI + frontend
/connections/{id}/compilation/runs(+ get),/compilation/findings(+/accept,/dismiss,/bulk)Test fixture + eval
opsdb: hostile operational schema in the sample-db container (no FKs, tenant_id, soft deletes, lookup tables, business-logic views, deadcustomers_bak);pg_stat_statementsnow preloaded;run_ops_workload.pypopulates query logs. Init scripts apply on a fresh volume (docker compose down -v)eval_compiler_ifrs9.py: scores recovery of the IFRS 9 seed metadata with declared FKs hidden — relationships 5/5 recall @ 100% precision, dictionary 79% recall / 89% precision, glossary table-coverage 10/10, confidence calibrated (0.81 correct vs 0.60 incorrect)Verification
Live run against
opsdb(LLM off): all 9 inferred joins correct with zero spurious edges; dictionaries with labels probed from lookup tables; tenant row-filter draft; dead-table and fan-out findings. Accept flows verified for every kind, including rematerialization after re-introspect. LLM-annotated run produced grounded names ("Customer Lifetime Value") without touching structure.🤖 Generated with Claude Code