docs(skill/databricks-metric-views): add Update section + UDF gotcha by jacksandom · Pull Request #528 · databricks-solutions/ai-dev-kit

jacksandom · 2026-05-11T15:23:35Z

Summary

Two additive doc fixes for non-MCP metric view authoring (matches the experimental branch's CLI-only posture):

New "Update an Existing Metric View" subsection under SQL Operations. Metric views don't support ALTER VIEW … ADD MEASURE — the only path is CREATE OR REPLACE VIEW with the complete updated YAML. The worked example annotates each line ← unchanged, repeated verbatim / ← new to make the full-replacement requirement visually obvious. Cross-links to SHOW CREATE TABLE so an agent fetches the current YAML before editing.
New row in Common Issues: Python UDFs are not supported in measure expressions. Workaround: use built-in SQL aggregates (SUM, COUNT, AVG) or SQL UDFs; for custom logic, push transformation into the source table or wrap as a UC-registered SQL UDF.

Targets two known footguns surfaced by the metric-views ground_truth.yaml:

metric-views_alter_010 — "Add a new measure 'Average Order Value' to my existing orders_metrics metric view"
metric-views_udf_not_supported_021 — "Can I use a Python UDF inside a metric view measure expression?"

Evaluation

Ran stf compare (L3 static + L5 output) against origin/experimental. Agent model claude-sonnet-4-6 via Anthropic OAuth (had to unset llm.ai_gateway_host — the fevm-jss-sandbox workspace's /anthropic endpoint returns 404, breaking the gateway agent path; L3 judges still ran via gateway fine).

Metric	A (this PR)	B (experimental)	Δ
Composite	0.641	0.617	+0.024
L3 static	0.8753	0.8752	+0.0001
L5 output	0.407	0.359	+0.047
Feedback pass rate	166 / 250 (66%)	146 / 249 (59%)	+8 pts
Judge winner	TIE (confidence 0.20)

Both target test cases hit:

Test case	A pass	B pass	Δ
`metric-views_alter_010`	4/8 (50%)	2/8 (25%)	+25 pts
`metric-views_udf_not_supported_021`	9/10 (90%)	7/10 (70%)	+20 pts

alter_010's expected response includes the MCP manage_metric_views(action="alter", …) call, which is intentionally not added on experimental (CLI-only posture). The fix still scores higher because it teaches the agent to keep all existing measures when updating — the full-replacement requirement.

Other notable L5 swings (probably mostly noise — non-target cases vary across runs):

+62 pts on metric-views_window_rolling_avg_018
+100 pts on metric-views_yaml_spec_005
+27 pts on metric-views_conversational_support_tickets_020
−88 pts on metric-views_filtered_measure_013 (regression, but not a target — likely noise)
−86 pts on metric-views_star_schema_006 (same)

L3 judge winner was TIE because the judge sampled a single test case where both artifacts were near-identical. The aggregated pass-rate across all 250 feedback rows is what shows the small but consistent positive lift.

Caveats — independent bugs surfaced during the eval (worth filing separately)

Install-banner contamination of agent responses. Every L5 agent's with_response / without_response starts with the Databricks AI Dev Kit — update available! banner. The SessionStart install-banner hook is firing inside agent subprocesses and getting captured as the agent's response. This explains why composite scores stay well below 0.7 on both branches — half of every captured response is install-banner garbage. The +0.05 L5 lift is real through this noise.
Gateway routing broken on fevm-jss-sandbox. llm.ai_gateway_host: https://fevm-jss-sandbox.cloud.databricks.com causes L5 agent subprocesses to inject ANTHROPIC_BASE_URL=<host>/anthropic, which 404s and surfaces as the misleading "model may not exist or you may not have access to it" error on every agent run. L3 judges still work via the gateway. Workaround: unset ai_gateway_host and pass --agent-model claude-sonnet-4-6 to stf compare.
skill-evaluator SKILL.md doesn't document stf compare. The reference table lists evaluate (full pyramid) and audit (L3 only) but not the dedicated A/B command. Discovered only because @jacksandom flagged it.

How to reproduce

# Worktree at experimental for the B side
git worktree add /tmp/mv-experimental-worktree origin/experimental

# Unset gateway (or this errors on fevm-jss-sandbox; see caveat 2)
# yq -i '.llm.ai_gateway_host = null' ~/.skillforge/config.yaml

MLFLOW_ENABLE_ASYNC_TRACE_LOGGING=false stf compare \
  databricks-skills/databricks-metric-views \
  /tmp/mv-experimental-worktree/databricks-skills/databricks-metric-views \
  --levels static,output \
  --agent-model claude-sonnet-4-6 \
  --timeout 600 \
  --output /tmp/mv-compare-verdict.json

Test plan

stf lint databricks-skills/databricks-metric-views clean
stf compare against origin/experimental shows positive L5 lift
Both target test cases improved (alter_010 +25 pts, udf_not_supported_021 +20 pts)
No MCP-tool guidance added (respects experimental's CLI-only posture)
Mirror under .claude/skills/ synced locally (gitignored, not in commit)

Two additive doc fixes for non-MCP metric view authoring on experimental: 1. New "Update an Existing Metric View" subsection under SQL Operations. Metric views don't support ALTER VIEW ... ADD MEASURE — the only path is CREATE OR REPLACE VIEW with the complete updated YAML. Includes a worked example (adding Average Order Value to orders_metrics) with line-by-line ← unchanged / ← new annotations so the full-replacement requirement is visually obvious. 2. New row in Common Issues: Python UDFs are not supported in measure expressions (use built-in SQL aggregates or SQL UDFs; for custom logic push into the source or wrap as a SQL UDF in UC). stf compare A/B vs origin/experimental (L3 static + L5 output, agent-model claude-sonnet-4-6 via Anthropic OAuth): Composite: A=0.641 vs B=0.617 (+0.024) L3 static: A=0.8753 vs B=0.8752 (≈0) L5 output: A=0.407 vs B=0.359 (+0.047) Feedback pass rate: 166/250 vs 146/249 (66% vs 59%, +8 pts) Both targeted test cases improved: metric-views_alter_010: 4/8 vs 2/8 (+25 pts) metric-views_udf_not_supported_021: 9/10 vs 7/10 (+20 pts) Judge verdict: TIE (low confidence 0.20) — judge sampled a single test case where artifacts were near-identical. Aggregated pass-rate across all 250 feedback rows shows the small but consistent positive lift above. Composite below 0.7 gate on both branches is due to (a) an install-banner hook polluting agent subprocess responses (separate bug worth filing) and (b) ground_truth scaffold staleness, not the fix itself. Co-authored-by: Isaac

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs(skill/databricks-metric-views): add Update section + UDF gotcha#528

docs(skill/databricks-metric-views): add Update section + UDF gotcha#528
jacksandom wants to merge 1 commit into
experimentalfrom
feat/metric-views-update-section

jacksandom commented May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

jacksandom commented May 11, 2026

Summary

Evaluation

Caveats — independent bugs surfaced during the eval (worth filing separately)

How to reproduce

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant