Skip to content

docs(skill/databricks-metric-views): add Update section + UDF gotcha#528

Draft
jacksandom wants to merge 1 commit into
experimentalfrom
feat/metric-views-update-section
Draft

docs(skill/databricks-metric-views): add Update section + UDF gotcha#528
jacksandom wants to merge 1 commit into
experimentalfrom
feat/metric-views-update-section

Conversation

@jacksandom
Copy link
Copy Markdown
Collaborator

Summary

Two additive doc fixes for non-MCP metric view authoring (matches the experimental branch's CLI-only posture):

  1. New "Update an Existing Metric View" subsection under SQL Operations. Metric views don't support ALTER VIEW … ADD MEASURE — the only path is CREATE OR REPLACE VIEW with the complete updated YAML. The worked example annotates each line ← unchanged, repeated verbatim / ← new to make the full-replacement requirement visually obvious. Cross-links to SHOW CREATE TABLE so an agent fetches the current YAML before editing.

  2. New row in Common Issues: Python UDFs are not supported in measure expressions. Workaround: use built-in SQL aggregates (SUM, COUNT, AVG) or SQL UDFs; for custom logic, push transformation into the source table or wrap as a UC-registered SQL UDF.

Targets two known footguns surfaced by the metric-views ground_truth.yaml:

  • metric-views_alter_010 — "Add a new measure 'Average Order Value' to my existing orders_metrics metric view"
  • metric-views_udf_not_supported_021 — "Can I use a Python UDF inside a metric view measure expression?"

Evaluation

Ran stf compare (L3 static + L5 output) against origin/experimental. Agent model claude-sonnet-4-6 via Anthropic OAuth (had to unset llm.ai_gateway_host — the fevm-jss-sandbox workspace's /anthropic endpoint returns 404, breaking the gateway agent path; L3 judges still ran via gateway fine).

Metric A (this PR) B (experimental) Δ
Composite 0.641 0.617 +0.024
L3 static 0.8753 0.8752 +0.0001
L5 output 0.407 0.359 +0.047
Feedback pass rate 166 / 250 (66%) 146 / 249 (59%) +8 pts
Judge winner TIE (confidence 0.20)

Both target test cases hit:

Test case A pass B pass Δ
metric-views_alter_010 4/8 (50%) 2/8 (25%) +25 pts
metric-views_udf_not_supported_021 9/10 (90%) 7/10 (70%) +20 pts

alter_010's expected response includes the MCP manage_metric_views(action="alter", …) call, which is intentionally not added on experimental (CLI-only posture). The fix still scores higher because it teaches the agent to keep all existing measures when updating — the full-replacement requirement.

Other notable L5 swings (probably mostly noise — non-target cases vary across runs):

  • +62 pts on metric-views_window_rolling_avg_018
  • +100 pts on metric-views_yaml_spec_005
  • +27 pts on metric-views_conversational_support_tickets_020
  • −88 pts on metric-views_filtered_measure_013 (regression, but not a target — likely noise)
  • −86 pts on metric-views_star_schema_006 (same)

L3 judge winner was TIE because the judge sampled a single test case where both artifacts were near-identical. The aggregated pass-rate across all 250 feedback rows is what shows the small but consistent positive lift.

Caveats — independent bugs surfaced during the eval (worth filing separately)

  1. Install-banner contamination of agent responses. Every L5 agent's with_response / without_response starts with the Databricks AI Dev Kit — update available! banner. The SessionStart install-banner hook is firing inside agent subprocesses and getting captured as the agent's response. This explains why composite scores stay well below 0.7 on both branches — half of every captured response is install-banner garbage. The +0.05 L5 lift is real through this noise.

  2. Gateway routing broken on fevm-jss-sandbox. llm.ai_gateway_host: https://fevm-jss-sandbox.cloud.databricks.com causes L5 agent subprocesses to inject ANTHROPIC_BASE_URL=<host>/anthropic, which 404s and surfaces as the misleading "model may not exist or you may not have access to it" error on every agent run. L3 judges still work via the gateway. Workaround: unset ai_gateway_host and pass --agent-model claude-sonnet-4-6 to stf compare.

  3. skill-evaluator SKILL.md doesn't document stf compare. The reference table lists evaluate (full pyramid) and audit (L3 only) but not the dedicated A/B command. Discovered only because @jacksandom flagged it.

How to reproduce

# Worktree at experimental for the B side
git worktree add /tmp/mv-experimental-worktree origin/experimental

# Unset gateway (or this errors on fevm-jss-sandbox; see caveat 2)
# yq -i '.llm.ai_gateway_host = null' ~/.skillforge/config.yaml

MLFLOW_ENABLE_ASYNC_TRACE_LOGGING=false stf compare \
  databricks-skills/databricks-metric-views \
  /tmp/mv-experimental-worktree/databricks-skills/databricks-metric-views \
  --levels static,output \
  --agent-model claude-sonnet-4-6 \
  --timeout 600 \
  --output /tmp/mv-compare-verdict.json

Test plan

  • stf lint databricks-skills/databricks-metric-views clean
  • stf compare against origin/experimental shows positive L5 lift
  • Both target test cases improved (alter_010 +25 pts, udf_not_supported_021 +20 pts)
  • No MCP-tool guidance added (respects experimental's CLI-only posture)
  • Mirror under .claude/skills/ synced locally (gitignored, not in commit)

Two additive doc fixes for non-MCP metric view authoring on experimental:

1. New "Update an Existing Metric View" subsection under SQL Operations.
   Metric views don't support ALTER VIEW ... ADD MEASURE — the only path
   is CREATE OR REPLACE VIEW with the complete updated YAML. Includes a
   worked example (adding Average Order Value to orders_metrics) with
   line-by-line ← unchanged / ← new annotations so the full-replacement
   requirement is visually obvious.

2. New row in Common Issues: Python UDFs are not supported in measure
   expressions (use built-in SQL aggregates or SQL UDFs; for custom
   logic push into the source or wrap as a SQL UDF in UC).

stf compare A/B vs origin/experimental (L3 static + L5 output,
agent-model claude-sonnet-4-6 via Anthropic OAuth):

  Composite: A=0.641 vs B=0.617  (+0.024)
  L3 static: A=0.8753 vs B=0.8752  (≈0)
  L5 output: A=0.407  vs B=0.359  (+0.047)
  Feedback pass rate: 166/250 vs 146/249  (66% vs 59%, +8 pts)

Both targeted test cases improved:
  metric-views_alter_010:           4/8 vs 2/8  (+25 pts)
  metric-views_udf_not_supported_021: 9/10 vs 7/10 (+20 pts)

Judge verdict: TIE (low confidence 0.20) — judge sampled a single test
case where artifacts were near-identical. Aggregated pass-rate across all
250 feedback rows shows the small but consistent positive lift above.

Composite below 0.7 gate on both branches is due to (a) an install-banner
hook polluting agent subprocess responses (separate bug worth filing) and
(b) ground_truth scaffold staleness, not the fix itself.

Co-authored-by: Isaac
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant