Skip to content

Leverage Deployment Stacks for idempotent operations#44

Open
Copilot wants to merge 15 commits into
mainfrom
copilot/leverage-deployment-stacks-idempotency
Open

Leverage Deployment Stacks for idempotent operations#44
Copilot wants to merge 15 commits into
mainfrom
copilot/leverage-deployment-stacks-idempotency

Conversation

Copy link
Copy Markdown
Contributor

Copilot AI commented May 5, 2026

The destroy workflow deletes a single resource group but leaves behind soft-deleted resources (Key Vault with purge protection), subscription-scoped resources, and multi-RG deployments — making destroy + redeploy non-idempotent.

Changes

Deploy workflow (git-ape-deploy.exampleyml)

  • Default deploy method changed from az deployment sub create to az stack sub create --action-on-unmanage deleteAll
  • New "Capture managed resources" step walks stack resources or deployment operations post-deploy
  • state.json now populated with stackId, deployMethod, managedResources[], resourceGroups[], subscriptions[], externalReferences[]
  • metadata.json gains deployMethod and resourceGroups[] on commit

Destroy workflow (git-ape-destroy.exampleyml)

  • Stack path: az stack sub delete --action-on-unmanage deleteAll when stackId present — single command covers all managed resources regardless of scope
  • Fallback path: preserved legacy sub-scoped resource cleanup → az group delete for pre-stack deployments
  • Soft-delete purge loop: iterates managedResources[].softDeletable, purges non-protected Key Vaults, marks purge-protected as retained-soft-deleted
  • Deployment history cleanup: az deployment sub delete to prevent 800/scope accumulation
  • New terminal statuses: partially-destroyed, retained-soft-deleted

State schema (website/docs/deployment/state.md)

  • Documented extended state.json schema with destroy strategy selection logic
  • Updated lifecycle diagram with new terminal states
  • Added deployMethod field to metadata.json spec

Example state.json (post-deploy)

{
  "stackId": "/subscriptions/.../providers/Microsoft.Resources/deploymentStacks/deploy-20260218-143022",
  "deployMethod": "stack",
  "managedResources": [
    {
      "id": "/subscriptions/.../Microsoft.KeyVault/vaults/kv-api-dev-eus",
      "type": "Microsoft.KeyVault/vaults",
      "scope": "resourceGroup",
      "softDeletable": true,
      "purgeProtected": true
    }
  ],
  "resourceGroups": ["rg-api-dev-eastus"],
  "subscriptions": ["00000000-..."],
  "externalReferences": []
}

This implements Phase 1 (schema + state capture) and Phase 2 (Deployment Stacks integration) from the issue. Phase 3 (extract destroy to standalone script) and Phase 4 (fixture validation) are deferred.

Copilot AI linked an issue May 5, 2026 that may be closed by this pull request
…tate schema

- Deploy workflow: use `az stack sub create` with `--action-on-unmanage deleteAll`
  as the default deployment method, with `az deployment sub create` as fallback
- Deploy workflow: add managed resources capture step after deploy that walks
  deployment operations or stack resources to populate state.managedResources[]
- Destroy workflow: use `az stack sub delete` when stackId is present in state,
  covering multi-RG, sub-scope, and MG-scope resources uniformly
- Destroy workflow: add soft-delete purge loop for Key Vault and Cognitive Services
- Destroy workflow: add deployment history cleanup step
- Destroy workflow: support new terminal statuses: `partially-destroyed` and
  `retained-soft-deleted`
- State schema: extend state.json with stackId, deployMethod, managedResources[],
  resourceGroups[], subscriptions[], externalReferences[]
- Metadata schema: add deployMethod and resourceGroups[] fields
- Documentation: update deployment state docs with new schema, statuses, and
  destroy strategy selection logic
- Regenerate workflow documentation pages

Agent-Logs-Url: https://github.com/Azure/git-ape/sessions/d2d1da54-9a38-41ef-9254-b5f585eab10e

Co-authored-by: arnaudlh <20535201+arnaudlh@users.noreply.github.com>
Copilot AI changed the title [WIP] Leverage deployment stacks for idempotency in destroy flow Leverage Deployment Stacks for idempotent destroy May 5, 2026
Copilot AI requested a review from arnaudlh May 5, 2026 02:27
- introduce azure-stack-deploy and azure-stack-destroy skills (bash + pwsh)
- destroy: fast async mode (default) polls resource groups, --wait for sync
- align workflows + agents + docs with new skills
- bump plugin to 0.1.0

🚀 - Generated by Copilot
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 6, 2026

⚠️ Documentation Staleness Warning

Source files (agents, skills, workflows, or config) changed in this PR, but the generated documentation is out of date.

Changed docs that need regeneration:

  • website/docs/agents/azure-resource-deployer.md
  • website/docs/agents/azure-template-generator.md
  • website/docs/reference/marketplace.md
  • website/docs/reference/plugin-json.md
  • website/docs/skills/azure-stack-deploy.md
  • website/docs/skills/azure-stack-destroy.md
  • website/docs/skills/overview.md
  • website/docs/skills/prereq-check.md
  • website/docs/workflows/daily-repo-status-lock.md
  • website/docs/workflows/git-ape-deploy.md
  • website/docs/workflows/issue-triage-agent-lock.md
  • website/docs/workflows/pr-validation.md
  • website/docs/workflows/waza-agent-evals.md
  • website/docs/workflows/waza-evals.md

To fix: Run the following command and commit the results:

node scripts/generate-docs.js

This is an advisory check — it does not block the PR.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates Git-Ape’s deploy/destroy flows to use Azure Deployment Stacks as the primary lifecycle primitive, aiming to make destroy + redeploy idempotent across resource groups, subscription-scope resources, and soft-deletable services. It also introduces local “stack deploy/destroy” skills and extends the persisted deployment state schema to record stack identity and managed resources for deterministic teardown.

Changes:

  • Switch deploy from az deployment sub create to az stack sub create (with state capture of managed resources / RGs).
  • Switch destroy to prefer az stack sub delete --action-on-unmanage deleteAll, add soft-delete purge sweep + subscription deployment-history cleanup.
  • Add /azure-stack-deploy and /azure-stack-destroy skills (bash + PowerShell) and update docs/agent guidance + version bumps.
Show a summary per file
File Description
website/docs/workflows/git-ape-destroy.md Documents new stack-first destroy workflow, purge sweep, and new terminal statuses.
website/docs/workflows/git-ape-deploy.md Documents stack-first deploy workflow and managed-resource/state capture.
website/docs/skills/overview.md Adds “General Skills” entries for stack deploy/destroy.
website/docs/skills/azure-stack-destroy.md New docs page for stack-based destroy skill.
website/docs/skills/azure-stack-deploy.md New docs page for stack-based deploy skill.
website/docs/deployment/state.md Extends state/metadata schema docs and lifecycle diagram for stack + new destroy outcomes.
website/docs/agents/git-ape.md Updates agent guidance to prefer stacks and soft-delete purge on destroy.
website/docs/agents/azure-template-generator.md Updates generated-agent guidance to prefer stacks (fallback to legacy deployment).
website/docs/agents/azure-resource-deployer.md Updates deployer guidance to use stack validate/create and verify extended state.json.
plugin.json Bumps plugin version to 0.1.0.
.github/workflows/git-ape-destroy.exampleyml Implements stack delete path, purge sweep, deployment history cleanup, and new statuses.
.github/workflows/git-ape-deploy.exampleyml Implements stack validate/create and writes extended state.json + metadata updates.
.github/skills/azure-stack-destroy/SKILL.md Adds new user-invocable destroy skill spec and usage.
.github/skills/azure-stack-destroy/scripts/destroy-stack.sh Adds local bash destroy implementation (stack delete + purge + state updates).
.github/skills/azure-stack-destroy/scripts/destroy-stack.ps1 Adds local PowerShell destroy implementation (stack delete + purge + state updates).
.github/skills/azure-stack-deploy/SKILL.md Adds new user-invocable deploy skill spec and usage.
.github/skills/azure-stack-deploy/scripts/deploy-stack.sh Adds local bash deploy implementation (stack create + managed resource capture + state writes).
.github/skills/azure-stack-deploy/scripts/deploy-stack.ps1 Adds local PowerShell deploy implementation (stack create + managed resource capture + state writes).
.github/scripts/deployment-manager.sh Re-scopes manager script to inventory-only and points deploy/destroy to new skills.
.github/plugin/marketplace.json Bumps marketplace metadata version to 0.1.0.
.github/copilot-instructions.md Updates guidance to use stack deploy/destroy skills in local + CI flows.
.github/agents/git-ape.agent.md Mirrors website agent docs: stacks preferred + purge sweep guidance.
.github/agents/azure-template-generator.agent.md Mirrors website generator docs: stacks preferred + fallback guidance.
.github/agents/azure-resource-deployer.agent.md Mirrors website deployer docs: stack validate/create + extended state verification.

Copilot's findings

Comments suppressed due to low confidence (2)

.github/skills/azure-stack-deploy/scripts/deploy-stack.sh:233

  • RESOURCE_GROUPS is derived via jq capture("/resourceGroups/(?<rg>[^/]+)") over every managedResources[].id, which will error on subscription-scoped resource IDs (no /resourceGroups/). This can make the deploy skill fail while writing state.json even though the deployment itself succeeded; filter to RG-scoped IDs or use a non-throwing match (capture(...)?/try).

RESOURCE_GROUPS=$(echo "$MANAGED_RESOURCES" | jq -c '[.[].id | capture("/resourceGroups/(?<rg>[^/]+)") | .rg] | unique')
[[ "$(echo "$RESOURCE_GROUPS" | jq 'length')" == "0" && -n "$RG_NAME" ]] && RESOURCE_GROUPS="[\"$RG_NAME\"]"

.github/skills/azure-stack-destroy/scripts/destroy-stack.sh:298

  • grep -oE '(?<=locations/)[^/]+' uses a PCRE lookbehind, but -E (ERE) doesn’t support lookbehind. This will fail to extract the Cognitive Services location and silently skip purge. Use grep -oP or another non-lookbehind parsing approach.
            "Microsoft.CognitiveServices/accounts")
                if [[ "$PURGE_PROTECTED" != "true" ]]; then
                    LOC=$(echo "$RES_ID" | grep -oE '(?<=locations/)[^/]+' || echo "")
                    if [[ -n "$LOC" ]]; then
                        az cognitiveservices account purge --name "$RES_NAME" --location "$LOC" \
  • Files reviewed: 24/24 changed files
  • Comments generated: 10

Comment on lines +252 to +259
# Determine deploy method: prefer deployment stacks (idempotent destroy)
# Fall back to az deployment sub create if stacks are unavailable
DEPLOY_METHOD="stack"
# Verbose output goes to a temp file so it does not contaminate the
# JSON that downstream jq calls need to parse.
VERBOSE_LOG=$(mktemp)
trap 'rm -f "$VERBOSE_LOG"' EXIT

Comment thread .github/workflows/git-ape-deploy.exampleyml Outdated
Comment on lines +383 to +388
done

MANAGED_RESOURCES=$(echo "$MANAGED_RESOURCES" | jq --arg id "$RES_ID" --arg type "$RES_TYPE" \
--arg scope "$RES_SCOPE" --argjson sd "$IS_SOFT_DELETABLE" \
'. + [{"id": $id, "type": $type, "scope": $scope, "softDeletable": $sd, "purgeProtected": false}]')
done
Comment thread .github/skills/azure-stack-deploy/scripts/deploy-stack.sh Outdated
Comment on lines +200 to +205
# If the bg process already failed, surface it early
if ! kill -0 "$STACK_BG_PID" 2>/dev/null; then
wait "$STACK_BG_PID" 2>/dev/null || true
BG_EXIT=$?
if [[ $BG_EXIT -ne 0 ]]; then
EXISTS=$(az group exists --name "$RG" 2>/dev/null || echo "true")
Comment thread .github/skills/azure-stack-destroy/scripts/destroy-stack.sh Outdated
Comment thread .github/skills/azure-stack-destroy/scripts/destroy-stack.ps1 Outdated
Comment on lines +317 to +334
# Determine deploy method: prefer deployment stacks (idempotent destroy)
# Fall back to az deployment sub create if stacks are unavailable
DEPLOY_METHOD="stack"

if [[ "$DEPLOY_METHOD" == "stack" ]]; then
DEPLOY_OUTPUT=$(az stack sub create \
--name "$DEPLOYMENT_ID" \
--location "$LOCATION" \
--template-file "$DEPLOY_DIR/template.json" \
--parameters @"$DEPLOY_DIR/parameters.json" \
--action-on-unmanage deleteAll \
--deny-settings-mode none \
--description "Git-Ape deployment $DEPLOYMENT_ID" \
--tags "managedBy=git-ape" "deploymentId=$DEPLOYMENT_ID" \
--yes \
--verbose \
--output json 2>&1)
else
Comment thread website/docs/deployment/state.md Outdated
Comment thread website/docs/deployment/state.md Outdated
arnaudlh and others added 7 commits May 25, 2026 04:46
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 26, 2026

🧪 Waza skill evals (advisory)

🔁 Full matrix run. project-wide config change (.waza.yaml, manifest, or workflow file) → full matrix

Ran 8 matrix legs in parallel (skills × models). Results are non-blocking — investigate failures via the workflow logs and the per-leg waza-results-* artifacts.

Legend: Models flagged baseline: true in .github/evals/manifest.yaml (currently: gpt-5.4) run with --baseline (A/B mode) to cap quota. All other models run standard. Judge model is fixed at claude-sonnet-4.6 across all legs.

📊 Token comparison vs main (advisory)
{
  "baseRef": "main",
  "headRef": "WORKING",
  "threshold": 10,
  "passed": true,
  "timestamp": "2026-05-26T08:19:32.889718074Z",
  "summary": {
    "totalBefore": 0,
    "totalAfter": 37876,
    "totalDiff": 37876,
    "percentChange": 100,
    "filesAdded": 15,
    "filesRemoved": 0,
    "filesModified": 0,
    "filesIncreased": 15,
    "filesDecreased": 0
  },
  "files": [
    {
      "file": ".github/skills/azure-cost-estimator/SKILL.md",
      "before": null,
      "after": {
        "tokens": 3227,
        "characters": 11926,
        "lines": 344
      },
      "diff": 3227,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-deployment-preflight/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1444,
        "characters": 6267,
        "lines": 211
      },
      "diff": 1444,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-drift-detector/SKILL.md",
      "before": null,
      "after": {
        "tokens": 3179,
        "characters": 13149,
        "lines": 460
      },
      "diff": 3179,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-integration-tester/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1559,
        "characters": 6793,
        "lines": 247
      },
      "diff": 1559,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-naming-research/SKILL.md",
      "before": null,
      "after": {
        "tokens": 486,
        "characters": 2108,
        "lines": 44
      },
      "diff": 486,
      "percentChange": 100,
      "status": "added",
      "limit": 500
    },
    {
      "file": ".github/skills/azure-policy-advisor/SKILL.md",
      "before": null,
      "after": {
        "tokens": 6233,
        "characters": 26754,
        "lines": 642
      },
      "diff": 6233,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-resource-availability/SKILL.md",
      "before": null,
      "after": {
        "tokens": 2409,
        "characters": 9867,
        "lines": 307
      },
      "diff": 2409,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-resource-visualizer/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1490,
        "characters": 6165,
        "lines": 191
      },
      "diff": 1490,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-rest-api-reference/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1827,
        "characters": 8416,
        "lines": 199
      },
      "diff": 1827,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-role-selector/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1276,
        "characters": 5627,
        "lines": 161
      },
      "diff": 1276,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-security-analyzer/SKILL.md",
      "before": null,
      "after": {
        "tokens": 5322,
        "characters": 21405,
        "lines": 450
      },
      "diff": 5322,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-stack-deploy/SKILL.md",
      "before": null,
      "after": {
        "tokens": 1912,
        "characters": 7525,
        "lines": 159
      },
      "diff": 1912,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/azure-stack-destroy/SKILL.md",
      "before": null,
      "after": {
        "tokens": 2644,
        "characters": 10670,
        "lines": 180
      },
      "diff": 2644,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/git-ape-onboarding/SKILL.md",
      "before": null,
      "after": {
        "tokens": 2730,
        "characters": 11072,
        "lines": 270
      },
      "diff": 2730,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    },
    {
      "file": ".github/skills/prereq-check/SKILL.md",
      "before": null,
      "after": {
        "tokens": 2138,
        "characters": 8019,
        "lines": 147
      },
      "diff": 2138,
      "percentChange": 100,
      "status": "added",
      "limit": 500,
      "overLimit": true
    }
  ]
}

Skill: prereq-check

📈 Score (per model) + Suggestions/Recommendations
Model: claude-opus-4.6

Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: claude-opus-4.6
Judge Model: claude-sonnet-4.6
Parallel: 4 workers

✓ [1/4] Negative — Editing an ARM template
✓ [2/4] Negative — Azure service concept question
✓ [4/4] Positive — "What do I need to install?"
✓ [3/4] Positive — "command not found" failure

🧪 Waza Eval Results

Status: ✅ Passed | Score: 0.79 | Duration: 2m11.786s

  • Tests: 4 total, 4 passed, 0 failed, 0 errors
  • Success Rate: 100.0%
  • Score Range: 0.57 - 1.00 (σ=0.2074)

Task Results

Task Score Status Graders
Negative — Editing an ARM template 0.57 budget, trigger_relevance_negative
Negative — Azure service concept question 0.60 budget, trigger_relevance_negative
Positive — "command not found" failure 1.00 answer_quality, budget, trigger_relevance_positive
Positive — "What do I need to install?" 1.00 answer_quality, budget, trigger_relevance_positive

Benchmark: prereq-check-eval | Skill: prereq-check | Model: claude-opus-4.6

Results saved to: .waza-results/prereq-check-claude-opus-4.6.json
JUnit XML saved to: .waza-results/prereq-check-claude-opus-4.6.junit.xml

Model: claude-sonnet-4.6

Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-sonnet-4.6
Parallel: 4 workers

✓ [1/4] Negative — Editing an ARM template
✓ [2/4] Negative — Azure service concept question
✓ [4/4] Positive — "What do I need to install?"
✓ [3/4] Positive — "command not found" failure

🧪 Waza Eval Results

Status: ✅ Passed | Score: 0.79 | Duration: 2m39.795s

  • Tests: 4 total, 4 passed, 0 failed, 0 errors
  • Success Rate: 100.0%
  • Score Range: 0.57 - 1.00 (σ=0.2074)

Task Results

Task Score Status Graders
Negative — Editing an ARM template 0.57 budget, trigger_relevance_negative
Negative — Azure service concept question 0.60 budget, trigger_relevance_negative
Positive — "command not found" failure 1.00 answer_quality, budget, trigger_relevance_positive
Positive — "What do I need to install?" 1.00 answer_quality, budget, trigger_relevance_positive

Benchmark: prereq-check-eval | Skill: prereq-check | Model: claude-sonnet-4.6

Results saved to: .waza-results/prereq-check-claude-sonnet-4.6.json
JUnit XML saved to: .waza-results/prereq-check-claude-sonnet-4.6.junit.xml

Model: gpt-5-codex

Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: gpt-5-codex
Judge Model: claude-sonnet-4.6
Parallel: 4 workers

[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

✗ [3/4] Positive — "command not found" failure
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

✗ [4/4] Positive — "What do I need to install?"
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

✗ [1/4] Negative — Editing an ARM template
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

✗ [2/4] Negative — Azure service concept question

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.00 | Duration: 301ms

  • Tests: 4 total, 0 passed, 4 failed, 0 errors
  • Success Rate: 0.0%
  • Score Range: 0.00 - 0.00 (σ=0.0000)

Task Results

Task Score Status Graders
Negative — Editing an ARM template 0.00 -
Negative — Azure service concept question 0.00 -
Positive — "command not found" failure 0.00 -
Positive — "What do I need to install?" 0.00 -

Failed Task Details

Negative — Editing an ARM template

Run 1/3 (error):

Run 2/3 (error):

Run 3/3 (error):

Negative — Azure service concept question

Run 1/3 (error):

Run 2/3 (error):

Run 3/3 (error):

Positive — "command not found" failure

Run 1/3 (error):

Run 2/3 (error):

Run 3/3 (error):

Positive — "What do I need to install?"

Run 1/3 (error):

Run 2/3 (error):

Run 3/3 (error):


Benchmark: prereq-check-eval | Skill: prereq-check | Model: gpt-5-codex

Results saved to: .waza-results/prereq-check-gpt-5-codex.json

Model: gpt-5.4 *(baseline — A/B mode)*

Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: gpt-5.4
Judge Model: claude-sonnet-4.6
Parallel: 4 workers

════════════════════════════════════════════════════════════════
PASS 1: Skills-Enabled Run
════════════════════════════════════════════════════════════════
✓ [1/4] Negative — Editing an ARM template
✓ [2/4] Negative — Azure service concept question
✓ [4/4] Positive — "What do I need to install?"
✓ [3/4] Positive — "command not found" failure

════════════════════════════════════════════════════════════════
PASS 2: Skills Baseline (skills stripped)
════════════════════════════════════════════════════════════════
✓ [2/4] Negative — Azure service concept question
✓ [1/4] Negative — Editing an ARM template
✗ [3/4] Positive — "command not found" failure
✗ [4/4] Positive — "What do I need to install?"

════════════════════════════════════════════════════════════════
SKILL IMPACT ANALYSIS
════════════════════════════════════════════════════════════════
Overall Performance Delta:
With Skills: 100.0% (4/4 tasks passed)
Without Skills: 50.0% (2/4 tasks passed)
Impact: +50.0 percentage points

Per-Task Breakdown:
• Negative — Editing an ARM template [NEUTRAL] 100% → 100% (+0pp)
• Negative — Azure service concept question [NEUTRAL] 100% → 100% (+0pp)
• Positive — "command not found" failure [IMPROVED] 67% → 100% (+33pp)
• Positive — "What do I need to install?" [IMPROVED] 67% → 100% (+33pp)

Verdict: Skills have POSITIVE IMPACT (improved 2/4 tasks)
════════════════════════════════════════════════════════════════

🧪 Waza Eval Results

Status: ✅ Passed | Score: 0.79 | Duration: 2m10.625s

  • Tests: 4 total, 4 passed, 0 failed, 0 errors
  • Success Rate: 100.0%
  • Score Range: 0.57 - 1.00 (σ=0.2074)

Task Results

Task Score Status Graders
Negative — Editing an ARM template 0.57 budget, trigger_relevance_negative
Negative — Azure service concept question 0.60 budget, trigger_relevance_negative
Positive — "command not found" failure 1.00 answer_quality, budget, trigger_relevance_positive
Positive — "What do I need to install?" 1.00 answer_quality, budget, trigger_relevance_positive

Benchmark: prereq-check-eval | Skill: prereq-check | Model: gpt-5.4

Results saved to: .waza-results/prereq-check-gpt-5.4.json
JUnit XML saved to: .waza-results/prereq-check-gpt-5.4.junit.xml

🔢 Tokens (count + profile)

📊 prereq-check: 2,138 tokens (detailed ✓), 10 sections, 2 code blocks
   ⚠️  token count 2138 exceeds 1000

🎯 Quality (5-dim table)

DIMENSION          SCORE  FEEDBACK
────────────────────────────────────────────
clarity            █████  Purpose is immediately obvious, steps are numbered with clear actions and references, the Quick Reference table surfaces critical facts at a glance, and the Always/Never constraint lists eliminate ambiguity. No meaningful improvements needed.
completeness       █████  Covers tool checks, version minimums, auth sessions, platform variance (macOS/Linux/Windows), error handling for 8 distinct failure modes, edge cases like user-reported vs terminal-detected missing tools, and a defined handoff. Nothing material is missing.
trigger_precision  ████░  USE FOR is exhaustive with concrete error-string examples (e.g., 'az: command not found'), which is excellent. DO NOT USE FOR is a catch-all ('Anything else') — adding 1-2 explicit anti-examples (e.g., 'do not use for Azure resource validation, use azure-validate') would sharpen negative routing slightly.
scope_coverage     █████  Boundaries are crystal clear: read-only, four specific tools, two auth sessions, three platforms, no auto-installation, no chaining. The Related Skills section correctly offloads adjacent concerns (onboarding, deployment-time checks) without bleeding scope.
anti_patterns      █████  Avoids all major anti-patterns: no vague instructions, no conflicting directives (Rules section pre-empts contradiction between terminal findings and user reports), error handling is explicit with cause/fix pairs, and prescriptiveness is calibrated — steps are specific enough to be actionable without over-constraining implementation.
────────────────────────────────────────────
Overall: 4.8/5.0

Exceptionally well-structured skill. It is one of the cleaner examples of a narrowly scoped, operationally complete skill document — clear rules, thorough error handling, explicit platform branching, and a well-defined verdict taxonomy. The only minor gap is that the DO NOT USE FOR clause relies on a catch-all rather than named exclusions, which could cause occasional mis-routing in ambiguous cases.
✅ Check (compliance summary) (59 lines — click to expand)

ℹ️ waza check expects eval.yaml colocated with SKILL.md. This repo separates them into .github/evals/prereq-check/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).

🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Skill: prereq-check

📋 Compliance Score: Medium-High
   ⚠️  Good, but could be improved. Missing routing clarity.

   Issues found:
   ❌  SKILL.md is 2138 tokens (hard limit 500)

📐 Spec Compliance: 8/9 checks passed
   ❌  Does not fully meet agentskills.io specification.
   ❌  [spec-allowed-fields] Unknown frontmatter fields: argument-hint, user-invocable
     📎  agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility

📎 Links: 4/4 valid
   ✅  All links valid.

📊 Token Budget: 2138 / 500 tokens
   ❌  Exceeds limit by 1638 tokens. Consider reducing content.

🧪 Evaluation Suite: Found
   ✅  eval.yaml detected. Run 'waza run eval.yaml' to test.

📐 Schema Validation: Passed
   ✅  eval.yaml schema valid
   ✅  4 task file(s) validated

💡 Advisory Checks
   ✅  [module-count] Found 1 reference module(s)
   ❌  [complexity] Complexity: comprehensive (2138 tokens, 1 modules)
   ✅  [negative-delta-risk] No negative delta risk patterns detected
   ✅  [procedural-content] Description contains procedural language
   ✅  [over-specificity] No over-specificity patterns detected
   ❌  [cross-model-density] Advisory 16: word count is 122 (>60 may reduce cross-model effectiveness)
   ❌  [body-structure] Advisory 17: body structure quality — no examples section found
   ✅  [progressive-disclosure] Content structure supports progressive disclosure
   ✅  [scope-reduction] Capability scope: 8 signal(s) detected (8 level-2 heading(s), 2 numbered procedure(s))

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⚠️  Your skill needs some work before submission.

🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

To improve your skill:

1. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
2. Run 'waza dev' for interactive compliance improvement
3. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, user-invocable
4. Reduce SKILL.md by 1638 tokens. Run 'waza tokens suggest' for optimization tips

Skill: azure-stack-deploy

📈 Score (per model) + Suggestions/Recommendations
Model: claude-sonnet-4.6

Running benchmark: azure-stack-deploy-eval
Skill: azure-stack-deploy
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-sonnet-4.6
Parallel: 4 workers

[ERROR] waiting for session.idle: context deadline exceeded

✓ [2/5] Negative — Off-topic prompt (Linux kernel scheduling)
✗ [1/5] Negative — Destroying / tearing down an existing deployment
✗ [3/5] Negative — What-if preview / preflight validation
✓ [5/5] Positive — Re-deploy after template edit
[ERROR] waiting for session.idle: context deadline exceeded

✗ [4/5] Positive — Local deploy of an existing deployment artifact

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.75 | Duration: 2m26.67s

  • Tests: 5 total, 2 passed, 3 failed, 0 errors
  • Success Rate: 40.0%
  • Score Range: 0.60 - 0.86 (σ=0.1167)

Task Results

Task Score Status Graders
Negative — Destroying / tearing down an existing deployment 0.86 budget, trigger_relevance_negative
Negative — Off-topic prompt (Linux kernel scheduling) 0.60 budget, trigger_relevance_negative
Negative — What-if preview / preflight validation 0.82 budget, trigger_relevance_negative
Positive — Local deploy of an existing deployment artifact 0.61 answer_quality, budget, trigger_relevance_positive
Positive — Re-deploy after template edit 0.85 answer_quality, budget, trigger_relevance_positive

Failed Task Details

Negative — Destroying / tearing down an existing deployment

Run 1/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.71): Prompt appears trigger-aligned unexpectedly (score 0.71 >= 0.50)

Run 2/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.71): Prompt appears trigger-aligned unexpectedly (score 0.71 >= 0.50)

Negative — What-if preview / preflight validation

Run 1/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.65): Prompt appears trigger-aligned unexpectedly (score 0.65 >= 0.50)

Run 2/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.65): Prompt appears trigger-aligned unexpectedly (score 0.65 >= 0.50)

Positive — Local deploy of an existing deployment artifact

Run 1/2 (error):

  • answer_quality (0.00): fail: : The assistant never delivered a response to the user. All tool calls failed with "unexpected user permission response" and no final answer was produced. None of the four required criteria were met:
  1. az stack sub create — not mentioned
  2. --action-on-unmanage deleteAll — not mentioned
  3. Reference to deploy-stack.sh or deploy-stack.ps1 — not mentioned
  4. state.json (schemaVersion 1.0) — not mentioned
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (0.83): Prompt is trigger-aligned (score 0.83 >= 0.50)

Run 2/2 (error):

  • answer_quality (0.00): fail: : The assistant invoked the skill and loaded the context, but then got stuck in permission/tool errors and never delivered a response to the user. None of the four required criteria appeared in the user-facing output:
  1. az stack sub create — never mentioned
  2. --action-on-unmanage deleteAll — never mentioned
  3. deploy-stack.sh / deploy-stack.ps1 script reference — never mentioned
  4. state.json (schemaVersion 1.0) — never mentioned
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (0.83): Prompt is trigger-aligned (score 0.83 >= 0.50)

Benchmark: azure-stack-deploy-eval | Skill: azure-stack-deploy | Model: claude-sonnet-4.6

Results saved to: .waza-results/azure-stack-deploy-claude-sonnet-4.6.json

Model: gpt-5-codex

Running benchmark: azure-stack-deploy-eval
Skill: azure-stack-deploy
Engine: copilot-sdk
Model: gpt-5-codex
Judge Model: claude-sonnet-4.6
Parallel: 4 workers

[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

✗ [1/5] Negative — Destroying / tearing down an existing deployment
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

✗ [5/5] Positive — Re-deploy after template edit
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

✗ [4/5] Positive — Local deploy of an existing deployment artifact
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

✗ [3/5] Negative — What-if preview / preflight validation
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

✗ [2/5] Negative — Off-topic prompt (Linux kernel scheduling)

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.00 | Duration: 1.065s

  • Tests: 5 total, 0 passed, 5 failed, 0 errors
  • Success Rate: 0.0%
  • Score Range: 0.00 - 0.00 (σ=0.0000)

Task Results

Task Score Status Graders
Negative — Destroying / tearing down an existing deployment 0.00 -
Negative — Off-topic prompt (Linux kernel scheduling) 0.00 -
Negative — What-if preview / preflight validation 0.00 -
Positive — Local deploy of an existing deployment artifact 0.00 -
Positive — Re-deploy after template edit 0.00 -

Failed Task Details

Negative — Destroying / tearing down an existing deployment

Run 1/2 (error):

Run 2/2 (error):

Negative — Off-topic prompt (Linux kernel scheduling)

Run 1/2 (error):

Run 2/2 (error):

Negative — What-if preview / preflight validation

Run 1/2 (error):

Run 2/2 (error):

Positive — Local deploy of an existing deployment artifact

Run 1/2 (error):

Run 2/2 (error):

Positive — Re-deploy after template edit

Run 1/2 (error):

Run 2/2 (error):


Benchmark: azure-stack-deploy-eval | Skill: azure-stack-deploy | Model: gpt-5-codex

Results saved to: .waza-results/azure-stack-deploy-gpt-5-codex.json

🔢 Tokens (count + profile)

📊 azure-stack-deploy: 1,912 tokens (detailed ✓), 13 sections, 5 code blocks
   ⚠️  token count 1912 exceeds 1000

🎯 Quality (5-dim table)

DIMENSION          SCORE  FEEDBACK
────────────────────────────────────────────
clarity            █████  Purpose is immediately obvious from the description and header. Steps are logically ordered (locate → run → inspect → report), script internals are enumerated in sequence, and output examples make expected behavior concrete. The dual bash/PowerShell coverage adds slight cognitive overhead but is justified and well-separated.
completeness       █████  Prerequisites table, argument reference, full state.json schema with example, soft-deletable resource type list, failure modes table, and post-run messaging requirements are all present. The fallback path, race-condition edge case, and idempotency behavior are explicitly documented — very few gaps remain.
trigger_precision  ████░  'When to Use' and 'Do NOT use for' sections are well-defined with concrete anti-cases (destroy, preflight, template authoring). The only minor gap is that there is no guidance for partially failed or rolled-back stacks that may already have a stack ID — it's ambiguous whether this skill or a destroy+redeploy flow is correct in that state.
scope_coverage     █████  Boundaries are tight and explicit: subscription-scoped stacks only, Git-Ape artifact layout assumed, state.json schema is versioned and cross-referenced to docs. The skill neither overreaches into template generation nor undersells by omitting the fallback path and multi-RG lifecycle ownership nuance.
anti_patterns      ████░  No vague instructions or conflicting directives detected. Error handling is thorough (per-operation failure dump, early exit on missing auth/template). The one minor issue is that the post-run messaging section ('What to tell the user') is prescriptive in a way that may cause the agent to mechanically recite output rather than adapt to context — a note that this is a minimum floor rather than a rigid script would improve it.
────────────────────────────────────────────
Overall: 4.6/5.0

An exceptionally well-structured skill with versioned schema, dual-runtime support, explicit fallback semantics, and thorough failure-mode coverage. Minor deductions for the ambiguous re-deploy-after-partial-failure scenario and the overly rigid post-run messaging prescription. Production-ready with minimal revision needed.
✅ Check (compliance summary) (70 lines — click to expand)

ℹ️ waza check expects eval.yaml colocated with SKILL.md. This repo separates them into .github/evals/azure-stack-deploy/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).

🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Skill: azure-stack-deploy

📋 Compliance Score: Low
   ❌  Needs significant improvement. Description too short or missing triggers.

   Issues found:
   ❌  SKILL.md is 1912 tokens (hard limit 500)

📐 Spec Compliance: 8/9 checks passed
   ❌  Does not fully meet agentskills.io specification.
   ❌  [spec-allowed-fields] Unknown frontmatter fields: argument-hint, user-invocable
     📎  agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility

📎 Links: 0/8 valid
   ⚠️  8 link issue(s) found.
   ❌  [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-deployment-preflight/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../../../website/docs/deployment/state.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-deployment-preflight/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-security-analyzer/SKILL.md: link escapes skill directory

📊 Token Budget: 1912 / 500 tokens
   ❌  Exceeds limit by 1412 tokens. Consider reducing content.

🧪 Evaluation Suite: Found
   ✅  eval.yaml detected. Run 'waza run eval.yaml' to test.

📐 Schema Validation: Passed
   ✅  eval.yaml schema valid
   ✅  5 task file(s) validated

💡 Advisory Checks
   ✅  [module-count] Found 0 reference module(s)
   ❌  [complexity] Complexity: comprehensive (1912 tokens, 0 modules)
   ✅  [negative-delta-risk] No negative delta risk patterns detected
   ✅  [procedural-content] Description contains procedural language
   ✅  [over-specificity] No over-specificity patterns detected
   ✅  [cross-model-density] Description density is optimal for cross-model use
   ❌  [body-structure] Advisory 17: body structure quality — no examples section found; no error handling or troubleshooting section found
   ✅  [progressive-disclosure] Content structure supports progressive disclosure
   ✅  [scope-reduction] Capability scope: 10 signal(s) detected (10 level-2 heading(s), 2 numbered procedure(s))

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⚠️  Your skill needs some work before submission.

🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

To improve your skill:

1. Add a 'USE FOR:' section with 3-5 trigger phrases that activate the skill
2. Add a 'DO NOT USE FOR:' section to clarify when NOT to use this skill
3. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
4. Run 'waza dev' for interactive compliance improvement
5. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, user-invocable
6. Fix 8 link(s) that escape the skill directory
7. Reduce SKILL.md by 1412 tokens. Run 'waza tokens suggest' for optimization tips

Skill: azure-stack-destroy

📈 Score (per model) + Suggestions/Recommendations
Model: claude-sonnet-4.6

Running benchmark: azure-stack-destroy-eval
Skill: azure-stack-destroy
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-sonnet-4.6
Parallel: 4 workers

✗ [1/5] Negative — Deploying a new stack (opposite operation)
✗ [2/5] Negative — Deleting a non-Git-Ape resource group
✓ [3/5] Negative — Off-topic prompt (Linux kernel scheduling)
[ERROR] waiting for session.idle: context deadline exceeded

✗ [4/5] Positive — Clean up the deployment stack
✗ [5/5] Positive — Local destroy of a Git-Ape deployment

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.77 | Duration: 2m45.646s

  • Tests: 5 total, 1 passed, 4 failed, 0 errors
  • Success Rate: 20.0%
  • Score Range: 0.60 - 0.87 (σ=0.0903)

Task Results

Task Score Status Graders
Negative — Deploying a new stack (opposite operation) 0.81 budget, trigger_relevance_negative
Negative — Deleting a non-Git-Ape resource group 0.87 budget, trigger_relevance_negative
Negative — Off-topic prompt (Linux kernel scheduling) 0.60 budget, trigger_relevance_negative
Positive — Clean up the deployment stack 0.79 answer_quality, budget, trigger_relevance_positive
Positive — Local destroy of a Git-Ape deployment 0.80 answer_quality, budget, trigger_relevance_positive

⚠️ Flaky Tasks

The following tasks showed inconsistent results across runs:

  • Positive — Clean up the deployment stack: 50% pass rate, score=0.79±0.17
  • Positive — Local destroy of a Git-Ape deployment: 50% pass rate, score=0.80±0.17

Failed Task Details

Negative — Deploying a new stack (opposite operation)

Run 1/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.62): Prompt appears trigger-aligned unexpectedly (score 0.62 >= 0.50)

Run 2/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.62): Prompt appears trigger-aligned unexpectedly (score 0.62 >= 0.50)

Negative — Deleting a non-Git-Ape resource group

Run 1/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.73): Prompt appears trigger-aligned unexpectedly (score 0.73 >= 0.50)

Run 2/2 (failed):

  • budget (1.00): All behavior checks passed
  • trigger_relevance_negative (0.73): Prompt appears trigger-aligned unexpectedly (score 0.73 >= 0.50)

Positive — Clean up the deployment stack

Run 2/2 (failed):

  • answer_quality (0.00): fail: Criterion 1 not met: The response recommends running the destroy script but never explicitly tells the user NOT to use raw az group delete, nor explains that az group delete would miss soft-delete cleanup and multi-RG resources. Criteria 2, 3, and 4 are all satisfied.
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (0.88): Prompt is trigger-aligned (score 0.88 >= 0.50)

Positive — Local destroy of a Git-Ape deployment

Run 1/2 (error):

  • answer_quality (0.00): fail: Criteria 2, 3, and 4 not met: The assistant invoked the azure-stack-destroy skill (criteria 1 ✅), but its response then devolved entirely into failed file-lookup tool calls with no user-facing explanation. It never:
  • Explained that state.json under .azure/deployments/deploy-20260506-001/ is the source of truth for stackId, managedResources, softDeletable, purgeProtected (criteria 2 ❌)
  • Named or described the az stack sub delete --action-on-unmanage deleteAll command or its semantics (criteria 3 ❌)
  • Mentioned the soft-delete purge sweep, az keyvault purge / az keyvault list-deleted, or explained that non-purge-protected Key Vaults would be purged so the name is immediately reusable (criteria 4 ❌)
  • budget (1.00): All behavior checks passed
  • trigger_relevance_positive (0.89): Prompt is trigger-aligned (score 0.89 >= 0.50)

Benchmark: azure-stack-destroy-eval | Skill: azure-stack-destroy | Model: claude-sonnet-4.6

Results saved to: .waza-results/azure-stack-destroy-claude-sonnet-4.6.json

Model: gpt-5-codex

Running benchmark: azure-stack-destroy-eval
Skill: azure-stack-destroy
Engine: copilot-sdk
Model: gpt-5-codex
Judge Model: claude-sonnet-4.6
Parallel: 4 workers

[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

✗ [1/5] Negative — Deploying a new stack (opposite operation)
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

✗ [5/5] Positive — Local destroy of a Git-Ape deployment
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

✗ [4/5] Positive — Clean up the deployment stack
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

✗ [2/5] Negative — Deleting a non-Git-Ape resource group
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.

✗ [3/5] Negative — Off-topic prompt (Linux kernel scheduling)

🧪 Waza Eval Results

Status: ❌ Failed | Score: 0.00 | Duration: 436ms

  • Tests: 5 total, 0 passed, 5 failed, 0 errors
  • Success Rate: 0.0%
  • Score Range: 0.00 - 0.00 (σ=0.0000)

Task Results

Task Score Status Graders
Negative — Deploying a new stack (opposite operation) 0.00 -
Negative — Deleting a non-Git-Ape resource group 0.00 -
Negative — Off-topic prompt (Linux kernel scheduling) 0.00 -
Positive — Clean up the deployment stack 0.00 -
Positive — Local destroy of a Git-Ape deployment 0.00 -

Failed Task Details

Negative — Deploying a new stack (opposite operation)

Run 1/2 (error):

Run 2/2 (error):

Negative — Deleting a non-Git-Ape resource group

Run 1/2 (error):

Run 2/2 (error):

Negative — Off-topic prompt (Linux kernel scheduling)

Run 1/2 (error):

Run 2/2 (error):

Positive — Clean up the deployment stack

Run 1/2 (error):

Run 2/2 (error):

Positive — Local destroy of a Git-Ape deployment

Run 1/2 (error):

Run 2/2 (error):


Benchmark: azure-stack-destroy-eval | Skill: azure-stack-destroy | Model: gpt-5-codex

Results saved to: .waza-results/azure-stack-destroy-gpt-5-codex.json

🔢 Tokens (count + profile)

📊 azure-stack-destroy: 2,644 tokens (detailed ✓), 14 sections, 7 code blocks
   ⚠️  token count 2644 exceeds 1000

🎯 Quality (5-dim table)

DIMENSION          SCORE  FEEDBACK
────────────────────────────────────────────
clarity            █████  Purpose is immediately obvious from the description, steps are logically ordered with clear headings, and code examples (bash and PowerShell) remove ambiguity. The fast-vs-sync mode table is a standout clarity element.
completeness       █████  Exceptional coverage: prerequisites table, procedure with numbered steps, all flag arguments, 5 terminal status codes, 4 failure modes with recovery guidance, soft-delete purge sweep details, and state.json output schema. Edge cases like purge-protected resources and the no-stackId fallback path are explicitly addressed.
trigger_precision  ████░  USE FOR and DO NOT USE FOR are specific and actionable with concrete phrasing examples. However, the separate 'When to Use' section directly below USE FOR duplicates several triggers verbatim, adding noise without value — consolidate or remove it.
scope_coverage     █████  Boundaries are explicit on all axes: stack-only (no surgical per-resource delete), Git-Ape deployments only, state.json required (hard abort if missing), and the 'no non-Azure / non-Git-Ape' exclusion is unambiguous. The CI-parity guarantee is clearly scoped.
anti_patterns      ████░  Avoids vague instructions, conflicting directives, and missing error handling. The one anti-pattern is the redundant 'When to Use' section that re-states USE FOR content — this creates a maintenance burden and potential drift. Additionally, 'Prefer this over raw az group delete' is nested under USE FOR when it logically belongs in a rationale or DO NOT USE FOR section.
────────────────────────────────────────────
Overall: 4.6/5.0

A high-quality skill document that sets a strong example for operational runbooks. It is thorough, precise, and well-structured with excellent failure-mode coverage and clear CI-parity guarantees. The only meaningful improvement is eliminating the redundant 'When to Use' section that duplicates USE FOR content, and reorganizing the 'Prefer this over az group delete' rationale into a more logical location.
✅ Check (compliance summary) (69 lines — click to expand)

ℹ️ waza check expects eval.yaml colocated with SKILL.md. This repo separates them into .github/evals/azure-stack-destroy/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).

🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

Skill: azure-stack-destroy

📋 Compliance Score: Low
   ❌  Needs significant improvement. Description too short or missing triggers.

   Issues found:
   ❌  SKILL.md is 2644 tokens (hard limit 500)

📐 Spec Compliance: 7/9 checks passed
   ❌  Does not fully meet agentskills.io specification.
   ❌  [spec-allowed-fields] Unknown frontmatter fields: argument-hint, user-invocable
     📎  agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility
   ❌  [spec-security] Security risks detected: description contains XML angle brackets
     📎  XML angle brackets and reserved prefixes pose injection and naming conflict risks

📎 Links: 0/4 valid
   ⚠️  4 link issue(s) found.
   ❌  [SKILL.md] → ../azure-stack-deploy/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-stack-deploy/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-drift-detector/SKILL.md: link escapes skill directory
   ❌  [SKILL.md] → ../azure-resource-visualizer/SKILL.md: link escapes skill directory

📊 Token Budget: 2644 / 500 tokens
   ❌  Exceeds limit by 2144 tokens. Consider reducing content.

🧪 Evaluation Suite: Found
   ✅  eval.yaml detected. Run 'waza run eval.yaml' to test.

📐 Schema Validation: Passed
   ✅  eval.yaml schema valid
   ✅  5 task file(s) validated

💡 Advisory Checks
   ✅  [module-count] Found 0 reference module(s)
   ❌  [complexity] Complexity: comprehensive (2644 tokens, 0 modules)
   ✅  [negative-delta-risk] No negative delta risk patterns detected
   ✅  [procedural-content] Description contains procedural language
   ✅  [over-specificity] No over-specificity patterns detected
   ✅  [cross-model-density] Advisory 16: first sentence doesn't lead with action verb (reduces clarity)
   ❌  [body-structure] Advisory 17: body structure quality — no examples section found; no error handling or troubleshooting section found
   ✅  [progressive-disclosure] Content structure supports progressive disclosure
   ✅  [scope-reduction] Capability scope: 8 signal(s) detected (8 level-2 heading(s), 2 numbered procedure(s))

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

⚠️  Your skill needs some work before submission.

🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━

To improve your skill:

1. Add a 'USE FOR:' section with 3-5 trigger phrases that activate the skill
2. Add a 'DO NOT USE FOR:' section to clarify when NOT to use this skill
3. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
4. Run 'waza dev' for interactive compliance improvement
5. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, user-invocable
6. Fix spec violation [spec-security]: Security risks detected: description contains XML angle brackets
7. Fix 4 link(s) that escape the skill directory
8. Reduce SKILL.md by 2144 tokens. Run 'waza tokens suggest' for optimization tips

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 26, 2026

🤖 Waza agent evals (advisory)

ℹ️ No agents evaluated. changed agent(s) have no eval directory: azure-resource-deployer azure-template-generator git-ape

Ran 0 agent evals against claude-sonnet-4.6. Each eval consumes ~5 premium Copilot requests; results are non-blocking — investigate failures via the workflow logs and the per-agent waza-agent-results-* artifacts.

How this works: This workflow auto-syncs the canonical .github/agents/<name>.agent.md into the sibling mirror inside .github/evals/agents/<name>/ before each run, so the score below reflects the version of the agent in this PR — not whatever was committed when the eval was first wired up.

📊 Agent file token comparison vs main (advisory)

No .agent.md files changed vs main (or token-compare returned no entries).

No agents in scope for this PR.

arnaudlh added 4 commits May 26, 2026 16:11
- thin out azure-resource-deployer: delegate preflight, verify, rollback to existing skills
- thin out azure-template-generator: add Step 0 lookups (rest-api-reference + naming-research)
- replace inline RBAC GUIDs and per-resource checklists with skill invocations
- collapse hardening checklist into 5 non-negotiable identity patterns + analyzer deferral

🔧 - Generated by Copilot
- destroy: front-load WHEN, add USE FOR / DO NOT USE FOR, harden state.json prerequisite
- destroy: list extra soft-deletable resource types in purgeResults note
- deploy: clarify stack-create flags and state.json schema references

🔧 - Generated by Copilot
- register both skills in evals manifest at expanded tier
- add 5-task eval for azure-stack-destroy (positive-local, positive-stack, negative-deploy, negative-non-gitape, negative-off-topic)
- add eval suite for azure-stack-deploy

🧪 - Generated by Copilot
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Leverage Deployment Stacks for idempotency

3 participants