Leverage Deployment Stacks for idempotent operations#44
Conversation
…tate schema - Deploy workflow: use `az stack sub create` with `--action-on-unmanage deleteAll` as the default deployment method, with `az deployment sub create` as fallback - Deploy workflow: add managed resources capture step after deploy that walks deployment operations or stack resources to populate state.managedResources[] - Destroy workflow: use `az stack sub delete` when stackId is present in state, covering multi-RG, sub-scope, and MG-scope resources uniformly - Destroy workflow: add soft-delete purge loop for Key Vault and Cognitive Services - Destroy workflow: add deployment history cleanup step - Destroy workflow: support new terminal statuses: `partially-destroyed` and `retained-soft-deleted` - State schema: extend state.json with stackId, deployMethod, managedResources[], resourceGroups[], subscriptions[], externalReferences[] - Metadata schema: add deployMethod and resourceGroups[] fields - Documentation: update deployment state docs with new schema, statuses, and destroy strategy selection logic - Regenerate workflow documentation pages Agent-Logs-Url: https://github.com/Azure/git-ape/sessions/d2d1da54-9a38-41ef-9254-b5f585eab10e Co-authored-by: arnaudlh <20535201+arnaudlh@users.noreply.github.com>
- introduce azure-stack-deploy and azure-stack-destroy skills (bash + pwsh) - destroy: fast async mode (default) polls resource groups, --wait for sync - align workflows + agents + docs with new skills - bump plugin to 0.1.0 🚀 - Generated by Copilot
|
🔧 - Generated by Copilot
There was a problem hiding this comment.
Pull request overview
Updates Git-Ape’s deploy/destroy flows to use Azure Deployment Stacks as the primary lifecycle primitive, aiming to make destroy + redeploy idempotent across resource groups, subscription-scope resources, and soft-deletable services. It also introduces local “stack deploy/destroy” skills and extends the persisted deployment state schema to record stack identity and managed resources for deterministic teardown.
Changes:
- Switch deploy from
az deployment sub createtoaz stack sub create(with state capture of managed resources / RGs). - Switch destroy to prefer
az stack sub delete --action-on-unmanage deleteAll, add soft-delete purge sweep + subscription deployment-history cleanup. - Add
/azure-stack-deployand/azure-stack-destroyskills (bash + PowerShell) and update docs/agent guidance + version bumps.
Show a summary per file
| File | Description |
|---|---|
website/docs/workflows/git-ape-destroy.md |
Documents new stack-first destroy workflow, purge sweep, and new terminal statuses. |
website/docs/workflows/git-ape-deploy.md |
Documents stack-first deploy workflow and managed-resource/state capture. |
website/docs/skills/overview.md |
Adds “General Skills” entries for stack deploy/destroy. |
website/docs/skills/azure-stack-destroy.md |
New docs page for stack-based destroy skill. |
website/docs/skills/azure-stack-deploy.md |
New docs page for stack-based deploy skill. |
website/docs/deployment/state.md |
Extends state/metadata schema docs and lifecycle diagram for stack + new destroy outcomes. |
website/docs/agents/git-ape.md |
Updates agent guidance to prefer stacks and soft-delete purge on destroy. |
website/docs/agents/azure-template-generator.md |
Updates generated-agent guidance to prefer stacks (fallback to legacy deployment). |
website/docs/agents/azure-resource-deployer.md |
Updates deployer guidance to use stack validate/create and verify extended state.json. |
plugin.json |
Bumps plugin version to 0.1.0. |
.github/workflows/git-ape-destroy.exampleyml |
Implements stack delete path, purge sweep, deployment history cleanup, and new statuses. |
.github/workflows/git-ape-deploy.exampleyml |
Implements stack validate/create and writes extended state.json + metadata updates. |
.github/skills/azure-stack-destroy/SKILL.md |
Adds new user-invocable destroy skill spec and usage. |
.github/skills/azure-stack-destroy/scripts/destroy-stack.sh |
Adds local bash destroy implementation (stack delete + purge + state updates). |
.github/skills/azure-stack-destroy/scripts/destroy-stack.ps1 |
Adds local PowerShell destroy implementation (stack delete + purge + state updates). |
.github/skills/azure-stack-deploy/SKILL.md |
Adds new user-invocable deploy skill spec and usage. |
.github/skills/azure-stack-deploy/scripts/deploy-stack.sh |
Adds local bash deploy implementation (stack create + managed resource capture + state writes). |
.github/skills/azure-stack-deploy/scripts/deploy-stack.ps1 |
Adds local PowerShell deploy implementation (stack create + managed resource capture + state writes). |
.github/scripts/deployment-manager.sh |
Re-scopes manager script to inventory-only and points deploy/destroy to new skills. |
.github/plugin/marketplace.json |
Bumps marketplace metadata version to 0.1.0. |
.github/copilot-instructions.md |
Updates guidance to use stack deploy/destroy skills in local + CI flows. |
.github/agents/git-ape.agent.md |
Mirrors website agent docs: stacks preferred + purge sweep guidance. |
.github/agents/azure-template-generator.agent.md |
Mirrors website generator docs: stacks preferred + fallback guidance. |
.github/agents/azure-resource-deployer.agent.md |
Mirrors website deployer docs: stack validate/create + extended state verification. |
Copilot's findings
Comments suppressed due to low confidence (2)
.github/skills/azure-stack-deploy/scripts/deploy-stack.sh:233
RESOURCE_GROUPSis derived viajq capture("/resourceGroups/(?<rg>[^/]+)")over everymanagedResources[].id, which will error on subscription-scoped resource IDs (no/resourceGroups/). This can make the deploy skill fail while writingstate.jsoneven though the deployment itself succeeded; filter to RG-scoped IDs or use a non-throwing match (capture(...)?/try).
RESOURCE_GROUPS=$(echo "$MANAGED_RESOURCES" | jq -c '[.[].id | capture("/resourceGroups/(?<rg>[^/]+)") | .rg] | unique')
[[ "$(echo "$RESOURCE_GROUPS" | jq 'length')" == "0" && -n "$RG_NAME" ]] && RESOURCE_GROUPS="[\"$RG_NAME\"]"
.github/skills/azure-stack-destroy/scripts/destroy-stack.sh:298
grep -oE '(?<=locations/)[^/]+'uses a PCRE lookbehind, but-E(ERE) doesn’t support lookbehind. This will fail to extract the Cognitive Services location and silently skip purge. Usegrep -oPor another non-lookbehind parsing approach.
"Microsoft.CognitiveServices/accounts")
if [[ "$PURGE_PROTECTED" != "true" ]]; then
LOC=$(echo "$RES_ID" | grep -oE '(?<=locations/)[^/]+' || echo "")
if [[ -n "$LOC" ]]; then
az cognitiveservices account purge --name "$RES_NAME" --location "$LOC" \
- Files reviewed: 24/24 changed files
- Comments generated: 10
| # Determine deploy method: prefer deployment stacks (idempotent destroy) | ||
| # Fall back to az deployment sub create if stacks are unavailable | ||
| DEPLOY_METHOD="stack" | ||
| # Verbose output goes to a temp file so it does not contaminate the | ||
| # JSON that downstream jq calls need to parse. | ||
| VERBOSE_LOG=$(mktemp) | ||
| trap 'rm -f "$VERBOSE_LOG"' EXIT | ||
|
|
| done | ||
|
|
||
| MANAGED_RESOURCES=$(echo "$MANAGED_RESOURCES" | jq --arg id "$RES_ID" --arg type "$RES_TYPE" \ | ||
| --arg scope "$RES_SCOPE" --argjson sd "$IS_SOFT_DELETABLE" \ | ||
| '. + [{"id": $id, "type": $type, "scope": $scope, "softDeletable": $sd, "purgeProtected": false}]') | ||
| done |
| # If the bg process already failed, surface it early | ||
| if ! kill -0 "$STACK_BG_PID" 2>/dev/null; then | ||
| wait "$STACK_BG_PID" 2>/dev/null || true | ||
| BG_EXIT=$? | ||
| if [[ $BG_EXIT -ne 0 ]]; then | ||
| EXISTS=$(az group exists --name "$RG" 2>/dev/null || echo "true") |
| # Determine deploy method: prefer deployment stacks (idempotent destroy) | ||
| # Fall back to az deployment sub create if stacks are unavailable | ||
| DEPLOY_METHOD="stack" | ||
|
|
||
| if [[ "$DEPLOY_METHOD" == "stack" ]]; then | ||
| DEPLOY_OUTPUT=$(az stack sub create \ | ||
| --name "$DEPLOYMENT_ID" \ | ||
| --location "$LOCATION" \ | ||
| --template-file "$DEPLOY_DIR/template.json" \ | ||
| --parameters @"$DEPLOY_DIR/parameters.json" \ | ||
| --action-on-unmanage deleteAll \ | ||
| --deny-settings-mode none \ | ||
| --description "Git-Ape deployment $DEPLOYMENT_ID" \ | ||
| --tags "managedBy=git-ape" "deploymentId=$DEPLOYMENT_ID" \ | ||
| --yes \ | ||
| --verbose \ | ||
| --output json 2>&1) | ||
| else |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
…oyment-stacks-idempotency
🧪 Waza skill evals (advisory)
Ran 8 matrix legs in parallel (skills × models). Results are non-blocking — investigate failures via the workflow logs and the per-leg
📊 Token comparison vs
|
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Editing an ARM template | 0.57 | ✅ | budget, trigger_relevance_negative |
| Negative — Azure service concept question | 0.60 | ✅ | budget, trigger_relevance_negative |
| Positive — "command not found" failure | 1.00 | ✅ | answer_quality, budget, trigger_relevance_positive |
| Positive — "What do I need to install?" | 1.00 | ✅ | answer_quality, budget, trigger_relevance_positive |
Benchmark: prereq-check-eval | Skill: prereq-check | Model: claude-opus-4.6
Results saved to: .waza-results/prereq-check-claude-opus-4.6.json
JUnit XML saved to: .waza-results/prereq-check-claude-opus-4.6.junit.xml
Model: claude-sonnet-4.6
Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-sonnet-4.6
Parallel: 4 workers
✓ [1/4] Negative — Editing an ARM template
✓ [2/4] Negative — Azure service concept question
✓ [4/4] Positive — "What do I need to install?"
✓ [3/4] Positive — "command not found" failure
🧪 Waza Eval Results
Status: ✅ Passed | Score: 0.79 | Duration: 2m39.795s
- Tests: 4 total, 4 passed, 0 failed, 0 errors
- Success Rate: 100.0%
- Score Range: 0.57 - 1.00 (σ=0.2074)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Editing an ARM template | 0.57 | ✅ | budget, trigger_relevance_negative |
| Negative — Azure service concept question | 0.60 | ✅ | budget, trigger_relevance_negative |
| Positive — "command not found" failure | 1.00 | ✅ | answer_quality, budget, trigger_relevance_positive |
| Positive — "What do I need to install?" | 1.00 | ✅ | answer_quality, budget, trigger_relevance_positive |
Benchmark: prereq-check-eval | Skill: prereq-check | Model: claude-sonnet-4.6
Results saved to: .waza-results/prereq-check-claude-sonnet-4.6.json
JUnit XML saved to: .waza-results/prereq-check-claude-sonnet-4.6.junit.xml
Model: gpt-5-codex
Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: gpt-5-codex
Judge Model: claude-sonnet-4.6
Parallel: 4 workers
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
✗ [3/4] Positive — "command not found" failure
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
✗ [4/4] Positive — "What do I need to install?"
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
✗ [1/4] Negative — Editing an ARM template
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
✗ [2/4] Negative — Azure service concept question
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.00 | Duration: 301ms
- Tests: 4 total, 0 passed, 4 failed, 0 errors
- Success Rate: 0.0%
- Score Range: 0.00 - 0.00 (σ=0.0000)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Editing an ARM template | 0.00 | ❌ | - |
| Negative — Azure service concept question | 0.00 | ❌ | - |
| Positive — "command not found" failure | 0.00 | ❌ | - |
| Positive — "What do I need to install?" | 0.00 | ❌ | - |
Failed Task Details
Negative — Editing an ARM template
Run 1/3 (error):
Run 2/3 (error):
Run 3/3 (error):
Negative — Azure service concept question
Run 1/3 (error):
Run 2/3 (error):
Run 3/3 (error):
Positive — "command not found" failure
Run 1/3 (error):
Run 2/3 (error):
Run 3/3 (error):
Positive — "What do I need to install?"
Run 1/3 (error):
Run 2/3 (error):
Run 3/3 (error):
Benchmark: prereq-check-eval | Skill: prereq-check | Model: gpt-5-codex
Results saved to: .waza-results/prereq-check-gpt-5-codex.json
Model: gpt-5.4 *(baseline — A/B mode)*
Running benchmark: prereq-check-eval
Skill: prereq-check
Engine: copilot-sdk
Model: gpt-5.4
Judge Model: claude-sonnet-4.6
Parallel: 4 workers
════════════════════════════════════════════════════════════════
PASS 1: Skills-Enabled Run
════════════════════════════════════════════════════════════════
✓ [1/4] Negative — Editing an ARM template
✓ [2/4] Negative — Azure service concept question
✓ [4/4] Positive — "What do I need to install?"
✓ [3/4] Positive — "command not found" failure
════════════════════════════════════════════════════════════════
PASS 2: Skills Baseline (skills stripped)
════════════════════════════════════════════════════════════════
✓ [2/4] Negative — Azure service concept question
✓ [1/4] Negative — Editing an ARM template
✗ [3/4] Positive — "command not found" failure
✗ [4/4] Positive — "What do I need to install?"
════════════════════════════════════════════════════════════════
SKILL IMPACT ANALYSIS
════════════════════════════════════════════════════════════════
Overall Performance Delta:
With Skills: 100.0% (4/4 tasks passed)
Without Skills: 50.0% (2/4 tasks passed)
Impact: +50.0 percentage points
Per-Task Breakdown:
• Negative — Editing an ARM template [NEUTRAL] 100% → 100% (+0pp)
• Negative — Azure service concept question [NEUTRAL] 100% → 100% (+0pp)
• Positive — "command not found" failure [IMPROVED] 67% → 100% (+33pp)
• Positive — "What do I need to install?" [IMPROVED] 67% → 100% (+33pp)
Verdict: Skills have POSITIVE IMPACT (improved 2/4 tasks)
════════════════════════════════════════════════════════════════
🧪 Waza Eval Results
Status: ✅ Passed | Score: 0.79 | Duration: 2m10.625s
- Tests: 4 total, 4 passed, 0 failed, 0 errors
- Success Rate: 100.0%
- Score Range: 0.57 - 1.00 (σ=0.2074)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Editing an ARM template | 0.57 | ✅ | budget, trigger_relevance_negative |
| Negative — Azure service concept question | 0.60 | ✅ | budget, trigger_relevance_negative |
| Positive — "command not found" failure | 1.00 | ✅ | answer_quality, budget, trigger_relevance_positive |
| Positive — "What do I need to install?" | 1.00 | ✅ | answer_quality, budget, trigger_relevance_positive |
Benchmark: prereq-check-eval | Skill: prereq-check | Model: gpt-5.4
Results saved to: .waza-results/prereq-check-gpt-5.4.json
JUnit XML saved to: .waza-results/prereq-check-gpt-5.4.junit.xml
🔢 Tokens (count + profile)
📊 prereq-check: 2,138 tokens (detailed ✓), 10 sections, 2 code blocks
⚠️ token count 2138 exceeds 1000
🎯 Quality (5-dim table)
DIMENSION SCORE FEEDBACK
────────────────────────────────────────────
clarity █████ Purpose is immediately obvious, steps are numbered with clear actions and references, the Quick Reference table surfaces critical facts at a glance, and the Always/Never constraint lists eliminate ambiguity. No meaningful improvements needed.
completeness █████ Covers tool checks, version minimums, auth sessions, platform variance (macOS/Linux/Windows), error handling for 8 distinct failure modes, edge cases like user-reported vs terminal-detected missing tools, and a defined handoff. Nothing material is missing.
trigger_precision ████░ USE FOR is exhaustive with concrete error-string examples (e.g., 'az: command not found'), which is excellent. DO NOT USE FOR is a catch-all ('Anything else') — adding 1-2 explicit anti-examples (e.g., 'do not use for Azure resource validation, use azure-validate') would sharpen negative routing slightly.
scope_coverage █████ Boundaries are crystal clear: read-only, four specific tools, two auth sessions, three platforms, no auto-installation, no chaining. The Related Skills section correctly offloads adjacent concerns (onboarding, deployment-time checks) without bleeding scope.
anti_patterns █████ Avoids all major anti-patterns: no vague instructions, no conflicting directives (Rules section pre-empts contradiction between terminal findings and user reports), error handling is explicit with cause/fix pairs, and prescriptiveness is calibrated — steps are specific enough to be actionable without over-constraining implementation.
────────────────────────────────────────────
Overall: 4.8/5.0
Exceptionally well-structured skill. It is one of the cleaner examples of a narrowly scoped, operationally complete skill document — clear rules, thorough error handling, explicit platform branching, and a well-defined verdict taxonomy. The only minor gap is that the DO NOT USE FOR clause relies on a catch-all rather than named exclusions, which could cause occasional mis-routing in ambiguous cases.
✅ Check (compliance summary) (59 lines — click to expand)
ℹ️
waza checkexpectseval.yamlcolocated withSKILL.md. This repo separates them into.github/evals/prereq-check/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).
🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill: prereq-check
📋 Compliance Score: Medium-High
⚠️ Good, but could be improved. Missing routing clarity.
Issues found:
❌ SKILL.md is 2138 tokens (hard limit 500)
📐 Spec Compliance: 8/9 checks passed
❌ Does not fully meet agentskills.io specification.
❌ [spec-allowed-fields] Unknown frontmatter fields: argument-hint, user-invocable
📎 agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility
📎 Links: 4/4 valid
✅ All links valid.
📊 Token Budget: 2138 / 500 tokens
❌ Exceeds limit by 1638 tokens. Consider reducing content.
🧪 Evaluation Suite: Found
✅ eval.yaml detected. Run 'waza run eval.yaml' to test.
📐 Schema Validation: Passed
✅ eval.yaml schema valid
✅ 4 task file(s) validated
💡 Advisory Checks
✅ [module-count] Found 1 reference module(s)
❌ [complexity] Complexity: comprehensive (2138 tokens, 1 modules)
✅ [negative-delta-risk] No negative delta risk patterns detected
✅ [procedural-content] Description contains procedural language
✅ [over-specificity] No over-specificity patterns detected
❌ [cross-model-density] Advisory 16: word count is 122 (>60 may reduce cross-model effectiveness)
❌ [body-structure] Advisory 17: body structure quality — no examples section found
✅ [progressive-disclosure] Content structure supports progressive disclosure
✅ [scope-reduction] Capability scope: 8 signal(s) detected (8 level-2 heading(s), 2 numbered procedure(s))
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⚠️ Your skill needs some work before submission.
🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
To improve your skill:
1. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
2. Run 'waza dev' for interactive compliance improvement
3. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, user-invocable
4. Reduce SKILL.md by 1638 tokens. Run 'waza tokens suggest' for optimization tips
Skill: azure-stack-deploy
📈 Score (per model) + Suggestions/Recommendations
Model: claude-sonnet-4.6
Running benchmark: azure-stack-deploy-eval
Skill: azure-stack-deploy
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-sonnet-4.6
Parallel: 4 workers
[ERROR] waiting for session.idle: context deadline exceeded
✓ [2/5] Negative — Off-topic prompt (Linux kernel scheduling)
✗ [1/5] Negative — Destroying / tearing down an existing deployment
✗ [3/5] Negative — What-if preview / preflight validation
✓ [5/5] Positive — Re-deploy after template edit
[ERROR] waiting for session.idle: context deadline exceeded
✗ [4/5] Positive — Local deploy of an existing deployment artifact
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.75 | Duration: 2m26.67s
- Tests: 5 total, 2 passed, 3 failed, 0 errors
- Success Rate: 40.0%
- Score Range: 0.60 - 0.86 (σ=0.1167)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Destroying / tearing down an existing deployment | 0.86 | ❌ | budget, trigger_relevance_negative |
| Negative — Off-topic prompt (Linux kernel scheduling) | 0.60 | ✅ | budget, trigger_relevance_negative |
| Negative — What-if preview / preflight validation | 0.82 | ❌ | budget, trigger_relevance_negative |
| Positive — Local deploy of an existing deployment artifact | 0.61 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Re-deploy after template edit | 0.85 | ✅ | answer_quality, budget, trigger_relevance_positive |
Failed Task Details
Negative — Destroying / tearing down an existing deployment
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.71): Prompt appears trigger-aligned unexpectedly (score 0.71 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.71): Prompt appears trigger-aligned unexpectedly (score 0.71 >= 0.50)
Negative — What-if preview / preflight validation
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.65): Prompt appears trigger-aligned unexpectedly (score 0.65 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.65): Prompt appears trigger-aligned unexpectedly (score 0.65 >= 0.50)
Positive — Local deploy of an existing deployment artifact
Run 1/2 (error):
- ❌ answer_quality (0.00): fail: : The assistant never delivered a response to the user. All tool calls failed with "unexpected user permission response" and no final answer was produced. None of the four required criteria were met:
az stack sub create— not mentioned--action-on-unmanage deleteAll— not mentioned- Reference to
deploy-stack.shordeploy-stack.ps1— not mentioned state.json(schemaVersion 1.0) — not mentioned
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.83): Prompt is trigger-aligned (score 0.83 >= 0.50)
Run 2/2 (error):
- ❌ answer_quality (0.00): fail: : The assistant invoked the skill and loaded the context, but then got stuck in permission/tool errors and never delivered a response to the user. None of the four required criteria appeared in the user-facing output:
az stack sub create— never mentioned--action-on-unmanage deleteAll— never mentioned- deploy-stack.sh / deploy-stack.ps1 script reference — never mentioned
state.json(schemaVersion 1.0) — never mentioned
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.83): Prompt is trigger-aligned (score 0.83 >= 0.50)
Benchmark: azure-stack-deploy-eval | Skill: azure-stack-deploy | Model: claude-sonnet-4.6
Results saved to: .waza-results/azure-stack-deploy-claude-sonnet-4.6.json
Model: gpt-5-codex
Running benchmark: azure-stack-deploy-eval
Skill: azure-stack-deploy
Engine: copilot-sdk
Model: gpt-5-codex
Judge Model: claude-sonnet-4.6
Parallel: 4 workers
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
✗ [1/5] Negative — Destroying / tearing down an existing deployment
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
✗ [5/5] Positive — Re-deploy after template edit
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
✗ [4/5] Positive — Local deploy of an existing deployment artifact
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
✗ [3/5] Negative — What-if preview / preflight validation
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
✗ [2/5] Negative — Off-topic prompt (Linux kernel scheduling)
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.00 | Duration: 1.065s
- Tests: 5 total, 0 passed, 5 failed, 0 errors
- Success Rate: 0.0%
- Score Range: 0.00 - 0.00 (σ=0.0000)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Destroying / tearing down an existing deployment | 0.00 | ❌ | - |
| Negative — Off-topic prompt (Linux kernel scheduling) | 0.00 | ❌ | - |
| Negative — What-if preview / preflight validation | 0.00 | ❌ | - |
| Positive — Local deploy of an existing deployment artifact | 0.00 | ❌ | - |
| Positive — Re-deploy after template edit | 0.00 | ❌ | - |
Failed Task Details
Negative — Destroying / tearing down an existing deployment
Run 1/2 (error):
Run 2/2 (error):
Negative — Off-topic prompt (Linux kernel scheduling)
Run 1/2 (error):
Run 2/2 (error):
Negative — What-if preview / preflight validation
Run 1/2 (error):
Run 2/2 (error):
Positive — Local deploy of an existing deployment artifact
Run 1/2 (error):
Run 2/2 (error):
Positive — Re-deploy after template edit
Run 1/2 (error):
Run 2/2 (error):
Benchmark: azure-stack-deploy-eval | Skill: azure-stack-deploy | Model: gpt-5-codex
Results saved to: .waza-results/azure-stack-deploy-gpt-5-codex.json
🔢 Tokens (count + profile)
📊 azure-stack-deploy: 1,912 tokens (detailed ✓), 13 sections, 5 code blocks
⚠️ token count 1912 exceeds 1000
🎯 Quality (5-dim table)
DIMENSION SCORE FEEDBACK
────────────────────────────────────────────
clarity █████ Purpose is immediately obvious from the description and header. Steps are logically ordered (locate → run → inspect → report), script internals are enumerated in sequence, and output examples make expected behavior concrete. The dual bash/PowerShell coverage adds slight cognitive overhead but is justified and well-separated.
completeness █████ Prerequisites table, argument reference, full state.json schema with example, soft-deletable resource type list, failure modes table, and post-run messaging requirements are all present. The fallback path, race-condition edge case, and idempotency behavior are explicitly documented — very few gaps remain.
trigger_precision ████░ 'When to Use' and 'Do NOT use for' sections are well-defined with concrete anti-cases (destroy, preflight, template authoring). The only minor gap is that there is no guidance for partially failed or rolled-back stacks that may already have a stack ID — it's ambiguous whether this skill or a destroy+redeploy flow is correct in that state.
scope_coverage █████ Boundaries are tight and explicit: subscription-scoped stacks only, Git-Ape artifact layout assumed, state.json schema is versioned and cross-referenced to docs. The skill neither overreaches into template generation nor undersells by omitting the fallback path and multi-RG lifecycle ownership nuance.
anti_patterns ████░ No vague instructions or conflicting directives detected. Error handling is thorough (per-operation failure dump, early exit on missing auth/template). The one minor issue is that the post-run messaging section ('What to tell the user') is prescriptive in a way that may cause the agent to mechanically recite output rather than adapt to context — a note that this is a minimum floor rather than a rigid script would improve it.
────────────────────────────────────────────
Overall: 4.6/5.0
An exceptionally well-structured skill with versioned schema, dual-runtime support, explicit fallback semantics, and thorough failure-mode coverage. Minor deductions for the ambiguous re-deploy-after-partial-failure scenario and the overly rigid post-run messaging prescription. Production-ready with minimal revision needed.
✅ Check (compliance summary) (70 lines — click to expand)
ℹ️
waza checkexpectseval.yamlcolocated withSKILL.md. This repo separates them into.github/evals/azure-stack-deploy/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).
🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill: azure-stack-deploy
📋 Compliance Score: Low
❌ Needs significant improvement. Description too short or missing triggers.
Issues found:
❌ SKILL.md is 1912 tokens (hard limit 500)
📐 Spec Compliance: 8/9 checks passed
❌ Does not fully meet agentskills.io specification.
❌ [spec-allowed-fields] Unknown frontmatter fields: argument-hint, user-invocable
📎 agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility
📎 Links: 0/8 valid
⚠️ 8 link issue(s) found.
❌ [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-deployment-preflight/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../../../website/docs/deployment/state.md: link escapes skill directory
❌ [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-stack-destroy/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-deployment-preflight/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-security-analyzer/SKILL.md: link escapes skill directory
📊 Token Budget: 1912 / 500 tokens
❌ Exceeds limit by 1412 tokens. Consider reducing content.
🧪 Evaluation Suite: Found
✅ eval.yaml detected. Run 'waza run eval.yaml' to test.
📐 Schema Validation: Passed
✅ eval.yaml schema valid
✅ 5 task file(s) validated
💡 Advisory Checks
✅ [module-count] Found 0 reference module(s)
❌ [complexity] Complexity: comprehensive (1912 tokens, 0 modules)
✅ [negative-delta-risk] No negative delta risk patterns detected
✅ [procedural-content] Description contains procedural language
✅ [over-specificity] No over-specificity patterns detected
✅ [cross-model-density] Description density is optimal for cross-model use
❌ [body-structure] Advisory 17: body structure quality — no examples section found; no error handling or troubleshooting section found
✅ [progressive-disclosure] Content structure supports progressive disclosure
✅ [scope-reduction] Capability scope: 10 signal(s) detected (10 level-2 heading(s), 2 numbered procedure(s))
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⚠️ Your skill needs some work before submission.
🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
To improve your skill:
1. Add a 'USE FOR:' section with 3-5 trigger phrases that activate the skill
2. Add a 'DO NOT USE FOR:' section to clarify when NOT to use this skill
3. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
4. Run 'waza dev' for interactive compliance improvement
5. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, user-invocable
6. Fix 8 link(s) that escape the skill directory
7. Reduce SKILL.md by 1412 tokens. Run 'waza tokens suggest' for optimization tips
Skill: azure-stack-destroy
📈 Score (per model) + Suggestions/Recommendations
Model: claude-sonnet-4.6
Running benchmark: azure-stack-destroy-eval
Skill: azure-stack-destroy
Engine: copilot-sdk
Model: claude-sonnet-4.6
Judge Model: claude-sonnet-4.6
Parallel: 4 workers
✗ [1/5] Negative — Deploying a new stack (opposite operation)
✗ [2/5] Negative — Deleting a non-Git-Ape resource group
✓ [3/5] Negative — Off-topic prompt (Linux kernel scheduling)
[ERROR] waiting for session.idle: context deadline exceeded
✗ [4/5] Positive — Clean up the deployment stack
✗ [5/5] Positive — Local destroy of a Git-Ape deployment
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.77 | Duration: 2m45.646s
- Tests: 5 total, 1 passed, 4 failed, 0 errors
- Success Rate: 20.0%
- Score Range: 0.60 - 0.87 (σ=0.0903)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Deploying a new stack (opposite operation) | 0.81 | ❌ | budget, trigger_relevance_negative |
| Negative — Deleting a non-Git-Ape resource group | 0.87 | ❌ | budget, trigger_relevance_negative |
| Negative — Off-topic prompt (Linux kernel scheduling) | 0.60 | ✅ | budget, trigger_relevance_negative |
| Positive — Clean up the deployment stack | 0.79 | ❌ | answer_quality, budget, trigger_relevance_positive |
| Positive — Local destroy of a Git-Ape deployment | 0.80 | ❌ | answer_quality, budget, trigger_relevance_positive |
⚠️ Flaky Tasks
The following tasks showed inconsistent results across runs:
- Positive — Clean up the deployment stack: 50% pass rate, score=0.79±0.17
- Positive — Local destroy of a Git-Ape deployment: 50% pass rate, score=0.80±0.17
Failed Task Details
Negative — Deploying a new stack (opposite operation)
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.62): Prompt appears trigger-aligned unexpectedly (score 0.62 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.62): Prompt appears trigger-aligned unexpectedly (score 0.62 >= 0.50)
Negative — Deleting a non-Git-Ape resource group
Run 1/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.73): Prompt appears trigger-aligned unexpectedly (score 0.73 >= 0.50)
Run 2/2 (failed):
- ✅ budget (1.00): All behavior checks passed
- ❌ trigger_relevance_negative (0.73): Prompt appears trigger-aligned unexpectedly (score 0.73 >= 0.50)
Positive — Clean up the deployment stack
Run 2/2 (failed):
- ❌ answer_quality (0.00): fail: Criterion 1 not met: The response recommends running the destroy script but never explicitly tells the user NOT to use raw
az group delete, nor explains thataz group deletewould miss soft-delete cleanup and multi-RG resources. Criteria 2, 3, and 4 are all satisfied. - ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.88): Prompt is trigger-aligned (score 0.88 >= 0.50)
Positive — Local destroy of a Git-Ape deployment
Run 1/2 (error):
- ❌ answer_quality (0.00): fail: Criteria 2, 3, and 4 not met: The assistant invoked the azure-stack-destroy skill (criteria 1 ✅), but its response then devolved entirely into failed file-lookup tool calls with no user-facing explanation. It never:
- Explained that state.json under .azure/deployments/deploy-20260506-001/ is the source of truth for stackId, managedResources, softDeletable, purgeProtected (criteria 2 ❌)
- Named or described the az stack sub delete --action-on-unmanage deleteAll command or its semantics (criteria 3 ❌)
- Mentioned the soft-delete purge sweep, az keyvault purge / az keyvault list-deleted, or explained that non-purge-protected Key Vaults would be purged so the name is immediately reusable (criteria 4 ❌)
- ✅ budget (1.00): All behavior checks passed
- ✅ trigger_relevance_positive (0.89): Prompt is trigger-aligned (score 0.89 >= 0.50)
Benchmark: azure-stack-destroy-eval | Skill: azure-stack-destroy | Model: claude-sonnet-4.6
Results saved to: .waza-results/azure-stack-destroy-claude-sonnet-4.6.json
Model: gpt-5-codex
Running benchmark: azure-stack-destroy-eval
Skill: azure-stack-destroy
Engine: copilot-sdk
Model: gpt-5-codex
Judge Model: claude-sonnet-4.6
Parallel: 4 workers
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
✗ [1/5] Negative — Deploying a new stack (opposite operation)
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
✗ [5/5] Positive — Local destroy of a Git-Ape deployment
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
✗ [4/5] Positive — Clean up the deployment stack
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
✗ [2/5] Negative — Deleting a non-Git-Ape resource group
[ERROR] failed to create session: failed to create session: JSON-RPC Error -32603: Request session.create failed with message: Model "gpt-5-codex" is not available.
✗ [3/5] Negative — Off-topic prompt (Linux kernel scheduling)
🧪 Waza Eval Results
Status: ❌ Failed | Score: 0.00 | Duration: 436ms
- Tests: 5 total, 0 passed, 5 failed, 0 errors
- Success Rate: 0.0%
- Score Range: 0.00 - 0.00 (σ=0.0000)
Task Results
| Task | Score | Status | Graders |
|---|---|---|---|
| Negative — Deploying a new stack (opposite operation) | 0.00 | ❌ | - |
| Negative — Deleting a non-Git-Ape resource group | 0.00 | ❌ | - |
| Negative — Off-topic prompt (Linux kernel scheduling) | 0.00 | ❌ | - |
| Positive — Clean up the deployment stack | 0.00 | ❌ | - |
| Positive — Local destroy of a Git-Ape deployment | 0.00 | ❌ | - |
Failed Task Details
Negative — Deploying a new stack (opposite operation)
Run 1/2 (error):
Run 2/2 (error):
Negative — Deleting a non-Git-Ape resource group
Run 1/2 (error):
Run 2/2 (error):
Negative — Off-topic prompt (Linux kernel scheduling)
Run 1/2 (error):
Run 2/2 (error):
Positive — Clean up the deployment stack
Run 1/2 (error):
Run 2/2 (error):
Positive — Local destroy of a Git-Ape deployment
Run 1/2 (error):
Run 2/2 (error):
Benchmark: azure-stack-destroy-eval | Skill: azure-stack-destroy | Model: gpt-5-codex
Results saved to: .waza-results/azure-stack-destroy-gpt-5-codex.json
🔢 Tokens (count + profile)
📊 azure-stack-destroy: 2,644 tokens (detailed ✓), 14 sections, 7 code blocks
⚠️ token count 2644 exceeds 1000
🎯 Quality (5-dim table)
DIMENSION SCORE FEEDBACK
────────────────────────────────────────────
clarity █████ Purpose is immediately obvious from the description, steps are logically ordered with clear headings, and code examples (bash and PowerShell) remove ambiguity. The fast-vs-sync mode table is a standout clarity element.
completeness █████ Exceptional coverage: prerequisites table, procedure with numbered steps, all flag arguments, 5 terminal status codes, 4 failure modes with recovery guidance, soft-delete purge sweep details, and state.json output schema. Edge cases like purge-protected resources and the no-stackId fallback path are explicitly addressed.
trigger_precision ████░ USE FOR and DO NOT USE FOR are specific and actionable with concrete phrasing examples. However, the separate 'When to Use' section directly below USE FOR duplicates several triggers verbatim, adding noise without value — consolidate or remove it.
scope_coverage █████ Boundaries are explicit on all axes: stack-only (no surgical per-resource delete), Git-Ape deployments only, state.json required (hard abort if missing), and the 'no non-Azure / non-Git-Ape' exclusion is unambiguous. The CI-parity guarantee is clearly scoped.
anti_patterns ████░ Avoids vague instructions, conflicting directives, and missing error handling. The one anti-pattern is the redundant 'When to Use' section that re-states USE FOR content — this creates a maintenance burden and potential drift. Additionally, 'Prefer this over raw az group delete' is nested under USE FOR when it logically belongs in a rationale or DO NOT USE FOR section.
────────────────────────────────────────────
Overall: 4.6/5.0
A high-quality skill document that sets a strong example for operational runbooks. It is thorough, precise, and well-structured with excellent failure-mode coverage and clear CI-parity guarantees. The only meaningful improvement is eliminating the redundant 'When to Use' section that duplicates USE FOR content, and reorganizing the 'Prefer this over az group delete' rationale into a more logical location.
✅ Check (compliance summary) (69 lines — click to expand)
ℹ️
waza checkexpectseval.yamlcolocated withSKILL.md. This repo separates them into.github/evals/azure-stack-destroy/eval.yaml, so the "Evaluation Suite: Not Found" line below is a false negative — the eval actually ran (see the Score section above).
🔍 Skill Readiness Check
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Skill: azure-stack-destroy
📋 Compliance Score: Low
❌ Needs significant improvement. Description too short or missing triggers.
Issues found:
❌ SKILL.md is 2644 tokens (hard limit 500)
📐 Spec Compliance: 7/9 checks passed
❌ Does not fully meet agentskills.io specification.
❌ [spec-allowed-fields] Unknown frontmatter fields: argument-hint, user-invocable
📎 agentskills.io spec allows: name, description, license, allowed-tools, metadata, compatibility
❌ [spec-security] Security risks detected: description contains XML angle brackets
📎 XML angle brackets and reserved prefixes pose injection and naming conflict risks
📎 Links: 0/4 valid
⚠️ 4 link issue(s) found.
❌ [SKILL.md] → ../azure-stack-deploy/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-stack-deploy/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-drift-detector/SKILL.md: link escapes skill directory
❌ [SKILL.md] → ../azure-resource-visualizer/SKILL.md: link escapes skill directory
📊 Token Budget: 2644 / 500 tokens
❌ Exceeds limit by 2144 tokens. Consider reducing content.
🧪 Evaluation Suite: Found
✅ eval.yaml detected. Run 'waza run eval.yaml' to test.
📐 Schema Validation: Passed
✅ eval.yaml schema valid
✅ 5 task file(s) validated
💡 Advisory Checks
✅ [module-count] Found 0 reference module(s)
❌ [complexity] Complexity: comprehensive (2644 tokens, 0 modules)
✅ [negative-delta-risk] No negative delta risk patterns detected
✅ [procedural-content] Description contains procedural language
✅ [over-specificity] No over-specificity patterns detected
✅ [cross-model-density] Advisory 16: first sentence doesn't lead with action verb (reduces clarity)
❌ [body-structure] Advisory 17: body structure quality — no examples section found; no error handling or troubleshooting section found
✅ [progressive-disclosure] Content structure supports progressive disclosure
✅ [scope-reduction] Capability scope: 8 signal(s) detected (8 level-2 heading(s), 2 numbered procedure(s))
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
📈 Overall Readiness
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
⚠️ Your skill needs some work before submission.
🎯 Next Steps
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
To improve your skill:
1. Add a 'USE FOR:' section with 3-5 trigger phrases that activate the skill
2. Add a 'DO NOT USE FOR:' section to clarify when NOT to use this skill
3. Add routing clarity (e.g., **UTILITY SKILL**, INVOKES:, FOR SINGLE OPERATIONS:)
4. Run 'waza dev' for interactive compliance improvement
5. Fix spec violation [spec-allowed-fields]: Unknown frontmatter fields: argument-hint, user-invocable
6. Fix spec violation [spec-security]: Security risks detected: description contains XML angle brackets
7. Fix 4 link(s) that escape the skill directory
8. Reduce SKILL.md by 2144 tokens. Run 'waza tokens suggest' for optimization tips
🤖 Waza agent evals (advisory)
Ran 0 agent evals against
📊 Agent file token comparison vs
|
- thin out azure-resource-deployer: delegate preflight, verify, rollback to existing skills - thin out azure-template-generator: add Step 0 lookups (rest-api-reference + naming-research) - replace inline RBAC GUIDs and per-resource checklists with skill invocations - collapse hardening checklist into 5 non-negotiable identity patterns + analyzer deferral 🔧 - Generated by Copilot
- destroy: front-load WHEN, add USE FOR / DO NOT USE FOR, harden state.json prerequisite - destroy: list extra soft-deletable resource types in purgeResults note - deploy: clarify stack-create flags and state.json schema references 🔧 - Generated by Copilot
- register both skills in evals manifest at expanded tier - add 5-task eval for azure-stack-destroy (positive-local, positive-stack, negative-deploy, negative-non-gitape, negative-off-topic) - add eval suite for azure-stack-deploy 🧪 - Generated by Copilot
The destroy workflow deletes a single resource group but leaves behind soft-deleted resources (Key Vault with purge protection), subscription-scoped resources, and multi-RG deployments — making destroy + redeploy non-idempotent.
Changes
Deploy workflow (
git-ape-deploy.exampleyml)az deployment sub createtoaz stack sub create --action-on-unmanage deleteAllstate.jsonnow populated withstackId,deployMethod,managedResources[],resourceGroups[],subscriptions[],externalReferences[]metadata.jsongainsdeployMethodandresourceGroups[]on commitDestroy workflow (
git-ape-destroy.exampleyml)az stack sub delete --action-on-unmanage deleteAllwhenstackIdpresent — single command covers all managed resources regardless of scopeaz group deletefor pre-stack deploymentsmanagedResources[].softDeletable, purges non-protected Key Vaults, marks purge-protected asretained-soft-deletedaz deployment sub deleteto prevent 800/scope accumulationpartially-destroyed,retained-soft-deletedState schema (
website/docs/deployment/state.md)state.jsonschema with destroy strategy selection logicdeployMethodfield tometadata.jsonspecExample state.json (post-deploy)
{ "stackId": "/subscriptions/.../providers/Microsoft.Resources/deploymentStacks/deploy-20260218-143022", "deployMethod": "stack", "managedResources": [ { "id": "/subscriptions/.../Microsoft.KeyVault/vaults/kv-api-dev-eus", "type": "Microsoft.KeyVault/vaults", "scope": "resourceGroup", "softDeletable": true, "purgeProtected": true } ], "resourceGroups": ["rg-api-dev-eastus"], "subscriptions": ["00000000-..."], "externalReferences": [] }This implements Phase 1 (schema + state capture) and Phase 2 (Deployment Stacks integration) from the issue. Phase 3 (extract destroy to standalone script) and Phase 4 (fixture validation) are deferred.