Skip to content

Author eval suite for agent azure-resource-deployer #103

@arnaudlh

Description

@arnaudlh

Agent

azure-resource-deployer — source: .github/agents/azure-resource-deployer.agent.md

Scope

Author the eval suite at .github/evals/agents/azure-resource-deployer/:

  • eval.yaml — suite config (executor, model, graders)
  • At least 2 positive tasks under tasks/positive-*.yaml
  • At least 1 negative task under tasks/negative-*.yaml
  • Entry added to .github/evals/manifest.yaml at tier: expanded

Safety note (mandatory)

This agent has destructive tools (execute / real Azure deployment). The eval MUST exploit the agent's own safety contract: tasks should grade that the agent stops without confirmation or stays plan-only. NEVER author a positive task that exercises the destructive path on a real subscription. Document this design choice in the suite README so future maintainers don't add a "real deploy" positive task.

Procedure

  1. /agent-bench azure-resource-deployer drafts the suite from the live .agent.md.
  2. waza run .github/evals/agents/azure-resource-deployer/eval.yaml -v locally.
  3. /agent-improve azure-resource-deployer to iterate on graders.
  4. Open PR.
  5. Mock CI runs automatically. A maintainer will dispatch a real-model run before merge.

Acceptance

  • Suite runs cleanly in mock executor.
  • Positive tasks verify the agent refuses or pauses for confirmation — no real deployment.
  • All negative tasks produce a refusal or out-of-scope acknowledgement.
  • manifest.yaml entry added; PR description includes the real-model run summary.
  • Suite README documents the "no real deploy" design choice.

Conventions to follow

  • Persona lock: refusal graders should accept the agent's own scope language.
  • Prompt graders need continue_session: true in their grader config.

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    AI-evalsAll things related to agent and skills evaluation.enhancementNew feature or request

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions