Skip to content

Latest commit

 

History

History
87 lines (63 loc) · 3.27 KB

File metadata and controls

87 lines (63 loc) · 3.27 KB

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

Added

  • SubagentStart/SubagentStop hooks for improved agent detection accuracy (#192)
  • Prompt caching token economics in cost estimation (#185)
  • Temperature configuration for deterministic scenario generation (#183)
  • Explicit timeout configuration for Anthropic SDK API calls (#179)
  • Anthropic SDK request ID preservation in errors (#178)
  • Claude issue analysis workflow (#177)

Changed

  • BREAKING: Default execution strategy changed from isolated to batched mode for ~80% faster startup. Scenarios testing the same component now share a session with /clear between them. To restore previous behavior, set execution.session_strategy: "isolated" or execution.session_isolation: true in your config. (#86)
  • Extracted shared tool capture hook logic into reusable utility (#191)
  • Updated model pricing and default model selections (#181)

Fixed

  • PostToolUse/PostToolUseFailure hooks added to batch execution mode (#189)
  • System prompt inclusion in batch evaluation requests (#182)
  • System prompts included in token counting for cost estimation (#180)
  • Label events no longer cancel issue analysis workflow

0.2.0 - 2026-01-10

Added

  • MCP server evaluation with tool detection via mcp__<server>__<tool> pattern (#63)
  • Hooks evaluation with SDKHookResponseMessage event detection (#58, #49)
  • E2E integration tests with real Claude Agent SDK (#68)
  • ReDoS protection for custom sanitization patterns (#66)

Changed

  • Modernized CI workflows with updated action versions (#64, #65)
  • Updated dependencies: zod 4.3.5, glob 13.0.0 (#54, #55)
  • Improved README and CLAUDE.md documentation (#69)

Fixed

  • CI not failing on codecov errors for Dependabot PRs
  • CLI --version now reads from package.json instead of hardcoded value

0.1.0 - 2026-01-02

Added

  • Initial 4-stage evaluation pipeline (Analysis → Generation → Execution → Evaluation)
  • Support for skills, agents, and commands evaluation
  • Programmatic detection via tool capture parsing
  • LLM judge for quality assessment with multi-sampling
  • Resume capability with state checkpointing
  • Cost estimation before execution (dry-run mode)
  • Multiple output formats (JSON, YAML, JUnit XML, TAP)
  • Semantic variation testing for trigger robustness
  • Rate limiter for API call protection (#32)
  • Symlink resolution for plugin path validation (#33)
  • PII filtering for verbose transcript logging (#34)
  • Custom sanitization regex pattern validation (#46)
  • Comprehensive test suite with 943 tests and 93%+ coverage

Changed

  • Tuning configuration extracted from hardcoded values (#26)
  • Renamed seed.yaml to config.yaml for clarity (#25)

Fixed

  • Correct Anthropic structured output API usage in LLM judge (#9)
  • Variance propagation from runJudgment to metrics (#30)
  • Centralized logger and pricing utilities (#43)