All notable changes to this project will be documented in this file.
The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
- SubagentStart/SubagentStop hooks for improved agent detection accuracy (#192)
- Prompt caching token economics in cost estimation (#185)
- Temperature configuration for deterministic scenario generation (#183)
- Explicit timeout configuration for Anthropic SDK API calls (#179)
- Anthropic SDK request ID preservation in errors (#178)
- Claude issue analysis workflow (#177)
- BREAKING: Default execution strategy changed from isolated to batched mode
for ~80% faster startup. Scenarios testing the same component now share a session
with
/clearbetween them. To restore previous behavior, setexecution.session_strategy: "isolated"orexecution.session_isolation: truein your config. (#86) - Extracted shared tool capture hook logic into reusable utility (#191)
- Updated model pricing and default model selections (#181)
- PostToolUse/PostToolUseFailure hooks added to batch execution mode (#189)
- System prompt inclusion in batch evaluation requests (#182)
- System prompts included in token counting for cost estimation (#180)
- Label events no longer cancel issue analysis workflow
0.2.0 - 2026-01-10
- MCP server evaluation with tool detection via
mcp__<server>__<tool>pattern (#63) - Hooks evaluation with SDKHookResponseMessage event detection (#58, #49)
- E2E integration tests with real Claude Agent SDK (#68)
- ReDoS protection for custom sanitization patterns (#66)
- Modernized CI workflows with updated action versions (#64, #65)
- Updated dependencies: zod 4.3.5, glob 13.0.0 (#54, #55)
- Improved README and CLAUDE.md documentation (#69)
- CI not failing on codecov errors for Dependabot PRs
- CLI
--versionnow reads from package.json instead of hardcoded value
0.1.0 - 2026-01-02
- Initial 4-stage evaluation pipeline (Analysis → Generation → Execution → Evaluation)
- Support for skills, agents, and commands evaluation
- Programmatic detection via tool capture parsing
- LLM judge for quality assessment with multi-sampling
- Resume capability with state checkpointing
- Cost estimation before execution (dry-run mode)
- Multiple output formats (JSON, YAML, JUnit XML, TAP)
- Semantic variation testing for trigger robustness
- Rate limiter for API call protection (#32)
- Symlink resolution for plugin path validation (#33)
- PII filtering for verbose transcript logging (#34)
- Custom sanitization regex pattern validation (#46)
- Comprehensive test suite with 943 tests and 93%+ coverage
- Tuning configuration extracted from hardcoded values (#26)
- Renamed seed.yaml to config.yaml for clarity (#25)
- Correct Anthropic structured output API usage in LLM judge (#9)
- Variance propagation from runJudgment to metrics (#30)
- Centralized logger and pricing utilities (#43)