87 lines (63 loc) · 3.27 KB

Changelog

All notable changes to this project will be documented in this file.

The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.

Unreleased

Added

SubagentStart/SubagentStop hooks for improved agent detection accuracy (#192)
Prompt caching token economics in cost estimation (#185)
Temperature configuration for deterministic scenario generation (#183)
Explicit timeout configuration for Anthropic SDK API calls (#179)
Anthropic SDK request ID preservation in errors (#178)
Claude issue analysis workflow (#177)

Changed

BREAKING: Default execution strategy changed from isolated to batched mode for ~80% faster startup. Scenarios testing the same component now share a session with /clear between them. To restore previous behavior, set execution.session_strategy: "isolated" or execution.session_isolation: true in your config. (#86)
Extracted shared tool capture hook logic into reusable utility (#191)
Updated model pricing and default model selections (#181)

Fixed

PostToolUse/PostToolUseFailure hooks added to batch execution mode (#189)
System prompt inclusion in batch evaluation requests (#182)
System prompts included in token counting for cost estimation (#180)
Label events no longer cancel issue analysis workflow

0.2.0 - 2026-01-10

Added

MCP server evaluation with tool detection via mcp__<server>__<tool> pattern (#63)
Hooks evaluation with SDKHookResponseMessage event detection (#58, #49)
E2E integration tests with real Claude Agent SDK (#68)
ReDoS protection for custom sanitization patterns (#66)

Changed

Modernized CI workflows with updated action versions (#64, #65)
Updated dependencies: zod 4.3.5, glob 13.0.0 (#54, #55)
Improved README and CLAUDE.md documentation (#69)

Fixed

CI not failing on codecov errors for Dependabot PRs
CLI --version now reads from package.json instead of hardcoded value

0.1.0 - 2026-01-02

Added

Initial 4-stage evaluation pipeline (Analysis → Generation → Execution → Evaluation)
Support for skills, agents, and commands evaluation
Programmatic detection via tool capture parsing
LLM judge for quality assessment with multi-sampling
Resume capability with state checkpointing
Cost estimation before execution (dry-run mode)
Multiple output formats (JSON, YAML, JUnit XML, TAP)
Semantic variation testing for trigger robustness
Rate limiter for API call protection (#32)
Symlink resolution for plugin path validation (#33)
PII filtering for verbose transcript logging (#34)
Custom sanitization regex pattern validation (#46)
Comprehensive test suite with 943 tests and 93%+ coverage

Changed

Tuning configuration extracted from hardcoded values (#26)
Renamed seed.yaml to config.yaml for clarity (#25)

Fixed

Correct Anthropic structured output API usage in LLM judge (#9)
Variance propagation from runJudgment to metrics (#30)
Centralized logger and pricing utilities (#43)