feat(contrib): add drift.temporal evaluator for longitudinal behavioral monitoring#140
feat(contrib): add drift.temporal evaluator for longitudinal behavioral monitoring#140nanookclaw wants to merge 1 commit intoagentcontrol:mainfrom
Conversation
…al monitoring Adds a new contrib evaluator that detects gradual behavioral degradation patterns that point-in-time evaluators (regex, list, SQL, JSON) miss. ## Motivation Follows from discussion in agentcontrol#118 (temporal behavioral drift). The maintainer (lan17) asked for a standalone implementation; the package was built at https://github.com/nanookclaw/agent-control-drift-evaluator. This PR integrates it into the contrib ecosystem so it can be installed directly alongside other Agent Control evaluators. ## What it does - Records a numeric behavioral score (0.0–1.0) per agent per interaction - Compares the recent window (last N observations) to a baseline (first M observations) - Returns matched=True when recent average drops below baseline by more than the configured threshold - Stores history as local JSON — no external API or service required ## Design decisions grounded in empirical research Two findings from published longitudinal work (DOI: 10.5281/zenodo.19028012) shaped the implementation: 1. **min_observations ≥ 5**: Drift signals are noisy below 5 observations. Default min_observations=5 prevents early false positives. 2. **Non-monotonic degradation**: Agents can drift and recover without intervention. The evaluator tracks the window, not just a cumulative average, so it detects current state rather than all-time performance. Both patterns were independently validated by a second production deployment (NexusGuard fleet, v0.5.36, 48 tests merged). ## Package structure Follows the galileo contrib pattern: evaluators/contrib/drift/ ├── pyproject.toml # agent-control-evaluator-drift ├── Makefile # test / lint / typecheck / build ├── README.md └── src/ └── agent_control_evaluator_drift/ └── drift/ ├── config.py # DriftEvaluatorConfig (Pydantic) └── evaluator.py # DriftEvaluator (@register_evaluator) ## Tests 31 tests covering: - Config validation (bounds, window vs baseline, on_error) - Core drift computation (insufficient data, baseline building, stable, drift detected, threshold boundary conditions) - File I/O helpers (load/save roundtrip, missing file, corrupt JSON, directory creation) - Full evaluator integration (persistence across instances, independent agent_id tracking, score clamping, fail-open/closed error handling, metadata completeness) Relates to: agentcontrol#118
lan17
left a comment
There was a problem hiding this comment.
Thanks for putting this together. I like the direction overall, and the package structure is easy to follow. I left a few comments on things I think we should tighten up before merging.
| recent_avg = sum(recent_scores) / len(recent_scores) | ||
| drift_magnitude = baseline_avg - recent_avg # positive = drop | ||
|
|
||
| matched = drift_magnitude >= drift_threshold |
There was a problem hiding this comment.
Nice catch to compare recent vs baseline here. One thing I think will bite us is raw float precision on the exact-threshold boundary. For the added test case with a 1.0 baseline, 0.9 recent window, and 0.10 threshold, this ends up as 0.099999..., so the evaluator returns False even though the documented behavior says >= should trigger. That also lines up with why test_exactly_at_threshold_triggers is currently failing. I think we should compare with a small tolerance or round before the threshold check.
| @@ -0,0 +1,37 @@ | |||
| [project] | |||
There was a problem hiding this comment.
Happy to see this added as a standalone contrib package. I think we still need the repo-level release wiring though. Right now semantic-release, scripts/build.py, test-extras, and the release workflow still only know about the Galileo contrib package, so I do not think this one will actually get versioned, tested, and published from this repo yet.
|
|
||
| # Persist updated history | ||
| try: | ||
| _save_history(history_path, scores) |
There was a problem hiding this comment.
I think this needs some synchronization around the history update path. Right now each call does load, append, and overwrite with no lock, so if two workers hit the same agent at once, the last writer wins and we drop observations. I was able to reproduce that with a small multiprocess harness. Since the drift result depends on having a complete history, this feels worth fixing before merge.
Summary
Adds
drift.temporal— a new contrib evaluator that detects gradual behavioral degradation patterns that point-in-time evaluators (regex, list, SQL, JSON) miss.This follows from the discussion in #118, where @lan17 suggested implementing the evaluator in a standalone repository first. I built it (nanookclaw/agent-control-drift-evaluator, 2 ⭐) and am now integrating it into the contrib ecosystem using the galileo pattern.
The Gap
Built-in evaluators answer: "Is this response OK right now?"
They don't answer: "Is this agent getting worse over time?"
This evaluator fills that gap.
How It Works
matched=Truewhen recent average drops below baseline by more than the configured thresholdUsage
Design Decisions (Empirically Grounded)
Two findings from published longitudinal research (DOI: 10.5281/zenodo.19028012) shaped the defaults:
min_observations ≥ 5— Drift signals are noisy below 5 observations. Default prevents early false positives.Both patterns were independently replicated by a production deployment (NexusGuard fleet, v0.5.36, 48 passing tests).
Package Structure
Follows the galileo contrib pattern exactly:
Tests
31 tests covering:
on_errormodes)agent_idtracking, score clamping, fail-open/closed, metadata completeness)Checklist
requires_api_key = False)pyproject.tomlentry point registered:drift.temporal = agent_control_evaluator_drift.drift:DriftEvaluatoron_errorfail-open/closed behavior consistent with galileo evaluatorRelates to: #118