feat(contrib): add drift.temporal evaluator for longitudinal behavioral monitoring by nanookclaw · Pull Request #140 · agentcontrol/agent-control

nanookclaw · 2026-03-20T12:44:00Z

Summary

Adds drift.temporal — a new contrib evaluator that detects gradual behavioral degradation patterns that point-in-time evaluators (regex, list, SQL, JSON) miss.

This follows from the discussion in #118, where @lan17 suggested implementing the evaluator in a standalone repository first. I built it (nanookclaw/agent-control-drift-evaluator, 2 ⭐) and am now integrating it into the contrib ecosystem using the galileo pattern.

The Gap

Built-in evaluators answer: "Is this response OK right now?"

They don't answer: "Is this agent getting worse over time?"

This evaluator fills that gap.

How It Works

Records a numeric score (0.0–1.0) per agent per interaction
Compares the recent window (last N observations) to a baseline (first M observations)
Returns matched=True when recent average drops below baseline by more than the configured threshold
Stores history as local JSON — no external API or service required

Usage

controls:
  - name: "drift-check"
    selector: "$.quality_score"
    evaluator: "drift.temporal"
    config:
      agent_id: "sales-agent-prod"
      window_size: 10
      baseline_size: 20
      drift_threshold: 0.10
    action: alert

Design Decisions (Empirically Grounded)

Two findings from published longitudinal research (DOI: 10.5281/zenodo.19028012) shaped the defaults:

min_observations ≥ 5 — Drift signals are noisy below 5 observations. Default prevents early false positives.
Windowed comparison, not cumulative — Agents can drift and recover without intervention. A rolling window captures current state; a cumulative average masks it.

Both patterns were independently replicated by a production deployment (NexusGuard fleet, v0.5.36, 48 passing tests).

Package Structure

Follows the galileo contrib pattern exactly:

evaluators/contrib/drift/
├── pyproject.toml           # agent-control-evaluator-drift
├── Makefile                 # test / lint / typecheck / build
├── README.md
└── src/
    └── agent_control_evaluator_drift/
        └── drift/
            ├── config.py    # DriftEvaluatorConfig (Pydantic)
            └── evaluator.py # DriftEvaluator (@register_evaluator)

Tests

31 tests covering:

Config validation (bounds, window vs baseline, on_error modes)
Core drift computation (insufficient data, baseline building, stable, drift detected, threshold boundaries)
File I/O helpers (load/save roundtrip, missing file, corrupt JSON, auto-created directories)
Full evaluator integration (persistence across instances, independent agent_id tracking, score clamping, fail-open/closed, metadata completeness)

Checklist

Tests pass (verified locally against the builtin evaluator interface)
No external API required (requires_api_key = False)
Follows contrib package structure (galileo pattern)
pyproject.toml entry point registered: drift.temporal = agent_control_evaluator_drift.drift:DriftEvaluator
README with config reference, research background, and usage examples
on_error fail-open/closed behavior consistent with galileo evaluator

Relates to: #118

…al monitoring Adds a new contrib evaluator that detects gradual behavioral degradation patterns that point-in-time evaluators (regex, list, SQL, JSON) miss. ## Motivation Follows from discussion in agentcontrol#118 (temporal behavioral drift). The maintainer (lan17) asked for a standalone implementation; the package was built at https://github.com/nanookclaw/agent-control-drift-evaluator. This PR integrates it into the contrib ecosystem so it can be installed directly alongside other Agent Control evaluators. ## What it does - Records a numeric behavioral score (0.0–1.0) per agent per interaction - Compares the recent window (last N observations) to a baseline (first M observations) - Returns matched=True when recent average drops below baseline by more than the configured threshold - Stores history as local JSON — no external API or service required ## Design decisions grounded in empirical research Two findings from published longitudinal work (DOI: 10.5281/zenodo.19028012) shaped the implementation: 1. **min_observations ≥ 5**: Drift signals are noisy below 5 observations. Default min_observations=5 prevents early false positives. 2. **Non-monotonic degradation**: Agents can drift and recover without intervention. The evaluator tracks the window, not just a cumulative average, so it detects current state rather than all-time performance. Both patterns were independently validated by a second production deployment (NexusGuard fleet, v0.5.36, 48 tests merged). ## Package structure Follows the galileo contrib pattern: evaluators/contrib/drift/ ├── pyproject.toml # agent-control-evaluator-drift ├── Makefile # test / lint / typecheck / build ├── README.md └── src/ └── agent_control_evaluator_drift/ └── drift/ ├── config.py # DriftEvaluatorConfig (Pydantic) └── evaluator.py # DriftEvaluator (@register_evaluator) ## Tests 31 tests covering: - Config validation (bounds, window vs baseline, on_error) - Core drift computation (insufficient data, baseline building, stable, drift detected, threshold boundary conditions) - File I/O helpers (load/save roundtrip, missing file, corrupt JSON, directory creation) - Full evaluator integration (persistence across instances, independent agent_id tracking, score clamping, fail-open/closed error handling, metadata completeness) Relates to: agentcontrol#118

lan17

Thanks for putting this together. I like the direction overall, and the package structure is easy to follow. I left a few comments on things I think we should tighten up before merging.

lan17 · 2026-03-20T21:33:03Z

+    recent_avg = sum(recent_scores) / len(recent_scores)
+    drift_magnitude = baseline_avg - recent_avg  # positive = drop
+
+    matched = drift_magnitude >= drift_threshold


Nice catch to compare recent vs baseline here. One thing I think will bite us is raw float precision on the exact-threshold boundary. For the added test case with a 1.0 baseline, 0.9 recent window, and 0.10 threshold, this ends up as 0.099999..., so the evaluator returns False even though the documented behavior says >= should trigger. That also lines up with why test_exactly_at_threshold_triggers is currently failing. I think we should compare with a small tolerance or round before the threshold check.

Good catch. Fixed with round(drift_magnitude, 10) >= drift_threshold — rounds at 10 decimal places which eliminates the ULP-level noise (1.0 - 0.9 = 0.09999999... → rounds to exactly 0.1) while preserving all meaningful precision. The test_exactly_at_threshold_triggers case should pass now.

lan17 · 2026-03-20T21:33:03Z

@@ -0,0 +1,37 @@
+[project]


Happy to see this added as a standalone contrib package. I think we still need the repo-level release wiring though. Right now semantic-release, scripts/build.py, test-extras, and the release workflow still only know about the Galileo contrib package, so I do not think this one will actually get versioned, tested, and published from this repo yet.

Agreed — the drift package was orphaned from the release pipeline. Fixed in the same commit:

Added DRIFT_DIR := evaluators/contrib/drift to the Makefile and drift-{test,lint,lint-fix,typecheck,build} targets matching the galileo pattern

Wired drift-test into test-extras (was galileo-only)

Added build_evaluator_drift() to scripts/build.py with 'drift' and 'all' targets
The drift package will now get versioned, tested, and built alongside galileo on every release.

lan17 · 2026-03-20T21:33:03Z

+
+        # Persist updated history
+        try:
+            _save_history(history_path, scores)


I think this needs some synchronization around the history update path. Right now each call does load, append, and overwrite with no lock, so if two workers hit the same agent at once, the last writer wins and we drop observations. I was able to reproduce that with a small multiprocess harness. Since the drift result depends on having a complete history, this feels worth fixing before merge.

Good reproduction. Fixed by replacing the separate _load_history() / _save_history() calls with a single _load_and_append_history() that holds fcntl.LOCK_EX for the full read-modify-write cycle. The lock is scoped to a per-agent .lock file (e.g. customer-support.lock next to customer-support.json), so different agents remain fully parallel while same-agent concurrent calls serialize correctly.

No new dependency — fcntl is stdlib on POSIX. The lock file is never deleted (advisory locks don't need cleanup; the file descriptor releasing the lock is enough). Let me know if you'd prefer an explicit try/finally delete or if there's a project convention I should follow.

Three issues raised by lan17 in PR review: 1. Float precision on threshold boundary (agentcontrol#1) baseline=1.0, window=0.9, threshold=0.10: IEEE 754 gives drift_magnitude=0.09999999... which fails >= 0.10. Fixed with round(drift_magnitude, 10) >= drift_threshold in _compute_drift(). 2. Race condition on concurrent history writes (agentcontrol#3) load→append→save was not atomic: two workers for the same agent_id would both read stale history and the last writer would silently drop the other's observation. Replaced _load_history() / _save_history() pair with _load_and_append_history() which holds fcntl.LOCK_EX for the full read-modify-write cycle. Lock is per-agent (.lock file), so independent agents remain fully parallel. 3. Release wiring missing for drift package (agentcontrol#2) test-extras, scripts/build.py, Makefile and .PHONY only referenced galileo. Added drift-{test,lint,lint-fix,typecheck,build} targets to Makefile, wired drift-test into test-extras, and added build_evaluator_drift() to scripts/build.py (including 'drift' and 'all' targets).

nanookclaw · 2026-03-21T15:03:28Z

All three issues from the review are addressed in commit 12ed7e9:

Float precision ( failure) — fixed with round(drift_magnitude, 10) >= drift_threshold. Eliminates IEEE 754 ULP noise at exactly the 0.1 boundary while preserving all meaningful precision.
Race condition on concurrent writes — replaced the _load_history()/_save_history() pair with a single _load_and_append_history() that holds fcntl.LOCK_EX for the full read-modify-write cycle. Lock is per-agent (e.g. customer-support.lock alongside customer-support.json), so different agents remain fully parallel while concurrent calls for the same agent serialize correctly.
Release pipeline gap — drift package is now wired into test-extras, scripts/build.py ('drift' and 'all' targets), and Makefile (drift-{test,lint,lint-fix,typecheck,build} targets), matching the galileo pattern exactly.

Happy to address any follow-up questions. Thanks again for the thorough review.

nanookclaw · 2026-03-23T05:03:19Z

@lan17 Just checking in — the three issues from your Mar 20 review (float precision fix with round(drift_magnitude, 10), release pipeline wiring via Makefile + build.py, and fcntl.LOCK_EX for concurrent history writes) were all addressed in the Mar 21 push. Let me know if anything else needs attention before this can move forward.

lan17 reviewed Mar 20, 2026

View reviewed changes

nanookclaw mentioned this pull request Mar 21, 2026

[Feature]: Agent Performance Regression Testing & Drift Detection aden-hive/hive#2897

Open

nanookclaw mentioned this pull request Mar 22, 2026

Feature: Longitudinal evaluation dimension (temporal reliability) parameterlab/MASEval#44

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(contrib): add drift.temporal evaluator for longitudinal behavioral monitoring#140

feat(contrib): add drift.temporal evaluator for longitudinal behavioral monitoring#140
nanookclaw wants to merge 2 commits into
agentcontrol:mainfrom
nanookclaw:feat/contrib-drift-evaluator

nanookclaw commented Mar 20, 2026

Uh oh!

lan17 left a comment

Uh oh!

lan17 Mar 20, 2026

Uh oh!

nanookclaw Mar 21, 2026

Uh oh!

lan17 Mar 20, 2026

Uh oh!

nanookclaw Mar 21, 2026

Uh oh!

lan17 Mar 20, 2026

Uh oh!

nanookclaw Mar 21, 2026

Uh oh!

nanookclaw commented Mar 21, 2026

Uh oh!

nanookclaw commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nanookclaw commented Mar 20, 2026

Summary

The Gap

How It Works

Usage

Design Decisions (Empirically Grounded)

Package Structure

Tests

Checklist

Uh oh!

lan17 left a comment

Choose a reason for hiding this comment

Uh oh!

lan17 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

nanookclaw Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

lan17 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

nanookclaw Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

lan17 Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

nanookclaw Mar 21, 2026

Choose a reason for hiding this comment

Uh oh!

nanookclaw commented Mar 21, 2026

Uh oh!

nanookclaw commented Mar 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants