Skip to content

feat(contrib): add drift.temporal evaluator for longitudinal behavioral monitoring#140

Open
nanookclaw wants to merge 1 commit intoagentcontrol:mainfrom
nanookclaw:feat/contrib-drift-evaluator
Open

feat(contrib): add drift.temporal evaluator for longitudinal behavioral monitoring#140
nanookclaw wants to merge 1 commit intoagentcontrol:mainfrom
nanookclaw:feat/contrib-drift-evaluator

Conversation

@nanookclaw
Copy link

Summary

Adds drift.temporal — a new contrib evaluator that detects gradual behavioral degradation patterns that point-in-time evaluators (regex, list, SQL, JSON) miss.

This follows from the discussion in #118, where @lan17 suggested implementing the evaluator in a standalone repository first. I built it (nanookclaw/agent-control-drift-evaluator, 2 ⭐) and am now integrating it into the contrib ecosystem using the galileo pattern.

The Gap

Built-in evaluators answer: "Is this response OK right now?"

They don't answer: "Is this agent getting worse over time?"

This evaluator fills that gap.

How It Works

  1. Records a numeric score (0.0–1.0) per agent per interaction
  2. Compares the recent window (last N observations) to a baseline (first M observations)
  3. Returns matched=True when recent average drops below baseline by more than the configured threshold
  4. Stores history as local JSON — no external API or service required

Usage

controls:
  - name: "drift-check"
    selector: "$.quality_score"
    evaluator: "drift.temporal"
    config:
      agent_id: "sales-agent-prod"
      window_size: 10
      baseline_size: 20
      drift_threshold: 0.10
    action: alert

Design Decisions (Empirically Grounded)

Two findings from published longitudinal research (DOI: 10.5281/zenodo.19028012) shaped the defaults:

  1. min_observations ≥ 5 — Drift signals are noisy below 5 observations. Default prevents early false positives.
  2. Windowed comparison, not cumulative — Agents can drift and recover without intervention. A rolling window captures current state; a cumulative average masks it.

Both patterns were independently replicated by a production deployment (NexusGuard fleet, v0.5.36, 48 passing tests).

Package Structure

Follows the galileo contrib pattern exactly:

evaluators/contrib/drift/
├── pyproject.toml           # agent-control-evaluator-drift
├── Makefile                 # test / lint / typecheck / build
├── README.md
└── src/
    └── agent_control_evaluator_drift/
        └── drift/
            ├── config.py    # DriftEvaluatorConfig (Pydantic)
            └── evaluator.py # DriftEvaluator (@register_evaluator)

Tests

31 tests covering:

  • Config validation (bounds, window vs baseline, on_error modes)
  • Core drift computation (insufficient data, baseline building, stable, drift detected, threshold boundaries)
  • File I/O helpers (load/save roundtrip, missing file, corrupt JSON, auto-created directories)
  • Full evaluator integration (persistence across instances, independent agent_id tracking, score clamping, fail-open/closed, metadata completeness)

Checklist

  • Tests pass (verified locally against the builtin evaluator interface)
  • No external API required (requires_api_key = False)
  • Follows contrib package structure (galileo pattern)
  • pyproject.toml entry point registered: drift.temporal = agent_control_evaluator_drift.drift:DriftEvaluator
  • README with config reference, research background, and usage examples
  • on_error fail-open/closed behavior consistent with galileo evaluator

Relates to: #118

…al monitoring

Adds a new contrib evaluator that detects gradual behavioral degradation
patterns that point-in-time evaluators (regex, list, SQL, JSON) miss.

## Motivation

Follows from discussion in agentcontrol#118 (temporal behavioral drift). The maintainer
(lan17) asked for a standalone implementation; the package was built at
https://github.com/nanookclaw/agent-control-drift-evaluator. This PR
integrates it into the contrib ecosystem so it can be installed directly
alongside other Agent Control evaluators.

## What it does

- Records a numeric behavioral score (0.0–1.0) per agent per interaction
- Compares the recent window (last N observations) to a baseline
  (first M observations)
- Returns matched=True when recent average drops below baseline by
  more than the configured threshold
- Stores history as local JSON — no external API or service required

## Design decisions grounded in empirical research

Two findings from published longitudinal work (DOI: 10.5281/zenodo.19028012)
shaped the implementation:

1. **min_observations ≥ 5**: Drift signals are noisy below 5 observations.
   Default min_observations=5 prevents early false positives.

2. **Non-monotonic degradation**: Agents can drift and recover without
   intervention. The evaluator tracks the window, not just a cumulative
   average, so it detects current state rather than all-time performance.

Both patterns were independently validated by a second production deployment
(NexusGuard fleet, v0.5.36, 48 tests merged).

## Package structure

Follows the galileo contrib pattern:

  evaluators/contrib/drift/
  ├── pyproject.toml           # agent-control-evaluator-drift
  ├── Makefile                 # test / lint / typecheck / build
  ├── README.md
  └── src/
      └── agent_control_evaluator_drift/
          └── drift/
              ├── config.py    # DriftEvaluatorConfig (Pydantic)
              └── evaluator.py # DriftEvaluator (@register_evaluator)

## Tests

31 tests covering:
- Config validation (bounds, window vs baseline, on_error)
- Core drift computation (insufficient data, baseline building, stable,
  drift detected, threshold boundary conditions)
- File I/O helpers (load/save roundtrip, missing file, corrupt JSON,
  directory creation)
- Full evaluator integration (persistence across instances, independent
  agent_id tracking, score clamping, fail-open/closed error handling,
  metadata completeness)

Relates to: agentcontrol#118
Copy link
Contributor

@lan17 lan17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for putting this together. I like the direction overall, and the package structure is easy to follow. I left a few comments on things I think we should tighten up before merging.

recent_avg = sum(recent_scores) / len(recent_scores)
drift_magnitude = baseline_avg - recent_avg # positive = drop

matched = drift_magnitude >= drift_threshold
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch to compare recent vs baseline here. One thing I think will bite us is raw float precision on the exact-threshold boundary. For the added test case with a 1.0 baseline, 0.9 recent window, and 0.10 threshold, this ends up as 0.099999..., so the evaluator returns False even though the documented behavior says >= should trigger. That also lines up with why test_exactly_at_threshold_triggers is currently failing. I think we should compare with a small tolerance or round before the threshold check.

@@ -0,0 +1,37 @@
[project]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Happy to see this added as a standalone contrib package. I think we still need the repo-level release wiring though. Right now semantic-release, scripts/build.py, test-extras, and the release workflow still only know about the Galileo contrib package, so I do not think this one will actually get versioned, tested, and published from this repo yet.


# Persist updated history
try:
_save_history(history_path, scores)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this needs some synchronization around the history update path. Right now each call does load, append, and overwrite with no lock, so if two workers hit the same agent at once, the last writer wins and we drop observations. I was able to reproduce that with a small multiprocess harness. Since the drift result depends on having a complete history, this feels worth fixing before merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants