Skip to content

Enforce the v1 env contract with semgrep and framework support#1387

Open
willccbb wants to merge 1 commit into
mainfrom
codex/enforce-v1-env-contract
Open

Enforce the v1 env contract with semgrep and framework support#1387
willccbb wants to merge 1 commit into
mainfrom
codex/enforce-v1-env-contract

Conversation

@willccbb
Copy link
Copy Markdown
Member

@willccbb willccbb commented May 15, 2026

Summary

  • Added semgrep rules to mechanically enforce the v1 golden path for loaders, configs, transcripts, signal handlers, and generated AGENTS files.
  • Tightened v1 framework support so subclass config identity is preserved, EnvConfig uses typed child configs, and transcripts are typed as list[vf.Message].
  • Updated affected example environments to follow the contract and removed cookbook smells like signal factories, ad hoc env-var judge wiring, and constructor-level objects= / bindings= usage.
  • Documented the v1 env shape in docs/environments.md and projected the guidance into the repo and lab AGENTS files.

Testing

  • pre-commit run --all-files passed.
  • ty check verifiers passed.
  • Focused pytest coverage for the updated v1 config and environment loaders passed.

Note

High Risk
High risk because it changes core v1 Taskset/Harness/EnvConfig construction and updates many environment loaders to a stricter contract, which can break downstream environment configs and judge/tool wiring.

Overview
Enforces a stricter v1 environment “golden path” by adding Semgrep policy rules that ban loader/signal anti-patterns (handler factories, objects=/bindings= in environment constructors, v0 parsers in v1 modules, env-var reads in rewards/updates) and require load_environment(config: *EnvConfig) -> vf.Env to be a one-line return vf.Env(...).

Refactors multiple example environments to comply: moves config-dependent behavior into typed Taskset/Harness/Toolset classes, renames env-specific config fields to avoid overriding base defaults (e.g. max_turnsturn_limit in several harness configs), removes group-returning BFCL loader behavior, and standardizes judge rewards to use state.get_endpoint_config(api="chat") with optional per-task judge_model.

Tightens framework support and generated-doc workflows: updates v1 Taskset/Harness to preserve concrete config instances without config_type indirection, removes **kwargs passthrough patterns in packaged harnesses, exports Message/Messages from verifiers, and changes scripts/sync.py + pre-commit to fail when generated AGENTS.md/CLAUDE.md projections are out of date while also documenting the v1 env shape in docs/environments.md (propagated to lab/repo guides).

Reviewed by Cursor Bugbot for commit 3f6b37f. Bugbot is set up for automated code reviews on this repo. Configure here.

Note

Enforce v1 env contract across harnesses, tasksets, and environments with semgrep rules

  • Adds semgrep rules in verifiers.yml to enforce v1 environment patterns: no **kwargs in Taskset/Harness subclass __init__, no inline objects passed to constructors, no env var reads in reward handlers, and correct load_environment signature shape.
  • Replaces **kwargs: Unpack[HarnessKwargs] with explicit keyword parameters across all harness subclasses (MiniSWEAgent, OpenCode, Pi, RLM, Terminus2) and removes the HarnessKwargs TypedDict.
  • Removes config_type class attribute from Harness, Taskset, and several environment subclasses; config resolution is now hard-wired to the base HarnessConfig/TasksetConfig via from_config.
  • Renames max_turns to turn_limit in harness configs across multiple environments (hello_self_judge_v1, hello_group_reward_v1, hello_parallel_sandbox_v1, langchain_deep_agents_wikispeedia) and maps it to max_turns at construction time.
  • Moves reward/tool dependencies (e.g. TagExtractor, child harness, python session) from injected object bindings to module-level singletons or state-based access, and introduces Toolset/Harness subclasses to own that wiring.
  • Risk: Subclasses that relied on config_type overrides or **kwargs forwarding will now get base config types or raise TypeError; state_transcript now returns the same list object rather than a copy, altering mutation semantics.

Macroscope summarized 3f6b37f.

Comment on lines +779 to +780
def load_environment(config: Tau2EnvConfig) -> vf.Env:
return vf.Env(taskset=load_taskset(config=config.taskset))
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low tau2_bench_v1/tau2_bench_v1.py:779

The refactored load_environment no longer passes harness configuration to vf.Env(), so any harness settings in Tau2EnvConfig are silently ignored. The old implementation extracted config.harness and passed harness=vf.Harness(config=harness_config) to the constructor. Consider restoring the harness handling or documenting that Tau2EnvConfig.harness is intentionally unsupported.

 def load_environment(config: Tau2EnvConfig) -> vf.Env:
-    return vf.Env(taskset=load_taskset(config=config.taskset))
+    harness_config = (
+        None if config.harness is None else vf.HarnessConfig(config.harness)
+    )
+    return vf.Env(
+        taskset=load_taskset(config=config.taskset),
+        harness=vf.Harness(config=harness_config),
+    )
Also found in 1 other location(s)

environments/wiki_search/wiki_search_v1.py:265

Parameters judge_base_url and judge_api_key_var are accepted by load_taskset but are now completely unused after the refactoring. Callers like load_v1_environment pass these parameters expecting them to configure the judge client, but the new judge_reward_func ignores them and instead uses state.get_endpoint_config(api="chat"). This silently breaks the ability to use a different API endpoint for the judge model than for the main chat completions.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file environments/tau2_bench_v1/tau2_bench_v1.py around lines 779-780:

The refactored `load_environment` no longer passes harness configuration to `vf.Env()`, so any `harness` settings in `Tau2EnvConfig` are silently ignored. The old implementation extracted `config.harness` and passed `harness=vf.Harness(config=harness_config)` to the constructor. Consider restoring the harness handling or documenting that `Tau2EnvConfig.harness` is intentionally unsupported.

Evidence trail:
environments/tau2_bench_v1/tau2_bench_v1.py lines 779-780 (REVIEWED_COMMIT): new `load_environment` omits harness; git_diff MERGE_BASE..REVIEWED_COMMIT showing old lines 793-806 that handled harness; verifiers/v1/config.py lines 327-329: `EnvConfig` has `harness: HarnessConfig` field; verifiers/v1/env.py lines 18-23: `Env.__init__` defaults `harness=None` and creates bare `Harness()`.

Also found in 1 other location(s):
- environments/wiki_search/wiki_search_v1.py:265 -- Parameters `judge_base_url` and `judge_api_key_var` are accepted by `load_taskset` but are now completely unused after the refactoring. Callers like `load_v1_environment` pass these parameters expecting them to configure the judge client, but the new `judge_reward_func` ignores them and instead uses `state.get_endpoint_config(api="chat")`. This silently breaks the ability to use a different API endpoint for the judge model than for the main chat completions.

Comment thread environments/wiki_search/wiki_search_v1.py
Comment thread docs/environments.md
@macroscopeapp
Copy link
Copy Markdown

macroscopeapp Bot commented May 15, 2026

Approvability

Verdict: Needs human review

This PR refactors core v1 framework patterns and 15+ environments with changes to config handling, dependency injection, and judge reward behavior. Multiple unresolved review comments identify potential bugs including silently ignored harness config, unused parameters, and config mutation issues that warrant human verification.

You can customize Macroscope's approvability policy. Learn more.

@willccbb willccbb force-pushed the codex/enforce-v1-env-contract branch from 31841a1 to 3f6b37f Compare May 16, 2026 17:41
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 3 potential issues.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 3f6b37f. Configure here.

tool_name = id_to_name.get(msg.tool_call_id)
if tool_name is None:
extra = msg.get("name")
tool_name = extra if isinstance(extra, str) else None
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Metric relies on tool message name attribute availability

Medium Severity

The refactored invalid_link_rate now relies on getattr(msg, "name", None) to identify click_link tool messages. The old code built a lookup table from assistant tool_calls to determine tool names, which was more robust. Standard ToolMessage objects created by the framework's base program don't carry a name attribute, so this metric may silently return 0.0 for completions where tool messages lack the name field.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 3f6b37f. Configure here.

Comment thread verifiers/v1/user.py
]
if isinstance(completion, list):
return list(cast(list[PromptMessage], completion))
return cast(list[PromptMessage], completion)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing list copy allows state mutation via transcript

Low Severity

Changing return list(cast(..., completion)) to return cast(..., completion) removes a defensive copy. Other branches in resolve_transcript return new lists, but this branch now returns a direct reference to state["completion"]. Any downstream mutation of the returned transcript would silently corrupt the state's completion data.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 3f6b37f. Configure here.

Comment thread verifiers/v1/harness.py
@classmethod
def config_schema(cls) -> str:
return cls.config_type.schema_text()
return HarnessConfig.schema_text()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Subclass config_schema always returns base schema text

Low Severity

Harness.config_schema() and Taskset.config_schema() now hardcode HarnessConfig.schema_text() and TasksetConfig.schema_text() respectively, instead of using the subclass's config type. Any subclass calling MyHarness.config_schema() or MyTaskset.config_schema() will get the base schema, missing custom config fields. The cls parameter is effectively ignored.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 3f6b37f. Configure here.

Comment thread verifiers/v1/taskset.py
@@ -156,7 +140,7 @@ def __init__(

@classmethod
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low v1/taskset.py:141

Taskset.config_schema() now unconditionally returns TasksetConfig.schema_text(), so subclasses like HarborTaskset that rely on the former config_type class variable now return the wrong schema (the base TasksetConfig schema) instead of their own config type's schema. If this hardcoding is intentional, consider documenting why subclasses can no longer customize their schema output.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file verifiers/v1/taskset.py around line 141:

`Taskset.config_schema()` now unconditionally returns `TasksetConfig.schema_text()`, so subclasses like `HarborTaskset` that rely on the former `config_type` class variable now return the wrong schema (the base `TasksetConfig` schema) instead of their own config type's schema. If this hardcoding is intentional, consider documenting why subclasses can no longer customize their schema output.

Evidence trail:
Pre-PR harbor.py at commit 1fd65e1a (line 73): `config_type = HarborTasksetConfig` on `HarborTaskset`. Pre-PR taskset.py (visible in diff): `config_schema()` used `cls.config_type.schema_text()`. Post-PR taskset.py line 142-143: `config_schema()` hardcodes `TasksetConfig.schema_text()`. Post-PR harbor.py lines 53-68: `HarborTasksetConfig` extends `TasksetConfig` with many additional fields (dataset, task_names, docker_image, etc.). Post-PR harbor.py line 71+: `config_type` removed from `HarborTaskset`. git_diff MERGE_BASE..REVIEWED_COMMIT on verifiers/v1/taskset.py and verifiers/v1/packages/tasksets/harbor.py confirms coordinated removal.

Comment thread .semgrep/verifiers.yml
- /environments/**/tasks/**/*.py
pattern: vf.Env(..., harness=vf.Harness(config=config.harness), ...)

- id: verifiers-v1-no-env-var-reads-in-signals
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low .semgrep/verifiers.yml:346

The verifiers-v1-no-env-var-reads-in-signals rule only matches async def handlers, so synchronous def handlers that call os.getenv(...) or access os.environ[...] bypass the check. Add def $FUNC(...) patterns for @vf.reward, @vf.reward(...), @vf.update, and @vf.update(...) to catch sync handlers too.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file .semgrep/verifiers.yml around line 346:

The `verifiers-v1-no-env-var-reads-in-signals` rule only matches `async def` handlers, so synchronous `def` handlers that call `os.getenv(...)` or access `os.environ[...]` bypass the check. Add `def $FUNC(...)` patterns for `@vf.reward`, `@vf.reward(...)`, `@vf.update`, and `@vf.update(...)` to catch sync handlers too.

Comment thread verifiers/v1/harness.py
Comment on lines +95 to 101
self.config = (
config
if isinstance(config, HarnessConfig)
else HarnessConfig.from_config(config)
)
if max_turns is not None:
self.config.max_turns = max_turns
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low v1/harness.py:95

When a HarnessConfig instance is passed directly to the constructor, lines 100-101 mutate self.config.max_turns in-place. Since the same config object can be reused across multiple Harness instances, this causes later mutations to leak back to earlier harnesses. Previously, from_config was always called which created a new instance. Consider copying the config before mutation, or document if sharing state between harnesses is intentional.

-        self.config = (
-            config
-            if isinstance(config, HarnessConfig)
-            else HarnessConfig.from_config(config)
-        )
+        self.config = HarnessConfig.from_config(config)
         if max_turns is not None:
             self.config.max_turns = max_turns
Also found in 1 other location(s)

verifiers/v1/user.py:28

The change from return list(cast(list[PromptMessage], completion)) to return cast(list[PromptMessage], completion) removes the defensive copy of the completion list. This introduces inconsistent mutation behavior: when both prompt and completion are lists (lines 23-26), a new list is returned via spread syntax, but when only completion is a list (line 28), the original list from state is now returned directly. If any caller modifies the returned list, it will mutate the original state['completion'] in the second case but not the first, leading to subtle data corruption bugs that depend on whether prompt was also present.

🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file verifiers/v1/harness.py around lines 95-101:

When a `HarnessConfig` instance is passed directly to the constructor, lines 100-101 mutate `self.config.max_turns` in-place. Since the same config object can be reused across multiple `Harness` instances, this causes later mutations to leak back to earlier harnesses. Previously, `from_config` was always called which created a new instance. Consider copying the config before mutation, or document if sharing state between harnesses is intentional.

Evidence trail:
verifiers/v1/harness.py lines 95-101 (REVIEWED_COMMIT): direct assignment of config without copy, then mutation of max_turns. verifiers/v1/config.py line 67-68: `class Config(BaseModel)` with `ConfigDict(arbitrary_types_allowed=True, extra="forbid")` — no frozen=True. verifiers/v1/config.py lines 91-93: `from_config` creates a new instance via `cls(...)`. git_diff MERGE_BASE..REVIEWED_COMMIT verifiers/v1/harness.py: previous code was `self.config = type(self).config_type.from_config(config)` which always created a new instance.

Also found in 1 other location(s):
- verifiers/v1/user.py:28 -- The change from `return list(cast(list[PromptMessage], completion))` to `return cast(list[PromptMessage], completion)` removes the defensive copy of the `completion` list. This introduces inconsistent mutation behavior: when both `prompt` and `completion` are lists (lines 23-26), a new list is returned via spread syntax, but when only `completion` is a list (line 28), the original list from `state` is now returned directly. If any caller modifies the returned list, it will mutate the original `state['completion']` in the second case but not the first, leading to subtle data corruption bugs that depend on whether `prompt` was also present.

Comment thread .semgrep/verifiers.yml
Comment on lines +293 to +297
- pattern: |
def load_environment(...):
$STMT
return $RET

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Low .semgrep/verifiers.yml:293

The pattern in rule verifiers-v1-load-environment-no-local-config-wiring uses $STMT as a single-statement metavariable, so it only matches functions with exactly one statement before the return. Functions violating the rule with two or more statements before return produce false negatives and pass undetected. Consider using $...STMT (ellipsis metavariable) to match any number of statements.

-      - pattern: |
-          def load_environment(...):
-              $STMT
-              return $RET
+      - pattern: |
-          def load_environment(...):
+          def load_environment(...):
-              $STMT
+              $...STMT
               return $RET
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file .semgrep/verifiers.yml around lines 293-297:

The `pattern` in rule `verifiers-v1-load-environment-no-local-config-wiring` uses `$STMT` as a single-statement metavariable, so it only matches functions with exactly one statement before the `return`. Functions violating the rule with two or more statements before `return` produce false negatives and pass undetected. Consider using `$...STMT` (ellipsis metavariable) to match any number of statements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant