Enforce the v1 env contract with semgrep and framework support#1387
Enforce the v1 env contract with semgrep and framework support#1387willccbb wants to merge 1 commit into
Conversation
| def load_environment(config: Tau2EnvConfig) -> vf.Env: | ||
| return vf.Env(taskset=load_taskset(config=config.taskset)) |
There was a problem hiding this comment.
🟢 Low tau2_bench_v1/tau2_bench_v1.py:779
The refactored load_environment no longer passes harness configuration to vf.Env(), so any harness settings in Tau2EnvConfig are silently ignored. The old implementation extracted config.harness and passed harness=vf.Harness(config=harness_config) to the constructor. Consider restoring the harness handling or documenting that Tau2EnvConfig.harness is intentionally unsupported.
def load_environment(config: Tau2EnvConfig) -> vf.Env:
- return vf.Env(taskset=load_taskset(config=config.taskset))
+ harness_config = (
+ None if config.harness is None else vf.HarnessConfig(config.harness)
+ )
+ return vf.Env(
+ taskset=load_taskset(config=config.taskset),
+ harness=vf.Harness(config=harness_config),
+ )Also found in 1 other location(s)
environments/wiki_search/wiki_search_v1.py:265
Parameters
judge_base_urlandjudge_api_key_varare accepted byload_tasksetbut are now completely unused after the refactoring. Callers likeload_v1_environmentpass these parameters expecting them to configure the judge client, but the newjudge_reward_funcignores them and instead usesstate.get_endpoint_config(api="chat"). This silently breaks the ability to use a different API endpoint for the judge model than for the main chat completions.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file environments/tau2_bench_v1/tau2_bench_v1.py around lines 779-780:
The refactored `load_environment` no longer passes harness configuration to `vf.Env()`, so any `harness` settings in `Tau2EnvConfig` are silently ignored. The old implementation extracted `config.harness` and passed `harness=vf.Harness(config=harness_config)` to the constructor. Consider restoring the harness handling or documenting that `Tau2EnvConfig.harness` is intentionally unsupported.
Evidence trail:
environments/tau2_bench_v1/tau2_bench_v1.py lines 779-780 (REVIEWED_COMMIT): new `load_environment` omits harness; git_diff MERGE_BASE..REVIEWED_COMMIT showing old lines 793-806 that handled harness; verifiers/v1/config.py lines 327-329: `EnvConfig` has `harness: HarnessConfig` field; verifiers/v1/env.py lines 18-23: `Env.__init__` defaults `harness=None` and creates bare `Harness()`.
Also found in 1 other location(s):
- environments/wiki_search/wiki_search_v1.py:265 -- Parameters `judge_base_url` and `judge_api_key_var` are accepted by `load_taskset` but are now completely unused after the refactoring. Callers like `load_v1_environment` pass these parameters expecting them to configure the judge client, but the new `judge_reward_func` ignores them and instead uses `state.get_endpoint_config(api="chat")`. This silently breaks the ability to use a different API endpoint for the judge model than for the main chat completions.
ApprovabilityVerdict: Needs human review This PR refactors core v1 framework patterns and 15+ environments with changes to config handling, dependency injection, and judge reward behavior. Multiple unresolved review comments identify potential bugs including silently ignored harness config, unused parameters, and config mutation issues that warrant human verification. You can customize Macroscope's approvability policy. Learn more. |
31841a1 to
3f6b37f
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 3 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 3f6b37f. Configure here.
| tool_name = id_to_name.get(msg.tool_call_id) | ||
| if tool_name is None: | ||
| extra = msg.get("name") | ||
| tool_name = extra if isinstance(extra, str) else None |
There was a problem hiding this comment.
Metric relies on tool message name attribute availability
Medium Severity
The refactored invalid_link_rate now relies on getattr(msg, "name", None) to identify click_link tool messages. The old code built a lookup table from assistant tool_calls to determine tool names, which was more robust. Standard ToolMessage objects created by the framework's base program don't carry a name attribute, so this metric may silently return 0.0 for completions where tool messages lack the name field.
Reviewed by Cursor Bugbot for commit 3f6b37f. Configure here.
| ] | ||
| if isinstance(completion, list): | ||
| return list(cast(list[PromptMessage], completion)) | ||
| return cast(list[PromptMessage], completion) |
There was a problem hiding this comment.
Removing list copy allows state mutation via transcript
Low Severity
Changing return list(cast(..., completion)) to return cast(..., completion) removes a defensive copy. Other branches in resolve_transcript return new lists, but this branch now returns a direct reference to state["completion"]. Any downstream mutation of the returned transcript would silently corrupt the state's completion data.
Reviewed by Cursor Bugbot for commit 3f6b37f. Configure here.
| @classmethod | ||
| def config_schema(cls) -> str: | ||
| return cls.config_type.schema_text() | ||
| return HarnessConfig.schema_text() |
There was a problem hiding this comment.
Subclass config_schema always returns base schema text
Low Severity
Harness.config_schema() and Taskset.config_schema() now hardcode HarnessConfig.schema_text() and TasksetConfig.schema_text() respectively, instead of using the subclass's config type. Any subclass calling MyHarness.config_schema() or MyTaskset.config_schema() will get the base schema, missing custom config fields. The cls parameter is effectively ignored.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 3f6b37f. Configure here.
| @@ -156,7 +140,7 @@ def __init__( | |||
|
|
|||
| @classmethod | |||
There was a problem hiding this comment.
🟢 Low v1/taskset.py:141
Taskset.config_schema() now unconditionally returns TasksetConfig.schema_text(), so subclasses like HarborTaskset that rely on the former config_type class variable now return the wrong schema (the base TasksetConfig schema) instead of their own config type's schema. If this hardcoding is intentional, consider documenting why subclasses can no longer customize their schema output.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file verifiers/v1/taskset.py around line 141:
`Taskset.config_schema()` now unconditionally returns `TasksetConfig.schema_text()`, so subclasses like `HarborTaskset` that rely on the former `config_type` class variable now return the wrong schema (the base `TasksetConfig` schema) instead of their own config type's schema. If this hardcoding is intentional, consider documenting why subclasses can no longer customize their schema output.
Evidence trail:
Pre-PR harbor.py at commit 1fd65e1a (line 73): `config_type = HarborTasksetConfig` on `HarborTaskset`. Pre-PR taskset.py (visible in diff): `config_schema()` used `cls.config_type.schema_text()`. Post-PR taskset.py line 142-143: `config_schema()` hardcodes `TasksetConfig.schema_text()`. Post-PR harbor.py lines 53-68: `HarborTasksetConfig` extends `TasksetConfig` with many additional fields (dataset, task_names, docker_image, etc.). Post-PR harbor.py line 71+: `config_type` removed from `HarborTaskset`. git_diff MERGE_BASE..REVIEWED_COMMIT on verifiers/v1/taskset.py and verifiers/v1/packages/tasksets/harbor.py confirms coordinated removal.
| - /environments/**/tasks/**/*.py | ||
| pattern: vf.Env(..., harness=vf.Harness(config=config.harness), ...) | ||
|
|
||
| - id: verifiers-v1-no-env-var-reads-in-signals |
There was a problem hiding this comment.
🟢 Low .semgrep/verifiers.yml:346
The verifiers-v1-no-env-var-reads-in-signals rule only matches async def handlers, so synchronous def handlers that call os.getenv(...) or access os.environ[...] bypass the check. Add def $FUNC(...) patterns for @vf.reward, @vf.reward(...), @vf.update, and @vf.update(...) to catch sync handlers too.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file .semgrep/verifiers.yml around line 346:
The `verifiers-v1-no-env-var-reads-in-signals` rule only matches `async def` handlers, so synchronous `def` handlers that call `os.getenv(...)` or access `os.environ[...]` bypass the check. Add `def $FUNC(...)` patterns for `@vf.reward`, `@vf.reward(...)`, `@vf.update`, and `@vf.update(...)` to catch sync handlers too.
| self.config = ( | ||
| config | ||
| if isinstance(config, HarnessConfig) | ||
| else HarnessConfig.from_config(config) | ||
| ) | ||
| if max_turns is not None: | ||
| self.config.max_turns = max_turns |
There was a problem hiding this comment.
🟢 Low v1/harness.py:95
When a HarnessConfig instance is passed directly to the constructor, lines 100-101 mutate self.config.max_turns in-place. Since the same config object can be reused across multiple Harness instances, this causes later mutations to leak back to earlier harnesses. Previously, from_config was always called which created a new instance. Consider copying the config before mutation, or document if sharing state between harnesses is intentional.
- self.config = (
- config
- if isinstance(config, HarnessConfig)
- else HarnessConfig.from_config(config)
- )
+ self.config = HarnessConfig.from_config(config)
if max_turns is not None:
self.config.max_turns = max_turnsAlso found in 1 other location(s)
verifiers/v1/user.py:28
The change from
return list(cast(list[PromptMessage], completion))toreturn cast(list[PromptMessage], completion)removes the defensive copy of thecompletionlist. This introduces inconsistent mutation behavior: when bothpromptandcompletionare lists (lines 23-26), a new list is returned via spread syntax, but when onlycompletionis a list (line 28), the original list fromstateis now returned directly. If any caller modifies the returned list, it will mutate the originalstate['completion']in the second case but not the first, leading to subtle data corruption bugs that depend on whetherpromptwas also present.
🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file verifiers/v1/harness.py around lines 95-101:
When a `HarnessConfig` instance is passed directly to the constructor, lines 100-101 mutate `self.config.max_turns` in-place. Since the same config object can be reused across multiple `Harness` instances, this causes later mutations to leak back to earlier harnesses. Previously, `from_config` was always called which created a new instance. Consider copying the config before mutation, or document if sharing state between harnesses is intentional.
Evidence trail:
verifiers/v1/harness.py lines 95-101 (REVIEWED_COMMIT): direct assignment of config without copy, then mutation of max_turns. verifiers/v1/config.py line 67-68: `class Config(BaseModel)` with `ConfigDict(arbitrary_types_allowed=True, extra="forbid")` — no frozen=True. verifiers/v1/config.py lines 91-93: `from_config` creates a new instance via `cls(...)`. git_diff MERGE_BASE..REVIEWED_COMMIT verifiers/v1/harness.py: previous code was `self.config = type(self).config_type.from_config(config)` which always created a new instance.
Also found in 1 other location(s):
- verifiers/v1/user.py:28 -- The change from `return list(cast(list[PromptMessage], completion))` to `return cast(list[PromptMessage], completion)` removes the defensive copy of the `completion` list. This introduces inconsistent mutation behavior: when both `prompt` and `completion` are lists (lines 23-26), a new list is returned via spread syntax, but when only `completion` is a list (line 28), the original list from `state` is now returned directly. If any caller modifies the returned list, it will mutate the original `state['completion']` in the second case but not the first, leading to subtle data corruption bugs that depend on whether `prompt` was also present.
| - pattern: | | ||
| def load_environment(...): | ||
| $STMT | ||
| return $RET | ||
|
|
There was a problem hiding this comment.
🟢 Low .semgrep/verifiers.yml:293
The pattern in rule verifiers-v1-load-environment-no-local-config-wiring uses $STMT as a single-statement metavariable, so it only matches functions with exactly one statement before the return. Functions violating the rule with two or more statements before return produce false negatives and pass undetected. Consider using $...STMT (ellipsis metavariable) to match any number of statements.
- - pattern: |
- def load_environment(...):
- $STMT
- return $RET
+ - pattern: |
- def load_environment(...):
+ def load_environment(...):
- $STMT
+ $...STMT
return $RET🚀 Reply "fix it for me" or copy this AI Prompt for your agent:
In file .semgrep/verifiers.yml around lines 293-297:
The `pattern` in rule `verifiers-v1-load-environment-no-local-config-wiring` uses `$STMT` as a single-statement metavariable, so it only matches functions with exactly one statement before the `return`. Functions violating the rule with two or more statements before `return` produce false negatives and pass undetected. Consider using `$...STMT` (ellipsis metavariable) to match any number of statements.


Summary
EnvConfiguses typed child configs, and transcripts are typed aslist[vf.Message].objects=/bindings=usage.docs/environments.mdand projected the guidance into the repo and lab AGENTS files.Testing
pre-commit run --all-filespassed.ty check verifierspassed.Note
High Risk
High risk because it changes core v1
Taskset/Harness/EnvConfigconstruction and updates many environment loaders to a stricter contract, which can break downstream environment configs and judge/tool wiring.Overview
Enforces a stricter v1 environment “golden path” by adding Semgrep policy rules that ban loader/signal anti-patterns (handler factories,
objects=/bindings=in environment constructors, v0 parsers in v1 modules, env-var reads in rewards/updates) and requireload_environment(config: *EnvConfig) -> vf.Envto be a one-linereturn vf.Env(...).Refactors multiple example environments to comply: moves config-dependent behavior into typed
Taskset/Harness/Toolsetclasses, renames env-specific config fields to avoid overriding base defaults (e.g.max_turns→turn_limitin several harness configs), removes group-returning BFCL loader behavior, and standardizes judge rewards to usestate.get_endpoint_config(api="chat")with optional per-taskjudge_model.Tightens framework support and generated-doc workflows: updates v1
Taskset/Harnessto preserve concrete config instances withoutconfig_typeindirection, removes**kwargspassthrough patterns in packaged harnesses, exportsMessage/Messagesfromverifiers, and changesscripts/sync.py+ pre-commit to fail when generatedAGENTS.md/CLAUDE.mdprojections are out of date while also documenting the v1 env shape indocs/environments.md(propagated to lab/repo guides).Reviewed by Cursor Bugbot for commit 3f6b37f. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Enforce v1 env contract across harnesses, tasksets, and environments with semgrep rules
**kwargsinTaskset/Harnesssubclass__init__, no inline objects passed to constructors, no env var reads in reward handlers, and correctload_environmentsignature shape.**kwargs: Unpack[HarnessKwargs]with explicit keyword parameters across all harness subclasses (MiniSWEAgent,OpenCode,Pi,RLM,Terminus2) and removes theHarnessKwargsTypedDict.config_typeclass attribute fromHarness,Taskset, and several environment subclasses; config resolution is now hard-wired to the baseHarnessConfig/TasksetConfigviafrom_config.max_turnstoturn_limitin harness configs across multiple environments (hello_self_judge_v1,hello_group_reward_v1,hello_parallel_sandbox_v1,langchain_deep_agents_wikispeedia) and maps it tomax_turnsat construction time.TagExtractor, child harness, python session) from injected object bindings to module-level singletons orstate-based access, and introducesToolset/Harnesssubclasses to own that wiring.config_typeoverrides or**kwargsforwarding will now get base config types or raiseTypeError;state_transcriptnow returns the same list object rather than a copy, altering mutation semantics.Macroscope summarized 3f6b37f.