Rework v1 harness and taskset config classes#1392
Conversation
39cffcd to
29c5099
Compare
29c5099 to
1aac155
Compare
1aac155 to
8f4c124
Compare
8f4c124 to
e6a8fd4
Compare
e6a8fd4 to
43e463f
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 78e218e570
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
ApprovabilityVerdict: Needs human review Diff is too large for automated approval analysis. A human reviewer should evaluate this PR. You can customize Macroscope's approvability policy. Learn more. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6fbba3e059
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 19d5ef6884
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 00f39d0f0e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f8f8f1c3b8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ec0d15d9dc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9684c24c48
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d5cef6e1bc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 41ce81c5cd
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 33be5ca1ac
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ce2dc6162c
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| extra_config_specs: list[str] | None = None | ||
| install_python: bool = True | ||
| system_prompt: PromptInput | None = None | ||
| sandbox: SandboxConfig | None = SandboxConfig() |
There was a problem hiding this comment.
Preserve command harness sandbox defaults
When these command harness configs are constructed with defaults, CommandHarness.sandbox_value() only falls back to True when config.sandbox is None; this SandboxConfig() value is instead merged over DEFAULT_COMMAND_SANDBOX, so MiniSWEAgent/Pi/Terminus2 default runs now inherit the generic sandbox timeout_minutes=60 instead of the command harness default 120. Long-running agent tasks that previously had the packaged 2-hour sandbox budget can be terminated after 1 hour unless users explicitly override the sandbox.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
i think this is fine, sandbox > harness
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit eddd017. Configure here.
|
|
||
|
|
||
| def load_train_rows(num_train_examples: int): | ||
| return load_rows("train", num_train_examples) |
There was a problem hiding this comment.
Source functions require parameters but have no defaults
Medium Severity
load_train_rows(num_train_examples: int) and load_eval_rows(num_eval_examples: int) have required parameters with no defaults. When used as _default_source / _default_eval_source, they're called via rows_from_source which injects matching config fields as kwargs. This works when the config has those fields. However, the parameter num_train_examples has no default, so if the config field name ever diverges or the source is called outside the config injection path, it will raise a TypeError. The old code used lambdas that closed over config values directly, making the coupling explicit. This applies identically to both dspy_rlm.py and openai_agents_env.py.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit eddd017. Configure here.


Summary
taskset: MyTasksetConfig = MyTasksetConfig(),harness: MyHarnessConfig = MyHarnessConfig())config=Noneinstead of constructing config objects in function signaturesTaskset[MyTasksetConfig],Harness[MyHarnessConfig]) instead of a publicconfig_typehook_default_rewards,_default_program,_default_toolsets, etc.) so users do not have to write source/reward/program paths in every configadd_*mutation APIs now share one internal pathCommandHarness, so OpenCode/MiniSWEAgent/Pi/Terminus2 share command-program and sandbox wiringvf.Env.config(...),vf.Env.loader(...), andvf.Env.from_config(...)for the common typed env wiring paths, and use them in examples and simple v1 envsEnvConfigchild config types at loader boundariesConcrete env shape
Config classes expose the knobs users change; taskset/harness classes own the implementation.
vf.Env.config(...)builds the typed envelope from those classes, andvf.Env.loader(...)gives the package itsload_environmententrypoint.For hand-written loaders, keep config defaults fresh and route through the typed envelope:
Building configs in Python
TOML override shape
Users only write the fields they want to override. The environment implementation remains in Python.
Hosted training/RL uses the same child sections:
Review follow-up
v1=Truewrapper branches through typed v1 env configs instead of old mirrored kwargsshow/hidestrings as single tool namessys.moduleswhere import-ref config defaults need them during testsvf.Env.from_config(...)/vf.Env.loader(...)where construction is standardconfig=Nonewith an updated v1 semgrep policyRuntimeOwnerMixinCommandHarnessValidation
uv run --frozen ruff check --fix .uv run --frozen ruff formatuv run --frozen pytest tests/test_v1_config_extension.py tests/test_v1_harbor_cli.py tests/test_v1_runtime_lifecycle.py tests/test_v1_taskset_bindings.py tests/test_v1_bfcl.py tests/test_mcp_search_env.py tests/test_langchain_deep_agents_wikispeedia.py tests/test_v1_rlm_swe.py(225 passed)uv run --frozen pre-commit run --all-filesNotes
add_reward(...), class defaults such as_default_rewards, and methods such asrows()may still use live Python objects because they are not serialized config.pyproject.tomlanduv.lockare intentionally untouched by the latest refactor pass.Note
Medium Risk
Medium risk because it changes the public v1 environment loading/constructor surface (config defaults, loader signatures, and class-based defaults), which can break downstream environments that still pass kwargs or build config objects in function signatures.
Overview
Standardizes the v1 authoring/loading pattern around typed config envelopes: environments now define
TasksetConfig/HarnessConfig/EnvConfigdefaults explicitly, implement tasksets/harnesses asTaskset[Config]andHarness[Config], and construct envs viavf.Env.from_config(...)orvf.Env.loader(...)rather than bespokeload_taskset/load_harnesswrappers.Updates multiple bundled environments (e.g.,
alphabet_sort_v1,math_python_v1,wiki_search_v1,reverse_text_v1, BFCL, MCP search, nested harness, etc.) to move runtime defaults onto class attributes (_default_source,_default_rewards,_default_program,_default_toolsets, etc.), tighten config validation/derivation (including rejecting unknownv1wrapper kwargs), and adjust reward/toolset wiring to be config-serializable.Docs/examples/tests are rewritten to match the new API (including
Env.configusage and explicit nested config defaults), Semgrep adds a new rule forbidding config-object construction in function signatures, and the project addsprime-pydantic-config[toml]as a dependency.Reviewed by Cursor Bugbot for commit eddd017. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Require config objects for Taskset and Harness construction in v1
TasksetandHarnessconstructors now accept only a single typed config object (TasksetConfig/HarnessConfig); all previous per-argument overrides are removed.CallableConfigandSignalConfigfor declaring callables and scoring entries declaratively in config, replacing direct callable/object injection.TasksetConfig/HarnessConfig/EnvConfigsubclasses with defaults, and their loader functions now accept typed config objects.OpenCode,MiniSWEAgent,Pi,RLM,Terminus2) are refactored to use_configure_runtimefor post-construction runtime setup instead of constructor kwargs.validate_serializable_value; callables andPathLikeobjects are rejected at construction time and must be provided as import-ref strings.Taskset,Harness, or harness subclasses with keyword arguments will break; configs containing callables or Path objects will raiseTypeErrorat validation time.Changes since #1392 opened
MiniSWEAgent,OpenCode,Pi,RLM, andTerminus2harnesses to defer program override handling [00be9b2]base_harness_configutility function toverifiers.v1.packages.harnesses.commandmodule [00be9b2]prepare_typed_env_configutility to usefrom_configinstead ofmodel_validatefor config coercion [00be9b2]TasksetandHarnessbase classes to generic classes [2c0691b]Harness.__init_subclass__andTaskset.__init_subclass__to extract generic type arguments from__orig_bases__, inherit_config_clsfrom base classes with defaults toHarnessConfigandTasksetConfigrespectively, and validate that the resolved config class is a subclass of the appropriate base config type, raisingTypeErrorif validation fails [25857e6]Harness.__init__andTaskset.__init__constructors to always coerce config throughtype(self)._config_cls.from_config(config), removing the previous heuristic that would switch totype(config)when_config_clswas the base config class [25857e6]make_tasksetandmake_harnesstest helpers to normalize input config throughTasksetConfig.from_configandHarnessConfig.from_configrespectively, build data dictionaries frombase_config.model_dump(exclude_none=True)overlaid with runtime values, and instantiate usingtype(base_config).from_config(config_value(data))instead of dynamically selecting config classes and callingmodel_validate[25857e6]config_datautility helper from test infrastructure [25857e6]verifiers.v1.harness.Harness.__init__method [0d70fe0]toolset.Toolset.__init__constructor to resolve config objects into actual instances [58105f6]BaseConfigclass with Pydantic extra field validation [58105f6]EnvTasksetConfigfield toEnvTasksetclass attribute [58105f6]pydantic-configdependency specification [58105f6]lcs_reward_functo use module-level constant [58105f6]vf.HarnesstoReplayHarness[fa30096]WikiSearchTasksetConfiginitialization to use direct field assignment instead of explicit toolsets mapping, and added conditional logic towiki_search_v1.load_tasksetto only register default 'wiki' toolset whentoolsetsfield is not present in config [fdab488]BaseConfigclass implementation withBaseConfigimported frompydantic_configpackage in verifiers v1 config module [fdab488]pydantic-configdependency from version-constrained package to Git repository source [fdab488]Harness.__init__andTaskset.__init__constructors to honor non-None default user values specified on config models [c418244]ConfigBound[ConfigT]base class and refactoredTasksetandHarnessto inherit from it [a0210fd]Env.from_configclassmethod for standardized environment construction from config and class references [a0210fd]load_tasksetandload_harnessfunctions from all environment modules and reimplementedload_environmentto delegate toEnv.from_config[a0210fd]Env.from_configconstruction pattern [a0210fd]Env.from_configconstruction behavior [a0210fd]rowsmethod [f157d46]Env.from_configbehavior [f157d46]Env.configandEnv.loaderto theEnvclass inverifiers.v1.envmodule [dba9f3f]Env.from_configclassmethod to accept mapping-based configs and widened taskset and harness type signatures [dba9f3f]vf.Env.loaderfactory [dba9f3f]EnvConfigclasses usingvf.Env.configfactory in environment modules [dba9f3f]TasksetandHarnesssubclasses across multiple environment modules [dba9f3f]config_model_mappingandomit_nonefromverifiers.v1modules [dba9f3f]Harness.__init__andTaskset.__init__to prioritize config class defaults over class-level defaults [3d82213]wiki_search_v1from task-level to factory-level by introducingjudge_reward_factoryand updatingload_taskset[3d82213]test_wiki_search_v1_default_and_explicit_toolsetsto verify default reward presence and task row structure [3d82213]model_dumpcalls inenvironments.bfcl_v3.load_environmentto passexclude_none=Trueinstead ofexclude_unset=Truewhen convertingbase_taskset_configandbase_harness_configto dictionaries for validation intoBFCLTasksetConfigandBFCLHarnessConfig[ec0d15d]verifiers.v1.harness.Harness.__init__to accept an optionalconfigparameter instead of instantiating a defaultHarnessConfigat function definition time [ec0d15d]RuntimeOwnerMixinclass inverifiers.v1.utils.runtime_owner_utilsand refactoredTasksetandHarnessinitialization to use mixin-based configuration [11dfe5d]Optional[Config] | Noneinstead of constructing config objects as default values, and updatedEnv.from_configandEnv.loaderto acceptenv_configtype parameter [11dfe5d]add_metric,add_reward,add_advantage,add_toolset,add_stop,add_setup,add_update, andadd_cleanupmethods fromTasksetandHarnessclasses [11dfe5d]explicit_config_dataandresolved_config_datafunctions toverifiers.v1.utils.config_utilsand updated config data extraction logic [11dfe5d]CommandHarnessbase class inverifiers.v1.packages.harnesses.commandand refactored specific command-based harnesses to extend it [11dfe5d]_configure_from_confighooks in environment-specific tasksets and harnesses to add default toolsets and rewards when not explicitly provided in config [11dfe5d]HarborTasksetto read runtime values directly fromself.configinstead of mirrored instance attributes [11dfe5d]Optionalconfig parameters defaulting toNoneand explicitenv_configparameter invf.Env.from_config[11dfe5d]verifiers-v1-loaders-require-configSemgrep rule withverifiers-v1-no-config-object-defaultsrule in.semgrep/verifiers.yml[11dfe5d]CallableConfigEntrytype alias and updated references to useCallableEntrydirectly [11dfe5d]tests/test_v1_harbor_cli.pyto reference renamed default constants fromverifiers.v1.packages.harnesses.configs[11dfe5d]load_tasksetloader functions across environments and updated callers to directly instantiate taskset classes [4e0caee]HarnessandTasksetclasses toRuntimeOwnerMixin[4e0caee]CommandHarnessto remove hook methods and change runtime configuration [4e0caee]config_dataandmodel_config_datawrapper functions with their explicit counterparts [4e0caee]RLMharness initialization and build script to use direct constant references [4e0caee]mcp_search_env[4e0caee]CallableConfigEntrytype alias and its dependency [4e0caee]verifiers.v1.env.Env.configclassmethod to validate that at least one configuration type can be inferred or provided [b38f1f2]_config_clsattributes [b38f1f2]task_namesproperty toHarborTasksetclass [9684c24]cpu_coresproperty toHarborTasksetclass [9684c24]Env.configclassmethod to conditionally requiretasksetandharnessfields based on whether their respective config classes declare required fields [d5cef6e]test_env_config_allows_required_child_configsto verify conditional requirement behavior for nested config fields [d5cef6e]bfcl_v3.load_environmentto useexplicit_config_data()instead ofmodel_dump(exclude_none=True)[2580825]rewardsfield is not inmodel_fields_setand validates resolved reward name [2580825]pydantic-configtoprime-pydantic-configpackage dependency [b61b2cf]vf.Harnessconstruction to usevf.HarnessConfigpassed via a 'config' parameter instead of passing program and sandbox parameters directly [41ce81c]TasksetConfigexamples to declare objects and bindings for answer extractor [41ce81c]Toolsetwith object factory and binding for a childHarness[41ce81c]vf.Tasksetandvf.Harness[33be5ca]system_promptinMathPythonTasksetConfigfromharness.pip_install_packages[33be5ca]MathPythonv1 environmentsystem_promptderivation behavior [33be5ca]writeparameter inToolset.__init__must be a boolean value, raising aTypeErrorwith the message 'Toolset write must be a boolean.' when a non-boolean value is provided [4a0b0eb]verifiers.v1.toolset.toolset_from_mapping[4d82131]LifecycleConfigclass andRuntimeOwnerMixinmixin [ce2dc61]vf.Tasksetandvf.Harnessclasses with generic parameterized classesvf.Taskset[Config]andvf.Harness[Config]and introduced typed config subclasses pattern requiringTasksetConfig,HarnessConfig, andMyEnvConfigsubclasses bound via class definitions likeclass MyTaskset(vf.Taskset[MyTasksetConfig])[ffdb48f]load_tasksetandload_harnessloader functions withload_environment(config: MyEnvConfig | None = None) -> vf.Enventrypoint pattern usingvf.Env.from_config(...)orvf.Env.loader(...)for environment construction [ffdb48f]TasksetConfig.objectsandTasksetConfig.bindingspattern for shared extractor and factory import references in config classes [ffdb48f]config.tasksetandconfig.harness[ffdb48f]MyEnvConfigparameter in the example function from a default instance to an optional parameter, and modified thevf.Env.from_configcall to acceptenv_config=MyEnvConfigas an explicit argument alongside the existingtaskset=MyTasksetargument [eddd017]Macroscope summarized a1c64f8.