-
Notifications
You must be signed in to change notification settings - Fork 0
Yardstick config updates #14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ericwindmill
wants to merge
8
commits into
main
Choose a base branch
from
yardstick-config-updates
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
8 commits
Select commit
Hold shift + click to select a range
4e75faf
feat: Introduce a dedicated `yaml_config.md` for detailed configurati…
ericwindmill 3cce708
updates in flight
ericwindmill b9af6a2
rename func
ericwindmill 28fba88
adds task level fields and updates parser
ericwindmill fe24d91
feat: allow configurable sandbox and SDK channel mappings in dataset …
ericwindmill 2eb1104
feat: Introduce tag-based filtering, refined task function references…
ericwindmill 757acb7
feat: Add variant filtering and propagate image prefix and job task a…
ericwindmill 9c52fd1
feat: Generalize SDK channel to 'branch', consolidate sandbox configu…
ericwindmill File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,105 @@ | ||
| # Changelog | ||
|
|
||
| ## Unreleased | ||
|
|
||
| ### New | ||
|
|
||
| - **`Job.description`.** Optional human-readable description field on Job. | ||
|
|
||
| - **`Job.imagePrefix` / `Job.image_prefix`.** Registry URL prefix prepended to image names during sandbox resolution. Enables switching between local images and remote registries (e.g. Artifact Registry on GKE) without duplicating job YAML files. | ||
|
|
||
| - **Tag-based filtering.** New `TagFilter` model with `include_tags` and `exclude_tags`, used at three levels: | ||
| - `Job.taskFilters` / `Job.task_filters` — select tasks by metadata tags | ||
| - `Job.sampleFilters` / `Job.sample_filters` — select samples by metadata tags | ||
| - `variant_filters` on task YAML — restrict which variants apply to a task (supplements `allowed_variants`) | ||
|
|
||
| - **`JobTask.args`.** Per-task argument overrides. Allows a job to pass task-specific arguments (e.g. `base_url`, `dataset_path`) to individual tasks. | ||
|
Collaborator
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This refers to |
||
|
|
||
| - **`Task.systemMessage` / `Task.system_message`.** System prompt override at the task level. Previously only available as a job-level override via `JobTask`. | ||
|
|
||
| - **`Task.sandboxParameters` / `Task.sandbox_parameters`.** Pass-through dictionary for sandbox plugin configuration. | ||
|
|
||
| - **`module:task` syntax.** Task function references can now use `module.path:function_name` format for Python tasks. | ||
|
|
||
| ### Breaking Changes | ||
|
|
||
| - **`Task.taskFunc` → `Task.func`.** Renamed model field to match the YAML key name. JSON serialization key changes from `"task_func"` to `"func"`. Both Dart and Python packages must update in lockstep. | ||
|
|
||
| - **Sandbox registry is now configurable.** The hardcoded `kSandboxRegistry` and `kSdkChannels` maps are extracted from `eval_set_resolver.dart` and made data-driven, allowing non-Flutter projects to define their own sandbox configurations. | ||
|
|
||
| - **Workspace resolution uses native Inspect fields.** The `workspace` YAML key remains as parser-level sugar but resolves into Inspect AI's native `Sample.files` and `Sample.setup` fields. The `Sample.setup` command is no longer hardcoded to `cd /workspace && flutter pub get`; it is configurable or omitted for non-Flutter tasks. | ||
|
|
||
| ### Documentation | ||
|
|
||
| - Updated `docs/reference/yaml_config.md` with all new fields and updated descriptions. | ||
| - Updated `docs/guides/config.md` (pending — after implementation). | ||
|
|
||
| ## 11 March, 2025 | ||
|
|
||
| ### New | ||
|
|
||
| - **`dataset_config_python` package.** Python port of the Dart config package (`dataset_config_dart`), providing full parity for YAML parsing, resolution, and JSON output. Includes Pydantic models for `Job`, `Task`, `Sample`, `EvalSet`, `Variant`, `Dataset`, and `ContextFile`. Exposes `resolve()` and `write_eval_sets()` as the public API. No Dart SDK or Inspect AI dependency required — can be installed standalone by any team that needs to parse eval config YAML. | ||
|
|
||
| ### Breaking Changes | ||
|
|
||
| - **Renamed `dataset_config` → `dataset_config_dart`.** The Dart config package was renamed for clarity alongside the new Python package. | ||
|
|
||
| - **Renamed `dash_evals_config` → `dataset_config_python`.** The Python config package was renamed from its original name for consistency with the Dart package. | ||
|
|
||
| ## 28 February, 2025 | ||
|
|
||
| ### New | ||
|
|
||
| - **`eval_config` Dart package.** New package with a layered Parser → Resolver → Writer architecture that converts dataset YAML into EvalSet JSON for the Python runner. Provides `ConfigResolver` facade plus direct access to `YamlParser`, `JsonParser`, `EvalSetResolver`, and `EvalSetWriter`. | ||
|
|
||
| - **Dual-mode eval runner.** The Python runner now supports two invocation modes: | ||
| - `run-evals --json ./eval_set.json` — consume a JSON manifest produced by the Dart CLI | ||
| - `run-evals --task <name> --model <model>` — run a single task directly from CLI arguments | ||
|
|
||
| - **Generalized task functions.** Task implementations are now language-agnostic by default. Flutter-specific tasks (`flutter_bug_fix`, `flutter_code_gen`) are thin wrappers around the generic `bug_fix` and `code_gen` tasks. New tasks: `analyze_codebase`, `mcp_tool`, `skill_test`. | ||
|
|
||
| - **New Dart domain models.** `EvalSet`, `Task`, `Sample`, `Variant`, and `TaskInfo` models in the `models` package map directly to the Inspect AI evaluation structure. | ||
|
|
||
| ### Breaking Changes | ||
|
|
||
| - **Removed Python `registries.py`.** Task/model/sandbox registries are removed. Task functions are now discovered dynamically via `importlib` (short names like `"flutter_code_gen"` resolve automatically). | ||
|
|
||
| - **Removed `TaskConfig` and `SampleConfig`.** Replaced by `ParsedTask` (intermediate parsing type in `eval_config`) and `Sample` (Inspect AI domain model). | ||
|
|
||
| - **Removed legacy Python config parsing.** The `config/parsers/` directory, `load_yaml` utility, and associated model definitions have been removed from `eval_runner`. Configuration is now handled by the Dart `eval_config` package. | ||
|
|
||
| - **Models package reorganized.** Report-app models (used by the Flutter results viewer) moved to `models/lib/src/report_app/`. The top-level `models/lib/src/` now contains inspect-domain models. | ||
|
|
||
| - **Dataset utilities moved.** `DatasetReader`, `filesystem_utils`, and discovery helpers moved from `eval_config` to `eval_cli`. | ||
|
|
||
| ## 25 February, 2025 | ||
|
|
||
| ### Breaking Changes | ||
|
|
||
| - **Variant format changed from list to named map.** Job YAML files now define variants as a named map instead of a list. Tasks can optionally restrict applicable variants via `allowed_variants` in their `task.yaml`. | ||
|
|
||
| **Before (list format):** | ||
| ```yaml | ||
| variants: | ||
| - baseline | ||
| - { mcp_servers: [dart] } | ||
| ``` | ||
|
|
||
| **After (named map format):** | ||
| ```yaml | ||
| # job.yaml | ||
| variants: | ||
| baseline: {} | ||
| mcp_only: { mcp_servers: [dart] } | ||
| context_only: { context_files: [./context_files/flutter.md] } | ||
| full: { context_files: [./context_files/flutter.md], mcp_servers: [dart] } | ||
| ``` | ||
|
|
||
| ```yaml | ||
| # task.yaml (optional — omit to accept all job variants) | ||
| allowed_variants: [baseline, mcp_only] | ||
| ``` | ||
|
|
||
| - **Removed `DEFAULT_VARIANTS` registry.** Variants are no longer defined globally in `registries.py`. Each job file defines its own variants. | ||
|
|
||
| - **Removed `variants` from `JobTask`.** Per-task variant overrides (`job.tasks.<id>.variants`) are replaced by task-level `allowed_variants` whitelists. | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The imagePrefix/image_prefix thing is because Dart and Python changes are needed.