Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
105 changes: 105 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# Changelog

## Unreleased

### New

- **`Job.description`.** Optional human-readable description field on Job.

- **`Job.imagePrefix` / `Job.image_prefix`.** Registry URL prefix prepended to image names during sandbox resolution. Enables switching between local images and remote registries (e.g. Artifact Registry on GKE) without duplicating job YAML files.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The imagePrefix/image_prefix thing is because Dart and Python changes are needed.


- **Tag-based filtering.** New `TagFilter` model with `include_tags` and `exclude_tags`, used at three levels:
- `Job.taskFilters` / `Job.task_filters` — select tasks by metadata tags
- `Job.sampleFilters` / `Job.sample_filters` — select samples by metadata tags
- `variant_filters` on task YAML — restrict which variants apply to a task (supplements `allowed_variants`)

- **`JobTask.args`.** Per-task argument overrides. Allows a job to pass task-specific arguments (e.g. `base_url`, `dataset_path`) to individual tasks.
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This refers to job.yaml tasks.<task>.args JobTask is the underlying model


- **`Task.systemMessage` / `Task.system_message`.** System prompt override at the task level. Previously only available as a job-level override via `JobTask`.

- **`Task.sandboxParameters` / `Task.sandbox_parameters`.** Pass-through dictionary for sandbox plugin configuration.

- **`module:task` syntax.** Task function references can now use `module.path:function_name` format for Python tasks.

### Breaking Changes

- **`Task.taskFunc` → `Task.func`.** Renamed model field to match the YAML key name. JSON serialization key changes from `"task_func"` to `"func"`. Both Dart and Python packages must update in lockstep.

- **Sandbox registry is now configurable.** The hardcoded `kSandboxRegistry` and `kSdkChannels` maps are extracted from `eval_set_resolver.dart` and made data-driven, allowing non-Flutter projects to define their own sandbox configurations.

- **Workspace resolution uses native Inspect fields.** The `workspace` YAML key remains as parser-level sugar but resolves into Inspect AI's native `Sample.files` and `Sample.setup` fields. The `Sample.setup` command is no longer hardcoded to `cd /workspace && flutter pub get`; it is configurable or omitted for non-Flutter tasks.

### Documentation

- Updated `docs/reference/yaml_config.md` with all new fields and updated descriptions.
- Updated `docs/guides/config.md` (pending — after implementation).

## 11 March, 2025

### New

- **`dataset_config_python` package.** Python port of the Dart config package (`dataset_config_dart`), providing full parity for YAML parsing, resolution, and JSON output. Includes Pydantic models for `Job`, `Task`, `Sample`, `EvalSet`, `Variant`, `Dataset`, and `ContextFile`. Exposes `resolve()` and `write_eval_sets()` as the public API. No Dart SDK or Inspect AI dependency required — can be installed standalone by any team that needs to parse eval config YAML.

### Breaking Changes

- **Renamed `dataset_config` → `dataset_config_dart`.** The Dart config package was renamed for clarity alongside the new Python package.

- **Renamed `dash_evals_config` → `dataset_config_python`.** The Python config package was renamed from its original name for consistency with the Dart package.

## 28 February, 2025

### New

- **`eval_config` Dart package.** New package with a layered Parser → Resolver → Writer architecture that converts dataset YAML into EvalSet JSON for the Python runner. Provides `ConfigResolver` facade plus direct access to `YamlParser`, `JsonParser`, `EvalSetResolver`, and `EvalSetWriter`.

- **Dual-mode eval runner.** The Python runner now supports two invocation modes:
- `run-evals --json ./eval_set.json` — consume a JSON manifest produced by the Dart CLI
- `run-evals --task <name> --model <model>` — run a single task directly from CLI arguments

- **Generalized task functions.** Task implementations are now language-agnostic by default. Flutter-specific tasks (`flutter_bug_fix`, `flutter_code_gen`) are thin wrappers around the generic `bug_fix` and `code_gen` tasks. New tasks: `analyze_codebase`, `mcp_tool`, `skill_test`.

- **New Dart domain models.** `EvalSet`, `Task`, `Sample`, `Variant`, and `TaskInfo` models in the `models` package map directly to the Inspect AI evaluation structure.

### Breaking Changes

- **Removed Python `registries.py`.** Task/model/sandbox registries are removed. Task functions are now discovered dynamically via `importlib` (short names like `"flutter_code_gen"` resolve automatically).

- **Removed `TaskConfig` and `SampleConfig`.** Replaced by `ParsedTask` (intermediate parsing type in `eval_config`) and `Sample` (Inspect AI domain model).

- **Removed legacy Python config parsing.** The `config/parsers/` directory, `load_yaml` utility, and associated model definitions have been removed from `eval_runner`. Configuration is now handled by the Dart `eval_config` package.

- **Models package reorganized.** Report-app models (used by the Flutter results viewer) moved to `models/lib/src/report_app/`. The top-level `models/lib/src/` now contains inspect-domain models.

- **Dataset utilities moved.** `DatasetReader`, `filesystem_utils`, and discovery helpers moved from `eval_config` to `eval_cli`.

## 25 February, 2025

### Breaking Changes

- **Variant format changed from list to named map.** Job YAML files now define variants as a named map instead of a list. Tasks can optionally restrict applicable variants via `allowed_variants` in their `task.yaml`.

**Before (list format):**
```yaml
variants:
- baseline
- { mcp_servers: [dart] }
```

**After (named map format):**
```yaml
# job.yaml
variants:
baseline: {}
mcp_only: { mcp_servers: [dart] }
context_only: { context_files: [./context_files/flutter.md] }
full: { context_files: [./context_files/flutter.md], mcp_servers: [dart] }
```

```yaml
# task.yaml (optional — omit to accept all job variants)
allowed_variants: [baseline, mcp_only]
```

- **Removed `DEFAULT_VARIANTS` registry.** Variants are no longer defined globally in `registries.py`. Each job file defines its own variants.

- **Removed `variants` from `JobTask`.** Per-task variant overrides (`job.tasks.<id>.variants`) are replaced by task-level `allowed_variants` whitelists.
Loading
Loading