Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions .firebaserc
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
{
"projects": {
"default": "dash-evals"
}
}
10 changes: 10 additions & 0 deletions .gemini/config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
# Minimize verbosity.
have_fun: false
code_review:
# For now, use the default of MEDIUM for testing. Based on desired verbosity,
# we can change this to LOW or HIGH in the future.
comment_severity_threshold: MEDIUM
pull_request_opened:
summary: true
include_drafts: false
ignore_patterns:
90 changes: 90 additions & 0 deletions .gemini/styleguide.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,90 @@
# dash_evals Style Guide

This style guide outlines the coding conventions and contribution requirements for the dash_evals repository.

---

## Documentation Requirements

All changes that affect user-facing behavior, configuration, or APIs **must** be documented in the `docs/` directory:

- **New features**: Add documentation explaining the feature and how to use it
- **CLI changes**: Update `docs/dataset_yaml_schema.md` (CLI Usage section)
- **Configuration changes**: Update `docs/dataset_yaml_schema.md`
- **Workflow changes**: Update `docs/contributing_guide.md`
- **Architecture changes**: Update `docs/repository_structure.md`

When reviewing PRs, check that:
1. Any new CLI flags or options are documented
2. New configuration fields are documented with type, description, and examples
3. User-facing error messages are clear and actionable

---

## Python Style Guide

This project follows the [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html).

### Key Points

- **Formatting**: Use `ruff format` for automatic formatting
- **Linting**: Use `ruff check` and `pylint`
- **Line length**: 100 characters maximum
- **Docstrings**: Use Google-style docstrings with Args, Returns, and Raises sections
- **Type hints**: Required for all public functions and methods
- **Imports**: Use absolute imports, grouped by standard library / third-party / local

### Docstring Example

```python
def parse(dataset_path: Path, jobs: list[str] | None = None) -> list[EvalSetConfig]:
"""Parse dataset directory into resolved EvalSetConfig(s).

Args:
dataset_path: Path to dataset directory containing dataset.yaml.
jobs: Optional list of job names or paths. Uses default_job if not specified.

Returns:
List of EvalSetConfig objects ready to pass to inspect_ai.eval_set().

Raises:
FileNotFoundError: If dataset or job file not found.
"""
```

---

## Dart Style Guide

This project follows the [Effective Dart Style Guide](https://dart.dev/effective-dart/style).

Code should follow the relevant style guides, and use the correct
auto-formatter, for each language, as described in
[the repository contributing guide's Style section](https://github.com/flutter/packages/blob/main/CONTRIBUTING.md#style).

### Best Practices

- Code should follow the guidance and principles described in
[the flutter/packages contribution guide](https://github.com/flutter/flutter/blob/master/docs/ecosystem/contributing/README.md).
- Code should be tested. Changes to plugin packages, which include code written
in C, C++, Java, Kotlin, Objective-C, or Swift, should have appropriate tests
as described in [the plugin test guidance](https://github.com/flutter/flutter/blob/master/docs/ecosystem/testing/Plugin-Tests.md).
- PR descriptions should include the Pre-Review Checklist from
[the PR template](https://github.com/flutter/packages/blob/main/.github/PULL_REQUEST_TEMPLATE.md),
with all of the steps completed.

### Review Agent Guidelines

When providing a summary, the review agent must adhere to the following principles:
- **Be Objective:** Focus on a neutral, descriptive summary of the changes. Avoid subjective value judgments
like "good," "bad," "positive," or "negative." The goal is to report what the code does, not to evaluate it.
- **Use Code as the Source of Truth:** Base all summaries on the code diff. Do not trust or rephrase the PR
description, which may be outdated or inaccurate. A summary must reflect the actual changes in the code.
- **Be Concise:** Generate summaries that are brief and to the point. Focus on the most significant changes,
and avoid unnecessary details or verbose explanations. This ensures the feedback is easy to scan and understand.

### YAML Configuration Files

- Use 2-space indentation
- Include comments explaining non-obvious fields
- Use explicit `path:` references for file paths (e.g., `- path: samples/file.yaml`)
4 changes: 2 additions & 2 deletions .github/workflows/docs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -70,6 +70,6 @@ jobs:
with:
repoToken: ${{ secrets.GITHUB_TOKEN }}
firebaseServiceAccount: ${{ secrets.FIREBASE_SERVICE_ACCOUNT }}
projectId: evals
target: evals-docs
projectId: dash-evals
target: dash-evals-docs
channelId: live
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -233,3 +233,4 @@ app.*.map.json
/android/app/debug
/android/app/profile
/android/app/release
.firebase/
4 changes: 2 additions & 2 deletions docs/contributing/repository_structure.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ evals/

## dataset/

Contains all evaluation data, configurations, and resources. See the [Configuration Overview](./config/about.md) for detailed file formats.
Contains all evaluation data, configurations, and resources. See the [Configuration Overview](../reference/configuration_reference.md) for detailed file formats.

| Path | Description |
|------|-------------|
Expand Down Expand Up @@ -81,7 +81,7 @@ dash_evals/

### devals_cli/ (devals)

Dart CLI for creating and managing evaluation tasks and jobs. See the [CLI documentation](./cli.md) for full command reference.
Dart CLI for creating and managing evaluation tasks and jobs. See the [CLI documentation](../reference/cli.md) for full command reference.

```
devals_cli/
Expand Down
8 changes: 2 additions & 6 deletions docs/guides/quick_start.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,16 +14,12 @@ You'll also need an API key for at least one model provider (`GOOGLE_API_KEY`, `
## 1. Install the packages

```bash
git clone https://github.com/flutter/evals.git
pip install -e <path-to-evals>/packages/dash_evals
dart pub global activate devals --source path <path-to-evals>/packages/devals_cli


## TODO: Integrate in the new repo. This is wrong for this repo
git clone https://github.com/flutter/evals.git && cd evals
python3 -m venv .venv
source .venv/bin/activate
pip install -e "packages/dash_evals[dev]"
pip install -e "packages/dataset_config_python[dev]"
dart pub global activate devals --source path packages/devals_cli
```

This installs two things:
Expand Down
8 changes: 4 additions & 4 deletions docs/guides/tutorial.md
Original file line number Diff line number Diff line change
Expand Up @@ -169,7 +169,7 @@ samples:
| `tests.path` | Path to test files the scorer runs against the generated code. |

> [!NOTE]
> See [Tasks](config/tasks.md) and [Samples](config/samples.md) for the
See [Tasks](../reference/configuration_reference.md#task-files) and [Samples](../reference/configuration_reference.md#sample-files) for the
> complete field reference.

---
Expand Down Expand Up @@ -215,7 +215,7 @@ That's the minimal job — it will:
> with_context:
> context_files: [./context_files/dart_docs.md]
> ```
> See [Configuration Overview](config/about.md#variants) for details.
> See [Configuration Overview](../reference/configuration_reference.md#variants) for details.

---

Expand Down Expand Up @@ -281,7 +281,7 @@ devals view path/to/logs
Now that you've run your first custom evaluation, here are some things to try:

- **Add more samples** to your task: `devals create sample`
- **Try different task types** — `question_answer`, `bug_fix`, or `flutter_code_gen`. See [all available task functions](../packages/dash_evals.md).
- **Try different task types** — `question_answer`, `bug_fix`, or `flutter_code_gen`. See [all available task functions](../contributing/packages/dash_evals.md).
- **Add variants** to test how context files or MCP tools affect performance. See [Variants](config/about.md#variants).
- **Run multiple models** by adding more entries to the `models` list in your job file
- **Read the config reference** for [Jobs](config/jobs.md), [Tasks](config/tasks.md), and [Samples](config/samples.md)
- **Read the config reference** for [Jobs](../reference/configuration_reference.md#job-files), [Tasks](../reference/configuration_reference.md#task-files), and [Samples](../reference/configuration_reference.md#sample-files)
2 changes: 1 addition & 1 deletion docs/reference/glossary.md
Original file line number Diff line number Diff line change
Expand Up @@ -68,6 +68,6 @@ Key terminology for understanding the evals framework.

---

See the [Configuration Overview](./config/about.md) for detailed configuration file documentation.
See the [Configuration Reference](./configuration_reference.md) for detailed configuration file documentation.

[Learn more about Inspect AI](https://inspect.aisi.org.uk/)
4 changes: 2 additions & 2 deletions firebase.json
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
{
"hosting": {
"site": "evals-docs",
"site": "dash-evals-docs",
"public": "docs/_build/html",
"ignore": [
"firebase.json",
Expand All @@ -9,4 +9,4 @@
"**/dart_docs/"
]
}
}
}
Loading