diff --git a/.firebaserc b/.firebaserc new file mode 100644 index 0000000..f389136 --- /dev/null +++ b/.firebaserc @@ -0,0 +1,5 @@ +{ + "projects": { + "default": "dash-evals" + } +} diff --git a/.gemini/config.yaml b/.gemini/config.yaml new file mode 100644 index 0000000..5919d6e --- /dev/null +++ b/.gemini/config.yaml @@ -0,0 +1,10 @@ +# Minimize verbosity. +have_fun: false +code_review: + # For now, use the default of MEDIUM for testing. Based on desired verbosity, + # we can change this to LOW or HIGH in the future. + comment_severity_threshold: MEDIUM + pull_request_opened: + summary: true + include_drafts: false +ignore_patterns: \ No newline at end of file diff --git a/.gemini/styleguide.md b/.gemini/styleguide.md new file mode 100644 index 0000000..d488e19 --- /dev/null +++ b/.gemini/styleguide.md @@ -0,0 +1,90 @@ +# dash_evals Style Guide + +This style guide outlines the coding conventions and contribution requirements for the dash_evals repository. + +--- + +## Documentation Requirements + +All changes that affect user-facing behavior, configuration, or APIs **must** be documented in the `docs/` directory: + +- **New features**: Add documentation explaining the feature and how to use it +- **CLI changes**: Update `docs/dataset_yaml_schema.md` (CLI Usage section) +- **Configuration changes**: Update `docs/dataset_yaml_schema.md` +- **Workflow changes**: Update `docs/contributing_guide.md` +- **Architecture changes**: Update `docs/repository_structure.md` + +When reviewing PRs, check that: +1. Any new CLI flags or options are documented +2. New configuration fields are documented with type, description, and examples +3. User-facing error messages are clear and actionable + +--- + +## Python Style Guide + +This project follows the [Google Python Style Guide](https://google.github.io/styleguide/pyguide.html). + +### Key Points + +- **Formatting**: Use `ruff format` for automatic formatting +- **Linting**: Use `ruff check` and `pylint` +- **Line length**: 100 characters maximum +- **Docstrings**: Use Google-style docstrings with Args, Returns, and Raises sections +- **Type hints**: Required for all public functions and methods +- **Imports**: Use absolute imports, grouped by standard library / third-party / local + +### Docstring Example + +```python +def parse(dataset_path: Path, jobs: list[str] | None = None) -> list[EvalSetConfig]: + """Parse dataset directory into resolved EvalSetConfig(s). + + Args: + dataset_path: Path to dataset directory containing dataset.yaml. + jobs: Optional list of job names or paths. Uses default_job if not specified. + + Returns: + List of EvalSetConfig objects ready to pass to inspect_ai.eval_set(). + + Raises: + FileNotFoundError: If dataset or job file not found. + """ +``` + +--- + +## Dart Style Guide + +This project follows the [Effective Dart Style Guide](https://dart.dev/effective-dart/style). + +Code should follow the relevant style guides, and use the correct +auto-formatter, for each language, as described in +[the repository contributing guide's Style section](https://github.com/flutter/packages/blob/main/CONTRIBUTING.md#style). + +### Best Practices + +- Code should follow the guidance and principles described in + [the flutter/packages contribution guide](https://github.com/flutter/flutter/blob/master/docs/ecosystem/contributing/README.md). +- Code should be tested. Changes to plugin packages, which include code written + in C, C++, Java, Kotlin, Objective-C, or Swift, should have appropriate tests + as described in [the plugin test guidance](https://github.com/flutter/flutter/blob/master/docs/ecosystem/testing/Plugin-Tests.md). +- PR descriptions should include the Pre-Review Checklist from + [the PR template](https://github.com/flutter/packages/blob/main/.github/PULL_REQUEST_TEMPLATE.md), + with all of the steps completed. + +### Review Agent Guidelines + +When providing a summary, the review agent must adhere to the following principles: +- **Be Objective:** Focus on a neutral, descriptive summary of the changes. Avoid subjective value judgments + like "good," "bad," "positive," or "negative." The goal is to report what the code does, not to evaluate it. +- **Use Code as the Source of Truth:** Base all summaries on the code diff. Do not trust or rephrase the PR + description, which may be outdated or inaccurate. A summary must reflect the actual changes in the code. +- **Be Concise:** Generate summaries that are brief and to the point. Focus on the most significant changes, + and avoid unnecessary details or verbose explanations. This ensures the feedback is easy to scan and understand. + +### YAML Configuration Files + +- Use 2-space indentation +- Include comments explaining non-obvious fields +- Use explicit `path:` references for file paths (e.g., `- path: samples/file.yaml`) diff --git a/.github/workflows/docs.yml b/.github/workflows/docs.yml index 87adcc3..97f8ba8 100644 --- a/.github/workflows/docs.yml +++ b/.github/workflows/docs.yml @@ -70,6 +70,6 @@ jobs: with: repoToken: ${{ secrets.GITHUB_TOKEN }} firebaseServiceAccount: ${{ secrets.FIREBASE_SERVICE_ACCOUNT }} - projectId: evals - target: evals-docs + projectId: dash-evals + target: dash-evals-docs channelId: live diff --git a/.gitignore b/.gitignore index 87eb4c9..2f29eb9 100644 --- a/.gitignore +++ b/.gitignore @@ -233,3 +233,4 @@ app.*.map.json /android/app/debug /android/app/profile /android/app/release +.firebase/ diff --git a/docs/contributing/repository_structure.md b/docs/contributing/repository_structure.md index 4ed859f..e4a0730 100644 --- a/docs/contributing/repository_structure.md +++ b/docs/contributing/repository_structure.md @@ -21,7 +21,7 @@ evals/ ## dataset/ -Contains all evaluation data, configurations, and resources. See the [Configuration Overview](./config/about.md) for detailed file formats. +Contains all evaluation data, configurations, and resources. See the [Configuration Overview](../reference/configuration_reference.md) for detailed file formats. | Path | Description | |------|-------------| @@ -81,7 +81,7 @@ dash_evals/ ### devals_cli/ (devals) -Dart CLI for creating and managing evaluation tasks and jobs. See the [CLI documentation](./cli.md) for full command reference. +Dart CLI for creating and managing evaluation tasks and jobs. See the [CLI documentation](../reference/cli.md) for full command reference. ``` devals_cli/ diff --git a/docs/guides/quick_start.md b/docs/guides/quick_start.md index dd70a26..ed93d0e 100644 --- a/docs/guides/quick_start.md +++ b/docs/guides/quick_start.md @@ -14,16 +14,12 @@ You'll also need an API key for at least one model provider (`GOOGLE_API_KEY`, ` ## 1. Install the packages ```bash -git clone https://github.com/flutter/evals.git -pip install -e /packages/dash_evals -dart pub global activate devals --source path /packages/devals_cli - - -## TODO: Integrate in the new repo. This is wrong for this repo +git clone https://github.com/flutter/evals.git && cd evals python3 -m venv .venv source .venv/bin/activate pip install -e "packages/dash_evals[dev]" pip install -e "packages/dataset_config_python[dev]" +dart pub global activate devals --source path packages/devals_cli ``` This installs two things: diff --git a/docs/guides/tutorial.md b/docs/guides/tutorial.md index 5776963..fcf8b19 100644 --- a/docs/guides/tutorial.md +++ b/docs/guides/tutorial.md @@ -169,7 +169,7 @@ samples: | `tests.path` | Path to test files the scorer runs against the generated code. | > [!NOTE] -> See [Tasks](config/tasks.md) and [Samples](config/samples.md) for the +See [Tasks](../reference/configuration_reference.md#task-files) and [Samples](../reference/configuration_reference.md#sample-files) for the > complete field reference. --- @@ -215,7 +215,7 @@ That's the minimal job — it will: > with_context: > context_files: [./context_files/dart_docs.md] > ``` -> See [Configuration Overview](config/about.md#variants) for details. +> See [Configuration Overview](../reference/configuration_reference.md#variants) for details. --- @@ -281,7 +281,7 @@ devals view path/to/logs Now that you've run your first custom evaluation, here are some things to try: - **Add more samples** to your task: `devals create sample` -- **Try different task types** — `question_answer`, `bug_fix`, or `flutter_code_gen`. See [all available task functions](../packages/dash_evals.md). +- **Try different task types** — `question_answer`, `bug_fix`, or `flutter_code_gen`. See [all available task functions](../contributing/packages/dash_evals.md). - **Add variants** to test how context files or MCP tools affect performance. See [Variants](config/about.md#variants). - **Run multiple models** by adding more entries to the `models` list in your job file -- **Read the config reference** for [Jobs](config/jobs.md), [Tasks](config/tasks.md), and [Samples](config/samples.md) \ No newline at end of file +- **Read the config reference** for [Jobs](../reference/configuration_reference.md#job-files), [Tasks](../reference/configuration_reference.md#task-files), and [Samples](../reference/configuration_reference.md#sample-files) \ No newline at end of file diff --git a/docs/reference/glossary.md b/docs/reference/glossary.md index 8960400..7fdb5ad 100644 --- a/docs/reference/glossary.md +++ b/docs/reference/glossary.md @@ -68,6 +68,6 @@ Key terminology for understanding the evals framework. --- -See the [Configuration Overview](./config/about.md) for detailed configuration file documentation. +See the [Configuration Reference](./configuration_reference.md) for detailed configuration file documentation. [Learn more about Inspect AI](https://inspect.aisi.org.uk/) diff --git a/firebase.json b/firebase.json index 6fa6b63..b1b3874 100644 --- a/firebase.json +++ b/firebase.json @@ -1,6 +1,6 @@ { "hosting": { - "site": "evals-docs", + "site": "dash-evals-docs", "public": "docs/_build/html", "ignore": [ "firebase.json", @@ -9,4 +9,4 @@ "**/dart_docs/" ] } -} +} \ No newline at end of file