Fixing inconsistent error traces across failure modes.#61
Merged
Conversation
Coverage reportClick to see where and how coverage changed
This report was generated by python-coverage-comment-action |
||||||||||||||||||||||||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Benchmark.run()documents a stable report schema, but it wasn't actually stable. Reports built on the setup-failure and setup-timeout early returns, and on theexcept Exceptionfallback in_run_parallel, were missing theusageandtaskkeys (and had emptytraces/config), so a row that failed in setup looked structurally different from a row that succeeded, breaking any consumer that indexesreport["task"]/report["usage"]on every row. Separately, in parallel runs thefail_on_setup_error/fail_on_task_error/fail_on_evaluation_errorflags were effectively no-ops: when_execute_task_repetitionre-raised,_run_parallelswallowed the exception into a degraded report and kept going, instead of aborting the run the way a sequential run does.This PR makes every report carry the same keys regardless of outcome, and makes parallel fail-fast behave like sequential fail-fast.
What changed
Benchmark._build_report(...): the single constructor for repetition reports. Every report (success, setup failure, setup timeout, agent/environment/user/unknown execution error, evaluation failure, or an unexpected worker failure in parallel mode) now always containstask_id,repeat_idx,status,error,traces,config,usage,eval,task.traces/configdefault to{}when unavailable;errorisNoneonly forSUCCESSand is otherwise always populated (a generic placeholder is synthesised as a backstop if a caller ever omits it)._run_parallelnow re-raises (and shuts the pool down, cancelling queued work) when a worker future raises and anyfail_on_*flag is set — aborting the run like_run_sequential. Only genuinely unexpected worker failures (nofail_on_*set) are turned into a degraded-but-full-schemaUNKNOWN_EXECUTION_ERRORreport so the rest of the batch continues.run()docstring (Returns/Raises) to describe the full schema and the fail-fast behaviour.Tests
New
tests/test_core/test_benchmark/test_report_schema.py:{success, setup_failure, setup_timeout, execution_failure, evaluation_failure} × {sequential, parallel}, plus a mixed-outcome run and an "erroralways populated whenstatus != SUCCESS" check.fail_on_setup_error/fail_on_task_error/fail_on_evaluation_error.fail_on_*flag is set.11 of these were red before the fix.
just allis green (ruff format + check,ty check maseval tests, full default + offline suites — 2631 passed, 1 skipped, 2 xfailed).Type of Change
Checklist
Contribution
Documentation
docs/(if applicable)Changelog
CHANGELOG.mdunder[Unreleased]sectionArchitecture (if applicable)
maseval/core/do NOT import frommaseval/interface/Additional Notes
Behaviour change worth flagging for reviewers: under
num_workers > 1, afail_on_*flag now aborts the run (re-raising the original exception) instead of being silently absorbed into a report. This matches the documented intent and the sequential path; anyone who was (perhaps unknowingly) relying on the old swallow-and-continue behaviour in parallel mode should set the flag toFalseand inspect reportstatusinstead.Adding keys to the report dict is additive and shouldn't break consumers that look up keys by name; consumers that asserted on the exact key set will see the two new keys (
usage,task) on previously-degraded rows.