Fixing inconsistent error traces across failure modes. by cemde · Pull Request #61 · parameterlab/MASEval

cemde · 2026-05-12T10:10:05Z

Description

Benchmark.run() documents a stable report schema, but it wasn't actually stable. Reports built on the setup-failure and setup-timeout early returns, and on the except Exception fallback in _run_parallel, were missing the usage and task keys (and had empty traces/config), so a row that failed in setup looked structurally different from a row that succeeded, breaking any consumer that indexes report["task"] / report["usage"] on every row. Separately, in parallel runs the fail_on_setup_error / fail_on_task_error / fail_on_evaluation_error flags were effectively no-ops: when _execute_task_repetition re-raised, _run_parallel swallowed the exception into a degraded report and kept going, instead of aborting the run the way a sequential run does.

This PR makes every report carry the same keys regardless of outcome, and makes parallel fail-fast behave like sequential fail-fast.

What changed

New Benchmark._build_report(...): the single constructor for repetition reports. Every report (success, setup failure, setup timeout, agent/environment/user/unknown execution error, evaluation failure, or an unexpected worker failure in parallel mode) now always contains task_id, repeat_idx, status, error, traces, config, usage, eval, task. traces/config default to {} when unavailable; error is None only for SUCCESS and is otherwise always populated (a generic placeholder is synthesised as a backstop if a caller ever omits it).
_run_parallel now re-raises (and shuts the pool down, cancelling queued work) when a worker future raises and any fail_on_* flag is set — aborting the run like _run_sequential. Only genuinely unexpected worker failures (no fail_on_* set) are turned into a degraded-but-full-schema UNKNOWN_EXECUTION_ERROR report so the rest of the batch continues.
Updated the run() docstring (Returns/Raises) to describe the full schema and the fail-fast behaviour.

Tests

New tests/test_core/test_benchmark/test_report_schema.py:

Schema invariance over {success, setup_failure, setup_timeout, execution_failure, evaluation_failure} × {sequential, parallel}, plus a mixed-outcome run and an "error always populated when status != SUCCESS" check.
Parallel runs abort on each of fail_on_setup_error / fail_on_task_error / fail_on_evaluation_error.
An unexpected worker failure yields a full-schema degraded report (graceful mode) and propagates when a fail_on_* flag is set.

11 of these were red before the fix. just all is green (ruff format + check, ty check maseval tests, full default + offline suites — 2631 passed, 1 skipped, 2 xfailed).

Type of Change

Bug fix (non-breaking change that fixes an issue)
New feature (non-breaking change that adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Code quality improvement (refactoring, formatting, etc.)

Checklist

Contribution

I have read the CONTRIBUTING.md guide.
Commits follow "How to write a good git commit message"

Documentation

Added/updated docstrings for new/modified functions as instructed CONTRIBUTING.md
Updated relevant documentation in docs/ (if applicable)
Tag github issue with this PR (if applicable)

Changelog

Added entry to CHANGELOG.md under [Unreleased] section
- Use Added section for new features
- Use Changed section for modifications to existing functionality
- Use Fixed section for bug fixes
- Use Removed section for deprecated/removed features
OR this is a documentation-only change (no changelog needed)

Architecture (if applicable)

Core/Interface separation: Changes in maseval/core/ do NOT import from maseval/interface/
Dependencies: New core dependencies added sparingly; framework integrations go to optional dependencies

Additional Notes

Behaviour change worth flagging for reviewers: under num_workers > 1, a fail_on_* flag now aborts the run (re-raising the original exception) instead of being silently absorbed into a report. This matches the documented intent and the sequential path; anyone who was (perhaps unknowingly) relying on the old swallow-and-continue behaviour in parallel mode should set the flag to False and inspect report status instead.

Adding keys to the report dict is additive and shouldn't break consumers that look up keys by name; consumers that asserted on the exact key set will see the two new keys (usage, task) on previously-degraded rows.

github-actions · 2026-05-12T10:16:54Z

Coverage report

Click to see where and how coverage changed

File	Statements	Missing	Coverage	Coverage (new stmts)	Lines missing
maseval/core
benchmark.py					559
Project Total

_{This report was generated by python-coverage-comment-action}

cemde added 2 commits May 12, 2026 12:04

fix implemented

706bc19

added changelog

89aca00

cemde merged commit 1c52f23 into main May 12, 2026
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixing inconsistent error traces across failure modes.#61

Fixing inconsistent error traces across failure modes.#61
cemde merged 2 commits into
mainfrom
fix-exception-reporting

cemde commented May 12, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cemde commented May 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

What changed

Tests

Type of Change

Checklist

Contribution

Documentation

Changelog

Architecture (if applicable)

Additional Notes

Uh oh!

github-actions Bot commented May 12, 2026

Coverage report

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cemde commented May 12, 2026 •

edited

Loading