Skip to content

feat: Allow failing toys#2128

Open
lukasheinrich wants to merge 6 commits intomainfrom
toy_handling_test
Open

feat: Allow failing toys#2128
lukasheinrich wants to merge 6 commits intomainfrom
toy_handling_test

Conversation

@lukasheinrich
Copy link
Copy Markdown
Contributor

@lukasheinrich lukasheinrich commented Mar 3, 2023

Pull Request Description

Resolves #1427.

When running toy-based hypothesis tests with ToyCalculator, any single failed fit (raised as FailedMinimization) would abort the entire toy loop. For large ntoys runs this is extremely frustrating — all progress is lost with no information about which toys failed or why.

This PR adds two things:

  1. ToyResult dataclass — a frozen public dataclass that captures per-hypothesis diagnostics from the toy loop, stored on the calculator as toy_results after distributions() returns:

    • successful: list[float] which is a list of test statistic values for toys where the fit converged
    • failed: list[tuple[int, Tensor, FailedMinimization]] which is a list of (toy_index, sample, exception) for each failed toy, giving callers full access to the pseudo-data and underlying fit result via exception.result
  2. failure_threshold parameter on ToyCalculator which replaces the draft boolean skip_failing_toys with a float (or None):

    • None (default): any FailedMinimization propagates immediately, preserving existing behaviour
    • [0.0, 1.0]: fraction of toys allowed to fail before raising; e.g. failure_threshold=0.1 permits up to 10% failures
    • When failures occur within the threshold, a WARNING is logged with the count and fraction
    • failure_threshold passes through hypotest() kwargs to the calculator automatically

The internal toy loop is also refactored into a private _collect_toy_teststats() method to eliminate the duplicated signal/background logic.

Checklist Before Requesting Reviewer

  • Tests are passing
  • "WIP" removed from the title of the pull request
  • Selected an Assignee for the PR to be responsible for the log summary

Before Merging

For the PR Assignees:

  • Summarize commit messages into a comprehensive review of the PR
* Replace the draft skip_failing_toys boolean with a failure_threshold (float | None)
  parameter on ToyCalculator. When None (default) any FailedMinimization propagates
  immediately, preserving existing behaviour. A float value sets the maximum fraction
  of toys allowed to fail before raising.
* Add ToyResult frozen dataclass to capture per-hypothesis diagnostics: successful 
  test statistic values and failed entries as (toy_index, sample, FailedMinimization)
  tuples. Stored on the calculator as toy_results after distributions() returns,
  giving callers full access to which toys failed and the underlying fit result.
* Refactor duplicated signal/background toy loops into a single private
  _collect_toy_teststats() method. Catches FailedMinimization specifically (not bare
  except or RuntimeError), logs a WARNING with failure count and fraction, and
  checks the running failure fraction against failure_threshold after each failure.
* Add ToyResult to the public API (__all__) of pyhf.infer.calculators.
* Add tests covering: no-failure baseline (toy_results populated correctly),
  default raises on first failure, threshold allows failures within limit, threshold
  exceeded raises, warning logged on failure, and failure_threshold forwarded
  through hypotest() kwargs to ToyCalculator.

Co-Authored-By: Giordon Stark <kratsg@gmail.com>

@lukasheinrich lukasheinrich changed the title [WIP] first attempt on toy handling [WIP] feat: allow failing toys Mar 3, 2023
@matthewfeickert matthewfeickert added the feat/enhancement New feature or request label Mar 3, 2023
@matthewfeickert matthewfeickert changed the title [WIP] feat: allow failing toys feat: Allow failing toys Mar 3, 2023
@matthewfeickert matthewfeickert marked this pull request as draft March 3, 2023 16:57
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 1, 2023

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 98.29%. Comparing base (24b1013) to head (e86fbf7).

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #2128   +/-   ##
=======================================
  Coverage   98.28%   98.29%           
=======================================
  Files          65       65           
  Lines        4305     4328   +23     
  Branches      465      467    +2     
=======================================
+ Hits         4231     4254   +23     
  Misses         46       46           
  Partials       28       28           
Flag Coverage Δ
contrib 98.17% <100.00%> (+<0.01%) ⬆️
doctest 98.29% <100.00%> (+<0.01%) ⬆️
unittests-3.10 96.48% <100.00%> (+0.01%) ⬆️
unittests-3.11 96.48% <100.00%> (+0.01%) ⬆️
unittests-3.12 96.48% <100.00%> (+0.01%) ⬆️
unittests-3.13 96.48% <100.00%> (+0.01%) ⬆️
unittests-3.9 96.55% <100.00%> (+0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@matthewfeickert matthewfeickert added the API Changes the public API label Dec 13, 2023
@kratsg kratsg force-pushed the toy_handling_test branch from 803b842 to c41bb4d Compare July 5, 2024 18:28
@kratsg kratsg added the experiment/atlas Relevant to ATLAS's interests label Oct 18, 2024
@matthewfeickert
Copy link
Copy Markdown
Member

Before force-pushing this, I backed up it existing state to my fork as backup/lukas-toy_handling_test.

kratsg added a commit that referenced this pull request Apr 11, 2026
Replace the draft skip_failing_toys bool with a proper failure_threshold
(float | None) that controls what fraction of toys are allowed to fail
before raising. Any FailedMinimization exceptions are collected per
hypothesis into a frozen ToyResult dataclass (toy_index, sample,
exception) stored on the calculator as toy_results after distributions()
returns, giving users full visibility into which toys failed and why.
When threshold=None (default) the first failure propagates immediately,
preserving existing behaviour.

Closes #2128

Co-Authored-By: kratsg <kratsg@gmail.com>
@kratsg kratsg self-assigned this Apr 11, 2026
@kratsg kratsg requested a review from matthewfeickert April 11, 2026 04:56
@kratsg kratsg marked this pull request as ready for review April 11, 2026 04:58
kratsg added a commit that referenced this pull request Apr 11, 2026
Replace the draft skip_failing_toys bool with a proper failure_threshold
(float | None) that controls what fraction of toys are allowed to fail
before raising. Any FailedMinimization exceptions are collected per
hypothesis into a frozen ToyResult dataclass (toy_index, sample,
exception) stored on the calculator as toy_results after distributions()
returns, giving users full visibility into which toys failed and why.
When threshold=None (default) the first failure propagates immediately,
preserving existing behaviour.

Closes #2128

Co-Authored-By: kratsg <kratsg@gmail.com>
@kratsg kratsg force-pushed the toy_handling_test branch from 10d20e2 to 649a020 Compare April 11, 2026 04:58
kratsg added a commit that referenced this pull request Apr 11, 2026
Replace the draft skip_failing_toys bool with a proper failure_threshold
(float | None) that controls what fraction of toys are allowed to fail
before raising. Any FailedMinimization exceptions are collected per
hypothesis into a frozen ToyResult dataclass (toy_index, sample,
exception) stored on the calculator as toy_results after distributions()
returns, giving users full visibility into which toys failed and why.
When threshold=None (default) the first failure propagates immediately,
preserving existing behaviour.

Closes #2128

Co-Authored-By: kratsg <kratsg@gmail.com>
@kratsg kratsg force-pushed the toy_handling_test branch from 0ef4806 to 0713922 Compare April 11, 2026 06:23
kratsg added 3 commits April 11, 2026 15:49
Replace the draft skip_failing_toys bool with a proper failure_threshold
(float | None) that controls what fraction of toys are allowed to fail
before raising. Any FailedMinimization exceptions are collected per
hypothesis into a frozen ToyResult dataclass (toy_index, sample,
exception) stored on the calculator as toy_results after distributions()
returns, giving users full visibility into which toys failed and why.
When threshold=None (default) the first failure propagates immediately,
preserving existing behaviour.

Closes #2128

Co-Authored-By: kratsg <kratsg@gmail.com>
@kratsg kratsg force-pushed the toy_handling_test branch from 0713922 to e86fbf7 Compare April 11, 2026 13:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

API Changes the public API experiment/atlas Relevant to ATLAS's interests feat/enhancement New feature or request

Projects

Status: In progress

Development

Successfully merging this pull request may close these issues.

Allow toys to fail

3 participants