Skip to content

TEST: De-flake timing tests with lower-bound assertions#846

Open
mmcky wants to merge 3 commits into
mainfrom
fix-flaky-timing-tests
Open

TEST: De-flake timing tests with lower-bound assertions#846
mmcky wants to merge 3 commits into
mainfrom
fix-flaky-timing-tests

Conversation

@mmcky

@mmcky mmcky commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Summary

Fixes the flaky timing tests that have been intermittently reddening CI on green code (twice on main, once on #845). This is the second of the two root causes found while investigating #842's CI failure — the first (Coveralls 422 race + fail-fast cascade) is addressed in #845.

Root cause

The whole test_timing.py suite verifies the timing utilities (tic/tac/toc, loop_timer, Timer, timeit) by sleeping for a known duration and asserting the measured time stays within an upper bound:

assert_allclose(measured, expected_sleep, atol=0.05, rtol=2)

That is the wrong invariant for wall-clock timing. time.sleep(t) only guarantees the process is suspended for at least t seconds; a busy or descheduled CI runner can make the measured time arbitrarily larger. The tightest case, test_timeit_lambda_function (expected ≈ 0.05 s, ceiling ≈ 0.20 s), is the one that keeps tripping — e.g. measured 0.207 s vs ceiling 0.20 s on a loaded macOS runner.

Fix

Replace the ~14 upper-bound assertions with a lower-bound check, assert_at_least(measured, expected), which asserts the utilities do not under-report elapsed time (measured >= expected * 0.9). Scheduler overshoot — the actual source of the flakiness — is no longer treated as a failure, while a genuinely broken timer (returning zero / far too little) is still caught.

Kept as-is Why
assert result['minimum'] <= result['average'] <= result['maximum'] A relationship invariant, not wall-clock dependent.
assert_(average_time >= average_of_best) Same — relationship between aggregates.
test_timeit_single_run exact-equality asserts Compare fields to each other, not to absolute time.

The now-unused assert_allclose import and the dead rtol/atol locals are removed (keeps the CI flake8 --select F401 check green).

Verification

  • flake8 --select F401,F405,E231 quantecon/util/tests/test_timing.py — clean.
  • pytest quantecon/util/tests/test_timing.py27 passed, run repeatedly with stable results.

Relationship to the other PRs

🤖 Generated with Claude Code

test_timing.py asserted that wall-clock time.sleep() durations stayed
*under* an upper bound (assert_allclose with atol/rtol). That is
inherently flaky on shared CI runners: time.sleep(t) only guarantees the
process is suspended for at least t, and a descheduled runner can make
the measured time arbitrarily larger. test_timeit_lambda_function (the
tightest case, expecting ~0.05s) failed three times recently on slow
macOS/ubuntu runners.

Replace the ~14 upper-bound checks with a lower-bound assertion
(assert_at_least): the meaningful invariant is that the timing utilities
do not under-report elapsed time; scheduler overshoot is no longer a
failure. Relationship checks (min <= avg <= max, average_time >=
average_of_best) are unchanged.

See the CI flakiness discussion in #845.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 28, 2026 02:31
@mmcky mmcky added the tests label Jun 28, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR de-flakes the timing-related test suite in quantecon.util by replacing wall-clock upper-bound assertions (which can fail on busy CI runners due to scheduler oversleep) with a consistent lower-bound assertion that primarily guards against timers under-reporting elapsed time.

Changes:

  • Remove assert_allclose-based upper-bound checks for tic/tac/toc, loop_timer, Timer, and timeit results.
  • Introduce assert_at_least(measured, expected) with a module-level TIMING_LOWER_BOUND to enforce measured >= expected * 0.9.
  • Clean up now-unused imports and locals (assert_allclose, rtol, atol) to keep unused-import linting green.

Comment thread quantecon/util/tests/test_timing.py Outdated
@coveralls

coveralls commented Jun 28, 2026

Copy link
Copy Markdown

Coverage Status

coverage: 93.012% (+0.004%) from 93.008% — fix-flaky-timing-tests into main

mmcky and others added 2 commits June 28, 2026 12:43
Include the computed lower bound and the factor in the failure message,
with floats formatted to 4 dp, so a future CI trip is easy to interpret.
Diagnostics only; no behaviour change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The lower-bound conversion was over-applied to test_timer_units and
test_timeit_different_units. Those tests exist to verify Timer.elapsed
stays in seconds regardless of the display unit; a pure lower bound would
pass a unit-leak regression (elapsed stored in ms/µs is ~1000x/1e6x too
large), defeating their purpose.

Add assert_in_seconds: a bounded check (expected*0.9 <= measured <
expected*100) whose generous ceiling always trips on a unit leak but
never on scheduler overshoot (a few x at most). Point the two unit tests
at it; the generic timing tests keep the lower-bound-only assert_at_least.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@mmcky

mmcky commented Jun 28, 2026

Copy link
Copy Markdown
Contributor Author

Note for reviewers — what this PR changes and a fidelity check

This started as a pure de-flake but, after review, also strengthens the suite. Summary of the three commits and the reasoning, plus an assertion-by-assertion fidelity audit so it's clear nothing was silently weakened.

What changed

Commit Change
7d9e6b6 Replace the wall-clock upper-bound assertions (assert_allclose(measured, expected, atol=0.05, rtol=2)) with lower-bound checks. time.sleep(t) only guarantees at least t; a busy CI runner can oversleep arbitrarily, so an upper bound on measured duration is flaky by construction (it tripped on test_timeit_lambda_function 3× recently).
809e24c Make the assert_at_least failure message report the computed bound and factor, floats to 4 dp (Copilot suggestion).
9af5760 Restore a bounded check (assert_in_seconds) for the two unit tests, which exist to verify Timer.elapsed stays in seconds regardless of the display unit — a pure lower bound would miss a unit-leak regression.

The two helpers

Helper Used by Asserts
assert_at_least(x, e) generic timing tests (tic/tac/toc, loop_timer, Timer.elapsed, timeit runs/average) x >= 0.9·e — the timer does not under-report
assert_in_seconds(x, e) test_timer_units, test_timeit_different_units 0.9·e <= x < 100·eelapsed is in seconds, not ms (×1000) / µs (×1e6)

The ×100 ceiling sits ~25× above the worst overshoot actually observed on CI and 10× below a millisecond leak — so it catches a unit regression with no realistic flake risk.

Fidelity audit — no loss, net gain

The originals used rtol=2, so assert_allclose(x, e, atol=0.05, rtol=2) resolves to x ∈ [−(e+0.05), 3e+0.05]. The lower edge is negative, i.e. the originals were effectively upper-bound-only with no meaningful lower bound.

  • Newly caught (the originals silently passed these): a timer returning 0 / not accumulating — assert_allclose(0, 0.1, …) passes but 0 >= 0.09 fails; and toc() returning the last interval (~0.1) instead of the cumulative total (~0.3) — now fails the >= 0.27 lower bound.
  • Retained: the unit-invariance guard — the one place an upper bound genuinely mattered — still trips on a ms/µs leak, and now also has a lower bound.
  • Unchanged: every relationship/structural/contract assertion — average_time >= average_of_best, minimum <= average <= maximum, the single-run equalities, all is not None / len == N / isinstance checks, the ValueError/validation tests, and the whole TestGlobalPrecision class.
  • Only removed: loose wall-clock upper bounds on default-unit timing, which targeted an implausible bug class and were themselves the flake source.

Pre-existing caveat (not introduced here): test_timeit_lambda_function doesn't strongly verify the 0.5 multiplier (sleeping 0.1 instead of 0.05 passes both the old and new assertion). Unchanged by this PR; noting it for completeness.

Net: the suite is no longer flaky and catches more realistic regressions than before.

🤖 Generated with Claude Code

@oyamad oyamad left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mmcky Thanks! Looks good to me. The comment from ChatGPT below is just optional (not blocking).

Comment on lines +38 to +41
Lower bound: the timer did not under-report. Upper bound: a ceiling no real
``expected``-second sleep can reach, but which a unit leak into
``Timer.elapsed`` (ms/µs) always trips -- the invariant the unit tests
guard.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A comment from ChatGPT:

Non-blocking nit: the comment/docstring could avoid saying “no real sleep can reach” because sleep overshoot is theoretically unbounded; “not expected under normal CI scheduling” would be more precise. Not worth blocking the PR.

@oyamad oyamad added the ready label Jun 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants