TEST: De-flake timing tests with lower-bound assertions#846
Conversation
test_timing.py asserted that wall-clock time.sleep() durations stayed *under* an upper bound (assert_allclose with atol/rtol). That is inherently flaky on shared CI runners: time.sleep(t) only guarantees the process is suspended for at least t, and a descheduled runner can make the measured time arbitrarily larger. test_timeit_lambda_function (the tightest case, expecting ~0.05s) failed three times recently on slow macOS/ubuntu runners. Replace the ~14 upper-bound checks with a lower-bound assertion (assert_at_least): the meaningful invariant is that the timing utilities do not under-report elapsed time; scheduler overshoot is no longer a failure. Relationship checks (min <= avg <= max, average_time >= average_of_best) are unchanged. See the CI flakiness discussion in #845. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR de-flakes the timing-related test suite in quantecon.util by replacing wall-clock upper-bound assertions (which can fail on busy CI runners due to scheduler oversleep) with a consistent lower-bound assertion that primarily guards against timers under-reporting elapsed time.
Changes:
- Remove
assert_allclose-based upper-bound checks fortic/tac/toc,loop_timer,Timer, andtimeitresults. - Introduce
assert_at_least(measured, expected)with a module-levelTIMING_LOWER_BOUNDto enforcemeasured >= expected * 0.9. - Clean up now-unused imports and locals (
assert_allclose,rtol,atol) to keep unused-import linting green.
Include the computed lower bound and the factor in the failure message, with floats formatted to 4 dp, so a future CI trip is easy to interpret. Diagnostics only; no behaviour change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The lower-bound conversion was over-applied to test_timer_units and test_timeit_different_units. Those tests exist to verify Timer.elapsed stays in seconds regardless of the display unit; a pure lower bound would pass a unit-leak regression (elapsed stored in ms/µs is ~1000x/1e6x too large), defeating their purpose. Add assert_in_seconds: a bounded check (expected*0.9 <= measured < expected*100) whose generous ceiling always trips on a unit leak but never on scheduler overshoot (a few x at most). Point the two unit tests at it; the generic timing tests keep the lower-bound-only assert_at_least. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Note for reviewers — what this PR changes and a fidelity checkThis started as a pure de-flake but, after review, also strengthens the suite. Summary of the three commits and the reasoning, plus an assertion-by-assertion fidelity audit so it's clear nothing was silently weakened. What changed
The two helpers
The Fidelity audit — no loss, net gainThe originals used
Pre-existing caveat (not introduced here): Net: the suite is no longer flaky and catches more realistic regressions than before. 🤖 Generated with Claude Code |
| Lower bound: the timer did not under-report. Upper bound: a ceiling no real | ||
| ``expected``-second sleep can reach, but which a unit leak into | ||
| ``Timer.elapsed`` (ms/µs) always trips -- the invariant the unit tests | ||
| guard. |
There was a problem hiding this comment.
A comment from ChatGPT:
Non-blocking nit: the comment/docstring could avoid saying “no real sleep can reach” because sleep overshoot is theoretically unbounded; “not expected under normal CI scheduling” would be more precise. Not worth blocking the PR.
Summary
Fixes the flaky timing tests that have been intermittently reddening CI on green code (twice on
main, once on #845). This is the second of the two root causes found while investigating #842's CI failure — the first (Coveralls 422 race +fail-fastcascade) is addressed in #845.Root cause
The whole
test_timing.pysuite verifies the timing utilities (tic/tac/toc,loop_timer,Timer,timeit) by sleeping for a known duration and asserting the measured time stays within an upper bound:That is the wrong invariant for wall-clock timing.
time.sleep(t)only guarantees the process is suspended for at leasttseconds; a busy or descheduled CI runner can make the measured time arbitrarily larger. The tightest case,test_timeit_lambda_function(expected ≈ 0.05 s, ceiling ≈ 0.20 s), is the one that keeps tripping — e.g. measured0.207 svs ceiling0.20 son a loaded macOS runner.Fix
Replace the ~14 upper-bound assertions with a lower-bound check,
assert_at_least(measured, expected), which asserts the utilities do not under-report elapsed time (measured >= expected * 0.9). Scheduler overshoot — the actual source of the flakiness — is no longer treated as a failure, while a genuinely broken timer (returning zero / far too little) is still caught.assert result['minimum'] <= result['average'] <= result['maximum']assert_(average_time >= average_of_best)test_timeit_single_runexact-equality assertsThe now-unused
assert_allcloseimport and the deadrtol/atollocals are removed (keeps the CIflake8 --select F401check green).Verification
flake8 --select F401,F405,E231 quantecon/util/tests/test_timing.py— clean.pytest quantecon/util/tests/test_timing.py— 27 passed, run repeatedly with stable results.Relationship to the other PRs
fail-fast: false) and removes the Coveralls race; this PR removes the flake itself. Together they should make CI reliably green.🤖 Generated with Claude Code