TEST: De-flake timing tests with lower-bound assertions by mmcky · Pull Request #846 · QuantEcon/QuantEcon.py

mmcky · 2026-06-28T02:31:57Z

Summary

Fixes the flaky timing tests that have been intermittently reddening CI on green code (twice on main, once on #845). This is the second of the two root causes found while investigating #842's CI failure — the first (Coveralls 422 race + fail-fast cascade) is addressed in #845.

Root cause

The whole test_timing.py suite verifies the timing utilities (tic/tac/toc, loop_timer, Timer, timeit) by sleeping for a known duration and asserting the measured time stays within an upper bound:

assert_allclose(measured, expected_sleep, atol=0.05, rtol=2)

That is the wrong invariant for wall-clock timing. time.sleep(t) only guarantees the process is suspended for at least t seconds; a busy or descheduled CI runner can make the measured time arbitrarily larger. The tightest case, test_timeit_lambda_function (expected ≈ 0.05 s, ceiling ≈ 0.20 s), is the one that keeps tripping — e.g. measured 0.207 s vs ceiling 0.20 s on a loaded macOS runner.

Fix

Replace the ~14 upper-bound assertions with a lower-bound check, assert_at_least(measured, expected), which asserts the utilities do not under-report elapsed time (measured >= expected * 0.9). Scheduler overshoot — the actual source of the flakiness — is no longer treated as a failure, while a genuinely broken timer (returning zero / far too little) is still caught.

Kept as-is	Why
`assert result['minimum'] <= result['average'] <= result['maximum']`	A relationship invariant, not wall-clock dependent.
`assert_(average_time >= average_of_best)`	Same — relationship between aggregates.
`test_timeit_single_run` exact-equality asserts	Compare fields to each other, not to absolute time.

The now-unused assert_allclose import and the dead rtol/atol locals are removed (keeps the CI flake8 --select F401 check green).

Verification

flake8 --select F401,F405,E231 quantecon/util/tests/test_timing.py — clean.
pytest quantecon/util/tests/test_timing.py — 27 passed, run repeatedly with stable results.

Relationship to the other PRs

CI: De-flake coverage upload and stop fail-fast cancelling the matrix #845 stops a single flake from cancelling the whole matrix (fail-fast: false) and removes the Coveralls race; this PR removes the flake itself. Together they should make CI reliably green.
BUG: Fix DLE.compute_sequence under NumPy >= 2.4 (#839) #842 (the DLE / NumPy 2.4 fix that started this) is functionally green; it just needs CI: De-flake coverage upload and stop fail-fast cancelling the matrix #845's hardened workflow once that merges.

🤖 Generated with Claude Code

test_timing.py asserted that wall-clock time.sleep() durations stayed *under* an upper bound (assert_allclose with atol/rtol). That is inherently flaky on shared CI runners: time.sleep(t) only guarantees the process is suspended for at least t, and a descheduled runner can make the measured time arbitrarily larger. test_timeit_lambda_function (the tightest case, expecting ~0.05s) failed three times recently on slow macOS/ubuntu runners. Replace the ~14 upper-bound checks with a lower-bound assertion (assert_at_least): the meaningful invariant is that the timing utilities do not under-report elapsed time; scheduler overshoot is no longer a failure. Relationship checks (min <= avg <= max, average_time >= average_of_best) are unchanged. See the CI flakiness discussion in #845. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Copilot

Pull request overview

This PR de-flakes the timing-related test suite in quantecon.util by replacing wall-clock upper-bound assertions (which can fail on busy CI runners due to scheduler oversleep) with a consistent lower-bound assertion that primarily guards against timers under-reporting elapsed time.

Changes:

Remove assert_allclose-based upper-bound checks for tic/tac/toc, loop_timer, Timer, and timeit results.
Introduce assert_at_least(measured, expected) with a module-level TIMING_LOWER_BOUND to enforce measured >= expected * 0.9.
Clean up now-unused imports and locals (assert_allclose, rtol, atol) to keep unused-import linting green.

coveralls · 2026-06-28T02:36:30Z

coverage: 93.012% (+0.004%) from 93.008% — fix-flaky-timing-tests into main

Include the computed lower bound and the factor in the failure message, with floats formatted to 4 dp, so a future CI trip is easy to interpret. Diagnostics only; no behaviour change. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The lower-bound conversion was over-applied to test_timer_units and test_timeit_different_units. Those tests exist to verify Timer.elapsed stays in seconds regardless of the display unit; a pure lower bound would pass a unit-leak regression (elapsed stored in ms/µs is ~1000x/1e6x too large), defeating their purpose. Add assert_in_seconds: a bounded check (expected*0.9 <= measured < expected*100) whose generous ceiling always trips on a unit leak but never on scheduler overshoot (a few x at most). Point the two unit tests at it; the generic timing tests keep the lower-bound-only assert_at_least. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

mmcky · 2026-06-28T03:33:17Z

Note for reviewers — what this PR changes and a fidelity check

This started as a pure de-flake but, after review, also strengthens the suite. Summary of the three commits and the reasoning, plus an assertion-by-assertion fidelity audit so it's clear nothing was silently weakened.

What changed

Commit	Change
`7d9e6b6`	Replace the wall-clock upper-bound assertions (`assert_allclose(measured, expected, atol=0.05, rtol=2)`) with lower-bound checks. `time.sleep(t)` only guarantees at least `t`; a busy CI runner can oversleep arbitrarily, so an upper bound on measured duration is flaky by construction (it tripped on `test_timeit_lambda_function` 3× recently).
`809e24c`	Make the `assert_at_least` failure message report the computed bound and factor, floats to 4 dp (Copilot suggestion).
`9af5760`	Restore a bounded check (`assert_in_seconds`) for the two unit tests, which exist to verify `Timer.elapsed` stays in seconds regardless of the display `unit` — a pure lower bound would miss a unit-leak regression.

The two helpers

Helper	Used by	Asserts
`assert_at_least(x, e)`	generic timing tests (tic/tac/toc, `loop_timer`, `Timer.elapsed`, `timeit` runs/average)	`x >= 0.9·e` — the timer does not under-report
`assert_in_seconds(x, e)`	`test_timer_units`, `test_timeit_different_units`	`0.9·e <= x < 100·e` — `elapsed` is in seconds, not ms (×1000) / µs (×1e6)

The ×100 ceiling sits ~25× above the worst overshoot actually observed on CI and 10× below a millisecond leak — so it catches a unit regression with no realistic flake risk.

Fidelity audit — no loss, net gain

The originals used rtol=2, so assert_allclose(x, e, atol=0.05, rtol=2) resolves to x ∈ [−(e+0.05), 3e+0.05]. The lower edge is negative, i.e. the originals were effectively upper-bound-only with no meaningful lower bound.

Newly caught (the originals silently passed these): a timer returning 0 / not accumulating — assert_allclose(0, 0.1, …) passes but 0 >= 0.09 fails; and toc() returning the last interval (~0.1) instead of the cumulative total (~0.3) — now fails the >= 0.27 lower bound.
Retained: the unit-invariance guard — the one place an upper bound genuinely mattered — still trips on a ms/µs leak, and now also has a lower bound.
Unchanged: every relationship/structural/contract assertion — average_time >= average_of_best, minimum <= average <= maximum, the single-run equalities, all is not None / len == N / isinstance checks, the ValueError/validation tests, and the whole TestGlobalPrecision class.
Only removed: loose wall-clock upper bounds on default-unit timing, which targeted an implausible bug class and were themselves the flake source.

Pre-existing caveat (not introduced here): test_timeit_lambda_function doesn't strongly verify the 0.5 multiplier (sleeping 0.1 instead of 0.05 passes both the old and new assertion). Unchanged by this PR; noting it for completeness.

Net: the suite is no longer flaky and catches more realistic regressions than before.

🤖 Generated with Claude Code

oyamad

@mmcky Thanks! Looks good to me. The comment from ChatGPT below is just optional (not blocking).

oyamad · 2026-06-28T04:44:02Z

+    Lower bound: the timer did not under-report. Upper bound: a ceiling no real
+    ``expected``-second sleep can reach, but which a unit leak into
+    ``Timer.elapsed`` (ms/µs) always trips -- the invariant the unit tests
+    guard.


A comment from ChatGPT:

Non-blocking nit: the comment/docstring could avoid saying “no real sleep can reach” because sleep overshoot is theoretically unbounded; “not expected under normal CI scheduling” would be more precise. Not worth blocking the PR.

Copilot AI review requested due to automatic review settings June 28, 2026 02:31

mmcky added the tests label Jun 28, 2026

Copilot started reviewing on behalf of mmcky June 28, 2026 02:32 View session

Copilot AI reviewed Jun 28, 2026

View reviewed changes

Comment thread quantecon/util/tests/test_timing.py Outdated

mmcky and others added 2 commits June 28, 2026 12:43

oyamad approved these changes Jun 28, 2026

View reviewed changes

oyamad added the ready label Jun 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

TEST: De-flake timing tests with lower-bound assertions#846

TEST: De-flake timing tests with lower-bound assertions#846
mmcky wants to merge 3 commits into
mainfrom
fix-flaky-timing-tests

mmcky commented Jun 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

coveralls commented Jun 28, 2026 •

edited

Loading

Uh oh!

mmcky commented Jun 28, 2026

Uh oh!

oyamad left a comment

Uh oh!

oyamad Jun 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Uh oh!

Conversation

mmcky commented Jun 28, 2026

Summary

Root cause

Fix

Verification

Relationship to the other PRs

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

coveralls commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mmcky commented Jun 28, 2026

Note for reviewers — what this PR changes and a fidelity check

What changed

The two helpers

Fidelity audit — no loss, net gain

Uh oh!

oyamad left a comment

Choose a reason for hiding this comment

Uh oh!

oyamad Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

coveralls commented Jun 28, 2026 •

edited

Loading