Skip to content

CI: De-flake coverage upload and stop fail-fast cancelling the matrix#845

Open
mmcky wants to merge 3 commits into
mainfrom
ci-harden-coveralls-failfast
Open

CI: De-flake coverage upload and stop fail-fast cancelling the matrix#845
mmcky wants to merge 3 commits into
mainfrom
ci-harden-coveralls-failfast

Conversation

@mmcky

@mmcky mmcky commented Jun 28, 2026

Copy link
Copy Markdown
Contributor

Summary

CI has been going red on green code. Investigating the failure on #842 (and the matching failures on recent main pushes) turned up two independent, pre-existing infra problems — neither is a real test or code failure. This PR hardens the workflow against both.

Problem 1 — Coveralls 422 race

The Coveralls step runs on all three Linux matrix jobs (if: runner.os == 'Linux') and each uploads to the same Coveralls build with no parallel coordination. Whichever Linux job reports after the build has been closed gets:

422 Can't add a job to a build that is already closed. Build … is closed.

On the #842 run, ubuntu-3.14 finished first and closed the build; ubuntu-3.13 reported a minute later → 422 → job failed. Its pytest step had already passed. This is timing-dependent and intermittent — earlier runs on the same branch were green.

Problem 2 — fail-fast cascade

The matrix had no fail-fast key, so it defaulted to true. The moment one job failed (the 422 above, or the flaky timing test below), GitHub cancelled every other in-progress job — so a single flake painted the whole 9-job matrix red, with most jobs showing cancelled rather than their real result.

Fix

Change Effect
Upload coverage from a single Linux job (matrix.python-version == '3.13') Only one job reports to Coveralls, so there is no build-already-closed race.
continue-on-error: true on the Coveralls step A Coveralls outage records a warning instead of failing the job / blocking a merge.
fail-fast: false on the matrix Jobs run to completion independently; one flaky job no longer cancels the others, and the dashboard shows each job's true result.

Coverage is unaffected in practice — this is a pure-Python library, so line coverage does not vary across OS/Python, and one uploader is sufficient.

Note: a separate flaky test (not fixed here)

While diagnosing this I found that recent main failures were caused by a different flake — a wall-clock timing assertion, quantecon/util/tests/test_timing.py::TestTimer::test_timeit_lambda_function, which overshoots on a slow runner (ACTUAL 0.207 vs DESIRED 0.05, rtol=2). With fail-fast: false that flake will no longer cancel the rest of the matrix, but the test itself can still go red on an unlucky runner. Tightening or de-timing that test is left as a follow-up so this PR stays focused on the workflow.

🤖 Generated with Claude Code

Two unrelated infra issues have been turning CI red on green code:

1. The Coveralls step runs on all three Linux matrix jobs and uploads to
   the same build with no parallel coordination. Whichever Linux job
   reports after the build is closed gets "422 Build is already closed"
   and fails. Restrict the upload to a single Linux job (3.13) and mark
   it continue-on-error so a Coveralls outage never blocks a merge.

2. The matrix used the default fail-fast: true, so any single job failure
   (the 422 above, or a flaky wall-clock timing test) cancelled every
   other in-progress job and painted the whole matrix red. Set
   fail-fast: false so jobs run to completion independently.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 28, 2026 02:12
@mmcky mmcky added the CI label Jun 28, 2026

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR hardens the GitHub Actions CI workflow by preventing flaky/non-test infrastructure failures (Coveralls upload race, and default matrix fail-fast behavior) from incorrectly turning the overall CI status red.

Changes:

  • Set the test job’s matrix strategy to fail-fast: false so one failure doesn’t cancel the rest of the matrix.
  • Restrict Coveralls uploads to a single Linux/Python job and mark the upload step continue-on-error: true to avoid intermittent 422/upload outages failing CI.

Comment thread .github/workflows/ci.yml Outdated
Comment thread .github/workflows/ci.yml Outdated
@coveralls

coveralls commented Jun 28, 2026

Copy link
Copy Markdown

Coverage Status

coverage: 93.008%. remained the same — ci-harden-coveralls-failfast into main

mmcky and others added 2 commits June 28, 2026 12:40
- Quote the matrix python-version entries as strings ("3.12"/"3.13"/
  "3.14"). Bare YAML numbers are brittle (3.10 would parse as 3.1) and
  make the ``matrix.python-version == '3.13'`` Coveralls gate unreliable.
- Pin coverallsapp/github-action to @v2 instead of @master, matching the
  major-tag pinning used by the other actions in this file, for
  reproducibility and supply-chain safety.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Tighten the explanatory comments added in this PR to one or two lines
each, and drop the comment on the coverallsapp version pin (self-evident
from @v2 vs @master). No functional change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@mmcky

mmcky commented Jun 28, 2026

Copy link
Copy Markdown
Contributor Author

@oyamad this should fix the flaky coveralls issue we are having. I'll merge this once green but let me know if you have any issues.

Comment thread .github/workflows/ci.yml
Comment on lines 59 to 67
- name: Coveralls
uses: coverallsapp/github-action@master
if: runner.os == 'Linux'
uses: coverallsapp/github-action@v2
# upload from one Linux job only (avoids the Coveralls parallel-build 422
# race); continue-on-error so an upload hiccup can't fail CI
if: runner.os == 'Linux' && matrix.python-version == '3.13'
continue-on-error: true
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
path-to-lcov: coverage.lcov

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ChatGPT says path-to-lcov is deprecated.
It suggested the following:

      - name: Coveralls
        uses: coverallsapp/github-action@v2
        if: runner.os == 'Linux' && matrix.python-version == '3.13'
        with:
          github-token: ${{ secrets.GITHUB_TOKEN }}
          file: coverage.lcov
          format: lcov
          fail-on-error: false

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants