Skip to content

fix(core): retry Chromium download with idle timeout (PER-9192)#2271

Open
pranavz28 wants to merge 1 commit into
percy:masterfrom
pranavz28:fix/PER-9192-chromium-download-retry
Open

fix(core): retry Chromium download with idle timeout (PER-9192)#2271
pranavz28 wants to merge 1 commit into
percy:masterfrom
pranavz28:fix/PER-9192-chromium-download-retry

Conversation

@pranavz28

Copy link
Copy Markdown
Contributor

Fixes PER-9192

Problem

Customers' CI builds intermittently fail with the Chromium download frozen near the end (e.g. Downloading Chromium 1300309 [=========== ] 167.3MB/178.1MB 93%). The build is created but uploads zero snapshots and is later reaped as manually_failed_build. It works locally because Chromium is already cached there; each CI runner starts cold and re-downloads the full ~178 MB, exposing it to network flakiness.

The Chromium artifact itself is fine on Google's bucket (HTTP 200, correct content-length, Range/resume supported) — the failure is a transient interruption of the connection that the CLI does not recover from.

Root cause

packages/core/src/install.jsdownload() performed a single https.get piped straight to a file:

  • no socket/idle timeout → if data stops arriving but the socket isn't reset, the promise hangs until the whole CI job times out (the "stuck at 93%" symptom),
  • no retry → one blip is fatal, even though an immediate retry almost always succeeds,
  • no resume → a rerun re-downloads from scratch.

Fix

  • Extract the per-attempt download into fetchToFile() with an idle socket timeout (request.setTimeoutrequest.destroy) so a stalled connection fails fast instead of hanging.
  • Wrap it in a retry loop that removes the partial archive between attempts.
  • Both tunable via PERCY_CHROMIUM_DOWNLOAD_TIMEOUT (default 30000 ms) and PERCY_CHROMIUM_DOWNLOAD_RETRIES (default 3).

This converts an indefinite hang into bounded, self-healing attempts. Behaviour is unchanged on the happy path and for permanent failures (e.g. a 404 still ultimately rejects, after exhausting retries).

Testing

  • packages/client/test/helpers.js: taught the shared MockRequest helper to model setTimeout/destroy.
  • packages/core/test/unit/install.test.js: added retries the download when an attempt fails and gives up after exhausting download retries.
  • The three specs exercising the changed download() path (handles failed downloads, the two new ones) pass.

Note for reviewers

A momentary network glitch in CI currently kills the whole build. After this change it retries and recovers. The recommended customer-side mitigation (pre-installing a browser + PERCY_BROWSER_EXECUTABLE) still applies for fully air-gapped runners; this just makes the default path resilient.

🤖 Generated with Claude Code

The Chromium downloader performed a single https.get piped to a file with
no socket timeout, no retry, and no resume. A transient network stall in
CI (data stops arriving but the socket is never reset) left the download
frozen indefinitely — the classic "stuck at 93%" symptom — until the CI
job itself timed out, failing the build with zero snapshots.

Extract the per-attempt download into fetchToFile() with an idle socket
timeout (request.setTimeout -> destroy) so a stalled connection fails fast,
and wrap it in a retry loop that removes the partial archive between
attempts. Both are env-tunable via PERCY_CHROMIUM_DOWNLOAD_TIMEOUT (default
30s) and PERCY_CHROMIUM_DOWNLOAD_RETRIES (default 3).

Adds setTimeout/destroy to the shared MockRequest test helper and tests
covering retry-then-succeed and retry-exhaustion.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@pranavz28 pranavz28 requested a review from a team as a code owner June 9, 2026 10:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant