Skip to content
Merged
Changes from 6 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
64 changes: 64 additions & 0 deletions incidents/2026-03-08.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
# 2026-03-08 Incident Report

- Incident Commander: @ryanaslett
- Severity Level: P1

For a brief period of time, the macOS installer package (`.pkg`) for Node.js v22.22.1 served a duplicate file with a mismatched SHA256 checksum due to a failed rclone upload step during a Jenkins job re-run. While having a different hash, this file has been generated and signed legitimately by Node.js' CI and was safe to run.

## Timeline

- **2026-03-08 17:14 UTC**: Start of impact. First Jenkins build completed successfully, uploading `node-v22.22.1.pkg` (SHA256: `1fbe9cd7e9fdce6cf150bbe59cb97a426434f7fb217135d10124a62bfb697448`) to R2.
Comment thread
MattIPv4 marked this conversation as resolved.
Outdated

- **2026-03-08 21:00 UTC**: Second Jenkins build completed, uploading corrected `node-v22.22.1.pkg` (SHA256: `ac8cb570db59cb399be96978c194f6c4fc91ffcf11a197ebd5461083c0cf1dfd`) to direct.nodejs.org, but rclone step to R2 failed, leaving R2 (serving most users at `www.`) with the outdated file.
Comment thread
MattIPv4 marked this conversation as resolved.
Outdated

- **2026-03-08 10:04 UTC**: Initial report of incident [nodejs/release-cloudflare-worker#878](https://github.com/nodejs/release-cloudflare-worker/issues/878) created.

- **2026-03-08 12:12 UTC**: Initial report of incident [nodejs/release-cloudflare-worker#878](https://github.com/nodejs/release-cloudflare-worker/issues/878) acknowledged.

- **2026-03-08 11:52 UTC**: Initial report forwarded to [OpenJS Slack](https://openjs-foundation.slack.com/archives/C09EXEEHFKP/p1773013976217429), investigation began.

- **2026-03-09 00:33 UTC**: Team confirmed both files were legitimately signed by Apple at different times (17:14 and 21:00 UTC).

- **2026-03-09 00:41 UTC**: Root cause identified - Jenkins job re-run uploaded to www but failed to sync to R2, causing version mismatch.

- **2026-03-09 01:25 UTC**: Corrected macOS installer package (`.pkg`) promoted.

- **2026-03-09 01:29 UTC**: Cache purged. Impact resolved.

## Impact

Users downloading the macOS installer package from `https://nodejs.org/dist/v22.22.1/node-v22.22.1.pkg` received a file whose SHA256 checksum (`1fbe9cd7e9fdce6cf150bbe59cb97a426434f7fb217135d10124a62bfb697448`) did not match the checksum published in [`SHASUMS256.txt`](https://nodejs.org/dist/latest-v22.x/SHASUMS256.txt) (`ac8cb570db59cb399be96978c194f6c4fc91ffcf11a197ebd5461083c0cf1dfd`).

Both files were legitimately signed by the Node.js Foundation Apple Developer account, but represented different build artifacts from separate Jenkins runs. The file served from direct.nodejs.org was correct, but Cloudflare R2 (serving most users via the release worker) contained the outdated version.

## Root Cause

A workflow issue in the Jenkins release process allowed files to become out of sync between direct.nodejs.org (www) and the R2 bucket.

The release process works as follows:
1. Jenkins builds the macOS package and signs it
2. The package is copied to direct.nodejs.org via `scp`
3. Jenkins SSHs into direct and uses `rclone` to copy the file from www to R2 dist-staging

During the v22.22.1 release:
1. The first Jenkins job (17:14 UTC) completed successfully, uploading the initial signed package to both direct and R2
2. The job was re-run, producing a new signed package at 21:00 UTC
3. The second run successfully copied the new package to direct
4. The `rclone` step to R2 failed with `kex_exchange_identification: Connection closed by remote host`
5. The Jenkins job marked the build as failed but did not roll back the direct upload

This left `direct.` with the correct file (matching SHASUMS256.txt) while R2 served the outdated file, creating a checksum mismatch for most users.

## Fix

The immediate fix was to manually sync the correct file from direct.nodejs.org to the R2 dist-staging bucket using `rclone copyto`.

## Follow-up Work

- Improve Jenkins workflow to prevent partial uploads when rclone fails
- Either roll back www uploads if R2 sync fails, or upload to both destinations atomically
- Add verification step to compare checksums between www and R2 before marking build as complete
- Add monitoring/alerting for checksum mismatches between distribution sources
- Investigate why the rclone SSH connection failed mid-release
- Consider adding checksum verification as part of the promotion workflow
- Add better logging/auditing for release builds to track which artifacts were uploaded where and when
Loading