fix(ci): make redis_setexz a safe no-op when redis is unavailable#24520
Draft
AztecBot wants to merge 1 commit into
Draft
fix(ci): make redis_setexz a safe no-op when redis is unavailable#24520AztecBot wants to merge 1 commit into
AztecBot wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The nightly barretenberg debug build failed immediately, before it could even request an EC2 instance (failing run):
Root cause
ci3/bootstrap_ec2writes an initial log entry to redis at the very start of a run:redis_setexzwas:When redis is unavailable,
redis_cli(inci3/source_redis) is a no-op that returns without reading its stdin.gzipthen writes into a pipe whose read end is already closed and dies withSIGPIPE/ "Broken pipe". Because ci3 runs underset -euo pipefail, that broken-pipe failure propagates out of the pipeline and kills the whole script before any instance is launched.This is the direct cause of the observed failure: the run happened on the
aztec-claudemirror, whose environment has no AWS creds and noBUILD_INSTANCE_SSH_KEY(all empty in the job env), so the redis tunnel is never opened andCI_REDIS_AVAILABLE=0.source_redisintentionally degrades gracefully in that case ("Log and test cache will be disabled") — but the very nextredis_setexzcall defeats that graceful degradation.The same latent bug would take down the real nightly (or any CI run) any time the redis tunnel fails to establish, since
bootstrap_ec2:24,cache_log'spublish_log, andrun_test_cmdall callredis_setexzunguarded, relying on it being a safe no-op. (denoisealready works around it by guarding onCI_REDIS_AVAILABLEbefore calling it.)Fix
Make
redis_setexzitself a proper no-op when redis is unavailable, mirroring the existingredis_cliguard, and still drain stdin so the upstream pipeline producer doesn't getSIGPIPE:This fixes all unguarded callers at once and keeps the redis-available path unchanged.
Verification
Reproduced the failure under
set -euo pipefailwithCI_REDIS_AVAILABLE=0:echo "CI booting..." | redis_setexz k 300→ pipeline dies with exit 141 (SIGPIPE) / "gzip: stdout: Broken pipe".0and the script continues; the redis-available path (stubredis_cliconsuming stdin) also returns0.Notes
The scheduled nightly is guarded to only run on
AztecProtocol/aztec-packages(barretenberg-nightly-debug-build.ymlline 14); the failing run came from theaztec-claudemirror's older copy of that workflow that predates the guard, which is why it ran there at all. That mirror will self-correct on its next upstream sync — this PR fixes the underlying robustness bug that the mirror's credential-less environment exposed.Created by claudebox · group:
slackbot