-
Notifications
You must be signed in to change notification settings - Fork 133
ci: harden Phoenix CI against OOM, stale NFS handles, and transient SLURM errors #1304
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
18ff901
9acb375
ab10616
381dc7e
7f467b1
e864803
ff18b55
b55a21e
38fca24
084f31a
ecd2b22
5e67dec
7ea9da5
defdd7c
65e2642
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,15 @@ | ||
| #!/bin/bash | ||
| # Provides clean_build(): renames build/ aside and deletes it in the background. | ||
| # mv is a metadata-only operation that succeeds even with stale NFS file handles, | ||
| # unlike rm -rf which fails on ESTALE. The background delete is best-effort and | ||
| # scoped to this job's PID to avoid races with concurrent matrix jobs. | ||
| # | ||
| # Usage: source .github/scripts/clean-build.sh | ||
| # clean_build | ||
|
|
||
| clean_build() { | ||
| # Clean up leftover stale directories from previous runs before adding a new one. | ||
| rm -rf build.stale.* 2>/dev/null || true | ||
| mv build "build.stale.$$" 2>/dev/null || true | ||
| rm -rf "build.stale.$$" 2>/dev/null & disown | ||
| } | ||
| Original file line number | Diff line number | Diff line change | ||||||||
|---|---|---|---|---|---|---|---|---|---|---|
| @@ -0,0 +1,40 @@ | ||||||||||
| #!/bin/bash | ||||||||||
| # Provides retry_sbatch(): submits a job script string via sbatch with retries. | ||||||||||
| # Only retries on known transient SLURM/infrastructure errors (socket timeouts, | ||||||||||
| # connection failures). Hard failures (bad account, invalid partition, QOS | ||||||||||
| # violations) are not retried. | ||||||||||
| # | ||||||||||
| # Usage: source .github/scripts/retry-sbatch.sh | ||||||||||
| # job_id=$(retry_sbatch "$script_contents") | ||||||||||
|
|
||||||||||
| retry_sbatch() { | ||||||||||
| local script_contents="$1" | ||||||||||
| local max_attempts=3 | ||||||||||
| local attempt=1 | ||||||||||
| local submit_output job_id last_output="" | ||||||||||
|
|
||||||||||
| while [ $attempt -le $max_attempts ]; do | ||||||||||
| echo "sbatch attempt $attempt of $max_attempts..." >&2 | ||||||||||
| submit_output=$(printf '%s\n' "$script_contents" | sbatch 2>&1) || true | ||||||||||
| job_id=$(echo "$submit_output" | grep -oE 'Submitted batch job ([0-9]+)' | grep -oE '[0-9]+$') | ||||||||||
sbryngelson marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
| if [ -n "$job_id" ]; then | ||||||||||
|
Comment on lines
+19
to
+20
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🧩 Analysis chain🏁 Script executed: #!/bin/bash
set -u
echo "Checking current extraction pattern in retry-sbatch.sh:"
rg -n "job_id=.*Submitted batch job" .github/scripts/retry-sbatch.sh
echo
echo "Reproducing strict-mode early-exit behavior:"
output="$(
bash -lc '
set -euo pipefail
submit_output="Socket timed out on send/recv operation"
job_id=$(echo "$submit_output" | grep -oE "Submitted batch job ([0-9]+)" | grep -oE "[0-9]+$")
echo "RETRY_LOGIC_REACHED"
' 2>&1
)"
status=$?
printf 'exit_status=%s\n' "$status"
printf '%s\n' "$output"
echo
echo "Expected: non-zero exit_status and no RETRY_LOGIC_REACHED marker."Repository: MFlowCode/MFC Length of output: 444 🏁 Script executed: cat -n .github/scripts/retry-sbatch.shRepository: MFlowCode/MFC Length of output: 1834 🏁 Script executed: rg -n "retry.sbatch|retry_sbatch" --type sh --type bashRepository: MFlowCode/MFC Length of output: 85 🏁 Script executed: rg -n "retry.sbatch|retry_sbatch"Repository: MFlowCode/MFC Length of output: 39 🏁 Script executed: rg -n "retry-sbatch"Repository: MFlowCode/MFC Length of output: 39 🏁 Script executed: git ls-files | head -20
find .github -type f -name "*.sh" | head -20Repository: MFlowCode/MFC Length of output: 1190 🏁 Script executed: rg -l "retry-sbatch|retry_sbatch" .github/Repository: MFlowCode/MFC Length of output: 124 🏁 Script executed: cat -n .github/scripts/submit-slurm-job.shRepository: MFlowCode/MFC Length of output: 6880 Guard job-id extraction to prevent early exit under strict mode. At line 19, the Proposed fix- job_id=$(echo "$submit_output" | grep -oE 'Submitted batch job ([0-9]+)' | grep -oE '[0-9]+$')
+ job_id=$(echo "$submit_output" | grep -oE 'Submitted batch job ([0-9]+)' | grep -oE '[0-9]+$' || true)📝 Committable suggestion
Suggested change
|
||||||||||
| echo "$job_id" | ||||||||||
| return 0 | ||||||||||
sbryngelson marked this conversation as resolved.
Show resolved
Hide resolved
|
||||||||||
| fi | ||||||||||
| last_output="$submit_output" | ||||||||||
| echo "sbatch failed: $submit_output" >&2 | ||||||||||
| if ! echo "$submit_output" | grep -qiE "timed out|connection refused|connection reset|temporarily unavailable"; then | ||||||||||
| echo "Non-transient sbatch failure — not retrying." >&2 | ||||||||||
| return 1 | ||||||||||
| fi | ||||||||||
| if [ $attempt -lt $max_attempts ]; then | ||||||||||
| echo "Transient error — retrying in 30s..." >&2 | ||||||||||
| sleep 30 | ||||||||||
| fi | ||||||||||
| attempt=$((attempt + 1)) | ||||||||||
| done | ||||||||||
|
|
||||||||||
| echo "sbatch failed after $max_attempts attempts. Last error: $last_output" >&2 | ||||||||||
| return 1 | ||||||||||
| } | ||||||||||
|
|
||||||||||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Scope stale cleanup to this job’s PID.
Line 12 currently deletes
build.stale.*, which is broader than the PID-scoped behavior described in this script and can reintroduce cross-job cleanup races in shared workspaces. Keep deletion scoped tobuild.stale.$$.Proposed fix
clean_build() { - # Clean up leftover stale directories from previous runs before adding a new one. - rm -rf build.stale.* 2>/dev/null || true + # Clean only this process namespace to avoid cross-job races. + rm -rf "build.stale.$$" 2>/dev/null || true mv build "build.stale.$$" 2>/dev/null || true rm -rf "build.stale.$$" 2>/dev/null & disown }