Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
15 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
15 changes: 15 additions & 0 deletions .github/scripts/clean-build.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
#!/bin/bash
# Provides clean_build(): renames build/ aside and deletes it in the background.
# mv is a metadata-only operation that succeeds even with stale NFS file handles,
# unlike rm -rf which fails on ESTALE. The background delete is best-effort and
# scoped to this job's PID to avoid races with concurrent matrix jobs.
#
# Usage: source .github/scripts/clean-build.sh
# clean_build

clean_build() {
# Clean up leftover stale directories from previous runs before adding a new one.
rm -rf build.stale.* 2>/dev/null || true
mv build "build.stale.$$" 2>/dev/null || true
rm -rf "build.stale.$$" 2>/dev/null & disown
Comment on lines +12 to +14
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Scope stale cleanup to this job’s PID.

Line 12 currently deletes build.stale.*, which is broader than the PID-scoped behavior described in this script and can reintroduce cross-job cleanup races in shared workspaces. Keep deletion scoped to build.stale.$$.

Proposed fix
 clean_build() {
-    # Clean up leftover stale directories from previous runs before adding a new one.
-    rm -rf build.stale.* 2>/dev/null || true
+    # Clean only this process namespace to avoid cross-job races.
+    rm -rf "build.stale.$$" 2>/dev/null || true
     mv build "build.stale.$$" 2>/dev/null || true
     rm -rf "build.stale.$$" 2>/dev/null & disown
 }

}
3 changes: 2 additions & 1 deletion .github/scripts/prebuild-case-optimization.sh
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,8 @@ case "$cluster" in
*) echo "ERROR: Unknown cluster '$cluster'"; exit 1 ;;
esac

rm -rf build
source .github/scripts/clean-build.sh
clean_build

. ./mfc.sh load -c "$flag" -m g

Expand Down
40 changes: 40 additions & 0 deletions .github/scripts/retry-sbatch.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
#!/bin/bash
# Provides retry_sbatch(): submits a job script string via sbatch with retries.
# Only retries on known transient SLURM/infrastructure errors (socket timeouts,
# connection failures). Hard failures (bad account, invalid partition, QOS
# violations) are not retried.
#
# Usage: source .github/scripts/retry-sbatch.sh
# job_id=$(retry_sbatch "$script_contents")

retry_sbatch() {
local script_contents="$1"
local max_attempts=3
local attempt=1
local submit_output job_id last_output=""

while [ $attempt -le $max_attempts ]; do
echo "sbatch attempt $attempt of $max_attempts..." >&2
submit_output=$(printf '%s\n' "$script_contents" | sbatch 2>&1) || true
job_id=$(echo "$submit_output" | grep -oE 'Submitted batch job ([0-9]+)' | grep -oE '[0-9]+$')
if [ -n "$job_id" ]; then
Comment on lines +19 to +20
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -u

echo "Checking current extraction pattern in retry-sbatch.sh:"
rg -n "job_id=.*Submitted batch job" .github/scripts/retry-sbatch.sh

echo
echo "Reproducing strict-mode early-exit behavior:"
output="$(
  bash -lc '
    set -euo pipefail
    submit_output="Socket timed out on send/recv operation"
    job_id=$(echo "$submit_output" | grep -oE "Submitted batch job ([0-9]+)" | grep -oE "[0-9]+$")
    echo "RETRY_LOGIC_REACHED"
  ' 2>&1
)"
status=$?
printf 'exit_status=%s\n' "$status"
printf '%s\n' "$output"

echo
echo "Expected: non-zero exit_status and no RETRY_LOGIC_REACHED marker."

Repository: MFlowCode/MFC

Length of output: 444


🏁 Script executed:

cat -n .github/scripts/retry-sbatch.sh

Repository: MFlowCode/MFC

Length of output: 1834


🏁 Script executed:

rg -n "retry.sbatch|retry_sbatch" --type sh --type bash

Repository: MFlowCode/MFC

Length of output: 85


🏁 Script executed:

rg -n "retry.sbatch|retry_sbatch"

Repository: MFlowCode/MFC

Length of output: 39


🏁 Script executed:

rg -n "retry-sbatch"

Repository: MFlowCode/MFC

Length of output: 39


🏁 Script executed:

git ls-files | head -20
find .github -type f -name "*.sh" | head -20

Repository: MFlowCode/MFC

Length of output: 1190


🏁 Script executed:

rg -l "retry-sbatch|retry_sbatch" .github/

Repository: MFlowCode/MFC

Length of output: 124


🏁 Script executed:

cat -n .github/scripts/submit-slurm-job.sh

Repository: MFlowCode/MFC

Length of output: 6880


Guard job-id extraction to prevent early exit under strict mode.

At line 19, the grep pipeline in the command substitution can fail without matching and return non-zero. Since the caller script (.github/scripts/submit-slurm-job.sh) uses set -euo pipefail and sources this function, the unguarded pipeline causes the script to exit before the retry logic at lines 20–35 executes, breaking the entire retry mechanism.

Proposed fix
-        job_id=$(echo "$submit_output" | grep -oE 'Submitted batch job ([0-9]+)' | grep -oE '[0-9]+$')
+        job_id=$(echo "$submit_output" | grep -oE 'Submitted batch job ([0-9]+)' | grep -oE '[0-9]+$' || true)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
job_id=$(echo "$submit_output" | grep -oE 'Submitted batch job ([0-9]+)' | grep -oE '[0-9]+$')
if [ -n "$job_id" ]; then
job_id=$(echo "$submit_output" | grep -oE 'Submitted batch job ([0-9]+)' | grep -oE '[0-9]+$' || true)
if [ -n "$job_id" ]; then

echo "$job_id"
return 0
fi
last_output="$submit_output"
echo "sbatch failed: $submit_output" >&2
if ! echo "$submit_output" | grep -qiE "timed out|connection refused|connection reset|temporarily unavailable"; then
echo "Non-transient sbatch failure — not retrying." >&2
return 1
fi
if [ $attempt -lt $max_attempts ]; then
echo "Transient error — retrying in 30s..." >&2
sleep 30
fi
attempt=$((attempt + 1))
done

echo "sbatch failed after $max_attempts attempts. Last error: $last_output" >&2
return 1
}

19 changes: 8 additions & 11 deletions .github/scripts/submit-slurm-job.sh
Original file line number Diff line number Diff line change
Expand Up @@ -85,9 +85,9 @@ if [ "$device" = "cpu" ]; then
case "$cluster" in
phoenix)
sbatch_device_opts="\
#SBATCH -p cpu-small
#SBATCH --ntasks-per-node=24
#SBATCH --mem-per-cpu=2G"
#SBATCH -p cpu-small,cpu-medium,cpu-large
#SBATCH --ntasks-per-node=12
#SBATCH --mem-per-cpu=8G"
;;
frontier|frontier_amd)
sbatch_device_opts="\
Expand Down Expand Up @@ -161,8 +161,9 @@ rm -f "$output_file"
# --- Module load mode (short form) ---
module_mode=$([ "$device" = "gpu" ] && echo "g" || echo "c")

# --- Submit ---
submit_output=$(sbatch <<EOT
# --- Submit (with retries for transient SLURM errors) ---
source "${SCRIPT_DIR}/retry-sbatch.sh"
_sbatch_script=$(cat <<EOT
#!/bin/bash
#SBATCH -J ${job_prefix}-${job_slug}
#SBATCH --account=${account}
Expand Down Expand Up @@ -192,12 +193,8 @@ $sbatch_script_contents
EOT
)

job_id=$(echo "$submit_output" | grep -oE '[0-9]+')
if [ -z "$job_id" ]; then
echo "ERROR: Failed to submit job. sbatch output:"
echo "$submit_output"
exit 1
fi
job_id=$(retry_sbatch "$_sbatch_script")
unset _sbatch_script

echo "Submitted batch job $job_id"
echo "$job_id" > "$id_file"
Expand Down
3 changes: 2 additions & 1 deletion .github/workflows/common/bench.sh
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,8 @@ fi
# Phoenix builds inside SLURM; Frontier pre-builds via build.sh on the login node.
# Phoenix: always nuke stale builds (heterogeneous compute nodes → ISA mismatch risk).
if [ "$job_cluster" = "phoenix" ]; then
rm -rf build
source .github/scripts/clean-build.sh
clean_build
fi

if [ ! -d "build" ]; then
Expand Down
16 changes: 15 additions & 1 deletion .github/workflows/common/test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,26 @@ set -euo pipefail
source .github/scripts/gpu-opts.sh
build_opts="$gpu_opts"

# --- Phoenix TMPDIR setup ---
# Phoenix compute nodes have a small /tmp. With 8 parallel test threads each
# spawning MPI processes, it fills up and ORTE session dir creation fails.
# Redirect TMPDIR to project storage, same as bench.sh.
if [ "$job_cluster" = "phoenix" ]; then
tmpbuild=/storage/project/r-sbryngelson3-0/sbryngelson3/mytmp_build
currentdir=$tmpbuild/run-$(( RANDOM % 9000 ))
mkdir -p $tmpbuild
mkdir -p $currentdir
export TMPDIR=$currentdir
trap 'rm -rf "$currentdir" || true' EXIT
fi

# --- Build (if not pre-built on login node) ---
# Phoenix builds inside SLURM; Frontier pre-builds via build.sh on the login node.
# Phoenix builds inside SLURM on heterogeneous compute nodes — always start fresh
# to avoid SIGILL from stale binaries compiled on a different microarchitecture.
if [ "$job_cluster" = "phoenix" ]; then
rm -rf build
source .github/scripts/clean-build.sh
clean_build
fi

if [ ! -d "build" ]; then
Expand Down
3 changes: 2 additions & 1 deletion .github/workflows/frontier/build.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,7 +20,8 @@ build_opts="$gpu_opts"

. ./mfc.sh load -c $compiler_flag -m $([ "$job_device" = "gpu" ] && echo "g" || echo "c")

rm -rf build
source .github/scripts/clean-build.sh
clean_build

source .github/scripts/retry-build.sh
if [ "$run_bench" == "bench" ]; then
Expand Down
Loading