Skip to content
Closed
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
e779ed4
Switch Phoenix GPU jobs to H200 nodes for faster scheduling
sbryngelson Feb 10, 2026
9cf00d3
Fix bash segfault in monitor_slurm_job.sh from fractional read timeout
sbryngelson Feb 12, 2026
7faf2d6
Merge branch 'master' into ci-fixes
sbryngelson Feb 12, 2026
a59db02
Restore pull_request_review trigger for benchmark workflow
sbryngelson Feb 12, 2026
2efc61e
Auto-retry sporadic test failures in CI
sbryngelson Feb 12, 2026
0658bd3
Preserve exit code for catastrophic test failures
sbryngelson Feb 12, 2026
c6b6f81
Harden SLURM monitor: robust state checks, orphan cleanup
sbryngelson Feb 13, 2026
a82959e
Use parsable sacct flags for robust state parsing
sbryngelson Feb 13, 2026
8022969
Guard squeue/sacct pipelines against set -euo pipefail
sbryngelson Feb 13, 2026
88d19ce
Retry delete_directory on Lustre ENOTEMPTY race
sbryngelson Feb 13, 2026
05d28f3
Remove stale failed_uuids.txt before test run
sbryngelson Feb 13, 2026
9eed0c6
Split benchmark concurrency group by event type
sbryngelson Feb 13, 2026
02c658d
Merge branch 'master' into ci-fixes
sbryngelson Feb 14, 2026
edefc01
Revert Phoenix test jobs to multi-partition GPU scheduling
sbryngelson Feb 15, 2026
2e15ab6
Fix doc lint for generated pages and hyphenated page IDs
sbryngelson Feb 15, 2026
0267756
Merge branch 'master' into ci-fixes
sbryngelson Feb 15, 2026
dfc524c
Add Lustre-safe workspace cleanup for self-hosted runners
sbryngelson Feb 15, 2026
ece1951
Revert Phoenix benchmark jobs to L40S GPU scheduling
sbryngelson Feb 15, 2026
a553a75
Improve Lustre-safe workspace cleanup with dotglob and nullglob
sbryngelson Feb 15, 2026
a149886
Auto-requeue SLURM jobs on preemption
sbryngelson Feb 15, 2026
ed8abd5
Merge branch 'master' into ci-fixes
sbryngelson Feb 15, 2026
273cced
Remove aggressive workspace cleanup from test jobs
sbryngelson Feb 16, 2026
c2e6543
Propagate exit code from test retry command
sbryngelson Feb 16, 2026
c0b1cd1
Restore default checkout clean for test jobs; tune PR reviewer
sbryngelson Feb 16, 2026
7a764d5
Remove aggressive workspace cleanup from bench jobs
sbryngelson Feb 16, 2026
b5dfa1f
Add test sharding for Frontier CI; switch to batch/hackathon partition
sbryngelson Feb 16, 2026
475caa3
Validate --shard argument format and bounds
sbryngelson Feb 16, 2026
ddd95ac
Use nick-fields/retry for Frontier builds; reduce -j to 4
sbryngelson Feb 16, 2026
197813a
Add set -e to Frontier build scripts for fail-fast behavior
sbryngelson Feb 16, 2026
73072bf
Add required timeout_minutes to nick-fields/retry steps
sbryngelson Feb 16, 2026
4f39d1b
Add on_retry_command to clean build state between retry attempts
sbryngelson Feb 17, 2026
eed3af5
Fix bench build: use PIDs instead of job specs for wait
sbryngelson Feb 17, 2026
0f580e8
Disable Frontier AMD bench: IBM case hits zero-size GPU transfer bug
sbryngelson Feb 17, 2026
4ff4167
Restore Frontier AMD bench entry
sbryngelson Feb 17, 2026
0cff03b
Merge branch 'master' into ci-fixes
sbryngelson Feb 19, 2026
cf4a9d0
Add fail-fast: false to self-hosted test matrix
sbryngelson Feb 19, 2026
9031124
Clean stale build artifacts at start of every CI job
sbryngelson Feb 19, 2026
cca9fa8
Revert fail-fast: false on self-hosted test matrix
sbryngelson Feb 19, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/scripts/monitor_slurm_job.sh
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ while true; do
# Try to read from tail output (non-blocking via timeout)
# Read multiple lines if available to avoid falling behind
lines_read=0
while IFS= read -r -t 0.1 line <&3 2>/dev/null; do
while IFS= read -r -t 1 line <&3 2>/dev/null; do
echo "$line"
lines_read=$((lines_read + 1))
Comment thread
sbryngelson marked this conversation as resolved.
last_heartbeat=$(date +%s)
Expand Down Expand Up @@ -115,7 +115,7 @@ done
# Drain any remaining output from tail after job completes
echo "Draining remaining output..."
drain_count=0
while IFS= read -r -t 0.5 line <&3 2>/dev/null; do
while IFS= read -r -t 1 line <&3 2>/dev/null; do
echo "$line"
drain_count=$((drain_count + 1))
# Safety limit to avoid infinite loop
Expand Down
65 changes: 7 additions & 58 deletions .github/workflows/bench.yml
Comment thread
sbryngelson marked this conversation as resolved.
Original file line number Diff line number Diff line change
@@ -1,85 +1,35 @@
name: 'Benchmark'

on:
# Trigger when Test Suite completes (no polling needed)
workflow_run:
workflows: ["Test Suite"]
types: [completed]
pull_request:
pull_request_review:
types: [submitted]
workflow_dispatch:

concurrency:
group: ${{ github.workflow }}-${{ github.event.workflow_run.head_branch || github.ref }}
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true

jobs:
file-changes:
name: Detect File Changes
# Only run if Test Suite passed (or manual dispatch)
if: github.event_name == 'workflow_dispatch' || github.event.workflow_run.conclusion == 'success'
runs-on: 'ubuntu-latest'
outputs:
checkall: ${{ steps.changes.outputs.checkall }}
pr_number: ${{ steps.pr-info.outputs.pr_number }}
pr_approved: ${{ steps.pr-info.outputs.approved }}
pr_author: ${{ steps.pr-info.outputs.author }}
steps:
- name: Clone
uses: actions/checkout@v4
with:
ref: ${{ github.event.workflow_run.head_sha || github.sha }}

- name: Detect Changes
uses: dorny/paths-filter@v3
id: changes
with:
filters: ".github/file-filter.yml"

- name: Get PR Info
id: pr-info
env:
GH_TOKEN: ${{ github.token }}
run: |
if [ "${{ github.event_name }}" = "workflow_dispatch" ]; then
echo "pr_number=" >> $GITHUB_OUTPUT
echo "approved=true" >> $GITHUB_OUTPUT
echo "author=${{ github.actor }}" >> $GITHUB_OUTPUT
else
# Get PR number from workflow_run
PR_NUMBER="${{ github.event.workflow_run.pull_requests[0].number }}"
if [ -n "$PR_NUMBER" ]; then
echo "pr_number=$PR_NUMBER" >> $GITHUB_OUTPUT

# Fetch actual PR author from API (workflow_run.actor is the re-runner, not PR author)
PR_AUTHOR=$(gh api repos/${{ github.repository }}/pulls/$PR_NUMBER --jq '.user.login')
echo "author=$PR_AUTHOR" >> $GITHUB_OUTPUT

# Check if PR is approved
APPROVED=$(gh api repos/${{ github.repository }}/pulls/$PR_NUMBER/reviews \
--jq '[.[] | select(.state == "APPROVED")] | length')
if [ "$APPROVED" -gt 0 ]; then
echo "approved=true" >> $GITHUB_OUTPUT
else
echo "approved=false" >> $GITHUB_OUTPUT
fi
else
echo "pr_number=" >> $GITHUB_OUTPUT
echo "approved=false" >> $GITHUB_OUTPUT
echo "author=" >> $GITHUB_OUTPUT
fi
fi

self:
name: "${{ matrix.name }} (${{ matrix.device }}${{ matrix.interface != 'none' && format('-{0}', matrix.interface) || '' }})"
if: >
github.repository == 'MFlowCode/MFC' &&
needs.file-changes.outputs.checkall == 'true' &&
(
github.event_name == 'workflow_dispatch' ||
needs.file-changes.outputs.pr_approved == 'true' ||
needs.file-changes.outputs.pr_author == 'sbryngelson' ||
needs.file-changes.outputs.pr_author == 'wilfonba'
)
needs: [file-changes]
if: ${{ github.repository=='MFlowCode/MFC' && needs.file-changes.outputs.checkall=='true' && ((github.event_name=='pull_request_review' && github.event.review.state=='approved') || (github.event_name=='pull_request' && (github.event.pull_request.user.login=='sbryngelson' || github.event.pull_request.user.login=='wilfonba')) || github.event_name=='workflow_dispatch') }}
Comment thread
sbryngelson marked this conversation as resolved.
needs: file-changes
strategy:
fail-fast: false
matrix:
Expand Down Expand Up @@ -143,7 +93,6 @@ jobs:
- name: Clone - PR
uses: actions/checkout@v4
with:
ref: ${{ github.event.workflow_run.head_sha || github.sha }}
path: pr
Comment thread
sbryngelson marked this conversation as resolved.

- name: Clone - Master
Expand All @@ -155,7 +104,7 @@ jobs:

- name: Setup & Build
if: matrix.build_script != ''
run: |
run: |
(cd pr && ${{ matrix.build_script }}) &
(cd master && ${{ matrix.build_script }}) &
wait %1 && wait %2
Expand Down
5 changes: 2 additions & 3 deletions .github/workflows/phoenix/submit-bench.sh
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,8 @@ sbatch_cpu_opts="\
"

sbatch_gpu_opts="\
#SBATCH -CL40S
#SBATCH --ntasks-per-node=4 # Number of cores per node required
#SBATCH -G2\
#SBATCH --gres=gpu:H200:2
#SBATCH --ntasks-per-node=8 # Number of cores per node required\
"
Comment thread
sbryngelson marked this conversation as resolved.
Comment thread
sbryngelson marked this conversation as resolved.

if [ "$2" = "cpu" ]; then
Expand Down
5 changes: 2 additions & 3 deletions .github/workflows/phoenix/submit.sh
Original file line number Diff line number Diff line change
Expand Up @@ -23,9 +23,8 @@ sbatch_cpu_opts="\
"

sbatch_gpu_opts="\
#SBATCH -p gpu-v100,gpu-a100,gpu-h100,gpu-l40s
#SBATCH --ntasks-per-node=4 # Number of cores per node required
#SBATCH -G2\
#SBATCH --gres=gpu:H200:2
Comment thread
sbryngelson marked this conversation as resolved.
Outdated
#SBATCH --ntasks-per-node=8 # Number of cores per node required\
"
Comment thread
sbryngelson marked this conversation as resolved.

if [ "$2" = "cpu" ]; then
Expand Down
19 changes: 17 additions & 2 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -134,8 +134,23 @@ jobs:
TEST_ALL: ${{ matrix.mpi == 'mpi' && '--test-all' || '' }}

- name: Test
run: |
/bin/bash mfc.sh test -v --max-attempts 3 -j $(nproc) $TEST_ALL $TEST_PCT
run: |
/bin/bash mfc.sh test -v --max-attempts 3 -j $(nproc) $TEST_ALL $TEST_PCT || true
Comment thread
cubic-dev-ai[bot] marked this conversation as resolved.
Outdated

# Retry only if a small number of tests failed (sporadic failures)
if [ -f tests/failed_uuids.txt ]; then
NUM_FAILED=$(wc -l < tests/failed_uuids.txt)
if [ "$NUM_FAILED" -le 5 ]; then
FAILED=$(cat tests/failed_uuids.txt | tr '\n' ' ')
echo ""
echo "=== Retrying $NUM_FAILED failed test(s): $FAILED ==="
echo ""
/bin/bash mfc.sh test -v --max-attempts 3 -j $(nproc) --only $FAILED $TEST_ALL
Comment thread
sbryngelson marked this conversation as resolved.
Outdated
else
echo "Too many failures ($NUM_FAILED) to retry — likely a real issue."
exit 1
fi
Comment thread
sbryngelson marked this conversation as resolved.
Outdated
fi
Comment thread
coderabbitai[bot] marked this conversation as resolved.
Comment thread
sbryngelson marked this conversation as resolved.
Comment thread
sbryngelson marked this conversation as resolved.
Comment thread
sbryngelson marked this conversation as resolved.
Comment thread
sbryngelson marked this conversation as resolved.
Comment thread
sbryngelson marked this conversation as resolved.
env:
TEST_ALL: ${{ matrix.mpi == 'mpi' && '--test-all' || '' }}
TEST_PCT: ${{ matrix.debug == 'debug' && '-% 20' || '' }}
Expand Down
9 changes: 9 additions & 0 deletions toolchain/mfc/test/test.py
Original file line number Diff line number Diff line change
Expand Up @@ -206,6 +206,15 @@ def test():
# Build the summary report
_print_test_summary(nPASS, nFAIL, nSKIP, minutes, seconds, failed_tests, skipped_cases)

# Write failed UUIDs to file for CI retry logic
failed_uuids_path = os.path.join(common.MFC_TEST_DIR, "failed_uuids.txt")
if failed_tests:
with open(failed_uuids_path, "w") as f:
for test_info in failed_tests:
f.write(test_info['uuid'] + "\n")
elif os.path.exists(failed_uuids_path):
os.remove(failed_uuids_path)
Comment thread
sbryngelson marked this conversation as resolved.
Comment thread
sbryngelson marked this conversation as resolved.
Comment thread
sbryngelson marked this conversation as resolved.
Comment thread
sbryngelson marked this conversation as resolved.
Comment thread
sbryngelson marked this conversation as resolved.
Comment thread
sbryngelson marked this conversation as resolved.
Comment thread
sbryngelson marked this conversation as resolved.
Comment thread
sbryngelson marked this conversation as resolved.

exit(nFAIL)


Expand Down
Loading