-
Notifications
You must be signed in to change notification settings - Fork 127
This change will fix configuration issues on HiperGator #1112
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -8,10 +8,11 @@ | |||||||||||||||||||||||||||
| #SBATCH --job-name="${name}" | ||||||||||||||||||||||||||||
| #SBATCH --output="${name}.out" | ||||||||||||||||||||||||||||
| #SBATCH --time=${walltime} | ||||||||||||||||||||||||||||
| #SBATCH --cpus-per-task=7 | ||||||||||||||||||||||||||||
| % if gpu_enabled: | ||||||||||||||||||||||||||||
|
||||||||||||||||||||||||||||
| % if gpu_enabled: | |
| % if gpu_enabled: | |
| # Note: For GPU jobs, we explicitly request 1 GPU and 3 CPUs per task. | |
| # CPU-only jobs rely on the cluster's default cpus-per-task setting. |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
50GB memory per CPU is extremely high (150GB total for 3 CPUs per task). This could severely limit job scheduling on the cluster. Verify this is the intended memory requirement and not a typo (perhaps 50GB total or 5GB per CPU was intended). Most GPU codes require much less CPU memory unless doing significant host-side preprocessing.
| #SBATCH --mem-per-cpu=50GB | |
| #SBATCH --mem-per-cpu=5GB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: Add an else block to the gpu_enabled check to set --cpus-per-task=7 for CPU-only jobs, restoring the intended behavior. [general, importance: 8]
| % if gpu_enabled: | |
| #SBATCH --gpus-per-task=1 | |
| #SBATCH --cpus-per-task=3 | |
| #SBATCH --gpu-bind=closest | |
| #SBATCH --mem-per-cpu=50GB | |
| % endif | |
| % if gpu_enabled: | |
| #SBATCH --gpus-per-task=1 | |
| #SBATCH --cpus-per-task=3 | |
| #SBATCH --mem-per-cpu=50GB | |
| % else: | |
| #SBATCH --cpus-per-task=7 | |
| % endif |
Copilot
AI
Jan 15, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The hardcoded absolute path to mpirun creates a tight coupling to a specific NVHPC version (25.9) and installation location. This path is used unconditionally for both GPU and CPU modes, but the CPU configuration in toolchain/modules uses gcc/openmpi which would have a different mpirun path. Consider either: (1) using a conditional path based on gpu_enabled to use the appropriate MPI launcher for each mode, or (2) relying on the PATH environment variable set by the module system (like other cluster templates do) by simply using mpirun.
| /apps/compilers/nvhpc/25.9/Linux_x86_64/25.9/comm_libs/mpi/bin/mpirun -np ${nodes*tasks_per_node} \ | |
| mpirun -np ${nodes*tasks_per_node} \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
P2: Hardcoded NVHPC mpirun is used even in CPU mode, mismatching the loaded OpenMPI stack and risking missing binary or MPI runtime failures for CPU MPI jobs.
Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At toolchain/templates/hipergator.mako, line 52:
<comment>Hardcoded NVHPC mpirun is used even in CPU mode, mismatching the loaded OpenMPI stack and risking missing binary or MPI runtime failures for CPU MPI jobs.</comment>
<file context>
@@ -48,7 +49,7 @@ echo
% else:
(set -x; ${profiler} \
- mpirun -np ${nodes*tasks_per_node} \
+ /apps/compilers/nvhpc/25.9/Linux_x86_64/25.9/comm_libs/mpi/bin/mpirun -np ${nodes*tasks_per_node} \
--bind-to none \
"${target.get_install_binpath(case)}")
</file context>
| /apps/compilers/nvhpc/25.9/Linux_x86_64/25.9/comm_libs/mpi/bin/mpirun -np ${nodes*tasks_per_node} \ | |
| ${'/apps/compilers/nvhpc/25.9/Linux_x86_64/25.9/comm_libs/mpi/bin/mpirun' if gpu_enabled else 'mpirun'} -np ${nodes*tasks_per_node} \ | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggestion: The absolute MPI launcher path contains a duplicated version segment ("25.9" appears twice) which very likely makes the path incorrect and the mpirun binary unavailable at runtime; fix the path to the correct single-version location so the launcher exists on the nodes. [possible bug]
Severity Level: Critical 🚨
- ❌ MPI jobs fail to start on affected nodes.
- ⚠️ Distributed test runs do not execute.
- ⚠️ Affects template-driven MPI launches in CI and local runs.| % else: | |
| (set -x; ${profiler} \ | |
| mpirun -np ${nodes*tasks_per_node} \ | |
| /apps/compilers/nvhpc/25.9/Linux_x86_64/25.9/comm_libs/mpi/bin/mpirun -np ${nodes*tasks_per_node} \ | |
| --bind-to none \ | |
| "${target.get_install_binpath(case)}") | |
| % endif | |
| /apps/compilers/nvhpc/25.9/Linux_x86_64/comm_libs/mpi/bin/mpirun -np ${nodes*tasks_per_node} \ |
Steps of Reproduction ✅
1. Trigger an MPI run path by rendering toolchain/templates/hipergator.mako with a target
where mpi==True. The template enters the else branch shown at lines 50-55 and emits a
command containing the absolute mpirun path at line 52
("/apps/compilers/nvhpc/25.9/Linux_x86_64/25.9/comm_libs/mpi/bin/mpirun").
2. Submit the generated job or run the script on a compute node so the template-expressed
command executes. This is the normal execution path for distributed runs using this
template (the for-loop at lines 45-60 iterates targets and executes this branch when mpi
is enabled).
3. When the node shell attempts to execute the absolute path, the duplicated "25.9"
segment makes the path incorrect on nodes where the real NVHPC installation path does not
contain that duplicated segment. The shell prints "No such file or directory" and the MPI
launch fails immediately.
4. Observe the job failing to start distributed processes; the failure is reproducible by
running any MPI-targeted job using this template (mpi==True) because the template emits
the incorrect absolute path at toolchain/templates/hipergator.mako:52.Prompt for AI Agent 🤖
This is a comment left during a code review.
**Path:** toolchain/templates/hipergator.mako
**Line:** 50:55
**Comment:**
*Possible Bug: The absolute MPI launcher path contains a duplicated version segment ("25.9" appears twice) which very likely makes the path incorrect and the mpirun binary unavailable at runtime; fix the path to the correct single-version location so the launcher exists on the nodes.
Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
Repository: MFlowCode/MFC
Length of output: 227
🏁 Script executed:
Repository: MFlowCode/MFC
Length of output: 3802
🏁 Script executed:
Repository: MFlowCode/MFC
Length of output: 356
🏁 Script executed:
Repository: MFlowCode/MFC
Length of output: 665
🏁 Script executed:
Repository: MFlowCode/MFC
Length of output: 1340
🏁 Script executed:
Repository: MFlowCode/MFC
Length of output: 210
🏁 Script executed:
# Look at the CMakeLists.txt to understand compiler handling head -100 CMakeLists.txtRepository: MFlowCode/MFC
Length of output: 4519
HiperGator GPU configuration deviates from all other clusters by explicitly setting MPI wrapper paths instead of base NVHPC compilers.
All other GPU clusters (Bridges2, Ascent, Wombat, Expanse, Phoenix, Delta, Oscar, Nautilus) use
CC=nvc CXX=nvc++ FC=nvfortran, while HiperGator sets these to explicit MPI wrapper paths. This is inconsistent and problematic:nvfortran, not a wrapperIf MPI wrappers are necessary for HiperGator's build, consider using the wrapper commands (
mpicc,mpicxx,mpifort) instead of hardcoded paths, aligning with the pattern used by other clusters.🤖 Prompt for AI Agents
🧩 Analysis chain
🌐 Web query:
NVIDIA B200 GPU CUDA compute capability💡 Result:
The NVIDIA B200 (Blackwell SM100) has CUDA compute capability 10.0 (target arches sm100a / sm100f); it requires CUDA toolkit support beginning with CUDA 12.8. [1][2]
🏁 Script executed:
Repository: MFlowCode/MFC
Length of output: 292
Add
MFC_CUDA_CC=100for B200 GPU compute capability targeting.All other GPU-enabled clusters specify
MFC_CUDA_CCto target specific GPU architectures (e.g., line 41 for Phoenix:MFC_CUDA_CC=70,75,80,89,90, line 53 for Delta:MFC_CUDA_CC=80,86). The B200 GPU has CUDA compute capability 10.0 (SM100 architecture) and requires CUDA 12.8 or later; the h-gpu configuration should includeMFC_CUDA_CC=100for consistent GPU targeting.🤖 Prompt for AI Agents