Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 8 additions & 8 deletions toolchain/modules
Original file line number Diff line number Diff line change
Expand Up @@ -83,11 +83,11 @@ san-all cmake python
san-gpu nvhpc cuda cray-mpich

h hipergator
h-gpu nvhpc/25.9
h-gpu CUDA_HOME="/apps/compilers/cuda/12.8.1"
h-all HPC_OMPI_DIR="/apps/mpi/cuda/12.8.1/nvhpc/25.3/openmpi/5.0.7"
h-all HPC_OMPI_BIN="/apps/mpi/cuda/12.8.1/nvhpc/25.3/openmpi/5.0.7/bin"
h-all OMPI_MCA_pml=ob1 OMPI_MCA_coll_hcoll_enable=0
h-gpu PATH="/apps/mpi/cuda/12.8.1/nvhpc/25.3/openmpi/5.0.7/bin:${PATH}"
h-all LD_LIBRARY_PATH=/apps/compilers/cuda/12.8.1/lib64:$LD_LIBRARY_PATH
h-gpu MFC_CUDA_CC=100 NVHPC_CUDA_HOME="/apps/compilers/cuda/12.8.1"
h-all python/3.12
h-cpu gcc/14.2 openmpi/5.0.7
h-gpu UCX_NET_DEVICES="mlx5_4:1,mlx5_7:1,mlx5_8:1,mlx5_9:1,mlx5_10:1,mlx5_13:1,mlx5_14:1,mlx5_15:1"
h-gpu cuda/12.9.1 nvhpc/25.9 openmpi/5.0.7
h-gpu CC=/apps/compilers/nvhpc/25.9/Linux_x86_64/25.9/comm_libs/mpi/bin/mpicc
h-gpu CXX=/apps/compilers/nvhpc/25.9/Linux_x86_64/25.9/comm_libs/mpi/bin/mpicxx
h-gpu FC=/apps/compilers/nvhpc/25.9/Linux_x86_64/25.9/comm_libs/mpi/bin/mpifort
h-gpu NVCOMPILER_COMM_LIBS_HOME=/apps/compilers/nvhpc/25.9/Linux_x86_64/25.9/comm_libs/12.9
Comment on lines +88 to +93
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

# First, let's look at the toolchain/modules file around the mentioned lines
fd "modules" toolchain/ -type f

Repository: MFlowCode/MFC

Length of output: 227


🏁 Script executed:

# Get the full context around lines 88-93 and see other cluster configurations
cat -n toolchain/modules | head -100

Repository: MFlowCode/MFC

Length of output: 3802


🏁 Script executed:

# Check how CC/CXX/FC are used in the build system
rg "(\$CC|\$CXX|\$FC|CC=|CXX=|FC=)" --type cmake -A 2 | head -50

Repository: MFlowCode/MFC

Length of output: 356


🏁 Script executed:

# Check CMakeLists.txt for how it uses CC/CXX/FC
cat toolchain/modules | grep -A 10 -B 5 "hipergator\|h-gpu"

Repository: MFlowCode/MFC

Length of output: 665


🏁 Script executed:

# Search for HiperGator-specific build configuration or comments
rg -i "hipergator|h-gpu|mpicc|mpifort" . -A 2 -B 2 --max-count 20

Repository: MFlowCode/MFC

Length of output: 1340


🏁 Script executed:

# Check if there are any cmake files that handle compiler selection
fd -e cmake | head -5

Repository: MFlowCode/MFC

Length of output: 210


🏁 Script executed:

# Look at the CMakeLists.txt to understand compiler handling
head -100 CMakeLists.txt

Repository: MFlowCode/MFC

Length of output: 4519


HiperGator GPU configuration deviates from all other clusters by explicitly setting MPI wrapper paths instead of base NVHPC compilers.

All other GPU clusters (Bridges2, Ascent, Wombat, Expanse, Phoenix, Delta, Oscar, Nautilus) use CC=nvc CXX=nvc++ FC=nvfortran, while HiperGator sets these to explicit MPI wrapper paths. This is inconsistent and problematic:

  1. GPU compilation expectation: Learning notes indicate GPU code should compile with nvfortran, not a wrapper
  2. Build system assumptions: CMakeLists.txt sets the compiler via environment variables and detects the compiler ID; MPI wrapper paths may not be properly identified as NVHPC
  3. Portability: MPI wrappers bundled to a specific installation path are less portable than using wrapper commands in PATH

If MPI wrappers are necessary for HiperGator's build, consider using the wrapper commands (mpicc, mpicxx, mpifort) instead of hardcoded paths, aligning with the pattern used by other clusters.

🤖 Prompt for AI Agents
In `@toolchain/modules` around lines 88 - 93, Update the HiperGator GPU module env
vars so they match other GPU clusters: replace the hardcoded MPI wrapper paths
in CC, CXX, and FC with either the NVHPC compiler commands (CC=nvc, CXX=nvc++,
FC=nvfortran) so the build uses the NVHPC compilers, or—if MPI wrappers are
required—use the generic wrapper commands (mpicc, mpicxx, mpifort) instead of
absolute paths; also ensure NVCOMPILER_COMM_LIBS_HOME remains pointed to the
correct NVHPC comm libs location (symbol NVCOMPILER_COMM_LIBS_HOME) so the MPI
libs are found by the toolchain.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🌐 Web query:

NVIDIA B200 GPU CUDA compute capability

💡 Result:

The NVIDIA B200 (Blackwell SM100) has CUDA compute capability 10.0 (target arches sm100a / sm100f); it requires CUDA toolkit support beginning with CUDA 12.8. [1][2]


🏁 Script executed:

# Search for MFC_CUDA_CC in the toolchain/modules file
grep -n "MFC_CUDA_CC" toolchain/modules | head -20

Repository: MFlowCode/MFC

Length of output: 292


Add MFC_CUDA_CC=100 for B200 GPU compute capability targeting.

All other GPU-enabled clusters specify MFC_CUDA_CC to target specific GPU architectures (e.g., line 41 for Phoenix: MFC_CUDA_CC=70,75,80,89,90, line 53 for Delta: MFC_CUDA_CC=80,86). The B200 GPU has CUDA compute capability 10.0 (SM100 architecture) and requires CUDA 12.8 or later; the h-gpu configuration should include MFC_CUDA_CC=100 for consistent GPU targeting.

🤖 Prompt for AI Agents
In `@toolchain/modules` around lines 88 - 93, Add the environment variable
MFC_CUDA_CC=100 to the h-gpu stanza so the configuration targets SM100 (CUDA
compute capability 10.0); update the block that defines h-gpu (the lines setting
UCX_NET_DEVICES, cuda/12.9.1, nvhpc/25.9, openmpi/5.0.7 and the
CC/CXX/FC/NVCOMPILER_COMM_LIBS_HOME vars) to include MFC_CUDA_CC=100 alongside
those existing environment definitions.

5 changes: 3 additions & 2 deletions toolchain/templates/hipergator.mako
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,11 @@
#SBATCH --job-name="${name}"
#SBATCH --output="${name}.out"
#SBATCH --time=${walltime}
#SBATCH --cpus-per-task=7
% if gpu_enabled:
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cpus-per-task setting is now only applied to GPU jobs (inside the % if gpu_enabled: block), but CPU jobs have no cpus-per-task specification. This asymmetry could lead to suboptimal CPU-only job configurations. Consider whether CPU jobs should also have an explicit cpus-per-task setting, or document why GPU jobs specifically need 3 CPUs per task while CPU jobs use the default.

Suggested change
% if gpu_enabled:
% if gpu_enabled:
# Note: For GPU jobs, we explicitly request 1 GPU and 3 CPUs per task.
# CPU-only jobs rely on the cluster's default cpus-per-task setting.

Copilot uses AI. Check for mistakes.
#SBATCH --gpus-per-task=1
#SBATCH --cpus-per-task=3
#SBATCH --gpu-bind=closest
#SBATCH --mem-per-cpu=50GB
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

50GB memory per CPU is extremely high (150GB total for 3 CPUs per task). This could severely limit job scheduling on the cluster. Verify this is the intended memory requirement and not a typo (perhaps 50GB total or 5GB per CPU was intended). Most GPU codes require much less CPU memory unless doing significant host-side preprocessing.

Suggested change
#SBATCH --mem-per-cpu=50GB
#SBATCH --mem-per-cpu=5GB

Copilot uses AI. Check for mistakes.
% endif
Comment on lines 11 to 16
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Add an else block to the gpu_enabled check to set --cpus-per-task=7 for CPU-only jobs, restoring the intended behavior. [general, importance: 8]

Suggested change
% if gpu_enabled:
#SBATCH --gpus-per-task=1
#SBATCH --cpus-per-task=3
#SBATCH --gpu-bind=closest
#SBATCH --mem-per-cpu=50GB
% endif
% if gpu_enabled:
#SBATCH --gpus-per-task=1
#SBATCH --cpus-per-task=3
#SBATCH --mem-per-cpu=50GB
% else:
#SBATCH --cpus-per-task=7
% endif

% if account:
#SBATCH --account=${account}
Expand Down Expand Up @@ -48,7 +49,7 @@ echo
(set -x; ${profiler} "${target.get_install_binpath(case)}")
% else:
(set -x; ${profiler} \
mpirun -np ${nodes*tasks_per_node} \
/apps/compilers/nvhpc/25.9/Linux_x86_64/25.9/comm_libs/mpi/bin/mpirun -np ${nodes*tasks_per_node} \
Copy link

Copilot AI Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The hardcoded absolute path to mpirun creates a tight coupling to a specific NVHPC version (25.9) and installation location. This path is used unconditionally for both GPU and CPU modes, but the CPU configuration in toolchain/modules uses gcc/openmpi which would have a different mpirun path. Consider either: (1) using a conditional path based on gpu_enabled to use the appropriate MPI launcher for each mode, or (2) relying on the PATH environment variable set by the module system (like other cluster templates do) by simply using mpirun.

Suggested change
/apps/compilers/nvhpc/25.9/Linux_x86_64/25.9/comm_libs/mpi/bin/mpirun -np ${nodes*tasks_per_node} \
mpirun -np ${nodes*tasks_per_node} \

Copilot uses AI. Check for mistakes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2: Hardcoded NVHPC mpirun is used even in CPU mode, mismatching the loaded OpenMPI stack and risking missing binary or MPI runtime failures for CPU MPI jobs.

Prompt for AI agents
Check if this issue is valid — if so, understand the root cause and fix it. At toolchain/templates/hipergator.mako, line 52:

<comment>Hardcoded NVHPC mpirun is used even in CPU mode, mismatching the loaded OpenMPI stack and risking missing binary or MPI runtime failures for CPU MPI jobs.</comment>

<file context>
@@ -48,7 +49,7 @@ echo
     % else:
         (set -x; ${profiler}    \
-            mpirun -np ${nodes*tasks_per_node}            \
+            /apps/compilers/nvhpc/25.9/Linux_x86_64/25.9/comm_libs/mpi/bin/mpirun -np ${nodes*tasks_per_node}            \
                    --bind-to none                         \
                    "${target.get_install_binpath(case)}")
</file context>
Suggested change
/apps/compilers/nvhpc/25.9/Linux_x86_64/25.9/comm_libs/mpi/bin/mpirun -np ${nodes*tasks_per_node} \
${'/apps/compilers/nvhpc/25.9/Linux_x86_64/25.9/comm_libs/mpi/bin/mpirun' if gpu_enabled else 'mpirun'} -np ${nodes*tasks_per_node} \

--bind-to none \
"${target.get_install_binpath(case)}")
% endif
Comment on lines 50 to 55
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: The absolute MPI launcher path contains a duplicated version segment ("25.9" appears twice) which very likely makes the path incorrect and the mpirun binary unavailable at runtime; fix the path to the correct single-version location so the launcher exists on the nodes. [possible bug]

Severity Level: Critical 🚨
- ❌ MPI jobs fail to start on affected nodes.
- ⚠️ Distributed test runs do not execute.
- ⚠️ Affects template-driven MPI launches in CI and local runs.
Suggested change
% else:
(set -x; ${profiler} \
mpirun -np ${nodes*tasks_per_node} \
/apps/compilers/nvhpc/25.9/Linux_x86_64/25.9/comm_libs/mpi/bin/mpirun -np ${nodes*tasks_per_node} \
--bind-to none \
"${target.get_install_binpath(case)}")
% endif
/apps/compilers/nvhpc/25.9/Linux_x86_64/comm_libs/mpi/bin/mpirun -np ${nodes*tasks_per_node} \
Steps of Reproduction ✅
1. Trigger an MPI run path by rendering toolchain/templates/hipergator.mako with a target
where mpi==True. The template enters the else branch shown at lines 50-55 and emits a
command containing the absolute mpirun path at line 52
("/apps/compilers/nvhpc/25.9/Linux_x86_64/25.9/comm_libs/mpi/bin/mpirun").

2. Submit the generated job or run the script on a compute node so the template-expressed
command executes. This is the normal execution path for distributed runs using this
template (the for-loop at lines 45-60 iterates targets and executes this branch when mpi
is enabled).

3. When the node shell attempts to execute the absolute path, the duplicated "25.9"
segment makes the path incorrect on nodes where the real NVHPC installation path does not
contain that duplicated segment. The shell prints "No such file or directory" and the MPI
launch fails immediately.

4. Observe the job failing to start distributed processes; the failure is reproducible by
running any MPI-targeted job using this template (mpi==True) because the template emits
the incorrect absolute path at toolchain/templates/hipergator.mako:52.
Prompt for AI Agent 🤖
This is a comment left during a code review.

**Path:** toolchain/templates/hipergator.mako
**Line:** 50:55
**Comment:**
	*Possible Bug: The absolute MPI launcher path contains a duplicated version segment ("25.9" appears twice) which very likely makes the path incorrect and the mpirun binary unavailable at runtime; fix the path to the correct single-version location so the launcher exists on the nodes.

Validate the correctness of the flagged issue. If correct, How can I resolve this? If you propose a fix, implement it and please make it concise.

Expand Down
Loading