Fix GPU example, compiler matrix, and AMD flang consistency (#1256)

sbryngelson · claude · web-flow · commit b5288674e8c4 · 2026-02-23T20:50:07.000-05:00
Co-authored-by: Claude Opus 4.6 &lt;noreply@anthropic.com&gt;
diff --git a/.claude/rules/common-pitfalls.md b/.claude/rules/common-pitfalls.md
@@ -36,10 +36,10 @@
 - Boundary condition symmetry requirements must be maintained
 
 ## Compiler-Specific Issues
-- Code must compile on gfortran, nvfortran, Cray ftn, and Intel ifx
+- CI-gated compilers (must always pass): gfortran, nvfortran, Cray ftn, and Intel ifx
+- AMD flang is additionally supported for `--gpu mp` builds but not in the CI matrix
 - Each compiler has different strictness levels and warning behavior
 - Fypp macros must expand correctly for both GPU and CPU builds
-- GPU builds only work with nvfortran, Cray ftn, and AMD flang
 
 ## Test System
 - Tests are generated **programmatically** in `toolchain/mfc/test/cases.py`, not standalone files
diff --git a/.claude/rules/gpu-and-mpi.md b/.claude/rules/gpu-and-mpi.md
@@ -38,20 +38,27 @@ Inline macros (use `$:` prefix):
 - `$:GPU_WAIT()` — Synchronization barrier.
 
 Block macros (use `#:call`/`#:endcall`):
-- `GPU_PARALLEL(...)` — GPU parallel region wrapping a code block.
+- `GPU_PARALLEL(...)` — GPU parallel region (used for scalar reductions like `maxval`/`minval`).
 - `GPU_DATA(copy=..., create=..., ...)` — Scoped data region.
 - `GPU_HOST_DATA(use_device_addr=[...])` — Host code with device pointers.
 
-Block macro usage:
+Typical GPU loop pattern (used 750+ times in the codebase):
 ```
-#:call GPU_PARALLEL(copyin='[var1]', copyout='[var2]')
-  $:GPU_LOOP(collapse=N)
-  do k = 0, n; do j = 0, m
-    ! loop body
-  end do; end do
-#:endcall GPU_PARALLEL
+$:GPU_PARALLEL_LOOP(private='[i,j,k,l]', collapse=3)
+do l = idwbuff(3)%beg, idwbuff(3)%end
+    do k = idwbuff(2)%beg, idwbuff(2)%end
+        do j = idwbuff(1)%beg, idwbuff(1)%end
+            ! loop body
+        end do
+    end do
+end do
+$:END_GPU_PARALLEL_LOOP()
 ```
 
+WARNING: Do NOT use `GPU_PARALLEL` wrapping `GPU_LOOP` for spatial loops. `GPU_LOOP`
+emits empty directives on Cray and AMD compilers, causing silent serial execution.
+Use `GPU_PARALLEL_LOOP` / `END_GPU_PARALLEL_LOOP` for all parallel spatial loops.
+
 NEVER write raw `!$acc` or `!$omp` directives. Always use `GPU_*` Fypp macros.
 The precheck source lint will catch raw directives and fail.
 
@@ -67,13 +74,17 @@ The precheck source lint will catch raw directives and fail.
 - These compile only for Cray (`_CRAYFTN`); other compilers skip them
 
 ### Compiler-Backend Matrix
-| Compiler        | `--gpu acc` (OpenACC) | `--gpu mp` (OpenMP) | CPU-only |
-|-----------------|----------------------|---------------------|----------|
-| GNU gfortran    | No                   | No                  | Yes      |
-| NVIDIA nvfortran| Yes (primary)        | Yes                 | Yes      |
-| Cray ftn (CCE)  | Yes                  | Yes (primary)       | Yes      |
-| Intel ifx       | No                   | No                  | Yes      |
-| AMD flang       | No                   | Yes                 | Yes      |
+
+CI-gated compilers (must always pass): gfortran, nvfortran, Cray ftn, Intel ifx.
+AMD flang is additionally supported for GPU builds but not in the CI matrix.
+
+| Compiler        | `--gpu acc` (OpenACC) | `--gpu mp` (OpenMP)    | CPU-only |
+|-----------------|----------------------|------------------------|----------|
+| GNU gfortran    | No                   | Experimental (AMD GCN) | Yes      |
+| NVIDIA nvfortran| Yes (primary)        | Yes                    | Yes      |
+| Cray ftn (CCE)  | Yes                  | Yes (primary)          | Yes      |
+| Intel ifx       | No                   | Experimental (SPIR64)  | Yes      |
+| AMD flang       | No                   | Yes                    | Yes      |
 
 ## Preprocessor Defines (`#ifdef` / `#ifndef`)
 
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -3,7 +3,8 @@
 MFC is an exascale multi-physics CFD solver written in modern Fortran 2008+ with Fypp
 preprocessing. It has three executables (pre_process, simulation, post_process), a Python
 toolchain for building/running/testing, and supports GPU acceleration via OpenACC and
-OpenMP target offload. It must compile with gfortran, nvfortran, Cray ftn, and Intel ifx.
+OpenMP target offload. It must compile with gfortran, nvfortran, Cray ftn, and Intel ifx (CI-gated).
+AMD flang is additionally supported for OpenMP target offload GPU builds.
 
 ## Commands
 
@@ -167,4 +168,4 @@ When reviewing PRs, prioritize in this order:
 4. MPI correctness (halo exchange, buffer sizing, GPU_UPDATE calls)
 5. GPU code (GPU_* Fypp macros only, no raw pragmas)
 6. Physics consistency (pressure formula matches model_eqns)
-7. Compiler portability (all four compilers)
+7. Compiler portability (4 CI-gated compilers + AMD flang for GPU)