|
| 1 | +# GPU and MPI Patterns |
| 2 | + |
| 3 | +## GPU Offloading Architecture |
| 4 | + |
| 5 | +Only `src/simulation/` is GPU-accelerated. Pre/post_process run on CPU only. |
| 6 | + |
| 7 | +MFC uses a **backend-agnostic GPU abstraction** via Fypp macros. The same source code |
| 8 | +compiles to either OpenACC or OpenMP target offload depending on the build flag: |
| 9 | + |
| 10 | +- `./mfc.sh build --gpu acc` → OpenACC backend (NVIDIA nvfortran, Cray ftn) |
| 11 | +- `./mfc.sh build --gpu mp` → OpenMP target offload backend (Cray ftn, AMD flang) |
| 12 | +- `./mfc.sh build` (no --gpu) → CPU-only, GPU macros expand to plain Fortran |
| 13 | + |
| 14 | +### Macro Layers (in src/common/include/) |
| 15 | +- `parallel_macros.fpp` — **Use these.** Generic `GPU_*` macros that dispatch to the |
| 16 | + correct backend based on `MFC_OpenACC` / `MFC_OpenMP` compile definitions. |
| 17 | +- `acc_macros.fpp` — OpenACC-specific `ACC_*` implementations (do not call directly) |
| 18 | +- `omp_macros.fpp` — OpenMP target offload `OMP_*` implementations (do not call directly) |
| 19 | + - OMP macros generate **compiler-specific** directives: NVIDIA uses `target teams loop`, |
| 20 | + Cray uses `target teams distribute parallel do simd`, AMD uses |
| 21 | + `target teams distribute parallel do` |
| 22 | +- `shared_parallel_macros.fpp` — Shared helpers (collapse, private, reduction generators) |
| 23 | + |
| 24 | +### Key GPU Macros (always use the `GPU_*` prefix) |
| 25 | + |
| 26 | +Inline macros (use `$:` prefix): |
| 27 | +- `$:GPU_PARALLEL_LOOP(collapse=N, private=[...], reduction=[...], reductionOp='+')` — |
| 28 | + Parallel loop over GPU threads. Most common GPU macro. |
| 29 | +- `$:END_GPU_PARALLEL_LOOP()` — Required closing for GPU_PARALLEL_LOOP. |
| 30 | +- `$:GPU_LOOP(collapse=N, ...)` — Inner loop within a GPU parallel region. |
| 31 | +- `$:GPU_ENTER_DATA(create=[...])` — Allocate device memory (unscoped). |
| 32 | +- `$:GPU_EXIT_DATA(delete=[...])` — Free device memory. |
| 33 | +- `$:GPU_UPDATE(host=[...])` — Copy device → host (before MPI send). |
| 34 | +- `$:GPU_UPDATE(device=[...])` — Copy host → device (after MPI receive). |
| 35 | +- `$:GPU_ROUTINE(parallelism='[seq]')` — Mark routine for device compilation. |
| 36 | +- `$:GPU_DECLARE(create=[...])` — Declare device-resident data. |
| 37 | +- `$:GPU_ATOMIC(atomic='update')` — Atomic operation on device. |
| 38 | +- `$:GPU_WAIT()` — Synchronization barrier. |
| 39 | + |
| 40 | +Block macros (use `#:call`/`#:endcall`): |
| 41 | +- `GPU_PARALLEL(...)` — GPU parallel region wrapping a code block. |
| 42 | +- `GPU_DATA(copy=..., create=..., ...)` — Scoped data region. |
| 43 | +- `GPU_HOST_DATA(use_device_addr=[...])` — Host code with device pointers. |
| 44 | + |
| 45 | +Block macro usage: |
| 46 | +``` |
| 47 | +#:call GPU_PARALLEL(copyin='[var1]', copyout='[var2]') |
| 48 | + $:GPU_LOOP(collapse=N) |
| 49 | + do k = 0, n; do j = 0, m |
| 50 | + ! loop body |
| 51 | + end do; end do |
| 52 | +#:endcall GPU_PARALLEL |
| 53 | +``` |
| 54 | + |
| 55 | +NEVER write raw `!$acc` or `!$omp` directives. Always use `GPU_*` Fypp macros. |
| 56 | +The precheck source lint will catch raw directives and fail. |
| 57 | + |
| 58 | +### Memory Management Macros (from macros.fpp) |
| 59 | +- `@:ALLOCATE(var1, var2, ...)` — Fortran allocate + `GPU_ENTER_DATA(create=...)` |
| 60 | +- `@:DEALLOCATE(var1, var2, ...)` — `GPU_EXIT_DATA(delete=...)` + Fortran deallocate |
| 61 | +- `@:PREFER_GPU(var1, var2, ...)` — NVIDIA unified memory page placement hint |
| 62 | +- Every `@:ALLOCATE` MUST have a matching `@:DEALLOCATE` in finalization |
| 63 | +- Conditional allocation MUST have conditional deallocation |
| 64 | + |
| 65 | +### GPU Field Setup (Cray-specific, from macros.fpp) |
| 66 | +- `@:ACC_SETUP_VFs(...)` / `@:ACC_SETUP_SFs(...)` — GPU pointer setup for vector/scalar fields |
| 67 | +- These compile only for Cray (`_CRAYFTN`); other compilers skip them |
| 68 | + |
| 69 | +### Compiler-Backend Matrix |
| 70 | +| Compiler | `--gpu acc` (OpenACC) | `--gpu mp` (OpenMP) | CPU-only | |
| 71 | +|-----------------|----------------------|---------------------|----------| |
| 72 | +| GNU gfortran | No | No | Yes | |
| 73 | +| NVIDIA nvfortran| Yes (primary) | Yes | Yes | |
| 74 | +| Cray ftn (CCE) | Yes | Yes (primary) | Yes | |
| 75 | +| Intel ifx | No | No | Yes | |
| 76 | +| AMD flang | No | Yes | Yes | |
| 77 | + |
| 78 | +## Preprocessor Defines (`#ifdef` / `#ifndef`) |
| 79 | + |
| 80 | +Raw `#ifdef` / `#ifndef` preprocessor guards are **normal and expected** in MFC. |
| 81 | +They are NOT the same as raw `!$acc`/`!$omp` pragmas (which are forbidden). |
| 82 | + |
| 83 | +Use `#ifdef` for feature, target, compiler, and library gating: |
| 84 | + |
| 85 | +### Feature gating |
| 86 | +- `MFC_MPI` — MPI-enabled build (`--mpi` flag, default ON) |
| 87 | +- `MFC_OpenACC` — OpenACC GPU backend (`--gpu acc`) |
| 88 | +- `MFC_OpenMP` — OpenMP target offload backend (`--gpu mp`) |
| 89 | +- `MFC_GPU` — Any GPU build (either OpenACC or OpenMP) |
| 90 | +- `MFC_DEBUG` — Debug build (`--debug`) |
| 91 | +- `MFC_SINGLE_PRECISION` — Single-precision mode (`--single`) |
| 92 | +- `MFC_MIXED_PRECISION` — Mixed-precision mode (`--mixed`) |
| 93 | + |
| 94 | +### Target gating (for code in `src/common/` shared across executables) |
| 95 | +- `MFC_PRE_PROCESS` — Only in pre_process builds |
| 96 | +- `MFC_SIMULATION` — Only in simulation builds |
| 97 | +- `MFC_POST_PROCESS` — Only in post_process builds |
| 98 | + |
| 99 | +### Compiler gating (for compiler-specific workarounds) |
| 100 | +- `_CRAYFTN` — Cray Fortran compiler |
| 101 | +- `__NVCOMPILER_GPU_UNIFIED_MEM` — NVIDIA unified memory (GH-200 / `--unified`) |
| 102 | +- `__PGI` — Legacy PGI/NVIDIA compiler |
| 103 | +- `__INTEL_COMPILER` — Intel compiler |
| 104 | +- `FRONTIER_UNIFIED` — Frontier HPC unified memory |
| 105 | + |
| 106 | +### Library-specific code |
| 107 | +- FFTW (`m_fftw.fpp`) uses heavy `#ifdef` gating for `MFC_GPU` and `__PGI` |
| 108 | +- CUDA Fortran (`cudafor` module) is gated behind `__NVCOMPILER_GPU_UNIFIED_MEM` |
| 109 | +- SILO/HDF5 interfaces may have conditional paths |
| 110 | + |
| 111 | +When adding new `#ifdef` blocks, always provide an `#else` or `#endif` path so |
| 112 | +the code compiles in all configurations (CPU-only, GPU-ACC, GPU-OMP, with/without MPI). |
| 113 | + |
| 114 | +## MPI |
| 115 | + |
| 116 | +### Halo Exchange |
| 117 | +- Pack/unpack offset calculations are error-prone — verify carefully |
| 118 | +- Buffer sizing depends on dimensionality and QBMM state |
| 119 | +- GPU coherence: always `GPU_UPDATE(host=...)` before MPI send, |
| 120 | + `GPU_UPDATE(device=...)` after MPI receive |
| 121 | + |
| 122 | +### Error Handling |
| 123 | +- Use `call s_mpi_abort()` for fatal errors, never `stop` or `error stop` |
| 124 | +- MPI must be finalized before program exit |
0 commit comments