|
| 1 | +# RingKernel H100 GPU VM Session Guide |
| 2 | + |
| 3 | +> **For the Claude Code instance running on the Azure H100 VM.** |
| 4 | +> Read this before starting any work. |
| 5 | +
|
| 6 | +## Context |
| 7 | + |
| 8 | +RingKernel is a GPU-native persistent actor model framework for Rust. A comprehensive production readiness initiative is underway. Phases 0-4 are complete (crash safety, H100/B200 build support, observability, testing). This VM session handles GPU-dependent work. |
| 9 | + |
| 10 | +## Key Documents |
| 11 | + |
| 12 | +Read these first: |
| 13 | +1. **`CLAUDE.md`** — Project architecture, build commands, API patterns, gotchas |
| 14 | +2. **`docs/superpowers/specs/2026-03-22-production-readiness-design.md`** — Master spec with checklist (check what's done vs pending) |
| 15 | +3. **`docs/superpowers/plans/2026-03-22-production-readiness-phase2.md`** — Phase 2 plan (benchmarks) |
| 16 | +4. **`docs/superpowers/plans/2026-03-22-production-readiness-phases3-8.md`** — Phases 5-8 plans |
| 17 | +5. **`docs/benchmarks/h100-b200-baseline.md`** — Benchmark template to fill in |
| 18 | +6. **`docs/19-cuda-wishlist-persistent-actors.md`** — CUDA wishlist driving Phase 5-6 |
| 19 | + |
| 20 | +## What's Already Done (Phases 0-4) |
| 21 | + |
| 22 | +- **Crash safety**: Zero bare `.unwrap()` in production, typed error enums everywhere, `clippy::unwrap_used` lint |
| 23 | +- **H100/B200 build**: Architecture auto-detection (`RINGKERNEL_CUDA_ARCH`), cudarc 0.19.3, Blackwell preset, libcu++ atomics default |
| 24 | +- **Observability**: println migrated to tracing |
| 25 | +- **Testing**: Compile-fail tests, feature matrix CI, unsafe audit for CUDA crate |
| 26 | +- **1,447 tests pass, 0 failures** |
| 27 | + |
| 28 | +## Your Tasks (Priority Order) |
| 29 | + |
| 30 | +### Task 1: Validate GPU Environment |
| 31 | +```bash |
| 32 | +nvidia-smi |
| 33 | +nvcc --version |
| 34 | +cargo build --workspace --features cuda --release |
| 35 | +cargo test --workspace --release |
| 36 | +``` |
| 37 | +Verify the 97 previously-ignored GPU tests now pass. |
| 38 | + |
| 39 | +### Task 2: Run Baseline Benchmarks (Phase 2.4) |
| 40 | +Fill in `docs/benchmarks/h100-b200-baseline.md` with actual numbers: |
| 41 | +```bash |
| 42 | +# Lock clocks first |
| 43 | +sudo nvidia-smi -lgc $(nvidia-smi --query-gpu=clocks.max.graphics --format=csv,noheader,nounits) |
| 44 | +sudo nvidia-smi -c EXCLUSIVE_PROCESS |
| 45 | + |
| 46 | +# Criterion benchmarks |
| 47 | +cargo bench --package ringkernel --release |
| 48 | + |
| 49 | +# Application benchmarks |
| 50 | +cargo run -p ringkernel-wavesim3d --bin wavesim3d-benchmark --release --features cuda-codegen |
| 51 | +cargo run -p ringkernel-txmon --bin txmon-benchmark --release --features cuda-codegen |
| 52 | +cargo run -p ringkernel-procint --bin procint-benchmark --release |
| 53 | +``` |
| 54 | +Run each 3 times, report median. |
| 55 | + |
| 56 | +### Task 3: Validate persistent.rs Runtime Atomics (Phase 2.3.3) |
| 57 | +Test the libcu++ atomics in the persistent kernel runtime: |
| 58 | +```bash |
| 59 | +cargo test -p ringkernel-cuda --features "cuda,cooperative" --release -- --ignored |
| 60 | +``` |
| 61 | +If any atomics-related tests fail, check `crates/ringkernel-cuda/src/persistent.rs`. |
| 62 | + |
| 63 | +### Task 4: Implement Phase 5 — Hopper Features |
| 64 | +This is the main development work. See spec items 5.1-5.5. |
| 65 | + |
| 66 | +**Priority order:** |
| 67 | + |
| 68 | +#### 5.1 Thread Block Clusters |
| 69 | +- Add cluster launch attribute (`CU_LAUNCH_ATTRIBUTE_CLUSTER_DIMENSION`) to kernel launch config |
| 70 | +- Create `ClusterGroup` abstraction wrapping cooperative groups cluster API |
| 71 | +- Add cluster size to `LaunchOptions` (default 1, up to 8 portable) |
| 72 | +- Key file: `crates/ringkernel-cuda/src/cooperative.rs` |
| 73 | +- cudarc sys bindings: `cudarc::driver::sys::CUlaunchAttributeID_enum::CU_LAUNCH_ATTRIBUTE_CLUSTER_DIMENSION` |
| 74 | + |
| 75 | +#### 5.2 Distributed Shared Memory (DSMEM) |
| 76 | +- Add DSMEM message queue for intra-cluster actors |
| 77 | +- `map_shared_rank(ptr, block_rank)` wrapper |
| 78 | +- DSMEM-backed K2K channel (7x faster than global memory) |
| 79 | +- Benchmark: DSMEM K2K vs current global memory K2K |
| 80 | +- Fallback to global memory on pre-Hopper |
| 81 | + |
| 82 | +#### 5.3 TMA Async Copy |
| 83 | +- TMA 1D bulk async wrapper (no tensor map needed) |
| 84 | +- Single-thread message payload transfer |
| 85 | +- Barrier-based completion tracking |
| 86 | +- Multicast to cluster |
| 87 | + |
| 88 | +#### 5.4 Split-Phase Barriers |
| 89 | +- `barrier_arrive()` / `barrier_wait(token)` for producer-consumer actors |
| 90 | +- Replace polling-based message detection |
| 91 | + |
| 92 | +#### 5.5 Green Contexts (SM Partitioning) |
| 93 | +- `cuGreenCtxCreate` via driver API |
| 94 | +- SM reservation for persistent actors |
| 95 | +- Context isolation |
| 96 | +- Requires CUDA 12.4+ (should be available) |
| 97 | + |
| 98 | +### Task 5: NVTX Profiler Integration (Phase 3.2.1) |
| 99 | +```bash |
| 100 | +# NVTX should be available with CUDA toolkit |
| 101 | +# Check: ls /usr/local/cuda/include/nvtx3/ |
| 102 | +``` |
| 103 | +Implement real NVTX integration in `crates/ringkernel-core/src/observability.rs` (currently a stub). |
| 104 | + |
| 105 | +## Architecture Quick Reference |
| 106 | + |
| 107 | +``` |
| 108 | +ringkernel-core — Core traits, error types, runtime, shutdown handler |
| 109 | +ringkernel-cuda — CUDA backend (persistent kernels, K2K, reduction, streams) |
| 110 | +ringkernel-cuda-codegen — Rust-to-CUDA transpiler (155+ intrinsics) |
| 111 | +ringkernel-ir — Unified IR for multi-backend codegen |
| 112 | +``` |
| 113 | + |
| 114 | +Key CUDA files: |
| 115 | +- `crates/ringkernel-cuda/src/cooperative.rs` — Cooperative groups, grid sync |
| 116 | +- `crates/ringkernel-cuda/src/persistent.rs` — Persistent kernel mapped memory |
| 117 | +- `crates/ringkernel-cuda/src/reduction.rs` — Reduction with mapped buffers |
| 118 | +- `crates/ringkernel-cuda/src/driver_api.rs` — Low-level CUDA driver wrappers |
| 119 | +- `crates/ringkernel-cuda/src/launch_config/mode.rs` — GPU architecture presets |
| 120 | +- `crates/ringkernel-cuda/build.rs` — Architecture detection, cooperative kernel compilation |
| 121 | + |
| 122 | +## API Patterns (from CLAUDE.md) |
| 123 | + |
| 124 | +```rust |
| 125 | +// cudarc 0.19.3 kernel launch |
| 126 | +let module = device.inner().load_module(ptx)?; |
| 127 | +let func = module.load_function("kernel_name")?; |
| 128 | +unsafe { |
| 129 | + stream.launch_builder(&func).arg(&input).arg(&output).launch(cfg)?; |
| 130 | +} |
| 131 | + |
| 132 | +// Cooperative kernel launch |
| 133 | +use cudarc::driver::result as cuda_result; |
| 134 | +unsafe { cuda_result::launch_cooperative_kernel(func, grid, block, smem, stream, params)?; } |
| 135 | + |
| 136 | +// Cluster launch (new for Phase 5) — via sys bindings |
| 137 | +use cudarc::driver::sys::CUlaunchAttributeID_enum; |
| 138 | +// Set CU_LAUNCH_ATTRIBUTE_CLUSTER_DIMENSION in launch config |
| 139 | +``` |
| 140 | + |
| 141 | +## Commit Convention |
| 142 | + |
| 143 | +``` |
| 144 | +feat(cuda): description # new features |
| 145 | +refactor(crate): description # restructuring |
| 146 | +fix(crate): description # bug fixes |
| 147 | +test(crate): description # test additions |
| 148 | +docs: description # documentation |
| 149 | +
|
| 150 | +# Always end with: |
| 151 | +Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> |
| 152 | +``` |
| 153 | + |
| 154 | +## Important Gotchas |
| 155 | + |
| 156 | +- Queue capacity must be power of 2 |
| 157 | +- `block_dim_x()` returns 1 on CPU (not 0) |
| 158 | +- cudarc 0.19.3: `launch_builder(&func).arg(&x).launch(cfg)` — NOT old `func.launch()` API |
| 159 | +- Cooperative groups grid size capped ~1024 blocks (device-dependent) |
| 160 | +- `RINGKERNEL_CUDA_ARCH` env var controls build target (auto-detected by setup script) |
| 161 | +- PTX templates use sm_75 (forward-compatible minimum for cooperative groups) |
0 commit comments