Skip to content

Commit 41ee786

Browse files
mivertowskiclaude
andcommitted
docs: add GPU VM session guide for H100 Claude Code instance
Onboarding document for the Claude Code instance running on the Azure H100 VM. Covers: what's already done (Phases 0-4), task priority order (benchmarks -> Hopper features), architecture reference, API patterns, commit conventions, and gotchas. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent a989fd5 commit 41ee786

1 file changed

Lines changed: 161 additions & 0 deletions

File tree

Lines changed: 161 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,161 @@
1+
# RingKernel H100 GPU VM Session Guide
2+
3+
> **For the Claude Code instance running on the Azure H100 VM.**
4+
> Read this before starting any work.
5+
6+
## Context
7+
8+
RingKernel is a GPU-native persistent actor model framework for Rust. A comprehensive production readiness initiative is underway. Phases 0-4 are complete (crash safety, H100/B200 build support, observability, testing). This VM session handles GPU-dependent work.
9+
10+
## Key Documents
11+
12+
Read these first:
13+
1. **`CLAUDE.md`** — Project architecture, build commands, API patterns, gotchas
14+
2. **`docs/superpowers/specs/2026-03-22-production-readiness-design.md`** — Master spec with checklist (check what's done vs pending)
15+
3. **`docs/superpowers/plans/2026-03-22-production-readiness-phase2.md`** — Phase 2 plan (benchmarks)
16+
4. **`docs/superpowers/plans/2026-03-22-production-readiness-phases3-8.md`** — Phases 5-8 plans
17+
5. **`docs/benchmarks/h100-b200-baseline.md`** — Benchmark template to fill in
18+
6. **`docs/19-cuda-wishlist-persistent-actors.md`** — CUDA wishlist driving Phase 5-6
19+
20+
## What's Already Done (Phases 0-4)
21+
22+
- **Crash safety**: Zero bare `.unwrap()` in production, typed error enums everywhere, `clippy::unwrap_used` lint
23+
- **H100/B200 build**: Architecture auto-detection (`RINGKERNEL_CUDA_ARCH`), cudarc 0.19.3, Blackwell preset, libcu++ atomics default
24+
- **Observability**: println migrated to tracing
25+
- **Testing**: Compile-fail tests, feature matrix CI, unsafe audit for CUDA crate
26+
- **1,447 tests pass, 0 failures**
27+
28+
## Your Tasks (Priority Order)
29+
30+
### Task 1: Validate GPU Environment
31+
```bash
32+
nvidia-smi
33+
nvcc --version
34+
cargo build --workspace --features cuda --release
35+
cargo test --workspace --release
36+
```
37+
Verify the 97 previously-ignored GPU tests now pass.
38+
39+
### Task 2: Run Baseline Benchmarks (Phase 2.4)
40+
Fill in `docs/benchmarks/h100-b200-baseline.md` with actual numbers:
41+
```bash
42+
# Lock clocks first
43+
sudo nvidia-smi -lgc $(nvidia-smi --query-gpu=clocks.max.graphics --format=csv,noheader,nounits)
44+
sudo nvidia-smi -c EXCLUSIVE_PROCESS
45+
46+
# Criterion benchmarks
47+
cargo bench --package ringkernel --release
48+
49+
# Application benchmarks
50+
cargo run -p ringkernel-wavesim3d --bin wavesim3d-benchmark --release --features cuda-codegen
51+
cargo run -p ringkernel-txmon --bin txmon-benchmark --release --features cuda-codegen
52+
cargo run -p ringkernel-procint --bin procint-benchmark --release
53+
```
54+
Run each 3 times, report median.
55+
56+
### Task 3: Validate persistent.rs Runtime Atomics (Phase 2.3.3)
57+
Test the libcu++ atomics in the persistent kernel runtime:
58+
```bash
59+
cargo test -p ringkernel-cuda --features "cuda,cooperative" --release -- --ignored
60+
```
61+
If any atomics-related tests fail, check `crates/ringkernel-cuda/src/persistent.rs`.
62+
63+
### Task 4: Implement Phase 5 — Hopper Features
64+
This is the main development work. See spec items 5.1-5.5.
65+
66+
**Priority order:**
67+
68+
#### 5.1 Thread Block Clusters
69+
- Add cluster launch attribute (`CU_LAUNCH_ATTRIBUTE_CLUSTER_DIMENSION`) to kernel launch config
70+
- Create `ClusterGroup` abstraction wrapping cooperative groups cluster API
71+
- Add cluster size to `LaunchOptions` (default 1, up to 8 portable)
72+
- Key file: `crates/ringkernel-cuda/src/cooperative.rs`
73+
- cudarc sys bindings: `cudarc::driver::sys::CUlaunchAttributeID_enum::CU_LAUNCH_ATTRIBUTE_CLUSTER_DIMENSION`
74+
75+
#### 5.2 Distributed Shared Memory (DSMEM)
76+
- Add DSMEM message queue for intra-cluster actors
77+
- `map_shared_rank(ptr, block_rank)` wrapper
78+
- DSMEM-backed K2K channel (7x faster than global memory)
79+
- Benchmark: DSMEM K2K vs current global memory K2K
80+
- Fallback to global memory on pre-Hopper
81+
82+
#### 5.3 TMA Async Copy
83+
- TMA 1D bulk async wrapper (no tensor map needed)
84+
- Single-thread message payload transfer
85+
- Barrier-based completion tracking
86+
- Multicast to cluster
87+
88+
#### 5.4 Split-Phase Barriers
89+
- `barrier_arrive()` / `barrier_wait(token)` for producer-consumer actors
90+
- Replace polling-based message detection
91+
92+
#### 5.5 Green Contexts (SM Partitioning)
93+
- `cuGreenCtxCreate` via driver API
94+
- SM reservation for persistent actors
95+
- Context isolation
96+
- Requires CUDA 12.4+ (should be available)
97+
98+
### Task 5: NVTX Profiler Integration (Phase 3.2.1)
99+
```bash
100+
# NVTX should be available with CUDA toolkit
101+
# Check: ls /usr/local/cuda/include/nvtx3/
102+
```
103+
Implement real NVTX integration in `crates/ringkernel-core/src/observability.rs` (currently a stub).
104+
105+
## Architecture Quick Reference
106+
107+
```
108+
ringkernel-core — Core traits, error types, runtime, shutdown handler
109+
ringkernel-cuda — CUDA backend (persistent kernels, K2K, reduction, streams)
110+
ringkernel-cuda-codegen — Rust-to-CUDA transpiler (155+ intrinsics)
111+
ringkernel-ir — Unified IR for multi-backend codegen
112+
```
113+
114+
Key CUDA files:
115+
- `crates/ringkernel-cuda/src/cooperative.rs` — Cooperative groups, grid sync
116+
- `crates/ringkernel-cuda/src/persistent.rs` — Persistent kernel mapped memory
117+
- `crates/ringkernel-cuda/src/reduction.rs` — Reduction with mapped buffers
118+
- `crates/ringkernel-cuda/src/driver_api.rs` — Low-level CUDA driver wrappers
119+
- `crates/ringkernel-cuda/src/launch_config/mode.rs` — GPU architecture presets
120+
- `crates/ringkernel-cuda/build.rs` — Architecture detection, cooperative kernel compilation
121+
122+
## API Patterns (from CLAUDE.md)
123+
124+
```rust
125+
// cudarc 0.19.3 kernel launch
126+
let module = device.inner().load_module(ptx)?;
127+
let func = module.load_function("kernel_name")?;
128+
unsafe {
129+
stream.launch_builder(&func).arg(&input).arg(&output).launch(cfg)?;
130+
}
131+
132+
// Cooperative kernel launch
133+
use cudarc::driver::result as cuda_result;
134+
unsafe { cuda_result::launch_cooperative_kernel(func, grid, block, smem, stream, params)?; }
135+
136+
// Cluster launch (new for Phase 5) — via sys bindings
137+
use cudarc::driver::sys::CUlaunchAttributeID_enum;
138+
// Set CU_LAUNCH_ATTRIBUTE_CLUSTER_DIMENSION in launch config
139+
```
140+
141+
## Commit Convention
142+
143+
```
144+
feat(cuda): description # new features
145+
refactor(crate): description # restructuring
146+
fix(crate): description # bug fixes
147+
test(crate): description # test additions
148+
docs: description # documentation
149+
150+
# Always end with:
151+
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
152+
```
153+
154+
## Important Gotchas
155+
156+
- Queue capacity must be power of 2
157+
- `block_dim_x()` returns 1 on CPU (not 0)
158+
- cudarc 0.19.3: `launch_builder(&func).arg(&x).launch(cfg)` — NOT old `func.launch()` API
159+
- Cooperative groups grid size capped ~1024 blocks (device-dependent)
160+
- `RINGKERNEL_CUDA_ARCH` env var controls build target (auto-detected by setup script)
161+
- PTX templates use sm_75 (forward-compatible minimum for cooperative groups)

0 commit comments

Comments
 (0)