AGENTS.md -> .codex/config.toml -> subagent or skill -> script wrapper -> artifact
- read
AGENTS.md - use subagent
kernel-architectif design decomposition is non-trivial - use skills
gemm-kernel-designandbenchmark-harness - use the brief in
docs/prompts/implement-gemm-kernel.mdif a reusable prompt starter helps
- read
AGENTS.md - use subagent
kernel-architect - use skills
flashattention-kernel-designandbenchmark-harness - use the brief in
docs/prompts/implement-flashattention-kernel.md
- use subagent
perf-analystfirst - add skills
ncu-profilingandroofline-analysis - use
nsys-timelineonly for overlap, stream, or launch questions - use the brief in
docs/prompts/optimize-kernel.mdif needed
- use subagent
env-investigator - add skill
cuda-env-audit - use the brief in
docs/prompts/investigate-environment.md
- do not skip baseline benchmarking
- do not mix environment debugging with kernel optimization unless required
- do not make performance claims without artifacts