Add hipGraph capture/replay for decode eval (gated to fp16) by aditya-dl · Pull Request #5019 · ROCm/AMDMIGraphX

aditya-dl · 2026-06-29T17:34:19Z

LLM decode re-enqueues the per-token kernel sequence one launch at a time - program::eval's generic_eval loop issues one hipExtModuleLaunchKernel per op (~50+ launches/token for a transformer decode step). The per-launch host overhead, and the GPU-clock throttling caused by the resulting dispatch bubbles, are a meaningful fraction of per-token latency on memory-light decode workloads. This PR adds the HIP-graph analog of what other backends already do (CUDA Graphs / D3D12 command-list replay): capture the single-context eval kernel loop into a hipGraph once, then replay it with a single hipGraphLaunch per subsequent eval.

The feature is opt-in (MIGRAPHX_ENABLE_HIPGRAPH, default off) and, when off, the code path is byte-identical to before. When on, it is restricted to FP16 programs - quantized (int4/fp4) decode is automatically excluded because graph capture regresses it (see "Gating" below).

Motivation

For decode, the bottleneck on discrete GPUs is not kernel compute - it is host dispatch overhead plus the clock throttle that per-launch gaps induce. Collapsing the per-token launch sequence into one captured/replayed graph removes both. This is a runtime-execution change, not a kernel change, so it benefits any model whose decode step is dispatch-bound.

Technical Details

Adds a hipGraph-backed execution path to the GPU context:
- RAII wrappers for hipGraph_t / hipGraphExec_t.
- begin_graph_capture() / end_graph_capture() / replay_graph() and a single execute(run_kernels) entry point.
- First eval captures the loop into a graph and instantiates it; subsequent evals replay it and skip the per-op loop. Any capture/instantiate failure falls back to the eager path, so correctness is never at risk.
program::eval routes the single-context path through context::execute() and caches the output arguments for the replay path (valid because static-shape decode reuses fixed device buffers; the terminal stream sync stays outside the capture region).
The type-erased context interface gains an execute() hook whose default simply runs the loop, so non-GPU targets are unaffected.

Performance (measured)

fp16 steady-state decode throughput, hipGraph off vs on, interleaved same-session A/B with md5-verified builds:

Discrete GPUs (RDNA4 / RDNA3): ~+12% to +24% depending on model and arch.
APU (unified memory): ~+2-3% - the win scales with how dispatch-bound the platform is; a bandwidth-limited APU has less dispatch headroom to recover, which is the expected behavior.
Greedy decode output is byte-identical off vs on.

Safety / compatibility

Default off. With MIGRAPHX_ENABLE_HIPGRAPH unset, behavior is byte-identical to the prior path (the single-context branch still runs the eager loop).
Automatic eager fallback on any capture/instantiate error, on cross-compile contexts, and for non-capturable (quantized) programs.
No new external dependencies.

Validation

Greedy output byte-identical off vs on (fp16).
Gate verified: quantized programs are marked non-capturable and run eager (no regression); fp16 programs capture and replay.

How to enable

MIGRAPHX_ENABLE_HIPGRAPH=1
(plus static shapes / fixed buffers, which decode with a shared KV buffer already provides).

Notes for reviewers

The single-context routing in program::eval is the only core change; everything else lives in the GPU target.
Known limitation: the eligibility scan lives in fuse_mlir, so a non-default configuration with MLIR disabled would not set the gate. This is not a concern for the standard build (MLIR is on, and the int4 fusion path requires it), but noting it for completeness.

Changelog Category

Add a CHANGELOG.md entry for any option other than Not Applicable

- Added: New functionality.
- Changed: Changes to existing functionality.
- Removed: Functionality or support that has been removed. (Compared to a previous release)
- Optimized: Component performance that has been optimized or improved.
- Resolved Issues: Known issues from a previous version that have been resolved.
- Not Applicable: This PR is not to be included in the changelog.

Decode re-enqueues the per-token kernel sequence one launch at a time (program::eval generic_eval loop -> one hipExtModuleLaunchKernel per op), adding host dispatch overhead and a GPU-clock throttle from the per-launch bubbles. Add the HIP-graph analog: capture the single-context eval kernel loop into a hipGraph once and replay it with a single hipGraphLaunch per subsequent eval. Opt-in via MIGRAPHX_ENABLE_HIPGRAPH (default off = byte-identical to the prior path). The gpu context owns the hipGraph_t/hipGraphExec_t (RAII) and the begin/end_graph_capture / replay_graph / execute() entry points; program::eval routes the single-context path through context::execute() and caches the output arguments for the replay path (valid because static-shape decode reuses fixed device buffers). The type-erased context gains an execute() hook that by default just runs the loop. Capture is gated to fp16: hipGraph capture/replay regresses low-bit-quantized (int4/fp4) decode substantially (up to ~2x slower than eager on discrete GPUs). fuse_mlir marks a program non-capturable when it contains any quantized/low-bit op (unpack_int4, unpack_fp4, dequantizelinear, quant_dot) -- a cheap pre-lowering instruction-name scan, recorded on the gpu context -- and context::is_graph_enabled() keeps those programs on the eager path while fp16 still captures. No added compile-time or per-token cost, and none when the feature is off.

github-actions · 2026-06-29T17:34:45Z

Thank you for your contribution! Since this is an external pull request, a maintainer must review PR and add the "ok-to-test" label if it is approved for testing.

pfultz2 · 2026-06-29T17:40:43Z

So there is no reason to modify the program::eval as you can just write an op to do the execution.

Either way, see #4956 which implements hip graph and it handles when the pointer change.

Copilot

Pull request overview

This PR introduces an opt-in HIP Graph capture/replay path for GPU execution to reduce per-token host dispatch overhead by capturing the single-context eval kernel sequence once and replaying it with a single hipGraphLaunch on subsequent evals.

Changes:

Add HIP Graph state + capture/replay plumbing to the GPU context, exposed via a new context::execute() hook.
Route single-context program::eval() through context::execute() and cache output arguments for the replay path.
Add an MLIR-pass-time gate that disables capture when low-bit/quantized ops are present, plus a changelog entry for the new env var.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
src/targets/gpu/include/migraphx/gpu/context.hpp	Adds `MIGRAPHX_ENABLE_HIPGRAPH`, graph RAII state, capture/replay methods, and `context::execute()` implementation.
src/targets/gpu/fuse_mlir.cpp	Marks programs non-capturable when certain low-bit/quantized ops are present (gate for hipGraph path).
src/program.cpp	Routes single-context eval through `context::execute()` and caches results for replay.
src/include/migraphx/context.hpp	Extends the type-erased `context` interface with an optional `execute()` hook and a default eager implementation.
CHANGELOG.md	Documents the new `MIGRAPHX_ENABLE_HIPGRAPH` feature and gating behavior.

+        auto status  = hipStreamEndCapture(get_stream().get(), &g);
+        if(status != hipSuccess or g == nullptr)
+            return false;
+        captured_graph      = hip_graph_ptr{g};


+            captured_graph.reset();
+            return false;
+        }
+        graph_exec = hip_graph_exec_ptr{exec};


+        // First eval: capture the loop into a graph. NOTE under hipStreamBeginCapture
+        // the kernel launches are RECORDED, not executed -- so run_kernels() here
+        // produces no output; we must launch the instantiated graph once to actually
+        // compute this first token. If capture/instantiate fails, fall back to a real
+        // eager run so the first token is still correct (and future evals stay eager
+        // since graph_exec remains null).


+    // ~2x slower than the eager path on discrete GPUs). Allowlist-by-absence: any quantized
+    // program (int4 nibble unpack, fp4, or the int8/int4 dequantize+quant_dot path that has
+    // no nibble unpack) takes the eager path; only fp16 (none of these ops) captures.


+        static const std::array<std::string, 4> low_bit_ops = {
+            {"unpack_int4", "unpack_fp4", "dequantizelinear", "quant_dot"}};
+        for(const auto& ins : mpm.get_module())
+        {
+            if(contains(low_bit_ops, ins.name()))
+            {
+                ctx->set_graph_not_capturable();
+                break;
+            }
+        }


+        contexts.front().execute([&] {
+            ret = generic_eval(*this, contexts, params, [&](auto&&, auto f) { return f(); });
+            impl->graph_cached_results = ret;
+        });
+        if(ret.empty())
+            ret = impl->graph_cached_results;


    }
+    else if(contexts.size() == 1)
+    {
+        // route the single-context eval (the EP decode path, async or not)


Internal handoff for PR ROCm#5019: thesis, per-file implementation w/ code, verification, architecture note, and the maintainer review (approach superseded by ROCm#4956; fp16/int4 gate is the likely-additive piece). Lives on the non-PR branch only; not for upstream.

TedThemistokleous · 2026-06-29T18:32:00Z

+    // regresses low-bit-quantized (int4/fp4) decode substantially (up to ~2x slower than
+    // the eager per-op path on discrete GPUs), so a program that fuses any low-bit dequant
+    // op is marked non-capturable and runs the eager path; fp16 still captures.
+    bool is_graph_capturable() const { return graph_capturable; }


This shouldn't occur and signals a bug - We should be agnostic between what the lower level library is doing on the capture. Since MIGraphX is working on the higher level than MLIR and other libraries we should see a reduction in launch execution overall.

We do fusions to further optimize the model and keep the GPU queue full so it doesnt go idle.

TedThemistokleous · 2026-06-29T18:33:16Z

+    // no nibble unpack) takes the eager path; only fp16 (none of these ops) captures.
+    // Scanned here -- before fusion consumes these into a code_object whose name no longer
+    // reveals them -- and recorded on the (shared) context so the hipGraph path (gated by
+    // context::is_graph_enabled) skips capture.


We want to capture all kernels and not certain non fused ones, otherwise you'll still get idle bubbles in the pipeline of runs.

TedThemistokleous

We need to discuss this with @pfultz2. He's already started work on this with another draft PR he's mentioned and it seems like we need to discuss architecture of how we'd like to do this. I've already got changes in rocm/onnxruntime and porting these to the GPU EP side.

We shouldn't be picking/dropping kernels from the capture as that causes a bunch of other pointer dependencies/math in the GPU.

In my mind hipGraph is simple

Input run -> capture-> Model capture

Input -> Replay Model capture -> output.

Adding any sort of tap dancing will lead to future failure say if we have an increasing list of "uncapturable" kernels - we don't have an infinite amount of addresses in a GPU that may or may not be running concurrent workloads or parallel instances for the given address range.

Lets pick a time next week to discuss this in detail and the architecture as it seems like you have requirements on how you'd like this to work and I know Paul and I have our own in terms of MIGraphX and Onnxruntime specifically.

aditya-dl requested review from a team and causten as code owners June 29, 2026 17:34

causten requested review from TedThemistokleous, Copilot and pfultz2 June 29, 2026 18:16

Copilot started reviewing on behalf of causten June 29, 2026 18:17 View session

Copilot AI reviewed Jun 29, 2026

View reviewed changes

TedThemistokleous reviewed Jun 29, 2026

View reviewed changes

TedThemistokleous requested changes Jun 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add hipGraph capture/replay for decode eval (gated to fp16)#5019

Add hipGraph capture/replay for decode eval (gated to fp16)#5019
aditya-dl wants to merge 1 commit into
ROCm:developfrom
aditya-dl:amd/dev/adilohia/hipgraph-decode-capture-develop

aditya-dl commented Jun 29, 2026

Uh oh!

github-actions Bot commented Jun 29, 2026

Uh oh!

pfultz2 commented Jun 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

TedThemistokleous Jun 29, 2026

Uh oh!

TedThemistokleous Jun 29, 2026

Uh oh!

TedThemistokleous left a comment •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

aditya-dl commented Jun 29, 2026

Motivation

Technical Details

Performance (measured)

Safety / compatibility

Validation

How to enable

Notes for reviewers

Changelog Category

Uh oh!

github-actions Bot commented Jun 29, 2026

Uh oh!

pfultz2 commented Jun 29, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

TedThemistokleous Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

TedThemistokleous Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

TedThemistokleous left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

TedThemistokleous left a comment •

edited

Loading