Skip to content

Add hipGraph capture/replay for decode eval (gated to fp16)#5019

Open
aditya-dl wants to merge 1 commit into
ROCm:developfrom
aditya-dl:amd/dev/adilohia/hipgraph-decode-capture-develop
Open

Add hipGraph capture/replay for decode eval (gated to fp16)#5019
aditya-dl wants to merge 1 commit into
ROCm:developfrom
aditya-dl:amd/dev/adilohia/hipgraph-decode-capture-develop

Conversation

@aditya-dl

Copy link
Copy Markdown
Contributor

LLM decode re-enqueues the per-token kernel sequence one launch at a time - program::eval's generic_eval loop issues one hipExtModuleLaunchKernel per op (~50+ launches/token for a transformer decode step). The per-launch host overhead, and the GPU-clock throttling caused by the resulting dispatch bubbles, are a meaningful fraction of per-token latency on memory-light decode workloads. This PR adds the HIP-graph analog of what other backends already do (CUDA Graphs / D3D12 command-list replay): capture the single-context eval kernel loop into a hipGraph once, then replay it with a single hipGraphLaunch per subsequent eval.

The feature is opt-in (MIGRAPHX_ENABLE_HIPGRAPH, default off) and, when off, the code path is byte-identical to before. When on, it is restricted to FP16 programs - quantized (int4/fp4) decode is automatically excluded because graph capture regresses it (see "Gating" below).

Motivation

For decode, the bottleneck on discrete GPUs is not kernel compute - it is host dispatch overhead plus the clock throttle that per-launch gaps induce. Collapsing the per-token launch sequence into one captured/replayed graph removes both. This is a runtime-execution change, not a kernel change, so it benefits any model whose decode step is dispatch-bound.

Technical Details

  • Adds a hipGraph-backed execution path to the GPU context:
    • RAII wrappers for hipGraph_t / hipGraphExec_t.
    • begin_graph_capture() / end_graph_capture() / replay_graph() and a single execute(run_kernels) entry point.
    • First eval captures the loop into a graph and instantiates it; subsequent evals replay it and skip the per-op loop. Any capture/instantiate failure falls back to the eager path, so correctness is never at risk.
  • program::eval routes the single-context path through context::execute() and caches the output arguments for the replay path (valid because static-shape decode reuses fixed device buffers; the terminal stream sync stays outside the capture region).
  • The type-erased context interface gains an execute() hook whose default simply runs the loop, so non-GPU targets are unaffected.

Performance (measured)

fp16 steady-state decode throughput, hipGraph off vs on, interleaved same-session A/B with md5-verified builds:

  • Discrete GPUs (RDNA4 / RDNA3): ~+12% to +24% depending on model and arch.
  • APU (unified memory): ~+2-3% - the win scales with how dispatch-bound the platform is; a bandwidth-limited APU has less dispatch headroom to recover, which is the expected behavior.
  • Greedy decode output is byte-identical off vs on.

Safety / compatibility

  • Default off. With MIGRAPHX_ENABLE_HIPGRAPH unset, behavior is byte-identical to the prior path (the single-context branch still runs the eager loop).
  • Automatic eager fallback on any capture/instantiate error, on cross-compile contexts, and for non-capturable (quantized) programs.
  • No new external dependencies.

Validation

  • Greedy output byte-identical off vs on (fp16).
  • Gate verified: quantized programs are marked non-capturable and run eager (no regression); fp16 programs capture and replay.

How to enable

MIGRAPHX_ENABLE_HIPGRAPH=1
(plus static shapes / fixed buffers, which decode with a shared KV buffer already provides).

Notes for reviewers

  • The single-context routing in program::eval is the only core change; everything else lives in the GPU target.
  • Known limitation: the eligibility scan lives in fuse_mlir, so a non-default configuration with MLIR disabled would not set the gate. This is not a concern for the standard build (MLIR is on, and the int4 fusion path requires it), but noting it for completeness.

Changelog Category

Add a CHANGELOG.md entry for any option other than Not Applicable

    • Added: New functionality.
    • Changed: Changes to existing functionality.
    • Removed: Functionality or support that has been removed. (Compared to a previous release)
    • Optimized: Component performance that has been optimized or improved.
    • Resolved Issues: Known issues from a previous version that have been resolved.
    • Not Applicable: This PR is not to be included in the changelog.

Decode re-enqueues the per-token kernel sequence one launch at a time (program::eval generic_eval loop -> one hipExtModuleLaunchKernel per op), adding host dispatch overhead and a GPU-clock throttle from the per-launch bubbles. Add the HIP-graph analog: capture the single-context eval kernel loop into a hipGraph once and replay it with a single hipGraphLaunch per subsequent eval.

Opt-in via MIGRAPHX_ENABLE_HIPGRAPH (default off = byte-identical to the prior path). The gpu context owns the hipGraph_t/hipGraphExec_t (RAII) and the begin/end_graph_capture / replay_graph / execute() entry points; program::eval routes the single-context path through context::execute() and caches the output arguments for the replay path (valid because static-shape decode reuses fixed device buffers). The type-erased context gains an execute() hook that by default just runs the loop.

Capture is gated to fp16: hipGraph capture/replay regresses low-bit-quantized (int4/fp4) decode substantially (up to ~2x slower than eager on discrete GPUs). fuse_mlir marks a program non-capturable when it contains any quantized/low-bit op (unpack_int4, unpack_fp4, dequantizelinear, quant_dot) -- a cheap pre-lowering instruction-name scan, recorded on the gpu context -- and context::is_graph_enabled() keeps those programs on the eager path while fp16 still captures. No added compile-time or per-token cost, and none when the feature is off.
@aditya-dl aditya-dl requested review from a team and causten as code owners June 29, 2026 17:34
@github-actions

Copy link
Copy Markdown
Contributor

Thank you for your contribution! Since this is an external pull request, a maintainer must review PR and add the "ok-to-test" label if it is approved for testing.

@pfultz2

pfultz2 commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

So there is no reason to modify the program::eval as you can just write an op to do the execution.

Either way, see #4956 which implements hip graph and it handles when the pointer change.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR introduces an opt-in HIP Graph capture/replay path for GPU execution to reduce per-token host dispatch overhead by capturing the single-context eval kernel sequence once and replaying it with a single hipGraphLaunch on subsequent evals.

Changes:

  • Add HIP Graph state + capture/replay plumbing to the GPU context, exposed via a new context::execute() hook.
  • Route single-context program::eval() through context::execute() and cache output arguments for the replay path.
  • Add an MLIR-pass-time gate that disables capture when low-bit/quantized ops are present, plus a changelog entry for the new env var.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
src/targets/gpu/include/migraphx/gpu/context.hpp Adds MIGRAPHX_ENABLE_HIPGRAPH, graph RAII state, capture/replay methods, and context::execute() implementation.
src/targets/gpu/fuse_mlir.cpp Marks programs non-capturable when certain low-bit/quantized ops are present (gate for hipGraph path).
src/program.cpp Routes single-context eval through context::execute() and caches results for replay.
src/include/migraphx/context.hpp Extends the type-erased context interface with an optional execute() hook and a default eager implementation.
CHANGELOG.md Documents the new MIGRAPHX_ENABLE_HIPGRAPH feature and gating behavior.

auto status = hipStreamEndCapture(get_stream().get(), &g);
if(status != hipSuccess or g == nullptr)
return false;
captured_graph = hip_graph_ptr{g};
captured_graph.reset();
return false;
}
graph_exec = hip_graph_exec_ptr{exec};
Comment on lines +573 to +578
// First eval: capture the loop into a graph. NOTE under hipStreamBeginCapture
// the kernel launches are RECORDED, not executed -- so run_kernels() here
// produces no output; we must launch the instantiated graph once to actually
// compute this first token. If capture/instantiate fails, fall back to a real
// eager run so the first token is still correct (and future evals stay eager
// since graph_exec remains null).
Comment on lines +1509 to +1511
// ~2x slower than the eager path on discrete GPUs). Allowlist-by-absence: any quantized
// program (int4 nibble unpack, fp4, or the int8/int4 dequantize+quant_dot path that has
// no nibble unpack) takes the eager path; only fp16 (none of these ops) captures.
Comment on lines +1517 to +1526
static const std::array<std::string, 4> low_bit_ops = {
{"unpack_int4", "unpack_fp4", "dequantizelinear", "quant_dot"}};
for(const auto& ins : mpm.get_module())
{
if(contains(low_bit_ops, ins.name()))
{
ctx->set_graph_not_capturable();
break;
}
}
Comment thread src/program.cpp
Comment on lines +723 to +728
contexts.front().execute([&] {
ret = generic_eval(*this, contexts, params, [&](auto&&, auto f) { return f(); });
impl->graph_cached_results = ret;
});
if(ret.empty())
ret = impl->graph_cached_results;
Comment thread src/program.cpp
}
else if(contexts.size() == 1)
{
// route the single-context eval (the EP decode path, async or not)
aditya-dl added a commit to aditya-dl/AMDMIGraphX that referenced this pull request Jun 29, 2026
Internal handoff for PR ROCm#5019: thesis, per-file implementation w/ code, verification, architecture note, and the maintainer review (approach superseded by ROCm#4956; fp16/int4 gate is the likely-additive piece). Lives on the non-PR branch only; not for upstream.
// regresses low-bit-quantized (int4/fp4) decode substantially (up to ~2x slower than
// the eager per-op path on discrete GPUs), so a program that fuses any low-bit dequant
// op is marked non-capturable and runs the eager path; fp16 still captures.
bool is_graph_capturable() const { return graph_capturable; }

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This shouldn't occur and signals a bug - We should be agnostic between what the lower level library is doing on the capture. Since MIGraphX is working on the higher level than MLIR and other libraries we should see a reduction in launch execution overall.

We do fusions to further optimize the model and keep the GPU queue full so it doesnt go idle.

// no nibble unpack) takes the eager path; only fp16 (none of these ops) captures.
// Scanned here -- before fusion consumes these into a code_object whose name no longer
// reveals them -- and recorded on the (shared) context so the hipGraph path (gated by
// context::is_graph_enabled) skips capture.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We want to capture all kernels and not certain non fused ones, otherwise you'll still get idle bubbles in the pipeline of runs.

@TedThemistokleous TedThemistokleous left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to discuss this with @pfultz2. He's already started work on this with another draft PR he's mentioned and it seems like we need to discuss architecture of how we'd like to do this. I've already got changes in rocm/onnxruntime and porting these to the GPU EP side.

We shouldn't be picking/dropping kernels from the capture as that causes a bunch of other pointer dependencies/math in the GPU.

In my mind hipGraph is simple

Input run -> capture-> Model capture

Input -> Replay Model capture -> output.

Adding any sort of tap dancing will lead to future failure say if we have an increasing list of "uncapturable" kernels - we don't have an infinite amount of addresses in a GPU that may or may not be running concurrent workloads or parallel instances for the given address range.

Lets pick a time next week to discuss this in detail and the architecture as it seems like you have requirements on how you'd like this to work and I know Paul and I have our own in terms of MIGraphX and Onnxruntime specifically.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants