feat(cuda): GPU batch inverse by ColoCarletti · Pull Request #658 · yetanotherco/lambda_vm

ColoCarletti · 2026-06-10T18:53:02Z

Summary

Replaces the two remaining CPU inplace_batch_inverse sites in the prover with a fused on-device compute+invert pipeline. R3 OOD's per-eval-point barycentric_inv_denoms and R4 DEEP's (x_i − z_k) denominators now both flow through a single multi-block Hillis-Steele scan kernel and return device-resident CudaSlice<u64> handles that downstream dispatchers (try_barycentric_*_on_handle, try_deep_composition_gpu) read directly.

Wall-clock parity on fib_iterative_4M on a 46-core host (savings overlap with existing GPU work). The win is in PCIe traffic (~6 GB of redundant H2D removed per prove), peak host RSS at R4 (~288 MB removed), and the architectural primitives (Arc<CudaStream> threading, DenomSign enum, public batch_inverse_ext3_dev API) that future device-resident extensions can reuse. Pays off proportionally on smaller hosts, larger traces, more eval points, and more tables.

Changes

crypto/math-cuda/kernels/inverse.cu — 6 kernels: sign-flagged compute_denoms_ext3 (R3 z − x vs R4 x − z), forward and reverse multi-block scans,
apply-offsets passes, batch_inverse_combine_ext3.
crypto/math-cuda/src/inverse.rs — host driver: batch_inverse_ext3 (parity-test path), batch_inverse_ext3_dev (device→device), fused
compute_and_invert_denoms_ext3_dev with the DenomSign enum, recursive scan driver, one ext3 Fermat inverse on host per batch.
crypto/math-cuda/src/{barycentric,deep}.rs — _with_dev_inv_denoms variants that take a buffer + offset + caller's stream and slice internally (no
per-call H2D, no cross-stream race).
crypto/stark/src/gpu_lde.rs — try_compute_and_invert_inv_denoms_dev + try_inv_denoms_dev_with_stream (acquires backend + stream). Threads
Option<(&CudaSlice<u64>, usize, &Arc<CudaStream>)> through the barycentric and DEEP dispatchers. New gpu_batch_invert_calls counter.
crypto/stark/src/{trace,prover}.rs — R3 OOD and R4 DEEP fast paths. CPU barycentric_inv_denoms is now lazy on the all-GPU happy path; the CPU denoms
Vec in compute_deep_composition_poly_evaluations is only built when the dev-inv-denoms path returns None.
Tests — batch_inverse.rs (parity, n in {1, 2..256, 257..1024, 4096..2^16, 2^18, 2^20, 2^22}); compute_and_invert_denoms.rs (parity parametrised over both
signs, R3 and R4 shapes); invert_ext3_host parity against Degree3GoldilocksExtensionField::inv; cuda_path_integration asserts the new counter fires;
cuda_fallback_tests::gpu_batch_invert_fault_falls_back_to_cpu validates the CPU fallback under injected cudarc errors.

Fallback

Every dispatch returns None on TypeId mismatch (non-Goldilocks / non-ext3), below threshold, or any cudarc error. The caller falls through to the existing CPU inplace_batch_inverse path with no state change. Runtime-validated by the new fault-injection test.

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

…commits

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

…o feat/cuda-pr4

…5-batch-invert # Conflicts: # prover/tests/cuda_path_integration.rs

…nvert # Conflicts: # crypto/math-cuda/build.rs # crypto/math-cuda/src/deep.rs # crypto/math-cuda/src/device.rs # crypto/math-cuda/src/lib.rs # crypto/stark/src/gpu_lde.rs # crypto/stark/src/prover.rs # prover/tests/cuda_fallback_tests.rs # prover/tests/cuda_path_integration.rs

ColoCarletti · 2026-06-12T13:27:35Z

/codex

ColoCarletti · 2026-06-12T13:27:39Z

/claude

github-actions · 2026-06-12T13:30:07Z

Codex Code Review

Found 2 issues in the PR diff.

High - R4 GPU fallback can panic when device inv-denoms exist but device parts do not
crypto/stark/src/gpu_lde.rs:1199 skips validating inv_denoms_host whenever inv_denoms_dev is Some, but the (None, _) match arm later still slices inv_denoms_host[0..lde_size] at crypto/stark/src/gpu_lde.rs:1294. The caller in crypto/stark/src/prover.rs:1369 can pass Some(inv_dev) while round_2_result.gpu_composition_parts is None, so this turns a GPU fast-path miss into an out-of-bounds panic instead of falling back to CPU.
Fix by adding an explicit (None, Some(_)) path that returns None or by supporting host parts + device inv-denoms; alternatively only attempt the device-inv-denoms DEEP path when gpu_composition_parts.is_some().
Medium - Size truncation can under-launch CUDA kernels and leave uninitialized output
The new inverse path casts usize sizes to u32 before computing grid sizes, e.g. crypto/math-cuda/src/inverse.rs:182, crypto/math-cuda/src/inverse.rs:218, and crypto/math-cuda/src/inverse.rs:323. For inputs above u32::MAX, this wraps/truncates, while allocations like crypto/math-cuda/src/inverse.rs:173 are sized from the full usize. Because these buffers are allocated uninitialized under “kernel writes every slot” assumptions, an under-launch can return/read unwritten device memory.
Add explicit checked bounds before launch, e.g. reject n > u32::MAX / total > u32::MAX, and use checked_mul(3) for allocation lengths.

No Critical security issues found in the reviewed diff.

claude · 2026-06-12T13:38:14Z

Review: feat(cuda): GPU batch inverse

Overall: Well-structured PR. The multi-block Hillis-Steele scan, recursive block-total pass, single host Fermat inversion per batch, DenomSign abstraction, Arc<CudaStream> threading, and fallback design are all sound. Test coverage is thorough. Two issues found.

Bug (Medium) — `try_deep_composition_gpu` can panic on R2-fail + R4-success combination

See inline on gpu_lde.rs:1308. When the caller passes inv_denoms_host = &[] with inv_denoms_dev = Some(...) (the prover.rs fast path does exactly this), but parts_dev = None (R2 GPU composition failed), execution falls into the (None, _) match arm and &inv_denoms_host[0..lde_size] panics on the empty slice. The length check at line ~1199 is conditional on inv_denoms_dev.is_none(), so it is bypassed for this combination. The fix is an early-return guard on the (None, _) arm when inv_denoms_host is unpopulated.

Low — Benign CUDA data race in block-total write (`inverse.cu:130`)

See inline on inverse.cu:130. When n % BLOCK_SIZE != 0, two threads write to the same block_totals slot. Both values are equal (identity padding leaves the running product unchanged) so correctness is never affected, but adding gid < n && to the condition makes the single-writer intent explicit.

Oppen · 2026-06-16T19:57:37Z

+/// num_eval_points) and R4 DEEP (n = lde_size, k_scalars = 1 +
+/// num_eval_points). Returns a device handle the caller can slice and
+/// thread into downstream dispatchers without ever D2H'ing the inverted
+/// values; on type / threshold / cudarc failure returns `None` so the


Shouldn't type failures be caught by the type system?

ColoCarletti and others added 30 commits May 6, 2026 15:12

add first cuda files

d1a0abf

fmt

79634ff

fix clippy

ac6fbb5

gpu 2nd part

2ceb3b0

feat(cuda): Round 1 GPU LDE+commit dispatch + device-resident handles

affceb1

merge main

01172f2

Merge branch 'main' into feat/cuda-pr2-r1-gpu-commits

c4627e1

comments fix

01aa5e4

Merge branch 'main' into feat/cuda-pr2-r1-gpu-commits

cfc5c19

Update crypto/stark/src/gpu_lde.rs

ea5696f

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/stark/src/gpu_lde.rs

a8cf265

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/stark/src/gpu_lde.rs

fb8d31f

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/stark/src/gpu_lde.rs

a79f2b5

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/stark/src/gpu_lde.rs

761a2c0

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

address reviews

e066e9d

fix review comments

7d3d0f0

Merge remote-tracking branch 'origin/main' into feat/cuda-pr2-r1-gpu-…

cf80771

…commits

address doc comment suggestions

71aba0d

Merge branch 'main' into feat/cuda-pr2-r1-gpu-commits

83d91b8

fix

34cae4b

Merge branch 'main' into feat/cuda-pr2-r1-gpu-commits

f076bf4

Pass replay transcript to bus-balance call in verify_vm_minimal

a2cde0f

Update crypto/math-cuda/src/device.rs

46c305b

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Merge branch 'main' into feat/cuda-pr2-r1-gpu-commits

aca3dca

Update crypto/math-cuda/src/device.rs

63d7c00

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/math-cuda/src/device.rs

eb16c02

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/math-cuda/src/device.rs

66925b1

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/math-cuda/src/lde.rs

4e6daf3

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/math-cuda/src/lde.rs

4cd27d9

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

Update crypto/math-cuda/src/lde.rs

5fe390f

Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>

ColoCarletti and others added 17 commits June 8, 2026 11:57

fix comments

c499ee0

refactor test

025813a

rm dead code, refactor

f41bb7b

fix

6399cf2

rm doc

b8d97d5

gpu batch inverse

6f3262d

fix

b422d71

Merge branch 'feat/cuda-pr4' of github.com:yetanotherco/lambda_vm int…

95b8025

…o feat/cuda-pr4

fallback test

b706e48

Merge branch 'main' into feat/cuda-pr4

5eae98a

fix_comments

578cb29

Merge remote-tracking branch 'origin/feat/cuda-pr4' into feat/cuda-pr…

50d2541

…5-batch-invert # Conflicts: # prover/tests/cuda_path_integration.rs

cleanup

84ae125

fmt

adbcfe2

Merge branch 'main' into feat/cuda-pr5-batch-invert

7386d0a

Merge branch 'main' into feat/cuda-pr5-batch-invert

a29b013

claude Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread crypto/stark/src/gpu_lde.rs

claude Bot reviewed Jun 12, 2026

View reviewed changes

Comment thread crypto/math-cuda/kernels/inverse.cu Outdated

ColoCarletti added 3 commits June 12, 2026 10:51

address comments

1a1de35

harden inv_denoms guard, fix scan kernel race

d73219e

fix debug assert

b0c60e1

ColoCarletti added the gpu Related to GPU/CUDA development label Jun 12, 2026

Merge branch 'main' into feat/cuda-pr5-batch-invert

577c6b2

diegokingston approved these changes Jun 12, 2026

View reviewed changes

Oppen reviewed Jun 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(cuda): GPU batch inverse#658

feat(cuda): GPU batch inverse#658
ColoCarletti wants to merge 80 commits into
mainfrom
feat/cuda-pr5-batch-invert

ColoCarletti commented Jun 10, 2026

Uh oh!

ColoCarletti commented Jun 12, 2026

Uh oh!

ColoCarletti commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Uh oh!

Uh oh!

Uh oh!

claude Bot commented Jun 12, 2026

Uh oh!

Oppen Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

ColoCarletti commented Jun 10, 2026

Summary

Changes

Fallback

Uh oh!

ColoCarletti commented Jun 12, 2026

Uh oh!

ColoCarletti commented Jun 12, 2026

Uh oh!

github-actions Bot commented Jun 12, 2026

Codex Code Review

Uh oh!

Uh oh!

Uh oh!

claude Bot commented Jun 12, 2026

Review: feat(cuda): GPU batch inverse

Bug (Medium) — try_deep_composition_gpu can panic on R2-fail + R4-success combination

Low — Benign CUDA data race in block-total write (inverse.cu:130)

Uh oh!

Oppen Jun 16, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Bug (Medium) — `try_deep_composition_gpu` can panic on R2-fail + R4-success combination

Low — Benign CUDA data race in block-total write (`inverse.cu:130`)