Skip to content

feat(cuda): GPU batch inverse#658

Open
ColoCarletti wants to merge 80 commits into
mainfrom
feat/cuda-pr5-batch-invert
Open

feat(cuda): GPU batch inverse#658
ColoCarletti wants to merge 80 commits into
mainfrom
feat/cuda-pr5-batch-invert

Conversation

@ColoCarletti

Copy link
Copy Markdown
Collaborator

Summary

Replaces the two remaining CPU inplace_batch_inverse sites in the prover with a fused on-device compute+invert pipeline. R3 OOD's per-eval-point barycentric_inv_denoms and R4 DEEP's (x_i − z_k) denominators now both flow through a single multi-block Hillis-Steele scan kernel and return device-resident CudaSlice<u64> handles that downstream dispatchers (try_barycentric_*_on_handle, try_deep_composition_gpu) read directly.

Wall-clock parity on fib_iterative_4M on a 46-core host (savings overlap with existing GPU work). The win is in PCIe traffic (~6 GB of redundant H2D removed per prove), peak host RSS at R4 (~288 MB removed), and the architectural primitives (Arc<CudaStream> threading, DenomSign enum, public batch_inverse_ext3_dev API) that future device-resident extensions can reuse. Pays off proportionally on smaller hosts, larger traces, more eval points, and more tables.

Changes

  • crypto/math-cuda/kernels/inverse.cu — 6 kernels: sign-flagged compute_denoms_ext3 (R3 z − x vs R4 x − z), forward and reverse multi-block scans,
    apply-offsets passes, batch_inverse_combine_ext3.
  • crypto/math-cuda/src/inverse.rs — host driver: batch_inverse_ext3 (parity-test path), batch_inverse_ext3_dev (device→device), fused
    compute_and_invert_denoms_ext3_dev with the DenomSign enum, recursive scan driver, one ext3 Fermat inverse on host per batch.
  • crypto/math-cuda/src/{barycentric,deep}.rs_with_dev_inv_denoms variants that take a buffer + offset + caller's stream and slice internally (no
    per-call H2D, no cross-stream race).
  • crypto/stark/src/gpu_lde.rstry_compute_and_invert_inv_denoms_dev + try_inv_denoms_dev_with_stream (acquires backend + stream). Threads
    Option<(&CudaSlice<u64>, usize, &Arc<CudaStream>)> through the barycentric and DEEP dispatchers. New gpu_batch_invert_calls counter.
  • crypto/stark/src/{trace,prover}.rs — R3 OOD and R4 DEEP fast paths. CPU barycentric_inv_denoms is now lazy on the all-GPU happy path; the CPU denoms
    Vec in compute_deep_composition_poly_evaluations is only built when the dev-inv-denoms path returns None.
  • Testsbatch_inverse.rs (parity, n in {1, 2..256, 257..1024, 4096..2^16, 2^18, 2^20, 2^22}); compute_and_invert_denoms.rs (parity parametrised over both
    signs, R3 and R4 shapes); invert_ext3_host parity against Degree3GoldilocksExtensionField::inv; cuda_path_integration asserts the new counter fires;
    cuda_fallback_tests::gpu_batch_invert_fault_falls_back_to_cpu validates the CPU fallback under injected cudarc errors.

Fallback

Every dispatch returns None on TypeId mismatch (non-Goldilocks / non-ext3), below threshold, or any cudarc error. The caller falls through to the existing CPU inplace_batch_inverse path with no state change. Runtime-validated by the new fault-injection test.

ColoCarletti and others added 30 commits May 6, 2026 15:12
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
Co-authored-by: Gabriel Bosio <38794644+gabrielbosio@users.noreply.github.com>
ColoCarletti and others added 17 commits June 8, 2026 11:57
…5-batch-invert

# Conflicts:
#	prover/tests/cuda_path_integration.rs
…nvert

# Conflicts:
#	crypto/math-cuda/build.rs
#	crypto/math-cuda/src/deep.rs
#	crypto/math-cuda/src/device.rs
#	crypto/math-cuda/src/lib.rs
#	crypto/stark/src/gpu_lde.rs
#	crypto/stark/src/prover.rs
#	prover/tests/cuda_fallback_tests.rs
#	prover/tests/cuda_path_integration.rs
@ColoCarletti

Copy link
Copy Markdown
Collaborator Author

/codex

@ColoCarletti

Copy link
Copy Markdown
Collaborator Author

/claude

@github-actions

Copy link
Copy Markdown

Codex Code Review

Found 2 issues in the PR diff.

  1. High - R4 GPU fallback can panic when device inv-denoms exist but device parts do not
    crypto/stark/src/gpu_lde.rs:1199 skips validating inv_denoms_host whenever inv_denoms_dev is Some, but the (None, _) match arm later still slices inv_denoms_host[0..lde_size] at crypto/stark/src/gpu_lde.rs:1294. The caller in crypto/stark/src/prover.rs:1369 can pass Some(inv_dev) while round_2_result.gpu_composition_parts is None, so this turns a GPU fast-path miss into an out-of-bounds panic instead of falling back to CPU.
    Fix by adding an explicit (None, Some(_)) path that returns None or by supporting host parts + device inv-denoms; alternatively only attempt the device-inv-denoms DEEP path when gpu_composition_parts.is_some().

  2. Medium - Size truncation can under-launch CUDA kernels and leave uninitialized output
    The new inverse path casts usize sizes to u32 before computing grid sizes, e.g. crypto/math-cuda/src/inverse.rs:182, crypto/math-cuda/src/inverse.rs:218, and crypto/math-cuda/src/inverse.rs:323. For inputs above u32::MAX, this wraps/truncates, while allocations like crypto/math-cuda/src/inverse.rs:173 are sized from the full usize. Because these buffers are allocated uninitialized under “kernel writes every slot” assumptions, an under-launch can return/read unwritten device memory.
    Add explicit checked bounds before launch, e.g. reject n > u32::MAX / total > u32::MAX, and use checked_mul(3) for allocation lengths.

No Critical security issues found in the reviewed diff.

Comment thread crypto/stark/src/gpu_lde.rs
Comment thread crypto/math-cuda/kernels/inverse.cu Outdated
@claude

claude Bot commented Jun 12, 2026

Copy link
Copy Markdown

Review: feat(cuda): GPU batch inverse

Overall: Well-structured PR. The multi-block Hillis-Steele scan, recursive block-total pass, single host Fermat inversion per batch, DenomSign abstraction, Arc<CudaStream> threading, and fallback design are all sound. Test coverage is thorough. Two issues found.


Bug (Medium) — try_deep_composition_gpu can panic on R2-fail + R4-success combination

See inline on gpu_lde.rs:1308. When the caller passes inv_denoms_host = &[] with inv_denoms_dev = Some(...) (the prover.rs fast path does exactly this), but parts_dev = None (R2 GPU composition failed), execution falls into the (None, _) match arm and &inv_denoms_host[0..lde_size] panics on the empty slice. The length check at line ~1199 is conditional on inv_denoms_dev.is_none(), so it is bypassed for this combination. The fix is an early-return guard on the (None, _) arm when inv_denoms_host is unpopulated.

Low — Benign CUDA data race in block-total write (inverse.cu:130)

See inline on inverse.cu:130. When n % BLOCK_SIZE != 0, two threads write to the same block_totals slot. Both values are equal (identity padding leaves the running product unchanged) so correctness is never affected, but adding gid < n && to the condition makes the single-writer intent explicit.

@ColoCarletti ColoCarletti added the gpu Related to GPU/CUDA development label Jun 12, 2026
/// num_eval_points) and R4 DEEP (n = lde_size, k_scalars = 1 +
/// num_eval_points). Returns a device handle the caller can slice and
/// thread into downstream dispatchers without ever D2H'ing the inverted
/// values; on type / threshold / cudarc failure returns `None` so the

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't type failures be caught by the type system?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

gpu Related to GPU/CUDA development

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants