Skip to content

call_kernel performance: Pool hashing and unnecessary eval_safe_simd dominate runtime #202

@hczphn

Description

@hczphn

🚀 [Performance] Bottlenecks in call_kernel: Repetitive Hashing & Redundant Evaluation

Problem Description

Profiling of the call_kernel execution path reveals two significant performance bottlenecks that together account for the majority of execution time during circuit building.


1. Pool::add(kernel) rehashes entire KernelPrimitive on every call

call_kernel calls self.kernel_primitives.add(kernel), which performs a HashMap::get(v) lookup. This requires hashing the entire KernelPrimitive struct—including both ir_for_later_compilation and ir_for_calling RootCircuit IRs with all their instructions—on every invocation, even when the kernel is already registered.

Profiling Data

Kernel kernel_primitives.add() Total call_kernel
pre_attn_sln 1.17s 1.27s
freivalds_shared_x_qkv 3.38s 4.22s

Proposed Fix

  • Short-term: Cache the kernel ID on the caller side or use pointer-based identity for fast-path lookup to avoid repeated full-struct hashing.
  • Long-term: Introduce a register_kernel method that returns a reusable ID, and a call_kernel_by_id variant that skips the Pool::add overhead.

2. eval_safe_simd runs unconditionally for pure-constraint kernels

call_kernel always evaluates kernel.ir_for_calling().eval_safe_simd(...) for every parallel instance. This is often unnecessary:

  1. Pure-constraint kernels: For kernels where all IO specs are inputs (no outputs), ir_for_calling has its constraints stripped and produces no meaningful output.
  2. Known outputs: In many use cases, output values are already computed externally (e.g., during model inference). call_kernel is invoked only to register the kernel call for later proving, making the SIMD re-computation redundant.

Profiling Data (After fixing issue 1)

Kernel eval_safe_simd Total call_kernel
freivalds_shared_x_qkv 74ms ~100ms
freivalds_shared_x_qkv 579ms ~840ms

Proposed Fix

  • Optimization: Skip eval_safe_simd for kernels with no output specs:
    !kernel.io_specs().iter().any(|s| s.is_output)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions