Compensated summation for cpu_statevec_anyCtrlAnyTargDenseMatr_sub (#598) - [unitaryHACK]#791
Compensated summation for cpu_statevec_anyCtrlAnyTargDenseMatr_sub (#598) - [unitaryHACK]#791mk0dz wants to merge 1 commit into
Conversation
…uEST-Kit#598) The dense-matrix subroutine's inner reduction is liable to catastrophic cancellation for many target qubits. This adds an opt-in compensated path (-DQUEST_COMPENSATE_DENSEMATR_SUM=ON); base_qcomp's operators are plain IEEE arithmetic so the compensation is honoured directly. Single-CPU benchmarks (fp32/fp64/fp80): relative error improves ~25-58x at 12 targets, the benefit growing with target count; runtime cost is ~2-3.3x (compute-bound) falling to ~1.8x in the large-statevector regime. Left opt-in (off by default) per that trade-off.
|
Hi Mukul, Beautiful diff, and very useful and thorough benchmarking! Super interesting that quadrupling the flops per outer iteration (by the Kahan summation) still measurably impacts runtime at 24 qubits (at which the state is ~120 - 500 MiB). You have demonstrated compensation is only worthwhile for many target qubits (otherwise of course, the sum is too small to need compensation), and the more target qubits, the exponentially greater the total number of flops per iteration. So, there might not exist a regime where compensation of the inner loop has an accuracy benefit, where the outer loop is simultaneously memory bandwidth bound to hide the slowdown. Quite a shock to me! We will explore further to decide if exposing this hyperparameter (whether or not to compensate) is worthwhile. In any case, you have satisfied the requirements of the unitaryHACK challenge! 🎉 Please comment on issue #598 so I can assign it to you, and award the prize. |
Summary
Adds an optional compensated-summation path to improve numerical accuracy in CPU dense-matrix target evolution.
The implementation introduces a thinner reduction in
cpu_statevec_anyCtrlAnyTargDenseMat_sub(), controlled by the compile-time flag:The feature is disabled by default to preserve existing performance characteristics.
Implementation Notes
cpu_qcompvariable and writes it back once after the reduction.Files Changed
quest/src/cpu/cpu_subroutines.cppquest/src/cpu/CMakeLists.txtAccuracy & Performance
Benchmarked against a
__float128reference using the naive vs. compensated Kahan implementation across all three floating-point precisions.Observations
Validation
All existing dense
CompMatrtests pass for both compensated and uncompensated builds, including:applyCompMatrTest results:
Rationale for Default-Off
The improvement is primarily beneficial for large or ill-conditioned matrices, while introducing a consistent ~2–3× runtime overhead.
Keeping the feature disabled by default preserves current performance expectations while allowing users to opt in when higher numerical accuracy is required.
Future Work
A possible alternative is pairwise summation, which offers:
O(log N)This can be evaluated separately in a future PR.
I also acknowledge claude for guiding me thorough it.
/claim #598
Closes #598.