Compensated summation for cpu_statevec_anyCtrlAnyTargDenseMatr_sub (#598) - [unitaryHACK] by mk0dz · Pull Request #791 · QuEST-Kit/QuEST

mk0dz · 2026-06-12T22:39:54Z

Summary

Adds an optional compensated-summation path to improve numerical accuracy in CPU dense-matrix target evolution.

The implementation introduces a thinner reduction in cpu_statevec_anyCtrlAnyTargDenseMat_sub(), controlled by the compile-time flag:

-DQUEST_COMPENSATED_DENSEMAT_SUM=ON

The feature is disabled by default to preserve existing performance characteristics.

Implementation Notes

Accumulates the Kahan compensation term into a local cpu_qcomp variable and writes it back once after the reduction.
Applies only to the CPU dense-matrix path.
This source file is not built with fast-math, so the compensation term is preserved by the compiler.
Replaces the previous TODO comment with documentation describing the behaviour and trade-offs.

Files Changed

quest/src/cpu/cpu_subroutines.cpp
quest/src/cpu/CMakeLists.txt

Accuracy & Performance

Benchmarked against a __float128 reference using the naive vs. compensated Kahan implementation across all three floating-point precisions.

Precision	Accuracy @ 12 Targets	Runtime Cost
fp32	~58× lower error	~3.3×
fp64	~29× lower error	~2.2× (~1.76× for 10 targets on a 24-qubit state)
fp80	~25× lower error	~2.1×

Observations

Accuracy improvements grow with target count.
Benefits become negligible below roughly 7 targets.
Runtime overhead remains greater than 1× because the kernel is compute-bound on matrix multiplication rather than memory-bandwidth limited.
The Kahan recurrence prevents efficient vectorization.

Validation

All existing dense CompMatr tests pass for both compensated and uncompensated builds, including:

applyCompMatr
Controlled variants

Test results:

93,791 assertions
4 test cases

Rationale for Default-Off

The improvement is primarily beneficial for large or ill-conditioned matrices, while introducing a consistent ~2–3× runtime overhead.

Keeping the feature disabled by default preserves current performance expectations while allowing users to opt in when higher numerical accuracy is required.

Future Work

A possible alternative is pairwise summation, which offers:

Error scaling of approximately O(log N)
Better vectorization opportunities
Potentially lower runtime overhead

This can be evaluated separately in a future PR.

I also acknowledge claude for guiding me thorough it.

/claim #598

Closes #598.

…uEST-Kit#598) The dense-matrix subroutine's inner reduction is liable to catastrophic cancellation for many target qubits. This adds an opt-in compensated path (-DQUEST_COMPENSATE_DENSEMATR_SUM=ON); base_qcomp's operators are plain IEEE arithmetic so the compensation is honoured directly. Single-CPU benchmarks (fp32/fp64/fp80): relative error improves ~25-58x at 12 targets, the benefit growing with target count; runtime cost is ~2-3.3x (compute-bound) falling to ~1.8x in the large-statevector regime. Left opt-in (off by default) per that trade-off.

TysonRayJones · 2026-06-15T18:10:55Z

Hi Mukul,

Beautiful diff, and very useful and thorough benchmarking!

Super interesting that quadrupling the flops per outer iteration (by the Kahan summation) still measurably impacts runtime at 24 qubits (at which the state is ~120 - 500 MiB). You have demonstrated compensation is only worthwhile for many target qubits (otherwise of course, the sum is too small to need compensation), and the more target qubits, the exponentially greater the total number of flops per iteration. So, there might not exist a regime where compensation of the inner loop has an accuracy benefit, where the outer loop is simultaneously memory bandwidth bound to hide the slowdown. Quite a shock to me!

We will explore further to decide if exposing this hyperparameter (whether or not to compensate) is worthwhile. In any case, you have satisfied the requirements of the unitaryHACK challenge! 🎉 Please comment on issue #598 so I can assign it to you, and award the prize.

mk0dz mentioned this pull request Jun 15, 2026

Improve applyCompMatr accuracy with compensated summation #598

Closed

This was referenced Jun 15, 2026

Fuse distributed prefix-suffix multi-SWAP (closes #595) #785

Open

ci: locate Visual Studio with vswhere in the MSVC CUDA setup #797

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compensated summation for cpu_statevec_anyCtrlAnyTargDenseMatr_sub (#598) - [unitaryHACK]#791

Compensated summation for cpu_statevec_anyCtrlAnyTargDenseMatr_sub (#598) - [unitaryHACK]#791
mk0dz wants to merge 1 commit into
QuEST-Kit:develfrom
mk0dz:feat/598-compensated-densematr-sum

mk0dz commented Jun 12, 2026

Uh oh!

TysonRayJones commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mk0dz commented Jun 12, 2026

Summary

Implementation Notes

Files Changed

Accuracy & Performance

Observations

Validation

Rationale for Default-Off

Future Work

Uh oh!

TysonRayJones commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants