Switch QBLAS dependency to 1.5.0 by SwayamInSync · Pull Request #95 · numpy/numpy-quaddtype

SwayamInSync · 2026-05-14T18:15:25Z

Summary

Bumps the QBLAS dependency to QBLAS 1.5.0

For users of numpy_quaddtype, nothing visible in the Python API changes. A @ B and np.matmul(A, B) continue to work exactly as before. The C-level shim's old QuadBLAS:: C++ namespace is replaced by free C functions (cblas_qdot, cblas_qgemv, cblas_qgemm, ...) inside src/csrc/quadblas_interface.cpp.

Performance impact

Measured on AMD EPYC 7V13 (Zen 3, AVX2 tier), same numpy float64 / scipy-openblas baseline (Haswell tier), same harness on both QBLAS versions. Harness: bench/bench_quad_vs_numpy.py in the QBLAS repo.

op	size	threads	f64 ref (OpenBLAS)	QBLAS 1.0	QBLAS 1.5	new / old
gemm	128 x 128	16	73.4 GFLOPS	0.034 GFMA/s (2167x)	0.81 GFMA/s (84x)	24x
gemm	256 x 256	16	239 GFLOPS	0.034 GFMA/s (7072x)	0.83 GFMA/s (287x)	24x
gemm	512 x 512	16	356 GFLOPS	0.243 GFMA/s (1465x)	0.83 GFMA/s (449x)	3.4x
gemm	1024 x 1024	16	462 GFLOPS	0.487 GFMA/s (933x)	0.84 GFMA/s (552x)	1.7x
gemm	256 x 256	1	47.1 GFLOPS	0.034 GFMA/s (1392x)	0.058 GFMA/s (809x)	1.7x
gemv	2048	16	168 GFLOPS	0.506 GFMA/s (332x)	0.83 GFMA/s (204x)	1.6x
dot	1 M	16	43.9 GFLOPS	0.252 GFMA/s (174x)	0.41 GFMA/s (103x)	1.6x

Three gains stack:

Uniform ~1.6 to 1.7x single-core kernel speedup from the per-ISA OBJECT-library design, replacing the old SSE2-locked header-only template path. Runtime CPUID dispatch picks AVX2 on this hardware automatically.
~24x speedup at small-N multi-threaded gemm specifically. Old QBLAS shipped a fixed nc = NC_DEFAULT = 512, so a 128x128 gemm at 16 threads ran on a single thread (the others spun up, hit the barrier, and slept). New QBLAS auto-scales nc so each thread gets at least two blocks.
This PR also enables the fast kernel path for supporting F-contiguous/non-row-major inputs ([BUG] matmul produces incorrect results for F-contiguous / non-row-major inputs #89)

Caveat: Single-thread quad-precision gemm is still ~800x slower than f64 OpenBLAS. That gap is intrinsic to SLEEF's triple-double FMA implementation. The roadmap to bring it down to ~50 to 100x lives in docs/performance_bottlenecks.md, and none of it is in 1.5.0. The value-add of 1.5 is correctness and dispatch hygiene (proper threading, runtime SIMD selection, transpose flags), not lifting the FMA ceiling.

…pre-ordered

The new compiled QBLAS uses GCC-only flags (-march, -mavx2), POSIX-only APIs (clock_gettime, sysconf, _SC_LEVEL2_CACHE_SIZE), and GCC built-ins for CPUID detection - none of which build under MSVC. The legacy header-only QBLAS happened to compile on Windows because the wrap was five lines and produced no objects. Add a disable_quadblas meson option (default false; auto-enabled on Windows) that: - skips dependency('qblas', fallback: ...) entirely - declares an empty qblas_dep so the rest of the build proceeds - adds -DDISABLE_QUADBLAS to project args so quadblas_interface.cpp and umath/matmul.cpp take the naive-kernel branches that already exist behind that guard This removes the CFLAGS=/DDISABLE_QUADBLAS hack from the Windows CI since meson now sets the preprocessor flag itself (and translates -D to /D for MSVC). Verified on Linux: -Ddisable_quadblas=true builds without pulling the QBLAS subproject; default still builds it.

Meson caches the first subproject() call by name. When the QBLAS dependency was resolved first, qblas's internal subproject('sleef') ran with no options, configuring SLEEF with FMA enabled. Our later subproject('sleef', default_options: ['disable_fma=true']) was then silently a no-op - meson returned the already-configured instance. Net effect on the old-CPU CI: the SLEEF in the wheel still had the PURECFMA scalar code path enabled, and Intel SDE trapped on the first vfnmadd132sd from Sleef_log10q1_u10purecfma when emulating Sandy Bridge. Move the whole SLEEF resolution block above the qblas dependency line. SLEEF gets configured with the right options on its first (and only) load; QBLAS's later subproject('sleef') call returns the same instance, so there's now exactly one SLEEF in the build with the options we wanted.

quaddtype's own sources don't use OpenMP (no #pragma omp or omp_* calls). Both quaddtype's meson.build and qblas's meson.build called dependency('openmp'), and qblas_dep also propagated the OpenMP dep through its 'dependencies:' list. The same OpenMP instance therefore appeared twice in quaddtype's compile-args closure. On Apple's clang++ that double inclusion left an orphan -Xpreprocessor in the args stream. Meson's ninja rule appends dependency-generation flags ('-MD -MQ -MF ...') after $ARGS, so the orphan -Xpreprocessor paired with -MD on the next line. clang++ then passed -MD to the preprocessor verbatim, the preprocessor rejected it as unknown, and every C++ compile failed - sinking the macos-15 wheel build. qblas_dep already brings OpenMP transitively when QBLAS is enabled. When QBLAS is disabled (Windows / -Ddisable_quadblas=true) we don't need OpenMP at all because none of quaddtype's own sources use it. Removing the second dependency('openmp') makes the closure single- sourced and eliminates the duplicate compile args.

SwayamInSync · 2026-05-14T21:41:20Z

Writing down some of my obervations here:
I was microbenchmarking the FMA instruction on my machine so in brief
VFMADD132PD (AVX2) performs 4 FMA for f64 dtype within 1 µop (0.5 cycles per FMA)
For quaddtype we have Sleef_fmaq4_u05avx2 which also does 4 FMA for f128 within ~275 µops (~240 cycles per FMA)
SLEEF internally implements "triple-double" FMA that gives result within 0.5 ULP (as per standard and costly).

So I am not sure how many users will be interested but we can implement a "double-double" FMA which will save us ~100µops but the result will be within 1.5 ULP (loss of ~7 bits of accuracy).
Just leaving this here as a thought, I might be actually trying to do this in QBLAS (just for fun purpose, but if some tradeoff is okay here then happy to patch it in future)

juntyr · 2026-05-14T22:43:43Z

Writing down some of my obervations here: I was microbenchmarking the FMA instruction on my machine so in brief VFMADD132PD (AVX2) performs 4 FMA for f64 dtype within 1 µop (0.5 cycles per FMA) For quaddtype we have Sleef_fmaq4_u05avx2 which also does 4 FMA for f128 within ~275 µops (~240 cycles per FMA) SLEEF internally implements "triple-double" FMA that gives result within 0.5 ULP (as per standard and costly).

So I am not sure how many users will be interested but we can implement a "double-double" FMA which will save us ~100µops but the result will be within 1.5 ULP (loss of ~7 bits of accuracy). Just leaving this here as a thought, I might be actually trying to do this in QBLAS (just for fun purpose, but if some tradeoff is okay here then happy to patch it in future)

At least by default we should not sacrifice precision, my use case needs it

SwayamInSync added 6 commits May 14, 2026 01:09

switch to QBLAS overhaul-rewrite: new C ABI, meson subproject, sleef …

af809f3

…pre-ordered

fix openmp + interface shim

0a7a8b4

updated to latest commit

bc790da

SwayamInSync added the enhancement New feature or request label May 14, 2026

SwayamInSync requested a review from ngoldbaum May 14, 2026 21:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Switch QBLAS dependency to 1.5.0#95

Switch QBLAS dependency to 1.5.0#95
SwayamInSync wants to merge 6 commits into
numpy:mainfrom
SwayamInSync:qblas-rewrite-integration

SwayamInSync commented May 14, 2026 •

edited

Loading

Uh oh!

SwayamInSync commented May 14, 2026 •

edited

Loading

Uh oh!

juntyr commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

SwayamInSync commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance impact

Uh oh!

SwayamInSync commented May 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

juntyr commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

SwayamInSync commented May 14, 2026 •

edited

Loading

SwayamInSync commented May 14, 2026 •

edited

Loading