Skip to content

Switch QBLAS dependency to 1.5.0#95

Open
SwayamInSync wants to merge 6 commits into
numpy:mainfrom
SwayamInSync:qblas-rewrite-integration
Open

Switch QBLAS dependency to 1.5.0#95
SwayamInSync wants to merge 6 commits into
numpy:mainfrom
SwayamInSync:qblas-rewrite-integration

Conversation

@SwayamInSync
Copy link
Copy Markdown
Member

@SwayamInSync SwayamInSync commented May 14, 2026

Summary

Bumps the QBLAS dependency to QBLAS 1.5.0

For users of numpy_quaddtype, nothing visible in the Python API changes. A @ B and np.matmul(A, B) continue to work exactly as before. The C-level shim's old QuadBLAS:: C++ namespace is replaced by free C functions (cblas_qdot, cblas_qgemv, cblas_qgemm, ...) inside src/csrc/quadblas_interface.cpp.

Performance impact

Measured on AMD EPYC 7V13 (Zen 3, AVX2 tier), same numpy float64 / scipy-openblas baseline (Haswell tier), same harness on both QBLAS versions. Harness: bench/bench_quad_vs_numpy.py in the QBLAS repo.

op size threads f64 ref (OpenBLAS) QBLAS 1.0 QBLAS 1.5 new / old
gemm 128 x 128 16 73.4 GFLOPS 0.034 GFMA/s (2167x) 0.81 GFMA/s (84x) 24x
gemm 256 x 256 16 239 GFLOPS 0.034 GFMA/s (7072x) 0.83 GFMA/s (287x) 24x
gemm 512 x 512 16 356 GFLOPS 0.243 GFMA/s (1465x) 0.83 GFMA/s (449x) 3.4x
gemm 1024 x 1024 16 462 GFLOPS 0.487 GFMA/s (933x) 0.84 GFMA/s (552x) 1.7x
gemm 256 x 256 1 47.1 GFLOPS 0.034 GFMA/s (1392x) 0.058 GFMA/s (809x) 1.7x
gemv 2048 16 168 GFLOPS 0.506 GFMA/s (332x) 0.83 GFMA/s (204x) 1.6x
dot 1 M 16 43.9 GFLOPS 0.252 GFMA/s (174x) 0.41 GFMA/s (103x) 1.6x

Three gains stack:

  • Uniform ~1.6 to 1.7x single-core kernel speedup from the per-ISA OBJECT-library design, replacing the old SSE2-locked header-only template path. Runtime CPUID dispatch picks AVX2 on this hardware automatically.
  • ~24x speedup at small-N multi-threaded gemm specifically. Old QBLAS shipped a fixed nc = NC_DEFAULT = 512, so a 128x128 gemm at 16 threads ran on a single thread (the others spun up, hit the barrier, and slept). New QBLAS auto-scales nc so each thread gets at least two blocks.
  • This PR also enables the fast kernel path for supporting F-contiguous/non-row-major inputs ([BUG] matmul produces incorrect results for F-contiguous / non-row-major inputs #89)

Caveat: Single-thread quad-precision gemm is still ~800x slower than f64 OpenBLAS. That gap is intrinsic to SLEEF's triple-double FMA implementation. The roadmap to bring it down to ~50 to 100x lives in docs/performance_bottlenecks.md, and none of it is in 1.5.0. The value-add of 1.5 is correctness and dispatch hygiene (proper threading, runtime SIMD selection, transpose flags), not lifting the FMA ceiling.

The new compiled QBLAS uses GCC-only flags (-march, -mavx2), POSIX-only
APIs (clock_gettime, sysconf, _SC_LEVEL2_CACHE_SIZE), and GCC built-ins
for CPUID detection - none of which build under MSVC. The legacy
header-only QBLAS happened to compile on Windows because the wrap was
five lines and produced no objects.

Add a disable_quadblas meson option (default false; auto-enabled on
Windows) that:

  - skips dependency('qblas', fallback: ...) entirely
  - declares an empty qblas_dep so the rest of the build proceeds
  - adds -DDISABLE_QUADBLAS to project args so quadblas_interface.cpp
    and umath/matmul.cpp take the naive-kernel branches that already
    exist behind that guard

This removes the CFLAGS=/DDISABLE_QUADBLAS hack from the Windows CI
since meson now sets the preprocessor flag itself (and translates -D
to /D for MSVC). Verified on Linux: -Ddisable_quadblas=true builds
without pulling the QBLAS subproject; default still builds it.
Meson caches the first subproject() call by name. When the QBLAS
dependency was resolved first, qblas's internal subproject('sleef')
ran with no options, configuring SLEEF with FMA enabled. Our later
subproject('sleef', default_options: ['disable_fma=true']) was then
silently a no-op - meson returned the already-configured instance.

Net effect on the old-CPU CI: the SLEEF in the wheel still had the
PURECFMA scalar code path enabled, and Intel SDE trapped on the first
vfnmadd132sd from Sleef_log10q1_u10purecfma when emulating Sandy
Bridge.

Move the whole SLEEF resolution block above the qblas dependency line.
SLEEF gets configured with the right options on its first (and only)
load; QBLAS's later subproject('sleef') call returns the same
instance, so there's now exactly one SLEEF in the build with the
options we wanted.
quaddtype's own sources don't use OpenMP (no #pragma omp or omp_*
calls). Both quaddtype's meson.build and qblas's meson.build called
dependency('openmp'), and qblas_dep also propagated the OpenMP dep
through its 'dependencies:' list. The same OpenMP instance therefore
appeared twice in quaddtype's compile-args closure.

On Apple's clang++ that double inclusion left an orphan -Xpreprocessor
in the args stream. Meson's ninja rule appends dependency-generation
flags ('-MD -MQ -MF ...') after $ARGS, so the orphan -Xpreprocessor
paired with -MD on the next line. clang++ then passed -MD to the
preprocessor verbatim, the preprocessor rejected it as unknown, and
every C++ compile failed - sinking the macos-15 wheel build.

qblas_dep already brings OpenMP transitively when QBLAS is enabled.
When QBLAS is disabled (Windows / -Ddisable_quadblas=true) we don't
need OpenMP at all because none of quaddtype's own sources use it.
Removing the second dependency('openmp') makes the closure single-
sourced and eliminates the duplicate compile args.
@SwayamInSync SwayamInSync added the enhancement New feature or request label May 14, 2026
@SwayamInSync SwayamInSync requested a review from ngoldbaum May 14, 2026 21:24
@SwayamInSync
Copy link
Copy Markdown
Member Author

SwayamInSync commented May 14, 2026

Writing down some of my obervations here:
I was microbenchmarking the FMA instruction on my machine so in brief
VFMADD132PD (AVX2) performs 4 FMA for f64 dtype within 1 µop (0.5 cycles per FMA)
For quaddtype we have Sleef_fmaq4_u05avx2 which also does 4 FMA for f128 within ~275 µops (~240 cycles per FMA)
SLEEF internally implements "triple-double" FMA that gives result within 0.5 ULP (as per standard and costly).

So I am not sure how many users will be interested but we can implement a "double-double" FMA which will save us ~100µops but the result will be within 1.5 ULP (loss of ~7 bits of accuracy).
Just leaving this here as a thought, I might be actually trying to do this in QBLAS (just for fun purpose, but if some tradeoff is okay here then happy to patch it in future)

@juntyr
Copy link
Copy Markdown
Contributor

juntyr commented May 14, 2026

Writing down some of my obervations here: I was microbenchmarking the FMA instruction on my machine so in brief VFMADD132PD (AVX2) performs 4 FMA for f64 dtype within 1 µop (0.5 cycles per FMA) For quaddtype we have Sleef_fmaq4_u05avx2 which also does 4 FMA for f128 within ~275 µops (~240 cycles per FMA) SLEEF internally implements "triple-double" FMA that gives result within 0.5 ULP (as per standard and costly).

So I am not sure how many users will be interested but we can implement a "double-double" FMA which will save us ~100µops but the result will be within 1.5 ULP (loss of ~7 bits of accuracy). Just leaving this here as a thought, I might be actually trying to do this in QBLAS (just for fun purpose, but if some tradeoff is okay here then happy to patch it in future)

At least by default we should not sacrifice precision, my use case needs it

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants