Switch QBLAS dependency to 1.5.0#95
Conversation
The new compiled QBLAS uses GCC-only flags (-march, -mavx2), POSIX-only
APIs (clock_gettime, sysconf, _SC_LEVEL2_CACHE_SIZE), and GCC built-ins
for CPUID detection - none of which build under MSVC. The legacy
header-only QBLAS happened to compile on Windows because the wrap was
five lines and produced no objects.
Add a disable_quadblas meson option (default false; auto-enabled on
Windows) that:
- skips dependency('qblas', fallback: ...) entirely
- declares an empty qblas_dep so the rest of the build proceeds
- adds -DDISABLE_QUADBLAS to project args so quadblas_interface.cpp
and umath/matmul.cpp take the naive-kernel branches that already
exist behind that guard
This removes the CFLAGS=/DDISABLE_QUADBLAS hack from the Windows CI
since meson now sets the preprocessor flag itself (and translates -D
to /D for MSVC). Verified on Linux: -Ddisable_quadblas=true builds
without pulling the QBLAS subproject; default still builds it.
Meson caches the first subproject() call by name. When the QBLAS
dependency was resolved first, qblas's internal subproject('sleef')
ran with no options, configuring SLEEF with FMA enabled. Our later
subproject('sleef', default_options: ['disable_fma=true']) was then
silently a no-op - meson returned the already-configured instance.
Net effect on the old-CPU CI: the SLEEF in the wheel still had the
PURECFMA scalar code path enabled, and Intel SDE trapped on the first
vfnmadd132sd from Sleef_log10q1_u10purecfma when emulating Sandy
Bridge.
Move the whole SLEEF resolution block above the qblas dependency line.
SLEEF gets configured with the right options on its first (and only)
load; QBLAS's later subproject('sleef') call returns the same
instance, so there's now exactly one SLEEF in the build with the
options we wanted.
quaddtype's own sources don't use OpenMP (no #pragma omp or omp_*
calls). Both quaddtype's meson.build and qblas's meson.build called
dependency('openmp'), and qblas_dep also propagated the OpenMP dep
through its 'dependencies:' list. The same OpenMP instance therefore
appeared twice in quaddtype's compile-args closure.
On Apple's clang++ that double inclusion left an orphan -Xpreprocessor
in the args stream. Meson's ninja rule appends dependency-generation
flags ('-MD -MQ -MF ...') after $ARGS, so the orphan -Xpreprocessor
paired with -MD on the next line. clang++ then passed -MD to the
preprocessor verbatim, the preprocessor rejected it as unknown, and
every C++ compile failed - sinking the macos-15 wheel build.
qblas_dep already brings OpenMP transitively when QBLAS is enabled.
When QBLAS is disabled (Windows / -Ddisable_quadblas=true) we don't
need OpenMP at all because none of quaddtype's own sources use it.
Removing the second dependency('openmp') makes the closure single-
sourced and eliminates the duplicate compile args.
|
Writing down some of my obervations here: So I am not sure how many users will be interested but we can implement a "double-double" FMA which will save us ~100µops but the result will be within 1.5 ULP (loss of ~7 bits of accuracy). |
At least by default we should not sacrifice precision, my use case needs it |
Summary
Bumps the QBLAS dependency to QBLAS 1.5.0
For users of numpy_quaddtype, nothing visible in the Python API changes.
A @ Bandnp.matmul(A, B)continue to work exactly as before. The C-level shim's oldQuadBLAS::C++ namespace is replaced by free C functions (cblas_qdot,cblas_qgemv,cblas_qgemm, ...) inside src/csrc/quadblas_interface.cpp.Performance impact
Measured on AMD EPYC 7V13 (Zen 3, AVX2 tier), same numpy float64 / scipy-openblas baseline (Haswell tier), same harness on both QBLAS versions. Harness: bench/bench_quad_vs_numpy.py in the QBLAS repo.
Three gains stack:
nc = NC_DEFAULT = 512, so a 128x128 gemm at 16 threads ran on a single thread (the others spun up, hit the barrier, and slept). New QBLAS auto-scalesncso each thread gets at least two blocks.matmulproduces incorrect results for F-contiguous / non-row-major inputs #89)Caveat: Single-thread quad-precision gemm is still ~800x slower than f64 OpenBLAS. That gap is intrinsic to SLEEF's triple-double FMA implementation. The roadmap to bring it down to ~50 to 100x lives in docs/performance_bottlenecks.md, and none of it is in 1.5.0. The value-add of 1.5 is correctness and dispatch hygiene (proper threading, runtime SIMD selection, transpose flags), not lifting the FMA ceiling.