Enable CPU Optimizer Support for bitsandbytes#1901
Enable CPU Optimizer Support for bitsandbytes#1901jiqing-feng wants to merge 14 commits intobitsandbytes-foundation:mainfrom
Conversation
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
|
Hi @matthewdouglas |
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
|
Hi @matthewdouglas . The failed tests seems like the CI node does not have avx512. Please rerun the CI to see if I fixed this issue. |
|
Hi @jiqing-feng, Everything seems fine on the CI for Linux now regardless of AVX512 support. But it also looks like this is causing the macOS CPU tests to hang. |
matthewdouglas
left a comment
There was a problem hiding this comment.
Review: Enable CPU Optimizer Support for bitsandbytes
Thanks for the PR! This is a solid feature addition — CPU optimizer support is valuable. I've done extensive analysis of the macOS CI hang and the Zen3 SIGILL, and I have a clear path forward for both.
TL;DR
The _has_avx512 Python guard should be removed and replaced with a targeted #pragma in the C++ code. This will fix the macOS hang, restore performance on non-AVX512 x86-64, and have zero impact on AVX512 machines.
Investigating the macOS CI hang
The macOS CI tests were cancelled after the 6-hour GitHub Actions timeout. I reproduced locally on an M4 Mac and traced the cause:
With _has_avx512 guard (PR as-is): The C++ quantize/dequantize CPU kernels are not registered on any non-AVX512 platform (ARM, Zen3, etc.), falling back to the pure-Python default kernel. That kernel does an O(n×256) argmin which allocates a 4.3 GB temporary tensor for dim2=4097 test cases. Benchmarked: 115 seconds per test case vs 10 seconds with C++ kernels — an 11.5x slowdown.
Looking at actual CI durations on this PR confirms the problem across all non-AVX512 runners:
| Runner | AVX512? | Test duration |
|---|---|---|
| linux-x64 icelake | Yes | 19 min |
| linux-x64 default (Zen3) | No | 2h13m |
| linux-aarch64 | No | 2h31m |
| macOS ARM | No | cancelled at 6h |
For comparison, the pre-guard CI run (before SIGILL fix) had aarch64 at 14 min and macOS at 19-26 min — comparable to icelake.
Root cause of the Zen3 SIGILL
I downloaded the x86-64 CPU binary from the failing CI run and disassembled it. The new quantize_cpu_impl function and all its helpers contain zero AVX512 zmm instructions — GCC did not auto-vectorize them with 512-bit vectors (-mprefer-vector-width=256 worked for that).
However, I found one EVEX-encoded AVX512VL instruction in build_quantize_lut:
3764: 62 f1 7c 18 59 05 ... vmulps 0x9c66(%rip){1to4}, %xmm0, %xmm0
This is an embedded broadcast — GCC folded a vbroadcastss + vmulps of the 0.5f constant (from the midpoint computation 0.5f * (codebook[i] + codebook[i+1])) into a single EVEX-encoded instruction that operates on xmm registers but requires AVX512VL support. Zen3 doesn't have AVX512 at all, so this is an illegal instruction.
The reason this only affects the new PR code: the existing codebase's AVX512 intrinsics are all inside #if defined(__AVX512F__) blocks with has_avx512f() runtime guards. The new build_quantize_lut is plain scalar C++ — but the -mavx512vl compile flag (applied globally in CMakeLists.txt) allows GCC to use EVEX broadcast encoding as a micro-optimization even in scalar code.
I verified by comparing the main branch binary: the exact same set of functions contain zmm instructions in both builds, and build_quantize_lut doesn't exist on main at all. This is purely a new-code issue.
Recommended fix
C++ side: Wrap the new scalar functions with a pragma to disable AVX512 codegen:
#ifdef __GNUC__
#pragma GCC push_options
#pragma GCC target("no-avx512f")
#endif
static void build_quantize_lut(const float* codebook, unsigned char* lut) { ... }
struct LUTCache { ... };
static const unsigned char* get_global_lut(const float* code) { ... }
static inline uint16_t norm_to_lut_index(float val) { ... }
template <typename T>
void quantize_cpu_impl(float* code, const T* A, ...) { ... }
void quantize_cpu(float* code, float* A, ...) { ... }
void quantize_cpu_bf16(float* code, bf16_t* A, ...) { ... }
void quantize_cpu_fp16(float* code, fp16_t* A, ...) { ... }
#ifdef __GNUC__
#pragma GCC pop_options
#endifThis is safe cross-platform:
- x86-64 GCC/Clang: Prevents EVEX broadcast and any future AVX512 auto-vectorization in these functions
- aarch64 Linux: No-op (no AVX512 to disable)
- macOS ARM (Apple Clang): No-op
- MSVC: Skipped via
#ifdef __GNUC__(MSVC doesn't pass AVX512 flags anyway — it uses/arch:AVX2)
Performance impact on AVX512 machines: Zero. The one vmulps {1to4} becomes a vbroadcastss + vmulps pair in build_quantize_lut, which runs once per codebook and is cached. The hot quantization loop in quantize_cpu_impl already has zero EVEX instructions.
Python side: Remove the _has_avx512 guard — revert to if not isinstance(lib, ErrorHandlerMockBNBNativeLibrary): so the C++ kernels are registered on all platforms.
Summary
This PR enables all bitsandbytes optimizers (32-bit and 8-bit blockwise) to run on CPU. Previously optimizers were restricted to CUDA/XPU only.
Motivation
Users on CPU-only machines had to fall back to vanilla PyTorch optimizers, losing the benefits of 8-bit state compression. This PR removes that limitation.
Changes
Python CPU Kernels (
bitsandbytes/backends/cpu/ops.py)_optimizer_update_32bit_cpu(Adam, AdEMAMix, LAMB/LARS, Lion, SGD, RMSprop)_optimizer_update_8bit_blockwise_cpuwith blockwise quantization/dequantizationOptimizer Framework (
bitsandbytes/optim/optimizer.py,bitsandbytes/functional.py)get_state_buffer: CPU uses regular tensors; paged optimizers fall back to non-paged with warningto_gpu: skips CPU parametersis_on_gpu: accepts all-CPU tensor sets, rejects mixed CPU/GPUNative C++ Kernels (
csrc/cpu_ops.cpp,csrc/cpu_ops.h,csrc/pythonInterface.cpp)quantize_cpu_bf16/quantize_cpu_fp16for direct half-precision quantizationTests (
tests/test_optim.py)no_cpu=Truefilters — all optimizer tests now run on CPUExample (
examples/cpu/cpu_training.py)JackFram/llama-68m+ Alpaca, supporting multiple optimizers,--comparemode, and HF Trainer integrationSupported Optimizers on CPU
How to Test
pytest tests/test_optim.py -x -v -k "cpu" python examples/cpu/cpu_training.py --optimizer adamw8bit --steps 20