Skip to content

Enable CPU Optimizer Support for bitsandbytes#1901

Open
jiqing-feng wants to merge 14 commits intobitsandbytes-foundation:mainfrom
jiqing-feng:cpu
Open

Enable CPU Optimizer Support for bitsandbytes#1901
jiqing-feng wants to merge 14 commits intobitsandbytes-foundation:mainfrom
jiqing-feng:cpu

Conversation

@jiqing-feng
Copy link
Contributor

@jiqing-feng jiqing-feng commented Mar 18, 2026

Summary

This PR enables all bitsandbytes optimizers (32-bit and 8-bit blockwise) to run on CPU. Previously optimizers were restricted to CUDA/XPU only.

Motivation

Users on CPU-only machines had to fall back to vanilla PyTorch optimizers, losing the benefits of 8-bit state compression. This PR removes that limitation.

Changes

Python CPU Kernels (bitsandbytes/backends/cpu/ops.py)

  • Implemented _optimizer_update_32bit_cpu (Adam, AdEMAMix, LAMB/LARS, Lion, SGD, RMSprop)
  • Implemented _optimizer_update_8bit_blockwise_cpu with blockwise quantization/dequantization
  • Fixed AdEMAMix m1/m2 interleaved state layout

Optimizer Framework (bitsandbytes/optim/optimizer.py, bitsandbytes/functional.py)

  • get_state_buffer: CPU uses regular tensors; paged optimizers fall back to non-paged with warning
  • to_gpu: skips CPU parameters
  • is_on_gpu: accepts all-CPU tensor sets, rejects mixed CPU/GPU

Native C++ Kernels (csrc/cpu_ops.cpp, csrc/cpu_ops.h, csrc/pythonInterface.cpp)

  • Replaced per-element binary search with LUT-based quantization (4-slot cached LUT with content fingerprinting)
  • Added quantize_cpu_bf16 / quantize_cpu_fp16 for direct half-precision quantization
  • Fixed 8-bit blockwise kernel for bf16/fp16 inputs

Tests (tests/test_optim.py)

  • Removed no_cpu=True filters — all optimizer tests now run on CPU
  • Paged optimizer variants auto-skip on CPU

Example (examples/cpu/cpu_training.py)

  • End-to-end CPU training with JackFram/llama-68m + Alpaca, supporting multiple optimizers, --compare mode, and HF Trainer integration

Supported Optimizers on CPU

Optimizer 32-bit 8-bit Blockwise
Adam / AdamW
SGD (Momentum)
Lion
RMSprop
LARS
LAMB
AdEMAMix

Note: Paged variants (e.g., PagedAdamW) fall back to non-paged on CPU. LARS/LAMB 8-bit blockwise is not available upstream.

How to Test

pytest tests/test_optim.py -x -v -k "cpu"
python examples/cpu/cpu_training.py --optimizer adamw8bit --steps 20

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
@jiqing-feng jiqing-feng marked this pull request as draft March 18, 2026 02:25
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
@jiqing-feng jiqing-feng marked this pull request as ready for review March 18, 2026 02:41
@jiqing-feng
Copy link
Contributor Author

Hi @matthewdouglas
Please review this PR. Thanks!

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
@matthewdouglas matthewdouglas added this to the v0.50.0 milestone Mar 18, 2026
@matthewdouglas matthewdouglas added Intel x64 CPU Optimizers Issues or feature requests relating to optimizers labels Mar 18, 2026
@github-actions
Copy link

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
Signed-off-by: jiqing-feng <jiqing.feng@intel.com>
@jiqing-feng
Copy link
Contributor Author

Hi @matthewdouglas . The failed tests seems like the CI node does not have avx512. Please rerun the CI to see if I fixed this issue.

@matthewdouglas
Copy link
Member

Hi @jiqing-feng,

Everything seems fine on the CI for Linux now regardless of AVX512 support. But it also looks like this is causing the macOS CPU tests to hang.

Copy link
Member

@matthewdouglas matthewdouglas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Enable CPU Optimizer Support for bitsandbytes

Thanks for the PR! This is a solid feature addition — CPU optimizer support is valuable. I've done extensive analysis of the macOS CI hang and the Zen3 SIGILL, and I have a clear path forward for both.

TL;DR

The _has_avx512 Python guard should be removed and replaced with a targeted #pragma in the C++ code. This will fix the macOS hang, restore performance on non-AVX512 x86-64, and have zero impact on AVX512 machines.


Investigating the macOS CI hang

The macOS CI tests were cancelled after the 6-hour GitHub Actions timeout. I reproduced locally on an M4 Mac and traced the cause:

With _has_avx512 guard (PR as-is): The C++ quantize/dequantize CPU kernels are not registered on any non-AVX512 platform (ARM, Zen3, etc.), falling back to the pure-Python default kernel. That kernel does an O(n×256) argmin which allocates a 4.3 GB temporary tensor for dim2=4097 test cases. Benchmarked: 115 seconds per test case vs 10 seconds with C++ kernels — an 11.5x slowdown.

Looking at actual CI durations on this PR confirms the problem across all non-AVX512 runners:

Runner AVX512? Test duration
linux-x64 icelake Yes 19 min
linux-x64 default (Zen3) No 2h13m
linux-aarch64 No 2h31m
macOS ARM No cancelled at 6h

For comparison, the pre-guard CI run (before SIGILL fix) had aarch64 at 14 min and macOS at 19-26 min — comparable to icelake.


Root cause of the Zen3 SIGILL

I downloaded the x86-64 CPU binary from the failing CI run and disassembled it. The new quantize_cpu_impl function and all its helpers contain zero AVX512 zmm instructions — GCC did not auto-vectorize them with 512-bit vectors (-mprefer-vector-width=256 worked for that).

However, I found one EVEX-encoded AVX512VL instruction in build_quantize_lut:

3764: 62 f1 7c 18 59 05 ...    vmulps 0x9c66(%rip){1to4}, %xmm0, %xmm0

This is an embedded broadcast — GCC folded a vbroadcastss + vmulps of the 0.5f constant (from the midpoint computation 0.5f * (codebook[i] + codebook[i+1])) into a single EVEX-encoded instruction that operates on xmm registers but requires AVX512VL support. Zen3 doesn't have AVX512 at all, so this is an illegal instruction.

The reason this only affects the new PR code: the existing codebase's AVX512 intrinsics are all inside #if defined(__AVX512F__) blocks with has_avx512f() runtime guards. The new build_quantize_lut is plain scalar C++ — but the -mavx512vl compile flag (applied globally in CMakeLists.txt) allows GCC to use EVEX broadcast encoding as a micro-optimization even in scalar code.

I verified by comparing the main branch binary: the exact same set of functions contain zmm instructions in both builds, and build_quantize_lut doesn't exist on main at all. This is purely a new-code issue.


Recommended fix

C++ side: Wrap the new scalar functions with a pragma to disable AVX512 codegen:

#ifdef __GNUC__
#pragma GCC push_options
#pragma GCC target("no-avx512f")
#endif

static void build_quantize_lut(const float* codebook, unsigned char* lut) { ... }
struct LUTCache { ... };
static const unsigned char* get_global_lut(const float* code) { ... }
static inline uint16_t norm_to_lut_index(float val) { ... }
template <typename T>
void quantize_cpu_impl(float* code, const T* A, ...) { ... }
void quantize_cpu(float* code, float* A, ...) { ... }
void quantize_cpu_bf16(float* code, bf16_t* A, ...) { ... }
void quantize_cpu_fp16(float* code, fp16_t* A, ...) { ... }

#ifdef __GNUC__
#pragma GCC pop_options
#endif

This is safe cross-platform:

  • x86-64 GCC/Clang: Prevents EVEX broadcast and any future AVX512 auto-vectorization in these functions
  • aarch64 Linux: No-op (no AVX512 to disable)
  • macOS ARM (Apple Clang): No-op
  • MSVC: Skipped via #ifdef __GNUC__ (MSVC doesn't pass AVX512 flags anyway — it uses /arch:AVX2)

Performance impact on AVX512 machines: Zero. The one vmulps {1to4} becomes a vbroadcastss + vmulps pair in build_quantize_lut, which runs once per codebook and is cached. The hot quantization loop in quantize_cpu_impl already has zero EVEX instructions.

Python side: Remove the _has_avx512 guard — revert to if not isinstance(lib, ErrorHandlerMockBNBNativeLibrary): so the C++ kernels are registered on all platforms.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Intel Optimizers Issues or feature requests relating to optimizers x64 CPU

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants