Enable CPU Optimizer Support for bitsandbytes by jiqing-feng · Pull Request #1901 · bitsandbytes-foundation/bitsandbytes

jiqing-feng · 2026-03-18T02:21:38Z

Summary

This PR enables all bitsandbytes optimizers (32-bit and 8-bit blockwise) to run on CPU. Previously optimizers were restricted to CUDA/XPU only.

Motivation

Users on CPU-only machines had to fall back to vanilla PyTorch optimizers, losing the benefits of 8-bit state compression. This PR removes that limitation.

Changes

Python CPU Kernels (`bitsandbytes/backends/cpu/ops.py`)

Implemented _optimizer_update_32bit_cpu (Adam, AdEMAMix, LAMB/LARS, Lion, SGD, RMSprop)
Implemented _optimizer_update_8bit_blockwise_cpu with blockwise quantization/dequantization
Fixed AdEMAMix m1/m2 interleaved state layout

Optimizer Framework (`bitsandbytes/optim/optimizer.py`, `bitsandbytes/functional.py`)

get_state_buffer: CPU uses regular tensors; paged optimizers fall back to non-paged with warning
to_gpu: skips CPU parameters
is_on_gpu: accepts all-CPU tensor sets, rejects mixed CPU/GPU

Native C++ Kernels (`csrc/cpu_ops.cpp`, `csrc/cpu_ops.h`, `csrc/pythonInterface.cpp`)

Replaced per-element binary search with LUT-based quantization (4-slot cached LUT with content fingerprinting)
Added quantize_cpu_bf16 / quantize_cpu_fp16 for direct half-precision quantization
Fixed 8-bit blockwise kernel for bf16/fp16 inputs

Tests (`tests/test_optim.py`)

Removed no_cpu=True filters — all optimizer tests now run on CPU
Paged optimizer variants auto-skip on CPU

Example (`examples/cpu/cpu_training.py`)

End-to-end CPU training with JackFram/llama-68m + Alpaca, supporting multiple optimizers, --compare mode, and HF Trainer integration

Supported Optimizers on CPU

Optimizer	32-bit	8-bit Blockwise
Adam / AdamW	✅	✅
SGD (Momentum)	✅	✅
Lion	✅	✅
RMSprop	✅	✅
LARS	✅	—
LAMB	✅	—
AdEMAMix	✅	✅

Note: Paged variants (e.g., PagedAdamW) fall back to non-paged on CPU. LARS/LAMB 8-bit blockwise is not available upstream.

How to Test

pytest tests/test_optim.py -x -v -k "cpu"
python examples/cpu/cpu_training.py --optimizer adamw8bit --steps 20

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng · 2026-03-18T02:54:19Z

Hi @matthewdouglas
Please review this PR. Thanks!

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

github-actions · 2026-03-18T19:52:06Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng · 2026-03-19T03:16:41Z

Hi @matthewdouglas . The failed tests seems like the CI node does not have avx512. Please rerun the CI to see if I fixed this issue.

matthewdouglas · 2026-03-25T13:25:58Z

Hi @jiqing-feng,

Everything seems fine on the CI for Linux now regardless of AVX512 support. But it also looks like this is causing the macOS CPU tests to hang.

matthewdouglas

Review: Enable CPU Optimizer Support for bitsandbytes

Thanks for the PR! This is a solid feature addition — CPU optimizer support is valuable. I've done extensive analysis of the macOS CI hang and the Zen3 SIGILL, and I have a clear path forward for both.

TL;DR

The _has_avx512 Python guard should be removed and replaced with a targeted #pragma in the C++ code. This will fix the macOS hang, restore performance on non-AVX512 x86-64, and have zero impact on AVX512 machines.

Investigating the macOS CI hang

The macOS CI tests were cancelled after the 6-hour GitHub Actions timeout. I reproduced locally on an M4 Mac and traced the cause:

With _has_avx512 guard (PR as-is): The C++ quantize/dequantize CPU kernels are not registered on any non-AVX512 platform (ARM, Zen3, etc.), falling back to the pure-Python default kernel. That kernel does an O(n×256) argmin which allocates a 4.3 GB temporary tensor for dim2=4097 test cases. Benchmarked: 115 seconds per test case vs 10 seconds with C++ kernels — an 11.5x slowdown.

Looking at actual CI durations on this PR confirms the problem across all non-AVX512 runners:

Runner	AVX512?	Test duration
linux-x64 icelake	Yes	19 min
linux-x64 default (Zen3)	No	2h13m
linux-aarch64	No	2h31m
macOS ARM	No	cancelled at 6h

For comparison, the pre-guard CI run (before SIGILL fix) had aarch64 at 14 min and macOS at 19-26 min — comparable to icelake.

Root cause of the Zen3 SIGILL

I downloaded the x86-64 CPU binary from the failing CI run and disassembled it. The new quantize_cpu_impl function and all its helpers contain zero AVX512 zmm instructions — GCC did not auto-vectorize them with 512-bit vectors (-mprefer-vector-width=256 worked for that).

However, I found one EVEX-encoded AVX512VL instruction in build_quantize_lut:

3764: 62 f1 7c 18 59 05 ...    vmulps 0x9c66(%rip){1to4}, %xmm0, %xmm0

This is an embedded broadcast — GCC folded a vbroadcastss + vmulps of the 0.5f constant (from the midpoint computation 0.5f * (codebook[i] + codebook[i+1])) into a single EVEX-encoded instruction that operates on xmm registers but requires AVX512VL support. Zen3 doesn't have AVX512 at all, so this is an illegal instruction.

The reason this only affects the new PR code: the existing codebase's AVX512 intrinsics are all inside #if defined(__AVX512F__) blocks with has_avx512f() runtime guards. The new build_quantize_lut is plain scalar C++ — but the -mavx512vl compile flag (applied globally in CMakeLists.txt) allows GCC to use EVEX broadcast encoding as a micro-optimization even in scalar code.

I verified by comparing the main branch binary: the exact same set of functions contain zmm instructions in both builds, and build_quantize_lut doesn't exist on main at all. This is purely a new-code issue.

Recommended fix

C++ side: Wrap the new scalar functions with a pragma to disable AVX512 codegen:

#ifdef __GNUC__
#pragma GCC push_options
#pragma GCC target("no-avx512f")
#endif

static void build_quantize_lut(const float* codebook, unsigned char* lut) { ... }
struct LUTCache { ... };
static const unsigned char* get_global_lut(const float* code) { ... }
static inline uint16_t norm_to_lut_index(float val) { ... }
template <typename T>
void quantize_cpu_impl(float* code, const T* A, ...) { ... }
void quantize_cpu(float* code, float* A, ...) { ... }
void quantize_cpu_bf16(float* code, bf16_t* A, ...) { ... }
void quantize_cpu_fp16(float* code, fp16_t* A, ...) { ... }

#ifdef __GNUC__
#pragma GCC pop_options
#endif

This is safe cross-platform:

x86-64 GCC/Clang: Prevents EVEX broadcast and any future AVX512 auto-vectorization in these functions
aarch64 Linux: No-op (no AVX512 to disable)
macOS ARM (Apple Clang): No-op
MSVC: Skipped via #ifdef __GNUC__ (MSVC doesn't pass AVX512 flags anyway — it uses /arch:AVX2)

Performance impact on AVX512 machines: Zero. The one vmulps {1to4} becomes a vbroadcastss + vmulps pair in build_quantize_lut, which runs once per codebook and is cached. The hot quantization loop in quantize_cpu_impl already has zero EVEX instructions.

Python side: Remove the _has_avx512 guard — revert to if not isinstance(lib, ErrorHandlerMockBNBNativeLibrary): so the C++ kernels are registered on all platforms.

bitsandbytes/backends/cpu/ops.py

fix kernelk

eb0c87f

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng marked this pull request as draft March 18, 2026 02:25

update tests

d7c6ef0

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng marked this pull request as ready for review March 18, 2026 02:41

jiqing-feng added 9 commits March 18, 2026 09:28

enable cpu optimizer

ac28bef

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix ademamix

48c3cda

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

update tests and example

64bac6d

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

optimize

3887b4f

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix 8bit custom op

1aaa851

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

update example

b886e1e

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

update example

860e7a8

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

update tests

fc54f49

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix lint

aefc6bf

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

matthewdouglas added this to the v0.50.0 milestone Mar 18, 2026

matthewdouglas added Intel x64 CPU Optimizers Issues or feature requests relating to optimizers labels Mar 18, 2026

jiqing-feng added 3 commits March 19, 2026 10:02

fix storage

bfed130

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix shape

af7410d

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

fix dispatch

2ad0744

Signed-off-by: jiqing-feng <jiqing.feng@intel.com>

jiqing-feng force-pushed the cpu branch from 4019407 to 2ad0744 Compare March 19, 2026 04:44

matthewdouglas reviewed Mar 25, 2026

View reviewed changes

bitsandbytes/backends/cpu/ops.py Show resolved Hide resolved

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Enable CPU Optimizer Support for bitsandbytes#1901

Enable CPU Optimizer Support for bitsandbytes#1901
jiqing-feng wants to merge 14 commits intobitsandbytes-foundation:mainfrom
jiqing-feng:cpu

jiqing-feng commented Mar 18, 2026 •

edited

Loading

Uh oh!

jiqing-feng commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

jiqing-feng commented Mar 19, 2026

Uh oh!

matthewdouglas commented Mar 25, 2026

Uh oh!

matthewdouglas left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

jiqing-feng commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Motivation

Changes

Python CPU Kernels (bitsandbytes/backends/cpu/ops.py)

Optimizer Framework (bitsandbytes/optim/optimizer.py, bitsandbytes/functional.py)

Native C++ Kernels (csrc/cpu_ops.cpp, csrc/cpu_ops.h, csrc/pythonInterface.cpp)

Tests (tests/test_optim.py)

Example (examples/cpu/cpu_training.py)

Supported Optimizers on CPU

How to Test

Uh oh!

jiqing-feng commented Mar 18, 2026

Uh oh!

github-actions bot commented Mar 18, 2026

Uh oh!

jiqing-feng commented Mar 19, 2026

Uh oh!

matthewdouglas commented Mar 25, 2026

Uh oh!

matthewdouglas left a comment

Choose a reason for hiding this comment

Review: Enable CPU Optimizer Support for bitsandbytes

TL;DR

Investigating the macOS CI hang

Root cause of the Zen3 SIGILL

Recommended fix

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

jiqing-feng commented Mar 18, 2026 •

edited

Loading

Python CPU Kernels (`bitsandbytes/backends/cpu/ops.py`)

Optimizer Framework (`bitsandbytes/optim/optimizer.py`, `bitsandbytes/functional.py`)

Native C++ Kernels (`csrc/cpu_ops.cpp`, `csrc/cpu_ops.h`, `csrc/pythonInterface.cpp`)

Tests (`tests/test_optim.py`)

Example (`examples/cpu/cpu_training.py`)