From abd3e0dbdf31d70b9c0af713bff9637d1241cec6 Mon Sep 17 00:00:00 2001
From: Andrew Barnes <andrew.jaguars@gmail.com>
Date: Fri, 13 Mar 2026 01:08:18 -0400
Subject: [PATCH] docs: add I2_S format reference

---
 README.md          |   2 +
 docs/i2s-format.md | 156 +++++++++++++++++++++++++++++++++++++++++++++
 2 files changed, 158 insertions(+)
 create mode 100644 docs/i2s-format.md
diff --git a/README.md b/README.md
index 3bb25596e..fd115cd28 100644
--- a/README.md
+++ b/README.md
@@ -12,6 +12,8 @@ The first release of bitnet.cpp is to support inference on CPUs. bitnet.cpp achi
 
 **Latest optimization** introduces parallel kernel implementations with configurable tiling and embedding quantization support, achieving **1.15x to 2.1x** additional speedup over the original implementation across different hardware platforms and workloads. For detailed technical information, see the [optimization guide](src/README.md).
 
+For runtime implementers who need the ternary weight layout, see [docs/i2s-format.md](docs/i2s-format.md).
+
 <img src="./assets/performance.png" alt="performance_comparison" width="800"/>
 
 
diff --git a/docs/i2s-format.md b/docs/i2s-format.md
new file mode 100644
index 000000000..fa92a3038
--- /dev/null
+++ b/docs/i2s-format.md
@@ -0,0 +1,156 @@
+# I2_S Quantization Format
+
+This note documents the `I2_S` weight layout used by `bitnet.cpp` so alternative runtimes and tooling can parse the format without reverse-engineering the kernels.
+
+The description below is derived from the packing logic in `src/ggml-bitnet-mad.cpp`, especially `quantize_i2_s()`.
+
+## What `I2_S` stores
+
+`I2_S` is the ternary weight format used by the CPU inference path.
+
+- Logical weight values are ternary: `-1`, `0`, or `+1`
+- Each logical value is encoded into 2 bits
+- A full tensor also stores one trailing `float32` scale value
+- The serialized buffer reserves 32 extra bytes so the scale region stays aligned
+
+At packing time the implementation first maps floating-point values into 2-bit symbols:
+
+- `0` means `-1`
+- `1` means `0`
+- `2` means `+1`
+
+Zero is detected with a small epsilon check, and non-zero values are converted by sign.
+
+## CPU-dependent packing granularity
+
+`QK_I2_S` depends on the active CPU backend in `src/ggml-bitnet-mad.cpp`:
+
+- x86 / AVX / SSSE3 paths use `QK_I2_S = 128`
+- ARM NEON paths use `QK_I2_S = 64`
+
+The packing pattern is the same on both backends: each output byte stores four 2-bit symbols from different groups. The only difference is whether the groups are 32-wide (`128 = 4 x 32`) or 16-wide (`64 = 4 x 16`).
+
+## x86 layout (`QK_I2_S = 128`)
+
+For x86, one 32-byte block stores 128 ternary values split into 4 groups of 32.
+
+For logical index `j` inside a 128-value block:
+
+```text
+group_idx = j / 32
+group_pos = j % 32
+```
+
+The packer writes:
+
+```text
+byte_index = block_base + group_pos
+shift = 6 - 2 * group_idx
+packed_byte |= value << shift
+```
+
+So byte `group_pos` contains values from:
+
+```text
+[group_pos, 32 + group_pos, 64 + group_pos, 96 + group_pos]
+```
+
+with the bit layout:
+
+```text
+bits[7:6] -> element at offset 0
+bits[5:4] -> element at offset 32
+bits[3:2] -> element at offset 64
+bits[1:0] -> element at offset 96
+```
+
+## ARM layout (`QK_I2_S = 64`)
+
+For ARM NEON, one 16-byte block stores 64 ternary values split into 4 groups of 16.
+
+For logical index `j` inside a 64-value block:
+
+```text
+group_idx = j / 16
+group_pos = j % 16
+```
+
+The packer writes:
+
+```text
+byte_index = block_base + group_pos
+shift = 6 - 2 * group_idx
+packed_byte |= value << shift
+```
+
+So byte `group_pos` contains values from:
+
+```text
+[group_pos, 16 + group_pos, 32 + group_pos, 48 + group_pos]
+```
+
+with the same bit ordering:
+
+```text
+bits[7:6], bits[5:4], bits[3:2], bits[1:0]
+```
+
+## Scale storage
+
+After the packed 2-bit payload, `quantize_i2_s()` stores one `float32` scale:
+
+```text
+scale_ptr = (float *)((char *)packed_weights + n / 4)
+scale_ptr[0] = i2_scale
+```
+
+The function then returns:
+
+```text
+nrow * row_size / 4 + 32
+```
+
+That final `+ 32` keeps the serialized tensor aligned. If you are building a parser, treat the packed payload as `n / 4` bytes followed by a scale region that starts immediately after that payload, with extra alignment space reserved by the buffer size calculation.
+
+## Practical decoding recipe
+
+To decode a logical element:
+
+1. Choose the backend block size (`128` for x86, `64` for ARM NEON).
+2. Compute the block-local group and position.
+3. Read the corresponding byte.
+4. Extract the 2-bit symbol with the appropriate shift.
+5. Map the symbol back to ternary:
+   - `0 -> -1`
+   - `1 -> 0`
+   - `2 -> +1`
+
+For x86:
+
+```text
+block = k / 128
+pos = k % 128
+group = pos / 32
+lane = pos % 32
+byte_offset = block * 32 + lane
+shift = 6 - 2 * group
+```
+
+For ARM:
+
+```text
+block = k / 64
+pos = k % 64
+group = pos / 16
+lane = pos % 16
+byte_offset = block * 16 + lane
+shift = 6 - 2 * group
+```
+
+## Related source files
+
+- `src/ggml-bitnet-mad.cpp`
+- `include/ggml-bitnet.h`
+- `utils/convert-hf-to-gguf-bitnet.py`
+
+If this format changes, update this document alongside the packing implementation.