Skip to content

Commit 9c26a56

Browse files
committed
bindings
1 parent 29dce4a commit 9c26a56

13 files changed

Lines changed: 425 additions & 59 deletions

File tree

.github/workflows/ci.yml

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -70,12 +70,15 @@ jobs:
7070
run: cmake --build "$BUILD_DIR" --parallel
7171
- name: Test
7272
run: ctest --test-dir "$BUILD_DIR" --output-on-failure
73-
- name: Python binding tests
73+
- name: Python binding + GGUF regression tests
7474
if: matrix.configuration.name == 'python'
7575
run: |
7676
set -euo pipefail
7777
mkdir -p artifacts
78+
python3 -m pip install --upgrade pip
79+
python3 -m pip install torch transformers pytest
7880
PYTHONPATH="$BUILD_DIR" python3 tests/python/test_bindings.py 2>&1 | tee artifacts/${BUILD_DIR}-python-tests.log
81+
PYTHONPATH="$BUILD_DIR" python3 -m pytest tests/python/test_gguf.py 2>&1 | tee artifacts/${BUILD_DIR}-gguf-tests.log
7982
- name: Collect test logs
8083
run: |
8184
mkdir -p artifacts

AGENTS.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,3 +58,4 @@ This file helps AI agents discover and understand how to work with this reposito
5858
- Rewrote `pyproject.toml` with valid TOML sections so editable installs (and `pip install -e '.[torch]'`) can parse the metadata cleanly before building the extension.
5959
- Restructured `README.md` into a onboarding-focused front door and added companion docs (`docs/use-cases.md`, `docs/hardware.md`, `docs/api-overview.md`, `docs/python-install.md`, `docs/torch.md`, `docs/gpu.md`, `examples/README.md`) so heavy reference material lives outside the visitor-facing overview.
6060
- Added optional CUDA/ROCm toggles plus a GPU dispatcher sketch (`include/t81/linalg/gemm_gpu.hpp`, `src/linalg/{gemm_cuda.cu,gemm_dispatch.cpp,gemm_rocm.cpp}`) so future teams can wire the new `where`/`clamp`/`lerp`/`addcmul` helpers into GPU kernels, introduced `t81::TensorMetadata` + Python helpers (`python/bindings.cpp`) that extract metadata from NumPy/Torch tensors, and expanded `tests/python/test_gpu_ops.py` to cover the metadata-backed bindings on both CPU and GPU paths.
61+
- Enhanced `tests/python/test_gguf.py` with quant-parameterized round-trip checks, metadata assertions, and a regression case for invalid quant identifiers to spotlight the GGUF helpers before future agents touch them.

README.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -110,20 +110,20 @@ Optional CUDA/ROCm backends can be enabled with `-DUSE_CUDA=ON` / `-DUSE_ROCM=ON
110110

111111
### Dequantizing for downstream runtimes
112112

113-
Use the new `t81-dequant` helper (backed by `t81.dequantize_gguf_to_float`) to rewrite a TQ1_0/TQ2_0 bundle into float32 before handing it to stock llama.cpp, Ollama, or LM Studio builds that lack ternary support:
113+
Use the new `t81-dequant` helper (backed by `t81.dequantize_gguf_to_float`) to rewrite a TQ1_0 or TQ2_0 bundle into float32 before handing it to stock llama.cpp, Ollama, or LM Studio builds that lack ternary support:
114114

115115
```bash
116116
t81-dequant model-tq1.gguf model-compatible-f16.gguf
117117
```
118118

119-
That command rewrites the tensors in place while preserving the standard GGUF metadata so the resulting file works with existing loaders. Keep the original `model-tq1.gguf` around for runtimes that already understand TQ tensors, and only run `t81-dequant` when you need immediate compatibility.
119+
That command rewrites the tensors in place while preserving the standard GGUF metadata so the resulting file works with existing loaders. Keep the original `model-tq1.gguf`/`model-tq2.gguf` around for runtimes that already understand TQ tensors, and only run `t81-dequant` when you need immediate compatibility.
120120

121-
For a zero-disk workaround you can also dequantize on the fly (via `t81.dequantize_gguf_to_float` or a small loader patch) before instantiating `llama_cpp.Llama`; see the docs for an example monkey patch if you want to load `model-tq1.gguf` directly without producing an intermediate copy.
121+
For a zero-disk workaround you can also dequantize on the fly (via `t81.dequantize_gguf_to_float` or a small loader patch) before instantiating `llama_cpp.Llama`; see the docs for an example monkey patch if you want to load `model-tq1.gguf` or `model-tq2.gguf` directly without producing an intermediate copy.
122122

123123

124124
## GGUF v4 compliance
125125

126-
t81’s GGUF exports already mirror the llama.cpp conventions; v4’s mandatory `gguf_header` additions are worth calling out for everybody writing their own converter:
126+
t81’s GGUF exports already mirror the llama.cpp conventions; the writer now aligns with llama.cpp’s block layout (32-row groups, per-group f16 scale, optional TQ2 refinement bytes) and includes v4’s mandatory `gguf_header` additions, which are worth calling out for everybody writing their own converter:
127127

128128
- **Header bump** – write `version = 4` instead of 3 so llama.cpp accepts the file and no longer fails with “unsupported version”.
129129
- **Global alignment metadata** – after `tensor_count`/`kv_count` emit `alignment` (default 32, power-of-two) and `reserved` (0) before the metadata block, and compute tensor padding with `GGML_PAD(size, alignment)` so every tensor data block ends on that boundary.

python/bindings.cpp

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -490,6 +490,8 @@ namespace {
490490
throw py::value_error("buffer truncated while skipping refinement bytes");
491491
}
492492
offset += kRefinementBytes;
493+
// Note: has_refinements currently just skips the reserved refinement bytes;
494+
// decoding their contents is a future enhancement.
493495
}
494496
const float scale = t81::core::gguf::half_to_float(scale_bits);
495497
const std::size_t rows_in_group =
@@ -643,6 +645,8 @@ PYBIND11_MODULE(t81lib, module) {
643645
t81::linalg::detail::backend_available(t81::linalg::Backend::CUDA);
644646
module.attr("HAS_ROCM_BACKEND") =
645647
t81::linalg::detail::backend_available(t81::linalg::Backend::ROCm);
648+
module.attr("TQ1_TRITS_PER_BLOCK") = t81::core::gguf::TQ1_TRITS_PER_BLOCK;
649+
module.attr("TQ1_BLOCK_ROWS") = t81::core::gguf::TQ1_BLOCK_ROWS;
646650

647651
module.def(
648652
"gemm_ternary",

t81/gguf.py

Lines changed: 61 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -26,13 +26,11 @@
2626
HEADER_STRUCT = struct.Struct("<4sIQQ")
2727
HEADER_SIZE = HEADER_STRUCT.size
2828

29-
GGML_TYPE_TQ1_0 = 250
30-
GGML_TYPE_TQ2_0 = 251
29+
GGML_TYPE_TQ1_0 = 34
30+
GGML_TYPE_TQ2_0 = 35
3131
GGML_TYPE_F32 = 100
3232
GGML_TYPE_F16 = 101
3333

34-
GGML_TYPE_F32 = 100
35-
3634
GGUF_TYPE_UINT8 = 0
3735
GGUF_TYPE_INT8 = 1
3836
GGUF_TYPE_UINT16 = 2
@@ -47,7 +45,9 @@
4745
GGUF_TYPE_INT64 = 11
4846
GGUF_TYPE_FLOAT64 = 12
4947

50-
GGUF_QUANT_BLOCK_ROWS = 32
48+
GGUF_QUANT_BLOCK_ROWS = 32 # matches the legacy TQ1_BLOCK_ROWS constant in the C++ helpers
49+
TQ1_TRITS_PER_BLOCK = 8
50+
TQ2_TRITS_PER_BYTE = 4
5151

5252

5353
@dataclass(frozen=True)
@@ -119,7 +119,7 @@ def _collect_metadata(model: PreTrainedModel, quant: str, threshold: float) -> t
119119
("general.file_type", 2),
120120
("general.alignment", HEADER_ALIGNMENT),
121121
("general.quantized_by", "t81lib"),
122-
("general.quantization_version", 2),
122+
("general.quantization_version", 3 if quant == "TQ2_0" else 2),
123123
("quantization.type", quant.lower()),
124124
("quantization.block_size", GGUF_QUANT_BLOCK_ROWS),
125125
("quantization.threshold", threshold),
@@ -224,21 +224,59 @@ def _float_to_half_bytes(value: float) -> bytes:
224224
return np.float16(value).tobytes()
225225

226226

227+
def _pack_row_tq1(row: np.ndarray, threshold: float, scale: float) -> bytes:
228+
_, packed = t81lib.quantize_row_tq1_0(np.asarray(row, dtype=np.float32), threshold, scale)
229+
return packed.tobytes(order="C")
230+
231+
232+
def _pack_row_tq2(row: np.ndarray, threshold: float, scale: float) -> bytes:
233+
"""Pack a row into four-trit bytes after thresholded normalization."""
234+
if scale == 0.0:
235+
normalized = np.zeros_like(row, dtype=np.float32)
236+
else:
237+
normalized = row.astype(np.float32, copy=False) / float(scale)
238+
cols = normalized.shape[0]
239+
trits = np.zeros(cols, dtype=np.uint8)
240+
mask = np.abs(normalized) >= threshold
241+
signs = (normalized[mask] < 0).astype(np.uint8)
242+
trits[mask] = 1 + signs
243+
244+
padded_len = (-cols) % TQ2_TRITS_PER_BYTE
245+
if padded_len:
246+
padded = np.pad(trits, (0, padded_len), constant_values=0)
247+
else:
248+
padded = trits
249+
reshaped = padded.reshape(-1, TQ2_TRITS_PER_BYTE)
250+
packed = (
251+
reshaped[:, 0]
252+
| (reshaped[:, 1] << 2)
253+
| (reshaped[:, 2] << 4)
254+
| (reshaped[:, 3] << 6)
255+
).astype(np.uint8)
256+
n_bytes = (cols + TQ2_TRITS_PER_BYTE - 1) // TQ2_TRITS_PER_BYTE
257+
return packed[:n_bytes].tobytes()
258+
259+
227260
def _quantize_tensor(tensor: torch.Tensor, quant: str, threshold: float) -> bytes:
261+
"""Quantize a 2D tensor into TQ1_0 or TQ2_0 payload bytes."""
228262
array = tensor.cpu().to(dtype=torch.float32, copy=False).numpy()
263+
if array.ndim != 2:
264+
raise ValueError(f"Only 2D tensors supported, got shape {array.shape}")
229265
rows, cols = array.shape
230266
serialized = bytearray()
267+
pack_row = _pack_row_tq1 if quant == "TQ1_0" else _pack_row_tq2
268+
include_refinements = quant == "TQ2_0"
231269
for group_start in range(0, rows, GGUF_QUANT_BLOCK_ROWS):
232270
group = array[group_start : group_start + GGUF_QUANT_BLOCK_ROWS]
233271
scale = float(np.max(np.abs(group))) if group.size else 0.0
234272
serialized.extend(_float_to_half_bytes(scale))
235-
if quant == "TQ2_0":
273+
if include_refinements:
274+
# Reserve 8 bytes per block for future-per-block refinement data (higher-order corrections).
236275
serialized.extend(b"\x00" * 8)
237276
for row in group:
238277
if cols == 0:
239278
continue
240-
packed = t81lib.quantize_row_tq1_0(np.asarray(row, dtype=np.float32), threshold, scale)[1]
241-
serialized.extend(packed.tobytes(order="C"))
279+
serialized.extend(pack_row(row, threshold, scale))
242280
return bytes(serialized)
243281

244282

@@ -268,6 +306,7 @@ def write_gguf(
268306
quant = quant.upper()
269307
if quant not in {"TQ1_0", "TQ2_0"}:
270308
raise ValueError("quant must be one of 'TQ1_0' or 'TQ2_0'")
309+
threshold = float(np.clip(threshold, 0.0, 0.9999))
271310
entries = _collect_linears(model)
272311
if not entries:
273312
raise ValueError("model does not contain any t81.nn.Linear layers")
@@ -376,6 +415,10 @@ def _decode_quant_tensor(
376415
dtype = np.float32
377416
data = np.frombuffer(memoryview(chunk), dtype=dtype)
378417
if shape:
418+
expected = int(np.prod(shape))
419+
if data.size < expected:
420+
raise ValueError("float tensor data truncated")
421+
data = data[:expected]
379422
data = data.reshape(shape)
380423
return torch.from_numpy(data)
381424
decoder = t81lib.dequant_tq1_0 if ggml_type == GGML_TYPE_TQ1_0 else t81lib.dequant_tq2_0
@@ -409,10 +452,16 @@ def read_gguf(
409452
tensor_infos = _parse_tensor_infos(buffer, tensor_infos_offset, num_tensors, alignment)
410453
sorted_infos = sorted(tensor_infos, key=lambda info: info.offset)
411454
payload: dict[str, torch.Tensor | bytes] = {}
455+
prev_end = 0
412456
for index, info in enumerate(sorted_infos):
457+
if info.offset < prev_end:
458+
raise ValueError("tensor data overlaps or is out of order")
413459
next_offset = (
414460
sorted_infos[index + 1].offset if index + 1 < len(sorted_infos) else len(buffer)
415461
)
462+
if next_offset > len(buffer):
463+
raise ValueError("tensor data extends beyond file length")
464+
prev_end = next_offset
416465
chunk = buffer[info.offset:next_offset]
417466
if info.ggml_type not in {GGML_TYPE_TQ1_0, GGML_TYPE_TQ2_0, GGML_TYPE_F32}:
418467
raise ValueError(f"unsupported tensor type {info.ggml_type}")
@@ -448,7 +497,9 @@ def dequantize_gguf(
448497
numpy_dtype = np.float32
449498
else:
450499
raise ValueError(f"unsupported target dtype {dtype}")
451-
numpy_array = array.cpu().numpy(dtype=numpy_dtype, copy=False)
500+
numpy_array = array.cpu().numpy()
501+
if numpy_array.dtype != numpy_dtype:
502+
numpy_array = numpy_array.astype(numpy_dtype, copy=False)
452503
tensor_payloads.append(
453504
_TensorPayload(
454505
name=name,

t81/scripts/t81_dequant.py

Lines changed: 117 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -6,11 +6,12 @@
66

77
import argparse
88
from pathlib import Path
9-
from typing import Iterable
9+
from typing import Any, Iterable, Mapping
1010

1111
import numpy as np
12+
import torch
1213

13-
from t81 import dequantize_gguf
14+
from t81 import gguf
1415

1516

1617
def _parse_args(argv: Iterable[str] | None = None) -> argparse.Namespace:
@@ -34,6 +35,17 @@ def _parse_args(argv: Iterable[str] | None = None) -> argparse.Namespace:
3435
default="f16",
3536
help="Output tensor type (f16 yields best compression, q8_0 is not implemented yet).",
3637
)
38+
parser.add_argument(
39+
"--tensor",
40+
type=str,
41+
help="Optional tensor name to print metadata or sample values for (defaults to first tensor).",
42+
)
43+
parser.add_argument(
44+
"--sample",
45+
type=int,
46+
default=0,
47+
help="When used with `--info`, print the first N dequantized values for the selected tensor.",
48+
)
3749
parser.add_argument(
3850
"--info",
3951
action="store_true",
@@ -49,6 +61,22 @@ def _parse_args(argv: Iterable[str] | None = None) -> argparse.Namespace:
4961
action="store_true",
5062
help="Suppress informational logging.",
5163
)
64+
parser.add_argument(
65+
"--validate",
66+
action="store_true",
67+
help="After writing the GGUF bundle, reload it to ensure it loads cleanly.",
68+
)
69+
parser.add_argument(
70+
"--list-tensors",
71+
action="store_true",
72+
help="List all tensor names (similar to --info) and exit.",
73+
)
74+
parser.add_argument(
75+
"--version",
76+
action="version",
77+
version="t81-dequant",
78+
help="Show the version of the t81-dequant helper.",
79+
)
5280
return parser.parse_args(argv)
5381

5482

@@ -62,7 +90,7 @@ def _target_config(target: str) -> tuple[np.dtype, int]:
6290
return np.float16, 101
6391
if target == "f32":
6492
return np.float32, 100
65-
raise SystemExit("--target q8_0 currently not implemented")
93+
raise SystemExit("--target q8_0 currently not implemented (planned in a follow-up)")
6694

6795

6896
def main(argv: Iterable[str] | None = None) -> int:
@@ -71,20 +99,101 @@ def main(argv: Iterable[str] | None = None) -> int:
7199
raise SystemExit(f"{args.input!s} does not exist")
72100

73101
output_path = args.output or _default_output(args.input, args.target)
74-
if args.info or not args.quiet:
75-
info_msg = f"Dequantizing {args.input.name}{output_path.name} (target={args.target})"
102+
info_msg = f"Dequantizing {args.input.name}{output_path.name} (target={args.target})"
103+
if not args.quiet:
76104
print(info_msg)
77105

78106
dtype, ggml_type = _target_config(args.target)
107+
108+
quantized_payload: Mapping[str, Any] | None = None
109+
metadata: Mapping[str, Any] | None = None
110+
decoded_payload: Mapping[str, torch.Tensor | bytes] | None = None
111+
tensor_names_cache: list[str] | None = None
112+
113+
def _ensure_quantized():
114+
nonlocal quantized_payload, metadata
115+
if quantized_payload is None:
116+
quantized_payload, metadata = gguf.read_gguf(args.input, dequantize=False, return_metadata=True)
117+
return quantized_payload, metadata
118+
119+
def _ensure_decoded():
120+
nonlocal decoded_payload
121+
if decoded_payload is None:
122+
decoded_payload = gguf.read_gguf(args.input, dequantize=True)
123+
return decoded_payload
124+
125+
def _tensor_names():
126+
nonlocal tensor_names_cache
127+
if tensor_names_cache is None:
128+
payload, _ = _ensure_quantized()
129+
tensor_names_cache = sorted(payload.keys())
130+
return tensor_names_cache
131+
132+
def _print_sample(selected_tensor: str, payload: Mapping[str, Any]) -> None:
133+
decoded = _ensure_decoded()
134+
sample_tensor = decoded.get(selected_tensor)
135+
if isinstance(sample_tensor, torch.Tensor):
136+
arr = sample_tensor.flatten()[: args.sample].cpu().numpy()
137+
formatted = np.array2string(arr, threshold=10, edgeitems=5, floatmode="fixed", precision=4)
138+
print(f"Sample ({selected_tensor}): {arr.tolist()}")
139+
print(f"Sample ({selected_tensor}) [{arr.shape}]:")
140+
print(formatted)
141+
raw_chunk = payload.get(selected_tensor)
142+
if isinstance(raw_chunk, (bytes, bytearray)) and len(raw_chunk) >= 2:
143+
first_scale = float(np.frombuffer(raw_chunk[:2], dtype=np.float16)[0])
144+
print(f"First block scale ≈ {first_scale:.4f}")
145+
else:
146+
print(f"Sample request ignored: tensor {selected_tensor!r} is not numeric")
147+
79148
if args.info:
80-
metadata, _ = gguf.read_gguf(args.input, dequantize=False, return_metadata=True)
149+
payload, metadata = _ensure_quantized()
81150
alignment = metadata.get("general.alignment", 32)
82-
print(f"Alignment={alignment}, tensors={len(metadata)} keys, target dtype={dtype}")
151+
print(f"Alignment={alignment}, tensors={len(payload)} (quantized), metadata keys={len(metadata)}")
152+
quant_type = metadata.get("quantization.type")
153+
threshold = metadata.get("quantization.threshold", "unknown")
154+
block_size = metadata.get("quantization.block_size", "unknown")
155+
if quant_type is not None:
156+
print(f"Quantization: {quant_type.upper()} (threshold={threshold}, block_size={block_size})")
157+
for key, value in sorted(metadata.items()):
158+
print(f" {key} = {value!r}")
159+
tensor_names = _tensor_names()
160+
if tensor_names:
161+
print("Tensors:")
162+
for name in tensor_names:
163+
print(f" - {name}")
164+
selected_tensor = args.tensor or (tensor_names[0] if tensor_names else None)
165+
if selected_tensor and args.sample > 0:
166+
_print_sample(selected_tensor, payload)
167+
elif args.tensor and selected_tensor not in payload:
168+
print(f"Tensor {args.tensor!r} not found in the file")
169+
170+
if args.list_tensors and not args.info:
171+
tensor_names = _tensor_names()
172+
if tensor_names:
173+
print("Tensors:")
174+
for name in tensor_names:
175+
print(f" - {name}")
176+
else:
177+
print("No tensors found in the file.")
178+
return 0
179+
180+
if args.sample > 0 and not args.info:
181+
payload, _ = _ensure_quantized()
182+
tensor_names = _tensor_names()
183+
selected_tensor = args.tensor or (tensor_names[0] if tensor_names else None)
184+
if not selected_tensor:
185+
print("No tensors available to sample.")
186+
elif selected_tensor not in payload:
187+
print(f"Tensor {selected_tensor!r} not found in the file")
188+
else:
189+
_print_sample(selected_tensor, payload)
83190
if args.dry_run:
84191
print("Dry run completed.")
85192
return 0
86193

87-
dequantize_gguf(args.input, output_path, dtype=dtype, ggml_type=ggml_type)
194+
gguf.dequantize_gguf(args.input, output_path, dtype=dtype, ggml_type=ggml_type)
195+
if args.validate:
196+
gguf.read_gguf(output_path, dequantize=True)
88197
if not args.quiet:
89198
print("Conversion complete.")
90199
return 0

temp-gguf-float.gguf

544 Bytes
Binary file not shown.

temp-gguf.gguf

512 Bytes
Binary file not shown.

tests/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
"""Package marker for the test modules."""

tests/python/__init__.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1 @@
1+
"""pytest package marker for the GGUF regression helpers."""

0 commit comments

Comments
 (0)