t81dev
diff --git a/‎.github/workflows/ci.yml‎
Lines changed: 4 additions & 1 deletion b/‎.github/workflows/ci.yml‎
Lines changed: 4 additions & 1 deletion
diff --git a/‎AGENTS.md‎
Lines changed: 1 addition & 0 deletions b/‎AGENTS.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎README.md‎
Lines changed: 4 additions & 4 deletions b/‎README.md‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎python/bindings.cpp‎
Lines changed: 4 additions & 0 deletions b/‎python/bindings.cpp‎
Lines changed: 4 additions & 0 deletions
diff --git a/‎t81/gguf.py‎
Lines changed: 61 additions & 10 deletions b/‎t81/gguf.py‎
Lines changed: 61 additions & 10 deletions
diff --git a/‎t81/scripts/t81_dequant.py‎
Lines changed: 117 additions & 8 deletions b/‎t81/scripts/t81_dequant.py‎
Lines changed: 117 additions & 8 deletions
diff --git a/‎temp-gguf-float.gguf‎
544 Bytes b/‎temp-gguf-float.gguf‎
544 Bytes
diff --git a/‎temp-gguf.gguf‎
512 Bytes b/‎temp-gguf.gguf‎
512 Bytes
diff --git a/‎tests/__init__.py‎
Lines changed: 1 addition & 0 deletions b/‎tests/__init__.py‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎tests/python/__init__.py‎
Lines changed: 1 addition & 0 deletions b/‎tests/python/__init__.py‎
Lines changed: 1 addition & 0 deletions
@@ -70,12 +70,15 @@ jobs:
         run: cmake --build "$BUILD_DIR" --parallel
       - name: Test
         run: ctest --test-dir "$BUILD_DIR" --output-on-failure
-      - name: Python binding tests
+      - name: Python binding + GGUF regression tests
         if: matrix.configuration.name == 'python'
         run: |
           set -euo pipefail
           mkdir -p artifacts
+          python3 -m pip install --upgrade pip
+          python3 -m pip install torch transformers pytest
           PYTHONPATH="$BUILD_DIR" python3 tests/python/test_bindings.py 2>&1 | tee artifacts/${BUILD_DIR}-python-tests.log
+          PYTHONPATH="$BUILD_DIR" python3 -m pytest tests/python/test_gguf.py 2>&1 | tee artifacts/${BUILD_DIR}-gguf-tests.log
       - name: Collect test logs
         run: |
           mkdir -p artifacts
 
@@ -58,3 +58,4 @@ This file helps AI agents discover and understand how to work with this reposito
 - Rewrote `pyproject.toml` with valid TOML sections so editable installs (and `pip install -e '.[torch]'`) can parse the metadata cleanly before building the extension.
 - Restructured `README.md` into a onboarding-focused front door and added companion docs (`docs/use-cases.md`, `docs/hardware.md`, `docs/api-overview.md`, `docs/python-install.md`, `docs/torch.md`, `docs/gpu.md`, `examples/README.md`) so heavy reference material lives outside the visitor-facing overview.
 - Added optional CUDA/ROCm toggles plus a GPU dispatcher sketch (`include/t81/linalg/gemm_gpu.hpp`, `src/linalg/{gemm_cuda.cu,gemm_dispatch.cpp,gemm_rocm.cpp}`) so future teams can wire the new `where`/`clamp`/`lerp`/`addcmul` helpers into GPU kernels, introduced `t81::TensorMetadata` + Python helpers (`python/bindings.cpp`) that extract metadata from NumPy/Torch tensors, and expanded `tests/python/test_gpu_ops.py` to cover the metadata-backed bindings on both CPU and GPU paths.
+- Enhanced `tests/python/test_gguf.py` with quant-parameterized round-trip checks, metadata assertions, and a regression case for invalid quant identifiers to spotlight the GGUF helpers before future agents touch them.
@@ -110,20 +110,20 @@ Optional CUDA/ROCm backends can be enabled with `-DUSE_CUDA=ON` / `-DUSE_ROCM=ON
 
 ### Dequantizing for downstream runtimes
 
-Use the new `t81-dequant` helper (backed by `t81.dequantize_gguf_to_float`) to rewrite a TQ1_0/TQ2_0 bundle into float32 before handing it to stock llama.cpp, Ollama, or LM Studio builds that lack ternary support:
+Use the new `t81-dequant` helper (backed by `t81.dequantize_gguf_to_float`) to rewrite a TQ1_0 or TQ2_0 bundle into float32 before handing it to stock llama.cpp, Ollama, or LM Studio builds that lack ternary support:
 
 ```bash
 t81-dequant model-tq1.gguf model-compatible-f16.gguf
 ```
 
-That command rewrites the tensors in place while preserving the standard GGUF metadata so the resulting file works with existing loaders. Keep the original `model-tq1.gguf` around for runtimes that already understand TQ tensors, and only run `t81-dequant` when you need immediate compatibility.
+That command rewrites the tensors in place while preserving the standard GGUF metadata so the resulting file works with existing loaders. Keep the original `model-tq1.gguf`/`model-tq2.gguf` around for runtimes that already understand TQ tensors, and only run `t81-dequant` when you need immediate compatibility.
 
-For a zero-disk workaround you can also dequantize on the fly (via `t81.dequantize_gguf_to_float` or a small loader patch) before instantiating `llama_cpp.Llama`; see the docs for an example monkey patch if you want to load `model-tq1.gguf` directly without producing an intermediate copy.
+For a zero-disk workaround you can also dequantize on the fly (via `t81.dequantize_gguf_to_float` or a small loader patch) before instantiating `llama_cpp.Llama`; see the docs for an example monkey patch if you want to load `model-tq1.gguf` or `model-tq2.gguf` directly without producing an intermediate copy.
 
 
 ## GGUF v4 compliance
 
-t81’s GGUF exports already mirror the llama.cpp conventions; v4’s mandatory `gguf_header` additions are worth calling out for everybody writing their own converter:
+t81’s GGUF exports already mirror the llama.cpp conventions; the writer now aligns with llama.cpp’s block layout (32-row groups, per-group f16 scale, optional TQ2 refinement bytes) and includes v4’s mandatory `gguf_header` additions, which are worth calling out for everybody writing their own converter:
 
 - **Header bump** – write `version = 4` instead of 3 so llama.cpp accepts the file and no longer fails with “unsupported version”.
 - **Global alignment metadata** – after `tensor_count`/`kv_count` emit `alignment` (default 32, power-of-two) and `reserved` (0) before the metadata block, and compute tensor padding with `GGML_PAD(size, alignment)` so every tensor data block ends on that boundary.
 
@@ -490,6 +490,8 @@ namespace {
                     throw py::value_error("buffer truncated while skipping refinement bytes");
                 }
                 offset += kRefinementBytes;
+                // Note: has_refinements currently just skips the reserved refinement bytes;
+                // decoding their contents is a future enhancement.
             }
             const float scale = t81::core::gguf::half_to_float(scale_bits);
             const std::size_t rows_in_group =
@@ -643,6 +645,8 @@ PYBIND11_MODULE(t81lib, module) {
         t81::linalg::detail::backend_available(t81::linalg::Backend::CUDA);
     module.attr("HAS_ROCM_BACKEND") =
         t81::linalg::detail::backend_available(t81::linalg::Backend::ROCm);
+    module.attr("TQ1_TRITS_PER_BLOCK") = t81::core::gguf::TQ1_TRITS_PER_BLOCK;
+    module.attr("TQ1_BLOCK_ROWS") = t81::core::gguf::TQ1_BLOCK_ROWS;
 
     module.def(
         "gemm_ternary",
 
@@ -26,13 +26,11 @@
 HEADER_STRUCT = struct.Struct("<4sIQQ")
 HEADER_SIZE = HEADER_STRUCT.size
 
-GGML_TYPE_TQ1_0 = 250
-GGML_TYPE_TQ2_0 = 251
+GGML_TYPE_TQ1_0 = 34
+GGML_TYPE_TQ2_0 = 35
 GGML_TYPE_F32 = 100
 GGML_TYPE_F16 = 101
 
-GGML_TYPE_F32 = 100
-
 GGUF_TYPE_UINT8 = 0
 GGUF_TYPE_INT8 = 1
 GGUF_TYPE_UINT16 = 2
@@ -47,7 +45,9 @@
 GGUF_TYPE_INT64 = 11
 GGUF_TYPE_FLOAT64 = 12
 
-GGUF_QUANT_BLOCK_ROWS = 32
+GGUF_QUANT_BLOCK_ROWS = 32  # matches the legacy TQ1_BLOCK_ROWS constant in the C++ helpers
+TQ1_TRITS_PER_BLOCK = 8
+TQ2_TRITS_PER_BYTE = 4
 
 
 @dataclass(frozen=True)
@@ -119,7 +119,7 @@ def _collect_metadata(model: PreTrainedModel, quant: str, threshold: float) -> t
         ("general.file_type", 2),
         ("general.alignment", HEADER_ALIGNMENT),
         ("general.quantized_by", "t81lib"),
-        ("general.quantization_version", 2),
+        ("general.quantization_version", 3 if quant == "TQ2_0" else 2),
         ("quantization.type", quant.lower()),
         ("quantization.block_size", GGUF_QUANT_BLOCK_ROWS),
         ("quantization.threshold", threshold),
@@ -224,21 +224,59 @@ def _float_to_half_bytes(value: float) -> bytes:
     return np.float16(value).tobytes()
 
 
+def _pack_row_tq1(row: np.ndarray, threshold: float, scale: float) -> bytes:
+    _, packed = t81lib.quantize_row_tq1_0(np.asarray(row, dtype=np.float32), threshold, scale)
+    return packed.tobytes(order="C")
+
+
+def _pack_row_tq2(row: np.ndarray, threshold: float, scale: float) -> bytes:
+    """Pack a row into four-trit bytes after thresholded normalization."""
+    if scale == 0.0:
+        normalized = np.zeros_like(row, dtype=np.float32)
+    else:
+        normalized = row.astype(np.float32, copy=False) / float(scale)
+    cols = normalized.shape[0]
+    trits = np.zeros(cols, dtype=np.uint8)
+    mask = np.abs(normalized) >= threshold
+    signs = (normalized[mask] < 0).astype(np.uint8)
+    trits[mask] = 1 + signs
+
+    padded_len = (-cols) % TQ2_TRITS_PER_BYTE
+    if padded_len:
+        padded = np.pad(trits, (0, padded_len), constant_values=0)
+    else:
+        padded = trits
+    reshaped = padded.reshape(-1, TQ2_TRITS_PER_BYTE)
+    packed = (
+        reshaped[:, 0]
+        | (reshaped[:, 1] << 2)
+        | (reshaped[:, 2] << 4)
+        | (reshaped[:, 3] << 6)
+    ).astype(np.uint8)
+    n_bytes = (cols + TQ2_TRITS_PER_BYTE - 1) // TQ2_TRITS_PER_BYTE
+    return packed[:n_bytes].tobytes()
+
+
 def _quantize_tensor(tensor: torch.Tensor, quant: str, threshold: float) -> bytes:
+    """Quantize a 2D tensor into TQ1_0 or TQ2_0 payload bytes."""
     array = tensor.cpu().to(dtype=torch.float32, copy=False).numpy()
+    if array.ndim != 2:
+        raise ValueError(f"Only 2D tensors supported, got shape {array.shape}")
     rows, cols = array.shape
     serialized = bytearray()
+    pack_row = _pack_row_tq1 if quant == "TQ1_0" else _pack_row_tq2
+    include_refinements = quant == "TQ2_0"
     for group_start in range(0, rows, GGUF_QUANT_BLOCK_ROWS):
         group = array[group_start : group_start + GGUF_QUANT_BLOCK_ROWS]
         scale = float(np.max(np.abs(group))) if group.size else 0.0
         serialized.extend(_float_to_half_bytes(scale))
-        if quant == "TQ2_0":
+        if include_refinements:
+            # Reserve 8 bytes per block for future-per-block refinement data (higher-order corrections).
             serialized.extend(b"\x00" * 8)
         for row in group:
             if cols == 0:
                 continue
-            packed = t81lib.quantize_row_tq1_0(np.asarray(row, dtype=np.float32), threshold, scale)[1]
-            serialized.extend(packed.tobytes(order="C"))
+            serialized.extend(pack_row(row, threshold, scale))
     return bytes(serialized)
 
 
@@ -268,6 +306,7 @@ def write_gguf(
     quant = quant.upper()
     if quant not in {"TQ1_0", "TQ2_0"}:
         raise ValueError("quant must be one of 'TQ1_0' or 'TQ2_0'")
+    threshold = float(np.clip(threshold, 0.0, 0.9999))
     entries = _collect_linears(model)
     if not entries:
         raise ValueError("model does not contain any t81.nn.Linear layers")
@@ -376,6 +415,10 @@ def _decode_quant_tensor(
         dtype = np.float32
         data = np.frombuffer(memoryview(chunk), dtype=dtype)
         if shape:
+            expected = int(np.prod(shape))
+            if data.size < expected:
+                raise ValueError("float tensor data truncated")
+            data = data[:expected]
             data = data.reshape(shape)
         return torch.from_numpy(data)
     decoder = t81lib.dequant_tq1_0 if ggml_type == GGML_TYPE_TQ1_0 else t81lib.dequant_tq2_0
@@ -409,10 +452,16 @@ def read_gguf(
     tensor_infos = _parse_tensor_infos(buffer, tensor_infos_offset, num_tensors, alignment)
     sorted_infos = sorted(tensor_infos, key=lambda info: info.offset)
     payload: dict[str, torch.Tensor | bytes] = {}
+    prev_end = 0
     for index, info in enumerate(sorted_infos):
+        if info.offset < prev_end:
+            raise ValueError("tensor data overlaps or is out of order")
         next_offset = (
             sorted_infos[index + 1].offset if index + 1 < len(sorted_infos) else len(buffer)
         )
+        if next_offset > len(buffer):
+            raise ValueError("tensor data extends beyond file length")
+        prev_end = next_offset
         chunk = buffer[info.offset:next_offset]
         if info.ggml_type not in {GGML_TYPE_TQ1_0, GGML_TYPE_TQ2_0, GGML_TYPE_F32}:
             raise ValueError(f"unsupported tensor type {info.ggml_type}")
@@ -448,7 +497,9 @@ def dequantize_gguf(
             numpy_dtype = np.float32
         else:
             raise ValueError(f"unsupported target dtype {dtype}")
-        numpy_array = array.cpu().numpy(dtype=numpy_dtype, copy=False)
+        numpy_array = array.cpu().numpy()
+        if numpy_array.dtype != numpy_dtype:
+            numpy_array = numpy_array.astype(numpy_dtype, copy=False)
         tensor_payloads.append(
             _TensorPayload(
                 name=name,
 
@@ -6,11 +6,12 @@
 
 import argparse
 from pathlib import Path
-from typing import Iterable
+from typing import Any, Iterable, Mapping
 
 import numpy as np
+import torch
 
-from t81 import dequantize_gguf
+from t81 import gguf
 
 
 def _parse_args(argv: Iterable[str] | None = None) -> argparse.Namespace:
@@ -34,6 +35,17 @@ def _parse_args(argv: Iterable[str] | None = None) -> argparse.Namespace:
         default="f16",
         help="Output tensor type (f16 yields best compression, q8_0 is not implemented yet).",
     )
+    parser.add_argument(
+        "--tensor",
+        type=str,
+        help="Optional tensor name to print metadata or sample values for (defaults to first tensor).",
+    )
+    parser.add_argument(
+        "--sample",
+        type=int,
+        default=0,
+        help="When used with `--info`, print the first N dequantized values for the selected tensor.",
+    )
     parser.add_argument(
         "--info",
         action="store_true",
@@ -49,6 +61,22 @@ def _parse_args(argv: Iterable[str] | None = None) -> argparse.Namespace:
         action="store_true",
         help="Suppress informational logging.",
     )
+    parser.add_argument(
+        "--validate",
+        action="store_true",
+        help="After writing the GGUF bundle, reload it to ensure it loads cleanly.",
+    )
+    parser.add_argument(
+        "--list-tensors",
+        action="store_true",
+        help="List all tensor names (similar to --info) and exit.",
+    )
+    parser.add_argument(
+        "--version",
+        action="version",
+        version="t81-dequant",
+        help="Show the version of the t81-dequant helper.",
+    )
     return parser.parse_args(argv)
 
 
@@ -62,7 +90,7 @@ def _target_config(target: str) -> tuple[np.dtype, int]:
         return np.float16, 101
     if target == "f32":
         return np.float32, 100
-    raise SystemExit("--target q8_0 currently not implemented")
+    raise SystemExit("--target q8_0 currently not implemented (planned in a follow-up)")
 
 
 def main(argv: Iterable[str] | None = None) -> int:
@@ -71,20 +99,101 @@ def main(argv: Iterable[str] | None = None) -> int:
         raise SystemExit(f"{args.input!s} does not exist")
 
     output_path = args.output or _default_output(args.input, args.target)
-    if args.info or not args.quiet:
-        info_msg = f"Dequantizing {args.input.name} → {output_path.name} (target={args.target})"
+    info_msg = f"Dequantizing {args.input.name} → {output_path.name} (target={args.target})"
+    if not args.quiet:
         print(info_msg)
 
     dtype, ggml_type = _target_config(args.target)
+
+    quantized_payload: Mapping[str, Any] | None = None
+    metadata: Mapping[str, Any] | None = None
+    decoded_payload: Mapping[str, torch.Tensor | bytes] | None = None
+    tensor_names_cache: list[str] | None = None
+
+    def _ensure_quantized():
+        nonlocal quantized_payload, metadata
+        if quantized_payload is None:
+            quantized_payload, metadata = gguf.read_gguf(args.input, dequantize=False, return_metadata=True)
+        return quantized_payload, metadata
+
+    def _ensure_decoded():
+        nonlocal decoded_payload
+        if decoded_payload is None:
+            decoded_payload = gguf.read_gguf(args.input, dequantize=True)
+        return decoded_payload
+
+    def _tensor_names():
+        nonlocal tensor_names_cache
+        if tensor_names_cache is None:
+            payload, _ = _ensure_quantized()
+            tensor_names_cache = sorted(payload.keys())
+        return tensor_names_cache
+
+    def _print_sample(selected_tensor: str, payload: Mapping[str, Any]) -> None:
+        decoded = _ensure_decoded()
+        sample_tensor = decoded.get(selected_tensor)
+        if isinstance(sample_tensor, torch.Tensor):
+            arr = sample_tensor.flatten()[: args.sample].cpu().numpy()
+            formatted = np.array2string(arr, threshold=10, edgeitems=5, floatmode="fixed", precision=4)
+            print(f"Sample ({selected_tensor}): {arr.tolist()}")
+            print(f"Sample ({selected_tensor}) [{arr.shape}]:")
+            print(formatted)
+            raw_chunk = payload.get(selected_tensor)
+            if isinstance(raw_chunk, (bytes, bytearray)) and len(raw_chunk) >= 2:
+                first_scale = float(np.frombuffer(raw_chunk[:2], dtype=np.float16)[0])
+                print(f"First block scale ≈ {first_scale:.4f}")
+        else:
+            print(f"Sample request ignored: tensor {selected_tensor!r} is not numeric")
+
     if args.info:
-        metadata, _ = gguf.read_gguf(args.input, dequantize=False, return_metadata=True)
+        payload, metadata = _ensure_quantized()
         alignment = metadata.get("general.alignment", 32)
-        print(f"Alignment={alignment}, tensors={len(metadata)} keys, target dtype={dtype}")
+        print(f"Alignment={alignment}, tensors={len(payload)} (quantized), metadata keys={len(metadata)}")
+        quant_type = metadata.get("quantization.type")
+        threshold = metadata.get("quantization.threshold", "unknown")
+        block_size = metadata.get("quantization.block_size", "unknown")
+        if quant_type is not None:
+            print(f"Quantization: {quant_type.upper()} (threshold={threshold}, block_size={block_size})")
+        for key, value in sorted(metadata.items()):
+            print(f"  {key} = {value!r}")
+        tensor_names = _tensor_names()
+        if tensor_names:
+            print("Tensors:")
+            for name in tensor_names:
+                print(f"  - {name}")
+        selected_tensor = args.tensor or (tensor_names[0] if tensor_names else None)
+        if selected_tensor and args.sample > 0:
+            _print_sample(selected_tensor, payload)
+        elif args.tensor and selected_tensor not in payload:
+            print(f"Tensor {args.tensor!r} not found in the file")
+
+    if args.list_tensors and not args.info:
+        tensor_names = _tensor_names()
+        if tensor_names:
+            print("Tensors:")
+            for name in tensor_names:
+                print(f"  - {name}")
+        else:
+            print("No tensors found in the file.")
+        return 0
+
+    if args.sample > 0 and not args.info:
+        payload, _ = _ensure_quantized()
+        tensor_names = _tensor_names()
+        selected_tensor = args.tensor or (tensor_names[0] if tensor_names else None)
+        if not selected_tensor:
+            print("No tensors available to sample.")
+        elif selected_tensor not in payload:
+            print(f"Tensor {selected_tensor!r} not found in the file")
+        else:
+            _print_sample(selected_tensor, payload)
     if args.dry_run:
         print("Dry run completed.")
         return 0
 
-    dequantize_gguf(args.input, output_path, dtype=dtype, ggml_type=ggml_type)
+    gguf.dequantize_gguf(args.input, output_path, dtype=dtype, ggml_type=ggml_type)
+    if args.validate:
+        gguf.read_gguf(output_path, dequantize=True)
     if not args.quiet:
         print("Conversion complete.")
     return 0
 
@@ -0,0 +1 @@
+"""Package marker for the test modules."""
@@ -0,0 +1 @@
+"""pytest package marker for the GGUF regression helpers."""
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+"""Package marker for the test modules."""`
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1 @@`
	`1`	`+"""pytest package marker for the GGUF regression helpers."""`