t81dev
diff --git a/‎AGENTS.md‎
Lines changed: 1 addition & 0 deletions b/‎AGENTS.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎README.md‎
Lines changed: 14 additions & 4 deletions b/‎README.md‎
Lines changed: 14 additions & 4 deletions
diff --git a/‎docs/index.md‎
Lines changed: 3 additions & 3 deletions b/‎docs/index.md‎
Lines changed: 3 additions & 3 deletions
diff --git a/‎docs/references/README.md‎
Lines changed: 4 additions & 4 deletions b/‎docs/references/README.md‎
Lines changed: 4 additions & 4 deletions
@@ -40,6 +40,7 @@ This file helps AI agents discover and understand how to work with this reposito
 - Added production-ready Python bindings (`python/bindings.cpp`) plus packaging helpers (`setup.py`, `pyproject.toml`) that expose `Limb`/`BigInt` helpers, Montgomery contexts, NumPy quantization utilities, and a tutorial notebook `examples/ternary_quantization_demo.ipynb`.
 - Added `t81.hardware.TernaryEmulator`, documentation for hardware simulation, and `examples/ternary_hardware_sim_demo.ipynb` so agents can explore ternary gate/circuit modeling, fuzzy AI decisions, and power-aware PyTorch inference workflows.
 - Added `docs/references/cli-usage.md` (linked from `docs/index.md`) to cover `t81-convert`, `t81-gguf`, and `t81-qat` usage with the CPU/offloading tips we surfaced for low-memory Apple Silicon.
+- Added a unified `t81` console script that exposes `convert`/`gguf` subcommands while preserving the legacy `t81-convert`/`t81-gguf` wrappers, plus updated docs/tests to reference the new entry point.
 - Added `docs/diagrams/cli-workflows-mermaid.md` to visualize the `t81-convert`, `t81-gguf`, and `t81-qat` workflows for future contributors looking at the CLI surface.
 - Extended `examples/ternary_qat_inference_comparison.py` so it now runs train + validation loops, logs compression ratios + per-step losses, and correlates the ternary threshold history with measured GEMM latencies.
 - Added `scripts/quantize_measure.py`, which chains `t81-convert` → `AutoModel.from_pretrained_t81` → latency/compression stats so you can automate quantize→measure in other pipelines.
 
@@ -46,13 +46,13 @@ It is **not** a drop-in replacement for PyTorch or NumPy, but a focused toolkit
 
 - **C++ limb/bigint & numerics** — build locally, include `<t81/t81lib.hpp>`, and verify the `tests/unit/` suite. Starts: [Quick start](#quick-start) & [docs/api-overview.md](docs/api-overview.md).
 - **Python quantization & helpers** — `pip install .[torch]` unlocks `t81lib`/`t81`, NumPy wrappers, and `t81.torch`/`t81.nn`. See [docs/python-install.md](docs/python-install.md) & [docs/python-api.md](docs/python-api.md).
-- **CLI & GGUF/QAT workflows** — `t81-convert`, `t81-gguf`, and `t81-qat` automate quantize→export→train flows. Follow [docs/references/cli-usage.md](docs/references/cli-usage.md).
+- **CLI & GGUF/QAT workflows** — `t81 convert`, `t81 gguf`, and `t81-qat` (the legacy `t81-convert`/`t81-gguf` aliases still work) automate quantize→export→train flows. Follow [docs/references/cli-usage.md](docs/references/cli-usage.md).
 
 ## Highlights
 
 - **Balanced-ternary core**: `t81::Int` (an alias for `t81::core::limb`) ships overflow-aware arithmetic, canonical I/O, and deterministic hashing.
 - **Ternary-friendly GEMMs**: `t81::linalg::gemm_ternary` packs balanced ternary matrices into AVX/NEON-accelerated kernels with alpha/beta semantics mirrored in the Python binding.
-- **Python, CLI, and Torch helpers**: Pybind11 bindings expose quantize/dequantize utilities and `t81.torch`/`t81.nn`, while `t81-convert`, `t81-gguf`, and `t81-qat` automate quantize/export/train workflows.
+- **Python, CLI, and Torch helpers**: Pybind11 bindings expose quantize/dequantize utilities and `t81.torch`/`t81.nn`, while `t81 convert`, `t81 gguf`, and `t81-qat` (and their legacy `t81-convert`/`t81-gguf` aliases) automate quantize/export/train workflows.
 - **Normative docs & demos**: Architecture notes, CLI references, and runnable demos live under `docs/`, `examples/`, and `bench/`.
 
 ## Quick start
@@ -80,7 +80,7 @@ On macOS or other PEP 668-enforced environments, activate a virtualenv before ru
 
 ### 2a. CLI-friendly Pipx install
 
-If you prefer shell-level access to `t81-convert`, `t81-gguf`, `t81-qat`, and `t81-dequant`, pipx can install the repo and then inject the torch extras:
+If you prefer shell-level access to the unified `t81` CLI (with `convert`/`gguf` subcommands) plus `t81-qat` and `t81-dequant`, pipx can install the repo and then inject the torch extras:
 
 ```bash
 pipx install --python python3 /Users/t81dev/Desktop/t81lib
@@ -121,7 +121,17 @@ Optional CUDA/ROCm backends can be enabled with `-DUSE_CUDA=ON` / `-DUSE_ROCM=ON
 
 ## CLI helpers
 
-`t81-convert`, `t81-gguf`, and `t81-qat` automate quantize→export→train flows with progress reporting and validation hooks. Browse [docs/references/cli-usage.md](docs/references/cli-usage.md), [docs/diagrams/cli-workflows-mermaid.md](docs/diagrams/cli-workflows-mermaid.md), and [examples/cli-examples.md](examples/cli-examples.md) for recipes.
+`t81 convert`, `t81 gguf`, `t81 info`, and `t81-qat` automate quantize→export→train flows with progress reporting and validation hooks (the legacy `t81-convert`/`t81-gguf` names still work). Browse [docs/references/cli-usage.md](docs/references/cli-usage.md), [docs/diagrams/cli-workflows-mermaid.md](docs/diagrams/cli-workflows-mermaid.md), and [examples/cli-examples.md](examples/cli-examples.md) for recipes.
+
+### Large models & GGUF streaming
+
+When targeting multi-gigabyte models (Llama 3.x or Gemma 3.x checkpoints) you can still run the CLI helpers without triggering macOS’s OOM killer, but you need to pin everything to CPU memory:
+
+- Set `ACCELERATE_DISABLE=1` (and `HF_ACCELERATE_DISABLE=1` when you launch a `transformers` command) so Accelerate never offloads tensors to `meta`/disk and the helpers can call `.to("cpu")`.
+- Prefer `--force-cpu-device-map` or `--device-map none/cpu` so `t81 convert`/`t81 gguf` (and the legacy `t81-convert`/`t81-gguf` wrappers) keep checkpoint shards on host RAM.
+- The Python GGUF reader now streams metadata/tensor infos directly from a file handle, seeks to each tensor block, and only buffers one tensor at a time, so `t81 gguf`/`t81-dequant` (and `t81-gguf`/`t81-dequant` scripts for compatibility) can handle the resulting bundles without reading the entire file into memory.
+
+If you still see `NotImplementedError: Cannot copy out of meta tensor` or a kernel that dies while building Matplotlib’s font cache, repeat the cache setup from [docs/troubleshooting.md](docs/troubleshooting.md#large-gguf-conversions) (`MPLCONFIGDIR`, `FONTCONFIG_PATH`, etc.) before rerunning the CLI.
 
 ### Dequantizing for downstream runtimes
 
 
@@ -24,17 +24,17 @@ to understand the balanced ternary engine without digging through specs immediat
 - **Python cookbook** — [`docs/python-cookbook.md`](python-cookbook.md) gathers recipes that mix `t81lib.pack_dense_matrix`, `t81.torch.TernaryTensor`, and the CLI helpers.
 - **Python install paths** — [`docs/python-install.md`](python-install.md) explains pip/pipx builds, validation tips, and CLI helper installs.
 - **PyTorch how-to** — [`docs/torch.md`](torch.md) walks through `t81.torch`, `t81.nn`, conversion helpers, and how the CLI scripts mirror the Python flows.
-- **CLI reference** — [`docs/references/cli-usage.md`](references/cli-usage.md) lists the `t81-convert`, `t81-gguf`, and `t81-qat` helpers
+- **CLI reference** — [`docs/references/cli-usage.md`](references/cli-usage.md) lists the unified `t81 convert`/`t81 gguf` helpers (with legacy `t81-convert`/`t81-gguf` aliases) plus `t81-qat`
   plus the common flags for exporting GGUF bundles and running QAT.
 - **Hardware & energy reference** — [`docs/references/hardware-emulation.md`](references/hardware-emulation.md) connects `t81.hardware.TernaryEmulator`
   with the Python quantization helpers plus the new [`scripts/quantize_measure.py`](../scripts/quantize_measure.py) automation that chains
-  `t81-convert` → measurement.
+  `t81 convert` → measurement.
 - **Python demos** — the [`examples/`](../examples/) scripts/notebooks track `t81.torch` + `t81.nn` workflows; add
   [`examples/ternary_qat_inference_comparison.py`](../examples/ternary_qat_inference_comparison.py) to kick off a mini `t81.trainer` QAT loop, print the ternary
   threshold schedule, and compare `torch.matmul` vs. `t81lib.gemm_ternary` latency so you can prototype
   entirely inside Python before launching the CLI helpers.
 - **CLI automation & energy benchmarking** — [`scripts/quantize_measure.py`](../scripts/quantize_measure.py) and [`scripts/quantize_energy_benchmark.py`](../scripts/quantize_energy_benchmark.py)
-  chain `t81-convert`/`t81-gguf` runs with latency/energy measurement so you can report quantization impact directly
+  chain `t81 convert`/`t81 gguf` runs with latency/energy measurement so you can report quantization impact directly
   from command-line workflows.
 - **Use cases & demos** — [`docs/use-cases.md`](use-cases.md) and [`examples/README.md`](../examples/README.md) capture the canonical scripts, notebooks, and research stories.
 - **Hardware simulation** — [`docs/hardware.md`](hardware.md) details `t81.hardware.TernaryEmulator`, fuzzy helpers, and the visualizer notebook.
 
@@ -1,18 +1,18 @@
 ### Reference index
 
 This folder collects lightweight guides for the console scripts and the GGUF
-plugin surface. For a broader overview of the CLI helpers (`t81-convert`,
-`t81-gguf`, and `t81-qat`), see `cli-usage.md`.
+plugin surface. For a broader overview of the CLI helpers (`t81 convert`,
+`t81 gguf`, and `t81-qat` plus the legacy `t81-convert`/`t81-gguf` aliases), see `cli-usage.md`.
 
 ### GGUF export (llama.cpp / Ollama / LM Studio)
 
 ```bash
-t81-convert meta-llama/Llama-3.2-3B-Instruct llama3.2-3b-t81.gguf --quant TQ1_0
+t81 convert meta-llama/Llama-3.2-3B-Instruct llama3.2-3b-t81.gguf --quant TQ1_0
 ```
 
 When writing a converted checkpoint or GGUF bundle, append `--force-cpu-device-map`
 so that Accelerate keeps parameters on CPU/disk instead of dispatching them to
 `meta`. The default `device_map="auto"` path can offload modules to disk and
 triggers `NotImplementedError: Cannot copy out of meta tensor` when `save_pretrained`
 runs. Using the new flag ensures everything stays serializable, and you can rerun
-`t81-gguf`/`t81-convert` with it whenever you hit that error.
+`t81 gguf`/`t81 convert` (or the legacy `t81-gguf`/`t81-convert` scripts) with it whenever you hit that error.