TL;DR: this is a Windows-first, offline-first TurboQuant research workspace that measures what usually gets hand-waved away: not just reconstruction, but hidden-state transport, attention behavior, GGUF packaging, and whether TQ4_1S / Triality artifacts actually survive the trip into a real llama.cpp runtime.
- Best current K-side practical line:
key_only_block_so8_triality_vector - Best current story for readers: reproducible RTX 3060 12 GB evidence, deterministic fixture export, and a vendored
zapabob/llama.cppthat can now load and execute realTQ4_1SGGUFs on CPU and CUDA - What is already implemented here: offline TurboQuant replay, Triality + learned SO(8) research/export, GGUF metadata/contract packaging,
TQ4_1Sconverter/export, large-batch CUDAq8_0scratch staging, and a fused packed-weight CUDA path for small decode in vendoredllama.cpp - What is not claimed here: universal runtime wins across every model/runtime stack, a fully profiled final routing policy, or that the current CUDA line is the last word on TurboQuant performance
- Why this repo exists: to keep research-faithful math, artifact contracts, and runtime integration in one place instead of treating them as separate hand-wavy projects
Most TurboQuant repos stop at one of these boundaries:
- paper math without runtime integration
- runtime integration without reproducible offline evidence
- memory wins without hidden-state checks
- codec contracts without deterministic export/import validation
This repo tries to keep the whole chain visible:
- paper-faithful Stage 1 / Stage 2
- captured Qwen3.5-9B replay on real hardware
- Triality + learned SO(8) K-side experiments
- GGUF metadata and artifact packaging
- vendored
zapabob/llama.cppruntime consumption - operator-facing local Studio flows on Windows
That is also why the README keeps the evidence families in the top level instead of hiding them deep in artifact folders.
The latest implementation wave made the repo much closer to a full research-to-runtime handoff:
- Byte-exact
TQ4_1Sexport path- Python GGUF export was aligned to the
llama.cppreference math and validated against real Gemma 4 artifacts.
- Python GGUF export was aligned to the
- Real
TQ4_1Sloadability in vendoredllama.cppGGML_TYPE_TQ4_1Sis now loadable on CPU, through the staged CUDA path for large batches, and through the fused CUDA decode line for small batches in the vendored runtime.
- Triality shared ABI hardening
- canonical naming, alias normalization, and stricter fail-closed metadata checks now reject incomplete artifacts instead of quietly accepting them.
- Learned SO(8) export with explicit validity metrics
- orthogonality and determinant metrics are now carried through the Triality metadata line.
- CUDA runtime closeout for
TQ4_1S- the vendored runtime now has a dedicated
TQ4_1S -> q8_0scratch path for large batches, a fused packed-weight CUDA path for small decode, and real Gemma 4 CUDA smoke verification on this PC.
- the vendored runtime now has a dedicated
Implementation logs:
_docs/2026-04-19_tq4_1s-python-e2e-byte-exact-and-real-gemma4-validation.md_docs/2026-04-19_llama_cpp_tq4_1s_ggml_loadable_path.md_docs/2026-04-20_triality-shared-abi-and-fail-closed-runtime.md_docs/2026-04-21_triality-views-and-tq4_1s-q8_0-scratch-kernel.md_docs/2026-04-21_tq4_1s_fused_cuda_closeout.md
The repo currently centers on:
- Qwen3.5-9B text-only captured KV
- RTX 3060 12 GB reduced comparison matrices
- Triality fixture export for Qwen 3.5 and Gemma 4
- GGUF packaging for
llama.cppand Hypura-style consumers - weight-side contract staging around
hypura.turboquant.weight.v1andcodec=tq4_1s
Mainline K-side modes:
exactkey_only_randomfull_kvasym_q8_turbo4asym_q8_turbo3multiscreen_relevancekey_only_block_so8_triality_vector
For day-to-day reading, the practical 4-bit headline is:
| Mode | Logit cosine (M +/- SD) | Hidden cosine (M +/- SD) | Memory ratio vs exact (M +/- SD) |
|---|---|---|---|
exact |
1.000000 +/- 0.000000 |
1.000000 +/- 0.000000 |
1.000000 +/- 0.000000 |
full_kv |
0.997396 +/- 0.006379 |
0.994792 +/- 0.003189 |
0.255859 +/- 0.000000 |
asym_q8_turbo4 |
1.001302 +/- 0.005881 |
0.994141 +/- 0.004784 |
0.378906 +/- 0.000000 |
asym_q8_turbo3 |
0.996745 +/- 0.002941 |
0.981771 +/- 0.004731 |
0.347656 +/- 0.000000 |
multiscreen_relevance |
1.002604 +/- 0.006379 |
1.000000 +/- 0.000000 |
0.660156 +/- 0.000000 |
key_only_block_so8_triality_vector |
1.000000 +/- 0.000000 |
0.999349 +/- 0.001595 |
0.628906 +/- 0.000000 |
Practical reading:
key_only_block_so8_triality_vectorremains the production K-side reference because it keeps hidden-state quality high while staying simple to package and consume.asym_q8_turbo4is still the aggressive memory-saving baseline worth keeping in the README because it is the honest low-memory comparison point.full_kvstill shows why this repo refuses to equate “good logit-like scores” with “safe hidden-state transport.”
At this point, the repo has three clear layers.
| Layer | What is implemented now | What is deliberately still incomplete |
|---|---|---|
| Offline research | paper-faithful Stage 1 / Stage 2, captured replay, Triality / SO(8), hidden and attention metrics | not every research branch is promoted to production defaults |
| Artifact contract | GGUF packaging, hypura.turboquant.*, hypura.turboquant.weight.*, deterministic fixture export, real Gemma 4 multimodal-safe guards |
weight codec policy is ahead of fully optimized weight runtime kernels |
| Runtime path | vendored zapabob/llama.cpp can load real TQ4_1S GGUFs on CPU, use a dedicated q8_0 scratch CUDA path for large batches, and use a fused packed-weight CUDA path for small decode |
no claim that the current CUDA routing thresholds or kernel mix are the final performance architecture |
The current CUDA story is now concrete enough to state plainly:
- Large prefill / large batch: contiguous
TQ4_1Sweights route throughTQ4_1S -> q8_0 scratch -> fp16 -> cuBLAS - Small decode: contiguous
TQ4_1Sweights can use the fused packed-weight CUDA MMVQ path - Shared ABI: Triality
vector,spinor_plus_proxy, andspinor_minus_proxystay fail-closed and metadata-complete - Real-model smoke: the Gemma 4
TQ4_1SGGUF loads on this PC withtype tq4_1s: 228 tensors,ngl 99, and completedpp1/tg1CUDA runs
For the full closeout commands and verification output, see _docs/2026-04-21_tq4_1s_fused_cuda_closeout.md.
| Path | Contents |
|---|---|
artifacts/qwen_3060_matrix/metrics/qwen_3060_matrix_trials.csv |
Raw per-trial rows for the 7-mode 12 GB matrix |
artifacts/qwen_3060_matrix/metrics/qwen_3060_matrix_summary.csv / .md |
Pooled summary with mean, SD, SEM, and 95% CI |
artifacts/qwen_3060_matrix/metrics/qwen_3060_matrix_mean_pm_sd.csv / .md |
Mode x bit M +/- SD table |
artifacts/qwen_3060_matrix/metrics/qwen_3060_matrix_friedman.csv / .md |
Friedman test across the 7 modes |
artifacts/qwen_3060_matrix/metrics/qwen_3060_matrix_pairwise.csv / .md |
Pairwise Wilcoxon-Holm vs baseline modes |
artifacts/qwen_3060_matrix/reports/qwen_3060_matrix_summary.md |
Exported markdown summary used by repo docs |
artifacts/qwen_3060_matrix/plots/qwen_3060_matrix_attention.png |
Attention/logit trade-off plot with error bars |
artifacts/qwen_3060_matrix/plots/qwen_3060_matrix_runtime.png |
Runtime trade-off plot with error bars |
| Path | Contents |
|---|---|
artifacts/research_extension/triality_full_eval_prod_bf16/metrics/triality_trials_captured.csv |
Raw per-trial rows |
artifacts/research_extension/triality_full_eval_prod_bf16/metrics/triality_summary_captured.csv / .md |
Pooled summary with mean, SD, SEM, and 95% CI |
artifacts/research_extension/triality_full_eval_prod_bf16/metrics/triality_summary_mean_pm_sd.csv / .md |
Mode x bit M +/- SD table |
artifacts/research_extension/triality_full_eval_prod_bf16/metrics/triality_statistics.csv / .md |
Mode-wise statistical tests |
artifacts/research_extension/triality_full_eval_prod_bf16/metrics/triality_friedman_rotation_modes.csv / .md |
Friedman test across K modes at fixed bit |
artifacts/research_extension/triality_full_eval_prod_bf16/metrics/triality_pairwise_wilcoxon_rotation_modes.csv / .md |
Pairwise Wilcoxon with Holm correction |
artifacts/research_extension/triality_full_eval_prod_bf16/plots/triality_*_captured.png |
Trade-off and M +/- SD plots |
artifacts/research_extension/triality_full_eval_prod_bf16/plots/triality_advantage_*.png |
Triality advantage figures |
Source: artifacts/research_extension/triality_full_eval_prod_bf16/metrics/triality_summary_mean_pm_sd.csv
| Mode | Logit cosine (M +/- SD) | Hidden cosine (M +/- SD) | Memory ratio (M +/- SD) |
|---|---|---|---|
key_only_block_so8_static |
0.999512 +/- 0.003699 |
1.000651 +/- 0.003762 |
0.628906 +/- 0.000000 |
key_only_block_so8_learned |
0.999512 +/- 0.002655 |
1.000488 +/- 0.003321 |
0.628906 +/- 0.000000 |
key_only_block_so8_triality_vector |
1.000000 +/- 0.004461 |
1.000651 +/- 0.003762 |
0.628906 +/- 0.000000 |
key_only_block_so8_triality_plus |
1.000814 +/- 0.004150 |
1.000163 +/- 0.003546 |
0.628906 +/- 0.000000 |
key_only_block_so8_triality_minus |
1.000977 +/- 0.005054 |
1.000651 +/- 0.004258 |
0.628906 +/- 0.000000 |
full_kv |
0.999023 +/- 0.003688 |
0.995605 +/- 0.003115 |
0.255859 +/- 0.000000 |
Source: artifacts/research_extension/triality_full_eval_prod_bf16/metrics/triality_friedman_rotation_modes.md
- For hidden cosine, the rotation-family differences are strongly non-random at low bits.
- At 4 bits, the Friedman row is
statistic = 264.2103546677535,p = 3.7569452515144635e-54,n_blocks = 96,n_modes = 7. - This is exactly the kind of result the repo wants to expose: memory and logit-like scores alone do not tell the full story.
These are the high-level figures most readers should look at first.
The tracked README copies come from the main 12 GB matrix export path. Plot points are means and the error bars come from the summary statistics emitted by the matrix export.
Keep these three evidence families together:
- Eval Output Layout
- raw trials, pooled summaries, pairwise tests, and plots
- Pareto Frontiers
- hidden/logit/memory trade-off views
- Paper Baseline Reference Results
- paper-faithful captured reference rows
- Triality Advantage Figures
- direct rotation-family comparisons
That is the minimum set that prevents “nice memory ratios” from turning into misleading conclusions.
Source: artifacts/paper_baseline/qwen_captured_reported/metrics/attention_summary_captured_mean_pm_sd.md
| Bits | Logit cosine | Hidden cosine (KO) | Hidden cosine (FV) | Memory/exact (KO) | Memory/exact (FV) | Attention relative error (KO) | Attention relative error (FV) |
|---|---|---|---|---|---|---|---|
| 2 | 0.995117 +/- 0.001953 |
0.997070 +/- 0.003740 |
0.939453 +/- 0.006766 |
0.566406 |
0.130859 |
0.048950 +/- 0.025218 |
0.340332 +/- 0.003336 |
| 2.5 | 0.998047 +/- 0.002255 |
0.999023 +/- 0.001953 |
0.957031 +/- 0.003189 |
0.574219 |
0.146484 |
0.033066 +/- 0.013036 |
0.287109 +/- 0.001595 |
| 3 | 1.000000 +/- 0.000000 |
0.998047 +/- 0.002255 |
0.980469 +/- 0.005524 |
0.597656 |
0.193359 |
0.023590 +/- 0.007530 |
0.184570 +/- 0.000797 |
| 3.5 | 0.999023 +/- 0.001953 |
0.999023 +/- 0.001953 |
0.988281 +/- 0.003189 |
0.605469 |
0.208984 |
0.014328 +/- 0.004094 |
0.150635 +/- 0.002916 |
| 4 | 0.998047 +/- 0.003906 |
0.999023 +/- 0.001953 |
0.995117 +/- 0.001953 |
0.628906 |
0.255859 |
0.009918 +/- 0.003365 |
0.096924 +/- 0.001668 |
The baseline takeaway remains simple and important:
- K-only TurboQuant-like lines preserve hidden-state quality much better than
full_kvat low bits. - That is exactly why this repo treats K-side stability as the practical production question.
Run everything from the repository root, the directory containing pyproject.toml.
uv python install 3.12.9
uv venv --python 3.12.9
uv sync --extra cu128 --extra dev --extra hf_qwen --extra eval
uv run python scripts\env_check.py
uv run python scripts\validate_repo_contract.pyMain extras:
--extra cu128: CUDA PyTorch--extra dev: pytest and verification helpers--extra hf_qwen: Hugging Face / Qwen capture path--extra eval: runtime eval and report export dependencies
uv run python scripts\env_check.py
uv run python scripts\validate_repo_contract.pyBackend:
uv run python scripts\run_turboquant_studio.pyFrontend dev server:
Set-Location .\studio-web
npm install
npm run devuv run python scripts\pack_turboquant_gguf.py `
--input-gguf path\to\base.gguf `
--output-gguf path\to\output.turboquant.gguf `
--profiles paper,so8_triality_vector `
--default-profile exact `
--hypura-compatible-profile autouv run python scripts\export_triality_fixture.py `
--output-dir artifacts\triality_fixtures `
--mode triality-proxy-so8-pareto `
--model-family Qwen/Qwen3.5-27B
uv run python scripts\verify_triality_export.py `
--manifest artifacts\triality_fixtures\triality-proxy-so8-pareto\triality-fixture-manifest.jsonThis repo is intentionally strict about what it claims.
It does claim:
- offline TurboQuant correctness matters
- hidden-state quality matters
- artifact contracts should be explicit and testable
TQ4_1Sshould be validated end to end, not just named in metadata
It does not currently claim:
- a finished fused packed-weight CUDA kernel
- a final optimized
TQ4_1S -> q8_0 scratch + cuBLASimplementation - that every research branch should become a production default
- that replay-only evidence is the same as end-to-end runtime evidence
- the vendored runtime is pinned through
.gitmodulestozapabob/llama.cpp - exported GGUF metadata is expected to preserve the current
tq_*andhypura.turboquant.*contract surfaces - repo integrity is checked by
repo_contract.tomlandscripts\validate_repo_contract.py - runtime-facing README claims are limited to paths that were actually loaded or measured
| Repository | Role |
|---|---|
| zapabob/Turboquant-CUDA | Upstream PyTorch / offline TurboQuant semantics |
| zapabob/llama.cpp | Runtime GGUF loader and serving path |
| zapabob/Hypura | Tiered inference / serving integration target |
| zapabob/multiscreen-pytorch | Multiscreen reference implementation used in the relevance path |
Apache-2.0






