t81dev
diff --git a/‎AGENTS.md‎
Lines changed: 1 addition & 0 deletions b/‎AGENTS.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎CMakeLists.txt‎
Lines changed: 1 addition & 0 deletions b/‎CMakeLists.txt‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎docs/gpu.md‎
Lines changed: 3 additions & 1 deletion b/‎docs/gpu.md‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎docs/python-cookbook.md‎
Lines changed: 6 additions & 0 deletions b/‎docs/python-cookbook.md‎
Lines changed: 6 additions & 0 deletions
diff --git a/‎include/t81/core/detail/simd.hpp‎
Lines changed: 123 additions & 20 deletions b/‎include/t81/core/detail/simd.hpp‎
Lines changed: 123 additions & 20 deletions
diff --git a/‎include/t81/linalg/gemm.hpp‎
Lines changed: 66 additions & 70 deletions b/‎include/t81/linalg/gemm.hpp‎
Lines changed: 66 additions & 70 deletions
@@ -59,3 +59,4 @@ This file helps AI agents discover and understand how to work with this reposito
 - Restructured `README.md` into a onboarding-focused front door and added companion docs (`docs/use-cases.md`, `docs/hardware.md`, `docs/api-overview.md`, `docs/python-install.md`, `docs/torch.md`, `docs/gpu.md`, `examples/README.md`) so heavy reference material lives outside the visitor-facing overview.
 - Added optional CUDA/ROCm toggles plus a GPU dispatcher sketch (`include/t81/linalg/gemm_gpu.hpp`, `src/linalg/{gemm_cuda.cu,gemm_dispatch.cpp,gemm_rocm.cpp}`) so future teams can wire the new `where`/`clamp`/`lerp`/`addcmul` helpers into GPU kernels, introduced `t81::TensorMetadata` + Python helpers (`python/bindings.cpp`) that extract metadata from NumPy/Torch tensors, and expanded `tests/python/test_gpu_ops.py` to cover the metadata-backed bindings on both CPU and GPU paths.
 - Enhanced `tests/python/test_gguf.py` with quant-parameterized round-trip checks, metadata assertions, and a regression case for invalid quant identifiers to spotlight the GGUF helpers before future agents touch them.
+- Hardened the SIMD detection helpers in `include/t81/core/detail/simd.hpp` with CPUID/xgetbv fallbacks, documented the `add_trytes_*` overflow semantics, and made NEON runtime checks opt-out via `T81_DISABLE_NEON`.
@@ -16,6 +16,7 @@ option(T81LIB_ENABLE_TORCH_BINDINGS
 
 option(USE_CUDA "Enable CUDA backend" OFF)
 option(USE_ROCM "Enable ROCm/HIP backend" OFF)
+option(USE_METAL "Enable Apple Metal backend" OFF)
 
 if(USE_CUDA)
     enable_language(CUDA)
 
@@ -2,4 +2,6 @@
 
 CUDA/ROCm kernels can be built when you configure with `-DUSE_CUDA=ON` or `-DUSE_ROCM=ON` (see `python/CMakeLists.txt`). The bindings expose `t81lib.where`, `t81lib.clamp`, `t81lib.lerp`, and `t81lib.addcmul`, which accept either NumPy buffers or PyTorch tensors and dispatch directly to the GPU kernels.
 
-Dispatch relies on `t81::TensorMetadata` (`include/t81/tensor_metadata.hpp`): a lightweight struct that carries device tags, dtype codes, shape, strides, and `data_ptr` so the dispatcher can call the right CUDA/HIP kernel without copies. When torch is available, `t81lib` automatically wraps tensors; without torch it gracefully falls back to CPU buffers. Review `python/bindings.cpp` for the extraction helpers and lifetime management and follow the [GPU dispatch diagram](diagrams/gpu-dispatch.mermaid.md) for the metadata flow.
+The dispatcher is driven entirely by `t81::TensorMetadata` (`include/t81/tensor_metadata.hpp`): a lightweight struct that carries device tags, dtype codes, shape, strides, and `data_ptr` so the runtime can call the right CUDA/HIP kernel without copies. Torch-aware helpers in `python/bindings.cpp` create metadata for GPU tensors (including a `requires_sync` flag when needed) and fall back to contiguous CPU buffers when torch is unavailable.
+
+`t81lib.gemm_ternary` now shares the same metadata plumbing. The CUDA/HIP kernels view `ScalarType::TernaryLimb` buffers as packed `core::limb` rows (`TRYTES_PER_LIMB` trytes packed into 16 bytes) and expect contiguous layouts (`np.dtype('V16')` rows or `torch.uint8` views with dimensions `(M, K_limbs)` / `(K_limbs, N)`). The accumulator `C` must remain float32 and contiguous. With `Backend::Auto`, the binding dispatches to CUDA/ROCm when available; otherwise it falls back to the CPU path. Review the [GPU dispatch diagram](diagrams/gpu-dispatch.mermaid.md) for how metadata flows from NumPy/Torch -> CUDA/HIP -> back to Python.
@@ -17,6 +17,12 @@ t81lib.gemm_ternary(packed, packed, c, 16, 16, 48)
 
 This shows how to drive the low-level binding (`t81lib.pack_dense_matrix`) together with `gemm_ternary` without needing PyTorch.
 
+### Keep packed buffers on the GPU
+
+When CUDA/ROCm support is enabled, `t81lib.gemm_ternary` accepts GPU-backed metadata directly. The dispatcher expects A/B to describe `ScalarType::TernaryLimb` rows with `TRYTES_PER_LIMB` packed trytes (e.g., NumPy `dtype('V16')` or `torch.uint8` views shaped `(M, k_limbs, 16)` after calling `torch.from_numpy(packed).reshape(...)`). The accumulator `C` stays float32 contiguous, and with `Backend::Auto` the binding routes the work to the compiled GPU kernel if the necessary backend is available.
+
+Use `t81.torch.TernaryTensor` to keep limbs on the GPU and let the binding generate the required metadata, or copy a packed NumPy buffer onto CUDA/ROCm if you need to interface with other tooling. Because the binding now shares the same `TensorMetadata` flow as `t81lib.where`/`clamp`/`lerp`, no extra copying or manual span conversions are required when the inputs already live on a compatible device.
+
 ## 2. Drop in `t81.torch.TernaryTensor` during training
 
 ```python
 
@@ -5,57 +5,160 @@
 
 #include <optional>
 #include <utility>
+#include <cstdint>
+
+#if defined(__x86_64__) || defined(_M_X64) || defined(__i386) || defined(_M_IX86)
+#include <immintrin.h>
+#if defined(_MSC_VER)
+#include <intrin.h>
+#elif defined(__GNUC__) || defined(__clang__)
+#include <cpuid.h>
+#endif
+#endif
 
 namespace t81::core {
     class limb;
 } // namespace t81::core
 
 namespace t81::core::detail {
 
-    inline bool cpu_supports_avx2() noexcept {
+namespace {
+
 #if defined(__x86_64__) || defined(_M_X64) || defined(__i386) || defined(_M_IX86)
-#if defined(__has_builtin)
-#if __has_builtin(__builtin_cpu_supports)
-        return __builtin_cpu_supports("avx2");
+struct cpuid_regs {
+    unsigned int eax{};
+    unsigned int ebx{};
+    unsigned int ecx{};
+    unsigned int edx{};
+};
+
+inline cpuid_regs read_cpuid(unsigned int leaf, unsigned int subleaf) {
+#if defined(_MSC_VER)
+    int regs[4];
+    __cpuidex(regs, leaf, subleaf);
+    return {static_cast<unsigned int>(regs[0]),
+            static_cast<unsigned int>(regs[1]),
+            static_cast<unsigned int>(regs[2]),
+            static_cast<unsigned int>(regs[3])};
+#elif defined(__GNUC__) || defined(__clang__)
+    unsigned int eax, ebx, ecx, edx;
+    __get_cpuid_count(leaf, subleaf, &eax, &ebx, &ecx, &edx);
+    return {eax, ebx, ecx, edx};
 #else
-        return false;
+    return {};
 #endif
+}
+
+inline unsigned long long read_xcr0() {
+#if defined(_MSC_VER)
+    return _xgetbv(0);
 #else
-        return false;
+    unsigned int eax, edx;
+    __asm__ volatile("xgetbv" : "=a"(eax), "=d"(edx) : "c"(0));
+    return (static_cast<unsigned long long>(edx) << 32) | eax;
 #endif
-#else
+}
+
+inline bool os_supports_xsave() {
+    const auto leaf1 = read_cpuid(1, 0);
+    return (leaf1.ecx & (1u << 27)) != 0;
+}
+
+inline bool os_supports_xcr_states(unsigned long long mask) {
+    if (!os_supports_xsave()) {
+        return false;
+    }
+    return (read_xcr0() & mask) == mask;
+}
+
+inline bool os_supports_avx_states() {
+    constexpr unsigned long long kMask = (1ull << 1) | (1ull << 2);
+    return os_supports_xcr_states(kMask);
+}
+
+inline bool os_supports_avx512_states() {
+    constexpr unsigned long long kMask =
+        (1ull << 1) | (1ull << 2) | (1ull << 5) | (1ull << 6) | (1ull << 7);
+    return os_supports_xcr_states(kMask);
+}
+
+inline bool cpu_reports_avx() {
+    const auto leaf1 = read_cpuid(1, 0);
+    return (leaf1.ecx & (1u << 28)) != 0;
+}
+
+inline bool cpu_reports_avx2() {
+    const auto leaf7 = read_cpuid(7, 0);
+    return (leaf7.ebx & (1u << 5)) != 0;
+}
+
+inline bool cpu_reports_avx512f() {
+    const auto leaf7 = read_cpuid(7, 0);
+    return (leaf7.ebx & (1u << 16)) != 0;
+}
+
+inline bool has_runtime_avx2() {
+    if (!os_supports_avx_states() || !cpu_reports_avx()) {
+        return false;
+    }
+    return cpu_reports_avx2();
+}
+
+inline bool has_runtime_avx512f() {
+    if (!os_supports_avx512_states()) {
         return false;
+    }
+    return cpu_reports_avx512f();
+}
+#else
+inline bool has_runtime_avx2() {
+    return false;
+}
+
+inline bool has_runtime_avx512f() {
+    return false;
+}
+#endif
+
+} // namespace
+
+    inline bool cpu_supports_avx2() noexcept {
+#if defined(__has_builtin)
+#if __has_builtin(__builtin_cpu_supports)
+#if defined(__AVX2__)
+        if (__builtin_cpu_supports("avx2")) {
+            return true;
+        }
+#endif
+#endif
 #endif
+        return has_runtime_avx2();
     }
 
     inline bool cpu_supports_avx512f() noexcept {
-#if defined(__x86_64__) || defined(_M_X64) || defined(__i386) || defined(_M_IX86)
 #if defined(__has_builtin)
 #if __has_builtin(__builtin_cpu_supports)
-        return __builtin_cpu_supports("avx512f");
-#else
-        return false;
+#if defined(__AVX512F__)
+        if (__builtin_cpu_supports("avx512f")) {
+            return true;
+        }
 #endif
-#else
-        return false;
 #endif
-#else
-        return false;
 #endif
+        return has_runtime_avx512f();
     }
 
     inline bool cpu_supports_neon() noexcept {
-#if defined(__ARM_NEON) || defined(__ARM_NEON__)
-#if defined(T81_ENABLE_NEON)
-        return true;
-#else
+#if defined(T81_DISABLE_NEON)
         return false;
-#endif
+#elif defined(__ARM_NEON) || defined(__ARM_NEON__) || defined(T81_ENABLE_NEON)
+        return true;
 #else
         return false;
 #endif
     }
 
+    // Returns true when SIMD addition completes without an overflow carry.
     bool add_trytes_avx2(const limb &, const limb &, limb &);
     bool add_trytes_avx512(const limb &, const limb &, limb &);
     bool add_trytes_neon(const limb &, const limb &, limb &);
 
@@ -9,6 +9,7 @@
 #include <stdexcept>
 
 #include <t81/core/limb.hpp>
+#include <t81/linalg/gemm_gpu.hpp>
 
 namespace t81::linalg {
 
@@ -62,87 +63,82 @@ namespace t81::linalg {
             return low_value + high_value * radix;
         }
 
-    } // namespace detail
-
-    inline void gemm_ternary(std::span<const core::limb> A,
-                             std::span<const core::limb> B,
-                             std::span<float> C,
-                             int M,
-                             int N,
-                             int K,
-                             float alpha,
-                             float beta) {
-        if (M < 0 || N < 0 || K < 0) {
-            throw std::invalid_argument("gemm_ternary dimensions must be non-negative");
-        }
-        if (K % core::limb::TRITS != 0) {
-            throw std::invalid_argument("gemm_ternary requires K divisible by 48");
-        }
-        const int K_limbs = K / core::limb::TRITS;
-        if (static_cast<std::size_t>(M) * static_cast<std::size_t>(K_limbs) != A.size()) {
-            throw std::invalid_argument("A span size does not match (M, K / 48)");
-        }
-        if (static_cast<std::size_t>(K_limbs) * static_cast<std::size_t>(N) != B.size()) {
-            throw std::invalid_argument("B span size does not match (K / 48, N)");
-        }
-        if (static_cast<std::size_t>(M) * static_cast<std::size_t>(N) != C.size()) {
-            throw std::invalid_argument("C span size does not match (M, N)");
-        }
-
-        if (M == 0 || N == 0) {
-            return;
-        }
+        inline void gemm_ternary_cpu_impl(std::span<const core::limb> A,
+                                          std::span<const core::limb> B,
+                                          std::span<float> C,
+                                          int M,
+                                          int N,
+                                          int K,
+                                          int K_limbs,
+                                          float alpha,
+                                          float beta) {
+            if (M == 0 || N == 0) {
+                return;
+            }
 
-        constexpr int BlockM = 8;
-        constexpr int BlockN = 8;
-        constexpr int BlockK = 4;
-        const std::size_t N_size = static_cast<std::size_t>(N);
-        const auto *const a_data = A.data();
-        const auto *const b_data = B.data();
-        auto *const c_data = C.data();
-
-        for (int ib = 0; ib < M; ib += BlockM) {
-            const int i_end = std::min(M, ib + BlockM);
-            for (int jb = 0; jb < N; jb += BlockN) {
-                const int j_end = std::min(N, jb + BlockN);
-                std::array<std::array<double, BlockN>, BlockM> accum{};
-                for (int i = ib; i < i_end; ++i) {
-                    const std::size_t row = static_cast<std::size_t>(i) * N_size;
-                    for (int j = jb; j < j_end; ++j) {
-                        const float existing = c_data[row + static_cast<std::size_t>(j)];
-                        accum[i - ib][j - jb] = static_cast<double>(existing) * beta;
+            constexpr int BlockM = 8;
+            constexpr int BlockN = 8;
+            constexpr int BlockK = 4;
+            const std::size_t N_size = static_cast<std::size_t>(N);
+            const auto *const a_data = A.data();
+            const auto *const b_data = B.data();
+            auto *const c_data = C.data();
+
+            for (int ib = 0; ib < M; ib += BlockM) {
+                const int i_end = std::min(M, ib + BlockM);
+                for (int jb = 0; jb < N; jb += BlockN) {
+                    const int j_end = std::min(N, jb + BlockN);
+                    std::array<std::array<double, BlockN>, BlockM> accum{};
+                    for (int i = ib; i < i_end; ++i) {
+                        const std::size_t row = static_cast<std::size_t>(i) * N_size;
+                        for (int j = jb; j < j_end; ++j) {
+                            const float existing = c_data[row + static_cast<std::size_t>(j)];
+                            accum[i - ib][j - jb] = static_cast<double>(existing) * beta;
+                        }
                     }
-                }
 
-                for (int kb = 0; kb < K_limbs; kb += BlockK) {
-                    const int k_end = std::min(K_limbs, kb + BlockK);
-                    for (int k = kb; k < k_end; ++k) {
-                        const std::size_t b_row = static_cast<std::size_t>(k) * N_size;
-                        for (int j = jb; j < j_end; ++j) {
-                            const core::limb b_value = b_data[b_row + static_cast<std::size_t>(j)];
-                            detail::prefetch_read(b_data + b_row + static_cast<std::size_t>(j) + 1);
-                            for (int i = ib; i < i_end; ++i) {
-                                const std::size_t a_index = static_cast<std::size_t>(i) *
-                                                                static_cast<std::size_t>(K_limbs) +
-                                                            static_cast<std::size_t>(k);
-                                const core::limb a_value = a_data[a_index];
-                                const double product = detail::multiply_to_double(a_value, b_value);
-                                accum[i - ib][j - jb] += product * static_cast<double>(alpha);
-                                detail::prefetch_read(a_data + a_index + 1);
+                    for (int kb = 0; kb < K_limbs; kb += BlockK) {
+                        const int k_end = std::min(K_limbs, kb + BlockK);
+                        for (int k = kb; k < k_end; ++k) {
+                            const std::size_t b_row = static_cast<std::size_t>(k) * N_size;
+                            for (int j = jb; j < j_end; ++j) {
+                                const core::limb b_value = b_data[b_row + static_cast<std::size_t>(j)];
+                                detail::prefetch_read(b_data + b_row + static_cast<std::size_t>(j) + 1);
+                                for (int i = ib; i < i_end; ++i) {
+                                    const std::size_t a_index = static_cast<std::size_t>(i) *
+                                                                    static_cast<std::size_t>(K_limbs) +
+                                                                static_cast<std::size_t>(k);
+                                    const core::limb a_value = a_data[a_index];
+                                    const double product = detail::multiply_to_double(a_value, b_value);
+                                    accum[i - ib][j - jb] += product * static_cast<double>(alpha);
+                                    detail::prefetch_read(a_data + a_index + 1);
+                                }
                             }
                         }
                     }
-                }
 
-                for (int i = ib; i < i_end; ++i) {
-                    const std::size_t row = static_cast<std::size_t>(i) * N_size;
-                    for (int j = jb; j < j_end; ++j) {
-                        c_data[row + static_cast<std::size_t>(j)] =
-                            static_cast<float>(accum[i - ib][j - jb]);
+                    for (int i = ib; i < i_end; ++i) {
+                        const std::size_t row = static_cast<std::size_t>(i) * N_size;
+                        for (int j = jb; j < j_end; ++j) {
+                            c_data[row + static_cast<std::size_t>(j)] =
+                                static_cast<float>(accum[i - ib][j - jb]);
+                        }
                     }
                 }
             }
         }
+
+    } // namespace detail
+
+    inline void gemm_ternary(std::span<const core::limb> A,
+                             std::span<const core::limb> B,
+                             std::span<float> C,
+                             int M,
+                             int N,
+                             int K,
+                             float alpha,
+                             float beta) {
+        detail::gemm_ternary_dispatch(A, B, C, M, N, K, alpha, beta);
     }
 
 } // namespace t81::linalg