Skip to content
#

ampere

Here are 28 public repositories matching this topic...

Production-grade runtime patches for vLLM (45+ patches) — Qwen3.6-35B-A3B-FP8 hybrid GDN+MoE on NVIDIA Ampere (SM 80-86). 127 tok/s MTP free-form, 99 tok/s suffix tool-call (max 175). TurboQuant k8v4 KV cache, 256K context verified to 252K. P67 multi-query kernel + Suffix Decoding + adaptive ngram K. Zero source modifications.

  • Updated Apr 26, 2026
  • Python

First public benchmark of llama.cpp speculative decoding on Qwen3.6-35B-A3B with a single RTX 3090 (post PR #19493 merge, 2026-04-19). 19 configurations covering ngram-cache, ngram-mod, and classic draft with vocab-matched Qwen3.5-0.8B. Finding: no variant achieves net speedup on Ampere + A3B MoE. Raw JSON, plots, full reproducibility.

  • Updated Apr 26, 2026
  • Python

Improve this page

Add a description, image, and links to the ampere topic page so that developers can more easily learn about it.

Curate this topic

Add this topic to your repo

To associate your repository with the ampere topic, visit your repo's landing page and select "manage topics."

Learn more