Official PyTorch implementation of "GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance" (ICML 2025)
-
Updated
Apr 13, 2026 - Python
Official PyTorch implementation of "GuidedQuant: Large Language Model Quantization via Exploiting End Loss Guidance" (ICML 2025)
Open quantization tooling for TurboQuant-style low-bit LLM releases, stock GGUF deployment, and Apple Silicon runtime experiments.
[CAAI AIR'24] Minimize Quantization Output Error with Bias Compensation
A high-performance, memory-efficient healthcare framework that deploys fine-tuned Large Language Models (LLMs) on edge devices. Multi-agent system to provide personalized diagnostic reasoning, health education, and dietary planning.
A more deep research about TurboQuant algorithms
Ternary Quantization for LLMs: Implement balanced ternary (T3_K) weights for 2.63-bit quantization—the first working solution for modern large language models.
Let me make GGUF files quickly
Production-grade LLM quantization, benchmarking, and edge deployment toolkit. Supports bitsandbytes INT8/INT4, GPTQ (Hessian calibration), AWQ (activation-aware), and GGUF (Q2_K–Q8_0). Four-dimensional benchmarking: perplexity, TPS/TTFT, VRAM profiling, and LLM-as-Judge quality scoring. RTX 5090 Blackwell sm_120 ready.
Paired capability-level GGUF quantization fragility benchmark across Qwen2.5-3B and SmolLM2 1.7B.
LLM quantization project built around `llama.cpp` + `Ollama` + `GGUF`
Local & lightweight LLM inference runtime in C++ with support for GGUF & quantization
GWIQ-Atlas: is a brain-atlasing and model-interpretability suite that combines per-layer census, compliance behaviour tracing, SAE features, and quantization analyses for LLMs.
Shift-based post-training quantization analysis for LLMs (ShiftQuant paper)
Extreme Low-Bit Inference in Reasoning Models: Failure Modes and Targeted Recovery
OpenVINO Model Manager — desktop GUI for Intel Arc
A high-performance inference engine optimized for deploying quantized LLMs on edge devices. Focuses on SIMD optimizations and memory management.
PentaNet extends BitNet's ternary quantization to pentanary {-2,-1,0,+1,+2}, improving perplexity by 6.4% at 124M params while preserving zero-multiplier arithmetic.
Implemented and fine-tuned BERT for a custom sequence classification task, leveraging LoRA adapters for efficient parameter updates and 4-bit quantization to optimize performance and resource utilization.
Add a description, image, and links to the llm-quantization topic page so that developers can more easily learn about it.
To associate your repository with the llm-quantization topic, visit your repo's landing page and select "manage topics."