What's Changed

Simplify min_score selection logic, correct type hint for propose_suffix_draft_token_ids by @CptTZ in #195
Add op_builder for jitting the kernels by @sfc-gh-reyazda in #193
Update links in README for Shift Parallelism by @sfc-gh-mhidayetoglu in #196
bump to v0.0.10 by @sfc-gh-jrasley in #194
Move SwiftKV ops to JIT-build by @sfc-gh-yewang in #198
Add @sfc-gh-reyazda as code owner by @sfc-gh-yewang in #199
Explicitly initialize CUDA buffers for next tokens by @sfc-gh-yewang in #201
Port suffix decoding to nanobind by @sfc-gh-aqiao in #206
upgrade to vllm 0.10.1 by @sfc-gh-yewang in #162
Suffix decoding: break out of speculate loop early by @sfc-gh-aqiao in #207
Suffix decoding speculation optimization by @sfc-gh-aqiao in #211
reshape_and_cache_flash fp4 kernel by @sfc-gh-yewang in #210
More suffix decoding optimizations by @sfc-gh-aqiao in #212
remove ulysses moe patch by @sfc-gh-mhidayetoglu in #213
Bump version from 0.0.10 to 0.1.0 by @sfc-gh-jrasley in #214

New Contributors

Full Changelog: v0.0.9...v0.1.0

CptTZ, sfc-gh-reyazda, and 4 other contributors

Assets 2

25 Sep 20:12

v0.0.9 Release

bump v0.0.9 by @sfc-gh-mwyatt in #129
Spec decoding minor changes by @sfc-gh-yewang in #128
Add json_mode benchmark by @sfc-gh-yewang in #121
Enable MoE models (Qwen & Llama) by @sfc-gh-mhidayetoglu in #126
include citation by @sfc-gh-mhidayetoglu in #139
fix minor bug by @sfc-gh-mhidayetoglu in #142
fix bug in create_engine_config by @shixianc in #144
Small improvements in benchmark by @sfc-gh-yewang in #140
Enable minimal build by @sfc-gh-yewang in #138
feat: FlashInfer backend support for SwiftKV by @therealnaveenkamal in #124
Upgrade to vLLM 0.9.2 by @sfc-gh-aqiao in #132
fix spec decoding over model len limit by @sfc-gh-yewang in #149
Fix Suffix Decoding max_model_len overflow by @sfc-gh-yewang in #153
Accelerate benchmarks by saturating GPUs by @sfc-gh-yewang in #154
Add gpt-oss blog to README by @sfc-gh-yewang in #167
Union type hint fix by @CptTZ in #172
Implement sequence eviction for Suffix Decoding (Part 1/2) by @sfc-gh-aqiao in #166
Add @sfc-gh-goliaro as code owner by @sfc-gh-jrasley in #177
Implement sequence eviction for Suffix Decoding (Part 2/2) by @sfc-gh-aqiao in #176
Reorganize and refactor Suffix Decoding by @sfc-gh-aqiao in #182
Add environment variable to skip version check by @sfc-gh-aqiao in #186
Enable SwiftKV when FlashInfer is not available by @sfc-gh-pjoziak in #187
Fix hybrid mode(spec decoding + suffix) crash on structured_output by @sfc-gh-yewang in #169
Make Arctic Inference plugin opt-in instead of opt-out by @sfc-gh-aqiao in #188

Full Changelog: v0.0.8...v0.0.9

CptTZ, shixianc, and 8 other contributors

Assets 2

07 Jul 18:44

v0.0.8

better error message if spec model type isn't supported by @sfc-gh-yewang in #78
Update README.md by @sfc-gh-aqiao in #73
Add RTD by @sfc-gh-mwyatt in #59
Update README.md by @sfc-gh-aqiao in #79
Add readme header by @sfc-gh-jrasley in #85
Fix editable install failure by @sfc-gh-mwyatt in #87
Remove the strict spec model check by @sfc-gh-yewang in #90
skip attention when capturing CUDA graph. by @sfc-gh-mhidayetoglu in #94
Update Docs Structure and Speculative Decoding Pages by @sfc-gh-aqiao in #100
add base_model_arch in spec config by @sfc-gh-yewang in #97
Update and Fixes for documentation by @sfc-gh-aqiao in #101
Add Docs CI by @sfc-gh-mwyatt in #86
Upgrade to vLLM v0.9.0.1 by @sfc-gh-aqiao in #89
ignore generated pb files by @sfc-gh-yewang in #104
capture only the required shapes of the SP_TP model by @sfc-gh-mhidayetoglu in #106
Base model architecture check for Speculators by @sfc-gh-yewang in #105
minor fix by @sfc-gh-yewang in #107
SwiftKV improvements by @sfc-gh-aqiao in #110
Add end-to-end benchmark tests by @sfc-gh-aqiao in #112
Use correct TP_GROUP and pack data before allGather by @sfc-gh-yewang in #114
Update/Fix Benchmark tests by @sfc-gh-aqiao in #116
Small fixes to spec decoding by @sfc-gh-yewang in #117
compiler patch for fixing issue 72 by @sfc-gh-mhidayetoglu in #118
fix minor bug by @sfc-gh-mhidayetoglu in #119
Graph Capture Ulysses by @sfc-gh-mhidayetoglu in #120
Fix broken wheel build by @sfc-gh-jrasley in #123

Full Changelog: v0.0.7...v0.0.8

sfc-gh-jrasley, sfc-gh-aqiao, and 3 other contributors

Assets 2

29 May 23:36

v0.0.7 Release

bump to v0.0.7 by @sfc-gh-jrasley in #35
update install instructions by @sfc-gh-jrasley in #36
Fix ParallelLMHead quantization by @Xiuyu-Li in #39
Reclaim speculator_config by vllm_config before initializing the drafter. by @dtransposed in #48
Add seed and disable_by_batch_size in spec example by @sfc-gh-yewang in #44
force set seed if not explicitly specified by @sfc-gh-yewang in #45
Fix crash in eager mode by @sfc-gh-yewang in #56
Warning message and disabling the plugin when vLLM V0 enabled. by @dtransposed in #57
Spec model's TP on SP allocated GPUs by @sfc-gh-yewang in #58
Fix suffix decoding over model len limit by @sfc-gh-aqiao in #61
Integrate Dynasor to Arctic Inference by @GindaChen in #31
Introduce Shift Parallelism and integrate it with SwiftKV and Spec. Dec. by @sfc-gh-mhidayetoglu in #66
Add embedding optimizations by @sfc-gh-juyang in #62
Fix V1 check and non-shift mode by @sfc-gh-aqiao in #71
updating embedding readme by @sfc-gh-juyang in #69
update readme by @sfc-gh-yewang in #68
Fix for wheel build by @sfc-gh-mwyatt in #74