Releases: snowflakedb/ArcticInference
Releases · snowflakedb/ArcticInference
v0.1.0 Release
What's Changed
- Simplify min_score selection logic, correct type hint for
propose_suffix_draft_token_idsby @CptTZ in #195 - Add op_builder for jitting the kernels by @sfc-gh-reyazda in #193
- Update links in README for Shift Parallelism by @sfc-gh-mhidayetoglu in #196
- bump to v0.0.10 by @sfc-gh-jrasley in #194
- Move SwiftKV ops to JIT-build by @sfc-gh-yewang in #198
- Add @sfc-gh-reyazda as code owner by @sfc-gh-yewang in #199
- Explicitly initialize CUDA buffers for next tokens by @sfc-gh-yewang in #201
- Port suffix decoding to nanobind by @sfc-gh-aqiao in #206
- upgrade to vllm 0.10.1 by @sfc-gh-yewang in #162
- Suffix decoding: break out of speculate loop early by @sfc-gh-aqiao in #207
- Suffix decoding speculation optimization by @sfc-gh-aqiao in #211
- reshape_and_cache_flash fp4 kernel by @sfc-gh-yewang in #210
- More suffix decoding optimizations by @sfc-gh-aqiao in #212
- remove ulysses moe patch by @sfc-gh-mhidayetoglu in #213
- Bump version from 0.0.10 to 0.1.0 by @sfc-gh-jrasley in #214
New Contributors
- @sfc-gh-reyazda made their first contribution in #193
Full Changelog: v0.0.9...v0.1.0
v0.0.9 Release
What's Changed
- bump v0.0.9 by @sfc-gh-mwyatt in #129
- Spec decoding minor changes by @sfc-gh-yewang in #128
- Add json_mode benchmark by @sfc-gh-yewang in #121
- Enable MoE models (Qwen & Llama) by @sfc-gh-mhidayetoglu in #126
- include citation by @sfc-gh-mhidayetoglu in #139
- fix minor bug by @sfc-gh-mhidayetoglu in #142
- fix bug in create_engine_config by @shixianc in #144
- Small improvements in benchmark by @sfc-gh-yewang in #140
- Enable minimal build by @sfc-gh-yewang in #138
- feat: FlashInfer backend support for SwiftKV by @therealnaveenkamal in #124
- Upgrade to vLLM 0.9.2 by @sfc-gh-aqiao in #132
- fix spec decoding over model len limit by @sfc-gh-yewang in #149
- Fix Suffix Decoding max_model_len overflow by @sfc-gh-yewang in #153
- Accelerate benchmarks by saturating GPUs by @sfc-gh-yewang in #154
- Add gpt-oss blog to README by @sfc-gh-yewang in #167
- Union type hint fix by @CptTZ in #172
- Implement sequence eviction for Suffix Decoding (Part 1/2) by @sfc-gh-aqiao in #166
- Add @sfc-gh-goliaro as code owner by @sfc-gh-jrasley in #177
- Implement sequence eviction for Suffix Decoding (Part 2/2) by @sfc-gh-aqiao in #176
- Reorganize and refactor Suffix Decoding by @sfc-gh-aqiao in #182
- Add environment variable to skip version check by @sfc-gh-aqiao in #186
- Enable SwiftKV when FlashInfer is not available by @sfc-gh-pjoziak in #187
- Fix hybrid mode(spec decoding + suffix) crash on structured_output by @sfc-gh-yewang in #169
- Make Arctic Inference plugin opt-in instead of opt-out by @sfc-gh-aqiao in #188
New Contributors
- @shixianc made their first contribution in #144
- @therealnaveenkamal made their first contribution in #124
- @CptTZ made their first contribution in #172
- @sfc-gh-pjoziak made their first contribution in #187
Full Changelog: v0.0.8...v0.0.9
v0.0.8
What's Changed
- better error message if spec model type isn't supported by @sfc-gh-yewang in #78
- Update README.md by @sfc-gh-aqiao in #73
- Add RTD by @sfc-gh-mwyatt in #59
- Update README.md by @sfc-gh-aqiao in #79
- Add readme header by @sfc-gh-jrasley in #85
- Fix editable install failure by @sfc-gh-mwyatt in #87
- Remove the strict spec model check by @sfc-gh-yewang in #90
- skip attention when capturing CUDA graph. by @sfc-gh-mhidayetoglu in #94
- Update Docs Structure and Speculative Decoding Pages by @sfc-gh-aqiao in #100
- add base_model_arch in spec config by @sfc-gh-yewang in #97
- Update and Fixes for documentation by @sfc-gh-aqiao in #101
- Add Docs CI by @sfc-gh-mwyatt in #86
- Upgrade to vLLM v0.9.0.1 by @sfc-gh-aqiao in #89
- ignore generated pb files by @sfc-gh-yewang in #104
- capture only the required shapes of the SP_TP model by @sfc-gh-mhidayetoglu in #106
- Base model architecture check for Speculators by @sfc-gh-yewang in #105
- minor fix by @sfc-gh-yewang in #107
- SwiftKV improvements by @sfc-gh-aqiao in #110
- Add end-to-end benchmark tests by @sfc-gh-aqiao in #112
- Use correct TP_GROUP and pack data before allGather by @sfc-gh-yewang in #114
- Update/Fix Benchmark tests by @sfc-gh-aqiao in #116
- Small fixes to spec decoding by @sfc-gh-yewang in #117
- compiler patch for fixing issue 72 by @sfc-gh-mhidayetoglu in #118
- fix minor bug by @sfc-gh-mhidayetoglu in #119
- Graph Capture Ulysses by @sfc-gh-mhidayetoglu in #120
- Fix broken wheel build by @sfc-gh-jrasley in #123
Full Changelog: v0.0.7...v0.0.8
v0.0.7 Release
Highlights
- Shift Parallelism release
- Arctic Inference with Shift Parallelism: The Fastest Open Source Inference System for Enterprise AI
What's Changed
- bump to v0.0.7 by @sfc-gh-jrasley in #35
- update install instructions by @sfc-gh-jrasley in #36
- Fix ParallelLMHead quantization by @Xiuyu-Li in #39
- Reclaim
speculator_configbyvllm_configbefore initializing the drafter. by @dtransposed in #48 - Add seed and disable_by_batch_size in spec example by @sfc-gh-yewang in #44
- force set seed if not explicitly specified by @sfc-gh-yewang in #45
- Fix crash in eager mode by @sfc-gh-yewang in #56
- Warning message and disabling the plugin when vLLM V0 enabled. by @dtransposed in #57
- Spec model's TP on SP allocated GPUs by @sfc-gh-yewang in #58
- Fix suffix decoding over model len limit by @sfc-gh-aqiao in #61
- Integrate Dynasor to Arctic Inference by @GindaChen in #31
- Introduce Shift Parallelism and integrate it with SwiftKV and Spec. Dec. by @sfc-gh-mhidayetoglu in #66
- Add embedding optimizations by @sfc-gh-juyang in #62
- Fix V1 check and non-shift mode by @sfc-gh-aqiao in #71
- updating embedding readme by @sfc-gh-juyang in #69
- update readme by @sfc-gh-yewang in #68
- Fix for wheel build by @sfc-gh-mwyatt in #74
New Contributors
- @Xiuyu-Li made their first contribution in #39
- @dtransposed made their first contribution in #48
- @GindaChen made their first contribution in #31
- @sfc-gh-mwyatt made their first contribution in #74
Full Changelog: v0.0.6...v0.0.7
v0.0.6
What's Changed
- bump to v0.0.5 by @sfc-gh-jrasley in #17
- remove dependencies to Llama and Qwen by @sfc-gh-mhidayetoglu in #18
- Update CODEOWNERS by @sfc-gh-aqiao in #20
- New encapsulated patching method by @sfc-gh-aqiao in #19
- Add runtime check for installed vllm version by @sfc-gh-aqiao in #21
- Allow dev version of vllm to pass version check by @sfc-gh-aqiao in #22
- Add code for Suffix Decoding by @sfc-gh-aqiao in #23
- upgrade vllm to 0.8.4 and bump version by @sfc-gh-aqiao in #24
- Suffix Decoding + ArcticSpeculator by @sfc-gh-yewang in #25
- Add example spec model link by @sfc-gh-yewang in #26
- Update README.md by @sfc-gh-yewang in #27
- Update README.md by @sfc-gh-aqiao in #28
- Add pybind11 dependencies by @sfc-gh-jrasley in #33
New Contributors
- @sfc-gh-yewang made their first contribution in #25
Full Changelog: v0.0.4...v0.0.6
v0.0.4
What's Changed
- bump to v0.0.3 by @sfc-gh-jrasley in #13
- update readme by @sfc-gh-jrasley in #15
- Ulysses bug fix by @sfc-gh-jrasley in #14
- readme edits by @sfc-gh-jrasley in #16
Full Changelog: v0.0.2...v0.0.4
v0.0.2
What's Changed
- release script by @sfc-gh-jrasley in #9
- bump to v0.0.2 by @sfc-gh-jrasley in #10
- Arctic Ulysses by @sfc-gh-mhidayetoglu in #11
- Projects readme restructure by @sfc-gh-jrasley in #12
New Contributors
- @sfc-gh-mhidayetoglu made their first contribution in #11
Full Changelog: v0.0.1...v0.0.2
v0.0.1
Initial release of ArcticInference!
What's Changed
- Add Llama-SwiftKV by @sfc-gh-aqiao in #1
- add license by @sfc-gh-jrasley in #2
- Create CODEOWNERS by @sfc-gh-jrasley in #3
- add short description to readme by @sfc-gh-aqiao in #4
- Create repo_meta.yaml by @sfc-gh-jrasley in #5
- Create semgroup.yml by @sfc-gh-jrasley in #6
- Add license headers by @sfc-gh-aqiao in #7
- add badges by @sfc-gh-jrasley in #8
New Contributors
- @sfc-gh-aqiao made their first contribution in #1
- @sfc-gh-jrasley made their first contribution in #2
Full Changelog: https://github.com/snowflakedb/ArcticInference/commits/v0.0.1