Releases · NVIDIA/TileGym · GitHub

22 Apr 05:43

arjkesh

v1.2.0 Latest

Latest

What's Changed

Add PyPI install instructions to README by @hannahli-nv in #96
Integrate Qwen3.5 with TileGym cuTile Kernels — 2.68x Speedup & other updates by @hannahli-nv in #92
cleanup: Remove dead mask variable and make bounds checking explicit in GELU kernel by @hannahli-nv in #99
docs: Update ROADMAP.md statuses & Add more unsloth kernels by @hannahli-nv in #100
Update TileGym Julia kernels to cuTile 0.2 by @maleadt in #102
Update translated READMEs to match latest English README & other updates by @hannahli-nv in #103
perf: gemma_attention CuTile — use approx tanh (rounding_mode=APPROX) for soft cap & other updates by @hannahli-nv in #105
Integrate TileGym Kernels for allenai/Olmo-3-1025-7B & [skill] Add cutile auto research by @hannahli-nv in #106
fix(unsloth): fix 6 correctness and performance issues in CuTile RoPE kernels & other updates by @hannahli-nv in #108
[skill] fix test func for perf improvement skills & perf(sm80): tune cuTile kernels for A100 with README updated by @hannahli-nv in #110
fix(ci): render all benchmark columns in summary, not just allowlisted backends by @hannahli-nv in #109
Bump version from 1.1.0 to 1.2.0 by @hannahli-nv in #112

New Contributors

@maleadt made their first contribution in #102

Full Changelog: v1.1.0...v1.2.0

Contributors

maleadt and hannahli-nv

Assets 3

03 Apr 10:27

arjkesh

v1.1.0

What's Changed

Igore vuln in pygments by @xjmxyt in #86
Fix random mhc test and benchmark failures & Add unsloth geglu and grouped_gemm kernels by @hannahli-nv in #84
bench: replace unfused reference_rms_norm with F.rms_norm as PyTorch baseline by @xjmxyt in #89
Add tileiras optional dependency for bundled compiler support by @hannahli-nv in #90
Add sparse MLA forward op in experimental by @Weili-0234 in #91
Bump CUDA base image from 13.1.0 to 13.2.0 by @hannahli-nv in #85
Bump version from 1.0.1 to 1.1.0 by @hannahli-nv in #94
Remove cuda-tile-experimental URL dep for PyPI compatibility by @hannahli-nv in #95

Full Changelog: v1.0.1...v1.1.0

Contributors

xjmxyt, Weili-0234, and hannahli-nv

Assets 3

23 Mar 08:20

arjkesh

v1.0.1 Pre-release

Pre-release

What's Changed

Fix checkout issue in release tag workflow by @arjkesh in #73
Fix permissions error in release workflow by @arjkesh in #74
Bump some python dependencies to remediate vulnerabilities by @hannahli-nv in #76
Add TileGym repository URL and fix directory name in install instructions by @brycelelbach in #78
Guard CuTile import in test_swiglu.py & Update experimental kernels and tests & other updates by @hannahli-nv in #80
feat: swiglu simple math changes for perf upgrades by @aghilann in #77
Add recurrent_gated_delta_rule and chunk_gated_delta_rule ops to TileGym & other updates by @hannahli-nv in #81
Update tilegym tag from 1.0.0 to 1.0.1 by @hannahli-nv in #83

New Contributors

@brycelelbach made their first contribution in #78

Full Changelog: v1.0.0...v1.0.1

Contributors

brycelelbach, arjkesh, and 2 other contributors

Assets 3

11 Mar 00:21

arjkesh

v1.0.0 Pre-release

Pre-release

What's Changed

[Bug fix] use padding_mode inside the kernel to process elements out of boundary for softmax by @xjmxyt in #1
[Bug fix] use ct.gather ct.store for softmax's no-tma op by @yifeis-nv in #2
Add PR bot to repository by @arjkesh in #3
Update README.md by @xjmxyt in #5
remove dead code in silu_and_mul kernel - creates output offsets (for 1D), expect n_elements param... but no need... by @lessw2020 in #6
Initialize TileGym CI by @arjkesh in #4
Use ruff formatter, introduce helper dev script by @arjkesh in #11
Introduce job timeouts, speed up builds by @camille-004 in #9
[FEA] add gelu & relu by @xjmxyt in #13
Update dockerfile to use cuda 13.1 base image by @arjkesh in #12
[Fix] Refactor nightly skip logic by @arjkesh in #8
Add automatic header checks and formatting by @arjkesh in #14
Standardize softmax.py to avoid numpy dependency by @lessw2020 in #16
[Update] update kernels and reformat codes by @hannahli-nv in #18
[FEA] Add dropout by @hannahli-nv in #19
Split-K reduction kernel cleanup by @lessw2020 in #21
Fix: moe_align_block_size() supports non-power-of-2 num_experts by @huanghua1994 in #24
Update autotuner: use experimental autotuner in cutile-python by @xjmxyt in #25
feat: chunked softmax implementation for large column size by @aghilann in #17
[Update] Add benchmark and autotune for group_gemm by @xjmxyt in #26
Fix benchmark failure cases by @arjkesh in #27
Format benchmark files as json, add perf thresholds by @arjkesh in #15
feat: RMSNorm backward pass kernels by @aghilann in #29
Split-K reduction: remove un-needed scaling via INV_LOG_2 by @lessw2020 in #22
[fix] Update benchmark sparse checkout by @arjkesh in #30
[FEA] Add bmm by @hannahli-nv in #31
Temporarily avoid job failures due to inconsistent benchmarks by @arjkesh in #32
[Update] Fix bmm issue by @hannahli-nv in #34
[FEA] Add Qwen2-7B module by @hannahli-nv in #36
Update for ragged_bmm moe by @hannahli-nv in #37
Add env "DISABLE_FALLBACK" & fix type hint error & other updates by @hannahli-nv in #39
Add reusable retry workflow for runner availability timeouts by @arjkesh in #35
Add mHC fused kernels and tests by @Edward-lyz in #38
Update some comments by @hannahli-nv in #42
Add tilegym wheel building by @arjkesh in #41
fix matmul illegal address error on DGX Spark by @xjmxyt in #44
fix qwen2 fp16 bug by @hannahli-nv in #43
[Fix] fix num_kv_split becomes 0 by @xjmxyt in #45
Avoid OOM for large GEMM 32k & modify layernorm cutile by @hannahli-nv in #50
Add option to ignore specific wheel validations by @arjkesh in #51
Add road map by @hannahli-nv in #52
[FEA] Add SwiGLU backward pass implementation, test cases and benchmark by @Weili-0234 in #46
Enable experimental_kernel marker by @hannahli-nv in #53
[FEA] Add FlashAttention backward pass implementation, test cases and benchmark by @Weili-0234 in #49
Update README.md by @xjmxyt in #54
Add version for tilegym wheels, update reusable workflow by @arjkesh in #55
Fix import error for experimental marker & support gemma 3 & other updates by @hannahli-nv in #57
Add tilegym homepage to setup.py by @arjkesh in #58
Update MoE by @hannahli-nv in #59
fix torch dependency by @xjmxyt in #61
feat: replace RMSNorm backward with persistent CuTile kernel by @aghilann in #60
Scan for CVEs in wheels, fix python versions by @arjkesh in #64
feat: add CuTile RoPE backward with tests and backward benchmark by @aghilann in #62
A fix for silu_and_mul & Update codes & other updates by @hannahli-nv in #67
Add workflow to prepare release tag and artifacts by @arjkesh in #66
Update moe type hint & Update gitignore & other updates by @hannahli-nv in #68
add cutile kernel skill and Move install_requires dependencies to requirements.txt by @hannahli-nv in #69
Add SECURITY.md with vulnerability reporting instructions & Add SPDX license header to SECURITY.md & other updates by @hannahli-nv in #71
feat: swiglu forward optimizations by @aghilann in #63
feat: chunked fused linear cross-entropy kernel forward by @aghilann in #65
Update attention & Add .venv to ruff exclude list by @hannahli-nv in #72

New Contributors

@xjmxyt made their first contribution in #1
@yifeis-nv made their first contribution in #2
@lessw2020 made their first contribution in #6
@camille-004 made their first contribution in #9
@hannahli-nv made their first contribution in #18
@huanghua1994 made their first contribution in #24
@aghilann made their first contribution in #17
@Edward-lyz made their first contribution in #38
@Weili-0234 made their first contribution in #46

Full Changelog: https://github.com/NVIDIA/TileGym/commits/v1.0.0

Contributors

camille-004, huanghua1994, and 8 other contributors

Assets 3