14 Apr 10:05

shijieliu

c7b9ea2

v26.03 Latest

Latest

What's Changed

Features & Enhancements

Add Torch export for HSTU model by @jensenhwa in #327
[Feature] dynamicemb table fusion and expansion by @jiashuy in #343
feat(benchmark): HSTU E2E training benchmark suite with progressive optimizations by @JacoCheung in #340
Add HSTU inference benchmark results on B200 by @geoffreyQiu in #338
Relax alignment requirements(remove pow of 2) in dynamicemb by @jiashuy in #312
perf: avoid D2H sync in _Split2DJaggedFunction by precomputing split lengths by @JacoCheung in #318
refactor: migrate to fbgemm_gpu_hstu, remove legacy HSTU compat layer by @JacoCheung in #321
Optimize balancer and setup debug logger. by @JacoCheung in #308
fix: align DynamicEmb capacity to bucket_capacity instead of DEMB_TABLE_ALIGN_SIZE by @JacoCheung in #329

Bug Fixes

fix missing import by @gameofdimension in #320
refactor: remove redundant apply_optimizer_in_backward in sharding.py by @ShaobinChen-AH in #330
error handling for empty kv list by @gameofdimension in #331
Fix docker, cmake and imports after torch export support by @geoffreyQiu in #358
Make table_ptrs_dev persistent by @jiashuy in #356
Create DynamicEmbStorage when zero local hbm; reset _prefetch_outstanding_keys only in reset_cache_states by @jiashuy in #354
Fix empty batch hang fundamentally by @jiashuy in #349
[bugfix] fix hang issue when fed empty batch by @gameofdimension in #342
Fix optimizer states dim(ckpt) of rowwise adagrad by @jiashuy in #305
Refactor test for alignment; add get_sharded_table_capacity by @jiashuy in #348

Misc

fix(pipeline): drain eval pipeline naturally to prevent batch leak by @JacoCheung in #314
Fix NVE dependency by @geoffreyQiu in #323
refactor: move HSTU build to devel stage by @shijieliu in #325
Upgrade to Torch 2.11 with Cuda 13.1 by @geoffreyQiu in #347
Update HSTU inference README file by @geoffreyQiu in #360

New Contributors

@jensenhwa made their first contribution in #327

Full Changelog: v26.01...v26.03

Contributors

geoffreyQiu, shijieliu, and 5 other contributors

Assets 4

13 Feb 02:41

shijieliu

v26.01

a9e3fee

v26.01

What's Changed

Features & Enhancements

HSTU KV Cache Manager V2 by @geoffreyQiu in #251
workload balancer and datasets folder refactor by @JacoCheung in #275
Fea unify pooling to dynamic embedding table by @shijieliu in #301
Optimize EmbeddingBagCollection preliminarily by @jiashuy in #268
refactor unique to stateless op by @shijieliu in #290
Optimize dedup indices and segmented unique by @shijieliu in #293
optimize backward local_reduce, use fwd unique results by @shijieliu in #299

Bug Fixes

Fix devel build failure by @JacoCheung in #288
Fix wrong evicted values when insert failed/busy in insert_and_evict. by @jiashuy in #284
Fix test_jagged_tensor import bug by @JacoCheung in #289
Fix KVCounter initialization bug. by @z52527 in #282
Fix mcore version in training readme by @JacoCheung in #286
Fix issue related to empty batch by @jiashuy in #271
Fix issue #272: dump/load score consistency in STEP mode by @ShaobinChen-AH in #298

Misc

Pull new triton hstu kernel by @JacoCheung in #291
Add README for embedding pooling. by @z52527 in #270
rename wrong module name by @gameofdimension in #278
[CI] split dynamicemb tests by @shijieliu in #273
Update triton version by @JacoCheung in #287
release v26.01 by @shijieliu in #309

New Contributors

@gameofdimension made their first contribution in #278
@ShaobinChen-AH made their first contribution in #298

Full Changelog: v25.12...v26.01

Contributors

z52527, geoffreyQiu, and 5 other contributors

Assets 2

13 Jan 08:10

shijieliu

v25.12

5e7301c

v25.12

What's Changed

Features & Enhancements

Support triton-server for hstu inference by @geoffreyQiu in #235
Deterministic mode in dynamicemb by @jiashuy in #262
Add sid gr model with validation on amzn beauty dataset by @JacoCheung in #265

Misc

Reserve and incremental dump of ScoredHashTable by @jiashuy in #246
Replace HKV with ScoredHashTable by @jiashuy in #255

Full Changelog: v25.11...v25.12

Contributors

geoffreyQiu, JacoCheung, and jiashuy

Assets 2

10 Dec 02:32

shijieliu

v25.11

7492d4b

v25.11

What's Changed

Features & Enhancements

Counter table interface and ScoredHashTable Implementation by @jiashuy in #229
Embedding admission strategy by @z52527 in #236
Optimize memory waste in segmented_unique by @z52527 in #244

Bug Fixes

Fix dtype mismatch of offset and table_range. by @jiashuy in #227
fix preprocessor local() error in python 3.10 by @shijieliu in #228
Add new handler despite of the exsiting ones by @JacoCheung in #233
Fix LFU test failed in incremental_dump by @jiashuy in #242
Fix default parameter initialization of KVCounter by @jiashuy in #253

Misc

Format dynamicemb's source codes by @jiashuy in #230
train docker update to cuda 12.9 by @shijieliu in #205
Quick fix for commands in Inference README by @geoffreyQiu in #247

Full Changelog: v25.10...v25.11

Contributors

z52527, geoffreyQiu, and 3 other contributors

Assets 2

11 Nov 02:01

shijieliu

v25.10

d043535

v25.10

What's Changed

Features & Enhancements

Add sequence parallelism by @JacoCheung in #216
Decouple scaling seqlen from max_seqlen in hstu attn by @geoffreyQiu in #208
Fea support lru score dump load by @shijieliu in #186
Gradient clipping by reusing TorchRec&FBGEMM's parameters by @jiashuy in #223
[HSTU]Add SM 89 support by @JacoCheung in #217
allow allow_overwrite in DynamicEmbDump by @fshhr46 in #206

But Fixs

Fix LFU mode frequency count bug by @z52527 in #176
Fix config bug when using torchrec's STBE in benchmark by @jiashuy in #193
Fix IMA in incremental dump and test the dumped embeddings by @jiashuy in #211
Fix rab num heads by @JacoCheung in #222
Fix IMA caused by wrong worker id for device of which max threads is … by @jiashuy in #220

Misc

Code reorganization for hstu training and inference by @geoffreyQiu in #202
Add embedding pooling kernel by @z52527 in #215

Full Changelog: v25.09...v25.10

Contributors

fshhr46, z52527, and 4 other contributors

Assets 2

20 Oct 10:00

shijieliu

v25.09

e065874

v25.09

What's Changed

Features & Enhancements

Dynamicemb prefetch integration by @JacoCheung in #181
Support distributed embedding dumping for dynamicemb by @z52527 @shijieliu in #120 #185
Add kernel fusion in HSTU block for inference, with KVCache fixes by @geoffreyQiu in #184
export hstu fp8 quant by @shijieliu in #168
Replace BatchedDynamicEmbeddingTables with BatchedDynamicEmbeddingTablesV2 by @jiashuy in #155

Bug Fixs

fix DynamicEmbDump - handle long strings in broadcast_string by @fshhr46 in #164
fix: consider mask when calc hstu attn flops by @shijieliu in #177
export fix hstu ima when num_candidates = seqlen by @shijieliu in #183

Misc

Make local hbm budget grow when num_embeddings grows. by @jiashuy in #156
Fix several errors for inference. by @geoffreyQiu in #167
Fix setup.py by @yiwenchen2025 in #169
Suppress mcore deps install by @JacoCheung in #170
dynamicemb clean BatchedDynamicEmbeddingTables by @jiashuy in #179
Update hstu layer benchmark doc by @JacoCheung in #171
Update dynamicemb's benchmark and example with README.md by @jiashuy in #188

New Contributors

@fshhr46 made their first contribution in #164

Full Changelog: v25.08...v25.09

Contributors

fshhr46, z52527, and 5 other contributors

Assets 2

08 Sep 01:39

shijieliu

v25.08

2947e15

v25.08

What's Changed

Features & Enhancements

Refactor dyanmicemb with Cache&Storage. by @jiashuy in #128
Support Kuairand dataset inference with alignment to training by @geoffreyQiu in #122
Support eval mode for dynamicemb and move insert in backward to forward for use_index_dedup=True by @shijieliu in #136
export hstu arbitrary mask by @shijieliu in #148
Optimize TP HSTU layer by @JacoCheung in #132

Bug fixs

Fix invalid pip option: replace --no-cache with --no-cache-dir by @mia1460 in #126
Remove HostAlloc in dataloader by @JacoCheung in #129
Fix filtering of samples with insufficient history by @mia1460 in #134
fix pipeline test by @shijieliu in #135
Hkv timeline clean by @jiashuy in #137
Fix calc flops by @shijieliu in #139
fix(dataset): add per-user reorder by time and pre-sort to guarantee … by @mia1460 in #141
fix preprocessor not working on absolute data path by @shijieliu in #146
fix codespell cheking by @shijieliu in #149
fix collective utset by @shijieliu in #151
Fix the shape hint for offsets by @yiwenchen2025 in #153

Misc

Update dynamicemb benchmark by @jiashuy in #138
Update the benchmarks and results. by @geoffreyQiu in #144
update benchmark doc by @shijieliu in #150
update benchmark result of dynamicemb to figure by @jiashuy in #154

New Contributors

@mia1460 made their first contribution in #126
@yiwenchen2025 made their first contribution in #153

Full Changelog: v25.07...v25.08

Contributors

geoffreyQiu, shijieliu, and 4 other contributors

Assets 2

01 Aug 09:35

shijieliu

v25.07

6a5be94

v25.07

What's Changed

Features & Enhancements

HSTU inference benchmark and example release @geoffreyQiu #92 #85 #93
Tensor parallelism support for HSTU layer @JacoCheung #101
Print detailed memory consumption of embedding and optimizer states by @jiashuy in #113
calc flops in ranking by @shijieliu in #96
add preprocessing mlp for hstu by @shijieliu in #98

Bug fixs

fix noncontiguous input for dynamicemb by @shijieliu in #99
Fix dynamicemb example's local rank bug on multi-node by @z52527 in #95
[Fix] retrieval shifting prediction embedding bug by @shijieliu in #114

Full Changelog: v25.06...v25.07

Contributors

z52527, geoffreyQiu, and 3 other contributors

Assets 2

04 Jul 14:24

shijieliu

v25.06

5652241

v25.06

What's Changed

Features & Enhancements

LFU Eviction Strategy for Dynamic Embeddings
Added a new Least Frequently Used (LFU) eviction strategy to the dynamicemb module, improving memory management and embedding efficiency.
(Contributed by @z52527 — (#52))

LayerNorm Recomputation for Fused HSTU Layer
Support for recomputing LayerNorm in the fused HSTU layer to optimize memory usage during training.
(Contributed by @JacoCheung — (#59))

Embedding and Optimizer State Insertion to HKV During Backward Pass
When use_index_dedup is enabled, embeddings and optimizer states are now inserted into the HKV during the backward pass, improving training efficiency.
(Contributed by @jiashuy — (#62))

Support for Non-Contiguous Input/Output in HSTU MHA and SiLU Recomputation
Enabled handling of non-contiguous tensors for multi-head attention and SiLU recomputation within HSTU layers.
(Contributed by @JacoCheung — (#64))

Customized CUDA Operation for Concatenating 2D Jagged Tensors
Introduced a new CUDA operator concat_2d_jagged_tensors to efficiently concatenate jagged tensors in 2D.
(Contributed by @z52527 — (#42))

Support for Training Pipeline
Added support for a streamlined training pipeline to facilitate easier model training and experimentation.
(Contributed by @JacoCheung — (#68))

Bug Fixes

Fixed HSTU Preprocess and Postprocess CI Issues
Resolved continuous integration issues related to HSTU preprocessing and postprocessing steps.
(Contributed by @shijieliu — (#76))

Documentation

Updated HSTU Installation Instructions
Clarified and expanded the README installation guide for the HSTU module to improve user onboarding.
(Contributed by @z52527 — (#84))

Dependency Updates

Stable Dependency Upgrades
Updated key dependencies to stable versions:
torchrec updated to 1.2.0
fbgemm_gpu updated to 1.2.0
mcore updated to 0.12.1
(Contributed by @shijieliu and @JacoCheung — (#74), (#75))

Contributors

z52527, shijieliu, and 2 other contributors

Assets 2

29 May 13:07

shijieliu

v25.05

a247bd4

v25.05

Changelog

Dynamicemb example #16 #31 #58
EmbeddingBagCollection support in Dynamicemb #20
Dynamicemb functionality enhancement #45 #46 #53

HSTU cutlass kernel support contextual features in hopper backward #51

Decouple sharding and model defination in hstu example #37
Fused hstu layer #43
Fix kuairand dataset convergency issue #34
Doc enhancement #39

Full Changelog: https://github.com/NVIDIA/recsys-examples/commits/v25.05

Assets 2

Releases: NVIDIA/recsys-examples

v26.03

What's Changed

Features & Enhancements

Bug Fixes

Misc

New Contributors

Contributors

Uh oh!

v26.01

What's Changed

Features & Enhancements

Bug Fixes

Misc

New Contributors

Contributors

Uh oh!

v25.12

What's Changed

Features & Enhancements

Misc

Contributors

Uh oh!

v25.11

What's Changed

Features & Enhancements

Bug Fixes

Misc

Contributors

Uh oh!

v25.10

What's Changed

Features & Enhancements

But Fixs

Misc

Contributors

Uh oh!

v25.09

What's Changed

Features & Enhancements

Bug Fixs

Misc

New Contributors

Contributors

Uh oh!

v25.08

What's Changed

Features & Enhancements

Bug fixs

Misc

New Contributors

Contributors

Uh oh!

v25.07

What's Changed

Features & Enhancements

Bug fixs

Contributors

Uh oh!

v25.06

What's Changed

Features & Enhancements

Bug Fixes

Documentation

Dependency Updates

Contributors

Uh oh!

v25.05

Changelog

Uh oh!