bf16/fp16 cache quantization for Cache Aware Pipelines by naymaraq · Pull Request #15762 · NVIDIA-NeMo/NeMo

naymaraq · 2026-06-06T21:41:05Z

Important

The Update branch button must only be pressed in very rare occassions.
An outdated branch is never blocking the merge of a PR.
Please reach out to the automation team before pressing that button.

What does this PR do ?

Stores the cache-aware streaming cache (cache_last_channel / cache_last_time) in bf16 instead of fp32, cutting cache GPU memory by 2x.

Collection: [ASR]

Changelog

Add specific line by line info of high level changes in this PR.

Usage

python3 asr_streaming_infer.py \
    --config-path=../conf/asr_streaming_inference/ \
    --config-name=cache_aware_rnnt.yaml \
    asr.model_name=nvidia/nemotron-speech-streaming-en-0.6b \
    audio_file=sample.wav \
    output_filename=out.json \
    streaming.cache_dtype=bf16        # or fp16 / fp32

Details

Key Results:

fp16/bf16 reduce cache memory by 50% versus fp32 (936 MB vs. 1872 MB) with essentially no WER degradation.
Reduced precision also improves speed: fp16/bf16 consistently achieve higher RTFX than fp32, with bf16 slightly fastest in most settings. This also revealed a bug in the current CacheAwarePipeline: the model runs in bf16 by default, but the cache is kept in fp32, causing repeated type casts during inference. Changing the cache dtype to bf16 improves RTFX with no WER degradation while also providing 2x cache memory savings.

Method	Att Context Size	NUM SLOTS	BATCH SIZE	BITS	Cache Total (MB)	Cache Last Channel (MB)	Cache Last Time (MB)	Comp Ratio	AMI	E22	LS CLEAN	LS OTHER	TED	VOX	AVG.	RTFX (LS-OTHER)
fp32	[70, 13]	256	64	32	1872	1680	192	1	14.91%	19.86%	5.28%	8.26%	12.03%	11.13%	11.91%	367
fp16	[70, 13]	256	64	16	936	840	96	2	14.91%	19.87%	5.29%	8.25%	12.05%	11.13%	11.92%	412
bf16	[70, 13]	256	64	16	936	840	96	2	14.91%	19.86%	5.28%	8.26%	12.03%	11.13%	11.91%	414
fp32	[70, 6]	256	64	32	1872	1680	192	1	15.06%	20.01%	5.38%	8.49%	12.02%	11.23%	12.03%	246
fp16	[70, 6]	256	64	16	936	840	96	2	15.05%	20.01%	5.38%	8.50%	12.00%	11.24%	12.03%	273
bf16	[70, 6]	256	64	16	936	840	96	2	15.06%	20.01%	5.38%	8.49%	12.02%	11.23%	12.03%	282
fp32	[70, 1]	256	64	32	1872	1680	192	1	17.08%	20.48%	5.57%	8.91%	12.30%	11.56%	12.65%	103
fp16	[70, 1]	256	64	16	936	840	96	2	17.03%	20.45%	5.56%	8.90%	12.31%	11.55%	12.63%	105
bf16	[70, 1]	256	64	16	936	840	96	2	17.08%	20.48%	5.57%	8.91%	12.30%	11.56%	12.65%	108
fp32	[70, 0]	256	64	32	1872	1680	192	1	20.29%	21.01%	6.06%	9.80%	12.61%	12.70%	13.74%	62
fp16	[70, 0]	256	64	16	936	840	96	2	20.29%	21.00%	6.04%	9.81%	12.65%	12.73%	13.75%	63
bf16	[70, 0]	256	64	16	936	840	96	2	20.29%	21.01%	6.06%	9.80%	12.61%	12.70%	13.74%	63

Experiments done on NVIDIA RTX 5000 Ada GPU

GitHub Actions CI

The Jenkins CI system has been replaced by GitHub Actions self-hosted runners.

The GitHub Actions CI will run automatically when the "Run CICD" label is added to the PR.
To re-run CI remove and add the label again.
To run CI on an untrusted fork, a NeMo user with write access must first click "Approve and run".

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

PR Type:

New Feature
Bugfix
Documentation

If you haven't finished some of the above items you can still open "Draft" PR.

Who can review?

Anyone in the community is free to review the PR once the checks have passed.
Contributor guidelines contains specific people who can review PRs to various areas.

Additional Information

Related to # (issue)

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

copy-pr-bot · 2026-06-06T21:41:09Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

naymaraq added 2 commits June 7, 2026 01:12

add fp16/bf16 cache quantization

fcdca8d

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

add inline log for cache dtype

6cea6e3

Signed-off-by: naymaraq <dkaramyan@nvidia.com>

github-actions Bot added the ASR label Jun 6, 2026

naymaraq added Run CICD CI and removed Run CICD labels Jun 6, 2026

naymaraq marked this pull request as ready for review June 7, 2026 14:03

naymaraq requested review from artbataev and arushidNV June 7, 2026 14:04

naymaraq added CI and removed CI labels Jun 7, 2026

naymaraq closed this Jun 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bf16/fp16 cache quantization for Cache Aware Pipelines#15762

bf16/fp16 cache quantization for Cache Aware Pipelines#15762
naymaraq wants to merge 2 commits into
mainfrom
dkaramyan/fp16-cache

naymaraq commented Jun 6, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

naymaraq commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

Usage

Details

GitHub Actions CI

Before your PR is "Ready for review"

Who can review?

Additional Information

Uh oh!

copy-pr-bot Bot commented Jun 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

naymaraq commented Jun 6, 2026 •

edited

Loading