Skip to content

fix: clamp dynamic NVFP4 FP8 scale export#1465

Open
ShawRong wants to merge 2 commits into
NVIDIA:mainfrom
ShawRong:fix/nvfp4-dynamic-scale-clamp
Open

fix: clamp dynamic NVFP4 FP8 scale export#1465
ShawRong wants to merge 2 commits into
NVIDIA:mainfrom
ShawRong:fix/nvfp4-dynamic-scale-clamp

Conversation

@ShawRong
Copy link
Copy Markdown

@ShawRong ShawRong commented May 12, 2026

What does this PR do?

Type of change: Bug fix

Clamp the dynamic NVFP4 per-block FP8 scale before casting it to torch.float8_e4m3fn during export.

The static quantizer-derived path already clamps the exported FP8 scale, but the dynamic path can still produce scale values larger than the finite FP8 E4M3 range. Since torch.float8_e4m3fn has no Inf representation, casting values such as 480 or 1000 produces NaN payloads in the exported scale tensor. This patch applies the same finite-range clamp to the dynamic path so exported dynamic NVFP4 scales remain finite.

Usage

# No API changes. Dynamic NVFP4 export now avoids NaN FP8 scale entries when
# the computed per-block scale exceeds the finite fp8_e4m3fn range.

Testing

uv run pytest tests/gpu/torch/quantization/test_qtensor_cuda.py::TestQTensor::test_nvfp4_dynamic_export_fp8_scale_no_nan_when_scale_exceeds_fp8 -q

Result: 1 passed.

Also ran pre-commit on the touched files:

uv run pre-commit run --files \
  modelopt/torch/quantization/qtensor/nvfp4_tensor.py \
  tests/gpu/torch/quantization/test_qtensor_cuda.py

Result: all applicable hooks passed.

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: ✅
  • Did you update Changelog?: N/A
  • Did you get Claude approval on this PR?: N/A

Additional Information

This complements the existing static NVFP4 scale clamp by applying the same safety guard to the dynamic scale export path.

Summary by CodeRabbit

Bug Fixes

  • Fixed NaN values in FP8 scale conversions during dynamic NVFP4 quantization export. Scale values now clamp properly when they exceed the valid FP8 range, preventing overflow and ensuring numerical stability in exported scale bytes.

Tests

  • Added regression test to verify FP8 scale exports no longer contain NaN bytes during dynamic NVFP4 operations.

Review Change Stack

Signed-off-by: ShawRong <shawnrong1213@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 12, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 12, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 1919a19e-573f-4a66-901b-d1c0526bccaa

📥 Commits

Reviewing files that changed from the base of the PR and between 7a11fb2 and 446536a.

📒 Files selected for processing (2)
  • modelopt/torch/quantization/qtensor/nvfp4_tensor.py
  • tests/gpu/torch/quantization/test_qtensor_cuda.py

📝 Walkthrough

Walkthrough

NVFP4 quantization now prevents numeric saturation to NaN when converting very large per-block scales to FP8 format by clamping values to 448.0 before the conversion. A regression test verifies the fix works by checking that exported FP8 scale bytes contain no NaN values and saturate correctly.

Changes

NVFP4 FP8 Scale Saturation

Layer / File(s) Summary
FP8 scale clamping logic
modelopt/torch/quantization/qtensor/nvfp4_tensor.py
When keep_high_precision is False, per_block_scale is clamped to max=448.0 before casting to torch.float8_e4m3fn to avoid NaN overflow; comments updated to explain the NaN-avoidance rationale.
Regression test for FP8 saturation
tests/gpu/torch/quantization/test_qtensor_cuda.py
New CUDA test verifies that dynamic NVFP4 export of FP8 scale bytes contains no NaN (0x7F) when scale exceeds FP8 range and that the first byte saturates to 0x7E.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and concisely summarizes the main change: clamping dynamic NVFP4 FP8 scale export to prevent NaN values.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed No security anti-patterns found. PR contains math/numerical changes to NVFP4 quantization and test additions. None of the six critical patterns are present.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@ShawRong ShawRong marked this pull request as ready for review May 12, 2026 14:18
@ShawRong ShawRong requested a review from a team as a code owner May 12, 2026 14:18
@ShawRong ShawRong requested a review from sychen52 May 12, 2026 14:18
Copy link
Copy Markdown
Collaborator

@shengliangxu shengliangxu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants