Skip to content

fix: fix h100 release/performance#2184

Merged
terrykong merged 11 commits intomainfrom
yukih/release-perf-fix
Apr 10, 2026
Merged

fix: fix h100 release/performance#2184
terrykong merged 11 commits intomainfrom
yukih/release-perf-fix

Conversation

@yuki-97
Copy link
Copy Markdown
Contributor

@yuki-97 yuki-97 commented Apr 1, 2026

All Clear (with Megatron-Bridge version in #2223)

  1. deepseek related: fix: fix dsv3 by disable mtp #2191
    • grpo-deepseek-v3-32n8g
    • grpo-deepseek-v3-64n8g-async-1off
    • grpo-deepseek-v3-64n8g-fp8-async-1off
    • grpo-dapomath17k-dsv3-megatron
    • dapo-deepseek-v3-64n8g: f2071f3
  2. logprob_batch_size related: fix: revert logprob_batch_size to keep same perf as before #2192
    • grpo-qwen3-30ba3b-8n8g-megatron
    • grpo-qwen3-30ba3b-4n8g-40K
  3. grpo-qwen3-235b-32n8g-async-1off: 365c77a
  4. grpo-gemma3-27b-it-8n8g-fsdp2tp8-actckpt-long: fix: fix gemma3 #2185
  5. grpo-gptoss-20b-8n8g-megatron: [model] fix: fix gpt-oss down_proj weight handling Megatron-Bridge#3162

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 1, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yuki-97 yuki-97 force-pushed the yukih/release-perf-fix branch 3 times, most recently from 8f5ef0d to 37a25de Compare April 5, 2026 05:33
@yuki-97 yuki-97 changed the base branch from main to yukih/bump-mbridge April 7, 2026 03:22
@yuki-97 yuki-97 force-pushed the yukih/bump-mbridge branch 2 times, most recently from 5371866 to f4de024 Compare April 7, 2026 12:53
@yuki-97 yuki-97 force-pushed the yukih/release-perf-fix branch from 3875c8a to 40c2dc0 Compare April 7, 2026 15:01
@yuki-97 yuki-97 force-pushed the yukih/bump-mbridge branch from f4de024 to d7916f6 Compare April 8, 2026 05:59
@yuki-97 yuki-97 force-pushed the yukih/release-perf-fix branch from 40c2dc0 to 0945485 Compare April 8, 2026 06:13
@yuki-97 yuki-97 force-pushed the yukih/bump-mbridge branch from 914d613 to fe45277 Compare April 8, 2026 11:11
@yuki-97 yuki-97 force-pushed the yukih/release-perf-fix branch 2 times, most recently from ae11a64 to e0ee956 Compare April 8, 2026 13:29
@yuki-97 yuki-97 force-pushed the yukih/bump-mbridge branch 2 times, most recently from 8a616b5 to a0417e4 Compare April 9, 2026 03:13
@github-actions github-actions bot added the Documentation Improvements or additions to documentation label Apr 9, 2026
Base automatically changed from yukih/bump-mbridge to main April 10, 2026 02:57
yuki-97 added 10 commits April 9, 2026 20:05
- grpo-llama3.1-8b-instruct-4n8g-fsdp2tp1-long.v3
- grpo-qwen2.5-32b-32n8g-fsdp2tp8-actckpt-long.v3
- grpo-qwen2.5-32b-32n8g-fsdp2tp8-actckpt.v3
- grpo-qwen3-30ba3b-4n8g-40K

Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
- revert logprob_batch_size
- update node and test time

Signed-off-by: Yuki Huang <yukih@nvidia.com>
Signed-off-by: Yuki Huang <yukih@nvidia.com>
…3b-4n8g-40K

Signed-off-by: Yuki Huang <yukih@nvidia.com>
@yuki-97 yuki-97 force-pushed the yukih/release-perf-fix branch from 3f93034 to 4fa69d0 Compare April 10, 2026 03:06
@github-actions github-actions bot removed the Documentation Improvements or additions to documentation label Apr 10, 2026
…htly_compute_stays_below_1340_hours

Signed-off-by: Yuki Huang <yukih@nvidia.com>
@yuki-97 yuki-97 marked this pull request as ready for review April 10, 2026 03:14
@yuki-97 yuki-97 requested review from a team as code owners April 10, 2026 03:14
@yuki-97 yuki-97 added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label Apr 10, 2026
@yuki-97
Copy link
Copy Markdown
Contributor Author

yuki-97 commented Apr 10, 2026

/ok to test 9455630

Comment thread tests/test_suites/llm/grpo-dapomath17k-dsv3-megatron.sh
Copy link
Copy Markdown
Collaborator

@terrykong terrykong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm modulo that one comment

@terrykong terrykong enabled auto-merge (squash) April 10, 2026 03:41
@terrykong terrykong merged commit e59c7ef into main Apr 10, 2026
27 checks passed
@terrykong terrykong deleted the yukih/release-perf-fix branch April 10, 2026 04:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants