[None][infra] take test durations into account to determine cbts splits num by crazydemo · Pull Request #15614 · NVIDIA/TensorRT-LLM

crazydemo · 2026-06-25T05:53:16Z

Summary by CodeRabbit

New Features
- Added split-sizing details to CBTS decision output and dry-run summaries.
- Test stages can now be resized more precisely based on expected runtime, keeping shard counts aligned with stage needs.
Bug Fixes
- Improved how affected test stages are reported so downstream reporting reflects the final stage set consistently.
Documentation
- Updated CBTS guidance and examples to describe the new split-sizing behavior and output fields.
Chores
- Marked some CI decision scripts as out of scope for change detection.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

crazydemo · 2026-06-25T05:53:42Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-25T05:59:35Z

PR_Github #55711 [ run ] triggered by Bot. Commit: 803b477 Link to invocation

coderabbitai · 2026-06-25T06:04:52Z

📝 Walkthrough

Walkthrough

CBTS now computes per-stage split counts from duration data, serializes them in the decision output, and consumes them in Jenkins scripts to resize stage splits. The related docs, reporting, out-of-scope routing, and temporary waiver entries were updated to match.

Changes

CBTS split sizing

Layer / File(s)	Summary
Duration-derived split counts `jenkins/scripts/cbts/blocks.py`, `jenkins/scripts/cbts/main.py`	Python CBTS helpers compute per-stage split counts from duration data and add `affected_stage_split_counts` to the emitted decision JSON.
Jenkins split-count consumers `jenkins/L0_MergeRequest.groovy`, `jenkins/L0_Test.groovy`, `jenkins/scripts/cbts/rules/out_of_scope_rule.py`	The CBTS result parser records split counts, the noop prefix list includes the Jenkins scripts, and `L0_Test.groovy` resizes stage configs from the new split-count map.
Split-count reporting and docs `jenkins/scripts/cbts/README.md`, `jenkins/scripts/cbts/tools/dryrun.py`, `jenkins/scripts/cbts/tools/report_cbts_decision.py`	CBTS README text, dryrun formatting, and OpenSearch reporting include the new split-count field and split-sizing wording.
Temporary CI waivers `tests/integration/test_lists/waives.txt`	`waives.txt` adds temporary SKIP waivers and a revert-before-merge marker for the split-sizing verification run.

Sequence Diagram(s)

sequenceDiagram
  participant blocks.py
  participant main.py
  participant L0_MergeRequest.groovy
  participant L0_Test.groovy
  main.py->>blocks.py: load_durations() and compute_stage_split_counts()
  blocks.py-->>main.py: affected_stage_split_counts
  main.py->>L0_MergeRequest.groovy: serialize affected_stage_split_counts in decision JSON
  main.py->>L0_Test.groovy: serialize affected_stage_split_counts in decision JSON
  L0_MergeRequest.groovy->>L0_MergeRequest.groovy: _cbtsParseSelectionResult()
  L0_Test.groovy->>L0_Test.groovy: cbtsResizeSplits()

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

NVIDIA/TensorRT-LLM#13382: Adds the CBTS decision schema and Groovy consumption that this PR extends with affected_stage_split_counts.
NVIDIA/TensorRT-LLM#15576: Updates the same duration-based CBTS inputs and CI waivers that feed the new split-count computation.
NVIDIA/TensorRT-LLM#15592: Reworks the L0 test split-handling path that this PR replaces with duration-based split resizing.

Suggested reviewers

tburt-nv
yuanjingx87
yiqingy0

🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description keeps the template but leaves Description and Test Coverage empty, so it doesn't explain the change or how it was tested.	Add a short Description of the issue and solution, plus Test Coverage listing the relevant tests or verification steps for this change.
Docstring Coverage	⚠️ Warning	Docstring coverage is 72.73% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title is concise and matches the main change: taking test durations into account for CBTS split sizing.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands.}

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@jenkins/L0_Test.groovy`:
- Around line 1706-1732: The shard reduction in cbtsResizeSplits is happening
before the CBTS test-db is actually guaranteed, which can silently skip tests if
renderTestDB falls back to the source DB. Update the CBTS flow so shard counts
are only shrunk after the narrowed DB override is successfully materialized, or
make renderTestDB fail hard when the CBTS artifact cannot be
downloaded/extracted; use cbtsResizeSplits, renderTestDB, and
affected_stage_split_counts to locate the affected logic.

In `@jenkins/scripts/cbts/blocks.py`:
- Around line 642-650: The duration-cache fallback in load_durations should not
cause shard resizing when .test_durations is missing or unreadable. Update the
split-planning flow around load_durations, the defaulting logic that currently
falls back to 1.0, and the narrowed-stage handling in the code that computes
shard counts so that an empty durations result produces no overrides and
preserves stage.total_splits instead of shrinking to 1. Use the existing
load_durations and shard calculation paths as the place to gate this behavior.

In `@jenkins/scripts/cbts/rules/out_of_scope_rule.py`:
- Around line 60-62: The `OutOfScopeRule` temporary exclusions for
`jenkins/L0_Test.groovy` and `jenkins/L0_MergeRequest.groovy` must be removed
before merge so those paths are not treated as noop. Update the exclusion list
in `out_of_scope_rule.py` to drop these two entries and keep only the intended
permanent out-of-scope paths, ensuring CBTS still reacts to changes in its
Groovy producer/consumer layer.

In `@tests/integration/test_lists/waives.txt`:
- Line 1: Remove the temporary CI verification waivers from
tests/integration/test_lists/waives.txt, including the top-of-file REVERT before
merge marker and any https://nvbugs/0000000 placeholder entries. Update the
waiver list so it no longer suppresses the newly covered
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::{test_eagle3,test_pard},
accuracy/test_llm_api_pytorch_multimodal.py::TestNanoV3Omni::test_auto_dtype[fp8],
or the broad unittest/_torch/*, unittest/disaggregated/test_kv_transfer.py, and
unittest/llmapi/test_llm_telemetry.py coverage. If any waiver must remain,
replace the temporary scaffold with a real follow-up entry and valid bug ID.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7b76eded-afda-46f9-9aae-f69debd26acc

📥 Commits

Reviewing files that changed from the base of the PR and between bcb9441 and 803b477.

📒 Files selected for processing (9)

jenkins/L0_MergeRequest.groovy
jenkins/L0_Test.groovy
jenkins/scripts/cbts/README.md
jenkins/scripts/cbts/blocks.py
jenkins/scripts/cbts/main.py
jenkins/scripts/cbts/rules/out_of_scope_rule.py
jenkins/scripts/cbts/tools/dryrun.py
jenkins/scripts/cbts/tools/report_cbts_decision.py
tests/integration/test_lists/waives.txt

tensorrt-cicd · 2026-06-25T07:05:30Z

PR_Github #55711 [ run ] completed with state FAILURE. Commit: 803b477
/LLM/main/L0_MergeRequest_PR pipeline #44614 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

crazydemo · 2026-06-25T07:28:01Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-25T07:34:26Z

PR_Github #55741 [ run ] triggered by Bot. Commit: 2f4b1fc Link to invocation

tensorrt-cicd · 2026-06-25T12:29:06Z

PR_Github #55741 [ run ] completed with state SUCCESS. Commit: 2f4b1fc
/LLM/main/L0_MergeRequest_PR pipeline #44643 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>

A CBTS-narrowed stage runs a reduced test set; both the whole-stage reuse (REUSE_STAGE_LIST) and per-test reuse (reusePassedTestResults) key on the stage name, so on the same commit a later non-CBTS full run could reuse a narrowed stage's result and skip full coverage. Rename narrowed stages with a 'cbts-' prefix (cbtsResizeSplits) so their results never match a full run's stage name. Strip the prefix before positional stage-name parsing (parseTaskConfigFromStageName, getSSHConnectionPorts via stripCbtsStagePrefix) and prefix the Layer 2 affectedSet to match. Sanity/PerfSanity full stages keep their original names and reuse normally. Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>

crazydemo · 2026-06-25T13:46:31Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-06-25T13:55:30Z

PR_Github #55793 [ run ] triggered by Bot. Commit: f9cb2ef Link to invocation

The tarball was written to /tmp and uploaded via rtUpload, whose pattern is resolved relative to the workspace -- so the /tmp path matched 0 files and the artifact was silently never deployed (uploadArtifacts does not throw on 0 matches). renderTestDB's download then 404'd and fell back to source. Harmless until CBTS split-resize (resized stage on source list times out). Tar into and upload from ${LLM_ROOT}, like every other uploadArtifacts call. Pre-existing since NVIDIA#15142. Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>

…loadable cbtsResizeSplits drops shards based on the narrowed list, but renderTestDB falls back to the source test-db if the cbts_test_db artifact can't be downloaded -- a resized stage would then run the full source list on too few shards and hit the SLURM walltime. Probe the artifact (wget --spider) once on the L0_Test agent (same job as the executors); if it's not retrievable, clear cbts_test_db_artifact_path so split-resize, the stage rename, Layer 2, and renderTestDB all uniformly fall back to the full source list on full splits. Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>

Temporarily claims jenkins/L0_Test.groovy + L0_MergeRequest.groovy as out-of-scope so CBTS does not fall back on this branch's groovy changes, and adds 15 high-duration waives so DGX_H100-PyTorch narrows to k=3 (6->3 shards) and DGX_B200-PyTorch to k=2 (9->2). Drop this commit after the /bot run verification. Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>

crazydemo requested review from a team as code owners June 25, 2026 05:53

crazydemo requested review from tburt-nv and yiqingy0 June 25, 2026 05:53

github-actions Bot assigned crazydemo Jun 25, 2026

crazydemo requested a review from EmmaQiaoCh June 25, 2026 05:54

coderabbitai Bot reviewed Jun 25, 2026

View reviewed changes

Comment thread jenkins/L0_Test.groovy Outdated

Comment thread jenkins/scripts/cbts/blocks.py

Comment thread jenkins/scripts/cbts/rules/out_of_scope_rule.py

Comment thread tests/integration/test_lists/waives.txt Outdated

crazydemo force-pushed the fix-xbts branch 3 times, most recently from fa71685 to 2f4b1fc Compare June 25, 2026 07:26

crazydemo force-pushed the fix-xbts branch 5 times, most recently from f11e970 to 0d4302b Compare June 25, 2026 08:03

crazydemo force-pushed the fix-xbts branch 3 times, most recently from 46456ba to f257ccb Compare June 25, 2026 13:43

crazydemo added 2 commits June 25, 2026 21:44

take test durations into account to determine cbts splits num

36aac4d

Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>

crazydemo force-pushed the fix-xbts branch from f257ccb to f9cb2ef Compare June 25, 2026 13:45

crazydemo added 3 commits June 25, 2026 22:08

crazydemo force-pushed the fix-xbts branch from f9cb2ef to 4279b35 Compare June 25, 2026 14:09

Uh oh!

Conversation

crazydemo commented Jun 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

crazydemo commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

coderabbitai Bot commented Jun 25, 2026

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

❌ Failed checks (2 warnings)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

crazydemo commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

crazydemo commented Jun 25, 2026

Uh oh!

tensorrt-cicd commented Jun 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

crazydemo commented Jun 25, 2026 •

edited by coderabbitai Bot

Loading