Skip to content

[None][infra] take test durations into account to determine cbts splits num#15614

Open
crazydemo wants to merge 5 commits into
NVIDIA:mainfrom
crazydemo:fix-xbts
Open

[None][infra] take test durations into account to determine cbts splits num#15614
crazydemo wants to merge 5 commits into
NVIDIA:mainfrom
crazydemo:fix-xbts

Conversation

@crazydemo

@crazydemo crazydemo commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

  • New Features

    • Added split-sizing details to CBTS decision output and dry-run summaries.
    • Test stages can now be resized more precisely based on expected runtime, keeping shard counts aligned with stage needs.
  • Bug Fixes

    • Improved how affected test stages are reported so downstream reporting reflects the final stage set consistently.
  • Documentation

    • Updated CBTS guidance and examples to describe the new split-sizing behavior and output fields.
  • Chores

    • Marked some CI decision scripts as out of scope for change detection.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

@crazydemo crazydemo requested review from a team as code owners June 25, 2026 05:53
@crazydemo crazydemo requested review from tburt-nv and yiqingy0 June 25, 2026 05:53
@crazydemo

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@crazydemo crazydemo requested a review from EmmaQiaoCh June 25, 2026 05:54
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55711 [ run ] triggered by Bot. Commit: 803b477 Link to invocation

@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

CBTS now computes per-stage split counts from duration data, serializes them in the decision output, and consumes them in Jenkins scripts to resize stage splits. The related docs, reporting, out-of-scope routing, and temporary waiver entries were updated to match.

Changes

CBTS split sizing

Layer / File(s) Summary
Duration-derived split counts
jenkins/scripts/cbts/blocks.py, jenkins/scripts/cbts/main.py
Python CBTS helpers compute per-stage split counts from duration data and add affected_stage_split_counts to the emitted decision JSON.
Jenkins split-count consumers
jenkins/L0_MergeRequest.groovy, jenkins/L0_Test.groovy, jenkins/scripts/cbts/rules/out_of_scope_rule.py
The CBTS result parser records split counts, the noop prefix list includes the Jenkins scripts, and L0_Test.groovy resizes stage configs from the new split-count map.
Split-count reporting and docs
jenkins/scripts/cbts/README.md, jenkins/scripts/cbts/tools/dryrun.py, jenkins/scripts/cbts/tools/report_cbts_decision.py
CBTS README text, dryrun formatting, and OpenSearch reporting include the new split-count field and split-sizing wording.
Temporary CI waivers
tests/integration/test_lists/waives.txt
waives.txt adds temporary SKIP waivers and a revert-before-merge marker for the split-sizing verification run.

Sequence Diagram(s)

sequenceDiagram
  participant blocks.py
  participant main.py
  participant L0_MergeRequest.groovy
  participant L0_Test.groovy
  main.py->>blocks.py: load_durations() and compute_stage_split_counts()
  blocks.py-->>main.py: affected_stage_split_counts
  main.py->>L0_MergeRequest.groovy: serialize affected_stage_split_counts in decision JSON
  main.py->>L0_Test.groovy: serialize affected_stage_split_counts in decision JSON
  L0_MergeRequest.groovy->>L0_MergeRequest.groovy: _cbtsParseSelectionResult()
  L0_Test.groovy->>L0_Test.groovy: cbtsResizeSplits()
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

  • NVIDIA/TensorRT-LLM#13382: Adds the CBTS decision schema and Groovy consumption that this PR extends with affected_stage_split_counts.
  • NVIDIA/TensorRT-LLM#15576: Updates the same duration-based CBTS inputs and CI waivers that feed the new split-count computation.
  • NVIDIA/TensorRT-LLM#15592: Reworks the L0 test split-handling path that this PR replaces with duration-based split resizing.

Suggested reviewers

  • tburt-nv
  • yuanjingx87
  • yiqingy0
🚥 Pre-merge checks | ✅ 3 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description keeps the template but leaves Description and Test Coverage empty, so it doesn't explain the change or how it was tested. Add a short Description of the issue and solution, plus Test Coverage listing the relevant tests or verification steps for this change.
Docstring Coverage ⚠️ Warning Docstring coverage is 72.73% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Title check ✅ Passed The title is concise and matches the main change: taking test durations into account for CBTS split sizing.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@jenkins/L0_Test.groovy`:
- Around line 1706-1732: The shard reduction in cbtsResizeSplits is happening
before the CBTS test-db is actually guaranteed, which can silently skip tests if
renderTestDB falls back to the source DB. Update the CBTS flow so shard counts
are only shrunk after the narrowed DB override is successfully materialized, or
make renderTestDB fail hard when the CBTS artifact cannot be
downloaded/extracted; use cbtsResizeSplits, renderTestDB, and
affected_stage_split_counts to locate the affected logic.

In `@jenkins/scripts/cbts/blocks.py`:
- Around line 642-650: The duration-cache fallback in load_durations should not
cause shard resizing when .test_durations is missing or unreadable. Update the
split-planning flow around load_durations, the defaulting logic that currently
falls back to 1.0, and the narrowed-stage handling in the code that computes
shard counts so that an empty durations result produces no overrides and
preserves stage.total_splits instead of shrinking to 1. Use the existing
load_durations and shard calculation paths as the place to gate this behavior.

In `@jenkins/scripts/cbts/rules/out_of_scope_rule.py`:
- Around line 60-62: The `OutOfScopeRule` temporary exclusions for
`jenkins/L0_Test.groovy` and `jenkins/L0_MergeRequest.groovy` must be removed
before merge so those paths are not treated as noop. Update the exclusion list
in `out_of_scope_rule.py` to drop these two entries and keep only the intended
permanent out-of-scope paths, ensuring CBTS still reacts to changes in its
Groovy producer/consumer layer.

In `@tests/integration/test_lists/waives.txt`:
- Line 1: Remove the temporary CI verification waivers from
tests/integration/test_lists/waives.txt, including the top-of-file REVERT before
merge marker and any https://nvbugs/0000000 placeholder entries. Update the
waiver list so it no longer suppresses the newly covered
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::{test_eagle3,test_pard},
accuracy/test_llm_api_pytorch_multimodal.py::TestNanoV3Omni::test_auto_dtype[fp8],
or the broad unittest/_torch/*, unittest/disaggregated/test_kv_transfer.py, and
unittest/llmapi/test_llm_telemetry.py coverage. If any waiver must remain,
replace the temporary scaffold with a real follow-up entry and valid bug ID.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 7b76eded-afda-46f9-9aae-f69debd26acc

📥 Commits

Reviewing files that changed from the base of the PR and between bcb9441 and 803b477.

📒 Files selected for processing (9)
  • jenkins/L0_MergeRequest.groovy
  • jenkins/L0_Test.groovy
  • jenkins/scripts/cbts/README.md
  • jenkins/scripts/cbts/blocks.py
  • jenkins/scripts/cbts/main.py
  • jenkins/scripts/cbts/rules/out_of_scope_rule.py
  • jenkins/scripts/cbts/tools/dryrun.py
  • jenkins/scripts/cbts/tools/report_cbts_decision.py
  • tests/integration/test_lists/waives.txt

Comment thread jenkins/L0_Test.groovy Outdated
Comment thread jenkins/scripts/cbts/blocks.py
Comment thread jenkins/scripts/cbts/rules/out_of_scope_rule.py
Comment thread tests/integration/test_lists/waives.txt Outdated
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55711 [ run ] completed with state FAILURE. Commit: 803b477
/LLM/main/L0_MergeRequest_PR pipeline #44614 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@crazydemo crazydemo force-pushed the fix-xbts branch 3 times, most recently from fa71685 to 2f4b1fc Compare June 25, 2026 07:26
@crazydemo

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55741 [ run ] triggered by Bot. Commit: 2f4b1fc Link to invocation

@crazydemo crazydemo force-pushed the fix-xbts branch 5 times, most recently from f11e970 to 0d4302b Compare June 25, 2026 08:03
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55741 [ run ] completed with state SUCCESS. Commit: 2f4b1fc
/LLM/main/L0_MergeRequest_PR pipeline #44643 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

  • Please check the failed tests and fix your PR
  • If you cannot view the failures, ask the CI triggerer to share details
  • Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

@crazydemo crazydemo force-pushed the fix-xbts branch 3 times, most recently from 46456ba to f257ccb Compare June 25, 2026 13:43
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
A CBTS-narrowed stage runs a reduced test set; both the whole-stage reuse
(REUSE_STAGE_LIST) and per-test reuse (reusePassedTestResults) key on the
stage name, so on the same commit a later non-CBTS full run could reuse a
narrowed stage's result and skip full coverage. Rename narrowed stages with
a 'cbts-' prefix (cbtsResizeSplits) so their results never match a full
run's stage name. Strip the prefix before positional stage-name parsing
(parseTaskConfigFromStageName, getSSHConnectionPorts via stripCbtsStagePrefix)
and prefix the Layer 2 affectedSet to match. Sanity/PerfSanity full stages
keep their original names and reuse normally.

Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
@crazydemo

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55793 [ run ] triggered by Bot. Commit: f9cb2ef Link to invocation

The tarball was written to /tmp and uploaded via rtUpload, whose pattern is
resolved relative to the workspace -- so the /tmp path matched 0 files and the
artifact was silently never deployed (uploadArtifacts does not throw on 0
matches). renderTestDB's download then 404'd and fell back to source. Harmless
until CBTS split-resize (resized stage on source list times out). Tar into and
upload from ${LLM_ROOT}, like every other uploadArtifacts call. Pre-existing
since NVIDIA#15142.

Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
…loadable

cbtsResizeSplits drops shards based on the narrowed list, but renderTestDB
falls back to the source test-db if the cbts_test_db artifact can't be
downloaded -- a resized stage would then run the full source list on too few
shards and hit the SLURM walltime. Probe the artifact (wget --spider) once on
the L0_Test agent (same job as the executors); if it's not retrievable, clear
cbts_test_db_artifact_path so split-resize, the stage rename, Layer 2, and
renderTestDB all uniformly fall back to the full source list on full splits.

Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Temporarily claims jenkins/L0_Test.groovy + L0_MergeRequest.groovy as
out-of-scope so CBTS does not fall back on this branch's groovy changes,
and adds 15 high-duration waives so DGX_H100-PyTorch narrows to k=3 (6->3
shards) and DGX_B200-PyTorch to k=2 (9->2). Drop this commit after the
/bot run verification.

Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants