[None][infra] take test durations into account to determine cbts splits num#15614
[None][infra] take test durations into account to determine cbts splits num#15614crazydemo wants to merge 5 commits into
Conversation
|
/bot run --disable-fail-fast |
|
PR_Github #55711 [ run ] triggered by Bot. Commit: |
📝 WalkthroughWalkthroughCBTS now computes per-stage split counts from duration data, serializes them in the decision output, and consumes them in Jenkins scripts to resize stage splits. The related docs, reporting, out-of-scope routing, and temporary waiver entries were updated to match. ChangesCBTS split sizing
Sequence Diagram(s)sequenceDiagram
participant blocks.py
participant main.py
participant L0_MergeRequest.groovy
participant L0_Test.groovy
main.py->>blocks.py: load_durations() and compute_stage_split_counts()
blocks.py-->>main.py: affected_stage_split_counts
main.py->>L0_MergeRequest.groovy: serialize affected_stage_split_counts in decision JSON
main.py->>L0_Test.groovy: serialize affected_stage_split_counts in decision JSON
L0_MergeRequest.groovy->>L0_MergeRequest.groovy: _cbtsParseSelectionResult()
L0_Test.groovy->>L0_Test.groovy: cbtsResizeSplits()
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 3 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@jenkins/L0_Test.groovy`:
- Around line 1706-1732: The shard reduction in cbtsResizeSplits is happening
before the CBTS test-db is actually guaranteed, which can silently skip tests if
renderTestDB falls back to the source DB. Update the CBTS flow so shard counts
are only shrunk after the narrowed DB override is successfully materialized, or
make renderTestDB fail hard when the CBTS artifact cannot be
downloaded/extracted; use cbtsResizeSplits, renderTestDB, and
affected_stage_split_counts to locate the affected logic.
In `@jenkins/scripts/cbts/blocks.py`:
- Around line 642-650: The duration-cache fallback in load_durations should not
cause shard resizing when .test_durations is missing or unreadable. Update the
split-planning flow around load_durations, the defaulting logic that currently
falls back to 1.0, and the narrowed-stage handling in the code that computes
shard counts so that an empty durations result produces no overrides and
preserves stage.total_splits instead of shrinking to 1. Use the existing
load_durations and shard calculation paths as the place to gate this behavior.
In `@jenkins/scripts/cbts/rules/out_of_scope_rule.py`:
- Around line 60-62: The `OutOfScopeRule` temporary exclusions for
`jenkins/L0_Test.groovy` and `jenkins/L0_MergeRequest.groovy` must be removed
before merge so those paths are not treated as noop. Update the exclusion list
in `out_of_scope_rule.py` to drop these two entries and keep only the intended
permanent out-of-scope paths, ensuring CBTS still reacts to changes in its
Groovy producer/consumer layer.
In `@tests/integration/test_lists/waives.txt`:
- Line 1: Remove the temporary CI verification waivers from
tests/integration/test_lists/waives.txt, including the top-of-file REVERT before
merge marker and any https://nvbugs/0000000 placeholder entries. Update the
waiver list so it no longer suppresses the newly covered
accuracy/test_llm_api_pytorch.py::TestLlama3_1_8BInstruct::{test_eagle3,test_pard},
accuracy/test_llm_api_pytorch_multimodal.py::TestNanoV3Omni::test_auto_dtype[fp8],
or the broad unittest/_torch/*, unittest/disaggregated/test_kv_transfer.py, and
unittest/llmapi/test_llm_telemetry.py coverage. If any waiver must remain,
replace the temporary scaffold with a real follow-up entry and valid bug ID.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 7b76eded-afda-46f9-9aae-f69debd26acc
📒 Files selected for processing (9)
jenkins/L0_MergeRequest.groovyjenkins/L0_Test.groovyjenkins/scripts/cbts/README.mdjenkins/scripts/cbts/blocks.pyjenkins/scripts/cbts/main.pyjenkins/scripts/cbts/rules/out_of_scope_rule.pyjenkins/scripts/cbts/tools/dryrun.pyjenkins/scripts/cbts/tools/report_cbts_decision.pytests/integration/test_lists/waives.txt
|
PR_Github #55711 [ run ] completed with state
|
fa71685 to
2f4b1fc
Compare
|
/bot run --disable-fail-fast |
|
PR_Github #55741 [ run ] triggered by Bot. Commit: |
f11e970 to
0d4302b
Compare
|
PR_Github #55741 [ run ] completed with state
|
46456ba to
f257ccb
Compare
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
A CBTS-narrowed stage runs a reduced test set; both the whole-stage reuse (REUSE_STAGE_LIST) and per-test reuse (reusePassedTestResults) key on the stage name, so on the same commit a later non-CBTS full run could reuse a narrowed stage's result and skip full coverage. Rename narrowed stages with a 'cbts-' prefix (cbtsResizeSplits) so their results never match a full run's stage name. Strip the prefix before positional stage-name parsing (parseTaskConfigFromStageName, getSSHConnectionPorts via stripCbtsStagePrefix) and prefix the Layer 2 affectedSet to match. Sanity/PerfSanity full stages keep their original names and reuse normally. Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
|
/bot run --disable-fail-fast |
|
PR_Github #55793 [ run ] triggered by Bot. Commit: |
The tarball was written to /tmp and uploaded via rtUpload, whose pattern is
resolved relative to the workspace -- so the /tmp path matched 0 files and the
artifact was silently never deployed (uploadArtifacts does not throw on 0
matches). renderTestDB's download then 404'd and fell back to source. Harmless
until CBTS split-resize (resized stage on source list times out). Tar into and
upload from ${LLM_ROOT}, like every other uploadArtifacts call. Pre-existing
since NVIDIA#15142.
Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
…loadable cbtsResizeSplits drops shards based on the narrowed list, but renderTestDB falls back to the source test-db if the cbts_test_db artifact can't be downloaded -- a resized stage would then run the full source list on too few shards and hit the SLURM walltime. Probe the artifact (wget --spider) once on the L0_Test agent (same job as the executors); if it's not retrievable, clear cbts_test_db_artifact_path so split-resize, the stage rename, Layer 2, and renderTestDB all uniformly fall back to the full source list on full splits. Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Temporarily claims jenkins/L0_Test.groovy + L0_MergeRequest.groovy as out-of-scope so CBTS does not fall back on this branch's groovy changes, and adds 15 high-duration waives so DGX_H100-PyTorch narrows to k=3 (6->3 shards) and DGX_B200-PyTorch to k=2 (9->2). Drop this commit after the /bot run verification. Signed-off-by: Ivy Zhang <25222398+crazydemo@users.noreply.github.com>
Summary by CodeRabbit
New Features
Bug Fixes
Documentation
Chores
Description
Test Coverage
PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.