Skip to content

[None][infra] Fix node list query failing on tcsh login nodes#15623

Open
yiqingy0 wants to merge 1 commit into
NVIDIA:mainfrom
yiqingy0:fix_slurm_command
Open

[None][infra] Fix node list query failing on tcsh login nodes#15623
yiqingy0 wants to merge 1 commit into
NVIDIA:mainfrom
yiqingy0:fix_slurm_command

Conversation

@yiqingy0

@yiqingy0 yiqingy0 commented Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Summary by CodeRabbit

  • Bug Fixes
    • Improved how job node list information is queried on remote hosts, making SLURM metadata retrieval more reliable.
    • Fixed command handling so fallback lookup behavior works correctly when the primary query is unavailable.

Description

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Signed-off-by: yiqingy <yiqingy@tensorrt-llm-infra-debug-vm-01.nvidia.com>
@yiqingy0

Copy link
Copy Markdown
Collaborator Author

/bot run --stage-list "DGX_H100-PyTorch-1"

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55753 [ run ] triggered by Bot. Commit: 426588e Link to invocation

@yiqingy0 yiqingy0 marked this pull request as ready for review June 25, 2026 11:23
@yiqingy0 yiqingy0 requested review from a team as code owners June 25, 2026 11:23
@coderabbitai

coderabbitai Bot commented Jun 25, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Walkthrough

The PR updates two remote SLURM NodeList command strings in jenkins/L0_Test.groovy to run through bash -c with adjusted quoting.

Changes

SLURM NodeList command quoting

Layer / File(s) Summary
Remote NodeList command quoting
jenkins/L0_Test.groovy
The remote sacct pipeline and scontrol show job fallback parsing are both wrapped in bash -c with revised escaping for NodeList extraction.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~2 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description is still the template text and lacks a real title, issue summary, solution, and test coverage. Replace the placeholders with a proper PR title, a short problem/solution description, and the tests run to verify the fix.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title matches the PR's main change and follows the required [None][type] summary format.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
jenkins/L0_Test.groovy (1)

2133-2143: 🔒 Security & Privacy | 🔴 Critical | ⚡ Quick win

Validate capturedJobID before shell interpolation to prevent command injection.

capturedJobID is interpolated into remote shell commands without a numeric guard. If slurm_job_id.txt is tampered, this can execute arbitrary shell on the login node.

Suggested fix
 def captureSlurmJobNodeList(def pipeline, SlurmCluster cluster, String clusterName, String slurmJobID, Map placementContext, String stageName, String jobWorkspace=null)
 {
@@
-            if (!capturedJobID) {
+            if (!capturedJobID) {
                 return
             }
+            if (!(capturedJobID.toString() ==~ /\d+/)) {
+                echo "[INFRA-RETRY] ${stageName}: invalid SLURM job id '${capturedJobID}', skipping node-list capture"
+                return
+            }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@jenkins/L0_Test.groovy` around lines 2133 - 2143, The remote Slurm lookup in
the job-node discovery logic interpolates capturedJobID directly into
Utils.sshUserCmd/Utils.exec shell commands, so validate that value is strictly
numeric before building the sacct and scontrol commands. Add a guard in the same
flow that reads capturedJobID and only proceed when it matches an expected
job-id format; otherwise fail fast or skip lookup. Keep the fix local to the
node resolution block that uses capturedJobID so the shell interpolation remains
safe.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@jenkins/L0_Test.groovy`:
- Around line 2133-2143: The remote Slurm lookup in the job-node discovery logic
interpolates capturedJobID directly into Utils.sshUserCmd/Utils.exec shell
commands, so validate that value is strictly numeric before building the sacct
and scontrol commands. Add a guard in the same flow that reads capturedJobID and
only proceed when it matches an expected job-id format; otherwise fail fast or
skip lookup. Keep the fix local to the node resolution block that uses
capturedJobID so the shell interpolation remains safe.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 685eadf3-5e6a-4a50-83fc-f33c46baa6ce

📥 Commits

Reviewing files that changed from the base of the PR and between a02214a and 426588e.

📒 Files selected for processing (1)
  • jenkins/L0_Test.groovy

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55753 [ run ] completed with state SUCCESS. Commit: 426588e
/LLM/main/L0_MergeRequest_PR pipeline #44652 (Partly Tested) completed with status: 'SUCCESS'

CI Report

Link to invocation

@tburt-nv

Copy link
Copy Markdown
Collaborator

/bot reuse-pipeline

@tburt-nv tburt-nv enabled auto-merge (squash) June 25, 2026 14:23
@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55798 [ reuse-pipeline ] triggered by Bot. Commit: 426588e Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants