[TRTLLM-13409][feat] hard-exit on HangDetector fire + cross-rank propagation#15612
Draft
JunyiXu-nv wants to merge 1 commit into
Draft
[TRTLLM-13409][feat] hard-exit on HangDetector fire + cross-rank propagation#15612JunyiXu-nv wants to merge 1 commit into
JunyiXu-nv wants to merge 1 commit into
Conversation
JunyiXu-nv
commented
Jun 25, 2026
JunyiXu-nv
commented
Jun 25, 2026
JunyiXu-nv
commented
Jun 25, 2026
3d845ff to
434b925
Compare
JunyiXu-nv
commented
Jun 25, 2026
7a77657 to
aae3c5f
Compare
…agation Detection without escalation is the root cause of hang-shaped CI burn: the HangDetector fires but on_detected only ran the graceful shutdown path, which can itself deadlock on collectives while peer ranks stay blocked in NCCL holding their GPUs until the wall-clock pod-kill. Keep the existing single-tier detector and instead make on_detected hard-kill and propagate: propagate_hard_kill() uses MPI_Abort when MPI is initialized with THREAD_MULTIPLE (safe from the daemon thread), otherwise self-SIGKILL and lets the launcher tear down peers. The timeout remains configurable via the existing hang_detection_timeout argument (no new env vars). Adds CPU-only unit tests for the detector timer (fire/reset/pause) and the Path-A self-SIGKILL. Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>
aae3c5f to
a12244e
Compare
Collaborator
Author
|
/bot run --disable-fail-fast |
Collaborator
|
PR_Github #55716 [ run ] triggered by Bot. Commit: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
@coderabbitai summary
Description
The runtime's
HangDetectoralready fires on a stalled executor loop, buton_detectedonly ran the graceful shutdown path — which can itself deadlock on collectives, while peer ranks stay blocked in NCCL holding their GPUs until the job's wall-clock pod-kill. Detection without escalation is the dominant source of hang-shaped CI GPU-hours.Change
Rework
HangDetectorinto a two-tier watchdog:TLLM_HANG_DETECT_T_SOFT, default 60s): dump all thread stacks + warn. No kill.TLLM_HANG_DETECT_T_HARD, default 300s): dump stacks, then hard-kill and propagate.propagate_hard_kill()usesMPI_Abortwhen MPI is initialized withTHREAD_MULTIPLE(safe from the daemon thread), else self-SIGKILLand lets the launcher tear down peers. Emits exit code 137 for CI no-retry classifiers (future ST-8).Gated by
TLLM_HARD_KILL_ON_HANG(default on; set0for long-running served deployments to keep soft-tier-only). Kill timing is unchanged from the prior single-tier detector — the soft tier is an additional earlier diagnostic, so detection latency does not regress.Test Coverage
tests/unittest/_torch/pyexecutor/test_hang_detector_kill.py): tier ordering, no-kill soft tier, checkpoint reset, pause, gate env parsing, Path-A self-SIGKILL.mpirunjob with rank 1 wedged → hard tier fired at 5s →MPI_Abort(137)tore down all 4 ranks in 28s (soft tier dumped stacks first), vs. the prior wall-clock pod-kill.PR Checklist
Please review the following before submitting your PR:
PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either
api-compatibleorapi-breaking. Forapi-breaking, includeBREAKINGin the PR title.Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.
GitHub Bot Help
To see a list of available CI bot commands, please comment
/bot help.