Skip to content

[TRTLLM-13409][feat] hard-exit on HangDetector fire + cross-rank propagation#15612

Draft
JunyiXu-nv wants to merge 1 commit into
NVIDIA:mainfrom
JunyiXu-nv:dev-junyix-feat-hang-hard-kill
Draft

[TRTLLM-13409][feat] hard-exit on HangDetector fire + cross-rank propagation#15612
JunyiXu-nv wants to merge 1 commit into
NVIDIA:mainfrom
JunyiXu-nv:dev-junyix-feat-hang-hard-kill

Conversation

@JunyiXu-nv

Copy link
Copy Markdown
Collaborator

@coderabbitai summary

Description

The runtime's HangDetector already fires on a stalled executor loop, but on_detected only ran the graceful shutdown path — which can itself deadlock on collectives, while peer ranks stay blocked in NCCL holding their GPUs until the job's wall-clock pod-kill. Detection without escalation is the dominant source of hang-shaped CI GPU-hours.

Change

Rework HangDetector into a two-tier watchdog:

  • Soft tier (TLLM_HANG_DETECT_T_SOFT, default 60s): dump all thread stacks + warn. No kill.
  • Hard tier (TLLM_HANG_DETECT_T_HARD, default 300s): dump stacks, then hard-kill and propagate. propagate_hard_kill() uses MPI_Abort when MPI is initialized with THREAD_MULTIPLE (safe from the daemon thread), else self-SIGKILL and lets the launcher tear down peers. Emits exit code 137 for CI no-retry classifiers (future ST-8).

Gated by TLLM_HARD_KILL_ON_HANG (default on; set 0 for long-running served deployments to keep soft-tier-only). Kill timing is unchanged from the prior single-tier detector — the soft tier is an additional earlier diagnostic, so detection latency does not regress.

Test Coverage

  • 14 CPU unit tests (tests/unittest/_torch/pyexecutor/test_hang_detector_kill.py): tier ordering, no-kill soft tier, checkpoint reset, pause, gate env parsing, Path-A self-SIGKILL.
  • Multi-GPU acceptance (4×H100): a 4-rank mpirun job with rank 1 wedged → hard tier fired at 5s → MPI_Abort(137) tore down all 4 ranks in 28s (soft tier dumped stacks first), vs. the prior wall-clock pod-kill.

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

Comment thread tensorrt_llm/_torch/pyexecutor/hang_detector.py Outdated
Comment thread tensorrt_llm/_torch/pyexecutor/hang_detector.py Outdated
Comment thread tensorrt_llm/_torch/pyexecutor/hang_detector.py Outdated
Comment thread tensorrt_llm/_torch/pyexecutor/hang_detector.py Outdated
Comment thread tensorrt_llm/_torch/pyexecutor/hang_detector.py Outdated
Comment thread tensorrt_llm/_torch/pyexecutor/hang_detector.py Outdated
Comment thread tests/unittest/_torch/pyexecutor/test_hang_detector_kill.py Outdated
@JunyiXu-nv JunyiXu-nv force-pushed the dev-junyix-feat-hang-hard-kill branch from 3d845ff to 434b925 Compare June 25, 2026 04:55
Comment thread tests/unittest/_torch/pyexecutor/test_hang_detector_kill.py
@JunyiXu-nv JunyiXu-nv force-pushed the dev-junyix-feat-hang-hard-kill branch 2 times, most recently from 7a77657 to aae3c5f Compare June 25, 2026 05:31
…agation

Detection without escalation is the root cause of hang-shaped CI burn: the
HangDetector fires but on_detected only ran the graceful shutdown path, which
can itself deadlock on collectives while peer ranks stay blocked in NCCL holding
their GPUs until the wall-clock pod-kill.

Keep the existing single-tier detector and instead make on_detected hard-kill
and propagate: propagate_hard_kill() uses MPI_Abort when MPI is initialized with
THREAD_MULTIPLE (safe from the daemon thread), otherwise self-SIGKILL and lets
the launcher tear down peers. The timeout remains configurable via the existing
hang_detection_timeout argument (no new env vars).

Adds CPU-only unit tests for the detector timer (fire/reset/pause) and the
Path-A self-SIGKILL.

Signed-off-by: JunyiXu-nv <219237550+JunyiXu-nv@users.noreply.github.com>
@JunyiXu-nv JunyiXu-nv force-pushed the dev-junyix-feat-hang-hard-kill branch from aae3c5f to a12244e Compare June 25, 2026 06:23
@JunyiXu-nv

Copy link
Copy Markdown
Collaborator Author

/bot run --disable-fail-fast

@tensorrt-cicd

Copy link
Copy Markdown
Collaborator

PR_Github #55716 [ run ] triggered by Bot. Commit: a12244e Link to invocation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants