Skip to content

fix(vllm): serialise AsyncMPClient input_socket sends to prevent zmq race#2513

Open
kaloyan-inherent wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
kaloyan-inherent:kally/async-grpo-hang
Open

fix(vllm): serialise AsyncMPClient input_socket sends to prevent zmq race#2513
kaloyan-inherent wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
kaloyan-inherent:kally/async-grpo-hang

Conversation

@kaloyan-inherent
Copy link
Copy Markdown

@kaloyan-inherent kaloyan-inherent commented May 17, 2026

What does this PR do ?

This PR addresses issue #2512 ; It wraps _shadow_sock.send_multipart with a threading.Lock, which serialises access when the http server is exposed.

Issues

Closes #2512.

Signed-off-by: Kaloyan <253267049+kaloyan-inherent@users.noreply.github.com>
@kaloyan-inherent kaloyan-inherent requested a review from a team as a code owner May 17, 2026 12:09
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 17, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@kaloyan-inherent kaloyan-inherent changed the title fix(vllm): serialise AsyncMPClient.input_socket sends to prevent zmq race and engine failure fix(vllm): serialise AsyncMPClient input_socket sends to prevent zmq race May 17, 2026
@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label May 19, 2026
@terrykong
Copy link
Copy Markdown
Collaborator

Thanks @kaloyan-inherent

@yuki-97 @bxyu-nvidia @ananthsub @sharonyu-115 coudl you review?

@svcnvidia-nemo-ci svcnvidia-nemo-ci removed the waiting-on-maintainers Waiting on maintainers to respond label May 20, 2026
Copy link
Copy Markdown
Contributor

@yuki-97 yuki-97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kaloyan-inherent thanks for the investigation and fix! overall LGTM, only some minor comments.

Comment thread nemo_rl/models/generation/vllm/vllm_worker_async.py Outdated
Comment thread nemo_rl/models/generation/vllm/vllm_worker_async.py Outdated
@svcnvidia-nemo-ci svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label May 21, 2026
Signed-off-by: Kaloyan <253267049+kaloyan-inherent@users.noreply.github.com>
@kaloyan-inherent
Copy link
Copy Markdown
Author

@kaloyan-inherent thanks for the investigation and fix! overall LGTM, only some minor comments.

great thanks for the review -- addressed both comments

@kaloyan-inherent kaloyan-inherent requested a review from yuki-97 May 21, 2026 17:30
@svcnvidia-nemo-ci svcnvidia-nemo-ci removed the waiting-on-customer Waiting on the original author to respond label May 21, 2026
@yuki-97 yuki-97 added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label May 22, 2026
@yuki-97
Copy link
Copy Markdown
Contributor

yuki-97 commented May 22, 2026

/ok to test b6035f6

@yuki-97
Copy link
Copy Markdown
Contributor

yuki-97 commented May 22, 2026

@bxyu-nvidia @ananthsub @sharonyu-115 could you help to take a review as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) community-request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[bug] async_grpo with in flight weight updates hangs

4 participants