fix(vllm): serialise AsyncMPClient input_socket sends to prevent zmq race by kaloyan-inherent · Pull Request #2513 · NVIDIA-NeMo/RL

kaloyan-inherent · 2026-05-17T12:09:27Z

What does this PR do ?

This PR addresses issue #2512 ; It wraps _shadow_sock.send_multipart with a threading.Lock, which serialises access when the http server is exposed.

Issues

Closes #2512.

Signed-off-by: Kaloyan <253267049+kaloyan-inherent@users.noreply.github.com>

copy-pr-bot · 2026-05-17T12:09:31Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

terrykong · 2026-05-20T17:16:53Z

Thanks @kaloyan-inherent

@yuki-97 @bxyu-nvidia @ananthsub @sharonyu-115 coudl you review?

yuki-97

@kaloyan-inherent thanks for the investigation and fix! overall LGTM, only some minor comments.

Signed-off-by: Kaloyan <253267049+kaloyan-inherent@users.noreply.github.com>

kaloyan-inherent · 2026-05-21T17:30:34Z

@kaloyan-inherent thanks for the investigation and fix! overall LGTM, only some minor comments.

great thanks for the review -- addressed both comments

yuki-97 · 2026-05-22T14:02:38Z

/ok to test b6035f6

yuki-97 · 2026-05-22T14:06:11Z

@bxyu-nvidia @ananthsub @sharonyu-115 could you help to take a review as well?

vllm asyncmpc client: seriealise requests to input socket

fd9931a

Signed-off-by: Kaloyan <253267049+kaloyan-inherent@users.noreply.github.com>

kaloyan-inherent requested a review from a team as a code owner May 17, 2026 12:09

github-actions Bot added the community-request label May 17, 2026

kaloyan-inherent mentioned this pull request May 17, 2026

[bug] async_grpo with in flight weight updates hangs #2512

Open

kaloyan-inherent changed the title ~~fix(vllm): serialise AsyncMPClient.input_socket sends to prevent zmq race and engine failure~~ fix(vllm): serialise AsyncMPClient input_socket sends to prevent zmq race May 17, 2026

svcnvidia-nemo-ci added the waiting-on-maintainers Waiting on maintainers to respond label May 19, 2026

terrykong requested review from ananthsub, bxyu-nvidia, sharonyu-115 and yuki-97 May 20, 2026 17:16

svcnvidia-nemo-ci removed the waiting-on-maintainers Waiting on maintainers to respond label May 20, 2026

yuki-97 reviewed May 21, 2026

View reviewed changes

Comment thread nemo_rl/models/generation/vllm/vllm_worker_async.py Outdated

Comment thread nemo_rl/models/generation/vllm/vllm_worker_async.py Outdated

svcnvidia-nemo-ci added the waiting-on-customer Waiting on the original author to respond label May 21, 2026

pr review address: better comment + remove defensive guard

225d7af

Signed-off-by: Kaloyan <253267049+kaloyan-inherent@users.noreply.github.com>

kaloyan-inherent requested a review from yuki-97 May 21, 2026 17:30

svcnvidia-nemo-ci removed the waiting-on-customer Waiting on the original author to respond label May 21, 2026

Merge branch 'main' into kally/async-grpo-hang

b6035f6

yuki-97 added the CI:Lfast Runs a fast test suite and re-use nightly `main` container (but sync dependencies to PRs version) label May 22, 2026

copy-pr-bot Bot temporarily deployed to public May 22, 2026 14:02 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 22, 2026 14:03 Inactive

copy-pr-bot Bot had a problem deploying to nemo-ci May 22, 2026 14:03 Failure

copy-pr-bot Bot temporarily deployed to public May 22, 2026 14:03 Inactive

yuki-97 approved these changes May 22, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to public May 22, 2026 14:07 Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(vllm): serialise AsyncMPClient input_socket sends to prevent zmq race#2513

fix(vllm): serialise AsyncMPClient input_socket sends to prevent zmq race#2513
kaloyan-inherent wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
kaloyan-inherent:kally/async-grpo-hang

kaloyan-inherent commented May 17, 2026 •

edited by yuki-97

Loading

Uh oh!

copy-pr-bot Bot commented May 17, 2026

Uh oh!

terrykong commented May 20, 2026

Uh oh!

yuki-97 left a comment

Uh oh!

Uh oh!

Uh oh!

kaloyan-inherent commented May 21, 2026

Uh oh!

yuki-97 commented May 22, 2026

Uh oh!

yuki-97 commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

kaloyan-inherent commented May 17, 2026 • edited by yuki-97 Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Issues

Uh oh!

copy-pr-bot Bot commented May 17, 2026

Uh oh!

terrykong commented May 20, 2026

Uh oh!

yuki-97 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kaloyan-inherent commented May 21, 2026

Uh oh!

yuki-97 commented May 22, 2026

Uh oh!

yuki-97 commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

kaloyan-inherent commented May 17, 2026 •

edited by yuki-97

Loading