Summary
The two publisher CLIs in grpc_p2p_client/ open a bidirectional gRPC stream (ListenCommands) and only call stream.Send() in a loop. They never stream.Recv(). Because the server emits trace-event Response messages for every publish, the per-stream HTTP/2 flow-control window fills up, and the publisher then blocks indefinitely on stream.Send. The bug only manifests at larger payload sizes / higher counts because the window is otherwise large enough to absorb the responses.
Affected files:
grpc_p2p_client/cmd/multi-publish/main.go (function sendMessages)
grpc_p2p_client/cmd/single/main.go (function publish)
Subscriber paths are not affected: both cmd/multi-subscribe and cmd/single in subscribe mode already drain stream.Recv() in their main loop.
Reproduction
Local Docker stack from this repo, image getoptimum/p2pnode:v0.0.1-rc16.
# 1. Bring up the local 4-node mesh per the README.
make generate-identity
cp .env.example .env # fill BOOTSTRAP_PEER_ID
docker compose -f docker-compose-optimum.yml up -d
make build
# 2. Try to publish 40 messages of 100 KB each.
cat > ips.txt <<EOF
127.0.0.1:33221
127.0.0.1:33222
127.0.0.1:33223
127.0.0.1:33224
EOF
./grpc_p2p_client/p2p-multi-publish \
-topic=repro \
-ipfile=ips.txt \
-start-index=0 -end-index=1 \
-count=40 -datasize=102400 -sleep=200ms
Expected
40 publications complete in roughly 40 × 200ms = ~8 seconds.
Actual
Process publishes exactly 7 messages, then stream.Send blocks indefinitely. Verified twice in two independent runs on the same stack, identical signature both times: 7 publications, then hang. Observed wait time on the first run before manual SIGINT: ~28 minutes. Second run was bounded by timeout 30 and confirmed exit code 124 (process killed by timer, not by completion).
Truncated output:
[127.0.0.1:33221] published 102416 bytes to "repro" (took 1.247ms)
[127.0.0.1:33221] published 102416 bytes to "repro" (took 1.003ms)
[127.0.0.1:33221] published 102416 bytes to "repro" (took 1.364ms)
[127.0.0.1:33221] published 102416 bytes to "repro" (took 437us)
[127.0.0.1:33221] published 102416 bytes to "repro" (took 539us)
[127.0.0.1:33221] published 102416 bytes to "repro" (took 527us)
[127.0.0.1:33221] published 102416 bytes to "repro" (took 883us)
<hangs here>
The server side keeps running normally during the hang (p2pnode-1 keeps emitting periodic state-dump heartbeats), the libp2p mesh is healthy, no error is logged on either side. Only the publisher's stream.Send is blocked.
Smaller payloads do not trigger the bug at the same -count: -count=50 -datasize=1024 completes cleanly because the responses are small enough that the per-stream window never fills.
Root cause
ListenCommands is declared as a bidirectional streaming RPC:
service CommandStream {
rpc ListenCommands (stream Request) returns (stream Response) {}
}
The server uses the Response direction to push trace events back to the caller (the same trace stream that multi-subscribe consumes via stream.Recv and writes to -output-trace). On every publish, the server emits multiple Response messages (a PUBLISH_MESSAGE event plus shard-related NEW_SHARD / RECV_RPC events, with payload-sized metadata for each shard).
In cmd/multi-publish/main.go sendMessages:
stream, err := client.ListenCommands(ctx)
// ... no goroutine draining stream.Recv() ...
for i := 0; i < *count; i++ {
// ...
if err := stream.Send(pubReq); err != nil {
return fmt.Errorf("[%s] send publish: %w", ip, err)
}
// ...
}
The client never reads the server-to-client direction. With default gRPC settings (initial HTTP/2 stream window 64 KiB, growing only via WINDOW_UPDATE from the client), the server-side eventually has its outbound buffer full for this stream. In Go's gRPC server implementation, the handler goroutine that produces the response also reads incoming requests for the same bidi stream; when the response side blocks, the handler blocks, request reads stop, the client-to-server flow-control window then drains, and stream.Send on the client blocks too.
This pattern is documented in the Go gRPC project and surfaces in several public reports of bidi-stream deadlock when one direction is not drained. References:
Fix
Spawn a goroutine that drains stream.Recv until EOF or error. This discards the trace events (which the publisher does not use anyway) and keeps the response side unblocked. ~5 lines per call site, no proto change, no API change.
A patch is in PR # on this issue. Verified on getoptimum/p2pnode:v0.0.1-rc16:
| test |
before fix |
after fix |
-count=40 -datasize=102400 -sleep=200ms |
hangs after 7 publications |
completes in ~8.06 s |
-count=20 -datasize=1048576 -sleep=400ms |
hangs after ~7 publications |
completes in ~8.10 s |
-count=10 -datasize=2516582 -sleep=600ms |
hangs after ~3-4 publications |
completes in ~6.12 s |
Environment
- Repo:
optimum-dev-setup-guide@bd8c0b8 (current main).
- Images:
getoptimum/p2pnode:v0.0.1-rc16 + getoptimum/proxy:v0.0.1-rc16.
- OS: Ubuntu 22.04 LTS on x86_64.
- Go: 1.26.3 (snap).
- Docker: 29.3.0.
- 4-node local mesh, full mesh (
MeshDegreeTarget=6 with N=4).
Related observation (not part of this issue)
While investigating this, I noticed the proxy REST POST /api/v1/publish endpoint with message_length does NOT exhibit the same problem (no bidi stream, just request/response). Using REST is a viable workaround for anyone hitting the deadlock today.
Summary
The two publisher CLIs in
grpc_p2p_client/open a bidirectional gRPC stream (ListenCommands) and only callstream.Send()in a loop. They neverstream.Recv(). Because the server emits trace-eventResponsemessages for every publish, the per-stream HTTP/2 flow-control window fills up, and the publisher then blocks indefinitely onstream.Send. The bug only manifests at larger payload sizes / higher counts because the window is otherwise large enough to absorb the responses.Affected files:
grpc_p2p_client/cmd/multi-publish/main.go(functionsendMessages)grpc_p2p_client/cmd/single/main.go(functionpublish)Subscriber paths are not affected: both
cmd/multi-subscribeandcmd/singlein subscribe mode already drainstream.Recv()in their main loop.Reproduction
Local Docker stack from this repo, image
getoptimum/p2pnode:v0.0.1-rc16.Expected
40 publications complete in roughly 40 × 200ms = ~8 seconds.
Actual
Process publishes exactly 7 messages, then
stream.Sendblocks indefinitely. Verified twice in two independent runs on the same stack, identical signature both times: 7 publications, then hang. Observed wait time on the first run before manual SIGINT: ~28 minutes. Second run was bounded bytimeout 30and confirmed exit code 124 (process killed by timer, not by completion).Truncated output:
The server side keeps running normally during the hang (p2pnode-1 keeps emitting periodic state-dump heartbeats), the libp2p mesh is healthy, no error is logged on either side. Only the publisher's
stream.Sendis blocked.Smaller payloads do not trigger the bug at the same
-count:-count=50 -datasize=1024completes cleanly because the responses are small enough that the per-stream window never fills.Root cause
ListenCommandsis declared as a bidirectional streaming RPC:The server uses the
Responsedirection to push trace events back to the caller (the same trace stream thatmulti-subscribeconsumes viastream.Recvand writes to-output-trace). On every publish, the server emits multiple Response messages (aPUBLISH_MESSAGEevent plus shard-relatedNEW_SHARD/RECV_RPCevents, with payload-sized metadata for each shard).In
cmd/multi-publish/main.gosendMessages:The client never reads the server-to-client direction. With default gRPC settings (initial HTTP/2 stream window 64 KiB, growing only via
WINDOW_UPDATEfrom the client), the server-side eventually has its outbound buffer full for this stream. In Go's gRPC server implementation, the handler goroutine that produces the response also reads incoming requests for the same bidi stream; when the response side blocks, the handler blocks, request reads stop, the client-to-server flow-control window then drains, andstream.Sendon the client blocks too.This pattern is documented in the Go gRPC project and surfaces in several public reports of bidi-stream deadlock when one direction is not drained. References:
Fix
Spawn a goroutine that drains
stream.Recvuntil EOF or error. This discards the trace events (which the publisher does not use anyway) and keeps the response side unblocked. ~5 lines per call site, no proto change, no API change.A patch is in PR # on this issue. Verified on
getoptimum/p2pnode:v0.0.1-rc16:-count=40 -datasize=102400 -sleep=200ms-count=20 -datasize=1048576 -sleep=400ms-count=10 -datasize=2516582 -sleep=600msEnvironment
optimum-dev-setup-guide@bd8c0b8(currentmain).getoptimum/p2pnode:v0.0.1-rc16+getoptimum/proxy:v0.0.1-rc16.MeshDegreeTarget=6with N=4).Related observation (not part of this issue)
While investigating this, I noticed the
proxyRESTPOST /api/v1/publishendpoint withmessage_lengthdoes NOT exhibit the same problem (no bidi stream, just request/response). Using REST is a viable workaround for anyone hitting the deadlock today.