Skip to content

multi-publish (and single in publish mode) deadlocks on payloads >=100KB after ~7-8 messages #74

@ColinkaMir

Description

@ColinkaMir

Summary

The two publisher CLIs in grpc_p2p_client/ open a bidirectional gRPC stream (ListenCommands) and only call stream.Send() in a loop. They never stream.Recv(). Because the server emits trace-event Response messages for every publish, the per-stream HTTP/2 flow-control window fills up, and the publisher then blocks indefinitely on stream.Send. The bug only manifests at larger payload sizes / higher counts because the window is otherwise large enough to absorb the responses.

Affected files:

  • grpc_p2p_client/cmd/multi-publish/main.go (function sendMessages)
  • grpc_p2p_client/cmd/single/main.go (function publish)

Subscriber paths are not affected: both cmd/multi-subscribe and cmd/single in subscribe mode already drain stream.Recv() in their main loop.

Reproduction

Local Docker stack from this repo, image getoptimum/p2pnode:v0.0.1-rc16.

# 1. Bring up the local 4-node mesh per the README.
make generate-identity
cp .env.example .env  # fill BOOTSTRAP_PEER_ID
docker compose -f docker-compose-optimum.yml up -d
make build

# 2. Try to publish 40 messages of 100 KB each.
cat > ips.txt <<EOF
127.0.0.1:33221
127.0.0.1:33222
127.0.0.1:33223
127.0.0.1:33224
EOF

./grpc_p2p_client/p2p-multi-publish \
  -topic=repro \
  -ipfile=ips.txt \
  -start-index=0 -end-index=1 \
  -count=40 -datasize=102400 -sleep=200ms

Expected

40 publications complete in roughly 40 × 200ms = ~8 seconds.

Actual

Process publishes exactly 7 messages, then stream.Send blocks indefinitely. Verified twice in two independent runs on the same stack, identical signature both times: 7 publications, then hang. Observed wait time on the first run before manual SIGINT: ~28 minutes. Second run was bounded by timeout 30 and confirmed exit code 124 (process killed by timer, not by completion).

Truncated output:

[127.0.0.1:33221] published 102416 bytes to "repro" (took 1.247ms)
[127.0.0.1:33221] published 102416 bytes to "repro" (took 1.003ms)
[127.0.0.1:33221] published 102416 bytes to "repro" (took 1.364ms)
[127.0.0.1:33221] published 102416 bytes to "repro" (took 437us)
[127.0.0.1:33221] published 102416 bytes to "repro" (took 539us)
[127.0.0.1:33221] published 102416 bytes to "repro" (took 527us)
[127.0.0.1:33221] published 102416 bytes to "repro" (took 883us)
<hangs here>

The server side keeps running normally during the hang (p2pnode-1 keeps emitting periodic state-dump heartbeats), the libp2p mesh is healthy, no error is logged on either side. Only the publisher's stream.Send is blocked.

Smaller payloads do not trigger the bug at the same -count: -count=50 -datasize=1024 completes cleanly because the responses are small enough that the per-stream window never fills.

Root cause

ListenCommands is declared as a bidirectional streaming RPC:

service CommandStream {
  rpc ListenCommands (stream Request) returns (stream Response) {}
}

The server uses the Response direction to push trace events back to the caller (the same trace stream that multi-subscribe consumes via stream.Recv and writes to -output-trace). On every publish, the server emits multiple Response messages (a PUBLISH_MESSAGE event plus shard-related NEW_SHARD / RECV_RPC events, with payload-sized metadata for each shard).

In cmd/multi-publish/main.go sendMessages:

stream, err := client.ListenCommands(ctx)
// ... no goroutine draining stream.Recv() ...
for i := 0; i < *count; i++ {
    // ...
    if err := stream.Send(pubReq); err != nil {
        return fmt.Errorf("[%s] send publish: %w", ip, err)
    }
    // ...
}

The client never reads the server-to-client direction. With default gRPC settings (initial HTTP/2 stream window 64 KiB, growing only via WINDOW_UPDATE from the client), the server-side eventually has its outbound buffer full for this stream. In Go's gRPC server implementation, the handler goroutine that produces the response also reads incoming requests for the same bidi stream; when the response side blocks, the handler blocks, request reads stop, the client-to-server flow-control window then drains, and stream.Send on the client blocks too.

This pattern is documented in the Go gRPC project and surfaces in several public reports of bidi-stream deadlock when one direction is not drained. References:

Fix

Spawn a goroutine that drains stream.Recv until EOF or error. This discards the trace events (which the publisher does not use anyway) and keeps the response side unblocked. ~5 lines per call site, no proto change, no API change.

A patch is in PR # on this issue. Verified on getoptimum/p2pnode:v0.0.1-rc16:

test before fix after fix
-count=40 -datasize=102400 -sleep=200ms hangs after 7 publications completes in ~8.06 s
-count=20 -datasize=1048576 -sleep=400ms hangs after ~7 publications completes in ~8.10 s
-count=10 -datasize=2516582 -sleep=600ms hangs after ~3-4 publications completes in ~6.12 s

Environment

  • Repo: optimum-dev-setup-guide@bd8c0b8 (current main).
  • Images: getoptimum/p2pnode:v0.0.1-rc16 + getoptimum/proxy:v0.0.1-rc16.
  • OS: Ubuntu 22.04 LTS on x86_64.
  • Go: 1.26.3 (snap).
  • Docker: 29.3.0.
  • 4-node local mesh, full mesh (MeshDegreeTarget=6 with N=4).

Related observation (not part of this issue)

While investigating this, I noticed the proxy REST POST /api/v1/publish endpoint with message_length does NOT exhibit the same problem (no bidi stream, just request/response). Using REST is a viable workaround for anyone hitting the deadlock today.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions