multi-publish (and single in publish mode) deadlocks on payloads >=100KB after ~7-8 messages

## Summary

The two publisher CLIs in `grpc_p2p_client/` open a bidirectional gRPC stream (`ListenCommands`) and only call `stream.Send()` in a loop. They never `stream.Recv()`. Because the server emits trace-event `Response` messages for every publish, the per-stream HTTP/2 flow-control window fills up, and the publisher then blocks indefinitely on `stream.Send`. The bug only manifests at larger payload sizes / higher counts because the window is otherwise large enough to absorb the responses.

Affected files:
- `grpc_p2p_client/cmd/multi-publish/main.go` (function `sendMessages`)
- `grpc_p2p_client/cmd/single/main.go` (function `publish`)

Subscriber paths are not affected: both `cmd/multi-subscribe` and `cmd/single` in subscribe mode already drain `stream.Recv()` in their main loop.

## Reproduction

Local Docker stack from this repo, image `getoptimum/p2pnode:v0.0.1-rc16`.

```bash
# 1. Bring up the local 4-node mesh per the README.
make generate-identity
cp .env.example .env  # fill BOOTSTRAP_PEER_ID
docker compose -f docker-compose-optimum.yml up -d
make build

# 2. Try to publish 40 messages of 100 KB each.
cat > ips.txt <<EOF
127.0.0.1:33221
127.0.0.1:33222
127.0.0.1:33223
127.0.0.1:33224
EOF

./grpc_p2p_client/p2p-multi-publish \
  -topic=repro \
  -ipfile=ips.txt \
  -start-index=0 -end-index=1 \
  -count=40 -datasize=102400 -sleep=200ms
```

## Expected

40 publications complete in roughly 40 × 200ms = ~8 seconds.

## Actual

Process publishes **exactly 7 messages**, then `stream.Send` blocks indefinitely. Verified twice in two independent runs on the same stack, identical signature both times: 7 publications, then hang. Observed wait time on the first run before manual SIGINT: ~28 minutes. Second run was bounded by `timeout 30` and confirmed exit code 124 (process killed by timer, not by completion).

Truncated output:

```
[127.0.0.1:33221] published 102416 bytes to "repro" (took 1.247ms)
[127.0.0.1:33221] published 102416 bytes to "repro" (took 1.003ms)
[127.0.0.1:33221] published 102416 bytes to "repro" (took 1.364ms)
[127.0.0.1:33221] published 102416 bytes to "repro" (took 437us)
[127.0.0.1:33221] published 102416 bytes to "repro" (took 539us)
[127.0.0.1:33221] published 102416 bytes to "repro" (took 527us)
[127.0.0.1:33221] published 102416 bytes to "repro" (took 883us)
<hangs here>
```

The server side keeps running normally during the hang (p2pnode-1 keeps emitting periodic state-dump heartbeats), the libp2p mesh is healthy, no error is logged on either side. Only the publisher's `stream.Send` is blocked.

Smaller payloads do not trigger the bug at the same `-count`: `-count=50 -datasize=1024` completes cleanly because the responses are small enough that the per-stream window never fills.

## Root cause

`ListenCommands` is declared as a bidirectional streaming RPC:

```proto
service CommandStream {
  rpc ListenCommands (stream Request) returns (stream Response) {}
}
```

The server uses the `Response` direction to push trace events back to the caller (the same trace stream that `multi-subscribe` consumes via `stream.Recv` and writes to `-output-trace`). On every publish, the server emits multiple Response messages (a `PUBLISH_MESSAGE` event plus shard-related `NEW_SHARD` / `RECV_RPC` events, with payload-sized metadata for each shard).

In `cmd/multi-publish/main.go` `sendMessages`:

```go
stream, err := client.ListenCommands(ctx)
// ... no goroutine draining stream.Recv() ...
for i := 0; i < *count; i++ {
    // ...
    if err := stream.Send(pubReq); err != nil {
        return fmt.Errorf("[%s] send publish: %w", ip, err)
    }
    // ...
}
```

The client never reads the server-to-client direction. With default gRPC settings (initial HTTP/2 stream window 64 KiB, growing only via `WINDOW_UPDATE` from the client), the server-side eventually has its outbound buffer full for this stream. In Go's gRPC server implementation, the handler goroutine that produces the response also reads incoming requests for the same bidi stream; when the response side blocks, the handler blocks, request reads stop, the client-to-server flow-control window then drains, and `stream.Send` on the client blocks too.

This pattern is documented in the Go gRPC project and surfaces in several public reports of bidi-stream deadlock when one direction is not drained. References:
- https://grpc.io/docs/languages/go/basics/#bidirectional-streaming-rpc
- https://github.com/grpc/grpc-go/issues/2350 (classic discussion of similar bidi deadlocks)

## Fix

Spawn a goroutine that drains `stream.Recv` until EOF or error. This discards the trace events (which the publisher does not use anyway) and keeps the response side unblocked. ~5 lines per call site, no proto change, no API change.

A patch is in PR #<TBD> on this issue. Verified on `getoptimum/p2pnode:v0.0.1-rc16`:

| test | before fix | after fix |
|---|---|---|
| `-count=40 -datasize=102400 -sleep=200ms` | hangs after 7 publications | completes in ~8.06 s |
| `-count=20 -datasize=1048576 -sleep=400ms` | hangs after ~7 publications | completes in ~8.10 s |
| `-count=10 -datasize=2516582 -sleep=600ms` | hangs after ~3-4 publications | completes in ~6.12 s |

## Environment

- Repo: `optimum-dev-setup-guide@bd8c0b8` (current `main`).
- Images: `getoptimum/p2pnode:v0.0.1-rc16` + `getoptimum/proxy:v0.0.1-rc16`.
- OS: Ubuntu 22.04 LTS on x86_64.
- Go: 1.26.3 (snap).
- Docker: 29.3.0.
- 4-node local mesh, full mesh (`MeshDegreeTarget=6` with N=4).

## Related observation (not part of this issue)

While investigating this, I noticed the `proxy` REST `POST /api/v1/publish` endpoint with `message_length` does NOT exhibit the same problem (no bidi stream, just request/response). Using REST is a viable workaround for anyone hitting the deadlock today.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

multi-publish (and single in publish mode) deadlocks on payloads >=100KB after ~7-8 messages #74

Summary

Reproduction

Expected

Actual

Root cause

Fix

Environment

Related observation (not part of this issue)

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

test	before fix	after fix
`-count=40 -datasize=102400 -sleep=200ms`	hangs after 7 publications	completes in ~8.06 s
`-count=20 -datasize=1048576 -sleep=400ms`	hangs after ~7 publications	completes in ~8.10 s
`-count=10 -datasize=2516582 -sleep=600ms`	hangs after ~3-4 publications	completes in ~6.12 s

multi-publish (and single in publish mode) deadlocks on payloads >=100KB after ~7-8 messages #74

Description

Summary

Reproduction

Expected

Actual

Root cause

Fix

Environment

Related observation (not part of this issue)

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions