Skip to content

TricklePublisher.SegmentWriter masks CancelledError as TrickleSegmentWriteError on shutdown #10

@rickstaa

Description

@rickstaa

Symptom

Normal session shutdown produces tracebacks that look like errors but happen on every clean teardown:

livepeer_gateway.trickle_publisher.TrickleSegmentWriteError:
    Trickle segment writer timed out after 5.0s
Trickle segment close suppressed seq=0
...
TimeoutError

Root cause (revised after investigation)

Not an asyncio.wait_for cancellation race (initial hypothesis was wrong). Verified via timestamped logs after migrating to asyncio.timeout — the timeouts still fire because they're legitimate:

  1. Mid-stream, the HTTP post-body consumer for the active segment becomes wedged or closed (the orchestrator-side reader stops reading, or the connection drops).
  2. SegmentWriter's internal queue fills up.
  3. The next await self.queue.put(data) in _stream_pipe_to_trickle blocks waiting for queue space.
  4. After 5s, the timeout fires → TrickleSegmentWriteErrorMediaPublish logs "dropped segment ... mid-stream" and enters drain mode.
  5. On /stream/stop, SegmentWriter.close() calls queue.put(None) to signal end-of-stream — but the consumer is gone, so it hangs.
  6. After 5s the timeout fires → "Trickle segment close suppressed" warning logged.

So during shutdown, every active segment with a wedged consumer adds 5s + a misleading-looking traceback before teardown completes. PR #7 / live_grayscale/test.sh reproduces it consistently.

Proposed fix (needs work)

SegmentWriter needs to know when its associated HTTP post-task is dead and short-circuit close() / write() rather than block on queue.put forever. Options:

  • Track the post-task on the segment; on close() check post_task.done() before queue.put(None).
  • Use put_nowait() and absorb QueueFull.
  • Reduce the close-path timeout to e.g. 0.5s (still misleading but faster).

What PR #11 does not fix

PR #11 migrates wait_forasyncio.timeout which is modernization, not a fix for this issue. The shutdown noise persists after that migration; investigation here is separate.

Reproduction

cd examples/runner/live_grayscale
docker compose up -d --wait --build
SKIP_VIEWER=1 bash test.sh        # passes
docker logs live_grayscale        # shows the tracebacks during shutdown

Severity

Cosmetic — teardown completes correctly, but the noise causes false alarms during debugging and adds 5s × N-active-segments to shutdown latency.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions