Skip to content

trickle_publisher: _resolve_next_seq fallback drops first video segment + logs noisy warning #12

@rickstaa

Description

@rickstaa

Symptom

Two issues from the same code path in trickle_publisher.py:_resolve_next_seq:

(1) Noisy warning at every session start

WARNING livepeer_gateway.trickle_publisher: Trickle /next missing Lp-Trickle-Latest header

Fires once per publisher (events + data + media-out) at every LivePipeline session start. Visible in every example pipeline (live_grayscale, live_depth, live_transcribe).

(2) First video segment dropped on the -out channel

Real bug, not just noise. Orchestrator log signature:

POST segment stream=<sid>-out idx=0 next=0      ← URL /-1, server resolved to slot 0
POST segment stream=<sid>-out idx=0 next=0      ← URL /0,  duplicate write to slot 0
POST segment stream=<sid>-out idx=0 next=0      ← retry after the drop
POST segment stream=<sid>-out idx=1 next=1      ← in-sync from here onwards

Runner-side log:

WARNING ... MediaPublish[<sid>-out] dropped segment seq=0 mid-stream; draining pipe
livepeer_gateway.trickle_publisher.TrickleSegmentWriteError: Trickle segment writer timed out after 5.0s
WARNING ... Trickle segment close suppressed seq=0

The first MPEG-TS video segment never makes it through. Subsequent segments are fine because the publisher's local counter is back in sync with the server's nextWrite from segment 1.

Root cause

The publisher's _resolve_next_seq() (line 346) does:

async def _resolve_next_seq(self) -> int:
    url = f"{self.url}/next"
    try:
        resp = await self._session.get(url)
        latest = resp.headers.get("Lp-Trickle-Latest")
        resp.release()
        if latest is not None:
            return int(latest)
        else:
            _LOG.warning("Trickle /next missing Lp-Trickle-Latest header")
    except Exception:
        _LOG.warning("Trickle /next request failed", exc_info=True)
    return -1

The /next endpoint is in livepeer/go-livepeer#3884 (Josh's "Serverless" PR, still open) — both sides authored as a designed pair on the same day:

When Where Author
2026-03-26 01:11 PDT _resolve_next_seq lands in this repo (commit 5007e2c) Josh Allmann
2026-03-26 12:21 PDT /next endpoint added to go-livepeer (commit 8da1c692, in PR #3884) Josh Allmann

The Python side merged to main. The Go side hasn't merged to master. So on master, every probe hits a 400 (no /next route), no header is set, the warning fires, and _resolve_next_seq returns -1.

The caller in next():

if self.seq < 0:
    send_reset = True
    self.seq = await self._resolve_next_seq()    # = -1
self._next_state = await self.preconnect(self.seq, send_reset=send_reset)  # POST /-1
self.seq += 1                                    # = 0
preconnect_task = asyncio.create_task(self._preconnect_task(self.seq))     # POST /0 in background

Server-side, getForWrite(idx == -1) resolves locally to nextWrite (= 0 for a fresh channel) and creates segment[0]. The publisher's next POST then targets URL /0 — server's getForWrite(0) finds the existing segment[0] and returns it. Two HTTP handlers now both writing to segment[0]. On the events channel the writes are tiny JSON heartbeats and the race is invisible. On the video -out channel the first POST is mid-stream when the second arrives → the server's segment-write semantics race → 5 s later the first writer trips its timeout → seq=0 dropped.

Fix

One line, both symptoms covered:

async def _resolve_next_seq(self) -> int:
    url = f"{self.url}/next"
    try:
        resp = await self._session.get(url)
        latest = resp.headers.get("Lp-Trickle-Latest")
        resp.release()
        if latest is not None:
            return int(latest)
        # Common pre-#3884 — server has no /next endpoint, so we get no header.
        # Treat as fresh channel and start at slot 0; matches pytrickle's approach.
        _LOG.debug("Trickle /next missing Lp-Trickle-Latest header — assuming fresh channel")
    except Exception:
        _LOG.warning("Trickle /next request failed", exc_info=True)
    return 0  # fresh-channel default; the -1 sentinel race causes duplicate POSTs to slot 0

Two changes:

  1. return 0 (was -1) — eliminates the duplicate POST → first video segment delivers cleanly.
  2. debug (was warning) — eliminates the operator-visible noise, matches what PR #6 already does for symptom (1).

Both pre-#3884 (probe fails, fresh-channel default kicks in) and post-#3884 (probe succeeds, returns server-reported nextWrite) work correctly.

Why not just wait for #3884

Option, but unbounded timeline. PR #6 already partially addresses (1) by demoting the warning, but doesn't touch the return -1 line — so even after #6 merges into our branches, symptom (2) persists. The return 0 change is the load-bearing fix.

Caveat — resume semantics

Returning 0 on probe failure means a publisher reconnecting on a non-fresh channel pre-#3884 would overwrite slot 0 instead of resuming from nextWrite. No regression: today's -1 path also breaks resume on master (the sentinel resolves to nextWrite server-side, then the publisher's local counter increments from -1 to 0, posting to /0 and overwriting all existing segments). Resume only works post-#3884 anyway.

Out of scope

  • Removing _resolve_next_seq() entirely — it's correct and useful post-#3884 for resume.
  • Server-side fixes — covered by livepeer/go-livepeer#3884.

Confirmed impact on the data channel

Observable in live_transcribe's SSE-based caller test (examples/runner/live_transcribe/test.sh). Runner emits 5 transcripts; SSE subscriber on the gateway's data_url proxy receives only 3:

runner _LOG (all 5):                        SSE subscriber output (3 of 5):
  transcript[0]: ask not                      transcript[0]: ask not
  transcript[1]: What your country can do…    (missing — `seq=0` race)
  transcript[2]: ask what you can do for…     transcript[2]: ask what you can do for…
  transcript[3]: And so am I fellow Americans transcript[3]: And so am I fellow Americans
  transcript[4]: Ask!                         (missing — separate teardown race)

transcript[1] is dropped by exactly the bug above: the publisher's first POST goes to URL /-1 (server resolves to slot 0), and the second POST goes to URL /0 — server finds the existing segment[0] and races it. transcript[0]'s write wins; transcript[1]'s overwrites or is dropped depending on timing.

This matches the video-channel seq=0 dropped + writer timed out 5.0s we saw before, just with smaller payloads and different visibility. Same root cause, same one-line fix.

(transcript[4] is a separate teardown timing issue between the runner's on_stream_stop flush and the gateway's SSE end signal — not part of this issue.)

Context

Surfaced while testing live_transcribe end-to-end. The noisy warning is in every example pipeline; the duplicate-POST loss is in any pipeline that emits on the video output channel OR the data channel. The data-channel impact is now directly observable via the SSE test in live_transcribe.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions