Symptom
Two issues from the same code path in trickle_publisher.py:_resolve_next_seq:
(1) Noisy warning at every session start
WARNING livepeer_gateway.trickle_publisher: Trickle /next missing Lp-Trickle-Latest header
Fires once per publisher (events + data + media-out) at every LivePipeline session start. Visible in every example pipeline (live_grayscale, live_depth, live_transcribe).
(2) First video segment dropped on the -out channel
Real bug, not just noise. Orchestrator log signature:
POST segment stream=<sid>-out idx=0 next=0 ← URL /-1, server resolved to slot 0
POST segment stream=<sid>-out idx=0 next=0 ← URL /0, duplicate write to slot 0
POST segment stream=<sid>-out idx=0 next=0 ← retry after the drop
POST segment stream=<sid>-out idx=1 next=1 ← in-sync from here onwards
Runner-side log:
WARNING ... MediaPublish[<sid>-out] dropped segment seq=0 mid-stream; draining pipe
livepeer_gateway.trickle_publisher.TrickleSegmentWriteError: Trickle segment writer timed out after 5.0s
WARNING ... Trickle segment close suppressed seq=0
The first MPEG-TS video segment never makes it through. Subsequent segments are fine because the publisher's local counter is back in sync with the server's nextWrite from segment 1.
Root cause
The publisher's _resolve_next_seq() (line 346) does:
async def _resolve_next_seq(self) -> int:
url = f"{self.url}/next"
try:
resp = await self._session.get(url)
latest = resp.headers.get("Lp-Trickle-Latest")
resp.release()
if latest is not None:
return int(latest)
else:
_LOG.warning("Trickle /next missing Lp-Trickle-Latest header")
except Exception:
_LOG.warning("Trickle /next request failed", exc_info=True)
return -1
The /next endpoint is in livepeer/go-livepeer#3884 (Josh's "Serverless" PR, still open) — both sides authored as a designed pair on the same day:
| When |
Where |
Author |
| 2026-03-26 01:11 PDT |
_resolve_next_seq lands in this repo (commit 5007e2c) |
Josh Allmann |
| 2026-03-26 12:21 PDT |
/next endpoint added to go-livepeer (commit 8da1c692, in PR #3884) |
Josh Allmann |
The Python side merged to main. The Go side hasn't merged to master. So on master, every probe hits a 400 (no /next route), no header is set, the warning fires, and _resolve_next_seq returns -1.
The caller in next():
if self.seq < 0:
send_reset = True
self.seq = await self._resolve_next_seq() # = -1
self._next_state = await self.preconnect(self.seq, send_reset=send_reset) # POST /-1
self.seq += 1 # = 0
preconnect_task = asyncio.create_task(self._preconnect_task(self.seq)) # POST /0 in background
Server-side, getForWrite(idx == -1) resolves locally to nextWrite (= 0 for a fresh channel) and creates segment[0]. The publisher's next POST then targets URL /0 — server's getForWrite(0) finds the existing segment[0] and returns it. Two HTTP handlers now both writing to segment[0]. On the events channel the writes are tiny JSON heartbeats and the race is invisible. On the video -out channel the first POST is mid-stream when the second arrives → the server's segment-write semantics race → 5 s later the first writer trips its timeout → seq=0 dropped.
Fix
One line, both symptoms covered:
async def _resolve_next_seq(self) -> int:
url = f"{self.url}/next"
try:
resp = await self._session.get(url)
latest = resp.headers.get("Lp-Trickle-Latest")
resp.release()
if latest is not None:
return int(latest)
# Common pre-#3884 — server has no /next endpoint, so we get no header.
# Treat as fresh channel and start at slot 0; matches pytrickle's approach.
_LOG.debug("Trickle /next missing Lp-Trickle-Latest header — assuming fresh channel")
except Exception:
_LOG.warning("Trickle /next request failed", exc_info=True)
return 0 # fresh-channel default; the -1 sentinel race causes duplicate POSTs to slot 0
Two changes:
return 0 (was -1) — eliminates the duplicate POST → first video segment delivers cleanly.
debug (was warning) — eliminates the operator-visible noise, matches what PR #6 already does for symptom (1).
Both pre-#3884 (probe fails, fresh-channel default kicks in) and post-#3884 (probe succeeds, returns server-reported nextWrite) work correctly.
Why not just wait for #3884
Option, but unbounded timeline. PR #6 already partially addresses (1) by demoting the warning, but doesn't touch the return -1 line — so even after #6 merges into our branches, symptom (2) persists. The return 0 change is the load-bearing fix.
Caveat — resume semantics
Returning 0 on probe failure means a publisher reconnecting on a non-fresh channel pre-#3884 would overwrite slot 0 instead of resuming from nextWrite. No regression: today's -1 path also breaks resume on master (the sentinel resolves to nextWrite server-side, then the publisher's local counter increments from -1 to 0, posting to /0 and overwriting all existing segments). Resume only works post-#3884 anyway.
Out of scope
- Removing
_resolve_next_seq() entirely — it's correct and useful post-#3884 for resume.
- Server-side fixes — covered by livepeer/go-livepeer#3884.
Confirmed impact on the data channel
Observable in live_transcribe's SSE-based caller test (examples/runner/live_transcribe/test.sh). Runner emits 5 transcripts; SSE subscriber on the gateway's data_url proxy receives only 3:
runner _LOG (all 5): SSE subscriber output (3 of 5):
transcript[0]: ask not transcript[0]: ask not
transcript[1]: What your country can do… (missing — `seq=0` race)
transcript[2]: ask what you can do for… transcript[2]: ask what you can do for…
transcript[3]: And so am I fellow Americans transcript[3]: And so am I fellow Americans
transcript[4]: Ask! (missing — separate teardown race)
transcript[1] is dropped by exactly the bug above: the publisher's first POST goes to URL /-1 (server resolves to slot 0), and the second POST goes to URL /0 — server finds the existing segment[0] and races it. transcript[0]'s write wins; transcript[1]'s overwrites or is dropped depending on timing.
This matches the video-channel seq=0 dropped + writer timed out 5.0s we saw before, just with smaller payloads and different visibility. Same root cause, same one-line fix.
(transcript[4] is a separate teardown timing issue between the runner's on_stream_stop flush and the gateway's SSE end signal — not part of this issue.)
Context
Surfaced while testing live_transcribe end-to-end. The noisy warning is in every example pipeline; the duplicate-POST loss is in any pipeline that emits on the video output channel OR the data channel. The data-channel impact is now directly observable via the SSE test in live_transcribe.
Symptom
Two issues from the same code path in
trickle_publisher.py:_resolve_next_seq:(1) Noisy warning at every session start
Fires once per publisher (events + data + media-out) at every
LivePipelinesession start. Visible in every example pipeline (live_grayscale,live_depth,live_transcribe).(2) First video segment dropped on the
-outchannelReal bug, not just noise. Orchestrator log signature:
Runner-side log:
The first MPEG-TS video segment never makes it through. Subsequent segments are fine because the publisher's local counter is back in sync with the server's
nextWritefrom segment 1.Root cause
The publisher's
_resolve_next_seq()(line 346) does:The
/nextendpoint is in livepeer/go-livepeer#3884 (Josh's "Serverless" PR, still open) — both sides authored as a designed pair on the same day:_resolve_next_seqlands in this repo (commit5007e2c)/nextendpoint added to go-livepeer (commit8da1c692, in PR #3884)The Python side merged to
main. The Go side hasn't merged tomaster. So on master, every probe hits a 400 (no/nextroute), no header is set, the warning fires, and_resolve_next_seqreturns-1.The caller in
next():Server-side,
getForWrite(idx == -1)resolves locally tonextWrite(= 0 for a fresh channel) and createssegment[0]. The publisher's next POST then targets URL/0— server'sgetForWrite(0)finds the existingsegment[0]and returns it. Two HTTP handlers now both writing tosegment[0]. On the events channel the writes are tiny JSON heartbeats and the race is invisible. On the video-outchannel the first POST is mid-stream when the second arrives → the server's segment-write semantics race → 5 s later the first writer trips its timeout → seq=0 dropped.Fix
One line, both symptoms covered:
Two changes:
return 0(was-1) — eliminates the duplicate POST → first video segment delivers cleanly.debug(waswarning) — eliminates the operator-visible noise, matches what PR #6 already does for symptom (1).Both pre-#3884 (probe fails, fresh-channel default kicks in) and post-#3884 (probe succeeds, returns server-reported
nextWrite) work correctly.Why not just wait for #3884
Option, but unbounded timeline. PR #6 already partially addresses (1) by demoting the warning, but doesn't touch the
return -1line — so even after #6 merges into our branches, symptom (2) persists. Thereturn 0change is the load-bearing fix.Caveat — resume semantics
Returning
0on probe failure means a publisher reconnecting on a non-fresh channel pre-#3884 would overwrite slot 0 instead of resuming fromnextWrite. No regression: today's-1path also breaks resume on master (the sentinel resolves tonextWriteserver-side, then the publisher's local counter increments from -1 to 0, posting to/0and overwriting all existing segments). Resume only works post-#3884 anyway.Out of scope
_resolve_next_seq()entirely — it's correct and useful post-#3884 for resume.Confirmed impact on the data channel
Observable in
live_transcribe's SSE-based caller test (examples/runner/live_transcribe/test.sh). Runner emits 5 transcripts; SSE subscriber on the gateway'sdata_urlproxy receives only 3:transcript[1]is dropped by exactly the bug above: the publisher's first POST goes to URL/-1(server resolves to slot 0), and the second POST goes to URL/0— server finds the existingsegment[0]and races it. transcript[0]'s write wins; transcript[1]'s overwrites or is dropped depending on timing.This matches the video-channel
seq=0 dropped + writer timed out 5.0swe saw before, just with smaller payloads and different visibility. Same root cause, same one-line fix.(
transcript[4]is a separate teardown timing issue between the runner'son_stream_stopflush and the gateway's SSE end signal — not part of this issue.)Context
Surfaced while testing
live_transcribeend-to-end. The noisy warning is in every example pipeline; the duplicate-POST loss is in any pipeline that emits on the video output channel OR the data channel. The data-channel impact is now directly observable via the SSE test inlive_transcribe.