Optimize telemetry event publishing#297
Conversation
61ab04c to
140f7c2
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using high effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 140f7c2. Configure here.
| env, _ = truncateIfNeeded(env) | ||
| es.ring.publish(env) | ||
| return env | ||
| return es.ring.publishNext(env) |
There was a problem hiding this comment.
Truncation checks size before seq
Medium Severity
EventStream.Publish now runs truncateIfNeeded before publishNext assigns seq, so the slow-path size check marshals with "seq":0. Envelopes whose JSON length is just under maxS2RecordBytes at seq zero can pass without truncation, then exceed the S2 1 MiB cap once the real monotonic seq is written and consumers remarshal for SSE or storage.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 140f7c2. Configure here.
140f7c2 to
80e5ed1
Compare
Sayan-
left a comment
There was a problem hiding this comment.
reviewed — nice, well-benchmarked optimization pass. the publish-path wins are real and the seq-before-truncation fix is a genuine correctness improvement over the prior shape. one design question worth acknowledging, plus a few smaller things. nothing blocking.
heads up: I checked both Cursor Bugbot inline findings (fast-path escaping; truncation-before-seq) against current HEAD — both are already resolved and covered by new regression tests, so they don't need action.
main design question (likely an acceptable trade — just want it acknowledged)
ringbuffer.go:147(+eventsstorage.go:49,cmd/api/api/events.go:106) —Reader.ReadmovedRLock→Lockto keep the newwaiterscount accurate, so blocking readers now serialize. that means the S2 durability writer and every SSE stream contend on the exclusive lock, and each publish wakes the whole herd to queue for it. given the crux of this PR is write perf (zero-alloc publish), serializing the far-less-frequent readers is probably a reasonable trade — and the read critical section is tiny (a few field reads + one Envelope copy, no I/O or marshal under the lock), so I'd expect negligible real impact. mostly want it called out as intentional. if S2 read latency is ever a hard constraint, a quick measurement under realistic SSE fan-out would settle it; the way to keep both wins would be an atomicwaiterscounter with a re-check after registering soReadstays onRLock(but that double-check ordering is fiddly, which is likely why the simple one-lock form was chosen).
questions / suggestions
telemetry.go:83— nil-metadata events now share onesessionMetadatamap by pointer. safe today (write-once inStart, frozen before publication, read-only downstream), but it drops the old structural guard where each event got its own map. the safety now rests on an unenforced invariant: a published event'sSource.Metadatais read-only downstream. if any future consumer mutates the metadata of an event pulled off the ring, it races every other reader of that shared map → fatalconcurrent map read and map write(hard crash, intermittent). worth making the invariant explicit with a comment at the assignment and on thesessionMetadatafield.ringbuffer.go:79—truncateIfNeeded(up to 2×json.Marshalof a ~1MB envelope) now runs under the ring write lock, so an oversized event briefly blocks all readers+publishers. the marshal only depends on seq's digit-width (≤20 bytes), so it could be hoisted out of the lock with a worst-case-seq pad, leaving only seq-assign + ring-write locked. current form is correct, just holds the lock during the rare big marshal.
tests
- solid, targeted coverage: the truncation-boundary test (
TestEventStreamPublishTruncatesWithAssignedSeq) straddles the real 1MB limit atseq:0vs the 19-digit seq, so a regression back to truncating-at-seq-0 would fail it; seq assignment/ordering and escaped+large-metadata fallback are covered too. thewaiterswake path is already exercised by the pre-existingTestConcurrentPublishReadandTestRingBufferResetWithActiveReader(both block a reader and rely on a publish waking it — a dropped wakeup would hang them to timeout). no new test directly asserts thewaiters==0skip itself, but the benchmark's 0 allocs/op evidences it.
nits
event.go:128— the warn log dropped"seq"; it's available now (assigned beforetruncateIfNeeded) and useful for correlating which oversized event hit the limit.event.go:133,149—estimatedEnvelopeBytes/maxEscapedJSONLenhave no doc comments while the rest of the file is commented.


Summary
Measurement
Before:
After:
Tests
Note: a broad race run also hit an unrelated lib/devtoolsproxy Chromium temp-dir cleanup failure; the touched telemetry/event/API packages pass under race.