Skip to content

perf: skip the decoder flush window when only one sequence is live#11

Merged
FuJacob merged 1 commit into
mainfrom
perf/single-seq-flush
Jun 12, 2026
Merged

perf: skip the decoder flush window when only one sequence is live#11
FuJacob merged 1 commit into
mainfrom
perf/single-seq-flush

Conversation

@FuJacob

@FuJacob FuJacob commented Jun 12, 2026

Copy link
Copy Markdown
Owner

Summary

The decoder thread's batching design waits BATCH_WINDOW_MICROS (200us) after the first sample request so sibling sequences can pile into one llama_decode. With a single live sequence there is no sibling to wait for, so the window was pure added latency on every sampled token; at a 20-26 token suggestion that is ~4-5ms per suggestion for nothing. The autocomplete app holds exactly one sequence by design (it destroys the old sequence before building a fresh one), so it paid this on every generation.

The flush decision now reads live_sequence_count, a lock-free atomic mirror of sequences.size() updated under sequences_mutex at every map mutation (create, destroy, destroy-all). An atomic mirror rather than taking sequences_mutex in the decoder because the decoder holds decode_mutex at the decision point, and destroySequence nests decode_mutex inside its sequence-scoped section, so locking sequences_mutex there would invert the order. The mirror can lag a concurrent create/destroy by at most one flush decision, which only means one window waited (or skipped) conservatively.

Multi-sequence drivers (evals, tests running two sequences side by side) keep the batching window unchanged.

Validation

swift build
# Build complete!

swift test
# Test Suite 'All tests' passed

Behavioral note: single-sequence end-to-end timing is best confirmed via the bench against a local GGUF; the change is a strict removal of a fixed wait on that path.

🤖 Generated with Claude Code

The decoder thread waited a fixed 200us batching window after every
sample request, even with a single live sequence where no sibling can
ever arrive; that is pure added latency per sampled token for the
autocomplete app, which holds exactly one sequence. The flush decision
reads a lock-free mirror of the sequence count (an atomic updated under
sequences_mutex at every map mutation) because taking sequences_mutex
while holding decode_mutex would invert the order destroySequence uses.
Multi-sequence drivers keep the batching window unchanged.
@FuJacob FuJacob merged commit 87193cd into main Jun 12, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant