perf: skip the decoder flush window when only one sequence is live by FuJacob · Pull Request #11 · FuJacob/cotabbyinference

FuJacob · 2026-06-12T01:47:52Z

Summary

The decoder thread's batching design waits BATCH_WINDOW_MICROS (200us) after the first sample request so sibling sequences can pile into one llama_decode. With a single live sequence there is no sibling to wait for, so the window was pure added latency on every sampled token; at a 20-26 token suggestion that is ~4-5ms per suggestion for nothing. The autocomplete app holds exactly one sequence by design (it destroys the old sequence before building a fresh one), so it paid this on every generation.

The flush decision now reads live_sequence_count, a lock-free atomic mirror of sequences.size() updated under sequences_mutex at every map mutation (create, destroy, destroy-all). An atomic mirror rather than taking sequences_mutex in the decoder because the decoder holds decode_mutex at the decision point, and destroySequence nests decode_mutex inside its sequence-scoped section, so locking sequences_mutex there would invert the order. The mirror can lag a concurrent create/destroy by at most one flush decision, which only means one window waited (or skipped) conservatively.

Multi-sequence drivers (evals, tests running two sequences side by side) keep the batching window unchanged.

Validation

swift build
# Build complete!

swift test
# Test Suite 'All tests' passed

Behavioral note: single-sequence end-to-end timing is best confirmed via the bench against a local GGUF; the change is a strict removal of a fixed wait on that path.

🤖 Generated with Claude Code

The decoder thread waited a fixed 200us batching window after every sample request, even with a single live sequence where no sibling can ever arrive; that is pure added latency per sampled token for the autocomplete app, which holds exactly one sequence. The flush decision reads a lock-free mirror of the sequence count (an atomic updated under sequences_mutex at every map mutation) because taking sequences_mutex while holding decode_mutex would invert the order destroySequence uses. Multi-sequence drivers keep the batching window unchanged.

FuJacob merged commit 87193cd into main Jun 12, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: skip the decoder flush window when only one sequence is live#11

perf: skip the decoder flush window when only one sequence is live#11
FuJacob merged 1 commit into
mainfrom
perf/single-seq-flush

FuJacob commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

FuJacob commented Jun 12, 2026

Summary

Validation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant