Skip to content

Rust warn on dropped LCM packets#2200

Merged
jeff-hykin merged 4 commits into
mainfrom
jeff/fix/drop_warn
May 21, 2026
Merged

Rust warn on dropped LCM packets#2200
jeff-hykin merged 4 commits into
mainfrom
jeff/fix/drop_warn

Conversation

@jeff-hykin
Copy link
Copy Markdown
Member

@jeff-hykin jeff-hykin commented May 20, 2026

Backpressure from heavy faster-than-real-time lidar benchmarks is kind of evitable. Right now its a silent failure.

This includes a warning message, and some higher default rust buffer sizes to handle more throughput to match.

Tests added to make sure rust native modules aren't bottlenecking LCM.

Run the tests:

cd native/rust/dimos-module
cargo test

The per-input mpsc queue between the LCM recv loop and the user's
handler had capacity 16. `TypedRoute::try_dispatch` was using
`try_send` and silently dropping new messages when full. At 10 Hz
publish + any handler slower than ~6 Hz, the queue filled in 1.6 s
and every subsequent message was lost without any signal.

This was the actual cause of the ~50 % "LCM drops" observed during
the pgo_rust KITTI-360 benchmark — the kernel reported zero drops the
whole time (UDP RcvbufErrors=0, lo drop=0), and a separate burst
characterization confirmed pure LCM can do 42 MB/s @ 89 Hz at 0%
loss. The drops were here, in dimos-module.

Two changes:

1. INPUT_CHANNEL_CAPACITY 16 → 1024 (and PUBLISH_CHANNEL_CAPACITY
   64 → 1024 for symmetry). At 10 Hz publish, 1024 gives ~100 s of
   headroom for handler stalls. Memory cost is bounded by the
   message types (rust mpsc only allocates entries as needed).

2. Per-route AtomicU64 `drops` counter on TypedRoute. Increments on
   `TrySendError::Full`. Logged via rate-limited eprintln at
   power-of-2 milestones (1, 2, 4, 8, 16, …). First drop fires
   immediately so operators see the first warning; subsequent
   frequency decays exponentially so a runaway handler doesn't spam
   stderr while still surfacing the failure.

4 new unit tests cover the behavior:

- typed_route_drops_count_when_queue_full: 100 sent, capacity 4,
  expect exactly 96 drops.
- typed_route_drops_counter_starts_at_zero: first sends under
  capacity must not increment drops.
- typed_route_does_not_drop_when_queue_has_headroom: 1000 sent,
  capacity 2048, drops must be 0 (guards against the rate-limited
  log firing on the happy path).
- typed_route_rate_limited_log_fires_on_power_of_two: 130 sent,
  capacity 1, expect 129 drops crossing the milestones 1, 2, 4, 8,
  16, 32, 64, 128 — operator sees 8 log lines.

All 22 dimos-module tests pass. Clippy clean.
@jeff-hykin jeff-hykin changed the title dimos-module: bump mpsc channels + log silent drops Rust warn on dropped LCM packets May 20, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ All tests successful. No failed tests found.

📢 Thoughts on this report? Let us know!

Comment thread native/rust/dimos-module/src/module.rs Outdated
Replaces the power-of-2 milestone trigger with a wall-clock throttle.
New constant MAX_ERROR_LOG_RATE caps drop-warning log lines per
route at N/sec (currently 1).

The drops counter still increments on every dropped message; only
the stderr line is throttled.

Implementation: last_log: Mutex<Option<Instant>> on TypedRoute.
First drop after a quiet period fires immediately; subsequent drops
within 1/MAX_ERROR_LOG_RATE seconds are silent.

22/22 dimos-module tests pass.
Reduced the back-pressure test suite to a single test
(typed_route_logs_error_on_drop) that fills the queue, forces one
drop, and verifies (a) the counter increments and (b) last_log is
set (proxying for "log fired"). Dropped the three other tests as
redundant with the single-message case.

Renamed the field `drops` → `drop_count` to read more naturally.

19/19 dimos-module tests pass.
@jeff-hykin jeff-hykin marked this pull request as ready for review May 20, 2026 21:50
@jeff-hykin jeff-hykin enabled auto-merge (squash) May 20, 2026 21:51
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 20, 2026

Greptile Summary

This PR adds a rate-limited drop warning to TypedRoute::try_dispatch so that silent LCM packet loss under backpressure is now surfaced on stderr, and bumps both mpsc channel capacities from 16/64 to 1024 to reduce the frequency of drops during faster-than-real-time lidar benchmarks. New concurrency tests and a drop-logging unit test are included.

  • Drop warning: when try_send returns Full, the per-route drop_count (AtomicU64) is incremented and a rate-limited (≤1/sec) eprintln! is emitted with the topic name and queue capacity. The Closed variant continues to be silently ignored.
  • Capacity increase: INPUT_CHANNEL_CAPACITY 16 → 1024 and PUBLISH_CHANNEL_CAPACITY 64 → 1024; the larger buffers give downstream consumers more headroom before drops occur.

Confidence Score: 4/5

Safe to merge; the logic change is isolated to warning output and does not affect message delivery or control flow.

The functional change — emitting a warning on dropped messages — is correct and well-tested for the happy path. The two findings are about warning message clarity and test coverage for rate limiting; neither affects message delivery or system behaviour. The larger channel capacities are a straightforward tuning change with no downstream breakage risk.

native/rust/dimos-module/src/module.rs — the drop warning message and its test are the only areas worth a second look.

Important Files Changed

Filename Overview
native/rust/dimos-module/src/module.rs Adds per-route drop counting + rate-limited eprintln warning when mpsc channel is full; bumps both channel capacities to 1024; adds TypedRoute unit test. Minor issue: n in the warning is an all-time cumulative total, not a per-interval delta, which can be misleading after recovery periods.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[Transport recv loop] -->|channel, data| B{Route lookup}
    B -->|No route| C[Ignore]
    B -->|Route found| D[TypedRoute::try_dispatch]
    D --> E{decode bytes}
    E -->|Err| F[eprintln decode error]
    E -->|Ok msg| G{sender.try_send}
    G -->|Ok| H[Message delivered to Input]
    G -->|Closed| I[Silently ignore]
    G -->|Full - backpressure| J[drop_count.fetch_add]
    J --> K{rate-limit check\nlast_log within 1s?}
    K -->|Yes, suppress| L[Silent drop]
    K -->|No, emit| M[eprintln warning\ndrop_count + topic + cap]
    M --> N[Update last_log]
Loading

Reviews (1): Last reviewed commit: "dimos-module: simplify tests + rename dr..." | Re-trigger Greptile

Comment thread native/rust/dimos-module/src/module.rs
Comment thread native/rust/dimos-module/src/module.rs
@jeff-hykin jeff-hykin self-assigned this May 20, 2026
Comment thread native/rust/dimos-module/src/module.rs
@jeff-hykin jeff-hykin merged commit 1f544d0 into main May 21, 2026
20 checks passed
@jeff-hykin jeff-hykin deleted the jeff/fix/drop_warn branch May 21, 2026 01:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants