Skip to content

feat: auto-scaling voxel grid mapper#1271

Open
spomichter wants to merge 4 commits intodevfrom
feat/1270-autoscale-voxel-mapper
Open

feat: auto-scaling voxel grid mapper#1271
spomichter wants to merge 4 commits intodevfrom
feat/1270-autoscale-voxel-mapper

Conversation

@spomichter
Copy link
Contributor

Summary

Closes #1270 — Auto-scaling global voxel grid mapper.

What

Adds adaptive frame skipping to VoxelGridMapper so it self-regulates on weaker machines instead of falling behind on lidar ingest.

How

When autoscale=True (default), _on_frame() skips frames if they arrive faster than the last add_frame() processing duration. The system self-regulates: fast machines process every frame, slow machines find their natural ceiling.

autoscale_min_frequency (default 1.0 Hz) prevents the map from going completely stale — even if processing takes 2s, we'll still ingest at least 1 frame per second.

Config

@dataclass
class Config(ModuleConfig):
    autoscale: bool = True
    autoscale_min_frequency: float = 1.0  # never drop below this Hz

Telemetry (Rerun)

Logs both ingest and publish timings separately to identify the real bottleneck:

  • voxel_mapper/ingest_time_msadd_frame() wall-clock cost
  • voxel_mapper/publish_time_msget_global_pointcloud2() + publish cost
  • voxel_mapper/map_size — total voxel count (correlates with publish cost growth)
  • voxel_mapper/frames_skipped — cumulative frames dropped by autoscaler

Testing

4 new tests:

  • test_autoscale_skips_when_saturated — simulates slow machine (fakes 500ms processing), verifies frames are skipped
  • test_autoscale_disabled_processes_all — verifies autoscale=False processes every frame
  • test_autoscale_min_frequency_respected — verifies high min_frequency prevents excessive skipping
  • test_autoscale_rerun_logging — verifies rerun telemetry is logged correctly

All existing tests unchanged and passing.

How to Test

pytest -sv dimos/mapping/test_voxels.py -k autoscale

To observe autoscaling live with rerun on a real robot:

dimos run go2-nav  # or any blueprint using voxel_mapper
# Open Rerun viewer → check voxel_mapper/* time series

Add adaptive frame skipping to VoxelGridMapper when processing can't
keep up with incoming lidar rate. Self-regulates by using last
add_frame() duration as the throttle interval, with a configurable
minimum frequency floor.

Config:
  autoscale: bool = True
  autoscale_min_frequency: float = 1.0  # Hz floor

Telemetry via rerun:
  voxel_mapper/ingest_time_ms   - add_frame() cost per frame
  voxel_mapper/publish_time_ms  - get_global_pointcloud2() cost
  voxel_mapper/map_size         - total voxel count
  voxel_mapper/frames_skipped   - cumulative skipped frames

Closes #1270
@greptile-apps
Copy link

greptile-apps bot commented Feb 16, 2026

Greptile Summary

Adds adaptive frame skipping to VoxelGridMapper so it self-regulates on weaker machines, along with Rerun telemetry for ingest time, publish time, map size, and frames skipped. Two new config fields (autoscale, autoscale_min_frequency) control the behavior.

  • _on_frame() now measures add_frame() wall-clock cost and skips incoming frames that arrive faster than the last processing duration, capped by a minimum frequency.
  • Rerun telemetry (rr.log) is added to both _on_frame and publish_global_map for observability.
  • A defensive hasattr(self, "_publish_trigger") guard was added to support calling _on_frame before start() (used in tests).
  • Issue: 1.0 / autoscale_min_frequency will raise ZeroDivisionError if a user sets this config value to 0.
  • Issue: _last_ingest_time is set to the time before add_frame() executes. Since all transports (LCM, SharedMemory, ROS) deliver callbacks synchronously on a single thread, elapsed_since_last will always be >= _last_ingest_duration, making the skip condition unreachable in production. The autoscale only works in the test because _last_ingest_duration is artificially inflated.

Confidence Score: 3/5

  • The PR introduces a useful feature but has a runtime crash risk (ZeroDivisionError) and the core autoscale logic is ineffective with the current synchronous transport model.
  • Score of 3 reflects two issues: (1) an unguarded division by zero on a user-configurable field that will crash at runtime, and (2) the autoscale skip logic is effectively dead code in production because _last_ingest_time is set before processing starts, and all transport callbacks are synchronous. The tests pass by faking internal state rather than exercising the real production code path, masking the timing issue.
  • Pay close attention to dimos/mapping/voxels.py — the _on_frame method's timing logic and the ZeroDivisionError risk in the autoscale configuration.

Important Files Changed

Filename Overview
dimos/mapping/voxels.py Adds autoscale frame-skipping and Rerun telemetry to VoxelGridMapper. Has a ZeroDivisionError risk with autoscale_min_frequency=0, and the autoscale skip logic is ineffective with synchronous transport delivery due to _last_ingest_time being set before processing.
dimos/mapping/test_voxels.py Adds 4 tests for autoscale behavior. Tests are well-structured but the saturation test relies on faking internal state (_last_ingest_duration) rather than testing the actual production code path, masking the timing issue in the implementation.

Sequence Diagram

sequenceDiagram
    participant LCM as LCM Transport Thread
    participant OF as _on_frame()
    participant AF as add_frame()
    participant RR as Rerun Telemetry
    participant PT as _publish_trigger
    participant PGM as publish_global_map()

    LCM->>OF: frame arrives
    alt autoscale ON & frames_processed > 0
        OF->>OF: elapsed = now - _last_ingest_time
        OF->>OF: throttle = min(last_duration, 1/min_freq)
        alt elapsed < throttle
            OF-->>LCM: skip (return early)
            Note right of OF: _frames_skipped++
        end
    end
    OF->>AF: add_frame(frame)
    AF-->>OF: done (ingest_duration measured)
    OF->>OF: update _last_ingest_time, _last_ingest_duration
    OF->>RR: log ingest_time_ms, map_size, frames_skipped
    alt publish_interval == 0
        OF->>PT: on_next(None)
        PT->>PGM: publish_global_map()
        PGM->>RR: log publish_time_ms
        PGM->>PGM: global_map.publish(pc)
    end
Loading

Last reviewed commit: f2ec462

Copy link

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

2 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ZeroDivisionError if autoscale_min_frequency is 0

1.0 / self.config.autoscale_min_frequency will raise ZeroDivisionError at runtime if a user passes autoscale_min_frequency=0. There's no validation preventing this value. Consider adding a guard or a config validation:

Suggested change
max_interval = 1.0 / self.config.autoscale_min_frequency
max_interval = (1.0 / self.config.autoscale_min_frequency) if self.config.autoscale_min_frequency > 0 else float("inf")

Comment on lines 133 to 138
t0 = time.monotonic()
self.add_frame(frame)
if self.config.publish_interval == 0:
ingest_duration = time.monotonic() - t0

self._last_ingest_time = t0
self._last_ingest_duration = ingest_duration
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_last_ingest_time = t0 limits autoscale effectiveness

_last_ingest_time is set to t0 (the time before add_frame() runs). Since all transports (LCM, SharedMemory, ROS) deliver callbacks synchronously on a single thread, the next _on_frame call can only begin after the current one returns. This means elapsed_since_last = now - t0 will always be >= _last_ingest_duration, so the skip condition elapsed < throttle_interval is effectively never true in production.

The autoscale will only skip frames when _on_frame is called directly from a separate thread (not through the transport), or if _last_ingest_duration is artificially set higher than the actual processing time (as done in the test).

If you want skipping to work with synchronous transport delivery, consider setting _last_ingest_time to time.monotonic() (after processing completes), so the elapsed-time check measures the gap between frames exclusive of processing time:

Suggested change
t0 = time.monotonic()
self.add_frame(frame)
if self.config.publish_interval == 0:
ingest_duration = time.monotonic() - t0
self._last_ingest_time = t0
self._last_ingest_duration = ingest_duration
t0 = time.monotonic()
self.add_frame(frame)
ingest_duration = time.monotonic() - t0
self._last_ingest_time = time.monotonic()
self._last_ingest_duration = ingest_duration

With this change, frames that arrive back-to-back (queued in LCM during processing) would be correctly identified as arriving faster than the processing rate.

rr.log() was silently no-op because VoxelGridMapper runs in a
separate Dask worker process from the rerun bridge. Now:
- _ensure_rr() calls rr.init('dimos', spawn=False) in the mapper process
- Uses rr.set_time() with sequence + timestamp for proper time series
- Graphs will appear in Rerun viewer under voxel_mapper/*
Two bugs flagged by Greptile:

1. ZeroDivisionError: autoscale_min_frequency=0 would crash.
   Fix: guard with > 0 check, treat 0 as 'no frequency floor'.

2. Skip logic was dead code: _last_ingest_time was set before
   add_frame(), so elapsed_since_last always >= _last_ingest_duration
   on synchronous transport threads — frames never got skipped.
   Fix: set _last_ingest_time after add_frame() completes, so it
   measures idle time between frames. If idle time < last processing
   duration, the machine is falling behind and the frame is skipped.

Also adds test_autoscale_min_frequency_zero_no_crash to cover the
ZeroDivisionError case, and updates the saturation test to correctly
set both _last_ingest_time and _last_ingest_duration.
@spomichter
Copy link
Contributor Author

Fixed both issues from Greptile review (3639cb6):

1. ZeroDivisionError on autoscale_min_frequency=0
Added a guard: if min_frequency <= 0, fall back to using processing time alone as the throttle interval (no frequency floor).

2. Skip logic was dead code
Greptile was correct — _last_ingest_time was set to t0 (before add_frame()), so on synchronous transport threads elapsed_since_last always >= _last_ingest_duration. Frames never actually got skipped in production.

Fix: set _last_ingest_time = time.monotonic() after add_frame() completes. Now elapsed_since_last measures idle time between frames. If idle < last processing duration, the machine is genuinely behind and we skip.

Also added test_autoscale_min_frequency_zero_no_crash and updated the saturation test to correctly fake both _last_ingest_time (set to now) and _last_ingest_duration to simulate a machine that just finished a slow frame.

@spomichter
Copy link
Contributor Author

@leshy this looks like slop to me. def concerned if the agent cant even do this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant

Comments