Skip to content

fix: tighten WS heap defenses (cap=4, broadcast floor 32K)#61

Open
skialpine wants to merge 1 commit into
decentespresso:mainfrom
skialpine:fix/ws-heap-defense
Open

fix: tighten WS heap defenses (cap=4, broadcast floor 32K)#61
skialpine wants to merge 1 commit into
decentespresso:mainfrom
skialpine:fix/ws-heap-defense

Conversation

@skialpine
Copy link
Copy Markdown
Contributor

Summary

Two small knobs that together close the silent-WiFi-drop failure mode observed during multi-protocol soaks on top of #56's WS-OOM defense.

  • websocket.cleanupClients(4) — halve the WS client cap from the lib default (8) to 4
  • WS_BROADCAST_HEAP_FLOOR raised 25 KB → 32 KB

Why

Under sustained multi-protocol load (4+ WS clients + BLE + USB), per-client queue + AsyncWebSocketMessage allocations overcommit heap until lwIP can't allocate pbufs. Symptom is nasty: wifi_status stays CONNECTED, disc/rec never increment, but packets drop silently. The on-device watchdog never trips because nothing has crashed.

The existing wsBroadcastHeapOk() gate (PR #58) prevents the bad_alloc → abort reboot, but at the original 25 KB floor a 4-client status burst still dipped free heap to ~22 KB — below the lwIP starvation knee. 32 KB keeps the post-burst trough in safe territory.

cap=8 was always overhead for edge cases. Real workloads (on-device UI + PRs/Decenza app + an occasional second viewer) cap out at 3–4 concurrent clients on this hardware.

Validation

2h+ soak on hardware (skialpine/test/adc-lib+telemetry worktree with these edits applied):

  • 4 ws_drop_repro generators at 10 Hz each + BLE (Decenza app, ~1 Hz heartbeats) + USB serial
  • 0% ping loss, 0 reconnects (disc=0 rec=0 for the full run)
  • 0 panics, 0 bad_alloc, 0 AsyncTCP stalls
  • ~300 broadcasts skipped by the existing heap gate (working as designed)
  • minheap stable at ~3.6 KB across the run — no leak signature

Holding for an 8h soak before merge.

What this doesn't fix

This is a brake, not a root-cause fix. The underlying pressure — large statusAll broadcasts × 4 clients allocating ~2.3 KB per burst — is still there; we're just guaranteeing the gate catches it. Follow-up will split immutable fields (firmware_version, protocol_version, reset_reason) and quasi-immutable fields (stall_count, last_stall_*, adc_recovery_count, soc_temp_max_c) out of the hot statusAll path into a one-shot session_info frame + on-change deltas. Targets ~37% payload reduction on the status broadcast.

Test plan

  • Build clean (pio run -e esp32s3)
  • 2h+ multi-protocol soak — 0 drops
  • 8h soak validation (in progress)
  • Functional smoke: on-device UI + Decenza app + PRs WS coexist
  • Verify [ws] low heap … skip broadcast still fires correctly under load

🤖 Generated with Claude Code

Under sustained multi-protocol load (4+ WS clients + BLE + USB) the
default WS client cap of 8 lets per-client queue + AsyncWebSocketMessage
allocations overcommit heap to the point where lwIP TX buffers starve
silently — WiFi reports status=CONNECTED, disc/rec never increment, but
packets drop. Two small knobs together close the gap:

  * cleanupClients(4) — halves worst-case connection overcommit.
    Real workloads (on-device UI + PRs/Decenza + occasional second
    client) cap out at 3–4; 8 was always headroom for edge cases that
    don't happen on this hardware.

  * WS_BROADCAST_HEAP_FLOOR raised 25000 → 32000. The original 25K
    floor sat too close to the lwIP starvation knee: a 4-client status
    burst dipped free heap to ~22K, below where pbufs can be allocated.
    32K keeps the post-burst trough in safe territory.

Validated on hardware: 2h+ soak with 4 ws_drop_repro generators + BLE
(Decenza) + USB serial, 0% ping loss, 0 reconnects, 0 panics, ~300
broadcasts skipped by the existing wsBroadcastHeapOk gate. minheap
stable across the run (no leak signature).

Holding for an 8h soak before merge.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant