Skip to content

Stress test backend telemetry ingestion for continuous Jetson publishing with minimal data loss #9

@MohamedEBR

Description

@MohamedEBR

Summary

We have moved telemetry ingestion to an HTTP push model from the Jetson. The next step is to stress test the backend under sustained telemetry load to confirm that it can receive, process, and store continuous data while minimizing loss.

Background

  • Jetson telemetry is sent to POST /api/v1/telemetry using an x-api-key.
  • The backend validates payloads, writes telemetry to InfluxDB, and broadcasts updates to SSE clients.
  • Initial manual connectivity has been verified, but we have not yet validated behavior under sustained or higher-frequency publishing.
  • The PR notes stress testing for 10Hz+ as a next step.

Problem

The current pipeline has been validated for correctness at small scale, but we do not yet know:

  • how it behaves under continuous Jetson publishing
  • how much data is lost during sustained load
  • whether the backend can keep up with writes to InfluxDB
  • where the bottlenecks are across HTTP ingestion, validation, broadcast, and persistence

This is especially important because the current telemetry request flow writes to InfluxDB in a fire-and-forget style, so successful HTTP responses alone do not prove durable storage.

Goal

Measure and improve backend reliability for continuous telemetry ingestion so the Jetson can publish sustained streams with minimal packet/sample loss and acceptable end-to-end latency.

Scope

In scope

  • Load and soak testing for POST /api/v1/telemetry
  • Measuring accepted requests versus persisted telemetry points
  • Measuring dropped samples, validation failures, write failures, and latency
  • Testing sustained Jetson-like publish rates, including 10Hz+
  • Identifying bottlenecks in backend processing and storage
  • Recommending code or config changes to reduce data loss

Out of scope

  • Frontend rendering performance
  • Large-scale production deployment work
  • Replacing the ingestion protocol unless testing proves it is necessary

Why This Matters

Telemetry is only useful if the backend can consistently capture and store it. A backend that returns 200 OK but silently drops writes, lags behind, or loses bursts of samples will create blind spots for analysis and live operations.

Technical Context

Current ingestion flow

  1. Jetson sends POST /api/v1/telemetry
  2. Backend validates the payload and timestamp skew
  3. Backend adds serverReceiveTime
  4. Backend writes the payload to InfluxDB
  5. Backend broadcasts the payload to connected SSE clients
  6. Backend returns 200 {"message":"Telemetry received"}

Important reliability note

The current controller calls writeTelemetryPayload(...) without awaiting a flush/acknowledged persistence step. Stress testing should therefore verify:

  • request success rate
  • actual persistence rate in InfluxDB
  • divergence between received and stored telemetry

Questions This Testing Should Answer

  • What sustained publish rate can the backend handle reliably?
  • At what point do request failures or storage gaps start appearing?
  • Is data loss happening at ingestion, validation, or storage?
  • How much end-to-end lag appears between Jetson publish time and storage time?
  • Does SSE broadcasting materially affect ingestion throughput?
  • What operational safeguards or code changes are needed to improve reliability?

Proposed Deliverables

  • A repeatable stress-test plan for local/dev environments
  • A Jetson-like publisher or simulator configuration for sustained load
  • Metrics and summary results for throughput, loss, and latency
  • A short findings report with bottlenecks and recommended fixes
  • Follow-up implementation tasks if reliability gaps are found

Suggested Test Scenarios

  • Baseline steady-state test at low frequency
  • Sustained 10Hz publish test
  • Sustained higher-rate publish test, such as 20Hz, 50Hz, or higher if realistic
  • Burst test with short spikes above normal operating rate
  • Long-running soak test to detect drift, memory growth, or delayed failures
  • Test with and without SSE clients connected
  • Test across local network conditions representative of Jetson usage

Metrics To Capture

  • Total telemetry payloads sent
  • Total HTTP 200 responses received
  • Total non-200 responses
  • Validation rejection count
  • Estimated number of telemetry points actually stored in InfluxDB
  • Sample loss percentage: sent vs stored
  • Request latency percentiles
  • Storage lag / end-to-end latency
  • CPU and memory usage on the backend
  • InfluxDB write/backpressure symptoms

Acceptance Criteria

  • A repeatable stress-test workflow exists for the backend telemetry path
  • The team can quantify data loss under sustained Jetson-like load
  • Results clearly distinguish between accepted requests and persisted points
  • We identify the maximum reliable sustained publish rate for the current implementation
  • We document bottlenecks and concrete next steps to reduce loss
  • If loss exceeds acceptable limits, follow-up issues are created

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions