-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
We have moved telemetry ingestion to an HTTP push model from the Jetson. The next step is to stress test the backend under sustained telemetry load to confirm that it can receive, process, and store continuous data while minimizing loss.
Background
- Jetson telemetry is sent to
POST /api/v1/telemetryusing anx-api-key. - The backend validates payloads, writes telemetry to InfluxDB, and broadcasts updates to SSE clients.
- Initial manual connectivity has been verified, but we have not yet validated behavior under sustained or higher-frequency publishing.
- The PR notes stress testing for
10Hz+as a next step.
Problem
The current pipeline has been validated for correctness at small scale, but we do not yet know:
- how it behaves under continuous Jetson publishing
- how much data is lost during sustained load
- whether the backend can keep up with writes to InfluxDB
- where the bottlenecks are across HTTP ingestion, validation, broadcast, and persistence
This is especially important because the current telemetry request flow writes to InfluxDB in a fire-and-forget style, so successful HTTP responses alone do not prove durable storage.
Goal
Measure and improve backend reliability for continuous telemetry ingestion so the Jetson can publish sustained streams with minimal packet/sample loss and acceptable end-to-end latency.
Scope
In scope
- Load and soak testing for
POST /api/v1/telemetry - Measuring accepted requests versus persisted telemetry points
- Measuring dropped samples, validation failures, write failures, and latency
- Testing sustained Jetson-like publish rates, including
10Hz+ - Identifying bottlenecks in backend processing and storage
- Recommending code or config changes to reduce data loss
Out of scope
- Frontend rendering performance
- Large-scale production deployment work
- Replacing the ingestion protocol unless testing proves it is necessary
Why This Matters
Telemetry is only useful if the backend can consistently capture and store it. A backend that returns 200 OK but silently drops writes, lags behind, or loses bursts of samples will create blind spots for analysis and live operations.
Technical Context
Current ingestion flow
- Jetson sends
POST /api/v1/telemetry - Backend validates the payload and timestamp skew
- Backend adds
serverReceiveTime - Backend writes the payload to InfluxDB
- Backend broadcasts the payload to connected SSE clients
- Backend returns
200 {"message":"Telemetry received"}
Important reliability note
The current controller calls writeTelemetryPayload(...) without awaiting a flush/acknowledged persistence step. Stress testing should therefore verify:
- request success rate
- actual persistence rate in InfluxDB
- divergence between received and stored telemetry
Questions This Testing Should Answer
- What sustained publish rate can the backend handle reliably?
- At what point do request failures or storage gaps start appearing?
- Is data loss happening at ingestion, validation, or storage?
- How much end-to-end lag appears between Jetson publish time and storage time?
- Does SSE broadcasting materially affect ingestion throughput?
- What operational safeguards or code changes are needed to improve reliability?
Proposed Deliverables
- A repeatable stress-test plan for local/dev environments
- A Jetson-like publisher or simulator configuration for sustained load
- Metrics and summary results for throughput, loss, and latency
- A short findings report with bottlenecks and recommended fixes
- Follow-up implementation tasks if reliability gaps are found
Suggested Test Scenarios
- Baseline steady-state test at low frequency
- Sustained
10Hzpublish test - Sustained higher-rate publish test, such as
20Hz,50Hz, or higher if realistic - Burst test with short spikes above normal operating rate
- Long-running soak test to detect drift, memory growth, or delayed failures
- Test with and without SSE clients connected
- Test across local network conditions representative of Jetson usage
Metrics To Capture
- Total telemetry payloads sent
- Total HTTP
200responses received - Total non-
200responses - Validation rejection count
- Estimated number of telemetry points actually stored in InfluxDB
- Sample loss percentage: sent vs stored
- Request latency percentiles
- Storage lag / end-to-end latency
- CPU and memory usage on the backend
- InfluxDB write/backpressure symptoms
Acceptance Criteria
- A repeatable stress-test workflow exists for the backend telemetry path
- The team can quantify data loss under sustained Jetson-like load
- Results clearly distinguish between accepted requests and persisted points
- We identify the maximum reliable sustained publish rate for the current implementation
- We document bottlenecks and concrete next steps to reduce loss
- If loss exceeds acceptable limits, follow-up issues are created