Stress test backend telemetry ingestion for continuous Jetson publishing with minimal data loss

## Summary

We have moved telemetry ingestion to an HTTP push model from the Jetson. The next step is to stress test the backend under sustained telemetry load to confirm that it can receive, process, and store continuous data while minimizing loss.

## Background

- Jetson telemetry is sent to `POST /api/v1/telemetry` using an `x-api-key`.
- The backend validates payloads, writes telemetry to InfluxDB, and broadcasts updates to SSE clients.
- Initial manual connectivity has been verified, but we have not yet validated behavior under sustained or higher-frequency publishing.
- The PR notes stress testing for `10Hz+` as a next step.

## Problem

The current pipeline has been validated for correctness at small scale, but we do not yet know:

- how it behaves under continuous Jetson publishing
- how much data is lost during sustained load
- whether the backend can keep up with writes to InfluxDB
- where the bottlenecks are across HTTP ingestion, validation, broadcast, and persistence

This is especially important because the current telemetry request flow writes to InfluxDB in a fire-and-forget style, so successful HTTP responses alone do not prove durable storage.

## Goal

Measure and improve backend reliability for continuous telemetry ingestion so the Jetson can publish sustained streams with minimal packet/sample loss and acceptable end-to-end latency.

## Scope

### In scope

- Load and soak testing for `POST /api/v1/telemetry`
- Measuring accepted requests versus persisted telemetry points
- Measuring dropped samples, validation failures, write failures, and latency
- Testing sustained Jetson-like publish rates, including `10Hz+`
- Identifying bottlenecks in backend processing and storage
- Recommending code or config changes to reduce data loss

### Out of scope

- Frontend rendering performance
- Large-scale production deployment work
- Replacing the ingestion protocol unless testing proves it is necessary

## Why This Matters

Telemetry is only useful if the backend can consistently capture and store it. A backend that returns `200 OK` but silently drops writes, lags behind, or loses bursts of samples will create blind spots for analysis and live operations.

## Technical Context

### Current ingestion flow

1. Jetson sends `POST /api/v1/telemetry`
2. Backend validates the payload and timestamp skew
3. Backend adds `serverReceiveTime`
4. Backend writes the payload to InfluxDB
5. Backend broadcasts the payload to connected SSE clients
6. Backend returns `200 {"message":"Telemetry received"}`

### Important reliability note

The current controller calls `writeTelemetryPayload(...)` without awaiting a flush/acknowledged persistence step. Stress testing should therefore verify:

- request success rate
- actual persistence rate in InfluxDB
- divergence between received and stored telemetry

## Questions This Testing Should Answer

- What sustained publish rate can the backend handle reliably?
- At what point do request failures or storage gaps start appearing?
- Is data loss happening at ingestion, validation, or storage?
- How much end-to-end lag appears between Jetson publish time and storage time?
- Does SSE broadcasting materially affect ingestion throughput?
- What operational safeguards or code changes are needed to improve reliability?

## Proposed Deliverables

- A repeatable stress-test plan for local/dev environments
- A Jetson-like publisher or simulator configuration for sustained load
- Metrics and summary results for throughput, loss, and latency
- A short findings report with bottlenecks and recommended fixes
- Follow-up implementation tasks if reliability gaps are found

## Suggested Test Scenarios

- Baseline steady-state test at low frequency
- Sustained `10Hz` publish test
- Sustained higher-rate publish test, such as `20Hz`, `50Hz`, or higher if realistic
- Burst test with short spikes above normal operating rate
- Long-running soak test to detect drift, memory growth, or delayed failures
- Test with and without SSE clients connected
- Test across local network conditions representative of Jetson usage

## Metrics To Capture

- Total telemetry payloads sent
- Total HTTP `200` responses received
- Total non-`200` responses
- Validation rejection count
- Estimated number of telemetry points actually stored in InfluxDB
- Sample loss percentage: sent vs stored
- Request latency percentiles
- Storage lag / end-to-end latency
- CPU and memory usage on the backend
- InfluxDB write/backpressure symptoms

## Acceptance Criteria

- A repeatable stress-test workflow exists for the backend telemetry path
- The team can quantify data loss under sustained Jetson-like load
- Results clearly distinguish between accepted requests and persisted points
- We identify the maximum reliable sustained publish rate for the current implementation
- We document bottlenecks and concrete next steps to reduce loss
- If loss exceeds acceptable limits, follow-up issues are created


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stress test backend telemetry ingestion for continuous Jetson publishing with minimal data loss #9

Summary

Background

Problem

Goal

Scope

In scope

Out of scope

Why This Matters

Technical Context

Current ingestion flow

Important reliability note

Questions This Testing Should Answer

Proposed Deliverables

Suggested Test Scenarios

Metrics To Capture

Acceptance Criteria

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Stress test backend telemetry ingestion for continuous Jetson publishing with minimal data loss #9

Description

Summary

Background

Problem

Goal

Scope

In scope

Out of scope

Why This Matters

Technical Context

Current ingestion flow

Important reliability note

Questions This Testing Should Answer

Proposed Deliverables

Suggested Test Scenarios

Metrics To Capture

Acceptance Criteria

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions