Skip to content

feat(otel): Add checkpoint restart-recovery integration test#705

Open
louisall wants to merge 1 commit into
feat/otel-logs-integration-testsfrom
feat/otel-logs-checkpoint-integration-test
Open

feat(otel): Add checkpoint restart-recovery integration test#705
louisall wants to merge 1 commit into
feat/otel-logs-integration-testsfrom
feat/otel-logs-checkpoint-integration-test

Conversation

@louisall
Copy link
Copy Markdown
Contributor

Summary

End-to-end integration test verifying that file_storage checkpoint persistence prevents log loss across agent pod restarts.

Test Flow

  1. Confirms agent is tailing nginx-test (warmup marker arrives in CW Logs)
  2. Emits 20 numbered markers via /proc/1/fd/1 to the container log stream
  3. Kills the agent pod mid-sequence (after marker 10)
  4. Emits remaining markers while agent is restarting
  5. Waits for agent restart + log propagation
  6. Queries CW Logs: asserts all 20 markers arrive exactly once

Validates: no gaps (checkpoint resumes correctly) and no duplicates (offset is accurate).

Dependencies

Design Decisions

  • Uses existing nginx-test deployment (already deployed by terraform for metrics tests) — no additional terraform resources needed
  • Writes to /proc/1/fd/1 to inject markers into the container log stream (not exec stdout)
  • Targets standard 10.0.x.x nodes (skips Fargate/special nodes)
  • Runs after metrics tests when the agent has been tailing nginx-test for several minutes

Results (manual run on metricsv2-testing cluster)

marker-001: 1
...
marker-020: 1
PASS: all 20 markers delivered exactly once

Verify file_storage checkpoint persistence prevents log loss across
agent pod restarts. Uses the existing nginx-test deployment as a
marker source — no additional terraform resources needed.

Test flow:
- Confirms agent is tailing nginx-test (warmup marker)
- Emits 20 numbered markers via /proc/1/fd/1
- Kills agent pod mid-sequence (after marker 10)
- Emits remaining markers while agent restarts
- Queries CW Logs and asserts all 20 arrive exactly once

Validates: no gaps (checkpoint resumes correctly) and no duplicates
(offset is accurate).
@louisall louisall requested a review from a team as a code owner May 26, 2026 13:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant