Skip to content

fix(traces): adjust stale span timestamps for SnapStart restores#1114

Draft
jchrostek-dd wants to merge 5 commits intomainfrom
john/APMS-18793
Draft

fix(traces): adjust stale span timestamps for SnapStart restores#1114
jchrostek-dd wants to merge 5 commits intomainfrom
john/APMS-18793

Conversation

@jchrostek-dd
Copy link
Contributor

Summary

Fixes APMS-18793 - SnapStart Span Duration Bug

When a SnapStart-enabled Lambda function is restored from a snapshot, tracer spans (like Java Netty HTTP client spans) may have timestamps from when the snapshot was created, not when the restore happened. This caused traces to appear to span 24+ hours.

Changes:

  • Store snapstart_restore_time on invocation context when PlatformRestoreStart is received
  • When processing tracer spans, detect spans with stale timestamps (>60 seconds before restore time)
  • Adjust stale span start times forward to the restore time, preserving original timing in tags

Tags added to adjusted spans:

  • _dd.snapstart_adjusted=true - Indicates the span was adjusted
  • _dd.snapstart_original_start=<timestamp> - Preserves the original start time for debugging

Test plan

  • cargo fmt
  • cargo clippy
  • cargo test (505+ unit tests pass)
  • Manual testing with SnapStart Java Lambda

🤖 Generated with Claude Code

jchrostek-dd and others added 5 commits March 18, 2026 15:46
- Remove redundant comments that just restate what code does
- Extract magic number into named constant SIXTY_SECONDS_NS
- Consolidate multi-line comments into clearer explanations

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add test infrastructure to validate that traces have reasonable durations
after SnapStart restore. The test:
- Creates a Java Lambda that makes HTTP requests during static init
- Waits 2 minutes after snapshot creation for timestamps to become stale
- Verifies trace duration is < 1 minute (not 24+ hours)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Switch from java.net.http.HttpClient to OkHttp for better dd-trace-java
  instrumentation coverage
- Add test assertion to verify OkHttp spans appear in the invocation trace
- Add diagnostic function to search all spans from service
- Add detailed span logging for debugging

Test now validates:
- OkHttp spans are created by Java tracer
- OkHttp spans are correctly linked to the Lambda invocation trace
- Trace structure includes extension and tracer spans

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously, only spans with request_id metadata were adjusted for stale
SnapStart timestamps. This missed tracer spans like OkHttp requests that
don't have request_id.

Now the fix:
1. Finds request_id from any span in the trace chunk
2. Looks up the restore_time for that invocation
3. Adjusts ALL spans with timestamps before the threshold

Integration test verified: OkHttp span timestamp went from 195 seconds
before invocation (stale) to 2.4 seconds before (at restore time).

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant