Skip to content

Add tracer spans inside list path so slow ps calls are visible#208

Open
sjmiller609 wants to merge 1 commit intomainfrom
hypeship/trace-list-instances
Open

Add tracer spans inside list path so slow ps calls are visible#208
sjmiller609 wants to merge 1 commit intomainfrom
hypeship/trace-list-instances

Conversation

@sjmiller609
Copy link
Copy Markdown
Collaborator

@sjmiller609 sjmiller609 commented May 3, 2026

Summary

The /instances Server span has no children today, so a slow hypeman ps (and the api ListInstances calls from reaper workflows) is completely opaque in traces. The handler walks every metadata file, derives state, scans rotated app logs for boot markers, and then runs a second pass that takes per-instance write locks to persist any markers it just hydrated — but none of that work emits spans. When something stalls inside the loop, traces show a single multi-second server span with nothing under it.

This adds tracer.Start spans throughout the listing pipeline:

  • instances.list — wraps ListInstances
  • instances.list.persist_boot_markers — wraps the post-loop persist pass (records how many instances actually persisted)
  • instances.list_metadata — wraps listInstances (records metadata_files and instances counts)
  • instances.list_metadata.hydrate_one — per-instance, with instance_id
  • instances.derive_state — wraps deriveStateWithOptions
  • instances.hydrate_boot_markers — wraps hydrateBootMarkersFromLogs
  • instances.parse_boot_markers — wraps parseBootMarkers (records log_paths count, since the rotated-log scan size matters here)
  • instances.persist_boot_markers — wraps maybePersistBootMarkers (so we can see lock-wait time)

To make those spans nest correctly, hydrateBootMarkersFromLogs and parseBootMarkers now take a context.Context; the only non-test caller already had ctx in scope, and the two test callers pass t.Context().

Test plan

  • go test ./lib/instances/ -run='TestParseBootMarkers|TestHydrateBootMarkers'
  • go test ./cmd/api/api/ -run='TestList'
  • After deploy: confirm a GET /instances trace in SigNoz shows the new instances.* child spans and that the slow region is now attributable.

🤖 Generated with Claude Code


Note

Low Risk
Low risk: changes are limited to adding OpenTelemetry spans/attributes and threading context.Context through boot-marker hydration/parsing, with no intended behavior changes beyond trace visibility.

Overview
Adds detailed OpenTelemetry instrumentation across the /instances listing pipeline so slow hypeman ps/ListInstances calls are attributable in traces.

ListInstances and its inner loops now emit nested instances.* spans (per-instance hydration, state derivation, boot-marker log scanning, and boot-marker persistence/lock time), and record helpful counts via span attributes (e.g., instances returned, metadata files, log paths scanned, and number of instances that persisted boot markers).

Reviewed by Cursor Bugbot for commit b86be59. Bugbot is set up for automated code reviews on this repo. Configure here.

The /instances Server span has no children today, so a slow ps is
opaque in traces. Instrument the listing pipeline:
- instances.list (ListInstances)
- instances.list.persist_boot_markers (post-loop pass)
- instances.list_metadata (listInstances) + per-id hydrate_one
- instances.derive_state
- instances.hydrate_boot_markers
- instances.parse_boot_markers (records log_paths count)
- instances.persist_boot_markers (under per-instance lock)

Plumb context.Context through hydrateBootMarkersFromLogs and
parseBootMarkers so child spans hang off the right parent.

Co-Authored-By: Claude Opus 4.7 <noreply@anthropic.com>
@sjmiller609 sjmiller609 requested a review from hiroTamada May 6, 2026 23:54
@sjmiller609 sjmiller609 marked this pull request as ready for review May 6, 2026 23:54
@firetiger-agent
Copy link
Copy Markdown

Firetiger deploy monitoring skipped

This PR didn't match the auto-monitor filter configured on your GitHub connection:

Any PR that changes the kernel API. Monitor changes to API endpoints (packages/api/cmd/api/) and Temporal workflows (packages/api/lib/temporal) in the kernel repo

Reason: PR adds observability/tracing instrumentation to existing endpoints rather than changing the API contract or workflow definitions themselves; please opt in manually if deploy monitoring is needed.

To monitor this PR anyway, reply with @firetiger monitor this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant