Skip to content

chore(dynamic-sampling): add status for snuba timeouts#115359

Draft
shellmayr wants to merge 1 commit into
masterfrom
shellmayr/chore/add-status-for-snuba-timeouts
Draft

chore(dynamic-sampling): add status for snuba timeouts#115359
shellmayr wants to merge 1 commit into
masterfrom
shellmayr/chore/add-status-for-snuba-timeouts

Conversation

@shellmayr
Copy link
Copy Markdown
Member

  • Snuba timeouts are not generic failures - they should thus not be marked as such in our observability

@shellmayr shellmayr requested a review from a team as a code owner May 12, 2026 07:15
@github-actions github-actions Bot added the Scope: Backend Automatically applied to PRs that change backend components label May 12, 2026
Copy link
Copy Markdown
Contributor

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit ffa5096. Configure here.

except SnubaRPCError as exc:
sentry_sdk.capture_exception(exc)
emit_status(status_metric, TelemetryStatus.SNUBA_TIMEOUT)
raise
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All SnubaRPC errors incorrectly classified as timeouts

Medium Severity

except SnubaRPCError catches all subclasses including SnubaRPCRateLimitExceeded and SnubaRPCTooManySimultaneous, but labels every one of them as SNUBA_TIMEOUT. SnubaRPCError is also raised for generic non-200 HTTP responses and protobuf decode failures. This defeats the stated purpose of distinguishing timeouts from generic failures in observability — non-timeout errors are now misclassified as timeouts rather than as failures.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ffa5096. Configure here.

Comment on lines +117 to +119
except SnubaRPCError as exc:
sentry_sdk.capture_exception(exc)
emit_status(status_metric, TelemetryStatus.SNUBA_TIMEOUT)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Catching the generic SnubaRPCError misclassifies various errors as timeouts and leads to inconsistent metric tagging (snuba_timeout vs. failed) for the same event.
Severity: MEDIUM

Suggested Fix

To accurately capture only timeouts, catch the more specific SnubaRPCTimeout exception instead of the general SnubaRPCError. If the goal is to handle other Snuba errors separately, add specific except blocks for them. To fix the inconsistent tagging, ensure the duration metric's status is set consistently with the status metric within the exception block.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: src/sentry/dynamic_sampling/per_org/tasks/telemetry.py#L117-L119

Potential issue: The `except SnubaRPCError` block is too broad and will incorrectly
handle various Snuba errors as timeouts. `SnubaRPCError` is a base class for multiple
error types, including rate limits, concurrent query limits, and other HTTP errors, not
just timeouts. Consequently, any of these non-timeout errors will be caught and
mislabeled with the `SNUBA_TIMEOUT` status. Furthermore, because the exception is
re-raised, an outer exception handler tags the associated duration metric with
`status=failed`, creating an inconsistency where the status metric reports
`snuba_timeout` but the duration metric reports `failed` for the exact same event.

Did we get this right? 👍 / 👎 to inform future reviews.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be valid. Maybe we can call this error status SNUBA_ERRORß

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's a good point but I really want these to be semantically meaningful errors - if that is not yet possible, I think we should add those statuses to the sentry snuba layer first

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a change here and will revisit when it shipped

@shellmayr shellmayr marked this pull request as draft May 12, 2026 07:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Scope: Backend Automatically applied to PRs that change backend components

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants