chore(dynamic-sampling): add status for snuba timeouts#115359
Conversation
shellmayr
commented
May 12, 2026
- Snuba timeouts are not generic failures - they should thus not be marked as such in our observability
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit ffa5096. Configure here.
| except SnubaRPCError as exc: | ||
| sentry_sdk.capture_exception(exc) | ||
| emit_status(status_metric, TelemetryStatus.SNUBA_TIMEOUT) | ||
| raise |
There was a problem hiding this comment.
All SnubaRPC errors incorrectly classified as timeouts
Medium Severity
except SnubaRPCError catches all subclasses including SnubaRPCRateLimitExceeded and SnubaRPCTooManySimultaneous, but labels every one of them as SNUBA_TIMEOUT. SnubaRPCError is also raised for generic non-200 HTTP responses and protobuf decode failures. This defeats the stated purpose of distinguishing timeouts from generic failures in observability — non-timeout errors are now misclassified as timeouts rather than as failures.
Reviewed by Cursor Bugbot for commit ffa5096. Configure here.
| except SnubaRPCError as exc: | ||
| sentry_sdk.capture_exception(exc) | ||
| emit_status(status_metric, TelemetryStatus.SNUBA_TIMEOUT) |
There was a problem hiding this comment.
Bug: Catching the generic SnubaRPCError misclassifies various errors as timeouts and leads to inconsistent metric tagging (snuba_timeout vs. failed) for the same event.
Severity: MEDIUM
Suggested Fix
To accurately capture only timeouts, catch the more specific SnubaRPCTimeout exception instead of the general SnubaRPCError. If the goal is to handle other Snuba errors separately, add specific except blocks for them. To fix the inconsistent tagging, ensure the duration metric's status is set consistently with the status metric within the exception block.
Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.
Location: src/sentry/dynamic_sampling/per_org/tasks/telemetry.py#L117-L119
Potential issue: The `except SnubaRPCError` block is too broad and will incorrectly
handle various Snuba errors as timeouts. `SnubaRPCError` is a base class for multiple
error types, including rate limits, concurrent query limits, and other HTTP errors, not
just timeouts. Consequently, any of these non-timeout errors will be caught and
mislabeled with the `SNUBA_TIMEOUT` status. Furthermore, because the exception is
re-raised, an outer exception handler tags the associated duration metric with
`status=failed`, creating an inconsistency where the status metric reports
`snuba_timeout` but the duration metric reports `failed` for the exact same event.
Did we get this right? 👍 / 👎 to inform future reviews.
There was a problem hiding this comment.
This might be valid. Maybe we can call this error status SNUBA_ERRORß
There was a problem hiding this comment.
Yeah it's a good point but I really want these to be semantically meaningful errors - if that is not yet possible, I think we should add those statuses to the sentry snuba layer first
There was a problem hiding this comment.
Adding a change here and will revisit when it shipped

