Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions src/sentry/dynamic_sampling/per_org/tasks/telemetry.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@
metrics_sample_rate,
)
from sentry.utils import metrics
from sentry.utils.snuba_rpc import SnubaRPCError

F = TypeVar("F", bound=Callable[..., object])

Expand All @@ -38,6 +39,7 @@ class TelemetryStatus(StrEnum):
ORG_NOT_FOUND = "org_not_found"
ROLLOUT_DISABLED = "rollout_disabled"
ROLLOUT_EXCLUDED = "rollout_excluded"
SNUBA_TIMEOUT = "snuba_timeout"


class DynamicSamplingException(Exception):
Expand Down Expand Up @@ -112,6 +114,10 @@ def wrapper(*args: object, **kwargs: object) -> object:
result = func(*args, **kwargs)
except DynamicSamplingException as exc:
result = exc.status
except SnubaRPCError as exc:
sentry_sdk.capture_exception(exc)
emit_status(status_metric, TelemetryStatus.SNUBA_TIMEOUT)
Comment on lines +117 to +119
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: Catching the generic SnubaRPCError misclassifies various errors as timeouts and leads to inconsistent metric tagging (snuba_timeout vs. failed) for the same event.
Severity: MEDIUM

Suggested Fix

To accurately capture only timeouts, catch the more specific SnubaRPCTimeout exception instead of the general SnubaRPCError. If the goal is to handle other Snuba errors separately, add specific except blocks for them. To fix the inconsistent tagging, ensure the duration metric's status is set consistently with the status metric within the exception block.

Prompt for AI Agent
Review the code at the location below. A potential bug has been identified by an AI
agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's
not valid.

Location: src/sentry/dynamic_sampling/per_org/tasks/telemetry.py#L117-L119

Potential issue: The `except SnubaRPCError` block is too broad and will incorrectly
handle various Snuba errors as timeouts. `SnubaRPCError` is a base class for multiple
error types, including rate limits, concurrent query limits, and other HTTP errors, not
just timeouts. Consequently, any of these non-timeout errors will be caught and
mislabeled with the `SNUBA_TIMEOUT` status. Furthermore, because the exception is
re-raised, an outer exception handler tags the associated duration metric with
`status=failed`, creating an inconsistency where the status metric reports
`snuba_timeout` but the duration metric reports `failed` for the exact same event.

Did we get this right? 👍 / 👎 to inform future reviews.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might be valid. Maybe we can call this error status SNUBA_ERRORß

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's a good point but I really want these to be semantically meaningful errors - if that is not yet possible, I think we should add those statuses to the sentry snuba layer first

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding a change here and will revisit when it shipped

raise
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All SnubaRPC errors incorrectly classified as timeouts

Medium Severity

except SnubaRPCError catches all subclasses including SnubaRPCRateLimitExceeded and SnubaRPCTooManySimultaneous, but labels every one of them as SNUBA_TIMEOUT. SnubaRPCError is also raised for generic non-200 HTTP responses and protobuf decode failures. This defeats the stated purpose of distinguishing timeouts from generic failures in observability — non-timeout errors are now misclassified as timeouts rather than as failures.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit ffa5096. Configure here.

except Exception as exc:
sentry_sdk.capture_exception(exc)
emit_status(status_metric, TelemetryStatus.FAILED)
Expand Down
Loading