feat(RetryStore): handle transient or retryable DB errors by victoria-yining-huang · Pull Request #634 · getsentry/taskbroker

victoria-yining-huang · 2026-05-13T02:54:36Z

ticket https://linear.app/getsentry/issue/STREAM-940/handle-server-shutting-down-errors-and-retry

this PR is no-op, just introduces the feature. Can be turned on by adding a value to db_query_max_retries

Add RetryStore wrapper (src/store/retry.rs) that implements InflightActivationStore and retries failed queries with a static sleep db_query_retry_delay_ms, up to db_query_max_retries times
Uses sqlx::Error downcast to only retry transient errors (Io, PoolTimedOut, PoolClosed, WorkerCrashed) — permanent errors like constraint violations are surfaced immediately
Add db_query_max_retries config option (Option, default None). When None, RetryStore is bypassed entirely (no-op). Set to a number to enable retries.
Wire RetryStore into main.rs so all components (gRPC server, writer, upkeep, push, fetch) benefit when enabled
Add unit tests covering retry success, exhaustion, non-retryable errors, config wiring

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 38c935a. Configure here.}

george-sentry

I agree that it would be preferable to avoid having two nearly identical macro definitions. Otherwise I like this solution!

victoria-yining-huang · 2026-05-13T19:35:48Z

@george-sentry I removed the duplicated function, only cloning at arg level now

george-sentry · 2026-05-13T21:17:20Z

+    }
+
+    fn assign_partitions(&self, partitions: Vec<i32>) -> Result<(), Error> {
+        self.inner.assign_partitions(partitions)


Most methods are retried, but not all, such as this one and remove_db and pending_activation_max_lag. Is there a reason for that?

fpacifici · 2026-05-13T22:46:27Z

+                        info!(
+                            method = stringify!($method),
+                            attempt,
+                            "Query succeeded after retry"
+                        );


Please avoid this logging at info level. DEbug would be fine.
Info is what we log in production, this can be very very high throughput

fpacifici · 2026-05-13T23:22:02Z

+}
+
+#[async_trait]
+impl InflightActivationStore for RetryStore {


I don't think this is the right abstraction level where to put the retry (wrapping the entire store with the retry logic).

The ActivationStore does more than running queries to the DB. So we cannot just decide to retry the entire method if the method raises a retriable DB error. What if you wanted to retry the DB query but not retry the entire method ?

Let's say your method had side effects beyond the DB query you do not want to retry ?

What if the method ran two queries and you wanted to retry only one ?

The design argument more in general:

the store has more responsibilities than just running DB queries.

We want to avoid mixing concerns between abstractions levels to keep the design logically organized and the cognitive overhead low.

When running the query, there are different categories of errors that can happen. We generally want to retry those that are infrastructure related (disconnection, network, etc.)

The infrastructure aspects are hidden as lower as possible (like in the connection pool https://github.com/getsentry/taskbroker/blob/main/src/store/adapters/postgres.rs#L106-L114) so that the layers above do not have to be polluted with infrastructure details.

As we are retrying only infrastructure related issues, the natural place where to put this logic should be as close to the infra as possible so the application only cares of application level issues.

I would recommend adding an abstraction around the connection pool to execute the queries and perform retry when needed rather than retrying the entire method.
I am not too familiar with sqlx, I know it does not manage retries for you:

You can write a wrapper around the connection pool that performs retry

sqlx produces futures, I think you can build a simple wrapper to retry them https://github.com/getsentry/symbolicator/blob/f8883320dcc5e8eb6db852620c1d143cacfdf9b9/crates/symbolicator-service/src/download/mod.rs#L387-L409

victoria-yining-huang added 2 commits May 12, 2026 17:32

add retry.rs

e822a92

finalize tests

66525aa

victoria-yining-huang requested a review from a team as a code owner May 13, 2026 02:54

cursor Bot reviewed May 13, 2026

View reviewed changes

Comment thread src/store/retry.rs

whitespace formatting

dc55d5f

sentry Bot reviewed May 13, 2026

View reviewed changes

Comment thread src/store/retry.rs Outdated

victoria-yining-huang added 2 commits May 12, 2026 23:22

comment

916a0a4

clippy wants matches macro

38c935a

cursor Bot reviewed May 13, 2026

View reviewed changes

Comment thread src/store/retry.rs Outdated

markstory reviewed May 13, 2026

View reviewed changes

Comment thread src/store/retry.rs Outdated

Comment thread src/config.rs Outdated

remove exponential backoff

841cdb2

george-sentry reviewed May 13, 2026

View reviewed changes

remove redundant code, only clone arg

b2a5a58

george-sentry reviewed May 13, 2026

View reviewed changes

fpacifici reviewed May 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(RetryStore): handle transient or retryable DB errors#634

feat(RetryStore): handle transient or retryable DB errors#634
victoria-yining-huang wants to merge 7 commits into
mainfrom
vic/add_query_retry

victoria-yining-huang commented May 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

george-sentry left a comment

Uh oh!

victoria-yining-huang commented May 13, 2026

Uh oh!

george-sentry May 13, 2026

Uh oh!

fpacifici May 13, 2026

Uh oh!

fpacifici May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Uh oh!

Conversation

victoria-yining-huang commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

george-sentry left a comment

Choose a reason for hiding this comment

Uh oh!

victoria-yining-huang commented May 13, 2026

Uh oh!

george-sentry May 13, 2026

Choose a reason for hiding this comment

Uh oh!

fpacifici May 13, 2026

Choose a reason for hiding this comment

Uh oh!

fpacifici May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

victoria-yining-huang commented May 13, 2026 •

edited

Loading