Skip to content

feat(RetryStore): handle transient or retryable DB errors#634

Open
victoria-yining-huang wants to merge 7 commits into
mainfrom
vic/add_query_retry
Open

feat(RetryStore): handle transient or retryable DB errors#634
victoria-yining-huang wants to merge 7 commits into
mainfrom
vic/add_query_retry

Conversation

@victoria-yining-huang
Copy link
Copy Markdown

@victoria-yining-huang victoria-yining-huang commented May 13, 2026

ticket https://linear.app/getsentry/issue/STREAM-940/handle-server-shutting-down-errors-and-retry

this PR is no-op, just introduces the feature. Can be turned on by adding a value to db_query_max_retries

  • Add RetryStore wrapper (src/store/retry.rs) that implements InflightActivationStore and retries failed queries with a static sleep db_query_retry_delay_ms, up to db_query_max_retries times
  • Uses sqlx::Error downcast to only retry transient errors (Io, PoolTimedOut, PoolClosed, WorkerCrashed) — permanent errors like constraint violations are surfaced immediately
  • Add db_query_max_retries config option (Option, default None). When None, RetryStore is bypassed entirely (no-op). Set to a number to enable retries.
  • Wire RetryStore into main.rs so all components (gRPC server, writer, upkeep, push, fetch) benefit when enabled
  • Add unit tests covering retry success, exhaustion, non-retryable errors, config wiring

@victoria-yining-huang victoria-yining-huang requested a review from a team as a code owner May 13, 2026 02:54
Comment thread src/store/retry.rs
Comment thread src/store/retry.rs Outdated
Copy link
Copy Markdown

@cursor cursor Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 38c935a. Configure here.

Comment thread src/store/retry.rs Outdated
Comment thread src/store/retry.rs Outdated
Comment thread src/config.rs Outdated
Copy link
Copy Markdown
Member

@george-sentry george-sentry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that it would be preferable to avoid having two nearly identical macro definitions. Otherwise I like this solution!

@victoria-yining-huang
Copy link
Copy Markdown
Author

@george-sentry I removed the duplicated function, only cloning at arg level now

Comment thread src/store/retry.rs
}

fn assign_partitions(&self, partitions: Vec<i32>) -> Result<(), Error> {
self.inner.assign_partitions(partitions)
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most methods are retried, but not all, such as this one and remove_db and pending_activation_max_lag. Is there a reason for that?

Comment thread src/store/retry.rs
Comment on lines +59 to +63
info!(
method = stringify!($method),
attempt,
"Query succeeded after retry"
);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please avoid this logging at info level. DEbug would be fine.
Info is what we log in production, this can be very very high throughput

Comment thread src/store/retry.rs
}

#[async_trait]
impl InflightActivationStore for RetryStore {
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is the right abstraction level where to put the retry (wrapping the entire store with the retry logic).

The ActivationStore does more than running queries to the DB. So we cannot just decide to retry the entire method if the method raises a retriable DB error. What if you wanted to retry the DB query but not retry the entire method ?

  • Let's say your method had side effects beyond the DB query you do not want to retry ?
  • What if the method ran two queries and you wanted to retry only one ?

The design argument more in general:

  • the store has more responsibilities than just running DB queries.
  • We want to avoid mixing concerns between abstractions levels to keep the design logically organized and the cognitive overhead low.
  • When running the query, there are different categories of errors that can happen. We generally want to retry those that are infrastructure related (disconnection, network, etc.)
  • The infrastructure aspects are hidden as lower as possible (like in the connection pool https://github.com/getsentry/taskbroker/blob/main/src/store/adapters/postgres.rs#L106-L114) so that the layers above do not have to be polluted with infrastructure details.
  • As we are retrying only infrastructure related issues, the natural place where to put this logic should be as close to the infra as possible so the application only cares of application level issues.

I would recommend adding an abstraction around the connection pool to execute the queries and perform retry when needed rather than retrying the entire method.
I am not too familiar with sqlx, I know it does not manage retries for you:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants