feat(RetryStore): handle transient or retryable DB errors#634
feat(RetryStore): handle transient or retryable DB errors#634victoria-yining-huang wants to merge 7 commits into
Conversation
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 38c935a. Configure here.
george-sentry
left a comment
There was a problem hiding this comment.
I agree that it would be preferable to avoid having two nearly identical macro definitions. Otherwise I like this solution!
|
@george-sentry I removed the duplicated function, only cloning at arg level now |
| } | ||
|
|
||
| fn assign_partitions(&self, partitions: Vec<i32>) -> Result<(), Error> { | ||
| self.inner.assign_partitions(partitions) |
There was a problem hiding this comment.
Most methods are retried, but not all, such as this one and remove_db and pending_activation_max_lag. Is there a reason for that?
| info!( | ||
| method = stringify!($method), | ||
| attempt, | ||
| "Query succeeded after retry" | ||
| ); |
There was a problem hiding this comment.
Please avoid this logging at info level. DEbug would be fine.
Info is what we log in production, this can be very very high throughput
| } | ||
|
|
||
| #[async_trait] | ||
| impl InflightActivationStore for RetryStore { |
There was a problem hiding this comment.
I don't think this is the right abstraction level where to put the retry (wrapping the entire store with the retry logic).
The ActivationStore does more than running queries to the DB. So we cannot just decide to retry the entire method if the method raises a retriable DB error. What if you wanted to retry the DB query but not retry the entire method ?
- Let's say your method had side effects beyond the DB query you do not want to retry ?
- What if the method ran two queries and you wanted to retry only one ?
The design argument more in general:
- the store has more responsibilities than just running DB queries.
- We want to avoid mixing concerns between abstractions levels to keep the design logically organized and the cognitive overhead low.
- When running the query, there are different categories of errors that can happen. We generally want to retry those that are infrastructure related (disconnection, network, etc.)
- The infrastructure aspects are hidden as lower as possible (like in the connection pool https://github.com/getsentry/taskbroker/blob/main/src/store/adapters/postgres.rs#L106-L114) so that the layers above do not have to be polluted with infrastructure details.
- As we are retrying only infrastructure related issues, the natural place where to put this logic should be as close to the infra as possible so the application only cares of application level issues.
I would recommend adding an abstraction around the connection pool to execute the queries and perform retry when needed rather than retrying the entire method.
I am not too familiar with sqlx, I know it does not manage retries for you:
- You can write a wrapper around the connection pool that performs retry
- sqlx produces futures, I think you can build a simple wrapper to retry them https://github.com/getsentry/symbolicator/blob/f8883320dcc5e8eb6db852620c1d143cacfdf9b9/crates/symbolicator-service/src/download/mod.rs#L387-L409

ticket https://linear.app/getsentry/issue/STREAM-940/handle-server-shutting-down-errors-and-retry
this PR is no-op, just introduces the feature. Can be turned on by adding a value to
db_query_max_retriesdb_query_retry_delay_ms, up todb_query_max_retriestimes