[metrics] Add client-side error and retry tracking metrics#619
[metrics] Add client-side error and retry tracking metrics#619charlesdong1991 wants to merge 3 commits into
Conversation
|
@fresh-borzoni @leekeiabstraction PTAL 🙏 |
leekeiabstraction
left a comment
There was a problem hiding this comment.
Thank you for the PR! Better observability makes sense to me but I've left a question.
| pub const WRITER_ERRORS_TOTAL: &str = "fluss.client.writer.errors.total"; | ||
|
|
||
| /// `error_kind` value for a non-retriable server error. | ||
| pub(crate) const WRITER_ERROR_KIND_NON_RETRIABLE: &str = "non_retriable"; | ||
| /// `error_kind` value for a retriable error that exhausted its retries. | ||
| pub(crate) const WRITER_ERROR_KIND_MAX_RETRIES_EXCEEDED: &str = "max_retries_exceeded"; | ||
| /// `error_kind` value for an idempotent retry abandoned after a writer-id reset. | ||
| pub(crate) const WRITER_ERROR_KIND_WRITER_ID_CHANGED: &str = "writer_id_changed"; | ||
| /// `error_kind` value for a batch that failed to build locally (never sent). | ||
| pub(crate) const WRITER_ERROR_KIND_LOCAL_BUILD: &str = "local_build"; |
There was a problem hiding this comment.
Do all of these metrics exist on Java client side?
If not you may want to consider raising a short FIP (potentially including more new client side metrics that might be valuable) and discussion for the definitions and implementation of these metrics. Metrics are user facing, releasing and then removing these metrics at later release would be a breaking change so we'd want to be mindful about what we introduce. I think we would benefit from discussing and gathering feedback from the community.
There was a problem hiding this comment.
Do all of these metrics exist on Java client side?
No, they are not as i mentioned in the PR description, Java has things similar in server side, but rust client cannot get those... it is mainly inspired by those java side metrics, and i come up with some simple metrics for alerting accordingly.
Metrics are user facing, releasing and then removing these metrics at later release would be a breaking change
You are right, i thought it's mainly on client side, so can be more flexible, but indeed better to share there to collect feedback first... I will think about how to draft discussion properly
Purpose
Add Rust client-side error/retry counters for production alerting: connections poisoned, metadata refreshes/errors, terminal writer errors (by error_kind), and scanner fetch errors (rpc vs bucket).
Java has error family on server, but not reachable from rust client.
Linked issue: close #620
Brief change log
Tests
All tests passed
API and Format
Documentation