Skip to content

[metrics] Add client-side error and retry tracking metrics#619

Open
charlesdong1991 wants to merge 3 commits into
apache:mainfrom
charlesdong1991:add-error-retry-tracking
Open

[metrics] Add client-side error and retry tracking metrics#619
charlesdong1991 wants to merge 3 commits into
apache:mainfrom
charlesdong1991:add-error-retry-tracking

Conversation

@charlesdong1991

@charlesdong1991 charlesdong1991 commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Purpose

Add Rust client-side error/retry counters for production alerting: connections poisoned, metadata refreshes/errors, terminal writer errors (by error_kind), and scanner fetch errors (rpc vs bucket).

Java has error family on server, but not reachable from rust client.

Linked issue: close #620

Brief change log

Tests

All tests passed

API and Format

Documentation

@charlesdong1991

Copy link
Copy Markdown
Contributor Author

@fresh-borzoni @leekeiabstraction PTAL 🙏

@leekeiabstraction leekeiabstraction left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR! Better observability makes sense to me but I've left a question.

Comment on lines +376 to +385
pub const WRITER_ERRORS_TOTAL: &str = "fluss.client.writer.errors.total";

/// `error_kind` value for a non-retriable server error.
pub(crate) const WRITER_ERROR_KIND_NON_RETRIABLE: &str = "non_retriable";
/// `error_kind` value for a retriable error that exhausted its retries.
pub(crate) const WRITER_ERROR_KIND_MAX_RETRIES_EXCEEDED: &str = "max_retries_exceeded";
/// `error_kind` value for an idempotent retry abandoned after a writer-id reset.
pub(crate) const WRITER_ERROR_KIND_WRITER_ID_CHANGED: &str = "writer_id_changed";
/// `error_kind` value for a batch that failed to build locally (never sent).
pub(crate) const WRITER_ERROR_KIND_LOCAL_BUILD: &str = "local_build";

@leekeiabstraction leekeiabstraction Jun 15, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all of these metrics exist on Java client side?

If not you may want to consider raising a short FIP (potentially including more new client side metrics that might be valuable) and discussion for the definitions and implementation of these metrics. Metrics are user facing, releasing and then removing these metrics at later release would be a breaking change so we'd want to be mindful about what we introduce. I think we would benefit from discussing and gathering feedback from the community.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do all of these metrics exist on Java client side?

No, they are not as i mentioned in the PR description, Java has things similar in server side, but rust client cannot get those... it is mainly inspired by those java side metrics, and i come up with some simple metrics for alerting accordingly.

Metrics are user facing, releasing and then removing these metrics at later release would be a breaking change

You are right, i thought it's mainly on client side, so can be more flexible, but indeed better to share there to collect feedback first... I will think about how to draft discussion properly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[metrics] Add error and retry tracking metrics

2 participants