-
Notifications
You must be signed in to change notification settings - Fork 182
DOC-14039 flakey node metric documentation for Totoro #4102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
ggray-cb
wants to merge
7
commits into
release/8.0
Choose a base branch
from
DOC-14039_flakey_node_metric_totoro_2
base: release/8.0
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from all commits
Commits
Show all changes
7 commits
Select commit
Hold shift + click to select a range
de1afe0
Work in progress checkin.
ggray-cb eb20000
Second part of the work in progress... because VScode sucks.
ggray-cb 965e7c4
Work in progress checkin
ggray-cb ccf4264
Minor touchups
ggray-cb 5c38ded
Added troubleshooting entry.
ggray-cb c670c37
* Fixed incorrect link that Steve noticed in his review.
ggray-cb a397214
Cleanups based on Copilot review.
ggray-cb File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,18 +1,18 @@ | ||
| = What's New in Version 8.0 | ||
| = What's New in Version 8.1 | ||
| :description: Couchbase is the modern database for enterprise applications. | ||
| :page-aliases: security:security-watsnew | ||
| :page-toclevels: 2 | ||
|
|
||
| [abstract] | ||
| {description} + | ||
| Couchbase Server 8.0 combines the strengths of relational databases with the flexibility, performance, and scale of Couchbase. | ||
| Couchbase Server 8.1 combines the strengths of relational databases with the flexibility, performance, and scale of Couchbase. | ||
|
|
||
| For information about platform support changes, deprecation notifications, and fixed and known issues, see the xref:release-notes:relnotes.adoc[Release Notes]. | ||
|
|
||
| [#new-features-80] | ||
| == New Features and Enhancements in 8.0.0 | ||
| [#new-features-81] | ||
| == New Features and Enhancements in 8.1.0 | ||
|
|
||
| This release introduces the following new features. | ||
|
|
||
| include::partial$new-features-80.adoc[] | ||
| include::partial$new-features-81.adoc[] | ||
|
|
This file was deleted.
Oops, something went wrong.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,63 @@ | ||
| [#section-new-feature-810-platform-support] | ||
| === Platform Support | ||
|
|
||
| Couchbase Server 8.1 adds support for the following operating systems: | ||
|
|
||
| * TBD | ||
|
|
||
| For more information about supported operating systems, see xref:install:install-platforms.adoc[]. | ||
|
|
||
| === Global Secondary Indexing (GSI) Vector Indexes | ||
|
|
||
| TBD | ||
|
|
||
| === Data Service Changes | ||
|
|
||
| Couchbase Server 8.1 introduces several new features for the Data Service. | ||
|
|
||
| TBD | ||
|
|
||
|
|
||
| === Non-Data Services | ||
|
|
||
| Couchbase Server 8.1 release has key non-Data Services enhancements. | ||
|
|
||
|
|
||
| === Couchbase Cluster | ||
|
|
||
| Couchbase Server 8.1 release has added cluster enhancements and diagnostic capabilities. | ||
|
|
||
| [#unstable-metric] | ||
| ==== New Metric to Detect Unstable Nodes | ||
|
|
||
| Couchbase Server 8.1.0 introduces a new metric, `cm_node_unreachable_total`, to help you monitor for unstable nodes in your cluster. | ||
| An unstable node periodically becomes unavailable but recovers before the auto failover timeout expires. | ||
| This metric counts the number of times a node has been unable to reach another node in the cluster. | ||
| By monitoring it, you can identify nodes that are having issues before they become unavailable for long enough to be automatically failed over. | ||
|
|
||
|
Comment on lines
+33
to
+37
|
||
| See xref:learn:clusters-and-availability/unstable-nodes.adoc[] for more information about using this metric to identify unstable nodes. | ||
|
|
||
| === XDCR | ||
|
|
||
| Couchbase Server 8.1 release has key cross datacenter replication (XDCR) enhancements and diagnostic capabilities. | ||
|
|
||
| TBD. | ||
|
|
||
| === Security and Authentication | ||
|
|
||
| Couchbase Server 8.1 release has key security and authentication enhancements. | ||
|
|
||
| TBD. | ||
|
|
||
| === Query Service | ||
|
|
||
| Couchbase Server 8.1 release adds these Query Service features. | ||
|
|
||
|
|
||
| TBD. | ||
|
|
||
| === Search Service | ||
|
|
||
| Couchbase Server 8.1 introduces several new features for the xref:search:search.adoc[Search Service]. | ||
|
|
||
| TBD. | ||
Binary file added
BIN
+299 KB
...arn/assets/images/clusters-and-availability/unstable-node-metric-timeseries.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+202 KB
...earn/assets/images/clusters-and-availability/unstable-node-prometheus_chart.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
100 changes: 100 additions & 0 deletions
100
modules/learn/pages/clusters-and-availability/unstable-nodes.adoc
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,100 @@ | ||
| = Unstable Nodes | ||
| :page-topic-type: concept | ||
| :description: Nodes that periodically become unavailable but recover before the auto failover timeout expires are considered unstable. This page describes what unstable nodes are and how to detect them. | ||
|
|
||
| [abstract] | ||
| {description} | ||
|
|
||
| == About Unstable Nodes | ||
|
|
||
| An unstable node periodically becomes unavailable but recovers before the auto failover timeout expires. | ||
| Nodes can become unstable for a variety of reasons, such as: | ||
|
|
||
| Network hardware issues:: | ||
| For example, a network port that's flapping (periodically failing) can cause intermittent connectivity issues. | ||
|
|
||
| Kernel TCP/IP memory pressure:: | ||
| When a node is under heavy network load, the Linux kernel's TCP/IP stack can run low on memory. | ||
| The kernel may respond by refusing connections and dropping packets, which can cause nodes to become unreachable until the memory shortage is resolved. | ||
| See xref:install:tcp_mem_settings.adoc[] for more information about the kernel's TCP/IP memory settings. | ||
|
|
||
| CPU, Memory, or Disk Resource Issues:: | ||
| When a node is under heavy CPU, memory, or disk I/O load, it may struggle to meet the demands of the cluster. | ||
| Hardware issues with any of these resources can also cause instability. | ||
|
|
||
| Any of these issues can cause periods where the node is unreachable. | ||
| In some cases, the node may continue to be unreachable until it's automatically failed over. | ||
| However, the node can recover before the auto failover timeout expires, which lets it potentially repeat the cycle again. | ||
|
|
||
| An unstable node can cause performance issues for the cluster, even if it does not lead to automatic failover. | ||
| When a node is unreachable, other nodes in the cluster may experience increased latency as they attempt to communicate with the unreachable node. | ||
|
|
||
| == Tracking Instability Using Metrics | ||
|
|
||
| Couchbase Server provides metrics to help you monitor its performance. | ||
| See xref:metrics-reference:metrics-reference.adoc[] for more information about Couchbase Server metrics. | ||
|
|
||
| To track instability, you can monitor the xref:metrics-reference:ns-server-metrics.adoc#cm_node_unreachable_total[`cm_node_unreachable_total`] metric. | ||
| It is a cross-node counter metric that reports how many times a node has been unable to reach another node. | ||
| When you see multiple nodes incrementing this counter for the same node, it may indicate the node is unstable. | ||
|
|
||
| [#what-metric-reports] | ||
| == What the Metric Reports | ||
|
|
||
| The following screenshot shows an example of using Prometheus to view the `cm_node_unreachable_total` metric. | ||
|
|
||
| image::clusters-and-availability/unstable-node-metric-timeseries.png[Viewing the cm_node_unreachable_total metric in Prometheus] | ||
|
|
||
| Each entry represents a count of the times a node has found another node unreachable for a particular reason. | ||
| For example: | ||
|
|
||
| ---- | ||
| cm_node_unreachable_total{instance="node1.example.com:8091", | ||
| job="couchbase-server", | ||
| node="ns_1@node4.example.com", | ||
| reason="net_tick_timeout"} 26 | ||
| ---- | ||
|
|
||
| The labels in the metric identify the nodes and the type of issue: | ||
|
|
||
| * The `instance` is the node which is reporting another node as unreachable. | ||
| * The `job` is the Prometheus job that you configured to scrape Couchbase Server metrics. | ||
| * The `node` is the node being reported as having issues. | ||
| It uses the https://www.erlang.org/doc/system/distributed.html#nodes[Erlang node name format]. | ||
| The portion before the `@` is the name of the Erlang process running on the host. | ||
| * The `reason` is the type of issue the instance (reporting node) observed in its connection to the node. | ||
| The possible reasons are: | ||
| + | ||
| ** `connection_setup_failed`: Setting up the connection failed (after the `nodeup` messages were sent). | ||
| ** `no_network`: The reporting node has no network connection. | ||
| ** `net_kernel_terminated`: The Erlang `net_kernel` process terminated. | ||
| ** `shutdown`: The connection shut down for an unknown reason. | ||
| ** `connection_closed`: The node closed its connection with the instance. | ||
| ** `disconnect`: The reporting node forced a disconnection from the node. | ||
| ** `net_tick_timeout`: The network distribution heartbeat timed out. | ||
| ** `send_net_tick_failed`: The reporting node was not able to send the distribution heartbeat via the connection. | ||
| ** `get_status_failed`: The reporting node failed to retrieve status information from the connection. | ||
| + | ||
| See https://www.erlang.org/doc/apps/kernel/net_kernel.html#:~:text=map.-,nodedown_reason[the Erlang documentation for `nodedown_reason`^] for more details about these errors. | ||
|
|
||
| In the previous example, `node1.example.com` has reported that it was unable to reach `node4.example.com` 26 times due to a network distribution heartbeat timeout. | ||
| In the screenshot, you can see that multiple nodes have reported that `node4.example.com` was unreachable for several reasons. | ||
| These reports may indicate that `node4.example.com` is unstable. | ||
|
|
||
| NOTE: Depending on the reason, the reporting node may increment its metric counter immediately or after a delay. | ||
| For example, a node reports a `connection closed` error immediately when it happens. | ||
| However, it only reports a `net_tick_timeout` after the timeout period has elapsed. | ||
|
Comment on lines
+84
to
+86
|
||
| Therefore, you may see a lag between issues that nodes report immediately and ones they report after a delay. | ||
|
|
||
| == Spotting Unstable Nodes | ||
|
|
||
| When a node is unstable, you see a pattern of multiple nodes incrementing their `cm_node_unreachable_total` counters for it. | ||
| The following screenshot shows a Prometheus graph of the metric over a period of 50 minutes. | ||
|
|
||
| image::clusters-and-availability/unstable-node-prometheus_chart.png[A chart showing multiple nodes incrementing their cm_node_unreachable_total metric multiple times in between periods of stability.] | ||
|
|
||
| The lines in the graph show nodes incrementing their `cm_node_unreachable_total` metric counters when they were unable to reach `node4.example.com`. | ||
| The `reason` label for most of these counters is `net_tick_timeout` errors, although there are several reports of `connection closed`. | ||
| You can see there are periods of stability (where the values do not increment) followed by multiple nodes incrementing their counts around the same time. | ||
|
|
||
| See xref:manage:monitor/monitor-node-stability.adoc[] for steps you can take to monitor node stability. | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.