Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 3 additions & 1 deletion modules/ROOT/nav.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ include::third-party:partial$nav.adoc[]
**** xref:learn:clusters-and-availability/graceful-failover.adoc[Graceful]
**** xref:learn:clusters-and-availability/hard-failover.adoc[Hard]
**** xref:learn:clusters-and-availability/automatic-failover.adoc[Automatic]
*** xref:learn:clusters-and-availability/unstable-nodes.adoc[]
*** xref:learn:clusters-and-availability/recovery.adoc[Recovery]
*** xref:learn:clusters-and-availability/node-to-node-encryption.adoc[Node-to-Node Encryption]
** xref:learn:clusters-and-availability/replication-architecture.adoc[Availability]
Expand Down Expand Up @@ -185,7 +186,8 @@ include::third-party:partial$nav.adoc[]
*** xref:backup-restore:cbbackupmgr-encryption.adoc[Encryption]
* xref:manage:monitor/monitor-intro.adoc[Monitor]
** xref:manage:monitor/xdcr-monitor-timestamp-conflict-resolution.adoc[Monitor Clock Drift]
** xref:manage:monitor/set-up-prometheus-for-monitoring.adoc[Configure Prometheus to Collect Couchbase Metrics]
** xref:manage:monitor/set-up-prometheus-for-monitoring.adoc[]
** xref:manage:monitor/monitor-node-stability.adoc[]
* xref:manage:troubleshoot/troubleshoot.adoc[Troubleshoot]
** xref:manage:troubleshoot/common-errors.adoc[Common Errors]
** xref:manage:troubleshoot/core-files.adoc[Core Files]
Expand Down
10 changes: 5 additions & 5 deletions modules/introduction/pages/whats-new.adoc
Original file line number Diff line number Diff line change
@@ -1,18 +1,18 @@
= What's New in Version 8.0
= What's New in Version 8.1
:description: Couchbase is the modern database for enterprise applications.
:page-aliases: security:security-watsnew
:page-toclevels: 2

[abstract]
{description} +
Couchbase Server 8.0 combines the strengths of relational databases with the flexibility, performance, and scale of Couchbase.
Couchbase Server 8.1 combines the strengths of relational databases with the flexibility, performance, and scale of Couchbase.

For information about platform support changes, deprecation notifications, and fixed and known issues, see the xref:release-notes:relnotes.adoc[Release Notes].

[#new-features-80]
== New Features and Enhancements in 8.0.0
[#new-features-81]
== New Features and Enhancements in 8.1.0

This release introduces the following new features.

include::partial$new-features-80.adoc[]
include::partial$new-features-81.adoc[]

515 changes: 0 additions & 515 deletions modules/introduction/partials/new-features-80.adoc

This file was deleted.

63 changes: 63 additions & 0 deletions modules/introduction/partials/new-features-81.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
[#section-new-feature-810-platform-support]
=== Platform Support

Couchbase Server 8.1 adds support for the following operating systems:

* TBD

For more information about supported operating systems, see xref:install:install-platforms.adoc[].
Comment thread
ggray-cb marked this conversation as resolved.

=== Global Secondary Indexing (GSI) Vector Indexes

TBD

=== Data Service Changes

Couchbase Server 8.1 introduces several new features for the Data Service.

TBD


=== Non-Data Services

Couchbase Server 8.1 release has key non-Data Services enhancements.


=== Couchbase Cluster

Couchbase Server 8.1 release has added cluster enhancements and diagnostic capabilities.

[#unstable-metric]
==== New Metric to Detect Unstable Nodes

Couchbase Server 8.1.0 introduces a new metric, `cm_node_unreachable_total`, to help you monitor for unstable nodes in your cluster.
An unstable node periodically becomes unavailable but recovers before the auto failover timeout expires.
This metric counts the number of times a node has been unable to reach another node in the cluster.
By monitoring it, you can identify nodes that are having issues before they become unavailable for long enough to be automatically failed over.

Comment on lines +33 to +37
See xref:learn:clusters-and-availability/unstable-nodes.adoc[] for more information about using this metric to identify unstable nodes.

=== XDCR

Couchbase Server 8.1 release has key cross datacenter replication (XDCR) enhancements and diagnostic capabilities.

TBD.

=== Security and Authentication

Couchbase Server 8.1 release has key security and authentication enhancements.

TBD.

=== Query Service

Couchbase Server 8.1 release adds these Query Service features.


TBD.

=== Search Service

Couchbase Server 8.1 introduces several new features for the xref:search:search.adoc[Search Service].

TBD.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
15 changes: 15 additions & 0 deletions modules/learn/pages/clusters-and-availability/failover.adoc
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,21 @@ If manual failover is to be used, administrative intervention is required to det
This can be achieved either by assigning an administrator to monitor the cluster; or by creating an externally based monitoring system that uses the Couchbase REST API to monitor the cluster, detect problems, and either provide notifications, or itself trigger failover.
Such a system might be designed to take into account system or network components beyond the scope of Couchbase Server.

[#detect-unstable-nodes]
== Detecting Unstable Nodes

You can have Couchbase Server's Cluster Manager perform automatic failovers of nodes that have been unreachable for a set period of time.
Failing over these nodes prevents the loss of the node from degrading database performance.
However, nodes can become unstable, where other nodes lose contact with them periodically.
These nodes may recover before the failover timeout expires and resume operation.

In some cases, these nodes become unreachable for long enough that the Cluster Manager fails them over.
In other cases they can continue the cycle of instability followed by recovery.
Even though these disruptions are not as severe as a complete loss of the node, they can cause performance issues.
You should investigate and resolve these issues to prevent further problems.

See xref:learn:clusters-and-availability/unstable-nodes.adoc[] for more information about unstable nodes.

[#failover-and-replica-promotion]
== Failover and Replica Promotion

Expand Down
100 changes: 100 additions & 0 deletions modules/learn/pages/clusters-and-availability/unstable-nodes.adoc
Original file line number Diff line number Diff line change
@@ -0,0 +1,100 @@
= Unstable Nodes
:page-topic-type: concept
:description: Nodes that periodically become unavailable but recover before the auto failover timeout expires are considered unstable. This page describes what unstable nodes are and how to detect them.

[abstract]
{description}

== About Unstable Nodes

An unstable node periodically becomes unavailable but recovers before the auto failover timeout expires.
Nodes can become unstable for a variety of reasons, such as:

Network hardware issues::
For example, a network port that's flapping (periodically failing) can cause intermittent connectivity issues.

Kernel TCP/IP memory pressure::
When a node is under heavy network load, the Linux kernel's TCP/IP stack can run low on memory.
The kernel may respond by refusing connections and dropping packets, which can cause nodes to become unreachable until the memory shortage is resolved.
See xref:install:tcp_mem_settings.adoc[] for more information about the kernel's TCP/IP memory settings.

CPU, Memory, or Disk Resource Issues::
When a node is under heavy CPU, memory, or disk I/O load, it may struggle to meet the demands of the cluster.
Hardware issues with any of these resources can also cause instability.

Any of these issues can cause periods where the node is unreachable.
In some cases, the node may continue to be unreachable until it's automatically failed over.
However, the node can recover before the auto failover timeout expires, which lets it potentially repeat the cycle again.

An unstable node can cause performance issues for the cluster, even if it does not lead to automatic failover.
When a node is unreachable, other nodes in the cluster may experience increased latency as they attempt to communicate with the unreachable node.

== Tracking Instability Using Metrics

Couchbase Server provides metrics to help you monitor its performance.
See xref:metrics-reference:metrics-reference.adoc[] for more information about Couchbase Server metrics.

To track instability, you can monitor the xref:metrics-reference:ns-server-metrics.adoc#cm_node_unreachable_total[`cm_node_unreachable_total`] metric.
It is a cross-node counter metric that reports how many times a node has been unable to reach another node.
When you see multiple nodes incrementing this counter for the same node, it may indicate the node is unstable.

[#what-metric-reports]
== What the Metric Reports

The following screenshot shows an example of using Prometheus to view the `cm_node_unreachable_total` metric.

image::clusters-and-availability/unstable-node-metric-timeseries.png[Viewing the cm_node_unreachable_total metric in Prometheus]

Each entry represents a count of the times a node has found another node unreachable for a particular reason.
For example:

----
cm_node_unreachable_total{instance="node1.example.com:8091",
job="couchbase-server",
node="ns_1@node4.example.com",
reason="net_tick_timeout"} 26
----

The labels in the metric identify the nodes and the type of issue:

* The `instance` is the node which is reporting another node as unreachable.
* The `job` is the Prometheus job that you configured to scrape Couchbase Server metrics.
* The `node` is the node being reported as having issues.
It uses the https://www.erlang.org/doc/system/distributed.html#nodes[Erlang node name format].
The portion before the `@` is the name of the Erlang process running on the host.
* The `reason` is the type of issue the instance (reporting node) observed in its connection to the node.
The possible reasons are:
+
** `connection_setup_failed`: Setting up the connection failed (after the `nodeup` messages were sent).
** `no_network`: The reporting node has no network connection.
** `net_kernel_terminated`: The Erlang `net_kernel` process terminated.
** `shutdown`: The connection shut down for an unknown reason.
** `connection_closed`: The node closed its connection with the instance.
** `disconnect`: The reporting node forced a disconnection from the node.
** `net_tick_timeout`: The network distribution heartbeat timed out.
** `send_net_tick_failed`: The reporting node was not able to send the distribution heartbeat via the connection.
** `get_status_failed`: The reporting node failed to retrieve status information from the connection.
+
See https://www.erlang.org/doc/apps/kernel/net_kernel.html#:~:text=map.-,nodedown_reason[the Erlang documentation for `nodedown_reason`^] for more details about these errors.

In the previous example, `node1.example.com` has reported that it was unable to reach `node4.example.com` 26 times due to a network distribution heartbeat timeout.
In the screenshot, you can see that multiple nodes have reported that `node4.example.com` was unreachable for several reasons.
These reports may indicate that `node4.example.com` is unstable.

NOTE: Depending on the reason, the reporting node may increment its metric counter immediately or after a delay.
For example, a node reports a `connection closed` error immediately when it happens.
However, it only reports a `net_tick_timeout` after the timeout period has elapsed.
Comment on lines +84 to +86
Therefore, you may see a lag between issues that nodes report immediately and ones they report after a delay.

== Spotting Unstable Nodes

When a node is unstable, you see a pattern of multiple nodes incrementing their `cm_node_unreachable_total` counters for it.
The following screenshot shows a Prometheus graph of the metric over a period of 50 minutes.

image::clusters-and-availability/unstable-node-prometheus_chart.png[A chart showing multiple nodes incrementing their cm_node_unreachable_total metric multiple times in between periods of stability.]

The lines in the graph show nodes incrementing their `cm_node_unreachable_total` metric counters when they were unable to reach `node4.example.com`.
The `reason` label for most of these counters is `net_tick_timeout` errors, although there are several reports of `connection closed`.
You can see there are periods of stability (where the values do not increment) followed by multiple nodes incrementing their counts around the same time.

See xref:manage:monitor/monitor-node-stability.adoc[] for steps you can take to monitor node stability.
Loading