Skip to content

DOC-14039 flakey node metric documentation for Totoro#4102

Open
ggray-cb wants to merge 7 commits intorelease/8.0from
DOC-14039_flakey_node_metric_totoro_2
Open

DOC-14039 flakey node metric documentation for Totoro#4102
ggray-cb wants to merge 7 commits intorelease/8.0from
DOC-14039_flakey_node_metric_totoro_2

Conversation

@ggray-cb
Copy link
Copy Markdown
Contributor

Adds documentation for the cm_node_unreachable_total metric.

The following list summarizes the changes, with links to the preview site. You will need the Docs Team credentials on Confluence.

NOTE: These changes will be backported to 7.6.11 and 8.0.2.

ggray-cb added 5 commits April 9, 2026 16:48
* Partial drafts of the concecptual and procedural doc pages.
* Added screen shots.
* Updated metrics to get the new metric.
* Forward ported change to the metrics template to add anchors to the entries in the metric reference.
@ggray-cb ggray-cb requested review from anuthan and hyunjuV April 13, 2026 14:45
@anuthan
Copy link
Copy Markdown

anuthan commented Apr 13, 2026

@stevewatanabe Can you please review this addition? Thanks

Comment thread modules/manage/pages/monitor/monitor-node-stability.adoc Outdated
Comment thread modules/learn/pages/clusters-and-availability/failover.adoc Outdated
@stevewatanabe
Copy link
Copy Markdown
Member

A couple of minor comments. Otherwise looks good.

* Fixed missing space after a link to the metric that Steve spotted.
* Some misc, grammar fixes.
Copy link
Copy Markdown
Contributor

@hyunjuV hyunjuV left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked good.

@ggray-cb ggray-cb requested a review from Copilot April 16, 2026 17:37
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds end-user documentation for detecting and investigating unstable Couchbase nodes using the new cm_node_unreachable_total metric, and wires the new content into the docs nav and metrics reference.

Changes:

  • Added new concept + guide docs explaining unstable nodes and how to monitor them via cm_node_unreachable_total.
  • Added/updated metrics metadata (including cm_node_unreachable_total) and adjusted metrics reference rendering to link metric names.
  • Updated troubleshooting and availability docs to reference the new unstable-node guidance, and updated nav/What’s New content.

Reviewed changes

Copilot reviewed 16 out of 18 changed files in this pull request and generated 15 comments.

Show a summary per file
File Description
modules/metrics-reference/partials/metrics.hbs Adds anchors to metric rows so metrics can be deep-linked.
modules/metrics-reference/attachments/cm_metrics_metadata.json Adds cm_node_unreachable_total and additional CM/couch view metrics metadata.
modules/manage/pages/troubleshoot/common-errors.adoc Adds a “Unstable Nodes” common error entry and links to the new concept page.
modules/manage/pages/monitor/set-up-prometheus-for-monitoring.adoc Adds :page-topic-type: metadata.
modules/manage/pages/monitor/monitor-node-stability.adoc New guide describing Prometheus + REST API approaches to monitoring node instability.
modules/learn/pages/clusters-and-availability/unstable-nodes.adoc New concept page defining unstable nodes and explaining metric labels/reasons.
modules/learn/pages/clusters-and-availability/failover.adoc Adds a section explaining unstable nodes in the context of failover expectations.
modules/introduction/pages/whats-new.adoc Updates What’s New page to reference 8.1 content/partials.
modules/introduction/partials/new-features-81.adoc New 8.1 partial including a What’s New entry for the unstable-node metric.
modules/introduction/partials/new-features-80.adoc Removes the 8.0 partial previously included by What’s New.
modules/ROOT/nav.adoc Adds nav entries for the new unstable-node pages and monitor guide.
modules/metrics-reference/attachments/kv_metrics_metadata.json Adds/renames multiple KV metrics and updates some KV metric help text.
modules/metrics-reference/attachments/index_metrics_metadata.json Adds new index metrics for lost replicas.
modules/metrics-reference/attachments/n1ql_metrics_metadata.json Adds a new N1QL metric entry.
modules/metrics-reference/attachments/goxdcr_metrics_metadata.json Adds multiple XDCR metrics entries.
modules/metrics-reference/attachments/fts_metrics_metadata.json Adds multiple FTS runtime metrics entries and updates one help string.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread modules/learn/pages/clusters-and-availability/unstable-nodes.adoc Outdated
Comment on lines +3304 to +3316
"kv_fusion_num_migrator_threads_bytes": {
"added": "8.0.0",
"help": "The number of Fusion Migrator threads.",
"stability": "committed",
"type": "gauge",
"unit": "bytes"
},
"kv_fusion_num_uploader_threads_bytes": {
"added": "8.0.0",
"help": "The number of Fusion Uploader threads.",
"stability": "committed",
"type": "gauge",
"unit": "bytes"
},
"index_num_lost_replica_indexes": {
"added": "8.1.0",
"help": "Number of index partitions with atleast one lost replica.",
Comment thread modules/learn/pages/clusters-and-availability/unstable-nodes.adoc Outdated
Comment thread modules/introduction/partials/new-features-81.adoc

== Spotting Unstable Nodes

When a node is unstable, you see a patten of multiple nodes incrementing their `cm_node_unreachable_total` counts for it.
Comment on lines 777 to 782
"kv_ep_commit_time_seconds": {
"added": "7.0.0",
"help": "Number of milliseconds of most recent commit",
"help": "Number of microseconds of most recent commit",
"stability": "committed",
"type": "gauge",
"unit": "seconds"
Comment on lines 784 to 789
"kv_ep_commit_time_total_seconds": {
"added": "7.0.0",
"help": "Cumulative milliseconds spent committing",
"help": "Cumulative microseconds spent committing",
"stability": "committed",
"type": "gauge",
"unit": "seconds"
Comment on lines +271 to +286
"type": "gauge"
},
"fts_heap_idle": {
"added": "8.1.0",
"help": "Bytes in idle (unused) heap spans",
"type": "gauge"
},
"fts_heap_inuse": {
"added": "8.1.0",
"help": "Bytes in non-idle heap spans",
"type": "gauge"
},
"fts_heap_released": {
"added": "8.1.0",
"help": "Bytes of physical memory returned to the OS",
"type": "gauge"
Comment on lines +271 to +286
"type": "gauge"
},
"fts_heap_idle": {
"added": "8.1.0",
"help": "Bytes in idle (unused) heap spans",
"type": "gauge"
},
"fts_heap_inuse": {
"added": "8.1.0",
"help": "Bytes in non-idle heap spans",
"type": "gauge"
},
"fts_heap_released": {
"added": "8.1.0",
"help": "Bytes of physical memory returned to the OS",
"type": "gauge"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants