DOC-14039 flakey node metric documentation for Totoro#4102
Open
ggray-cb wants to merge 7 commits intorelease/8.0from
Open
DOC-14039 flakey node metric documentation for Totoro#4102ggray-cb wants to merge 7 commits intorelease/8.0from
ggray-cb wants to merge 7 commits intorelease/8.0from
Conversation
* Partial drafts of the concecptual and procedural doc pages. * Added screen shots. * Updated metrics to get the new metric. * Forward ported change to the metrics template to add anchors to the entries in the metric reference.
|
@stevewatanabe Can you please review this addition? Thanks |
Member
|
A couple of minor comments. Otherwise looks good. |
* Fixed missing space after a link to the metric that Steve spotted. * Some misc, grammar fixes.
There was a problem hiding this comment.
Pull request overview
This PR adds end-user documentation for detecting and investigating unstable Couchbase nodes using the new cm_node_unreachable_total metric, and wires the new content into the docs nav and metrics reference.
Changes:
- Added new concept + guide docs explaining unstable nodes and how to monitor them via
cm_node_unreachable_total. - Added/updated metrics metadata (including
cm_node_unreachable_total) and adjusted metrics reference rendering to link metric names. - Updated troubleshooting and availability docs to reference the new unstable-node guidance, and updated nav/What’s New content.
Reviewed changes
Copilot reviewed 16 out of 18 changed files in this pull request and generated 15 comments.
Show a summary per file
| File | Description |
|---|---|
| modules/metrics-reference/partials/metrics.hbs | Adds anchors to metric rows so metrics can be deep-linked. |
| modules/metrics-reference/attachments/cm_metrics_metadata.json | Adds cm_node_unreachable_total and additional CM/couch view metrics metadata. |
| modules/manage/pages/troubleshoot/common-errors.adoc | Adds a “Unstable Nodes” common error entry and links to the new concept page. |
| modules/manage/pages/monitor/set-up-prometheus-for-monitoring.adoc | Adds :page-topic-type: metadata. |
| modules/manage/pages/monitor/monitor-node-stability.adoc | New guide describing Prometheus + REST API approaches to monitoring node instability. |
| modules/learn/pages/clusters-and-availability/unstable-nodes.adoc | New concept page defining unstable nodes and explaining metric labels/reasons. |
| modules/learn/pages/clusters-and-availability/failover.adoc | Adds a section explaining unstable nodes in the context of failover expectations. |
| modules/introduction/pages/whats-new.adoc | Updates What’s New page to reference 8.1 content/partials. |
| modules/introduction/partials/new-features-81.adoc | New 8.1 partial including a What’s New entry for the unstable-node metric. |
| modules/introduction/partials/new-features-80.adoc | Removes the 8.0 partial previously included by What’s New. |
| modules/ROOT/nav.adoc | Adds nav entries for the new unstable-node pages and monitor guide. |
| modules/metrics-reference/attachments/kv_metrics_metadata.json | Adds/renames multiple KV metrics and updates some KV metric help text. |
| modules/metrics-reference/attachments/index_metrics_metadata.json | Adds new index metrics for lost replicas. |
| modules/metrics-reference/attachments/n1ql_metrics_metadata.json | Adds a new N1QL metric entry. |
| modules/metrics-reference/attachments/goxdcr_metrics_metadata.json | Adds multiple XDCR metrics entries. |
| modules/metrics-reference/attachments/fts_metrics_metadata.json | Adds multiple FTS runtime metrics entries and updates one help string. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+3304
to
+3316
| "kv_fusion_num_migrator_threads_bytes": { | ||
| "added": "8.0.0", | ||
| "help": "The number of Fusion Migrator threads.", | ||
| "stability": "committed", | ||
| "type": "gauge", | ||
| "unit": "bytes" | ||
| }, | ||
| "kv_fusion_num_uploader_threads_bytes": { | ||
| "added": "8.0.0", | ||
| "help": "The number of Fusion Uploader threads.", | ||
| "stability": "committed", | ||
| "type": "gauge", | ||
| "unit": "bytes" |
| }, | ||
| "index_num_lost_replica_indexes": { | ||
| "added": "8.1.0", | ||
| "help": "Number of index partitions with atleast one lost replica.", |
|
|
||
| == Spotting Unstable Nodes | ||
|
|
||
| When a node is unstable, you see a patten of multiple nodes incrementing their `cm_node_unreachable_total` counts for it. |
Comment on lines
777
to
782
| "kv_ep_commit_time_seconds": { | ||
| "added": "7.0.0", | ||
| "help": "Number of milliseconds of most recent commit", | ||
| "help": "Number of microseconds of most recent commit", | ||
| "stability": "committed", | ||
| "type": "gauge", | ||
| "unit": "seconds" |
Comment on lines
784
to
789
| "kv_ep_commit_time_total_seconds": { | ||
| "added": "7.0.0", | ||
| "help": "Cumulative milliseconds spent committing", | ||
| "help": "Cumulative microseconds spent committing", | ||
| "stability": "committed", | ||
| "type": "gauge", | ||
| "unit": "seconds" |
Comment on lines
+271
to
+286
| "type": "gauge" | ||
| }, | ||
| "fts_heap_idle": { | ||
| "added": "8.1.0", | ||
| "help": "Bytes in idle (unused) heap spans", | ||
| "type": "gauge" | ||
| }, | ||
| "fts_heap_inuse": { | ||
| "added": "8.1.0", | ||
| "help": "Bytes in non-idle heap spans", | ||
| "type": "gauge" | ||
| }, | ||
| "fts_heap_released": { | ||
| "added": "8.1.0", | ||
| "help": "Bytes of physical memory returned to the OS", | ||
| "type": "gauge" |
Comment on lines
+271
to
+286
| "type": "gauge" | ||
| }, | ||
| "fts_heap_idle": { | ||
| "added": "8.1.0", | ||
| "help": "Bytes in idle (unused) heap spans", | ||
| "type": "gauge" | ||
| }, | ||
| "fts_heap_inuse": { | ||
| "added": "8.1.0", | ||
| "help": "Bytes in non-idle heap spans", | ||
| "type": "gauge" | ||
| }, | ||
| "fts_heap_released": { | ||
| "added": "8.1.0", | ||
| "help": "Bytes of physical memory returned to the OS", | ||
| "type": "gauge" |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Adds documentation for the
cm_node_unreachable_totalmetric.The following list summarizes the changes, with links to the preview site. You will need the Docs Team credentials on Confluence.
NOTE: These changes will be backported to 7.6.11 and 8.0.2.