Skip to content

collector: add nvmesubsystem collector for NVMe-oF path health#3579

Open
sradco wants to merge 1 commit intoprometheus:masterfrom
sradco:add_collector_multipath
Open

collector: add nvmesubsystem collector for NVMe-oF path health#3579
sradco wants to merge 1 commit intoprometheus:masterfrom
sradco:add_collector_multipath

Conversation

@sradco
Copy link

@sradco sradco commented Mar 11, 2026

Add a new disabled-by-default collector that reads
/sys/class/nvme-subsystem/ to expose NVMe over Fabrics subsystem
connectivity metrics.

This complements the existing nvme collector (which reports
per-controller hardware stats) by monitoring the subsystem-level
path redundancy - how many controller paths are live, connecting,
or dead for each NVMe subsystem.

Exposed metrics:

  • node_nvmesubsystem_info
  • node_nvmesubsystem_paths_total
  • node_nvmesubsystem_paths_live
  • node_nvmesubsystem_path_state

Signed-off-by: Shirly Radco sradco@redhat.com
Co-authored-by: AI Assistant noreply@cursor.com

@sradco sradco force-pushed the add_collector_multipath branch from 742c1b1 to a0a146e Compare March 11, 2026 19:18
@sradco
Copy link
Author

sradco commented Mar 11, 2026

Hi @SuperQ , I created this PR for a new multipath collector.
I would appreciate your review.

@sradco sradco force-pushed the add_collector_multipath branch from a0a146e to 1fa2099 Compare March 12, 2026 08:55
@sradco sradco changed the title Add multipath collector Add multipath collector for NVMe-oF subsystem path health Mar 12, 2026
Add a new disabled-by-default collector that reads
/sys/class/nvme-subsystem/ to expose NVMe over Fabrics subsystem
connectivity metrics.

This complements the existing nvme collector (which reports
per-controller hardware stats) by monitoring the subsystem-level
path redundancy — how many controller paths are live, connecting,
or dead for each NVMe subsystem.

Exposed metrics:
- node_nvmesubsystem_info
- node_nvmesubsystem_paths_total
- node_nvmesubsystem_paths_live
- node_nvmesubsystem_path_state

Signed-off-by: Shirly Radco <sradco@redhat.com>
Co-authored-by: AI Assistant <noreply@cursor.com>
@sradco sradco force-pushed the add_collector_multipath branch from 1fa2099 to 635b613 Compare March 12, 2026 09:13
@sradco sradco changed the title Add multipath collector for NVMe-oF subsystem path health collector: add nvmesubsystem collector for NVMe-oF path health Mar 12, 2026
Copy link

@jsafrane jsafrane left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code looks good to me. I tested it with 2 NVMe over TCP devices, I got:

node_nvmesubsystem_info{iopolicy="numa",model="Linux",nqn="tempdisk",serial="bab529a0f32e397e1319",subsystem="nvme-subsys0"} 1
node_nvmesubsystem_path_state{controller="nvme0",state="connecting",subsystem="nvme-subsys0",transport="tcp"} 0
node_nvmesubsystem_path_state{controller="nvme0",state="dead",subsystem="nvme-subsys0",transport="tcp"} 0
node_nvmesubsystem_path_state{controller="nvme0",state="live",subsystem="nvme-subsys0",transport="tcp"} 1
node_nvmesubsystem_path_state{controller="nvme0",state="resetting",subsystem="nvme-subsys0",transport="tcp"} 0
node_nvmesubsystem_path_state{controller="nvme0",state="unknown",subsystem="nvme-subsys0",transport="tcp"} 0
node_nvmesubsystem_path_state{controller="nvme1",state="connecting",subsystem="nvme-subsys0",transport="tcp"} 0
node_nvmesubsystem_path_state{controller="nvme1",state="dead",subsystem="nvme-subsys0",transport="tcp"} 0
node_nvmesubsystem_path_state{controller="nvme1",state="live",subsystem="nvme-subsys0",transport="tcp"} 1
node_nvmesubsystem_path_state{controller="nvme1",state="resetting",subsystem="nvme-subsys0",transport="tcp"} 0
node_nvmesubsystem_path_state{controller="nvme1",state="unknown",subsystem="nvme-subsys0",transport="tcp"} 0
node_nvmesubsystem_paths_live{subsystem="nvme-subsys0"} 2
node_nvmesubsystem_paths_total{subsystem="nvme-subsys0"} 2

Which looks reasonable.

Comment on lines +64 to +70
switch raw {
case "live", "connecting", "resetting", "dead":
return raw
case "deleting", "deleting (no IO)", "new":
return raw
default:
return "unknown"

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for the record, I checked that this is a complete list of all states reported by the kernel today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants