collector/diskstats: fix rotational queue stats scrape regression#3686
collector/diskstats: fix rotational queue stats scrape regression#3686ruthwikkakumani wants to merge 1 commit into
Conversation
|
Hey @ruthwikkakumani overall LGTM, do you mind paste the results of the benchmark as part of the PR description? |
d802087 to
65da987
Compare
|
Hi @nicolastakashi! Thanks for the approval I noticed CI is blocked on "3 workflows awaiting approval" — could you approve the workflow run so the checks can complete? Benchmark ResultsThe go test -bench=BenchmarkDiskstatsUpdate -benchtime=5s -benchmem ./collector/I can't run it locally since the
This ~30× reduction in sysfs I/O directly explains the 10× latency regression reported in #3282. Note on golangci-lint FailuresAll 15 reported lint issues ( golangci-lint run ./collector/...directly on upstream |
|
Hey @SuperQ @discordianfish — could you please approve the workflow runs so CI can complete? The code has been reviewed and approved by @nicolastakashi. Thank you! |
| // file reads to exactly 1, fixing the 10× scrape regression on systems with | ||
| // large numbers of block devices (issue #3282). | ||
| func readRotational(dev string) string { | ||
| data, err := os.ReadFile(sysFilePath(fmt.Sprintf("block/%s/queue/rotational", dev))) |
There was a problem hiding this comment.
We prefer to have these file readers implemented in the procfs library.
I would be fine with an optimized blockdevice.FS.Rotational(dev) function that only reads the one parameter.
The rest of the PR seems fine.
There was a problem hiding this comment.
Understood! I'll add a SysBlockDeviceRotational(device string) (uint64, error) method
to the blockdevice.FS type in prometheus/procfs and update this PR to use it.
The implementation will follow the same pattern as SysBlockDeviceSize() — a targeted
single-file read of /sys/block/<device>/queue/rotational using util.ReadUintFromFile.
I'll open the procfs PR first, then update this one to depend on it.
644b834 to
d3a254d
Compare
|
@SuperQ I've addressed your feedback — the file reader is now in the procfs library as you requested. prometheus/procfs#824 adds blockdevice.FS.SysBlockDeviceRotational(device string) (uint64, error) — a targeted single-file read of /sys/block//queue/rotational following the same pattern as SysBlockDeviceSize(). This PR now calls fs.SysBlockDeviceRotational(dev) via rotationalLabel(). The go.mod uses a replace directive pointing to the procfs fork until procfs#824 merges — at that point it can be replaced with a proper version bump. Could you also approve the workflow runs so CI can complete? Thank you! |
…ometheus#3282) Fix a 10× scrape-latency regression introduced in prometheus#3022 on systems with many block devices. Previously, Update() called SysBlockDeviceQueueStats(dev) for every block device per scrape, reading ~30 sysfs files per device. On systems with 1,000+ devices this resulted in 30,000+ unnecessary file reads per scrape, causing timeouts. Replace the expensive struct call with a targeted rotationalLabel() helper that delegates to blockdevice.FS.SysBlockDeviceRotational(dev) (added in prometheus/procfs#824). This reads exactly 1 sysfs file per device: /sys/block/<device>/queue/rotational Returns "1" for rotational (HDD) and "0" for non-rotational (SSD/NVMe), failing closed to "0" on any error — matching the zero-initialised-struct behaviour of the old code. Also add: - TestRotationalLabel: dynamic t.TempDir()-based tests via blockdevice.FS, covering HDD, SSD, and missing-file cases. - BenchmarkDiskstatsUpdate: measures the full Update() call to catch future per-device sysfs I/O regressions before merge. Fixes prometheus#3282 Signed-off-by: Ruthwik Kakumani <ruthwikkakumani08@gmail.com>
|
@SuperQ procfs#824 has been merged into prometheus:master. I've updated go.mod to drop the replace directive and now point directly to the official github.com/prometheus/procfs@v0.20.2-0.20260618104242-66a9e6ebfb87. The fork dependency is gone. Could you please approve the workflow runs so CI can complete? Thank you! |
Fixes #3282 - 10x diskstats latency regression on systems with many block devices
This PR resolves a severe performance regression introduced in PR #3022. The collector was previously calling
SysBlockDeviceQueueStats(dev)for every block device, forcing roughly 30 sysfs file reads per device per scrape. On systems with 1,000+ devices, this resulted in 30,000+ unnecessary file operations, causing scrapes to timeout.This fix excises the expensive struct call and replaces it with a targeted
readRotational(dev)function that reads exactly 1 file per device. The new function safely mirrors previous fallback behavior by strictly validating the content and failing closed to"0"on missing or malformed files. Dynamic tests usingt.TempDir()have been added to ensure robust edge-case handling.