[server] Add disk-usage write protection to TabletServer#3340
[server] Add disk-usage write protection to TabletServer#3340swuferhong wants to merge 2 commits into
Conversation
5949356 to
ac209de
Compare
zuston
left a comment
There was a problem hiding this comment.
If exceeding the disk usage ratio threshold (or disk corruption), do we need to make this tablet server as offline or unhealthy status? I think the writer side fencing is not enough, sometimes the disk usage exceeding will not recover automaticlly at the many cases
ac209de to
feb7dc6
Compare
Hi, @zuston. Writer-side fencing is the minimum-sufficient response for a capacity event; promoting it to node-level offline turns a localized capacity problem into a cluster-wide availability incident and triggers cascading failover. Disk corruption is a separate fault domain (IOException-driven Log Directory Failure) and should be addressed in a dedicated PR. Happy to add a follow-up issue tracking the Log Directory Failure work if that helps. |
feb7dc6 to
44c6c93
Compare
wuchong
left a comment
There was a problem hiding this comment.
Thanks @swuferhong , I only left some minor comments.
| KV_SHARED_RATE_LIMITER_BYTES_PER_SEC.key(), | ||
| KV_SNAPSHOT_INTERVAL.key())); | ||
| KV_SNAPSHOT_INTERVAL.key(), | ||
| SERVER_DATA_DISK_WRITE_LIMIT_RATIO.key())); |
There was a problem hiding this comment.
This key now passes the coordinator allowlist, but the range check still exists only in LocalDiskManager.validate(), which is registered on TabletServer, not CoordinatorServer. Values like 0.0 or 1.5 can therefore be persisted through AlterConfigs and only fail later when tablet servers try to apply them. The coordinator path should reject invalid server.data-disk.write-limit-ratio updates up front.
I think we should also validate this on the Coordinator via org.apache.fluss.server.DynamicConfigManager#registerValidator by extending a ConfigValidator. We should also add an IT case for setting valid and invalid server.data-disk.write-limit-ratio (maybe near FlussAdminITCase#testDynamicConfigs()).
| if (total <= 0L) { | ||
| continue; | ||
| } | ||
| double ratio = (double) (total - fs.getUsableSpace()) / total; |
There was a problem hiding this comment.
collect() only treats Files.getFileStore() failures as skippable. If FileStore#getTotalSpace() or getUsableSpace() throws for one data directory, the whole sample aborts instead of skipping just that directory, so DiskUsageMonitor can keep a stale lock state even when the other data dirs are still healthy and measurable. This should handle per-filesystem stat failures the same way as lookup failures.
| diskUsageMonitor.runOnce(); | ||
| scheduler.schedule( | ||
| "disk-usage-monitor", | ||
| diskUsageMonitor::runOnce, | ||
| diskCheckIntervalMs, | ||
| diskCheckIntervalMs); |
There was a problem hiding this comment.
nit: Setting delayMs to 0 can trigger immediate collection, rather than relying on an explicit invocation. This ensures the disk I/O operation executes asynchronously within the scheduler thread, preventing it from blocking the startup process.
Purpose
Linked issue: close #3338
Introduce a periodic disk-usage monitoring mechanism that proactively rejects client writes when the TabletServer's data disk usage exceeds a configurable high-water-mark ratio, preventing ENOSPC errors and potential data corruption.
Key design decisions:
New configuration:
New metrics:
Brief change log
Tests
API and Format
Documentation