Skip to content

[server] Add disk-usage write protection to TabletServer#3340

Open
swuferhong wants to merge 2 commits into
apache:mainfrom
swuferhong:disk-usage-protect
Open

[server] Add disk-usage write protection to TabletServer#3340
swuferhong wants to merge 2 commits into
apache:mainfrom
swuferhong:disk-usage-protect

Conversation

@swuferhong
Copy link
Copy Markdown
Contributor

Purpose

Linked issue: close #3338

Introduce a periodic disk-usage monitoring mechanism that proactively rejects client writes when the TabletServer's data disk usage exceeds a configurable high-water-mark ratio, preventing ENOSPC errors and potential data corruption.

Key design decisions:

  • Hysteresis state machine with a fixed 10% recovery gap to avoid rapid lock/unlock oscillation (lock at limit, unlock at limit-0.10)
  • Max-per-disk strategy: report the highest usage across all distinct FileStores so a single full disk is never masked by other low-usage disks in multi-disk deployments
  • Only client-driven writes (appendLog/putKv) are rejected with a retriable DiskWriteLockedException; follower replication is not blocked to preserve replica consistency
  • write-limit-ratio supports runtime dynamic reconfiguration via ServerReconfigurable, with an immediate re-check on change
  • Setting ratio to 1.0 completely disables the protection

New configuration:

  • server.data-disk.write-limit-ratio (default 0.85, dynamic)
  • server.data-disk.check-interval (default 30s)

New metrics:

  • diskUsageRatio: current disk usage ratio [0.0, 1.0]
  • diskWriteLocked: 1 when writes are being rejected, 0 otherwise

Brief change log

Tests

API and Format

Documentation

@swuferhong swuferhong force-pushed the disk-usage-protect branch 2 times, most recently from 5949356 to ac209de Compare May 18, 2026 03:16
Copy link
Copy Markdown
Member

@zuston zuston left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If exceeding the disk usage ratio threshold (or disk corruption), do we need to make this tablet server as offline or unhealthy status? I think the writer side fencing is not enough, sometimes the disk usage exceeding will not recover automaticlly at the many cases

@swuferhong swuferhong force-pushed the disk-usage-protect branch from ac209de to feb7dc6 Compare May 18, 2026 03:36
@swuferhong
Copy link
Copy Markdown
Contributor Author

If exceeding the disk usage ratio threshold (or disk corruption), do we need to make this tablet server as offline or unhealthy status? I think the writer side fencing is not enough, sometimes the disk usage exceeding will not recover automaticlly at the many cases

Hi, @zuston. Writer-side fencing is the minimum-sufficient response for a capacity event; promoting it to node-level offline turns a localized capacity problem into a cluster-wide availability incident and triggers cascading failover. Disk corruption is a separate fault domain (IOException-driven Log Directory Failure) and should be addressed in a dedicated PR.

Happy to add a follow-up issue tracking the Log Directory Failure work if that helps.

@swuferhong swuferhong force-pushed the disk-usage-protect branch from feb7dc6 to 44c6c93 Compare May 20, 2026 09:25
Copy link
Copy Markdown
Member

@wuchong wuchong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @swuferhong , I only left some minor comments.

KV_SHARED_RATE_LIMITER_BYTES_PER_SEC.key(),
KV_SNAPSHOT_INTERVAL.key()));
KV_SNAPSHOT_INTERVAL.key(),
SERVER_DATA_DISK_WRITE_LIMIT_RATIO.key()));
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This key now passes the coordinator allowlist, but the range check still exists only in LocalDiskManager.validate(), which is registered on TabletServer, not CoordinatorServer. Values like 0.0 or 1.5 can therefore be persisted through AlterConfigs and only fail later when tablet servers try to apply them. The coordinator path should reject invalid server.data-disk.write-limit-ratio updates up front.

I think we should also validate this on the Coordinator via org.apache.fluss.server.DynamicConfigManager#registerValidator by extending a ConfigValidator. We should also add an IT case for setting valid and invalid server.data-disk.write-limit-ratio (maybe near FlussAdminITCase#testDynamicConfigs()).

if (total <= 0L) {
continue;
}
double ratio = (double) (total - fs.getUsableSpace()) / total;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

collect() only treats Files.getFileStore() failures as skippable. If FileStore#getTotalSpace() or getUsableSpace() throws for one data directory, the whole sample aborts instead of skipping just that directory, so DiskUsageMonitor can keep a stale lock state even when the other data dirs are still healthy and measurable. This should handle per-filesystem stat failures the same way as lookup failures.

Comment on lines +477 to +482
diskUsageMonitor.runOnce();
scheduler.schedule(
"disk-usage-monitor",
diskUsageMonitor::runOnce,
diskCheckIntervalMs,
diskCheckIntervalMs);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Setting delayMs to 0 can trigger immediate collection, rather than relying on an explicit invocation. This ensures the disk I/O operation executes asynchronously within the scheduler thread, preventing it from blocking the startup process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[server] Add disk-usage write protection to TabletServer

3 participants