Make self-hosted Durable Objects scalable across a cluster.#6780
Make self-hosted Durable Objects scalable across a cluster.#6780kentonv wants to merge 9 commits into
Conversation
This defines how the new feature is configured and exposed on the network.
There was a problem hiding this comment.
I'm Bonk, and I've done a quick review of your PR.
This PR adds cluster mode to workerd for scalable self-hosted Durable Objects, using filesystem-based service discovery and OFD locks for ownership coordination.
Findings (highest severity first):
- [MEDIUM] Typo:
verifyRigstrationshould beverifyRegistration(3 occurrences) — misspelled method name propagated to aKJ_BIND_METHODcallback that's invoked on every SQLite file open in cluster mode. - [LOW]
catch(...)inconnect()— idiomatic KJ usesKJ_TRY/KJ_CATCH; barecatch(...)will also catch non-KJ exceptions (includingstd::bad_alloc) and route them throughcleanupFailedPeer. - [LOW] Unbounded retry loop in
acquireOrRoute()— no maximum attempt count or deadline; a perpetually contested lock will spin indefinitely. - [INFO]
ClusterConfigfield ordinals@2and@3are in textual reverse order (key @2beforenetwork @3in the struct definition butnetworkappears first textually). Not a bug, but mildly surprising.
These comments were generated by an AI code review assistant and may not be perfect.
|
|
||
| kj::Path path({actorId}); | ||
|
|
||
| for (uint attempt = 0;; ++attempt) { |
There was a problem hiding this comment.
This loop has no upper bound on attempt. If the lock is perpetually contested (e.g., a bug where an owner holds the exclusive lock but writes a different key, or the file is locked by a non-workerd process), this spins indefinitely with backoff capped at 1s. Consider adding a maximum attempt count or a deadline, and returning an error when exceeded.
(Not blocking for a v1, but worth a TODO.)
|
Review posted successfully on PR #6780. Here's a summary of what I found:
The overall design is solid — the OFD lock primitive, VatNetwork implementation, and integration with the existing server architecture are well structured. The capnp schema changes are wire-compatible, the test coverage is thorough (unit tests for OFD locks, cluster registry, cluster lock manager, plus integration tests), and the backward compatibility is maintained properly (cluster mode is gated behind |
|
The generated output of |
6a9647f to
e86a5ee
Compare
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #6780 +/- ##
==========================================
- Coverage 66.62% 66.61% -0.02%
==========================================
Files 402 410 +8
Lines 115914 117188 +1274
Branches 19425 19669 +244
==========================================
+ Hits 77228 78065 +837
- Misses 27095 27365 +270
- Partials 11591 11758 +167 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Ugh goddamit Windows... including winsock headers wrecks the global namespace... we don't even support Windows but I'm trying not to ifdef-out too much stuff... |
e86a5ee to
6b91b13
Compare
RAII wrapper around Linux OFD locks (fcntl F_OFD_SETLK) for per-fd file locking needed by the cluster DO ownership protocol. OFD locks are per open-file-description rather than per-process, which is essential since a single workerd process manages many DOs concurrently. Includes tryLock() (non-blocking), verifyHeld() (for NFSv4 lease verification), and move semantics. Non-Linux platforms fail at runtime rather than compile time so builds still succeed when cluster mode is unused. This code (and commit message) was entirely written by Opus 4.6.
Implements ClusterRegistry, a capnp::VatNetwork for multi-node cluster RPC. Each node generates an X25519 keypair as its identity, binds a listener (Unix socket or ephemeral TCP port based on CIDR), and registers itself in a shared directory. Includes: - cluster.capnp: VatId and stub Level 3/4 types for VatNetwork params - X25519PublicKey: small wrapper with hex encoding and HashMap support - ConnectionImpl: per-peer Connection with MessageStream, idle timeout, and refcounted outbound caching - Handshake protocol: raw 32-byte public key exchange before RPC starts - Peer discovery: lazy directory scanning with mtime-based liveness - CIDR/IP mode: getifaddrs-based IP matching, OFD-locked registry file with periodic mtime heartbeat - Unix mode: socket-file-as-registry-entry, connect-based liveness The end-to-end RPC tests are present but currently failing; the non-network tests (keypair generation, self-bootstrap, peer discovery, self-connect) all pass. This code (and commit message) was originally written by Opus 4.6 but the code sucked and was heavily refactored by my human hand.
Implements ClusterLockManager, which manages the lock files that serialize ownership of Durable Objects across a workerd cluster. Each lock file is named after the actor ID and contains the owner's 32-byte X25519 public key (or is empty if unowned). acquireOrRoute() encapsulates the full protocol: - If the lock file is the expected 32-byte size, read the owner key and look it up in the ClusterRegistry. If the owner is live (or might be), return a WorkerdDebugPort capability via clusterRpc.bootstrap() — the RpcSystem short-circuits self-routing without going through the VatNetwork. - Otherwise (empty file or confirmed-dead owner), try to acquire an exclusive OFD lock. If granted, ftruncate, write our identity, fsync, and return an OwnedLock. If refused, retry with exponential backoff and jitter. No checksum is needed on the lock file payload: a single pwrite() of 32 bytes is atomic with respect to concurrent readers (both on local filesystems and over NFS, where it's a single RPC to the server). The writer's sequence (ftruncate → pwrite → fsync) means readers see either size 0 or the full 32-byte key, never a partial write. This is a deviation from the original spec, which called for a CRC32. OwnedLock is an RAII handle whose destructor truncates the file and releases the lock, returning the DO to the unowned pool. verifyStillOwned() is provided for the NFSv4 lease check that the SQLite VFS will call on every file open (Commit 4). This code (and commit message) was originally written by Opus 4.6 with a fair amount of manual cleanup.
This callback is invoked immediately after `open()` opens a file. In NFS mode we'll use this hook to verify the new open was opened on the same NFS lease as the registry lock.
In cluster mode, all members of the cluster need to use the same key for channel tokens, so they can exchange them.
If the actor ID was derived from a name, it's important to pass the name along with the ID so that `ctx.id.name` works correctly.
(This is the commit that wires everything up, but this commit message describes the overall series.) workerd has always been intended to be something you can run in production, in order to self-host Workers outside of Cloudflare. For many use cases it does in fact work for this. But, there has been one missing piece: Durable Objects. workerd has always supported running Durable Objects for the purpose of local testing, but only within the scope of a single instance running on a single thread. Any Durable Objects created would run within that same workerd instance. But in production, you probably want to utilize more than one thread on one machine. For stateless Workers, this could be done just fine: just run multiple instances of workerd and load-balance across them. But doing this completely broke the model of Durable Objects, where each object is supposed to have a single instance globally, not one per workerd instance! This change fixes the problem, by introducing a new "cluster" mode. In cluster mode, you can run several workerd instances that are aware of each other and able to route requests to each other. All instances in a cluster run the same config. Mostly these workerd instances behave like they would normally, except when a request is made to a Durable Object, they coordinate to make sure only one workerd instance owns the DO, and others route to it. The design assumes a shared filesystem for underlying DO storage. All instances must be on the same filesystem. If all instances are on the same machine (useful for utilizing multiple cores), then this can be any local filesystem. Otherwise, it must be NFSv4, or some network filesystem that has exactly the same lock/lease semantics as NFSv4. The shared filesystem is also used for service discovery and locking. This is unconventional, but has a major advantage vs something like etcd or Consul: If a node loses its NFS lease, it simultaneously loses its locks *and* loses the ability to write to any open files. This provides "fencing": there's no way a node could continue writing after other nodes believe that it is dead. If locking were provided by a separate service, then it becomes extremely difficult to ensure that a node can't accidentally write after losing its lock due to a timeout. A complete design doc can be found here: https://gist.github.com/kentonv/cd8001237dc1181058193de0e7509972 The "Implementation Plan" in the design doc was largely written by AI (with several rounds of revision guided by me); the sections before that were largely written by me (with some AI review and minor changes). GPT 5.5, Opus 4.6, and Opus 4.7 were all employed. A blog post will come later. This particular commit was initially written by GPT 5.5 but heavily refactored by hand.
This test suite was written entirely by Claude Opus 4.7.
6b91b13 to
f7847b4
Compare
workerd has always been intended to be something you can run in production, in order to self-host Workers outside of Cloudflare. For many use cases it does in fact work for this. But, there has been one missing piece: Durable Objects.
workerd has always supported running Durable Objects for the purpose of local testing, but only within the scope of a single instance running on a single thread. Any Durable Objects created would run within that same workerd instance.
But in production, you probably want to utilize more than one thread on one machine. For stateless Workers, this could be done just fine: just run multiple instances of workerd and load-balance across them. But doing this completely broke the model of Durable Objects, where each object is supposed to have a single instance globally, not one per workerd instance!
This change fixes the problem, by introducing a new "cluster" mode. In cluster mode, you can run several workerd instances that are aware of each other and able to route requests to each other. All instances in a cluster run the same config. Mostly these workerd instances behave like they would normally, except when a request is made to a Durable Object, they coordinate to make sure only one workerd instance owns the DO, and others route to it.
The design assumes a shared filesystem for underlying DO storage. All instances must be on the same filesystem. If all instances are on the same machine (useful for utilizing multiple cores), then this can be any local filesystem. Otherwise, it must be NFSv4, or some network filesystem that has exactly the same lock/lease semantics as NFSv4.
The shared filesystem is also used for service discovery and locking. This is unconventional, but has a major advantage vs something like etcd or Consul: If a node loses its NFS lease, it simultaneously loses its locks and loses the ability to write to any open files. This provides "fencing": there's no way a node could continue writing after other nodes believe that it is dead. If locking were provided by a separate service, then it becomes extremely difficult to ensure that a node can't accidentally write after losing its lock due to a timeout.
A complete design doc can be found here:
https://gist.github.com/kentonv/baeebc2de19c6ae81d71e09e822b6c45
The "Implementation Plan" in the design doc was largely written by AI (with several rounds of revision guided by me); the sections before that were largely written by me (with some AI review and minor changes).
A blog post will come later.
Opus 4.6, Opus 4.7, and GPT 5.5 were all involved in this PR, though quite a lot of manual editing occurred as well.