Skip to content

Make self-hosted Durable Objects scalable across a cluster.#6780

Open
kentonv wants to merge 9 commits into
mainfrom
kenton/scalable-do
Open

Make self-hosted Durable Objects scalable across a cluster.#6780
kentonv wants to merge 9 commits into
mainfrom
kenton/scalable-do

Conversation

@kentonv
Copy link
Copy Markdown
Member

@kentonv kentonv commented May 23, 2026

workerd has always been intended to be something you can run in production, in order to self-host Workers outside of Cloudflare. For many use cases it does in fact work for this. But, there has been one missing piece: Durable Objects.

workerd has always supported running Durable Objects for the purpose of local testing, but only within the scope of a single instance running on a single thread. Any Durable Objects created would run within that same workerd instance.

But in production, you probably want to utilize more than one thread on one machine. For stateless Workers, this could be done just fine: just run multiple instances of workerd and load-balance across them. But doing this completely broke the model of Durable Objects, where each object is supposed to have a single instance globally, not one per workerd instance!

This change fixes the problem, by introducing a new "cluster" mode. In cluster mode, you can run several workerd instances that are aware of each other and able to route requests to each other. All instances in a cluster run the same config. Mostly these workerd instances behave like they would normally, except when a request is made to a Durable Object, they coordinate to make sure only one workerd instance owns the DO, and others route to it.

The design assumes a shared filesystem for underlying DO storage. All instances must be on the same filesystem. If all instances are on the same machine (useful for utilizing multiple cores), then this can be any local filesystem. Otherwise, it must be NFSv4, or some network filesystem that has exactly the same lock/lease semantics as NFSv4.

The shared filesystem is also used for service discovery and locking. This is unconventional, but has a major advantage vs something like etcd or Consul: If a node loses its NFS lease, it simultaneously loses its locks and loses the ability to write to any open files. This provides "fencing": there's no way a node could continue writing after other nodes believe that it is dead. If locking were provided by a separate service, then it becomes extremely difficult to ensure that a node can't accidentally write after losing its lock due to a timeout.

A complete design doc can be found here:

https://gist.github.com/kentonv/baeebc2de19c6ae81d71e09e822b6c45

The "Implementation Plan" in the design doc was largely written by AI (with several rounds of revision guided by me); the sections before that were largely written by me (with some AI review and minor changes).

A blog post will come later.

Opus 4.6, Opus 4.7, and GPT 5.5 were all involved in this PR, though quite a lot of manual editing occurred as well.

This defines how the new feature is configured and exposed on the network.
@kentonv kentonv requested a review from justin-mp May 23, 2026 23:39
@kentonv kentonv requested review from a team as code owners May 23, 2026 23:39
Copy link
Copy Markdown
Contributor

@ask-bonk ask-bonk Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm Bonk, and I've done a quick review of your PR.

This PR adds cluster mode to workerd for scalable self-hosted Durable Objects, using filesystem-based service discovery and OFD locks for ownership coordination.

Findings (highest severity first):

  1. [MEDIUM] Typo: verifyRigstration should be verifyRegistration (3 occurrences) — misspelled method name propagated to a KJ_BIND_METHOD callback that's invoked on every SQLite file open in cluster mode.
  2. [LOW] catch(...) in connect() — idiomatic KJ uses KJ_TRY/KJ_CATCH; bare catch(...) will also catch non-KJ exceptions (including std::bad_alloc) and route them through cleanupFailedPeer.
  3. [LOW] Unbounded retry loop in acquireOrRoute() — no maximum attempt count or deadline; a perpetually contested lock will spin indefinitely.
  4. [INFO] ClusterConfig field ordinals @2 and @3 are in textual reverse order (key @2 before network @3 in the struct definition but network appears first textually). Not a bug, but mildly surprising.

These comments were generated by an AI code review assistant and may not be perfect.

Comment thread src/workerd/server/server.c++ Outdated
Comment thread src/workerd/server/server.c++ Outdated
Comment thread src/workerd/server/server.c++ Outdated
Comment thread src/workerd/server/cluster-registry.c++

kj::Path path({actorId});

for (uint attempt = 0;; ++attempt) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This loop has no upper bound on attempt. If the lock is perpetually contested (e.g., a bug where an owner holds the exclusive lock but writes a different key, or the file is locked by a non-workerd process), this spins indefinitely with backoff capped at 1s. Consider adding a maximum attempt count or a deadline, and returning an error when exceeded.

(Not blocking for a v1, but worth a TODO.)

@ask-bonk
Copy link
Copy Markdown
Contributor

ask-bonk Bot commented May 23, 2026

Review posted successfully on PR #6780. Here's a summary of what I found:

  1. [MEDIUM] Typo verifyRigstration — 3 occurrences of a misspelled method name (verifyRigstration instead of verifyRegistration) in server.c++. Posted suggestions for all three.

  2. [LOW] catch(...) instead of KJ_TRY/KJ_CATCH — In cluster-registry.c++:738, bare catch(...) is not idiomatic for the workerd codebase and catches broader than intended.

  3. [LOW] Unbounded retry loopacquireOrRoute() in cluster-lock.c++ has no maximum attempt count or deadline, which could spin indefinitely on a perpetually contested lock.

  4. [INFO] Cap'n Proto ordinalsClusterConfig field ordinals @2 and @3 appear in reverse textual order. Not a bug but slightly surprising.

The overall design is solid — the OFD lock primitive, VatNetwork implementation, and integration with the existing server architecture are well structured. The capnp schema changes are wire-compatible, the test coverage is thorough (unit tests for OFD locks, cluster registry, cluster lock manager, plus integration tests), and the backward compatibility is maintained properly (cluster mode is gated behind --experimental, localDisk experimental marker is removed appropriately).

github run

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 23, 2026

The generated output of @cloudflare/workers-types matches the snapshot in types/generated-snapshot 🎉

@kentonv kentonv force-pushed the kenton/scalable-do branch 2 times, most recently from 6a9647f to e86a5ee Compare May 24, 2026 00:54
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 24, 2026

Codecov Report

❌ Patch coverage is 66.35945% with 438 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.61%. Comparing base (71b97ec) to head (f7847b4).
⚠️ Report is 18 commits behind head on main.

Files with missing lines Patch % Lines
src/workerd/server/cluster-registry.c++ 63.78% 117 Missing and 55 partials ⚠️
src/workerd/server/server.c++ 24.76% 140 Missing and 18 partials ⚠️
src/workerd/server/cluster-lock-test.c++ 83.13% 0 Missing and 29 partials ⚠️
src/workerd/util/ofd-lock-test.c++ 70.40% 0 Missing and 29 partials ⚠️
src/workerd/server/cluster-registry-test.c++ 89.78% 0 Missing and 19 partials ⚠️
src/workerd/server/channel-token.c++ 26.66% 10 Missing and 1 partial ⚠️
src/workerd/server/cluster-lock.c++ 87.30% 3 Missing and 5 partials ⚠️
src/workerd/util/ofd-lock.c++ 84.44% 1 Missing and 6 partials ⚠️
src/workerd/util/sqlite.c++ 79.16% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #6780      +/-   ##
==========================================
- Coverage   66.62%   66.61%   -0.02%     
==========================================
  Files         402      410       +8     
  Lines      115914   117188    +1274     
  Branches    19425    19669     +244     
==========================================
+ Hits        77228    78065     +837     
- Misses      27095    27365     +270     
- Partials    11591    11758     +167     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@kentonv
Copy link
Copy Markdown
Member Author

kentonv commented May 24, 2026

Ugh goddamit Windows... including winsock headers wrecks the global namespace... we don't even support Windows but I'm trying not to ifdef-out too much stuff...

@kentonv kentonv force-pushed the kenton/scalable-do branch from e86a5ee to 6b91b13 Compare May 24, 2026 14:10
kentonv added 8 commits May 24, 2026 13:04
RAII wrapper around Linux OFD locks (fcntl F_OFD_SETLK) for per-fd file
locking needed by the cluster DO ownership protocol. OFD locks are per
open-file-description rather than per-process, which is essential since a
single workerd process manages many DOs concurrently.

Includes tryLock() (non-blocking), verifyHeld() (for NFSv4 lease
verification), and move semantics. Non-Linux platforms fail at runtime
rather than compile time so builds still succeed when cluster mode is
unused.

This code (and commit message) was entirely written by Opus 4.6.
Implements ClusterRegistry, a capnp::VatNetwork for multi-node cluster
RPC. Each node generates an X25519 keypair as its identity, binds a
listener (Unix socket or ephemeral TCP port based on CIDR), and registers
itself in a shared directory.

Includes:
- cluster.capnp: VatId and stub Level 3/4 types for VatNetwork params
- X25519PublicKey: small wrapper with hex encoding and HashMap support
- ConnectionImpl: per-peer Connection with MessageStream, idle timeout,
  and refcounted outbound caching
- Handshake protocol: raw 32-byte public key exchange before RPC starts
- Peer discovery: lazy directory scanning with mtime-based liveness
- CIDR/IP mode: getifaddrs-based IP matching, OFD-locked registry file
  with periodic mtime heartbeat
- Unix mode: socket-file-as-registry-entry, connect-based liveness

The end-to-end RPC tests are present but currently failing; the
non-network tests (keypair generation, self-bootstrap, peer discovery,
self-connect) all pass.

This code (and commit message) was originally written by Opus 4.6 but the code sucked and was heavily refactored by my human hand.
Implements ClusterLockManager, which manages the lock files that serialize
ownership of Durable Objects across a workerd cluster. Each lock file is
named after the actor ID and contains the owner's 32-byte X25519 public
key (or is empty if unowned).

acquireOrRoute() encapsulates the full protocol:
- If the lock file is the expected 32-byte size, read the owner key and
  look it up in the ClusterRegistry. If the owner is live (or might be),
  return a WorkerdDebugPort capability via clusterRpc.bootstrap() — the
  RpcSystem short-circuits self-routing without going through the
  VatNetwork.
- Otherwise (empty file or confirmed-dead owner), try to acquire an
  exclusive OFD lock. If granted, ftruncate, write our identity, fsync,
  and return an OwnedLock. If refused, retry with exponential backoff
  and jitter.

No checksum is needed on the lock file payload: a single pwrite() of 32
bytes is atomic with respect to concurrent readers (both on local
filesystems and over NFS, where it's a single RPC to the server). The
writer's sequence (ftruncate → pwrite → fsync) means readers see either
size 0 or the full 32-byte key, never a partial write. This is a
deviation from the original spec, which called for a CRC32.

OwnedLock is an RAII handle whose destructor truncates the file and
releases the lock, returning the DO to the unowned pool.
verifyStillOwned() is provided for the NFSv4 lease check that the SQLite
VFS will call on every file open (Commit 4).

This code (and commit message) was originally written by Opus 4.6 with a fair amount of manual cleanup.
This callback is invoked immediately after `open()` opens a file. In NFS mode we'll use this hook to verify the new open was opened on the same NFS lease as the registry lock.
In cluster mode, all members of the cluster need to use the same key for channel tokens, so they can exchange them.
If the actor ID was derived from a name, it's important to pass the name along with the ID so that `ctx.id.name` works correctly.
(This is the commit that wires everything up, but this commit message describes the overall series.)

workerd has always been intended to be something you can run in production, in order to self-host Workers outside of Cloudflare. For many use cases it does in fact work for this. But, there has been one missing piece: Durable Objects.

workerd has always supported running Durable Objects for the purpose of local testing, but only within the scope of a single instance running on a single thread. Any Durable Objects created would run within that same workerd instance.

But in production, you probably want to utilize more than one thread on one machine. For stateless Workers, this could be done just fine: just run multiple instances of workerd and load-balance across them. But doing this completely broke the model of Durable Objects, where each object is supposed to have a single instance globally, not one per workerd instance!

This change fixes the problem, by introducing a new "cluster" mode. In cluster mode, you can run several workerd instances that are aware of each other and able to route requests to each other. All instances in a cluster run the same config. Mostly these workerd instances behave like they would normally, except when a request is made to a Durable Object, they coordinate to make sure only one workerd instance owns the DO, and others route to it.

The design assumes a shared filesystem for underlying DO storage. All instances must be on the same filesystem. If all instances are on the same machine (useful for utilizing multiple cores), then this can be any local filesystem. Otherwise, it must be NFSv4, or some network filesystem that has exactly the same lock/lease semantics as NFSv4.

The shared filesystem is also used for service discovery and locking. This is unconventional, but has a major advantage vs something like etcd or Consul: If a node loses its NFS lease, it simultaneously loses its locks *and* loses the ability to write to any open files. This provides "fencing": there's no way a node could continue writing after other nodes believe that it is dead. If locking were provided by a separate service, then it becomes extremely difficult to ensure that a node can't accidentally write after losing its lock due to a timeout.

A complete design doc can be found here:

https://gist.github.com/kentonv/cd8001237dc1181058193de0e7509972

The "Implementation Plan" in the design doc was largely written by AI (with several rounds of revision guided by me); the sections before that were largely written by me (with some AI review and minor changes). GPT 5.5, Opus 4.6, and Opus 4.7 were all employed.

A blog post will come later.

This particular commit was initially written by GPT 5.5 but heavily refactored by hand.
This test suite was written entirely by Claude Opus 4.7.
@kentonv kentonv force-pushed the kenton/scalable-do branch from 6b91b13 to f7847b4 Compare May 24, 2026 18:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants