Make self-hosted Durable Objects scalable across a cluster. by kentonv · Pull Request #6780 · cloudflare/workerd

kentonv · 2026-05-23T23:39:10Z

workerd has always been intended to be something you can run in production, in order to self-host Workers outside of Cloudflare. For many use cases it does in fact work for this. But, there has been one missing piece: Durable Objects.

workerd has always supported running Durable Objects for the purpose of local testing, but only within the scope of a single instance running on a single thread. Any Durable Objects created would run within that same workerd instance.

But in production, you probably want to utilize more than one thread on one machine. For stateless Workers, this could be done just fine: just run multiple instances of workerd and load-balance across them. But doing this completely broke the model of Durable Objects, where each object is supposed to have a single instance globally, not one per workerd instance!

This change fixes the problem, by introducing a new "cluster" mode. In cluster mode, you can run several workerd instances that are aware of each other and able to route requests to each other. All instances in a cluster run the same config. Mostly these workerd instances behave like they would normally, except when a request is made to a Durable Object, they coordinate to make sure only one workerd instance owns the DO, and others route to it.

The design assumes a shared filesystem for underlying DO storage. All instances must be on the same filesystem. If all instances are on the same machine (useful for utilizing multiple cores), then this can be any local filesystem. Otherwise, it must be NFSv4, or some network filesystem that has exactly the same lock/lease semantics as NFSv4.

The shared filesystem is also used for service discovery and locking. This is unconventional, but has a major advantage vs something like etcd or Consul: If a node loses its NFS lease, it simultaneously loses its locks and loses the ability to write to any open files. This provides "fencing": there's no way a node could continue writing after other nodes believe that it is dead. If locking were provided by a separate service, then it becomes extremely difficult to ensure that a node can't accidentally write after losing its lock due to a timeout.

A complete design doc can be found here:

https://gist.github.com/kentonv/baeebc2de19c6ae81d71e09e822b6c45

The "Implementation Plan" in the design doc was largely written by AI (with several rounds of revision guided by me); the sections before that were largely written by me (with some AI review and minor changes).

A blog post will come later.

Opus 4.6, Opus 4.7, and GPT 5.5 were all involved in this PR, though quite a lot of manual editing occurred as well.

This defines how the new feature is configured and exposed on the network.

ask-bonk

I'm Bonk, and I've done a quick review of your PR.

This PR adds cluster mode to workerd for scalable self-hosted Durable Objects, using filesystem-based service discovery and OFD locks for ownership coordination.

Findings (highest severity first):

[MEDIUM] Typo: verifyRigstration should be verifyRegistration (3 occurrences) — misspelled method name propagated to a KJ_BIND_METHOD callback that's invoked on every SQLite file open in cluster mode.
[LOW] catch(...) in connect() — idiomatic KJ uses KJ_TRY/KJ_CATCH; bare catch(...) will also catch non-KJ exceptions (including std::bad_alloc) and route them through cleanupFailedPeer.
[LOW] Unbounded retry loop in acquireOrRoute() — no maximum attempt count or deadline; a perpetually contested lock will spin indefinitely.
[INFO] ClusterConfig field ordinals @2 and @3 are in textual reverse order (key @2 before network @3 in the struct definition but network appears first textually). Not a bug, but mildly surprising.

These comments were generated by an AI code review assistant and may not be perfect.

ask-bonk · 2026-05-23T23:43:11Z

+
+  kj::Path path({actorId});
+
+  for (uint attempt = 0;; ++attempt) {


This loop has no upper bound on attempt. If the lock is perpetually contested (e.g., a bug where an owner holds the exclusive lock but writes a different key, or the file is locked by a non-workerd process), this spins indefinitely with backoff capped at 1s. Consider adding a maximum attempt count or a deadline, and returning an error when exceeded.

(Not blocking for a v1, but worth a TODO.)

ask-bonk · 2026-05-23T23:43:27Z

Review posted successfully on PR #6780. Here's a summary of what I found:

[MEDIUM] Typo verifyRigstration — 3 occurrences of a misspelled method name (verifyRigstration instead of verifyRegistration) in server.c++. Posted suggestions for all three.
[LOW] catch(...) instead of KJ_TRY/KJ_CATCH — In cluster-registry.c++:738, bare catch(...) is not idiomatic for the workerd codebase and catches broader than intended.
[LOW] Unbounded retry loop — acquireOrRoute() in cluster-lock.c++ has no maximum attempt count or deadline, which could spin indefinitely on a perpetually contested lock.
[INFO] Cap'n Proto ordinals — ClusterConfig field ordinals @2 and @3 appear in reverse textual order. Not a bug but slightly surprising.

The overall design is solid — the OFD lock primitive, VatNetwork implementation, and integration with the existing server architecture are well structured. The capnp schema changes are wire-compatible, the test coverage is thorough (unit tests for OFD locks, cluster registry, cluster lock manager, plus integration tests), and the backward compatibility is maintained properly (cluster mode is gated behind --experimental, localDisk experimental marker is removed appropriately).

github run

github-actions · 2026-05-23T23:46:38Z

The generated output of @cloudflare/workers-types matches the snapshot in types/generated-snapshot 🎉

codecov-commenter · 2026-05-24T01:25:11Z

Codecov Report

❌ Patch coverage is 66.35945% with 438 lines in your changes missing coverage. Please review.
✅ Project coverage is 66.61%. Comparing base (71b97ec) to head (f7847b4).
⚠️ Report is 18 commits behind head on main.

Files with missing lines	Patch %	Lines
src/workerd/server/cluster-registry.c++	63.78%	117 Missing and 55 partials ⚠️
src/workerd/server/server.c++	24.76%	140 Missing and 18 partials ⚠️
src/workerd/server/cluster-lock-test.c++	83.13%	0 Missing and 29 partials ⚠️
src/workerd/util/ofd-lock-test.c++	70.40%	0 Missing and 29 partials ⚠️
src/workerd/server/cluster-registry-test.c++	89.78%	0 Missing and 19 partials ⚠️
src/workerd/server/channel-token.c++	26.66%	10 Missing and 1 partial ⚠️
src/workerd/server/cluster-lock.c++	87.30%	3 Missing and 5 partials ⚠️
src/workerd/util/ofd-lock.c++	84.44%	1 Missing and 6 partials ⚠️
src/workerd/util/sqlite.c++	79.16%	5 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #6780      +/-   ##
==========================================
- Coverage   66.62%   66.61%   -0.02%     
==========================================
  Files         402      410       +8     
  Lines      115914   117188    +1274     
  Branches    19425    19669     +244     
==========================================
+ Hits        77228    78065     +837     
- Misses      27095    27365     +270     
- Partials    11591    11758     +167

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

kentonv · 2026-05-24T02:32:31Z

Ugh goddamit Windows... including winsock headers wrecks the global namespace... we don't even support Windows but I'm trying not to ifdef-out too much stuff...

RAII wrapper around Linux OFD locks (fcntl F_OFD_SETLK) for per-fd file locking needed by the cluster DO ownership protocol. OFD locks are per open-file-description rather than per-process, which is essential since a single workerd process manages many DOs concurrently. Includes tryLock() (non-blocking), verifyHeld() (for NFSv4 lease verification), and move semantics. Non-Linux platforms fail at runtime rather than compile time so builds still succeed when cluster mode is unused. This code (and commit message) was entirely written by Opus 4.6.

Implements ClusterRegistry, a capnp::VatNetwork for multi-node cluster RPC. Each node generates an X25519 keypair as its identity, binds a listener (Unix socket or ephemeral TCP port based on CIDR), and registers itself in a shared directory. Includes: - cluster.capnp: VatId and stub Level 3/4 types for VatNetwork params - X25519PublicKey: small wrapper with hex encoding and HashMap support - ConnectionImpl: per-peer Connection with MessageStream, idle timeout, and refcounted outbound caching - Handshake protocol: raw 32-byte public key exchange before RPC starts - Peer discovery: lazy directory scanning with mtime-based liveness - CIDR/IP mode: getifaddrs-based IP matching, OFD-locked registry file with periodic mtime heartbeat - Unix mode: socket-file-as-registry-entry, connect-based liveness The end-to-end RPC tests are present but currently failing; the non-network tests (keypair generation, self-bootstrap, peer discovery, self-connect) all pass. This code (and commit message) was originally written by Opus 4.6 but the code sucked and was heavily refactored by my human hand.

Implements ClusterLockManager, which manages the lock files that serialize ownership of Durable Objects across a workerd cluster. Each lock file is named after the actor ID and contains the owner's 32-byte X25519 public key (or is empty if unowned). acquireOrRoute() encapsulates the full protocol: - If the lock file is the expected 32-byte size, read the owner key and look it up in the ClusterRegistry. If the owner is live (or might be), return a WorkerdDebugPort capability via clusterRpc.bootstrap() — the RpcSystem short-circuits self-routing without going through the VatNetwork. - Otherwise (empty file or confirmed-dead owner), try to acquire an exclusive OFD lock. If granted, ftruncate, write our identity, fsync, and return an OwnedLock. If refused, retry with exponential backoff and jitter. No checksum is needed on the lock file payload: a single pwrite() of 32 bytes is atomic with respect to concurrent readers (both on local filesystems and over NFS, where it's a single RPC to the server). The writer's sequence (ftruncate → pwrite → fsync) means readers see either size 0 or the full 32-byte key, never a partial write. This is a deviation from the original spec, which called for a CRC32. OwnedLock is an RAII handle whose destructor truncates the file and releases the lock, returning the DO to the unowned pool. verifyStillOwned() is provided for the NFSv4 lease check that the SQLite VFS will call on every file open (Commit 4). This code (and commit message) was originally written by Opus 4.6 with a fair amount of manual cleanup.

This callback is invoked immediately after `open()` opens a file. In NFS mode we'll use this hook to verify the new open was opened on the same NFS lease as the registry lock.

In cluster mode, all members of the cluster need to use the same key for channel tokens, so they can exchange them.

If the actor ID was derived from a name, it's important to pass the name along with the ID so that `ctx.id.name` works correctly.

(This is the commit that wires everything up, but this commit message describes the overall series.) workerd has always been intended to be something you can run in production, in order to self-host Workers outside of Cloudflare. For many use cases it does in fact work for this. But, there has been one missing piece: Durable Objects. workerd has always supported running Durable Objects for the purpose of local testing, but only within the scope of a single instance running on a single thread. Any Durable Objects created would run within that same workerd instance. But in production, you probably want to utilize more than one thread on one machine. For stateless Workers, this could be done just fine: just run multiple instances of workerd and load-balance across them. But doing this completely broke the model of Durable Objects, where each object is supposed to have a single instance globally, not one per workerd instance! This change fixes the problem, by introducing a new "cluster" mode. In cluster mode, you can run several workerd instances that are aware of each other and able to route requests to each other. All instances in a cluster run the same config. Mostly these workerd instances behave like they would normally, except when a request is made to a Durable Object, they coordinate to make sure only one workerd instance owns the DO, and others route to it. The design assumes a shared filesystem for underlying DO storage. All instances must be on the same filesystem. If all instances are on the same machine (useful for utilizing multiple cores), then this can be any local filesystem. Otherwise, it must be NFSv4, or some network filesystem that has exactly the same lock/lease semantics as NFSv4. The shared filesystem is also used for service discovery and locking. This is unconventional, but has a major advantage vs something like etcd or Consul: If a node loses its NFS lease, it simultaneously loses its locks *and* loses the ability to write to any open files. This provides "fencing": there's no way a node could continue writing after other nodes believe that it is dead. If locking were provided by a separate service, then it becomes extremely difficult to ensure that a node can't accidentally write after losing its lock due to a timeout. A complete design doc can be found here: https://gist.github.com/kentonv/cd8001237dc1181058193de0e7509972 The "Implementation Plan" in the design doc was largely written by AI (with several rounds of revision guided by me); the sections before that were largely written by me (with some AI review and minor changes). GPT 5.5, Opus 4.6, and Opus 4.7 were all employed. A blog post will come later. This particular commit was initially written by GPT 5.5 but heavily refactored by hand.

This test suite was written entirely by Claude Opus 4.7.

Update config capnp and worker-interface.capnp for clustering.

87a4524

This defines how the new feature is configured and exposed on the network.

kentonv requested a review from justin-mp May 23, 2026 23:39

kentonv requested review from a team as code owners May 23, 2026 23:39

ask-bonk Bot reviewed May 23, 2026

View reviewed changes

kentonv force-pushed the kenton/scalable-do branch 2 times, most recently from 6a9647f to e86a5ee Compare May 24, 2026 00:54

kentonv force-pushed the kenton/scalable-do branch from e86a5ee to 6b91b13 Compare May 24, 2026 14:10

kentonv added 8 commits May 24, 2026 13:04

Add a post-open callback to SQLite.

a63a0e2

This callback is invoked immediately after `open()` opens a file. In NFS mode we'll use this hook to verify the new open was opened on the same NFS lease as the registry lock.

Extend ChannelTokenHandler to use shared keys in cluster mode.

e361e1a

In cluster mode, all members of the cluster need to use the same key for channel tokens, so they can exchange them.

Extend WorkerdDebugPort to include actor name with ID.

d48022d

If the actor ID was derived from a name, it's important to pass the name along with the ID so that `ctx.id.name` works correctly.

Add integration tests for clustering mode.

f7847b4

This test suite was written entirely by Claude Opus 4.7.

kentonv force-pushed the kenton/scalable-do branch from 6b91b13 to f7847b4 Compare May 24, 2026 18:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make self-hosted Durable Objects scalable across a cluster.#6780

Make self-hosted Durable Objects scalable across a cluster.#6780
kentonv wants to merge 9 commits into
mainfrom
kenton/scalable-do

kentonv commented May 23, 2026 •

edited

Loading

Uh oh!

ask-bonk Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ask-bonk Bot May 23, 2026

Uh oh!

ask-bonk Bot commented May 23, 2026

Uh oh!

github-actions Bot commented May 23, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented May 24, 2026 •

edited

Loading

Uh oh!

kentonv commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants


		kj::Path path({actorId});

		for (uint attempt = 0;; ++attempt) {

Conversation

kentonv commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ask-bonk Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ask-bonk Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

ask-bonk Bot commented May 23, 2026

Uh oh!

github-actions Bot commented May 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov-commenter commented May 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

kentonv commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

kentonv commented May 23, 2026 •

edited

Loading

github-actions Bot commented May 23, 2026 •

edited

Loading

codecov-commenter commented May 24, 2026 •

edited

Loading