[miniflare] Restart workerd process on crash after successful startup#13045
[miniflare] Restart workerd process on crash after successful startup#13045
Conversation
🦋 Changeset detectedLatest commit: 4a4048f The changes in this PR will be included in the next version bump. Not sure what this means? Click here to learn what changesets are. Click here if you're a maintainer who wants to add another changeset to this PR |
|
Codeowners approval required for this PR:
Show detailed file reviewers
|
The added tests require cloudflare/workerd#6382 for abortIsolate().
There was a problem hiding this comment.
I'm Bonk, and I've done a quick review of your PR.
This PR restarts the workerd process when it crashes after a successful startup, while preserving the existing fail-fast behavior for startup crashes.
Issues (ranked by severity):
-
Unhandled promise rejection on restart failure — The
.catch()handler re-throws the error, but the promise is voided (void this.#runtimeMutex...). This means a failed restart will produce an unhandled promise rejection, which crashes Node.js by default (since Node 15). You should either log the error instead of re-throwing, or store the promise somewhere it can be observed. -
TODO(NOW)left in code — This looks like a development reminder that should be resolved before merging. -
Flaky restart test —
setTimeout(resolve, 0)is almost certainly not enough time for the full restart cycle (mutex acquisition → config assembly → workerd spawn → port ready). This will likely be flaky in CI. Consider pollingmf.readyormf.dispatchFetchwith retries instead. -
Typo — "goit" → "got".
| // can still test async initialisation failures, without test failures | ||
| // telling us to `dispose()` the instance. | ||
| maybeInstanceRegistry?.delete(this); | ||
| throw e; |
There was a problem hiding this comment.
Re-throwing inside .catch() on a voided promise creates an unhandled promise rejection, which crashes Node.js by default. Consider logging the error instead, or not re-throwing:
| throw e; | |
| this.#log.error(e); |
| onWorkerdCrashRestart: () => { | ||
| // workerd crashed after successful startup. Try to restart | ||
| // workerd. I copied this from the constructor. | ||
| // TODO(NOW): What do we do when it fails?? |
There was a problem hiding this comment.
This TODO(NOW) should be resolved before merging — it looks like a development note.
| ).rejects.toThrow(); | ||
|
|
||
| // Wait for workerd to reset | ||
| await new Promise((resolve) => setTimeout(resolve, 0)); |
There was a problem hiding this comment.
setTimeout(resolve, 0) is unlikely to be enough for the full restart cycle (mutex + config assembly + workerd spawn + port listening). This will probably be flaky in CI. Consider awaiting mf.ready instead, which should resolve once the restart completes, or polling dispatchFetch with retries.
7127ca5 to
211410d
Compare
|
I'm Bonk, and I've done a quick review of your PR. This PR restarts the workerd process when it crashes after a successful startup, while preserving the existing fail-fast behavior for startup crashes. I posted a review with 4 issues (ranked by severity):
|
|
✅ All changesets look good |
create-cloudflare
@cloudflare/kv-asset-handler
miniflare
@cloudflare/pages-shared
@cloudflare/unenv-preset
@cloudflare/vite-plugin
@cloudflare/vitest-pool-workers
@cloudflare/workers-editor-shared
wrangler
commit: |
| void this.#runtimeMutex | ||
| .runWith(() => this.#assembleAndUpdateConfig()) | ||
| .catch((e) => { | ||
| // If initialisation failed, attempting to `dispose()` this instance | ||
| // will too. Therefore, remove from the instance registry now, so we | ||
| // can still test async initialisation failures, without test failures | ||
| // telling us to `dispose()` the instance. | ||
| maybeInstanceRegistry?.delete(this); | ||
| throw e; | ||
| }); |
There was a problem hiding this comment.
🔴 Unhandled promise rejection when crash restart fails
The onWorkerdCrashRestart callback copies the error-handling pattern from the constructor (line 1081-1090) but applies it incorrectly. In the constructor, the promise chain (including the throw e in .catch()) is assigned to this.#initPromise, which is later await-ed in #waitForReady, so the rejection is handled. In the crash restart callback, the same .catch((e) => { throw e }) is used, but the resulting promise is void-ed — no handler is ever attached to it. When the restart fails (e.g., workerd crashes again at startup), the throw e produces a rejected promise with no rejection handler, triggering an unhandledRejection event in Node.js, which by default crashes the process (Node.js 15+). The throw e should be removed or replaced with error logging, since there is no consumer to propagate the error to.
| void this.#runtimeMutex | |
| .runWith(() => this.#assembleAndUpdateConfig()) | |
| .catch((e) => { | |
| // If initialisation failed, attempting to `dispose()` this instance | |
| // will too. Therefore, remove from the instance registry now, so we | |
| // can still test async initialisation failures, without test failures | |
| // telling us to `dispose()` the instance. | |
| maybeInstanceRegistry?.delete(this); | |
| throw e; | |
| }); | |
| void this.#runtimeMutex | |
| .runWith(() => this.#assembleAndUpdateConfig()) | |
| .catch((e) => { | |
| // If initialisation failed, attempting to `dispose()` this instance | |
| // will too. Therefore, remove from the instance registry now, so we | |
| // can still test async initialisation failures, without test failures | |
| // telling us to `dispose()` the instance. | |
| maybeInstanceRegistry?.delete(this); | |
| this.#log.error("Failed to restart workerd after crash:", e); | |
| }); |
Was this helpful? React with 👍 or 👎 to provide feedback.
| } else { | ||
| // workerd is now listening. Watch for unexpected exits so we can | ||
| // restart. | ||
| const currentProcess = this.#process; | ||
| currentProcess?.once("exit", () => { | ||
| if (this.#process !== currentProcess) { | ||
| // We goit here because dispose() set this.#process to | ||
| // undefined before sending SIGKILL | ||
| return; | ||
| } | ||
| if (abortSignal.aborted) { | ||
| return; | ||
| } | ||
| // Crash: clear stale #process and notify the caller. | ||
| this.#process = undefined; | ||
| options.onWorkerdCrashRestart?.(); | ||
| }); | ||
| } |
There was a problem hiding this comment.
🟡 Early return in VSCode inspector path skips crash handler registration
When running inside a VSCode Debug Terminal with an inspector socket enabled, if the bootloader path is not found, updateConfig returns early at packages/miniflare/src/runtime/index.ts:279 (return ports). This bypasses the crash handler registration at lines 313-329, meaning workerd crashes after startup will not trigger automatic restarts in this scenario. The exit handler registration should be moved before the VSCode inspector block, or the early return should be refactored to fall through to the exit handler registration.
Prompt for agents
In packages/miniflare/src/runtime/index.ts, the crash handler registration (lines 313-330) is placed after the VSCode inspector block which has an early return at line 279. Move the crash handler registration block (the else clause at lines 313-330) to before the VSCode inspector block (line 270) so that the exit handler is always registered when ports are defined, regardless of the early return. The condition should be: if ports is defined and abortSignal is not aborted, register the exit handler. This ensures crash restarts work even when running in a VSCode Debug Terminal.
Was this helpful? React with 👍 or 👎 to provide feedback.
Co-authored-by: ask-bonk[bot] <249159057+ask-bonk[bot]@users.noreply.github.com>
| // We got here because dispose() set this.#process to | ||
| // undefined before sending SIGKILL | ||
| return; |
There was a problem hiding this comment.
Could we just get dispose to remove this handler instead?
If workerd crashes while starting up, exit as before. If workerd crash happens in a handler, try to restart workerd.
I'm not really sure whether the added logic is correct if restarting workerd fails. On the other hand, starting it the first time worked so hopefully restarting it will work?
The added tests require cloudflare/workerd#6382 for abortIsolate().
cc @penalosa