Skip to content

[miniflare] Restart workerd process on crash after successful startup#13045

Open
hoodmane wants to merge 2 commits intomainfrom
hoodmane/handle-restart
Open

[miniflare] Restart workerd process on crash after successful startup#13045
hoodmane wants to merge 2 commits intomainfrom
hoodmane/handle-restart

Conversation

@hoodmane
Copy link
Contributor

@hoodmane hoodmane commented Mar 24, 2026

If workerd crashes while starting up, exit as before. If workerd crash happens in a handler, try to restart workerd.
I'm not really sure whether the added logic is correct if restarting workerd fails. On the other hand, starting it the first time worked so hopefully restarting it will work?
The added tests require cloudflare/workerd#6382 for abortIsolate().

cc @penalosa

  • Tests
    • Tests included/updated
    • Automated tests not possible - manual testing has been completed as follows:
    • Additional testing not necessary because:
  • Public documentation
    • Cloudflare docs PR(s):
    • Documentation not necessary because: I don't think it's necessary?

Open with Devin

@hoodmane hoodmane requested a review from a team as a code owner March 24, 2026 17:51
@hoodmane hoodmane requested a review from penalosa March 24, 2026 17:51
@changeset-bot
Copy link

changeset-bot bot commented Mar 24, 2026

🦋 Changeset detected

Latest commit: 4a4048f

The changes in this PR will be included in the next version bump.

Not sure what this means? Click here to learn what changesets are.

Click here if you're a maintainer who wants to add another changeset to this PR

@workers-devprod
Copy link
Contributor

Codeowners approval required for this PR:

  • @cloudflare/wrangler
Show detailed file reviewers
  • packages/miniflare/src/index.ts: [@cloudflare/wrangler]
  • packages/miniflare/src/runtime/index.ts: [@cloudflare/wrangler]
  • packages/miniflare/test/index.spec.ts: [@cloudflare/wrangler]

Copy link
Contributor

@ask-bonk ask-bonk bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm Bonk, and I've done a quick review of your PR.

This PR restarts the workerd process when it crashes after a successful startup, while preserving the existing fail-fast behavior for startup crashes.

Issues (ranked by severity):

  1. Unhandled promise rejection on restart failure — The .catch() handler re-throws the error, but the promise is voided (void this.#runtimeMutex...). This means a failed restart will produce an unhandled promise rejection, which crashes Node.js by default (since Node 15). You should either log the error instead of re-throwing, or store the promise somewhere it can be observed.

  2. TODO(NOW) left in code — This looks like a development reminder that should be resolved before merging.

  3. Flaky restart testsetTimeout(resolve, 0) is almost certainly not enough time for the full restart cycle (mutex acquisition → config assembly → workerd spawn → port ready). This will likely be flaky in CI. Consider polling mf.ready or mf.dispatchFetch with retries instead.

  4. Typo — "goit" → "got".

// can still test async initialisation failures, without test failures
// telling us to `dispose()` the instance.
maybeInstanceRegistry?.delete(this);
throw e;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Re-throwing inside .catch() on a voided promise creates an unhandled promise rejection, which crashes Node.js by default. Consider logging the error instead, or not re-throwing:

Suggested change
throw e;
this.#log.error(e);

onWorkerdCrashRestart: () => {
// workerd crashed after successful startup. Try to restart
// workerd. I copied this from the constructor.
// TODO(NOW): What do we do when it fails??
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This TODO(NOW) should be resolved before merging — it looks like a development note.

).rejects.toThrow();

// Wait for workerd to reset
await new Promise((resolve) => setTimeout(resolve, 0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

setTimeout(resolve, 0) is unlikely to be enough for the full restart cycle (mutex + config assembly + workerd spawn + port listening). This will probably be flaky in CI. Consider awaiting mf.ready instead, which should resolve once the restart completes, or polling dispatchFetch with retries.

@hoodmane hoodmane force-pushed the hoodmane/handle-restart branch from 7127ca5 to 211410d Compare March 24, 2026 17:52
@ask-bonk
Copy link
Contributor

ask-bonk bot commented Mar 24, 2026

I'm Bonk, and I've done a quick review of your PR.

This PR restarts the workerd process when it crashes after a successful startup, while preserving the existing fail-fast behavior for startup crashes.

I posted a review with 4 issues (ranked by severity):

  1. Unhandled promise rejection on restart failure — The .catch() re-throws on a voided promise, which will crash Node.js. Suggested logging the error instead.
  2. TODO(NOW) left in code — Should be resolved before merging.
  3. Flaky restart testsetTimeout(resolve, 0) is almost certainly not enough time for the full restart cycle. Suggested awaiting mf.ready or polling with retries.
  4. Typo — "goit" → "got" (with suggestion).

github run

@github-actions
Copy link
Contributor

github-actions bot commented Mar 24, 2026

✅ All changesets look good

@pkg-pr-new
Copy link

pkg-pr-new bot commented Mar 24, 2026

create-cloudflare

npm i https://pkg.pr.new/create-cloudflare@13045

@cloudflare/kv-asset-handler

npm i https://pkg.pr.new/@cloudflare/kv-asset-handler@13045

miniflare

npm i https://pkg.pr.new/miniflare@13045

@cloudflare/pages-shared

npm i https://pkg.pr.new/@cloudflare/pages-shared@13045

@cloudflare/unenv-preset

npm i https://pkg.pr.new/@cloudflare/unenv-preset@13045

@cloudflare/vite-plugin

npm i https://pkg.pr.new/@cloudflare/vite-plugin@13045

@cloudflare/vitest-pool-workers

npm i https://pkg.pr.new/@cloudflare/vitest-pool-workers@13045

@cloudflare/workers-editor-shared

npm i https://pkg.pr.new/@cloudflare/workers-editor-shared@13045

wrangler

npm i https://pkg.pr.new/wrangler@13045

commit: 4a4048f

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 2 potential issues.

View 5 additional findings in Devin Review.

Open in Devin Review

Comment on lines +2086 to +2095
void this.#runtimeMutex
.runWith(() => this.#assembleAndUpdateConfig())
.catch((e) => {
// If initialisation failed, attempting to `dispose()` this instance
// will too. Therefore, remove from the instance registry now, so we
// can still test async initialisation failures, without test failures
// telling us to `dispose()` the instance.
maybeInstanceRegistry?.delete(this);
throw e;
});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Unhandled promise rejection when crash restart fails

The onWorkerdCrashRestart callback copies the error-handling pattern from the constructor (line 1081-1090) but applies it incorrectly. In the constructor, the promise chain (including the throw e in .catch()) is assigned to this.#initPromise, which is later await-ed in #waitForReady, so the rejection is handled. In the crash restart callback, the same .catch((e) => { throw e }) is used, but the resulting promise is void-ed — no handler is ever attached to it. When the restart fails (e.g., workerd crashes again at startup), the throw e produces a rejected promise with no rejection handler, triggering an unhandledRejection event in Node.js, which by default crashes the process (Node.js 15+). The throw e should be removed or replaced with error logging, since there is no consumer to propagate the error to.

Suggested change
void this.#runtimeMutex
.runWith(() => this.#assembleAndUpdateConfig())
.catch((e) => {
// If initialisation failed, attempting to `dispose()` this instance
// will too. Therefore, remove from the instance registry now, so we
// can still test async initialisation failures, without test failures
// telling us to `dispose()` the instance.
maybeInstanceRegistry?.delete(this);
throw e;
});
void this.#runtimeMutex
.runWith(() => this.#assembleAndUpdateConfig())
.catch((e) => {
// If initialisation failed, attempting to `dispose()` this instance
// will too. Therefore, remove from the instance registry now, so we
// can still test async initialisation failures, without test failures
// telling us to `dispose()` the instance.
maybeInstanceRegistry?.delete(this);
this.#log.error("Failed to restart workerd after crash:", e);
});
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Comment on lines +313 to 330
} else {
// workerd is now listening. Watch for unexpected exits so we can
// restart.
const currentProcess = this.#process;
currentProcess?.once("exit", () => {
if (this.#process !== currentProcess) {
// We goit here because dispose() set this.#process to
// undefined before sending SIGKILL
return;
}
if (abortSignal.aborted) {
return;
}
// Crash: clear stale #process and notify the caller.
this.#process = undefined;
options.onWorkerdCrashRestart?.();
});
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Early return in VSCode inspector path skips crash handler registration

When running inside a VSCode Debug Terminal with an inspector socket enabled, if the bootloader path is not found, updateConfig returns early at packages/miniflare/src/runtime/index.ts:279 (return ports). This bypasses the crash handler registration at lines 313-329, meaning workerd crashes after startup will not trigger automatic restarts in this scenario. The exit handler registration should be moved before the VSCode inspector block, or the early return should be refactored to fall through to the exit handler registration.

Prompt for agents
In packages/miniflare/src/runtime/index.ts, the crash handler registration (lines 313-330) is placed after the VSCode inspector block which has an early return at line 279. Move the crash handler registration block (the else clause at lines 313-330) to before the VSCode inspector block (line 270) so that the exit handler is always registered when ports are defined, regardless of the early return. The condition should be: if ports is defined and abortSignal is not aborted, register the exit handler. This ensures crash restarts work even when running in a VSCode Debug Terminal.
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Co-authored-by: ask-bonk[bot] <249159057+ask-bonk[bot]@users.noreply.github.com>
Comment on lines +319 to +321
// We got here because dispose() set this.#process to
// undefined before sending SIGKILL
return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we just get dispose to remove this handler instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll try that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Untriaged

Development

Successfully merging this pull request may close these issues.

3 participants