Skip to content

Workspace stuck with empty build_status forever after unexpected EOF from create API call #378

@robobryce

Description

@robobryce

Summary

A brev create invocation whose API response was truncated with unexpected EOF left the workspace in a state where the server-side build job was never enqueued. The VM was provisioned and reaches status=RUNNING, but build_status stays empty forever and the post-cloud-init setup-push (/opt/setup.sh, /etc/systemd/system/instance-oneshot.service, hostname rewrite to brev-${BREV_ENV_ID}, /etc/brev/metadata.json) never runs.

After 1+ hour the workspace is still in this stuck state. The same command on a different name produces a healthy workspace.

Reproduction

  • brev CLI version: v0.6.322
  • Org: vanguard-programming
  • Workspace name: gr-manager
  • Workspace ID: hclv1344v
  • Approximate timestamp of the failed create: 2026-05-01 17:43:46 UTC

Command (run from a logged-in dev box):

brev create gr-manager --type n2d-standard-8 --min-disk 589 --detached

CLI output on the first invocation:

[Worker 1] Trying n2d-standard-8 for instance 'gr-manager'...
2026/05/01 17:43:46 WARN  RESTY Post "https://brevapi.us-west-2-prod.control-plane.brev.dev/api/organizations/org-31CQRkwV2DY20PnKdmX92lJgtlm/workspaces?cli_version=v0.6.322&local=true&os=linux&utm_source=cli": unexpected EOF, Attempt 1
2026/05/01 17:43:46 ERROR RESTY Post "...": unexpected EOF
[Worker 1] n2d-standard-8 Failed: ... unexpected EOF
Warning: Only created 0/1 instances
could only create 0/1 instances

A second invocation a few seconds later returned duplicate workspace with name gr-manager, confirming the workspace row was created server-side despite the truncated response.

Resulting state

brev ls for the workspace:

NAME        STATUS   BUILD  SHELL      ID         MACHINE         GPU
gr-manager  RUNNING         NOT READY  hclv1344v  n2d-standard-8  -

brev ls --json:

{
  "name": "gr-manager",
  "status": "RUNNING",
  "build_status": "",
  "shell_status": "NOT READY",
  "health_status": "HEALTHY"
}

On the VM:

  • Cloud-init completed at 17:44:48 UTC. boot-finished exists, cloud-init status is done.
  • /opt/setup.sh: does not exist.
  • /etc/systemd/system/instance-oneshot.service: does not exist.
  • /etc/brev/metadata.json: does not exist.
  • /etc/hostname: still the GCE-default gr-manager-inst-3d8ij8ob5qwnwtaq3pxwc209nru (file mtime is the base-image bake date, never rewritten).
  • BREV_ENV_ID: unset system-wide.
  • Brev's control plane (source 52.13.205.207) successfully SSHes in as ubuntu every ~12-13 seconds; auth.log has dozens of Accepted publickey entries. None of those sessions ever execute the tee /opt/setup.sh / tee /etc/systemd/system/instance-oneshot.service / systemctl start instance-oneshot chain that healthy workspaces show.
  • 1+ hour after creation, none of the above has changed.

Healthy comparison

Three other workspaces (test, test2, gr-manager2, all n2d-standard-8 in the same vanguard-programming org, the last two created with the identical --type n2d-standard-8 --min-disk 589 --detached flag combination, the only difference being a clean create with no EOF) reached BUILD=COMPLETED + SHELL=READY within a few minutes. On those VMs auth.log shows a sequence like:

sudo[1832]: ubuntu : COMMAND=/usr/bin/tee -a /opt/setup.sh
sudo[1834]: ubuntu : COMMAND=/usr/bin/chmod +x /opt/setup.sh
sudo[1868]: ubuntu : COMMAND=/usr/bin/tee /etc/systemd/instance-oneshot.env
... (writes the env file with environmentID='<the workspace id>')
sudo[1888]: ubuntu : COMMAND=/usr/bin/tee /etc/systemd/system/instance-oneshot.service
sudo[1923]: ubuntu : COMMAND=/usr/bin/systemctl start instance-oneshot

…running ~16 seconds after cloud-init finished. That sequence never fires on gr-manager.

Hypothesis

Server-side, the workspace creation flow appears to be:

  1. Insert workspace row.
  2. Provision VM.
  3. Enqueue post-cloud-init setup-push job.

The unexpected EOF (presumably from a transient network/proxy/upstream issue mid-response) likely interrupted the handler somewhere between step 1 and step 3. Step 1's effects persisted (workspace row exists, VM later got provisioned), but step 3 never happened, and there appears to be no reconciliation path that re-queues setup for an existing workspace whose build_status is empty.

The CLI side has no follow-up call after POST /workspaces (per pkg/cmd/gpucreate/gpucreate.gocreateWorkspace makes a single store call), so the CLI can't usefully retry; the broken state is entirely server-side.

Suggested mitigations

  • Make the create POST handler atomic: if any step fails after the row insert, roll back the row (or mark build_status=FAILED so a subsequent brev reset --hard rebuilds rather than no-ops).
  • Reconciler for workspaces with status=RUNNING && build_status="" for >N minutes: re-enqueue setup, or transition to FAILED.
  • CLI-side: when brev create returns an error but a workspace with that name later turns out to exist, surface a clear "workspace was partially created — run brev reset --hard or brev delete" message.

Happy to provide more on-VM logs, journalctl output, or open an SSH session to gr-manager (workspace hclv1344v) for diagnostics — the instance is intentionally being kept around in this broken state until this is investigated.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions