Workspace stuck with empty build_status forever after `unexpected EOF` from create API call

## Summary

A `brev create` invocation whose API response was truncated with `unexpected EOF` left the workspace in a state where the server-side **build job was never enqueued**. The VM was provisioned and reaches `status=RUNNING`, but `build_status` stays empty forever and the post-cloud-init setup-push (`/opt/setup.sh`, `/etc/systemd/system/instance-oneshot.service`, hostname rewrite to `brev-${BREV_ENV_ID}`, `/etc/brev/metadata.json`) never runs.

After 1+ hour the workspace is still in this stuck state. The same command on a different name produces a healthy workspace.

## Reproduction

- **brev CLI version:** v0.6.322
- **Org:** `vanguard-programming`
- **Workspace name:** `gr-manager`
- **Workspace ID:** `hclv1344v`
- **Approximate timestamp of the failed create:** 2026-05-01 17:43:46 UTC

Command (run from a logged-in dev box):

```
brev create gr-manager --type n2d-standard-8 --min-disk 589 --detached
```

CLI output on the first invocation:

```
[Worker 1] Trying n2d-standard-8 for instance 'gr-manager'...
2026/05/01 17:43:46 WARN  RESTY Post "https://brevapi.us-west-2-prod.control-plane.brev.dev/api/organizations/org-31CQRkwV2DY20PnKdmX92lJgtlm/workspaces?cli_version=v0.6.322&local=true&os=linux&utm_source=cli": unexpected EOF, Attempt 1
2026/05/01 17:43:46 ERROR RESTY Post "...": unexpected EOF
[Worker 1] n2d-standard-8 Failed: ... unexpected EOF
Warning: Only created 0/1 instances
could only create 0/1 instances
```

A second invocation a few seconds later returned `duplicate workspace with name gr-manager`, confirming the workspace row **was** created server-side despite the truncated response.

## Resulting state

`brev ls` for the workspace:

```
NAME        STATUS   BUILD  SHELL      ID         MACHINE         GPU
gr-manager  RUNNING         NOT READY  hclv1344v  n2d-standard-8  -
```

`brev ls --json`:

```json
{
  "name": "gr-manager",
  "status": "RUNNING",
  "build_status": "",
  "shell_status": "NOT READY",
  "health_status": "HEALTHY"
}
```

On the VM:
- Cloud-init completed at 17:44:48 UTC. `boot-finished` exists, `cloud-init status` is `done`.
- `/opt/setup.sh`: **does not exist**.
- `/etc/systemd/system/instance-oneshot.service`: **does not exist**.
- `/etc/brev/metadata.json`: **does not exist**.
- `/etc/hostname`: still the GCE-default `gr-manager-inst-3d8ij8ob5qwnwtaq3pxwc209nru` (file mtime is the base-image bake date, never rewritten).
- `BREV_ENV_ID`: unset system-wide.
- Brev's control plane (source `52.13.205.207`) successfully SSHes in as `ubuntu` every ~12-13 seconds; `auth.log` has dozens of `Accepted publickey` entries. None of those sessions ever execute the `tee /opt/setup.sh` / `tee /etc/systemd/system/instance-oneshot.service` / `systemctl start instance-oneshot` chain that healthy workspaces show.
- 1+ hour after creation, none of the above has changed.

## Healthy comparison

Three other workspaces (`test`, `test2`, `gr-manager2`, all `n2d-standard-8` in the same `vanguard-programming` org, the last two created with the **identical** `--type n2d-standard-8 --min-disk 589 --detached` flag combination, the only difference being a clean create with no EOF) reached `BUILD=COMPLETED + SHELL=READY` within a few minutes. On those VMs `auth.log` shows a sequence like:

```
sudo[1832]: ubuntu : COMMAND=/usr/bin/tee -a /opt/setup.sh
sudo[1834]: ubuntu : COMMAND=/usr/bin/chmod +x /opt/setup.sh
sudo[1868]: ubuntu : COMMAND=/usr/bin/tee /etc/systemd/instance-oneshot.env
... (writes the env file with environmentID='<the workspace id>')
sudo[1888]: ubuntu : COMMAND=/usr/bin/tee /etc/systemd/system/instance-oneshot.service
sudo[1923]: ubuntu : COMMAND=/usr/bin/systemctl start instance-oneshot
```

…running ~16 seconds after cloud-init finished. That sequence never fires on `gr-manager`.

## Hypothesis

Server-side, the workspace creation flow appears to be:

1. Insert workspace row.
2. Provision VM.
3. Enqueue post-cloud-init setup-push job.

The `unexpected EOF` (presumably from a transient network/proxy/upstream issue mid-response) likely interrupted the handler somewhere between step 1 and step 3. Step 1's effects persisted (workspace row exists, VM later got provisioned), but step 3 never happened, and there appears to be no reconciliation path that re-queues setup for an existing workspace whose `build_status` is empty.

The CLI side has no follow-up call after `POST /workspaces` (per [`pkg/cmd/gpucreate/gpucreate.go`](https://github.com/brevdev/brev-cli/blob/main/pkg/cmd/gpucreate/gpucreate.go) — `createWorkspace` makes a single store call), so the CLI can't usefully retry; the broken state is entirely server-side.

## Suggested mitigations

- **Make the create POST handler atomic**: if any step fails after the row insert, roll back the row (or mark `build_status=FAILED` so a subsequent `brev reset --hard` rebuilds rather than no-ops).
- **Reconciler for workspaces with `status=RUNNING && build_status=""` for >N minutes**: re-enqueue setup, or transition to `FAILED`.
- **CLI-side**: when `brev create` returns an error but a workspace with that name later turns out to exist, surface a clear "workspace was partially created — run `brev reset --hard` or `brev delete`" message.

Happy to provide more on-VM logs, journalctl output, or open an SSH session to `gr-manager` (workspace `hclv1344v`) for diagnostics — the instance is intentionally being kept around in this broken state until this is investigated.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workspace stuck with empty build_status forever after `unexpected EOF` from create API call #378

Summary

Reproduction

Resulting state

Healthy comparison

Hypothesis

Suggested mitigations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Workspace stuck with empty build_status forever after unexpected EOF from create API call #378

Description

Summary

Reproduction

Resulting state

Healthy comparison

Hypothesis

Suggested mitigations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Workspace stuck with empty build_status forever after `unexpected EOF` from create API call #378