WSL local gateway onboarding validation blockers from PR #274

Copilot here, filing as a bot on Scott's behalf after local validation and triage of draft PR #274 by @indierawk2k2.

First: thank you for the careful WSL gateway work. The overall direction is good — a dedicated app-owned WSL distro that behaves like a private local gateway appliance is the right shape. The local validation did get deep into the flow: WSL instance creation, first boot configuration, OpenClaw install, gateway config, service install, service start, health wait, and setup-code generation all completed. The remaining issues are specific and fixable, but they are blocking enough that we should keep the PR out of merge until they are addressed.

## Validation result

The latest full WSL validation run reached `PairOperator` and failed with:

```text
operator_auth_failed: Gateway rejected operator authentication.
```

Important context: the validation harness also wrote to the real tray settings file instead of the isolated run folder, so the auth failure may be partially contaminated by stale real credentials. The validation harness isolation bug should be fixed first, then the end-to-end WSL flow should be rerun from a clean state.

## Blocking issues and suggested fixes

### 1. Validation isolation writes to real user settings

The validation script intends to isolate AppData/LocalAppData under the run directory, but it sets:

```powershell
OPENCLAW_TRAY_APPDATA_DIR
OPENCLAW_TRAY_LOCALAPPDATA_DIR
```

`SettingsManager` uses `OPENCLAW_TRAY_DATA_DIR`, so the launched tray can still fall back to the real `%APPDATA%\OpenClawTray\settings.json`. During local validation this did happen: the real settings file was touched and `GatewayUrl` was changed to `ws://localhost:18789`.

**Suggested fix:**

- Set `OPENCLAW_TRAY_DATA_DIR` for validation runs, in addition to any identity/setup-state variables that are still needed.
- Make the validation data directories explicit, for example:
  - `OPENCLAW_TRAY_DATA_DIR=$runRoot\isolated\data`
  - `OPENCLAW_TRAY_APPDATA_DIR=$runRoot\isolated\appdata`
  - `OPENCLAW_TRAY_LOCALAPPDATA_DIR=$runRoot\isolated\localappdata`
- Add a pre-run or first-screenshot assertion that the effective settings path is under `$runRoot` before any WSL work starts.
- Add a regression test or script check that `validate-wsl-gateway.ps1` exports the env var used by `SettingsManager`.

### 2. Validation can launch a stale tray executable

`validate-wsl-gateway.ps1` prefers:

```text
src\OpenClaw.Tray.WinUI\bin\x64\Debug\net10.0-windows10.0.19041.0\win-x64\OpenClaw.Tray.WinUI.exe
```

But `./build.ps1` builds the fresh product under:

```text
src\OpenClaw.Tray.WinUI\bin\Debug\net10.0-windows10.0.19041.0\win-x64\OpenClaw.Tray.WinUI.exe
```

This caused one validation pass to run an older May 1 binary and produced misleading screenshots from stale code.

**Suggested fix:**

- Have the validation script launch the exact output produced by its build step.
- If `-NoBuild` is used, fail clearly when the target exe does not exist or its product version/commit does not match the expected repo `HEAD`.
- Include the executable path and product version/commit in the summary artifact.

### 3. Operator pairing fails after gateway start

After freeing port 18789, the validation run completed the WSL install/start path but failed at operator pairing with `operator_auth_failed`.

Observed diagnostics:

- Gateway service was active/running inside `OpenClawGateway`.
- `/var/lib/openclaw/gateway-token` existed.
- `openclaw qr --json --url ws://localhost:18789` worked when run as the `openclaw` user.
- Root CLI context did not have gateway auth configured.
- The Windows tray settings were contaminated by the isolation bug, so `ResolveCredential()` may have preferred a stale `Token` over the freshly minted `BootstrapToken`.

**Suggested fix:**

- Fix validation isolation first, then rerun from a clean `OpenClawGateway` distro and clean isolated settings.
- During local setup, make credential source explicit in diagnostics without logging the secret value: `gateway token`, `bootstrap token`, `stored device token`, etc.
- Consider clearing `Token`/`BootstrapToken` at the start of local WSL setup inside the isolated settings context, or prefer the newly minted bootstrap token for the local setup flow.
- Make WSL CLI invocations that depend on user-scoped OpenClaw config explicit about the Linux user, e.g. `-u openclaw`, rather than relying on the distro default user.
- Add an integration/stub test that proves: QR mint -> bootstrap connect -> local pending approval -> retry -> stored device-token reconnect.

### 4. Token redaction gaps in failure diagnostics

Some approval failure paths append raw stdout/stderr into setup state/UI diagnostics. The approval commands can include sensitive gateway tokens. If a CLI error echoes arguments or command text, the token can leak into logs or UI state.

**Suggested fix:**

- Redact every subprocess stdout/stderr before writing to setup state, UI diagnostics, or logs.
- Use the existing token sanitizer/redactor pattern, and also literal-replace the known gateway token value read from `/var/lib/openclaw/gateway-token`.
- Add tests that simulate stdout/stderr containing `--token <secret>` and ensure persisted diagnostics contain only redacted values.

### 5. `wsl.conf` configured probe is too weak

The current “already configured” check is not section-aware and can pass or fail for the wrong reason. A loose `enabled=false` check does not prove both `[automount]` and `[interop]` are disabled.

**Suggested fix:**

- Parse `/etc/wsl.conf` by section and verify each expected setting:
  - `[boot] systemd=true`
  - `[automount] enabled=false`
  - `[interop] enabled=false`
  - `[interop] appendWindowsPath=false`
  - `[user] default=openclaw`
- Accept harmless whitespace, CRLF, comments, and BOMs.
- Add fixture tests for valid config, missing section, duplicate key, commented-out key, and whitespace variants.

### 6. Documentation/code mismatch for WSL time setting

The docs describe:

```ini
[time]
useWindowsTimezone=true
```

But the current configurator does not appear to write a `[time]` section.

**Suggested fix:**

- Either add the `[time] useWindowsTimezone=true` setting to the written `wsl.conf`, if supported and desired, or remove it from the docs.
- Include it in the section-aware config probe if it remains part of the intended appliance contract.

### 7. Cancellation can leave WSL work running

The setup process kills `wsl.exe` on timeout, but caller cancellation can leave in-flight WSL commands running after the user aborts setup.

**Suggested fix:**

- Register cancellation to terminate the process tree for in-flight `wsl.exe` invocations, not just timeout paths.
- Add a test that cancels mid-command and asserts the child process is killed and setup state does not continue mutating after cancellation.

### 8. Legacy `DeviceToken` vs `NodeDeviceToken` behavior needs an explicit decision

The newer role-specific token split means node mode expects `NodeDeviceToken`, while some readiness paths can still look at legacy/operator `DeviceToken`. That can make startup believe node mode is ready even though the node client cannot authenticate with a node credential.

**Suggested fix:**

- Decide explicitly between:
  1. migrating a legacy token to the appropriate role-specific field,
  2. requiring a fresh node-role pairing token, or
  3. allowing a tightly scoped fallback with clear audit/logging.
- Update `StartupSetupState` and node client credential resolution to use the same role-specific readiness rule.
- Add tests for legacy settings, operator-only token, node-only token, and both-token scenarios.

## Suggested acceptance criteria before reviving the WSL PR

- Validation script proves all settings/identity/setup-state writes stay under the run folder.
- Validation script launches the freshly built executable or fails on stale output.
- Fresh-machine WSL validation reaches green from clean state without touching real user settings.
- Pairing succeeds through stored device-token reconnect, not just initial bootstrap acceptance.
- Token-bearing subprocess diagnostics are redacted in logs, setup state, and UI.
- `wsl.conf` checks are section-aware and tested with realistic formatting variants.
- User cancellation terminates in-flight WSL commands.
- Docs and implementation agree on every WSL config key.

Thanks again for pushing this direction. The architecture is promising, but these fixes need to land before the WSL onboarding path is safe to take.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WSL local gateway onboarding validation blockers from PR #274 #281

Validation result

Blocking issues and suggested fixes

1. Validation isolation writes to real user settings

2. Validation can launch a stale tray executable

3. Operator pairing fails after gateway start

4. Token redaction gaps in failure diagnostics

5. `wsl.conf` configured probe is too weak

6. Documentation/code mismatch for WSL time setting

7. Cancellation can leave WSL work running

8. Legacy `DeviceToken` vs `NodeDeviceToken` behavior needs an explicit decision

Suggested acceptance criteria before reviving the WSL PR

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

WSL local gateway onboarding validation blockers from PR #274 #281

Description

Validation result

Blocking issues and suggested fixes

1. Validation isolation writes to real user settings

2. Validation can launch a stale tray executable

3. Operator pairing fails after gateway start

4. Token redaction gaps in failure diagnostics

5. wsl.conf configured probe is too weak

6. Documentation/code mismatch for WSL time setting

7. Cancellation can leave WSL work running

8. Legacy DeviceToken vs NodeDeviceToken behavior needs an explicit decision

Suggested acceptance criteria before reviving the WSL PR

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

5. `wsl.conf` configured probe is too weak

8. Legacy `DeviceToken` vs `NodeDeviceToken` behavior needs an explicit decision