Skip to content

WSL local gateway onboarding validation blockers from PR #274 #281

@shanselman

Description

@shanselman

Copilot here, filing as a bot on Scott's behalf after local validation and triage of draft PR #274 by @indierawk2k2.

First: thank you for the careful WSL gateway work. The overall direction is good — a dedicated app-owned WSL distro that behaves like a private local gateway appliance is the right shape. The local validation did get deep into the flow: WSL instance creation, first boot configuration, OpenClaw install, gateway config, service install, service start, health wait, and setup-code generation all completed. The remaining issues are specific and fixable, but they are blocking enough that we should keep the PR out of merge until they are addressed.

Validation result

The latest full WSL validation run reached PairOperator and failed with:

operator_auth_failed: Gateway rejected operator authentication.

Important context: the validation harness also wrote to the real tray settings file instead of the isolated run folder, so the auth failure may be partially contaminated by stale real credentials. The validation harness isolation bug should be fixed first, then the end-to-end WSL flow should be rerun from a clean state.

Blocking issues and suggested fixes

1. Validation isolation writes to real user settings

The validation script intends to isolate AppData/LocalAppData under the run directory, but it sets:

OPENCLAW_TRAY_APPDATA_DIR
OPENCLAW_TRAY_LOCALAPPDATA_DIR

SettingsManager uses OPENCLAW_TRAY_DATA_DIR, so the launched tray can still fall back to the real %APPDATA%\OpenClawTray\settings.json. During local validation this did happen: the real settings file was touched and GatewayUrl was changed to ws://localhost:18789.

Suggested fix:

  • Set OPENCLAW_TRAY_DATA_DIR for validation runs, in addition to any identity/setup-state variables that are still needed.
  • Make the validation data directories explicit, for example:
    • OPENCLAW_TRAY_DATA_DIR=$runRoot\isolated\data
    • OPENCLAW_TRAY_APPDATA_DIR=$runRoot\isolated\appdata
    • OPENCLAW_TRAY_LOCALAPPDATA_DIR=$runRoot\isolated\localappdata
  • Add a pre-run or first-screenshot assertion that the effective settings path is under $runRoot before any WSL work starts.
  • Add a regression test or script check that validate-wsl-gateway.ps1 exports the env var used by SettingsManager.

2. Validation can launch a stale tray executable

validate-wsl-gateway.ps1 prefers:

src\OpenClaw.Tray.WinUI\bin\x64\Debug\net10.0-windows10.0.19041.0\win-x64\OpenClaw.Tray.WinUI.exe

But ./build.ps1 builds the fresh product under:

src\OpenClaw.Tray.WinUI\bin\Debug\net10.0-windows10.0.19041.0\win-x64\OpenClaw.Tray.WinUI.exe

This caused one validation pass to run an older May 1 binary and produced misleading screenshots from stale code.

Suggested fix:

  • Have the validation script launch the exact output produced by its build step.
  • If -NoBuild is used, fail clearly when the target exe does not exist or its product version/commit does not match the expected repo HEAD.
  • Include the executable path and product version/commit in the summary artifact.

3. Operator pairing fails after gateway start

After freeing port 18789, the validation run completed the WSL install/start path but failed at operator pairing with operator_auth_failed.

Observed diagnostics:

  • Gateway service was active/running inside OpenClawGateway.
  • /var/lib/openclaw/gateway-token existed.
  • openclaw qr --json --url ws://localhost:18789 worked when run as the openclaw user.
  • Root CLI context did not have gateway auth configured.
  • The Windows tray settings were contaminated by the isolation bug, so ResolveCredential() may have preferred a stale Token over the freshly minted BootstrapToken.

Suggested fix:

  • Fix validation isolation first, then rerun from a clean OpenClawGateway distro and clean isolated settings.
  • During local setup, make credential source explicit in diagnostics without logging the secret value: gateway token, bootstrap token, stored device token, etc.
  • Consider clearing Token/BootstrapToken at the start of local WSL setup inside the isolated settings context, or prefer the newly minted bootstrap token for the local setup flow.
  • Make WSL CLI invocations that depend on user-scoped OpenClaw config explicit about the Linux user, e.g. -u openclaw, rather than relying on the distro default user.
  • Add an integration/stub test that proves: QR mint -> bootstrap connect -> local pending approval -> retry -> stored device-token reconnect.

4. Token redaction gaps in failure diagnostics

Some approval failure paths append raw stdout/stderr into setup state/UI diagnostics. The approval commands can include sensitive gateway tokens. If a CLI error echoes arguments or command text, the token can leak into logs or UI state.

Suggested fix:

  • Redact every subprocess stdout/stderr before writing to setup state, UI diagnostics, or logs.
  • Use the existing token sanitizer/redactor pattern, and also literal-replace the known gateway token value read from /var/lib/openclaw/gateway-token.
  • Add tests that simulate stdout/stderr containing --token <secret> and ensure persisted diagnostics contain only redacted values.

5. wsl.conf configured probe is too weak

The current “already configured” check is not section-aware and can pass or fail for the wrong reason. A loose enabled=false check does not prove both [automount] and [interop] are disabled.

Suggested fix:

  • Parse /etc/wsl.conf by section and verify each expected setting:
    • [boot] systemd=true
    • [automount] enabled=false
    • [interop] enabled=false
    • [interop] appendWindowsPath=false
    • [user] default=openclaw
  • Accept harmless whitespace, CRLF, comments, and BOMs.
  • Add fixture tests for valid config, missing section, duplicate key, commented-out key, and whitespace variants.

6. Documentation/code mismatch for WSL time setting

The docs describe:

[time]
useWindowsTimezone=true

But the current configurator does not appear to write a [time] section.

Suggested fix:

  • Either add the [time] useWindowsTimezone=true setting to the written wsl.conf, if supported and desired, or remove it from the docs.
  • Include it in the section-aware config probe if it remains part of the intended appliance contract.

7. Cancellation can leave WSL work running

The setup process kills wsl.exe on timeout, but caller cancellation can leave in-flight WSL commands running after the user aborts setup.

Suggested fix:

  • Register cancellation to terminate the process tree for in-flight wsl.exe invocations, not just timeout paths.
  • Add a test that cancels mid-command and asserts the child process is killed and setup state does not continue mutating after cancellation.

8. Legacy DeviceToken vs NodeDeviceToken behavior needs an explicit decision

The newer role-specific token split means node mode expects NodeDeviceToken, while some readiness paths can still look at legacy/operator DeviceToken. That can make startup believe node mode is ready even though the node client cannot authenticate with a node credential.

Suggested fix:

  • Decide explicitly between:
    1. migrating a legacy token to the appropriate role-specific field,
    2. requiring a fresh node-role pairing token, or
    3. allowing a tightly scoped fallback with clear audit/logging.
  • Update StartupSetupState and node client credential resolution to use the same role-specific readiness rule.
  • Add tests for legacy settings, operator-only token, node-only token, and both-token scenarios.

Suggested acceptance criteria before reviving the WSL PR

  • Validation script proves all settings/identity/setup-state writes stay under the run folder.
  • Validation script launches the freshly built executable or fails on stale output.
  • Fresh-machine WSL validation reaches green from clean state without touching real user settings.
  • Pairing succeeds through stored device-token reconnect, not just initial bootstrap acceptance.
  • Token-bearing subprocess diagnostics are redacted in logs, setup state, and UI.
  • wsl.conf checks are section-aware and tested with realistic formatting variants.
  • User cancellation terminates in-flight WSL commands.
  • Docs and implementation agree on every WSL config key.

Thanks again for pushing this direction. The architecture is promising, but these fixes need to land before the WSL onboarding path is safe to take.

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions