feat(onboarding): WSL local gateway setup, onboarding wizard, and security hardening#274
feat(onboarding): WSL local gateway setup, onboarding wizard, and security hardening#274indierawk2k2 wants to merge 46 commits intoopenclaw:masterfrom
Conversation
…kens (Phase 1) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…dation Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…reconnect (Phase 2.1) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ase 2.2) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…istro override - Remove residual PreserveWorkerData property and worker_data_preserved step from LocalGatewayRemoveRequest/LocalGatewayLifecycleManager. Windows tray is the node; no WSL-worker vocabulary remains in product APIs. - Gate OPENCLAW_WSL_DISTRO_NAME env override and explicit distroName parameter behind #if DEBUG || OPENCLAW_TRAY_TESTS via ResolveDistroName helper. Production builds are now hard-locked to OpenClawGateway regardless of caller input. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…Phase 4) - App.App now exposes CreateLocalGatewaySetupEngine() backed by LocalGatewaySetupEngineFactory.CreateLocalOnly. Onboarding pages (Phase 5) can request the engine; NodeService is materialized eagerly so the engine can pair the Windows tray node into the gateway it installs. - Add IdentityDataPath alongside DataPath (operator/node DeviceIdentity store at %APPDATA%\\OpenClawTray, OPENCLAW_TRAY_APPDATA_DIR override for tests). NodeService now accepts identityDataPath; WindowsNodeClient is constructed with it so node device tokens land in the same role-aware DeviceIdentity store as operator tokens (Phase 1 model: shared location, role distinction inside). - StartupSetupState.CanStartNodeGateway / RequiresSetup callsites now use IdentityDataPath so stored node device tokens are detected at the same path WindowsNodeClient writes them. - No prototype env-var rootfs/manifest overrides, dev-shim auto-accept, or worker-in-WSL wiring ported (Phase 3 already pruned those phases; nothing to strip in App). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tupPath state (Phase 5.1) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… 5.2) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ngine (Phase 5.3) Wires AdvanceRequested into OnboardingApp, supports OPENCLAW_ONBOARDING_START_SETUP_PATH and OPENCLAW_VISUAL_TEST_LOCAL_SETUP for screenshot capture without running the real WSL engine. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…se 5.4) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…only, no rootfs (Phase 6) Clean port of prototype validate-wsl-gateway.ps1 reduced to four scenarios: PreflightOnly, UpstreamInstall, FreshMachine, Recreate. Kept: UI automation (drives SetupWarningPage 'Set up locally' button [OnboardingSetupLocal] -> LocalSetupProgressPage), loopback-only endpoint diagnostics, real upstream setup-code/bootstrap proof, operator pairing proof, Windows tray node proof, separated validation/cleanup status, token/setup-code redaction, aka.ms/wsllogs link on failure. Stripped: BuildRootfs/InstallOnly/Smoke/Loop scenarios, all rootfs/ manifest/signing parameters, worker-in-WSL pairing, WSL-IP/lan/auto fallback diagnostics, AllowNonStandardDistroNameForDestructiveClean. Recreate uses 'wsl --unregister OpenClawGateway' (NEVER --shutdown) per Craig. Network probes are loopback only. Validation: PreflightOnly run PASS (status=Passed, validation=Passed). build.ps1 PASS. Shared.Tests 1180/1180. Tray.Tests 434/434. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…target gated cleanup (Phase 7) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…es (Phase 5 fast-follow) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ig's answers (Phase 8) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…(fr-fr/nl-nl/zh-cn/zh-tw)
Extract 17 hard-coded English strings from SetupWarningPage and
LocalSetupProgressPage into Resources.resw and add translations for all
four non-en-us locales. Adds OPENCLAW_TEST_LOCALE env hook on
OnboardingWindow for visual-test locale forcing.
Keys added (per locale):
- Onboarding_SetupWarning_{Title,Body,SetupLocally,Advanced} (4)
- Onboarding_LocalSetup_{Title,SubtitleIdle,SubtitleSuccess,Retry,TerminalFailure,DiagnosticsHint} (6)
- Onboarding_LocalSetup_Phase_{Preflight,CreateInstance,Configure,InstallCli,PrepareConfig,StartGateway,MintToken} (7)
Validation: build PASS, Tray 434/434, Shared 1180/1180,
LocalizationValidationTests green. Screenshot verified for fr-FR at
visual-test-output/phase5-localization/fr-fr/page-02.png; no truncation,
no English fallback, layout contract intact (MaxWidth 460, centered).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…per state (Phase 5 final)
Implements industry-standard onboarding-progress button policy on LocalSetupProgressPage per the autopilot defaults captured in .squad/decisions.md (round 11):
Idle (Pending) Next=Hidden, Back=Enabled
Running Next=VisibleDisabled, Back=Enabled
Complete Next=VisibleEnabled, Back=Enabled (1s pre-auto-advance; tap-to-skip)
FailedRetryable Next=VisibleDisabled, Back=Enabled (in-page Try again)
FailedTerminal Next=VisibleDisabled, Back=Enabled (force back-out)
Contract extension (minimal):
- OnboardingState gains NextButtonState property (Default/Hidden/VisibleDisabled/VisibleEnabled), SetNextButtonState() setter, and NavBarStateChanged event.
- OnboardingApp consults NextButtonState only when currentRoute == LocalSetupProgress; legacy behavior preserved everywhere else.
- Mapping logic extracted to OnboardingTray.Onboarding.Services.LocalSetupProgressPolicy (no WinUI deps) so it is unit-testable from OpenClaw.Tray.Tests.
Bonus fix: gate the Complete-state 1s auto-advance timer on still being on LocalSetupProgress so an early Next-tap doesn't over-advance a later page.
Tests: Tray 447/447 (+13: 3 OnboardingState NextButtonState/NavBarStateChanged + 10 LocalSetupProgressPolicy mapping cases). Shared 1180/1180. Build PASS.
Screenshots: visual-test-output/next-button-impl-2026-05-04/{s1-running,s2-success,s3-failed-terminal,s4-failed-retryable}/page-02.png — all four states verified visually.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…mint and tray pair (Bug 1 from e2e drive) Bug 1 surfaced by the 2026-05-04 e2e drive: MintBootstrapToken correctly invokes `openclaw qr --json` and the tray sends the resulting token via `auth.bootstrapToken`, but the upstream gateway treats a fresh bootstrap-token connect as a *pending* operator pairing request and rejects the connect itself with `device-auth-invalid` then `pairing-required reason:not-paired`. The pending request is recorded server-side but never redeemed because nothing approves it. On a local-loopback gateway the user driving the tray is also the operator/approver, so SettingsOperatorPairingService now drives `openclaw devices approve --latest` through the gateway CLI and retries the bootstrap connect once. New IPendingDeviceApprover seam keeps it injectable (default null preserves remote-gateway behavior); WslGatewayCliPendingDeviceApprover authenticates with the locally-stored `/var/lib/openclaw/gateway-token` (read inside the shell so it never touches argv) and scopes the approval to `LocalGatewayApprover.IsLocalGateway` URLs only. Tests (10 new, all green): round-trip approve+retry, double-PairingRequired no-loop, approval-failure surfaces error code, remote-gateway opt-out, non-bootstrap-token opt-out, first-connect happy path, plus 4 ParseApproveJson cases. OPENCLAW_RUN_INTEGRATION=1, OPENCLAW_REPO_ROOT=<worktree>: - OpenClaw.Shared.Tests: 1180/1180/0/0 - OpenClaw.Tray.Tests: 493/493/0/0 (+10 new) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ryable rendering (Bug 2 from e2e drive)
Bug 2 from Aaron's 2026-05-04 e2e drive: the LocalSetupProgressPage UI stayed
on stage 1 ("Checking system" with spinner) for the entire 12-minute run even
though the LocalGatewaySetupEngine progressed through 9+ phases on the gateway
side and ultimately failed at PairOperator. The page never re-rendered past
the first event and never transitioned to FailedRetryable.
Root cause: reference-equality in UseState. The engine raises StateChanged
with the same mutating LocalGatewaySetupState instance every call. The page's
UseState<LocalGatewaySetupState?> compared previous and next with
EqualityComparer<T>.Default — which for a class without an Equals override
falls through to ReferenceEquals. The first null -> state transition rendered
once; every subsequent state -> state event was identified as "no change" and
the framework swallowed the re-render request.
Fix:
- Introduce a private record RenderSnapshot(Phase, Status, LastRunningPhase,
UserMessage, FailureCode) and store *that* in UseState. Records have value
equality, so each engine event yields a fresh RenderSnapshot whose fields
differ from the previous snapshot, reliably triggering re-renders.
- Capture the snapshot off the dispatcher (before TryEnqueue) so values
reflect the engine's state at the moment the event fired, not whatever
the engine has further mutated to by the time the dispatcher dequeues.
- Thread LastRunningPhase through to the stage-list math: previously the
Failed-state rendering only knew Phase=Failed (the highest enum ordinal)
which lost the position of the last running phase. The new helper consults
History to pin the failure marker on the correct stage.
Also extracted the stage-list math from the page into a pure helper
(LocalSetupProgressStageMap) so it is unit-testable without WinUI deps:
- VisibleStages array (now also folds PairOperator + later hidden phases
into the MintToken stage, so a PairOperator failure pins correctly).
- ComputeStageState(stagePhases, currentPhase, currentStatus, lastRunningPhase).
- IndexOfStageForPhase, ShouldShowErrorRow, ShouldShowRetryButton.
Tests added (LocalSetupProgressStageMapTests, +36 net):
- Every running engine phase advances the active stage to the expected index
(15 InlineData rows covering all 15 non-terminal phases).
- NotStarted -> all stages Pending.
- Complete -> all stages Complete.
- Coverage guard: every declared LocalGatewaySetupPhase value is either
terminal or covered by some VisibleStage (locks down future enum drift).
- FailedRetryable @ PairOperator pins failure on the last visible stage
(this is the concrete e2e-drive scenario).
- FailedRetryable @ CreateWslInstance pins failure on stage 1.
- FailedTerminal @ Preflight pins failure on stage 0.
- ShouldShowErrorRow + ShouldShowRetryButton truth tables.
Validation:
- ./tests/OpenClaw.Shared.Tests: 1180 passed, 0 failed (anchor 1180/1180).
- ./tests/OpenClaw.Tray.Tests: 493 passed, 0 failed (was 447/447, +46).
- Env: OPENCLAW_REPO_ROOT=<worktree>, OPENCLAW_RUN_INTEGRATION=1.
- Full ./build.ps1 + screenshot verification BLOCKED in this session by
the running tray app at PID 8240 holding a write-lock on the WinUI
output directory (Mike is examining the broken state per the e2e-drive
guardrail). Visual verification deferred until PID 8240 is released.
Existing OPENCLAW_VISUAL_TEST_LOCAL_SETUP harness exercises the new
retryable/terminal paths via the modified TryReadVisualTestState (which
now seeds StartPhase before Block so LastRunningPhase pins correctly).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…icitGatewayAuth (Bug 1 residual) Drop --url override from WslGatewayCliPendingDeviceApprover. The CLI runs inside the OpenClawGateway distro where openclaw.json pins gateway.mode=local + port 18789, so buildGatewayConnectionDetails resolves the loopback URL itself. Without --url, ensureExplicitGatewayAuth (src/gateway/call.ts) early-returns and shouldUseLocalPairingFallback becomes available, so the CLI silently falls back to local pairing-file approval if the WS hop trips. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… against CLI v2026.5.3-1 (Bug 1 part 3)
CLI v2026.5.3-1 (src/cli/devices-cli.ts, commit aef38de) makes
`openclaw devices approve --latest --json` PREVIEW-ONLY: when --latest
or no requestId is supplied, the action handler enters the
usingImplicitSelection branch which writes a JSON preview
({ selected, approvalState, approveCommand, requiresAuthFlags }) and
returns BEFORE invoking approvePairingWithFallback. Only an explicit
requestId argument bypasses the preview gate and actually calls
device.pair.approve / mutates paired.json.
The previous fix (3927451) correctly removed the --url override that
tripped ensureExplicitGatewayAuth, but the resulting invocation still
only ran the preview, so the engine saw exit 0, retried the WS connect,
got pairing-required again, and surfaced operator_pending_approval_failed.
WslGatewayCliPendingDeviceApprover.ApproveLatestAsync now runs two stages:
1. Preview: openclaw devices approve --latest --json --token "\"
parses selected.requestId from the v2026.5.3-1 preview JSON.
2. Commit: openclaw devices approve <requestId> --json --token "\"
actually approves and mutates paired.json.
A new no_pending_entries error code distinguishes "stage 1 returned no
selected.requestId" from a real approval failure so the engine does not
infinite-loop. Stage 2 failures surface the underlying stderr. The
requestId returned by stage 1 is validated against a safe charset before
interpolation into the bash -lc commit script.
Tests (tests/OpenClaw.Tray.Tests/OperatorPairingApprovalTests.cs):
- TwoStage_PreviewThenCommit_Succeeds (argv shape pinned for both stages)
- TwoStage_PreviewEmpty_NoPendingEntries (stage 2 must NOT run)
- TwoStage_CommitFails_SurfacesStructuredFailure (surfaces stderr)
- TwoStage_PreviewReturnsUnsafeRequestId_DoesNotRunCommit (defense in depth)
- ParsePreviewJson_V20265_Shape_ReturnsRequestId
- ParsePreviewJson_Empty_ReturnsNoPendingEntries
- ParsePreviewJson_OkFalse_ReturnsApprovalFailure
Existing DoesNotPassUrlOverride and NonZeroExit tests updated for the
two-stage flow; all prior 12 approval tests remain green.
Validation:
./build.ps1 ok
dotnet test tests/OpenClaw.Tray.Tests --no-restore 502 / 502 passed
dotnet test tests/OpenClaw.Shared.Tests --no-restore 1180 / 1180 passed
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…e stderr in failure (Bug 1 part 4) Bostick-11 Round-2 Path B drive surfaced a deterministic race: the engine's first `--token`-authenticated call into the in-distro CLI in Phase 12 triggers an internal Linux-operator auto-bootstrap inside the gateway. The bootstrap completes successfully (linux operator entry IS persisted to paired.json) but the CLI process that drove it exits non-zero; a fresh process invocation made hundreds of ms later succeeds because the internal operator is now pre-paired. Fix: - WslGatewayCliPendingDeviceApprover.ApproveLatestAsync retries stage 1 once on first failure with a 750ms backoff (configurable; tests use TimeSpan.Zero). - On final stage-1 failure, both attempts' stderr (each truncated to 1 KB) are surfaced in PendingDeviceApprovalResult.ErrorMessage so future regressions are diagnosable from setup-state.json without digging tray.log. Tests added/updated: - Stage1FailsThenSucceeds_OverallSuccess (retry path) - Stage1FailsTwice_SurfacesBothStderrs (structured failure with stderr) - TruncateStderr_RespectsCap_AndAppendsTruncationMarker - Existing NonZeroExit_SurfacesStructuredFailureCode updated to assert stderr surfacing Validation: build.ps1 green; Tray tests 505/505 passed; Shared tests 1158 passed + 22 skipped = 1180 baseline. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…; surface stdout (Bug 1 part 5) Bostick-11 Round-3 (commit 05f7be0) proved the part-4 retry IS firing but BOTH stage-1 attempts still exit non-zero with EMPTY stderr in the engine's invocation context. The IDENTICAL script run manually via wsl -- bash -lc <script> from PowerShell against the engine's exact post-failure gateway state returns exit 0 with valid 1054-byte preview JSON. Leading hypothesis (Bostick): the embedded \"\\" shell substitution gets mangled when .NET ProcessStartInfo.ArgumentList encoding forwards the script through wsl.exe to bash -lc — the embedded double-quotes interact badly with .NET's MSVCRT-style escaping and/or wsl.exe's argv re-encoding, leaving bash with an empty/malformed --token argument and causing the CLI to silently exit non-zero. Fix: read the gateway token via a SEPARATE wsl ... cat /var/lib/openclaw/gateway-token call, capture in C#, then interpolate as a single-quoted shell literal into the approve script. The script body now contains NO \ substitution and NO \" characters at all — there's nothing for .NET / wsl.exe argv encoding to mangle. Diagnosability (belt-and-suspenders): also surface STDOUT (paired with stderr) for both stage-1 attempts and stage-2 failures. If some other invocation-context issue is still at play, the next regression is observable from setup-state.json alone — a CLI that writes JSON-mode errors to stdout (with empty stderr — exactly what Round 3 observed) is no longer invisible. Token safety: reject tokens containing single quotes / newlines / control chars before interpolation. Token-read failures (file missing, empty, unreadable) surface as operator_pending_approval_failed with a 'token-read stage' prefix. Tests: 511/511 Tray (505 baseline + 6 new — token-read fail, token-empty, unsafe token chars, stage-1 stdout surfaced, stage-2 stdout surfaced, no-\/no-\" script body invariant). 1158/1158 Shared (22 skipped, baseline). build.ps1 green. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… exit code (Bug 1 final) CLI v2026.5.3-1 's `devices approve --latest --json` returns exit code 1 deterministically in preview mode even on the happy path; the JSON payload on stdout (with `selected.requestId`, `approveCommand`, `requiresAuthFlags`) IS the success signal. Bostick-11 Round-4 captured this via Aaron-20's stdout surfacing and verified manual stage-2 with the captured requestId mutates `paired.json` correctly. Invert the gate in `WslGatewayCliPendingDeviceApprover.ApproveLatestAsync`: parse the stdout JSON FIRST and treat a parseable preview shape as stage-1 success regardless of exit code. Exit-non-zero only triggers the structured `BuildStage1Failure` path when there is no usable preview to extract. Also short-circuit the 750ms retry in `RunStage1WithRetryAsync` when attempt 1 returns parseable preview JSON, so the common success path no longer burns the retry delay on every pair. All prior parts retained: token pre-read + single-quoted shell literal (part 5), retry on stage-1 failure (part 4), two-stage flow (part 3), IsSafeRequestId guard (part 3), --url drop (part 2), stdout/stderr/exit surfacing in failure messages (part 5). Tests: +5 new in OperatorPairingApprovalTests (exit-1+valid JSON success, exit-0+valid JSON success, exit-1+empty stdout failure, exit-1+malformed JSON failure, exit-1+valid JSON skips retry). Tray 516/516, Shared 1180/1180. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…airing (Bug 3) Phase 12 (Bug 1) is GREEN, which unmasked Bug 3 at Phase 14 (PairWindowsTrayNode). On a fresh local-loopback gateway the node-role connect arrives as reason=role-upgrade, isRepair=true; the gateway parks it on the pending list and the connect times out with windows_node_pairing_failed. There is no auto-approve handler upstream for this path. Mirror the Phase-12 fix: SettingsWindowsTrayNodeProvisioner now takes an optional IPendingDeviceApprover and, when the first connect fails on a local gateway, drives openclaw devices approve --latest via the same WslGatewayCliPendingDeviceApprover and retries the connect once. Approver failures surface their own structured error code; remote gateways and provisioners with no approver wired keep the legacy windows_node_pairing_failed surface. Tests: 8 new in WindowsTrayNodePairingApprovalTests.cs covering happy path, approver failure, no-pending-entries, retry-after-approve-still-fails, remote-gateway no-approve, first-connect-success no-approve, no-approver legacy passthrough, and OperationCanceled passthrough. Tray 524/524, Shared 1180/1180. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ayClient (Bug openclaw#1 from manual test) OnboardingState.GetPageOrder() previously stripped the Wizard hop whenever Settings.EnableNodeMode==true. SettingsWindowsTrayNodeProvisioner.PairAsync flips that flag mid-onboarding (LocalGatewaySetup.cs:2147) for the loopback role-upgrade, so the Local easy-setup auto-advance after Phase 16 landed on Permissions, not Wizard. Carve out an explicit SetupPath.Local exception: Local easy-setup pairs the tray as both operator (Phase 12) and node (Phase 14), so it still has operator credentials for wizard.start. Only explicit Advanced + remote-node deployments skip Wizard. Sister fix: at LocalSetupProgress completion, eagerly (re)initialize the persistent App.GatewayClient and copy the reference into Props.GatewayClient so WizardPage's poll (App.GatewayClient ?? Props.GatewayClient) finds a connected client instead of timing out into 'offline'. Tests: added GetPageOrder_LocalPath_NodeMode_KeepsWizardAndChat (replaces SkipsWizard expectation), GetPageOrder_LocalPath_NodeMode_NoChat_KeepsWizard, and NextRouteAfterLocalSetupProgress_LocalNodeMode_IsWizard (the integration assertion RubberDucky asked for: pageIndex+1 lookup proves auto-advance lands on Wizard, not Permissions). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…(Bug openclaw#2 from manual test) On the Local easy-setup path, the loopback gateway parks the Phase 14 node-role connect as PairingStatus.Pending for ~100ms before SettingsWindowsTrayNodeProvisioner's pending-approver auto-approves it. App.OnPairingStatusChanged showed the 'copy pairing command' toast for that transient blip even though the user never needed to copy anything. Engine: add LocalGatewaySetupEngine.IsAutoPairingWindowsNode bracketed exactly around the Phase 14 _windowsTrayNode.PairAsync call (LocalGatewaySetup.cs:2401), via try/finally with Interlocked.Exchange. Phase 12 PairOperator and the rest of RunLocalOnlyAsync are NOT bracketed — scope is exactly the node-role autopair window per RubberDucky's closure condition. App: cache the engine in App._localSetupEngine when CreateLocalGatewaySetupEngine is invoked (App.xaml.cs:62). OnPairingStatusChanged Pending branch now consults LocalGatewaySetupEngine.ShouldSuppressPairingPendingNotification(_localSetupEngine, status) — a pure static decision helper that returns true only for (autopair-on, Pending). Paired/Rejected confirmations and the manual ConnectionPage path (which calls App.ShowPairingPendingNotification directly, bypassing OnPairingStatusChanged) are unaffected. Tests (LocalGatewaySetupAutoPairFlagTests, +12 cases): IsAutoPairingWindowsNode toggles only during Phase 14 (asserts false during Phase 12 callback, true during Phase 14 callback, false after run); flag resets even if Phase 14 throws; ShouldSuppressPairingPendingNotification theory covers all 6 (autopair × status) combos plus null-engine; all assert that Paired/Rejected and out-of-scope Pending pass through. Also exposed shared helpers in LocalGatewaySetupTests as internal nested classes for reuse. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ot clipboard toast (Bug openclaw#3 from manual test) QuickSendDialog used to capture App._gatewayClient at constructor time into a readonly field. After autopair / SSH tunnel restart / manual ConnectionPage reinit / onboarding completion swapped the App's gateway client, the dialog kept sending into the stale (still-unpaired) instance, tripped its NOT_PAIRED catch, and copied the pair-command remediation to the user's clipboard against a perfectly paired live client. Fix: - QuickSendDialog ctor now takes Func<OpenClawGatewayClient?>; App.ShowQuickSend passes () => _gatewayClient. - New QuickSendCoordinator (pure, UI-free) resolves the live client on every Send via the provider, with explicit null/disposed/swap-window contract: null => GatewayInitializing (NO clipboard toast) disposed mid-send => Failed (NO clipboard toast) live + NOT_PAIRED => PairingRequired (clipboard toast STILL fires, built from the live current client) - Coordinator retries the provider once after a short delay to absorb the dispose-then-reassign window in App.RestartSshTunnel and the onboarding completion callback. Tests: 15 new QuickSendCoordinatorTests covering stale snapshot, reused dialog after swap, null provider, retry-after-null, ObjectDisposed mid-send, genuine NOT_PAIRED regression guard (clipboard STILL fires, from live client), missing scope, SSH tunnel restart, ConnectionPage reinit, and an autopair end-to-end resolver-contract simulation. Tray tests: 551 / 551 passing (baseline 536; +15 new). Shared tests: 1158 / 1180 (22 skipped, baseline). Scope: QuickSend ONLY per RubberDucky closure condition openclaw#1. The OnboardingState.GatewayClient sister-bug is filed as a follow-up at .squad/decisions/inbox/aaron-bug3-onboardingstate-followup.md. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…strapToken -> DeviceIdentity (Bug openclaw#4 from manual test) App.InitializeGatewayClient previously bailed when settings.Token was empty, leaving App.GatewayClient null after Phase 12 even though OpenClawGatewayClient already treats DeviceIdentity DeviceToken as first-class auth. Wizard then polled the null singleton for 30s and surfaced 'offline'. Mirrors prototype resolver shape (openclaw-windows-node App.xaml.cs:1244-1298). Resolution order: settings.Token (non-bootstrap) -> settings.BootstrapToken (bootstrap) -> DeviceIdentity DeviceToken (non-bootstrap) -> null (logged skip). Logic extracted into GatewayCredentialResolver static class so Tray.Tests covers all branches (8 new tests) without booting WinUI. Producer-side promotion and QR dual-token harvest stay as separate follow-ups. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…dvance (Bug openclaw#5 instrumentation) Diagnostics-only per RubberDucky-6 conditional approval. Logs the Complete-branch schedule, Task.Delay completion, dispatcher.TryEnqueue result, dispatched lambda entry, guard pass/skip, RequestAdvance call, OnboardingState subscriber count + invocation return, OnboardingApp handler entry with current route+pageIndex, GoNext advancement vs no-op, WizardPage mount + gateway-client polling, pre-wizard.start send, and OpenClawGatewayClient.SendWizardRequestAsync frame send. No structural changes. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…aw#5) EffectHookState.Dependencies defaulted to [], so on first mount the guard DependenciesChanged([], []) returned false and the effect was silently skipped. This broke any UseEffect declared with Array.Empty<object>() (or omitted deps via the params overload). Two surgical edits in FunctionalUI.cs: - Make Dependencies nullable; null = "never scheduled" sentinel. - Guard the early-return on Dependencies is not null. Affected call sites (now correctly running on mount): - WizardPage StartWizard mount effect (sends wizard.start) — root of the "Configuring Gateway" wizard hang Mike saw. - PermissionsPage permission-state subscription effect. Adds OpenClawTray.FunctionalUI.Tests with regression coverage for explicit empty deps, omitted deps, stable deps, and changing deps. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…pprove (Bug openclaw#6) Local easy-button autopair was using QR/bootstrap operator scopes (operator.read/write/talk.secrets/approvals — no admin), so wizard.start failed at the gateway with "missing scope: operator.admin". Mobile clients intentionally never call wizard.start; Windows easy-button does (must configure AI provider). Solution: when on local loopback (gateway we just installed), use the standard admin pair flow: 1. Connect with full s_operatorScopes (incl. operator.admin) — only when loopback AND not bootstrap. 2. Parse the requestId from the gateway's connect-error details (structured per connect-error-details.ts schema). 3. Approve EXACTLY that requestId via openclaw devices approve <id> inside the WSL distro — no --latest stage, no race window. 4. Reconnect; admin token persists via existing StoreDeviceTokenWithScopes. Fail closed: missing/malformed requestId → no auto-approve. We install the gateway, so version-skew has no realistic compat cost. Loopback predicate moved to OpenClaw.Shared; Tray delegates. Bootstrap + role-upgrade callers continue to use ApproveLatestAsync (--latest) unchanged. Out of scope: scope arrays unchanged; QR bootstrap unchanged; mobile unchanged; remote/non-loopback gateways NEVER auto-admin. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… (Bug openclaw#6 root cause) The Bug openclaw#6 Option B admin-scope code path requires settings.Token (not BootstrapToken) to be populated so SettingsOperatorPairingService.ResolveCredential returns IsBootstrapToken=false. Local easy-button setup never persisted Token to settings — bash generated it in WSL and only the bootstrap token went to Windows settings. C-refactored: C# now owns the shared gateway token from generation through persistence. New SettingsSharedGatewayTokenProvisioner mirrors the existing SettingsBootstrapTokenProvisioner shape. Bash receives the token via env var (WSLENV-forwarded) and writes it to /var/lib/openclaw/gateway-token. After bash succeeds, settings.Token is persisted via the existing Settings.Save() pattern. Hybrid idempotency: if WSL already has a token (preserved across tray reinstall scenarios), C# reads it back and persists. Otherwise generates fresh via RandomNumberGenerator.GetBytes(32). Either path leaves settings.Token canonical without rotating gateway-side credentials of already-paired clients. Atomicity: settings.Token is persisted AFTER the bash gateway-config write succeeds. Bash failure leaves settings unchanged. Closes the Bug openclaw#6 Option B predicate gap. Once settings.Token is populated, the existing GetRequestedScopes admin branch fires on local-loopback fresh operator pairing, the explicit-requestId approver mints admin-scoped device tokens, and wizard.start succeeds. Dashboard/chat also work because they already read settings.Token. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When the gateway restarts mid-wizard (e.g. installing a channel like BlueBubbles emits "gateway restarting"), the in-flight wizard.next response never arrives, the in-memory wizard session is lost on the gateway side, and the tray UI hangs for 30s before retry hits "wizard not found" forever. Two coupled fixes mirror the macOS canonical recovery pattern at OnboardingWizard.swift:177-190: 1. OpenClawGatewayClient.ClearPendingRequests now completes pending wizard TaskCompletionSource entries with OperationCanceledException on socket close, matching the existing chat-send cleanup pattern. Eliminates the 30s apparent hang. 2. WizardPage gets session-lost recovery: on connection-lost, "wizard not found", or "wizard not running", clear stale session state and re-invoke wizard.start exactly once per lost session. The "already restarted" guard is a mutable reference object (not UseState<bool>) to prevent the team's recurring stale-closure anti-pattern. Successful recovery resets the guard so a second independent disconnect can also recover (matches macOS applyStartResult reset semantic). TimeoutException recovery is narrowed: only triggers if the client is disconnected at timeout fire time. Slow-but-connected steps surface a Retry UI instead of forcing restart. Recovery failure presents a real "Restart wizard" action that clears WizardSessionId, WizardStepPayload, and the guard. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…tingly) RubberDucky review of d1cfbcf found that SetRecoveryFailureError reset the once-only recovery guard before the user clicked Restart, reopening the stale-closure double-start hazard that Closure openclaw#1 was meant to prevent. Fix: only reset the guard on - successful recovery (ApplyStep start-shape, unchanged) - explicit user-initiated Restart wizard action via RestartWizardAsync, unchanged After automatic recovery failure, the guard stays set until the user clicks Restart. Stale closures from the same lost session arriving after the failure UI is shown observe the set guard and return AlreadyAttempted instead of launching a second automatic wizard.start. Adds regression test RecoveryFailureFollowedByStaleClosure_DoesNotStartAgain_BeforeUserRestart that fails on d1cfbcf and passes after this fix. Reviewer rejection lockout: Mattingly authored this revision because Aaron was locked out per Strict Lockout protocol after the d1cfbcf rejection. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… on real input Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rtifact Fix openclaw#1 — validation script env-var alignment (Scott's PR openclaw#274 backlog item): - scripts\validate-wsl-gateway.ps1 was setting OPENCLAW_TRAY_APPDATA_DIR and OPENCLAW_TRAY_LOCALAPPDATA_DIR for isolation, but SettingsManager reads OPENCLAW_TRAY_DATA_DIR. Result: validation runs touched real %APPDATA%\OpenClawTray\settings.json, causing operator-auth contamination in earlier test rounds. - Set OPENCLAW_TRAY_DATA_DIR alongside the existing isolation env vars. - Document the canonical env-var contract in docs\wsl-owner-validation.md. - Add SettingsManagerIsolationTest asserting OPENCLAW_TRAY_DATA_DIR redirects writes away from real %APPDATA%. Fix openclaw#2 — remove .squad\decisions\inbox\aaron-uninstall-plan.md from PR diff: - The file is an agent planning artifact unrelated to the WSL gateway port and was flagged in RubberDucky's adversarial review. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ery investigation Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Symptoms fixed:
- Symptom 1+2 (radio flash + two-click): Replace render-time WizardStepPayload
re-parse with stable optionLabels/optionValues arrays from UseState. The old
code called labels.ToArray()/values.ToArray() on every render, producing a new
string[] reference each time. FunctionalUI's ConfigureRadioButtons assigned
this new reference to control.ItemsSource unconditionally, triggering WinUI3's
UpdateItemsSource() -> Select(-1) -> selection cleared -> SelectedIndex
reapplied cycle. Now the same array reference is reused across re-renders so
WinUI3 detects no change and skips the reset.
- Symptom 3 (loopback after channels): Replace recovery lambda's wizard.start
call with TryResumeWithSessionAsync, which first attempts wizard.next({sessionId})
with no answer. If the gateway session is still alive, the current step is
returned immediately (session.next() returns currentStep directly) and the user
resumes where they left off. Falls back to wizard.start on session-not-found,
TimeoutException, or any unexpected error (RD improvement openclaw#1).
- Pending-submission tracking: setPendingSubmission({stepId, stepInput}) before
the wizard.next send; cleared on success. On Scenario B resume (connection
dropped before answer reached gateway, same step returned), the pending answer
is restored into stepInput so the user sees their prior selection.
- Fix broken wizard.status fallback: The 'already running' catch block called
wizard.status with no params (fails upstream schema validation). Replaced with
wizard.next({sessionId}) using the saved session ID, matching the skip-path
pattern already proven at WizardPage.cs:461.
- Add TryResumeWithSessionAsync to WizardFlowController with 5 new tests:
session-alive (calls next not start), session-not-found (fallback),
no-session-id (fallback), TimeoutException (fallback), disconnected (fallback).
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… step-0 loopback Recovery fired with connected=False immediately after disconnect; TryResumeWithSessionAsync checked IsConnectedToGateway == true and skipped wizard.next, falling through to wizard.start which created a new session at step 0. Fix: add WaitForConnectionAsync (injectable delay for testability) to WizardFlowController and call it in the recovery lambda in WizardPage before TryResumeWithSessionAsync. After the wait, IsConnectedToGateway is true so wizard.next(sessionId) is attempted first, preserving the live gateway session and returning the user to the channels step instead of step 0. New tests: WaitForConnectionAsync_WhenAlreadyConnected, WhenReconnectsAfterTwoPolls, WhenTimesOut — verify the reconnect-wait contract and that wizard.next is called post-reconnect. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Adds a canonical PowerShell script and Squad skill capturing the 6-step
dev-loop pattern used repeatedly during wsl-gateway-clean development:
kill OpenClaw* by PID -> backup/wipe state dirs -> optionally unregister
WSL distro -> dotnet build x64 -> optionally launch with visual capture.
Script: scripts/dev-reset-rebuild-launch.ps1
- Supports -WipeWslDistro, -CaptureDir, -SkipBuild, -DontLaunch,
-WorktreePath, -NoBackup, -Verbose, -WhatIf
- Kills processes by PID only (Stop-Process -Name is forbidden)
- Uses wsl bash -c for WSL file ops (never wsl-dollar paths)
- Idempotent; exits non-zero on hard failures
Skill: .squad/skills/dev-reset-rebuild-loop/SKILL.md
- Routes agents to the canonical script instead of hand-rolling commands
- Documents anti-patterns: name-based kills, wsl-dollar paths, wrong ordering
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…detection (PR openclaw#274 must-fix openclaw#6) - Add OnboardingExistingConfigGuard service: detects existing token, bootstrap token, non-default gateway URL, device identity tokens, and completed/running setup-state.json (sync); async variant probes WSL distro list. - Gate openclaw#1 (SetupWarningPage): warn-and-confirm inline section replaces 'Set up locally' button when existing config detected. Shows immediately on page load per Mike's directive. Lists specifically what would be lost (token, device pairing, gateway URL). 'Replace my setup' sets ReplaceExistingConfigurationConfirmed. - Gate openclaw#2 (OnboardingWindow): default SetupPath=Advanced when existing config detected so returning users land on SetupWarning with Next button enabled (→ Connection page), not the local setup path. - Gate openclaw#3 (LocalSetupProgressPage): defense-in-depth guard before engine construction blocks any env-override / deep-link path that bypasses SetupWarningPage. Error code: existing_config_gate. - Gate openclaw#4 (LocalGatewaySetupEngineFactory): fail-closed check in CreateLocalOnly; throws InvalidOperationException when settings.Token exists and replaceExistingConfigurationConfirmed=false. Default is always strict. - Conditional menu label (Mike refinement): tray flyout shows 'Reconfigure...' when existing config detected, 'Setup Guide...' otherwise. - Add 5 localization keys in all 5 locales: Onboarding_SetupWarning_ReplaceHeading/Body/Confirm/Cancel, Menu_Reconfigure. - Add 13 new tests: OnboardingExistingConfigGuardTests (8), SetupWarning GuardPolicyTests (2), LocalSetupProgressGuardTests (2), OnboardingStateTests +1, LocalGatewaySetupTests +2. Tray tests: 627/627 pass. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
….gitignore .squad/ contains internal agent planning docs, skill definitions, and coordination reports that were unintentionally committed in this PR. These files remain on disk but are no longer tracked by git. Going forward, any changes to .squad/ will be ignored by git. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rd + add CancellationToken to WaitForConnectionAsync H1: Replace narrow settings.Token check in LocalGatewaySetupEngineFactory.CreateLocalOnly with OnboardingExistingConfigGuard.HasExistingConfiguration(), which covers all 6 sync predicates (Token, BootstrapToken, GatewayUrl, operator DeviceToken, node DeviceToken, setup-state phase). The 7th predicate (WSL distro probe) is intentionally excluded because the factory is a synchronous constructor path; the page-level gate runs the full async check. New optional identityDataPath and setupStatePath parameters allow test injection. H2: Add CancellationToken parameter to WizardFlowController.WaitForConnectionAsync so the 30-second polling loop can be aborted (e.g., on app shutdown or page navigation). Default Task.Delay receives the token; injected delayAsync delegates are guarded by ThrowIfCancellationRequested at the top of each iteration. Caller in WizardPage.cs updated with default + TODO comment for future _disposalCts wiring. Tests added: - CreateLocalOnly_ThrowsInvalidOperation_WhenBootstrapTokenExistsAndNotConfirmed - CreateLocalOnly_ThrowsInvalidOperation_WhenNonDefaultGatewayUrlAndNotConfirmed - CreateLocalOnly_ThrowsInvalidOperation_WhenOperatorDeviceTokenExistsAndNotConfirmed - CreateLocalOnly_ThrowsInvalidOperation_WhenNodeDeviceTokenExistsAndNotConfirmed - CreateLocalOnly_ThrowsInvalidOperation_WhenActiveSetupStateAndNotConfirmed - WaitForConnectionAsync_WhenCancelledDuringPolling_ThrowsOperationCanceledException Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Resolves 12 conflicts: 9 trivial + 3 semantic per Aaron's recommendations and Mike's picks. App.xaml.cs: - Took master's redesigned BuildTrayMenuPopup (sessions/devices/capability toggles/Dashboard-Chat-Canvas-Companion-QuickSend) as the base - Kept our GatewayCredentialResolver.Resolve credential resolution path - Kept our EnsureNodeServiceForLocalGatewaySetup alongside master's WireAppCapabilityHandlers - Took master's BuildBadge helper verbatim - GRAFTED: conditional Setup Guide / Reconfigure entry from must-fix openclaw#6 into master's actions section before Exit, using existing OnboardingExistingConfigGuard for label and existing wizard launch wiring DeviceIdentity.cs: - Kept our multi-method dispatch (operator/node role-aware token storage) - Added master's empty-token guard to StoreDeviceTokenCore and StoreNodeDeviceTokenCore SetupCodeDecoder.cs: - Took master's strict JSON parsing verbatim (explicit ValueKind checks, 'must include url or token' gate) - Dropped our bootstrap_token/token field name fallbacks (file follow-up if needed) Known follow-ups (existing tray-menu items lost to master's redesign — file as separate issues): - Activity Stream flyout - Support / Debug flyouts - AutoStart entry - RestartSshTunnel entry Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Smoke test — known issues noted (not blocking)Local smoke test of the merged build (after master merge 37745b2) surfaced two regressions vs pre-merge behavior. Filing for visibility; neither blocks this PR:
Build pass, Shared.Tests 1251/1274 (22 skipped + 1 pre-existing env-only ReadmeValidationTests), Tray.Tests 617/617. |
Summary
Implement a WSL-based one-click OpenClaw Gateway install that auto-pairs the tray app as an operator and a worker node. It connects to the onboarding wizard (RPC-driven from the gateway), and hardens the security posture of token handling and logging. Existing gateway connections are not altered in the app upgrade scenario.
248 files changed across shared library, tray WinUI app, scripts, docs, tests, and CI.
What changed
WSL Local Gateway Setup (Phases 1–5)
OpenClaw.Shared): Ed25519 keypair management with operator + node role-specific device tokens, v2/v3 connect payload signing.OpenClaw.Shared): Bootstrap + role-specific reconnect, credential broadening (Token → BootstrapToken → DeviceIdentity).Existing-Config Gate (Must-Fix #6)
OnboardingExistingConfigGuard: detects 7 existing-config signals (settings.Token, BootstrapToken, GatewayUrl, operator/node DeviceToken, setup-state phase, WSL distro).Security Cluster
OPENCLAW_GATEWAY_TOKENenvironment variable +WSLENVpassthrough — never in process argv.Wizard Fixes
7f3f108521(Apr 27) introduced a startup config write that triggers the gateway's auto-reload listener mid-wizard, killing the in-memory wizard session. Upstream PR openclaw/openclaw#78047 (d4b4660026, May 6) addresses related startup writes but does not fully resolve the channels-step case (the channel-plugin install path also writes config withmode: "auto"). Tracked separately for further upstream investigation.Wizard Recovery
WizardFlowController: Connection-loss detection via epoch tracking, automaticwizard.next(no-answer) resume beforewizard.startfallback.WaitForConnectionAsync: 30-second reconnect polling before recovery attempt; supports CancellationToken.Dev Loop
scripts/dev-reset-rebuild-launch.ps1: Kill → backup/wipe state → optional WSL distro wipe → build → launch.CI
dotnet-coveragefor out-of-process coverage collection.What's NOT in scope
Validation
build.ps1✅dotnet test OpenClaw.Shared.Tests— 1184 passed, 22 skippeddotnet test OpenClaw.Tray.Tests— 633 passed, 0 skippedFollow-up backlog
OnboardingExistingConfigGuardon App to avoid per-hover file I/Oasync voidpatterns in WizardPage for exception safetystaticengine state inLocalSetupProgressPagefor multi-window scenarios