Skip to content

fix(cli): roll back gateway registration when auth fails during gateway add#1538

Open
zanetworker wants to merge 1 commit into
NVIDIA:mainfrom
zanetworker:fix/gateway-add-rollback-on-auth-failure
Open

fix(cli): roll back gateway registration when auth fails during gateway add#1538
zanetworker wants to merge 1 commit into
NVIDIA:mainfrom
zanetworker:fix/gateway-add-rollback-on-auth-failure

Conversation

@zanetworker
Copy link
Copy Markdown
Contributor

@zanetworker zanetworker commented May 23, 2026

Summary

When openshell gateway add fails to authenticate (OIDC discovery error, browser timeout, Cloudflare callback failure), the gateway registration was left on disk. Users who retried with different names or flags accumulated broken entries that required manual gateway remove cleanup.

Root cause: store_gateway_metadata() and save_active_gateway() are called before the auth attempt. On failure, the code printed "Authentication skipped" but never cleaned up. There was no rollback at all, so every failed attempt left a stale registration behind regardless of why auth failed.

The problem in practice

Debugging auth produces this after a few retries with different flags and names:

$ openshell gateway list
  NAME                  TYPE    AUTH
  127.0.0.1             local   plaintext    # duplicate of local-docker
  local-docker          local   plaintext    # works
  localhost             local   plaintext    # dead, refuses connections
  oidc-test             remote  oidc         # dead, one-off experiment
  openshell             local   mtls         # dead, no gateway running
  openshell-gw-default  local   plaintext    # dead, returns 503
  openshift-cluster     remote  oidc         # stale AWS ELB endpoint
  rhai-gw               remote  plaintext    # works
* workload-gw           cloud   cloudflare   # wrong auth mode, hangs

Only 2 of 9 entries work. The rest are leftovers from failed gateway add attempts that were never cleaned up.

After this fix

$ openshell gateway add --name bad-gw \
    --oidc-issuer http://127.0.0.1:1/realms/fake \
    --oidc-client-id test https://gateway.example.com
✓ Gateway 'bad-gw' added and set as active
! Authentication failed: error sending request...
! Registration for 'bad-gw' removed. Fix the issue and retry gateway add.

$ openshell gateway list
  local-docker   local   plaintext
* workload-gw    remote  oidc

The failed registration is gone. The previously active gateway is restored.

Behavior change

Scenario Before After
Auth fails (network error, timeout, wrong issuer) Registration kept, stale entry accumulates Registration rolled back, active gateway restored
Auth intentionally skipped (OPENSHELL_NO_BROWSER=1) Registration kept Registration kept (unchanged)
Auth succeeds Registration kept Registration kept (unchanged)

The only case where a failed registration is preserved is when the user explicitly set OPENSHELL_NO_BROWSER=1, which signals "I know auth will not complete now, I will authenticate separately with gateway login." This env var is used in CI/headless environments where no browser is available.

Related Issue

Fixes #1537

Changes

  • Add is_browser_suppressed() helper that checks OPENSHELL_NO_BROWSER
  • Add rollback_gateway_registration() helper that removes the registration and restores the previously active gateway
  • OIDC auth path: capture previous active gateway before registration, roll back on auth failure
  • Cloud (Cloudflare) auth path: same rollback logic

Testing

  • 135 lib tests pass, 0 failures

  • 2 new tests:

    • gateway_add_oidc_rolls_back_on_auth_failure: registers a seed gateway, attempts OIDC add against unreachable issuer, verifies the failed registration is removed and active gateway is restored
    • gateway_add_cloud_rolls_back_on_auth_failure: same pattern for the Cloudflare auth path
  • mise run pre-commit passes

  • Unit tests added/updated

  • E2E tests not applicable (auth rollback is a CLI-local operation)

Note

The cloud auth rollback test opens a browser tab to https://127.0.0.1:1/auth/connect which Chrome blocks with ERR_UNSAFE_PORT. This is expected and the tab can be closed.

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (not applicable)

…ay add

When OIDC or Cloudflare browser auth fails during gateway add, remove
the gateway registration and restore the previously active gateway
instead of leaving a broken entry on disk.

Previously, store_gateway_metadata and save_active_gateway were called
before the auth attempt. On failure, the registration persisted with an
'authenticate later' message, causing stale entries to accumulate when
users retried with different flags or names.

The rollback is skipped when the browser is intentionally suppressed
(OPENSHELL_NO_BROWSER=1), since the user intends to authenticate later
with gateway login.

Fixes NVIDIA#1537

Signed-off-by: Adel Zaalouk <azaalouk@redhat.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 23, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix(cli): roll back gateway registration when auth fails during gateway add

1 participant