Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 5 additions & 5 deletions .github/scripts/e2e-prepare-creds.sh
Original file line number Diff line number Diff line change
Expand Up @@ -4,8 +4,8 @@
# GITHUB_ENV. Never echoes secret values.
#
# Inputs (env):
# E2E_SSH_PRIVATE_KEY SSH private key contents (required)
# E2E_CLUSTER_KUBECONFIG base64-encoded kubeconfig (required)
# E2E_DVP_BASE_CLUSTER_SSH_PRIVATE_KEY SSH private key contents (required)
# E2E_DVP_BASE_CLUSTER_KUBECONFIG base64-encoded kubeconfig (required)
# GITHUB_ENV file to append env exports to (required)
# GITHUB_WORKSPACE workspace root to prune (optional)
# RUNNER_TEMP dir for temp files (falls back to TMPDIR, then /tmp)
Expand All @@ -16,14 +16,14 @@ tmp_dir="${RUNNER_TEMP:-${TMPDIR:-/tmp}}"
ssh_key_path="$(mktemp "${tmp_dir%/}/e2e_ssh_key.XXXXXX")"
kubeconfig_path="$(mktemp "${tmp_dir%/}/e2e_kubeconfig.XXXXXX")"

printf '%s\n' "${E2E_SSH_PRIVATE_KEY:?E2E_SSH_PRIVATE_KEY is required}" >"$ssh_key_path"
printf '%s\n' "${E2E_DVP_BASE_CLUSTER_SSH_PRIVATE_KEY:?E2E_DVP_BASE_CLUSTER_SSH_PRIVATE_KEY is required}" >"$ssh_key_path"
chmod 600 "$ssh_key_path"

printf '%s' "${E2E_CLUSTER_KUBECONFIG:?E2E_CLUSTER_KUBECONFIG is required}" | base64 -d >"$kubeconfig_path"
printf '%s' "${E2E_DVP_BASE_CLUSTER_KUBECONFIG:?E2E_DVP_BASE_CLUSTER_KUBECONFIG is required}" | base64 -d >"$kubeconfig_path"
chmod 600 "$kubeconfig_path"

{
echo "E2E_DVP_BASE_CLUSTER_SSH_KEY_PATH=${ssh_key_path}"
echo "E2E_DVP_BASE_CLUSTER_SSH_PRIVATE_KEY_PATH=${ssh_key_path}"
echo "E2E_DVP_BASE_CLUSTER_KUBECONFIG_PATH=${kubeconfig_path}"
} >>"${GITHUB_ENV:?GITHUB_ENV is required}"

Expand Down
8 changes: 4 additions & 4 deletions .github/workflows/e2e.yml
Original file line number Diff line number Diff line change
Expand Up @@ -122,8 +122,8 @@ jobs:

- name: Prepare credentials
env:
E2E_SSH_PRIVATE_KEY: ${{ secrets.E2E_DVP_BASE_CLUSTER_SSH_PRIVATE_KEY }}
E2E_CLUSTER_KUBECONFIG: ${{ secrets.E2E_DVP_BASE_CLUSTER_KUBECONFIG }}
E2E_DVP_BASE_CLUSTER_SSH_PRIVATE_KEY: ${{ secrets.E2E_DVP_BASE_CLUSTER_SSH_PRIVATE_KEY }}
E2E_DVP_BASE_CLUSTER_KUBECONFIG: ${{ secrets.E2E_DVP_BASE_CLUSTER_KUBECONFIG }}
run: bash _storage-e2e/.github/scripts/e2e-prepare-creds.sh

- name: Bootstrap cluster
Expand Down Expand Up @@ -229,8 +229,8 @@ jobs:

- name: Prepare credentials
env:
E2E_SSH_PRIVATE_KEY: ${{ secrets.E2E_DVP_BASE_CLUSTER_SSH_PRIVATE_KEY }}
E2E_CLUSTER_KUBECONFIG: ${{ secrets.E2E_DVP_BASE_CLUSTER_KUBECONFIG }}
E2E_DVP_BASE_CLUSTER_SSH_PRIVATE_KEY: ${{ secrets.E2E_DVP_BASE_CLUSTER_SSH_PRIVATE_KEY }}
E2E_DVP_BASE_CLUSTER_KUBECONFIG: ${{ secrets.E2E_DVP_BASE_CLUSTER_KUBECONFIG }}
run: bash _storage-e2e/.github/scripts/e2e-prepare-creds.sh

- name: Remove cluster
Expand Down
70 changes: 64 additions & 6 deletions docs/ARCHITECTURE.md
Original file line number Diff line number Diff line change
Expand Up @@ -51,11 +51,19 @@ storage-e2e/
│ │ └── vm_block_device.go
│ │
│ ├── infrastructure/ # Infrastructure layer
│ │ └── ssh/ # SSH operations
│ │ └── ssh/ # SSH operations (legacy)
│ │ ├── client.go
│ │ ├── interface.go
│ │ ├── tunnel.go
│ │ └── types.go
│ │ ├── types.go
│ │ └── v2/ # Self-healing SSH client (Dialer/Route + Tunnel)
│ │ ├── client.go # New, Client, Close + package docs
│ │ ├── conn.go # connection core: snapshot/refresh/keepalive + withConn
│ │ ├── dialer.go # Dialer interface, Route, chain closer
│ │ ├── endpoint.go # Endpoint, auth, host/key resolution
│ │ ├── errors.go # transient classification
│ │ ├── options.go # functional options
│ │ └── tunnel.go # Tunnel, accept loop
│ │
│ └── logger/ # Structured logging
│ ├── logger.go # Logger implementation
Expand Down Expand Up @@ -473,10 +481,18 @@ internal/kubernetes/ # Internal Kubernetes clients

```
infrastructure/ssh/
├── client.go # SSH client implementation (Exec, ExecCapture, tunnels)
├── interface.go # SSH client interface
├── tunnel.go # Port forwarding and tunneling
└── types.go # SSH-related types
├── client.go # SSH client implementation (Exec, ExecCapture, tunnels) [legacy]
├── interface.go # SSH client interface [legacy]
├── tunnel.go # Port forwarding and tunneling [legacy]
├── types.go # SSH-related types [legacy]
└── v2/ # Self-healing SSH client (see below)
├── client.go # New, Client, Close + package docs
├── conn.go # connection core: snapshot/refresh/keepalive + withConn executor
├── dialer.go # Dialer interface, Route, chain closer
├── endpoint.go # Endpoint, auth, host/key resolution
├── errors.go # transient classification
├── options.go # functional options
└── tunnel.go # Tunnel, accept loop
```

**Responsibilities**:
Expand All @@ -494,6 +510,48 @@ infrastructure/ssh/
- `ExecCapture` keeps stdout and stderr separate while preserving retry/reconnect behavior
- Proper resource cleanup

#### 3.4.1 Self-healing SSH client (`internal/infrastructure/ssh/v2/`)

A ground-up rewrite that lives in parallel with the legacy package (no consumers
migrated yet). It separates **how we connect** (directly or via jump hosts) from
**what we do over the connection** (currently only tunneling), and hides every
reconnect from callers.

**Design**:

- `Dialer` is the injection point: `Dial(ctx) (*ssh.Client, io.Closer, error)` +
`Describe()`. `Route(first Endpoint, more ...Endpoint)` builds the built-in
implementation; the last hop is always the target, so the `(first, more...)`
signature guarantees at least one hop at compile time. The returned `io.Closer`
tears down the whole chain (target + every jump + ssh-agent connections).
- `Endpoint` describes a single host: `User`, `Addr` (`host` or `host:port`,
default `:22`), `KeyPath` (`~` expanded), optional `Passphrase`
(falls back to `SSH_PASSPHRASE` then ssh-agent), optional per-hop `HostKey`.
- The unexported `conn` core owns the current `*ssh.Client`, its chain `Closer`,
and a generation counter under a mutex. `snapshot` reads them; `refresh`
re-dials via `singleflight` keyed on the failed generation so concurrent
reconnects collapse into one and a stale generation never tears down a freshly
healed link. The slow `Dial` runs outside the lock on a detached context
(`context.WithoutCancel` + timeout) so one caller's cancellation can't abort
the shared flight.
- A single generic executor `withConn[T]` runs an operation against the live
client and heals on transient failures (bounded by `WithRetries`); the tunnel
uses it today and `Run`/`Upload` are designed to reuse it unchanged.
- Optional keepalive (`WithKeepalive`) probes the link and heals through the same
`refresh` path; every heal is logged at WARN.

**Public API v1**: `New(ctx, Dialer, ...Option)`, `Client.Tunnel(ctx, remotePort)`
(self-healing local forward on a free `127.0.0.1` port; `Tunnel.LocalAddr`,
`Tunnel.Close`), `Client.Close`. Options: `WithKeepalive`, `WithRetries`,
`WithLogger`, `WithHostKeyCallback`, `WithInsecureIgnoreHostKey` (host key
defaults to `InsecureIgnoreHostKey` — a conscious default for ephemeral e2e VMs).

**Extension points (designed, not yet implemented)**: `Run` (transparent retry
only when the session fails to open; mid-flight drops heal but surface the error
to avoid double side effects; opt-in `Idempotent` for true retry) and `Upload`.
Transient-error classification uses `errors.Is`/`errors.As` against standard
types — never error-string matching.

### 3.5 Logger Module (`internal/logger/`)

```
Expand Down
3 changes: 3 additions & 0 deletions docs/WORKLOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -224,3 +224,6 @@ All notable changes to this repository are documented here. New entries are appe
`log.Fatalf` instead of glued key/value arguments.
- **Bugfix** `.github/workflows/e2e.yml`: checkout `storage-e2e` into `_storage-e2e` in the `run-tests` job before
invoking `.github/scripts/e2e-run-tests.sh`.
- **Bugfix** `internal/infrastructure/ssh/v2/conn.go` (`newConn`): derive keepalive context via
`context.WithCancel(context.WithoutCancel(ctx))` instead of `context.Background()` to satisfy `contextcheck` while
keeping the loop lifetime tied to the connection (still cancelled in `Close`).
74 changes: 74 additions & 0 deletions internal/infrastructure/ssh/v2/client.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,74 @@
/*
Copyright 2025 Flant JSC

Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
*/

// Package ssh provides a self-healing SSH client whose connection strategy
// ("how we connect" — directly or through jump hosts) is separated from the
// operations performed over it ("what we do" — currently tunneling).
//
// The injection point is the Dialer: Route builds one for a direct connection or
// an arbitrary chain of jump hosts. New opens a Client over a Dialer and hides
// every reconnect: callers invoke methods and never reason about reconnection.
// All operations funnel through a single reconnect-aware executor (withConn) over
// a shared connection core (conn), so future operations such as Run and Upload
// can be added without touching the healing logic.
//
// The primary use case is opening a tunnel to the API server of a closed
// Kubernetes cluster and pointing a kubeconfig at it:
//
// c, _ := ssh.New(ctx, ssh.Route(jumpEp, targetEp))
// defer c.Close()
// t, _ := c.OpenTunnel(ctx, 6443)
// defer t.Close()
// rest := &rest.Config{Host: "https://" + t.LocalAddr()}
package ssh

import (
"context"
"errors"
"log/slog"
)

type Client struct {
conn *conn
retries int
log *slog.Logger
}

func New(ctx context.Context, d Dialer, opts ...Option) (*Client, error) {
if d == nil {
return nil, errors.New("ssh: nil dialer")
}

o := defaultOptions()
for _, opt := range opts {
opt(&o)
}

if hkd, ok := d.(hostKeyDefaulter); ok {
hkd.setDefaultHostKey(o.hostKey)
}

core, err := newConn(ctx, d, o)
if err != nil {
return nil, err
}

return &Client{conn: core, retries: o.retries, log: o.log}, nil
}

func (c *Client) Close() error {
return c.conn.Close()
}
Loading
Loading