Support native k3s as a cluster backend

## Problem

Obol Stack currently runs exclusively on k3d (k3s-in-Docker). This works well for most users, but introduces friction in several real-world scenarios:

- **Bare-metal and VPS deployments** — running k3s-in-Docker on cloud VMs adds overhead and complexity that isn't necessary on a Linux host
- **Performance-sensitive workloads** — the Docker abstraction adds I/O overhead for persistent volumes and networking, which matters for blockchain nodes syncing large amounts of data
- **Environments where Docker isn't available or desirable** — some server setups, CI runners, or minimal Linux installs don't have Docker and shouldn't need it just to run a local Kubernetes cluster
- **Debugging and observability** — with k3d, logs, networking, and storage sit behind a Docker layer that makes troubleshooting harder; native k3s exposes everything directly on the host

### TEE requirement: k3d fundamentally cannot support Trusted Execution Environments

The most critical driver for native k3s support is **Confidential Computing / TEE workloads**. k3d runs k3s inside Docker containers, adding an extra virtualization layer that **blocks hardware TEE access entirely**. This means:

- AMD SEV-SNP and Intel TDX require direct hardware access — the Docker container boundary in k3d prevents the guest kernel from negotiating with the host's TEE firmware
- Kubernetes CoCo (Confidential Containers) with `kata-qemu-snp` or `kata-qemu-tdx` runtimes requires bare-metal k3s
- GPU TEE workloads (NVIDIA H100/H200 Confidential Computing) similarly need direct hardware access through bare-metal k3s + NVIDIA GPU Operator

This creates a natural two-profile architecture for Obol Stack:

| Environment | Runtime | TEE Hardware | Use Case |
|---|---|---|---|
| **Local dev (k3d)** | Standard containers | None | Business logic, x402, routing |
| **TEE dev (k3s)** | kata-qemu-coco-dev | None (virtualization only) | Test CoCo workflow, no real security |
| **TEE production (k3s)** | kata-qemu-snp | AMD EPYC (SEV-SNP) | Real confidential inference |
| **GPU TEE production (k3s)** | CoCo + NVIDIA Operator | H100/H200 + AMD EPYC | High-throughput confidential inference |

This maps directly to the consumer/provider split in the marketplace design — **consumers** run the standard k3d stack locally, while **inference providers** run bare-metal k3s with TEE hardware for verifiable private inference.

## Proposal

Introduce a pluggable backend system that lets users choose between `k3d` (default, Docker-based) and `k3s` (native, bare-metal) when initializing their stack:

```bash
# Docker-based (default, unchanged)
obol stack init

# Native k3s (new)
obol stack init --backend k3s
```

The backend choice is persisted in `.stack-backend` so all subsequent commands (`up`, `down`, `purge`) work transparently regardless of backend.

### Key design considerations

- **Backend interface** — a common `Backend` interface (`Init`, `Up`, `Down`, `Destroy`, `IsRunning`, `DataDir`) so the stack lifecycle code is backend-agnostic
- **k3d remains the default** — no breaking changes for existing users; stacks without a `.stack-backend` file fall back to k3d
- **k3s process management** — native k3s runs as a root process via `sudo`, requiring PID tracking, process group signals, and proper cleanup (`k3s-killall.sh`)
- **DataDir divergence** — k3d mounts host paths into Docker containers (always `/data` inside), while k3s uses host paths directly. The `DataDir()` method abstracts this so helmfile templates work on both
- **Shared infrastructure** — both backends use the same helmfile, charts, and values templates; only the cluster lifecycle differs
- **TEE-ready foundation** — the k3s backend provides the bare-metal Kubernetes surface needed for CoCo runtime classes, kata containers, and GPU TEE operators in future work

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support native k3s as a cluster backend #134

Problem

TEE requirement: k3d fundamentally cannot support Trusted Execution Environments

Proposal

Key design considerations

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Environment	Runtime	TEE Hardware	Use Case
Local dev (k3d)	Standard containers	None	Business logic, x402, routing
TEE dev (k3s)	kata-qemu-coco-dev	None (virtualization only)	Test CoCo workflow, no real security
TEE production (k3s)	kata-qemu-snp	AMD EPYC (SEV-SNP)	Real confidential inference
GPU TEE production (k3s)	CoCo + NVIDIA Operator	H100/H200 + AMD EPYC	High-throughput confidential inference

Support native k3s as a cluster backend #134

Description

Problem

TEE requirement: k3d fundamentally cannot support Trusted Execution Environments

Proposal

Key design considerations

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions