Skip to content

Support native k3s as a cluster backend #134

@bussyjd

Description

@bussyjd

Problem

Obol Stack currently runs exclusively on k3d (k3s-in-Docker). This works well for most users, but introduces friction in several real-world scenarios:

  • Bare-metal and VPS deployments — running k3s-in-Docker on cloud VMs adds overhead and complexity that isn't necessary on a Linux host
  • Performance-sensitive workloads — the Docker abstraction adds I/O overhead for persistent volumes and networking, which matters for blockchain nodes syncing large amounts of data
  • Environments where Docker isn't available or desirable — some server setups, CI runners, or minimal Linux installs don't have Docker and shouldn't need it just to run a local Kubernetes cluster
  • Debugging and observability — with k3d, logs, networking, and storage sit behind a Docker layer that makes troubleshooting harder; native k3s exposes everything directly on the host

TEE requirement: k3d fundamentally cannot support Trusted Execution Environments

The most critical driver for native k3s support is Confidential Computing / TEE workloads. k3d runs k3s inside Docker containers, adding an extra virtualization layer that blocks hardware TEE access entirely. This means:

  • AMD SEV-SNP and Intel TDX require direct hardware access — the Docker container boundary in k3d prevents the guest kernel from negotiating with the host's TEE firmware
  • Kubernetes CoCo (Confidential Containers) with kata-qemu-snp or kata-qemu-tdx runtimes requires bare-metal k3s
  • GPU TEE workloads (NVIDIA H100/H200 Confidential Computing) similarly need direct hardware access through bare-metal k3s + NVIDIA GPU Operator

This creates a natural two-profile architecture for Obol Stack:

Environment Runtime TEE Hardware Use Case
Local dev (k3d) Standard containers None Business logic, x402, routing
TEE dev (k3s) kata-qemu-coco-dev None (virtualization only) Test CoCo workflow, no real security
TEE production (k3s) kata-qemu-snp AMD EPYC (SEV-SNP) Real confidential inference
GPU TEE production (k3s) CoCo + NVIDIA Operator H100/H200 + AMD EPYC High-throughput confidential inference

This maps directly to the consumer/provider split in the marketplace design — consumers run the standard k3d stack locally, while inference providers run bare-metal k3s with TEE hardware for verifiable private inference.

Proposal

Introduce a pluggable backend system that lets users choose between k3d (default, Docker-based) and k3s (native, bare-metal) when initializing their stack:

# Docker-based (default, unchanged)
obol stack init

# Native k3s (new)
obol stack init --backend k3s

The backend choice is persisted in .stack-backend so all subsequent commands (up, down, purge) work transparently regardless of backend.

Key design considerations

  • Backend interface — a common Backend interface (Init, Up, Down, Destroy, IsRunning, DataDir) so the stack lifecycle code is backend-agnostic
  • k3d remains the default — no breaking changes for existing users; stacks without a .stack-backend file fall back to k3d
  • k3s process management — native k3s runs as a root process via sudo, requiring PID tracking, process group signals, and proper cleanup (k3s-killall.sh)
  • DataDir divergence — k3d mounts host paths into Docker containers (always /data inside), while k3s uses host paths directly. The DataDir() method abstracts this so helmfile templates work on both
  • Shared infrastructure — both backends use the same helmfile, charts, and values templates; only the cluster lifecycle differs
  • TEE-ready foundation — the k3s backend provides the bare-metal Kubernetes surface needed for CoCo runtime classes, kata containers, and GPU TEE operators in future work

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions