-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Problem
Obol Stack currently runs exclusively on k3d (k3s-in-Docker). This works well for most users, but introduces friction in several real-world scenarios:
- Bare-metal and VPS deployments — running k3s-in-Docker on cloud VMs adds overhead and complexity that isn't necessary on a Linux host
- Performance-sensitive workloads — the Docker abstraction adds I/O overhead for persistent volumes and networking, which matters for blockchain nodes syncing large amounts of data
- Environments where Docker isn't available or desirable — some server setups, CI runners, or minimal Linux installs don't have Docker and shouldn't need it just to run a local Kubernetes cluster
- Debugging and observability — with k3d, logs, networking, and storage sit behind a Docker layer that makes troubleshooting harder; native k3s exposes everything directly on the host
TEE requirement: k3d fundamentally cannot support Trusted Execution Environments
The most critical driver for native k3s support is Confidential Computing / TEE workloads. k3d runs k3s inside Docker containers, adding an extra virtualization layer that blocks hardware TEE access entirely. This means:
- AMD SEV-SNP and Intel TDX require direct hardware access — the Docker container boundary in k3d prevents the guest kernel from negotiating with the host's TEE firmware
- Kubernetes CoCo (Confidential Containers) with
kata-qemu-snporkata-qemu-tdxruntimes requires bare-metal k3s - GPU TEE workloads (NVIDIA H100/H200 Confidential Computing) similarly need direct hardware access through bare-metal k3s + NVIDIA GPU Operator
This creates a natural two-profile architecture for Obol Stack:
| Environment | Runtime | TEE Hardware | Use Case |
|---|---|---|---|
| Local dev (k3d) | Standard containers | None | Business logic, x402, routing |
| TEE dev (k3s) | kata-qemu-coco-dev | None (virtualization only) | Test CoCo workflow, no real security |
| TEE production (k3s) | kata-qemu-snp | AMD EPYC (SEV-SNP) | Real confidential inference |
| GPU TEE production (k3s) | CoCo + NVIDIA Operator | H100/H200 + AMD EPYC | High-throughput confidential inference |
This maps directly to the consumer/provider split in the marketplace design — consumers run the standard k3d stack locally, while inference providers run bare-metal k3s with TEE hardware for verifiable private inference.
Proposal
Introduce a pluggable backend system that lets users choose between k3d (default, Docker-based) and k3s (native, bare-metal) when initializing their stack:
# Docker-based (default, unchanged)
obol stack init
# Native k3s (new)
obol stack init --backend k3sThe backend choice is persisted in .stack-backend so all subsequent commands (up, down, purge) work transparently regardless of backend.
Key design considerations
- Backend interface — a common
Backendinterface (Init,Up,Down,Destroy,IsRunning,DataDir) so the stack lifecycle code is backend-agnostic - k3d remains the default — no breaking changes for existing users; stacks without a
.stack-backendfile fall back to k3d - k3s process management — native k3s runs as a root process via
sudo, requiring PID tracking, process group signals, and proper cleanup (k3s-killall.sh) - DataDir divergence — k3d mounts host paths into Docker containers (always
/datainside), while k3s uses host paths directly. TheDataDir()method abstracts this so helmfile templates work on both - Shared infrastructure — both backends use the same helmfile, charts, and values templates; only the cluster lifecycle differs
- TEE-ready foundation — the k3s backend provides the bare-metal Kubernetes surface needed for CoCo runtime classes, kata containers, and GPU TEE operators in future work