- Overview
- Requirements
- Architecture
- Core Components
- Networking
- Storage & Cloning
- VM Lifecycle
- Guest Agent
- CLI Interface
- Implementation Details
fcvm is a Firecracker VM manager designed to run Podman containers inside lightweight microVMs with lightning-fast cloning capabilities. It provides a simple CLI interface for spinning up isolated container environments with:
- Full-featured VMs: Filesystem access, outbound networking, port forwarding
- Fast cloning: Clone running VMs in <1s using snapshots and CoW disks
- Flexible networking: Both rootless and privileged modes
- Process lifetime binding: VM lifetime tied to controlling process
- Resource configuration: Configurable vCPU/memory with overcommit support
Target Platform: Linux only (requires KVM)
-
fcvm podman runCommand- Takes a Docker/Podman container image
- Spins up a Firecracker VM running the container
- Supports volume mounts via FUSE passthrough (host → guest)
- Supports port forwarding (host → guest)
- Process blocks until VM exits (hanging/foreground mode)
- VM dies when process is killed (lifetime binding)
-
fcvm execCommand- Execute commands in running VMs
- Supports running in guest OS or inside container (
-cflag) - Interactive mode with stdin forwarding (
-iflag) - TTY allocation for terminal apps (
-tflag)
-
fcvm snapshotCommandsfcvm snapshot create: Create snapshot from running VMfcvm snapshot serve: Start UFFD memory server for cloningfcvm snapshot run: Spawn clone from memory server- Lightning-fast clone startup (<1 second)
- Shares memory via UFFD page fault handler
- Creates independent VM with its own networking
-
Networking Modes
- Rootless: Works without root privileges using pasta (from passt)
- Privileged: Uses iptables + TAP for better performance
- Routed: IPv6 veth pairs with kernel routing at line rate (no userspace proxy)
- Port mapping:
[HOSTIP:]HOSTPORT:GUESTPORT[/PROTO]syntax - Support multiple ports, TCP/UDP protocols
-
Volume Mounting
- Map local directories to guest filesystem
- Support block devices, sshfs, and NFS modes
- Read-only and read-write mounts
-
Resource Configuration
- vCPU overcommit (more vCPUs than physical cores)
- Memory overcommit with balloon device
- Configurable memory ballooning
-
Snapshot & Clone
- Save VM state at "warm" checkpoint (after container ready)
- Fast restore from snapshot
- CoW disks for instant cloning
- Identity patching (MAC addresses, hostnames)
- Performance: Clone startup <1s
- Isolation: Full VM isolation via Firecracker
- Compatibility: Works with rootless Podman in guest
- Portability: Runs on bare metal or nested VMs (VM-in-VM)
- Reliability: Clean shutdown, resource cleanup
┌──────────────────────────────────────────────────────┐
│ fcvm CLI (Host) │
│ ┌────────────┐ ┌──────────────┐ ┌─────────────┐ │
│ │ Networking │ │ Firecracker │ │ Storage & │ │
│ │ Manager │ │ API Client │ │ Snapshots │ │
│ └────────────┘ └──────────────┘ └─────────────┘ │
│ │ │ │ │
│ └────────────────┴─────────────────┘ │
│ │ │
└──────────────────────────┼────────────────────────────┘
│
▼
┌────────────────────────┐
│ Firecracker Process │
│ (microVM) │
│ ┌──────────────────┐ │
│ │ Linux Kernel │ │
│ │ ┌────────────┐ │ │
│ │ │ fc-agent │ │ │
│ │ │ │ │ │ │
│ │ │ Podman │ │ │
│ │ │ │ │ │ │
│ │ │ Container │ │ │
│ │ └────────────┘ │ │
│ └──────────────────┘ │
└────────────────────────┘
-
fcvm CLI (Rust)
- Command-line interface
- Orchestrates VM lifecycle
- Manages networking, storage, snapshots
- Streams logs and handles signals
-
Firecracker (External binary)
- Runs the microVM
- Provides REST API over Unix socket
- Manages VM resources (vCPU, memory, drives, network)
-
fc-agent (Rust, runs in guest)
- Fetches container configuration from MMDS
- Launches Podman with correct parameters
- Streams container logs to host via vsock
- Signals readiness to host
Location: fcvm/src/firecracker/api.rs
Provides Rust interface to Firecracker REST API over Unix socket using hyper + hyperlocal.
Key Functions:
set_boot_source()- Configure kernel + boot argsset_machine_config()- Set vCPU, memory, SMTadd_drive()- Attach rootfs and data disksadd_network_interface()- Setup networkingset_mmds_config()- Configure metadata serviceput_mmds()- Provide container plan to guestcreate_snapshot()- Save VM stateload_snapshot()- Restore from snapshotset_balloon()- Configure memory balloon
API Structures:
struct BootSource {
kernel_image_path: String,
initrd_path: Option<String>,
boot_args: Option<String>,
}
struct MachineConfig {
vcpu_count: u8,
mem_size_mib: u32,
smt: Option<bool>,
track_dirty_pages: Option<bool>,
}
struct Drive {
drive_id: String,
path_on_host: String,
is_root_device: bool,
is_read_only: bool,
}
struct NetworkInterface {
iface_id: String,
host_dev_name: String, // TAP device
guest_mac: Option<String>,
}Location: fcvm/src/firecracker/vm.rs
Manages Firecracker process lifecycle.
Responsibilities:
- Spawn Firecracker process with correct args
- Wait for API socket to be ready
- Stream stdout/stderr to tracing logs
- Handle graceful shutdown
- Clean up resources (socket, processes)
Key Functions:
impl VmManager {
async fn start(&mut self, firecracker_bin, config) -> Result<()>
async fn wait(&mut self) -> Result<ExitStatus>
async fn kill(&mut self) -> Result<()>
async fn stream_console(&self, console_path) -> Result<Receiver<String>>
fn client(&self) -> Result<&FirecrackerClient>
}Location: fcvm/src/network/
Three implementations based on execution mode.
Uses pasta (from the passt project) for L4 splice-based networking.
Features:
- No root privileges required
- Port forwarding via pasta CLI flags (
-tfor TCP,-ufor UDP) - Default guest IP:
10.0.2.100 - Default host IP:
10.0.2.2
Implementation:
struct PastaNetwork {
vm_id: String,
tap_device: String,
port_mappings: Vec<PortMapping>,
pasta_process: Option<Child>,
}
async fn setup() -> Result<NetworkConfig> {
// TAP device created by Firecracker
// pasta started after VM boots
// Port forwarding configured via -t/-u CLI flags
}Uses TAP devices + iptables for native performance.
Features:
- Requires root or CAP_NET_ADMIN
- Better performance than rootless
- Uses DNAT for port forwarding (scoped to veth IP)
- Network namespace isolation per VM
Implementation:
struct BridgedNetwork {
vm_id: String,
tap_device: String,
namespace_id: String,
host_veth: String, // veth_outer in host namespace
guest_veth: String, // veth_inner in VM namespace
guest_ip: String,
host_ip: String, // veth's host IP (used for port forwarding)
port_mappings: Vec<PortMapping>,
}
async fn setup() -> Result<NetworkConfig> {
create_namespace(namespace_id)
create_veth_pair(host_veth, guest_veth)
move_veth_to_namespace(guest_veth, namespace_id)
create_tap_device_in_namespace(tap_name, namespace_id)
for mapping in port_mappings {
// Scope DNAT to veth IP so same port works across VMs
setup_nat_rule(mapping, guest_ip, host_ip)
}
}NAT Rule Example (scoped to veth IP):
iptables -t nat -A PREROUTING -d 172.30.x.1 -p tcp --dport 8080 -j DNAT --to-destination 172.30.x.2:80Uses veth pairs + IPv6 routing for kernel line-rate networking without userspace proxies.
Features:
- Requires root and a host with a global IPv6 /64 subnet (or
--ipv6-prefixto specify one explicitly) - Native IPv6 routing through the kernel stack (no userspace L4 translation)
- Each VM gets a unique IPv6 derived from the host's /64 prefix
- Port forwarding via built-in TCP proxy (
setns+ tokio relay) on loopback IP (same as rootless) - Parallel-safe: per-VM routes, proxy NDP, ip6tables rules
Implementation:
struct RoutedNetwork {
vm_id: String,
tap_device: String,
port_mappings: Vec<PortMapping>,
loopback_ip: Option<String>,
namespace_id: Option<String>,
host_veth: Option<String>,
vm_ipv6: Option<String>,
default_iface: Option<String>,
proxy_handles: Vec<JoinHandle<()>>,
ipv6_prefix: Option<String>, // explicit /64 prefix (skips auto-detect + MASQUERADE)
}
async fn setup() -> Result<NetworkConfig> {
self.preflight_check() // root, IPv6, ip6tables (ip6tables skipped if --ipv6-prefix)
detect_host_ipv6() // find /64 subnet (or /128 with on-link /64); skipped if --ipv6-prefix
generate_vm_ipv6(prefix, vm_id) // deterministic IPv6 from hash
create_namespace(ns_name)
create_veth_pair(host_veth, guest_veth)
create_tap_in_ns(ns_name, tap)
connect_tap_to_veth(ns_name, tap, guest_veth) // bridge for L2
// Assign bridge IPs: 10.0.2.1/24 + fd00::1/64
// Host veth: enable forwarding, assign link-local
// Namespace: default IPv6 route via host veth link-local
// Host: /128 route to VM IPv6 via host veth
// Proxy NDP on default interface
// ip6tables MASQUERADE for outbound (skipped if --ipv6-prefix is set)
// TCP proxy port forwarding on loopback IP (setns + tokio relay)
}Grammar: [HOSTIP:]HOSTPORT:GUESTPORT[/PROTO]
Examples:
8080:80 # TCP port 8080 → guest:80
127.0.0.1:8080:80 # Bind to localhost only
8080:80/udp # UDP protocol
0.0.0.0:53:53/udp # DNS forwarding
Parsing Logic (network/types.rs):
impl PortMapping {
pub fn parse(s: &str) -> Result<Self> {
// Split on ':'
// Extract optional host IP
// Extract protocol suffix (/tcp or /udp)
// Default to TCP if not specified
}
}Each VM has:
- Kernel: Shared across all VMs (read-only)
- Base rootfs: Shared base image with Podman + fc-agent
- CoW overlay: Per-VM writable layer (using btrfs reflinks)
- Volume mounts: Optional host directory mounts
/mnt/fcvm-btrfs/ # btrfs filesystem (CoW reflinks work here)
├── kernels/
│ ├── vmlinux.bin # Symlink to active kernel
│ └── vmlinux-{sha}.bin # Kernel (SHA of URL for cache key)
├── rootfs/
│ └── layer2-{sha}.raw # Base rootfs (~10GB, SHA of setup script)
├── initrd/
│ └── fc-agent-{sha}.initrd # fc-agent injection initrd (SHA of binary)
├── vm-disks/
│ └── vm-{id}/
│ └── disks/rootfs.raw # CoW reflink copy per VM
├── snapshots/
│ └── {snapshot-name}/
│ ├── vmstate.snap # VM memory snapshot
│ ├── disk.snap # Disk snapshot
│ └── config.json # VM configuration
├── state/ # VM state JSON files
└── cache/ # Downloaded cloud images
Goal: Share base rootfs across VMs, only store deltas per-VM.
Options:
-
overlayfs (preferred for simplicity)
mount -t overlay overlay \ -o lowerdir=/base/rootfs,upperdir=/vm/upper,workdir=/vm/work \ /vm/merged
-
btrfs reflinks (current implementation)
cp --reflink=always /mnt/fcvm-btrfs/rootfs/layer2-{sha}.raw /mnt/fcvm-btrfs/vm-disks/{id}/disks/rootfs.raw
Benefits:
- Instant cloning (no disk copy)
- Shared memory pages across VMs
- Fast snapshot restore
Memory Snapshot: Firecracker native format
{
"snapshot_path": "/snapshots/warm/disk.snap",
"mem_file_path": "/snapshots/warm/memory.snap",
"snapshot_type": "Full"
}Clone Process:
- Load snapshot via Firecracker API
- Create new CoW overlay disk
- Patch identity (MAC address, hostname, VM ID)
- Setup new networking (TAP device, ports)
- Resume VM
Identity Patching:
- Generate new MAC address
- Update hostname in guest
- Regenerate machine IDs
- Update MMDS with new config
Key Insight: pasta and Firecracker CANNOT share a TAP device (both need exclusive access). Solution: Use a Linux bridge (br0) for L2 forwarding between pasta and Firecracker inside a user namespace.
Architecture: pasta uses splice(2) zero-copy L4 translation for inbound port forwarding (host socket to namespace socket) and L2 TAP path for outbound VM traffic.
Topology:
Host │ User Namespace (unshare --user --net)
│
pasta <──────────────────┼── pasta0 ─┐
(L4 splice/TAP) │ │
│ br0 (10.0.2.1/24) ← namespace IP for health checks
│ │
│ tap-fc ─┘
│ │
│ ▼
│ Firecracker VM
│ eth0: 10.0.2.100
Why Bridge Instead of IP Forwarding?
- Bridge operates at L2 (MAC addresses) - preserves source MAC for proper ARP/NDP learning
- pasta expects traffic from specific MAC addresses for its internal NAT tables
- IP forwarding rewrites source MAC, breaking pasta's connection tracking
- Bridge also enables IPv6 with proper NDP neighbor discovery
Setup Sequence (3-phase with nsenter):
- Spawn holder process:
unshare --user --net -- sleep infinity(UID/GID mappings written externally) - Pre-pasta setup via nsenter: create Firecracker TAP device only
- Start pasta attached to holder's namespace (creates pasta0 TAP)
- Post-pasta setup via nsenter: create bridge, attach pasta0 + tap-fc, add namespace IP
- Run Firecracker via nsenter:
nsenter -t HOLDER_PID -U -n -- firecracker ... - Health checks via nsenter:
nsenter -t HOLDER_PID -U -n -- curl 10.0.2.100:80
Pre-Pasta Setup Script (Phase 2, executed via nsenter):
# Create TAP device for Firecracker (pasta creates its own TAP separately)
ip tuntap add tap-fc mode tap
ip link set tap-fc up
ip link set lo upPost-Pasta Bridge Script (Phase 4, executed via nsenter after pasta starts):
# Bring pasta0 up (pasta creates it but doesn't bring it up without --config-net)
ip link set pasta0 up
# Create L2 bridge — connects pasta0 and Firecracker TAP
ip link add br0 type bridge
ip link set br0 up
ip link set pasta0 master br0
ip link set tap-fc master br0
# Add IP to bridge for health checks
# This enables nsenter to route to guest via the 10.0.2.x subnet
ip addr add 10.0.2.1/24 dev br0
# Enable IP forwarding
echo 1 > /proc/sys/net/ipv4/ip_forwardPort Forwarding (unique loopback IPs):
# Each VM gets a unique loopback IP (127.x.y.z) for port forwarding
# No IP aliasing needed - Linux routes all 127.0.0.0/8 to loopback
# Port forwarding is configured via pasta CLI flags:
# -t <bind_addr>/<host_port>:<guest_port> for TCP
# -u <bind_addr>/<host_port>:<guest_port> for UDP
pasta \
--foreground \
--quiet \
-P <pid-file> \
--ns-ifname pasta0 \
-a 10.0.2.100 \
-n 255.255.255.0 \
-g 10.0.2.2 \
--no-dhcp \
-t 127.0.0.2/8080:80 \
-T none -U none \
<holder-pid>Traffic Flow (VM to Internet):
Guest (10.0.2.100) → tap-fc → br0 (L2) → pasta0 → pasta → Host → Internet
Traffic Flow (Health Check from namespace):
nsenter curl → br0 (10.0.2.1) → L2 forward → tap-fc → Guest (10.0.2.100:80)
Traffic Flow (Host to VM port forward):
Host (127.0.0.x:8080) → pasta → pasta0 → br0 (L2) → tap-fc → Guest (10.0.2.100:80)
IPv6 Support:
- pasta has native IPv6 support (no custom build needed)
- Guest uses fd00::2 (IPv6 gateway), fd00::100 (guest IPv6)
- Guest DNS uses host DNS servers directly (via
fcvm_dns=kernel cmdline parameter) - fc-agent sends gratuitous NDP NA at boot for MAC learning
- On snapshot restore, fc-agent re-sends NDP NA to teach new pasta process
Characteristics:
- No root required (runs entirely in user namespace)
- All VMs use same 10.0.2.x subnet (isolated by user namespace)
- Unique loopback IP per VM enables same port on multiple VMs
- Bridge-based L2 preserves MAC addresses for proper pasta ARP/NDP learning
- Namespace IP (10.0.2.1) enables health checks via nsenter
- IPv6 support with native pasta IPv6 forwarding
- Works in nested VMs and restricted environments
- Fully compatible with rootless Podman in guest
In rootless mode, outbound IPv4 TCP from the guest requires a transparent proxy because there is no NAT gateway. The egress proxy multiplexes all outbound TCP connections over a single vsock connection using a frame-based protocol.
Note: Routed mode does not use the egress proxy. All external traffic in routed mode goes natively through the kernel's IPv6 routing stack at line rate. IPv4 stays internal to the namespace (for health checks and port forwarding only).
Architecture:
Guest VM Host
───────── ────
App connects to 93.184.216.34:80
↓
iptables REDIRECT → proxy (127.0.0.1:12345)
↓
SO_ORIGINAL_DST → 93.184.216.34:80
↓
Assign stream_id, send OPEN frame
↓ ↓
Single persistent vsock (port 52000) ───→ UnixStream reader
↓ OPEN → spawn TCP to destination
↓ send OPEN_OK back
↓ (OPEN_OK received)
Bidirectional DATA frames ←──→ DATA frames relayed to/from TCP
↓
CLOSE frame when done ───→ Close TCP, cleanup
Frame Format (10-byte header):
stream_id(u32 LE): unique per TCP connectionframe_type(u8): OPEN=1, DATA=2, CLOSE=3, RST=4, OPEN_OK=5, OPEN_FAIL=6flags(u8): reservedpayload_len(u32 LE): payload length after header
Guest Side (fc-agent/src/proxy.rs):
- iptables REDIRECT captures outbound TCP (excluding local, link-local, MMDS)
- Single writer task serializes frame writes to vsock
- Reader task routes incoming frames to per-stream channels via
DashMap - Per-connection handler: accept → SO_ORIGINAL_DST → OPEN → relay DATA
Host Side (src/network/egress_proxy.rs):
- Accepts vsock connections from guest
- OPEN frames spawn TCP connections to real destinations
- DATA frames relayed bidirectionally between vsock and TCP
- CLOSE/RST frames trigger cleanup
Snapshot Restore: After VIRTIO_VSOCK_EVENT_TRANSPORT_RESET, a tokio::sync::Notify
signal breaks the stale mux session and triggers immediate vsock reconnection.
Topology:
┌─────────────────────────────────────────────────────────────────┐
│ Host Namespace │
│ ┌──────────────┐ veth pair ┌──────────────────┐ │
│ │ veth_outer │◄─────────────────────────►│ VM Namespace │ │
│ │ 172.30.x.1 │ │ (fcvm-vm-xxxxx) │ │
│ └──────────────┘ │ │ │
│ │ veth_inner │ │
│ iptables DNAT (scoped to veth IP): │ 172.30.x.2 │ │
│ -d 172.30.x.1 --dport 8080 → 172.30.x.2 │ │ │ │
│ │ ▼ │ │
│ │ ┌──────────┐ │ │
│ │ │ TAP │ │ │
│ │ └────┬─────┘ │ │
│ │ │ │ │
│ │ ┌────▼─────┐ │ │
│ │ │Firecracker│ │ │
│ │ │eth0: │ │ │
│ │ │172.30.x.2 │ │ │
│ │ └───────────┘ │ │
│ └──────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Accessing port-forwarded services:
# Curl the veth's host IP (172.30.x.1), NOT localhost
curl http://172.30.x.1:8080
# Get the veth IP from VM state
fcvm ls --json | jq '.[0].config.network.host_ip'iptables Rules (from src/network/portmap.rs):
# DNAT for external traffic - scoped to veth's host IP to avoid port conflicts
# Each VM has unique veth IP (172.30.x.y) so same port works across VMs
iptables -t nat -A PREROUTING -d 172.30.x.1 -p tcp --dport 8080 -j DNAT --to-destination 172.30.x.2:80
# DNAT for localhost traffic (OUTPUT chain) - also scoped to veth IP
iptables -t nat -A OUTPUT -d 172.30.x.1 -p tcp --dport 8080 -j DNAT --to-destination 172.30.x.2:80
# MASQUERADE for outbound (guest → internet)
iptables -t nat -A POSTROUTING -s 172.30.x.0/30 -j MASQUERADEAccessing port-forwarded services:
# Curl the veth's host IP (172.30.x.1), NOT localhost
curl http://172.30.x.1:8080IP Allocation:
- Each VM gets unique /30 subnet:
172.30.{x}.{y}/30 - Veth host IP:
172.30.{x}.{y}(used for port forwarding) - Guest IP:
172.30.{x}.{y+1}
Key Insight: pasta's userspace L4 translation adds latency and doesn't scale to many parallel clones. Solution: Use veth pairs with native IPv6 routing through the kernel stack at line rate.
Topology:
┌─────────────────────────────────────────────────────────────────┐
│ Host Namespace │
│ ┌──────────────┐ veth pair ┌──────────────────┐ │
│ │ veth-host │◄─────────────────────────►│ Namespace │ │
│ │ (link-local) │ │ (fcvm-xxxxxxxx) │ │
│ └──────────────┘ │ │ │
│ │ │ veth-ns │ │
│ proxy NDP on eth0 │ │ │ │
│ ip6tables MASQUERADE │ ┌────▼───────┐ │ │
│ route: vm_ipv6/128 via veth-host │ │ br0 │ │ │
│ │ │ 10.0.2.1 │ │ │
│ │ │ fd00::1 │ │ │
│ │ └────┬───────┘ │ │
│ │ │ │ │
│ │ ┌────▼─────┐ │ │
│ │ │ TAP │ │ │
│ │ └────┬─────┘ │ │
│ │ │ │ │
│ │ ┌────▼─────┐ │ │
│ │ │Firecracker│ │ │
│ │ │eth0: │ │ │
│ │ │10.0.2.100 │ │ │
│ │ │vm_ipv6/128│ │ │
│ │ └───────────┘ │ │
│ └──────────────────┘ │
└─────────────────────────────────────────────────────────────────┘
Setup Sequence (14 steps):
- Preflight: verify root, global IPv6, ip6tables
- Detect host IPv6 /64 subnet (supports direct /64 or AWS-style /128 with on-link /64 route)
- Generate deterministic VM IPv6 from host prefix + hash of vm_id (with collision detection)
- Create network namespace (
ip netns add fcvm-XXXX) - Create veth pair, move guest side to namespace
- Create TAP device in namespace, connect to bridge (br0) with veth
- Assign bridge IPs:
10.0.2.1/24(IPv4 gateway) +fd00::1/64(IPv6 gateway, nodad) - Bring up host veth, enable per-interface IPv6 forwarding
- Assign EUI-64 link-local to host veth (auto-assignment fails when
all.forwarding=1) - Namespace default IPv6 route: via host veth link-local through bridge
- Host: route
vm_ipv6/128via host veth - Proxy NDP for vm_ipv6 on default interface (so network fabric routes to this host)
- ip6tables MASQUERADE on outbound interface (required for AWS source/dest check)
- TCP proxy port forwarding on unique loopback IP (127.x.y.z)
Port Forwarding (built-in TCP proxy + loopback IP, same as rootless):
# Rust TCP proxy: bind on host loopback, connect inside namespace via setns(2)
Host 127.0.0.2:8080 → tcp_proxy → setns(namespace) → connect 10.0.2.100:80
# Bidirectional relay via tokio::io::copy_bidirectional
Traffic Flow (VM to Internet, IPv6):
Guest → TAP → br0 → veth-ns → veth-host → host kernel → ip6tables MASQUERADE → eth0 → Internet
Traffic Flow (Internet to VM, IPv6):
Internet → eth0 → proxy NDP → host kernel → route vm_ipv6/128 via veth-host → veth-ns → br0 → TAP → Guest
Traffic Flow (Host to VM port forward):
Host (127.0.0.x:8080) → tcp_proxy (setns) → 10.0.2.100:80 → br0 → TAP → Guest
Traffic Flow (Health check):
ip netns exec curl → br0 (10.0.2.1) → L2 forward → TAP → Guest (10.0.2.100:80)
IPv6 Addressing:
- Each VM gets a deterministic IPv6 from the host's /64 subnet (hash of vm_id)
- VM uses /128 prefix (fc-agent configures via boot parameter) to prevent on-link NDP for other subnet addresses
- fd00::1 on the bridge serves as the VM's IPv6 gateway
- Proxy NDP advertises the VM's IPv6 on the host's physical interface
Cleanup (on VM exit):
- Abort TCP proxy tasks (in-process, no external PIDs)
- Remove ip6tables MASQUERADE rule (scoped to vm_ipv6/128)
- Remove proxy NDP entry
- Remove host route (uses
devqualifier for parallel safety) - Delete veth pair (auto-deletes peer)
- Delete namespace
Characteristics:
- Requires root and global IPv6 on host
- Kernel line-rate IPv6 (no userspace proxy for traffic forwarding)
- Each VM gets unique /128 IPv6 — parallel clones route correctly without NAT
- IPv4 internal only (10.0.2.x for health checks, no external IPv4 routing)
- Port forwarding via built-in TCP proxy + loopback IP (same model as rootless)
- All resources per-VM: no shared state, clean parallel operation
┌─────────────────────────────────────────────────────────┐
│ 1. Parse CLI arguments │
│ - Image, vCPU, memory, ports, volumes, snapshot name │
└────────────────┬────────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ 2. Detect execution mode (rootless/bridged/routed) │
│ - Check for root privileges │
│ - Check for /dev/kvm access │
└────────────────┬────────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ 3. Setup networking │
│ - Create TAP device (bridged/routed) or prepare pasta│
│ - Parse port mappings │
│ - Generate MAC address │
└────────────────┬────────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ 4. Prepare disks │
│ - Create CoW overlay from base rootfs │
│ - Setup volume mounts (block/sshfs/nfs) │
└────────────────┬────────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ 5. Start Firecracker process │
│ - Spawn with Unix socket API │
│ - Wait for socket ready │
└────────────────┬────────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ 6. Configure VM via API │
│ - set_boot_source (kernel) │
│ - set_machine_config (vCPU, memory) │
│ - add_drive (rootfs) │
│ - add_network_interface (TAP device) │
│ - set_mmds_config (metadata service) │
│ - put_mmds (container plan) │
│ - set_balloon (memory balloon if configured) │
└────────────────┬────────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ 7. Start VM │
│ - put_action(InstanceStart) │
└────────────────┬────────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ 8. Stream serial console logs │
│ - Open serial console device │
│ - Stream to stdout/file based on --logs flag │
└────────────────┬────────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ 9. Wait for readiness (if --wait-ready specified) │
│ - vsock: Wait for guest connection │
│ - http: Poll HTTP endpoint │
│ - log: Search serial console for pattern │
│ - exec: Execute command in guest │
└────────────────┬────────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ 10. Save snapshot (if --save-snapshot specified) │
│ - create_snapshot(memory + disk) │
└────────────────┬────────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ 11. Setup signal handlers │
│ - SIGINT/SIGTERM → graceful shutdown │
│ - SIGCHLD → detect VM exit │
└────────────────┬────────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ 12. Wait for VM exit or signal │
│ - Process blocks here (hanging mode) │
│ - VM lifetime = process lifetime │
└────────────────┬────────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ 13. Cleanup │
│ - Kill Firecracker process │
│ - Remove TAP device │
│ - Remove NAT rules │
│ - Clean up temp files │
└─────────────────────────────────────────────────────────┘
Step 1: Create Snapshot (fcvm snapshot create)
┌─────────────────────────────────────────────────────────┐
│ 1. Pause the running VM │
│ - Firecracker API: pause │
└────────────────┬────────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ 2. Create Firecracker snapshot │
│ - Snapshot memory to file │
└────────────────┬────────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ 3. Copy disk via reflink (VM still paused) │
│ - Ensures memory/disk consistency │
│ - Reflink is O(1) — no pause time impact │
└────────────────┬────────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ 4. Resume the original VM │
│ - VM continues running │
└─────────────────────────────────────────────────────────┘
Step 2: Start Memory Server (fcvm snapshot serve)
┌─────────────────────────────────────────────────────────┐
│ 1. Load snapshot memory file (mmap, MAP_SHARED) │
│ - Kernel shares physical pages via page cache │
└────────────────┬────────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ 2. Create Unix socket for clone connections │
│ - /mnt/fcvm-btrfs/uffd-{snapshot}-{pid}.sock │
└────────────────┬────────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ 3. Register state in state manager │
│ - process_type: "serve" │
│ - snapshot_name │
└────────────────┬────────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ 4. Wait for clone connections (async) │
│ - Handle UFFD page faults from clones │
│ - Serve memory pages on-demand │
└─────────────────────────────────────────────────────────┘
Step 3: Spawn Clone (fcvm snapshot run)
┌─────────────────────────────────────────────────────────┐
│ 1. Create CoW overlay disk (btrfs reflink) │
│ - cp --reflink=always (~1.5ms) │
└────────────────┬────────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ 2. Setup new networking │
│ - Generate new MAC address │
│ - Create TAP device (bridged) or pasta (rootless) │
│ - Allocate loopback IP for health checks │
└────────────────┬────────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ 3. Start Firecracker with UFFD backend │
│ - Connect to memory server's Unix socket │
│ - Firecracker fetches pages via UFFD on access │
└────────────────┬────────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ 4. Load snapshot via Firecracker API │
│ - track_dirty_pages = !hugepages │
│ - resume_vm = true │
└────────────────┬────────────────────────────────────────┘
▼
┌─────────────────────────────────────────────────────────┐
│ 5. VM resumes (< 1 second total startup) │
│ - Memory pages loaded on-demand │
│ - Shared pages via kernel page cache │
└─────────────────────────────────────────────────────────┘
Goal: VM dies when fcvm podman run process exits.
Implementation (using tokio::signal):
use tokio::signal::unix::{signal, SignalKind};
async fn main() -> Result<()> {
let mut sigterm = signal(SignalKind::terminate())?;
let mut sigint = signal(SignalKind::interrupt())?;
// Start VM
let mut vm = VmManager::new(...);
vm.start().await?;
// Wait for signal or VM exit
tokio::select! {
_ = sigterm.recv() => {
info!("received SIGTERM, shutting down");
vm.kill().await?;
}
_ = sigint.recv() => {
info!("received SIGINT, shutting down");
vm.kill().await?;
}
status = vm.wait() => {
info!("VM exited with status: {:?}", status);
}
}
// Cleanup
network.cleanup().await?;
Ok(())
}Graceful Shutdown:
- Receive SIGTERM/SIGINT
- Send shutdown signal to Firecracker
- Wait up to 10 seconds for graceful exit
- Force kill if timeout
- Clean up network resources
- Remove temporary files
Location: fc-agent/src/main.rs
Runs inside the Firecracker VM as a systemd service.
Responsibilities:
- Fetch container plan from MMDS (Metadata Service)
- Launch Podman with correct configuration
- Stream container logs to serial console
- Signal readiness to host (via vsock)
- Handle container lifecycle
Firecracker provides a metadata service accessible at http://169.254.169.254/.
Container Plan Format:
{
"image": "nginx:latest",
"env": {
"KEY": "VALUE",
"DB_HOST": "localhost"
},
"cmd": ["/bin/sh", "-c", "nginx -g 'daemon off;'"],
"volumes": [
{
"host": "/data",
"guest": "/mnt/data",
"readonly": false
}
],
"podman": {
"rootless": true,
"network": "host",
"privileged": false
},
"readiness": {
"mode": "http",
"url": "http://127.0.0.1:80/health"
},
"logs": {
"mode": "stream"
}
}#[tokio::main]
async fn main() -> Result<()> {
// 1. Fetch plan from MMDS
let plan = fetch_mmds_plan().await?;
// 2. Build Podman command
let mut cmd = Command::new("podman");
cmd.arg("run").arg("--rm");
// Network mode
if plan.podman.network == "host" {
cmd.arg("--network=host");
}
// Environment variables
for (key, val) in plan.env {
cmd.arg("-e").arg(format!("{}={}", key, val));
}
// Volume mounts
for vol in plan.volumes {
let mount = if vol.readonly {
format!("{}:{}:ro", vol.guest, vol.guest)
} else {
format!("{}:{}", vol.guest, vol.guest)
};
cmd.arg("-v").arg(mount);
}
// Image
cmd.arg(&plan.image);
// Command override
if let Some(cmd_override) = plan.cmd {
cmd.args(cmd_override);
}
// 3. Spawn container
let mut child = cmd
.stdout(Stdio::piped())
.stderr(Stdio::piped())
.spawn()?;
// 4. Stream logs to serial console
stream_to_console(child.stdout.take(), "stdout").await;
stream_to_console(child.stderr.take(), "stderr").await;
// 5. Signal readiness (vsock or log marker)
if let Some(readiness) = plan.readiness {
signal_ready(readiness).await?;
}
// 6. Wait for container exit
let status = child.wait().await?;
eprintln!("[agent] container exited: {}", status);
Ok(())
}
async fn fetch_mmds_plan() -> Result<Plan> {
loop {
match reqwest::get("http://169.254.169.254/").await {
Ok(resp) => return resp.json().await.context("parsing MMDS"),
Err(_) => {
tokio::time::sleep(Duration::from_millis(500)).await;
}
}
}
}The guest is configured to support rootless Podman:
-
User setup (
create-rootfs-debian.sh):# Create podman user useradd -m -s /bin/bash podman # Setup subuid/subgid ranges echo "podman:100000:65536" >> /etc/subuid echo "podman:100000:65536" >> /etc/subgid
-
Podman configuration:
# Enable unprivileged port binding sysctl -w net.ipv4.ip_unprivileged_port_start=0 # Use crun runtime (faster than runc) podman --runtime=crun
-
fc-agent runs as:
- Root: When
podman.rootless = false - podman user: When
podman.rootless = true
- Root: When
fcvm provides full interactive terminal support for both podman run -it and exec -it, matching docker/podman semantics.
┌─────────────────────────────────────────────────────────────────────────┐
│ Host Process (fcvm) │
│ ┌──────────────┐ ┌──────────────────┐ ┌──────────────────────┐│
│ │ User Terminal│────►│ Raw Mode Handler │────►│ exec_proto Encoder ││
│ │ (stdin/out) │◄────│ (tcsetattr) │◄────│ (binary framing) ││
│ └──────────────┘ └──────────────────┘ └──────────┬───────────┘│
│ │ │
└───────────────────────────────────────────────────────────┼────────────┘
│ vsock
▼
┌───────────────────────────────────────────────────────────────────────┐
│ Guest (fc-agent) │ │
│ ┌──────────────────┐ ┌──────────────┐ ┌──────────▼─────────┐ │
│ │ exec_proto │────►│ PTY Master │────►│ Container Process │ │
│ │ Decoder │◄────│ (openpty) │◄────│ (sh, vim, etc.) │ │
│ └──────────────────┘ └──────────────┘ └────────────────────┘ │
└───────────────────────────────────────────────────────────────────────┘
| Flag | Meaning | Implementation |
|---|---|---|
-i |
Keep stdin open | Host reads stdin, sends via STDIN messages |
-t |
Allocate PTY | Guest allocates PTY master/slave pair |
-it |
Both | Interactive shell with full terminal support |
| neither | Plain exec | Pipes for stdin/stdout, no PTY |
Binary framed protocol over vsock for efficient transport of terminal data:
┌─────────┬─────────┬──────────────────┐
│ Type(1) │ Len(4) │ Payload(N) │
└─────────┴─────────┴──────────────────┘
Message Types:
DATA (0x00): Output from command (stdout/stderr)STDIN (0x01): Input from user terminalEXIT (0x02): Command exit code (4 bytes, big-endian i32)ERROR (0x03): Error message string
Why binary framing?
- Handles escape sequences (Ctrl+C = 0x03, Ctrl+D = 0x04)
- Preserves all bytes without escaping
- Efficient for high-throughput terminal output
// 1. Set terminal to raw mode
let original = tcgetattr(stdin)?;
let mut raw = original.clone();
cfmakeraw(&mut raw);
tcsetattr(stdin, TCSANOW, &raw)?;
// 2. Spawn reader/writer tasks
tokio::spawn(async move {
// Reader: terminal stdin → vsock
loop {
let n = stdin.read(&mut buf)?;
exec_proto::write_stdin(&mut vsock, &buf[..n])?;
}
});
tokio::spawn(async move {
// Writer: vsock → terminal stdout
loop {
match exec_proto::Message::read_from(&mut vsock)? {
Message::Data(data) => stdout.write_all(&data)?,
Message::Exit(code) => return code,
_ => {}
}
}
});// 1. Allocate PTY
let (master, slave) = openpty()?;
// 2. Fork child process
match fork() {
0 => {
// Child: setup PTY as controlling terminal
setsid();
ioctl(slave, TIOCSCTTY, 0);
dup2(slave, STDIN);
dup2(slave, STDOUT);
dup2(slave, STDERR);
execvp(command, args);
}
pid => {
// Parent: relay between vsock and PTY master
// Reader thread: PTY master → vsock (DATA messages)
// Writer thread: vsock (STDIN messages) → PTY master
}
}- Escape sequences: Colors (ANSI), cursor movement, screen clearing
- Control characters: Ctrl+C (SIGINT), Ctrl+D (EOF), Ctrl+Z (SIGTSTP)
- Line editing: Arrow keys, backspace, history (shell-dependent)
- Full-screen apps: vim, htop, less, nano, tmux
- Window resize (SIGWINCH): Not implemented. Terminal size is fixed at session start.
- Job control: Background/foreground (
bg,fg) work within the container, but signals are not forwarded to the host.
Full CLI documentation with examples: See README.md
| Command | Purpose |
|---|---|
fcvm setup |
Download kernel, create rootfs (first-time setup, ~5-10 min) |
fcvm podman run |
Launch container in Firecracker VM |
fcvm exec |
Execute command in running VM/container |
fcvm ls |
List running VMs |
fcvm snapshot create |
Create snapshot from running VM |
fcvm snapshot serve |
Start UFFD memory server for cloning |
fcvm snapshot run |
Spawn clone from memory server |
fcvm snapshots |
List available snapshots |
-
Trailing arguments: Both
podman runandexecsupport trailing args after--:fcvm podman run --name test alpine:latest echo "hello" fcvm exec --name test -it -- sh -c "ls -la"
-
--namevs--pid: VM identification uses named flags (not positional):fcvm exec --name my-vm -- hostname # By name fcvm exec --pid 12345 -- hostname # By PID
-
Interactive flags (
-i,-t): Match docker/podman semantics:-i: Keep stdin open-t: Allocate PTY-it: Both (interactive shell)
-
Network modes:
--network rootless(default, no sudo),--network bridged(sudo), or--network routed(sudo + IPv6)
Purpose: Create a snapshot from a running VM.
Usage:
fcvm snapshot create [--pid <PID> | <VM_NAME>] [--tag <TAG>]Options:
--pid <PID> fcvm process PID to snapshot
--tag <TAG> Snapshot name (defaults to VM name)
<VM_NAME> VM name to snapshot (alternative to --pid)
Examples:
# Create snapshot by PID
fcvm snapshot create --pid 12345 --tag my-snapshot
# Create snapshot by name
fcvm snapshot create my-vm --tag warm-nginxPurpose: Start a UFFD memory server for cloning.
Usage:
fcvm snapshot serve <SNAPSHOT_NAME>The memory server:
- Loads the snapshot's memory file
- Listens for clone connections via Unix socket
- Serves memory pages on-demand via UFFD (userfaultfd)
- Enables sharing physical pages across multiple clones
Example:
# Start memory server (blocks, keeps running)
fcvm snapshot serve my-snapshotPurpose: Spawn a clone VM from a running memory server.
Usage:
fcvm snapshot run --pid <SERVE_PID> [OPTIONS]Options:
--pid <SERVE_PID> Memory server PID (required)
--name <NAME> Clone VM name (auto-generated if not provided)
--exec <CMD> Execute command in container after clone is healthy
Network mode, port mappings, TTY, interactive flags, and --ipv6-prefix are inherited from the snapshot
metadata automatically — no need to re-specify them on clone.
Examples:
# Spawn a clone (inherits network/ports from snapshot)
fcvm snapshot run --pid 12345 --name clone1
# Multiple clones in parallel
for i in {1..10}; do
fcvm snapshot run --pid 12345 --name clone$i &
done
wait # Lightning fast: all start in <1 second eachPurpose: List running memory servers.
fcvm snapshot lsPurpose: List running VMs.
Usage:
fcvm ls [--json] [--pid <PID>]Options:
--json Output in JSON format
--pid <PID> Filter by fcvm process PID
Example output:
NAME PID STATUS HEALTH NETWORK IMAGE
my-nginx 12345 running healthy bridged nginx:alpine
clone-1 12350 running healthy rootless (clone)
Purpose: List available snapshots.
fcvm snapshotsfcvm/
├── Cargo.toml # Workspace manifest
├── DESIGN.md # This document
├── README.md # User-facing documentation
├── Makefile # Build and test commands
├── Containerfile # Test container definition
│
├── src/ # Host CLI (fcvm binary)
│ ├── main.rs # Entry point
│ ├── lib.rs # Module exports
│ ├── paths.rs # Path utilities for btrfs layout
│ ├── health.rs # Health monitoring
│ │
│ ├── cli/ # Command-line parsing
│ │ ├── mod.rs
│ │ └── args.rs # Clap structures
│ │
│ ├── commands/ # CLI command implementations
│ │ ├── mod.rs
│ │ ├── common.rs # Shared utilities
│ │ ├── exec.rs # fcvm exec
│ │ ├── ls.rs # fcvm ls
│ │ ├── podman.rs # fcvm podman run
│ │ ├── setup.rs # fcvm setup
│ │ ├── snapshot.rs # fcvm snapshot {create,serve,run} + UFFD server
│ │ └── snapshots.rs # fcvm snapshots
│ │
│ ├── firecracker/ # Firecracker integration
│ │ ├── mod.rs
│ │ ├── api.rs # API client (hyper + hyperlocal)
│ │ └── vm.rs # VM manager
│ │
│ ├── network/ # Networking
│ │ ├── mod.rs
│ │ ├── bridged.rs # Bridged networking (iptables)
│ │ ├── pasta.rs # Rootless networking (pasta)
│ │ ├── routed.rs # Routed networking (IPv6 veth)
│ │ ├── namespace.rs # Network namespace management
│ │ ├── veth.rs # Veth pair management
│ │ ├── types.rs # Network types
│ │ ├── portmap.rs # Port mapping utilities
│ │ └── egress_proxy.rs # Host-side multiplexed egress proxy
│ │
│ ├── storage/ # Storage & snapshots
│ │ ├── mod.rs
│ │ ├── disk.rs # btrfs CoW disk management
│ │ ├── snapshot.rs # Snapshot management
│ │ └── volume.rs # Volume handling
│ │
│ ├── state/ # VM state management
│ │ ├── mod.rs
│ │ ├── types.rs # VmState, VmConfig
│ │ ├── manager.rs # StateManager (CRUD + loopback IPs)
│ │ └── utils.rs # State utilities
│ │
│ ├── uffd/ # UFFD memory server
│ │ ├── mod.rs
│ │ ├── server.rs # Userfaultfd page handler
│ │ └── handler.rs # UFFD event handler
│ │
│ ├── volume/ # FUSE volume handling
│ │ └── mod.rs # Host → guest filesystem mapping
│ │
│ └── setup/ # Setup utilities
│ ├── mod.rs
│ ├── preflight.rs # Pre-flight checks
│ ├── kernel.rs # Kernel setup
│ ├── kernel_build.rs # Kernel build
│ └── rootfs.rs # Rootfs setup
│
├── fc-agent/ # Guest agent crate
│ ├── Cargo.toml
│ └── src/
│ ├── main.rs # Entry point
│ ├── agent.rs # MMDS + Podman orchestration
│ ├── container.rs # Container lifecycle management
│ ├── exec.rs # Exec command handler
│ ├── mmds.rs # MMDS client + restore-epoch watcher
│ ├── mounts.rs # Mount setup (overlayfs, volumes)
│ ├── network.rs # ARP/NDP, TCP cleanup, localhost forwarding
│ ├── output.rs # Container log streaming via vsock
│ ├── proxy.rs # Guest-side multiplexed egress proxy
│ ├── restore.rs # Snapshot restore handler
│ ├── system.rs # System setup (sysctl, cgroups)
│ ├── tty.rs # TTY/PTY handling
│ ├── types.rs # Shared types (MMDS config)
│ └── vsock.rs # Vsock connection utilities
│
├── fuse-pipe/ # FUSE passthrough library
│ ├── Cargo.toml
│ ├── src/
│ │ ├── client/ # FUSE client (mounts in VM)
│ │ ├── server/ # Async server (runs on host)
│ │ ├── protocol/ # Wire protocol (request/response)
│ │ └── transport/ # vsock/Unix socket transport
│ ├── tests/ # Integration tests
│ └── benches/ # Performance benchmarks
│
└── tests/ # fcvm integration tests
├── common/mod.rs # Shared test utilities
├── test_sanity.rs # VM sanity tests (rootless + bridged)
├── test_snapshot_clone.rs # Snapshot/clone workflow tests
├── test_egress.rs # Egress proxy tests (rootless + bridged)
├── test_egress_stress.rs # Egress proxy stress tests
├── test_egress_proxy_bench.rs # 8000-connection benchmark
├── test_port_forward.rs # Port forwarding tests
├── test_rootless_ipv6.rs # IPv6 networking tests
├── test_exec.rs # Exec command tests
├── test_fuse_in_vm_matrix.rs # In-VM pjdfstest
├── test_localhost_image.rs
├── test_state_manager.rs
├── test_health_monitor.rs
└── ... # 35+ additional test files
fcvm (Cargo.toml):
[dependencies]
anyhow = "1"
clap = { version = "4", features = ["derive"] }
serde = { version = "1", features = ["derive"] }
serde_json = "1"
serde_yaml = "0.9"
tokio = { version = "1", features = ["full"] }
reqwest = { version = "0.11", features = ["json", "rustls-tls"] }
which = "6"
nix = { version = "0.29", features = ["user", "process", "signal", "ioctl", "net"] }
uuid = { version = "1", features = ["v4", "serde"] }
sha2 = "0.10"
tracing = "0.1"
tracing-subscriber = { version = "0.3", features = ["env-filter"] }
libc = "0.2"
hex = "0.4"
chrono = { version = "0.4", features = ["serde"] }
tempfile = "3"
rand = "0.8"
async-trait = "0.1"
hyper = { version = "0.14", features = ["client", "http1"] }
hyperlocal = "0.8"fc-agent (fc-agent/Cargo.toml):
[dependencies]
anyhow = "1"
serde = { version = "1", features = ["derive"] }
serde_json = "1"
tokio = { version = "1", features = ["rt-multi-thread", "macros", "process", "io-util"] }
reqwest = { version = "0.11", features = ["json", "rustls-tls"] }All builds are done via the root Makefile. See CLAUDE.md for the complete target list.
make build # Build fcvm + fc-agent
make test-root # Run all tests (requires sudo + KVM)
make help # Show available targetsAll fcvm data is stored under /mnt/fcvm-btrfs/ (btrfs filesystem for CoW reflinks).
Override with FCVM_BASE_DIR environment variable.
Layout (from src/paths.rs):
/mnt/fcvm-btrfs/
├── kernels/ # Kernel binaries
│ └── vmlinux-{sha}.bin
├── rootfs/ # Base rootfs images (contains /etc/fcvm-setup-complete marker)
│ └── layer2-{sha}.raw
├── initrd/ # fc-agent injection initrds
│ └── fc-agent-{sha}.initrd
├── vm-disks/ # Per-VM CoW disk copies
│ └── {vm-id}/disks/rootfs.raw
├── snapshots/ # Firecracker snapshots
├── state/ # VM state JSON files
│ └── {vm-id}.json
└── cache/ # Downloaded images and packages
├── ubuntu-24.04-arm64-{sha}.img # Cloud image cache
└── packages-{sha}/ # Downloaded .deb files
Rootfs Hash Calculation: The layer2-{sha}.raw name is computed from:
- Init script (embeds install + setup scripts)
- Kernel URL
- Download script (package list + Ubuntu codename)
This ensures automatic cache invalidation when any component changes.
VM State (/mnt/fcvm-btrfs/state/{vm-id}.json):
{
"schema_version": 1,
"vm_id": "vm-abc123...",
"name": "my-nginx",
"status": "running",
"health_status": "healthy",
"exit_code": null,
"pid": 12345,
"created_at": "2025-01-09T12:00:00Z",
"last_updated": "2025-01-09T12:00:05Z",
"config": {
"image": "nginx:alpine",
"vcpu": 2,
"memory_mib": 1024,
"network": {
"tap_device": "tap-abc123",
"guest_ip": "172.16.29.2",
"loopback_ip": "127.0.0.2"
},
"volumes": [],
"process_type": "vm",
"snapshot_name": null,
"serve_pid": null
}
}Strategy: Use anyhow::Result everywhere, with context.
Example:
use anyhow::{Context, Result, bail};
async fn setup_network() -> Result<NetworkConfig> {
create_tap_device("tap0")
.await
.context("creating TAP device for VM network")?;
add_to_bridge("tap0", "fcvmbr0")
.await
.context("adding TAP to bridge")?;
Ok(NetworkConfig { ... })
}User-facing errors:
// In main.rs
if let Err(e) = run().await {
eprintln!("Error: {:#}", e); // Pretty print error chain
std::process::exit(1);
}Setup (in main.rs):
use tracing_subscriber::{fmt, EnvFilter};
fn main() {
tracing_subscriber::fmt()
.with_env_filter(EnvFilter::from_default_env())
.with_target(false)
.init();
// ...
}Usage:
use tracing::{info, warn, error, debug};
info!(vm_id = %vm.id(), "starting VM");
warn!(tap = "tap0", "TAP device already exists");
error!(error = %e, "failed to start Firecracker");
debug!(config = ?config, "loaded configuration");Environment:
# Set log level
export RUST_LOG=fcvm=debug
# Run with debug logs
RUST_LOG=trace fcvm run nginx:latestNetwork Mode Guards: The fcvm binary enforces proper network mode usage:
- Bridged without root: Fails with helpful error message suggesting
sudoor--network rootless - Rootless with root: Runs but prints warning that bridged would be faster
Test Isolation: All tests use unique resource names to enable parallel execution:
unique_names()helper generates timestamp+counter-based names- PID-based naming for additional uniqueness
- Automatic cleanup on test exit
Test Tier Organization (feature-gated):
test-unit: No feature flags, fast tests without VMstest-integration-fast:--features integration-fast,privileged-tests(quick VM tests <30s)test-root: All features includingintegration-slow(pjdfstest, slow VM tests)- Filter by name pattern:
make test-root FILTER=exec - Container configs:
CONTAINER_RUN_ROOTLESS(unit) andCONTAINER_RUN_ROOT(VM tests)
Test individual components in isolation:
#[cfg(test)]
mod tests {
use super::*;
#[test]
fn test_parse_port_mapping() {
let pm = PortMapping::parse("8080:80").unwrap();
assert_eq!(pm.host_port, 8080);
assert_eq!(pm.guest_port, 80);
assert_eq!(pm.proto, Protocol::Tcp);
}
#[tokio::test]
async fn test_firecracker_client() {
// Mock Firecracker API
// Test API calls
}
}Test full workflows:
#!/bin/bash
# tests/integration/test_run.sh
# Test rootless mode (no sudo required)
fcvm podman run --name test-nginx --network rootless nginx:alpine &
PID=$!
sleep 5
kill $PID
# Test bridged mode with port forwarding (requires sudo for iptables/TAP)
sudo fcvm podman run --name web --network bridged --publish 8080:80 nginx:alpine &
PID=$!
sleep 5
curl http://localhost:8080 # Should return nginx page
kill $PID
# Test snapshot & clone (rootless with port forwarding)
fcvm podman run --name baseline --network rootless --publish 9090:80 nginx:alpine &
BASELINE_PID=$!
sleep 5 # Wait for VM to be healthy
# Create snapshot
fcvm snapshot create --pid $BASELINE_PID --tag warm
# Start memory server
fcvm snapshot serve warm &
SERVE_PID=$!
sleep 2
# Spawn clone (inherits network mode + port mappings from snapshot)
fcvm snapshot run --pid $SERVE_PID --name clone1 &
CLONE_PID=$!
sleep 2
curl http://localhost:9090 # Should return nginx page in <2s
kill $CLONE_PID $SERVE_PID $BASELINE_PIDNote: Network mode is set on the baseline VM: --network rootless (default, no root required) or --network bridged (iptables/TAP, requires sudo). Clones inherit the network mode and port mappings from the snapshot automatically.
The fuse-pipe library passes the pjdfstest POSIX compliance suite. Tests run via make test-root or make container-test-root.
Test Counts:
- 237 total test files in pjdfstest
- 54 skipped on Linux (FreeBSD/ZFS/UFS-specific)
- 183 real test files run
- 8789 assertions pass
Skipped Categories (via quick_exit() - outputs trivial "ok 1"):
| Category | Files | Skipped | Real | Reason |
|---|---|---|---|---|
| granular | 7 | 7 | 0 | FreeBSD extended ACLs only |
| open | 26 | 8 | 18 | FreeBSD-specific open behaviors |
| link | 18 | 6 | 12 | FreeBSD hardlink semantics |
| rename | 25 | 5 | 20 | FreeBSD rename edge cases |
| rmdir | 16 | 4 | 12 | FreeBSD rmdir behaviors |
| ftruncate | 15 | 3 | 12 | FreeBSD:UFS specific |
| mkdir | 13 | 3 | 10 | FreeBSD:UFS specific |
| mkfifo | 13 | 3 | 10 | FreeBSD:UFS specific |
| symlink | 13 | 3 | 10 | FreeBSD:UFS specific |
| truncate | 15 | 3 | 12 | FreeBSD:UFS specific |
| unlink | 15 | 3 | 12 | FreeBSD:UFS specific |
| chflags | 14 | 2 | 12 | Some UFS-specific flags |
| chmod | 13 | 2 | 11 | FreeBSD:ZFS specific |
| chown | 11 | 2 | 9 | FreeBSD:ZFS specific |
| mknod | 12 | 0 | 12 | All run |
| posix_fallocate | 1 | 0 | 1 | All run |
| utimensat | 10 | 0 | 10 | All run |
Skip mechanism: Tests check ${os}:${fs} and call quick_exit() for unsupported OS/filesystem combinations. This outputs TAP format 1..1 + ok 1 (trivial pass) rather than running real assertions.
Goal: <1 second from fcvm clone to ready
Breakdown:
- Snapshot load: ~200ms
- Network setup: ~100ms
- Identity patching: ~50ms
- VM resume: ~300ms
- Container ready: ~300ms
- Total: ~950ms
Optimizations:
- Pre-warmed snapshot (container already running)
- CoW disks (no disk copy)
- Shared memory pages
- Fast network setup (TAP device creation)
Memory:
- Base VM: ~100MB overhead
- Shared kernel + rootfs: ~200MB (shared across all VMs)
- Per-VM: Container memory + ~100MB overhead
Example: 10 nginx VMs
- Traditional VMs: 10 × 512MB = 5GB
- fcvm with cloning: 200MB (shared) + 10 × 150MB = 1.7GB
- Savings: ~66%
CPU:
- Support vCPU overcommit (e.g., 32 vCPUs on 8 cores)
- KVM handles scheduling efficiently
- Minimal overhead when VMs are idle
- VM-level isolation: Full hardware virtualization via KVM
- No shared kernel: Each VM has its own kernel
- No container escape: Podman runs inside VM, not on host
- No root required: Entire stack runs as regular user
- User namespaces: pasta uses user namespaces
- No privileged operations: No sudo, no CAP_NET_ADMIN
- Requires CAP_NET_ADMIN: For TAP/iptables setup
- Minimal privileges: Only for network setup, not VM execution
- Firecracker jailer: Can use jailer for additional sandboxing (future)
- Snapshot contains full VM state: Including memory (may have secrets)
- Encrypt snapshots: Option to encrypt at rest (future)
- Access control: Snapshots stored in user-owned directories
Every kernel in fcvm is delivered through a profile. The [kernel] config section is synthesized into a "default" profile at load time, so all code paths use profiles uniformly. Named profiles (e.g., nested, btrfs) can build custom kernels from source or download from GitHub releases.
A profile delivers a kernel in one of three ways:
- URL-based (default profile): Downloads a pre-built kernel archive (e.g., Kata release)
- Custom build: Builds from source using
kernel_version/kernel_repo - Inherited: Uses the default profile's kernel, adding only runtime overrides (boot_args, firecracker_args, etc.)
# Custom kernel profile (build from source)
[kernel_profiles.minimal.amd64]
description = "Minimal kernel for fast boot"
kernel_version = "6.12"
kernel_repo = "your-org/your-kernel-repo"
build_inputs = ["kernel/minimal.conf", "kernel/patches/*.patch"]
kernel_config = "kernel/minimal.conf"
patches_dir = "kernel/patches"
# firecracker_bin = "/usr/local/bin/firecracker-custom"
# firecracker_args = "--extra-flag"
# boot_args = "quiet"| Field | Required | Description |
|---|---|---|
kernel_url |
URL-based | URL to kernel archive (e.g., Kata release tarball) |
kernel_archive_path |
URL-based | Path within the archive to extract the kernel binary |
kernel_local_path |
No | Local filesystem path to kernel binary (overrides URL) |
kernel_version |
Custom | Kernel version (e.g., "6.18.3") |
kernel_repo |
Custom | GitHub repo for releases |
build_inputs |
Custom | Files to hash for kernel SHA (supports globs) |
base_config_url |
Custom | Base kernel .config URL (e.g., Firecracker's microvm config) |
kernel_config |
No | Kernel config fragment file path (applied on top of base) |
patches_dir |
No | Directory containing kernel patches |
firecracker_bin |
No | Custom Firecracker binary path |
firecracker_args |
No | Extra Firecracker CLI args |
boot_args |
No | Extra kernel boot parameters |
rootfs_type |
No | Root filesystem type: "ext4" (default) or "btrfs" (converts via btrfs-convert) |
- Config is source of truth: All kernel versions and build settings flow from
rootfs-config.toml - SHA computation: fcvm hashes all files matching
build_inputspatterns - Download first: Tries
kernel_reporeleases with tagkernel-{profile}-{version}-{arch}-{sha} - Build fallback: If download fails and
--build-kernelsis set, Rust generates build scripts on-the-fly - Config sync:
make buildsyncs embedded config to~/.config/fcvm/
The rootfs is built from rootfs-config.toml:
[base]
version = "24.04"
codename = "noble"
[packages]
runtime = ["podman", "crun", "fuse-overlayfs", "skopeo"]
fuse = ["fuse3"]
system = ["haveged", "chrony"]
debug = ["strace"]
[services]
enable = ["haveged", "chrony", "systemd-networkd"]
disable = ["snapd", "cloud-init"]
[files."/etc/myconfig"]
content = """
my custom config
"""After changing the config, run fcvm setup to rebuild the rootfs with the new SHA.
--map volumes use FUSE-over-vsock with WRITEBACK_CACHE and AUTO_INVAL_DATA. When a host process modifies a file in a mapped directory, the guest sees the change on its next read — but only after the kernel detects the mtime change (up to ~1 second granularity). Writes within the same second may not be visible immediately.
Directory changes (new files, deletions) are subject to the kernel's directory entry cache TTL. A new file created on the host may not appear in guest readdir() until the cache expires.
There are no push notifications from host to guest. The guest discovers changes only on access. inotify/fanotify in the guest watches the FUSE mount, not the host filesystem, so host-side changes don't trigger guest notifications.
Potential fix: Use FUSE_NOTIFY_INVAL_INODE and FUSE_NOTIFY_INVAL_ENTRY — server-initiated invalidation notifications. The host VolumeServer would watch directories with inotify and push invalidations through the FUSE connection when files change. This is how production network filesystems (NFS, CIFS) handle it.
ARM64 FEAT_NV2 has architectural issues with cache coherency under double Stage 2 translation. The DSB SY kernel patch fixes this for vsock/FUSE data paths, but multi-vCPU L2 VMs still hit interrupt delivery issues (NETDEV WATCHDOG). L2 VMs are limited to single vCPU.
Snapshots are disabled when --map volumes are present because the FUSE-over-vsock connection state may not survive the pause/resume cycle cleanly. This means VMs with volume mounts always do a fresh boot. Block device mounts (--disk, --disk-dir) do not have this limitation.
-
Persistent volumes:
- Support Docker volumes API
- Persistent storage across clones
-
Custom networks:
- User-defined networks
- VM-to-VM communication
-
Resource limits:
- CPU pinning
- Memory limits (cgroups)
- I/O throttling
-
Metrics & monitoring:
- Prometheus exporter
- Real-time resource graphs
-
Snapshot encryption:
- Encrypt memory snapshots
- Key management
-
Jailer integration:
- Use Firecracker jailer for additional sandboxing
- chroot, cgroups, seccomp
-
Multi-host support:
- Distribute VMs across multiple hosts
- Remote snapshots
-
Live migration:
- Migrate running VMs between hosts
- Zero-downtime updates
-
GPU passthrough:
- vGPU support for ML workloads
-
Kubernetes integration:
- Run as CRI runtime
- Pod → Firecracker VM
- Firecracker: Lightweight VMM (Virtual Machine Monitor) from AWS
- microVM: Minimalistic virtual machine with fast boot times
- KVM: Kernel-based Virtual Machine, Linux's hypervisor
- MMDS: Micro Metadata Service, Firecracker's metadata API
- TAP device: Virtual network interface (TUN/TAP)
- pasta: L4 splice-based networking from the passt project for rootless containers
- CoW: Copy-on-Write, disk strategy for fast cloning
- iptables: Linux firewall/NAT configuration tool
- vsock: Virtual socket for host-guest communication
- Balloon device: Memory reclamation mechanism for VMs
Benchmarked on c6g.metal (64 ARM cores, 128GB RAM).
| Scenario | Time | Notes |
|---|---|---|
| Cold build (clean target) | 44s | ~12 parallel rustc processes |
| Incremental (touch main.rs) | 13s | Only recompiles fcvm |
| test-unit LIST (cold) | 24s | Compiles test binaries |
| test-unit LIST (warm) | 1.2s | No recompilation |
| Tool | Cold Build | Incremental | Verdict |
|---|---|---|---|
| Default (no tools) | 44s | 13.7s | Baseline |
| mold linker | 43s | 12.7s | ~1s savings, not worth config |
| sccache | 52s cold / 21s warm | 13s | Overhead > benefit for local dev |
Cargo parallelizes by crate, limited by the dependency graph:
- Early build: many leaf crates → high parallelism (11+ rustc)
- Late build: waiting on syn, tokio → low parallelism (1-3 rustc)
The 64 CPUs help within each crate (LLVM codegen), but crate-level parallelism is dependency-limited.
- Local dev: Use defaults. Incremental builds are fast (13s).
- CI: Consider sccache if rebuilding from scratch frequently.
- mold: Not worth it - linking is not the bottleneck.
Under parallel test load on ARM64, diff snapshots occasionally capture only ~94 KB of dirty pages instead of the expected ~37-43 MB. Restoring the merged snapshot kernel-panics with:
stack-protector: Kernel stack is corrupted in: do_idle
Kernel panic - not syncing: stack-protector: Kernel stack is corrupted in: do_idle
- Load pre-start snapshot with
track_dirty_pages: true - Resume VM — VM boots, container initializes until healthy
- Pause VM — Create diff snapshot
- Merge diff into pre-start base → startup snapshot
KVM ARM64's KVM_GET_DIRTY_LOG silently returns a nearly-empty bitmap. The diff snapshot
captures only device-emulation pages (virtio queue pages marked by Firecracker's internal
AtomicBitmap via mark_virtio_queue_memory_dirty()), while missing ALL guest OS memory
writes tracked by KVM's Stage-2 page tables.
CI data showing the failure:
| Round | bytes_merged | data_regions | Result |
|---|---|---|---|
| 07:54 | 37,834,752 | 2,824 | OK |
| 08:34 | 43,225,088 | 3,648 | OK |
| 15:20 | 40,157,184 | 3,039 | OK |
| 17:35 | 94,208 | 9 | KERNEL PANIC |
Both rootless AND routed snapshots created at the same time had EXACTLY 94,208 bytes — a systematic KVM dirty tracking failure, not random corruption.
Investigated the Firecracker source (ejc3/firecracker, branch bump-vsock-max-connections):
KVM_MEM_LOG_DIRTY_PAGESis correctly set on memory regions duringload_snapshot(viaKVM_SET_USER_MEMORY_REGIONinvstate/memory.rs)- No code path resets or discards the dirty bitmap between load and diff creation
- vCPU pause is properly synchronized before
KVM_GET_DIRTY_LOG - The ~94 KB corresponds to virtio queue pages marked by Firecracker's internal bitmap after the previous snapshot — these don't rely on KVM tracking
-
UFFD + dirty tracking interaction (highest suspicion): Pages populated via
UFFD_COPYduring snapshot restore may get Stage-2 mappings created without write-protection when dirty logging was enabled before the host PTE existed. Under load, delayed UFFD faults mean pages fault in after VM resumes, bypassing dirty tracking entirely. -
Stage-2 TLB stale entries: Incomplete TLB invalidation after enabling dirty logging under heavy load (IPI delays) allows writes through stale entries that bypass tracking.
-
Block mapping coalescing: KVM ARM64 can create 2 MB block mappings in Stage-2. If splitting into 4 KB write-protected pages is delayed, writes bypass per-page tracking.
In create_snapshot_core(), after creating the diff snapshot but before resuming the VM:
- Check diff file's actual disk usage (
meta.blocks() * 512) - If
diff_allocated < memory_bytes / 1024(0.1% of VM memory), the diff is corrupt - Retry as Full snapshot while VM is still paused
- Skip the merge step and use the full snapshot directly
This is a detection + recovery approach because the bug is in KVM kernel code we don't control.
Additional defense:
- Post-resume liveness check: After restoring a snapshot, wait 200ms and
try_wait()to detect immediate kernel panics. Returns error so the snapshot is known to be bad.
Multiple clones from the same snapshot should share physical memory pages for read-only data. A large container VM may have 131 GB of guest memory, but most of it is identical across clones (kernel, application code, page cache). Only pages each clone writes to should be unique (Private_Dirty).
Three memory backends were tested. Results for two 131 GB clones:
| Backend | Per-clone RSS | Shared | Private_Clean | Pressure | Status |
|---|---|---|---|---|---|
File (MAP_PRIVATE on memory.bin) |
44 GB | 1.8 MB | 33.6 GB | 40% | Broken — KVM CoW-copies pages into Private_Clean even for reads |
| UFFD MISSING+COPY | 21 GB | 0 | 0 | 11% | Works but no sharing — each fault copies data to fresh anon page |
| UFFD MINOR+CONTINUE (not implemented) | ~5 GB (est.) | ~80 GB (est.) | 0 | ~2% (est.) | True sharing via shared memfd |
File backend: Firecracker maps memory.bin with MAP_PRIVATE | PROT_READ | PROT_WRITE.
When KVM handles a guest page fault, even for a read, the page becomes Private_Clean in the
process's address space. This happens because the kernel creates a private copy of the
file-backed page when setting up writable EPT mappings. The track_dirty_pages flag
(--no-dirty-tracking CLI) controls KVM's dirty bitmap tracking but does NOT prevent
the Private_Clean CoW behavior — that's inherent to MAP_PRIVATE with writable mappings.
UFFD MISSING+COPY: Firecracker creates anonymous memory (MAP_PRIVATE | MAP_ANONYMOUS)
and registers it with UFFD in MISSING mode. On each page fault, the UFFD server reads from
memory.bin and calls UFFDIO_COPY to fill the page. Each clone gets its own physical copy.
No Private_Clean bloat (no file-backed mapping), but no sharing either.
RSS is lower than File mode because only faulted pages are populated (lazy loading).
KSM: Disabled (/sys/kernel/mm/ksm/run=0). Firecracker doesn't mark guest memory
with MADV_MERGEABLE. Even if enabled, KSM is after-the-fact dedup with scanning overhead.
The kernel (6.13+) supports UFFD_FEATURE_MINOR_SHMEM — verified on our host.
The userfaultfd crate (0.9.0) supports register_with_mode() with raw bits.
Architecture:
┌─────────────────────────────────────────────────────┐
│ fcvm snapshot serve │
│ │
│ 1. memfd_create("snapshot", 131 GB) │
│ 2. Populate memfd from memory.bin │
│ 3. Accept clone connections via UDS │
│ 4. Send memfd fd + UFFD fd to each clone │
│ │
│ On MINOR fault from clone: │
│ UFFDIO_CONTINUE → maps existing memfd page │
│ (zero-copy, page shared across all clones) │
└─────────────────────────────────────────────────────┘
│ memfd fd shared via UDS
▼
┌──────────────────────┐ ┌──────────────────────┐
│ Clone 1 (Firecracker) │ │ Clone 2 (Firecracker) │
│ │ │ │
│ Guest memory: │ │ Guest memory: │
│ MAP_SHARED on memfd │ │ MAP_SHARED on memfd │
│ + UFFD MINOR mode │ │ + UFFD MINOR mode │
│ │ │ │
│ Read → shared page │ │ Read → shared page │
│ Write → kernel CoW │ │ Write → kernel CoW │
└──────────────────────┘ └──────────────────────┘
Changes required:
-
Firecracker (
persist.rs):guest_memory_from_uffd(): Usememfd_backed()instead ofanonymous()for guest memory- Pass memfd fd from the UFFD server (received via UDS alongside the UFFD fd)
uffd.register_with_mode(ptr, size, RegisterMode::from_bits_truncate(4))for MINOR mode
-
fcvm UFFD server (
src/uffd/server.rs):- Create memfd, populate from memory.bin (one-time cost at serve start)
- Send memfd fd to each clone via UDS handshake
- On MINOR fault:
UFFDIO_CONTINUE(maps existing page) instead ofUFFDIO_COPY(copies data)
-
fcvm serve (
src/commands/snapshot.rs):snapshot servecreates and populates the memfd once- Each clone receives the same memfd fd
Why this works: With MAP_SHARED on the memfd, all clones' page tables can point to the
same physical pages. UFFDIO_CONTINUE resolves a MINOR fault by installing a PTE pointing to
the already-populated memfd page — no data copy. Writes trigger kernel-level CoW (the page gets
copied to anonymous memory for that process only). This is the same mechanism used by CRIU for
lazy migration and by cloud providers for VM density.
Kernel support: Verified UFFD_FEATURE_MINOR_SHMEM (bit 10) is available on our
kernel 6.13. The userfaultfd crate 0.9.0 doesn't export a MINOR constant but
RegisterMode::from_bits_truncate(4) works since it's a bitflags struct.
- Firecracker Documentation
- Firecracker API Specification
- Podman Documentation
- passt/pasta
- iptables Documentation
- KVM Documentation
- Linux UFFD Documentation
End of Design Specification
Version: 2.4 Date: 2026-02-28 Author: fcvm project