-
Notifications
You must be signed in to change notification settings - Fork 7
Optional VM egress MITM proxy with mock-secret header rewriting #134
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 4 commits
Commits
Show all changes
28 commits
Select commit
Hold shift + click to select a range
e28a8f6
Stabilize integration network and ingress test flakes
sjmiller609 e66e759
Speed up fork-from-running integration tests
sjmiller609 4433590
Reduce fork test agent wait latency
sjmiller609 b708aef
feat(instances): add optional egress MITM proxy secret rewriting
sjmiller609 d8a64b9
test(instances): use prewarmed nginx image in egress proxy integratio…
sjmiller609 e3beda6
feat(egress-proxy): source secret rewrites from instance env
sjmiller609 d56d8b5
chore: remove agents notes from this PR
sjmiller609 cbe7082
test(ci): mirror API/integration image refs and prewarm missing images
sjmiller609 c58e231
Add egress proxy enforcement modes with strict default
sjmiller609 2d76140
Fix CI image pull flakes in API and e2e install tests
sjmiller609 63fa947
Add per-env domain-gated HTTPS secret substitution
sjmiller609 a9231cd
Fix review issues in cert cache and prewarm logging
sjmiller609 26c4a29
Fix follow-up proxy review findings
sjmiller609 3266c10
Address latest bugbot follow-ups
sjmiller609 060e1d1
Harden HTTP proxy replacement destination handling
sjmiller609 e648719
Merge remote-tracking branch 'origin/main' into feature/egress-mitm-p…
sjmiller609 2ad0511
Merge remote-tracking branch 'origin/main' into feature/egress-mitm-p…
sjmiller609 717e16f
Fix restore proxy config refresh and async image queue coverage
sjmiller609 3ef9157
Remove unused egress proxy host normalizer
sjmiller609 8c3775f
Format instances types after merge
sjmiller609 e6d841e
Merge remote-tracking branch 'origin/main' into feature/egress-mitm-p…
sjmiller609 a25a59b
refactor egress API to policy + credentials and fix restore regressions
sjmiller609 865b50c
Fix open review findings for create validation and proxy errors
sjmiller609 2dd5cfc
Align egress credential injection with policy model
sjmiller609 2f6d425
Merge remote-tracking branch 'origin/main' into feature/egress-mitm-p…
sjmiller609 f342efb
Fix API instance test after merging main
sjmiller609 3ba6d54
Merge branch 'main' into feature/egress-mitm-proxy-secret-rewrite
sjmiller609 3f15717
Surface egress proxy CA cert setup failures as hard errors
sjmiller609 File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,162 @@ | ||
| # Test Agent Notes | ||
|
|
||
| ## 2026-03-07 - Linux CI flake in `lib/instances` | ||
|
|
||
| ### Flake signature | ||
| - Intermittent failure in `TestBasicEndToEnd`: | ||
| - `start caddy: fork/exec .../system/binaries/caddy/v2.10.2/x86_64/caddy: text file busy` | ||
| - Observed during second full no-cache CI-equivalent run on `deft-kernel-dev` as root. | ||
|
|
||
| ### Root cause | ||
| - Integration tests run in parallel and `prepareIntegrationTestDataDir` symlinks `tmpDir/system/binaries` to a shared prewarm directory. | ||
| - `lib/ingress/ExtractCaddyBinary` previously wrote directly to final binary path with `os.WriteFile`, so concurrent extraction/startup could race and produce ETXTBUSY. | ||
|
|
||
| ### Fix | ||
| - In `lib/ingress/binaries_linux.go`: | ||
| - Added extraction lock (`<binary>.lock` + `syscall.Flock`). | ||
| - Switched binary + hash writes to temp-file + atomic rename. | ||
| - Re-check binary/hash after acquiring lock. | ||
|
|
||
| ### Validation commands used | ||
| - Tight loop: | ||
| - `go test -tags containers_image_openpgp -run '^TestBasicEndToEnd$' -count=6 -timeout=25m ./lib/instances` | ||
| - `go test -tags containers_image_openpgp -run '^(TestBasicEndToEnd|TestQEMUBasicEndToEnd)$' -count=4 -timeout=30m ./lib/instances` | ||
| - Full CI-equivalent flow (`go mod download`, `make oapi-generate`, `make build`, `go run ./cmd/test-prewarm`, `make test TEST_TIMEOUT=20m`) run with fresh caches each time. | ||
|
|
||
| ### Full run durations (fresh caches) | ||
| - Pre-fix baseline: | ||
| - Run 1: 181s (pass) | ||
| - Run 2: 142s (flake) | ||
| - Post-fix full-suite verification: | ||
| - Run 1: 139s (pass) | ||
| - Run 2: 143s (pass) | ||
| - Run 3: 141s (pass) | ||
|
|
||
| ## 2026-03-07 - Additional no-cache flake under direct `go test` | ||
|
|
||
| ### Flake signatures | ||
| - `TestFirecrackerNetworkLifecycle` intermittent failure 1: | ||
| - `allocate network: get default network: network not found` | ||
| - `TestFirecrackerNetworkLifecycle` intermittent failure 2: | ||
| - curl exit code `28` (timeout) when probing `https://public-ping-bucket-kernel.s3.us-east-1.amazonaws.com/index.html`. | ||
|
|
||
| ### Root causes | ||
| - Bridge state readiness race after self-heal re-initialization could still fail immediate lookup. | ||
| - External internet dependency (S3 endpoint) introduced network flakiness unrelated to core networking behavior. | ||
|
|
||
| ### Fixes | ||
| - `lib/network/allocate.go` | ||
| - Added `getDefaultNetworkWithSelfHeal` with bounded short polling (2s total, 100ms interval) after self-heal init. | ||
| - Applied to both `CreateAllocation` and `RecreateAllocation`. | ||
| - `lib/instances/firecracker_test.go` | ||
| - Replaced remote S3 curl dependency with local deterministic probe server bound to the bridge gateway. | ||
| - Kept pre/post-standby connectivity assertions through guest `curl` with retry. | ||
|
|
||
| ### Final required gate (no-cache, 3 consecutive full runs) | ||
| - Command shape per run: | ||
| - `go mod download` | ||
| - `make oapi-generate` | ||
| - `make build` | ||
| - `go run ./cmd/test-prewarm` | ||
| - `go test -count=1 -tags containers_image_openpgp -timeout=20m ./...` | ||
| - Durations: | ||
| - Run 1: 118s (pass) | ||
| - Run 2: 230s (pass) | ||
| - Run 3: 153s (pass) | ||
|
|
||
| ## 2026-03-07 - Rerun round: redundancy + longest-test speed improvements | ||
|
|
||
| ### Fresh full no-cache baseline before new changes | ||
| - Full flow (same as CI prep + direct no-cache test): | ||
| - `go mod download` | ||
| - `make oapi-generate` | ||
| - `make build` | ||
| - `go run ./cmd/test-prewarm` | ||
| - `go test -count=1 -tags containers_image_openpgp -timeout=20m ./...` | ||
| - Results: | ||
| - Run 1: 143s (pass) | ||
| - Run 2: 153s (pass) | ||
|
|
||
| ### Slow test analysis (>2s) | ||
| - Package-level bottlenecks were `lib/images` (~6-8s) and `lib/instances` (~99s+). | ||
| - Longest individual tests (single-test baseline): | ||
| - `TestForkCloudHypervisorFromRunningNetwork`: 53.35s | ||
| - `TestQEMUForkFromRunningNetwork`: 46.87s | ||
| - `TestFirecrackerForkFromRunningNetwork`: 36.69s | ||
|
|
||
| ### Redundancy found and removed | ||
| - Duplicate source reachability assertions in running-fork tests: | ||
| - `lib/instances/fork_test.go` (CloudHypervisor case) | ||
| - `lib/instances/qemu_test.go` | ||
| - `lib/instances/firecracker_test.go` | ||
| - Removed one duplicate `assertHostCanReachNginx(sourceAfterFork...)` in each. | ||
|
|
||
| ### Longest-test speed fix | ||
| - In `lib/instances/fork_test.go`, reduced per-attempt guest-agent wait in `execInInstance`: | ||
| - `WaitForAgent: 30s` -> `5s` | ||
| - Why it mattered: | ||
| - `assertGuestHasOnlyExpectedIPv4` already does bounded polling. A 30s wait per attempt caused large stalls in the longest test while guest-agent was still coming up. | ||
|
|
||
| ### Tight-loop validation after changes | ||
| - `go test -count=1 -tags containers_image_openpgp -run '^(TestForkCloudHypervisorFromRunningNetwork|TestQEMUForkFromRunningNetwork|TestFirecrackerForkFromRunningNetwork)$' -count=3 -timeout=30m ./lib/instances` | ||
| - Pass, package time 84.182s. | ||
|
|
||
| ### Post-fix single-test durations | ||
| - `TestForkCloudHypervisorFromRunningNetwork`: 24.51s (from 53.35s) | ||
| - `TestQEMUForkFromRunningNetwork`: 11.18s (from 46.87s) | ||
| - `TestFirecrackerForkFromRunningNetwork`: 28.50s (from 36.69s) | ||
|
|
||
| ### Required pre-commit gate (3 consecutive full no-cache runs) | ||
| - Run 1: 82s (pass) | ||
| - Run 2: 103s (pass) | ||
| - Run 3: 97s (pass) | ||
| - `lib/instances` package runtime in those runs: | ||
| - 57.806s, 79.853s, 73.199s | ||
|
|
||
| ## 2026-03-08 - Rerun round (again): focused longest-test tuning | ||
|
|
||
| ### Fresh baseline no-cache full runs (before new changes) | ||
| - Run 1: 88s (pass) | ||
| - Run 2: 98s (pass) | ||
|
|
||
| ### What was analyzed | ||
| - Re-profiled slow tests in `lib/instances`; longest remained running-network fork integration tests. | ||
| - Tried a broader change (parallel source/fork reachability checks + additional guest-agent log wait) and observed regression/flakiness in tight loop (`[guest-agent] listening` log not reliably present in streamed logs). That experiment was reverted. | ||
|
|
||
| ### Final change kept | ||
| - `lib/instances/fork_test.go` | ||
| - In `execInInstance`, changed `WaitForAgent` from `5s` to `2s`. | ||
| - This path is used by `assertGuestHasOnlyExpectedIPv4` in the Cloud Hypervisor running-fork test and still uses bounded polling around command execution. | ||
|
|
||
| ### Tight-loop validation for targeted long tests | ||
| - Command: | ||
| - `go test -count=1 -tags containers_image_openpgp -run '^(TestForkCloudHypervisorFromRunningNetwork|TestQEMUForkFromRunningNetwork|TestFirecrackerForkFromRunningNetwork)$' -count=3 -timeout=30m ./lib/instances` | ||
| - Result: | ||
| - Pass; package runtime 102.528s. | ||
|
|
||
| ### Isolated longest-test samples after final change | ||
| - `TestForkCloudHypervisorFromRunningNetwork`: 26.14s | ||
| - `TestQEMUForkFromRunningNetwork`: 11.09s | ||
| - `TestFirecrackerForkFromRunningNetwork`: 27.58s | ||
|
|
||
| ### Required pre-commit gate (3 consecutive full no-cache runs) | ||
| - Run 1: 121s (pass) | ||
| - Run 2: 141s (pass) | ||
| - Run 3: 96s (pass) | ||
| - `lib/instances` runtime in those runs: | ||
| - 97.618s, 117.392s, 71.886s | ||
|
|
||
| ## 2026-03-08 - Egress proxy integration test notes | ||
|
|
||
| ### New integration test | ||
| - Added `TestEgressProxyRewritesHTTPSHeaders` in `lib/instances/egress_proxy_integration_test.go`. | ||
| - Validates end-to-end behavior: | ||
| - VM egress HTTP(S) proxy mode enabled. | ||
| - Mock header secret in guest request. | ||
| - Host-side proxy rewrites header using real secret from host env var. | ||
| - Verified by HTTPS target server response body. | ||
|
|
||
| ### Practical gotchas observed | ||
| - Running this suite from a fresh copied worktree on `deft-kernel-dev` requires embedded binaries to exist (`lib/system/guest_agent/guest-agent`, `lib/system/init/init`, Cloud Hypervisor binary, Caddy binary). | ||
| - `curlimages/curl` image can default/bundle proxy bypass behavior for loopback targets; test command explicitly clears `NO_PROXY` to force proxy routing. | ||
| - For deterministic test behavior with ad-hoc TLS target server, command uses `curl -k` to avoid CA trustchain variance across guest images while still exercising HTTPS MITM path. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,32 @@ | ||
| # Egress Proxy (Mock Secret Substitution) | ||
|
|
||
| This module provides an optional, default-off networking mode for VM egress. | ||
|
|
||
| When enabled for an instance, hypeman does three things: | ||
|
|
||
| 1. It starts (or reuses) a host-side HTTP/HTTPS MITM proxy bound to the VM bridge gateway. | ||
| 2. It injects proxy environment variables into the guest (`HTTP_PROXY` / `HTTPS_PROXY`) and installs the proxy CA certificate in the guest trust store. | ||
| 3. It enforces policy on the host so direct outbound TCP traffic on ports `80` and `443` from that VM's TAP interface is rejected unless it is going to the bridge gateway (the proxy). | ||
|
|
||
| ## Secret substitution flow | ||
|
|
||
| - Workloads inside the VM use mock secret values (for example `mock_openai_key`). | ||
| - Per instance, hypeman stores a mapping of `mock value -> host environment variable name`. | ||
| - For each outbound HTTP request (including HTTPS requests after MITM decryption), the proxy scans every HTTP header value. | ||
| - Any occurrence of a configured mock value is replaced with the real value loaded from the host environment variable. | ||
| - The modified request is then forwarded upstream. | ||
|
|
||
| This keeps real secrets out of the VM while still allowing authenticated egress requests. | ||
|
|
||
| ## Security behavior | ||
|
|
||
| - Real secret values are not persisted in instance metadata. | ||
| - Only host environment variable names are persisted. | ||
| - TLS interception requires guest trust of the proxy CA; hypeman installs this CA in the guest when proxy mode is enabled. | ||
| - Egress enforcement is applied per instance TAP device and removed when the instance stops/standbys/deletes. | ||
|
|
||
| ## Limits of enforcement | ||
|
|
||
| - Enforcement currently targets HTTP/HTTPS default ports (`80` and `443`). | ||
| - Non-HTTP protocols or custom ports are not rewritten. | ||
| - Header replacement is applied to HTTP headers only (not request/response bodies). |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,162 @@ | ||
| package egressproxy | ||
|
|
||
| import ( | ||
| "crypto/rand" | ||
| "crypto/rsa" | ||
| "crypto/tls" | ||
| "crypto/x509" | ||
| "crypto/x509/pkix" | ||
| "encoding/pem" | ||
| "fmt" | ||
| "math/big" | ||
| "net" | ||
| "os" | ||
| "path/filepath" | ||
| "strings" | ||
| "time" | ||
| ) | ||
|
|
||
| func loadOrCreateCA(dataDir string) (*x509.Certificate, *rsa.PrivateKey, string, error) { | ||
| dir := filepath.Join(dataDir, "egressproxy") | ||
| certPath := filepath.Join(dir, "ca.crt") | ||
| keyPath := filepath.Join(dir, "ca.key") | ||
|
|
||
| certPEM, certErr := os.ReadFile(certPath) | ||
| keyPEM, keyErr := os.ReadFile(keyPath) | ||
| if certErr == nil && keyErr == nil { | ||
| cert, key, err := parseCA(certPEM, keyPEM) | ||
| if err == nil { | ||
| return cert, key, string(certPEM), nil | ||
| } | ||
| } | ||
|
|
||
| if err := os.MkdirAll(dir, 0755); err != nil { | ||
| return nil, nil, "", fmt.Errorf("create egress proxy cert dir: %w", err) | ||
| } | ||
|
|
||
| caKey, err := rsa.GenerateKey(rand.Reader, 2048) | ||
| if err != nil { | ||
| return nil, nil, "", fmt.Errorf("generate egress proxy CA key: %w", err) | ||
| } | ||
|
|
||
| serialNumber, err := rand.Int(rand.Reader, new(big.Int).Lsh(big.NewInt(1), 128)) | ||
| if err != nil { | ||
| return nil, nil, "", fmt.Errorf("generate CA serial: %w", err) | ||
| } | ||
|
|
||
| now := time.Now() | ||
| tmpl := &x509.Certificate{ | ||
| SerialNumber: serialNumber, | ||
| Subject: pkix.Name{ | ||
| CommonName: "hypeman-egress-proxy-ca", | ||
| Organization: []string{"hypeman"}, | ||
| }, | ||
| NotBefore: now.Add(-1 * time.Hour), | ||
| NotAfter: now.Add(10 * 365 * 24 * time.Hour), | ||
| KeyUsage: x509.KeyUsageCertSign | x509.KeyUsageCRLSign, | ||
| BasicConstraintsValid: true, | ||
| IsCA: true, | ||
| MaxPathLen: 1, | ||
| } | ||
|
|
||
| certDER, err := x509.CreateCertificate(rand.Reader, tmpl, tmpl, &caKey.PublicKey, caKey) | ||
| if err != nil { | ||
| return nil, nil, "", fmt.Errorf("create egress proxy CA cert: %w", err) | ||
| } | ||
|
|
||
| certPEM = pem.EncodeToMemory(&pem.Block{Type: "CERTIFICATE", Bytes: certDER}) | ||
| keyPEM = pem.EncodeToMemory(&pem.Block{Type: "RSA PRIVATE KEY", Bytes: x509.MarshalPKCS1PrivateKey(caKey)}) | ||
|
|
||
| if err := os.WriteFile(certPath, certPEM, 0644); err != nil { | ||
| return nil, nil, "", fmt.Errorf("write egress proxy CA cert: %w", err) | ||
| } | ||
| if err := os.WriteFile(keyPath, keyPEM, 0600); err != nil { | ||
| return nil, nil, "", fmt.Errorf("write egress proxy CA key: %w", err) | ||
| } | ||
|
|
||
| cert, key, err := parseCA(certPEM, keyPEM) | ||
| if err != nil { | ||
| return nil, nil, "", err | ||
| } | ||
|
|
||
| return cert, key, string(certPEM), nil | ||
| } | ||
|
|
||
| func parseCA(certPEM, keyPEM []byte) (*x509.Certificate, *rsa.PrivateKey, error) { | ||
| certBlock, _ := pem.Decode(certPEM) | ||
| if certBlock == nil { | ||
| return nil, nil, fmt.Errorf("parse egress proxy CA cert PEM") | ||
| } | ||
| cert, err := x509.ParseCertificate(certBlock.Bytes) | ||
| if err != nil { | ||
| return nil, nil, fmt.Errorf("parse egress proxy CA cert: %w", err) | ||
| } | ||
| keyBlock, _ := pem.Decode(keyPEM) | ||
| if keyBlock == nil { | ||
| return nil, nil, fmt.Errorf("parse egress proxy CA key PEM") | ||
| } | ||
| key, err := x509.ParsePKCS1PrivateKey(keyBlock.Bytes) | ||
| if err != nil { | ||
| return nil, nil, fmt.Errorf("parse egress proxy CA key: %w", err) | ||
| } | ||
| return cert, key, nil | ||
| } | ||
|
|
||
| func signHostCertificate(caCert *x509.Certificate, caKey *rsa.PrivateKey, host string) (*tls.Certificate, error) { | ||
| leafKey, err := rsa.GenerateKey(rand.Reader, 2048) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("generate leaf key: %w", err) | ||
| } | ||
|
|
||
| serialNumber, err := rand.Int(rand.Reader, new(big.Int).Lsh(big.NewInt(1), 128)) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("generate leaf serial: %w", err) | ||
| } | ||
|
|
||
| now := time.Now() | ||
| tmpl := &x509.Certificate{ | ||
| SerialNumber: serialNumber, | ||
| Subject: pkix.Name{ | ||
| CommonName: host, | ||
| }, | ||
| NotBefore: now.Add(-1 * time.Hour), | ||
| NotAfter: now.Add(24 * time.Hour), | ||
| KeyUsage: x509.KeyUsageDigitalSignature | x509.KeyUsageKeyEncipherment, | ||
| ExtKeyUsage: []x509.ExtKeyUsage{x509.ExtKeyUsageServerAuth}, | ||
| BasicConstraintsValid: true, | ||
| } | ||
|
|
||
| if ip := net.ParseIP(host); ip != nil { | ||
| tmpl.IPAddresses = []net.IP{ip} | ||
| } else { | ||
| tmpl.DNSNames = []string{host} | ||
| } | ||
|
|
||
| leafDER, err := x509.CreateCertificate(rand.Reader, tmpl, caCert, &leafKey.PublicKey, caKey) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("sign leaf cert: %w", err) | ||
| } | ||
|
|
||
| leafPEM := pem.EncodeToMemory(&pem.Block{Type: "CERTIFICATE", Bytes: leafDER}) | ||
| caPEM := pem.EncodeToMemory(&pem.Block{Type: "CERTIFICATE", Bytes: caCert.Raw}) | ||
| keyPEM := pem.EncodeToMemory(&pem.Block{Type: "RSA PRIVATE KEY", Bytes: x509.MarshalPKCS1PrivateKey(leafKey)}) | ||
|
|
||
| tlsCert, err := tls.X509KeyPair(append(leafPEM, caPEM...), keyPEM) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("build leaf keypair: %w", err) | ||
| } | ||
| if len(tlsCert.Certificate) > 0 { | ||
| if leaf, parseErr := x509.ParseCertificate(tlsCert.Certificate[0]); parseErr == nil { | ||
| tlsCert.Leaf = leaf | ||
| } | ||
| } | ||
| return &tlsCert, nil | ||
| } | ||
|
|
||
| func normalizeHost(rawHost string) string { | ||
| host := strings.TrimSpace(rawHost) | ||
| if h, _, err := net.SplitHostPort(host); err == nil { | ||
| return h | ||
| } | ||
| return host | ||
| } | ||
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.