Add agent-managed firewall with nftables for NATS/monit access control#394
Draft
rkoster wants to merge 46 commits intocloudfoundry:mainfrom
Draft
Add agent-managed firewall with nftables for NATS/monit access control#394rkoster wants to merge 46 commits intocloudfoundry:mainfrom
rkoster wants to merge 46 commits intocloudfoundry:mainfrom
Conversation
Member
|
It looks like there are a lot of diffs in generated fakes. Are these because of a change in counterfeiter? If so could these changes be broken out? |
Contributor
Author
|
Yeah definitely, will do some more cleanup before marking it as ready for review. |
52ed4eb to
9e9299e
Compare
Move firewall management from stemcell scripts to bosh-agent to properly handle Jammy containers running on Noble hosts (cgroup v2 inherited from host). - Create platform/firewall package with nftables implementation using github.com/google/nftables library (pure Go, no C deps) - Add SetupFirewall() to Platform interface, called during bootstrap - Add 'bosh-agent firewall-allow <service>' CLI command for monit wrapper - Detect cgroup version at runtime from /proc/self/cgroup - NATS firewall only enabled for Jammy (Noble uses ephemeral credentials) - Gracefully handle missing base firewall (warn, don't fail)
Refactor NftablesFirewall to use dependency injection for testability: - Add NftablesConn, CgroupResolver, and OSReader interfaces - Add NewNftablesFirewallWithDeps constructor for testing - Export TableName, ChainName, and MonitPort constants - Generate counterfeiter fakes with Linux build tags Add scenario-based test coverage for: - SetupAgentRules with various NATS URL formats - OS-specific behavior (Jammy adds NATS rules, Noble skips them) - Cgroup v2 rule generation with path verification (including nested container paths) - Cgroup v1 rule generation with classid verification - AllowService and Cleanup operations
The firewall-allow CLI is called by BOSH-deployed jobs that need to interact with the monit API directly for controlled failover scenarios (e.g., monitoring process lifecycle). Jobs call this via the permit_monit_access helper function.
Remove OSReader dependency from firewall manager. NATS firewall enablement is now controlled via Platform.Linux.EnableNATSFirewall in agent.json. - Jammy stemcells set EnableNATSFirewall: true (uses static NATS credentials) - Noble stemcells leave it unset/false (uses ephemeral NATS credentials) This properly handles all deployment scenarios including Jammy containers running on Noble hosts with cgroup v2.
Migrate NATS firewall management from network configuration time (iptables) to mbus connection hook (nftables). This enables DNS re-resolution before each connection/reconnection, supporting HA failover scenarios where the director may move to a different IP. Architecture changes: - Monit chain: set up once at bootstrap, never modified - NATS chain: flushed and rebuilt before each NATS connect/reconnect via BeforeConnect hook in mbus handler Key changes: - Add NatsFirewallHook interface with BeforeConnect(mbusURL) method - Add GetNatsFirewallHook() to Platform interface - Store firewall manager in linux platform for reuse - Call BeforeConnect before initial connection and in reconnect handler - Remove old iptables-based firewall_provider from platform/net - Update integration tests from iptables to nftables
The NATS firewall now adds two rules per director IP: 1. ACCEPT rule for agent's cgroup (allows agent to connect) 2. DROP rule for everyone else (blocks malicious workloads) This protects against metadata service attacks where workloads could extract NATS certs and impersonate the agent.
The netlink-based nftables library fails with EOVERFLOW in nested container environments (e.g., Garden containers inside Concourse). This adds a CLI-based implementation that uses the nft command directly, which works reliably in these environments. When the netlink implementation fails, the agent automatically falls back to the CLI implementation.
On hybrid cgroup systems (cgroup v2 mounted but with no controllers enabled), nftables 'socket cgroupv2' matching doesn't work because the socket-to-cgroup association isn't established. This is detected by checking if /sys/fs/cgroup/cgroup.controllers has content. If empty, we fall back to UID-based matching (meta skuid 0) which allows any root process to connect to monit/NATS. Trade-off: UID matching is less secure (any root process can connect) but works on all systems including nested containers on hybrid hosts.
nftables evaluates each table's chains independently - an ACCEPT verdict in one table doesn't prevent other tables from also evaluating the packet. When the agent allows a packet to monit, it now sets mark 0xb054. The base bosh_firewall table checks for this mark and skips the DROP rule when set. This enables proper coordination between the agent-managed bosh_agent table (priority -1) and the stemcell's bosh_firewall table (priority 0).
…ments - Accept both socket cgroupv2 classid and meta cgroup matching patterns - Use (?s) flag for multiline regex matching - Parse hostname from BOSH_ENVIRONMENT URL - Handle hybrid cgroup v1+v2 systems (grep ^0:: for unified hierarchy) - Support both /sys/fs/cgroup and /sys/fs/cgroup/unified mount points - Skip cgroup access test when cgroup manipulation is not permitted (nested containers)
Noble stemcells use systemd for service management instead of runit. This change adds OS detection to use the appropriate service manager: - isSystemdSystem() detects Noble via /etc/lsb-release - StopAgent/StartAgent use systemctl for Noble, sv for Jammy - system_mounts_test accepts both service manager log patterns
Use OS detection to stop agent with appropriate service manager (systemctl for Noble, sv for Jammy) before installing new agent binary.
- Add regex pattern for pure cgroupv2 with systemd path format - Wait for NATS rules to be populated before checking nftables - Use Eventually for nftables check to handle timing issues
- Truncate log file in BeforeEach to avoid finding old logs from pipeline setup - Delete existing nftables table to ensure clean state for fresh agent startup
- Remove unreliable log-based waiting (differs between runit/systemd) - Wait up to 300 seconds for NATS rules to appear in nftables - Add meta skuid pattern for hybrid cgroup UID-based matching
Remove nohup wrapper for systemd commands since systemctl is designed to work synchronously and doesn't need background execution like runit's sv.
On systemd systems, 'systemctl start' is a no-op if the service is already running. Tests that need a fresh agent state (e.g., firewall tests that delete the nftables table) require a restart to ensure the agent recreates its firewall rules on startup.
On Noble (systemd-based stemcells), auditd requires /var/log/audit to exist. When CleanupDataDir unmounts /var/log and recreates it as an empty directory, the audit subdirectory is missing, causing auditd to fail. The agent's bootstrap then fails because the bosh-start-logging-and-auditing script cannot start auditd. This fix ensures /var/log/audit is created with proper permissions after /var/log is recreated.
- Fix SetupAgentRules signature in nftables_firewall_other.go to match Manager interface (add enableNATSFirewall parameter) - Fix nil pointer dereference in nats_handler.go ClosedHandler when connection closes gracefully without error - Remove CLI-based nftables fallback from linux_platform.go - Delete nftables_firewall_cli.go (CLI implementation no longer needed)
Add //nolint:errcheck comments for intentionally ignored errors and check url.Parse error return value.
The CgroupResolver and NftablesConn interfaces are defined in nftables_firewall.go which has //go:build linux. Use counterfeiter's -header option to automatically include the linux build tag when generating these fakes. Add linux_header.txt containing the build tag, and reference it in the counterfeiter:generate directives for Linux-only interfaces.
Validate that the parsed port is within the valid TCP port range (1-65535) before converting to uint16 in buildTCPDestPortExprs. This prevents potential integer overflow when converting from architecture-dependent int to uint16.
The google/nftables Go library has a bug encoding 'socket cgroupv2' expressions via netlink, causing 'netlink receive: value too large for defined data type' errors on Ubuntu Noble. The nft CLI works correctly, but the Go netlink implementation is broken. Until the library is fixed, always use UID-based matching (meta skuid 0) for cgroup v2 systems instead of socket cgroupv2 path matching. This allows root processes to access protected services (NATS, monit). Updated tests to reflect the new behavior where cgroup v2 systems always use UID-based matching regardless of IsCgroupV2SocketMatchFunctional.
The nftables kernel expects an 8-byte cgroup inode ID for 'socket cgroupv2' matching, not the path string. The nft CLI translates paths to inode IDs automatically, but the Go nftables library does not. This commit: - Adds GetCgroupID() to look up the cgroup inode via syscall.Stat - Updates buildCgroupMatchExprs() to use the 8-byte inode ID - Re-enables cgroup v2 socket matching (was disabled in previous workaround) - Updates tests to verify cgroup ID-based matching Verified on Ubuntu Noble (24.04) with cgroup v2: agent starts successfully and nftables rules correctly match the bosh-agent cgroup.
Simplify cgroup matching logic by removing the UID-based fallback that was used for hybrid cgroup systems (v2 mounted but with v1 controllers). The socket cgroupv2 matching should work regardless of whether cgroup controllers are delegated to v2, since the kernel's socket-to-cgroup association is based on the cgroup hierarchy, not controller delegation. This simplifies the code from 3 matching strategies to 2: - Cgroup v1: meta cgroup <classid> - Cgroup v2: socket cgroupv2 level 2 <inode_id> Integration tests will verify if this works in nested container scenarios.
Add integration tests to verify nftables firewall rules work correctly inside nested Garden/runC containers using OCI stemcell images. New files: - integration/utils/garden_client.go: Garden client helper that connects through SSH tunnel, supports container creation, command execution, file streaming, and cgroup detection - integration/garden_firewall_test.go: Tests cgroup detection, nftables availability, and firewall rule creation in Garden containers Tests verify that on cgroup v2 (Noble), firewall rules use proper cgroup socket matching rather than falling back to UID-based matching. Tests skip automatically if GARDEN_ADDRESS env var is not set. Dependencies added: - code.cloudfoundry.org/garden - code.cloudfoundry.org/lager/v3 Vendor cleanup: removed unused dependencies that were cleaned up by go mod tidy (iptables, go-systemd, staticcheck, etc.)
Tests now fail immediately with clear error messages when: - GARDEN_ADDRESS environment variable is not set - Garden client fails to connect - bosh-agent-linux-amd64 binary is not built - Agent fails to create firewall table This ensures test failures are visible in CI rather than silently skipped.
…g testing The Garden container firewall tests need to run in an environment where Garden remains functional throughout the test. The main integration test suite's SynchronizedBeforeSuite runs CleanupDataDir() which removes /var/vcap/data, destroying the Garden job. This commit: - Moves garden_firewall_test.go to integration/garden/ as a separate test suite - Adds deploy-to-noble.sh script for local testing with Garden deployment - Fixes test assertions and adds proper agent configuration (agent.json) - Uses CGO_ENABLED=0 for static binary that works in containers All 7 Garden firewall tests now pass, confirming that the agent correctly uses cgroup v2 socket matching (not UID fallback) in nested containers.
- Create nft-dump Go utility that uses nftables library to inspect firewall rules without requiring nft CLI. Outputs human-readable YAML with interpreted values (IP addresses, ports, marks, etc.) - Update Garden firewall tests to use nft-dump instead of nft CLI for verifying firewall rules work correctly in containers - Fix Jammy stemcell container tests by unmounting Garden's bind-mounted /etc/resolv.conf, /etc/hosts, /etc/hostname before starting the agent. This prevents the 'Device or resource busy' error when the agent runs resolvconf -u (same approach used by bosh-warden-cpi) - Add nft-dump helper methods to GardenClient and TestEnvironment for use in both container and VM-based integration tests - Update deploy-to-noble.sh to build nft-dump binary for container tests Tests now pass for both Noble and Jammy stemcells.
f18859f to
d262967
Compare
Update InstallNftDump in both TestEnvironment and GardenClient to automatically build the nft-dump binary if it doesn't exist, rather than requiring it to be pre-built. This makes the integration tests self-contained and removes the need for CI pipelines to build the binary separately.
On systemd-based systems like Noble, /var/tmp can have issues similar to /var/log where systemd services briefly hold file handles open even after fuser -km. This caused intermittent 'rm -rf /var/tmp' failures in test cleanup. Changes: - Use lazy unmount for /var/tmp (like we already do for /var/log) - Add retry logic with brief sleeps for rm -rf to handle transient busy state
The test was being skipped because we couldn't move a process into the agent's cgroup on systemd systems. This adds: 1. Debug output to understand the cgroup hierarchy 2. Try creating a child cgroup under the agent's cgroup (which should be allowed with proper delegation) instead of writing directly to the agent's cgroup.procs
The previous pattern 'pgrep -f "bosh-agent$"' expected the command line to END with 'bosh-agent', but on systemd the full command line is: /var/vcap/bosh/bin/bosh-agent -P ubuntu -C /var/vcap/bosh/agent.json Use 'pgrep -x bosh-agent' to match the exact process name, with fallback to matching the full binary path.
On systemd systems, /var/tmp may be held by various services and difficult to remove even with retries. Since we always recreate /var/tmp right after the cleanup attempt, this failure is not critical.
When losetup -a returns empty (no loop devices), splitting on newline produces an empty string which caused 'sudo losetup -d' to run with no argument and fail. Skip empty strings to avoid this spurious error.
Add a Go package that installs Garden from a compiled BOSH release tarball onto a VM via SSH, enabling integration tests to verify nftables firewall behavior inside Garden containers. Key components: - gardeninstaller: Extracts packages/templates from release tarball - SSH driver with sudo support for non-root users - XFS-backed grootfs stores (10GB for multiple stemcell images) - Generates config files matching garden-runc-release format The test suite now automatically installs Garden when GARDEN_RELEASE_TARBALL is set, removing the need for garden-runc as a BOSH release dependency. Tests verify firewall rules work in both Noble and Jammy stemcells.
…upport - Add UseDirectStore config option to skip XFS loopback creation for nested containers where loop devices are unavailable - Add initDirectStores() for grootfs init without store-size-bytes - Add initXFSStores() for standard XFS loopback initialization - Fix POSIX compatibility: use '.' instead of 'source' for shell sourcing - Add GardenClient methods for nested Garden containers: - NewGardenClientWithAddress: connect to nested Garden via forwarded ports - CreateNestedGardenContainer: create privileged container with cgroup mounts - StreamTarball: stream gzipped tarballs into containers - Container: expose underlying container for GardenDriver
…n tests - Extract Driver interface to integration/installerdriver package with Bootstrap(), Cleanup(), and IsBootstrapped() lifecycle methods - Move SSHDriver and GardenDriver to installerdriver package with config-based constructors (SSHDriverConfig, GardenDriverConfig) - Add GardenAPIClient helper for connecting to Garden through SSH tunnel - Create agentinstaller package for installing bosh-agent in containers - Add DialAgentThroughJumpbox() to utils for proper SSH tunneling to agent VM (fixes bind mount directory creation on wrong host) - Update garden_firewall_test.go and nested_garden_firewall_test.go to use the new Driver pattern with proper SSH connections - Add test_helpers.go with stemcell image constants and utility functions The refactor enables running Garden inside Garden containers (nested) for testing firewall rules at multiple nesting levels. All 10 single-level Garden firewall tests pass (Noble + Jammy stemcells).
…upport
- Disable containerd mode for L1/L2 Garden installations (containerd cannot
run inside containers due to missing capabilities and cgroups)
- Use dynamic IP allocation from Garden pools instead of NetIn port forwarding
(nftables-based systems may not support iptables NetIn rules)
- Add TCP forwarder (socat) in L1 to reach L2 Garden from host
- Add ContainerdMode and NetworkPool options to gardeninstaller Config
- Add ContainerIP() and Handle() methods to GardenDriver
- Fix busybox version mismatch (1.37.0 -> 1.36.1) to match release tarball
- Document cgroup bind mount requirements for firewall in nested containers
Network topology:
Host -> SSH -> Agent VM (Garden L0 :7777, containerd=on)
-> L1 container (Garden :7777, containerd=off, pool 10.253.0.0/22)
-> L2 container (Garden :7777, containerd=off, pool 10.252.0.0/22)
All 8 nested garden firewall tests now pass, verifying nftables rules work
correctly up to 3 levels of container nesting.
Add netcat-based tunneling to reach nested Garden containers (L2) from the test runner. This approach spawns netcat processes inside intermediate containers to tunnel TCP traffic. Files added: - garden_process_conn.go: net.Conn wrapper around Garden process stdin/stdout - container_tunnel.go: persistent tunnel attempt (incomplete) - tcp_forwarder.go: Python-based TCP proxy for container port forwarding Changes: - garden_api_client.go: Add ensureLogger() to fix nil logger panic, add NewGardenAPIClientThroughContainer() for netcat tunneling - driver_garden.go: Add Container() method to expose underlying container - nested_garden_firewall_test.go: Use netcat tunnel for L2 Garden connectivity Known issue: EOF errors occur during long operations like container creation with large images, likely due to netcat process lifecycle issues. This commit preserves the netcat approach for reference before switching to the NetIn port forwarding approach.
NetIn is Garden's built-in port forwarding mechanism that creates iptables DNAT rules. This is more reliable than the netcat tunnel approach which suffered from EOF errors during long operations. - Use L0 Garden NetIn (17777→7777) to reach L1 Garden via external IP - Use L1 Garden NetIn (27777→7777) to reach L2 Garden via L1 container IP - Remove netcat-based tunnel code (garden_process_conn.go, container_tunnel.go, tcp_forwarder.go) - Simplify garden_api_client.go to only NewGardenAPIClient and NewGardenAPIClientDirect
…nnections - Fix cgroup v2 socket matching to use Level 0 (direct cgroup) instead of Level 2 - Add DROP rule for non-agent connections to monit port 2822 - Add IsRunningUnderSystemd() detection to warn when cgroup isolation is ineffective - Improve nested firewall test with agent log capture for debugging - Add agentinstaller package for bosh-agent installation in test containers - Add cgrouputils package for cgroup diagnostics in tests
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
github.com/google/nftableslibrary (pure Go, no C dependencies)Replaces
Supported Container Scenarios
Architectural Changes
Firewall lifecycle moved from network setup to agent bootstrap + mbus hooks
Before (iptables):
SetupNetworking()in net managersAfter (nftables):
SetupFirewall(mbusURL)called during agent bootstrap after networkingNatsFirewallHook.BeforeConnect()hookNew Platform interface methods
Cross-table coordination with packet marks
nftables evaluates each table's chains independently. To coordinate with the stemcell's base
bosh_firewalltable:bosh_agenttable (priority -1) sets packet mark0xb054on allowed packetsbosh_firewalltable (priority 0) checks for this mark and skips DROP rulesRuntime cgroup detection
Cgroup version detected at runtime from
/proc/self/cgroup:New CLI command for BOSH jobs
Called by BOSH jobs needing monit API access for controlled failover scenarios.
Note
The CLI-based nftables fallback (
NftablesFirewallCLI) is temporary and only included for debugging while working through getting the integration tests to pass. It will be removed before merge.Files Changed
New
platform/firewall/- nftables implementation with cgroup detectionModified
agent/bootstrap.go- CallSetupFirewall()after networkingmbus/nats_handler.go- Add firewall hook before connect/reconnectplatform/*_platform.go- Implement new interface methodsmain/agent.go- Addfirewall-allowCLI commandRemoved
platform/net/firewall_provider*.go- Old iptables/Windows Firewall implementation