Skip to content

Add agent-managed firewall with nftables for NATS/monit access control#394

Draft
rkoster wants to merge 46 commits intocloudfoundry:mainfrom
rkoster:feature/agent-managed-firewall
Draft

Add agent-managed firewall with nftables for NATS/monit access control#394
rkoster wants to merge 46 commits intocloudfoundry:mainfrom
rkoster:feature/agent-managed-firewall

Conversation

@rkoster
Copy link
Contributor

@rkoster rkoster commented Jan 30, 2026

Summary

  • Move firewall management from stemcell scripts to bosh-agent to properly handle mixed cgroup environments in nested container scenarios
  • Add nftables-based firewall using github.com/google/nftables library (pure Go, no C dependencies)
  • Support DNS re-resolution on NATS reconnect for HA failover scenarios where director may move to a different IP

Replaces

Supported Container Scenarios

Host Container Cgroup Matching Strategy
Noble Jammy v2 (inherited) cgroup v2 socket matching
Noble Noble v2 cgroup v2 socket matching
Jammy Noble v1/hybrid UID-based fallback (meta skuid 0)
Jammy Jammy v1 net_cls classid matching

Architectural Changes

Firewall lifecycle moved from network setup to agent bootstrap + mbus hooks

Before (iptables):

  • Firewall rules set up during SetupNetworking() in net managers
  • Single-shot setup at network configuration time
  • DNS resolved once at startup; no re-resolution on reconnect

After (nftables):

  • SetupFirewall(mbusURL) called during agent bootstrap after networking
  • Monit chain: Set up once at bootstrap, never modified
  • NATS chain: Flushed and rebuilt before each NATS connect/reconnect via NatsFirewallHook.BeforeConnect() hook
  • DNS re-resolved on every reconnect attempt, enabling HA failover to new director IP

New Platform interface methods

SetupFirewall(mbusURL string) error
GetNatsFirewallHook() firewall.NatsFirewallHook

Cross-table coordination with packet marks

nftables evaluates each table's chains independently. To coordinate with the stemcell's base bosh_firewall table:

  1. Agent's bosh_agent table (priority -1) sets packet mark 0xb054 on allowed packets
  2. Stemcell's bosh_firewall table (priority 0) checks for this mark and skips DROP rules

Runtime cgroup detection

Cgroup version detected at runtime from /proc/self/cgroup:

  • Jammy container on Noble host → detects cgroup v2 (inherited from host)
  • Hybrid cgroup systems (v2 mounted but no controllers) → falls back to UID matching

New CLI command for BOSH jobs

bosh-agent firewall-allow monit

Called by BOSH jobs needing monit API access for controlled failover scenarios.

Note

The CLI-based nftables fallback (NftablesFirewallCLI) is temporary and only included for debugging while working through getting the integration tests to pass. It will be removed before merge.

Files Changed

New

  • platform/firewall/ - nftables implementation with cgroup detection

Modified

  • agent/bootstrap.go - Call SetupFirewall() after networking
  • mbus/nats_handler.go - Add firewall hook before connect/reconnect
  • platform/*_platform.go - Implement new interface methods
  • main/agent.go - Add firewall-allow CLI command

Removed

  • platform/net/firewall_provider*.go - Old iptables/Windows Firewall implementation

@aramprice
Copy link
Member

It looks like there are a lot of diffs in generated fakes. Are these because of a change in counterfeiter? If so could these changes be broken out?

@rkoster
Copy link
Contributor Author

rkoster commented Jan 30, 2026

Yeah definitely, will do some more cleanup before marking it as ready for review.

Move firewall management from stemcell scripts to bosh-agent to properly
handle Jammy containers running on Noble hosts (cgroup v2 inherited from
host).

- Create platform/firewall package with nftables implementation using
  github.com/google/nftables library (pure Go, no C deps)
- Add SetupFirewall() to Platform interface, called during bootstrap
- Add 'bosh-agent firewall-allow <service>' CLI command for monit wrapper
- Detect cgroup version at runtime from /proc/self/cgroup
- NATS firewall only enabled for Jammy (Noble uses ephemeral credentials)
- Gracefully handle missing base firewall (warn, don't fail)
Refactor NftablesFirewall to use dependency injection for testability:
- Add NftablesConn, CgroupResolver, and OSReader interfaces
- Add NewNftablesFirewallWithDeps constructor for testing
- Export TableName, ChainName, and MonitPort constants
- Generate counterfeiter fakes with Linux build tags

Add scenario-based test coverage for:
- SetupAgentRules with various NATS URL formats
- OS-specific behavior (Jammy adds NATS rules, Noble skips them)
- Cgroup v2 rule generation with path verification (including nested container paths)
- Cgroup v1 rule generation with classid verification
- AllowService and Cleanup operations
The firewall-allow CLI is called by BOSH-deployed jobs that need to
interact with the monit API directly for controlled failover scenarios
(e.g., monitoring process lifecycle). Jobs call this via the
permit_monit_access helper function.
Remove OSReader dependency from firewall manager. NATS firewall enablement
is now controlled via Platform.Linux.EnableNATSFirewall in agent.json.

- Jammy stemcells set EnableNATSFirewall: true (uses static NATS credentials)
- Noble stemcells leave it unset/false (uses ephemeral NATS credentials)

This properly handles all deployment scenarios including Jammy containers
running on Noble hosts with cgroup v2.
Migrate NATS firewall management from network configuration time (iptables)
to mbus connection hook (nftables). This enables DNS re-resolution before
each connection/reconnection, supporting HA failover scenarios where the
director may move to a different IP.

Architecture changes:
- Monit chain: set up once at bootstrap, never modified
- NATS chain: flushed and rebuilt before each NATS connect/reconnect
  via BeforeConnect hook in mbus handler

Key changes:
- Add NatsFirewallHook interface with BeforeConnect(mbusURL) method
- Add GetNatsFirewallHook() to Platform interface
- Store firewall manager in linux platform for reuse
- Call BeforeConnect before initial connection and in reconnect handler
- Remove old iptables-based firewall_provider from platform/net
- Update integration tests from iptables to nftables
The NATS firewall now adds two rules per director IP:
1. ACCEPT rule for agent's cgroup (allows agent to connect)
2. DROP rule for everyone else (blocks malicious workloads)

This protects against metadata service attacks where workloads could
extract NATS certs and impersonate the agent.
The netlink-based nftables library fails with EOVERFLOW in nested
container environments (e.g., Garden containers inside Concourse).
This adds a CLI-based implementation that uses the nft command
directly, which works reliably in these environments.

When the netlink implementation fails, the agent automatically
falls back to the CLI implementation.
On hybrid cgroup systems (cgroup v2 mounted but with no controllers
enabled), nftables 'socket cgroupv2' matching doesn't work because
the socket-to-cgroup association isn't established.

This is detected by checking if /sys/fs/cgroup/cgroup.controllers
has content. If empty, we fall back to UID-based matching (meta skuid 0)
which allows any root process to connect to monit/NATS.

Trade-off: UID matching is less secure (any root process can connect)
but works on all systems including nested containers on hybrid hosts.
nftables evaluates each table's chains independently - an ACCEPT verdict
in one table doesn't prevent other tables from also evaluating the packet.

When the agent allows a packet to monit, it now sets mark 0xb054. The base
bosh_firewall table checks for this mark and skips the DROP rule when set.

This enables proper coordination between the agent-managed bosh_agent table
(priority -1) and the stemcell's bosh_firewall table (priority 0).
…ments

- Accept both socket cgroupv2 classid and meta cgroup matching patterns
- Use (?s) flag for multiline regex matching
- Parse hostname from BOSH_ENVIRONMENT URL
- Handle hybrid cgroup v1+v2 systems (grep ^0:: for unified hierarchy)
- Support both /sys/fs/cgroup and /sys/fs/cgroup/unified mount points
- Skip cgroup access test when cgroup manipulation is not permitted (nested containers)
Noble stemcells use systemd for service management instead of runit.
This change adds OS detection to use the appropriate service manager:
- isSystemdSystem() detects Noble via /etc/lsb-release
- StopAgent/StartAgent use systemctl for Noble, sv for Jammy
- system_mounts_test accepts both service manager log patterns
Use OS detection to stop agent with appropriate service manager
(systemctl for Noble, sv for Jammy) before installing new agent binary.
- Add regex pattern for pure cgroupv2 with systemd path format
- Wait for NATS rules to be populated before checking nftables
- Use Eventually for nftables check to handle timing issues
- Truncate log file in BeforeEach to avoid finding old logs from pipeline setup
- Delete existing nftables table to ensure clean state for fresh agent startup
- Remove unreliable log-based waiting (differs between runit/systemd)
- Wait up to 300 seconds for NATS rules to appear in nftables
- Add meta skuid pattern for hybrid cgroup UID-based matching
Remove nohup wrapper for systemd commands since systemctl is designed
to work synchronously and doesn't need background execution like runit's sv.
On systemd systems, 'systemctl start' is a no-op if the service is already
running. Tests that need a fresh agent state (e.g., firewall tests that
delete the nftables table) require a restart to ensure the agent recreates
its firewall rules on startup.
On Noble (systemd-based stemcells), auditd requires /var/log/audit
to exist. When CleanupDataDir unmounts /var/log and recreates it
as an empty directory, the audit subdirectory is missing, causing
auditd to fail. The agent's bootstrap then fails because the
bosh-start-logging-and-auditing script cannot start auditd.

This fix ensures /var/log/audit is created with proper permissions
after /var/log is recreated.
- Fix SetupAgentRules signature in nftables_firewall_other.go to match
  Manager interface (add enableNATSFirewall parameter)
- Fix nil pointer dereference in nats_handler.go ClosedHandler when
  connection closes gracefully without error
- Remove CLI-based nftables fallback from linux_platform.go
- Delete nftables_firewall_cli.go (CLI implementation no longer needed)
Add //nolint:errcheck comments for intentionally ignored errors and
check url.Parse error return value.
The CgroupResolver and NftablesConn interfaces are defined in
nftables_firewall.go which has //go:build linux. Use counterfeiter's
-header option to automatically include the linux build tag when
generating these fakes.

Add linux_header.txt containing the build tag, and reference it in
the counterfeiter:generate directives for Linux-only interfaces.
Validate that the parsed port is within the valid TCP port range (1-65535)
before converting to uint16 in buildTCPDestPortExprs. This prevents
potential integer overflow when converting from architecture-dependent
int to uint16.
The google/nftables Go library has a bug encoding 'socket cgroupv2'
expressions via netlink, causing 'netlink receive: value too large
for defined data type' errors on Ubuntu Noble. The nft CLI works
correctly, but the Go netlink implementation is broken.

Until the library is fixed, always use UID-based matching (meta skuid 0)
for cgroup v2 systems instead of socket cgroupv2 path matching. This
allows root processes to access protected services (NATS, monit).

Updated tests to reflect the new behavior where cgroup v2 systems
always use UID-based matching regardless of IsCgroupV2SocketMatchFunctional.
The nftables kernel expects an 8-byte cgroup inode ID for 'socket cgroupv2'
matching, not the path string. The nft CLI translates paths to inode IDs
automatically, but the Go nftables library does not.

This commit:
- Adds GetCgroupID() to look up the cgroup inode via syscall.Stat
- Updates buildCgroupMatchExprs() to use the 8-byte inode ID
- Re-enables cgroup v2 socket matching (was disabled in previous workaround)
- Updates tests to verify cgroup ID-based matching

Verified on Ubuntu Noble (24.04) with cgroup v2: agent starts successfully
and nftables rules correctly match the bosh-agent cgroup.
Simplify cgroup matching logic by removing the UID-based fallback that was
used for hybrid cgroup systems (v2 mounted but with v1 controllers).

The socket cgroupv2 matching should work regardless of whether cgroup
controllers are delegated to v2, since the kernel's socket-to-cgroup
association is based on the cgroup hierarchy, not controller delegation.

This simplifies the code from 3 matching strategies to 2:
- Cgroup v1: meta cgroup <classid>
- Cgroup v2: socket cgroupv2 level 2 <inode_id>

Integration tests will verify if this works in nested container scenarios.
Add integration tests to verify nftables firewall rules work correctly
inside nested Garden/runC containers using OCI stemcell images.

New files:
- integration/utils/garden_client.go: Garden client helper that connects
  through SSH tunnel, supports container creation, command execution,
  file streaming, and cgroup detection
- integration/garden_firewall_test.go: Tests cgroup detection, nftables
  availability, and firewall rule creation in Garden containers

Tests verify that on cgroup v2 (Noble), firewall rules use proper cgroup
socket matching rather than falling back to UID-based matching.

Tests skip automatically if GARDEN_ADDRESS env var is not set.

Dependencies added:
- code.cloudfoundry.org/garden
- code.cloudfoundry.org/lager/v3

Vendor cleanup: removed unused dependencies that were cleaned up by
go mod tidy (iptables, go-systemd, staticcheck, etc.)
Tests now fail immediately with clear error messages when:
- GARDEN_ADDRESS environment variable is not set
- Garden client fails to connect
- bosh-agent-linux-amd64 binary is not built
- Agent fails to create firewall table

This ensures test failures are visible in CI rather than silently skipped.
…g testing

The Garden container firewall tests need to run in an environment where
Garden remains functional throughout the test. The main integration test
suite's SynchronizedBeforeSuite runs CleanupDataDir() which removes
/var/vcap/data, destroying the Garden job.

This commit:
- Moves garden_firewall_test.go to integration/garden/ as a separate test suite
- Adds deploy-to-noble.sh script for local testing with Garden deployment
- Fixes test assertions and adds proper agent configuration (agent.json)
- Uses CGO_ENABLED=0 for static binary that works in containers

All 7 Garden firewall tests now pass, confirming that the agent correctly
uses cgroup v2 socket matching (not UID fallback) in nested containers.
- Create nft-dump Go utility that uses nftables library to inspect firewall
  rules without requiring nft CLI. Outputs human-readable YAML with
  interpreted values (IP addresses, ports, marks, etc.)

- Update Garden firewall tests to use nft-dump instead of nft CLI for
  verifying firewall rules work correctly in containers

- Fix Jammy stemcell container tests by unmounting Garden's bind-mounted
  /etc/resolv.conf, /etc/hosts, /etc/hostname before starting the agent.
  This prevents the 'Device or resource busy' error when the agent runs
  resolvconf -u (same approach used by bosh-warden-cpi)

- Add nft-dump helper methods to GardenClient and TestEnvironment for
  use in both container and VM-based integration tests

- Update deploy-to-noble.sh to build nft-dump binary for container tests

Tests now pass for both Noble and Jammy stemcells.
@rkoster rkoster force-pushed the feature/agent-managed-firewall branch from f18859f to d262967 Compare February 3, 2026 11:12
Update InstallNftDump in both TestEnvironment and GardenClient to
automatically build the nft-dump binary if it doesn't exist, rather
than requiring it to be pre-built. This makes the integration tests
self-contained and removes the need for CI pipelines to build the
binary separately.
On systemd-based systems like Noble, /var/tmp can have issues similar to
/var/log where systemd services briefly hold file handles open even after
fuser -km. This caused intermittent 'rm -rf /var/tmp' failures in test
cleanup.

Changes:
- Use lazy unmount for /var/tmp (like we already do for /var/log)
- Add retry logic with brief sleeps for rm -rf to handle transient busy state
The test was being skipped because we couldn't move a process into the
agent's cgroup on systemd systems. This adds:
1. Debug output to understand the cgroup hierarchy
2. Try creating a child cgroup under the agent's cgroup (which should be
   allowed with proper delegation) instead of writing directly to the
   agent's cgroup.procs
The previous pattern 'pgrep -f "bosh-agent$"' expected the command line
to END with 'bosh-agent', but on systemd the full command line is:
/var/vcap/bosh/bin/bosh-agent -P ubuntu -C /var/vcap/bosh/agent.json

Use 'pgrep -x bosh-agent' to match the exact process name, with fallback
to matching the full binary path.
On systemd systems, /var/tmp may be held by various services and difficult
to remove even with retries. Since we always recreate /var/tmp right after
the cleanup attempt, this failure is not critical.
When losetup -a returns empty (no loop devices), splitting on newline
produces an empty string which caused 'sudo losetup -d' to run with no
argument and fail. Skip empty strings to avoid this spurious error.
Add a Go package that installs Garden from a compiled BOSH release tarball
onto a VM via SSH, enabling integration tests to verify nftables firewall
behavior inside Garden containers.

Key components:
- gardeninstaller: Extracts packages/templates from release tarball
- SSH driver with sudo support for non-root users
- XFS-backed grootfs stores (10GB for multiple stemcell images)
- Generates config files matching garden-runc-release format

The test suite now automatically installs Garden when GARDEN_RELEASE_TARBALL
is set, removing the need for garden-runc as a BOSH release dependency.

Tests verify firewall rules work in both Noble and Jammy stemcells.
…upport

- Add UseDirectStore config option to skip XFS loopback creation for nested
  containers where loop devices are unavailable
- Add initDirectStores() for grootfs init without store-size-bytes
- Add initXFSStores() for standard XFS loopback initialization
- Fix POSIX compatibility: use '.' instead of 'source' for shell sourcing
- Add GardenClient methods for nested Garden containers:
  - NewGardenClientWithAddress: connect to nested Garden via forwarded ports
  - CreateNestedGardenContainer: create privileged container with cgroup mounts
  - StreamTarball: stream gzipped tarballs into containers
  - Container: expose underlying container for GardenDriver
…n tests

- Extract Driver interface to integration/installerdriver package with
  Bootstrap(), Cleanup(), and IsBootstrapped() lifecycle methods
- Move SSHDriver and GardenDriver to installerdriver package with
  config-based constructors (SSHDriverConfig, GardenDriverConfig)
- Add GardenAPIClient helper for connecting to Garden through SSH tunnel
- Create agentinstaller package for installing bosh-agent in containers
- Add DialAgentThroughJumpbox() to utils for proper SSH tunneling to
  agent VM (fixes bind mount directory creation on wrong host)
- Update garden_firewall_test.go and nested_garden_firewall_test.go to
  use the new Driver pattern with proper SSH connections
- Add test_helpers.go with stemcell image constants and utility functions

The refactor enables running Garden inside Garden containers (nested)
for testing firewall rules at multiple nesting levels. All 10 single-level
Garden firewall tests pass (Noble + Jammy stemcells).
…upport

- Disable containerd mode for L1/L2 Garden installations (containerd cannot
  run inside containers due to missing capabilities and cgroups)
- Use dynamic IP allocation from Garden pools instead of NetIn port forwarding
  (nftables-based systems may not support iptables NetIn rules)
- Add TCP forwarder (socat) in L1 to reach L2 Garden from host
- Add ContainerdMode and NetworkPool options to gardeninstaller Config
- Add ContainerIP() and Handle() methods to GardenDriver
- Fix busybox version mismatch (1.37.0 -> 1.36.1) to match release tarball
- Document cgroup bind mount requirements for firewall in nested containers

Network topology:
  Host -> SSH -> Agent VM (Garden L0 :7777, containerd=on)
           -> L1 container (Garden :7777, containerd=off, pool 10.253.0.0/22)
              -> L2 container (Garden :7777, containerd=off, pool 10.252.0.0/22)

All 8 nested garden firewall tests now pass, verifying nftables rules work
correctly up to 3 levels of container nesting.
Add netcat-based tunneling to reach nested Garden containers (L2) from
the test runner. This approach spawns netcat processes inside intermediate
containers to tunnel TCP traffic.

Files added:
- garden_process_conn.go: net.Conn wrapper around Garden process stdin/stdout
- container_tunnel.go: persistent tunnel attempt (incomplete)
- tcp_forwarder.go: Python-based TCP proxy for container port forwarding

Changes:
- garden_api_client.go: Add ensureLogger() to fix nil logger panic,
  add NewGardenAPIClientThroughContainer() for netcat tunneling
- driver_garden.go: Add Container() method to expose underlying container
- nested_garden_firewall_test.go: Use netcat tunnel for L2 Garden connectivity

Known issue: EOF errors occur during long operations like container creation
with large images, likely due to netcat process lifecycle issues.

This commit preserves the netcat approach for reference before switching
to the NetIn port forwarding approach.
NetIn is Garden's built-in port forwarding mechanism that creates iptables
DNAT rules. This is more reliable than the netcat tunnel approach which
suffered from EOF errors during long operations.

- Use L0 Garden NetIn (17777→7777) to reach L1 Garden via external IP
- Use L1 Garden NetIn (27777→7777) to reach L2 Garden via L1 container IP
- Remove netcat-based tunnel code (garden_process_conn.go, container_tunnel.go,
  tcp_forwarder.go)
- Simplify garden_api_client.go to only NewGardenAPIClient and
  NewGardenAPIClientDirect
…nnections

- Fix cgroup v2 socket matching to use Level 0 (direct cgroup) instead of Level 2
- Add DROP rule for non-agent connections to monit port 2822
- Add IsRunningUnderSystemd() detection to warn when cgroup isolation is ineffective
- Improve nested firewall test with agent log capture for debugging
- Add agentinstaller package for bosh-agent installation in test containers
- Add cgrouputils package for cgroup diagnostics in tests
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

3 participants