Skip to content

feat: collect Docker operational logs on failure for AWF diagnostics #25548

@Mossaka

Description

@Mossaka

Summary

When AWF containers fail to start (e.g., Squid crashes on startup in DinD environments), we currently have no diagnostic information because application-level logs (access.log, audit.jsonl) are never written. This makes debugging customer issues require multiple rounds of back-and-forth to gather basic info like docker logs output.

Motivation: A customer running ARC runners with DinD sidecars hit a Squid container crash (exit code 1) where the root cause was invisible — the squid access logs were empty because Squid never started. Diagnosing this required asking the customer to manually add debug steps to their workflow. See #18385.

Proposal

Add a --diagnostic-logs flag (off by default) that collects Docker operational logs on failure and includes them in the firewall-audit-logs artifact under a diagnostics/ subdirectory.

What to collect on failure

Data Command Why
Container logs docker logs <container> for squid, agent, api-proxy, iptables-init Captures entrypoint stderr/stdout — shows WHY a container crashed
Container exit codes docker inspect --format '{{.State.ExitCode}}' Quick triage signal
Mount inspection docker inspect --format '{{json .Mounts}}' Shows what Docker actually mounted vs. what was requested (critical for DinD debugging)
Sanitized docker-compose.yml Strip env vars containing tokens/keys Shows the full container config without leaking secrets

What NOT to collect (even with the flag)

  • Raw environment variables (may contain API keys)
  • Full docker inspect output (contains env vars)
  • Host filesystem contents

Feature flag behavior

  • --diagnostic-logs: Opt-in flag, off by default
  • When enabled and AWF exits with a non-zero code, collect the above and write to ${auditDir}/diagnostics/ or ${workDir}/diagnostics/
  • When disabled (default), no additional data is collected — current behavior preserved
  • Consider making this default-on in a future release once validated

Implementation notes

  • Collection should happen in the cleanup/error path (src/cli.ts catch block and signal handlers)
  • Use docker logs with --tail 200 to cap output size
  • Sanitize docker-compose.yml by redacting any env var value containing token, key, secret, password (case-insensitive)
  • If a container doesn't exist (already cleaned up), skip gracefully
  • Bundle into existing firewall-audit-logs artifact upload path

Acceptance criteria

  • --diagnostic-logs flag added to AWF CLI
  • On failure with flag enabled: container logs, exit codes, mount info, and sanitized compose config collected
  • Output written to diagnostics/ subdirectory alongside existing audit artifacts
  • No secrets leaked in collected diagnostics
  • Works in both standard and DinD environments
  • Documentation updated

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions