Pac-Man AI Agent Architecture

1. Core Design Decisions

Decision: Direct Screen Capture (preferred). Reasoning:

Latency & Frame Rate: Direct capture (e.g., via mss) is significantly faster and has lower latency than processing a webcam feed.
Fidelity: Screen capture provides pixel-perfect data, avoiding issues with glare, perspective distortion, and lighting changes that plague webcam computer vision.
Reliability: It removes external physical variables (camera position, room lighting).

Decision: Hybrid approach.

Local CV (OpenCV/ML): Handles the high-frequency control loop (30 FPS). It performs immediate tasks: sprite detection, grid mapping, and collision avoidance. It must be fast and deterministic.
Cloud AI (Gemini): Acts as a "Coach" or "Strategist". It runs asynchronously (e.g., every few seconds or on demand) to analyze the broader game state, suggest heuristic tuning, or provide high-level strategy (e.g., "Focus on the bottom-left quadrant"). It is NOT in the critical path for frame-by-frame movement.

The architecture follows a classic robotics sense-plan-act cycle:

Capture: Grab the latest frame from the OS window manager.
Perception (Vision):
- Crop to game region.
- Detect entities (Pac-Man, Ghosts, Pellets).
- Update the internal World Model (Grid/Graph).
Reasoning (Agent):
- Update pathfinding costs (e.g., ghost proximity).
- Select the next best action (Policy).
Action (Control):
- Send keyboard input to the OS.
Feedback/Logging:
- Log state for debugging or async sending to Gemini.

capture/: Abstraction for getting image data.
vision/: Image processing. Converts pixels -> structured data (positions, types).
agent/: Brains. Converts structured data -> decisions (UP/DOWN/LEFT/RIGHT).
control/: Actuators. Converts decisions -> OS events.
ai_google/: High-level intelligence. Interface to Gemini.

The Agent class structure allows swapping SimpleHeuristicAgent with RLAgent without changing the vision or control pipelines.
The StateEstimator decouples the raw pixels from the agent's understanding, allowing us to switch from Template Matching to a CNN detector without breaking the agent logic.