Skip to content

Latest commit

 

History

History
137 lines (117 loc) · 10.2 KB

File metadata and controls

137 lines (117 loc) · 10.2 KB

AgentBox — Technology Decisions

NETWORG internal — July 2025

Each decision includes rationale and alternatives considered.

Why GitHub Copilot CLI (not Claude Code)?

  1. Microsoft ecosystem alignment (Azure, GitHub, VS Code)
  2. Copilot SDK provides programmatic API for future integration
  3. GitHub CLI + Azure CLI ecosystem synergy
  4. Enterprise licensing via Copilot for Business seats we already have
  5. MCP server support for extensibility

Why ACI for Agent Containers (not Container Apps, not AKS)?

  1. Simplest possible: one command to deploy, one command to delete
  2. Cheapest: ~$0.03/hr per container, no cluster overhead
  3. DNS included: automatic FQDN per container
  4. Sufficient: our containers don't need scaling, service mesh, or ingress
  5. Container Apps used for the API only (proxy + API)

Why YARP on Container Apps for the API (not Caddy, not App Gateway)?

  1. Microsoft's own reverse proxy — built by the ASP.NET team, used in Azure App Service and Entra ID
  2. API + proxy in one app: REST controllers + YARP reverse proxy in one ASP.NET Core deployment
  3. Container Apps Easy Auth: turnkey Entra ID authentication for container access
  4. WebSocket: full support via Kestrel — critical for code-server and ttyd
  5. Dynamic routing: programmatically update routes in-process as containers spawn/die
  6. C# authorization: per-container ownership checks in middleware (no separate forward_auth)
  7. Cost: ~$5-10/mo on Consumption plan
  8. TLS: Cloudflare handles TLS at the edge (free wildcard cert) — Container Apps origin uses HTTP or Cloudflare Origin CA
  9. Portal separate: React SPA served by Azure Static Web App (free CDN) — Container App handles only API + proxy
  10. Alternatives eliminated: App Gateway ($20+), Front Door ($35+), APIM (no WebSocket on Consumption), Caddy/Traefik (not Microsoft-native)

See proxy-investigation.md for the full 9-option comparison.

Why code-server + ttyd (not just SSH)?

  1. Phone access: SSH on iPhone is painful; web terminal works perfectly
  2. No client needed: just a browser URL
  3. VS Code extensions: full extension marketplace via code-server
  4. Zero setup: no SSH keys, no client configuration

Why also SSH (certificate-based)?

  1. VS Code Remote SSH: power users connect local VS Code to the container — full extension support, local keybindings, native performance
  2. SCP / file transfer: easily move files in and out of the sandbox
  3. Port forwarding: forward container ports to localhost for debugging web apps
  4. Standard tooling: integrates with existing SSH workflows, rsync, git over SSH
  5. Security: Entra ID SSH certificate authentication — Microsoft-managed CA, short-lived certificates, Conditional Access, centralized revocation. Fallback: custom CA in Azure Key Vault (HSM-backed). Both options: no persistent keys, identity-bound, 24hr expiry.

See entra-ssh-investigation.md for the ACI feasibility study.

Why nginx Reverse Proxy Inside Containers?

  1. ACI exposes limited ports externally
  2. User's ISP/DPI blocks HTTP on non-standard ports
  3. Single port 80 for all services simplifies firewall rules
  4. Path-based routing: / = VS Code, /terminal/ = web terminal
  5. WebSocket support for both services

Why Sites.Selected for SharePoint (not delegated Sites.Read.All)?

  1. Principle of least privilege: AI only sees explicitly selected sites, not user's full SharePoint access
  2. Per-site granularity: POST /sites/{id}/permissions grants read access to individual sites
  3. Auto-cleanup: permissions revoked on container destroy
  4. Zero extra licenses: Graph API included in existing M365 licenses
  5. Per-container MI (to validate): if the Site Permissions API accepts system-assigned MI appId, each container gets its own SharePoint identity — strongest isolation + automatic cleanup on destroy
  6. Fallback: shared "AgentBox SharePoint Reader" app registration if MI doesn't work with Site Permissions API
  7. Alternative rejected: delegated Sites.Read.All gives AI the user's full SharePoint access

See sharepoint-investigation.md for the full investigation.

Why Per-Container Managed Identity for Dataverse (not delegated OAuth)?

  1. Decoupled permissions: AI's Dataverse access defined by admin-selected security role, not user's own role
  2. Per-environment control: user selects environments + role at spawn time
  3. Zero licenses: application users in Dataverse are unlicensed
  4. Audit trail: each container has unique identity in Dataverse audit logs
  5. Auto-cleanup: managed identity auto-deleted when container is destroyed
  6. Alternative rejected: delegated OAuth (user_impersonation) gives AI the user's full Dataverse permissions

Why Shared Copilot Account for Unlicensed Users (not JIT provisioning, not free tier)?

  1. Microsoft/GitHub experimental approval: NETWORG has explicit approval to use a shared account for Copilot CLI sessions
  2. Premium models for everyone: all users get premium Copilot models regardless of personal subscription status
  3. Credential isolation: COPILOT_GITHUB_TOKEN (shared) is independent from GH_TOKEN (user's own) — git operations always use the developer's real identity
  4. No seat management overhead: no JIT provisioning API calls, no per-month billing surprises
  5. Cost-effective: single Copilot Business seat + premium request billing vs. $19/user/month per seat
  6. Personal subscriptions preserved: users with their own Pro+/Business seats use their own token, avoiding shared account quota consumption
  7. Alternatives eliminated: per-user seat assignment ($19/mo × N), JIT provisioning (billing complexity), Copilot Free tier (no premium models, 2K completions limit)

Why Azure Static Web App for the Portal (not embedded in API)?

  1. Free tier: custom domain, SSL, global CDN, built-in Entra ID auth — $0/month
  2. Separation of concerns: React SPA deployed independently from the API/proxy Container App
  3. Global CDN: static assets served from edge nodes — faster load times worldwide, great for phone access
  4. Built-in auth: SWA Entra ID integration (/.auth/login/aad) — no custom auth code needed for the dashboard
  5. Independent deployments: portal changes deploy in ~30 seconds (static files), no API restart
  6. Mobile-first: excellent Core Web Vitals from CDN-served static files
  7. Alternative considered: React bundled inside ASP.NET Core wwwroot/ (simpler single deployment, but no CDN, no independent scaling, mixes concerns)

Why React for the Portal (not Blazor, not Vanilla JS)?

  1. Rich UX: Complex spawn form with multi-step OAuth flows, site pickers, environment dropdowns — benefits from component model
  2. Ecosystem: Largest community, most npm packages, easy to hire/onboard
  3. Decoupled from API: React SPA on Azure Static Web App, API calls to Container App
  4. Mobile-first: Excellent responsive component libraries (e.g., Radix, shadcn/ui)
  5. Alternative considered: Blazor (C# everywhere, but smaller ecosystem and less mobile polish)

Why Cloudflare for TLS (not Azure managed certs)?

  1. Free wildcard TLS: Cloudflare proxy mode (orange cloud) on *.agentbox.networg.com — zero cost, zero renewal
  2. Already there: networg.com DNS already managed in Cloudflare — no migration needed
  3. Bonus: DDoS protection, caching static assets, analytics
  4. Simple origin: Container Apps origin can use HTTP or Cloudflare Origin CA cert (free, 15-year validity)
  5. No Azure cert complexity: no managed cert provisioning delays, no rate limits on wildcard certs
  6. Trade-off accepted: Cloudflare terminates TLS (sees plaintext traffic) — acceptable for internal dev tooling

Why Short-Lived OAuth Tokens for Shared Copilot Account (not PAT)?

  1. 8-hour expiry: even if a user extracts COPILOT_GITHUB_TOKEN from the container, it's useless after 8 hours
  2. Server-side refresh: API holds the refresh token in Key Vault, mints fresh access tokens per container spawn
  3. No long-lived secrets in containers: PATs would be extractable and valid for months
  4. Cost protection: $0.04/premium request means a leaked long-lived token could run up significant bills
  5. Alternative rejected: Fine-grained PAT (simpler but long-lived, higher risk if extracted)

Why Azure Table Storage for BoxMetadata (not Cosmos DB, not SQLite)?

  1. Cheapest serverless option: ~$0.01/mo for our scale, pay only for storage + transactions
  2. Simple key-value + partitioning: PartitionKey = userId, RowKey = boxName — covers all our queries
  3. Azure SDK native: Azure.Data.Tables NuGet package, works with managed identity
  4. Sufficient: we don't need complex queries, joins, or full-text search for box metadata
  5. Alternative rejected: Cosmos DB (~$25/mo minimum, overkill), SQLite (single-instance only, no HA)

Why Terraform for Infrastructure (not ARM/Bicep, not manual)?

  1. HCL is provider-agnostic: azurerm for Azure resources + azuread for Entra ID app registrations + groups in one language
  2. State management: drift detection, plan before apply, import existing resources
  3. Modular: reusable modules, separate .tfvars per environment (dev/prod)
  4. Ecosystem: vast community, well-documented providers, integrates with GitHub Actions via OIDC
  5. App registrations in code: Entra ID apps, service principals, federated credentials, API permissions — all version-controlled
  6. Alternative considered: Bicep (Azure-only, can't manage Entra ID app registrations natively)

Why GitHub Actions for CI/CD (not Azure DevOps Pipelines)?

  1. Same repo: pipelines live next to the code in .github/workflows/
  2. OIDC workload identity federation: authenticate to Azure without stored secrets
  3. Matrix builds: multi-arch Docker images with Buildx + QEMU in one job
  4. PR integration: Terraform plan posted as PR comment, status checks gate merge
  5. Free tier: 2,000 minutes/month included in GitHub plan
  6. Alternative considered: Azure DevOps Pipelines (separate system, overkill for this project's CI/CD needs)