Skip to content

Latest commit

 

History

History
1743 lines (1418 loc) · 73.4 KB

File metadata and controls

1743 lines (1418 loc) · 73.4 KB

Architecture Documentation

Project: Code Executor MCP Version: 0.9.0 Last Updated: 2025-11-19


Table of Contents

  1. System Overview
  2. Core Components
  3. Progressive Disclosure Architecture
  4. Security Architecture
  5. Discovery System
  6. Data Flow
  7. Concurrency & Performance
  8. Design Decisions
  9. Resilience Patterns
  10. CLI Setup Wizard Architecture
  11. MCP Sampling Architecture (v1.0.0)

1. System Overview

Code Executor MCP is a universal MCP orchestration server that implements the progressive disclosure pattern to eliminate context bloat from exposing multiple MCP servers' tool schemas.

Problem Statement

Exposing 47 MCP tools directly to an AI agent consumes 141k tokens just for schemas, exhausting context before any work begins.

Solution

Two-tier access model:

  • Tier 1 (Top-level): 3 lightweight tools (~560 tokens)

    • executeTypescript - Execute TypeScript code in Deno sandbox
    • executePython - Execute Python code in Pyodide sandbox
    • health - Server health check
  • Tier 2 (On-demand): All MCP tools accessible via code execution

    // Inside sandbox, access any MCP tool on-demand
    const result = await callMCPTool('mcp__zen__codereview', {...});

Result: 98% token reduction (141k → 1.6k tokens)


2. Core Components

2.1 Component Diagram

┌─────────────────────────────────────────────────────────────┐
│                        AI Agent (Claude)                    │
│                     (MCP Client Context)                    │
└────────────────┬────────────────────────────────────────────┘
                 │ MCP Protocol (STDIO)
                 │ Top-level tools: 3 tools, ~560 tokens
                 ▼
┌─────────────────────────────────────────────────────────────┐
│              Code Executor MCP Server (Node.js)             │
│  ┌──────────────────────────────────────────────────────┐  │
│  │         MCP Proxy Server (HTTP Localhost)            │  │
│  │  • POST / (callMCPTool endpoint)                     │  │
│  │  • GET /mcp/tools (discovery endpoint - NEW v0.4.0)  │  │
│  │  • Bearer token authentication                       │  │
│  │  • Rate limiting (30 req/60s)                        │  │
│  │  • Audit logging (AsyncLock mutex)                   │  │
│  └──────────────┬───────────────────────────────────────┘  │
│                 │                                           │
│  ┌──────────────▼───────────────────────────────────────┐  │
│  │            MCP Client Pool                           │  │
│  │  • Manages connections to multiple MCP servers       │  │
│  │  • Parallel queries (Promise.all)                    │  │
│  │  • Resilient aggregation (partial failure handling)  │  │
│  │  • In-memory tool list (listAllTools)                │  │
│  └──────────────┬───────────────────────────────────────┘  │
│                 │                                           │
│  ┌──────────────▼───────────────────────────────────────┐  │
│  │            Schema Cache                              │  │
│  │  • LRU cache (max 1000 entries)                      │  │
│  │  • Disk persistence (~/.code-executor/cache.json)    │  │
│  │  • 24h TTL with stale-on-error fallback              │  │
│  │  • AsyncLock mutex (thread-safe writes)              │  │
│  └──────────────────────────────────────────────────────┘  │
│                                                              │
│  ┌──────────────────────────────────────────────────────┐  │
│  │     Sandbox Executors (Deno/Pyodide subprocesses)    │  │
│  │  • Isolated execution context                        │  │
│  │  • Injected globals:                                 │  │
│  │    - callMCPTool(name, params)                       │  │
│  │    - discoverMCPTools(options) - NEW v0.4.0          │  │
│  │    - getToolSchema(toolName) - NEW v0.4.0            │  │
│  │    - searchTools(query, limit) - NEW v0.4.0          │  │
│  │  • Restricted permissions (allowlist, network, fs)   │  │
│  └──────────────────────────────────────────────────────┘  │
└────────────────┬────────────────────────────────────────────┘
                 │ MCP Protocol (STDIO)
                 │ External MCP Servers (parallel queries)
                 ▼
┌─────────────────────────────────────────────────────────────┐
│    External MCP Servers (filesystem, zen, linear, etc.)     │
│    • Queried in parallel via Promise.all (O(1) amortized)   │
│    • Each returns tools/list and tools/call responses        │
│    • Discovery: 50-100ms first call, <5ms cached             │
└─────────────────────────────────────────────────────────────┘

2.2 Component Responsibilities

Component Responsibility (SRP) Pattern Concurrency Safe
MCP Proxy Server Route HTTP requests, enforce auth/rate limiting, audit log Proxy Yes (AsyncLock on audit logs)
MCP Client Pool Manage MCP connections, parallel query aggregation Pool Yes (read-only queries, write-once at startup)
Schema Cache Cache tool schemas, disk persistence, LRU eviction Cache Yes (AsyncLock on disk writes)
Sandbox Executor Execute untrusted code in isolated environment Sandbox Yes (independent subprocesses)
Discovery Functions Provide in-sandbox tool discovery (v0.4.0) Wrapper Yes (stateless HTTP calls)

3. Progressive Disclosure Architecture

3.1 Token Budget Preservation

Design Goal: Maintain ~1.6k tokens for top-level tools (98% reduction from 141k baseline)

Achievement (v0.4.0):

  • Tool count: 3 tools (no increase from v0.3.x)
  • Token usage: ~560 tokens (well below 1.6k budget)
  • Discovery functions: Hidden from top-level (injected in sandbox only)

3.2 Two-Tier Access Model

Tier 1: Top-Level Tools (Exposed to AI Agent)

// AI agent sees only these in context:
- executeTypescript(code, allowedTools?, timeoutMs?, permissions?)
- executePython(code, allowedTools?, timeoutMs?, permissions?)
- health()

Tier 2: On-Demand Tools (Accessible Inside Sandbox)

// Inside executeTypescript code, AI agent can:

// 1. Execute any MCP tool (existing v0.3.x)
const result = await callMCPTool('mcp__zen__codereview', {
  step: 'Analysis',
  relevant_files: ['/path/to/file.ts'],
  // ... other params
});

// 2. Discover available tools (NEW v0.4.0)
const allTools = await discoverMCPTools();
// Returns: ToolSchema[] (name, description, parameters)

// 3. Search tools by keyword (NEW v0.4.0)
const fileTools = await searchTools('file read write', 10);
// Returns: Top 10 tools matching keywords (OR logic, case-insensitive)

// 4. Inspect tool schema (NEW v0.4.0)
const schema = await getToolSchema('mcp__filesystem__read_file');
// Returns: Full JSON Schema for tool parameters + outputSchema (v0.6.0)

3.3 Output Schema Support (NEW v0.6.0)

Design Goal: Enable AI agents to understand tool response structure without trial execution

Implementation:

  • All 3 code-executor tools provide Zod schemas for responses (outputSchema)
  • Uses MCP SDK native support (ZodRawShape format)
  • Graceful fallback for third-party tools without output schemas

Response Schemas:

// ExecutionResult (run-typescript-code, run-python-code)
{
  success: boolean,
  output: string,
  error?: string,
  executionTimeMs: number,
  toolCallsMade?: string[],
  toolCallSummary?: ToolCallSummaryEntry[]
}

// HealthCheck (health)
{
  healthy: boolean,
  auditLog: { enabled: boolean },
  mcpClients: { connected: number },
  connectionPool: { active, waiting, max },
  uptime: number,
  timestamp: string
}

Benefits:

  • ✅ AI agents know response structure upfront
  • ✅ No trial-and-error required for filtering/aggregation
  • ✅ Better code generation (correct field access)
  • ✅ Optional field - no breaking changes

Data Flow:

1. Tool registration: Zod schema → MCP SDK Tool.outputSchema
2. Discovery: MCPClientPool returns ToolSchema with outputSchema
3. Schema cache: CachedToolSchema.outputSchema persisted (24h TTL)
4. Graceful fallback: Third-party tools return outputSchema: undefined

3.4 OutputSchema Protocol Support (v0.7.1+)

✅ RESOLVED: MCP SDK v1.22.0 Native Support

Status: OutputSchema is now fully functional in the MCP protocol as of v0.7.1 (MCP SDK v1.22.0).

What Changed:

  • ✅ MCP SDK v1.22.0 exposes outputSchema via tools/list protocol response
  • ✅ All 5 code-executor tools expose response structure to AI agents
  • ✅ External MCP clients can see outputSchema immediately
  • ✅ No trial execution needed for response structure discovery

Protocol Response (v1.22.0):

{
  "tools": [
    {
      "name": "run-typescript-code",
      "description": "...",
      "inputSchema": { "type": "object", "properties": { ... } },
      "outputSchema": {  // ✅ NOW EXPOSED IN PROTOCOL
        "type": "object",
        "properties": {
          "success": { "type": "boolean" },
          "output": { "type": "string" },
          "error": { "type": "string" },
          "executionTimeMs": { "type": "number" }
        }
      }
    }
  ]
}

Verification Test:

node test-outputschema-v122.mjs
# Result:
# ✅ run-typescript-code: outputSchema: YES! (6 fields)
# ✅ run-python-code: outputSchema: YES! (6 fields)
# ✅ health: outputSchema: YES! (6 fields)
# 🎉 SUCCESS! All tools have outputSchema exposed in protocol!

Migration Details (v1.0.4 → v1.22.0):

  • Handler signatures updated: (params)(args, extra)
  • Added RequestHandlerExtra for request context (cancellation signals, session tracking)
  • Runtime Zod validation preserved (zero functional changes)
  • All 620 tests passing, zero regressions

Impact:

  • Issue #28 RESOLVED: AI agents now see response structure upfront
  • No trial-and-error: Agents can write correct filtering/aggregation code immediately
  • Progressive disclosure intact: Still 98% token reduction (141k → 1.6k)
  • Future-proof: Ready for ecosystem-wide outputSchema adoption

4. Security Architecture

4.1 Security Boundaries

┌─────────────────────────────────────────────────────────────┐
│ Security Boundary 1: MCP Proxy Server (Auth + Rate Limit)   │
│  • Bearer token authentication (per-execution, 32-byte)      │
│  • Rate limiting (30 req/60s per client)                     │
│  • Query validation (max 100 chars, alphanumeric+safe chars) │
│  • Audit logging (all requests, success/failure)             │
└─────────────────────────────────────────────────────────────┘
                         │
┌─────────────────────────────────────────────────────────────┐
│ Security Boundary 2: Tool Allowlist (Execution Gating)      │
│  • Enforced by executeTypescript allowedTools parameter      │
│  • Discovery bypasses allowlist (read-only metadata)         │
│  • Execution still enforced (callMCPTool checks allowlist)   │
│  • Trade-off documented: discovery = read, execution = write │
└─────────────────────────────────────────────────────────────┘
                         │
┌─────────────────────────────────────────────────────────────┐
│ Security Boundary 3: Sandbox Isolation (Code Execution)     │
│  • Deno sandbox with restricted permissions                  │
│  • No filesystem access (unless explicitly allowed)          │
│  • No network access (except localhost proxy)                │
│  • No environment variable access                            │
│  • Memory limits enforced                                    │
└─────────────────────────────────────────────────────────────┘

4.2 Security Trade-Off: Discovery Allowlist Bypass

Decision (v0.4.0): Discovery functions bypass tool allowlist for read-only metadata access.

Rationale:

  • Problem: AI agents get stuck without knowing what tools exist (blind execution)
  • Solution: Allow discovery of tool schemas (read-only metadata)
  • Mitigation: Execution still enforces allowlist (two-tier security model)
  • Risk Assessment: LOW - schemas are non-sensitive metadata, no execution without allowlist

Security Model:

Operation Allowlist Check Auth Required Rate Limited Audit Logged
Discovery (discoverMCPTools) ❌ Bypassed ✅ Required ✅ Yes (30/60s) ✅ Yes
Execution (callMCPTool) ✅ Enforced ✅ Required ✅ Yes (30/60s) ✅ Yes

Constitutional Alignment: This intentional exception is documented in spec.md Section 2 (Constitutional Exceptions) as BY DESIGN per Principle 2 (Security Zero Tolerance).


5. Discovery System (NEW v0.4.0)

5.1 Discovery Architecture

Design Goal: Enable AI agents to discover, search, and inspect MCP tools without manual documentation lookup.

┌─────────────────────────────────────────────────────────────┐
│ Discovery Flow (Single Round-Trip)                          │
│                                                              │
│  AI Agent executes ONE TypeScript call:                     │
│  ┌─────────────────────────────────────────────────────┐   │
│  │ const tools = await discoverMCPTools();             │   │
│  │ const schema = await getToolSchema('tool_name');    │   │
│  │ const result = await callMCPTool('tool_name', {...});│  │
│  └─────────────────────────────────────────────────────┘   │
│                                                              │
│  No context switching, variables persist across steps       │
└─────────────────────────────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│ Sandbox → Proxy: HTTP GET /mcp/tools                        │
│  • 500ms timeout (fast fail, no hanging)                    │
│  • Bearer token in Authorization header                     │
│  • Optional ?q=keyword1+keyword2 search                     │
└─────────────────────────────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│ Proxy → MCP Servers: Parallel Queries (Promise.all)         │
│  • Query all MCP servers simultaneously (O(1) amortized)    │
│  • Use Schema Cache for schemas (24h TTL, disk-persisted)   │
│  • Resilient aggregation (partial failures handled)         │
│  • Performance: First call 50-100ms, cached <5ms            │
└─────────────────────────────────────────────────────────────┘
                         │
                         ▼
┌─────────────────────────────────────────────────────────────┐
│ Response: ToolSchema[] (JSON)                               │
│  [                                                           │
│    {                                                         │
│      "name": "mcp__filesystem__read_file",                  │
│      "description": "Read file contents",                   │
│      "parameters": { /* JSON Schema */ }                    │
│    },                                                        │
│    ...                                                       │
│  ]                                                           │
└─────────────────────────────────────────────────────────────┘

5.2 Discovery Functions

discoverMCPTools(options?)

Purpose: Fetch all available tool schemas from connected MCP servers

Signature:

interface DiscoveryOptions {
  search?: string[]; // Optional keyword array (OR logic, case-insensitive)
}

async function discoverMCPTools(
  options?: DiscoveryOptions
): Promise<ToolSchema[]>

Implementation:

  • Injected into sandbox as globalThis.discoverMCPTools
  • Calls GET /mcp/tools endpoint (localhost proxy)
  • 500ms timeout via AbortSignal.timeout(500)
  • Returns full tool schemas with JSON Schema parameters

Performance:

  • First call: 50-100ms (populates schema cache)
  • Subsequent calls: <5ms (from cache, 24h TTL)
  • Parallel queries across 3+ MCP servers: <100ms P95

getToolSchema(toolName)

Purpose: Retrieve full JSON Schema for a specific tool

Signature:

async function getToolSchema(
  toolName: string
): Promise<ToolSchema | null>

Implementation:

  • Wrapper over discoverMCPTools() (DRY principle)
  • Finds tool by name using Array.find()
  • Returns null if tool not found (no exceptions)

searchTools(query, limit?)

Purpose: Search tools by keywords with result limiting

Signature:

async function searchTools(
  query: string,
  limit?: number // Default: 10
): Promise<ToolSchema[]>

Implementation:

  • Splits query by whitespace: query.split(/\s+/)
  • Calls discoverMCPTools({ search: keywords })
  • Applies result limit via Array.slice(0, limit)
  • OR logic: matches if ANY keyword found in name/description

5.3 Parallel Query Pattern

Design Decision: Query all MCP servers in parallel using Promise.all for O(1) amortized latency.

Sequential vs Parallel:

// ❌ Sequential (3 servers × 30ms each = 90ms)
for (const client of mcpClients) {
  const tools = await client.listTools(); // Wait for each
  allTools.push(...tools);
}

// ✅ Parallel (max 30ms, O(1) amortized)
const queries = mcpClients.map(client => client.listTools());
const results = await Promise.all(queries); // All at once
const allTools = results.flat();

Resilient Aggregation:

// Handle partial failures gracefully
const queries = mcpClients.map(async client => {
  try {
    return await client.listTools();
  } catch (error) {
    console.error(`MCP server ${client.name} failed:`, error);
    return { tools: [] }; // Return empty, don't block others
  }
});

Performance Benefit:

  • 1 MCP server: 30ms (baseline)
  • 3 MCP servers (sequential): 90ms (3× slower)
  • 3 MCP servers (parallel): 35ms (O(1) amortized)
  • 10 MCP servers (parallel): 50ms (still O(1))

Target Met: P95 latency <100ms for 3 MCP servers (spec.md NFR-2)

5.4 Timeout Strategy

Design Decision: 500ms timeout for proxy→sandbox communication (fast fail, no retries).

Rationale:

  • AI agents prefer fast failure over hanging
  • 500ms allows parallel queries (100ms + network overhead)
  • No retries: discovery errors should surface immediately
  • Clear error messages guide AI agent to retry if transient

Implementation:

// Sandbox side (fetch with timeout)
const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), 500);

try {
  const response = await fetch(url, {
    signal: controller.signal,
    headers: { 'Authorization': `Bearer ${token}` }
  });
  return await response.json();
} catch (error) {
  if (error.name === 'AbortError') {
    throw new Error('Discovery timeout (500ms exceeded). MCP servers may be slow.');
  }
  throw error;
} finally {
  clearTimeout(timeoutId);
}

6. Pyodide WebAssembly Sandbox (Python Executor)

6.1 Security Resolution: Issues #50/#59

Problem: Native Python executor (subprocess.spawn) had ZERO sandbox isolation.

Solution: Pyodide WebAssembly runtime with complete isolation.

6.2 Pyodide Architecture

┌─────────────────────────────────────────────────────────────┐
│                    Python Code Execution                     │
└────────────────┬────────────────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────────────────┐
│          Pyodide WebAssembly Sandbox (v0.26.4)              │
│  ┌──────────────────────────────────────────────────────┐  │
│  │           WebAssembly VM (Primary Boundary)          │  │
│  │  • No native syscall access                          │  │
│  │  • Memory-safe (bounds checking, type safety)        │  │
│  │  • Cross-platform consistency                        │  │
│  └──────────────┬───────────────────────────────────────┘  │
│                 │                                           │
│  ┌──────────────▼───────────────────────────────────────┐  │
│  │         Virtual Filesystem (Emscripten FS)           │  │
│  │  • In-memory only (no host access)                   │  │
│  │  • /tmp writable, / read-only                        │  │
│  │  • Host files completely inaccessible                │  │
│  └──────────────┬───────────────────────────────────────┘  │
│                 │                                           │
│  ┌──────────────▼───────────────────────────────────────┐  │
│  │       Network Access (pyodide.http.pyfetch)          │  │
│  │  • Localhost only (127.0.0.1)                        │  │
│  │  • Bearer token authentication required              │  │
│  │  • MCP proxy enforces tool allowlist                 │  │
│  └──────────────┬───────────────────────────────────────┘  │
│                 │                                           │
│  ┌──────────────▼───────────────────────────────────────┐  │
│  │          Injected MCP Functions                      │  │
│  │  • call_mcp_tool(name, params)                       │  │
│  │  • discover_mcp_tools(search_terms)                  │  │
│  │  • get_tool_schema(tool_name)                        │  │
│  │  • search_tools(query, limit)                        │  │
│  └──────────────────────────────────────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

6.3 Two-Phase Execution Pattern

Design: Based on Pydantic's mcp-run-python (production-proven).

Phase 1: Setup (Inject MCP Tool Access)

# Executed by Pyodide before user code
import js
from pyodide.http import pyfetch

async def call_mcp_tool(tool_name, params):
    # Call MCP proxy with bearer auth
    response = await pyfetch(
        f'http://localhost:{js.PROXY_PORT}',
        method='POST',
        headers={'Authorization': f'Bearer {js.AUTH_TOKEN}'},
        body=json.dumps({'toolName': tool_name, 'params': params})
    )
    return await response.json()

# Discovery functions also injected

Phase 2: Execute User Code

# User's code runs in sandboxed environment
# Has access to injected functions but not host system
result = await call_mcp_tool('mcp__filesystem__read_file', {...})

WHY Two-Phase?

  • Prevents user code from tampering with injection mechanism
  • Clear separation of setup vs execution
  • Injection happens in trusted context before untrusted code runs

6.4 Global Pyodide Cache

Problem: Pyodide initialization is expensive (~2-3s with npm package).

Solution: Global cached instance shared across executions.

let pyodideCache: PyodideInterface | null = null;

async function getPyodide(): Promise<PyodideInterface> {
  if (!pyodideCache) {
    console.error('🐍 Initializing Pyodide (first run, ~10s)...');
    pyodideCache = await loadPyodide({
      indexURL: 'https://cdn.jsdelivr.net/pyodide/v0.26.4/full/',
      stdin: () => { throw new Error('stdin disabled for security'); },
    });
  }
  return pyodideCache;
}

Performance:

  • First call: ~2-3s initialization (npm package includes files locally)
  • Subsequent calls: <100ms (cache hit)
  • Memory overhead: ~20MB (WASM module + Python runtime)

6.5 Security Boundaries

Boundary Enforcement Attack Prevention
WASM VM V8 engine No syscalls, no native code execution
Virtual FS Emscripten No host file access (/etc/passwd, ~/.ssh)
Network Fetch API + proxy No external network, only localhost MCP
MCP Allowlist Proxy validation No unauthorized tool execution
Timeout Promise.race() No infinite loops, resource exhaustion

Attack Surface Reduction: 99% vs native Python executor.

6.6 Limitations & Trade-offs

Acceptable Limitations:

  • Pure Python only - No native C extensions (unless WASM-compiled)
    • ✅ Most Python stdlib works (json, asyncio, math, etc.)
    • ❌ No numpy, pandas, scikit-learn (unless Pyodide-compiled versions)
  • 10-30% slower - WASM overhead
    • ✅ Acceptable for security-critical environments
    • ✅ Still faster than Docker container startup
  • No multiprocessing/threading - Single-threaded WASM
    • ✅ Use async/await instead (fully supported)
  • 4GB memory limit - WASM 32-bit addressing
    • ✅ Sufficient for most scripts
    • ❌ Large ML models won't fit

Security Trade-off: Performance cost is acceptable for complete isolation.

6.7 Industry Validation

Production Usage:

  • Pydantic mcp-run-python - Reference implementation
  • JupyterLite - Run Jupyter notebooks in browser
  • Google Colab - Similar WASM isolation approach
  • VS Code Python REPL - Uses Pyodide for in-browser Python
  • PyScript - HTML tags powered by Pyodide

Security Review: Gemini 2.0 Flash validation via zen clink (research-specialist agent).


7. Data Flow

7.1 Tool Execution Flow (Existing v0.3.x)

1. AI Agent → executeTypescript(code)
2. Sandbox spawned (Deno subprocess)
3. Code executes: callMCPTool('tool_name', params)
4. Sandbox → HTTP POST localhost:PORT/
5. Proxy validates: Bearer token, rate limit, allowlist
6. Proxy → MCP Client Pool → External MCP Server
7. MCP Server executes tool, returns result
8. Result → Proxy → Sandbox → AI Agent

6.2 Tool Discovery Flow (NEW v0.4.0)

1. AI Agent → executeTypescript(code with discoverMCPTools())
2. Sandbox executes: discoverMCPTools({ search: ['file'] })
3. Sandbox → HTTP GET localhost:PORT/mcp/tools?q=file
4. Proxy validates: Bearer token, rate limit, query (<100 chars)
5. Proxy → MCP Client Pool.listAllToolSchemas(schemaCache)
6. Client Pool queries all MCP servers in parallel (Promise.all)
7. Schema Cache provides cached schemas (<5ms) or fetches (50ms)
8. Proxy filters by keywords (OR logic, case-insensitive)
9. Proxy audits: { action: 'discovery', searchTerms: ['file'], count: 5 }
10. Result → Sandbox → AI Agent (ToolSchema[] JSON)

6.3 Schema Caching Flow

1. First discovery call: Cache miss
   → Query MCP servers (50-100ms)
   → Store in LRU cache (in-memory, max 1000 entries)
   → Persist to disk (~/.code-executor/schema-cache.json, AsyncLock)
   → Return schemas

2. Subsequent calls (within 24h): Cache hit
   → Retrieve from LRU cache (<5ms)
   → No network calls
   → Return cached schemas

3. After 24h TTL: Cache expired
   → Re-query MCP servers (background refresh)
   → Update cache
   → Return fresh schemas

4. MCP server failure: Stale-on-error
   → Use expired cache entry (better than failure)
   → Log warning
   → Return stale schemas

7. Concurrency & Performance

7.1 Concurrency Safety (AsyncLock)

Shared Resources Protected:

Resource Lock Name Why Protected Performance Impact
Schema Cache Disk Writes schema-cache-write Prevent file corruption from concurrent updates Negligible (writes rare, 24h TTL)
Audit Log Appends audit-log-write Prevent interleaved log entries Negligible (<1ms lock hold)

AsyncLock Pattern:

import AsyncLock from 'async-lock';
const lock = new AsyncLock();

// Schema cache writes
await lock.acquire('schema-cache-write', async () => {
  await fs.writeFile(cachePath, JSON.stringify(cache));
});

// Audit log appends
await lock.acquire('audit-log-write', async () => {
  await fs.appendFile(auditLogPath, logEntry + '\n');
});

7.2 Performance Characteristics

Operation First Call Cached Call Target Actual (v0.4.0)
discoverMCPTools (1 server) 30ms <5ms <50ms ✅ 30ms / 3ms
discoverMCPTools (3 servers) 50-100ms <5ms <100ms P95 ✅ 60ms / 4ms
discoverMCPTools (10 servers) 80-150ms <10ms <150ms P95 ✅ 120ms / 8ms
getToolSchema (specific tool) 50ms <5ms N/A ✅ Same as discover
searchTools (keyword filter) 50ms <5ms N/A ✅ Same as discover

Key Optimizations:

  • ✅ Parallel queries (Promise.all) → O(1) amortized complexity
  • ✅ Schema Cache with 24h TTL → 20× faster (100ms → 5ms)
  • ✅ In-memory LRU cache (max 1000 entries) → No disk I/O on hits
  • ✅ Disk persistence → Survives restarts, no re-fetching
  • ✅ Stale-on-error fallback → Resilient to transient failures

7.3 Memory & Storage

Memory Footprint:

  • Schema Cache (in-memory): ~1-2MB (1000 schemas × ~1-2KB each)
  • MCP Client connections: ~100KB per server
  • Sandbox subprocesses: ~50MB per execution (isolated, cleaned up)

Disk Storage:

  • Schema Cache: ~/.code-executor/schema-cache.json (~500KB-1MB)
  • Audit Logs: ~/.code-executor/audit-logs/*.jsonl (append-only, rotated daily)

8. Design Decisions

8.1 Why Progressive Disclosure?

Problem: Exposing all MCP tool schemas exhausts context budget.

Decision: Hide tools behind code execution, load on-demand.

Trade-offs:

  • Benefit: 98% token reduction (141k → 1.6k)
  • Benefit: Zero context overhead for unused tools
  • Cost: Two-step process (discover → execute)
  • Mitigation (v0.4.0): Single round-trip workflow (discover + execute in one call)

8.2 Why Parallel Queries?

Problem: Sequential MCP queries scale linearly (3 servers = 3× latency).

Decision: Query all MCP servers in parallel using Promise.all.

Trade-offs:

  • Benefit: O(1) amortized latency (max of all queries, not sum)
  • Benefit: Meets <100ms P95 target for 3 servers
  • Cost: More complex error handling (partial failures)
  • Mitigation: Resilient aggregation (one failure doesn't block others)

8.3 Why 500ms Timeout?

Problem: Slow MCP servers cause AI agents to hang indefinitely.

Decision: 500ms timeout on sandbox→proxy discovery calls.

Trade-offs:

  • Benefit: Fast fail (AI agent gets immediate feedback)
  • Benefit: Allows parallel queries (100ms + 400ms network/overhead)
  • Cost: May timeout on legitimately slow servers (10+)
  • Mitigation: Clear error message guides retry, stale-on-error fallback

8.4 Why Bypass Allowlist for Discovery?

Problem: AI agents stuck without knowing what tools exist.

Decision: Discovery bypasses allowlist, execution still enforced.

Trade-offs:

  • Benefit: AI agents can self-discover tools (no manual docs)
  • Benefit: Read-only metadata, no execution without allowlist
  • Risk: Information disclosure (tool names/descriptions visible)
  • Mitigation: Two-tier security (discovery=read, execution=write), auth + rate limit + audit log

Risk Assessment: LOW - tool schemas are non-sensitive metadata, no code execution without allowlist enforcement.

8.5 Why Schema Cache with 24h TTL?

Problem: Querying MCP servers on every discovery call wastes 50-100ms.

Decision: Disk-persisted LRU cache with 24h TTL.

Trade-offs:

  • Benefit: 20× faster (100ms → 5ms) on cache hits
  • Benefit: Survives server restarts (disk persistence)
  • Cost: Stale schemas if MCP servers update within 24h
  • Mitigation: Smart refresh on validation failures, manual cache clear available

9. Resilience Patterns (v0.5.0)

9.1 Circuit Breaker Pattern

Purpose: Prevent cascade failures when MCP servers hang or fail repeatedly.

Implementation: Opossum library wrapping MCP client pool calls

State Machine:

CLOSED (Normal Operation)
   ↓ 5 consecutive failures
OPEN (Fail Fast - 30s cooldown)
   ↓ After 30s timeout
HALF-OPEN (Test with 1 request)
   ↓ Success → CLOSED | Failure → OPEN

Configuration:

  • Failure Threshold: 5 consecutive failures
  • Cooldown Period: 30 seconds
  • Half-Open Test: 1 request

WHY 5 failures?

  • Low enough to detect problems quickly
  • High enough to avoid false positives from transient errors
  • Balances responsiveness with stability

WHY 30s cooldown?

  • Kubernetes default terminationGracePeriodSeconds is 30s
  • AWS ALB deregistration delay is also 30s default
  • Allows time for failing server to recover or be replaced

Metrics Exposed:

  • circuit_breaker_state (gauge): 0=closed, 1=open, 0.5=half-open
  • circuit_breaker_failures_total (counter): Total failures per server

Example:

// Circuit breaker wraps MCP client pool calls
const breaker = new CircuitBreakerFactory({
  failureThreshold: 5,
  resetTimeout: 30000,
});

// Fails fast when circuit open (no waiting on broken server)
try {
  const result = await breaker.callTool('mcp__server__tool', params);
} catch (error) {
  if (error.message.includes('circuit open')) {
    // Handle gracefully - server is known to be down
  }
}

9.2 Connection Pool Overflow Queue

Purpose: Add request queueing and backpressure when connection pool reaches capacity.

Implementation: FIFO queue with timeout-based expiration and AsyncLock protection

Architecture:

MCP Request → Check Pool Capacity
   ↓ Pool under capacity (< 100 concurrent)
   Execute Immediately
   ↓ Pool at capacity (≥ 100 concurrent)
   Enqueue Request (max 200 in queue)
      ↓ Queue full
      Return 503 Service Unavailable
      ↓ Queued successfully
      Wait for slot (max 30s timeout)
         ↓ Timeout exceeded
         Return 503 with retry-after hint
         ↓ Slot available
         Dequeue and execute

Configuration:

  • Pool Capacity: 100 concurrent requests (configurable via POOL_MAX_CONCURRENT)
  • Queue Size: 200 requests (configurable via POOL_QUEUE_SIZE)
  • Queue Timeout: 30 seconds (configurable via POOL_QUEUE_TIMEOUT_MS)

WHY 100 concurrent requests?

  • Balances throughput vs MCP server resource consumption
  • Most MCP servers handle 100 concurrent requests comfortably
  • Configurable for tuning based on actual MCP server capacity

WHY 200 queue size?

  • Provides 2× buffer beyond concurrency limit
  • Balances memory usage (~40KB at 200 requests) vs utility
  • More conservative than Nginx default (512)

WHY 30s timeout?

  • Reasonable wait time for legitimate traffic
  • Prevents queue from filling with stale requests
  • Matches circuit breaker cooldown (30s recovery window)

Metrics Exposed:

  • pool_active_connections (gauge): Current concurrent requests
  • pool_queue_depth (gauge): Number of requests waiting in queue
  • pool_queue_wait_seconds (histogram): Time spent waiting (buckets: 0.1s-30s)

Example:

// Pool automatically queues when at capacity
const pool = new MCPClientPool({
  maxConcurrent: 100,
  queueSize: 200,
  queueTimeoutMs: 30000,
});

// Request queued if pool full, executed when slot available
try {
  const result = await pool.callTool('mcp__tool', params);
} catch (error) {
  if (error.message.includes('Service Unavailable')) {
    // Queue full or timeout - implement retry logic
  }
}

9.3 Resilience Pattern Interaction

Circuit Breaker + Queue:

Request → Circuit Breaker Check
   ↓ Circuit OPEN
   Fail Fast (no queue)
   ↓ Circuit CLOSED/HALF-OPEN
   Check Pool Capacity
      ↓ Under capacity
      Execute immediately
      ↓ At capacity
      Enqueue (with timeout)

Benefits:

  • Circuit breaker prevents queueing requests to known-bad servers
  • Queue provides graceful degradation under load
  • Combined: Fast failure for broken servers, queueing for healthy ones

Failure Modes:

  1. MCP Server Down: Circuit breaker opens → immediate 503 (no queueing)
  2. MCP Server Slow: Queue fills → 503 after 30s timeout
  3. High Load: Queue drains as capacity frees → requests succeed with delay

9.4 Backpressure Signaling

HTTP Status Codes:

  • 200 OK - Request succeeded (no backpressure)
  • 429 Too Many Requests - Rate limit exceeded (per-client limit hit)
  • 503 Service Unavailable - Circuit open OR queue full/timeout

Retry Guidance:

503 Circuit Open
   Retry-After: 30 (wait for circuit to close)

503 Queue Full
   Retry-After: 60 (estimated queue drain time)

503 Queue Timeout
   Retry-After: 30 (try again with fresh timeout)

Monitoring:

# Alert on high queue depth
pool_queue_depth > 150  # Queue >75% full

# Alert on frequent circuit opens
rate(circuit_breaker_failures_total[5m]) > 10

# Alert on slow queue processing
histogram_quantile(0.95, pool_queue_wait_seconds) > 15

9.5 Performance Impact

Latency Overhead:

  • Circuit Breaker: <1ms per request (state check)
  • Queue Check: <1ms per request (counter comparison)
  • Queue Wait: 0-30s (depends on load)

Memory Overhead:

  • Circuit Breaker: ~10KB per server (state tracking)
  • Connection Queue: ~200 bytes per queued request (max ~40KB)

Total Overhead: Negligible (<0.1% CPU, <1MB RAM)


10. CLI Setup Wizard Architecture (v0.9.0)

10.1 Overview

The CLI setup wizard provides one-command initialization of code-executor-mcp with automatic MCP server discovery, wrapper generation, and daily sync scheduling.

Entry Point: npm run setupsrc/cli/index.ts

Design Goal: Zero-config setup with smart defaults, cross-platform support, and idempotent operation.

10.2 Component Diagram

┌─────────────────────────────────────────────────────────────┐
│                     CLI Entry Point                          │
│                   (src/cli/index.ts)                         │
│  • Self-install check (SelfInstaller)                       │
│  • Lock acquisition (LockFileService)                       │
│  • Wizard orchestration                                     │
└────────────────┬────────────────────────────────────────────┘
                 │
                 ▼
┌─────────────────────────────────────────────────────────────┐
│                      CLIWizard                               │
│                  (src/cli/wizard.ts)                         │
│  • Interactive prompts (tool selection, config questions)   │
│  • Default config pattern (press Enter to skip)             │
│  • Idempotent setup (merge/reset/keep existing configs)     │
└────────────┬────────────────────────────────────────────────┘
             │
             ├─────────────────┬──────────────────┬────────────┐
             ▼                 ▼                  ▼            ▼
┌──────────────────┐  ┌─────────────────┐  ┌──────────┐  ┌────────────┐
│  ToolDetector    │  │ MCPDiscovery    │  │ Wrapper  │  │  Daily     │
│                  │  │   Service       │  │Generator │  │   Sync     │
│ • Detect Claude  │  │ • Scan configs: │  │ • TS/Py  │  │ • Schedule │
│   Code install   │  │   ~/.claude.json│  │   wrapper│  │   setup    │
│ • Validate paths │  │   .mcp.json     │  │   gen    │  │ • Platform │
│                  │  │ • Merge servers │  │ • JSDoc  │  │   specific │
└──────────────────┘  └─────────────────┘  └──────────┘  └────────────┘

10.3 Config Discovery & Merging

Two-Location Scan Pattern:

// 1. Scan global Claude Code config
const globalServers = await discovery.scanToolConfig({
  id: 'claude-code',
  configPaths: {
    linux: '~/.claude.json',
    darwin: '~/.claude.json',
    win32: '%USERPROFILE%\\.claude.json'
  }
});

// 2. Scan project config
const projectServers = await discovery.scanProjectConfig('.mcp.json');

// 3. Merge (project overrides global for duplicate names)
const mergedServers = mergeMCPServers(globalServers, projectServers);

Path Expansion:

  • ~os.homedir() (Linux/macOS)
  • %USERPROFILE%process.env.USERPROFILE (Windows)
  • %APPDATA%process.env.APPDATA (Windows)

Fallback Behavior:

  • Config file not found → Prompt user for custom path or skip
  • Invalid JSON → Log error, skip tool
  • Missing command field → Log warning, skip server

10.4 Wrapper Generation

Design: Template-based code generation with schema-driven parameter types.

Templates:

src/cli/templates/
├── typescript-wrapper.hbs  # TypeScript wrapper template
└── python-wrapper.hbs      # Python wrapper template

Generation Flow:

1. Fetch tool schemas from MCP servers (via schema cache)
2. For each tool:
   - Extract name, description, parameters (JSON Schema)
   - Generate JSDoc comments from schema
   - Generate TypeScript types from JSON Schema
   - Render template with Handlebars
3. Write wrappers to output directory

Example Output:

// Before (manual)
const file = await callMCPTool('mcp__filesystem__read_file', {
  path: '/src/app.ts'
});

// After (wrapper)
import { filesystem } from './mcp-wrappers';
const file = await filesystem.readFile({ path: '/src/app.ts' });

Benefits:

  • Type-safe with IntelliSense/autocomplete
  • Self-documenting JSDoc from schemas
  • No manual tool name lookups
  • Matches actual MCP tool APIs

10.5 Daily Sync System

Purpose: Automatically regenerate wrappers when MCP servers change.

Architecture:

┌─────────────────────────────────────────────────────────────┐
│              Platform Scheduler (scheduled job)              │
│  • macOS: launchd plist (~/.config/launchd/...)             │
│  • Linux: systemd timer (~/.config/systemd/user/...)        │
│  • Windows: Task Scheduler (HKCU\Software\Microsoft\...)    │
└────────────────┬────────────────────────────────────────────┘
                 │
                 ▼ (runs at 4-6 AM daily)
┌─────────────────────────────────────────────────────────────┐
│                 DailySyncService                             │
│             (src/cli/daily-sync.ts)                          │
│  1. Re-scan configs (~/.claude.json + .mcp.json)            │
│  2. Detect changes (new/removed/modified servers)           │
│  3. Regenerate wrappers if changes detected                 │
│  4. Log sync status                                         │
└─────────────────────────────────────────────────────────────┘

Scheduler Implementation:

Platform Mechanism Config Location Command
macOS launchd plist ~/Library/LaunchAgents/com.code-executor.daily-sync.plist launchctl load/unload
Linux systemd timer ~/.config/systemd/user/code-executor-daily-sync.timer systemctl --user enable/disable
Windows Task Scheduler HKCU\Software\Microsoft\Windows\CurrentVersion\Run schtasks /create /delete

Sync Execution:

# Command executed by scheduler
npm run setup --sync-only --non-interactive

Sync Logic:

  • Reads last sync state from ~/.code-executor/last-sync.json
  • Compares current MCP servers with last sync
  • If changes detected → regenerate wrappers
  • Update last sync state
  • Exit 0 (success) or 1 (failure)

10.6 Lock File System

Purpose: Prevent concurrent wizard runs (race condition protection).

Implementation:

class LockFileService {
  private lockPath = '~/.code-executor/setup.lock';

  async acquire(): Promise<void> {
    if (await fs.exists(this.lockPath)) {
      throw new Error('Setup wizard already running');
    }
    await fs.writeFile(this.lockPath, JSON.stringify({
      pid: process.pid,
      timestamp: Date.now()
    }));
  }

  async release(): Promise<void> {
    await fs.unlink(this.lockPath);
  }
}

Protection Against:

  • Multiple users running setup simultaneously
  • Concurrent daily sync + manual setup
  • Race conditions in wrapper file writes

10.7 Security Considerations

Input Validation:

  • MCP server names: [a-zA-Z0-9_-]+ only (no special chars)
  • Config paths: No directory traversal (., .., ~/../etc)
  • Template variables: Escaped before rendering (XSS prevention)

Dangerous Pattern Detection:

  • MCP names with code injection patterns rejected (not escaped)
  • Validation happens BEFORE template rendering (defense-in-depth)
  • Tests: tests/security/template-injection.test.ts (387 lines)

Privilege Escalation:

  • Wizard runs with user privileges (no sudo/admin required)
  • Platform schedulers run as current user (not system-wide)
  • Lock files in user home directory (no /tmp race conditions)

10.8 Component Responsibilities (SRP)

Component Responsibility Why Separated
CLIWizard Interactive prompts, user flow UI/UX logic separate from business logic
ToolDetector Detect AI tool installations Tool-specific logic centralized
MCPDiscoveryService Scan configs for MCP servers Config parsing separate from UI
WrapperGenerator Generate TS/Py wrappers Code generation separate from discovery
DailySyncService Daily sync orchestration Scheduling logic separate from setup
PlatformScheduler Platform detection OS-specific logic encapsulated
LockFileService Concurrent access control Shared resource protection

10.9 Idempotent Setup Pattern

Design Goal: Safe to run npm run setup multiple times without breaking existing config.

Detection Flow:

1. Check for existing config: ~/.code-executor/config.json
2. If exists:
   - Prompt user: Merge, Reset, Keep existing
   - Merge: Combine old + new MCP servers
   - Reset: Delete old, use new config
   - Keep: Skip setup, exit
3. If not exists:
   - Create new config with defaults

Merge Strategy:

function mergeMCPServers(
  existing: MCPServerConfig[],
  new: MCPServerConfig[]
): MCPServerConfig[] {
  const merged = new Map<string, MCPServerConfig>();

  // Add existing servers
  for (const server of existing) {
    merged.set(server.name, server);
  }

  // Override with new servers (project overrides global)
  for (const server of new) {
    merged.set(server.name, server);
  }

  return Array.from(merged.values());
}

10.10 Performance Characteristics

Operation First Run Subsequent Runs Notes
Tool detection 50-100ms <10ms File system checks
MCP discovery 100-200ms 50-100ms Schema cache helps
Wrapper generation 200-500ms 200-500ms Template rendering dominant
Daily sync 500ms-1s 500ms-1s Full re-scan + regeneration

Optimization Opportunities:

  • Schema cache reduces discovery latency (24h TTL)
  • Template caching (compile once, render many)
  • Parallel wrapper generation (Promise.all)

Architecture Validation Checklist

Constitutional Compliance

  • Principle 1 (Progressive Disclosure): Token impact 0% (3 tools maintained, ~560 tokens)
  • Principle 2 (Security): Zero tolerance met (auth, rate limit, audit, validation, intentional exception documented)
  • Principle 3 (TDD): Red-Green-Refactor followed, 95%+ discovery coverage, 90%+ overall
  • Principle 4 (Type Safety): TypeScript strict mode, no any types (use unknown + guards)
  • Principle 5 (SOLID): SRP verified (each component single purpose), DIP via abstractions
  • Principle 6 (Concurrency): AsyncLock on shared resources (cache writes, audit logs)
  • Principle 7 (Fail-Fast): Descriptive errors with schemas, no silent failures
  • Principle 8 (Performance): Measurement-driven (<100ms P95 met), parallel queries O(1)
  • Principle 9 (Documentation): Self-documenting code, WHY comments, architecture.md complete

Quality Metrics

  • Test Coverage: 95%+ (discovery endpoint), 90%+ (overall), 85%+ (integration)
  • Performance: P95 <100ms (3 MCP servers), <5ms cached
  • Security: Auth + rate limit + audit log + validation all enforced
  • Token Usage: 3 tools, ~560 tokens (within 1.6k budget, 98% reduction maintained)

11. MCP Sampling Architecture (v1.0.0)

Release: v1.0.0 (2025-01-20) Status: Beta Purpose: Enable LLM-in-the-Loop execution for dynamic reasoning and analysis

11.1 Overview

MCP Sampling allows sandboxed code (TypeScript/Python) to invoke Claude during execution through simple helpers (llm.ask(), llm.think()). This enables "Claude asks Claude" scenarios for multi-step reasoning, code analysis, and data processing.

11.2 Architecture Diagram

┌─────────────────────────────────────────────────────────────┐
│                    AI Agent (Claude/Cursor)                 │
│                                                             │
│  1. Send code with enableSampling: true                     │
└─────────────────────────────────────────────────────────────┘
                    ↓ (executeTypescript/executePython)
┌─────────────────────────────────────────────────────────────┐
│               Code Executor MCP Server                      │
│                                                             │
│  2. Detect sampling enabled                                 │
│  3. Start SamplingBridgeServer                              │
│     - Generate 256-bit bearer token                         │
│     - Start HTTP server on random port (localhost only)     │
│     - Inject llm helpers into sandbox                       │
└─────────────────────────────────────────────────────────────┘
                    ↓ (Start sandbox with bridge URL + token)
┌─────────────────────────────────────────────────────────────┐
│         Sandbox (Deno/Pyodide) with Injected Helpers        │
│                                                             │
│  User Code:                                                 │
│    const result = await llm.ask("Analyze this code...");    │
│                    ↓                                         │
│  4. HTTP POST to bridge: localhost:PORT/sample              │
│     Authorization: Bearer <token>                           │
│     Body: { messages, model, maxTokens, systemPrompt }     │
└─────────────────────────────────────────────────────────────┘
                    ↓ (Bearer token validation)
┌─────────────────────────────────────────────────────────────┐
│           SamplingBridgeServer (Security Layer)             │
│                                                             │
│  5. Security Checks (in order):                             │
│     ✅ Validate Bearer Token (timing-safe comparison)       │
│     ✅ Check Rate Limits (10 rounds, 10k tokens max)        │
│     ✅ Validate System Prompt (allowlist check)             │
│     ✅ Validate Request Schema (AJV deep validation)        │
│                    ↓                                         │
│  6. Forward Request:                                        │
│     ├─ Mode Detection (MCP SDK or Direct API)              │
│     ├─ MCP Sampling (free) - if available                  │
│     └─ Direct Anthropic API (paid) - fallback              │
└─────────────────────────────────────────────────────────────┘
                    ↓ (Claude API call)
┌─────────────────────────────────────────────────────────────┐
│              Claude API (Anthropic)                         │
│                                                             │
│  7. Process Request:                                        │
│     - Model: claude-sonnet-4-5 (default)                   │
│     - Response: { content, stop_reason, usage }            │
└─────────────────────────────────────────────────────────────┘
                    ↓ (Return response)
┌─────────────────────────────────────────────────────────────┐
│           SamplingBridgeServer (Post-Processing)            │
│                                                             │
│  8. Content Filtering:                                      │
│     ✅ Scan for secrets (OpenAI keys, GitHub tokens, AWS)  │
│     ✅ Scan for PII (emails, SSNs, credit cards)           │
│     ✅ Redact violations: [REDACTED_SECRET]/[REDACTED_PII] │
│                    ↓                                         │
│  9. Audit Logging:                                          │
│     ✅ SHA-256 hash of prompt/response (no plaintext)      │
│     ✅ Log: timestamp, model, tokens, duration, violations  │
│     ✅ Write to: ~/.code-executor/audit-log.jsonl          │
│                    ↓                                         │
│  10. Update Metrics:                                        │
│      - Increment round counter                              │
│      - Add tokens to cumulative budget                      │
│      - Calculate quota remaining                            │
└─────────────────────────────────────────────────────────────┘
                    ↓ (Return filtered response)
┌─────────────────────────────────────────────────────────────┐
│         Sandbox (Continue Execution)                        │
│                                                             │
│  User Code:                                                 │
│    console.log(result); // Claude's filtered response       │
│                    ↓                                         │
│  11. Execution completes, bridge shuts down gracefully      │
└─────────────────────────────────────────────────────────────┘
                    ↓ (Return execution result)
┌─────────────────────────────────────────────────────────────┐
│               Code Executor MCP Server                      │
│                                                             │
│  12. Return to AI Agent:                                    │
│      {                                                      │
│        success: true,                                       │
│        output: "...",                                       │
│        samplingCalls: [...],  // Array of all LLM calls    │
│        samplingMetrics: {                                   │
│          totalRounds: 2,                                    │
│          totalTokens: 150,                                  │
│          totalDurationMs: 1200,                             │
│          averageTokensPerRound: 75,                         │
│          quotaRemaining: { rounds: 8, tokens: 9850 }       │
│        }                                                    │
│      }                                                      │
└─────────────────────────────────────────────────────────────┘

11.3 Core Components

11.3.1 SamplingBridgeServer

Purpose: Ephemeral HTTP bridge between sandbox and Claude API with security enforcement

Responsibilities:

  1. Lifecycle Management

    • Start: Generate bearer token, find random port, start HTTP server
    • Stop: Drain active requests (max 5s), close server gracefully
    • Lifecycle: One bridge per execution, destroyed after completion
  2. Security Enforcement

    • Bearer token validation (timing-safe comparison)
    • Rate limiting (rounds and tokens)
    • System prompt allowlist validation
    • Content filtering (secrets/PII redaction)
  3. Request Proxying

    • Mode detection: MCP SDK (free) or Direct API (paid)
    • Request forwarding with proper authentication
    • Response filtering and audit logging

Key Methods:

  • start(): Promise<{port, authToken}> - Start bridge server
  • stop(): Promise<void> - Graceful shutdown with request draining
  • getSamplingMetrics(): Promise<SamplingMetrics> - Get current metrics
  • handleRequest(req, res) - HTTP request handler (private)

Configuration:

interface SamplingConfig {
  enabled: boolean;                  // Enable/disable sampling
  maxRoundsPerExecution: number;     // Max LLM calls (default: 10)
  maxTokensPerExecution: number;     // Max tokens (default: 10,000)
  timeoutPerCallMs: number;          // Timeout per call (default: 30,000ms)
  allowedSystemPrompts: string[];    // Prompt allowlist
  contentFilteringEnabled: boolean;  // Enable filtering (default: true)
}

11.3.2 RateLimiter

Purpose: Prevent infinite loops and resource exhaustion

Implementation:

  • Round Counter: Tracks number of sampling calls
  • Token Budget: Cumulative token count across all calls
  • AsyncLock Protection: Thread-safe counters for concurrent access
  • Quota Calculation: Real-time remaining rounds/tokens

Methods:

  • async checkLimit(tokensRequested): Promise<{exceeded, metrics}> - Check if request would exceed limits
  • async incrementUsage(tokensUsed): Promise<void> - Increment counters after successful call
  • async getMetrics(): Promise<{roundsUsed, tokensUsed}> - Get current usage
  • async getQuotaRemaining(): Promise<{rounds, tokens}> - Get remaining quota

Test Coverage:

  • ✅ T033-T036: Rate limiting tests (10 rounds, 10k tokens, 429 responses)
  • ✅ T037: Concurrent access protection (AsyncLock verification)

11.3.3 ContentFilter

Purpose: Detect and redact secrets/PII from Claude responses

Patterns Detected:

  • Secrets: OpenAI keys (sk-*), GitHub tokens (ghp_*), AWS keys (AKIA*), JWT tokens (eyJ*)
  • PII: Emails, SSNs, credit card numbers

Methods:

  • scan(content): {violations, filtered} - Detect violations and return redacted content
  • filter(content, rejectOnViolation): string - Filter with optional rejection mode
  • hasViolations(content): boolean - Quick check for any violations

Redaction Format:

  • Secrets: [REDACTED_SECRET]
  • PII: [REDACTED_PII]

Test Coverage:

  • ✅ T022-T026: Pattern detection tests (98%+ coverage)
  • ✅ T115: Secret leakage redaction verification

11.3.4 SamplingAuditLogger

Purpose: Log all sampling calls for security auditing and compliance

Log Format (JSONL):

{
  "timestamp": "2025-01-20T12:00:00.000Z",
  "executionId": "exec-123",
  "round": 1,
  "model": "claude-sonnet-4-5",
  "promptHash": "sha256:abc123...",
  "responseHash": "sha256:def456...",
  "tokensUsed": 75,
  "durationMs": 600,
  "status": "success",
  "contentViolations": [
    { "type": "secret", "pattern": "openai_key", "count": 1 }
  ]
}

Key Features:

  • SHA-256 Hashing: No plaintext secrets in logs
  • AsyncLock Protection: Thread-safe concurrent writes
  • JSONL Format: One entry per line, easy to parse
  • Location: ~/.code-executor/audit-log.jsonl

Test Coverage:

  • ✅ T082-T084: Audit logging tests (13/13 passing)

11.4 API Design

11.4.1 TypeScript API (Deno Sandbox)

Simple Query:

const response = await llm.ask("What is 2+2?");
// Returns: "4"

Multi-Turn Conversation:

const response = await llm.think({
  messages: [
    { role: "user", content: "What is 2+2?" },
    { role: "assistant", content: "4" },
    { role: "user", content: "What about 3+3?" }
  ],
  model: "claude-sonnet-4-5",  // Optional
  maxTokens: 1000,              // Optional
  systemPrompt: "",             // Optional (must be in allowlist)
  stream: false                 // Optional (not yet supported)
});
// Returns: "6"

11.4.2 Python API (Pyodide Sandbox)

Simple Query:

response = await llm.ask("What is 2+2?")
# Returns: "4"

Multi-Turn Conversation:

response = await llm.think(
    messages=[
        {"role": "user", "content": "What is 2+2?"},
        {"role": "assistant", "content": "4"},
        {"role": "user", "content": "What about 3+3?"}
    ],
    model="claude-sonnet-4-5",  # Optional
    max_tokens=1000,             # Optional (snake_case for Python)
    system_prompt="",            # Optional (must be in allowlist)
    stream=False                 # Optional (not supported in Pyodide)
)
# Returns: "6"

11.5 Security Model

11.5.1 Threat Matrix

Threat Likelihood Impact Mitigation Test
Infinite loop API cost High High Rate limiting (10 rounds) T112 ✅
Token exhaustion Medium High Token budget (10k tokens) T113 ✅
Prompt injection Medium Medium System prompt allowlist T114 ✅
Secret leakage Low Critical Content filtering + SHA-256 logs T115 ✅
Timing attacks Low Medium Constant-time comparison T116 ✅
Unauthorized access Low Medium Bearer token + localhost binding T014/T011 ✅

11.5.2 Defense Layers

  1. Authentication Layer: 256-bit bearer token (unique per execution)
  2. Rate Limiting Layer: 10 rounds, 10,000 tokens per execution
  3. Validation Layer: System prompt allowlist, AJV schema validation
  4. Content Filtering Layer: Secrets/PII redaction before returning
  5. Audit Layer: SHA-256 hashed logs for forensic analysis

11.6 Performance Characteristics

Metric Target Measured Status
Bridge startup time <50ms ~30ms ✅ PASS
Per-call overhead <100ms ~60ms ✅ PASS
Memory footprint <50MB ~15MB ✅ PASS
Token validation <10ms ~5ms ✅ PASS
Content filtering <50ms ~15ms ✅ PASS

11.7 Configuration Hierarchy

Priority (highest to lowest):

  1. Per-execution parameters (enableSampling, maxSamplingRounds, maxSamplingTokens)
  2. Environment variables (CODE_EXECUTOR_SAMPLING_ENABLED, CODE_EXECUTOR_MAX_SAMPLING_ROUNDS)
  3. Configuration file (~/.code-executor/config.json)
  4. Default values (enabled: false, maxRounds: 10, maxTokens: 10,000)

11.8 Hybrid Architecture (MCP SDK vs Direct API)

Mode Detection:

detectSamplingMode(): 'mcp' | 'direct' {
  if (this.mcpServer && typeof this.mcpServer.request === 'function') {
    return 'mcp';  // MCP SDK available (free)
  }
  return 'direct';  // Fallback to Direct API (paid)
}

MCP SDK Mode (Free):

  • Uses Claude Desktop's MCP SDK for sampling
  • No additional API costs
  • Requires Claude Desktop with MCP support

Direct API Mode (Paid):

  • Uses Anthropic API directly
  • Requires ANTHROPIC_API_KEY
  • Pay-per-token pricing

User Experience:

  • Automatic detection and fallback
  • Clear logging of which mode is active
  • Same API surface regardless of mode

11.9 Docker Support

Detection:

  • Checks for /.dockerenv file
  • Checks for Docker cgroup signatures in /proc/self/cgroup

Bridge URL Handling:

  • Host execution: http://localhost:PORT
  • Docker execution: http://host.docker.internal:PORT

Docker Compose Example:

services:
  code-executor:
    image: aberemia24/code-executor-mcp:1.0.0
    environment:
      - CODE_EXECUTOR_SAMPLING_ENABLED=true
      - ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
    extra_hosts:
      - "host.docker.internal:host-gateway"

11.10 Test Coverage

Total Sampling Tests: 74/74 passing (100%)

Component Tests Status
Bridge Server 15/15 ✅ PASS
Content Filter 8/8 ✅ PASS
TypeScript API 4/4 ✅ PASS
Python API 3/3 ✅ PASS
Config Schema 23/23 ✅ PASS
Audit Logging 13/13 ✅ PASS
Security Attacks 8/8 ✅ PASS

Key Tests:

  • T010-T016: Bridge server lifecycle (startup, shutdown, token validation)
  • T022-T026: Content filtering (secrets, PII detection and redaction)
  • T033-T037: Rate limiting (rounds, tokens, concurrent access)
  • T044-T047: System prompt allowlist validation
  • T053-T056: TypeScript sampling API
  • T063-T066: Python sampling API
  • T082-T084: Audit logging with SHA-256 hashes
  • T112-T116: Security attack tests (infinite loop, token exhaustion, prompt injection, secret leakage, timing attacks)

11.11 Design Rationale

Why Ephemeral Bridge Server?

  • Security: Unique bearer token per execution prevents cross-execution attacks
  • Isolation: Localhost binding ensures no external access
  • Lifecycle: Bridge destroyed after execution, no lingering processes

Why Rate Limiting?

  • Cost Control: Prevent infinite loops from causing API cost explosions
  • Resource Management: Prevent token exhaustion from overwhelming Claude API
  • User Protection: Default limits protect users from accidental abuse

Why Content Filtering?

  • Secret Protection: Prevent API keys, tokens, credentials from leaking into logs
  • Compliance: PII redaction helps meet privacy regulations (GDPR, CCPA)
  • Defense-in-Depth: Even if Claude accidentally generates secrets, they're redacted

Why System Prompt Allowlist?

  • Prompt Injection Defense: Prevents attackers from bypassing security via custom system prompts
  • Controlled Behavior: Ensures Claude operates within intended parameters
  • Auditability: Limited set of prompts makes behavior predictable

Why SHA-256 Audit Logs?

  • Forensics: Enable investigation of security incidents without exposing secrets
  • Deduplication: Same prompt = same hash, enables pattern detection
  • Compliance: Meets audit requirements without storing plaintext data

Document Version: 1.2.0 (Added MCP Sampling Architecture for v1.0.0) Contributors: Alexandru Eremia Last Review: 2025-11-19