Project: Code Executor MCP Version: 0.9.0 Last Updated: 2025-11-19
- System Overview
- Core Components
- Progressive Disclosure Architecture
- Security Architecture
- Discovery System
- Data Flow
- Concurrency & Performance
- Design Decisions
- Resilience Patterns
- CLI Setup Wizard Architecture
- MCP Sampling Architecture (v1.0.0)
Code Executor MCP is a universal MCP orchestration server that implements the progressive disclosure pattern to eliminate context bloat from exposing multiple MCP servers' tool schemas.
Exposing 47 MCP tools directly to an AI agent consumes 141k tokens just for schemas, exhausting context before any work begins.
Two-tier access model:
-
Tier 1 (Top-level): 3 lightweight tools (~560 tokens)
executeTypescript- Execute TypeScript code in Deno sandboxexecutePython- Execute Python code in Pyodide sandboxhealth- Server health check
-
Tier 2 (On-demand): All MCP tools accessible via code execution
// Inside sandbox, access any MCP tool on-demand const result = await callMCPTool('mcp__zen__codereview', {...});
Result: 98% token reduction (141k → 1.6k tokens)
┌─────────────────────────────────────────────────────────────┐
│ AI Agent (Claude) │
│ (MCP Client Context) │
└────────────────┬────────────────────────────────────────────┘
│ MCP Protocol (STDIO)
│ Top-level tools: 3 tools, ~560 tokens
▼
┌─────────────────────────────────────────────────────────────┐
│ Code Executor MCP Server (Node.js) │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ MCP Proxy Server (HTTP Localhost) │ │
│ │ • POST / (callMCPTool endpoint) │ │
│ │ • GET /mcp/tools (discovery endpoint - NEW v0.4.0) │ │
│ │ • Bearer token authentication │ │
│ │ • Rate limiting (30 req/60s) │ │
│ │ • Audit logging (AsyncLock mutex) │ │
│ └──────────────┬───────────────────────────────────────┘ │
│ │ │
│ ┌──────────────▼───────────────────────────────────────┐ │
│ │ MCP Client Pool │ │
│ │ • Manages connections to multiple MCP servers │ │
│ │ • Parallel queries (Promise.all) │ │
│ │ • Resilient aggregation (partial failure handling) │ │
│ │ • In-memory tool list (listAllTools) │ │
│ └──────────────┬───────────────────────────────────────┘ │
│ │ │
│ ┌──────────────▼───────────────────────────────────────┐ │
│ │ Schema Cache │ │
│ │ • LRU cache (max 1000 entries) │ │
│ │ • Disk persistence (~/.code-executor/cache.json) │ │
│ │ • 24h TTL with stale-on-error fallback │ │
│ │ • AsyncLock mutex (thread-safe writes) │ │
│ └──────────────────────────────────────────────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ Sandbox Executors (Deno/Pyodide subprocesses) │ │
│ │ • Isolated execution context │ │
│ │ • Injected globals: │ │
│ │ - callMCPTool(name, params) │ │
│ │ - discoverMCPTools(options) - NEW v0.4.0 │ │
│ │ - getToolSchema(toolName) - NEW v0.4.0 │ │
│ │ - searchTools(query, limit) - NEW v0.4.0 │ │
│ │ • Restricted permissions (allowlist, network, fs) │ │
│ └──────────────────────────────────────────────────────┘ │
└────────────────┬────────────────────────────────────────────┘
│ MCP Protocol (STDIO)
│ External MCP Servers (parallel queries)
▼
┌─────────────────────────────────────────────────────────────┐
│ External MCP Servers (filesystem, zen, linear, etc.) │
│ • Queried in parallel via Promise.all (O(1) amortized) │
│ • Each returns tools/list and tools/call responses │
│ • Discovery: 50-100ms first call, <5ms cached │
└─────────────────────────────────────────────────────────────┘
| Component | Responsibility (SRP) | Pattern | Concurrency Safe |
|---|---|---|---|
| MCP Proxy Server | Route HTTP requests, enforce auth/rate limiting, audit log | Proxy | Yes (AsyncLock on audit logs) |
| MCP Client Pool | Manage MCP connections, parallel query aggregation | Pool | Yes (read-only queries, write-once at startup) |
| Schema Cache | Cache tool schemas, disk persistence, LRU eviction | Cache | Yes (AsyncLock on disk writes) |
| Sandbox Executor | Execute untrusted code in isolated environment | Sandbox | Yes (independent subprocesses) |
| Discovery Functions | Provide in-sandbox tool discovery (v0.4.0) | Wrapper | Yes (stateless HTTP calls) |
Design Goal: Maintain ~1.6k tokens for top-level tools (98% reduction from 141k baseline)
Achievement (v0.4.0):
- Tool count: 3 tools (no increase from v0.3.x)
- Token usage: ~560 tokens (well below 1.6k budget)
- Discovery functions: Hidden from top-level (injected in sandbox only)
Tier 1: Top-Level Tools (Exposed to AI Agent)
// AI agent sees only these in context:
- executeTypescript(code, allowedTools?, timeoutMs?, permissions?)
- executePython(code, allowedTools?, timeoutMs?, permissions?)
- health()Tier 2: On-Demand Tools (Accessible Inside Sandbox)
// Inside executeTypescript code, AI agent can:
// 1. Execute any MCP tool (existing v0.3.x)
const result = await callMCPTool('mcp__zen__codereview', {
step: 'Analysis',
relevant_files: ['/path/to/file.ts'],
// ... other params
});
// 2. Discover available tools (NEW v0.4.0)
const allTools = await discoverMCPTools();
// Returns: ToolSchema[] (name, description, parameters)
// 3. Search tools by keyword (NEW v0.4.0)
const fileTools = await searchTools('file read write', 10);
// Returns: Top 10 tools matching keywords (OR logic, case-insensitive)
// 4. Inspect tool schema (NEW v0.4.0)
const schema = await getToolSchema('mcp__filesystem__read_file');
// Returns: Full JSON Schema for tool parameters + outputSchema (v0.6.0)Design Goal: Enable AI agents to understand tool response structure without trial execution
Implementation:
- All 3 code-executor tools provide Zod schemas for responses (
outputSchema) - Uses MCP SDK native support (ZodRawShape format)
- Graceful fallback for third-party tools without output schemas
Response Schemas:
// ExecutionResult (run-typescript-code, run-python-code)
{
success: boolean,
output: string,
error?: string,
executionTimeMs: number,
toolCallsMade?: string[],
toolCallSummary?: ToolCallSummaryEntry[]
}
// HealthCheck (health)
{
healthy: boolean,
auditLog: { enabled: boolean },
mcpClients: { connected: number },
connectionPool: { active, waiting, max },
uptime: number,
timestamp: string
}Benefits:
- ✅ AI agents know response structure upfront
- ✅ No trial-and-error required for filtering/aggregation
- ✅ Better code generation (correct field access)
- ✅ Optional field - no breaking changes
Data Flow:
1. Tool registration: Zod schema → MCP SDK Tool.outputSchema
2. Discovery: MCPClientPool returns ToolSchema with outputSchema
3. Schema cache: CachedToolSchema.outputSchema persisted (24h TTL)
4. Graceful fallback: Third-party tools return outputSchema: undefined
Status: OutputSchema is now fully functional in the MCP protocol as of v0.7.1 (MCP SDK v1.22.0).
What Changed:
- ✅ MCP SDK v1.22.0 exposes
outputSchemaviatools/listprotocol response - ✅ All 5 code-executor tools expose response structure to AI agents
- ✅ External MCP clients can see outputSchema immediately
- ✅ No trial execution needed for response structure discovery
Protocol Response (v1.22.0):
{
"tools": [
{
"name": "run-typescript-code",
"description": "...",
"inputSchema": { "type": "object", "properties": { ... } },
"outputSchema": { // ✅ NOW EXPOSED IN PROTOCOL
"type": "object",
"properties": {
"success": { "type": "boolean" },
"output": { "type": "string" },
"error": { "type": "string" },
"executionTimeMs": { "type": "number" }
}
}
}
]
}Verification Test:
node test-outputschema-v122.mjs
# Result:
# ✅ run-typescript-code: outputSchema: YES! (6 fields)
# ✅ run-python-code: outputSchema: YES! (6 fields)
# ✅ health: outputSchema: YES! (6 fields)
# 🎉 SUCCESS! All tools have outputSchema exposed in protocol!Migration Details (v1.0.4 → v1.22.0):
- Handler signatures updated:
(params)→(args, extra) - Added
RequestHandlerExtrafor request context (cancellation signals, session tracking) - Runtime Zod validation preserved (zero functional changes)
- All 620 tests passing, zero regressions
Impact:
- Issue #28 RESOLVED: AI agents now see response structure upfront
- No trial-and-error: Agents can write correct filtering/aggregation code immediately
- Progressive disclosure intact: Still 98% token reduction (141k → 1.6k)
- Future-proof: Ready for ecosystem-wide outputSchema adoption
┌─────────────────────────────────────────────────────────────┐
│ Security Boundary 1: MCP Proxy Server (Auth + Rate Limit) │
│ • Bearer token authentication (per-execution, 32-byte) │
│ • Rate limiting (30 req/60s per client) │
│ • Query validation (max 100 chars, alphanumeric+safe chars) │
│ • Audit logging (all requests, success/failure) │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────┐
│ Security Boundary 2: Tool Allowlist (Execution Gating) │
│ • Enforced by executeTypescript allowedTools parameter │
│ • Discovery bypasses allowlist (read-only metadata) │
│ • Execution still enforced (callMCPTool checks allowlist) │
│ • Trade-off documented: discovery = read, execution = write │
└─────────────────────────────────────────────────────────────┘
│
┌─────────────────────────────────────────────────────────────┐
│ Security Boundary 3: Sandbox Isolation (Code Execution) │
│ • Deno sandbox with restricted permissions │
│ • No filesystem access (unless explicitly allowed) │
│ • No network access (except localhost proxy) │
│ • No environment variable access │
│ • Memory limits enforced │
└─────────────────────────────────────────────────────────────┘
Decision (v0.4.0): Discovery functions bypass tool allowlist for read-only metadata access.
Rationale:
- Problem: AI agents get stuck without knowing what tools exist (blind execution)
- Solution: Allow discovery of tool schemas (read-only metadata)
- Mitigation: Execution still enforces allowlist (two-tier security model)
- Risk Assessment: LOW - schemas are non-sensitive metadata, no execution without allowlist
Security Model:
| Operation | Allowlist Check | Auth Required | Rate Limited | Audit Logged |
|---|---|---|---|---|
| Discovery (discoverMCPTools) | ❌ Bypassed | ✅ Required | ✅ Yes (30/60s) | ✅ Yes |
| Execution (callMCPTool) | ✅ Enforced | ✅ Required | ✅ Yes (30/60s) | ✅ Yes |
Constitutional Alignment: This intentional exception is documented in spec.md Section 2 (Constitutional Exceptions) as BY DESIGN per Principle 2 (Security Zero Tolerance).
Design Goal: Enable AI agents to discover, search, and inspect MCP tools without manual documentation lookup.
┌─────────────────────────────────────────────────────────────┐
│ Discovery Flow (Single Round-Trip) │
│ │
│ AI Agent executes ONE TypeScript call: │
│ ┌─────────────────────────────────────────────────────┐ │
│ │ const tools = await discoverMCPTools(); │ │
│ │ const schema = await getToolSchema('tool_name'); │ │
│ │ const result = await callMCPTool('tool_name', {...});│ │
│ └─────────────────────────────────────────────────────┘ │
│ │
│ No context switching, variables persist across steps │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Sandbox → Proxy: HTTP GET /mcp/tools │
│ • 500ms timeout (fast fail, no hanging) │
│ • Bearer token in Authorization header │
│ • Optional ?q=keyword1+keyword2 search │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Proxy → MCP Servers: Parallel Queries (Promise.all) │
│ • Query all MCP servers simultaneously (O(1) amortized) │
│ • Use Schema Cache for schemas (24h TTL, disk-persisted) │
│ • Resilient aggregation (partial failures handled) │
│ • Performance: First call 50-100ms, cached <5ms │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Response: ToolSchema[] (JSON) │
│ [ │
│ { │
│ "name": "mcp__filesystem__read_file", │
│ "description": "Read file contents", │
│ "parameters": { /* JSON Schema */ } │
│ }, │
│ ... │
│ ] │
└─────────────────────────────────────────────────────────────┘
Purpose: Fetch all available tool schemas from connected MCP servers
Signature:
interface DiscoveryOptions {
search?: string[]; // Optional keyword array (OR logic, case-insensitive)
}
async function discoverMCPTools(
options?: DiscoveryOptions
): Promise<ToolSchema[]>Implementation:
- Injected into sandbox as
globalThis.discoverMCPTools - Calls
GET /mcp/toolsendpoint (localhost proxy) - 500ms timeout via
AbortSignal.timeout(500) - Returns full tool schemas with JSON Schema parameters
Performance:
- First call: 50-100ms (populates schema cache)
- Subsequent calls: <5ms (from cache, 24h TTL)
- Parallel queries across 3+ MCP servers: <100ms P95
Purpose: Retrieve full JSON Schema for a specific tool
Signature:
async function getToolSchema(
toolName: string
): Promise<ToolSchema | null>Implementation:
- Wrapper over
discoverMCPTools()(DRY principle) - Finds tool by name using
Array.find() - Returns
nullif tool not found (no exceptions)
Purpose: Search tools by keywords with result limiting
Signature:
async function searchTools(
query: string,
limit?: number // Default: 10
): Promise<ToolSchema[]>Implementation:
- Splits query by whitespace:
query.split(/\s+/) - Calls
discoverMCPTools({ search: keywords }) - Applies result limit via
Array.slice(0, limit) - OR logic: matches if ANY keyword found in name/description
Design Decision: Query all MCP servers in parallel using Promise.all for O(1) amortized latency.
Sequential vs Parallel:
// ❌ Sequential (3 servers × 30ms each = 90ms)
for (const client of mcpClients) {
const tools = await client.listTools(); // Wait for each
allTools.push(...tools);
}
// ✅ Parallel (max 30ms, O(1) amortized)
const queries = mcpClients.map(client => client.listTools());
const results = await Promise.all(queries); // All at once
const allTools = results.flat();Resilient Aggregation:
// Handle partial failures gracefully
const queries = mcpClients.map(async client => {
try {
return await client.listTools();
} catch (error) {
console.error(`MCP server ${client.name} failed:`, error);
return { tools: [] }; // Return empty, don't block others
}
});Performance Benefit:
- 1 MCP server: 30ms (baseline)
- 3 MCP servers (sequential): 90ms (3× slower)
- 3 MCP servers (parallel): 35ms (O(1) amortized)
- 10 MCP servers (parallel): 50ms (still O(1))
Target Met: P95 latency <100ms for 3 MCP servers (spec.md NFR-2)
Design Decision: 500ms timeout for proxy→sandbox communication (fast fail, no retries).
Rationale:
- AI agents prefer fast failure over hanging
- 500ms allows parallel queries (100ms + network overhead)
- No retries: discovery errors should surface immediately
- Clear error messages guide AI agent to retry if transient
Implementation:
// Sandbox side (fetch with timeout)
const controller = new AbortController();
const timeoutId = setTimeout(() => controller.abort(), 500);
try {
const response = await fetch(url, {
signal: controller.signal,
headers: { 'Authorization': `Bearer ${token}` }
});
return await response.json();
} catch (error) {
if (error.name === 'AbortError') {
throw new Error('Discovery timeout (500ms exceeded). MCP servers may be slow.');
}
throw error;
} finally {
clearTimeout(timeoutId);
}Problem: Native Python executor (subprocess.spawn) had ZERO sandbox isolation.
Solution: Pyodide WebAssembly runtime with complete isolation.
┌─────────────────────────────────────────────────────────────┐
│ Python Code Execution │
└────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Pyodide WebAssembly Sandbox (v0.26.4) │
│ ┌──────────────────────────────────────────────────────┐ │
│ │ WebAssembly VM (Primary Boundary) │ │
│ │ • No native syscall access │ │
│ │ • Memory-safe (bounds checking, type safety) │ │
│ │ • Cross-platform consistency │ │
│ └──────────────┬───────────────────────────────────────┘ │
│ │ │
│ ┌──────────────▼───────────────────────────────────────┐ │
│ │ Virtual Filesystem (Emscripten FS) │ │
│ │ • In-memory only (no host access) │ │
│ │ • /tmp writable, / read-only │ │
│ │ • Host files completely inaccessible │ │
│ └──────────────┬───────────────────────────────────────┘ │
│ │ │
│ ┌──────────────▼───────────────────────────────────────┐ │
│ │ Network Access (pyodide.http.pyfetch) │ │
│ │ • Localhost only (127.0.0.1) │ │
│ │ • Bearer token authentication required │ │
│ │ • MCP proxy enforces tool allowlist │ │
│ └──────────────┬───────────────────────────────────────┘ │
│ │ │
│ ┌──────────────▼───────────────────────────────────────┐ │
│ │ Injected MCP Functions │ │
│ │ • call_mcp_tool(name, params) │ │
│ │ • discover_mcp_tools(search_terms) │ │
│ │ • get_tool_schema(tool_name) │ │
│ │ • search_tools(query, limit) │ │
│ └──────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Design: Based on Pydantic's mcp-run-python (production-proven).
Phase 1: Setup (Inject MCP Tool Access)
# Executed by Pyodide before user code
import js
from pyodide.http import pyfetch
async def call_mcp_tool(tool_name, params):
# Call MCP proxy with bearer auth
response = await pyfetch(
f'http://localhost:{js.PROXY_PORT}',
method='POST',
headers={'Authorization': f'Bearer {js.AUTH_TOKEN}'},
body=json.dumps({'toolName': tool_name, 'params': params})
)
return await response.json()
# Discovery functions also injectedPhase 2: Execute User Code
# User's code runs in sandboxed environment
# Has access to injected functions but not host system
result = await call_mcp_tool('mcp__filesystem__read_file', {...})WHY Two-Phase?
- Prevents user code from tampering with injection mechanism
- Clear separation of setup vs execution
- Injection happens in trusted context before untrusted code runs
Problem: Pyodide initialization is expensive (~2-3s with npm package).
Solution: Global cached instance shared across executions.
let pyodideCache: PyodideInterface | null = null;
async function getPyodide(): Promise<PyodideInterface> {
if (!pyodideCache) {
console.error('🐍 Initializing Pyodide (first run, ~10s)...');
pyodideCache = await loadPyodide({
indexURL: 'https://cdn.jsdelivr.net/pyodide/v0.26.4/full/',
stdin: () => { throw new Error('stdin disabled for security'); },
});
}
return pyodideCache;
}Performance:
- First call: ~2-3s initialization (npm package includes files locally)
- Subsequent calls: <100ms (cache hit)
- Memory overhead: ~20MB (WASM module + Python runtime)
| Boundary | Enforcement | Attack Prevention |
|---|---|---|
| WASM VM | V8 engine | No syscalls, no native code execution |
| Virtual FS | Emscripten | No host file access (/etc/passwd, ~/.ssh) |
| Network | Fetch API + proxy | No external network, only localhost MCP |
| MCP Allowlist | Proxy validation | No unauthorized tool execution |
| Timeout | Promise.race() | No infinite loops, resource exhaustion |
Attack Surface Reduction: 99% vs native Python executor.
Acceptable Limitations:
- Pure Python only - No native C extensions (unless WASM-compiled)
- ✅ Most Python stdlib works (json, asyncio, math, etc.)
- ❌ No numpy, pandas, scikit-learn (unless Pyodide-compiled versions)
- 10-30% slower - WASM overhead
- ✅ Acceptable for security-critical environments
- ✅ Still faster than Docker container startup
- No multiprocessing/threading - Single-threaded WASM
- ✅ Use async/await instead (fully supported)
- 4GB memory limit - WASM 32-bit addressing
- ✅ Sufficient for most scripts
- ❌ Large ML models won't fit
Security Trade-off: Performance cost is acceptable for complete isolation.
Production Usage:
- Pydantic mcp-run-python - Reference implementation
- JupyterLite - Run Jupyter notebooks in browser
- Google Colab - Similar WASM isolation approach
- VS Code Python REPL - Uses Pyodide for in-browser Python
- PyScript - HTML tags powered by Pyodide
Security Review: Gemini 2.0 Flash validation via zen clink (research-specialist agent).
1. AI Agent → executeTypescript(code)
2. Sandbox spawned (Deno subprocess)
3. Code executes: callMCPTool('tool_name', params)
4. Sandbox → HTTP POST localhost:PORT/
5. Proxy validates: Bearer token, rate limit, allowlist
6. Proxy → MCP Client Pool → External MCP Server
7. MCP Server executes tool, returns result
8. Result → Proxy → Sandbox → AI Agent
1. AI Agent → executeTypescript(code with discoverMCPTools())
2. Sandbox executes: discoverMCPTools({ search: ['file'] })
3. Sandbox → HTTP GET localhost:PORT/mcp/tools?q=file
4. Proxy validates: Bearer token, rate limit, query (<100 chars)
5. Proxy → MCP Client Pool.listAllToolSchemas(schemaCache)
6. Client Pool queries all MCP servers in parallel (Promise.all)
7. Schema Cache provides cached schemas (<5ms) or fetches (50ms)
8. Proxy filters by keywords (OR logic, case-insensitive)
9. Proxy audits: { action: 'discovery', searchTerms: ['file'], count: 5 }
10. Result → Sandbox → AI Agent (ToolSchema[] JSON)
1. First discovery call: Cache miss
→ Query MCP servers (50-100ms)
→ Store in LRU cache (in-memory, max 1000 entries)
→ Persist to disk (~/.code-executor/schema-cache.json, AsyncLock)
→ Return schemas
2. Subsequent calls (within 24h): Cache hit
→ Retrieve from LRU cache (<5ms)
→ No network calls
→ Return cached schemas
3. After 24h TTL: Cache expired
→ Re-query MCP servers (background refresh)
→ Update cache
→ Return fresh schemas
4. MCP server failure: Stale-on-error
→ Use expired cache entry (better than failure)
→ Log warning
→ Return stale schemas
Shared Resources Protected:
| Resource | Lock Name | Why Protected | Performance Impact |
|---|---|---|---|
| Schema Cache Disk Writes | schema-cache-write |
Prevent file corruption from concurrent updates | Negligible (writes rare, 24h TTL) |
| Audit Log Appends | audit-log-write |
Prevent interleaved log entries | Negligible (<1ms lock hold) |
AsyncLock Pattern:
import AsyncLock from 'async-lock';
const lock = new AsyncLock();
// Schema cache writes
await lock.acquire('schema-cache-write', async () => {
await fs.writeFile(cachePath, JSON.stringify(cache));
});
// Audit log appends
await lock.acquire('audit-log-write', async () => {
await fs.appendFile(auditLogPath, logEntry + '\n');
});| Operation | First Call | Cached Call | Target | Actual (v0.4.0) |
|---|---|---|---|---|
| discoverMCPTools (1 server) | 30ms | <5ms | <50ms | ✅ 30ms / 3ms |
| discoverMCPTools (3 servers) | 50-100ms | <5ms | <100ms P95 | ✅ 60ms / 4ms |
| discoverMCPTools (10 servers) | 80-150ms | <10ms | <150ms P95 | ✅ 120ms / 8ms |
| getToolSchema (specific tool) | 50ms | <5ms | N/A | ✅ Same as discover |
| searchTools (keyword filter) | 50ms | <5ms | N/A | ✅ Same as discover |
Key Optimizations:
- ✅ Parallel queries (Promise.all) → O(1) amortized complexity
- ✅ Schema Cache with 24h TTL → 20× faster (100ms → 5ms)
- ✅ In-memory LRU cache (max 1000 entries) → No disk I/O on hits
- ✅ Disk persistence → Survives restarts, no re-fetching
- ✅ Stale-on-error fallback → Resilient to transient failures
Memory Footprint:
- Schema Cache (in-memory): ~1-2MB (1000 schemas × ~1-2KB each)
- MCP Client connections: ~100KB per server
- Sandbox subprocesses: ~50MB per execution (isolated, cleaned up)
Disk Storage:
- Schema Cache:
~/.code-executor/schema-cache.json(~500KB-1MB) - Audit Logs:
~/.code-executor/audit-logs/*.jsonl(append-only, rotated daily)
Problem: Exposing all MCP tool schemas exhausts context budget.
Decision: Hide tools behind code execution, load on-demand.
Trade-offs:
- ✅ Benefit: 98% token reduction (141k → 1.6k)
- ✅ Benefit: Zero context overhead for unused tools
- ❌ Cost: Two-step process (discover → execute)
- ✅ Mitigation (v0.4.0): Single round-trip workflow (discover + execute in one call)
Problem: Sequential MCP queries scale linearly (3 servers = 3× latency).
Decision: Query all MCP servers in parallel using Promise.all.
Trade-offs:
- ✅ Benefit: O(1) amortized latency (max of all queries, not sum)
- ✅ Benefit: Meets <100ms P95 target for 3 servers
- ❌ Cost: More complex error handling (partial failures)
- ✅ Mitigation: Resilient aggregation (one failure doesn't block others)
Problem: Slow MCP servers cause AI agents to hang indefinitely.
Decision: 500ms timeout on sandbox→proxy discovery calls.
Trade-offs:
- ✅ Benefit: Fast fail (AI agent gets immediate feedback)
- ✅ Benefit: Allows parallel queries (100ms + 400ms network/overhead)
- ❌ Cost: May timeout on legitimately slow servers (10+)
- ✅ Mitigation: Clear error message guides retry, stale-on-error fallback
Problem: AI agents stuck without knowing what tools exist.
Decision: Discovery bypasses allowlist, execution still enforced.
Trade-offs:
- ✅ Benefit: AI agents can self-discover tools (no manual docs)
- ✅ Benefit: Read-only metadata, no execution without allowlist
- ❌ Risk: Information disclosure (tool names/descriptions visible)
- ✅ Mitigation: Two-tier security (discovery=read, execution=write), auth + rate limit + audit log
Risk Assessment: LOW - tool schemas are non-sensitive metadata, no code execution without allowlist enforcement.
Problem: Querying MCP servers on every discovery call wastes 50-100ms.
Decision: Disk-persisted LRU cache with 24h TTL.
Trade-offs:
- ✅ Benefit: 20× faster (100ms → 5ms) on cache hits
- ✅ Benefit: Survives server restarts (disk persistence)
- ❌ Cost: Stale schemas if MCP servers update within 24h
- ✅ Mitigation: Smart refresh on validation failures, manual cache clear available
Purpose: Prevent cascade failures when MCP servers hang or fail repeatedly.
Implementation: Opossum library wrapping MCP client pool calls
State Machine:
CLOSED (Normal Operation)
↓ 5 consecutive failures
OPEN (Fail Fast - 30s cooldown)
↓ After 30s timeout
HALF-OPEN (Test with 1 request)
↓ Success → CLOSED | Failure → OPEN
Configuration:
- Failure Threshold: 5 consecutive failures
- Cooldown Period: 30 seconds
- Half-Open Test: 1 request
WHY 5 failures?
- Low enough to detect problems quickly
- High enough to avoid false positives from transient errors
- Balances responsiveness with stability
WHY 30s cooldown?
- Kubernetes default terminationGracePeriodSeconds is 30s
- AWS ALB deregistration delay is also 30s default
- Allows time for failing server to recover or be replaced
Metrics Exposed:
circuit_breaker_state(gauge): 0=closed, 1=open, 0.5=half-opencircuit_breaker_failures_total(counter): Total failures per server
Example:
// Circuit breaker wraps MCP client pool calls
const breaker = new CircuitBreakerFactory({
failureThreshold: 5,
resetTimeout: 30000,
});
// Fails fast when circuit open (no waiting on broken server)
try {
const result = await breaker.callTool('mcp__server__tool', params);
} catch (error) {
if (error.message.includes('circuit open')) {
// Handle gracefully - server is known to be down
}
}Purpose: Add request queueing and backpressure when connection pool reaches capacity.
Implementation: FIFO queue with timeout-based expiration and AsyncLock protection
Architecture:
MCP Request → Check Pool Capacity
↓ Pool under capacity (< 100 concurrent)
Execute Immediately
↓ Pool at capacity (≥ 100 concurrent)
Enqueue Request (max 200 in queue)
↓ Queue full
Return 503 Service Unavailable
↓ Queued successfully
Wait for slot (max 30s timeout)
↓ Timeout exceeded
Return 503 with retry-after hint
↓ Slot available
Dequeue and execute
Configuration:
- Pool Capacity: 100 concurrent requests (configurable via
POOL_MAX_CONCURRENT) - Queue Size: 200 requests (configurable via
POOL_QUEUE_SIZE) - Queue Timeout: 30 seconds (configurable via
POOL_QUEUE_TIMEOUT_MS)
WHY 100 concurrent requests?
- Balances throughput vs MCP server resource consumption
- Most MCP servers handle 100 concurrent requests comfortably
- Configurable for tuning based on actual MCP server capacity
WHY 200 queue size?
- Provides 2× buffer beyond concurrency limit
- Balances memory usage (~40KB at 200 requests) vs utility
- More conservative than Nginx default (512)
WHY 30s timeout?
- Reasonable wait time for legitimate traffic
- Prevents queue from filling with stale requests
- Matches circuit breaker cooldown (30s recovery window)
Metrics Exposed:
pool_active_connections(gauge): Current concurrent requestspool_queue_depth(gauge): Number of requests waiting in queuepool_queue_wait_seconds(histogram): Time spent waiting (buckets: 0.1s-30s)
Example:
// Pool automatically queues when at capacity
const pool = new MCPClientPool({
maxConcurrent: 100,
queueSize: 200,
queueTimeoutMs: 30000,
});
// Request queued if pool full, executed when slot available
try {
const result = await pool.callTool('mcp__tool', params);
} catch (error) {
if (error.message.includes('Service Unavailable')) {
// Queue full or timeout - implement retry logic
}
}Circuit Breaker + Queue:
Request → Circuit Breaker Check
↓ Circuit OPEN
Fail Fast (no queue)
↓ Circuit CLOSED/HALF-OPEN
Check Pool Capacity
↓ Under capacity
Execute immediately
↓ At capacity
Enqueue (with timeout)
Benefits:
- Circuit breaker prevents queueing requests to known-bad servers
- Queue provides graceful degradation under load
- Combined: Fast failure for broken servers, queueing for healthy ones
Failure Modes:
- MCP Server Down: Circuit breaker opens → immediate 503 (no queueing)
- MCP Server Slow: Queue fills → 503 after 30s timeout
- High Load: Queue drains as capacity frees → requests succeed with delay
HTTP Status Codes:
200 OK- Request succeeded (no backpressure)429 Too Many Requests- Rate limit exceeded (per-client limit hit)503 Service Unavailable- Circuit open OR queue full/timeout
Retry Guidance:
503 Circuit Open
Retry-After: 30 (wait for circuit to close)
503 Queue Full
Retry-After: 60 (estimated queue drain time)
503 Queue Timeout
Retry-After: 30 (try again with fresh timeout)
Monitoring:
# Alert on high queue depth
pool_queue_depth > 150 # Queue >75% full
# Alert on frequent circuit opens
rate(circuit_breaker_failures_total[5m]) > 10
# Alert on slow queue processing
histogram_quantile(0.95, pool_queue_wait_seconds) > 15
Latency Overhead:
- Circuit Breaker: <1ms per request (state check)
- Queue Check: <1ms per request (counter comparison)
- Queue Wait: 0-30s (depends on load)
Memory Overhead:
- Circuit Breaker: ~10KB per server (state tracking)
- Connection Queue: ~200 bytes per queued request (max ~40KB)
Total Overhead: Negligible (<0.1% CPU, <1MB RAM)
The CLI setup wizard provides one-command initialization of code-executor-mcp with automatic MCP server discovery, wrapper generation, and daily sync scheduling.
Entry Point: npm run setup → src/cli/index.ts
Design Goal: Zero-config setup with smart defaults, cross-platform support, and idempotent operation.
┌─────────────────────────────────────────────────────────────┐
│ CLI Entry Point │
│ (src/cli/index.ts) │
│ • Self-install check (SelfInstaller) │
│ • Lock acquisition (LockFileService) │
│ • Wizard orchestration │
└────────────────┬────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ CLIWizard │
│ (src/cli/wizard.ts) │
│ • Interactive prompts (tool selection, config questions) │
│ • Default config pattern (press Enter to skip) │
│ • Idempotent setup (merge/reset/keep existing configs) │
└────────────┬────────────────────────────────────────────────┘
│
├─────────────────┬──────────────────┬────────────┐
▼ ▼ ▼ ▼
┌──────────────────┐ ┌─────────────────┐ ┌──────────┐ ┌────────────┐
│ ToolDetector │ │ MCPDiscovery │ │ Wrapper │ │ Daily │
│ │ │ Service │ │Generator │ │ Sync │
│ • Detect Claude │ │ • Scan configs: │ │ • TS/Py │ │ • Schedule │
│ Code install │ │ ~/.claude.json│ │ wrapper│ │ setup │
│ • Validate paths │ │ .mcp.json │ │ gen │ │ • Platform │
│ │ │ • Merge servers │ │ • JSDoc │ │ specific │
└──────────────────┘ └─────────────────┘ └──────────┘ └────────────┘
Two-Location Scan Pattern:
// 1. Scan global Claude Code config
const globalServers = await discovery.scanToolConfig({
id: 'claude-code',
configPaths: {
linux: '~/.claude.json',
darwin: '~/.claude.json',
win32: '%USERPROFILE%\\.claude.json'
}
});
// 2. Scan project config
const projectServers = await discovery.scanProjectConfig('.mcp.json');
// 3. Merge (project overrides global for duplicate names)
const mergedServers = mergeMCPServers(globalServers, projectServers);Path Expansion:
~→os.homedir()(Linux/macOS)%USERPROFILE%→process.env.USERPROFILE(Windows)%APPDATA%→process.env.APPDATA(Windows)
Fallback Behavior:
- Config file not found → Prompt user for custom path or skip
- Invalid JSON → Log error, skip tool
- Missing
commandfield → Log warning, skip server
Design: Template-based code generation with schema-driven parameter types.
Templates:
src/cli/templates/
├── typescript-wrapper.hbs # TypeScript wrapper template
└── python-wrapper.hbs # Python wrapper template
Generation Flow:
1. Fetch tool schemas from MCP servers (via schema cache)
2. For each tool:
- Extract name, description, parameters (JSON Schema)
- Generate JSDoc comments from schema
- Generate TypeScript types from JSON Schema
- Render template with Handlebars
3. Write wrappers to output directory
Example Output:
// Before (manual)
const file = await callMCPTool('mcp__filesystem__read_file', {
path: '/src/app.ts'
});
// After (wrapper)
import { filesystem } from './mcp-wrappers';
const file = await filesystem.readFile({ path: '/src/app.ts' });Benefits:
- Type-safe with IntelliSense/autocomplete
- Self-documenting JSDoc from schemas
- No manual tool name lookups
- Matches actual MCP tool APIs
Purpose: Automatically regenerate wrappers when MCP servers change.
Architecture:
┌─────────────────────────────────────────────────────────────┐
│ Platform Scheduler (scheduled job) │
│ • macOS: launchd plist (~/.config/launchd/...) │
│ • Linux: systemd timer (~/.config/systemd/user/...) │
│ • Windows: Task Scheduler (HKCU\Software\Microsoft\...) │
└────────────────┬────────────────────────────────────────────┘
│
▼ (runs at 4-6 AM daily)
┌─────────────────────────────────────────────────────────────┐
│ DailySyncService │
│ (src/cli/daily-sync.ts) │
│ 1. Re-scan configs (~/.claude.json + .mcp.json) │
│ 2. Detect changes (new/removed/modified servers) │
│ 3. Regenerate wrappers if changes detected │
│ 4. Log sync status │
└─────────────────────────────────────────────────────────────┘
Scheduler Implementation:
| Platform | Mechanism | Config Location | Command |
|---|---|---|---|
| macOS | launchd plist | ~/Library/LaunchAgents/com.code-executor.daily-sync.plist |
launchctl load/unload |
| Linux | systemd timer | ~/.config/systemd/user/code-executor-daily-sync.timer |
systemctl --user enable/disable |
| Windows | Task Scheduler | HKCU\Software\Microsoft\Windows\CurrentVersion\Run |
schtasks /create /delete |
Sync Execution:
# Command executed by scheduler
npm run setup --sync-only --non-interactiveSync Logic:
- Reads last sync state from
~/.code-executor/last-sync.json - Compares current MCP servers with last sync
- If changes detected → regenerate wrappers
- Update last sync state
- Exit 0 (success) or 1 (failure)
Purpose: Prevent concurrent wizard runs (race condition protection).
Implementation:
class LockFileService {
private lockPath = '~/.code-executor/setup.lock';
async acquire(): Promise<void> {
if (await fs.exists(this.lockPath)) {
throw new Error('Setup wizard already running');
}
await fs.writeFile(this.lockPath, JSON.stringify({
pid: process.pid,
timestamp: Date.now()
}));
}
async release(): Promise<void> {
await fs.unlink(this.lockPath);
}
}Protection Against:
- Multiple users running setup simultaneously
- Concurrent daily sync + manual setup
- Race conditions in wrapper file writes
Input Validation:
- MCP server names:
[a-zA-Z0-9_-]+only (no special chars) - Config paths: No directory traversal (
.,..,~/../etc) - Template variables: Escaped before rendering (XSS prevention)
Dangerous Pattern Detection:
- MCP names with code injection patterns rejected (not escaped)
- Validation happens BEFORE template rendering (defense-in-depth)
- Tests:
tests/security/template-injection.test.ts(387 lines)
Privilege Escalation:
- Wizard runs with user privileges (no sudo/admin required)
- Platform schedulers run as current user (not system-wide)
- Lock files in user home directory (no
/tmprace conditions)
| Component | Responsibility | Why Separated |
|---|---|---|
| CLIWizard | Interactive prompts, user flow | UI/UX logic separate from business logic |
| ToolDetector | Detect AI tool installations | Tool-specific logic centralized |
| MCPDiscoveryService | Scan configs for MCP servers | Config parsing separate from UI |
| WrapperGenerator | Generate TS/Py wrappers | Code generation separate from discovery |
| DailySyncService | Daily sync orchestration | Scheduling logic separate from setup |
| PlatformScheduler | Platform detection | OS-specific logic encapsulated |
| LockFileService | Concurrent access control | Shared resource protection |
Design Goal: Safe to run npm run setup multiple times without breaking existing config.
Detection Flow:
1. Check for existing config: ~/.code-executor/config.json
2. If exists:
- Prompt user: Merge, Reset, Keep existing
- Merge: Combine old + new MCP servers
- Reset: Delete old, use new config
- Keep: Skip setup, exit
3. If not exists:
- Create new config with defaults
Merge Strategy:
function mergeMCPServers(
existing: MCPServerConfig[],
new: MCPServerConfig[]
): MCPServerConfig[] {
const merged = new Map<string, MCPServerConfig>();
// Add existing servers
for (const server of existing) {
merged.set(server.name, server);
}
// Override with new servers (project overrides global)
for (const server of new) {
merged.set(server.name, server);
}
return Array.from(merged.values());
}| Operation | First Run | Subsequent Runs | Notes |
|---|---|---|---|
| Tool detection | 50-100ms | <10ms | File system checks |
| MCP discovery | 100-200ms | 50-100ms | Schema cache helps |
| Wrapper generation | 200-500ms | 200-500ms | Template rendering dominant |
| Daily sync | 500ms-1s | 500ms-1s | Full re-scan + regeneration |
Optimization Opportunities:
- Schema cache reduces discovery latency (24h TTL)
- Template caching (compile once, render many)
- Parallel wrapper generation (Promise.all)
- Principle 1 (Progressive Disclosure): Token impact 0% (3 tools maintained, ~560 tokens)
- Principle 2 (Security): Zero tolerance met (auth, rate limit, audit, validation, intentional exception documented)
- Principle 3 (TDD): Red-Green-Refactor followed, 95%+ discovery coverage, 90%+ overall
- Principle 4 (Type Safety): TypeScript strict mode, no
anytypes (useunknown+ guards) - Principle 5 (SOLID): SRP verified (each component single purpose), DIP via abstractions
- Principle 6 (Concurrency): AsyncLock on shared resources (cache writes, audit logs)
- Principle 7 (Fail-Fast): Descriptive errors with schemas, no silent failures
- Principle 8 (Performance): Measurement-driven (<100ms P95 met), parallel queries O(1)
- Principle 9 (Documentation): Self-documenting code, WHY comments, architecture.md complete
- Test Coverage: 95%+ (discovery endpoint), 90%+ (overall), 85%+ (integration)
- Performance: P95 <100ms (3 MCP servers), <5ms cached
- Security: Auth + rate limit + audit log + validation all enforced
- Token Usage: 3 tools, ~560 tokens (within 1.6k budget, 98% reduction maintained)
Release: v1.0.0 (2025-01-20) Status: Beta Purpose: Enable LLM-in-the-Loop execution for dynamic reasoning and analysis
MCP Sampling allows sandboxed code (TypeScript/Python) to invoke Claude during execution through simple helpers (llm.ask(), llm.think()). This enables "Claude asks Claude" scenarios for multi-step reasoning, code analysis, and data processing.
┌─────────────────────────────────────────────────────────────┐
│ AI Agent (Claude/Cursor) │
│ │
│ 1. Send code with enableSampling: true │
└─────────────────────────────────────────────────────────────┘
↓ (executeTypescript/executePython)
┌─────────────────────────────────────────────────────────────┐
│ Code Executor MCP Server │
│ │
│ 2. Detect sampling enabled │
│ 3. Start SamplingBridgeServer │
│ - Generate 256-bit bearer token │
│ - Start HTTP server on random port (localhost only) │
│ - Inject llm helpers into sandbox │
└─────────────────────────────────────────────────────────────┘
↓ (Start sandbox with bridge URL + token)
┌─────────────────────────────────────────────────────────────┐
│ Sandbox (Deno/Pyodide) with Injected Helpers │
│ │
│ User Code: │
│ const result = await llm.ask("Analyze this code..."); │
│ ↓ │
│ 4. HTTP POST to bridge: localhost:PORT/sample │
│ Authorization: Bearer <token> │
│ Body: { messages, model, maxTokens, systemPrompt } │
└─────────────────────────────────────────────────────────────┘
↓ (Bearer token validation)
┌─────────────────────────────────────────────────────────────┐
│ SamplingBridgeServer (Security Layer) │
│ │
│ 5. Security Checks (in order): │
│ ✅ Validate Bearer Token (timing-safe comparison) │
│ ✅ Check Rate Limits (10 rounds, 10k tokens max) │
│ ✅ Validate System Prompt (allowlist check) │
│ ✅ Validate Request Schema (AJV deep validation) │
│ ↓ │
│ 6. Forward Request: │
│ ├─ Mode Detection (MCP SDK or Direct API) │
│ ├─ MCP Sampling (free) - if available │
│ └─ Direct Anthropic API (paid) - fallback │
└─────────────────────────────────────────────────────────────┘
↓ (Claude API call)
┌─────────────────────────────────────────────────────────────┐
│ Claude API (Anthropic) │
│ │
│ 7. Process Request: │
│ - Model: claude-sonnet-4-5 (default) │
│ - Response: { content, stop_reason, usage } │
└─────────────────────────────────────────────────────────────┘
↓ (Return response)
┌─────────────────────────────────────────────────────────────┐
│ SamplingBridgeServer (Post-Processing) │
│ │
│ 8. Content Filtering: │
│ ✅ Scan for secrets (OpenAI keys, GitHub tokens, AWS) │
│ ✅ Scan for PII (emails, SSNs, credit cards) │
│ ✅ Redact violations: [REDACTED_SECRET]/[REDACTED_PII] │
│ ↓ │
│ 9. Audit Logging: │
│ ✅ SHA-256 hash of prompt/response (no plaintext) │
│ ✅ Log: timestamp, model, tokens, duration, violations │
│ ✅ Write to: ~/.code-executor/audit-log.jsonl │
│ ↓ │
│ 10. Update Metrics: │
│ - Increment round counter │
│ - Add tokens to cumulative budget │
│ - Calculate quota remaining │
└─────────────────────────────────────────────────────────────┘
↓ (Return filtered response)
┌─────────────────────────────────────────────────────────────┐
│ Sandbox (Continue Execution) │
│ │
│ User Code: │
│ console.log(result); // Claude's filtered response │
│ ↓ │
│ 11. Execution completes, bridge shuts down gracefully │
└─────────────────────────────────────────────────────────────┘
↓ (Return execution result)
┌─────────────────────────────────────────────────────────────┐
│ Code Executor MCP Server │
│ │
│ 12. Return to AI Agent: │
│ { │
│ success: true, │
│ output: "...", │
│ samplingCalls: [...], // Array of all LLM calls │
│ samplingMetrics: { │
│ totalRounds: 2, │
│ totalTokens: 150, │
│ totalDurationMs: 1200, │
│ averageTokensPerRound: 75, │
│ quotaRemaining: { rounds: 8, tokens: 9850 } │
│ } │
│ } │
└─────────────────────────────────────────────────────────────┘
Purpose: Ephemeral HTTP bridge between sandbox and Claude API with security enforcement
Responsibilities:
-
Lifecycle Management
- Start: Generate bearer token, find random port, start HTTP server
- Stop: Drain active requests (max 5s), close server gracefully
- Lifecycle: One bridge per execution, destroyed after completion
-
Security Enforcement
- Bearer token validation (timing-safe comparison)
- Rate limiting (rounds and tokens)
- System prompt allowlist validation
- Content filtering (secrets/PII redaction)
-
Request Proxying
- Mode detection: MCP SDK (free) or Direct API (paid)
- Request forwarding with proper authentication
- Response filtering and audit logging
Key Methods:
start(): Promise<{port, authToken}>- Start bridge serverstop(): Promise<void>- Graceful shutdown with request draininggetSamplingMetrics(): Promise<SamplingMetrics>- Get current metricshandleRequest(req, res)- HTTP request handler (private)
Configuration:
interface SamplingConfig {
enabled: boolean; // Enable/disable sampling
maxRoundsPerExecution: number; // Max LLM calls (default: 10)
maxTokensPerExecution: number; // Max tokens (default: 10,000)
timeoutPerCallMs: number; // Timeout per call (default: 30,000ms)
allowedSystemPrompts: string[]; // Prompt allowlist
contentFilteringEnabled: boolean; // Enable filtering (default: true)
}Purpose: Prevent infinite loops and resource exhaustion
Implementation:
- Round Counter: Tracks number of sampling calls
- Token Budget: Cumulative token count across all calls
- AsyncLock Protection: Thread-safe counters for concurrent access
- Quota Calculation: Real-time remaining rounds/tokens
Methods:
async checkLimit(tokensRequested): Promise<{exceeded, metrics}>- Check if request would exceed limitsasync incrementUsage(tokensUsed): Promise<void>- Increment counters after successful callasync getMetrics(): Promise<{roundsUsed, tokensUsed}>- Get current usageasync getQuotaRemaining(): Promise<{rounds, tokens}>- Get remaining quota
Test Coverage:
- ✅ T033-T036: Rate limiting tests (10 rounds, 10k tokens, 429 responses)
- ✅ T037: Concurrent access protection (AsyncLock verification)
Purpose: Detect and redact secrets/PII from Claude responses
Patterns Detected:
- Secrets: OpenAI keys (
sk-*), GitHub tokens (ghp_*), AWS keys (AKIA*), JWT tokens (eyJ*) - PII: Emails, SSNs, credit card numbers
Methods:
scan(content): {violations, filtered}- Detect violations and return redacted contentfilter(content, rejectOnViolation): string- Filter with optional rejection modehasViolations(content): boolean- Quick check for any violations
Redaction Format:
- Secrets:
[REDACTED_SECRET] - PII:
[REDACTED_PII]
Test Coverage:
- ✅ T022-T026: Pattern detection tests (98%+ coverage)
- ✅ T115: Secret leakage redaction verification
Purpose: Log all sampling calls for security auditing and compliance
Log Format (JSONL):
{
"timestamp": "2025-01-20T12:00:00.000Z",
"executionId": "exec-123",
"round": 1,
"model": "claude-sonnet-4-5",
"promptHash": "sha256:abc123...",
"responseHash": "sha256:def456...",
"tokensUsed": 75,
"durationMs": 600,
"status": "success",
"contentViolations": [
{ "type": "secret", "pattern": "openai_key", "count": 1 }
]
}Key Features:
- SHA-256 Hashing: No plaintext secrets in logs
- AsyncLock Protection: Thread-safe concurrent writes
- JSONL Format: One entry per line, easy to parse
- Location:
~/.code-executor/audit-log.jsonl
Test Coverage:
- ✅ T082-T084: Audit logging tests (13/13 passing)
Simple Query:
const response = await llm.ask("What is 2+2?");
// Returns: "4"Multi-Turn Conversation:
const response = await llm.think({
messages: [
{ role: "user", content: "What is 2+2?" },
{ role: "assistant", content: "4" },
{ role: "user", content: "What about 3+3?" }
],
model: "claude-sonnet-4-5", // Optional
maxTokens: 1000, // Optional
systemPrompt: "", // Optional (must be in allowlist)
stream: false // Optional (not yet supported)
});
// Returns: "6"Simple Query:
response = await llm.ask("What is 2+2?")
# Returns: "4"Multi-Turn Conversation:
response = await llm.think(
messages=[
{"role": "user", "content": "What is 2+2?"},
{"role": "assistant", "content": "4"},
{"role": "user", "content": "What about 3+3?"}
],
model="claude-sonnet-4-5", # Optional
max_tokens=1000, # Optional (snake_case for Python)
system_prompt="", # Optional (must be in allowlist)
stream=False # Optional (not supported in Pyodide)
)
# Returns: "6"| Threat | Likelihood | Impact | Mitigation | Test |
|---|---|---|---|---|
| Infinite loop API cost | High | High | Rate limiting (10 rounds) | T112 ✅ |
| Token exhaustion | Medium | High | Token budget (10k tokens) | T113 ✅ |
| Prompt injection | Medium | Medium | System prompt allowlist | T114 ✅ |
| Secret leakage | Low | Critical | Content filtering + SHA-256 logs | T115 ✅ |
| Timing attacks | Low | Medium | Constant-time comparison | T116 ✅ |
| Unauthorized access | Low | Medium | Bearer token + localhost binding | T014/T011 ✅ |
- Authentication Layer: 256-bit bearer token (unique per execution)
- Rate Limiting Layer: 10 rounds, 10,000 tokens per execution
- Validation Layer: System prompt allowlist, AJV schema validation
- Content Filtering Layer: Secrets/PII redaction before returning
- Audit Layer: SHA-256 hashed logs for forensic analysis
| Metric | Target | Measured | Status |
|---|---|---|---|
| Bridge startup time | <50ms | ~30ms | ✅ PASS |
| Per-call overhead | <100ms | ~60ms | ✅ PASS |
| Memory footprint | <50MB | ~15MB | ✅ PASS |
| Token validation | <10ms | ~5ms | ✅ PASS |
| Content filtering | <50ms | ~15ms | ✅ PASS |
Priority (highest to lowest):
- Per-execution parameters (
enableSampling,maxSamplingRounds,maxSamplingTokens) - Environment variables (
CODE_EXECUTOR_SAMPLING_ENABLED,CODE_EXECUTOR_MAX_SAMPLING_ROUNDS) - Configuration file (
~/.code-executor/config.json) - Default values (enabled: false, maxRounds: 10, maxTokens: 10,000)
Mode Detection:
detectSamplingMode(): 'mcp' | 'direct' {
if (this.mcpServer && typeof this.mcpServer.request === 'function') {
return 'mcp'; // MCP SDK available (free)
}
return 'direct'; // Fallback to Direct API (paid)
}MCP SDK Mode (Free):
- Uses Claude Desktop's MCP SDK for sampling
- No additional API costs
- Requires Claude Desktop with MCP support
Direct API Mode (Paid):
- Uses Anthropic API directly
- Requires
ANTHROPIC_API_KEY - Pay-per-token pricing
User Experience:
- Automatic detection and fallback
- Clear logging of which mode is active
- Same API surface regardless of mode
Detection:
- Checks for
/.dockerenvfile - Checks for Docker cgroup signatures in
/proc/self/cgroup
Bridge URL Handling:
- Host execution:
http://localhost:PORT - Docker execution:
http://host.docker.internal:PORT
Docker Compose Example:
services:
code-executor:
image: aberemia24/code-executor-mcp:1.0.0
environment:
- CODE_EXECUTOR_SAMPLING_ENABLED=true
- ANTHROPIC_API_KEY=${ANTHROPIC_API_KEY}
extra_hosts:
- "host.docker.internal:host-gateway"Total Sampling Tests: 74/74 passing (100%)
| Component | Tests | Status |
|---|---|---|
| Bridge Server | 15/15 | ✅ PASS |
| Content Filter | 8/8 | ✅ PASS |
| TypeScript API | 4/4 | ✅ PASS |
| Python API | 3/3 | ✅ PASS |
| Config Schema | 23/23 | ✅ PASS |
| Audit Logging | 13/13 | ✅ PASS |
| Security Attacks | 8/8 | ✅ PASS |
Key Tests:
- T010-T016: Bridge server lifecycle (startup, shutdown, token validation)
- T022-T026: Content filtering (secrets, PII detection and redaction)
- T033-T037: Rate limiting (rounds, tokens, concurrent access)
- T044-T047: System prompt allowlist validation
- T053-T056: TypeScript sampling API
- T063-T066: Python sampling API
- T082-T084: Audit logging with SHA-256 hashes
- T112-T116: Security attack tests (infinite loop, token exhaustion, prompt injection, secret leakage, timing attacks)
Why Ephemeral Bridge Server?
- Security: Unique bearer token per execution prevents cross-execution attacks
- Isolation: Localhost binding ensures no external access
- Lifecycle: Bridge destroyed after execution, no lingering processes
Why Rate Limiting?
- Cost Control: Prevent infinite loops from causing API cost explosions
- Resource Management: Prevent token exhaustion from overwhelming Claude API
- User Protection: Default limits protect users from accidental abuse
Why Content Filtering?
- Secret Protection: Prevent API keys, tokens, credentials from leaking into logs
- Compliance: PII redaction helps meet privacy regulations (GDPR, CCPA)
- Defense-in-Depth: Even if Claude accidentally generates secrets, they're redacted
Why System Prompt Allowlist?
- Prompt Injection Defense: Prevents attackers from bypassing security via custom system prompts
- Controlled Behavior: Ensures Claude operates within intended parameters
- Auditability: Limited set of prompts makes behavior predictable
Why SHA-256 Audit Logs?
- Forensics: Enable investigation of security incidents without exposing secrets
- Deduplication: Same prompt = same hash, enables pattern detection
- Compliance: Meets audit requirements without storing plaintext data
Document Version: 1.2.0 (Added MCP Sampling Architecture for v1.0.0) Contributors: Alexandru Eremia Last Review: 2025-11-19