Chapter 16: Production Patterns

From Laptop to the Real World

Your MCP server works on your machine. It passes tests. The Inspector shows green. Now you need to deploy it where real users can use it, real load can hit it, and real things can go wrong at 3 AM.

This chapter covers production deployment patterns—from simple single-server setups to enterprise architectures with gateways, registries, and multi-tenant isolation.

Deployment Model 1: Local Server Distribution

The simplest production model: distribute your server as a package and let users run it locally.

npm Distribution

# Users install and run with npx
npx -y @yourorg/mcp-server-whatever

Package your server properly:

{
  "name": "@yourorg/mcp-server-whatever",
  "version": "1.0.0",
  "bin": {
    "mcp-server-whatever": "./dist/index.js"
  },
  "files": ["dist/"],
  "engines": {
    "node": ">=18"
  }
}

PyPI Distribution

# Users install and run with uvx
uvx mcp-server-whatever

Set up pyproject.toml:

[project]
name = "mcp-server-whatever"
version = "1.0.0"
requires-python = ">=3.10"
dependencies = ["mcp>=1.0.0"]

[project.scripts]
mcp-server-whatever = "mcp_server_whatever:main"

Advantages

Zero infrastructure to manage
Server runs with user's permissions (appropriate for local tools)
No authentication needed
Updates via package manager

Limitations

Can't share state between users
Each user runs their own instance
No centralized monitoring
Hard to enforce version consistency

Deployment Model 2: Hosted HTTP Server

For shared, remote servers, deploy as an HTTP service.

Basic Express/Node.js Deployment

import express from "express";
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { StreamableHTTPServerTransport } from "@modelcontextprotocol/sdk/server/streamableHttp.js";

const app = express();

// Health check endpoint
app.get("/health", (req, res) => {
  res.json({ status: "healthy", version: "1.0.0" });
});

// MCP endpoint
const sessions = new Map();

app.post("/mcp", async (req, res) => {
  const sessionId = req.headers["mcp-session-id"];

  if (!sessionId || !sessions.has(sessionId)) {
    // New session
    const server = new McpServer({ name: "prod-server", version: "1.0.0" });
    // ... register tools ...

    const transport = new StreamableHTTPServerTransport({
      sessionIdGenerator: () => crypto.randomUUID(),
    });

    await server.connect(transport);
    sessions.set(transport.sessionId, { server, transport });

    await transport.handleRequest(req, res);
  } else {
    // Existing session
    const { transport } = sessions.get(sessionId);
    await transport.handleRequest(req, res);
  }
});

app.listen(process.env.PORT || 3000);

Docker Deployment

FROM node:20-slim
WORKDIR /app
COPY package*.json ./
RUN npm ci --production
COPY dist/ ./dist/
EXPOSE 3000
HEALTHCHECK CMD curl -f http://localhost:3000/health || exit 1
CMD ["node", "dist/index.js"]

# docker-compose.yml
services:
  mcp-server:
    build: .
    ports:
      - "3000:3000"
    environment:
      - API_KEY=${API_KEY}
      - DATABASE_URL=${DATABASE_URL}
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:3000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

Kubernetes Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: mcp-server
spec:
  replicas: 3
  selector:
    matchLabels:
      app: mcp-server
  template:
    metadata:
      labels:
        app: mcp-server
    spec:
      containers:
        - name: mcp-server
          image: yourorg/mcp-server:latest
          ports:
            - containerPort: 3000
          env:
            - name: API_KEY
              valueFrom:
                secretKeyRef:
                  name: mcp-secrets
                  key: api-key
          readinessProbe:
            httpGet:
              path: /health
              port: 3000
            initialDelaySeconds: 5
          livenessProbe:
            httpGet:
              path: /health
              port: 3000
            initialDelaySeconds: 15
          resources:
            requests:
              memory: "128Mi"
              cpu: "100m"
            limits:
              memory: "512Mi"
              cpu: "500m"
---
apiVersion: v1
kind: Service
metadata:
  name: mcp-server
spec:
  selector:
    app: mcp-server
  ports:
    - port: 80
      targetPort: 3000
  type: ClusterIP

Deployment Model 3: Serverless

MCP's Streamable HTTP transport is compatible with serverless platforms, especially as the protocol moves toward stateless operation.

AWS Lambda

import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";

export async function handler(event) {
  const server = new McpServer({ name: "lambda-server", version: "1.0.0" });

  // Register tools
  server.tool("process_data", "Process data", { input: z.string() }, async ({ input }) => ({
    content: [{ type: "text", text: `Processed: ${input}` }],
  }));

  // Handle the request
  const body = JSON.parse(event.body);
  // ... process JSON-RPC message and return response
}

Cloudflare Workers

export default {
  async fetch(request: Request): Promise<Response> {
    if (request.method === "POST") {
      const body = await request.json();
      // Handle MCP JSON-RPC request
      const response = await handleMcpRequest(body);
      return new Response(JSON.stringify(response), {
        headers: { "Content-Type": "application/json" },
      });
    }
    return new Response("MCP Server", { status: 200 });
  },
};

Serverless Considerations

Cold starts — First request will be slower. Minimize initialization.
Statelessness — Each invocation is independent. Don't rely on in-memory state.
Session management — Use external storage (Redis, DynamoDB) for sessions if needed.
Timeouts — Lambda has a 15-minute max. Most MCP tool calls should be much faster.
Cost — Pay per invocation. Great for bursty workloads, expensive for constant load.

MCP Gateways

As MCP deployments grow, organizations need a way to manage, secure, and monitor multiple servers. Enter the MCP gateway.

What a Gateway Does

Client ──→ ┌─────────────┐ ──→ Server A (GitHub)
            │   Gateway   │ ──→ Server B (Database)
Client ──→ │             │ ──→ Server C (Monitoring)
            │ • Auth      │
            │ • Routing   │
Client ──→ │ • Rate limit│ ──→ Server D (File Storage)
            │ • Logging   │
            │ • Caching   │
            └─────────────┘

A gateway sits between clients and servers, providing:

Authentication — Verify client identity once, proxy to multiple servers
Routing — Direct requests to the appropriate backend server
Rate limiting — Prevent abuse and enforce quotas
Logging — Centralized audit trail
Caching — Cache resource reads and tool list responses
Tool aggregation — Present tools from multiple servers as a single unified server
Access control — Control which users can access which tools

Build vs. Buy

Several companies offer MCP gateway products:

Cloudflare has built MCP support into their Workers platform
Kong and other API gateway vendors are adding MCP support
Smithery and mcp.run offer hosted MCP server registries with gateway features

For many teams, a simple reverse proxy with authentication is sufficient. You don't need a dedicated MCP gateway until you have many servers, many users, or complex access control requirements.

Multi-Tenant Patterns

When multiple users share an MCP server, you need tenant isolation.

Per-Request Authentication

Identify the user on each request and scope operations:

@mcp.tool()
async def list_documents(ctx: Context) -> str:
    """List the current user's documents."""
    user_id = ctx.request_context.get("user_id")  # From auth middleware
    docs = await db.query("SELECT * FROM documents WHERE owner_id = ?", user_id)
    return format_documents(docs)

Per-Session Isolation

Create isolated server instances per session:

app.post("/mcp", async (req, res) => {
  const userId = await authenticateRequest(req);
  const sessionKey = `${userId}:${req.headers["mcp-session-id"]}`;

  if (!sessions.has(sessionKey)) {
    // Create a new server instance scoped to this user
    const server = createServerForUser(userId);
    sessions.set(sessionKey, server);
  }

  await sessions.get(sessionKey).handleRequest(req, res);
});

Database-Per-Tenant

For maximum isolation, give each tenant their own database:

def get_db_for_user(user_id: str) -> Connection:
    return connect(f"postgres://host/{user_id}_db")

Monitoring and Observability

Metrics to Track

Request rate — Tool calls per second, by tool name
Latency — P50, P95, P99 for tool execution
Error rate — Percentage of tool calls that return errors
Active sessions — Number of connected clients
Resource usage — CPU, memory, connections per server

Health Checks

app.get("/health", (req, res) => {
  const checks = {
    server: "healthy",
    database: checkDatabase(),
    externalApi: checkExternalApi(),
    uptime: process.uptime(),
    version: "1.0.0",
  };

  const isHealthy = Object.values(checks).every(
    (v) => v === "healthy" || typeof v === "number" || typeof v === "string"
  );

  res.status(isHealthy ? 200 : 503).json(checks);
});

Structured Logging

import structlog

logger = structlog.get_logger()

@mcp.tool()
async def query_data(sql: str, ctx: Context) -> str:
    logger.info(
        "tool_call",
        tool="query_data",
        sql_length=len(sql),
        session_id=ctx.session_id,
    )

    start = time.time()
    try:
        result = await execute_query(sql)
        duration = time.time() - start
        logger.info(
            "tool_success",
            tool="query_data",
            duration_ms=duration * 1000,
            row_count=len(result),
        )
        return format_result(result)
    except Exception as e:
        duration = time.time() - start
        logger.error(
            "tool_error",
            tool="query_data",
            error=str(e),
            duration_ms=duration * 1000,
        )
        raise

Scaling Considerations

Horizontal Scaling

For stateless servers, scaling is straightforward—add more instances behind a load balancer. For stateful servers (with sessions), you need either:

Sticky sessions — Route requests from the same session to the same instance
Shared session store — Store session state in Redis/Memcached
Stateless design — Avoid server-side session state entirely

Connection Limits

Each stdio connection is a process. Each HTTP connection consumes memory. Plan capacity accordingly:

stdio: Limit the number of concurrent server processes
HTTP: Use connection pooling and set reasonable timeouts
WebSocket/SSE: Monitor open connection counts

Caching

Cache aggressively:

Tool lists change infrequently → cache with TTL
Resource reads may be cacheable → check freshness with subscriptions
Prompt templates rarely change → cache indefinitely

Operational Runbook

Server Won't Start

Check logs (stderr for stdio, application logs for HTTP)
Verify dependencies are installed
Check environment variables
Try running manually from the command line
Check permissions (file access, network, ports)

High Latency

Profile tool execution (is the tool slow or the transport?)
Check external dependencies (API calls, database queries)
Look for N+1 query patterns
Consider caching frequently-requested data
Check resource contention (CPU, memory, connections)

Memory Leaks

Monitor memory usage over time
Check for unclosed connections or file handles
Watch for growing collections (session maps, caches without TTL)
Use profiling tools (Node.js: --inspect, Python: tracemalloc)

Graceful Degradation

When external dependencies fail:

@mcp.tool()
async def get_data(query: str) -> str:
    try:
        return await primary_source.query(query)
    except ConnectionError:
        try:
            return await cache.get(query)
        except CacheMiss:
            return "Error: Data source temporarily unavailable. Please try again in a few minutes."

Summary

Production MCP deployment ranges from simple package distribution to complex multi-tenant architectures. Key considerations:

Local distribution (npm/PyPI) for single-user tools
HTTP deployment (Docker/K8s/serverless) for shared servers
Gateways for managing fleets of servers
Multi-tenant isolation for shared infrastructure
Monitoring and observability for operational health
Scaling through horizontal replication and caching

The right architecture depends on your scale, security requirements, and operational maturity. Start simple (local distribution), grow as needed (hosted HTTP), and add complexity (gateways, multi-tenancy) only when you have the problems that justify it.

Next: a tour of the MCP ecosystem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chapter 16: Production Patterns

From Laptop to the Real World

Deployment Model 1: Local Server Distribution

npm Distribution

PyPI Distribution

Advantages

Limitations

Deployment Model 2: Hosted HTTP Server

Basic Express/Node.js Deployment

Docker Deployment

Kubernetes Deployment

Deployment Model 3: Serverless

AWS Lambda

Cloudflare Workers

Serverless Considerations

MCP Gateways

What a Gateway Does

Build vs. Buy

Multi-Tenant Patterns

Per-Request Authentication

Per-Session Isolation

Database-Per-Tenant

Monitoring and Observability

Metrics to Track

Health Checks

Structured Logging

Scaling Considerations

Horizontal Scaling

Connection Limits

Caching

Operational Runbook

Server Won't Start

High Latency

Memory Leaks

Graceful Degradation

Summary

FilesExpand file tree

16-production.md

Latest commit

History

16-production.md

File metadata and controls

Chapter 16: Production Patterns

From Laptop to the Real World

Deployment Model 1: Local Server Distribution

npm Distribution

PyPI Distribution

Advantages

Limitations

Deployment Model 2: Hosted HTTP Server

Basic Express/Node.js Deployment

Docker Deployment

Kubernetes Deployment

Deployment Model 3: Serverless

AWS Lambda

Cloudflare Workers

Serverless Considerations

MCP Gateways

What a Gateway Does

Build vs. Buy

Multi-Tenant Patterns

Per-Request Authentication

Per-Session Isolation

Database-Per-Tenant

Monitoring and Observability

Metrics to Track

Health Checks

Structured Logging

Scaling Considerations

Horizontal Scaling

Connection Limits

Caching

Operational Runbook

Server Won't Start

High Latency

Memory Leaks

Graceful Degradation

Summary