A production-ready LLM proxy gateway written in Go with enterprise features
- Overview
- Supported Architecture
- Supported Providers
- Key Features
- Status
- Use Cases
- Architecture
- Quick Start
- API Standard
- API Reference
- Tech Stack
- Project Structure
- Configuration
- Chat Client
- Conversation Management
- Observability
- Circuit Breakers
- Azure OpenAI
- Azure Anthropic
- Admin Web UI
- Deployment
- Authentication
- Production Features
- Roadmap
- Documentation
- Contributing
- License
A production-ready LLM proxy gateway written in Go that provides a unified API interface for multiple LLM providers. Similar to LiteLLM, but built natively in Go using each provider's official SDK with enterprise features including rate limiting, circuit breakers, observability, and authentication.
latticelm currently supports one deployment architecture:
- Gateway + PostgreSQL + Redis
- PostgreSQL stores providers/models config, users, conversations, and usage data
- Redis is used for distributed rate limiting
There are no alternative storage/runtime backend modes in the pre-release product contract.
- OpenAI (GPT models)
- Azure OpenAI (Azure-deployed OpenAI models)
- Anthropic (Claude models)
- Azure Anthropic (Microsoft Foundry-hosted Claude models)
- Google Generative AI (Gemini models)
- Vertex AI (Google Cloud-hosted Gemini models)
Instead of managing multiple SDK integrations in your application, call one endpoint and let the gateway handle provider-specific implementations.
Client Request
↓
latticelm (unified API)
↓
├─→ OpenAI SDK
├─→ Azure OpenAI (OpenAI SDK + Azure auth)
├─→ Anthropic SDK
├─→ Google Gen AI SDK
└─→ Vertex AI (Google Gen AI SDK + GCP auth)
- Single API interface for multiple LLM providers
- Native Go SDKs for optimal performance and type safety
- Provider abstraction - switch providers without changing client code
- Streaming support - Server-Sent Events for all providers
- Conversation tracking - Efficient context management with
previous_response_id
- Circuit breakers - Automatic failure detection and recovery per provider
- Rate limiting - Per-IP token bucket algorithm with configurable limits
- OAuth2/OIDC authentication - Support for Google, Auth0, and any OIDC provider
- Observability - Prometheus metrics and OpenTelemetry tracing
- Health checks - Kubernetes-compatible liveness and readiness endpoints
- Admin Web UI - Built-in dashboard for monitoring and configuration
- Easy setup - YAML configuration with environment variable overrides
- Postgres + Redis - PostgreSQL for data, Redis for rate limiting
- Applications that need multi-provider LLM support
- Cost optimization (route to cheapest provider for specific tasks)
- Failover and redundancy (fallback to alternative providers)
- A/B testing across different models
- Centralized LLM access for microservices
Production Ready - All core features implemented and tested.
✅ All providers use official Go SDKs:
- OpenAI →
github.com/openai/openai-go/v3 - Azure OpenAI →
github.com/openai/openai-go/v3(with Azure auth) - Anthropic →
github.com/anthropics/anthropic-sdk-go - Azure Anthropic →
github.com/anthropics/anthropic-sdk-go(with Azure auth) - Google Gen AI →
google.golang.org/genai - Vertex AI →
google.golang.org/genai(with GCP auth)
✅ Provider auto-selection (gpt→OpenAI, claude→Anthropic, gemini→Google)
✅ Streaming responses (Server-Sent Events)
✅ Conversation tracking with previous_response_id
✅ OAuth2/OIDC authentication
✅ Rate limiting with token bucket algorithm
✅ Circuit breakers for fault tolerance
✅ Observability (Prometheus metrics + OpenTelemetry tracing)
✅ Health & readiness endpoints
✅ Admin Web UI dashboard
✅ Terminal chat client (Python with Rich UI)
- Docker with Compose (
docker compose)
# 1. Clone the repository
git clone https://github.com/yourusername/latticelm.git
cd latticelm
# 2. Create env file
cp .env.example .env
# 3. Set required bootstrap values in .env
# - ENCRYPTION_KEY (openssl rand -base64 32)
# - DATABASE_URL=postgres://latticelm:latticelm@postgres:5432/latticelm?sslmode=disable
# - RATE_LIMIT_REDIS_URL=redis://redis:6379/0
# 4. Start the canonical stack (gateway + postgres + redis)
docker compose up -d --build
# 5. Verify health
curl http://localhost:8080/health
curl http://localhost:8080/ready
# Gateway starts on http://localhost:8080
# Web UI available at http://localhost:8080/Non-streaming request:
curl -X POST http://localhost:8080/v1/responses \
-H "Content-Type: application/json" \
-d '{
"model": "gpt-4o-mini",
"input": [
{
"role": "user",
"content": [{"type": "input_text", "text": "Hello!"}]
}
]
}'Streaming request:
curl -X POST http://localhost:8080/v1/responses \
-H "Content-Type: application/json" \
-N \
-d '{
"model": "claude-3-5-sonnet-20241022",
"stream": true,
"input": [
{
"role": "user",
"content": [{"type": "input_text", "text": "Write a haiku about Go"}]
}
]
}'Run backend and frontend separately for live reloading:
# Terminal 1: Backend with auto-reload
make dev-backend
# Terminal 2: Frontend dev server
make dev-frontendFrontend runs on http://localhost:5173 with hot module replacement.
This gateway implements the Open Responses specification — an open-source, multi-provider API standard for LLM interfaces based on OpenAI's Responses API.
Why Open Responses:
- Multi-provider by default - one schema that maps cleanly across providers
- Agentic workflow support - consistent streaming events, tool invocation patterns, and "items" as atomic units
- Extensible - stable core with room for provider-specific features
By following the Open Responses spec, this gateway ensures:
- Interoperability across different LLM providers
- Standard request/response formats (messages, tool calls, streaming)
- Compatibility with existing Open Responses tooling and ecosystem
For full specification details, see: https://www.openresponses.org
Create a chat completion response (streaming or non-streaming).
Request body:
{
"model": "gpt-4o-mini",
"stream": false,
"input": [
{
"role": "user",
"content": [{"type": "input_text", "text": "Hello!"}]
}
],
"previous_response_id": "optional-conversation-id",
"provider": "optional-explicit-provider"
}Response (non-streaming):
{
"id": "resp_abc123",
"object": "response",
"model": "gpt-4o-mini",
"provider": "openai",
"output": [
{
"role": "assistant",
"content": [{"type": "text", "text": "Hello! How can I help you?"}]
}
],
"usage": {
"input_tokens": 10,
"output_tokens": 8
}
}Response (streaming):
Server-Sent Events with data: {...} lines containing deltas.
List available models.
Response:
{
"object": "list",
"data": [
{"id": "gpt-4o-mini", "provider": "openai"},
{"id": "claude-3-5-sonnet", "provider": "anthropic"},
{"id": "gemini-1.5-flash", "provider": "google"}
]
}Liveness probe (always returns 200 if server is running).
Response:
{
"status": "healthy",
"timestamp": 1709438400
}Readiness probe (checks conversation store and providers).
Response:
{
"status": "ready",
"timestamp": 1709438400,
"checks": {
"conversation_store": "healthy",
"providers": "healthy"
}
}Returns 503 if any check fails.
Web dashboard (when web UI is enabled).
System information.
Detailed health status.
Current configuration (secrets masked).
Prometheus metrics (when observability is enabled).
- Language: Go
- API Specification: Open Responses
- Official SDKs:
google.golang.org/genai(Google Generative AI & Vertex AI)github.com/anthropics/anthropic-sdk-go(Anthropic & Azure Anthropic)github.com/openai/openai-go/v3(OpenAI & Azure OpenAI)
- Observability:
- Prometheus for metrics
- OpenTelemetry for distributed tracing
- Resilience:
- Circuit breakers via
github.com/sony/gobreaker - Token bucket rate limiting
- Circuit breakers via
- Transport: RESTful HTTP with Server-Sent Events for streaming
latticelm/
├── cmd/gateway/ # Main application entry point
├── internal/
│ ├── admin/ # Web UI backend and embedded frontend
│ ├── api/ # Open Responses types and validation
│ ├── auth/ # OAuth2/OIDC authentication
│ ├── config/ # YAML configuration loader
│ ├── conversation/ # Conversation tracking and storage
│ ├── logger/ # Structured logging setup
│ ├── metrics/ # Prometheus metrics
│ ├── providers/ # Provider implementations
│ │ ├── anthropic/
│ │ ├── azureanthropic/
│ │ ├── azureopenai/
│ │ ├── google/
│ │ ├── openai/
│ │ └── vertexai/
│ ├── ratelimit/ # Rate limiting implementation
│ ├── server/ # HTTP server and handlers
│ └── tracing/ # OpenTelemetry tracing
├── ui/ # Vue.js Web UI
├── k8s/ # Kubernetes manifests
├── tests/ # Integration tests
├── config.example.yaml # Example configuration
├── Makefile # Build and development tasks
└── README.md
The gateway uses a YAML configuration file with support for environment variable overrides.
server:
address: ":8080"
max_request_body_size: 10485760 # 10MB
logging:
format: "json" # or "text" for development
level: "info" # debug, info, warn, error
# Configure providers (API keys can use ${ENV_VAR} syntax)
providers:
openai:
type: "openai"
api_key: "${OPENAI_API_KEY}"
anthropic:
type: "anthropic"
api_key: "${ANTHROPIC_API_KEY}"
google:
type: "google"
api_key: "${GOOGLE_API_KEY}"
# Map model names to providers
models:
- name: "gpt-4o-mini"
provider: "openai"
- name: "claude-3-5-sonnet"
provider: "anthropic"
- name: "gemini-1.5-flash"
provider: "google"# Rate limiting
rate_limit:
enabled: true
requests_per_second: 10
burst: 20
# Authentication
auth:
enabled: true
issuer: "https://accounts.google.com"
audience: "your-client-id.apps.googleusercontent.com"
# Observability
observability:
enabled: true
metrics:
enabled: true
path: "/metrics"
tracing:
enabled: true
service_name: "llm-gateway"
exporter:
type: "otlp"
endpoint: "localhost:4317"
# Conversation storage
conversations:
enabled: true
ttl: "1h"
# Web UI
ui:
enabled: trueSee config.example.yaml for complete configuration options with detailed comments.
Interactive terminal chat interface with beautiful Rich UI powered by Python and the Rich library:
# Basic usage
uv run chat.py
# With authentication
uv run chat.py --token "$(gcloud auth print-identity-token)"
# Switch models on the fly
You> /model claude
You> /models # List all available modelsFeatures:
- Syntax highlighting for code blocks
- Markdown rendering for formatted responses
- Model switching on the fly with
/modelcommand - Conversation history with automatic
previous_response_idtracking - Streaming responses with real-time display
The chat client uses PEP 723 inline script metadata, so uv run automatically installs dependencies.
The gateway implements efficient conversation tracking using previous_response_id from the Open Responses spec:
- 📉 Reduced token usage - Only send new messages, not full history
- ⚡ Smaller requests - Less bandwidth and faster responses
- 🧠 Server-side context - Gateway maintains conversation state
- ⏰ Auto-expire - Conversations expire after configurable TTL (default: 1 hour)
Conversations are stored in PostgreSQL (the same database used for gateway configuration):
conversations:
enabled: true
ttl: "1h" # Conversation expirationNo additional DSN is needed — the gateway reuses the DATABASE_URL connection.
The gateway provides comprehensive observability through Prometheus metrics and OpenTelemetry tracing.
Enable Prometheus metrics to monitor gateway performance:
observability:
enabled: true
metrics:
enabled: true
path: "/metrics" # Default endpointAvailable metrics include:
- Request counts and latencies per provider and model
- Error rates and types
- Circuit breaker state changes
- Rate limit hits
- Conversation store operations
Access metrics at http://localhost:8080/metrics (Prometheus scrape format).
Enable OpenTelemetry tracing for distributed request tracking:
observability:
enabled: true
tracing:
enabled: true
service_name: "llm-gateway"
sampler:
type: "probability" # "always", "never", or "probability"
rate: 0.1 # Sample 10% of requests
exporter:
type: "otlp" # Send to OpenTelemetry Collector
endpoint: "localhost:4317" # gRPC endpoint
insecure: true # Use TLS in productionTraces include:
- End-to-end request flow
- Provider API calls
- Conversation store lookups
- Circuit breaker operations
- Authentication checks
Use with Jaeger, Zipkin, or any OpenTelemetry-compatible backend.
The gateway automatically wraps each provider with a circuit breaker for fault tolerance. When a provider experiences failures, the circuit breaker:
- Closed state - Normal operation, requests pass through
- Open state - Fast-fail after threshold reached, returns errors immediately
- Half-open state - Allows test requests to check if provider recovered
Default configuration (per provider):
- Max requests in half-open: 3
- Interval: 60 seconds (resets failure count)
- Timeout: 30 seconds (open → half-open transition)
- Failure ratio: 0.5 (50% failures trips circuit)
Circuit breaker state changes are logged and exposed via metrics.
The gateway supports Azure OpenAI with the same interface as standard OpenAI:
providers:
azureopenai:
type: "azureopenai"
api_key: "${AZURE_OPENAI_API_KEY}"
endpoint: "https://your-resource.openai.azure.com"
models:
- name: "gpt-4o"
provider: "azureopenai"
provider_model_id: "my-gpt4o-deployment" # optional: defaults to nameexport AZURE_OPENAI_API_KEY="..."
export AZURE_OPENAI_ENDPOINT="https://your-resource.openai.azure.com"
./gatewayThe provider_model_id field lets you map a friendly model name to the actual provider identifier (e.g., an Azure deployment name). If omitted, the model name is used directly.
The gateway supports Azure-hosted Anthropic models through Microsoft's AI Foundry:
providers:
azureanthropic:
type: "azureanthropic"
api_key: "${AZURE_ANTHROPIC_API_KEY}"
endpoint: "https://your-resource.services.ai.azure.com/anthropic"
models:
- name: "claude-sonnet-4-5"
provider: "azureanthropic"
provider_model_id: "claude-sonnet-4-5-20250514" # optionalexport AZURE_ANTHROPIC_API_KEY="..."
export AZURE_ANTHROPIC_ENDPOINT="https://your-resource.services.ai.azure.com/anthropic"
./gatewayAzure Anthropic provides Claude models with Azure's compliance, security, and regional deployment options.
The gateway includes a built-in admin web interface for monitoring and configuration.
- System Information - View version, uptime, platform details
- Health Checks - Monitor server, providers, and conversation store status
- Provider Status - View configured providers and their models
- Configuration - View current configuration (with secrets masked)
- Enable in config:
ui:
enabled: true- Build with frontend assets:
make build-all- Access at:
http://localhost:8080/
Run backend and frontend separately for development:
# Terminal 1: Run backend
make dev-backend
# Terminal 2: Run frontend dev server
make dev-frontendFrontend dev server runs on http://localhost:5173 and proxies API requests to backend.
The canonical deployment path is Docker Compose with PostgreSQL + Redis + gateway.
cp .env.example .env
docker compose up -d --build
docker compose ps
docker compose logs -f gatewayThe Docker image:
- Uses 3-stage build (frontend → backend → runtime) for minimal size (~50MB)
- Automatically builds and embeds the Web UI
- Runs as non-root user (UID 1000) for security
- Includes health checks for orchestration
- No need for Node.js or Go installed locally
Production-ready Kubernetes manifests are available in the k8s/ directory:
# Deploy to Kubernetes
kubectl apply -k k8s/
# Or deploy individual manifests
kubectl apply -f k8s/namespace.yaml
kubectl apply -f k8s/deployment.yaml
kubectl apply -f k8s/service.yaml
kubectl apply -f k8s/ingress.yamlFeatures included:
- High availability - 3+ replicas with pod anti-affinity
- Auto-scaling - HorizontalPodAutoscaler (3-20 replicas)
- Security - Non-root, read-only filesystem, network policies
- Monitoring - ServiceMonitor and PrometheusRule for Prometheus Operator
- Storage - PostgreSQL for persisted data and Redis for distributed rate limiting
- Ingress - TLS with cert-manager integration
See k8s/README.md for complete deployment guide including:
- Cloud-specific configurations (AWS EKS, GCP GKE, Azure AKS)
- Secrets management (External Secrets Operator, Sealed Secrets)
- Monitoring and alerting setup
- Troubleshooting guide
The gateway supports OAuth2/OIDC authentication for securing API access.
auth:
enabled: true
issuer: "https://accounts.google.com"
audience: "YOUR-CLIENT-ID.apps.googleusercontent.com"# Get token
TOKEN=$(gcloud auth print-identity-token)
# Make authenticated request
curl -X POST http://localhost:8080/v1/responses \
-H "Authorization: Bearer $TOKEN" \
-H "Content-Type: application/json" \
-d '{"model": "gemini-2.0-flash-exp", ...}'Per-IP rate limiting using token bucket algorithm to prevent abuse and manage load:
rate_limit:
enabled: true
requests_per_second: 10 # Max requests per second per IP
burst: 20 # Maximum burst sizeFeatures:
- Token bucket algorithm for smooth rate limiting
- Per-IP limiting with support for X-Forwarded-For headers
- Configurable limits for requests per second and burst size
- Automatic cleanup of stale rate limiters to prevent memory leaks
- 429 responses with Retry-After header when limits exceeded
Kubernetes-compatible health check endpoints for orchestration and load balancers:
Liveness endpoint (/health):
curl http://localhost:8080/health
# {"status":"healthy","timestamp":1709438400}Readiness endpoint (/ready):
curl http://localhost:8080/ready
# {
# "status":"ready",
# "timestamp":1709438400,
# "checks":{
# "conversation_store":"healthy",
# "providers":"healthy"
# }
# }The readiness endpoint verifies:
- Conversation store connectivity
- At least one provider is configured
- Returns 503 if any check fails
- ✅ Streaming responses (Server-Sent Events)
- ✅ OAuth2/OIDC authentication
- ✅ Conversation tracking with
previous_response_id - ✅ Persistent conversation storage (PostgreSQL)
- ✅ Circuit breakers for fault tolerance
- ✅ Rate limiting
- ✅ Observability (Prometheus metrics and OpenTelemetry tracing)
- ✅ Admin Web UI
- ✅ Health and readiness endpoints
- ⬜ Tool/function calling support across providers
- ⬜ Request-level cost tracking and budgets
- ⬜ Advanced routing policies (cost optimization, latency-based, failover)
- ⬜ Multi-tenancy with per-tenant rate limits and quotas
- ⬜ Request caching for identical prompts
- ⬜ Webhook notifications for events (failures, circuit breaker changes)
Comprehensive guides and documentation are available in the /docs directory:
- Docker Deployment Guide - Canonical Docker Compose deployment (gateway + PostgreSQL + Redis)
- Kubernetes Deployment Guide - Production deployment with Kubernetes
- Web UI Documentation - Using the web dashboard
- Configuration Reference - All configuration options explained
See the docs directory README for a complete documentation index.
Contributions are welcome! Here's how you can help:
- Bug reports: Include steps to reproduce, expected vs actual behavior, and environment details
- Feature requests: Describe the use case and why it would be valuable
- Security issues: Email security concerns privately (don't open public issues)
- Fork and clone the repository
- Create a branch for your feature:
git checkout -b feature/your-feature-name - Make your changes with clear, atomic commits
- Add tests for new functionality
- Run tests:
make test - Run linter:
make lint - Update documentation if needed
- Submit a pull request with a clear description
- Follow Go best practices and idioms
- Write tests for new features and bug fixes
- Keep functions small and focused
- Use meaningful variable names
- Add comments for complex logic
- Run
go fmtbefore committing
# Run all tests
make test
# Run specific package tests
go test ./internal/providers/...
# Run with coverage
make test-coverage
# Run integration tests (requires API keys)
make test-integration- Create provider implementation in
internal/providers/yourprovider/ - Implement the
Providerinterface - Add provider registration in
internal/providers/providers.go - Add configuration support in
internal/config/ - Add tests and update documentation
MIT License - see the repository for details.
- Built with official SDKs from OpenAI, Anthropic, and Google
- Inspired by LiteLLM
- Implements the Open Responses specification
- Uses gobreaker for circuit breaker functionality
- Documentation: Check this README and the files in
/docs - Issues: Open a GitHub issue for bugs or feature requests
- Discussions: Use GitHub Discussions for questions and community support
Made with ❤️ in Go