An endβtoβend demonstration of practical semantic caching, realβtime streaming, and cost modeling for AI workloads, built with Next.js (App Router) + FastAPI + Redis Stack. The app now ships with a realistic streaming cache pipeline (2 parallel columns) and a new interactive Semantic Cache Cost & Sizing Calculator to help you justify Redis spend versus raw LLM costs.
- Azure subscription with Azure OpenAI quota in
eastus2 - Azure CLI and Azure Developer CLI (azd)
- Docker Desktop (for container image builds)
- Logged in with Azure (
az loginorazd auth login)
# 1. Install azd if needed
curl -fsSL https://aka.ms/install-azd.sh | bash
# 2. Authenticate (one time per machine)
azd auth login
# 3. Provision infrastructure and deploy the app
azd upDuring azd up, choose an environment name (e.g. rg-Redis-AMR-semantic-cache-demo) and keep the suggested eastus2 location. The command provisions:
- Azure OpenAI Service with GPT-4o + text-embedding-ada-002 deployments
- Azure Managed Redis Enterprise (Balanced) for semantic cache storage
- Azure Container Apps running the FastAPI backend and Next.js frontend
- Supporting services: Azure Container Apps Environment, Azure Container Registry, Log Analytics, Application Insights, and managed identities with
AcrPullaccess
- Iterate on code:
azd deployrebuilds/pushes images and updates Container Apps. - Infrastructure Bicep changes:
azd provisionreapplies infra updates (follow withazd deployif images changed). - Retrieve connection info:
azd env get-valuesor read.azure/<env>/.envfor URLs, secrets, and resource names. - Tear everything down:
azd down --force --purgedeletes the resource group and purges OpenAI instances.
Why use
azd upfirst? Runningazd provisionby itself leaves the frontend pointing to the fallback container image, which listens on port 80. Our Container App expects port 3000, so the revision gets stuck activating.azd up(orazd provisionfollowed byazd deploy) ensures the real frontend image is built and deployed, preventing port mismatch issues.
- Deployment commands print the frontend and backend URLs; you can also run
azd env get-valuesto viewFRONTEND_URIandBACKEND_URI. - The
.azure/<env>/.envfile mirrors those values and includes Azure credentials for local tooling.
- Logs:
az containerapp logs show --name ca-backend-<token> --resource-group rg-<env>(and similar for the frontend). - Telemetry: Application Insights collects backend traces via Azure Monitor OpenTelemetry and frontend telemetry via the web SDK.
- Scaling: Container Apps default to 1β3 replicas sized at 0.5 vCPU / 1β―GiB. Adjust
infra/app/backend.biceporfrontend.bicepto tune resources. - Redis tier: Balanced B0 by defaultβscale up in Azure Portal when you outgrow it.
- Lock down networking (VNet integration, private endpoints, WAF) and add authentication (Azure AD).
- Enable Redis persistence/backups and set up alerting in Application Insights.
- Integrate automated CI/CD with
azd pipeline config, GitHub Actions, or Azure DevOps.
- Raw Column: Direct Azure OpenAI GPTβ5 response (baseline, always LLM tokens)
- Cache Column: Ultraβlow latency semantic cache path with:
- Fast path exact prompt hash lookup (no embedding, ~single RTT)
- Fallback semantic vector search (Redis vector index)
- Immediate stream of cached content on hit (subβ100ms typical)
- Automatic async population on miss (stream + background store)
Interactive tool to model:
- Annual token spend (no cache vs with cache)
- Savings vs hit rate curve (dynamic SVG graph)
- Breakeven hit rate vs Redis HA cost
- Automatic Azure Managed Redis SKU selection (subset benchmark table)
- Perβentry memory footprint estimation (vector + payload + overhead)
- Cache hit/miss metadata (similarity, distance, thresholds)
- Underlying Redis query JSON (debug surface)
- Embedding + doc id (for hits) β truncated client side
- Toggle with the Header βHover: On/Offβ control
- Realβtime token, latency & cost tracking (Redis streams)
- Distinguishes LLM tokens vs tokens saved from cache
- Streaming event types:
column_start,content_chunk,column_complete,stream_complete - Cache column emits enriched
cache_documentpayload for hits & misses (uniform UI handling)
- Prompt Hash Fast Path: MD5 of normalized prompt β O(1) lookup prior to embedding
- Vector Similarity: 1536βd embedding (OpenAI adaβstyle) search via Redis Search index
- Dual Threshold Support: Similarity & cosine distance (configurable via env)
- Immediate Hit Return: Sends cached body + metadata before raw column finishes first tokens
- Async Population: Miss path stores generated content after streaming completes
- Graceful Degradation: Raw column still operates when cache/vector index missing
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β Next.js UI β β FastAPI β β Redis Stack β
β β β Backend β β β
β β’ Chat InterfaceβββββΊβ β’ Semantic CacheβββββΊβ β’ Vector Search β
β β’ Metrics View β β β’ Streaming API β β β’ JSON Storage β
βββββββββββββββββββ βββββββββββββββββββ βββββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β Azure OpenAI β
β (Embeddings & β
β GPT-5 Gen) β
βββββββββββββββββββ
- Open the frontend URL reported by
azd deployorazd env get-values(theFRONTEND_URI). - Enter a prompt; both columns start streaming almost immediately
- If prior semantically similar (or identical) prompt exists youβll see a nearβinstant cache column answer
- Hover over cache result (if hover enabled) to inspect metadata & Redis query JSON
- Compare tokens vs tokens saved, latency, cost
- Click the calculator icon in the header or go to
/calculator - Adjust hit rate, traffic, token sizes, TTL, pricing
- Review savings vs hit rate curve & automatic Redis SKU selection
- Use breakeven % to justify enabling semantic cache in your environment
- Token Usage: Track LLM vs cached tokens
- Cost Analysis: Monitor spending across different approaches
- Cache Hit Rates: Measure semantic cache effectiveness
- Latency Comparison: Compare response times
POST /ask- Oneβshot generation (raw + cache + personalized objects returned)POST /ask/stream- Realβtime streaming for two active columns (raw & cache)GET /metrics- Aggregated metrics (legacy + compatibility)
GET /cache/stats- Cache statistics (hit counts, sizes)POST /cache/search- Search semantic cache (text or embedding)POST /cache/store- Store a document manuallyDELETE /cache/clear- Clear cache entries (pattern based)
POST /vector/search- Vector similarity search (top_k, threshold)POST /vector/add- Manually index arbitrary content
GET /telemetry/summary- Real-time metrics summaryGET /telemetry/stream-info- Stream statisticsDELETE /telemetry/clear- Clear telemetry data
GET /llm/test- Test Azure OpenAI connectionGET /health- Health check with Redis connectivity
- Cache hits bypass LLM generation β input + output tokens avoided
- Savings scale linearly with hit rate until saturation
- Calculator exposes sensitivity to TTL, entry footprint & traffic
- Prompt hash hit: near instantaneous (no embedding call)
- Semantic hit: embedding + Redis vector RTT (< ~50β120ms typical)
- Miss: full model streaming; cache stored asynchronously
Breakeven Hit % = (Annual Redis HA Cost / NoβCache Annual Token Spend) * 100. If > 100%, optimize prompt or choose smaller SKU.
- Normalize + hash prompt β attempt exact index lookup
- If miss β generate embedding & run vector similarity (threshold governed)
- On hit β stream cached content + metadata immediately
- On miss β stream live LLM output while accumulating content buffer
- After completion β async store (embedding + payload + metadata)
- Raw: Always LLM β baseline tokens, latency, cost
- Cache: Potentially zero LLM tokens; showcases savings & latency delta
- Token Usage: Prompt, completion, and cached tokens
- Cost Analysis: Per-request and cumulative costs
- Latency Measurement: Response time monitoring
- Cache Performance: Hit rates and effectiveness
- Redis streams capture perβcolumn events with structured metrics
- Aggregations compute hit rate, tokens saved, average latency & cost
Backend Connection Errors:
# Retrieve backend URL for the active azd environment
BACKEND_URI=$(azd env get-value BACKEND_URI)
# Confirm the API is healthy and Azure OpenAI is reachable
curl "$BACKEND_URI/health"
curl "$BACKEND_URI/llm/test"Frontend Issues:
# Review recent frontend container logs
az containerapp logs show \
--name ca-frontend-<token> \
--resource-group rg-<environment>
# Validate backend response time from the frontend container
az containerapp exec \
--name ca-frontend-<token> \
--resource-group rg-<environment> \
--command "curl -I $BACKEND_URI"- Redis: Required for caching and memory storage
- Azure OpenAI: Required for embeddings and LLM responses
- Secrets: In azd deployments, the backend pulls credentials from Azure Container Apps secrets (Key Vault integration is not yet wired up)
- API Keys: Keep Azure OpenAI keys in environment variables for local work; never commit them
- CORS: Configured for the deployed frontend origin
- Rate Limiting: Add before exposing the service beyond closed demos
- Fork
- Create a feature branch
- Implement / document
- (Optional) Add a scenario to the calculator or enhance telemetry
- PR with clear before/after metrics if performance related
Property of Redis
- Redis for vector search and caching infrastructure
- Azure OpenAI for language model capabilities
- Next.js and FastAPI for the application framework
Built with β€οΈ to showcase pragmatic semantic caching, streaming UX, and cost justification tooling for AI workloads.
Philip Laussermair and Roy de Milde
Backend (see backend/.env.example):
| Variable | Purpose | Example |
|---|---|---|
| AZURE_OPENAI_ENDPOINT | Azure OpenAI endpoint | https://your-resource.openai.azure.com/ |
| AZURE_OPENAI_API_KEY | Azure OpenAI key | (secret) |
| AZURE_OPENAI_API_VERSION | API version | 2024-12-01-preview |
| AZURE_OPENAI_EMBEDDING_MODEL | Embedding deployment name | text-embedding-3-small |
| AZURE_OPENAI_GPT5_MODEL | Chat model deployment name | gpt-5-chat |
| REDIS_URL | Redis connection string (TLS) | rediss://:pwd@host:10000 |
| VECTOR_INDEX_NAME | Redis vector index name | q_idx |
| VECTOR_DIMENSION | Embedding dimension | 1536 |
| SEMANTIC_SIMILARITY_THRESHOLD | Similarity cutoff | 0.70 |
| CACHE_TTL (optional) | TTL for cached docs (s) | 86400 |
| NEXT_PUBLIC_APPINSIGHTS_CONNECTION_STRING (optional) | Frontend Application Insights telemetry | (connection string) |
Frontend builds read NEXT_PUBLIC_API_BASE; azd deploy sets it to the backend URI automatically.
Implemented: GET /health (Redis + vector index). Planned: /readiness for full dependency checks.
/readinessendpoint- Multi-stage backend Dockerfile for smaller image
- Structured JSON logging with request IDs
- Pytest suite for cache hit/miss economics