Semantic Cache Demo

An end‑to‑end demonstration of practical semantic caching, real‑time streaming, and cost modeling for AI workloads, built with Next.js (App Router) + FastAPI + Redis Stack. The app now ships with a realistic streaming cache pipeline (2 parallel columns) and a new interactive Semantic Cache Cost & Sizing Calculator to help you justify Redis spend versus raw LLM costs.

🚀 Deploy to Azure

Prerequisites

Azure subscription with Azure OpenAI quota in eastus2
Azure CLI and Azure Developer CLI (azd)
Docker Desktop (for container image builds)
Logged in with Azure (az login or azd auth login)

First-Time Provisioning & Deployment

# 1. Install azd if needed
curl -fsSL https://aka.ms/install-azd.sh | bash

# 2. Authenticate (one time per machine)
azd auth login

# 3. Provision infrastructure and deploy the app
azd up

During azd up, choose an environment name (e.g. rg-Redis-AMR-semantic-cache-demo) and keep the suggested eastus2 location. The command provisions:

Azure OpenAI Service with GPT-4o + text-embedding-ada-002 deployments
Azure Managed Redis Enterprise (Balanced) for semantic cache storage
Azure Container Apps running the FastAPI backend and Next.js frontend
Supporting services: Azure Container Apps Environment, Azure Container Registry, Log Analytics, Application Insights, and managed identities with AcrPull access

Ongoing Workflow

Iterate on code: azd deploy rebuilds/pushes images and updates Container Apps.
Infrastructure Bicep changes: azd provision reapplies infra updates (follow with azd deploy if images changed).
Retrieve connection info: azd env get-values or read .azure/<env>/.env for URLs, secrets, and resource names.
Tear everything down: azd down --force --purge deletes the resource group and purges OpenAI instances.

Why use azd up first? Running azd provision by itself leaves the frontend pointing to the fallback container image, which listens on port 80. Our Container App expects port 3000, so the revision gets stuck activating. azd up (or azd provision followed by azd deploy) ensures the real frontend image is built and deployed, preventing port mismatch issues.

Accessing the Deployed App

Deployment commands print the frontend and backend URLs; you can also run azd env get-values to view FRONTEND_URI and BACKEND_URI.
The .azure/<env>/.env file mirrors those values and includes Azure credentials for local tooling.

Operations & Observability

Logs: az containerapp logs show --name ca-backend-<token> --resource-group rg-<env> (and similar for the frontend).
Telemetry: Application Insights collects backend traces via Azure Monitor OpenTelemetry and frontend telemetry via the web SDK.
Scaling: Container Apps default to 1–3 replicas sized at 0.5 vCPU / 1 GiB. Adjust infra/app/backend.bicep or frontend.bicep to tune resources.
Redis tier: Balanced B0 by default—scale up in Azure Portal when you outgrow it.

Hardening Checklist

Lock down networking (VNet integration, private endpoints, WAF) and add authentication (Azure AD).
Enable Redis persistence/backups and set up alerting in Application Insights.
Integrate automated CI/CD with azd pipeline config, GitHub Actions, or Azure DevOps.

🚀 Features

Parallel Two-Column Streaming Comparison

Raw Column: Direct Azure OpenAI GPT‑5 response (baseline, always LLM tokens)
Cache Column: Ultra‑low latency semantic cache path with:
- Fast path exact prompt hash lookup (no embedding, ~single RTT)
- Fallback semantic vector search (Redis vector index)
- Immediate stream of cached content on hit (sub‑100ms typical)
  - Automatic async population on miss (stream + background store)

New: Semantic Cache Cost & Sizing Calculator (`/calculator`)

Interactive tool to model:

Annual token spend (no cache vs with cache)
Savings vs hit rate curve (dynamic SVG graph)
Breakeven hit rate vs Redis HA cost
Automatic Azure Managed Redis SKU selection (subset benchmark table)
Per‑entry memory footprint estimation (vector + payload + overhead)

Rich Hover Tooltips (Toggleable)

Cache hit/miss metadata (similarity, distance, thresholds)
Underlying Redis query JSON (debug surface)
Embedding + doc id (for hits) – truncated client side
Toggle with the Header “Hover: On/Off” control

Performance & Telemetry

Real‑time token, latency & cost tracking (Redis streams)
Distinguishes LLM tokens vs tokens saved from cache
Streaming event types: column_start, content_chunk, column_complete, stream_complete
Cache column emits enriched cache_document payload for hits & misses (uniform UI handling)

Advanced Caching Mechanics

Prompt Hash Fast Path: MD5 of normalized prompt → O(1) lookup prior to embedding
Vector Similarity: 1536‑d embedding (OpenAI ada‑style) search via Redis Search index
Dual Threshold Support: Similarity & cosine distance (configurable via env)
Immediate Hit Return: Sends cached body + metadata before raw column finishes first tokens
Async Population: Miss path stores generated content after streaming completes
Graceful Degradation: Raw column still operates when cache/vector index missing

🏗️ Architecture

┌─────────────────┐    ┌─────────────────┐    ┌─────────────────┐
│   Next.js UI    │    │   FastAPI       │    │   Redis Stack   │
│                 │    │   Backend       │    │                 │
│ • Chat Interface│◄──►│ • Semantic Cache│◄──►│ • Vector Search │
│ • Metrics View  │    │ • Streaming API │    │ • JSON Storage  │
└─────────────────┘    └─────────────────┘    └─────────────────┘
                                │
                                ▼
                       ┌─────────────────┐
                       │   Azure OpenAI  │
                       │  (Embeddings &  │
                       │    GPT-5 Gen)   │
                       └─────────────────┘

🎯 Usage

Basic Chat

Open the frontend URL reported by azd deploy or azd env get-values (the FRONTEND_URI).
Enter a prompt; both columns start streaming almost immediately
If prior semantically similar (or identical) prompt exists you’ll see a near‑instant cache column answer
Hover over cache result (if hover enabled) to inspect metadata & Redis query JSON
Compare tokens vs tokens saved, latency, cost

Calculator Usage (`/calculator`)

Click the calculator icon in the header or go to /calculator
Adjust hit rate, traffic, token sizes, TTL, pricing
Review savings vs hit rate curve & automatic Redis SKU selection
Use breakeven % to justify enabling semantic cache in your environment

Performance Monitoring

Token Usage: Track LLM vs cached tokens
Cost Analysis: Monitor spending across different approaches
Cache Hit Rates: Measure semantic cache effectiveness
Latency Comparison: Compare response times

🔧 API Endpoints

Core Endpoints

POST /ask - One‑shot generation (raw + cache + personalized objects returned)
POST /ask/stream - Real‑time streaming for two active columns (raw & cache)
GET /metrics - Aggregated metrics (legacy + compatibility)

Cache Management

GET /cache/stats - Cache statistics (hit counts, sizes)
POST /cache/search - Search semantic cache (text or embedding)
POST /cache/store - Store a document manually
DELETE /cache/clear - Clear cache entries (pattern based)

Vector Operations

POST /vector/search - Vector similarity search (top_k, threshold)
POST /vector/add - Manually index arbitrary content

Telemetry

GET /telemetry/summary - Real-time metrics summary
GET /telemetry/stream-info - Stream statistics
DELETE /telemetry/clear - Clear telemetry data

Testing

GET /llm/test - Test Azure OpenAI connection
GET /health - Health check with Redis connectivity

📊 Performance & Economics

Token & Cost Efficiency

Cache hits bypass LLM generation → input + output tokens avoided
Savings scale linearly with hit rate until saturation
Calculator exposes sensitivity to TTL, entry footprint & traffic

Latency Characteristics

Prompt hash hit: near instantaneous (no embedding call)
Semantic hit: embedding + Redis vector RTT (< ~50–120ms typical)
Miss: full model streaming; cache stored asynchronously

Breakeven Logic (Calculator)

Breakeven Hit % = (Annual Redis HA Cost / No‑Cache Annual Token Spend) * 100. If > 100%, optimize prompt or choose smaller SKU.

🧠 How It Works

Semantic Cache Flow

Normalize + hash prompt → attempt exact index lookup
If miss → generate embedding & run vector similarity (threshold governed)
On hit → stream cached content + metadata immediately
On miss → stream live LLM output while accumulating content buffer
After completion → async store (embedding + payload + metadata)

Two-Column Streaming Strategy

Raw: Always LLM → baseline tokens, latency, cost
Cache: Potentially zero LLM tokens; showcases savings & latency delta

📈 Metrics and Analytics

Real-time Tracking

Token Usage: Prompt, completion, and cached tokens
Cost Analysis: Per-request and cumulative costs
Latency Measurement: Response time monitoring
Cache Performance: Hit rates and effectiveness

Telemetry Dashboard

Redis streams capture per‑column events with structured metrics
Aggregations compute hit rate, tokens saved, average latency & cost

🚨 Troubleshooting

Common Issues

Backend Connection Errors:

# Retrieve backend URL for the active azd environment
BACKEND_URI=$(azd env get-value BACKEND_URI)

# Confirm the API is healthy and Azure OpenAI is reachable
curl "$BACKEND_URI/health"
curl "$BACKEND_URI/llm/test"

Frontend Issues:

# Review recent frontend container logs
az containerapp logs show \
    --name ca-frontend-<token> \
    --resource-group rg-<environment>

# Validate backend response time from the frontend container
az containerapp exec \
    --name ca-frontend-<token> \
    --resource-group rg-<environment> \
    --command "curl -I $BACKEND_URI"

Service Dependencies

Redis: Required for caching and memory storage
Azure OpenAI: Required for embeddings and LLM responses

🔒 Security Considerations

Secrets: In azd deployments, the backend pulls credentials from Azure Container Apps secrets (Key Vault integration is not yet wired up)
API Keys: Keep Azure OpenAI keys in environment variables for local work; never commit them
CORS: Configured for the deployed frontend origin
Rate Limiting: Add before exposing the service beyond closed demos

🤝 Contributing

Fork
Create a feature branch
Implement / document
(Optional) Add a scenario to the calculator or enhance telemetry
PR with clear before/after metrics if performance related

📄 License

Property of Redis

🙏 Acknowledgments

Redis for vector search and caching infrastructure
Azure OpenAI for language model capabilities
Next.js and FastAPI for the application framework

Built with ❤️ to showcase pragmatic semantic caching, streaming UX, and cost justification tooling for AI workloads.

👥 Authors

Philip Laussermair and Roy de Milde

Environment Variables Summary

Backend (see backend/.env.example):

Variable	Purpose	Example
AZURE_OPENAI_ENDPOINT	Azure OpenAI endpoint	https://your-resource.openai.azure.com/
AZURE_OPENAI_API_KEY	Azure OpenAI key	(secret)
AZURE_OPENAI_API_VERSION	API version	2024-12-01-preview
AZURE_OPENAI_EMBEDDING_MODEL	Embedding deployment name	text-embedding-3-small
AZURE_OPENAI_GPT5_MODEL	Chat model deployment name	gpt-5-chat
REDIS_URL	Redis connection string (TLS)	rediss://:pwd@host:10000
VECTOR_INDEX_NAME	Redis vector index name	q_idx
VECTOR_DIMENSION	Embedding dimension	1536
SEMANTIC_SIMILARITY_THRESHOLD	Similarity cutoff	0.70
CACHE_TTL (optional)	TTL for cached docs (s)	86400
NEXT_PUBLIC_APPINSIGHTS_CONNECTION_STRING (optional)	Frontend Application Insights telemetry	(connection string)

Frontend builds read NEXT_PUBLIC_API_BASE; azd deploy sets it to the backend URI automatically.

✅ Health & Readiness

Implemented: GET /health (Redis + vector index). Planned: /readiness for full dependency checks.

🛠 Planned Enhancements

/readiness endpoint
Multi-stage backend Dockerfile for smaller image
Structured JSON logging with request IDs
Pytest suite for cache hit/miss economics

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
backend		backend
frontend		frontend
infra		infra
scripts		scripts
.dockerignore		.dockerignore
.gitignore		.gitignore
README.md		README.md
azure.yaml		azure.yaml
docker-compose.yml		docker-compose.yml

Folders and files

Latest commit

History

Repository files navigation

Semantic Cache Demo

🚀 Deploy to Azure

Prerequisites

First-Time Provisioning & Deployment

Ongoing Workflow

Accessing the Deployed App

Operations & Observability

Hardening Checklist

🚀 Features

Parallel Two-Column Streaming Comparison

New: Semantic Cache Cost & Sizing Calculator (/calculator)

Rich Hover Tooltips (Toggleable)

Performance & Telemetry

Advanced Caching Mechanics

🏗️ Architecture

🎯 Usage

Basic Chat

Calculator Usage (/calculator)

Performance Monitoring

🔧 API Endpoints

Core Endpoints

Cache Management

Vector Operations

Telemetry

Testing

📊 Performance & Economics

Token & Cost Efficiency

Latency Characteristics

Breakeven Logic (Calculator)

🧠 How It Works

Semantic Cache Flow

Two-Column Streaming Strategy

📈 Metrics and Analytics

Real-time Tracking

Telemetry Dashboard

🚨 Troubleshooting

Common Issues

Service Dependencies

🔒 Security Considerations

🤝 Contributing

📄 License

🙏 Acknowledgments

👥 Authors

Environment Variables Summary

✅ Health & Readiness

🛠 Planned Enhancements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

New: Semantic Cache Cost & Sizing Calculator (`/calculator`)

Calculator Usage (`/calculator`)

Packages