Skip to content

AzureManagedRedis/semantic-caching-demo-and-calculator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

33 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Semantic Cache Demo

An end‑to‑end demonstration of practical semantic caching, real‑time streaming, and cost modeling for AI workloads, built with Next.js (App Router) + FastAPI + Redis Stack. The app now ships with a realistic streaming cache pipeline (2 parallel columns) and a new interactive Semantic Cache Cost & Sizing Calculator to help you justify Redis spend versus raw LLM costs.

πŸš€ Deploy to Azure

Prerequisites

  • Azure subscription with Azure OpenAI quota in eastus2
  • Azure CLI and Azure Developer CLI (azd)
  • Docker Desktop (for container image builds)
  • Logged in with Azure (az login or azd auth login)

First-Time Provisioning & Deployment

# 1. Install azd if needed
curl -fsSL https://aka.ms/install-azd.sh | bash

# 2. Authenticate (one time per machine)
azd auth login

# 3. Provision infrastructure and deploy the app
azd up

During azd up, choose an environment name (e.g. rg-Redis-AMR-semantic-cache-demo) and keep the suggested eastus2 location. The command provisions:

  • Azure OpenAI Service with GPT-4o + text-embedding-ada-002 deployments
  • Azure Managed Redis Enterprise (Balanced) for semantic cache storage
  • Azure Container Apps running the FastAPI backend and Next.js frontend
  • Supporting services: Azure Container Apps Environment, Azure Container Registry, Log Analytics, Application Insights, and managed identities with AcrPull access

Ongoing Workflow

  • Iterate on code: azd deploy rebuilds/pushes images and updates Container Apps.
  • Infrastructure Bicep changes: azd provision reapplies infra updates (follow with azd deploy if images changed).
  • Retrieve connection info: azd env get-values or read .azure/<env>/.env for URLs, secrets, and resource names.
  • Tear everything down: azd down --force --purge deletes the resource group and purges OpenAI instances.

Why use azd up first? Running azd provision by itself leaves the frontend pointing to the fallback container image, which listens on port 80. Our Container App expects port 3000, so the revision gets stuck activating. azd up (or azd provision followed by azd deploy) ensures the real frontend image is built and deployed, preventing port mismatch issues.

Accessing the Deployed App

  • Deployment commands print the frontend and backend URLs; you can also run azd env get-values to view FRONTEND_URI and BACKEND_URI.
  • The .azure/<env>/.env file mirrors those values and includes Azure credentials for local tooling.

Operations & Observability

  • Logs: az containerapp logs show --name ca-backend-<token> --resource-group rg-<env> (and similar for the frontend).
  • Telemetry: Application Insights collects backend traces via Azure Monitor OpenTelemetry and frontend telemetry via the web SDK.
  • Scaling: Container Apps default to 1–3 replicas sized at 0.5 vCPU / 1β€―GiB. Adjust infra/app/backend.bicep or frontend.bicep to tune resources.
  • Redis tier: Balanced B0 by defaultβ€”scale up in Azure Portal when you outgrow it.

Hardening Checklist

  • Lock down networking (VNet integration, private endpoints, WAF) and add authentication (Azure AD).
  • Enable Redis persistence/backups and set up alerting in Application Insights.
  • Integrate automated CI/CD with azd pipeline config, GitHub Actions, or Azure DevOps.

πŸš€ Features

Parallel Two-Column Streaming Comparison

  • Raw Column: Direct Azure OpenAI GPT‑5 response (baseline, always LLM tokens)
  • Cache Column: Ultra‑low latency semantic cache path with:
    • Fast path exact prompt hash lookup (no embedding, ~single RTT)
    • Fallback semantic vector search (Redis vector index)
    • Immediate stream of cached content on hit (sub‑100ms typical)
      • Automatic async population on miss (stream + background store)

New: Semantic Cache Cost & Sizing Calculator (/calculator)

Interactive tool to model:

  • Annual token spend (no cache vs with cache)
  • Savings vs hit rate curve (dynamic SVG graph)
  • Breakeven hit rate vs Redis HA cost
  • Automatic Azure Managed Redis SKU selection (subset benchmark table)
  • Per‑entry memory footprint estimation (vector + payload + overhead)

Rich Hover Tooltips (Toggleable)

  • Cache hit/miss metadata (similarity, distance, thresholds)
  • Underlying Redis query JSON (debug surface)
  • Embedding + doc id (for hits) – truncated client side
  • Toggle with the Header β€œHover: On/Off” control

Performance & Telemetry

  • Real‑time token, latency & cost tracking (Redis streams)
  • Distinguishes LLM tokens vs tokens saved from cache
  • Streaming event types: column_start, content_chunk, column_complete, stream_complete
  • Cache column emits enriched cache_document payload for hits & misses (uniform UI handling)

Advanced Caching Mechanics

  • Prompt Hash Fast Path: MD5 of normalized prompt β†’ O(1) lookup prior to embedding
  • Vector Similarity: 1536‑d embedding (OpenAI ada‑style) search via Redis Search index
  • Dual Threshold Support: Similarity & cosine distance (configurable via env)
  • Immediate Hit Return: Sends cached body + metadata before raw column finishes first tokens
  • Async Population: Miss path stores generated content after streaming completes
  • Graceful Degradation: Raw column still operates when cache/vector index missing

πŸ—οΈ Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Next.js UI    β”‚    β”‚   FastAPI       β”‚    β”‚   Redis Stack   β”‚
β”‚                 β”‚    β”‚   Backend       β”‚    β”‚                 β”‚
β”‚ β€’ Chat Interface│◄──►│ β€’ Semantic Cache│◄──►│ β€’ Vector Search β”‚
β”‚ β€’ Metrics View  β”‚    β”‚ β€’ Streaming API β”‚    β”‚ β€’ JSON Storage  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                β”‚
                                β–Ό
                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
                       β”‚   Azure OpenAI  β”‚
                       β”‚  (Embeddings &  β”‚
                       β”‚    GPT-5 Gen)   β”‚
                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

🎯 Usage

Basic Chat

  1. Open the frontend URL reported by azd deploy or azd env get-values (the FRONTEND_URI).
  2. Enter a prompt; both columns start streaming almost immediately
  3. If prior semantically similar (or identical) prompt exists you’ll see a near‑instant cache column answer
  4. Hover over cache result (if hover enabled) to inspect metadata & Redis query JSON
  5. Compare tokens vs tokens saved, latency, cost

Calculator Usage (/calculator)

  1. Click the calculator icon in the header or go to /calculator
  2. Adjust hit rate, traffic, token sizes, TTL, pricing
  3. Review savings vs hit rate curve & automatic Redis SKU selection
  4. Use breakeven % to justify enabling semantic cache in your environment

Performance Monitoring

  • Token Usage: Track LLM vs cached tokens
  • Cost Analysis: Monitor spending across different approaches
  • Cache Hit Rates: Measure semantic cache effectiveness
  • Latency Comparison: Compare response times

πŸ”§ API Endpoints

Core Endpoints

  • POST /ask - One‑shot generation (raw + cache + personalized objects returned)
  • POST /ask/stream - Real‑time streaming for two active columns (raw & cache)
  • GET /metrics - Aggregated metrics (legacy + compatibility)

Cache Management

  • GET /cache/stats - Cache statistics (hit counts, sizes)
  • POST /cache/search - Search semantic cache (text or embedding)
  • POST /cache/store - Store a document manually
  • DELETE /cache/clear - Clear cache entries (pattern based)

Vector Operations

  • POST /vector/search - Vector similarity search (top_k, threshold)
  • POST /vector/add - Manually index arbitrary content

Telemetry

  • GET /telemetry/summary - Real-time metrics summary
  • GET /telemetry/stream-info - Stream statistics
  • DELETE /telemetry/clear - Clear telemetry data

Testing

  • GET /llm/test - Test Azure OpenAI connection
  • GET /health - Health check with Redis connectivity

πŸ“Š Performance & Economics

Token & Cost Efficiency

  • Cache hits bypass LLM generation β†’ input + output tokens avoided
  • Savings scale linearly with hit rate until saturation
  • Calculator exposes sensitivity to TTL, entry footprint & traffic

Latency Characteristics

  • Prompt hash hit: near instantaneous (no embedding call)
  • Semantic hit: embedding + Redis vector RTT (< ~50–120ms typical)
  • Miss: full model streaming; cache stored asynchronously

Breakeven Logic (Calculator)

Breakeven Hit % = (Annual Redis HA Cost / No‑Cache Annual Token Spend) * 100. If > 100%, optimize prompt or choose smaller SKU.

🧠 How It Works

Semantic Cache Flow

  1. Normalize + hash prompt β†’ attempt exact index lookup
  2. If miss β†’ generate embedding & run vector similarity (threshold governed)
  3. On hit β†’ stream cached content + metadata immediately
  4. On miss β†’ stream live LLM output while accumulating content buffer
  5. After completion β†’ async store (embedding + payload + metadata)

Two-Column Streaming Strategy

  • Raw: Always LLM β†’ baseline tokens, latency, cost
  • Cache: Potentially zero LLM tokens; showcases savings & latency delta

πŸ“ˆ Metrics and Analytics

Real-time Tracking

  • Token Usage: Prompt, completion, and cached tokens
  • Cost Analysis: Per-request and cumulative costs
  • Latency Measurement: Response time monitoring
  • Cache Performance: Hit rates and effectiveness

Telemetry Dashboard

  • Redis streams capture per‑column events with structured metrics
  • Aggregations compute hit rate, tokens saved, average latency & cost

🚨 Troubleshooting

Common Issues

Backend Connection Errors:

# Retrieve backend URL for the active azd environment
BACKEND_URI=$(azd env get-value BACKEND_URI)

# Confirm the API is healthy and Azure OpenAI is reachable
curl "$BACKEND_URI/health"
curl "$BACKEND_URI/llm/test"

Frontend Issues:

# Review recent frontend container logs
az containerapp logs show \
    --name ca-frontend-<token> \
    --resource-group rg-<environment>

# Validate backend response time from the frontend container
az containerapp exec \
    --name ca-frontend-<token> \
    --resource-group rg-<environment> \
    --command "curl -I $BACKEND_URI"

Service Dependencies

  • Redis: Required for caching and memory storage
  • Azure OpenAI: Required for embeddings and LLM responses

πŸ”’ Security Considerations

  • Secrets: In azd deployments, the backend pulls credentials from Azure Container Apps secrets (Key Vault integration is not yet wired up)
  • API Keys: Keep Azure OpenAI keys in environment variables for local work; never commit them
  • CORS: Configured for the deployed frontend origin
  • Rate Limiting: Add before exposing the service beyond closed demos

🀝 Contributing

  1. Fork
  2. Create a feature branch
  3. Implement / document
  4. (Optional) Add a scenario to the calculator or enhance telemetry
  5. PR with clear before/after metrics if performance related

πŸ“„ License

Property of Redis

πŸ™ Acknowledgments

  • Redis for vector search and caching infrastructure
  • Azure OpenAI for language model capabilities
  • Next.js and FastAPI for the application framework


Built with ❀️ to showcase pragmatic semantic caching, streaming UX, and cost justification tooling for AI workloads.

πŸ‘₯ Authors

Philip Laussermair and Roy de Milde

Environment Variables Summary

Backend (see backend/.env.example):

Variable Purpose Example
AZURE_OPENAI_ENDPOINT Azure OpenAI endpoint https://your-resource.openai.azure.com/
AZURE_OPENAI_API_KEY Azure OpenAI key (secret)
AZURE_OPENAI_API_VERSION API version 2024-12-01-preview
AZURE_OPENAI_EMBEDDING_MODEL Embedding deployment name text-embedding-3-small
AZURE_OPENAI_GPT5_MODEL Chat model deployment name gpt-5-chat
REDIS_URL Redis connection string (TLS) rediss://:pwd@host:10000
VECTOR_INDEX_NAME Redis vector index name q_idx
VECTOR_DIMENSION Embedding dimension 1536
SEMANTIC_SIMILARITY_THRESHOLD Similarity cutoff 0.70
CACHE_TTL (optional) TTL for cached docs (s) 86400
NEXT_PUBLIC_APPINSIGHTS_CONNECTION_STRING (optional) Frontend Application Insights telemetry (connection string)

Frontend builds read NEXT_PUBLIC_API_BASE; azd deploy sets it to the backend URI automatically.

βœ… Health & Readiness

Implemented: GET /health (Redis + vector index). Planned: /readiness for full dependency checks.

πŸ›  Planned Enhancements

  • /readiness endpoint
  • Multi-stage backend Dockerfile for smaller image
  • Structured JSON logging with request IDs
  • Pytest suite for cache hit/miss economics

About

Semantic caching demo with real-time streaming and a cost & sizing calculator, powered by Azure Managed Redis and Azure OpenAI.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors