Reference repository: fishdev20/rag-chat
This project extends a basic RAG chatbot into a scalable asynchronous system. Instead of processing each user request inside the API request/response cycle, requests are queued and processed by background workers.
A synchronous RAG API can fail under load because each request does retrieval + LLM call inline. That creates:
- High API latency
- Request timeouts
- Poor concurrency
- Difficult horizontal scaling
Using a queue + workers solves this by decoupling ingestion of requests from execution of heavy RAG jobs.
If we use the original synchronous approach (API handles retrieval + LLM call inline), each request keeps a server worker busy until completion. Under concurrent traffic, this blocks the server, increases latency, and can cause timeouts.
-
API Server (FastAPI)
- Accepts user requests
- Enqueues jobs to Redis/Valkey queue
- Returns
job_idimmediately - Exposes job status endpoint
-
Queue Broker (Valkey/Redis + RQ)
- Stores pending jobs
- Delivers jobs to available workers
-
Distributed Worker Pool
- Multiple identical worker processes/containers
- Each worker runs
process_query() - Pulls jobs from the same queue
-
Vector DB (Qdrant)
- Stores embeddings and metadata
- Returns top relevant chunks for user query
-
LLM + Embedding API (OpenAI)
- Embedding for retrieval
- Chat completion for grounded response
- User sends query to
/chat - API enqueues
process_query(query)and returnsjob_id - Worker picks job from queue
- Worker retrieves relevant chunks from Qdrant
- Worker builds grounded prompt and calls LLM
- Worker stores result in job record
- Client polls
/job-status?job_id=...to fetch final answer
flowchart LR
U[Client / UI] --> API[FastAPI API Server]
API --> Q[(Valkey/Redis Queue)]
Q --> W1[Worker 1]
Q --> W2[Worker 2]
Q --> WN[Worker N]
W1 --> VDB[(Qdrant Vector DB)]
W2 --> VDB
WN --> VDB
W1 --> OAI[OpenAI APIs\nEmbeddings + Chat]
W2 --> OAI
WN --> OAI
API -->|poll by job_id| Q
Q --> API
API --> U
flowchart TB
subgraph Ingress
API1[API Pod 1]
API2[API Pod 2]
end
LB[Load Balancer] --> API1
LB --> API2
API1 --> Q[(Queue)]
API2 --> Q
subgraph Worker Pool
W1[Worker A]
W2[Worker B]
W3[Worker C]
W4[Worker D]
end
Q --> W1
Q --> W2
Q --> W3
Q --> W4
W1 --> VDB[(Qdrant)]
W2 --> VDB
W3 --> VDB
W4 --> VDB
W1 --> LLM[LLM Provider]
W2 --> LLM
W3 --> LLM
W4 --> LLM
- API replicas: for request throughput
- Worker replicas: for job processing throughput
- Queue: managed Redis/Valkey for reliability
- Vector DB: upgrade to managed Qdrant for higher load/storage
- Start queue broker (Valkey/Redis)
- Start Qdrant
- Start API server
- Start one or more workers
Example worker scale:
- Terminal 1:
rq worker - Terminal 2:
rq worker - Terminal 3:
rq worker
All workers consume from the same queue and process jobs in parallel.
- Add request/response logging with job IDs
- Add retries and dead-letter queue strategy for failed jobs
- Add timeout and circuit breaker around external API calls
- Add metrics: queue depth, job latency, worker success/failure rate
- Add autoscaling policy based on queue length
This design turns RAG into a resilient async pipeline:
- API remains fast and responsive
- Heavy retrieval + generation work happens in background workers
- Horizontal scaling is achieved by increasing worker replicas