Scalable RAG with Async Queues and Distributed Workers

Reference repository: fishdev20/rag-chat

This project extends a basic RAG chatbot into a scalable asynchronous system. Instead of processing each user request inside the API request/response cycle, requests are queued and processed by background workers.

Why This Architecture

A synchronous RAG API can fail under load because each request does retrieval + LLM call inline. That creates:

High API latency
Request timeouts
Poor concurrency
Difficult horizontal scaling

Using a queue + workers solves this by decoupling ingestion of requests from execution of heavy RAG jobs.

If we use the original synchronous approach (API handles retrieval + LLM call inline), each request keeps a server worker busy until completion. Under concurrent traffic, this blocks the server, increases latency, and can cause timeouts.

Core Components

API Server (FastAPI)
- Accepts user requests
- Enqueues jobs to Redis/Valkey queue
- Returns job_id immediately
- Exposes job status endpoint
Queue Broker (Valkey/Redis + RQ)
- Stores pending jobs
- Delivers jobs to available workers
Distributed Worker Pool
- Multiple identical worker processes/containers
- Each worker runs process_query()
- Pulls jobs from the same queue
Vector DB (Qdrant)
- Stores embeddings and metadata
- Returns top relevant chunks for user query
LLM + Embedding API (OpenAI)
- Embedding for retrieval
- Chat completion for grounded response

End-to-End Flow

User sends query to /chat
API enqueues process_query(query) and returns job_id
Worker picks job from queue
Worker retrieves relevant chunks from Qdrant
Worker builds grounded prompt and calls LLM
Worker stores result in job record
Client polls /job-status?job_id=... to fetch final answer

Architecture Diagram

flowchart LR
    U[Client / UI] --> API[FastAPI API Server]
    API --> Q[(Valkey/Redis Queue)]
    Q --> W1[Worker 1]
    Q --> W2[Worker 2]
    Q --> WN[Worker N]

    W1 --> VDB[(Qdrant Vector DB)]
    W2 --> VDB
    WN --> VDB

    W1 --> OAI[OpenAI APIs\nEmbeddings + Chat]
    W2 --> OAI
    WN --> OAI

    API -->|poll by job_id| Q
    Q --> API
    API --> U

Scale-Out Diagram

flowchart TB
    subgraph Ingress
      API1[API Pod 1]
      API2[API Pod 2]
    end

    LB[Load Balancer] --> API1
    LB --> API2

    API1 --> Q[(Queue)]
    API2 --> Q

    subgraph Worker Pool
      W1[Worker A]
      W2[Worker B]
      W3[Worker C]
      W4[Worker D]
    end

    Q --> W1
    Q --> W2
    Q --> W3
    Q --> W4

    W1 --> VDB[(Qdrant)]
    W2 --> VDB
    W3 --> VDB
    W4 --> VDB

    W1 --> LLM[LLM Provider]
    W2 --> LLM
    W3 --> LLM
    W4 --> LLM

What Scales Independently

API replicas: for request throughput
Worker replicas: for job processing throughput
Queue: managed Redis/Valkey for reliability
Vector DB: upgrade to managed Qdrant for higher load/storage

Local Run (Development)

Start queue broker (Valkey/Redis)
Start Qdrant
Start API server
Start one or more workers

Example worker scale:

Terminal 1: rq worker
Terminal 2: rq worker
Terminal 3: rq worker

All workers consume from the same queue and process jobs in parallel.

Future Enhancements

Add request/response logging with job IDs
Add retries and dead-letter queue strategy for failed jobs
Add timeout and circuit breaker around external API calls
Add metrics: queue depth, job latency, worker success/failure rate
Add autoscaling policy based on queue length

Summary

This design turns RAG into a resilient async pipeline:

API remains fast and responsive
Heavy retrieval + generation work happens in background workers
Horizontal scaling is achieved by increasing worker replicas

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
client		client
queues		queues
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
constants.py		constants.py
docker-compose.yml		docker-compose.yml
main.py		main.py
requirements.txt		requirements.txt
server.py		server.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Scalable RAG with Async Queues and Distributed Workers

Why This Architecture

Core Components

End-to-End Flow

Architecture Diagram

Scale-Out Diagram

What Scales Independently

Local Run (Development)

Future Enhancements

Summary

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Scalable RAG with Async Queues and Distributed Workers

Why This Architecture

Core Components

End-to-End Flow

Architecture Diagram

Scale-Out Diagram

What Scales Independently

Local Run (Development)

Future Enhancements

Summary

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages