Skip to content

fishdev20/async-queues-rag

Repository files navigation

Scalable RAG with Async Queues and Distributed Workers

Reference repository: fishdev20/rag-chat

This project extends a basic RAG chatbot into a scalable asynchronous system. Instead of processing each user request inside the API request/response cycle, requests are queued and processed by background workers.

Why This Architecture

A synchronous RAG API can fail under load because each request does retrieval + LLM call inline. That creates:

  • High API latency
  • Request timeouts
  • Poor concurrency
  • Difficult horizontal scaling

Using a queue + workers solves this by decoupling ingestion of requests from execution of heavy RAG jobs.

If we use the original synchronous approach (API handles retrieval + LLM call inline), each request keeps a server worker busy until completion. Under concurrent traffic, this blocks the server, increases latency, and can cause timeouts.

Core Components

  • API Server (FastAPI)

    • Accepts user requests
    • Enqueues jobs to Redis/Valkey queue
    • Returns job_id immediately
    • Exposes job status endpoint
  • Queue Broker (Valkey/Redis + RQ)

    • Stores pending jobs
    • Delivers jobs to available workers
  • Distributed Worker Pool

    • Multiple identical worker processes/containers
    • Each worker runs process_query()
    • Pulls jobs from the same queue
  • Vector DB (Qdrant)

    • Stores embeddings and metadata
    • Returns top relevant chunks for user query
  • LLM + Embedding API (OpenAI)

    • Embedding for retrieval
    • Chat completion for grounded response

End-to-End Flow

  1. User sends query to /chat
  2. API enqueues process_query(query) and returns job_id
  3. Worker picks job from queue
  4. Worker retrieves relevant chunks from Qdrant
  5. Worker builds grounded prompt and calls LLM
  6. Worker stores result in job record
  7. Client polls /job-status?job_id=... to fetch final answer

Architecture Diagram

flowchart LR
    U[Client / UI] --> API[FastAPI API Server]
    API --> Q[(Valkey/Redis Queue)]
    Q --> W1[Worker 1]
    Q --> W2[Worker 2]
    Q --> WN[Worker N]

    W1 --> VDB[(Qdrant Vector DB)]
    W2 --> VDB
    WN --> VDB

    W1 --> OAI[OpenAI APIs\nEmbeddings + Chat]
    W2 --> OAI
    WN --> OAI

    API -->|poll by job_id| Q
    Q --> API
    API --> U
Loading

Scale-Out Diagram

flowchart TB
    subgraph Ingress
      API1[API Pod 1]
      API2[API Pod 2]
    end

    LB[Load Balancer] --> API1
    LB --> API2

    API1 --> Q[(Queue)]
    API2 --> Q

    subgraph Worker Pool
      W1[Worker A]
      W2[Worker B]
      W3[Worker C]
      W4[Worker D]
    end

    Q --> W1
    Q --> W2
    Q --> W3
    Q --> W4

    W1 --> VDB[(Qdrant)]
    W2 --> VDB
    W3 --> VDB
    W4 --> VDB

    W1 --> LLM[LLM Provider]
    W2 --> LLM
    W3 --> LLM
    W4 --> LLM
Loading

What Scales Independently

  • API replicas: for request throughput
  • Worker replicas: for job processing throughput
  • Queue: managed Redis/Valkey for reliability
  • Vector DB: upgrade to managed Qdrant for higher load/storage

Local Run (Development)

  1. Start queue broker (Valkey/Redis)
  2. Start Qdrant
  3. Start API server
  4. Start one or more workers

Example worker scale:

  • Terminal 1: rq worker
  • Terminal 2: rq worker
  • Terminal 3: rq worker

All workers consume from the same queue and process jobs in parallel.

Future Enhancements

  • Add request/response logging with job IDs
  • Add retries and dead-letter queue strategy for failed jobs
  • Add timeout and circuit breaker around external API calls
  • Add metrics: queue depth, job latency, worker success/failure rate
  • Add autoscaling policy based on queue length

Summary

This design turns RAG into a resilient async pipeline:

  • API remains fast and responsive
  • Heavy retrieval + generation work happens in background workers
  • Horizontal scaling is achieved by increasing worker replicas

About

This project extends a basic RAG chatbot into a scalable asynchronous system. Instead of processing each user request inside the API request/response cycle, requests are queued and processed by background workers.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages