Skip to content

holisticon/multimodal-rag-demo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

6 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

๐Ÿง ๐Ÿ–ผ๏ธ๐Ÿ“„ Multimodal RAG Demo (Qwen3-VL)

Try the future of RAG - multimodal RAG - that effortlessly brings PDFs, images, and text together in one index: ingest โ†’ retrieve โ†’ rerank โ†’ (optionally) answer with sources.

Multimodal RAG UI


โœจ Why multimodal RAG?

Traditional RAG is great when everything is plain text. Real documents arenโ€™t.

PDFs often contain:

  • ๐Ÿ“Š charts and tables
  • ๐Ÿงพ scanned pages / images of text
  • ๐Ÿงฉ figures with crucial labels
  • ๐Ÿง  context split across captions + visuals

Multimodal RAG lets you retrieve from both:

  • ๐Ÿ“ plain text (searchable chunks)
  • ๐Ÿ–ผ๏ธ rendered page images / standalone images (visual understanding)

That means you can ask questions like:

  • โ€œWhich page contains the table with latency numbers?โ€
  • โ€œWhere is the screenshot that shows the error dialog?โ€ โ€ฆand get results that actually reflect the document as it exists, not just what text extraction managed to capture.

You can even search the index using example images, not just with text queries!


๐Ÿš€ Why Qwen3-VL for this demo?

This demo uses Apache-2.0 Qwen3-VL models designed specifically for vision-language embedding and reranking:

  • ๐Ÿงฒ Qwen3-VL Embedding: turns text + images into vectors in the same semantic space
    โ†’ You can retrieve โ€œthe right page imageโ€ even if the answer isnโ€™t explicitly in extracted text.

  • ๐Ÿ Qwen3-VL Reranker (optional): reorders the initial candidates with a stronger cross-encoder
    โ†’ Better top results, fewer โ€œalmost relevantโ€ hits.


๐Ÿงฐ What you get

  • ๐Ÿ“ฅ Ingest PDFs (page images + extracted text), images, and text files
  • ๐Ÿงฒ Multimodal embeddings with Qwen3-VL
  • ๐Ÿ Multimodal reranker support (toggleable) for improved ordering
  • ๐Ÿ“๐Ÿ–ผ๏ธ Text or image queries (or both)
  • โœ๏ธ Optional answer generation with source context:
    • fully local via transformers, or
    • via an OpenAI-compatible API (local or remote)
  • ๐Ÿ’พ Disk-backed index for quick restarts
  • ๐ŸŒ Web UI at / for ingest + query

๐Ÿ–ฅ๏ธ Quick starts

Open the UI after launch: http://localhost:8000/

0) Retrieval only (no answer generation) ๐Ÿ”Ž

uv run qwen3vl_multimodal_rag_server.py

1) Fully local (built-in generation via transformers) ๐Ÿ 

uv run qwen3vl_multimodal_rag_server.py --enable-generator

2) Local LLM via LM Studio or any other OpenAI-compatible API ๐Ÿ”Œ

uv run qwen3vl_multimodal_rag_server.py \
  --enable-generator \
  --generator-backend openai \
  --generator-remote-endpoint http://localhost:1234/v1

Tip: set OPENAI_API_KEY in a .env file. Many local servers accept any value.

3) Remote OpenAI answer generation โ˜๏ธ

uv run qwen3vl_multimodal_rag_server.py \
  --enable-generator \
  --generator-backend openai \
  --generator-remote-model gpt-5-mini \
  --generator-remote-reasoning-effort minimal

Create a .env with OPENAI_API_KEY=... before starting.

Typical workflow

  1. ๐Ÿ“ฅ Drop a few PDFs / images
  2. โ“ Ask a question
  3. ๐Ÿ”Ž See retrieved sources (text chunks + page images)
  4. โœ๏ธ Turn on generation when you want an answer with citations

Parameters

All available CLI parameters (defaults shown):

Flag Default Description
--host 0.0.0.0 Bind address for the server.
--port 8000 Port for the server.
--log-level info Logging level (debug, info, warning, error).
--data-dir ./qwen3vl_rag_data Storage directory for the index, originals, and derived assets.
--device auto Device selection (auto, cuda, mps, cpu).
--embedder-model Qwen/Qwen3-VL-Embedding-2B Embedding model name.
--embedder-quant off Embedder quantization (off, 8bit, 4bit).
--reranker-model Qwen/Qwen3-VL-Reranker-2B Reranker model name.
--reranker-quant off Reranker quantization (off, 8bit, 4bit).
--disable-reranker false Disable reranking.
--generator-model Qwen/Qwen3-VL-2B-Instruct Local generator model name.
--generator-quant off Generator quantization (off, 8bit, 4bit).
--enable-generator false Enable answer generation.
--generator-backend transformers Generator backend (transformers, openai).
--generator-remote-endpoint None Base URL for OpenAI-compatible APIs.
--generator-remote-model None Model name for OpenAI-compatible APIs.
--generator-remote-reasoning-effort medium Reasoning effort for remote generator (none, minimal, low, medium, high, xhigh).
--text-chunk-size 1200 Text chunk size during ingestion.
--text-chunk-overlap 200 Overlap between text chunks.
--pdf-dpi 75 DPI for rendering PDF pages (also used to downscale images).
--pdf-max-pages 0 If >0, only ingest the first N pages of a PDF.

Notes

  • .env is loaded automatically at startup (for OPENAI_API_KEY or other env vars).
  • Retrieval defaults: top_k=4, rerank_k=4 (can be overridden per query in the API).
  • The UI exposes /ingest and /query and shows metadata from /health.