Add multi-representation search tutorial notebook by Dylancouzon · Pull Request #103 · qdrant/examples

Dylancouzon · 2026-05-05T23:28:20Z

Summary

Adds a notebook companion to the new multi-representation search tutorial landing in
qdrant/landing_page#2334.
Walks through the recommended retrieval pipeline (three named-vector prefetches with Reciprocal Rank Fusion plus document-level grouping) step by step on a 2,000-paper ML/CS arXiv slice.
Prints qualitative top-K results at each step (title, category tags, matching-chunk excerpt) so the reader can see what each step adds without depending on hand-curated qrels.

Scope

The tutorial covers why of the design; this notebook covers the how end to end:

Stream and filter 2,000 papers from gfissore/arxiv-abstracts-2021 (cs.LG / cs.CV / cs.CL / cs.AI / stat.ML) so the demo queries have natural matches.
Create a collection with three named dense vectors (dense_chunk, dense_title, dense_summary) and one named sparse vector (sparse_keywords) using FastEmbed (BAAI/bge-small-en-v1.5 + Qdrant/bm25).
Ingest one point per chunk, with title and summary embeddings reused across the same paper's chunks.
Walk through five retrieval steps that each add one capability:
- Dense over chunks (baseline)
- Add BM25 sparse keywords + RRF
- Add a dense_title prefetch
- Switch to query_points_groups for document-level grouping
- Swap RRF for a FormulaQuery boost

File

multi-representation-search/multi-representation-search.ipynb — single notebook, 22 cells.

Requirements

Notebook is tested on Python 3.11 and 3.12. Some FastEmbed dependencies don't yet ship wheels for Python 3.14.

Companion notebook to the multi-representation search tutorial in qdrant/landing_page#2334. Builds the recommended retrieval pipeline (three named-vector prefetches, RRF fusion, document-level grouping) step by step against a 2,000-paper ML/CS arXiv slice, with qualitative top-K results at each step. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ries as filter Brings the notebook to 1:1 parity with the refactored tutorial: - Switches to Qdrant Cloud Inference for dense + core BM25 for sparse (drops FastEmbed from the notebook dependencies) - Renames dense_summary to dense_abstract and sparse_keywords to sparse_title - Moves categories from BM25 input to a filterable payload field with a keyword index; sparse_title now indexes only the title (avg_len=10) - Adds dense_abstract as a fourth prefetch with its own Step 4, so the build-up now reads: chunk -> +sparse -> +title -> +abstract -> group -> formula - Adds an optional tags filter to retrieve_grouped - Updates probe_queries.py to match the new schema and step structure

Wraps every $score term in MultExpression (with weight 1.0 on chunk) so the formula reads uniformly. Adds a comment above the QdrantClient init pointing readers to https://cloud.qdrant.io for their own url and api_key.

Adds the missing keyword index on document_id so the grouping step (query_points_groups with group_by="document_id") works under strict mode. Tweaks the upload_points call to batch_size=256, parallel=2 for faster ingestion against Cloud Inference. Adds an expected-results summary to the wrap-up so readers running the same query against the same dataset can compare their output to the reference.

Adds the missing keyword index on document_id so grouping works under strict mode (Cloud default), and tunes the upload_points call to batch_size=256, parallel=2 for faster ingestion. Mirrors the notebook in qdrant/examples#103.

Dylancouzon requested review from kanungle and mrscoopers May 5, 2026 23:29

Dylancouzon mentioned this pull request May 5, 2026

Multi-representation search tutorial + supporting doc cross-links qdrant/landing_page#2334

Open

Dylancouzon added 5 commits May 6, 2026 16:51

Improve category filter and scale corpus

12993cf

delete testing file

8772636

make FormulaQuery example symmetric and add cloud-cluster comment

b1b18b4

Wraps every $score term in MultExpression (with weight 1.0 on chunk) so the formula reads uniformly. Adds a comment above the QdrantClient init pointing readers to https://cloud.qdrant.io for their own url and api_key.

update tags label

c01fe7c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add multi-representation search tutorial notebook#103

Add multi-representation search tutorial notebook#103
Dylancouzon wants to merge 7 commits into
masterfrom
multi-representation-search

Dylancouzon commented May 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Dylancouzon commented May 5, 2026

Summary

Scope

File

Requirements

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant