Skip to content

Add multi-representation search tutorial notebook#103

Open
Dylancouzon wants to merge 7 commits into
masterfrom
multi-representation-search
Open

Add multi-representation search tutorial notebook#103
Dylancouzon wants to merge 7 commits into
masterfrom
multi-representation-search

Conversation

@Dylancouzon
Copy link
Copy Markdown

Summary

  • Adds a notebook companion to the new multi-representation search tutorial landing in
    qdrant/landing_page#2334.
  • Walks through the recommended retrieval pipeline (three named-vector prefetches with Reciprocal Rank Fusion plus document-level grouping) step by step on a 2,000-paper ML/CS arXiv slice.
  • Prints qualitative top-K results at each step (title, category tags, matching-chunk excerpt) so the reader can see what each step adds without depending on hand-curated qrels.

Scope

The tutorial covers why of the design; this notebook covers the how end to end:

  1. Stream and filter 2,000 papers from gfissore/arxiv-abstracts-2021 (cs.LG / cs.CV / cs.CL / cs.AI / stat.ML) so the demo queries have natural matches.
  2. Create a collection with three named dense vectors (dense_chunk, dense_title, dense_summary) and one named sparse vector (sparse_keywords) using FastEmbed (BAAI/bge-small-en-v1.5 + Qdrant/bm25).
  3. Ingest one point per chunk, with title and summary embeddings reused across the same paper's chunks.
  4. Walk through five retrieval steps that each add one capability:
    • Dense over chunks (baseline)
    • Add BM25 sparse keywords + RRF
    • Add a dense_title prefetch
    • Switch to query_points_groups for document-level grouping
    • Swap RRF for a FormulaQuery boost

File

  • multi-representation-search/multi-representation-search.ipynb — single notebook, 22 cells.

Requirements

Notebook is tested on Python 3.11 and 3.12. Some FastEmbed dependencies don't yet ship wheels for Python 3.14.

Companion notebook to the multi-representation search tutorial in qdrant/landing_page#2334.
Builds the recommended retrieval pipeline (three named-vector prefetches, RRF fusion,
document-level grouping) step by step against a 2,000-paper ML/CS arXiv slice, with
qualitative top-K results at each step.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ries as filter

Brings the notebook to 1:1 parity with the refactored tutorial:
- Switches to Qdrant Cloud Inference for dense + core BM25 for sparse
  (drops FastEmbed from the notebook dependencies)
- Renames dense_summary to dense_abstract and sparse_keywords to sparse_title
- Moves categories from BM25 input to a filterable payload field with a
  keyword index; sparse_title now indexes only the title (avg_len=10)
- Adds dense_abstract as a fourth prefetch with its own Step 4, so the
  build-up now reads: chunk -> +sparse -> +title -> +abstract -> group -> formula
- Adds an optional tags filter to retrieve_grouped
- Updates probe_queries.py to match the new schema and step structure
Wraps every $score term in MultExpression (with weight 1.0 on chunk)
so the formula reads uniformly. Adds a comment above the QdrantClient
init pointing readers to https://cloud.qdrant.io for their own url
and api_key.
Adds the missing keyword index on document_id so the grouping step
(query_points_groups with group_by="document_id") works under strict
mode. Tweaks the upload_points call to batch_size=256, parallel=2 for
faster ingestion against Cloud Inference. Adds an expected-results
summary to the wrap-up so readers running the same query against the
same dataset can compare their output to the reference.
Dylancouzon added a commit to qdrant/landing_page that referenced this pull request May 12, 2026
Adds the missing keyword index on document_id so grouping works under
strict mode (Cloud default), and tunes the upload_points call to
batch_size=256, parallel=2 for faster ingestion. Mirrors the notebook
in qdrant/examples#103.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant