feat(retrieval): add ColBERT late-interaction engine (PyLate + GTE-ModernColBERT)#5
feat(retrieval): add ColBERT late-interaction engine (PyLate + GTE-ModernColBERT)#5chicham wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces ColBERT-based late-interaction retrieval using the pylate library, adding a new ColbertEngine alongside the existing BM25 and SPLADE engines. The feedback identifies a critical runtime issue where models.ColBERT is initialized with unsupported arguments (document_length and query_length), which will cause a TypeError. The reviewer suggests removing these arguments from the constructor and instead passing max_length directly to the .encode() method during both indexing and searching.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
| def search(self, index_dir: Path, texts: list[str], k: Positive) -> Ranked: | ||
| from pylate import retrieve | ||
|
|
||
| embeddings = self._encoder().encode( | ||
| texts, batch_size=self.batch_size, is_query=True, | ||
| show_progress_bar=self.show_progress) | ||
| index = self._index(index_dir, override=False) | ||
| ranked = retrieve.ColBERT(index=index).retrieve(queries_embeddings=embeddings, k=k) | ||
| return [[(int(h["id"]), float(h["score"])) for h in ranked[q]] | ||
| for q in range(len(texts))] |
There was a problem hiding this comment.
Pass max_length=self.query_length to the .encode() method to enforce the query token limit during retrieval.
| def search(self, index_dir: Path, texts: list[str], k: Positive) -> Ranked: | |
| from pylate import retrieve | |
| embeddings = self._encoder().encode( | |
| texts, batch_size=self.batch_size, is_query=True, | |
| show_progress_bar=self.show_progress) | |
| index = self._index(index_dir, override=False) | |
| ranked = retrieve.ColBERT(index=index).retrieve(queries_embeddings=embeddings, k=k) | |
| return [[(int(h["id"]), float(h["score"])) for h in ranked[q]] | |
| for q in range(len(texts))] | |
| def search(self, index_dir: Path, texts: list[str], k: Positive) -> Ranked: | |
| from pylate import retrieve | |
| embeddings = self._encoder().encode( | |
| texts, batch_size=self.batch_size, is_query=True, | |
| show_progress_bar=self.show_progress, max_length=self.query_length) | |
| index = self._index(index_dir, override=False) | |
| ranked = retrieve.ColBERT(index=index).retrieve(queries_embeddings=embeddings, k=k) | |
| return [[(int(h["id"]), float(h["score"])) for h in ranked[q]] | |
| for q in range(len(texts))] |
What
Adds a third retrieval engine — ColBERT late interaction — to
retrieval/retrieval.py, alongside the existing BM25 and SPLADE engines. Same frozen-@beartype-dataclass contract (index()/search(), corpus-position →ids.jsonmapping, shared docid space), so itsrun.trecscores against the same qrels.Model
No finance- or KPI-specialized ColBERT exists on the HF Hub (searched finance/financial/10-K/SEC/KPI across models and the PyLate tag; the only domain fine-tune is a legal one). The default is therefore the strongest general-domain late-interaction model:
lightonai/GTE-ModernColBERT-v1— ModernBERT-based with an 8k context, so it indexes whole long OCR'd pages instead of truncating at the 512-token limit of a classic BERT ColBERT.Implementation
ColbertEngine: a PyLate PLAID index + a sharedmodels.ColBERTencoder. JVM-free (faiss/FastPlaid, C++), consistent with the existing engines' "no JVM" ethos.search()reopens the on-disk index (override=False) — it never re-indexes, matching the bm25/spladeload()pattern.ENGINES+ theMethodliteral; addedpylateto the PEP-723 deps.index/searchparameter is a CLI flag under thecolbertsubgroup:--model --device --doc_length --query_length --nbits --kmeans_niters --batch_size --show_progress --index_name. (Non-knobs are structural:is_query,documents_ids,override,k.)Verification
Smoke-tested end-to-end on a 2-report / 190-page sample:
index→ 190 pages, FastPlaid backend, centroids trainedsearchreopens from disk in a fresh process--queries_file(order preserved), and--reportdeep-pool filter (correctly returns only that report's pages)--helpunder the subgroupNotes
--doc_lengthdefaults to 2048 (well under GTE-ModernColBERT's 8192 ceiling, so no silent clamp). It directly trades recall vs index size/latency, since ColBERT stores one vector per token.ndocs,ncells,centroid_score_threshold) are left at library defaults — can be promoted to flags in a follow-up if wanted.--querykeeps only the last value; use--queries_filefor multiple queries.🤖 Generated with Claude Code