Add LanceDB retrieval mode autodetection#2265
Conversation
c958184 to
782b40b
Compare
Greptile SummaryThis PR adds opt-in sparse (FTS-only) LanceDB ingest and auto-detection of table capabilities so the local query CLI routes dense, hybrid, and sparse tables without explicit mode flags. The
|
| Filename | Overview |
|---|---|
| nemo_retriever/src/nemo_retriever/common/vdb/lancedb_capabilities.py | New module that inspects a LanceDB table's schema, index metadata, and arrow schema metadata to classify it as dense/hybrid/sparse/unknown; logic and edge-case handling are thorough. |
| nemo_retriever/src/nemo_retriever/common/vdb/lancedb.py | Adds sparse schema, sparse result builder, and sparse_retrieval method; design smell where _execute_sparse_lancedb_queries instantiates a LanceDB object whose state is immediately overridden by kwargs. |
| nemo_retriever/src/nemo_retriever/graph/retriever.py | Adds per-instance LanceTableCapabilities cache, _resolve_lancedb_query_mode that auto-routes dense/hybrid/sparse, and _execute_sparse_lancedb_queries; logic is sound but the LanceDB wrapper pattern is redundant. |
| nemo_retriever/src/nemo_retriever/query/workflow.py | Introduces QueryDocumentsResult with hits + strategies; strategy detection via getattr on private _resolve_lancedb_query_mode creates a silent degradation path if the method is renamed. |
| nemo_retriever/src/nemo_retriever/ingest/execution.py | Correctly bypasses embed/vdb_upload for sparse mode and writes the FTS table via _write_sparse_lancedb_result after the pipeline completes. |
| nemo_retriever/src/nemo_retriever/ingest/plan.py | Adds sparse flag to IngestStorageOptions and ResolvedIngestPlan; correctly validates sparse+hybrid mutual exclusion and skips embed_params for sparse mode. |
| nemo_retriever/src/nemo_retriever/common/vdb/records.py | Adds to_sparse_client_vdb_records and makes require_embedding optional in _client_record_from_graph_row; both flat-list and nested-batch paths are handled consistently. |
| nemo_retriever/src/nemo_retriever/cli/query/app.py | Correctly detects whether --hybrid was explicitly passed vs defaulted using ctx.get_parameter_source, and routes through query_documents_with_metadata to get accurate strategy metadata. |
| nemo_retriever/src/nemo_retriever/cli/ingest/options.py | Adds SparseOption and changes HybridOption from --hybrid/--no-hybrid to --hybrid; the --no-hybrid removal is a backward-incompatible CLI change (covered in previous review threads). |
| nemo_retriever/tests/test_lancedb_capabilities.py | Good coverage of dense/hybrid/sparse detection, sparse-skips-embedding, hybrid-auto-enables-hybrid, and capabilities caching; all happy paths covered. |
| nemo_retriever/tests/test_root_cli_workflow.py | New sparse ingest tests verify embed/vdb_upload are skipped, schema is correct, and sparse+hybrid conflict is rejected; assertions on schema metadata are solid. |
Flowchart
%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[retriever query / Retriever.queries] --> B[_resolve_lancedb_query_mode]
B --> C{vdb_op == lancedb?}
C -- No/graph set --> D[Skip auto-detection\nuse existing graph path]
C -- Yes --> E[_inspect_lancedb_capabilities\ncached per uri+table_name]
E --> F[inspect_lancedb_table_object]
F --> G{Detect schema metadata\nvector col / FTS indexes}
G --> H{mode?}
H -- sparse --> I[_execute_sparse_lancedb_queries\nLanceDB.sparse_retrieval FTS-only]
H -- hybrid --> J[set hybrid=True in vdb_call_kwargs\n→ _execute_queries_graph embed+fuse]
H -- dense --> K[_execute_queries_graph\nembed+vector search]
H -- unknown --> L[raise ValueError]
I --> M[normalize_retrieval_results\n→ shape_query_hits]
J --> M
K --> M
subgraph Ingest
N[--sparse flag] --> O[skip embed step\nskip vdb_upload in pipeline]
O --> P[Ingestor.ingest raw extraction]
P --> Q[to_sparse_client_vdb_records]
Q --> R[LanceDB.run sparse=True\ncreate FTS index only]
end
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A[retriever query / Retriever.queries] --> B[_resolve_lancedb_query_mode]
B --> C{vdb_op == lancedb?}
C -- No/graph set --> D[Skip auto-detection\nuse existing graph path]
C -- Yes --> E[_inspect_lancedb_capabilities\ncached per uri+table_name]
E --> F[inspect_lancedb_table_object]
F --> G{Detect schema metadata\nvector col / FTS indexes}
G --> H{mode?}
H -- sparse --> I[_execute_sparse_lancedb_queries\nLanceDB.sparse_retrieval FTS-only]
H -- hybrid --> J[set hybrid=True in vdb_call_kwargs\n→ _execute_queries_graph embed+fuse]
H -- dense --> K[_execute_queries_graph\nembed+vector search]
H -- unknown --> L[raise ValueError]
I --> M[normalize_retrieval_results\n→ shape_query_hits]
J --> M
K --> M
subgraph Ingest
N[--sparse flag] --> O[skip embed step\nskip vdb_upload in pipeline]
O --> P[Ingestor.ingest raw extraction]
P --> Q[to_sparse_client_vdb_records]
Q --> R[LanceDB.run sparse=True\ncreate FTS index only]
end
Reviews (4): Last reviewed commit: "Add LanceDB retrieval mode autodetection" | Re-trigger Greptile
ad57852 to
6377e0c
Compare
6377e0c to
a7277b4
Compare
Summary
Validation
test_lancedb_capabilities.py,test_root_query_cli.py, and sparse/hybrid root ingest tests passed.