Skip to content

Add LanceDB retrieval mode autodetection#2265

Open
jioffe502 wants to merge 1 commit into
NVIDIA:mainfrom
jioffe502:codex/lancedb-retrieval-mode-autodetect
Open

Add LanceDB retrieval mode autodetection#2265
jioffe502 wants to merge 1 commit into
NVIDIA:mainfrom
jioffe502:codex/lancedb-retrieval-mode-autodetect

Conversation

@jioffe502

Copy link
Copy Markdown
Collaborator

Summary

  • Add opt-in sparse LanceDB ingest that writes text/metadata/source/id rows and creates an FTS index.
  • Add LanceDB table capability detection so local query auto-routes dense, hybrid, and sparse tables without mode flags.
  • Keep existing output shape and remove negative hybrid CLI flags from the local query/ingest surfaces.

Validation

  • Ruff passed on touched files.
  • Focused tests: test_lancedb_capabilities.py, test_root_query_cli.py, and sparse/hybrid root ingest tests passed.
  • Live GPU 0 smoke covered sparse ingest/query, dense query, hybrid query, and local query help flag surface.

@jioffe502 jioffe502 requested review from a team as code owners June 24, 2026 19:32
@jioffe502 jioffe502 requested a review from jdye64 June 24, 2026 19:32
@jioffe502 jioffe502 force-pushed the codex/lancedb-retrieval-mode-autodetect branch from c958184 to 782b40b Compare June 24, 2026 19:36
@greptile-apps

greptile-apps Bot commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds opt-in sparse (FTS-only) LanceDB ingest and auto-detection of table capabilities so the local query CLI routes dense, hybrid, and sparse tables without explicit mode flags. The QueryRetrievalOptions.hybrid field changes from bool = False to bool | None = None, shifting the default behavior from guaranteed dense retrieval to auto-detected retrieval.

  • Sparse ingest path: skips embedding, writes a text/metadata/source/id table with a schema-level retrieval_mode=sparse metadata tag, and builds an FTS index.
  • Auto-detection (lancedb_capabilities.py): inspects arrow schema metadata, FTS index presence, and vector column presence to classify tables; results are cached per (uri, table_name) pair on Retriever to avoid repeated IO on long-lived instances.
  • CLI surface: --hybrid/--no-hybrid is simplified to --hybrid (a flag), and --sparse is added to both ingest and query; strategy metadata (strategies field) is now accurately derived from the resolved retrieval mode rather than being hardcoded.

Confidence Score: 5/5

Safe to merge; the auto-detection path is well-tested and the capability cache prevents repeated IO on long-lived Retriever instances.

The core logic — schema metadata tagging, FTS index detection, capability-driven routing, and the separate sparse write path — is correct and covered by focused unit tests. The two design observations (LanceDB wrapper in sparse query execution, getattr on a private method for strategy reporting) do not affect correctness in the current implementation. The backward-incompatible CLI change (--no-hybrid removal) was already flagged in a previous review round.

nemo_retriever/src/nemo_retriever/graph/retriever.py (_execute_sparse_lancedb_queries) and nemo_retriever/src/nemo_retriever/query/workflow.py (strategy detection coupling) warrant a second look if either file is refactored.

Important Files Changed

Filename Overview
nemo_retriever/src/nemo_retriever/common/vdb/lancedb_capabilities.py New module that inspects a LanceDB table's schema, index metadata, and arrow schema metadata to classify it as dense/hybrid/sparse/unknown; logic and edge-case handling are thorough.
nemo_retriever/src/nemo_retriever/common/vdb/lancedb.py Adds sparse schema, sparse result builder, and sparse_retrieval method; design smell where _execute_sparse_lancedb_queries instantiates a LanceDB object whose state is immediately overridden by kwargs.
nemo_retriever/src/nemo_retriever/graph/retriever.py Adds per-instance LanceTableCapabilities cache, _resolve_lancedb_query_mode that auto-routes dense/hybrid/sparse, and _execute_sparse_lancedb_queries; logic is sound but the LanceDB wrapper pattern is redundant.
nemo_retriever/src/nemo_retriever/query/workflow.py Introduces QueryDocumentsResult with hits + strategies; strategy detection via getattr on private _resolve_lancedb_query_mode creates a silent degradation path if the method is renamed.
nemo_retriever/src/nemo_retriever/ingest/execution.py Correctly bypasses embed/vdb_upload for sparse mode and writes the FTS table via _write_sparse_lancedb_result after the pipeline completes.
nemo_retriever/src/nemo_retriever/ingest/plan.py Adds sparse flag to IngestStorageOptions and ResolvedIngestPlan; correctly validates sparse+hybrid mutual exclusion and skips embed_params for sparse mode.
nemo_retriever/src/nemo_retriever/common/vdb/records.py Adds to_sparse_client_vdb_records and makes require_embedding optional in _client_record_from_graph_row; both flat-list and nested-batch paths are handled consistently.
nemo_retriever/src/nemo_retriever/cli/query/app.py Correctly detects whether --hybrid was explicitly passed vs defaulted using ctx.get_parameter_source, and routes through query_documents_with_metadata to get accurate strategy metadata.
nemo_retriever/src/nemo_retriever/cli/ingest/options.py Adds SparseOption and changes HybridOption from --hybrid/--no-hybrid to --hybrid; the --no-hybrid removal is a backward-incompatible CLI change (covered in previous review threads).
nemo_retriever/tests/test_lancedb_capabilities.py Good coverage of dense/hybrid/sparse detection, sparse-skips-embedding, hybrid-auto-enables-hybrid, and capabilities caching; all happy paths covered.
nemo_retriever/tests/test_root_cli_workflow.py New sparse ingest tests verify embed/vdb_upload are skipped, schema is correct, and sparse+hybrid conflict is rejected; assertions on schema metadata are solid.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[retriever query / Retriever.queries] --> B[_resolve_lancedb_query_mode]
    B --> C{vdb_op == lancedb?}
    C -- No/graph set --> D[Skip auto-detection\nuse existing graph path]
    C -- Yes --> E[_inspect_lancedb_capabilities\ncached per uri+table_name]
    E --> F[inspect_lancedb_table_object]
    F --> G{Detect schema metadata\nvector col / FTS indexes}
    G --> H{mode?}
    H -- sparse --> I[_execute_sparse_lancedb_queries\nLanceDB.sparse_retrieval FTS-only]
    H -- hybrid --> J[set hybrid=True in vdb_call_kwargs\n→ _execute_queries_graph embed+fuse]
    H -- dense --> K[_execute_queries_graph\nembed+vector search]
    H -- unknown --> L[raise ValueError]
    I --> M[normalize_retrieval_results\n→ shape_query_hits]
    J --> M
    K --> M

    subgraph Ingest
    N[--sparse flag] --> O[skip embed step\nskip vdb_upload in pipeline]
    O --> P[Ingestor.ingest raw extraction]
    P --> Q[to_sparse_client_vdb_records]
    Q --> R[LanceDB.run sparse=True\ncreate FTS index only]
    end
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
    A[retriever query / Retriever.queries] --> B[_resolve_lancedb_query_mode]
    B --> C{vdb_op == lancedb?}
    C -- No/graph set --> D[Skip auto-detection\nuse existing graph path]
    C -- Yes --> E[_inspect_lancedb_capabilities\ncached per uri+table_name]
    E --> F[inspect_lancedb_table_object]
    F --> G{Detect schema metadata\nvector col / FTS indexes}
    G --> H{mode?}
    H -- sparse --> I[_execute_sparse_lancedb_queries\nLanceDB.sparse_retrieval FTS-only]
    H -- hybrid --> J[set hybrid=True in vdb_call_kwargs\n→ _execute_queries_graph embed+fuse]
    H -- dense --> K[_execute_queries_graph\nembed+vector search]
    H -- unknown --> L[raise ValueError]
    I --> M[normalize_retrieval_results\n→ shape_query_hits]
    J --> M
    K --> M

    subgraph Ingest
    N[--sparse flag] --> O[skip embed step\nskip vdb_upload in pipeline]
    O --> P[Ingestor.ingest raw extraction]
    P --> Q[to_sparse_client_vdb_records]
    Q --> R[LanceDB.run sparse=True\ncreate FTS index only]
    end
Loading

Reviews (4): Last reviewed commit: "Add LanceDB retrieval mode autodetection" | Re-trigger Greptile

Comment thread nemo_retriever/src/nemo_retriever/cli/query/app.py Outdated
Comment thread nemo_retriever/src/nemo_retriever/graph/retriever.py
Comment thread nemo_retriever/src/nemo_retriever/common/vdb/lancedb.py Outdated
@jioffe502 jioffe502 force-pushed the codex/lancedb-retrieval-mode-autodetect branch 2 times, most recently from ad57852 to 6377e0c Compare June 24, 2026 20:14
@jioffe502 jioffe502 changed the title [codex] Add LanceDB retrieval mode autodetection Add LanceDB retrieval mode autodetection Jun 24, 2026
@jioffe502 jioffe502 force-pushed the codex/lancedb-retrieval-mode-autodetect branch from 6377e0c to a7277b4 Compare June 24, 2026 21:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant