Skip to content

Index second_pass: parallelize per-file LSP symbol resolution to cut wall-time on large repos #687

@DvirDukhan

Description

@DvirDukhan

Problem

SourceAnalyzer.second_pass (api/analyzers/source_analyzer.py) walks every indexed file sequentially and, for each entity in each file, calls lsp.request_definition per symbol to populate CALLS / EXTENDS / RETURNS / PARAMETERS edges. On large Python repos this dominates index wall-time.

Measured during bench calibration (commit 476bc73, sympy 1.6+ at instance sympy-12481):

  • 1,113 Python files in repo
  • ~22 jedi resolves per file → ~24,500 sequential LSP requests
  • ~50–60 minutes total just for second_pass on a single sympy index
  • API log shows ~22 `resolve() failed: Unexpected response from Language Server: None` per file — most jedi requests return None and we still pay the roundtrip

By contrast, the bench's `lsp` agent track (same multilspy/jedi backend, used on-demand at query time) makes 10–30 calls per task. The architectural difference is batching, not the backend: code_graph eagerly resolves everything; the agent resolves lazily.

This is fine in principle (we pay the cost once to get O(1) FalkorDB queries later) but the current implementation leaves easy wins on the table.

Current code

`source_analyzer.py` ~line 166–189:

```python
with lsps[".java"].start_server(), lsps[".py"].start_server(), lsps[".cs"].start_server():
for i, file_path in enumerate(files):
file = self.files.get(file_path)
if file is None: ...
for _, entity in file.entities.items():
entity.resolved_symbol(lambda key, symbol, fp=file_path:
analyzers[fp.suffix].resolve_symbol(self.files, lsps[fp.suffix], fp, path, key, symbol))
for key, symbols in entity.symbols.items():
for symbol in symbols:
...
graph.connect_entities(...)
```

Each `resolve_symbol` ultimately reaches `AbstractAnalyzer.resolve` → `lsp.request_definition(...)`. All calls are serial on one SyncLanguageServer instance per language.

Proposed work

  1. Parallelize file-level resolution. Process N files concurrently with a thread pool (LSP I/O releases the GIL). Either:
    • Maintain a pool of M SyncLanguageServer instances per language and round-robin file dispatch, or
    • Switch to multilspy's async API and run resolutions concurrently against one server.
  2. Avoid re-resolving identical (file, line, col) tuples within a single index — small in-memory cache keyed by position.
  3. Down-rank or skip resolves likely to return None — e.g. attribute access where the receiver type is unresolvable (currently we pay the LSP roundtrip and log a WARN per failure). Profile first to confirm.
  4. Add a coarse progress + timing log line per N files so future regressions surface in CI / bench logs.

Acceptance

  • Index wall-time for sympy-12481 (1,113 files, fresh worktree, 8-core host) drops from ~50 min to <15 min.
  • Graph contents (DEFINES / CALLS / EXTENDS / IMPLEMENTS / RETURNS / PARAMETERS edge counts) are bit-identical to the serial version on at least 3 repos (pytest, sphinx, sympy).
  • New parallelism is bounded and configurable (env var or constant) so we can ship a conservative default.

Out of scope

  • Replacing jedi with a different LSP backend.
  • Streaming / incremental index updates.
  • LSP server warm-cache reuse across separate `analyze_folder` calls (Redis-side caching).

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions