Index second_pass: parallelize per-file LSP symbol resolution to cut wall-time on large repos

## Problem

`SourceAnalyzer.second_pass` (api/analyzers/source_analyzer.py) walks every indexed file sequentially and, for each entity in each file, calls `lsp.request_definition` per symbol to populate CALLS / EXTENDS / RETURNS / PARAMETERS edges. On large Python repos this dominates index wall-time.

Measured during bench calibration (commit `476bc73`, sympy 1.6+ at instance sympy-12481):
- 1,113 Python files in repo
- ~22 jedi resolves per file → ~24,500 sequential LSP requests
- ~50–60 minutes total just for second_pass on a single sympy index
- API log shows ~22 \`resolve() failed: Unexpected response from Language Server: None\` per file — most jedi requests return None and we still pay the roundtrip

By contrast, the bench's \`lsp\` agent track (same multilspy/jedi backend, used on-demand at query time) makes 10–30 calls per task. The architectural difference is *batching*, not the backend: code_graph eagerly resolves everything; the agent resolves lazily.

This is fine in principle (we pay the cost once to get O(1) FalkorDB queries later) but the current implementation leaves easy wins on the table.

## Current code

\`source_analyzer.py\` ~line 166–189:

\`\`\`python
with lsps[\".java\"].start_server(), lsps[\".py\"].start_server(), lsps[\".cs\"].start_server():
    for i, file_path in enumerate(files):
        file = self.files.get(file_path)
        if file is None: ...
        for _, entity in file.entities.items():
            entity.resolved_symbol(lambda key, symbol, fp=file_path:
                analyzers[fp.suffix].resolve_symbol(self.files, lsps[fp.suffix], fp, path, key, symbol))
            for key, symbols in entity.symbols.items():
                for symbol in symbols:
                    ...
                    graph.connect_entities(...)
\`\`\`

Each \`resolve_symbol\` ultimately reaches \`AbstractAnalyzer.resolve\` → \`lsp.request_definition(...)\`. All calls are serial on one SyncLanguageServer instance per language.

## Proposed work

1. **Parallelize file-level resolution.** Process N files concurrently with a thread pool (LSP I/O releases the GIL). Either:
   - Maintain a pool of M SyncLanguageServer instances per language and round-robin file dispatch, or
   - Switch to multilspy's async API and run resolutions concurrently against one server.
2. **Avoid re-resolving identical (file, line, col) tuples** within a single index — small in-memory cache keyed by position.
3. **Down-rank or skip resolves likely to return None** — e.g. attribute access where the receiver type is unresolvable (currently we pay the LSP roundtrip and log a WARN per failure). Profile first to confirm.
4. **Add a coarse progress + timing log line per N files** so future regressions surface in CI / bench logs.

## Acceptance

- Index wall-time for sympy-12481 (1,113 files, fresh worktree, 8-core host) drops from ~50 min to <15 min.
- Graph contents (DEFINES / CALLS / EXTENDS / IMPLEMENTS / RETURNS / PARAMETERS edge counts) are bit-identical to the serial version on at least 3 repos (pytest, sphinx, sympy).
- New parallelism is bounded and configurable (env var or constant) so we can ship a conservative default.

## Out of scope

- Replacing jedi with a different LSP backend.
- Streaming / incremental index updates.
- LSP server warm-cache reuse across separate \`analyze_folder\` calls (Redis-side caching).

## References

- Bench results showing the cost: PR #684 calibration comments
- Related analyzer hardening: PR #686 (silent CALLS failures + KeyError in second_pass)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Index second_pass: parallelize per-file LSP symbol resolution to cut wall-time on large repos #687

Problem

Current code

Proposed work

Acceptance

Out of scope

References

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Index second_pass: parallelize per-file LSP symbol resolution to cut wall-time on large repos #687

Description

Problem

Current code

Proposed work

Acceptance

Out of scope

References

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions