refactor(analyzer): split second_pass into resolve + write phases (#687) by DvirDukhan · Pull Request #688 · FalkorDB/code-graph

DvirDukhan · 2026-05-28T05:25:36Z

Refactors SourceAnalyzer.second_pass to separate LSP-bound symbol resolution from FalkorDB edge writes, gated by the new CODE_GRAPH_INDEX_WORKERS env var.

Why

Index wall-time on large Python repos is dominated by second_pass — see issue #687. This PR is a structural prerequisite for any parallelization strategy: it lets us swap in a different concurrency model without re-untangling the resolve / write interleaving.

What

New env var CODE_GRAPH_INDEX_WORKERS (default 4).
When > 1, files are dispatched to a ThreadPoolExecutor that populates entity.resolved_symbols concurrently.
Graph writes run on the main thread in the original file order, so output is bit-identical to the pre-change serial path.
When == 1, fully serial (drop-in compatible).

Verified

A/B test on pytest-6202 (204 Python files):

workers	wall (s)	speedup	CALLS / DEFINES / EXTENDS
1	247.1	1.00×	1976 / 4509 / 71
4	250.4	0.99×	1976 / 4509 / 71

✅ Edge counts identical → safe to merge as a no-op.

❌ No speedup observed because multilspy's SyncLanguageServer serializes through one asyncio loop. Followup work to actually win wall-time (multiple LSP servers, async API, or tree-sitter-only resolver) is tracked in #687.

Risk

Low. Default behavior unchanged unless the env var is set.

Summary by CodeRabbit

Refactor
- Optimized symbol resolution performance through parallel processing while maintaining deterministic edge ordering.
- Enhanced error handling with detailed per-file logging during symbol resolution.

Prepares the analyzer for parallel symbol resolution (issue #687). This change is a no-op on wall-time when CODE_GRAPH_INDEX_WORKERS=1 (the default behavior pre-change). With workers>1, files are dispatched to a ThreadPoolExecutor that calls entity.resolved_symbol concurrently; graph writes are then emitted serially in the original file order so edge contents and ordering match the pre-refactor path. Note: in practice this currently produces ~no speedup because multilspy's SyncLanguageServer multiplexes all requests onto a single asyncio loop. The split is still worth landing because it gives us a clean place to plug in either (a) multiple LSP servers per language, or (b) multilspy's async API. See issue #687 comment thread for the experiment. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

coderabbitai · 2026-05-28T05:25:46Z

📝 Walkthrough

Walkthrough

The SourceAnalyzer.second_pass method is refactored to parallelize symbol reference resolution across files using a bounded thread pool while keeping graph edge writes serialized in original file order for deterministic output. Symbol resolution is filtered to resolvable files with real LSP support, with worker threads populating entity resolved_symbol sets and per-file exception handling.

Changes

Parallel symbol resolution with serialized edge ordering

Layer / File(s)	Summary
Docstring and imports for two-phase flow `api/analyzers/source_analyzer.py`	The `second_pass` docstring is expanded to describe the two-phase indexing strategy (parallel symbol resolution then serial edge writing), and local imports (`os`, `ThreadPoolExecutor`, `as_completed`) are added to support worker-based execution.
Parallel symbol resolution via ThreadPoolExecutor `api/analyzers/source_analyzer.py`	Worker-count is configured via `CODE_GRAPH_INDEX_WORKERS` (default 4). Input files are filtered to resolvable ones with real LSP support. A worker function populates each entity's `resolved_symbol` set by querying `SyncLanguageServer`. A `ThreadPoolExecutor` runs workers in parallel with per-file exception logging and periodic progress reporting.
Serialized edge emission in original order `api/analyzers/source_analyzer.py`	After parallel symbol resolution completes, resolvable files are iterated in original order to preserve deterministic graph edge ordering when emitting edges from resolved symbols.

Sequence Diagram

sequenceDiagram
  participant ThreadPool as ThreadPoolExecutor
  participant Worker as Worker Thread
  participant LSP as SyncLanguageServer
  participant Entity as Entity<br/>(resolved_symbol)
  
  ThreadPool->>Worker: Process file N (parallel)
  Worker->>LSP: Query symbol references
  LSP-->>Worker: Symbol candidates
  Worker->>Entity: Populate resolved_symbol set
  Worker-->>ThreadPool: Complete file N
  
  Note over ThreadPool: Phase B: Serial edge emission
  ThreadPool->>ThreadPool: Iterate resolvable files<br/>in original order

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Index second_pass: parallelize per-file LSP symbol resolution to cut wall-time on large repos #687: Parallel LSP symbol resolution implementation — both PRs target SourceAnalyzer.second_pass to convert serial symbol resolution into bounded parallel per-file resolution while preserving serialized edge emission for deterministic ordering.

Poem

🐰 A rabbit's verse on parallel threads:
Symbol threads dance, side-by-side,
Each file resolved with worker pride,
But edges march in ordered line—
Deterministic, oh so fine!

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title clearly and accurately describes the main structural change: splitting the second_pass method into separate symbol resolution and edge-writing phases, which is the primary focus of the refactoring.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings

Create stacked PR
Commit on current branch

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch dvirdukhan/parallel-second-pass

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@api/analyzers/source_analyzer.py`:
- Around line 228-243: The worker loop that calls fut.result() in the
as_completed handling swallows exceptions for _resolve_file and still proceeds
to Phase B which iterates resolvable and writes edges; modify the logic around
futures and resolvable so that any file whose future raised an exception is
recorded and removed (or cause an early abort) before Phase B starts: capture
failed file paths from the except block (use the mapping futures[fut] to get
fp), store them in a failed set, and then either raise a single aggregated
exception to stop processing or filter out those paths from the resolvable list
before the serial edge-write loop to ensure no edges are written for failed
files.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0f76db71-f7e6-4e7b-9d4a-60c7a627a042

📥 Commits

Reviewing files that changed from the base of the PR and between b066064 and 3b743e7.

📒 Files selected for processing (1)

api/analyzers/source_analyzer.py

coderabbitai · 2026-05-28T05:30:04Z

+                    for fut in as_completed(futures):
+                        fp = futures[fut]
+                        try:
+                            fut.result()
+                        except Exception as exc:
+                            logging.warning(
+                                "second_pass: resolution failed for %s: %s",
+                                fp, exc,
+                            )
+                        done += 1
+                        if done % log_every == 0 or done == total:
+                            logging.info("second_pass: resolved %d/%d files", done, total)
+
+            # Phase B: serial edge writes, in the original file order so
+            # the graph is bit-identical to the single-threaded path.
+            for file_path in resolvable:


⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't continue phase B after a worker failure.

Line 232 swallows any _resolve_file exception, but Lines 241-243 still emit graph edges. That makes correctness depend on CODE_GRAPH_INDEX_WORKERS: workers=1 still fails fast, while workers>1 can silently write a partial graph for the failed file. Track failed files and abort before phase B, or explicitly exclude/clear them before writing edges.

🛠️ Proposed fix

done = 0 log_every = max(1, total // 50) if total else 1 + failed_files: list[Path] = [] if n_workers == 1: for fp in resolvable: _resolve_file(fp) done += 1 if done % log_every == 0 or done == total: @@ with ThreadPoolExecutor(max_workers=n_workers, thread_name_prefix="sa-resolve") as ex: futures = {ex.submit(_resolve_file, fp): fp for fp in resolvable} for fut in as_completed(futures): fp = futures[fut] try: fut.result() except Exception: - logging.warning( - "second_pass: resolution failed for %s: %s", - fp, exc, - ) + failed_files.append(fp) + logging.exception( + "second_pass: resolution failed for %s", + fp, + ) done += 1 if done % log_every == 0 or done == total: logging.info("second_pass: resolved %d/%d files", done, total) + + if failed_files: + raise RuntimeError( + f"second_pass aborted after symbol resolution failed for {len(failed_files)} file(s)" + ) # Phase B: serial edge writes, in the original file order so # the graph is bit-identical to the single-threaded path. for file_path in resolvable:

🧰 Tools

🪛 Ruff (0.15.14)

[warning] 232-232: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents

Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@api/analyzers/source_analyzer.py` around lines 228 - 243, The worker loop that calls fut.result() in the as_completed handling swallows exceptions for _resolve_file and still proceeds to Phase B which iterates resolvable and writes edges; modify the logic around futures and resolvable so that any file whose future raised an exception is recorded and removed (or cause an early abort) before Phase B starts: capture failed file paths from the except block (use the mapping futures[fut] to get fp), store them in a failed set, and then either raise a single aggregated exception to stop processing or filter out those paths from the resolvable list before the serial edge-write loop to ensure no edges are written for failed files.

coderabbitai Bot reviewed May 28, 2026

View reviewed changes

DvirDukhan mentioned this pull request May 28, 2026

MCP T18 — Tree-sitter symbol resolver for Python (replace jedi in second_pass) #689

Open

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(analyzer): split second_pass into resolve + write phases (#687)#688

refactor(analyzer): split second_pass into resolve + write phases (#687)#688
DvirDukhan wants to merge 1 commit into
stagingfrom
dvirdukhan/parallel-second-pass

DvirDukhan commented May 28, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 28, 2026 •

edited

Loading

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related issues

Poem

Uh oh!

coderabbitai Bot left a comment

Uh oh!

coderabbitai Bot May 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

DvirDukhan commented May 28, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Verified

Risk

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Possibly related issues

Poem

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai Bot May 28, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

DvirDukhan commented May 28, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 28, 2026 •

edited

Loading