Skip to content

refactor(analyzer): split second_pass into resolve + write phases (#687)#688

Open
DvirDukhan wants to merge 1 commit into
stagingfrom
dvirdukhan/parallel-second-pass
Open

refactor(analyzer): split second_pass into resolve + write phases (#687)#688
DvirDukhan wants to merge 1 commit into
stagingfrom
dvirdukhan/parallel-second-pass

Conversation

@DvirDukhan
Copy link
Copy Markdown

@DvirDukhan DvirDukhan commented May 28, 2026

Refactors SourceAnalyzer.second_pass to separate LSP-bound symbol resolution from FalkorDB edge writes, gated by the new CODE_GRAPH_INDEX_WORKERS env var.

Why

Index wall-time on large Python repos is dominated by second_pass — see issue #687. This PR is a structural prerequisite for any parallelization strategy: it lets us swap in a different concurrency model without re-untangling the resolve / write interleaving.

What

  • New env var CODE_GRAPH_INDEX_WORKERS (default 4).
  • When > 1, files are dispatched to a ThreadPoolExecutor that populates entity.resolved_symbols concurrently.
  • Graph writes run on the main thread in the original file order, so output is bit-identical to the pre-change serial path.
  • When == 1, fully serial (drop-in compatible).

Verified

A/B test on pytest-6202 (204 Python files):

workers wall (s) speedup CALLS / DEFINES / EXTENDS
1 247.1 1.00× 1976 / 4509 / 71
4 250.4 0.99× 1976 / 4509 / 71

✅ Edge counts identical → safe to merge as a no-op.

❌ No speedup observed because multilspy's SyncLanguageServer serializes through one asyncio loop. Followup work to actually win wall-time (multiple LSP servers, async API, or tree-sitter-only resolver) is tracked in #687.

Risk

Low. Default behavior unchanged unless the env var is set.

Summary by CodeRabbit

  • Refactor
    • Optimized symbol resolution performance through parallel processing while maintaining deterministic edge ordering.
    • Enhanced error handling with detailed per-file logging during symbol resolution.

Review Change Stack

Prepares the analyzer for parallel symbol resolution (issue #687).
This change is a no-op on wall-time when CODE_GRAPH_INDEX_WORKERS=1
(the default behavior pre-change). With workers>1, files are
dispatched to a ThreadPoolExecutor that calls entity.resolved_symbol
concurrently; graph writes are then emitted serially in the original
file order so edge contents and ordering match the pre-refactor path.

Note: in practice this currently produces ~no speedup because
multilspy's SyncLanguageServer multiplexes all requests onto a
single asyncio loop. The split is still worth landing because it
gives us a clean place to plug in either (a) multiple LSP servers
per language, or (b) multilspy's async API. See issue #687 comment
thread for the experiment.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

📝 Walkthrough

Walkthrough

The SourceAnalyzer.second_pass method is refactored to parallelize symbol reference resolution across files using a bounded thread pool while keeping graph edge writes serialized in original file order for deterministic output. Symbol resolution is filtered to resolvable files with real LSP support, with worker threads populating entity resolved_symbol sets and per-file exception handling.

Changes

Parallel symbol resolution with serialized edge ordering

Layer / File(s) Summary
Docstring and imports for two-phase flow
api/analyzers/source_analyzer.py
The second_pass docstring is expanded to describe the two-phase indexing strategy (parallel symbol resolution then serial edge writing), and local imports (os, ThreadPoolExecutor, as_completed) are added to support worker-based execution.
Parallel symbol resolution via ThreadPoolExecutor
api/analyzers/source_analyzer.py
Worker-count is configured via CODE_GRAPH_INDEX_WORKERS (default 4). Input files are filtered to resolvable ones with real LSP support. A worker function populates each entity's resolved_symbol set by querying SyncLanguageServer. A ThreadPoolExecutor runs workers in parallel with per-file exception logging and periodic progress reporting.
Serialized edge emission in original order
api/analyzers/source_analyzer.py
After parallel symbol resolution completes, resolvable files are iterated in original order to preserve deterministic graph edge ordering when emitting edges from resolved symbols.

Sequence Diagram

sequenceDiagram
  participant ThreadPool as ThreadPoolExecutor
  participant Worker as Worker Thread
  participant LSP as SyncLanguageServer
  participant Entity as Entity<br/>(resolved_symbol)
  
  ThreadPool->>Worker: Process file N (parallel)
  Worker->>LSP: Query symbol references
  LSP-->>Worker: Symbol candidates
  Worker->>Entity: Populate resolved_symbol set
  Worker-->>ThreadPool: Complete file N
  
  Note over ThreadPool: Phase B: Serial edge emission
  ThreadPool->>ThreadPool: Iterate resolvable files<br/>in original order
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Poem

🐰 A rabbit's verse on parallel threads:
Symbol threads dance, side-by-side,
Each file resolved with worker pride,
But edges march in ordered line—
Deterministic, oh so fine!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly and accurately describes the main structural change: splitting the second_pass method into separate symbol resolution and edge-writing phases, which is the primary focus of the refactoring.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch dvirdukhan/parallel-second-pass

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@api/analyzers/source_analyzer.py`:
- Around line 228-243: The worker loop that calls fut.result() in the
as_completed handling swallows exceptions for _resolve_file and still proceeds
to Phase B which iterates resolvable and writes edges; modify the logic around
futures and resolvable so that any file whose future raised an exception is
recorded and removed (or cause an early abort) before Phase B starts: capture
failed file paths from the except block (use the mapping futures[fut] to get
fp), store them in a failed set, and then either raise a single aggregated
exception to stop processing or filter out those paths from the resolvable list
before the serial edge-write loop to ensure no edges are written for failed
files.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 0f76db71-f7e6-4e7b-9d4a-60c7a627a042

📥 Commits

Reviewing files that changed from the base of the PR and between b066064 and 3b743e7.

📒 Files selected for processing (1)
  • api/analyzers/source_analyzer.py

Comment on lines +228 to +243
for fut in as_completed(futures):
fp = futures[fut]
try:
fut.result()
except Exception as exc:
logging.warning(
"second_pass: resolution failed for %s: %s",
fp, exc,
)
done += 1
if done % log_every == 0 or done == total:
logging.info("second_pass: resolved %d/%d files", done, total)

# Phase B: serial edge writes, in the original file order so
# the graph is bit-identical to the single-threaded path.
for file_path in resolvable:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Don't continue phase B after a worker failure.

Line 232 swallows any _resolve_file exception, but Lines 241-243 still emit graph edges. That makes correctness depend on CODE_GRAPH_INDEX_WORKERS: workers=1 still fails fast, while workers>1 can silently write a partial graph for the failed file. Track failed files and abort before phase B, or explicitly exclude/clear them before writing edges.

🛠️ Proposed fix
             done = 0
             log_every = max(1, total // 50) if total else 1
+            failed_files: list[Path] = []
             if n_workers == 1:
                 for fp in resolvable:
                     _resolve_file(fp)
                     done += 1
                     if done % log_every == 0 or done == total:
@@
                 with ThreadPoolExecutor(max_workers=n_workers, thread_name_prefix="sa-resolve") as ex:
                     futures = {ex.submit(_resolve_file, fp): fp for fp in resolvable}
                     for fut in as_completed(futures):
                         fp = futures[fut]
                         try:
                             fut.result()
                         except Exception:
-                            logging.warning(
-                                "second_pass: resolution failed for %s: %s",
-                                fp, exc,
-                            )
+                            failed_files.append(fp)
+                            logging.exception(
+                                "second_pass: resolution failed for %s",
+                                fp,
+                            )
                         done += 1
                         if done % log_every == 0 or done == total:
                             logging.info("second_pass: resolved %d/%d files", done, total)
+
+            if failed_files:
+                raise RuntimeError(
+                    f"second_pass aborted after symbol resolution failed for {len(failed_files)} file(s)"
+                )
 
             # Phase B: serial edge writes, in the original file order so
             # the graph is bit-identical to the single-threaded path.
             for file_path in resolvable:
🧰 Tools
🪛 Ruff (0.15.14)

[warning] 232-232: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@api/analyzers/source_analyzer.py` around lines 228 - 243, The worker loop
that calls fut.result() in the as_completed handling swallows exceptions for
_resolve_file and still proceeds to Phase B which iterates resolvable and writes
edges; modify the logic around futures and resolvable so that any file whose
future raised an exception is recorded and removed (or cause an early abort)
before Phase B starts: capture failed file paths from the except block (use the
mapping futures[fut] to get fp), store them in a failed set, and then either
raise a single aggregated exception to stop processing or filter out those paths
from the resolvable list before the serial edge-write loop to ensure no edges
are written for failed files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant