refactor(analyzer): split second_pass into resolve + write phases (#687)#688
refactor(analyzer): split second_pass into resolve + write phases (#687)#688DvirDukhan wants to merge 1 commit into
Conversation
Prepares the analyzer for parallel symbol resolution (issue #687). This change is a no-op on wall-time when CODE_GRAPH_INDEX_WORKERS=1 (the default behavior pre-change). With workers>1, files are dispatched to a ThreadPoolExecutor that calls entity.resolved_symbol concurrently; graph writes are then emitted serially in the original file order so edge contents and ordering match the pre-refactor path. Note: in practice this currently produces ~no speedup because multilspy's SyncLanguageServer multiplexes all requests onto a single asyncio loop. The split is still worth landing because it gives us a clean place to plug in either (a) multiple LSP servers per language, or (b) multilspy's async API. See issue #687 comment thread for the experiment. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
📝 WalkthroughWalkthroughThe ChangesParallel symbol resolution with serialized edge ordering
Sequence DiagramsequenceDiagram
participant ThreadPool as ThreadPoolExecutor
participant Worker as Worker Thread
participant LSP as SyncLanguageServer
participant Entity as Entity<br/>(resolved_symbol)
ThreadPool->>Worker: Process file N (parallel)
Worker->>LSP: Query symbol references
LSP-->>Worker: Symbol candidates
Worker->>Entity: Populate resolved_symbol set
Worker-->>ThreadPool: Complete file N
Note over ThreadPool: Phase B: Serial edge emission
ThreadPool->>ThreadPool: Iterate resolvable files<br/>in original order
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Poem
🚥 Pre-merge checks | ✅ 5✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@api/analyzers/source_analyzer.py`:
- Around line 228-243: The worker loop that calls fut.result() in the
as_completed handling swallows exceptions for _resolve_file and still proceeds
to Phase B which iterates resolvable and writes edges; modify the logic around
futures and resolvable so that any file whose future raised an exception is
recorded and removed (or cause an early abort) before Phase B starts: capture
failed file paths from the except block (use the mapping futures[fut] to get
fp), store them in a failed set, and then either raise a single aggregated
exception to stop processing or filter out those paths from the resolvable list
before the serial edge-write loop to ensure no edges are written for failed
files.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 0f76db71-f7e6-4e7b-9d4a-60c7a627a042
📒 Files selected for processing (1)
api/analyzers/source_analyzer.py
| for fut in as_completed(futures): | ||
| fp = futures[fut] | ||
| try: | ||
| fut.result() | ||
| except Exception as exc: | ||
| logging.warning( | ||
| "second_pass: resolution failed for %s: %s", | ||
| fp, exc, | ||
| ) | ||
| done += 1 | ||
| if done % log_every == 0 or done == total: | ||
| logging.info("second_pass: resolved %d/%d files", done, total) | ||
|
|
||
| # Phase B: serial edge writes, in the original file order so | ||
| # the graph is bit-identical to the single-threaded path. | ||
| for file_path in resolvable: |
There was a problem hiding this comment.
Don't continue phase B after a worker failure.
Line 232 swallows any _resolve_file exception, but Lines 241-243 still emit graph edges. That makes correctness depend on CODE_GRAPH_INDEX_WORKERS: workers=1 still fails fast, while workers>1 can silently write a partial graph for the failed file. Track failed files and abort before phase B, or explicitly exclude/clear them before writing edges.
🛠️ Proposed fix
done = 0
log_every = max(1, total // 50) if total else 1
+ failed_files: list[Path] = []
if n_workers == 1:
for fp in resolvable:
_resolve_file(fp)
done += 1
if done % log_every == 0 or done == total:
@@
with ThreadPoolExecutor(max_workers=n_workers, thread_name_prefix="sa-resolve") as ex:
futures = {ex.submit(_resolve_file, fp): fp for fp in resolvable}
for fut in as_completed(futures):
fp = futures[fut]
try:
fut.result()
except Exception:
- logging.warning(
- "second_pass: resolution failed for %s: %s",
- fp, exc,
- )
+ failed_files.append(fp)
+ logging.exception(
+ "second_pass: resolution failed for %s",
+ fp,
+ )
done += 1
if done % log_every == 0 or done == total:
logging.info("second_pass: resolved %d/%d files", done, total)
+
+ if failed_files:
+ raise RuntimeError(
+ f"second_pass aborted after symbol resolution failed for {len(failed_files)} file(s)"
+ )
# Phase B: serial edge writes, in the original file order so
# the graph is bit-identical to the single-threaded path.
for file_path in resolvable:🧰 Tools
🪛 Ruff (0.15.14)
[warning] 232-232: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@api/analyzers/source_analyzer.py` around lines 228 - 243, The worker loop
that calls fut.result() in the as_completed handling swallows exceptions for
_resolve_file and still proceeds to Phase B which iterates resolvable and writes
edges; modify the logic around futures and resolvable so that any file whose
future raised an exception is recorded and removed (or cause an early abort)
before Phase B starts: capture failed file paths from the except block (use the
mapping futures[fut] to get fp), store them in a failed set, and then either
raise a single aggregated exception to stop processing or filter out those paths
from the resolvable list before the serial edge-write loop to ensure no edges
are written for failed files.
Refactors
SourceAnalyzer.second_passto separate LSP-bound symbol resolution from FalkorDB edge writes, gated by the newCODE_GRAPH_INDEX_WORKERSenv var.Why
Index wall-time on large Python repos is dominated by
second_pass— see issue #687. This PR is a structural prerequisite for any parallelization strategy: it lets us swap in a different concurrency model without re-untangling the resolve / write interleaving.What
CODE_GRAPH_INDEX_WORKERS(default 4).ThreadPoolExecutorthat populatesentity.resolved_symbolsconcurrently.Verified
A/B test on pytest-6202 (204 Python files):
✅ Edge counts identical → safe to merge as a no-op.
❌ No speedup observed because multilspy's SyncLanguageServer serializes through one asyncio loop. Followup work to actually win wall-time (multiple LSP servers, async API, or tree-sitter-only resolver) is tracked in #687.
Risk
Low. Default behavior unchanged unless the env var is set.
Summary by CodeRabbit