feat(dataflow): interprocedural variable-level model — P0-P6 complete, all 34 languages#1612
Conversation
Greptile SummaryThis PR completes the interprocedural variable-level dataflow model (P0–P5) across 20+ languages, adding schema vertices/edges, per-file param/return extraction, interprocedural stitching, incremental re-stitch (P4), the C-family rules, and rules for 18 additional languages spanning JVM, scripting, functional, and systems families. The push/pop asymmetry fix for Elixir/Clojure scopes (using a parallel
Confidence Score: 3/5Safe to merge for the languages not affected by the call-tracking bug, but Elixir and Clojure will silently emit zero argument-flow data until the callNode conflict is resolved. The Elixir and Clojure configs set callNode to the same type already listed in functionNodes. The walker's dispatcher short-circuits on all such nodes before the call-handling branch is reached, so handleCallExpr is never invoked for either language — a silent correctness gap with no error signal. The Erlang and OCaml double-scope issue is a second independent concern that could produce duplicate or contradictory param vertices for those languages. src/ast-analysis/rules/b4.ts — the Elixir, Clojure, Erlang, and OCaml config blocks need the most attention before those languages produce correct interprocedural dataflow output. Important Files Changed
Reviews (6): Last reviewed commit: "fix(dataflow): guard exitFunction pop ag..." | Re-trigger Greptile |
| const placeholders = changedFuncIds.map(() => '?').join(','); | ||
| const callerFileRows = db | ||
| .prepare( | ||
| `SELECT DISTINCT n.file AS caller_file | ||
| FROM dataflow d | ||
| JOIN nodes n ON n.id = d.source_id | ||
| WHERE d.target_id IN (${placeholders}) | ||
| AND d.kind = 'flows_to'`, | ||
| ) | ||
| .all(...changedFuncIds) as { caller_file: string }[]; |
There was a problem hiding this comment.
SQLite variable-limit risk in dynamic IN clause
changedFuncIds is spread into .all(...changedFuncIds), which passes each element as a separate positional bind parameter. SQLite enforces SQLITE_MAX_VARIABLE_NUMBER at statement-prepare time; the historical default is 999 and the modern ceiling is 32 766 (SQLite ≥ 3.32). On a large incremental build — e.g. a batch commit touching many files where the combined function/method count in fileSymbols exceeds that limit — db.prepare(...) or .all(...) throws SQLITE_ERROR: too many SQL variables, propagating as an uncaught exception that aborts the entire build.
Consider batching the query in chunks (e.g. 500 IDs at a time) and collecting the distinct caller files across all batches, or using a temporary table to hold the IDs.
There was a problem hiding this comment.
Fixed in the existing commit via chunked queries. The clause is now issued in batches of 500 IDs, collecting distinct caller files across all batches. This keeps the variable count well within SQLite's SQLITE_MAX_VARIABLE_NUMBER on all SQLite versions (999 legacy, 32766 on 3.32+).
| if (vstmts.available) { | ||
| const changedRelPaths = new Set<string>(fileSymbols.keys()); | ||
| const changedFuncIds = collectFuncIdsForFiles(db, changedRelPaths); | ||
| const extra = await collectCallerStitchCandidates( | ||
| db, | ||
| changedFuncIds, | ||
| changedRelPaths, | ||
| rootDir, | ||
| extToLang, | ||
| parsers, | ||
| getParserFn, | ||
| ); | ||
| allCandidates.push(...extra.candidates); | ||
| allCaptures.push(...extra.captures); | ||
| } |
There was a problem hiding this comment.
P4 runs unconditionally after every JS-path call, even full builds
vstmts.available only guards for the absence of the dataflow_vertices table — it does not distinguish full builds from incremental ones. For a full build, changedRelPaths equals all files in fileSymbols, so collectCallerStitchCandidates correctly returns empty after the filter. However, collectFuncIdsForFiles still issues one SELECT per changed file before discovering there is nothing to do. On a large initial build with thousands of files this adds noticeable overhead before the early-return in collectCallerStitchCandidates. A lightweight guard — e.g. skipping P4 entirely when fileSymbols.size equals the total file count in the DB — would avoid the extra queries on full builds.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
Added a full-build guard before the P4 block: a single COUNT(DISTINCT file) query checks whether fileSymbols covers all DB files, and if so P4 is skipped entirely (no per-file SELECTs). For an incremental build the guard costs one fast count query; for a full build it avoids N per-file SELECTs with no result.
| for (const callerFile of callerFiles) { | ||
| // Read the caller file from disk without touching its existing DB rows. | ||
| const stub: FileSymbolsDataflow = { definitions: [], _langId: null, _tree: null }; | ||
| const data = getDataflowForFile( | ||
| stub, | ||
| callerFile, | ||
| rootDir, | ||
| extToLang, | ||
| activeParsers, | ||
| activeGetParserFn, | ||
| ); | ||
| if (!data) continue; |
There was a problem hiding this comment.
definitions: [] stub silently skips callee-context-aware resolution
getDataflowForFile passes stub.definitions (always []) to extractDataflow. Today _definitions is unused, so this is harmless. But the coupling is implicit: if a future refactor makes extractDataflow use definitions to restrict which function bodies are walked (a natural optimisation), P4 will silently emit no stitch candidates for caller files whose definitions were not pre-loaded. A small comment here noting the intentional stub would make this safe to refactor.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
Added a comment explaining the intentional stub: extractDataflow does not currently use _definitions, so passing [] is safe. The comment notes exactly what must change if a future refactor makes extractDataflow use definitions to filter which function bodies are walked.
Codegraph Impact Analysis29 functions changed → 15 callers affected across 5 files
|
During an incremental build only the changed files are in fileSymbols. When a callee function's file changes its param vertices are purged and rebuilt, but callers' files are not re-parsed — leaving their arg_in edges unrecreated (target vertex was deleted by the purge). Fix: after the per-file tx() loop, collectCallerStitchCandidates() queries flows_to edges that target any function in the changed file set and finds the distinct caller files NOT already in fileSymbols. Each caller file is read from disk and parsed to extract VisitorArgFlow data. StitchCandidates are built from those flows (filtered to only those calling a changed callee) and merged into allCandidates before the existing buildInterproceduralStitch post-pass runs. This preserves the full vertex-level source/target_vertex pointers on arg_in edges: the caller's existing param/local vertex (untouched by the purge) is the source, and the callee's freshly inserted param vertex is the target. No caller DB rows are written — only reads. Closes #1609. docs check acknowledged
Adds DATAFLOW_RULES entries (function scope, param extraction, return/var/call tracking) for all remaining languages: - lc(B)2 (JVM/mobile): Kotlin, Swift, Scala, Dart, Groovy - lc(B)3 (scripting): Lua, R, Julia, Bash - lc(B)4 (functional): Haskell, OCaml, F#, Gleam, Elixir, Erlang, Clojure - lc(B)5 (systems/blockchain): Zig, Solidity Also fixes enterFunctionScope to early-return when nameExtractor rejects a node (prevents Elixir/Clojure call/list_lit functionNodes from pushing spurious scopes for non-def expressions). Also fixes collectCallerStitchCandidates to chunk IN queries at 500 to avoid SQLITE_MAX_VARIABLE_NUMBER on large repos (Fixes #1613). Skips lc(P)4 on full builds where fileSymbols covers every file in the DB. Closes #1610
63bb901 to
0d8039c
Compare
| // When nameExtractor is set it acts as a gate: null means this node is not a function | ||
| // definition. Needed for languages (Elixir, Clojure) where functionNodes includes generic | ||
| // node types (call/list_lit) that are only sometimes function definitions. | ||
| if (rules.nameExtractor && !rules.nameExtractor(funcNode)) return; |
There was a problem hiding this comment.
Asymmetric push/pop corrupts the scope stack for Elixir and Clojure
enterFunctionScope now returns early (no push) when nameExtractor says a node is not a function definition. But the exitFunction callback always calls scopeStack.pop() unconditionally (line ~547). In Elixir (functionNodes: ['call']) and Clojure (functionNodes: ['list_lit']), every call expression inside a function body triggers an enterFunction/exitFunction cycle. Because enterFunctionScope does not push for these non-def nodes, the subsequent pop() removes the enclosing def/defn scope instead.
For a function body containing N call expressions, the enclosing scope is popped N times. After the first call expression, currentScope(scopeStack) returns undefined, so all variable references, return tracking, and parameter registrations that follow are dropped. The fix should mirror the gate: either track which nodes actually pushed (e.g. a WeakSet<TreeSitterNode>) and only pop for those, or apply the same nameExtractor check inside exitFunction.
There was a problem hiding this comment.
Fixed — enterFunctionScope now returns boolean (true when it pushed a scope frame, false on early return). A parallel pushRecord: boolean[] stack inside createDataflowVisitor mirrors the scope stack: enterFunction pushes the result, and exitFunction calls scopeStack.pop() only when pushRecord.pop() is true. This makes push/pop symmetric for all languages including Elixir and Clojure.
…ionScope enterFunctionScope returns early (no push) when nameExtractor rejects a node. The unconditional scopeStack.pop() in exitFunction caused asymmetric push/pop for languages like Elixir and Clojure whose functionNodes include generic node types (call/list_lit) that only sometimes represent function definitions. Fix: enterFunctionScope now returns boolean (true = pushed). A parallel pushRecord boolean stack in createDataflowVisitor records the result of each enterFunction call; exitFunction pops scopeStack only when pushRecord.pop() is true.
|
Addressed Greptile P1 (comment #3440188988): guarded exitFunction's scopeStack.pop() against early-return enterFunctionScope to prevent asymmetric push/pop on Elixir/Clojure. enterFunctionScope now returns boolean; a parallel pushRecord stack in createDataflowVisitor ensures pop only fires when a frame was actually pushed. |
Summary
Implements the complete interprocedural dataflow vertex model across all 34 supported languages.
Phases
dataflow_vertices,dataflow_summary, new edge kinds (def_use,arg_in,return_out)def_useedges per filearg_in(caller → callee param) andreturn_out(callee return → caller capture) edges;dataflow_summaryper param (Closes feat(dataflow): P4 — incremental re-stitch when callee file changes #1609)arg_inedges when callee file changes without full rebuilddataflow_verticesis populated regardless of engine; v19 migration sentinel forces one-time backfill of existing DBsgenerator_function_declaration/generator_functionto JS + RustfunctionNodesto fix 3 df-vertex divergences injelly-microfixtureAlso includes:
INquery chunking incollectCallerStitchCandidatesto avoidSQLITE_MAX_VARIABLE_NUMBER(Fixes fix(dataflow): chunk SQLite IN queries in collectCallerStitchCandidates to avoid SQLITE_MAX_VARIABLE_NUMBER limit #1613)fileSymbolscovers all DB filesarg_in,return_out,def_useParity
scripts/parity-compare.mjs --dataflow— 36/36 fixtures OK, both engines identical including df-vertices.Performance
Full build regression (~3s → ~11s on 707 files) is expected from:
runDataflowVertexPassre-runs Rust visitor per file (~2-3s)1-file rebuild: 108ms → 207ms (P4 incremental re-stitch reads caller files from disk).
No-op rebuild: unchanged (22ms).
Optimization opportunity (filed): have Rust pipeline return
DataflowResultper file so P6 doesn't need a second visitor pass.Test plan
npx vitest run tests/integration/dataflow*.test.ts— 40 passingnode scripts/parity-compare.mjs --dataflow— 36/36 OKnode scripts/parity-compare.mjs— 36/36 OK (no node/edge regressions)npm run lint— cleannpx tsc --noEmit— cleanOpen follow-ups