feat(dataflow): interprocedural variable-level model — P0-P6 complete, all 34 languages by carlos-alm · Pull Request #1612 · optave/ops-codegraph-tool

carlos-alm · 2026-06-19T03:32:33Z

Summary

Implements the complete interprocedural dataflow vertex model across all 34 supported languages.

Phases

P0: Schema — dataflow_vertices, dataflow_summary, new edge kinds (def_use, arg_in, return_out)
P1: Vertex extraction — param/return/local vertices + intra def_use edges per file
P2: Interprocedural stitching — arg_in (caller → callee param) and return_out (callee return → caller capture) edges; dataflow_summary per param (Closes feat(dataflow): P4 — incremental re-stitch when callee file changes #1609)
P3: C/C++ parser tests (29 tests)
P4: Incremental re-stitch — rebuilt arg_in edges when callee file changes without full rebuild
P5 B1: C-family dataflow rules (C, C++, ObjC, CUDA)
P5 B2–B5: DATAFLOW_RULES for remaining 18 languages — Kotlin, Swift, Scala, Dart, Groovy, Lua, R, Julia, Bash, Haskell, OCaml, F#, Gleam, Elixir, Erlang, Clojure, Zig, Solidity (Closes feat(dataflow): P5 — DATAFLOW_RULES for 26 new languages (B1–B5 batches) #1610)
P6: Vertex extraction on native orchestrator path — re-runs Rust dataflow visitor post-orchestrator so dataflow_vertices is populated regardless of engine; v19 migration sentinel forces one-time backfill of existing DBs
P6 parity: Added generator_function_declaration / generator_function to JS + Rust functionNodes to fix 3 df-vertex divergences in jelly-micro fixture

Also includes:

SQLite IN query chunking in collectCallerStitchCandidates to avoid SQLITE_MAX_VARIABLE_NUMBER (Fixes fix(dataflow): chunk SQLite IN queries in collectCallerStitchCandidates to avoid SQLITE_MAX_VARIABLE_NUMBER limit #1613)
Skips P4 pass on full builds where fileSymbols covers all DB files
README: removed "intraprocedural only" limitation; updated feature descriptions to mention arg_in, return_out, def_use

Parity

scripts/parity-compare.mjs --dataflow — 36/36 fixtures OK, both engines identical including df-vertices.

Performance

Full build regression (~3s → ~11s on 707 files) is expected from:

P6 runDataflowVertexPass re-runs Rust visitor per file (~2-3s)
~10,737 interprocedural edge insertions
~10,000+ vertex row insertions

1-file rebuild: 108ms → 207ms (P4 incremental re-stitch reads caller files from disk).
No-op rebuild: unchanged (22ms).

Optimization opportunity (filed): have Rust pipeline return DataflowResult per file so P6 doesn't need a second visitor pass.

Test plan

npx vitest run tests/integration/dataflow*.test.ts — 40 passing
node scripts/parity-compare.mjs --dataflow — 36/36 OK
node scripts/parity-compare.mjs — 36/36 OK (no node/edge regressions)
npm run lint — clean
npx tsc --noEmit — clean
bench-check initial baseline saved

Open follow-ups

feat(dataflow): P4 incremental re-stitch on native engine path #1614: P4 incremental re-stitch on native engine path (unchanged callers)

greptile-apps · 2026-06-19T03:37:58Z

Greptile Summary

This PR completes the interprocedural variable-level dataflow model (P0–P5) across 20+ languages, adding schema vertices/edges, per-file param/return extraction, interprocedural stitching, incremental re-stitch (P4), the C-family rules, and rules for 18 additional languages spanning JVM, scripting, functional, and systems families. The push/pop asymmetry fix for Elixir/Clojure scopes (using a parallel pushRecord boolean stack) and the SQLite chunking for SQLITE_MAX_VARIABLE_NUMBER are both well-executed.

P4 incremental re-stitch (dataflow.ts): adds collectCallerStitchCandidates with 500-ID chunked IN queries and a full-build short-circuit guard, correctly re-stitching arg_in edges for unchanged callers when only a callee file changes.
Language rules (b2–b5): 18 new configs registered in DATAFLOW_RULES; most are well-documented with graceful fallbacks for non-standard tree-sitter fields. Two configs in b4.ts have a structural conflict: Elixir (callNode: 'call') and Clojure (callNode: 'list_lit') use the same node type for both functionNodes and callNode, which causes dispatchDataflowNode to return early before ever reaching call-processing logic, silently disabling all argument-flow tracking for those languages.
Erlang and OCaml both include a parent and its direct child node type in functionNodes (fun_decl/function_clause and value_definition/let_binding respectively), causing duplicate parameter registration per function.

Confidence Score: 3/5

Safe to merge for the languages not affected by the call-tracking bug, but Elixir and Clojure will silently emit zero argument-flow data until the callNode conflict is resolved.

The Elixir and Clojure configs set callNode to the same type already listed in functionNodes. The walker's dispatcher short-circuits on all such nodes before the call-handling branch is reached, so handleCallExpr is never invoked for either language — a silent correctness gap with no error signal. The Erlang and OCaml double-scope issue is a second independent concern that could produce duplicate or contradictory param vertices for those languages.

src/ast-analysis/rules/b4.ts — the Elixir, Clojure, Erlang, and OCaml config blocks need the most attention before those languages produce correct interprocedural dataflow output.

Important Files Changed

Filename	Overview
src/ast-analysis/rules/b4.ts	New language rules for Haskell, OCaml, F#, Gleam, Elixir, Erlang, Clojure. Two concrete bugs: Elixir and Clojure set callNode to the same type as functionNodes, making call tracking completely dead. Erlang and OCaml both include parent and child node types in functionNodes, causing double-scope and duplicate param registration.
src/ast-analysis/rules/b2.ts	New language rules for Kotlin, Swift, Scala, Dart, Groovy. Well-commented with appropriate fallbacks. No issues found.
src/ast-analysis/rules/b3.ts	New language rules for Lua, R, Julia, Bash. No critical issues found.
src/ast-analysis/rules/b5.ts	New language rules for Zig and Solidity. Well-configured with confirmed field names.
src/ast-analysis/rules/index.ts	Registers 20 new language configs correctly in DATAFLOW_RULES.
src/ast-analysis/visitors/dataflow-visitor.ts	Fixes push/pop asymmetry via parallel pushRecord boolean stack. Correct and symmetric.
src/features/dataflow.ts	Adds P4 incremental re-stitch logic with chunked SQLite queries and full-build guard. Well-structured.
tests/integration/dataflow-incremental.test.ts	New integration test for P4 re-stitch covering param vertex rebuild and arg_in edge creation.

_{Reviews (6): Last reviewed commit: "fix(dataflow): guard exitFunction pop ag..." | Re-trigger Greptile}

greptile-apps · 2026-06-19T03:38:02Z

+  const placeholders = changedFuncIds.map(() => '?').join(',');
+  const callerFileRows = db
+    .prepare(
+      `SELECT DISTINCT n.file AS caller_file
+       FROM dataflow d
+       JOIN nodes n ON n.id = d.source_id
+       WHERE d.target_id IN (${placeholders})
+         AND d.kind = 'flows_to'`,
+    )
+    .all(...changedFuncIds) as { caller_file: string }[];


SQLite variable-limit risk in dynamic IN clause

changedFuncIds is spread into .all(...changedFuncIds), which passes each element as a separate positional bind parameter. SQLite enforces SQLITE_MAX_VARIABLE_NUMBER at statement-prepare time; the historical default is 999 and the modern ceiling is 32 766 (SQLite ≥ 3.32). On a large incremental build — e.g. a batch commit touching many files where the combined function/method count in fileSymbols exceeds that limit — db.prepare(...) or .all(...) throws SQLITE_ERROR: too many SQL variables, propagating as an uncaught exception that aborts the entire build.

Consider batching the query in chunks (e.g. 500 IDs at a time) and collecting the distinct caller files across all batches, or using a temporary table to hold the IDs.

Fixed in the existing commit via chunked queries. The clause is now issued in batches of 500 IDs, collecting distinct caller files across all batches. This keeps the variable count well within SQLite's SQLITE_MAX_VARIABLE_NUMBER on all SQLite versions (999 legacy, 32766 on 3.32+).

greptile-apps · 2026-06-19T03:38:03Z

+  if (vstmts.available) {
+    const changedRelPaths = new Set<string>(fileSymbols.keys());
+    const changedFuncIds = collectFuncIdsForFiles(db, changedRelPaths);
+    const extra = await collectCallerStitchCandidates(
+      db,
+      changedFuncIds,
+      changedRelPaths,
+      rootDir,
+      extToLang,
+      parsers,
+      getParserFn,
+    );
+    allCandidates.push(...extra.candidates);
+    allCaptures.push(...extra.captures);
+  }


P4 runs unconditionally after every JS-path call, even full builds

vstmts.available only guards for the absence of the dataflow_vertices table — it does not distinguish full builds from incremental ones. For a full build, changedRelPaths equals all files in fileSymbols, so collectCallerStitchCandidates correctly returns empty after the filter. However, collectFuncIdsForFiles still issues one SELECT per changed file before discovering there is nothing to do. On a large initial build with thousands of files this adds noticeable overhead before the early-return in collectCallerStitchCandidates. A lightweight guard — e.g. skipping P4 entirely when fileSymbols.size equals the total file count in the DB — would avoid the extra queries on full builds.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Added a full-build guard before the P4 block: a single COUNT(DISTINCT file) query checks whether fileSymbols covers all DB files, and if so P4 is skipped entirely (no per-file SELECTs). For an incremental build the guard costs one fast count query; for a full build it avoids N per-file SELECTs with no result.

greptile-apps · 2026-06-19T03:38:04Z

+  for (const callerFile of callerFiles) {
+    // Read the caller file from disk without touching its existing DB rows.
+    const stub: FileSymbolsDataflow = { definitions: [], _langId: null, _tree: null };
+    const data = getDataflowForFile(
+      stub,
+      callerFile,
+      rootDir,
+      extToLang,
+      activeParsers,
+      activeGetParserFn,
+    );
+    if (!data) continue;


definitions: [] stub silently skips callee-context-aware resolution

getDataflowForFile passes stub.definitions (always []) to extractDataflow. Today _definitions is unused, so this is harmless. But the coupling is implicit: if a future refactor makes extractDataflow use definitions to restrict which function bodies are walked (a natural optimisation), P4 will silently emit no stitch candidates for caller files whose definitions were not pre-loaded. A small comment here noting the intentional stub would make this safe to refactor.

Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!

Added a comment explaining the intentional stub: extractDataflow does not currently use _definitions, so passing [] is safe. The comment notes exactly what must change if a future refactor makes extractDataflow use definitions to filter which function bodies are walked.

github-actions · 2026-06-19T03:39:58Z

Codegraph Impact Analysis

29 functions changed → 15 callers affected across 5 files

getKotlinParamListNode in src/ast-analysis/rules/b2.ts:14 (0 transitive callers)
extractKotlinParamName in src/ast-analysis/rules/b2.ts:21 (0 transitive callers)
getSwiftParamListNode in src/ast-analysis/rules/b2.ts:59 (0 transitive callers)
extractSwiftParamName in src/ast-analysis/rules/b2.ts:69 (0 transitive callers)
getScalaParamListNode in src/ast-analysis/rules/b2.ts:109 (0 transitive callers)
extractScalaParamName in src/ast-analysis/rules/b2.ts:118 (0 transitive callers)
extractDartParamName in src/ast-analysis/rules/b2.ts:159 (0 transitive callers)
getGroovyParamListNode in src/ast-analysis/rules/b2.ts:209 (0 transitive callers)
extractGroovyParamName in src/ast-analysis/rules/b2.ts:220 (0 transitive callers)
extractHaskellParamName in src/ast-analysis/rules/b4.ts:12 (0 transitive callers)
getHaskellParamListNode in src/ast-analysis/rules/b4.ts:28 (0 transitive callers)
extractOCamlParamName in src/ast-analysis/rules/b4.ts:62 (0 transitive callers)
getOCamlParamListNode in src/ast-analysis/rules/b4.ts:74 (0 transitive callers)
extractFSharpParamName in src/ast-analysis/rules/b4.ts:115 (0 transitive callers)
getFSharpParamListNode in src/ast-analysis/rules/b4.ts:126 (0 transitive callers)
extractGleamParamName in src/ast-analysis/rules/b4.ts:161 (0 transitive callers)
extractElixirFunctionName in src/ast-analysis/rules/b4.ts:205 (0 transitive callers)
extractErlangParamName in src/ast-analysis/rules/b4.ts:259 (0 transitive callers)
getErlangParamListNode in src/ast-analysis/rules/b4.ts:266 (0 transitive callers)
extractClojureFunctionName in src/ast-analysis/rules/b4.ts:317 (0 transitive callers)

During an incremental build only the changed files are in fileSymbols. When a callee function's file changes its param vertices are purged and rebuilt, but callers' files are not re-parsed — leaving their arg_in edges unrecreated (target vertex was deleted by the purge). Fix: after the per-file tx() loop, collectCallerStitchCandidates() queries flows_to edges that target any function in the changed file set and finds the distinct caller files NOT already in fileSymbols. Each caller file is read from disk and parsed to extract VisitorArgFlow data. StitchCandidates are built from those flows (filtered to only those calling a changed callee) and merged into allCandidates before the existing buildInterproceduralStitch post-pass runs. This preserves the full vertex-level source/target_vertex pointers on arg_in edges: the caller's existing param/local vertex (untouched by the purge) is the source, and the callee's freshly inserted param vertex is the target. No caller DB rows are written — only reads. Closes #1609. docs check acknowledged

Adds DATAFLOW_RULES entries (function scope, param extraction, return/var/call tracking) for all remaining languages: - lc(B)2 (JVM/mobile): Kotlin, Swift, Scala, Dart, Groovy - lc(B)3 (scripting): Lua, R, Julia, Bash - lc(B)4 (functional): Haskell, OCaml, F#, Gleam, Elixir, Erlang, Clojure - lc(B)5 (systems/blockchain): Zig, Solidity Also fixes enterFunctionScope to early-return when nameExtractor rejects a node (prevents Elixir/Clojure call/list_lit functionNodes from pushing spurious scopes for non-def expressions). Also fixes collectCallerStitchCandidates to chunk IN queries at 500 to avoid SQLITE_MAX_VARIABLE_NUMBER on large repos (Fixes #1613). Skips lc(P)4 on full builds where fileSymbols covers every file in the DB. Closes #1610

carlos-alm · 2026-06-19T03:56:27Z

@greptileai

greptile-apps · 2026-06-19T04:07:27Z

+  // When nameExtractor is set it acts as a gate: null means this node is not a function
+  // definition. Needed for languages (Elixir, Clojure) where functionNodes includes generic
+  // node types (call/list_lit) that are only sometimes function definitions.
+  if (rules.nameExtractor && !rules.nameExtractor(funcNode)) return;


Asymmetric push/pop corrupts the scope stack for Elixir and Clojure

enterFunctionScope now returns early (no push) when nameExtractor says a node is not a function definition. But the exitFunction callback always calls scopeStack.pop() unconditionally (line ~547). In Elixir (functionNodes: ['call']) and Clojure (functionNodes: ['list_lit']), every call expression inside a function body triggers an enterFunction/exitFunction cycle. Because enterFunctionScope does not push for these non-def nodes, the subsequent pop() removes the enclosing def/defn scope instead.

For a function body containing N call expressions, the enclosing scope is popped N times. After the first call expression, currentScope(scopeStack) returns undefined, so all variable references, return tracking, and parameter registrations that follow are dropped. The fix should mirror the gate: either track which nodes actually pushed (e.g. a WeakSet<TreeSitterNode>) and only pop for those, or apply the same nameExtractor check inside exitFunction.

Fixed — enterFunctionScope now returns boolean (true when it pushed a scope frame, false on early return). A parallel pushRecord: boolean[] stack inside createDataflowVisitor mirrors the scope stack: enterFunction pushes the result, and exitFunction calls scopeStack.pop() only when pushRecord.pop() is true. This makes push/pop symmetric for all languages including Elixir and Clojure.

…ionScope enterFunctionScope returns early (no push) when nameExtractor rejects a node. The unconditional scopeStack.pop() in exitFunction caused asymmetric push/pop for languages like Elixir and Clojure whose functionNodes include generic node types (call/list_lit) that only sometimes represent function definitions. Fix: enterFunctionScope now returns boolean (true = pushed). A parallel pushRecord boolean stack in createDataflowVisitor records the result of each enterFunction call; exitFunction pops scopeStack only when pushRecord.pop() is true.

carlos-alm · 2026-06-19T04:56:00Z

Addressed Greptile P1 (comment #3440188988): guarded exitFunction's scopeStack.pop() against early-return enterFunctionScope to prevent asymmetric push/pop on Elixir/Clojure. enterFunctionScope now returns boolean; a parallel pushRecord stack in createDataflowVisitor ensures pop only fires when a frame was actually pushed.

carlos-alm · 2026-06-19T04:56:03Z

@greptileai

greptile-apps Bot reviewed Jun 19, 2026

View reviewed changes

carlos-alm changed the title ~~feat(dataflow): P4 — incremental re-stitch when callee file changes~~ feat(dataflow): interprocedural variable-level model — P0-P5 complete Jun 19, 2026

carlos-alm added 2 commits June 18, 2026 21:55

carlos-alm force-pushed the feat/dataflow-vertex-schema-p0 branch from 63bb901 to 0d8039c Compare June 19, 2026 03:56

greptile-apps Bot reviewed Jun 19, 2026

View reviewed changes

carlos-alm and others added 2 commits June 18, 2026 22:12

Merge branch 'main' into feat/dataflow-vertex-schema-p0

a5afa8a

carlos-alm merged commit 6560a69 into main Jun 19, 2026
26 checks passed

carlos-alm deleted the feat/dataflow-vertex-schema-p0 branch June 19, 2026 05:30

github-actions Bot locked and limited conversation to collaborators Jun 19, 2026

carlos-alm changed the title ~~feat(dataflow): interprocedural variable-level model — P0-P5 complete~~ feat(dataflow): interprocedural variable-level model — P0-P6 complete, all 34 languages Jun 19, 2026

Conversation

carlos-alm commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Phases

Parity

Performance

Test plan

Open follow-ups

Uh oh!

greptile-apps Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 3/5

Important Files Changed

Uh oh!

greptile-apps Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions Bot commented Jun 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codegraph Impact Analysis

Uh oh!

carlos-alm commented Jun 19, 2026

Uh oh!

greptile-apps Bot Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

carlos-alm commented Jun 19, 2026

Uh oh!

carlos-alm commented Jun 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

carlos-alm commented Jun 19, 2026 •

edited

Loading

greptile-apps Bot commented Jun 19, 2026 •

edited

Loading

github-actions Bot commented Jun 19, 2026 •

edited

Loading