Skip to content

Fix static-tool runners to pass the strict scorer (col_offset emission + Jedi correctness)#15

Open
bgoconnor wants to merge 5 commits into
secure-software-engineering:mainfrom
Archway-Labs-AI:fix/static-runners-emit-col-offset
Open

Fix static-tool runners to pass the strict scorer (col_offset emission + Jedi correctness)#15
bgoconnor wants to merge 5 commits into
secure-software-engineering:mainfrom
Archway-Labs-AI:fix/static-runners-emit-col-offset

Conversation

@bgoconnor
Copy link
Copy Markdown

Hi TypeEvalPy maintainers — we've been using the benchmark and noticed that the static-tool runners can't pass the strict scorer (is_same_element, added in 2f7c6056) because they don't emit col_offset. This PR fixes that for HeaderGen, Jedi, and Scalpel. Before: 0 strict matches across all three on micro-benchmark. After: 580/476/180 respectively.

Changes

Jedi (src/target_tools/jedi/src/jedi_type_inference.py)

  • Emit col_offset in the four _info dict construction sites (Jedi's API already provides column; the runner was collecting it via pos["column"] and then discarding it).
  • Rewrite get_function_name to walk parent scopes — the old split(".", 1)[-1] only stripped one dotted component, producing names like assignments.chained.main.func1 instead of func1 outside Docker's shallow file layouts. Also correctly produces qualified names for nested functions (func.dec).
  • Replace parent.name == parent.module_name with parent.type == "module" — the old check added a spurious function key to module-level variables.
  • Distinguish function-definition site from function-reference site, so a = func1 types as callable rather than func1's return type.

HeaderGen (src/target_tools/headergen/src/translator.py + 1 line in runner.py)

  • The headergen server doesn't return position info. The translator now parses the source with stdlib ast, builds a (name, line) → col_offset map, and attaches columns to the server's entries. Subscript/attribute expressions use the base name's column; nested functions use the inner name's. Ambiguous lookups (>1 candidate on the same line, e.g. x = lambda x: x) are skipped rather than guessed — 5 of 853 entries across the benchmark.

Scalpel (src/target_tools/scalpel/src/translator.py + 2 lines in runner.py)

  • Same source-parsed enrichment as HeaderGen. Zero ambiguous lookups across the benchmark.

HeaderGen Dockerfile

  • Install g++ in addition to gcc. HeaderGen's transitive dep line-profiler has a Cython C++ extension; on arm64 Linux + Python 3.10 (no prebuilt wheel) pip builds from source and fails without g++.

Results (Docker, micro-benchmark, main_analyze_results.py Total line)

Tool Lenient before Lenient after Strict before Strict after
Jedi 414/850 479 0 476
HeaderGen 591/850 601 0 580
Scalpel 183/850 187 0 180

Lenient also rises modestly, mostly from Jedi's three non-col_offset fixes (which help under both scorers).

Methodology note

The HeaderGen and Scalpel col_offset values come from parsing the source with stdlib ast, not from the tool's inference output. We validated empirically that this is recovery, not synthesis: 805/853 HeaderGen entries and 369/369 Scalpel entries have a uniquely-determined source position from the (name, line_number) the tool already emits. The translator skips ambiguous cases. The +1 adjustment for 1-indexed columns follows the existing convention used by Pyright's and RightTyper's runners.

Happy to iterate.

bgoconnor and others added 5 commits May 27, 2026 16:35
The Jedi runner had four pre-existing bugs that caused every prediction
to fail the strict scorer (`is_same_element`, added in 2f7c605):

1. Predictions did not emit col_offset, which strict matching requires.
2. get_function_name only stripped the first dotted component, so paths
   deeper than two levels produced over-long names like
   "assignments.chained.main.func1" instead of "func1". The runner
   silently broke outside of Docker's shallow path layout.
3. Module-level variables received a spurious "function" key because the
   parent-is-module check compared parent.name to parent.module_name
   instead of checking parent.type == "module".
4. Function-reference assignments (e.g., `a = func1`) were typed as
   func1's return type instead of callable, because the runner ran
   find_types_by_execute on every inferred function without checking
   whether the position was the function's own definition site.

Result on micro-benchmark: 5/850 (local, old) and 414/850 (Docker, old)
both rise to 433/850 under both lenient and strict scorers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The HeaderGen server (headergen PyPI package) returns inference results
without col_offset, so every prediction failed the strict scorer
(`is_same_element`). HeaderGen's runner is a thin HTTP client and does
not build its own dicts, so the fix lives in translator.py.

Approach: after receiving the server response, parse the source with
Jedi to build a (name, line) -> col_offset map, then look up each
entry's position. Subscript and attribute expressions reported as
"a[0]" or "self.x" use the base name's column; nested functions
reported as "outer.inner" use the inner name's column.

Result on micro-benchmark: 0/850 strict rises to 603/850 under strict.
Lenient is essentially unchanged (612 baseline -> 611). The 8-entry
gap between lenient (611) and strict (603) is line_number mismatches
between HeaderGen and GT that the lenient scorer silently accepts
(line_number checks are commented out in large_scale_analysis.py).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Scalpel's runner builds dicts in-process but does not include col_offset,
so every prediction failed the strict scorer (`is_same_element`). The
fix mirrors the HeaderGen approach: parse the source with Jedi to build
a position map and look up each entry's column in translator.py, called
once after process_file in the runner.

Result on micro-benchmark: 0/845 strict rises to 179/845 under strict.
Lenient is preserved at 182/845 (Docker baseline 183/850 - close enough
for path/version drift). One file fails Scalpel inference; not addressed
here.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…guous

The previous version of these translators used jedi for the (name, line)
-> col_offset position lookup. That has two issues:

1. Architectural smell: Scalpel's runner would have to depend on jedi (a
   sibling tool being evaluated). HeaderGen happened to have jedi as a
   transitive dep so the import worked, but the principle was wrong.

2. Synthesis vs recovery: when multiple names share the same (name, line)
   key (e.g., `x = lambda x: x` has three `x`s on one line), the previous
   code silently picked the first via setdefault. That risks attaching a
   col_offset the tool didn't actually intend.

This commit replaces jedi with stdlib `ast` (same lookup, no extra dep,
no cross-tool entanglement) and skips col_offset emission entirely when
the lookup is ambiguous. Empirically across the full micro-benchmark:

- HeaderGen: 805/853 entries have a unique position, 5 are ambiguous,
  43 are unfindable in source (the latter are HeaderGen's `ClassName.attr`
  style which differs from GT's `self.attr` convention; unmatchable
  regardless).
- Scalpel: 369/369 unique, zero ambiguous.

So the col_offset enrichment is recovery (the position is determined by
what the tool already emitted) for >94% of HeaderGen output and 100% of
Scalpel output. The remaining ambiguous entries are now correctly handled
by NOT attaching a position rather than guessing.

Docker results after this change:
- HeaderGen: 0 -> 580/850 strict (591 -> 601 lenient)
- Scalpel:   0 -> 180/850 strict (183 -> 187 lenient)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The HeaderGen Docker image previously installed only gcc, not g++.
HeaderGen's transitive dependency `line-profiler` has a Cython C++
extension; on platforms where no prebuilt wheel exists (e.g., arm64
Linux + Python 3.10), pip falls back to building from source, which
requires g++. The build then fails with:

    g++ -... -c line_profiler/_line_profiler.cpp -o ...
    error: command 'g++' failed: No such file or directory
    ERROR: Failed building wheel for line-profiler

This change adds g++ to the apt-get install line. The image now builds
cleanly on arm64 hosts as well as amd64.

(Builds may have appeared to succeed previously on hosts where a
line-profiler wheel was already cached from an earlier build with g++
present; fresh builds without the cache hit the source-compile path
and surface the issue.)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant