Skip to content

⚡️ Speed up function closest_matching_file_function_name by 175% in PR #1689 (consolidate-python-discovery)#1691

Closed
codeflash-ai[bot] wants to merge 2 commits intoconsolidate-python-discoveryfrom
codeflash/optimize-pr1689-2026-02-27T17.52.22
Closed

⚡️ Speed up function closest_matching_file_function_name by 175% in PR #1689 (consolidate-python-discovery)#1691
codeflash-ai[bot] wants to merge 2 commits intoconsolidate-python-discoveryfrom
codeflash/optimize-pr1689-2026-02-27T17.52.22

Conversation

@codeflash-ai
Copy link
Copy Markdown
Contributor

@codeflash-ai codeflash-ai bot commented Feb 27, 2026

⚡️ This pull request contains optimizations for PR #1689

If you approve this dependent PR, these changes will be merged into the original PR branch consolidate-python-discovery.

This PR will be automatically closed if the original PR is merged.


📄 175% (1.75x) speedup for closest_matching_file_function_name in codeflash/discovery/functions_to_optimize.py

⏱️ Runtime : 24.7 milliseconds 8.99 milliseconds (best of 111 runs)

📝 Explanation and details

Runtime improvement (primary): The optimized version reduces the median runtime of the matcher from 24.7 ms to 8.99 ms — a 175% speedup. Line profiling shows closest_matching_file_function_name total time dropped from ~0.18 s to ~0.037 s, and the hot cost inside full Levenshtein calls was cut dramatically.

What changed (concrete optimizations)

  • Precompute candidate metadata: the code now builds a flattened candidates list of (file_path, function, fn_name_lower, fn_len). This removes repeated attribute lookups, .lower() and len() calls inside the inner loop.
  • Bounded (banded) Levenshtein: replaced many full-distance computations with _bounded_levenshtein(s1, s2, max_distance) that computes distance only within a narrow band and returns >max_distance early if it cannot be better. The algorithm also:
    • early-exits when length difference already exceeds the bound,
    • restricts DP computation to a band [start..end] per row, and
    • performs a row-level early-exit if the minimum in the active band exceeds the bound.
  • Local binding and micro-optimizations: binder _bounded = _bounded_levenshtein and caching target_len reduce attribute lookups. The bounded implementation uses manual comparisons for the min-of-three to avoid tuple/min overhead in tight loops.
  • Keep original levenshtein_distance intact for external callers (no behavioral change for that API).

Why this speeds things up (mechanics)

  • The original profile shows most time was spent inside levenshtein_distance calls (many full DP passes). The bounded approach avoids doing the O(n*m) DP for any candidate that cannot beat the current min_distance — in a large candidate set most names are either skipped by length or exceed the bound early. That turns many expensive full-distance calls into quick checks or truncated DP runs.
  • Reducing repeated .lower() and len() calls removes repeated Python-level work and attribute lookups inside tight loops which are relatively expensive in Python.
  • Restricting the DP to a narrow band reduces inner-loop iterations and memory writes/reads per row, giving much less Python-level loop iteration overhead and fewer list operations.

Evidence from profiling and tests

  • Line profiler: original had ~1.78e8 ns in dist calls; optimized shows ~3.25e7 ns for bounded calls — a big reduction in DP work.
  • Annotated tests show the biggest wins for larger and near-match workloads (large_scale_exact_match, large_scale_near_match, choose_closest_of_multiple_candidates, special_characters_and_dots). Example: large-scale search for one exact name dropped from ~13.3 ms to ~5.99 ms in one annotated test.
  • The bounded variant still returns a number > max_distance when the true distance exceeds the bound, which preserves the selection logic (we only care about distances smaller than current min_distance).

Trade-offs and regressions to be aware of

  • Small overhead for tiny inputs: building the candidates list and extra function call overhead cause a few microbenchmarks to be slightly slower (some tests show small regressions, e.g., empty maps or single-item trivial cases). These regressions are minimal and expected given the additional precompute allocation; they are an acceptable trade-off for the large reductions when many candidates are present.
  • Memory: candidates list adds short-lived tuples (file_path, function, lower-name, length). This is small relative to the CPU savings when many names are checked.

Hot-path impact

  • get_functions_to_optimize can call closest_matching_file_function_name when the exact function wasn't found; in common interactive/CLI flows this is a helpful path. The optimization yields the most benefit when many functions are present (e.g., scanning a repository or when user passes a mistyped name and there are many candidates). Because the heavy work (distance comparisons across many candidates) is now bounded and cheaper, interactive latency and batch throughput improve in those hot scenarios.

Summary

  • Primary benefit: 175% speedup in runtime for the matcher by avoiding repeated costly full Levenshtein computations and reducing Python-level overhead inside the candidate loop.
  • Side-effects: minimal memory allocation for candidate metadata and a few micro regressions on trivial inputs — reasonable trade-offs given the large improvements for real workloads (many candidates / near-match searches).
  • Correctness: original external levenshtein_distance kept intact; the bounded version preserves selection semantics (any distance >= current min_distance is treated as not improving the match).

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 20 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Click to see Generated Regression Tests
from pathlib import Path

# imports
import pytest  # used for our unit tests
# import the real functions and dataclass from the codebase under test
from codeflash.discovery.functions_to_optimize import (
    closest_matching_file_function_name, levenshtein_distance)
from codeflash.models.function_types import FunctionToOptimize

def test_exact_match_and_case_insensitive():
    # Create a FunctionToOptimize with a simple name and Path
    file_path = Path("/tmp/file_a.py")
    fto = FunctionToOptimize(function_name="compute", file_path=file_path)
    found = {file_path: [fto]}

    # exact lowercase query should return the tuple (path, function)
    codeflash_output = closest_matching_file_function_name("compute", found); result = codeflash_output # 20.4μs -> 3.29μs (520% faster)

    # case-insensitive: query with different case should still match
    codeflash_output = closest_matching_file_function_name("Compute", found); result_case = codeflash_output # 16.0μs -> 1.34μs (1089% faster)

def test_no_match_returns_none_when_all_skipped_or_too_different():
    # A function with a very long name that will be skipped due to length difference
    long_path = Path("/tmp/long.py")
    long_fn = FunctionToOptimize(function_name="very_long_function_name", file_path=long_path)
    # Query a very short name - the absolute length difference >= min_distance (4)
    codeflash_output = closest_matching_file_function_name("shrt", {long_path: [long_fn]}); result = codeflash_output # 2.36μs -> 2.73μs (13.2% slower)

    # Also test completely empty found_fns mapping returns None
    codeflash_output = closest_matching_file_function_name("anything", {}) # 531ns -> 731ns (27.4% slower)

def test_choose_closest_of_multiple_candidates():
    # Construct multiple candidate functions with varying closeness to the query
    p = Path("/tmp/multiple.py")
    # Use function_name strings (no parents) - qualified_name == function_name
    f_exact = FunctionToOptimize(function_name="process_data", file_path=p)
    f_one_missing = FunctionToOptimize(function_name="procss_data", file_path=p)
    f_wrong = FunctionToOptimize(function_name="other_thing", file_path=p)

    # Query is slightly misspelled version of "process_data"
    query = "proces_data"  # missing one 's' relative to "process_data"
    found = {p: [f_wrong, f_one_missing, f_exact]}

    # The algorithm should find the closest (lowest Levenshtein distance) function.
    # Here, "process_data" is the true closest match (distance 1), although it appears last.
    codeflash_output = closest_matching_file_function_name(query, found); res = codeflash_output # 77.7μs -> 42.1μs (84.5% faster)

def test_tie_breaker_prefers_first_encountered_candidate():
    # The function uses '<' when comparing distances, so on ties the first encountered candidate remains.
    p = Path("/tmp/tie.py")
    # Two candidates with the same Levenshtein distance to the query
    f1 = FunctionToOptimize(function_name="aXc", file_path=p)  # distance 1 from "abc"
    f2 = FunctionToOptimize(function_name="abY", file_path=p)  # distance 1 from "abc"

    found = {p: [f1, f2]}
    # Query "abc" is at distance 1 to both; the first should be returned
    codeflash_output = closest_matching_file_function_name("abc", found); res = codeflash_output # 12.8μs -> 17.6μs (27.2% slower)

def test_empty_query_and_empty_function_name_match():
    # A function whose function_name is an empty string
    p = Path("/tmp/empty.py")
    empty_fn = FunctionToOptimize(function_name="", file_path=p)
    found = {p: [empty_fn]}

    # Empty query string should match the empty function name (distance 0)
    codeflash_output = closest_matching_file_function_name("", found); res = codeflash_output # 4.13μs -> 2.89μs (43.0% faster)

def test_special_characters_and_dots_in_function_name():
    # Use dots inside the function_name to simulate qualified names without using parents
    p = Path("/tmp/dotted.py")
    dotted = FunctionToOptimize(function_name="MyClass.my_method", file_path=p)
    # Query a lowercase variant and ensure matching is case-insensitive
    codeflash_output = closest_matching_file_function_name("myclass.my_method", {p: [dotted]}); res = codeflash_output # 83.5μs -> 2.90μs (2785% faster)

    # Query only the method portion - because the stored qualified_name contains the class prefix,
    # the length/distance rules may skip it. We assert that it returns None in this case,
    # reflecting the actual implementation (it compares the provided qualified name to the stored qualified_name).
    codeflash_output = closest_matching_file_function_name("my_method", {p: [dotted]}) # 1.23μs -> 1.41μs (12.7% slower)

def test_large_scale_search_finds_exact_among_thousand_entries():
    # Build up to 1000 FunctionToOptimize instances distributed across several files.
    num = 1000  # target size per instructions (up to 1000)
    file_count = 10
    files = [Path(f"/tmp/large_file_{i}.py") for i in range(file_count)]

    found_fns: dict[Path, list[FunctionToOptimize]] = {}
    target_name = "fn_500"
    target_path = files[5]  # place the target in file index 5

    # Populate mapping with many entries; ensure deterministic ordering by iterating i
    for i in range(num):
        filename = f"fn_{i}"
        # disperse across files evenly
        file_for_fn = files[i % file_count]
        fto = FunctionToOptimize(function_name=filename, file_path=file_for_fn)
        found_fns.setdefault(file_for_fn, []).append(fto)

    # Query for the exact name; should return the correct Path and the exact FunctionToOptimize instance
    codeflash_output = closest_matching_file_function_name(target_name, found_fns); result = codeflash_output # 721μs -> 530μs (35.9% faster)
    path_res, fn_res = result
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.
from pathlib import Path  # used to construct file path keys

# imports
import pytest  # used for our unit tests
from codeflash.discovery.functions_to_optimize import (
    closest_matching_file_function_name, levenshtein_distance)
from codeflash.models.function_types import \
    FunctionToOptimize  # real dataclass used by the function

def test_closest_match_exact_name_and_case_insensitive():
    # Create a FunctionToOptimize with a simple top-level name
    fn = FunctionToOptimize(function_name="process", file_path=Path("/project/a.py"))
    found = {Path("/project/a.py"): [fn]}
    # Exact match (same case)
    codeflash_output = closest_matching_file_function_name("process", found); result = codeflash_output # 19.6μs -> 3.05μs (544% faster)
    # Case-insensitive match (different case)
    codeflash_output = closest_matching_file_function_name("Process", found); result2 = codeflash_output # 15.6μs -> 1.29μs (1108% faster)

def test_closest_match_with_qualified_like_function_name():
    # Even if function_name contains a dot (simulating a qualified name),
    # qualified_name property returns it intact when parents is empty.
    fn = FunctionToOptimize(function_name="MyClass.do_stuff", file_path=Path("/project/b.py"))
    found = {Path("/project/b.py"): [fn]}
    # Searching for the fully qualified name should return the function
    codeflash_output = closest_matching_file_function_name("MyClass.do_stuff", found) # 74.6μs -> 2.83μs (2532% faster)

def test_empty_found_fns_returns_none():
    # When there are no functions to search, result should be None
    codeflash_output = closest_matching_file_function_name("anything", {}) # 1.07μs -> 1.30μs (17.7% slower)

def test_skip_by_length_difference_leads_to_none_when_all_skipped():
    # min_distance defaults to 4 in the implementation.
    # If all candidate names differ in length by >= 4, they are skipped.
    long_fn = FunctionToOptimize(function_name="very_long_function_name_here", file_path=Path("/x.py"))
    found = {Path("/x.py"): [long_fn]}
    # Query is short; the only candidate is skipped because abs length diff >= 4
    codeflash_output = closest_matching_file_function_name("a", found) # 2.08μs -> 2.77μs (24.6% slower)

def test_input_none_raises_attribute_error():
    # Passing None as the query will attempt to call .lower() and should raise AttributeError
    with pytest.raises(AttributeError):
        closest_matching_file_function_name(None, {}) # 2.94μs -> 2.94μs (0.306% slower)

def test_special_characters_and_small_distance():
    # Special characters should be compared literally and result in small Levenshtein distance
    f1 = FunctionToOptimize(function_name="do-thing_v2", file_path=Path("/spec.py"))
    found = {Path("/spec.py"): [f1]}
    # A query that swaps separators should be close (small edit distance) and be accepted
    codeflash_output = closest_matching_file_function_name("do_thing-v2", found); res = codeflash_output # 40.3μs -> 31.5μs (27.8% faster)

def test_tie_breaking_prefers_first_found():
    # If two distinct functions have the same Levenshtein distance to the query,
    # the function picks the first encountered one (strictly less is used to update).
    a = FunctionToOptimize(function_name="abc", file_path=Path("/first.py"))
    b = FunctionToOptimize(function_name="abd", file_path=Path("/second.py"))
    # Order matters: put 'a' first so it should be selected on ties.
    found = {Path("/first.py"): [a], Path("/second.py"): [b]}
    # Both 'abc' and 'abd' are distance 1 from 'abe' -> expect the first (a) to be chosen
    codeflash_output = closest_matching_file_function_name("abe", found) # 12.9μs -> 17.9μs (28.0% slower)

def test_large_scale_exact_match_among_many_candidates():
    # Construct many functions (1000) distributed across 10 files.
    # We will search for one known name and expect the exact corresponding instance.
    found_fns: dict[Path, list[FunctionToOptimize]] = {}
    target_index = 789  # pick a target index to search for
    target_fn = None
    for i in range(1000):
        # Build a deterministic name
        name = f"func{i}"
        file_path = Path(f"/many_files/file_{i % 10}.py")
        fn = FunctionToOptimize(function_name=name, file_path=file_path)
        # Add to the mapping
        found_fns.setdefault(file_path, []).append(fn)
        # Save the target instance for later comparison
        if i == target_index:
            target_fn = fn

    # Search for the target name; should find exact (distance 0)
    codeflash_output = closest_matching_file_function_name(f"func{target_index}", found_fns); res = codeflash_output # 13.3ms -> 5.99ms (122% faster)

def test_large_scale_near_match_efficiency_and_correctness():
    # Build a large set where the best match is a near miss among many distractors.
    found_fns: dict[Path, list[FunctionToOptimize]] = {}
    # Create 500 distractors with names far from the query (ensuring many will be skipped by length diff)
    for i in range(500):
        name = "X" * (10 + (i % 10))  # long names to be skipped for short queries
        file_path = Path(f"/distractors/file_{i}.py")
        found_fns.setdefault(file_path, []).append(FunctionToOptimize(function_name=name, file_path=file_path))

    # Insert a few plausible near matches; one should be closest.
    good1 = FunctionToOptimize(function_name="compute_value", file_path=Path("/good/one.py"))
    good2 = FunctionToOptimize(function_name="compute_val", file_path=Path("/good/two.py"))
    found_fns.setdefault(Path("/good/one.py"), []).append(good1)
    found_fns.setdefault(Path("/good/two.py"), []).append(good2)

    # Query is closer to "compute_val" (shorter by a few chars)
    codeflash_output = closest_matching_file_function_name("compute_val", found_fns); result = codeflash_output # 10.3ms -> 2.33ms (343% faster)
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-pr1689-2026-02-27T17.52.22 and push.

Codeflash Static Badge

Runtime improvement (primary): The optimized version reduces the median runtime of the matcher from 24.7 ms to 8.99 ms — a 175% speedup. Line profiling shows closest_matching_file_function_name total time dropped from ~0.18 s to ~0.037 s, and the hot cost inside full Levenshtein calls was cut dramatically.

What changed (concrete optimizations)
- Precompute candidate metadata: the code now builds a flattened candidates list of (file_path, function, fn_name_lower, fn_len). This removes repeated attribute lookups, .lower() and len() calls inside the inner loop.
- Bounded (banded) Levenshtein: replaced many full-distance computations with _bounded_levenshtein(s1, s2, max_distance) that computes distance only within a narrow band and returns >max_distance early if it cannot be better. The algorithm also:
  - early-exits when length difference already exceeds the bound,
  - restricts DP computation to a band [start..end] per row, and
  - performs a row-level early-exit if the minimum in the active band exceeds the bound.
- Local binding and micro-optimizations: binder _bounded = _bounded_levenshtein and caching target_len reduce attribute lookups. The bounded implementation uses manual comparisons for the min-of-three to avoid tuple/min overhead in tight loops.
- Keep original levenshtein_distance intact for external callers (no behavioral change for that API).

Why this speeds things up (mechanics)
- The original profile shows most time was spent inside levenshtein_distance calls (many full DP passes). The bounded approach avoids doing the O(n*m) DP for any candidate that cannot beat the current min_distance — in a large candidate set most names are either skipped by length or exceed the bound early. That turns many expensive full-distance calls into quick checks or truncated DP runs.
- Reducing repeated .lower() and len() calls removes repeated Python-level work and attribute lookups inside tight loops which are relatively expensive in Python.
- Restricting the DP to a narrow band reduces inner-loop iterations and memory writes/reads per row, giving much less Python-level loop iteration overhead and fewer list operations.

Evidence from profiling and tests
- Line profiler: original had ~1.78e8 ns in dist calls; optimized shows ~3.25e7 ns for bounded calls — a big reduction in DP work.
- Annotated tests show the biggest wins for larger and near-match workloads (large_scale_exact_match, large_scale_near_match, choose_closest_of_multiple_candidates, special_characters_and_dots). Example: large-scale search for one exact name dropped from ~13.3 ms to ~5.99 ms in one annotated test.
- The bounded variant still returns a number > max_distance when the true distance exceeds the bound, which preserves the selection logic (we only care about distances smaller than current min_distance).

Trade-offs and regressions to be aware of
- Small overhead for tiny inputs: building the candidates list and extra function call overhead cause a few microbenchmarks to be slightly slower (some tests show small regressions, e.g., empty maps or single-item trivial cases). These regressions are minimal and expected given the additional precompute allocation; they are an acceptable trade-off for the large reductions when many candidates are present.
- Memory: candidates list adds short-lived tuples (file_path, function, lower-name, length). This is small relative to the CPU savings when many names are checked.

Hot-path impact
- get_functions_to_optimize can call closest_matching_file_function_name when the exact function wasn't found; in common interactive/CLI flows this is a helpful path. The optimization yields the most benefit when many functions are present (e.g., scanning a repository or when user passes a mistyped name and there are many candidates). Because the heavy work (distance comparisons across many candidates) is now bounded and cheaper, interactive latency and batch throughput improve in those hot scenarios.

Summary
- Primary benefit: 175% speedup in runtime for the matcher by avoiding repeated costly full Levenshtein computations and reducing Python-level overhead inside the candidate loop.
- Side-effects: minimal memory allocation for candidate metadata and a few micro regressions on trivial inputs — reasonable trade-offs given the large improvements for real workloads (many candidates / near-match searches).
- Correctness: original external levenshtein_distance kept intact; the bounded version preserves selection semantics (any distance >= current min_distance is treated as not improving the match).
@codeflash-ai codeflash-ai bot added ⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash labels Feb 27, 2026
@claude
Copy link
Copy Markdown
Contributor

claude bot commented Feb 27, 2026

PR Review Summary

Prek Checks

Fixed in commit cca120dd:

  • Removed duplicate _bounded_levenshtein function definition (the function was defined twice — 75 lines of dead code)
  • Fixed docstring formatting (D205: blank line between summary and description)
  • Applied FURB136 auto-fixes (use min() instead of ternary)
  • Fixed D209 auto-fixes (newline after last paragraph)
  • Cleaned up extra blank lines

mypy: No issues found.

Code Review

Issues found and fixed:

  1. Duplicate function definition (fixed): _bounded_levenshtein was defined twice at end of file. The second definition silently shadowed the first. While both were identical (no functional bug), this was clearly unintended code duplication.

Observations (non-blocking):
2. Dead code: The original levenshtein_distance function (lines 419-450) is no longer called anywhere — it was replaced by _bounded_levenshtein but left behind. Consider removing it.
3. Algorithm correctness: The bounded Levenshtein optimization is correct — passing min_distance - 1 as the bound ensures equivalent behavior to the original full-distance computation for the closest-match use case.

Test Coverage

File Stmts (PR) Miss (PR) Cover (PR) Cover (main) Delta
codeflash/discovery/functions_to_optimize.py 545 197 64% 70% -6%

Coverage regression: -6% overall for this file.

Details:

  • _bounded_levenshtein (lines 951-1023): 0% coverage — the entire new function is untested
  • closest_matching_file_function_name (lines 381-416): 0% coverage — the modified function body is also untested
  • Note: The closest_matching_file_function_name function was also uncovered on main, but the new _bounded_levenshtein adds ~35 new uncovered statements

Pre-existing test failure (unrelated to this PR): tests/test_tracer.py::TestTracer::test_tracer_initialization_normal


Last updated: 2026-02-27T18:20:00Z

@@ -936,3 +946,78 @@ def filter_files_optimized(file_path: Path, tests_root: Path, ignore_paths: list
file_path in submodule_paths
or any(file_path.is_relative_to(submodule_path) for submodule_path in submodule_paths)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug (fixed in latest commit): _bounded_levenshtein was defined twice — the second definition silently shadowed the first. While both implementations were identical so there was no functional bug, this is dead code that should not be committed. Fixed by removing the duplicate in cca120dd.

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Feb 27, 2026

Duplicate Code Analysis

No duplicates detected.

I analyzed all changed files in this PR:

  • codeflash/discovery/functions_to_optimize.py - Added is_property() method, optimized closest_matching_file_function_name(), added _bounded_levenshtein()
  • codeflash/languages/python/support.py - Simplified discover_functions()
  • codeflash/lsp/beta.py - Updated type hints

All new/modified functions were checked against the codebase:

  • FunctionVisitor.is_property() - Unique decorator-checking logic for properties (not a duplicate of similar pytest fixture checking—different decorators, different purposes)
  • _bounded_levenshtein() - New optimization of the existing levenshtein_distance() function with early-exit logic (complementary, not duplicate)
  • closest_matching_file_function_name() - Unique function matching logic
  • Cross-language check - Python and JavaScript support classes have appropriately different implementations

This PR actually reduces duplication by consolidating Python function discovery to use libcst consistently and removing the old AST-based function_is_a_property function.

@codeflash-ai
Copy link
Copy Markdown
Contributor Author

codeflash-ai bot commented Feb 27, 2026

⚡️ Codeflash found optimizations for this PR

📄 995% (9.95x) speedup for _bounded_levenshtein in codeflash/discovery/functions_to_optimize.py

⏱️ Runtime : 97.4 milliseconds 8.90 milliseconds (best of 157 runs)

A dependent PR with the suggested changes has been created. Please review:

If you approve, it will be merged into this PR (branch codeflash/optimize-pr1689-2026-02-27T17.52.22).

Static Badge

@KRRT7 KRRT7 closed this Feb 27, 2026
@codeflash-ai codeflash-ai bot deleted the codeflash/optimize-pr1689-2026-02-27T17.52.22 branch February 27, 2026 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI 🎯 Quality: High Optimization Quality according to Codeflash

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant