From 96688b3a7088e9808321564c924a913dc24b16cf Mon Sep 17 00:00:00 2001 From: "codeflash-ai[bot]" <148906541+codeflash-ai[bot]@users.noreply.github.com> Date: Fri, 27 Feb 2026 17:52:25 +0000 Subject: [PATCH 1/2] Optimize closest_matching_file_function_name MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Runtime improvement (primary): The optimized version reduces the median runtime of the matcher from 24.7 ms to 8.99 ms — a 175% speedup. Line profiling shows closest_matching_file_function_name total time dropped from ~0.18 s to ~0.037 s, and the hot cost inside full Levenshtein calls was cut dramatically. What changed (concrete optimizations) - Precompute candidate metadata: the code now builds a flattened candidates list of (file_path, function, fn_name_lower, fn_len). This removes repeated attribute lookups, .lower() and len() calls inside the inner loop. - Bounded (banded) Levenshtein: replaced many full-distance computations with _bounded_levenshtein(s1, s2, max_distance) that computes distance only within a narrow band and returns >max_distance early if it cannot be better. The algorithm also: - early-exits when length difference already exceeds the bound, - restricts DP computation to a band [start..end] per row, and - performs a row-level early-exit if the minimum in the active band exceeds the bound. - Local binding and micro-optimizations: binder _bounded = _bounded_levenshtein and caching target_len reduce attribute lookups. The bounded implementation uses manual comparisons for the min-of-three to avoid tuple/min overhead in tight loops. - Keep original levenshtein_distance intact for external callers (no behavioral change for that API). Why this speeds things up (mechanics) - The original profile shows most time was spent inside levenshtein_distance calls (many full DP passes). The bounded approach avoids doing the O(n*m) DP for any candidate that cannot beat the current min_distance — in a large candidate set most names are either skipped by length or exceed the bound early. That turns many expensive full-distance calls into quick checks or truncated DP runs. - Reducing repeated .lower() and len() calls removes repeated Python-level work and attribute lookups inside tight loops which are relatively expensive in Python. - Restricting the DP to a narrow band reduces inner-loop iterations and memory writes/reads per row, giving much less Python-level loop iteration overhead and fewer list operations. Evidence from profiling and tests - Line profiler: original had ~1.78e8 ns in dist calls; optimized shows ~3.25e7 ns for bounded calls — a big reduction in DP work. - Annotated tests show the biggest wins for larger and near-match workloads (large_scale_exact_match, large_scale_near_match, choose_closest_of_multiple_candidates, special_characters_and_dots). Example: large-scale search for one exact name dropped from ~13.3 ms to ~5.99 ms in one annotated test. - The bounded variant still returns a number > max_distance when the true distance exceeds the bound, which preserves the selection logic (we only care about distances smaller than current min_distance). Trade-offs and regressions to be aware of - Small overhead for tiny inputs: building the candidates list and extra function call overhead cause a few microbenchmarks to be slightly slower (some tests show small regressions, e.g., empty maps or single-item trivial cases). These regressions are minimal and expected given the additional precompute allocation; they are an acceptable trade-off for the large reductions when many candidates are present. - Memory: candidates list adds short-lived tuples (file_path, function, lower-name, length). This is small relative to the CPU savings when many names are checked. Hot-path impact - get_functions_to_optimize can call closest_matching_file_function_name when the exact function wasn't found; in common interactive/CLI flows this is a helpful path. The optimization yields the most benefit when many functions are present (e.g., scanning a repository or when user passes a mistyped name and there are many candidates). Because the heavy work (distance comparisons across many candidates) is now bounded and cheaper, interactive latency and batch throughput improve in those hot scenarios. Summary - Primary benefit: 175% speedup in runtime for the matcher by avoiding repeated costly full Levenshtein computations and reducing Python-level overhead inside the candidate loop. - Side-effects: minimal memory allocation for candidate metadata and a few micro regressions on trivial inputs — reasonable trade-offs given the large improvements for real workloads (many candidates / near-match searches). - Correctness: original external levenshtein_distance kept intact; the bounded version preserves selection semantics (any distance >= current min_distance is treated as not improving the match). --- codeflash/discovery/functions_to_optimize.py | 186 +++++++++++++++++-- 1 file changed, 174 insertions(+), 12 deletions(-) diff --git a/codeflash/discovery/functions_to_optimize.py b/codeflash/discovery/functions_to_optimize.py index 62d6896fe..ff7a69b49 100644 --- a/codeflash/discovery/functions_to_optimize.py +++ b/codeflash/discovery/functions_to_optimize.py @@ -384,22 +384,33 @@ def closest_matching_file_function_name( qualified_fn_to_find_lower = qualified_fn_to_find.lower() - # Cache levenshtein_distance locally for improved lookup speed - _levenshtein = levenshtein_distance + # Prepare a flattened list of candidates with precomputed lowercase names and lengths + # to avoid repeated .lower() and len() calls inside the main loop. + candidates: list[tuple[Path, FunctionToOptimize, str, int]] = [] for file_path, functions in found_fns.items(): for function in functions: - # Compare either full qualified name or just function name - fn_name = function.qualified_name.lower() - # If the absolute length difference is already >= min_distance, skip calculation - if abs(len(qualified_fn_to_find_lower) - len(fn_name)) >= min_distance: - continue - dist = _levenshtein(qualified_fn_to_find_lower, fn_name) + fn_name_lower = function.qualified_name.lower() + candidates.append((file_path, function, fn_name_lower, len(fn_name_lower))) + + # Use a bounded levenshtein variant here to early-exit when the distance cannot be + # smaller than the current min_distance. This avoids expensive full-distance calculations. + _bounded = _bounded_levenshtein + + target_len = len(qualified_fn_to_find_lower) + + for file_path, function, fn_name, fn_len in candidates: + # If the absolute length difference is already >= min_distance, skip calculation + if abs(target_len - fn_len) >= min_distance: + continue + # compute bounded distance; if result is >= min_distance it won't improve + dist = _bounded(qualified_fn_to_find_lower, fn_name, min_distance - 1) + + if dist < min_distance: + min_distance = dist + closest_match = function + closest_file = file_path - if dist < min_distance: - min_distance = dist - closest_match = function - closest_file = file_path if closest_match is not None and closest_file is not None: return closest_file, closest_match @@ -936,3 +947,154 @@ def filter_files_optimized(file_path: Path, tests_root: Path, ignore_paths: list file_path in submodule_paths or any(file_path.is_relative_to(submodule_path) for submodule_path in submodule_paths) ) + + + + + +def _bounded_levenshtein(s1: str, s2: str, max_distance: int) -> int: + """Compute Levenshtein distance but stop when distance exceeds max_distance. + Returns a value > max_distance when the true distance is > max_distance.""" + # Fast path equal + if s1 == s2: + return 0 + + # Ensure s1 is the shorter + if len(s1) > len(s2): + s1, s2 = s2, s1 + + n = len(s1) + m = len(s2) + + # If length difference already exceeds max allowed distance, we can exit + if m - n > max_distance: + return max_distance + 1 + + # Initialize previous row: distances from empty s2 prefix to s1 prefixes + previous = list(range(n + 1)) + current = [0] * (n + 1) + + # We will only compute values within a "band" [start..end] for each row + for i in range(1, m + 1): + # Position in s2 is i (1-based for DP), character is s2[i-1] + char2 = s2[i - 1] + # Compute band boundaries (1-based indices for s1 positions) + start = max(1, i - max_distance) + end = min(n, i + max_distance) + + # If start > end the band is empty -> distance exceeds max_distance + if start > end: + return max_distance + 1 + + # Set current[0] for the empty prefix of s1 + current[0] = i + + # Fill left part outside band with large values + for k in range(1, start): + current[k] = max_distance + 1 + + # Compute values inside the band + for j in range(start, end + 1): + if s1[j - 1] == char2: + current[j] = previous[j - 1] + else: + # deletion = previous[j] + 1 + # insertion = current[j - 1] + 1 + # substitution = previous[j - 1] + 1 + a = previous[j] + 1 + b = current[j - 1] + 1 + c = previous[j - 1] + 1 + # Fast min of three + t = a if a < b else b + current[j] = c if c < t else t + + # Fill right part outside band with large values + for k in range(end + 1, n + 1): + current[k] = max_distance + 1 + + # Swap rows + previous, current = current, previous + + # Early exit: if the minimum value in the active band is greater than max_distance + # then the final distance must exceed max_distance + # (band width is small: at most 2*max_distance+1). + row_min = min(previous[start : end + 1]) + if row_min > max_distance: + return max_distance + 1 + + return previous[n] + + + + +def _bounded_levenshtein(s1: str, s2: str, max_distance: int) -> int: + """Compute Levenshtein distance but stop when distance exceeds max_distance. + Returns a value > max_distance when the true distance is > max_distance.""" + # Fast path equal + if s1 == s2: + return 0 + + # Ensure s1 is the shorter + if len(s1) > len(s2): + s1, s2 = s2, s1 + + n = len(s1) + m = len(s2) + + # If length difference already exceeds max allowed distance, we can exit + if m - n > max_distance: + return max_distance + 1 + + # Initialize previous row: distances from empty s2 prefix to s1 prefixes + previous = list(range(n + 1)) + current = [0] * (n + 1) + + # We will only compute values within a "band" [start..end] for each row + for i in range(1, m + 1): + # Position in s2 is i (1-based for DP), character is s2[i-1] + char2 = s2[i - 1] + # Compute band boundaries (1-based indices for s1 positions) + start = max(1, i - max_distance) + end = min(n, i + max_distance) + + # If start > end the band is empty -> distance exceeds max_distance + if start > end: + return max_distance + 1 + + # Set current[0] for the empty prefix of s1 + current[0] = i + + # Fill left part outside band with large values + for k in range(1, start): + current[k] = max_distance + 1 + + # Compute values inside the band + for j in range(start, end + 1): + if s1[j - 1] == char2: + current[j] = previous[j - 1] + else: + # deletion = previous[j] + 1 + # insertion = current[j - 1] + 1 + # substitution = previous[j - 1] + 1 + a = previous[j] + 1 + b = current[j - 1] + 1 + c = previous[j - 1] + 1 + # Fast min of three + t = a if a < b else b + current[j] = c if c < t else t + + # Fill right part outside band with large values + for k in range(end + 1, n + 1): + current[k] = max_distance + 1 + + # Swap rows + previous, current = current, previous + + # Early exit: if the minimum value in the active band is greater than max_distance + # then the final distance must exceed max_distance + # (band width is small: at most 2*max_distance+1). + row_min = min(previous[start : end + 1]) + if row_min > max_distance: + return max_distance + 1 + + return previous[n] From cca120ddb8bece1b67157a71c584a8fe862c94c5 Mon Sep 17 00:00:00 2001 From: "claude[bot]" <41898282+claude[bot]@users.noreply.github.com> Date: Fri, 27 Feb 2026 17:54:46 +0000 Subject: [PATCH 2/2] style: fix duplicate function definition and linting issues --- codeflash/discovery/functions_to_optimize.py | 85 +------------------- 1 file changed, 4 insertions(+), 81 deletions(-) diff --git a/codeflash/discovery/functions_to_optimize.py b/codeflash/discovery/functions_to_optimize.py index ff7a69b49..93881f0ad 100644 --- a/codeflash/discovery/functions_to_optimize.py +++ b/codeflash/discovery/functions_to_optimize.py @@ -411,7 +411,6 @@ def closest_matching_file_function_name( closest_match = function closest_file = file_path - if closest_match is not None and closest_file is not None: return closest_file, closest_match return None @@ -949,87 +948,11 @@ def filter_files_optimized(file_path: Path, tests_root: Path, ignore_paths: list ) - - - def _bounded_levenshtein(s1: str, s2: str, max_distance: int) -> int: """Compute Levenshtein distance but stop when distance exceeds max_distance. - Returns a value > max_distance when the true distance is > max_distance.""" - # Fast path equal - if s1 == s2: - return 0 - - # Ensure s1 is the shorter - if len(s1) > len(s2): - s1, s2 = s2, s1 - - n = len(s1) - m = len(s2) - - # If length difference already exceeds max allowed distance, we can exit - if m - n > max_distance: - return max_distance + 1 - - # Initialize previous row: distances from empty s2 prefix to s1 prefixes - previous = list(range(n + 1)) - current = [0] * (n + 1) - - # We will only compute values within a "band" [start..end] for each row - for i in range(1, m + 1): - # Position in s2 is i (1-based for DP), character is s2[i-1] - char2 = s2[i - 1] - # Compute band boundaries (1-based indices for s1 positions) - start = max(1, i - max_distance) - end = min(n, i + max_distance) - - # If start > end the band is empty -> distance exceeds max_distance - if start > end: - return max_distance + 1 - - # Set current[0] for the empty prefix of s1 - current[0] = i - - # Fill left part outside band with large values - for k in range(1, start): - current[k] = max_distance + 1 - - # Compute values inside the band - for j in range(start, end + 1): - if s1[j - 1] == char2: - current[j] = previous[j - 1] - else: - # deletion = previous[j] + 1 - # insertion = current[j - 1] + 1 - # substitution = previous[j - 1] + 1 - a = previous[j] + 1 - b = current[j - 1] + 1 - c = previous[j - 1] + 1 - # Fast min of three - t = a if a < b else b - current[j] = c if c < t else t - - # Fill right part outside band with large values - for k in range(end + 1, n + 1): - current[k] = max_distance + 1 - # Swap rows - previous, current = current, previous - - # Early exit: if the minimum value in the active band is greater than max_distance - # then the final distance must exceed max_distance - # (band width is small: at most 2*max_distance+1). - row_min = min(previous[start : end + 1]) - if row_min > max_distance: - return max_distance + 1 - - return previous[n] - - - - -def _bounded_levenshtein(s1: str, s2: str, max_distance: int) -> int: - """Compute Levenshtein distance but stop when distance exceeds max_distance. - Returns a value > max_distance when the true distance is > max_distance.""" + Returns a value > max_distance when the true distance is > max_distance. + """ # Fast path equal if s1 == s2: return 0 @@ -1080,8 +1003,8 @@ def _bounded_levenshtein(s1: str, s2: str, max_distance: int) -> int: b = current[j - 1] + 1 c = previous[j - 1] + 1 # Fast min of three - t = a if a < b else b - current[j] = c if c < t else t + t = min(b, a) + current[j] = min(t, c) # Fill right part outside band with large values for k in range(end + 1, n + 1):