Skip to content

[VL] Skip escape arg when offloading Like with no-backslash pattern#12152

Open
zhulipeng wants to merge 1 commit into
apache:mainfrom
zhulipeng:vl-like-skip-escape-arg
Open

[VL] Skip escape arg when offloading Like with no-backslash pattern#12152
zhulipeng wants to merge 1 commit into
apache:mainfrom
zhulipeng:vl-like-skip-escape-arg

Conversation

@zhulipeng
Copy link
Copy Markdown

What changes are proposed in this pull request?

When offloading Like to Velox, omit the escape literal argument and emit
the 2-arg form (like(input, pattern)) when both:

  1. the pattern is a constant Literal, and
  2. the escape char is Spark's default \ AND the pattern does not contain \.

Otherwise we still emit the 3-arg form (like(input, pattern, escape)) as before.

Why are the changes needed?

Spark's Like node always carries an escapeChar (defaulting to \) even when
the SQL did not specify ESCAPE. We previously always sent the 3-arg form to
Velox, which forces Velox's makeLike (Re2Functions.cpp) to take the
escape-aware path: parsePattern runs an extra unescape pass, and
determinePatternKind runs with escapeChar.has_value() == true, even when
no actual escaping is needed.

When the pattern literal contains no \, the 2-arg and 3-arg forms are
semantically identical, so we can safely send the cheaper 2-arg form. Velox
already registers both signatures via likeSignatures().

How was this patch tested?

  • Existing Like query coverage in VeloxStringFunctionsSuite (like /
    rlike / ilike) — query results unchanged.
  • TPC-H Q13 end-to-end run at 6 TB scale — see Performance section.
  • ./dev/format-scala-code.sh clean.

Performance

TPC-H Q13 @ 6 TB scale: >6% end-to-end latency reduction.

Q13's l_comment NOT LIKE '%special%requests%' filter scans every lineitem row.
With the 3-arg form, the constant-pattern fast paths in determinePatternKind
are bypassed and Velox falls back to the generic LikeWithRe2 path, which
hot-loops in re2::DFA::InlinedSearchLoop — CPU profiling shows
InlinedSearchLoop accounts for >8% of total cycles on Q13.

Sending the 2-arg form lets determinePatternKind recognize this as the
kSubstrings shape and dispatch to the dedicated OptimizedLike<kSubstrings>
kernel, eliminating the RE2 DFA cost. No regression observed on other queries.

Was this patch authored or co-authored using generative AI tooling?

Reviewed-by: Claude claude-opus-4-7

Spark's Like node always carries escapeChar (defaulting to '\') even when
the SQL did not specify ESCAPE. Always sending the 3-arg form to Velox
forces makeLike (Re2Functions.cpp) onto the escape-aware path:
parsePattern runs an extra unescape pass and determinePatternKind runs
with escapeChar.has_value() == true, even when no escaping is needed.

When the pattern literal contains no '\', the 2-arg and 3-arg forms are
semantically identical, so emit the cheaper 2-arg form. Velox already
registers both signatures via likeSignatures().

Performance: TPC-H Q13 @ 6 TB shows >6% end-to-end latency reduction.
With the 3-arg form, the constant-pattern fast paths in
determinePatternKind are bypassed and Velox falls back to LikeWithRe2,
hot-looping in re2::DFA::InlinedSearchLoop (>8% of total cycles on
Q13). Sending the 2-arg form lets determinePatternKind dispatch
'%special%requests%' to OptimizedLike<kSubstrings>, eliminating the
RE2 DFA cost.

Generated-by: Claude claude-opus-4-7

Co-authored-by: Guo Wangyang <wangyang.guo@intel.com>
Co-authored-by: Hengrui Hu <hengrui.hu@intel.com>
Signed-off-by: Lipeng Zhu <lipeng.zhu@intel.com>
@github-actions github-actions Bot added the VELOX label May 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant