Skip to content

Fix GH-14807 ext/standard levenshtein overflow on 3rd, 4th and 5th arguments.#14809

Draft
devnexen wants to merge 1 commit into
php:PHP-8.2from
devnexen:gh14807
Draft

Fix GH-14807 ext/standard levenshtein overflow on 3rd, 4th and 5th arguments.#14809
devnexen wants to merge 1 commit into
php:PHP-8.2from
devnexen:gh14807

Conversation

@devnexen

@devnexen devnexen commented Jul 4, 2024

Copy link
Copy Markdown
Member

No description provided.

@devnexen devnexen changed the title ext/standard levenshtein overflow on 3rd, 4th and 5th arguments. Fix GH-14807 ext/standard levenshtein overflow on 3rd, 4th and 5th arguments. Jul 4, 2024
@devnexen devnexen marked this pull request as ready for review July 4, 2024 06:37
@devnexen devnexen requested a review from bukka as a code owner July 4, 2024 06:37
@ndossche

ndossche commented Jul 4, 2024

Copy link
Copy Markdown
Member

This doesn't solve the problem in general, also the weights are allowed to be negative

@devnexen

devnexen commented Jul 4, 2024

Copy link
Copy Markdown
Member Author

Oh I see you already work on it he might be better if you continue with it :)

@ndossche

ndossche commented Jul 4, 2024

Copy link
Copy Markdown
Member

I didn't really come up with a general solution, see #13823 (comment)

@devnexen devnexen marked this pull request as draft July 5, 2024 11:50
@ndossche

ndossche commented Jul 5, 2024

Copy link
Copy Markdown
Member

@devnexen I thought about this more.

I think there's two ways to go about this:

  1. Check beforehand if overflow can happen. If we set a number L to the largest cost (i.e. MAX(cost_ins, cost_rep, cost_del)) then overflow can only happen if MAX(ZSTR_LEN(string1), ZSTR_LEN(string2)) * L overflows, i.e. if L > ZEND_LONG_MAX / MAX(ZSTR_LEN(string1), ZSTR_LEN(string2)). And same reasoning for the lowest cost. If we check both of these cases, then we could throw beforehand. The disadvantage is that this can be overly defensive, e.g. using a PHP_INT_MAX weight to disallow one of the 3 operations is a legitimate trick and that would break even though the overflow will never happen in practice.
  2. Use overflow-checked math in the computation of the distance. This is a bit harder / more annoying. It should be cheap for the CPU as the overflow flag is computed anyway when performing arithmetic. The advantage is that the function only throws when there actually is overflow, although then if you use extreme weights then depending on some input strings you get an exception and user code may not be prepared to deal with this. Then again, overflow should be rare.

@devnexen

devnexen commented Jul 5, 2024

Copy link
Copy Markdown
Member Author

Thanks I ll look at it within the next days.

@cmb69

cmb69 commented Jul 14, 2024

Copy link
Copy Markdown
Member

I think there's two ways to go about this:

Both appear to be well-thought out and analyzed, and unfortunately neither seems to be fully backward compatible. I consider the current weighted levenshtein() functionality rather half-baked anyway; originally, the function signature supported callbacks (but not even general callables) for the costs, but since that was never implemented, it had been removed. If callables were properly supported, the function might be way more useful (albeit supposedly even much slower), but as is I still see not much use for this function, so it might indeed be "best" to just document the issue and let users deal with it. More elaborate support for levenshtein (and maybe other edit distance algorithms) could still be offered as PECL package (maybe something like this already exists). These are just my 2cents, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants