Fix GH-12703: parse_url colon-in-path + optimize control-char replacement#21718
Fix GH-12703: parse_url colon-in-path + optimize control-char replacement#21718iliaal wants to merge 1 commit intophp:masterfrom
Conversation
Two changes to ext/standard/url.c. 1. Fix phpGH-12703: parse_url mishandles colon inside path. parse_url('/page:1') returns false instead of the path, and parse_url('//www.example.com/foo:65535/') reports a spurious port of 65535. Both stem from the scheme walk's error branch routing /-prefixed inputs into the `parse_port` fallback. Guard the `parse_port` branch with `*s != '/'`. A leading slash can't start a scheme, so the input must be a path or a //host authority. The existing branch ordering then lets the //host check catch `//...` inputs and `just_path` catch the rest. 2. Speed up php_replace_controlchars on typical URLs. Each parse calls php_replace_controlchars once per component (scheme, host, user, pass, path, query, fragment), and each call walked the bytes through iscntrl(). iscntrl() hits __ctype_b_loc() for a locale-aware lookup, and callgrind showed it at ~14% of total instructions on a realistic URL workload. Replace the call with an inline `*s < 0x20 || *s == 0x7f` compare. URL components are bytes, not locale-dependent text, so C-locale semantics are what we want regardless of the process locale. The Zend language scanner uses the same pattern (`yych <= 0x1F`). Benchmark across 13 realistic URL shapes: 12-22% faster, 16% average. No behavior change in the C/POSIX locale. Closes phpGH-12703
62b73ec to
854136f
Compare
|
This PR is making two unrelated changes, it should be split up. But in any case, we shouldn't make any further functional changes to Users should migrate to the new ext/uri classes ( |
|
@TimWolla I can split performance related changes, if it'll be considered? Seems like 10%+ speed increase for a commonly used function is still of value, and https://www.php.net/manual/en/function.parse-url.php doesn't imply function is being deprecated or at risk of being so. |
My understanding is that it might result in a behavioral change for non-C locales, which might be the case if someone calls
I'm not sure we should encourage users to choose
I feel it's a little early for that, but |
parse_url() isn't deprecated today, and even a future RFC wouldn't remove it for years. The function is still worth improving in the meantime.
The behavior does change, and the change is a bug fix.
So the four cases are:
The old behavior made If the concern is BC for the C1 control edge case, I can add a test that demonstrates the UTF-8 corruption under a Latin-1 locale as the motivating bug. The perf number then stops being the primary justification and becomes the side effect of a correctness fix. |
As I had mentioned before, the current behavior of |
Declaring a behavior "intended" after the fact doesn't make it so. Intended behavior is documented, tested, or discussed on record. None of that applies here: there's no test in The full ext/standard/tests/url suite passes with the change. If the old behavior were intended, there would be at least one test asserting it. The absence of any such test is the same signal it always is: a quirk that wasn't codified. This isn't just about userland Reading the thread as a whole, the blocker on the technical side is an unsourced claim that the current locale-dependent behavior is intended, and the blocker on the non-technical side is the preference that parse_url() not be encouraged. I don't see a productive way forward against either, so I'll leave this thread here. The patch lives on my fork at iliaal#31 for anyone who wants to build with it. |
Fixes #12703
Two changes in the same commit, both in
ext/standard/url.c.Bug fix: GH-12703
parse_url('/page:1')returnsfalseinstead of['path' => '/page:1'], andparse_url('//www.example.com/foo:65535/')reports a spuriousport=65535that should never exist (the65535is part of the path, not an authority port).Both bugs stem from the scheme-walk error branch routing
/-prefixed inputs into theparse_portfallback. A leading/can't start a scheme, so the input is either a path or a//hostauthority, never a host-less:portform.The guard
*s != '/'on theparse_portbranch sends/page:1tojust_pathand//www.example.com/foo:65535/to the//hostbranch that follows it, both of which produce the correct output.This matches vdauchy's abandoned PR #12704 (closed without technical objections, author cited the URL parsing API RFC). The RFC doesn't deprecate
parse_url()in 8.5, andparse_url()remains in use by FTP stream wrappers, SoapClient, and session URL rewriting, so the fix is worth landing on master.Uri\Rfc3986\Uri::parse()from the 8.5 URI extension already handles both cases correctly today. After this fix,parse_url()agrees with the RFC 3986 parser on 7 of 9 cases I tested (the remaining two,page:1and./page:1, are long-standing legacy quirks out of scope).Optimization: php_replace_controlchars
Each
parse_url()call invokesphp_replace_controlcharsonce per parsed component (scheme, host, user, pass, path, query, fragment), and each call walked the bytes throughiscntrl().iscntrl()goes through__ctype_b_loc()for a locale-aware lookup on every byte. Callgrind showed this accounted for ~14% of total instructions on a realistic URL workload.Replace the call with an inline
*s < 0x20 || *s == 0x7fcompare. URL components are bytes, not locale-dependent text, so C-locale semantics are what we want regardless of the process locale. The Zend language scanner already uses the same pattern (yych <= 0x1F), andext/standard/quot_print.cusesiscntrl(c) || c == 0x7ffor the same reason (the author didn't trustiscntrlto include DEL across locales).Benchmark
Release build, 1M iterations per case, best-of-5, 13 realistic URL shapes:
http://example.com)https://www.example.com/path/to/page)https://api.example.com/v1/users?id=42&format=json)//www.example.com/path/to/page)/page:1?foo=bar)/foo)http://[2001:db8::1]:8080/path)mailto:user@example.com)file:///home/user/file.txt)Average: 16% faster. No behavior change in the C/POSIX locale.
Tests
New
ext/standard/tests/url/gh12703.phptcovers both bugs with correct expected values (not the buggy port expectations locked into PR #12704's original test). Includes edge cases/:,/:80,//host/a:1/b:2/, IPv6-like inputs, and a full scheme://user:pass@host:port/path?query#fragment regression guard.git stashround-tripext/standard/tests/url/— 32/32 pass, 0 regressionsext/uri/tests/— 309/309 pass (the 8.5 URI extension wraps parse_url, so wrapping regression check)