Fix GH-12703: parse_url colon-in-path + optimize control-char replacement by iliaal · Pull Request #21718 · php/php-src

iliaal · 2026-04-10T23:34:19Z

Two changes in the same commit, both in ext/standard/url.c.

Bug fix: GH-12703

parse_url('/page:1') returns false instead of ['path' => '/page:1'], and parse_url('//www.example.com/foo:65535/') reports a spurious port=65535 that should never exist (the 65535 is part of the path, not an authority port).

Both bugs stem from the scheme-walk error branch routing /-prefixed inputs into the parse_port fallback. A leading / can't start a scheme, so the input is either a path or a //host authority, never a host-less :port form.

The guard *s != '/' on the parse_port branch sends /page:1 to just_path and //www.example.com/foo:65535/ to the //host branch that follows it, both of which produce the correct output.

This matches vdauchy's abandoned PR #12704 (closed without technical objections, author cited the URL parsing API RFC). The RFC doesn't deprecate parse_url() in 8.5, and parse_url() remains in use by FTP stream wrappers, SoapClient, and session URL rewriting, so the fix is worth landing on master.

Uri\Rfc3986\Uri::parse() from the 8.5 URI extension already handles both cases correctly today. After this fix, parse_url() agrees with the RFC 3986 parser on 7 of 9 cases I tested (the remaining two, page:1 and ./page:1, are long-standing legacy quirks out of scope).

Optimization: php_replace_controlchars

Each parse_url() call invokes php_replace_controlchars once per parsed component (scheme, host, user, pass, path, query, fragment), and each call walked the bytes through iscntrl(). iscntrl() goes through __ctype_b_loc() for a locale-aware lookup on every byte. Callgrind showed this accounted for ~14% of total instructions on a realistic URL workload.

Replace the call with an inline *s < 0x20 || *s == 0x7f compare. URL components are bytes, not locale-dependent text, so C-locale semantics are what we want regardless of the process locale. The Zend language scanner already uses the same pattern (yych <= 0x1F), and ext/standard/quot_print.c uses iscntrl(c) || c == 0x7f for the same reason (the author didn't trust iscntrl to include DEL across locales).

Benchmark

Release build, 1M iterations per case, best-of-5, 13 realistic URL shapes:

Case	len	before	after	speedup
short_nopath (`http://example.com`)	18	76.7 ns	65.0 ns	15.3%
typical_web (`https://www.example.com/path/to/page`)	36	100.2 ns	78.3 ns	21.9%
with_query (`https://api.example.com/v1/users?id=42&format=json`)	50	117.4 ns	94.6 ns	19.4%
full_featured (user:pass@host:port/path?q#f)	83	181.9 ns	147.5 ns	18.9%
path_with_colons	44	102.6 ns	84.9 ns	17.3%
relative_scheme (`//www.example.com/path/to/page`)	30	80.8 ns	67.9 ns	16.0%
absolute_path (`/page:1?foo=bar`)	15	67.7 ns	57.6 ns	14.9%
short_absolute_path (`/foo`)	4	46.7 ns	40.3 ns	13.7%
ipv6 (`http://[2001:db8::1]:8080/path`)	30	113.6 ns	96.2 ns	15.3%
mailto (`mailto:user@example.com`)	23	68.6 ns	60.2 ns	12.2%
file_url (`file:///home/user/file.txt`)	26	75.2 ns	63.5 ns	15.6%
long_query (96-byte search URL)	96	141.8 ns	123.0 ns	13.3%
long_path (81-byte asset URL)	81	123.0 ns	99.0 ns	19.5%

Average: 16% faster. No behavior change in the C/POSIX locale.

Tests

New ext/standard/tests/url/gh12703.phpt covers both bugs with correct expected values (not the buggy port expectations locked into PR #12704's original test). Includes edge cases /:, /:80, //host/a:1/b:2/, IPv6-like inputs, and a full scheme://user:pass@host:port/path?query#fragment regression guard.

Test verified to fail on unfixed master and pass on the fixed build via git stash round-trip
ext/standard/tests/url/ — 32/32 pass, 0 regressions
ext/uri/tests/ — 309/309 pass (the 8.5 URI extension wraps parse_url, so wrapping regression check)

Two changes to ext/standard/url.c. 1. Fix phpGH-12703: parse_url mishandles colon inside path. parse_url('/page:1') returns false instead of the path, and parse_url('//www.example.com/foo:65535/') reports a spurious port of 65535. Both stem from the scheme walk's error branch routing /-prefixed inputs into the `parse_port` fallback. Guard the `parse_port` branch with `*s != '/'`. A leading slash can't start a scheme, so the input must be a path or a //host authority. The existing branch ordering then lets the //host check catch `//...` inputs and `just_path` catch the rest. 2. Speed up php_replace_controlchars on typical URLs. Each parse calls php_replace_controlchars once per component (scheme, host, user, pass, path, query, fragment), and each call walked the bytes through iscntrl(). iscntrl() hits __ctype_b_loc() for a locale-aware lookup, and callgrind showed it at ~14% of total instructions on a realistic URL workload. Replace the call with an inline `*s < 0x20 || *s == 0x7f` compare. URL components are bytes, not locale-dependent text, so C-locale semantics are what we want regardless of the process locale. The Zend language scanner uses the same pattern (`yych <= 0x1F`). Benchmark across 13 realistic URL shapes: 12-22% faster, 16% average. No behavior change in the C/POSIX locale. Closes phpGH-12703

TimWolla · 2026-04-11T16:22:10Z

This PR is making two unrelated changes, it should be split up. But in any case, we shouldn't make any further functional changes to parse_url(), since at this point the URLs expected to be parsed by parse_url() are the URLs parsed by parse_url().

Users should migrate to the new ext/uri classes (Uri\* namespace) instead. See also php/doc-en#5477.

iliaal · 2026-04-11T17:32:09Z

@TimWolla I can split performance related changes, if it'll be considered? Seems like 10%+ speed increase for a commonly used function is still of value, and https://www.php.net/manual/en/function.parse-url.php doesn't imply function is being deprecated or at risk of being so.

TimWolla · 2026-04-11T21:39:41Z

I can split performance related changes, if it'll be considered?

My understanding is that it might result in a behavioral change for non-C locales, which might be the case if someone calls setlocale() before.

Seems like 10%+ speed increase for a commonly used function is still of value

I'm not sure we should encourage users to choose parse_url() (e.g. due to performance concerns with the spec-compliant functionality).

doesn't imply function is being deprecated or at risk of being so.

I feel it's a little early for that, but parse_url() should absolutely be deprecated.

iliaal · 2026-04-11T22:03:15Z

parse_url() should absolutely be deprecated.

parse_url() isn't deprecated today, and even a future RFC wouldn't remove it for years. The function is still worth improving in the meantime.

My understanding is that it might result in a behavioral change for non-C locales, which might be the case if someone calls setlocale() before.

The behavior does change, and the change is a bug fix.

iscntrl() in glibc under a Latin-1-family LC_CTYPE locale classifies 0x80 through 0x9F (the C1 control range) as control characters. php_replace_controlchars then rewrites those bytes to _. For a UTF-8 URL like http://example.com/café, the é encodes as 0xC3 0xA9, and 0xA9 sits in that range. Under a Latin-1 locale, the old code rewrites the continuation byte to _ with no warning, yielding an invalid UTF-8 sequence 0xC3 0x5F where a valid é used to be. In a C or UTF-8 locale, behavior is identical to the new inline check.

So the four cases are:

C or POSIX locale: no change.
UTF-8 locale: no change.
Latin-1 locale with UTF-8 URL input (the common case in a mixed process): old code mangles UTF-8 continuation bytes, new code leaves them alone. That's a fix, not a regression.
Latin-1 locale with a Latin-1 URL containing literal C1 control bytes (never in practice): old code replaces with _, new code doesn't.

The old behavior made parse_url() non-deterministic across process states that had nothing to do with URL parsing. Any library or SAPI that called setlocale(LC_CTYPE, "xx_XX.ISO-8859-1") earlier in the process could change URL parsing results without warning. The Zend scanner uses the same inline pattern I'm proposing (yych <= 0x1F) for the same reason: source bytes aren't locale-dependent text, and URL bytes aren't either.

If the concern is BC for the C1 control edge case, I can add a test that demonstrates the UTF-8 corruption under a Latin-1 locale as the motivating bug. The perf number then stops being the primary justification and becomes the side effect of a correctness fix.

TimWolla · 2026-04-11T22:42:56Z

The behavior does change, and the change is a bug fix.

As I had mentioned before, the current behavior of parse_url() is the intended behavior. There are many issues in parse_url() that users might consider to be a bug, but at this point they have all become a feature.

iliaal · 2026-04-11T23:33:21Z

the current behavior of parse_url() is the intended behavior. There are many issues in parse_url() that users might consider to be a bug, but at this point they have all become a feature.

Declaring a behavior "intended" after the fact doesn't make it so. Intended behavior is documented, tested, or discussed on record. None of that applies here: there's no test in ext/standard/tests/url that asserts the locale-dependent control-char replacement, no documentation describing parse_url() as locale-sensitive, and no prior discussion establishing that UTF-8 continuation-byte corruption under a Latin-1 locale is a feature.

The full ext/standard/tests/url suite passes with the change. If the old behavior were intended, there would be at least one test asserting it. The absence of any such test is the same signal it always is: a quirk that wasn't codified.

This isn't just about userland parse_url() either. php_url_parse_ex2 and php_replace_controlchars are on the hot path of every URL-based stream in php-src: the http/https/ftp/ftps stream wrappers (ext/standard/http_fopen_wrapper.c, ext/standard/ftp_fopen_wrapper.c), the TLS transport (ext/openssl/xp_ssl.c), SoapClient (ext/soap/php_http.c), filter_var($url, FILTER_VALIDATE_URL) (ext/filter/logical_filters.c), and the session URL rewriter (ext/standard/url_scanner_ex.c). Declaring every quirk a feature means every one of these keeps the locale-dependent byte mangling in perpetuity, for every user, in code that has no drop-in replacement. That's a high cost for a position that isn't documented anywhere as project policy.

Reading the thread as a whole, the blocker on the technical side is an unsourced claim that the current locale-dependent behavior is intended, and the blocker on the non-technical side is the preference that parse_url() not be encouraged. I don't see a productive way forward against either, so I'll leave this thread here. The patch lives on my fork at iliaal#31 for anyone who wants to build with it.

iliaal requested a review from bukka as a code owner April 10, 2026 23:34

iliaal mentioned this pull request Apr 10, 2026

Fix GH-12703: parse_url colon-in-path + optimize control-char replacement iliaal/php-src#29

Closed

github-actions bot added the Extension: standard label Apr 10, 2026

iliaal force-pushed the fix/gh12703-parse-url-colon-path branch from 62b73ec to 854136f Compare April 11, 2026 13:22

TimWolla closed this Apr 11, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix GH-12703: parse_url colon-in-path + optimize control-char replacement#21718

Fix GH-12703: parse_url colon-in-path + optimize control-char replacement#21718
iliaal wants to merge 1 commit intophp:masterfrom
iliaal:fix/gh12703-parse-url-colon-path

iliaal commented Apr 10, 2026

Uh oh!

TimWolla commented Apr 11, 2026

Uh oh!

iliaal commented Apr 11, 2026

Uh oh!

TimWolla commented Apr 11, 2026

Uh oh!

iliaal commented Apr 11, 2026

Uh oh!

TimWolla commented Apr 11, 2026

Uh oh!

iliaal commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

iliaal commented Apr 10, 2026

Bug fix: GH-12703

Optimization: php_replace_controlchars

Benchmark

Tests

Uh oh!

TimWolla commented Apr 11, 2026

Uh oh!

iliaal commented Apr 11, 2026

Uh oh!

TimWolla commented Apr 11, 2026

Uh oh!

iliaal commented Apr 11, 2026

Uh oh!

TimWolla commented Apr 11, 2026

Uh oh!

iliaal commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants