Skip to content

Fix GH-12703: parse_url colon-in-path + optimize control-char replacement#21718

Closed
iliaal wants to merge 1 commit intophp:masterfrom
iliaal:fix/gh12703-parse-url-colon-path
Closed

Fix GH-12703: parse_url colon-in-path + optimize control-char replacement#21718
iliaal wants to merge 1 commit intophp:masterfrom
iliaal:fix/gh12703-parse-url-colon-path

Conversation

@iliaal
Copy link
Copy Markdown
Contributor

@iliaal iliaal commented Apr 10, 2026

Fixes #12703

Two changes in the same commit, both in ext/standard/url.c.

Bug fix: GH-12703

parse_url('/page:1') returns false instead of ['path' => '/page:1'], and parse_url('//www.example.com/foo:65535/') reports a spurious port=65535 that should never exist (the 65535 is part of the path, not an authority port).

Both bugs stem from the scheme-walk error branch routing /-prefixed inputs into the parse_port fallback. A leading / can't start a scheme, so the input is either a path or a //host authority, never a host-less :port form.

The guard *s != '/' on the parse_port branch sends /page:1 to just_path and //www.example.com/foo:65535/ to the //host branch that follows it, both of which produce the correct output.

This matches vdauchy's abandoned PR #12704 (closed without technical objections, author cited the URL parsing API RFC). The RFC doesn't deprecate parse_url() in 8.5, and parse_url() remains in use by FTP stream wrappers, SoapClient, and session URL rewriting, so the fix is worth landing on master.

Uri\Rfc3986\Uri::parse() from the 8.5 URI extension already handles both cases correctly today. After this fix, parse_url() agrees with the RFC 3986 parser on 7 of 9 cases I tested (the remaining two, page:1 and ./page:1, are long-standing legacy quirks out of scope).

Optimization: php_replace_controlchars

Each parse_url() call invokes php_replace_controlchars once per parsed component (scheme, host, user, pass, path, query, fragment), and each call walked the bytes through iscntrl(). iscntrl() goes through __ctype_b_loc() for a locale-aware lookup on every byte. Callgrind showed this accounted for ~14% of total instructions on a realistic URL workload.

Replace the call with an inline *s < 0x20 || *s == 0x7f compare. URL components are bytes, not locale-dependent text, so C-locale semantics are what we want regardless of the process locale. The Zend language scanner already uses the same pattern (yych <= 0x1F), and ext/standard/quot_print.c uses iscntrl(c) || c == 0x7f for the same reason (the author didn't trust iscntrl to include DEL across locales).

Benchmark

Release build, 1M iterations per case, best-of-5, 13 realistic URL shapes:

Case len before after speedup
short_nopath (http://example.com) 18 76.7 ns 65.0 ns 15.3%
typical_web (https://www.example.com/path/to/page) 36 100.2 ns 78.3 ns 21.9%
with_query (https://api.example.com/v1/users?id=42&format=json) 50 117.4 ns 94.6 ns 19.4%
full_featured (user:pass@host:port/path?q#f) 83 181.9 ns 147.5 ns 18.9%
path_with_colons 44 102.6 ns 84.9 ns 17.3%
relative_scheme (//www.example.com/path/to/page) 30 80.8 ns 67.9 ns 16.0%
absolute_path (/page:1?foo=bar) 15 67.7 ns 57.6 ns 14.9%
short_absolute_path (/foo) 4 46.7 ns 40.3 ns 13.7%
ipv6 (http://[2001:db8::1]:8080/path) 30 113.6 ns 96.2 ns 15.3%
mailto (mailto:user@example.com) 23 68.6 ns 60.2 ns 12.2%
file_url (file:///home/user/file.txt) 26 75.2 ns 63.5 ns 15.6%
long_query (96-byte search URL) 96 141.8 ns 123.0 ns 13.3%
long_path (81-byte asset URL) 81 123.0 ns 99.0 ns 19.5%

Average: 16% faster. No behavior change in the C/POSIX locale.

Tests

New ext/standard/tests/url/gh12703.phpt covers both bugs with correct expected values (not the buggy port expectations locked into PR #12704's original test). Includes edge cases /:, /:80, //host/a:1/b:2/, IPv6-like inputs, and a full scheme://user:pass@host:port/path?query#fragment regression guard.

  • Test verified to fail on unfixed master and pass on the fixed build via git stash round-trip
  • ext/standard/tests/url/ — 32/32 pass, 0 regressions
  • ext/uri/tests/ — 309/309 pass (the 8.5 URI extension wraps parse_url, so wrapping regression check)

Two changes to ext/standard/url.c.

1. Fix phpGH-12703: parse_url mishandles colon inside path.

   parse_url('/page:1') returns false instead of the path, and
   parse_url('//www.example.com/foo:65535/') reports a spurious
   port of 65535. Both stem from the scheme walk's error branch
   routing /-prefixed inputs into the `parse_port` fallback.

   Guard the `parse_port` branch with `*s != '/'`. A leading slash
   can't start a scheme, so the input must be a path or a //host
   authority. The existing branch ordering then lets the //host
   check catch `//...` inputs and `just_path` catch the rest.

2. Speed up php_replace_controlchars on typical URLs.

   Each parse calls php_replace_controlchars once per component
   (scheme, host, user, pass, path, query, fragment), and each
   call walked the bytes through iscntrl(). iscntrl() hits
   __ctype_b_loc() for a locale-aware lookup, and callgrind showed
   it at ~14% of total instructions on a realistic URL workload.

   Replace the call with an inline `*s < 0x20 || *s == 0x7f`
   compare. URL components are bytes, not locale-dependent text,
   so C-locale semantics are what we want regardless of the
   process locale. The Zend language scanner uses the same pattern
   (`yych <= 0x1F`).

   Benchmark across 13 realistic URL shapes: 12-22% faster, 16%
   average. No behavior change in the C/POSIX locale.

Closes phpGH-12703
@iliaal iliaal force-pushed the fix/gh12703-parse-url-colon-path branch from 62b73ec to 854136f Compare April 11, 2026 13:22
@TimWolla
Copy link
Copy Markdown
Member

This PR is making two unrelated changes, it should be split up. But in any case, we shouldn't make any further functional changes to parse_url(), since at this point the URLs expected to be parsed by parse_url() are the URLs parsed by parse_url().

Users should migrate to the new ext/uri classes (Uri\* namespace) instead. See also php/doc-en#5477.

@TimWolla TimWolla closed this Apr 11, 2026
@iliaal
Copy link
Copy Markdown
Contributor Author

iliaal commented Apr 11, 2026

@TimWolla I can split performance related changes, if it'll be considered? Seems like 10%+ speed increase for a commonly used function is still of value, and https://www.php.net/manual/en/function.parse-url.php doesn't imply function is being deprecated or at risk of being so.

@TimWolla
Copy link
Copy Markdown
Member

I can split performance related changes, if it'll be considered?

My understanding is that it might result in a behavioral change for non-C locales, which might be the case if someone calls setlocale() before.

Seems like 10%+ speed increase for a commonly used function is still of value

I'm not sure we should encourage users to choose parse_url() (e.g. due to performance concerns with the spec-compliant functionality).

doesn't imply function is being deprecated or at risk of being so.

I feel it's a little early for that, but parse_url() should absolutely be deprecated.

@iliaal
Copy link
Copy Markdown
Contributor Author

iliaal commented Apr 11, 2026

parse_url() should absolutely be deprecated.

parse_url() isn't deprecated today, and even a future RFC wouldn't remove it for years. The function is still worth improving in the meantime.

My understanding is that it might result in a behavioral change for non-C locales, which might be the case if someone calls setlocale() before.

The behavior does change, and the change is a bug fix.

iscntrl() in glibc under a Latin-1-family LC_CTYPE locale classifies 0x80 through 0x9F (the C1 control range) as control characters. php_replace_controlchars then rewrites those bytes to _. For a UTF-8 URL like http://example.com/café, the é encodes as 0xC3 0xA9, and 0xA9 sits in that range. Under a Latin-1 locale, the old code rewrites the continuation byte to _ with no warning, yielding an invalid UTF-8 sequence 0xC3 0x5F where a valid é used to be. In a C or UTF-8 locale, behavior is identical to the new inline check.

So the four cases are:

  • C or POSIX locale: no change.
  • UTF-8 locale: no change.
  • Latin-1 locale with UTF-8 URL input (the common case in a mixed process): old code mangles UTF-8 continuation bytes, new code leaves them alone. That's a fix, not a regression.
  • Latin-1 locale with a Latin-1 URL containing literal C1 control bytes (never in practice): old code replaces with _, new code doesn't.

The old behavior made parse_url() non-deterministic across process states that had nothing to do with URL parsing. Any library or SAPI that called setlocale(LC_CTYPE, "xx_XX.ISO-8859-1") earlier in the process could change URL parsing results without warning. The Zend scanner uses the same inline pattern I'm proposing (yych <= 0x1F) for the same reason: source bytes aren't locale-dependent text, and URL bytes aren't either.

If the concern is BC for the C1 control edge case, I can add a test that demonstrates the UTF-8 corruption under a Latin-1 locale as the motivating bug. The perf number then stops being the primary justification and becomes the side effect of a correctness fix.

@TimWolla
Copy link
Copy Markdown
Member

The behavior does change, and the change is a bug fix.

As I had mentioned before, the current behavior of parse_url() is the intended behavior. There are many issues in parse_url() that users might consider to be a bug, but at this point they have all become a feature.

@iliaal
Copy link
Copy Markdown
Contributor Author

iliaal commented Apr 11, 2026

the current behavior of parse_url() is the intended behavior. There are many issues in parse_url() that users might consider to be a bug, but at this point they have all become a feature.

Declaring a behavior "intended" after the fact doesn't make it so. Intended behavior is documented, tested, or discussed on record. None of that applies here: there's no test in ext/standard/tests/url that asserts the locale-dependent control-char replacement, no documentation describing parse_url() as locale-sensitive, and no prior discussion establishing that UTF-8 continuation-byte corruption under a Latin-1 locale is a feature.

The full ext/standard/tests/url suite passes with the change. If the old behavior were intended, there would be at least one test asserting it. The absence of any such test is the same signal it always is: a quirk that wasn't codified.

This isn't just about userland parse_url() either. php_url_parse_ex2 and php_replace_controlchars are on the hot path of every URL-based stream in php-src: the http/https/ftp/ftps stream wrappers (ext/standard/http_fopen_wrapper.c, ext/standard/ftp_fopen_wrapper.c), the TLS transport (ext/openssl/xp_ssl.c), SoapClient (ext/soap/php_http.c), filter_var($url, FILTER_VALIDATE_URL) (ext/filter/logical_filters.c), and the session URL rewriter (ext/standard/url_scanner_ex.c). Declaring every quirk a feature means every one of these keeps the locale-dependent byte mangling in perpetuity, for every user, in code that has no drop-in replacement. That's a high cost for a position that isn't documented anywhere as project policy.

Reading the thread as a whole, the blocker on the technical side is an unsourced claim that the current locale-dependent behavior is intended, and the blocker on the non-technical side is the preference that parse_url() not be encouraged. I don't see a productive way forward against either, so I'll leave this thread here. The patch lives on my fork at iliaal#31 for anyone who wants to build with it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Colon in url path is not parsed by parse_url

2 participants