Conversation
Rename PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES to PCRE2_EXTRA_ALLOW_SURROGATES
so it can be used when characters with those codes are used even when not
escaped. This also resolves the ambuiguity that allowed the following:
PCRE2 version 10.34 2019-11-21
re> /\p{Cs}/match_invalid_utf
data> \x{dfff}\=no_utf_check
No match
Several interrelated bug fixes:
* An UTF-8 or UTF-32 encoded character with a surrogate codepoint should be
allowed with PCRE2_EXTRA_ALLOW_SURROGATES. Before this change non UTF-16
will not allow surrogates if encoded in the pattern directly.
* In ALT_BSUX mode, any escaped character should represent itself if not
"special", but that was not the case for not ASCII characters.
* When match_invalid_utf is enabled a surrogate character in the subject was
being skipped, even if no_utf_check was used together with allow_surrogates
Issues that would be addressed with a new version
* the documentation around \o and \x, specially when bigger than 255 needs
reviewing for clarification.
* pcre2test support for testing escaped UTF characters is suboptimal, so most
testing was done with custom code that should be integrated instead.
* JIT is broken when match_invalid_utf and allow_surrogate.
Issues that are still open and had been punted from this version:
* ALT_BSUX allows escape characters up to \x{ffff} even without PCRE2_UTF.
* PCRE2_ALT_BSUX might assume only UTF-16. It is not clear what 0xf1 might
mean (\u00F1, assume binary matching like it seems to do with surrogates
and match with an extended 0xf1 PCRE2_UCHAR, or error), if the later is
something like PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES needed?.
* PCRE2_ALT_BSUX match shouldn't really do UTF matching when using surrogates.
* PCRE2_EXTRA_ALT_BSUX should restrict the use of \, so for example non ASCII
or \U throw an error. It probably need to break with PCRE2_ALT_SUX.
* Behaviour is different between UTF-16 and UTF-32 but tests are still unified
so tests are missing.
| \x{09f} | ||
| No match | ||
|
|
||
| /^\p{Cs}/match_invalid_utf,allow_surrogates |
There was a problem hiding this comment.
this replicates the behaviour where utf is used instead of match_invalid_utf, by using the "allow_surrogates" to redefine what is valid, but maybe it shouldn't match instead since it is invalid UTF-16?
| /* An additional compile options word is available in the compile context. */ | ||
|
|
||
| #define PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES 0x00000001u /* C */ | ||
| #define PCRE2_EXTRA_ALLOW_SURROGATES 0x00000001u /* C */ |
There was a problem hiding this comment.
It might be better to keep the old flag and add PCRE2_EXTRA_ALLOW_SURROGATES, which will imply it and extend it?
There was a problem hiding this comment.
Yes, I was already going to say that making the non-compatible change of removing an option is not something I'd want to do, especially as there is a way round it, as you suggest.
There was a problem hiding this comment.
... and I note that the GitHub tests have failed...
There was a problem hiding this comment.
correct, I posted this as a work in progress, to help gather feedback on the approach and caveats, and a roundabout way to ask Zoltan for help implementing the new behaviour (use a flag to redefine what is invalid utf), so that match_invalid_utf and utf behave similarly and resolve the ambiguity when matching \p{Cs} in the JIT, if it was sound and correct for the interpreter.
Will update the change and make the RFC more explicit.
Rename PCRE2_EXTRA_ALLOW_SURROGATE_ESCAPES to PCRE2_EXTRA_ALLOW_SURROGATES so it can be used when characters with those codes are used even when not escaped. This also resolves the ambuiguity that allowed the following:
PCRE2 version 10.34 2019-11-21
re> /\p{Cs}/match_invalid_utf
data> \x{dfff}=no_utf_check
No match
Several interrelated bug fixes:
Issues that would be addressed with a new version
Issues that are still open and had been punted from this version: