Add/html character reference decoder by dmsnell · Pull Request #47040 · WordPress/gutenberg

dmsnell · 2023-01-10T21:49:37Z

What?

Adds a new HTML character reference decoder class, used by tag processor, for properly decoding HTML character references (entities). Leaves junk input in output, e.g. when HTML calls to replace a sequence with the replacement character U+FFFD (�) this decoder leaves in the raw input so that it won't change something it doesn't need to.

Why?

html_entity_decode() is an insufficient function in two ways:

it isn't aware of the ambiguous ampersand rule which leads to different decoding based on whether the encoded text comes from an HTML attribute or from data (or a few other contexts)
it doesn't properly decode all the allowable entities and variations allowed by HTML5
- it requires a terminating ;
- doesn't allow zero-extended prefixes to numerical character references e.g. &#x0000065
- doesn't decode the C1 control character replacements, e.g. &#x80 is € not the padding character

How?

Scans an input string for character entities and decodes them as numeric references or as named referenced, looking up names from the HTML5 spec

When performing named character decoding this decoder groups names by their first two letters, forming a naming "group." That group usually contains only a few named references. When finding the appropriate group, we iterate over the candidate names in that group to determine if the input contains that exact name match, and if we do, use that match to determine which text to replace in the input string.

Testing

Need to add tests to confirm this behavior.

Some basic tests so far with the HTML5 spec single-page.html shows very little or no noticeable impact on performance, but slightly increased memory use, probably because of how this is string-copying the class attribute for comparison. There are optimizations we could explore to avoid this allocation.

raw input	how it should be decoded in an attribute (this PR)	how it should be decoded in markup (this PR)	how PHP decodes it
`test`	`test`	`test`	`test`
`&sirnotinthisfilm;`	`&sirnotinthisfilm;`	`&sirnotinthisfilm;`	`&sirnotinthisfilm;`
`&#x0000065`	`e`	`e`	`&#x0000065`
`a&#x1F170b`	`a&#x1F170b`	`a&#x1F170b`	`a&#x1F170b`
`a🅰b`	`a🅰b`	`a🅰b`	`a🅰b`
`a&#x0001F170;b`	`a🅰b`	`a🅰b`	`a🅰b`
`a&#x10FFFCd;b`	`a&#x10FFFCd;b`	`a&#x10FFFCd;b`	`a&#x10FFFCd;b`
`&#x10FFFCt`	`􏿼t`	`􏿼t`	`&#x10FFFCt`
`&#x10FFFC5`	`&#x10FFFC5`	`&#x10FFFC5`	`&#x10FFFC5`
`&#1114101`	`􏿵`	`􏿵`	`&#1114101`
`&#1114101t`	`􏿵t`	`􏿵t`	`&#1114101t`
`&#11141015`	`&#11141015`	`&#11141015`	`&#11141015`
`a&#x1F170x;b`	`a🅰x;b`	`a🅰x;b`	`a&#x1F170x;b`
`a&#1337b`	`aԹb`	`aԹb`	`a&#1337b`
`aԹb`	`aԹb`	`aԹb`	`aԹb`
`a&ampb`	`a&ampb`	`a&b`	`a&ampb`
`a&b`	`a&b`	`a&b`	`a&b`
`a&notin a bind`	`a&notin a bind`	`a¬in a bind`	`a&notin a bind`
`a∉b`	`a∉b`	`a∉b`	`a∉b`
`a&notinb`	`a&notinb`	`a¬inb`	`a&notinb`
`&Abreve;`	`Ă`	`Ă`	`Ă`
`&Abreve`	`&Abreve`	`&Abreve`	`&Abreve`
`Á`	`Á`	`Á`	`Á`
`&Aacute`	`Á`	`Á`	`&Aacute`
`Ála carte`	`Ála carte`	`Ála carte`	`Ála carte`
`&Aacutela carte`	`&Aacutela carte`	`Ála carte`	`&Aacutela carte`
`&Aacute la carte`	`Á la carte`	`Á la carte`	`&Aacute la carte`
`&Aacute=la carte`	`&Aacute=la carte`	`Á=la carte`	`&Aacute=la carte`
`&Aacute*la carte`	`Á*la carte`	`Á*la carte`	`&Aacute*la carte`
``	`€‡•œ`	`€‡•œ`	``
`&#t;`	`&#t;`	`&#t;`	`&#t;`
`&#xd83c;&#xdd70`	`&#xd83c;&#xdd70`	`&#xd83c;&#xdd70`	`&#xd83c;&#xdd70`
`􏿿`	`􏿿`	`􏿿`	`􏿿`
`򟿾`	`򟿾`	`򟿾`	`򟿾`
`&#x09&#x10&#x20`	`&#x10`	`&#x10`	`&#x09&#x10&#x20`
`&#x-65;`	`&#x-65;`	`&#x-65;`	`&#x-65;`

github-actions · 2023-01-10T22:31:19Z

Flaky tests detected in b8e9430.
Some tests passed with failed attempts. The failures may not be related to this commit but are still reported for visibility. See the documentation for more information.

🔍 Workflow run URL: https://github.com/WordPress/gutenberg/actions/runs/3994110924
📝 Reported issues:

[Flaky Test] should always expand single line selection #39787 in specs/editor/various/multi-block-selection.test.js

adamziel · 2023-01-13T11:35:13Z

+	 *
+	 * @var string[] named character reference information
+	 */
+	public static $lookup_table = array(


Is there any way to initialize this on the first use and avoid allocating the memory until the first use? E.g. implementing something like...

public static get_lookup_table() { if ( ! self::$lookup_table ) { self::$lookup_table = /* ... */; } return self::$lookup_table; }

If PHP always brings the methods opcodes into memory then it wouldn't help at all, but it's worth a shot.

as static string literals I don't think this is possible, or that it will save any memory, because those strings already have to be in memory for it to lazily-initialize them.

this is down to a negligible amount IMO and constrasts a few MBs from clearer constructs, such as the array you mention before, such as the array introduced in d663e70

adamziel · 2023-01-13T11:48:52Z

+					$at += $digit_count;
+					$buffer .= substr( $input, $next, $at - $next );
+					continue;


Nit: just continue; should suffice here as well – the next loop iteration will skip over the digits anyway.

adamziel · 2023-01-13T11:57:03Z

+					 * For group "qu", during lookup that will find "&quot;"
+					 *
+					 * ┌─────┬────┬──────┬────┬──────────────┬────┬─────┐
+					 * │ ... │ N5 │ Name │ S5 │ Substitution │ N6 │ ... │
+					 * ├─────┼────┼──────┼────┼──────────────┼────┼─────┤
+					 * │ ... │ 04 │ ote; │ 01 │ "            │ 03 │ ... │
+					 * └─────┴────┴──────┴────┴──────────────┴────┴─────┘
+					 *         ^^          ^^
+					 *          |           |
+					 *          |           ╰ The substitution is one byte,
+					 *          |             even though it's represented in
+					 *          |             the string literal as "\x22", which
+					 *          |             is done for the sake of avoiding
+					 *          |             quoting issues in PHP.
+					 *          |
+					 *          ╰ The "ote;" is four bytes (the finishing of &quo̱ṯe̱;̱)
+					 *
+					 * The part of the group string this represents follows:
+					 * > ...\x04ote;\x01\x22\x03...


I'm confused with this comment. It says the group qu will find ", but the example talks about &quote;. What does N5, S5, and N6 mean in the header? I figured out the group string contains different possible substitutions to reduce the size of the lookup table, but I'm not sure what else is there to know about it.

I looked up that entry in $lookup_table and it differs from what's described in this comment:

"QU" => "\x03OT;\x01\x22\x02OT\x01\x22", // &QUOT; &QUOT

the ote; is a typo, but you're not looking up the listed entry, since you looked up QU instead of qu. QU might be a better example though since it only has the two entries and this one has more.

// &quaternions; &quatint; &questeq; &quest; " &quot "qu" => "\x0aaternions;\x03ℍ\x06atint;\x03⨖\x06esteq;\x03≟\x04est;\x01?\x03ot;\x01\x22\x02ot\x01\x22",

still working on the table; trying to show the structural layout of the lookups

adamziel · 2023-01-13T13:17:33Z

- * @TODO: Decode HTML references/entities in class names when matching.
- *        E.g. match having class `1<"2` needs to recognize `class="1&lt;&quot;2"`.
- * @TODO: Decode character references in `get_attribute()`
 * @TODO: Properly escape attribute value in `set_attribute()`


Does that still apply? As long as the " character is escaped everything should be fine even if nothing else is escaped?

I'm not following your question. can you rephrase it?

the class value of a&#x65a should be interpreted as having class aaa, so the stylesheet to apply rules to it would look like this…

.aaa { … }

these TODOs thus addressed two different issues:

if we search for a tag with class wp-block it should find a tag even if it's written in the tag as class="wp&#x2dblock" or class='wp&#45block' or any other variant with character references. strangely the class &#x1FFFF (and others like it with invalid character references) map to the CSS class name �.

when decoding attribute values PHP has been returning potentially invalid strings since html_entity_decode is both broken by implementation and by design.

adamziel · 2023-01-13T13:26:17Z

+	 * @var string[] named character reference information
+	 */
+	public static $lookup_table = array(
+		"AE" => "\x04lig;\x02Æ\x03lig\x02Æ",               // &AElig; &AElig


I'm confused by this $lookup_table. Why not store it as "AE" => ["lig;" => "Æ", "lig" => "Æ"]? Is this a memory optimization? Here's how _zend_string is stored:

struct _zend_string { zend_refcounted_h gc; zend_ulong h; /* hash value */ size_t len; char val[1]; };

len is just like size byte you use, and val[1] is used to store the actual contents (the length 1 is a "struct hack"). On top of that there's a few integer allocations for the hash (computed on the first use) and the gc reference. So each zend_string would take up a few integer allocations more and enable less complex code.

What are your thoughts on that?

this was an optimization; see d663e70 for the previous data structure. I looked at another, which was storing these as static properties of an auto-generated class using a Character_Reference class with $name and $substitution properties, but each one had its own costs.

you're right that Zend strings don't require more storage, but a couple things led me away from the array as you wrote it:

array access involves a few steps of memory indirection and we don't know if the second or third name in a group will be close or distant in memory. the packed string format I'm using currently should keep most candidate names for a group within a single cache line and all name/substitution values except potentially a few very long ones at that. I haven't measured the impact of this, but at least we know we're preserving locality during iteration of the candidates and will avoid all further memory lookups.

array key lookup requires an additional hashing step which here I don't think will be necessary. we're still doing it for groups, but I'm trying to balance the different tradeoffs and minimize the search space as quickly as possible. the packed string is mostly this same structure you're proposing except we know it comes from a contiguous sequence of bytes vs. a group of pointer dereferences.

because it's all been somewhat fast I've only been looking at memory overhead and as before, arrays have surprised me how much they add. I thought through another few approaches whereby the substitutions and names were all found in a single string, but those didn't offer much and were much messier in an editor.

additionally I looked at packing a tree in a single string which could potentially reduce the size of the table dramatically by avoiding most duplicated letters; it would let us scan one letter at a time (and also avoid new allocations) but I didn't code it up yet for no reason other than time.

there are 2231 named character references and we know we'd need at least a few bytes for each one, let alone the substitutions which total 6420 bytes. if we can get this down to a couple extra bytes of overhead for each one we're still looking at around a minimum of 10KB vs. 135KB, within one order of magnitude, and I don't think it's bad to add that much code on the server.

I'm not exactly sure what this approach would be like, but maybe I'll do an exploration at some point. this seemed pretty much good enough for now 🤷‍♂️

I just played with packing trie into a string for fun: https://gist.github.com/adamziel/a628fc2d63a7733f9fed0dc9c3c34705

@adamziel given that your trie representation still has some issues it's hard for me to test, but, I also find it maybe much more difficult to comprehend when reading. I do admit that my packed structure is itself a bit difficult to see, but at least the groups are easy to scan to and most are short enough where you can see all the names and replacements.

I predict that given we are doing one array lookup in my code based on the first two characters, that it's sufficiently optimized over the more abstract tree where we have approximately as many lookups as we have letters in the character name (less an optimization for when only one name remains given the prefix). the string scanning within a group also should be prompt given that the names within a group should be very local in memory and read in the same burst.

that being said, and knowing that in most cases we won't have any character references in an attribute, I guess we could relax this packing and go back to what's in the original commit (or a slight improvement) and see if we care about the performance.

ironically I do find the packed string in some ways more legible than the noisier array. concerning memory though there's a big difference between 'AE' => "\x04lig;\x02Æ\x03lig\x02Æ" and this…

'AE' => array( 'lig;', 'Æ', 'lig', 'Æ' ), 'AM' => array( 'P;', '&', 'P', '&' ), …

Whereas the packed string is a single memory lookup where all the values coincide, the array version includes string values which might be at very different locations in memory, so we still have in the case of &lig at least four separate memory lookups once we arrive at the group, since each value in the array is going to be a pointer to some other location with the actual string data.

As a side-note I find that any optimization to make the semi-colon optional is a bit fraught because it's not the case that all semi-colons are optional or required. so as far as the spec goes, the semicolon might as well be a normal letter.

For memory limiting I wonder how the trie implementation would look after fixing it. We have to make sure it finds &lig; before it finds &lig and it has to know at each prefix if there are branches, but also if there's a termination.

For example, the trie as written encodes AElig; and AElig in the same structure…

array(52) { ["A"]=> array(16) { ["E"]=> array(1) { ["l"]=> array(1) { ["i"]=> array(1) { ["g"]=> string(2) "Æ" } } } } }

it looks like the problem here is that it finds the version without the semicolon and when the path is empty it overrides the existing [";"]=>string(2) "Æ" that was there already.

I was able to easily modify the script so that it stores the terminators at each branch point, but that made the packed table increase to 26 KB, and this trie traversal looks relatively slow compared to the string comparing. is 110 KB, if we can fit it in there, worth the trie?

- more data locality within each group - fewer memory indirections - less string scanning for length lookup

dmsnell · 2024-05-24T23:55:26Z

Replaced by WordPress/wordpress-develop#6387

dmsnell requested review from adamziel, georgeh and ockham January 10, 2023 21:49

dmsnell requested a review from spacedmonkey as a code owner January 10, 2023 21:49

dmsnell marked this pull request as draft January 10, 2023 21:50

adamziel reviewed Jan 13, 2023

View reviewed changes

dmsnell added 9 commits January 24, 2023 00:59

HTML Processsing: Adds HTML character reference decoder

ad1b759

Decode entities in attributes properly

9bfa185

Add loop budget for character reference decoder

c35a1bf

Add some comments

1cfe5c2

Add some comments

987f787

Use basic UTF-8 encoder to avoid call to html_entity_decode

f563c34

Cleanup code and properly print out invalid named character references

67f9a88

Handle special-cased code-point references

c02718e

Store named character reference lookup table in string format

b8e9430

- more data locality within each group - fewer memory indirections - less string scanning for length lookup

dmsnell force-pushed the add/html-character-reference-decoder branch from 026e439 to b8e9430 Compare January 24, 2023 07:59

Pack C1 replacements, add test cases, and fix bugs

4277f6c

gziolo added the [Feature] Block API API that allows to express the block paradigm. label Jan 31, 2023

dmsnell mentioned this pull request Feb 1, 2023

Editor: Introduce HTML Tag Processor WordPress/wordpress-develop#3920

Closed

adamziel mentioned this pull request Feb 9, 2023

WP_HTML_Tag_Processor overview issue #44410

Closed

26 tasks

gziolo added [Feature] HTML API An API for updating HTML attributes in markup and removed [Feature] Block API API that allows to express the block paradigm. labels Apr 18, 2023

dmsnell mentioned this pull request Apr 12, 2024

HTML API: Add custom text decoder WordPress/wordpress-develop#6387

Closed

dmsnell closed this May 24, 2024

dmsnell deleted the add/html-character-reference-decoder branch May 24, 2024 23:57

Conversation

dmsnell commented Jan 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What?

Why?

How?

Testing

Uh oh!

github-actions Bot commented Jan 10, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adamziel Jan 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adamziel Jan 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

adamziel Jan 13, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dmsnell commented May 24, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

dmsnell commented Jan 10, 2023 •

edited

Loading

github-actions Bot commented Jan 10, 2023 •

edited

Loading

adamziel Jan 13, 2023 •

edited

Loading

adamziel Jan 13, 2023 •

edited

Loading

adamziel Jan 13, 2023 •

edited

Loading