Conversation
|
Flaky tests detected in b8e9430. 🔍 Workflow run URL: https://github.com/WordPress/gutenberg/actions/runs/3994110924
|
| * | ||
| * @var string[] named character reference information | ||
| */ | ||
| public static $lookup_table = array( |
There was a problem hiding this comment.
Is there any way to initialize this on the first use and avoid allocating the memory until the first use? E.g. implementing something like...
public static get_lookup_table() {
if ( ! self::$lookup_table ) {
self::$lookup_table = /* ... */;
}
return self::$lookup_table;
}If PHP always brings the methods opcodes into memory then it wouldn't help at all, but it's worth a shot.
There was a problem hiding this comment.
as static string literals I don't think this is possible, or that it will save any memory, because those strings already have to be in memory for it to lazily-initialize them.
this is down to a negligible amount IMO and constrasts a few MBs from clearer constructs, such as the array you mention before, such as the array introduced in d663e70
| $at += $digit_count; | ||
| $buffer .= substr( $input, $next, $at - $next ); | ||
| continue; |
There was a problem hiding this comment.
Nit: just continue; should suffice here as well – the next loop iteration will skip over the digits anyway.
| * For group "qu", during lookup that will find """ | ||
| * | ||
| * ┌─────┬────┬──────┬────┬──────────────┬────┬─────┐ | ||
| * │ ... │ N5 │ Name │ S5 │ Substitution │ N6 │ ... │ | ||
| * ├─────┼────┼──────┼────┼──────────────┼────┼─────┤ | ||
| * │ ... │ 04 │ ote; │ 01 │ " │ 03 │ ... │ | ||
| * └─────┴────┴──────┴────┴──────────────┴────┴─────┘ | ||
| * ^^ ^^ | ||
| * | | | ||
| * | ╰ The substitution is one byte, | ||
| * | even though it's represented in | ||
| * | the string literal as "\x22", which | ||
| * | is done for the sake of avoiding | ||
| * | quoting issues in PHP. | ||
| * | | ||
| * ╰ The "ote;" is four bytes (the finishing of &quo̱ṯe̱;̱) | ||
| * | ||
| * The part of the group string this represents follows: | ||
| * > ...\x04ote;\x01\x22\x03... |
There was a problem hiding this comment.
I'm confused with this comment. It says the group qu will find ", but the example talks about "e;. What does N5, S5, and N6 mean in the header? I figured out the group string contains different possible substitutions to reduce the size of the lookup table, but I'm not sure what else is there to know about it.
There was a problem hiding this comment.
I looked up that entry in $lookup_table and it differs from what's described in this comment:
"QU" => "\x03OT;\x01\x22\x02OT\x01\x22", // " "
There was a problem hiding this comment.
the ote; is a typo, but you're not looking up the listed entry, since you looked up QU instead of qu. QU might be a better example though since it only has the two entries and this one has more.
// ℍ ⨖ ≟ ? " "
"qu" => "\x0aaternions;\x03ℍ\x06atint;\x03⨖\x06esteq;\x03≟\x04est;\x01?\x03ot;\x01\x22\x02ot\x01\x22",
still working on the table; trying to show the structural layout of the lookups
| * @TODO: Decode HTML references/entities in class names when matching. | ||
| * E.g. match having class `1<"2` needs to recognize `class="1<"2"`. | ||
| * @TODO: Decode character references in `get_attribute()` | ||
| * @TODO: Properly escape attribute value in `set_attribute()` |
There was a problem hiding this comment.
Does that still apply? As long as the " character is escaped everything should be fine even if nothing else is escaped?
There was a problem hiding this comment.
I'm not following your question. can you rephrase it?
the class value of aٚ should be interpreted as having class aaa, so the stylesheet to apply rules to it would look like this…
.aaa { … }these TODOs thus addressed two different issues:
- if we search for a tag with class
wp-blockit should find a tag even if it's written in the tag asclass="wp˛lock"orclass='wp-block'or any other variant with character references. strangely the class(and others like it with invalid character references) map to the CSS class name�. - when decoding attribute values PHP has been returning potentially invalid strings since
html_entity_decodeis both broken by implementation and by design.
| * @var string[] named character reference information | ||
| */ | ||
| public static $lookup_table = array( | ||
| "AE" => "\x04lig;\x02Æ\x03lig\x02Æ", // Æ Æ |
There was a problem hiding this comment.
I'm confused by this $lookup_table. Why not store it as "AE" => ["lig;" => "Æ", "lig" => "Æ"]? Is this a memory optimization? Here's how _zend_string is stored:
struct _zend_string {
zend_refcounted_h gc;
zend_ulong h; /* hash value */
size_t len;
char val[1];
};len is just like size byte you use, and val[1] is used to store the actual contents (the length 1 is a "struct hack"). On top of that there's a few integer allocations for the hash (computed on the first use) and the gc reference. So each zend_string would take up a few integer allocations more and enable less complex code.
What are your thoughts on that?
There was a problem hiding this comment.
this was an optimization; see d663e70 for the previous data structure. I looked at another, which was storing these as static properties of an auto-generated class using a Character_Reference class with $name and $substitution properties, but each one had its own costs.
you're right that Zend strings don't require more storage, but a couple things led me away from the array as you wrote it:
- array access involves a few steps of memory indirection and we don't know if the second or third name in a group will be close or distant in memory. the packed string format I'm using currently should keep most candidate names for a group within a single cache line and all name/substitution values except potentially a few very long ones at that. I haven't measured the impact of this, but at least we know we're preserving locality during iteration of the candidates and will avoid all further memory lookups.
- array key lookup requires an additional hashing step which here I don't think will be necessary. we're still doing it for groups, but I'm trying to balance the different tradeoffs and minimize the search space as quickly as possible. the packed string is mostly this same structure you're proposing except we know it comes from a contiguous sequence of bytes vs. a group of pointer dereferences.
because it's all been somewhat fast I've only been looking at memory overhead and as before, arrays have surprised me how much they add. I thought through another few approaches whereby the substitutions and names were all found in a single string, but those didn't offer much and were much messier in an editor.
additionally I looked at packing a tree in a single string which could potentially reduce the size of the table dramatically by avoiding most duplicated letters; it would let us scan one letter at a time (and also avoid new allocations) but I didn't code it up yet for no reason other than time.
there are 2231 named character references and we know we'd need at least a few bytes for each one, let alone the substitutions which total 6420 bytes. if we can get this down to a couple extra bytes of overhead for each one we're still looking at around a minimum of 10KB vs. 135KB, within one order of magnitude, and I don't think it's bad to add that much code on the server.
I'm not exactly sure what this approach would be like, but maybe I'll do an exploration at some point. this seemed pretty much good enough for now 🤷♂️
There was a problem hiding this comment.
I just played with packing trie into a string for fun: https://gist.github.com/adamziel/a628fc2d63a7733f9fed0dc9c3c34705
There was a problem hiding this comment.
@adamziel given that your trie representation still has some issues it's hard for me to test, but, I also find it maybe much more difficult to comprehend when reading. I do admit that my packed structure is itself a bit difficult to see, but at least the groups are easy to scan to and most are short enough where you can see all the names and replacements.
I predict that given we are doing one array lookup in my code based on the first two characters, that it's sufficiently optimized over the more abstract tree where we have approximately as many lookups as we have letters in the character name (less an optimization for when only one name remains given the prefix). the string scanning within a group also should be prompt given that the names within a group should be very local in memory and read in the same burst.
that being said, and knowing that in most cases we won't have any character references in an attribute, I guess we could relax this packing and go back to what's in the original commit (or a slight improvement) and see if we care about the performance.
ironically I do find the packed string in some ways more legible than the noisier array. concerning memory though there's a big difference between 'AE' => "\x04lig;\x02Æ\x03lig\x02Æ" and this…
'AE' => array( 'lig;', 'Æ', 'lig', 'Æ' ),
'AM' => array( 'P;', '&', 'P', '&' ),
…Whereas the packed string is a single memory lookup where all the values coincide, the array version includes string values which might be at very different locations in memory, so we still have in the case of &lig at least four separate memory lookups once we arrive at the group, since each value in the array is going to be a pointer to some other location with the actual string data.
As a side-note I find that any optimization to make the semi-colon optional is a bit fraught because it's not the case that all semi-colons are optional or required. so as far as the spec goes, the semicolon might as well be a normal letter.
For memory limiting I wonder how the trie implementation would look after fixing it. We have to make sure it finds &lig; before it finds &lig and it has to know at each prefix if there are branches, but also if there's a termination.
For example, the trie as written encodes AElig; and AElig in the same structure…
array(52) {
["A"]=>
array(16) {
["E"]=>
array(1) {
["l"]=>
array(1) {
["i"]=>
array(1) {
["g"]=>
string(2) "Æ"
}
}
}
}
}
it looks like the problem here is that it finds the version without the semicolon and when the path is empty it overrides the existing [";"]=>string(2) "Æ" that was there already.
I was able to easily modify the script so that it stores the terminators at each branch point, but that made the packed table increase to 26 KB, and this trie traversal looks relatively slow compared to the string comparing. is 110 KB, if we can fit it in there, worth the trie?
- more data locality within each group - fewer memory indirections - less string scanning for length lookup
026e439 to
b8e9430
Compare
|
Replaced by WordPress/wordpress-develop#6387 |
Trac ticket: Core-60841
What?
Adds a new HTML character reference decoder class, used by tag processor, for properly decoding HTML character references (entities). Leaves junk input in output, e.g. when HTML calls to replace a sequence with the replacement character U+FFFD (�) this decoder leaves in the raw input so that it won't change something it doesn't need to.
Why?
html_entity_decode()is an insufficient function in two ways:;e€is€not the padding characterHow?
Scans an input string for character entities and decodes them as numeric references or as named referenced, looking up names from the HTML5 spec
When performing named character decoding this decoder groups names by their first two letters, forming a naming "group." That group usually contains only a few named references. When finding the appropriate group, we iterate over the candidate names in that group to determine if the input contains that exact name match, and if we do, use that match to determine which text to replace in the input string.
Testing
Need to add tests to confirm this behavior.
Some basic tests so far with the HTML5 spec
single-page.htmlshows very little or no noticeable impact on performance, but slightly increased memory use, probably because of how this is string-copying theclassattribute for comparison. There are optimizations we could explore to avoid this allocation.testtesttesttest&sirnotinthisfilm;&sirnotinthisfilm;&sirnotinthisfilm;&sirnotinthisfilm;旮ٚ����🅰ba🅰ba🅰ba🅰ba🅰ba🅰ba🅰ba🅰ba�ba�ba�ba�b􏿼ttt􏿼t����􏿵􏿵􏿵ttt􏿵t����a🅰x;ba🅰x;ba🅰x;ba🅰x;baԹbaԹbaԹbaԹbaԹbaԹbaԹbaԹba&ba&ba&ba&ba&ba&ba&ba&ba¬in a binda¬in a binda¬in a binda¬in a binda∉ba∉ba∉ba∉ba¬inba¬inba¬inba¬inbĂĂĂĂ&Abreve&Abreve&Abreve&AbreveÁÁÁÁÁÁÁÁÁla carteÁla carteÁla carteÁla carteÁla carteÁla carteÁla carteÁla carteÁ la carteÁ la carteÁ la carteÁ la carteÁ=la carteÁ=la carteÁ=la carteÁ=la carteÁ*la carteÁ*la carteÁ*la carteÁ*la carte€‡•œ€‡•œ€‡•œ€‡•œ&#t;&#t;&#t;&#t;��������	 	 &#x-65;&#x-65;&#x-65;&#x-65;