Skip to content

HTML API: Implement adoption agency algorithm and active format reconstruction#81

Draft
sirreal wants to merge 3 commits into
trunkfrom
html-apl/implement-aaa-afr
Draft

HTML API: Implement adoption agency algorithm and active format reconstruction#81
sirreal wants to merge 3 commits into
trunkfrom
html-apl/implement-aaa-afr

Conversation

@sirreal

@sirreal sirreal commented Jul 3, 2026

Copy link
Copy Markdown
Owner

Trac ticket:

Use of AI Tools

Example disclosure:

AI assistance: Yes
Tool(s): Claude Code
Model(s): Fable 5
Used for: Implementation.


This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

sirreal added 3 commits July 3, 2026 12:32
Introduce the supporting operations that the adoption agency algorithm and
active formatting element reconstruction require on the two parser stacks.

On the stack of open elements:

- Extract the "in scope" element list into a shared class constant.
- Add `has_node_in_scope()`, which reports whether a specific node (rather
  than any element of a given tag name) is in scope. The adoption agency
  algorithm must test a specific formatting element, regardless of other
  open elements sharing its tag name.

On the list of active formatting elements, add position-indexed operations
(`position_of()`, `remove_at()`, `insert_at()`, `replace_node()`) so entries
can be cloned and replaced in place as the algorithms direct.

These additions are unused until the algorithms are implemented and do not
change parsing behavior.
Implement the adoption agency algorithm and active formatting element
reconstruction so the HTML Processor handles misnested formatting elements
instead of bailing.

Previously the processor stopped whenever a document required reconstructing
implicitly-closed formatting elements (e.g. `<p><b>1<p>2`) or running the
adoption agency algorithm (e.g. `<b>1<p>2</b>3`). Both are now supported:

- `reconstruct_active_formatting_elements()` reopens the run of unclosed
  formatting elements at the end of the list, per the specification's
  rewind/advance/create steps.
- `run_adoption_agency_algorithm()` implements the full algorithm, including
  the furthest-block case and the "any other end tag" fallback.
- The "Noah's Ark clause" limits the list of active formatting elements to
  three equivalent entries (same tag name, namespace, and attributes).

Because the processor visits a document in a single pass, it cannot relocate
nodes it has already reported. The parser's state (the stack of open elements
and the list of active formatting elements) is maintained exactly as the
specification requires, so every token visited after these algorithms run is
reported with the ancestor chain a browser would produce. Nodes which were
already visited when a misnesting is discovered remain where they were found.

Formatting elements reopened by the parser are reported as "virtual" nodes.
Reading an attribute, class, or qualified name of such a node reports the
value from the tag which opened the original element; these nodes cannot be
modified.

Supporting this required hardening stack-event provenance so a single source
tag never produces two visitor events: pushes are matched to the current
token by identity, and each tag closer is matched to at most one popped node.

The html5lib test cases whose constructed trees differ only because the
adoption agency algorithm re-parents already-visited nodes are skip-listed
with a shared reason; each was verified to match browser behavior for parser
state and normalization. The absorbed
`wpHtmlSupportRequiredActiveFormatReconstruction` test and the previous
bail-asserting cases are replaced with tests of the new behavior.
A FORM end tag encountered while other elements remain open no longer stops
the parser. The form element is removed from the stack of open elements using
the same reconciliation the adoption agency algorithm uses, so any elements
that remain open after it are reported with correct breadcrumbs.

The scope check now tests the specific form element pointer rather than any
FORM element in scope, matching the specification.

One html5lib case (`<form><div></form><div>`) exercises a shape a single-pass
token stream cannot represent: browsers keep the closed FORM as a DOM ancestor
of its still-open descendants. This parser reports following content outside
the closed FORM, mirroring the stack of open elements; the case is skip-listed
with that reason.
@sirreal

sirreal commented Jul 3, 2026

Copy link
Copy Markdown
Owner Author

Agent's report @ 764adbd:

The goal is achieved. Both suites pass, the implementation is verified against the HTML spec, browsers, and a 20,000-case oracle sweep. All changes are uncommitted on html-apl/implement-aaa-afr, ready for review.

What was implemented

Adoption agency algorithm (run_adoption_agency_algorithm) — the complete spec algorithm: the furthest-block path with the full inner loop (cloning, AFE bookmark repositioning, counter-based eviction), the 8-iteration outer loop, an identity-based scope check (new WP_HTML_Open_Elements::has_node_in_scope), and the "any other end tag" fallback that the -A-U handlers now take. All three former bail() sites are gone.

Active format reconstruction (reconstruct_active_formatting_elements) — the full rewind/advance/create algorithm. Reconstructed elements appear as virtual openers whose attributes read from the original tag (a clone token carries a fresh bookmark over the source tag's span; a lazily-created reader serves get_attribute, has_class, class_list, etc., including quirks-mode class matching). Clones are read-only; set_attribute on them refuses.

Noah's Ark clause — implemented in the processor at AFE-push time, comparing tag name, namespace, and decoded attribute sets; attribute signatures are computed only when three same-name candidates already exist, so well-formed documents pay nothing. This also bounds reconstruction memory on adversarial input.

The core design problem

AAA rearranges the stack of open elements in a random-access way, but this parser's visitor consumes a stream of properly-nested open/close events — and it cannot re-parent nodes it already reported. The design: run the spec algorithm on a working copy of the stack, then reconcile — pop to the deepest shared ancestor and re-open the rest. The parser state stays spec-exact, so every token after the misnesting gets the same breadcrumbs a browser would report; only already-visited nodes stay where they were seen. This required hardening event provenance (token-identity for pushes, a one-shot latch for pops) so one source tag never produces two "real" events. The same reconcile helper fixed two pre-existing bugs: the '+A' mid-stack removal that silently corrupted breadcrumb names (e.g. <a><svg><foreignObject><a>), and the -FORM bail.

Verification

  • html5lib suite: runtime skips fell from 420 to 340 with zero failures; 35 tests newly pass outright. The 44 tree-shape mismatches were each mechanically verified as the inherent single-pass limitation — post-parse breadcrumbs match a Dom\HTMLDocument sentinel's ancestor chain and normalization is idempotent — and are skip-listed with documented reason constants.
  • Oracle sweep: 20,000 generated misnesting sequences, zero state divergences from PHP's native HTML5 parser.
  • Performance: typical content parse time unchanged (365ms vs 379ms baseline on 2.7MB); previously-fatal misnested input parses in linear time; clone bookmarks release on GC (13 alive at EOF after 300 reconstruction rounds). Downstream interactivity-api and blocks groups pass untouched.

Tests: new wpHtmlProcessorActiveFormattingElements.php (12 tests: breadcrumbs, clone attributes, Noah's Ark, deep misnesting, event-stream nesting, seek stability); three stale bail-asserting tests updated to assert the now-working behavior; wpHtmlSupportRequiredActiveFormatReconstruction.php absorbed and removed. Class docblocks updated where they described these features as unsupported. PHPCS clean.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant