Skip to content

fix: align HTML parsing with the spec#2387

Merged
fb55 merged 6 commits intomasterfrom
html-fixes
Mar 20, 2026
Merged

fix: align HTML parsing with the spec#2387
fb55 merged 6 commits intomasterfrom
html-fixes

Conversation

@fb55
Copy link
Owner

@fb55 fb55 commented Mar 20, 2026

Lots of changes to align us better with the HTML spec:

  • Raw-text tags: <iframe>, <noembed>, <noframes>, and <plaintext> now behave like <script> and <style> — their content is treated as raw text, not parsed as HTML
  • RCDATA entities: <textarea> now decodes entities like <title> already did
  • Bogus comments: <?…>, <!…> (non-DOCTYPE), and </ …> in HTML mode now emit comments instead of being silently dropped or emitted as text
  • Comment edge cases: <!-->, <!--->, <!->, and <!> are now parsed as valid (empty) comments per the spec
  • EOF handling: Unclosed comments, partial <!DOCTYPE, <?…, and <![CDATA[… at end-of-input now emit the correct token type instead of raw text
  • CDATA in HTML: Partial CDATA matches like <![CD> emit a bogus comment; unclosed <![CDATA[… at EOF emits [CDATA[… as a comment (without a spurious ]] suffix)
  • </>: Empty close tags are now silently ignored in HTML mode instead of emitting </> as text
  • DOCTYPE without space: <!DOCTYPEhtml> is now recognized as a DOCTYPE, not a bogus comment
  • SVG/MathML support: Tag names inside <svg> are case-adjusted per the spec (e.g. foreignObject, clipPath); CDATA sections are treated as text; special-tag detection is disabled inside foreign content
  • <image> alias: <image> is rewritten to <img> outside foreign content, matching browser behavior
  • Implicit closes: <h1><h6> now close an open heading; <a> closes a previous <a>; nested <form> is ignored when one is already open
  • Self-closing in HTML: <script/>, <style/>, etc. now correctly enter their raw-text state in HTML mode (the / is ignored per spec) unless recognizeSelfClosing is enabled
  • Foreign context correctness: Stray </svg> or </math> no longer corrupt the parser's context tracking; implicitly closed foreign elements are properly cleaned up

Helped by a number of AI Agents (Opus 4.6, GPT-5.4, Devin Reviews).

Tokenizer:
- Treat <iframe>, <noembed>, <noframes> as raw-text tags
- Treat <plaintext> as a permanent raw-text tag
- Parse RCDATA entities in <textarea> (not just <title>)
- Treat <?…> and non-DOCTYPE <!…> as bogus comments in HTML
- Match DOCTYPE case-insensitively via DeclarationSequence state
- Handle <!>, <!-->, <!--->, <!-> as spec-compliant comments
- Treat </> as an ignored token, </ x> as a bogus comment
- Emit bogus comments (not text) for all EOF-in-markup states
- Treat unclosed CDATA in HTML as a bogus comment at EOF
- Skip special-tag detection inside foreign (SVG/MathML) content
- Honor recognizeSelfClosing for raw-text tags (<script/> etc.)

Parser:
- Adjust SVG tag names to spec casing (foreignObject, clipPath, …)
- Alias <image> to <img> outside foreign content
- Treat CDATA as text (not a comment) in foreign content
- Headings (h1–h6) implicitly close other headings
- <a> implicitly closes a previous <a>
- Ignore nested <form> when one is already open
Copilot AI review requested due to automatic review settings March 20, 2026 18:11
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the HTML tokenizer and parser behavior to better align with the HTML parsing spec, particularly around raw-text/RCDATA handling, bogus comments, EOF-in-markup behavior, and foreign (SVG/MathML) content rules.

Changes:

  • Tokenizer: expands “text-only” tag handling (raw-text/RCDATA/plaintext), improves spec-compliant comment/doctype/bogus-comment parsing, and adjusts EOF handling for markup states.
  • Parser: adds foreign-context awareness (SVG/MathML), applies SVG tag-name case adjustments, implements HTML-specific behaviors like <a>/heading implicit closes and nested <form> ignoring.
  • Tests/snapshots: add/adjust coverage for foreign CDATA/text handling, bogus comments, self-closing behavior, and other spec edge cases.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.

Show a summary per file
File Description
src/Tokenizer.ts Implements spec-aligned tokenizer behavior for raw-text/RCDATA/plaintext, bogus comments/doctype matching, and EOF handling.
src/Parser.ts Adds foreign-context tracking, SVG casing adjustments, and HTML implicit-close behaviors (headings, <a>, nested <form>).
src/Tokenizer.spec.ts Adds tokenizer test coverage for new edge cases (bogus comments, self-closing text-only tags, etc.).
src/Parser.events.spec.ts Adds parser event tests for foreign content, EOF bogus comments, entities in <textarea>, plaintext, etc.
src/Parser.spec.ts Adds unit tests for declaration casing/preservation behavior.
src/index.spec.ts Adds a snapshot test for parsing CDATA inside SVG/foreign content.
src/snapshots/* Updates snapshots to reflect the new spec-aligned behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@fb55 fb55 enabled auto-merge (squash) March 20, 2026 23:04
@fb55 fb55 merged commit e32389a into master Mar 20, 2026
12 checks passed
@fb55 fb55 deleted the html-fixes branch March 20, 2026 23:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants