Conversation
Tokenizer: - Treat <iframe>, <noembed>, <noframes> as raw-text tags - Treat <plaintext> as a permanent raw-text tag - Parse RCDATA entities in <textarea> (not just <title>) - Treat <?…> and non-DOCTYPE <!…> as bogus comments in HTML - Match DOCTYPE case-insensitively via DeclarationSequence state - Handle <!>, <!-->, <!--->, <!-> as spec-compliant comments - Treat </> as an ignored token, </ x> as a bogus comment - Emit bogus comments (not text) for all EOF-in-markup states - Treat unclosed CDATA in HTML as a bogus comment at EOF - Skip special-tag detection inside foreign (SVG/MathML) content - Honor recognizeSelfClosing for raw-text tags (<script/> etc.) Parser: - Adjust SVG tag names to spec casing (foreignObject, clipPath, …) - Alias <image> to <img> outside foreign content - Treat CDATA as text (not a comment) in foreign content - Headings (h1–h6) implicitly close other headings - <a> implicitly closes a previous <a> - Ignore nested <form> when one is already open
There was a problem hiding this comment.
Pull request overview
This PR updates the HTML tokenizer and parser behavior to better align with the HTML parsing spec, particularly around raw-text/RCDATA handling, bogus comments, EOF-in-markup behavior, and foreign (SVG/MathML) content rules.
Changes:
- Tokenizer: expands “text-only” tag handling (raw-text/RCDATA/plaintext), improves spec-compliant comment/doctype/bogus-comment parsing, and adjusts EOF handling for markup states.
- Parser: adds foreign-context awareness (SVG/MathML), applies SVG tag-name case adjustments, implements HTML-specific behaviors like
<a>/heading implicit closes and nested<form>ignoring. - Tests/snapshots: add/adjust coverage for foreign CDATA/text handling, bogus comments, self-closing behavior, and other spec edge cases.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| src/Tokenizer.ts | Implements spec-aligned tokenizer behavior for raw-text/RCDATA/plaintext, bogus comments/doctype matching, and EOF handling. |
| src/Parser.ts | Adds foreign-context tracking, SVG casing adjustments, and HTML implicit-close behaviors (headings, <a>, nested <form>). |
| src/Tokenizer.spec.ts | Adds tokenizer test coverage for new edge cases (bogus comments, self-closing text-only tags, etc.). |
| src/Parser.events.spec.ts | Adds parser event tests for foreign content, EOF bogus comments, entities in <textarea>, plaintext, etc. |
| src/Parser.spec.ts | Adds unit tests for declaration casing/preservation behavior. |
| src/index.spec.ts | Adds a snapshot test for parsing CDATA inside SVG/foreign content. |
| src/snapshots/* | Updates snapshots to reflect the new spec-aligned behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Lots of changes to align us better with the HTML spec:
<iframe>,<noembed>,<noframes>, and<plaintext>now behave like<script>and<style>— their content is treated as raw text, not parsed as HTML<textarea>now decodes entities like<title>already did<?…>,<!…>(non-DOCTYPE), and</ …>in HTML mode now emit comments instead of being silently dropped or emitted as text<!-->,<!--->,<!->, and<!>are now parsed as valid (empty) comments per the spec<!DOCTYPE,<?…, and<![CDATA[…at end-of-input now emit the correct token type instead of raw text<![CD>emit a bogus comment; unclosed<![CDATA[…at EOF emits[CDATA[…as a comment (without a spurious]]suffix)</>: Empty close tags are now silently ignored in HTML mode instead of emitting</>as text<!DOCTYPEhtml>is now recognized as a DOCTYPE, not a bogus comment<svg>are case-adjusted per the spec (e.g.foreignObject,clipPath); CDATA sections are treated as text; special-tag detection is disabled inside foreign content<image>alias:<image>is rewritten to<img>outside foreign content, matching browser behavior<h1>–<h6>now close an open heading;<a>closes a previous<a>; nested<form>is ignored when one is already open<script/>,<style/>, etc. now correctly enter their raw-text state in HTML mode (the/is ignored per spec) unlessrecognizeSelfClosingis enabled</svg>or</math>no longer corrupt the parser's context tracking; implicitly closed foreign elements are properly cleaned upHelped by a number of AI Agents (Opus 4.6, GPT-5.4, Devin Reviews).