Fix HtmlToDjot reverse-conversion round-trip fidelity#208
Merged
Conversation
The default DjotConverter output (no round-trip annotations) did not survive HtmlToDjot back-conversion in several cases, changing the rendered HTML after a Djot -> HTML -> Djot -> HTML cycle. This is the path WYSIWYG and HTML-serializing tools rely on. Fixes: - Bracket-in-label corruption: processLink() blanket-escaped every `]` in already-built child markup, mangling structural brackets from a nested image or link (`[](href)` became `[]...`). `]` is now escaped at the text-node level in escapeDjotText(), where it is known to be literal, so structural brackets emitted by child elements stay intact. This also fixes a literal `]` inside span and semantic-span labels. - Footnote reference: the default footnote reference `<a href="#fnN" role="doc-noteref">` was unrecognized and produced invalid Djot; it now round-trips to a footnote marker. Annotated inline footnotes are unaffected. - Loose lists: a loose list (each item wrapped in `<p>`) collapsed into a tight list; items are now separated by a blank line to stay loose. - Explicit heading and section ids: a section with a non-derivable id dropped it; explicit ids are now preserved, while auto-derived slugs (including the `s-N` fallback for unsluggable headings) stay implicit so they regenerate identically. Adds unit tests for each case plus an HTML-stability round-trip suite for the default (un-annotated) converter path.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #208 +/- ##
=========================================
Coverage 91.87% 91.87%
- Complexity 3470 3489 +19
=========================================
Files 104 104
Lines 9829 9860 +31
=========================================
+ Hits 9030 9059 +29
- Misses 799 801 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The default
DjotConverteroutput (without round-trip annotations) did not surviveHtmlToDjotback-conversion in several cases, so aDjot -> HTML -> Djot -> HTMLcycle changed the rendered HTML. This is exactly the path WYSIWYG and HTML-serializing tools rely on, where HTML stability across the round-trip matters.Found via a round-trip drift harness over the core element set plus element combinations.
Fixes
Bracket-in-label corruption.
processLink()blanket-escaped every]in the already-built child markup, mangling structural brackets produced by a nested image or link.[](https://x.com)became[](https://x.com)(invalid).]is now escaped at the text-node level inescapeDjotText(), where it is known to be literal, so structural]emitted by child element processors stays intact. A literal]inside a link/span label is still neutralized. This also fixes literal]inside span and semantic-span labels, which had the same latent issue.Footnote reference. The default footnote reference
<a href="#fnN" role="doc-noteref">was unrecognized and produced invalid Djot ([^1^](#fn1){#fnref1}). It now round-trips to a footnote marker, pairing with the definition already collected byprocessEndnotesSection(). Annotated inline footnotes (which also carryrole="doc-noteref") are explicitly excluded and unaffected.Loose lists. A loose list (each item wrapped in
<p>) collapsed into a tight list on the way back. Items are now separated by a blank line when the list is loose, so the next parse stays loose. Tight lists are unchanged.Explicit heading and section ids. A
<section id="...">whose id is not derivable from the heading text dropped the id, so the next render generated a different one. Explicit ids are now preserved, while auto-derived slugs stay implicit so they regenerate identically. The unsluggable-heading case (where Djot falls back to ans-Nid) is handled: ans-Nid is treated as auto, any other id as explicit. Auto-slug detection reusesHeadingIdTracker::normalizeId()so it cannot drift from the renderer (transliteration included).Not addressed (documented as known limitations)
These remain "drift" in the harness but are not reverse-converter bugs:
=html) round-trips to a paragraph - the reverse side cannot know the original was a raw block without annotation.<b>maps to strong markup - semantically equivalent.Notes
Unit tests cover each fix with exact reverse output, plus a new HTML-stability data-provider suite for the default (un-annotated) converter path. Full suite green (2485 tests), phpstan and phpcs clean.