
❌ This issue is not open for contribution. Visit Contributing guidelines to learn about the contributing process and how to find suitable issues.

Overview
utils/pasteTransform.js covers the common Microsoft Office cases but leaves fidelity gaps when pasting from Word, Excel, SharePoint, and Google Docs. Most visibly, Word multi-level lists collapse to flat paragraphs. This task closes those gaps in the existing sanitizer.
Complexity: Medium
Target branch: hotfixes
Context
We evaluated existing TipTap/ProseMirror plugins for source-specific paste cleanup and chose to keep the homegrown sanitizer: @tiptap-pro/extension-paste-handler (the only feature-complete option) is paid; @intevation/tiptap-extension-office-paste is narrower than ours and has open list-in-table bugs; wordsoap is 7+ years dormant; CKEditor 5's filter is GPL and tightly coupled to its model layer.
The Change
Extend the homegrown sanitizer to close known fidelity gaps. Each is a distinct branch in the sanitizer and can be implemented and reviewed independently:
- Word multi-level lists collapse to flat paragraphs. Word emits
<p class="MsoListParagraph"> with the indent level encoded in mso-list:l0 level2 lfo1 style hints. We strip both, so the structure disappears and the list becomes a sequence of indistinguishable paragraphs.
- Excel/SharePoint class-driven styles disappear. Excel emits
class="xl63" referring to definitions in a pasted <style> block we ignore. Bold/colored cells paste as plain text.
- Google Docs paragraph fragmentation. GDocs wraps every visual line in
<p style="margin:0">. Single logical paragraphs explode into many.
- Google Docs
<hr> pastes as a literal dashed line of text. GDocs encodes the rule as a styled paragraph of dashes.
- ARIA-encoded headings paste as plain paragraphs. SharePoint / Word Online emit
<p role="heading" aria-level="2"> instead of <h2>, losing document structure.
- Word bookmark anchors survive as empty/dead anchors.
<a name="_Toc...">, _Ref, _GoBack, footnote/endnote markers leave behind dead hyperlinks.
<ol list-style-type> choice is lost. Word "i, ii, iii" lists paste as default "1, 2, 3".
- Table-cell mark normalization. Bold/italic inside
<td> survives in raw HTML but gets stripped on first re-serialize because ProseMirror doesn't find the mark at the schema-valid position.
Acceptance Criteria pairs each gap with a checkbox for tracking.
How to Get There
Each gap is reproducible against the TipTap editor in any Studio rich-text field (e.g. question/answer/hint editors in the exercise authoring flow):
- Open a source document in Word / Excel / SharePoint / Google Docs containing the relevant construct (a multi-level list, a styled cell, a multi-line paragraph, etc.).
- Select the content and copy.
- Paste into a TipTap editor field in Studio.
- Observe the symptom described in The Change.
Out of Scope
- Adopting a third-party paste-sanitization plugin (see Context).
- Image handling on paste — covered by the separate strip-images work and its follow-ups.
- Any user-visible string for sanitization actions (string freeze).
- Outlook-specific paste patterns (quoted-reply chains, signature wrappers,
cid: images) — separate issue if a real case surfaces.
Acceptance Criteria
General
Testing
References
Per-mechanism source links are inline in the corresponding Acceptance Criteria above.
AI usage
I used Claude (Opus 4.7) for research and drafting:
- Surveying existing TipTap and ProseMirror paste sanitization plugins to confirm we should keep the homegrown sanitizer rather than adopt a dependency.
- Gap analysis comparing TipTap Pro's documented transformations against the current sanitizer to identify what's missing.
- Sourcing the per-mechanism authoritative references linked in each AC (CKEditor 5 source, Microsoft Q&A, ProseMirror forum threads).
- Drafting the issue section by section.
I reviewed each section and each linked reference before submitting. Every AC except "Google Docs <hr> detection" is anchored to a production OSS implementation or official documentation; that one is explicitly flagged as empirical so the implementer captures a real paste to verify.
❌ This issue is not open for contribution. Visit Contributing guidelines to learn about the contributing process and how to find suitable issues.
Overview
utils/pasteTransform.jscovers the common Microsoft Office cases but leaves fidelity gaps when pasting from Word, Excel, SharePoint, and Google Docs. Most visibly, Word multi-level lists collapse to flat paragraphs. This task closes those gaps in the existing sanitizer.Complexity: Medium
Target branch: hotfixes
Context
We evaluated existing TipTap/ProseMirror plugins for source-specific paste cleanup and chose to keep the homegrown sanitizer:
@tiptap-pro/extension-paste-handler(the only feature-complete option) is paid;@intevation/tiptap-extension-office-pasteis narrower than ours and has open list-in-table bugs;wordsoapis 7+ years dormant; CKEditor 5's filter is GPL and tightly coupled to its model layer.The Change
Extend the homegrown sanitizer to close known fidelity gaps. Each is a distinct branch in the sanitizer and can be implemented and reviewed independently:
<p class="MsoListParagraph">with the indent level encoded inmso-list:l0 level2 lfo1style hints. We strip both, so the structure disappears and the list becomes a sequence of indistinguishable paragraphs.class="xl63"referring to definitions in a pasted<style>block we ignore. Bold/colored cells paste as plain text.<p style="margin:0">. Single logical paragraphs explode into many.<hr>pastes as a literal dashed line of text. GDocs encodes the rule as a styled paragraph of dashes.<p role="heading" aria-level="2">instead of<h2>, losing document structure.<a name="_Toc...">,_Ref,_GoBack, footnote/endnote markers leave behind dead hyperlinks.<ol list-style-type>choice is lost. Word "i, ii, iii" lists paste as default "1, 2, 3".<td>survives in raw HTML but gets stripped on first re-serialize because ProseMirror doesn't find the mark at the schema-valid position.Acceptance Criteria pairs each gap with a checkbox for tracking.
How to Get There
Each gap is reproducible against the TipTap editor in any Studio rich-text field (e.g. question/answer/hint editors in the exercise authoring flow):
Out of Scope
cid:images) — separate issue if a real case surfaces.Acceptance Criteria
General
<p class="MsoListParagraph">into nested<ol>/<ul>.level\d+token inmso-list:l\d+ level\d+ lfo\d+.mso-list:Ignore-glyph child indicates bullet; otherwise ordered.mso-*style strip (which would otherwise drop the indent token).<style>-block rules onto matching elements, then drop the blocks..xl63 {...}).<p>withmargin:0under the same parent into one<p>, joined by<br>.margin:0or pairedmargin-top:0+margin-bottom:0.<b id="docs-internal-guid-...">wrapper when present.<hr>detection (empirical — no authoritative spec; verify by capturing a real paste). Replace<p>whose stripped text matches/^[-—–]{3,}$/with<hr>.<p role="heading" aria-level="N">(and<div role="heading">) with<hN>.<a>whosename/idmatches/^_(Toc|Ref|Hlt|GoBack|ftn|edn)\w*/i.list-style-typeon<ol>through themso-*strip.lower-roman,upper-roman,lower-alpha,upper-alpha,decimal.<td>/<th>:<span style="font-weight:bold">→<strong>,font-style:italic→<em>,text-decoration:underline→<u>.<p>so marks attach at a valid position.Testing
__tests__/pasteTransform.spec.js, using an HTML fixture captured from a real paste of that source.References
Per-mechanism source links are inline in the corresponding Acceptance Criteria above.
AI usage
I used Claude (Opus 4.7) for research and drafting:
I reviewed each section and each linked reference before submitting. Every AC except "Google Docs
<hr>detection" is anchored to a production OSS implementation or official documentation; that one is explicitly flagged as empirical so the implementer captures a real paste to verify.