Summary
When converting DOCX files where a space character has different formatting than adjacent text, the space is lost in the output. This happens because markdownify (used internally via mammoth) strips whitespace-only content from inline formatting tags.
Example:
- DOCX content:
further [normal] + [bold] + reference [normal]
- HTML from mammoth:
further<strong> </strong>reference
- Current output:
furtherreference
- Expected output:
further reference
Root Cause
This is an upstream issue in markdownify, not markitdown itself. When chomp() encounters whitespace-only text inside inline tags like <strong>, <b>, <em>, or <i>, it strips the content entirely.
Upstream issue: matthewwithanm/python-markdownify#155
Fix PR: matthewwithanm/python-markdownify#253
Workaround
Until the markdownify fix is released, users can apply a monkey-patch before importing markitdown:
import markdownify
_original_convert_strong = markdownify.MarkdownConverter.convert_strong
_original_convert_b = markdownify.MarkdownConverter.convert_b
_original_convert_em = markdownify.MarkdownConverter.convert_em
_original_convert_i = markdownify.MarkdownConverter.convert_i
def _make_whitespace_preserving_converter(original_method):
def wrapper(self, el, text, *args, **kwargs):
if text and text.strip() == '' and len(text) > 0:
return text
return original_method(self, el, text, *args, **kwargs)
return wrapper
markdownify.MarkdownConverter.convert_strong = _make_whitespace_preserving_converter(_original_convert_strong)
markdownify.MarkdownConverter.convert_b = _make_whitespace_preserving_converter(_original_convert_b)
markdownify.MarkdownConverter.convert_em = _make_whitespace_preserving_converter(_original_convert_em)
markdownify.MarkdownConverter.convert_i = _make_whitespace_preserving_converter(_original_convert_i)
from markitdown import MarkItDown # Import after patch
Suggested Action
Once markdownify releases the fix, consider bumping the dependency version to include it.
Summary
When converting DOCX files where a space character has different formatting than adjacent text, the space is lost in the output. This happens because markdownify (used internally via mammoth) strips whitespace-only content from inline formatting tags.
Example:
further[normal] +[bold] +reference[normal]further<strong> </strong>referencefurtherreferencefurther referenceRoot Cause
This is an upstream issue in markdownify, not markitdown itself. When
chomp()encounters whitespace-only text inside inline tags like<strong>,<b>,<em>, or<i>, it strips the content entirely.Upstream issue: matthewwithanm/python-markdownify#155
Fix PR: matthewwithanm/python-markdownify#253
Workaround
Until the markdownify fix is released, users can apply a monkey-patch before importing markitdown:
Suggested Action
Once markdownify releases the fix, consider bumping the dependency version to include it.