Skip to content

Add DocConverter for legacy .doc (Word 97-2003) files#1616

Open
vibeyclaw wants to merge 1 commit intomicrosoft:mainfrom
vibeyclaw:feature/add-doc-converter
Open

Add DocConverter for legacy .doc (Word 97-2003) files#1616
vibeyclaw wants to merge 1 commit intomicrosoft:mainfrom
vibeyclaw:feature/add-doc-converter

Conversation

@vibeyclaw
Copy link

Summary

Adds support for converting legacy (Word 97-2003 binary format) files to Markdown, addressing #23.

Approach

The new DocConverter uses system-level tools (antiword or catdoc) to extract text from legacy .doc files, consistent with how other converters in the project handle format-specific dependencies.

Why system tools?

The legacy .doc format (OLE2/CFB binary) is complex to parse purely in Python. antiword and catdoc are well-established, widely available tools that handle this reliably.

Changes

  • converters/_doc_converter.py — new DocConverter class
  • converters/__init__.py — export DocConverter
  • _markitdown.py — import and register DocConverter

Usage

Install a system tool first:

# macOS
brew install antiword

# Ubuntu/Debian
apt install antiword

Then use normally:

from markitdown import MarkItDown
md = MarkItDown()
result = md.convert("document.doc")
print(result.text_content)

If neither tool is installed, a MissingDependencyException is raised with clear installation instructions.

Notes

  • antiword is preferred (tried first); catdoc is the fallback
  • A temp file is used since both tools require a file path rather than stdin
  • No new Python dependencies required

Closes microsoft#23

Adds support for converting legacy .doc files to Markdown text using
system tools (antiword or catdoc). The converter:

- Accepts .doc extension and application/msword MIME type
- Uses antiword or catdoc (whichever is available on the system)
- Raises MissingDependencyException with clear install instructions if
  neither tool is found
- Writes to a temp file for tool compatibility, cleans up after

Install dependencies:
  macOS:         brew install antiword
  Ubuntu/Debian: apt install antiword
@vibeyclaw
Copy link
Author

@microsoft-github-policy-service agree

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant