Skip to content

Bug: HTML to PDF conversion fails on non-UTF-8 encoded files #631

@Clawiee

Description

@Clawiee

Bug Description

When converting HTML to PDF using convert_html_to_pdf tool, the conversion fails with a UTF-8 decoding error if the HTML file is not UTF-8 encoded.

Error Message

❌ Conversion failed: 'utf-8' codec can't decode byte 0x9a in position 0: invalid start byte

Current Behavior

  • Tool assumes all HTML files are UTF-8 encoded
  • No encoding detection or fallback
  • Fails with cryptic error message
  • File may be deleted after failed conversion (unclear if intentional)

Expected Behavior

  • Auto-detect encoding: Try UTF-8, then fallback to other common encodings (GBK, Latin-1, etc.)
  • Better error message: Tell user what went wrong and suggest solutions
  • Graceful handling: Don't delete the source file on conversion failure
  • Optional encoding parameter: Allow user to specify encoding if auto-detect fails

Reproduction Steps

  1. Create an HTML file with non-UTF-8 encoding (e.g., GBK, Windows-1252)
  2. Use convert_html_to_pdf tool to convert it
  3. Observe: Conversion fails with UTF-8 decoding error

Suggested Fix

# Pseudo-code for encoding detection
def read_html_with_encoding_detection(file_path):
    encodings_to_try = ['utf-8', 'gbk', 'latin-1', 'cp1252', 'big5']
    for encoding in encodings_to_try:
        try:
            with open(file_path, 'r', encoding=encoding) as f:
                return f.read()
        except UnicodeDecodeError:
            continue
    raise ValueError(f"Unable to decode file with common encodings: {encodings_to_try}")

Priority

🔴 High - affects core file conversion functionality, especially for Chinese users (GBK encoding)

Related

  • May affect other file reading tools that assume UTF-8
  • Should be consistent with read_document tool's encoding handling

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions