pymupdf · jamie-lemon · Mar 23, 2026 · Mar 20, 2026 · Mar 20, 2026 · Mar 20, 2026
diff --git a/changes.txt b/changes.txt
@@ -7,7 +7,7 @@ Change Log
 * Fixed issues:
 
   * **Fixed** `4902 <https://github.com/pymupdf/PyMuPDF/issues/4902>`_: Incorrect linewidth in elements returned by Page.get_texttrace()
-  * **Fixed** `4932 <https://github.com/pymupdf/PyMuPDF/issues/4932>`_: `"Page" has no attribute "find_tables" in PyMuPDF 1.27
+  * **Fixed** `4932 <https://github.com/pymupdf/PyMuPDF/issues/4932>`_: "Page" has no attribute "find_tables" in PyMuPDF 1.27
 
 * Other:
 
@@ -20,12 +20,12 @@ Change Log
 
 * Fixed issues:
 
-  * **Fixed** `4903 <https://github.com/pymupdf/PyMuPDF/issues/4903>`_: Typing broken because of *_forward_decl
+  * **Fixed** `4903 <https://github.com/pymupdf/PyMuPDF/issues/4903>`_: Typing broken because of `*_forward_decl`
 
 * Other:
 
   * Retrospectively marked #4907 as fixed in pymupdf-1.27.1.
-  * Improved get_textpage_ocr().
+  * Improved `get_textpage_ocr()`.
 
     For partial OCR, **all** page areas outside legible text are now OCRed, not
     just those within images. This means that OCR will now also be performed

diff --git a/docs/installation.rst b/docs/installation.rst
@@ -303,12 +303,11 @@ See :doc:`pyodide`.
 Enabling Integrated OCR Support
 ---------------------------------------------------------
 
-If you do not intend to use this feature, skip this step. Otherwise, it is required for both installation paths: **from wheels and from sources.**
-
-PyMuPDF will already contain all the logic to support OCR functions. But it additionally does need `Tesseract’s language support data <https://github.com/tesseract-ocr/tessdata>`_.
+PyMuPDF will already contain all the logic to support OCR functions. But it additionally does need `Tesseract's language support data <https://github.com/tesseract-ocr/tessdata>`_.
 
 If not specified explicitly, PyMuPDF will attempt to find the installed
-Tesseract's tessdata, but this should probably not be relied upon.
+Tesseract's `tessdata`, but this should probably not be relied upon.
+
 
 Otherwise PyMuPDF requires that Tesseract's language support folder is
 specified explicitly either in PyMuPDF OCR functions' `tessdata` arguments or
@@ -333,6 +332,13 @@ So for a working OCR functionality, make sure to complete this checklist:
 
 .. note::
 
+    English language support is included by default in Tesseract insallation.
+
+    :ref:`Tesseract Language Packs <tesseract-language-packs>` for other languages must be installed separately, and the `tessdata` folder must be specified to PyMuPDF as described above, for OCR to work with those languages.
+
+
+.. note::
+
   Find out more on the `official documentation for installing Tesseract website <https://tesseract-ocr.github.io/tessdoc/Installation.html>`_.
 
 .. include:: footer.rst
diff --git a/docs/ocr/tesseract-language-packs.rst b/docs/ocr/tesseract-language-packs.rst
@@ -0,0 +1,251 @@
+
+.. include:: ../header.rst
+
+.. _pymupdf-pro:
+
+.. raw:: html
+
+    <script>
+        document.getElementById("headerSearchWidget").action = '../search.html';
+    </script>
+
+
+.. _tesseract-language-packs:
+
+Tesseract Language Packs
+========================
+
+.. meta::
+   :description: How to install additional Tesseract language packs on macOS, Linux, and Windows.
+
+Overview
+--------
+
+Tesseract identifies languages using three-letter `ISO 639-2 <https://en.wikipedia.org/wiki/List_of_ISO_639-2_codes>`_ codes. English (``eng``) is installed by default on most platforms. For any other language, you need to install the corresponding language pack before pymupdf4llm can use it for OCR.
+
+A full list of supported language codes is available on the `Tesseract tessdata repository <https://github.com/tesseract-ocr/tessdata>`_.
+
+.. tip::
+
+   To see which languages are already installed on your system, run ``tesseract --list-langs`` in your terminal.
+
+----
+
+Linux
+-----
+
+Language pack installation varies slightly by distribution.
+
+**Ubuntu / Debian**
+
+.. code-block:: bash
+
+   # List all available language packs
+   apt-cache search tesseract-ocr
+
+   # Install a specific language (e.g. German)
+   sudo apt install tesseract-ocr-deu
+
+   # Install all available languages at once
+   sudo apt install tesseract-ocr-all
+
+Language packages follow the naming pattern ``tesseract-ocr-<langcode>``, for example ``tesseract-ocr-fra`` for French or ``tesseract-ocr-chi-sim`` for Simplified Chinese.
+
+**Fedora / RHEL**
+
+.. code-block:: bash
+
+   # Search for available language packs
+   dnf search tesseract
+
+   # Install a specific language (e.g. German)
+   sudo dnf install tesseract-langpack-deu
+
+   # Install all language packs
+   sudo dnf install tesseract-langpack-*
+
+On Fedora, packages are named ``tesseract-langpack-<langcode>``.
+
+**Arch Linux**
+
+.. code-block:: bash
+
+   # Search for available language packs
+   pacman -Ss tesseract-data
+
+   # Install a specific language (e.g. German)
+   sudo pacman -S tesseract-data-deu
+
+On Arch, packages are named ``tesseract-data-<langcode>``.
+
+Manual Installation (All Distros)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If a language pack is not available through your package manager, download the ``.traineddata`` file directly from GitHub and copy it to your Tesseract data directory:
+
+.. code-block:: bash
+
+   # Download language pack (e.g. French)
+   curl -L https://github.com/tesseract-ocr/tessdata/raw/main/fra.traineddata \
+     -o fra.traineddata
+
+   # Copy to tessdata directory (path varies by distro)
+   sudo cp fra.traineddata /usr/share/tesseract-ocr/4.00/tessdata/
+   # or
+   sudo cp fra.traineddata /usr/share/tessdata/
+
+Common tessdata locations on Linux:
+
+.. list-table::
+   :header-rows: 1
+   :widths: 40 60
+
+   * - Distribution
+     - Path
+   * - Ubuntu / Debian
+     - ``/usr/share/tesseract-ocr/4.00/tessdata/``
+   * - Fedora / RHEL
+     - ``/usr/share/tesseract/tessdata/``
+   * - Arch Linux
+     - ``/usr/share/tessdata/``
+
+----
+
+Windows
+-------
+
+During Installation (Recommended)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The Tesseract Windows installer from `UB Mannheim <https://github.com/UB-Mannheim/tesseract/wiki>`_ lets you select additional language packs during setup. When you reach the **Choose Components** screen, expand **Additional language data** and tick the languages you need.
+
+After Installation (Manual)
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If Tesseract is already installed, download language packs manually:
+
+1. Go to `github.com/tesseract-ocr/tessdata <https://github.com/tesseract-ocr/tessdata>`_
+2. Download the ``.traineddata`` file for your language (e.g. ``fra.traineddata`` for French)
+3. Copy the file into your Tesseract ``tessdata`` folder, typically:
+
+.. code-block:: text
+
+   C:\Program Files\Tesseract-OCR\tessdata\
+
+.. note::
+
+   The Chocolatey (``choco install tesseract``) package only includes English. All additional languages must be added manually using the steps above.
+
+Verify the Install
+~~~~~~~~~~~~~~~~~~
+
+Open Command Prompt or PowerShell and run:
+
+.. code-block:: powershell
+
+   tesseract --list-langs
+
+Your newly installed language should appear in the output.
+
+----
+
+macOS
+-----
+
+The recommended approach on macOS is `Homebrew <https://brew.sh>`_. There are two options depending on how much disk space you want to use.
+
+Install All Languages at Once
+~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+The ``tesseract-lang`` formula bundles Tesseract with every available language pack:
+
+.. code-block:: bash
+
+   brew install tesseract-lang
+
+Install Specific Languages
+~~~~~~~~~~~~~~~~~~~~~~~~~~
+
+If you only need a few languages, install ``tesseract`` first and then manually download the ``.traineddata`` files you need:
+
+.. code-block:: bash
+
+   # Install Tesseract engine only
+   brew install tesseract
+
+   # Find the tessdata directory
+   brew info tesseract
+   # Look for a line like: /opt/homebrew/share/tessdata
+
+   # Download a specific language pack (e.g. French)
+   curl -L https://github.com/tesseract-ocr/tessdata/raw/main/fra.traineddata \
+     -o /opt/homebrew/share/tessdata/fra.traineddata
+
+Replace ``fra`` with your target language code and adjust the tessdata path to match what ``brew info tesseract`` reports on your machine.
+
+.. note::
+
+   If you installed Tesseract via MacPorts instead of Homebrew, use ``port install tesseract-<langcode>``, for example ``sudo port install tesseract-fra``.
+
+----
+
+Using a Language with pymupdf4llm
+----------------------------------
+
+Once a language pack is installed, pass its code to ``to_markdown()`` via the ``ocr_language`` parameter:
+
+.. code-block:: python
+
+   import pymupdf4llm
+
+   # Single language
+   md = pymupdf4llm.to_markdown("document.pdf", ocr_language="fra")
+
+   # Multiple languages
+   md = pymupdf4llm.to_markdown("document.pdf", ocr_language="eng+fra+deu")
+
+----
+
+Common Language Codes
+---------------------
+
+.. list-table::
+   :header-rows: 1
+   :widths: 50 50
+
+   * - Language
+     - Code
+   * - English
+     - ``eng``
+   * - French
+     - ``fra``
+   * - German
+     - ``deu``
+   * - Spanish
+     - ``spa``
+   * - Italian
+     - ``ita``
+   * - Portuguese
+     - ``por``
+   * - Simplified Chinese
+     - ``chi_sim``
+   * - Traditional Chinese
+     - ``chi_tra``
+   * - Japanese
+     - ``jpn``
+   * - Korean
+     - ``kor``
+   * - Arabic
+     - ``ara``
+   * - Russian
+     - ``rus``
+   * - Hindi
+     - ``hin``
+
+For the full list of supported languages and their codes, see the `Tesseract tessdata repository <https://github.com/tesseract-ocr/tessdata>`_.
+
+
+
+.. include:: ../footer.rst
+
+