Skip to content
Merged
Show file tree
Hide file tree
Changes from 5 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 3 additions & 3 deletions changes.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ Change Log
* Fixed issues:

* **Fixed** `4902 <https://github.com/pymupdf/PyMuPDF/issues/4902>`_: Incorrect linewidth in elements returned by Page.get_texttrace()
* **Fixed** `4932 <https://github.com/pymupdf/PyMuPDF/issues/4932>`_: `"Page" has no attribute "find_tables" in PyMuPDF 1.27
* **Fixed** `4932 <https://github.com/pymupdf/PyMuPDF/issues/4932>`_: "Page" has no attribute "find_tables" in PyMuPDF 1.27

* Other:

Expand All @@ -20,12 +20,12 @@ Change Log

* Fixed issues:

* **Fixed** `4903 <https://github.com/pymupdf/PyMuPDF/issues/4903>`_: Typing broken because of *_forward_decl
* **Fixed** `4903 <https://github.com/pymupdf/PyMuPDF/issues/4903>`_: Typing broken because of `*_forward_decl`

* Other:

* Retrospectively marked #4907 as fixed in pymupdf-1.27.1.
* Improved get_textpage_ocr().
* Improved `get_textpage_ocr()`.

Comment thread
JorjMcKie marked this conversation as resolved.
For partial OCR, **all** page areas outside legible text are now OCRed, not
just those within images. This means that OCR will now also be performed
Expand Down
14 changes: 10 additions & 4 deletions docs/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -303,12 +303,11 @@ See :doc:`pyodide`.
Enabling Integrated OCR Support
---------------------------------------------------------

If you do not intend to use this feature, skip this step. Otherwise, it is required for both installation paths: **from wheels and from sources.**

PyMuPDF will already contain all the logic to support OCR functions. But it additionally does need `Tesseract’s language support data <https://github.com/tesseract-ocr/tessdata>`_.
PyMuPDF will already contain all the logic to support OCR functions. But it additionally does need `Tesseract's language support data <https://github.com/tesseract-ocr/tessdata>`_.

If not specified explicitly, PyMuPDF will attempt to find the installed
Tesseract's tessdata, but this should probably not be relied upon.
Tesseract's `tessdata`, but this should probably not be relied upon.


Otherwise PyMuPDF requires that Tesseract's language support folder is
specified explicitly either in PyMuPDF OCR functions' `tessdata` arguments or
Expand All @@ -333,6 +332,13 @@ So for a working OCR functionality, make sure to complete this checklist:

.. note::

English language support is included by default in Tesseract insallation.

:ref:`Tesseract Language Packs <tesseract-language-packs>` for other languages must be installed separately, and the `tessdata` folder must be specified to PyMuPDF as described above, for OCR to work with those languages.


.. note::

Find out more on the `official documentation for installing Tesseract website <https://tesseract-ocr.github.io/tessdoc/Installation.html>`_.

.. include:: footer.rst
251 changes: 251 additions & 0 deletions docs/ocr/tesseract-language-packs.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,251 @@

.. include:: ../header.rst

.. _pymupdf-pro:

.. raw:: html

<script>
document.getElementById("headerSearchWidget").action = '../search.html';
</script>


.. _tesseract-language-packs:

Tesseract Language Packs
========================

.. meta::
:description: How to install additional Tesseract language packs on macOS, Linux, and Windows.

Overview
--------

Tesseract identifies languages using three-letter `ISO 639-2 <https://en.wikipedia.org/wiki/List_of_ISO_639-2_codes>`_ codes. English (``eng``) is installed by default on most platforms. For any other language, you need to install the corresponding language pack before pymupdf4llm can use it for OCR.

A full list of supported language codes is available on the `Tesseract tessdata repository <https://github.com/tesseract-ocr/tessdata>`_.

.. tip::

To see which languages are already installed on your system, run ``tesseract --list-langs`` in your terminal.

----

Linux
-----

Language pack installation varies slightly by distribution.

**Ubuntu / Debian**

.. code-block:: bash

# List all available language packs
apt-cache search tesseract-ocr

# Install a specific language (e.g. German)
sudo apt install tesseract-ocr-deu

# Install all available languages at once
sudo apt install tesseract-ocr-all

Language packages follow the naming pattern ``tesseract-ocr-<langcode>``, for example ``tesseract-ocr-fra`` for French or ``tesseract-ocr-chi-sim`` for Simplified Chinese.

**Fedora / RHEL**

.. code-block:: bash

# Search for available language packs
dnf search tesseract

# Install a specific language (e.g. German)
sudo dnf install tesseract-langpack-deu

# Install all language packs
sudo dnf install tesseract-langpack-*

On Fedora, packages are named ``tesseract-langpack-<langcode>``.

**Arch Linux**

.. code-block:: bash

# Search for available language packs
pacman -Ss tesseract-data

# Install a specific language (e.g. German)
sudo pacman -S tesseract-data-deu

On Arch, packages are named ``tesseract-data-<langcode>``.

Manual Installation (All Distros)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If a language pack is not available through your package manager, download the ``.traineddata`` file directly from GitHub and copy it to your Tesseract data directory:

.. code-block:: bash

# Download language pack (e.g. French)
curl -L https://github.com/tesseract-ocr/tessdata/raw/main/fra.traineddata \
-o fra.traineddata

# Copy to tessdata directory (path varies by distro)
sudo cp fra.traineddata /usr/share/tesseract-ocr/4.00/tessdata/
# or
sudo cp fra.traineddata /usr/share/tessdata/

Common tessdata locations on Linux:

.. list-table::
:header-rows: 1
:widths: 40 60

* - Distribution
- Path
* - Ubuntu / Debian
- ``/usr/share/tesseract-ocr/4.00/tessdata/``
* - Fedora / RHEL
- ``/usr/share/tesseract/tessdata/``
* - Arch Linux
- ``/usr/share/tessdata/``

----

Windows
-------

During Installation (Recommended)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The Tesseract Windows installer from `UB Mannheim <https://github.com/UB-Mannheim/tesseract/wiki>`_ lets you select additional language packs during setup. When you reach the **Choose Components** screen, expand **Additional language data** and tick the languages you need.

After Installation (Manual)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

If Tesseract is already installed, download language packs manually:

1. Go to `github.com/tesseract-ocr/tessdata <https://github.com/tesseract-ocr/tessdata>`_
2. Download the ``.traineddata`` file for your language (e.g. ``fra.traineddata`` for French)
3. Copy the file into your Tesseract ``tessdata`` folder, typically:

.. code-block:: text

C:\Program Files\Tesseract-OCR\tessdata\

.. note::

The Chocolatey (``choco install tesseract``) package only includes English. All additional languages must be added manually using the steps above.

Verify the Install
~~~~~~~~~~~~~~~~~~

Open Command Prompt or PowerShell and run:

.. code-block:: powershell

tesseract --list-langs

Your newly installed language should appear in the output.

----

macOS
-----

The recommended approach on macOS is `Homebrew <https://brew.sh>`_. There are two options depending on how much disk space you want to use.

Install All Languages at Once
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

The ``tesseract-lang`` formula bundles Tesseract with every available language pack:

.. code-block:: bash

brew install tesseract-lang

Install Specific Languages
~~~~~~~~~~~~~~~~~~~~~~~~~~

If you only need a few languages, install ``tesseract`` first and then manually download the ``.traineddata`` files you need:

.. code-block:: bash

# Install Tesseract engine only
brew install tesseract

# Find the tessdata directory
brew info tesseract
# Look for a line like: /opt/homebrew/share/tessdata

# Download a specific language pack (e.g. French)
curl -L https://github.com/tesseract-ocr/tessdata/raw/main/fra.traineddata \
-o /opt/homebrew/share/tessdata/fra.traineddata

Replace ``fra`` with your target language code and adjust the tessdata path to match what ``brew info tesseract`` reports on your machine.

.. note::

If you installed Tesseract via MacPorts instead of Homebrew, use ``port install tesseract-<langcode>``, for example ``sudo port install tesseract-fra``.

----

Using a Language with pymupdf4llm
----------------------------------

Once a language pack is installed, pass its code to ``to_markdown()`` via the ``ocr_language`` parameter:

.. code-block:: python

import pymupdf4llm

# Single language
md = pymupdf4llm.to_markdown("document.pdf", ocr_language="fra")

# Multiple languages
md = pymupdf4llm.to_markdown("document.pdf", ocr_language="eng+fra+deu")

----

Common Language Codes
---------------------

.. list-table::
:header-rows: 1
:widths: 50 50

* - Language
- Code
* - English
- ``eng``
* - French
- ``fra``
* - German
- ``deu``
* - Spanish
- ``spa``
* - Italian
- ``ita``
* - Portuguese
- ``por``
* - Simplified Chinese
- ``chi_sim``
* - Traditional Chinese
- ``chi_tra``
* - Japanese
- ``jpn``
* - Korean
- ``kor``
* - Arabic
- ``ara``
* - Russian
- ``rus``
* - Hindi
- ``hin``

For the full list of supported languages and their codes, see the `Tesseract tessdata repository <https://github.com/tesseract-ocr/tessdata>`_.



.. include:: ../footer.rst


Loading
Loading