-
Notifications
You must be signed in to change notification settings - Fork 697
Updates docs for 1.27.2.2 release. #4947
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+453
−10
Merged
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
24e4e1c
Updates docs for 1.27.2.2 release.
jamie-lemon 0e5b0d0
Merge branch 'main' into docs-general-updates
jamie-lemon 4182a46
Links OCR section to OCR Plugins page and some tidy up.
jamie-lemon 1059bc4
Some small corrections to the OCR page details.
jamie-lemon fef1eef
Reverts changes in installation.rst for OCR notes and some other mino…
jamie-lemon 0437b00
Typo fix.
jamie-lemon File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,251 @@ | ||
|
|
||
| .. include:: ../header.rst | ||
|
|
||
| .. _pymupdf-pro: | ||
|
|
||
| .. raw:: html | ||
|
|
||
| <script> | ||
| document.getElementById("headerSearchWidget").action = '../search.html'; | ||
| </script> | ||
|
|
||
|
|
||
| .. _tesseract-language-packs: | ||
|
|
||
| Tesseract Language Packs | ||
| ======================== | ||
|
|
||
| .. meta:: | ||
| :description: How to install additional Tesseract language packs on macOS, Linux, and Windows. | ||
|
|
||
| Overview | ||
| -------- | ||
|
|
||
| Tesseract identifies languages using three-letter `ISO 639-2 <https://en.wikipedia.org/wiki/List_of_ISO_639-2_codes>`_ codes. English (``eng``) is installed by default on most platforms. For any other language, you need to install the corresponding language pack before pymupdf4llm can use it for OCR. | ||
|
|
||
| A full list of supported language codes is available on the `Tesseract tessdata repository <https://github.com/tesseract-ocr/tessdata>`_. | ||
|
|
||
| .. tip:: | ||
|
|
||
| To see which languages are already installed on your system, run ``tesseract --list-langs`` in your terminal. | ||
|
|
||
| ---- | ||
|
|
||
| Linux | ||
| ----- | ||
|
|
||
| Language pack installation varies slightly by distribution. | ||
|
|
||
| **Ubuntu / Debian** | ||
|
|
||
| .. code-block:: bash | ||
|
|
||
| # List all available language packs | ||
| apt-cache search tesseract-ocr | ||
|
|
||
| # Install a specific language (e.g. German) | ||
| sudo apt install tesseract-ocr-deu | ||
|
|
||
| # Install all available languages at once | ||
| sudo apt install tesseract-ocr-all | ||
|
|
||
| Language packages follow the naming pattern ``tesseract-ocr-<langcode>``, for example ``tesseract-ocr-fra`` for French or ``tesseract-ocr-chi-sim`` for Simplified Chinese. | ||
|
|
||
| **Fedora / RHEL** | ||
|
|
||
| .. code-block:: bash | ||
|
|
||
| # Search for available language packs | ||
| dnf search tesseract | ||
|
|
||
| # Install a specific language (e.g. German) | ||
| sudo dnf install tesseract-langpack-deu | ||
|
|
||
| # Install all language packs | ||
| sudo dnf install tesseract-langpack-* | ||
|
|
||
| On Fedora, packages are named ``tesseract-langpack-<langcode>``. | ||
|
|
||
| **Arch Linux** | ||
|
|
||
| .. code-block:: bash | ||
|
|
||
| # Search for available language packs | ||
| pacman -Ss tesseract-data | ||
|
|
||
| # Install a specific language (e.g. German) | ||
| sudo pacman -S tesseract-data-deu | ||
|
|
||
| On Arch, packages are named ``tesseract-data-<langcode>``. | ||
|
|
||
| Manual Installation (All Distros) | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| If a language pack is not available through your package manager, download the ``.traineddata`` file directly from GitHub and copy it to your Tesseract data directory: | ||
|
|
||
| .. code-block:: bash | ||
|
|
||
| # Download language pack (e.g. French) | ||
| curl -L https://github.com/tesseract-ocr/tessdata/raw/main/fra.traineddata \ | ||
| -o fra.traineddata | ||
|
|
||
| # Copy to tessdata directory (path varies by distro) | ||
| sudo cp fra.traineddata /usr/share/tesseract-ocr/4.00/tessdata/ | ||
| # or | ||
| sudo cp fra.traineddata /usr/share/tessdata/ | ||
|
|
||
| Common tessdata locations on Linux: | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
| :widths: 40 60 | ||
|
|
||
| * - Distribution | ||
| - Path | ||
| * - Ubuntu / Debian | ||
| - ``/usr/share/tesseract-ocr/4.00/tessdata/`` | ||
| * - Fedora / RHEL | ||
| - ``/usr/share/tesseract/tessdata/`` | ||
| * - Arch Linux | ||
| - ``/usr/share/tessdata/`` | ||
|
|
||
| ---- | ||
|
|
||
| Windows | ||
| ------- | ||
|
|
||
| During Installation (Recommended) | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| The Tesseract Windows installer from `UB Mannheim <https://github.com/UB-Mannheim/tesseract/wiki>`_ lets you select additional language packs during setup. When you reach the **Choose Components** screen, expand **Additional language data** and tick the languages you need. | ||
|
|
||
| After Installation (Manual) | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| If Tesseract is already installed, download language packs manually: | ||
|
|
||
| 1. Go to `github.com/tesseract-ocr/tessdata <https://github.com/tesseract-ocr/tessdata>`_ | ||
| 2. Download the ``.traineddata`` file for your language (e.g. ``fra.traineddata`` for French) | ||
| 3. Copy the file into your Tesseract ``tessdata`` folder, typically: | ||
|
|
||
| .. code-block:: text | ||
|
|
||
| C:\Program Files\Tesseract-OCR\tessdata\ | ||
|
|
||
| .. note:: | ||
|
|
||
| The Chocolatey (``choco install tesseract``) package only includes English. All additional languages must be added manually using the steps above. | ||
|
|
||
| Verify the Install | ||
| ~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| Open Command Prompt or PowerShell and run: | ||
|
|
||
| .. code-block:: powershell | ||
|
|
||
| tesseract --list-langs | ||
|
|
||
| Your newly installed language should appear in the output. | ||
|
|
||
| ---- | ||
|
|
||
| macOS | ||
| ----- | ||
|
|
||
| The recommended approach on macOS is `Homebrew <https://brew.sh>`_. There are two options depending on how much disk space you want to use. | ||
|
|
||
| Install All Languages at Once | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| The ``tesseract-lang`` formula bundles Tesseract with every available language pack: | ||
|
|
||
| .. code-block:: bash | ||
|
|
||
| brew install tesseract-lang | ||
|
|
||
| Install Specific Languages | ||
| ~~~~~~~~~~~~~~~~~~~~~~~~~~ | ||
|
|
||
| If you only need a few languages, install ``tesseract`` first and then manually download the ``.traineddata`` files you need: | ||
|
|
||
| .. code-block:: bash | ||
|
|
||
| # Install Tesseract engine only | ||
| brew install tesseract | ||
|
|
||
| # Find the tessdata directory | ||
| brew info tesseract | ||
| # Look for a line like: /opt/homebrew/share/tessdata | ||
|
|
||
| # Download a specific language pack (e.g. French) | ||
| curl -L https://github.com/tesseract-ocr/tessdata/raw/main/fra.traineddata \ | ||
| -o /opt/homebrew/share/tessdata/fra.traineddata | ||
|
|
||
| Replace ``fra`` with your target language code and adjust the tessdata path to match what ``brew info tesseract`` reports on your machine. | ||
|
|
||
| .. note:: | ||
|
|
||
| If you installed Tesseract via MacPorts instead of Homebrew, use ``port install tesseract-<langcode>``, for example ``sudo port install tesseract-fra``. | ||
|
|
||
| ---- | ||
|
|
||
| Using a Language with pymupdf4llm | ||
| ---------------------------------- | ||
|
|
||
| Once a language pack is installed, pass its code to ``to_markdown()`` via the ``ocr_language`` parameter: | ||
|
|
||
| .. code-block:: python | ||
|
|
||
| import pymupdf4llm | ||
|
|
||
| # Single language | ||
| md = pymupdf4llm.to_markdown("document.pdf", ocr_language="fra") | ||
|
|
||
| # Multiple languages | ||
| md = pymupdf4llm.to_markdown("document.pdf", ocr_language="eng+fra+deu") | ||
|
|
||
| ---- | ||
|
|
||
| Common Language Codes | ||
| --------------------- | ||
|
|
||
| .. list-table:: | ||
| :header-rows: 1 | ||
| :widths: 50 50 | ||
|
|
||
| * - Language | ||
| - Code | ||
| * - English | ||
| - ``eng`` | ||
| * - French | ||
| - ``fra`` | ||
| * - German | ||
| - ``deu`` | ||
| * - Spanish | ||
| - ``spa`` | ||
| * - Italian | ||
| - ``ita`` | ||
| * - Portuguese | ||
| - ``por`` | ||
| * - Simplified Chinese | ||
| - ``chi_sim`` | ||
| * - Traditional Chinese | ||
| - ``chi_tra`` | ||
| * - Japanese | ||
| - ``jpn`` | ||
| * - Korean | ||
| - ``kor`` | ||
| * - Arabic | ||
| - ``ara`` | ||
| * - Russian | ||
| - ``rus`` | ||
| * - Hindi | ||
| - ``hin`` | ||
|
|
||
| For the full list of supported languages and their codes, see the `Tesseract tessdata repository <https://github.com/tesseract-ocr/tessdata>`_. | ||
|
|
||
|
|
||
|
|
||
| .. include:: ../footer.rst | ||
|
|
||
|
|
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.