Skip to content
Discussion options

You must be logged in to vote

This is another case of differences between "ActualText" being accepted (the default) or not.
The PDF's author found it cool to put some text (the drawn text in the gray left column like "STERILIZATION") as hidden "actual text" inside the PDF.
Because we (like all state-of-the-art extractors) by default accept ActualText it will appear ... together with whatever coordinates we are being given.
ActualText need not be visible in PDF viewers (likewise neither in Pixmaps), but it is extracted!
If you add the bit pymupdf.TEXT_IGNORE_ACTUALTEXT to your extraction flags, then these things will disappear.

Replies: 1 comment 2 replies

Comment options

You must be logged in to vote
2 replies
@JorjMcKie
Comment options

@yinghao-xue
Comment options

Answer selected by yinghao-xue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
2 participants