Invisible text in get_text() despite being absent in get_pixmap() #4952
-
|
I've encountered a case where page.get_text("rawdict")returns text that is not visible in the rendered output. Rawdict Comparison: --- VISIBLE SPAN (matches pixmap) ---{'bbox': (133.10000610351562, --- INVISIBLE SPAN (exists in rawdict, absent in pixmap) ---{'bbox': (219.6320037841797, 696.166015625, 305.29010009765625, 708.2080078125), |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment 2 replies
-
|
This is another case of differences between "ActualText" being accepted (the default) or not. |
Beta Was this translation helpful? Give feedback.
This is another case of differences between "ActualText" being accepted (the default) or not.
The PDF's author found it cool to put some text (the drawn text in the gray left column like "STERILIZATION") as hidden "actual text" inside the PDF.
Because we (like all state-of-the-art extractors) by default accept ActualText it will appear ... together with whatever coordinates we are being given.
ActualText need not be visible in PDF viewers (likewise neither in Pixmaps), but it is extracted!
If you add the bit
pymupdf.TEXT_IGNORE_ACTUALTEXTto your extraction flags, then these things will disappear.