Can PyMuPDF help me find a place in PDF stream with a specific text represented as CIDs? #4911
-
|
Hello, My ultimate goal is to delete a portion of characters in form of CIDs from a PDF stream/s. First, I read several discussions here related to text deletions. I'm aware of Once I understood that my particular problem is deeper, I learned a PDF specification and now I don't know whether my goal is achievable using the MuPDF library. Currently, it looks like it is not and requires implementing a sophisticated parser, so I'm seeking advice. Details. It wouldn't be a big problem for me if streams in my PDF file had regular string objects like Unfortunately, most of the characters in my PDF streams are encoded as CIDs (character identifiers) in the form of Then I had an idea that I could achieve this using blocks through Now the question: does PyMuPDF have any API that would help me finding a stream and then find an offset in a stream to a Thank you! |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments 3 replies
-
I think you indeed are. You don't say it clearly, but I assume you not only what to remove unwanted characters? You probably also want to shuffle around the remaining text such that there are no vacant positions on the page where characters have been removed. Isn't there a much simpler solution? |
Beta Was this translation helpful? Give feedback.
-
|
Alright then, thanks for the clarification. |
Beta Was this translation helpful? Give feedback.
Alright then, thanks for the clarification.
If you know the characters (=Unicodes) you want to get rid of and do not mind the empty spaces left over when removing them, then using redaction annotations is the easiest thing to do. It is font-independent and fast.