Can PyMuPDF help me find a place in PDF stream with a specific text represented as CIDs? #4911

frkd-dev · 2026-02-18T15:45:56Z

frkd-dev
Feb 18, 2026

Hello,

My ultimate goal is to delete a portion of characters in form of CIDs from a PDF stream/s.

First, I read several discussions here related to text deletions. I'm aware of Page.get_text() along with Page.add_redact_annot() and Page.apply_redactions() and they don't fit my goal as this deletes any text that gets in a rectangle.

Once I understood that my particular problem is deeper, I learned a PDF specification and now I don't know whether my goal is achievable using the MuPDF library. Currently, it looks like it is not and requires implementing a sophisticated parser, so I'm seeking advice.

Details. It wouldn't be a big problem for me if streams in my PDF file had regular string objects like BT ... (Some text to output) Tj ... ET. They would be easily removable by doing a simple text scanning, removing objects from a stream, and updating a stream in PDF.

Unfortunately, most of the characters in my PDF streams are encoded as CIDs (character identifiers) in the form of BT ... [<002100250088> 1 <002300120008> 1 ... ] TJ ET sequences with positional adjustments. Of course, you know these CIDs aren't ASCII or Unicode characters, and the actual Unicode values must be extracted from font objects using a mapping table there, so no straightforward searching is possible.

Then I had an idea that I could achieve this using blocks through Page.get_text("blocks"), but then I learned that blocks are PyMuPDF's abstraction and there is no API that would allow manipulation on blocks, which totally makes sense as the PDF specification has no such thing as a "block".

Now the question: does PyMuPDF have any API that would help me finding a stream and then find an offset in a stream to a BT ET portion of a stream containing text of interest, considering it's CID encoded, or am I doomed to writing a parser+mapper to do this?

Thank you!

Answered by JorjMcKie

Feb 20, 2026

Alright then, thanks for the clarification.
If you know the characters (=Unicodes) you want to get rid of and do not mind the empty spaces left over when removing them, then using redaction annotations is the easiest thing to do. It is font-independent and fast.

import pymupdf

doc = pymupdf.open("test.pdf")
unwanted_chars = ["a", "b", "c", chr(0xFFFD)]  # example unwanted characters
for page in doc:
    blocks = page.get_text("rawdict")["blocks"]  # all text in character detail
    chars = [  # all unwanted characters on page
        c
        for b in blocks
        if b["type"] == 0
        for l in b["lines"]
        for s in l["spans"]
        for c in s["chars"]
        if c["c"] in…

View full answer

JorjMcKie · 2026-02-19T12:14:39Z

JorjMcKie
Feb 19, 2026
Maintainer

or am I doomed to writing a parser+mapper to do this?

I think you indeed are. You don't say it clearly, but I assume you not only what to remove unwanted characters? You probably also want to shuffle around the remaining text such that there are no vacant positions on the page where characters have been removed.

Isn't there a much simpler solution?
Like you extract the page content in its entirety (page.get_text("dict") or "rawdict"), make sure you have all required fonts available as Font object and then write the desired text subset out to a new page - taking care of the gaps in your page.

1 reply

frkd-dev Feb 20, 2026
Author

You don't say it clearly, but I assume you not only what to remove unwanted characters? You probably also want to shuffle around the remaining text such that there are no vacant positions on the page where characters have been removed.

I didn't say because I don't need any sort of reflow or re-frormatting the text. Text in pdf organized as visual blocks and I need delete some blocks here and there. I understand, that the flow of characters in pdf could be absolutely random while visually it will be looking as sequential, but this is not my case.

extract the page content in its entirety (page.get_text("dict") or "rawdict"), make sure you have all required fonts available as Font object and then write the desired text subset out to a new page

Anything might work out if that will keep me away from writing my own parser 😄. Did I get you right? I should create a new page and then go through a dict I got from the original page and write to the new page only parts/objects I need, right?

JorjMcKie · 2026-02-20T15:55:26Z

JorjMcKie
Feb 20, 2026
Maintainer

Alright then, thanks for the clarification.
If you know the characters (=Unicodes) you want to get rid of and do not mind the empty spaces left over when removing them, then using redaction annotations is the easiest thing to do. It is font-independent and fast.

import pymupdf

doc = pymupdf.open("test.pdf")
unwanted_chars = ["a", "b", "c", chr(0xFFFD)]  # example unwanted characters
for page in doc:
    blocks = page.get_text("rawdict")["blocks"]  # all text in character detail
    chars = [  # all unwanted characters on page
        c
        for b in blocks
        if b["type"] == 0
        for l in b["lines"]
        for s in l["spans"]
        for c in s["chars"]
        if c["c"] in unwanted_chars
    ]
    for c in chars:  # make small rectangles around the character's center point
        bbox = pymupdf.Rect(c["bbox"])
        mp = (bbox.tl + bbox.br) / 2
        minibbox = pymupdf.Rect(mp, mp) + (-1, -1, 1, 1)
        page.add_redact_annot(minibbox)  # make a redaction annotation
    if chars:  # if there are unwanted chars on this page, erase them all in one go
        page.apply_redactions()
doc.ez_save("test-redacted.pdf")

2 replies

JorjMcKie Feb 20, 2026
Maintainer

To explain why the minibbox is used:
Redactions erase everything that laps even partly into their rectangle. And character bboxes in reality are larger than what is visible as a character.
So a small 2 x 2 rectangle around the center of the bbox is a fairly safe insurance.
If your text is written across images, then they will also receive punched out holes!
To prevent this from happening, use appropriate options in page.apply_redactions(images=pymupdf.PDF_REDACT_IMAGE_NONE).

frkd-dev Mar 2, 2026
Author

Thank you showing the example. This example does work, but for more complex cases I'll write more complete solution.
Thank you once again!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Can PyMuPDF help me find a place in PDF stream with a specific text represented as CIDs? #4911

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Can PyMuPDF help me find a place in PDF stream with a specific text represented as CIDs? #4911

Uh oh!

Uh oh!

frkd-dev Feb 18, 2026

Replies: 2 comments · 3 replies

Uh oh!

JorjMcKie Feb 19, 2026 Maintainer

Uh oh!

frkd-dev Feb 20, 2026 Author

Uh oh!

JorjMcKie Feb 20, 2026 Maintainer

Uh oh!

JorjMcKie Feb 20, 2026 Maintainer

Uh oh!

frkd-dev Mar 2, 2026 Author

frkd-dev
Feb 18, 2026

Replies: 2 comments 3 replies

JorjMcKie
Feb 19, 2026
Maintainer

frkd-dev Feb 20, 2026
Author

JorjMcKie
Feb 20, 2026
Maintainer

JorjMcKie Feb 20, 2026
Maintainer

frkd-dev Mar 2, 2026
Author