Skip to content
Discussion options

You must be logged in to vote

Alright then, thanks for the clarification.
If you know the characters (=Unicodes) you want to get rid of and do not mind the empty spaces left over when removing them, then using redaction annotations is the easiest thing to do. It is font-independent and fast.

import pymupdf

doc = pymupdf.open("test.pdf")
unwanted_chars = ["a", "b", "c", chr(0xFFFD)]  # example unwanted characters
for page in doc:
    blocks = page.get_text("rawdict")["blocks"]  # all text in character detail
    chars = [  # all unwanted characters on page
        c
        for b in blocks
        if b["type"] == 0
        for l in b["lines"]
        for s in l["spans"]
        for c in s["chars"]
        if c["c"] in…

Replies: 2 comments 3 replies

Comment options

You must be logged in to vote
1 reply
@frkd-dev
Comment options

Comment options

You must be logged in to vote
2 replies
@JorjMcKie
Comment options

@frkd-dev
Comment options

Answer selected by frkd-dev
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
2 participants