PDF Tools

A collection of CLI utilities for working with PDF files.

Tools

PDF to Text (`pdf_to_text.py`)

A tiny CLI utility to stream large PDF files into plain text without loading the entire file into memory. It wraps pdfminer.six with page-based iteration, configurable LAParams, a friendly CLI spinner, and safe logging so you can batch-process enormous PDFs. When conversion finishes, the CLI prints a summary showing file sizes and elapsed time.

Text Replacement (`replace_text.py`)

Replace text in PDF files with support for case-sensitive/insensitive search, whole word matching, regular expressions, and custom font sizing. The tool preserves font, size, and style formatting when not using a custom font size.

Key Features:

Simple text replacement with formatting preservation
Case-insensitive search option
Whole word matching
Regular expression support
Custom font size specification (--size)
Batch processing of multiple occurrences
Enhanced debugging with detailed logging

Replace Page (`replace_page.py`)

A Python utility to replace a page in a PDF file with an image. The image is automatically scaled to fit the dimensions of the page being replaced while maintaining its aspect ratio.

Key Features:

Replace any page in a PDF with an image
Automatically scales images to match page dimensions
Maintains image aspect ratio
Preserves all other pages in the PDF
Supports various image formats (PNG, JPEG, BMP, etc.)

Installation

Create or activate your Python virtual environment (the repository already contains .venv/).
Install the requirements:

pip install -r requirements.txt

Usage

PDF to Text

python pdf_to_text.py INPUT_PDF [-o OUTPUT_TXT] [OPTIONS]

Examples

Convert an entire PDF:

python pdf_to_text.py documents/manual.pdf

Extract a subsection without overwriting an existing file:

python pdf_to_text.py big-output.pdf --page-range 50-150 \
    --output extracted.txt --overwrite

Options

--page-range: specify start-end to control the page window (e.g., 10- for everything after page 10)
--encoding: control the output text encoding (default utf-8)
--char-margin, --line-margin, --word-margin, --boxes-flow, --detect-vertical: customize pdfminer.six layout heuristics
--quiet / --log-level: mute or raise logging verbosity
--no-spinner: disable the CLI animation

Text Replacement

python replace_text.py INPUT_PDF SEARCH_TEXT REPLACE_TEXT [OPTIONS]

Examples

Replace text while preserving formatting:

python replace_text.py document.pdf "old text" "new text"

Replace with custom font size:

python replace_text.py document.pdf "11-21-2020" "11-21-2025" --size 14

Case-insensitive replacement:

python replace_text.py document.pdf "Old Text" "New Text" --ignore-case

Regex replacement (email redaction):

python replace_text.py document.pdf "\b[\w.-]+@[\w.-]+\.\w+\b" "[EMAIL]" --regex

Phone number redaction:

python replace_text.py document.pdf "\b\d{3}-\d{3}-\d{4}\b" "[PHONE]" --regex

Options

--ignore-case (-i): Case-insensitive search
--whole-word (-w): Match whole words only
--regex (-r): Treat search text as a regular expression
--size SIZE: Font size to use for replacement text (preserves original size if not specified)
--overwrite: Overwrite the output file if it exists
--output (-o): Specify output file path
--quiet: Suppress informational logging
--log-level: Set logging level (DEBUG, INFO, WARNING, ERROR)

Limitations:

Works best with PDFs that have selectable text
Scanned PDFs (images) require OCR preprocessing
Complex layouts may not be perfectly preserved
Encrypted PDFs require password

Replace Page

python replace_page.py INPUT_PDF IMAGE_FILE [OPTIONS]

Examples

Replace the first page (cover):

python replace_page.py document.pdf new_cover.png

Replace a specific page:

python replace_page.py report.pdf diagram.jpg --page 3 --output updated_report.pdf

Replace with overwrite:

python replace_page.py document.pdf image.png --page 5 --overwrite

Options

-p, --page PAGE: Page number to replace (1-indexed, default: 1)
-o, --output OUTPUT: Path to the output PDF file
--overwrite: Overwrite the output file if it exists
--quiet: Suppress informational logging
--log-level LOG_LEVEL: Set logging level (DEBUG, INFO, WARNING, ERROR)

Supported Image Formats: PNG, JPEG, BMP, GIF, TIFF

Testing

Run the CLIs with --help to verify the scripts start without errors:

python pdf_to_text.py --help
python replace_text.py --help
python replace_page.py --help

Run the test suite:

pytest tests/ -v

Requirements

Python 3.8+
PyMuPDF (fitz) >= 1.23.0
PyPDF2 >= 4.0
Pillow >= 10.0
pdfminer.six

See requirements.txt for complete dependencies.

Notes

Always test on a copy of your PDF first
Complex PDFs with multiple layers may not work perfectly
The tools preserve images and non-text content when possible
Text replacement preserves formatting when custom font size is not specified

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PDF Tools

Tools

PDF to Text (`pdf_to_text.py`)

Text Replacement (`replace_text.py`)

Replace Page (`replace_page.py`)

Installation

Usage

PDF to Text

Examples

Options

Text Replacement

Examples

Options

Replace Page

Examples

Options

Testing

Requirements

Notes

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

PDF Tools

Tools

PDF to Text (pdf_to_text.py)

Text Replacement (replace_text.py)

Replace Page (replace_page.py)

Installation

Usage

PDF to Text

Examples

Options

Text Replacement

Examples

Options

Replace Page

Examples

Options

Testing

Requirements

Notes

PDF to Text (`pdf_to_text.py`)

Text Replacement (`replace_text.py`)

Replace Page (`replace_page.py`)