Skip to content

Latest commit

 

History

History
197 lines (136 loc) · 5.18 KB

File metadata and controls

197 lines (136 loc) · 5.18 KB

PDF Tools

A collection of CLI utilities for working with PDF files.

Tools

PDF to Text (pdf_to_text.py)

A tiny CLI utility to stream large PDF files into plain text without loading the entire file into memory. It wraps pdfminer.six with page-based iteration, configurable LAParams, a friendly CLI spinner, and safe logging so you can batch-process enormous PDFs. When conversion finishes, the CLI prints a summary showing file sizes and elapsed time.

Text Replacement (replace_text.py)

Replace text in PDF files with support for case-sensitive/insensitive search, whole word matching, regular expressions, and custom font sizing. The tool preserves font, size, and style formatting when not using a custom font size.

Key Features:

  • Simple text replacement with formatting preservation
  • Case-insensitive search option
  • Whole word matching
  • Regular expression support
  • Custom font size specification (--size)
  • Batch processing of multiple occurrences
  • Enhanced debugging with detailed logging

Replace Page (replace_page.py)

A Python utility to replace a page in a PDF file with an image. The image is automatically scaled to fit the dimensions of the page being replaced while maintaining its aspect ratio.

Key Features:

  • Replace any page in a PDF with an image
  • Automatically scales images to match page dimensions
  • Maintains image aspect ratio
  • Preserves all other pages in the PDF
  • Supports various image formats (PNG, JPEG, BMP, etc.)

Installation

  1. Create or activate your Python virtual environment (the repository already contains .venv/).
  2. Install the requirements:
pip install -r requirements.txt

Usage

PDF to Text

python pdf_to_text.py INPUT_PDF [-o OUTPUT_TXT] [OPTIONS]

Examples

Convert an entire PDF:

python pdf_to_text.py documents/manual.pdf

Extract a subsection without overwriting an existing file:

python pdf_to_text.py big-output.pdf --page-range 50-150 \
    --output extracted.txt --overwrite

Options

  • --page-range: specify start-end to control the page window (e.g., 10- for everything after page 10)
  • --encoding: control the output text encoding (default utf-8)
  • --char-margin, --line-margin, --word-margin, --boxes-flow, --detect-vertical: customize pdfminer.six layout heuristics
  • --quiet / --log-level: mute or raise logging verbosity
  • --no-spinner: disable the CLI animation

Text Replacement

python replace_text.py INPUT_PDF SEARCH_TEXT REPLACE_TEXT [OPTIONS]

Examples

Replace text while preserving formatting:

python replace_text.py document.pdf "old text" "new text"

Replace with custom font size:

python replace_text.py document.pdf "11-21-2020" "11-21-2025" --size 14

Case-insensitive replacement:

python replace_text.py document.pdf "Old Text" "New Text" --ignore-case

Regex replacement (email redaction):

python replace_text.py document.pdf "\b[\w.-]+@[\w.-]+\.\w+\b" "[EMAIL]" --regex

Phone number redaction:

python replace_text.py document.pdf "\b\d{3}-\d{3}-\d{4}\b" "[PHONE]" --regex

Options

  • --ignore-case (-i): Case-insensitive search
  • --whole-word (-w): Match whole words only
  • --regex (-r): Treat search text as a regular expression
  • --size SIZE: Font size to use for replacement text (preserves original size if not specified)
  • --overwrite: Overwrite the output file if it exists
  • --output (-o): Specify output file path
  • --quiet: Suppress informational logging
  • --log-level: Set logging level (DEBUG, INFO, WARNING, ERROR)

Limitations:

  • Works best with PDFs that have selectable text
  • Scanned PDFs (images) require OCR preprocessing
  • Complex layouts may not be perfectly preserved
  • Encrypted PDFs require password

Replace Page

python replace_page.py INPUT_PDF IMAGE_FILE [OPTIONS]

Examples

Replace the first page (cover):

python replace_page.py document.pdf new_cover.png

Replace a specific page:

python replace_page.py report.pdf diagram.jpg --page 3 --output updated_report.pdf

Replace with overwrite:

python replace_page.py document.pdf image.png --page 5 --overwrite

Options

  • -p, --page PAGE: Page number to replace (1-indexed, default: 1)
  • -o, --output OUTPUT: Path to the output PDF file
  • --overwrite: Overwrite the output file if it exists
  • --quiet: Suppress informational logging
  • --log-level LOG_LEVEL: Set logging level (DEBUG, INFO, WARNING, ERROR)

Supported Image Formats: PNG, JPEG, BMP, GIF, TIFF

Testing

Run the CLIs with --help to verify the scripts start without errors:

python pdf_to_text.py --help
python replace_text.py --help
python replace_page.py --help

Run the test suite:

pytest tests/ -v

Requirements

  • Python 3.8+
  • PyMuPDF (fitz) >= 1.23.0
  • PyPDF2 >= 4.0
  • Pillow >= 10.0
  • pdfminer.six

See requirements.txt for complete dependencies.

Notes

  • Always test on a copy of your PDF first
  • Complex PDFs with multiple layers may not work perfectly
  • The tools preserve images and non-text content when possible
  • Text replacement preserves formatting when custom font size is not specified