Skip to content

Latest commit

 

History

History
175 lines (133 loc) · 3.83 KB

File metadata and controls

175 lines (133 loc) · 3.83 KB

PDF Libraries Reference

This file provides quick reference for the Python libraries used in pdf-tools skill.

pdfplumber

Installation: pip3 install pdfplumber

Best for: Text extraction, table extraction, page layout analysis

Key features:

  • Extract text with precise positioning
  • Extract tables automatically
  • Get page dimensions and metadata
  • Access visual elements (lines, rectangles, curves)

Basic usage:

import pdfplumber

with pdfplumber.open('file.pdf') as pdf:
    # Get total pages
    total_pages = len(pdf.pages)
    
    # Extract text from a page
    page = pdf.pages[0]
    text = page.extract_text()
    
    # Extract tables
    tables = page.extract_tables()
    
    # Get page dimensions
    width = page.width
    height = page.height

PyPDF2

Installation: pip3 install PyPDF2

Best for: PDF manipulation (merge, split, rotate, watermark)

Key features:

  • Merge multiple PDFs
  • Split PDFs into separate files
  • Rotate pages
  • Add watermarks
  • Encrypt/decrypt PDFs

Basic usage:

from PyPDF2 import PdfReader, PdfWriter, PdfMerger

# Reading
reader = PdfReader('input.pdf')
pages = reader.pages
metadata = reader.metadata

# Writing
writer = PdfWriter()
writer.add_page(reader.pages[0])
with open('output.pdf', 'wb') as f:
    writer.write(f)

# Merging
merger = PdfMerger()
merger.append('file1.pdf')
merger.append('file2.pdf')
merger.write('merged.pdf')
merger.close()

# Rotating
page = reader.pages[0]
page.rotate(90)  # clockwise 90 degrees

ReportLab (optional, for creating PDFs from scratch)

Installation: pip3 install reportlab

Best for: Generating new PDFs programmatically

Basic usage:

from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter

c = canvas.Canvas("output.pdf", pagesize=letter)
c.drawString(100, 750, "Hello World")
c.save()

Common Patterns

Extract text from specific pages

import pdfplumber

with pdfplumber.open('document.pdf') as pdf:
    for page_num in [0, 2, 4]:  # 0-indexed
        page = pdf.pages[page_num]
        text = page.extract_text()
        print(f"Page {page_num + 1}:\n{text}")

Merge PDFs with page ranges

from PyPDF2 import PdfMerger

merger = PdfMerger()
merger.append('file1.pdf', pages=(0, 3))  # First 3 pages
merger.append('file2.pdf')  # All pages
merger.write('output.pdf')
merger.close()

Extract and save all pages as separate PDFs

from PyPDF2 import PdfReader, PdfWriter

reader = PdfReader('input.pdf')
for i, page in enumerate(reader.pages):
    writer = PdfWriter()
    writer.add_page(page)
    with open(f'page_{i+1}.pdf', 'wb') as f:
        writer.write(f)

Text Editing

Add text overlay to a page

from PyPDF2 import PdfReader, PdfWriter
from reportlab.pdfgen import canvas
from reportlab.lib.pagesizes import letter
from io import BytesIO

# Create overlay
packet = BytesIO()
can = canvas.Canvas(packet, pagesize=letter)
can.setFont("Helvetica", 12)
can.drawString(100, 700, "New Text Here")
can.save()

# Merge with existing PDF
packet.seek(0)
overlay = PdfReader(packet)

reader = PdfReader('input.pdf')
writer = PdfWriter()

page = reader.pages[0]
page.merge_page(overlay.pages[0])
writer.add_page(page)

with open('output.pdf', 'wb') as f:
    writer.write(f)

Simple text replacement (limited)

# Note: PDF format is complex, this works for simple cases only
with open('input.pdf', 'rb') as f:
    pdf_data = f.read()

# Replace at byte level
modified = pdf_data.replace(b'Old Text', b'New Text')

with open('output.pdf', 'wb') as f:
    f.write(modified)

Important: Direct text replacement in PDFs is unreliable due to PDF format complexity (text encoding, positioning, fonts). For reliable editing:

  1. Extract text → Edit → Regenerate PDF, or
  2. Use overlay method to add new text on top