A local/offline batch processing tool for PDF accessibility enhancement. This module complements the AWS-based PDF accessibility solution by enabling:
- Offline processing without AWS infrastructure
- Pre-processing before cloud upload
- Development/testing workflows
- High-volume batch processing with folder structure preservation
- OCR Enhancement: Adds invisible searchable text layers using Tesseract (via ocrmypdf)
- PDF/UA-1 Preparation: Adds compliance metadata and markers for accessibility
- Batch Processing: Process entire directory trees with folder structure preservation
- Progress Tracking: Visual progress bar with tqdm
- Parallel Processing: Multi-threaded processing for faster throughput
- Summary Reports: JSON reports with processing statistics
-
Python 3.8+
-
Tesseract OCR (system dependency)
# macOS brew install tesseract # Ubuntu/Debian sudo apt-get install tesseract-ocr # Windows # Download from: https://github.com/UB-Mannheim/tesseract/wiki
-
Ghostscript (required by ocrmypdf)
# macOS brew install ghostscript # Ubuntu/Debian sudo apt-get install ghostscript
cd local_batch_processor
pip install -r requirements.txtProcess a single PDF:
python -m local_batch_processor.cli process input.pdf output.pdfBatch process a directory:
python -m local_batch_processor.cli batch input_folder/ output_folder/With options:
# Process with 4 parallel workers
python -m local_batch_processor.cli batch input/ output/ --workers 4
# Skip OCR (only apply PDF/UA metadata)
python -m local_batch_processor.cli batch input/ output/ --skip-ocr
# Force OCR even if text exists
python -m local_batch_processor.cli batch input/ output/ --force-ocr
# Set custom DPI for OCR
python -m local_batch_processor.cli batch input/ output/ --dpi 400
# Use different OCR language
python -m local_batch_processor.cli batch input/ output/ --ocr-lang deufrom local_batch_processor import BatchProcessor, EnhancementService
# Single file processing
service = EnhancementService(text_threshold=100, dpi=300)
success = service.enhance_document(
input_path="input.pdf",
output_path="output.pdf",
title="My Document",
author="Author Name",
language="en-US"
)
# Batch processing
processor = BatchProcessor(text_threshold=100, dpi=300)
summary = processor.process_batch(
input_dir="./pdfs",
output_dir="./enhanced",
workers=4,
recursive=True
)
print(f"Processed: {summary['processed']}/{summary['total_files']}")
print(f"Failed: {summary['failed']}")-
OCR Enhancement (if needed)
- Analyzes PDF text content
- Applies OCR using sandwich renderer (invisible text behind visible content)
- Normalizes non-standard page boxes for accurate text positioning
-
PDF/UA-1 Preparation
- Strips orphan tags that interfere with accessibility tools
- Adds PDF/UA-1 compliance metadata
- Sets document properties (title, author, language)
- Marks document for manual tagging workflow
output_folder/
├── subfolder1/
│ ├── document1.pdf
│ └── document2.pdf
├── subfolder2/
│ └── document3.pdf
└── batch_processing_summary.json
The folder structure from the input directory is preserved in the output.
After batch processing, a batch_processing_summary.json file is created:
{
"success": true,
"total_files": 100,
"processed": 98,
"failed": 2,
"total_duration": 1234.5,
"avg_duration_per_file": 12.3,
"successful_files": ["file1.pdf", "file2.pdf", ...],
"failed_files": [
{"file": "bad.pdf", "error": "Encrypted PDF"},
{"file": "corrupt.pdf", "error": "Invalid PDF structure"}
],
"timestamp": "2024-01-15T10:30:00"
}This local batch processor can be used alongside the AWS-based solution:
- Pre-processing: Process PDFs locally before uploading to S3
- Testing: Verify accessibility enhancements locally before cloud deployment
- Offline workflow: Process PDFs when AWS infrastructure is not available
- High-volume batch jobs: Process large collections locally with parallel workers
Install Tesseract OCR and the Python package:
# Install Tesseract (system)
brew install tesseract # macOS
# Install Python package
pip install ocrmypdfThe processor cannot handle password-protected PDFs. Remove protection before processing.
Use the --force-ocr flag to regenerate the text layer with corrected positioning:
python -m local_batch_processor.cli process input.pdf output.pdf --force-ocrThis module is part of the PDF Accessibility Solutions project. See the main repository LICENSE for details.