Skip to content

Latest commit

 

History

History
199 lines (140 loc) · 4.05 KB

File metadata and controls

199 lines (140 loc) · 4.05 KB

PDF-Parse CLI Tool

A command-line interface for extracting data from PDF files using the pdf-parse library.

Installation

The CLI tool is included with the pdf-parse package. If you have pdf-parse installed, the CLI is available as pdf-parse.

npm install -g pdf-parse

Updating

To update to the latest version:

npm update -g pdf-parse

Uninstallation

To remove the CLI tool:

npm uninstall -g pdf-parse

Usage

pdf-parse <command> <file> [options]

Where <file> can be a local PDF file path or a URL (for certain commands).

Commands

check

Check PDF file headers and validate format. Only works with URLs.

pdf-parse check https://example.com/document.pdf

info

Extract PDF metadata and information.

pdf-parse info document.pdf

text

Extract text content from PDF pages.

pdf-parse text document.pdf --pages 1-3

image

Extract embedded images from PDF pages.

pdf-parse image document.pdf --output ./images/

screenshot (alias: ss)

Generate screenshots of PDF pages.

pdf-parse screenshot document.pdf --output ./screenshots/ --scale 2.0

table

Extract tabular data from PDF pages.

pdf-parse table document.pdf --format json

Options

  • -o, --output <file>: Output file path (for single file) or directory (for multiple files)
  • -p, --pages <range>: Page range (e.g., 1,3-5,7)
  • -f, --format <format>: Output format (json, text, dataurl)
  • -m, --min <px>: Minimum image size threshold in pixels (default: 80)
  • -s, --scale <factor>: Scale factor for screenshots (default: 1.0)
  • -w, --width <px>: Desired width for screenshots in pixels
  • -l, --large: Enable optimizations for large PDF files
  • --magic: Validate PDF magic bytes
  • -h, --help: Show help message
  • -v, --version: Show version number

Examples

Basic Usage

Get PDF information:

pdf-parse info mydocument.pdf

Extract text from specific pages:

pdf-parse text mydocument.pdf --pages 1,3-5

Image Extraction

Extract all images to a directory:

pdf-parse image mydocument.pdf --output ./extracted-images/

Extract images with minimum size filter:

pdf-parse image mydocument.pdf --min 100 --output ./images/

Screenshots

Generate screenshots with custom scale:

pdf-parse screenshot mydocument.pdf --scale 1.5 --output ./screenshots/

Generate screenshots with specific width:

pdf-parse screenshot mydocument.pdf --width 800 --output ./screenshots/

Table Extraction

Extract tables in JSON format:

pdf-parse table mydocument.pdf --format json --output tables.json

Extract tables from specific pages:

pdf-parse table mydocument.pdf --pages 2-4

Header Validation

Check PDF headers from URL:

pdf-parse check https://example.com/document.pdf

Check without magic byte validation:

pdf-parse check https://example.com/document.pdf --no-magic

Large File Processing

For large PDF files (> 5MB), use the --large flag to enable performance optimizations:

pdf-parse text https://example.com/large-document.pdf --large --pages 1-10
pdf-parse info https://example.com/huge-report.pdf --large

The --large flag enables:

  • Disable auto-fetching of additional pages
  • Chunk-based loading instead of streaming
  • Optimized range request chunk size

Output Formats

Text Format (default)

Human-readable text output for most commands.

JSON Format

Structured data output using --format json.

Data URL Format

Base64 encoded data URLs for image and screenshot commands using --format dataurl.

Page Ranges

Specify page ranges using comma-separated values and ranges:

  • 1: Page 1
  • 1,3,5: Pages 1, 3, and 5
  • 1-5: Pages 1 through 5
  • 1,3-5,7: Pages 1, 3, 4, 5, and 7

Error Handling

The CLI tool provides clear error messages for common issues:

  • Invalid commands or options
  • Missing required arguments
  • File not found or inaccessible
  • Invalid page ranges
  • Network errors for URL-based operations