A command-line interface for extracting data from PDF files using the pdf-parse library.
The CLI tool is included with the pdf-parse package. If you have pdf-parse installed, the CLI is available as pdf-parse.
npm install -g pdf-parseTo update to the latest version:
npm update -g pdf-parseTo remove the CLI tool:
npm uninstall -g pdf-parsepdf-parse <command> <file> [options]Where <file> can be a local PDF file path or a URL (for certain commands).
Check PDF file headers and validate format. Only works with URLs.
pdf-parse check https://example.com/document.pdfExtract PDF metadata and information.
pdf-parse info document.pdfExtract text content from PDF pages.
pdf-parse text document.pdf --pages 1-3Extract embedded images from PDF pages.
pdf-parse image document.pdf --output ./images/Generate screenshots of PDF pages.
pdf-parse screenshot document.pdf --output ./screenshots/ --scale 2.0Extract tabular data from PDF pages.
pdf-parse table document.pdf --format json-o, --output <file>: Output file path (for single file) or directory (for multiple files)-p, --pages <range>: Page range (e.g., 1,3-5,7)-f, --format <format>: Output format (json, text, dataurl)-m, --min <px>: Minimum image size threshold in pixels (default: 80)-s, --scale <factor>: Scale factor for screenshots (default: 1.0)-w, --width <px>: Desired width for screenshots in pixels-l, --large: Enable optimizations for large PDF files--magic: Validate PDF magic bytes-h, --help: Show help message-v, --version: Show version number
Get PDF information:
pdf-parse info mydocument.pdfExtract text from specific pages:
pdf-parse text mydocument.pdf --pages 1,3-5Extract all images to a directory:
pdf-parse image mydocument.pdf --output ./extracted-images/Extract images with minimum size filter:
pdf-parse image mydocument.pdf --min 100 --output ./images/Generate screenshots with custom scale:
pdf-parse screenshot mydocument.pdf --scale 1.5 --output ./screenshots/Generate screenshots with specific width:
pdf-parse screenshot mydocument.pdf --width 800 --output ./screenshots/Extract tables in JSON format:
pdf-parse table mydocument.pdf --format json --output tables.jsonExtract tables from specific pages:
pdf-parse table mydocument.pdf --pages 2-4Check PDF headers from URL:
pdf-parse check https://example.com/document.pdfCheck without magic byte validation:
pdf-parse check https://example.com/document.pdf --no-magicFor large PDF files (> 5MB), use the --large flag to enable performance optimizations:
pdf-parse text https://example.com/large-document.pdf --large --pages 1-10
pdf-parse info https://example.com/huge-report.pdf --largeThe --large flag enables:
- Disable auto-fetching of additional pages
- Chunk-based loading instead of streaming
- Optimized range request chunk size
Human-readable text output for most commands.
Structured data output using --format json.
Base64 encoded data URLs for image and screenshot commands using --format dataurl.
Specify page ranges using comma-separated values and ranges:
1: Page 11,3,5: Pages 1, 3, and 51-5: Pages 1 through 51,3-5,7: Pages 1, 3, 4, 5, and 7
The CLI tool provides clear error messages for common issues:
- Invalid commands or options
- Missing required arguments
- File not found or inaccessible
- Invalid page ranges
- Network errors for URL-based operations