Skip to content

CLI tool to extract individual chapters from PDF books using regex pattern matching on bookmarks. Built with Go and pdfcpu.

Notifications You must be signed in to change notification settings

oueslati1990/Book-Chapter-Extractor

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Book-Chapter-Extractor

A command-line tool written in Go that extracts a specific chapter from PDF books based on bookmark pattern matching and saves it as a separate PDF file.

Features

  • Extract a single chapter from PDF files using bookmark matching
  • Regex pattern matching for flexible chapter selection
  • Preserves PDF structure and formatting
  • Verbose output mode for detailed logging
  • Clean, modular architecture

Installation

Prerequisites

  • Go 1.24.2 or higher

Build from source

git clone https://github.com/oueslati1990/Book-Chapter-Extractor.git
cd Book-Chapter-Extractor
go build -o pdf-chapter-extractor ./cmd/main.go

Usage

./pdf-chapter-extractor -input <pdf-file> -pattern <regex-pattern> [options]

Options

Flag Short Description Default
--input -i Input PDF file (required) -
--output -o Output directory for extracted chapter Chapters
--pattern -p Regex pattern to match chapter bookmark (required) -
--verbose -v Enable verbose output false

Important Note

The pattern must match exactly one bookmark. If multiple bookmarks match the pattern, the tool will return an error listing all matches, and you'll need to provide a more specific pattern.

Examples

Extract a specific chapter by exact title:

./pdf-chapter-extractor -i book.pdf -p "Chapter 1: Introduction"

Extract a chapter with verbose output:

./pdf-chapter-extractor -i book.pdf -p "Chapter 5" -o output_folder -v

Extract using a more specific regex to match one chapter:

./pdf-chapter-extractor -i book.pdf -p "^Chapter 3:"

Extract a chapter with special characters in title:

./pdf-chapter-extractor -i book.pdf -p "Chapter 2\\.1"

How it works

  1. The tool reads the PDF file and extracts all bookmarks (table of contents)
  2. It flattens nested bookmarks to search through all levels
  3. Matches bookmarks against the provided regex pattern
  4. If exactly one bookmark matches, it extracts the page range
  5. Creates a new PDF file containing only those pages
  6. If multiple bookmarks match, it shows an error with all matching titles

Project Structure

Book-Chapter-Extractor/
├── cmd/
│   └── main.go                 # CLI entry point
├── internal/
│   ├── bookmark/
│   │   └── bookmark.go         # Bookmark extraction and processing
│   └── extractor/
│       └── extractor.go        # Chapter extraction logic
├── go.mod
├── go.sum
└── README.md

Dependencies

  • pdfcpu - PDF processing library

Error Handling

The tool handles several error cases:

  • Missing input file
  • Invalid regex patterns
  • PDF files without bookmarks
  • Multiple bookmarks matching the pattern (requires more specific pattern)
  • No bookmarks matching the pattern

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

License

This project is open source and available under the MIT License.

Author

oueslati1990

About

CLI tool to extract individual chapters from PDF books using regex pattern matching on bookmarks. Built with Go and pdfcpu.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages