Skip to content

Commit 59a1eee

Browse files
authored
Contentmd (#57)
* Introduce content-md markdown format * Include page separators in contentmd * Handle converting existing json output to markdown * Update documentation
1 parent 52568b7 commit 59a1eee

9 files changed

Lines changed: 1344 additions & 106 deletions

File tree

docs/howto/pdf_manipulation.md

Lines changed: 66 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -102,7 +102,7 @@ parxy pdf:merge file1.pdf file2.pdf -o /output/dir/merged.pdf
102102

103103
## Splitting PDFs
104104

105-
The `pdf:split` command divides a single PDF into individual pages, with each page becoming a separate PDF file.
105+
The `pdf:split` command divides a single PDF into individual pages, with each page becoming a separate PDF file. You can optionally limit which pages are extracted and combine them into a single output PDF.
106106

107107
### Basic Splitting
108108

@@ -139,6 +139,51 @@ Creates files named:
139139
- `chapter_page_2.pdf`
140140
- etc.
141141

142+
### Extracting a Page Range
143+
144+
Use `--pages` to limit which pages are extracted (1-based indexing):
145+
146+
**Single page:**
147+
```bash
148+
parxy pdf:split document.pdf --pages 3
149+
```
150+
151+
**Page range:**
152+
```bash
153+
parxy pdf:split document.pdf --pages 2:5
154+
```
155+
156+
**From start to page N:**
157+
```bash
158+
parxy pdf:split document.pdf --pages :5
159+
```
160+
161+
**From page N to end:**
162+
```bash
163+
parxy pdf:split document.pdf --pages 3:
164+
```
165+
166+
### Combining Pages into a Single PDF
167+
168+
Use `--combine` to extract a page range into a single output PDF instead of one file per page:
169+
170+
```bash
171+
# Extract pages 2–5 as a single PDF (auto-named)
172+
parxy pdf:split document.pdf --pages 2:5 --combine
173+
# Output: document_pages_2-5.pdf (next to the input file)
174+
175+
# Specify a custom output path
176+
parxy pdf:split document.pdf --pages 2:5 --combine -o extracted.pdf
177+
178+
# Extract a single page as a PDF
179+
parxy pdf:split document.pdf --pages 3 --combine -o page3.pdf
180+
181+
# Combine all pages (equivalent to a copy)
182+
parxy pdf:split document.pdf --combine -o copy.pdf
183+
```
184+
185+
> **Tip:** `--combine` pairs well with `--pages` to replace the `pdf:merge file.pdf[2:5]` pattern when working with a single source file.
186+
142187
### Complete Examples
143188

144189
**Split with custom output directory:**
@@ -161,14 +206,25 @@ Creates:
161206
parxy pdf:split document.pdf -o ./individual_pages -p page
162207
```
163208

209+
**Extract pages 10–20 as individual files:**
210+
```bash
211+
parxy pdf:split document.pdf --pages 10:20 -o ./extracted_pages
212+
```
213+
164214
## Combining Merge and Split
165215

166216
You can chain operations together using the CLI:
167217

168218
**Example: Extract specific pages and split them:**
169219
```bash
170-
# First, extract pages 10-20
171-
parxy pdf:merge document.pdf[10:20] -o extracted.pdf
220+
# Extract pages 10-20 as individual files
221+
parxy pdf:split document.pdf --pages 10:20 -o ./individual_pages
222+
```
223+
224+
**Example: Extract a range into a single PDF, then split:**
225+
```bash
226+
# First, extract pages 10-20 into one PDF
227+
parxy pdf:split document.pdf --pages 10:20 --combine -o extracted.pdf
172228

173229
# Then split into individual pages
174230
parxy pdf:split extracted.pdf -o ./individual_pages
@@ -232,17 +288,21 @@ parxy pdf:split INPUT_FILE [OPTIONS]
232288
```
233289

234290
**Arguments:**
235-
- `INPUT_FILE`: PDF file to split into individual pages
291+
- `INPUT_FILE`: PDF file to split
236292

237293
**Options:**
238-
- `--output, -o`: Output directory (default: `{filename}_split/`)
239-
- `--prefix, -p`: Output filename prefix (default: input filename)
294+
- `--output, -o`: Without `--combine`: output directory (default: `{filename}_split/`). With `--combine`: output file path (default: `{filename}_pages_{from}-{to}.pdf` next to the input).
295+
- `--prefix, -p`: Output filename prefix for individual split files (default: input filename)
296+
- `--pages`: Page range to extract, 1-based. Formats: `3` (single page), `2:5` (range), `:5` (up to page 5), `3:` (from page 3 to end)
297+
- `--combine`: Combine extracted pages into a single PDF instead of one file per page
240298

241299
**Examples:**
242300
```bash
243301
parxy pdf:split document.pdf
244302
parxy pdf:split document.pdf -o ./pages
245303
parxy pdf:split document.pdf -o ./pages -p page
304+
parxy pdf:split document.pdf --pages 2:5
305+
parxy pdf:split document.pdf --pages 2:5 --combine -o extracted.pdf
246306
```
247307

248308
## Getting Help

docs/tutorials/pdf_manipulation.md

Lines changed: 47 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -81,6 +81,44 @@ for page_path in pages:
8181
# ...
8282
```
8383

84+
You can limit splitting to a page range using 0-based `from_page` / `to_page` indices:
85+
86+
```python
87+
# Split only pages 2–5 (0-based: indices 1–4)
88+
pages = Parxy.pdf.split(
89+
input_path=Path("document.pdf"),
90+
output_dir=Path("./pages"),
91+
prefix="doc",
92+
from_page=1,
93+
to_page=4,
94+
)
95+
# Creates: doc_page_2.pdf, doc_page_3.pdf, doc_page_4.pdf, doc_page_5.pdf
96+
```
97+
98+
### Extracting Pages into a Single PDF
99+
100+
Use `extract_pages` to pull a page range from a PDF into a new single-file PDF without splitting each page individually:
101+
102+
```python
103+
from pathlib import Path
104+
from parxy_core.services.pdf_service import PdfService
105+
106+
# Extract pages 3–7 (0-based: indices 2–6)
107+
PdfService.extract_pages(
108+
input_path=Path("report.pdf"),
109+
output_path=Path("summary.pdf"),
110+
from_page=2,
111+
to_page=6,
112+
)
113+
```
114+
115+
Omit `from_page` / `to_page` to copy all pages:
116+
117+
```python
118+
# Equivalent to a copy
119+
PdfService.extract_pages(Path("original.pdf"), Path("copy.pdf"))
120+
```
121+
84122
### Optimizing PDFs
85123

86124
Reduce PDF file size using compression techniques:
@@ -302,6 +340,12 @@ try:
302340
except FileNotFoundError as e:
303341
print(f"File not found: {e}")
304342

343+
# ValueError for invalid page ranges
344+
try:
345+
Parxy.pdf.split(Path("doc.pdf"), Path("./out"), "doc", from_page=100)
346+
except ValueError as e:
347+
print(f"Invalid page range: {e}")
348+
305349
# ValueError for invalid parameters
306350
try:
307351
Parxy.pdf.optimize(
@@ -332,7 +376,8 @@ except RuntimeError as e:
332376
In this tutorial you learned:
333377

334378
- **`Parxy.pdf.merge()`** - Combine multiple PDFs with optional page ranges
335-
- **`Parxy.pdf.split()`** - Split a PDF into individual page files
379+
- **`Parxy.pdf.split()`** - Split a PDF into individual page files, with optional page range
380+
- **`PdfService.extract_pages()`** - Extract a page range into a single output PDF
336381
- **`Parxy.pdf.optimize()`** - Reduce file size with compression options
337382
- **`PdfService` context manager** - Work with attachments (add, list, extract, remove)
338383

@@ -344,6 +389,7 @@ In this tutorial you learned:
344389
| Splitting into pages | Extracting attachment content |
345390
| Optimizing file size | Multiple operations on one file |
346391
| One-shot operations | Need fine-grained control |
392+
| Splitting a page range | Extracting a page range into one PDF (`extract_pages`) |
347393

348394
## Next Steps
349395

docs/tutorials/using_cli.md

Lines changed: 55 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ The Parxy CLI lets you:
1414
| `parxy preview` | Interactive document viewer with metadata, table of contents, and scrollable content preview |
1515
| `parxy markdown` | Convert documents to Markdown files, with support for multiple drivers and folder processing |
1616
| `parxy pdf:merge`| Merge multiple PDF files into one, with support for page ranges |
17-
| `parxy pdf:split`| Split a PDF file into individual pages |
17+
| `parxy pdf:split`| Split a PDF into individual pages, with optional page range and single-file extraction |
1818
| `parxy drivers` | List available document processing drivers |
1919
| `parxy env` | Generate a default `.env` configuration file |
2020
| `parxy docker` | Create a Docker Compose configuration for running Parxy-related services |
@@ -218,6 +218,42 @@ parxy markdown document.pdf -d pymupdf -d llamaparse
218218

219219
This produces `pymupdf-document.md` and `llamaparse-document.md`.
220220

221+
### Converting Pre-parsed JSON Results
222+
223+
If you have a JSON file produced by `parxy parse -m json`, you can convert it to Markdown directly without re-parsing:
224+
225+
```bash
226+
parxy markdown result.json
227+
```
228+
229+
This loads the `Document` model from the JSON and converts it immediately — no driver or API call required. You can mix JSON files and PDF files in the same invocation:
230+
231+
```bash
232+
parxy markdown result.json document.pdf -d pymupdf -o output/
233+
```
234+
235+
### Page Separator Comments
236+
237+
Use `--page-separators` to insert HTML comments before each page's content:
238+
239+
```bash
240+
parxy markdown document.pdf --page-separators
241+
```
242+
243+
Output will contain markers like:
244+
245+
```markdown
246+
<!-- page: 1 -->
247+
248+
First page content...
249+
250+
<!-- page: 2 -->
251+
252+
Second page content...
253+
```
254+
255+
This is useful for post-processing scripts that need to identify page boundaries.
256+
221257
### Inline Output
222258

223259
Use `--inline` with a single file to print markdown directly to stdout with a YAML frontmatter header — useful for shell pipelines:
@@ -276,7 +312,7 @@ parxy pdf:merge cover.pdf /chapters doc.pdf[10:20] appendix.pdf -o book.pdf
276312

277313
### Splitting PDFs
278314

279-
The `pdf:split` command divides a PDF file into individual pages, with each page becoming a separate PDF file.
315+
The `pdf:split` command divides a PDF file into individual pages, with optional page range extraction and single-file output.
280316

281317
**Split into individual pages:**
282318
```bash
@@ -290,7 +326,21 @@ This creates a `document_split/` folder containing `document_page_1.pdf`, `docum
290326
parxy pdf:split report.pdf -o ./pages -p page
291327
```
292328

293-
Creates `page_1.pdf`, `page_2.pdf`, etc. in the `./pages` directory.
329+
**Extract a page range as individual files:**
330+
```bash
331+
parxy pdf:split document.pdf --pages 2:5 -o ./pages
332+
```
333+
334+
**Combine a page range into a single PDF:**
335+
```bash
336+
# Auto-named output next to the input file
337+
parxy pdf:split document.pdf --pages 2:5 --combine
338+
339+
# Custom output path
340+
parxy pdf:split document.pdf --pages 2:5 --combine -o extracted.pdf
341+
```
342+
343+
Page range formats (1-based): `3` · `2:5` · `:5` · `3:`
294344

295345
For more detailed examples and use cases, see the [PDF Manipulation How-to Guide](../howto/pdf_manipulation.md).
296346

@@ -358,9 +408,9 @@ With the CLI, you can use Parxy as a **standalone document parsing tool** — id
358408
|------------------|--------------------------------------------------------------|
359409
| `parxy parse` | Extract text from documents with multiple formats & drivers |
360410
| `parxy preview` | Interactive document viewer with metadata and TOC |
361-
| `parxy markdown` | Generate Markdown files with driver prefix naming |
411+
| `parxy markdown` | Generate Markdown files; accepts JSON results and supports `--page-separators` |
362412
| `parxy pdf:merge`| Merge multiple PDF files with page range support |
363-
| `parxy pdf:split`| Split PDF files into individual pages |
413+
| `parxy pdf:split`| Split PDF into individual pages; supports `--pages` and `--combine` |
364414
| `parxy drivers` | List supported drivers |
365415
| `parxy env` | Create default configuration file |
366416
| `parxy docker` | Generate Docker Compose setup |

0 commit comments

Comments
 (0)