Skip to content

Commit 31fdec8

Browse files
author
moresunlight
committed
New feature html to PDF from downloaded confluence pages
1 parent c5e8d9b commit 31fdec8

6 files changed

Lines changed: 589 additions & 117 deletions

File tree

CHANGELOG.md

Lines changed: 31 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,20 +4,47 @@ All notable changes to this project will be documented in this file.
44

55
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/ "null"), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html "null").
66

7+
## \[2.6.0\] - 2025-01-26
8+
9+
Added professional PDF publication capabilities.
10+
11+
### Added
12+
13+
- **PDF Generator:** Introduced `htmlToPDF.py`. Converts the dumped HTML structure into a single, hierarchical PDF document.
14+
15+
- **PDF Features:**
16+
17+
- **Smart Splitting:** Option `--split-by-root` to generate separate PDFs for each top-level folder (scalable for 4000+ pages).
18+
19+
- **Mixed Orientation:** Supports mixing Portrait and Landscape pages within the same PDF based on source HTML hints.
20+
21+
- **Bookmarks:** Generates PDF Outlines/Bookmarks matching the sidebar structure.
22+
23+
- **Link Rewriting:** Converts HTML links to internal PDF anchors for seamless navigation.
24+
25+
- **PDF Configuration:** Auto-generates `styles/pdf_settings.css` for user-definable page layouts (A4/Letter, Margins).
26+
27+
728
## \[2.5.0\] - 2025-11-22
829

930
Introduction of the "Architecture Sandbox" for offline restructuring.
1031

1132
### Added
1233

13-
- **Architecture Sandbox:** Introduced `create_editor.py` and `patch_sidebar.py`. Users can now generate a visual Drag & Drop editor (`editor_sidebar.html`) to restructure the exported documentation offline and apply changes massively using the patcher.
34+
- **Architecture Sandbox:** Introduced `create_editor.py` and `patch_sidebar.py`. Users can now generate a visual Drag & Drop editor (`editor_sidebar.html`) to restructure the exported documentation offline.
1435

15-
- **Robust Editor Generation:** The editor generator now uses a safe string concatenation approach to avoid syntax errors and supports creating a working copy of the sidebar structure (`sidebar_edit.md`).
36+
- **Editor Features:**
1637

38+
- **Zero-Dependency:** The editor is a self-contained HTML file requiring no internet access.
39+
40+
- **Drag & Drop:** Robust reordering of pages and folders.
41+
42+
- **Working Copy:** Supports a `sidebar_edit.md` workflow to keep the original structure safe.
43+
1744

1845
### Changed
1946

20-
- **CSS Strategy:** Refined the "Two-Layer" styling approach (Standard + Custom) to be more robust in the documentation and implementation.
47+
- **CSS Strategy:** Refined the "Two-Layer" styling approach (Standard + Custom) to be more robust.
2148

2249

2350
## \[2.4.1\] - 2025-11-21
@@ -95,7 +122,7 @@ Introduction of Static Sidebar Injection.
95122

96123
- **Smart Linking:** Improved detection of dead/external links vs. local links based on the inventory.
97124

98-
- **CSS Auto-Discovery:** The script automatically detects and applies `site.css` from the local `styles/` folder.
125+
- **CSS Auto-Discovery:** The script automatically detects and applies `site.css` from the local `styles/` directory.
99126

100127
- **Multi-CSS Support:** Allows layering multiple CSS files (Standard + Custom).
101128

README.md

Lines changed: 91 additions & 100 deletions
Original file line numberDiff line numberDiff line change
@@ -1,181 +1,172 @@
11
# Confluence Dump with Python
22

3-
This script exports content from a Confluence instance (Cloud or Data Center) using various modes.
3+
This script exports content from a Confluence instance (Cloud or Data Center) using various modes. It creates a static, navigable HTML archive that can be optionally converted into a structured PDF.
44

55
**Key Features:**
66

7-
- **Visual Fidelity & Sidebar:** Creates a visually faithful copy of Confluence pages, including a **fully functional, static navigation sidebar** on the left—something even the standard Confluence export does not provide.
7+
- **Visual Fidelity & Sidebar:** Creates a visually faithful copy of Confluence pages (`export_view`), including a **fully functional, static navigation sidebar** on the left.
88

9-
- **Offline Browsing:** Localizes images and links, and downloads **all** attachments (PDFs, Office docs, etc.) for complete offline access.
9+
- **Offline Browsing:** Localizes images and links, and downloads **all** attachments for complete offline access.
1010

11-
- **Recursive Inventory:** Scans the tree hierarchy to ensure the **correct sort order** (manual Confluence order) in the sidebar.
11+
- **Recursive Inventory:** Scans the tree hierarchy to ensure the **correct sort order** (manual Confluence order) is preserved.
1212

13-
- **Metadata Injection:** Automatically adds Page Title, Author, and Modification Date to the top of every page.
13+
- **Metadata Injection:** Automatically adds Page Title, Author, and Modification Date to the top of every page.
1414

15-
- **Versioning:** Automatically creates timestamped output subfolders (e.g., `2025-11-21 1400 Space IT`) for clean history management. This allows you to run the script repeatedly (e.g., after changes in Confluence) and maintain a history of snapshots without overwriting previous exports.
15+
- **Versioning:** Creates timestamped output subfolders (e.g., `2025-11-21 1400 Space IT`) for clean history management.
1616

17-
- **Performance:** Supports **Multithreaded** downloading (`--threads`) to speed up the export of large spaces.
17+
- **Performance:** Supports **Multithreaded** downloading (`--threads`) to speed up the export of large spaces.
1818

19-
- **Tree Pruning:** Exclude specific branches with `--exclude-page-id` or `--exclude-label`.
19+
- **Tree Pruning:** Exclude specific branches with `--exclude-page-id` or `--exclude-label`.
2020

21-
- **Index Sandbox:** Includes visual tools to manually restructure the navigation tree via Drag & Drop and apply it to the downloaded files without affecting Confluence.
21+
- **Architecture Sandbox:** Tools to manually restructure the navigation tree via Drag & Drop before generating the final output.
22+
23+
- **Professional PDF Output:** Converts the (restructured) HTML dump into a single, hierarchical PDF with Table of Contents, Bookmarks, and mixed Portrait/Landscape orientation.
2224

2325

2426
## Platform Support
2527

2628
This script supports both:
2729

28-
- **Confluence Cloud**
30+
- **Confluence Cloud**
2931

30-
- **Confluence Data Center**
32+
- **Confluence Data Center**
3133

3234

3335
The platform-specific API paths and authentication methods are defined in the `confluence_products.ini` file.
3436

35-
> **⚠️ Note on Cloud Verification:** The support for **Confluence Cloud** has been carefully ported to the new modular architecture based on the original codebase. However, this refactoring was developed and tested against a **Confluence Data Center** environment.
37+
> **⚠️ Note on Cloud Verification:** The support for **Confluence Cloud** has been carefully ported to the new modular architecture based on the original codebase. However, this refactoring was developed and rigorously tested against a **Confluence Data Center** environment.
3638
>
3739
> While the logic remains consistent with the previous version, the Cloud mode has **not yet been verified in a live environment** by the current maintainer due to lack of access. If you encounter issues with Cloud authentication or API paths, please open an issue or submit a Pull Request.
3840
39-
## Missing Features / Ideas
40-
41-
- **Incremental Update:** Currently, the script always performs a full export. An update mode that only downloads changed pages would be a valuable addition.
42-
43-
4441
## Requirements
4542

46-
- Python 3.x
43+
- Python 3.x
4744

48-
- `requests`, `beautifulsoup4`, `tqdm`
45+
- `requests`, `beautifulsoup4`, `tqdm`
4946

50-
- `pypandoc` (optional, only needed for RST export)
51-
52-
53-
pip install -r requirements.txt
47+
- `pypandoc` (optional, only needed for RST export)
5448

49+
- `weasyprint` (optional, only needed for PDF export)
5550

5651

52+
```
53+
pip install -r requirements.txt
54+
```
55+
5756
## Authentication
5857

5958
Authentication is handled via environment variables, based on the profile you select.
6059

6160
### For Confluence Cloud (`--profile cloud`)
6261

63-
export CONFLUENCE_USER="your-email@example.com"
64-
export CONFLUENCE_TOKEN="YourApiTokenHere"
65-
66-
62+
```
63+
export CONFLUENCE_USER="your-email@example.com"
64+
export CONFLUENCE_TOKEN="YourApiTokenHere"
65+
```
6766

6867
### For Confluence Data Center (`--profile dc`)
6968

70-
export CONFLUENCE_TOKEN="YourPersonalAccessTokenHere"
71-
72-
69+
```
70+
export CONFLUENCE_TOKEN="YourPersonalAccessTokenHere"
71+
```
7372

7473
**⚠️ Troubleshooting Note for Data Center:** If authentication fails (Intranet/SSO blocks), ensure you are on VPN and PATs are enabled.
7574

76-
## Exporting with CSS Styling
77-
78-
The script uses a robust **Two-Layer Styling Strategy**.
79-
80-
### Layer 1: Standard CSS (Default)
81-
82-
The project folder contains a `styles/` directory. If a CSS file exists there (e.g., `styles/site.css`), it is **automatically applied** to every export.
83-
84-
### Layer 2: Custom CSS (Optional)
85-
86-
Use `--css-file "/path/to/my_custom.css"` to apply specific overrides. This file will be loaded **after** the standard CSS.
87-
88-
## Usage
75+
## Usage 1: HTML Export (The Dump)
8976

9077
### General Syntax
9178

92-
python3 confluenceDumpWithPython.py [GLOBAL_OPTIONS] <COMMAND> [COMMAND_OPTIONS]
93-
94-
79+
```
80+
python3 confluenceDumpToHTML.py [GLOBAL_OPTIONS] <COMMAND> [COMMAND_OPTIONS]
81+
```
9582

9683
### Global Options
9784

98-
-o OUTDIR, --outdir OUTDIR
99-
The output directory (will be created)
100-
--base-url BASE_URL Confluence Base URL (e.g., '[https://confluence.corp.com](https://confluence.corp.com)')
101-
--profile PROFILE Platform profile ('cloud' or 'dc')
102-
--context-path PATH (DC only) Context path (e.g., '/wiki')
103-
--threads THREADS, -t THREADS
104-
Number of download threads (Default: 1)
105-
--exclude-page-id ID Exclude a page ID and its children (can be repeated)
106-
--no-vpn-reminder Skip the VPN check confirmation (DC only)
107-
--css-file CSS_FILE Path to custom CSS file
108-
-R, --rst Export pages as RST (requires pypandoc)
109-
110-
85+
```
86+
-o OUTDIR, --outdir OUTDIR
87+
The output directory (will be created)
88+
--base-url BASE_URL Confluence Base URL (e.g., '[https://confluence.corp.com](https://confluence.corp.com)')
89+
--profile PROFILE Platform profile ('cloud' or 'dc')
90+
--context-path PATH (DC only) Context path (e.g., '/wiki')
91+
--threads THREADS, -t THREADS
92+
Number of download threads (Default: 1)
93+
--exclude-page-id ID Exclude a page ID and its children (can be repeated)
94+
--no-vpn-reminder Skip the VPN check confirmation (DC only)
95+
--css-file CSS_FILE Path to custom CSS file
96+
-R, --rst Export pages as RST (requires pypandoc)
97+
```
11198

11299
### Commands
113100

114-
- **`space`**: Dumps an entire space. Starts at the Space Homepage and recurses down.
101+
- **`space`**: Dumps an entire space. Starts at the Space Homepage and recurses down.
115102

116-
- `-sp`, `--space-key`: The Key of the space.
103+
- `-sp`, `--space-key`: The Key of the space.
117104

118-
- **`tree`**: Dumps a specific page and all its descendants.
105+
- **`tree`**: Dumps a specific page and all its descendants.
119106

120-
- `-p`, `--pageid`: The Root Page ID.
107+
- `-p`, `--pageid`: The Root Page ID.
121108

122-
- **`single`**: Dumps a single page.
109+
- **`single`**: Dumps a single page.
123110

124-
- `-p`, `--pageid`: The Page ID.
111+
- `-p`, `--pageid`: The Page ID.
125112

126-
- **`label`**: Dumps pages by label ("Forest Mode"). Finds all pages with the label and treats them as roots for recursion.
113+
- **`label`**: Dumps pages by label ("Forest Mode"). Finds all pages with the label and treats them as roots for recursion.
127114

128-
- `-l`, `--label`: The label to include.
115+
- `-l`, `--label`: The label to include.
129116

130-
- `--exclude-label`: Exclude subtrees that have this specific label (e.g. 'archived').
117+
- `--exclude-label`: Exclude subtrees that have this specific label (e.g. 'archived').
131118

132-
- **`all-spaces`**: Dumps all visible spaces.
119+
- **`all-spaces`**: Dumps all visible spaces.
133120

134121

135-
### Examples
122+
## Usage 2: Index Restructuring (The Sandbox)
136123

137-
**1\. Data Center: Entire Space, 8 Threads, Exclude Archive**
124+
This toolset allows you to re-organize the index structure locally. This is useful for testing structural changes or cleaning up the navigation flow without touching Confluence.
138125

139-
python3 confluenceDumpWithPython.py \
140-
--base-url "[https://confluence.corp.com](https://confluence.corp.com)" \
141-
--profile dc \
142-
--context-path "/wiki" \
143-
-o "./dump_it" \
144-
-t 8 \
145-
--exclude-page-id "999999" \
146-
space -sp "IT"
126+
1. **Generate Editor:** Create a visual Drag & Drop editor for the index.
127+
128+
```
129+
python3 create_editor.py --site-dir "./output/2025-01-01 Space IT"
130+
```
147131
132+
2. **Edit:** Open `editor_sidebar.html` in your browser. Move pages, create folders, delete items.
148133
134+
3. **Save:** Click "Copy Markdown" in the editor and paste the content into a new file `sidebar_edit.md` in the site directory.
135+
136+
4. **Apply:** Patch the new index structure into all **downloaded** HTML files.
137+
138+
```
139+
python3 patch_sidebar.py --site-dir "./output/2025-01-01 Space IT"
140+
```
141+
142+
143+
## Usage 3: PDF Generation (Publication)
149144
150-
**2\. Cloud: Single Page Tree**
145+
Once you have the HTML dump (and optionally restructured it), you can generate high-quality PDF documents. This tool is separated from the main downloader to keep dependencies light.
151146
152-
python3 confluenceDumpWithPython.py \
153-
--base-url "[https://myteam.atlassian.net](https://myteam.atlassian.net)" \
154-
--profile cloud \
155-
-o "./dump_tree" \
156-
tree -p "12345"
147+
**Prerequisite:**
148+
149+
- `pip install weasyprint`
157150
151+
- **Windows Users:** You must install the [GTK3 Runtime](https://www.google.com/search?q=https://github.com/tschoonj/GTK-for-Windows-runtime-environment-installer "null") for WeasyPrint to work.
158152
159153
160-
## Index Restructuring Sandbox
161-
162-
This additional toolset allows you to re-organize the pages and sub-pages structure (the index) of your export locally. This is useful for testing structural changes or cleaning up the navigation flow without touching Confluence or re-downloading pages.
154+
```
155+
python3 htmlToPDF.py --site-dir "./output/2025-01-01 Space IT"
156+
```
163157
164-
**The Workflow:**
158+
**Features:**
165159
166-
1. **Generate Editor:** Create a visual Drag & Drop editor for the index of all exported pages.
160+
- **Mixed Orientation:** Automatically detects Landscape pages based on content styles and switches the PDF page orientation accordingly.
167161
168-
python3 create_editor.py --site-dir "./output/2025-01-01 Space IT"
169-
170-
162+
- **Configurable Layout:** A `styles/pdf_settings.css` file is auto-generated on first run. Use it to customize margins, page sizes (A4/Letter), or page numbering.
171163
172-
2. **Edit:** Open `editor_sidebar.html` in your browser. Move pages, create folders, delete items.
164+
- **Hierarchical:** Creates a PDF outline (bookmarks) and a Table of Contents based on your `sidebar.md`.
173165
174-
3. **Save:** Click "Copy Markdown" in the editor and paste the content into a new file `sidebar_edit.md` in the site directory.
166+
- **Internal Linking:** Links between exported pages work seamlessly within the PDF.
175167
176-
4. **Apply:** Patch the new index structure into all **downloaded** HTML files.
168+
- **Table Optimization:** Aggressive CSS rules ensure wide tables fit onto the page (`table-layout: fixed`, word breaking).
177169
178-
python3 patch_sidebar.py --site-dir "./output/2025-01-01 Space IT"
179-
180-
181-
170+
- **Smart Splitting:** Use `--split-by-root` to generate separate PDFs for each top-level folder (ideal for massive exports > 1000 pages).
171+
172+
- **Debug Mode:** Use `--debug` to save the intermediate merged HTML file. This is also an **excellent input format for LLMs** (RAG), as it contains the entire documentation in a single, structured HTML file.

0 commit comments

Comments
 (0)