|
1 | 1 | # Confluence Dump with Python |
2 | 2 |
|
3 | | -This script exports content from a Confluence instance (Cloud or Data Center) using various modes. |
| 3 | +This script exports content from a Confluence instance (Cloud or Data Center) using various modes. It creates a static, navigable HTML archive that can be optionally converted into a structured PDF. |
4 | 4 |
|
5 | 5 | **Key Features:** |
6 | 6 |
|
7 | | -- **Visual Fidelity & Sidebar:** Creates a visually faithful copy of Confluence pages, including a **fully functional, static navigation sidebar** on the left—something even the standard Confluence export does not provide. |
| 7 | +- **Visual Fidelity & Sidebar:** Creates a visually faithful copy of Confluence pages (`export_view`), including a **fully functional, static navigation sidebar** on the left. |
8 | 8 |
|
9 | | -- **Offline Browsing:** Localizes images and links, and downloads **all** attachments (PDFs, Office docs, etc.) for complete offline access. |
| 9 | +- **Offline Browsing:** Localizes images and links, and downloads **all** attachments for complete offline access. |
10 | 10 |
|
11 | | -- **Recursive Inventory:** Scans the tree hierarchy to ensure the **correct sort order** (manual Confluence order) in the sidebar. |
| 11 | +- **Recursive Inventory:** Scans the tree hierarchy to ensure the **correct sort order** (manual Confluence order) is preserved. |
12 | 12 |
|
13 | | -- **Metadata Injection:** Automatically adds Page Title, Author, and Modification Date to the top of every page. |
| 13 | +- **Metadata Injection:** Automatically adds Page Title, Author, and Modification Date to the top of every page. |
14 | 14 |
|
15 | | -- **Versioning:** Automatically creates timestamped output subfolders (e.g., `2025-11-21 1400 Space IT`) for clean history management. This allows you to run the script repeatedly (e.g., after changes in Confluence) and maintain a history of snapshots without overwriting previous exports. |
| 15 | +- **Versioning:** Creates timestamped output subfolders (e.g., `2025-11-21 1400 Space IT`) for clean history management. |
16 | 16 |
|
17 | | -- **Performance:** Supports **Multithreaded** downloading (`--threads`) to speed up the export of large spaces. |
| 17 | +- **Performance:** Supports **Multithreaded** downloading (`--threads`) to speed up the export of large spaces. |
18 | 18 |
|
19 | | -- **Tree Pruning:** Exclude specific branches with `--exclude-page-id` or `--exclude-label`. |
| 19 | +- **Tree Pruning:** Exclude specific branches with `--exclude-page-id` or `--exclude-label`. |
20 | 20 |
|
21 | | -- **Index Sandbox:** Includes visual tools to manually restructure the navigation tree via Drag & Drop and apply it to the downloaded files without affecting Confluence. |
| 21 | +- **Architecture Sandbox:** Tools to manually restructure the navigation tree via Drag & Drop before generating the final output. |
| 22 | + |
| 23 | +- **Professional PDF Output:** Converts the (restructured) HTML dump into a single, hierarchical PDF with Table of Contents, Bookmarks, and mixed Portrait/Landscape orientation. |
22 | 24 |
|
23 | 25 |
|
24 | 26 | ## Platform Support |
25 | 27 |
|
26 | 28 | This script supports both: |
27 | 29 |
|
28 | | -- **Confluence Cloud** |
| 30 | +- **Confluence Cloud** |
29 | 31 |
|
30 | | -- **Confluence Data Center** |
| 32 | +- **Confluence Data Center** |
31 | 33 |
|
32 | 34 |
|
33 | 35 | The platform-specific API paths and authentication methods are defined in the `confluence_products.ini` file. |
34 | 36 |
|
35 | | -> **⚠️ Note on Cloud Verification:** The support for **Confluence Cloud** has been carefully ported to the new modular architecture based on the original codebase. However, this refactoring was developed and tested against a **Confluence Data Center** environment. |
| 37 | +> **⚠️ Note on Cloud Verification:** The support for **Confluence Cloud** has been carefully ported to the new modular architecture based on the original codebase. However, this refactoring was developed and rigorously tested against a **Confluence Data Center** environment. |
36 | 38 | > |
37 | 39 | > While the logic remains consistent with the previous version, the Cloud mode has **not yet been verified in a live environment** by the current maintainer due to lack of access. If you encounter issues with Cloud authentication or API paths, please open an issue or submit a Pull Request. |
38 | 40 |
|
39 | | -## Missing Features / Ideas |
40 | | - |
41 | | -- **Incremental Update:** Currently, the script always performs a full export. An update mode that only downloads changed pages would be a valuable addition. |
42 | | - |
43 | | - |
44 | 41 | ## Requirements |
45 | 42 |
|
46 | | -- Python 3.x |
| 43 | +- Python 3.x |
47 | 44 |
|
48 | | -- `requests`, `beautifulsoup4`, `tqdm` |
| 45 | +- `requests`, `beautifulsoup4`, `tqdm` |
49 | 46 |
|
50 | | -- `pypandoc` (optional, only needed for RST export) |
51 | | - |
52 | | - |
53 | | - pip install -r requirements.txt |
| 47 | +- `pypandoc` (optional, only needed for RST export) |
54 | 48 |
|
| 49 | +- `weasyprint` (optional, only needed for PDF export) |
55 | 50 |
|
56 | 51 |
|
| 52 | +``` |
| 53 | +pip install -r requirements.txt |
| 54 | +``` |
| 55 | + |
57 | 56 | ## Authentication |
58 | 57 |
|
59 | 58 | Authentication is handled via environment variables, based on the profile you select. |
60 | 59 |
|
61 | 60 | ### For Confluence Cloud (`--profile cloud`) |
62 | 61 |
|
63 | | - export CONFLUENCE_USER="your-email@example.com" |
64 | | - export CONFLUENCE_TOKEN="YourApiTokenHere" |
65 | | - |
66 | | - |
| 62 | +``` |
| 63 | +export CONFLUENCE_USER="your-email@example.com" |
| 64 | +export CONFLUENCE_TOKEN="YourApiTokenHere" |
| 65 | +``` |
67 | 66 |
|
68 | 67 | ### For Confluence Data Center (`--profile dc`) |
69 | 68 |
|
70 | | - export CONFLUENCE_TOKEN="YourPersonalAccessTokenHere" |
71 | | - |
72 | | - |
| 69 | +``` |
| 70 | +export CONFLUENCE_TOKEN="YourPersonalAccessTokenHere" |
| 71 | +``` |
73 | 72 |
|
74 | 73 | **⚠️ Troubleshooting Note for Data Center:** If authentication fails (Intranet/SSO blocks), ensure you are on VPN and PATs are enabled. |
75 | 74 |
|
76 | | -## Exporting with CSS Styling |
77 | | - |
78 | | -The script uses a robust **Two-Layer Styling Strategy**. |
79 | | - |
80 | | -### Layer 1: Standard CSS (Default) |
81 | | - |
82 | | -The project folder contains a `styles/` directory. If a CSS file exists there (e.g., `styles/site.css`), it is **automatically applied** to every export. |
83 | | - |
84 | | -### Layer 2: Custom CSS (Optional) |
85 | | - |
86 | | -Use `--css-file "/path/to/my_custom.css"` to apply specific overrides. This file will be loaded **after** the standard CSS. |
87 | | - |
88 | | -## Usage |
| 75 | +## Usage 1: HTML Export (The Dump) |
89 | 76 |
|
90 | 77 | ### General Syntax |
91 | 78 |
|
92 | | - python3 confluenceDumpWithPython.py [GLOBAL_OPTIONS] <COMMAND> [COMMAND_OPTIONS] |
93 | | - |
94 | | - |
| 79 | +``` |
| 80 | +python3 confluenceDumpToHTML.py [GLOBAL_OPTIONS] <COMMAND> [COMMAND_OPTIONS] |
| 81 | +``` |
95 | 82 |
|
96 | 83 | ### Global Options |
97 | 84 |
|
98 | | - -o OUTDIR, --outdir OUTDIR |
99 | | - The output directory (will be created) |
100 | | - --base-url BASE_URL Confluence Base URL (e.g., '[https://confluence.corp.com](https://confluence.corp.com)') |
101 | | - --profile PROFILE Platform profile ('cloud' or 'dc') |
102 | | - --context-path PATH (DC only) Context path (e.g., '/wiki') |
103 | | - --threads THREADS, -t THREADS |
104 | | - Number of download threads (Default: 1) |
105 | | - --exclude-page-id ID Exclude a page ID and its children (can be repeated) |
106 | | - --no-vpn-reminder Skip the VPN check confirmation (DC only) |
107 | | - --css-file CSS_FILE Path to custom CSS file |
108 | | - -R, --rst Export pages as RST (requires pypandoc) |
109 | | - |
110 | | - |
| 85 | +``` |
| 86 | + -o OUTDIR, --outdir OUTDIR |
| 87 | + The output directory (will be created) |
| 88 | + --base-url BASE_URL Confluence Base URL (e.g., '[https://confluence.corp.com](https://confluence.corp.com)') |
| 89 | + --profile PROFILE Platform profile ('cloud' or 'dc') |
| 90 | + --context-path PATH (DC only) Context path (e.g., '/wiki') |
| 91 | + --threads THREADS, -t THREADS |
| 92 | + Number of download threads (Default: 1) |
| 93 | + --exclude-page-id ID Exclude a page ID and its children (can be repeated) |
| 94 | + --no-vpn-reminder Skip the VPN check confirmation (DC only) |
| 95 | + --css-file CSS_FILE Path to custom CSS file |
| 96 | + -R, --rst Export pages as RST (requires pypandoc) |
| 97 | +``` |
111 | 98 |
|
112 | 99 | ### Commands |
113 | 100 |
|
114 | | -- **`space`**: Dumps an entire space. Starts at the Space Homepage and recurses down. |
| 101 | +- **`space`**: Dumps an entire space. Starts at the Space Homepage and recurses down. |
115 | 102 |
|
116 | | - - `-sp`, `--space-key`: The Key of the space. |
| 103 | + - `-sp`, `--space-key`: The Key of the space. |
117 | 104 |
|
118 | | -- **`tree`**: Dumps a specific page and all its descendants. |
| 105 | +- **`tree`**: Dumps a specific page and all its descendants. |
119 | 106 |
|
120 | | - - `-p`, `--pageid`: The Root Page ID. |
| 107 | + - `-p`, `--pageid`: The Root Page ID. |
121 | 108 |
|
122 | | -- **`single`**: Dumps a single page. |
| 109 | +- **`single`**: Dumps a single page. |
123 | 110 |
|
124 | | - - `-p`, `--pageid`: The Page ID. |
| 111 | + - `-p`, `--pageid`: The Page ID. |
125 | 112 |
|
126 | | -- **`label`**: Dumps pages by label ("Forest Mode"). Finds all pages with the label and treats them as roots for recursion. |
| 113 | +- **`label`**: Dumps pages by label ("Forest Mode"). Finds all pages with the label and treats them as roots for recursion. |
127 | 114 |
|
128 | | - - `-l`, `--label`: The label to include. |
| 115 | + - `-l`, `--label`: The label to include. |
129 | 116 |
|
130 | | - - `--exclude-label`: Exclude subtrees that have this specific label (e.g. 'archived'). |
| 117 | + - `--exclude-label`: Exclude subtrees that have this specific label (e.g. 'archived'). |
131 | 118 |
|
132 | | -- **`all-spaces`**: Dumps all visible spaces. |
| 119 | +- **`all-spaces`**: Dumps all visible spaces. |
133 | 120 |
|
134 | 121 |
|
135 | | -### Examples |
| 122 | +## Usage 2: Index Restructuring (The Sandbox) |
136 | 123 |
|
137 | | -**1\. Data Center: Entire Space, 8 Threads, Exclude Archive** |
| 124 | +This toolset allows you to re-organize the index structure locally. This is useful for testing structural changes or cleaning up the navigation flow without touching Confluence. |
138 | 125 |
|
139 | | - python3 confluenceDumpWithPython.py \ |
140 | | - --base-url "[https://confluence.corp.com](https://confluence.corp.com)" \ |
141 | | - --profile dc \ |
142 | | - --context-path "/wiki" \ |
143 | | - -o "./dump_it" \ |
144 | | - -t 8 \ |
145 | | - --exclude-page-id "999999" \ |
146 | | - space -sp "IT" |
| 126 | +1. **Generate Editor:** Create a visual Drag & Drop editor for the index. |
| 127 | + |
| 128 | + ``` |
| 129 | + python3 create_editor.py --site-dir "./output/2025-01-01 Space IT" |
| 130 | + ``` |
147 | 131 | |
| 132 | +2. **Edit:** Open `editor_sidebar.html` in your browser. Move pages, create folders, delete items. |
148 | 133 | |
| 134 | +3. **Save:** Click "Copy Markdown" in the editor and paste the content into a new file `sidebar_edit.md` in the site directory. |
| 135 | + |
| 136 | +4. **Apply:** Patch the new index structure into all **downloaded** HTML files. |
| 137 | + |
| 138 | + ``` |
| 139 | + python3 patch_sidebar.py --site-dir "./output/2025-01-01 Space IT" |
| 140 | + ``` |
| 141 | + |
| 142 | +
|
| 143 | +## Usage 3: PDF Generation (Publication) |
149 | 144 |
|
150 | | -**2\. Cloud: Single Page Tree** |
| 145 | +Once you have the HTML dump (and optionally restructured it), you can generate high-quality PDF documents. This tool is separated from the main downloader to keep dependencies light. |
151 | 146 |
|
152 | | - python3 confluenceDumpWithPython.py \ |
153 | | - --base-url "[https://myteam.atlassian.net](https://myteam.atlassian.net)" \ |
154 | | - --profile cloud \ |
155 | | - -o "./dump_tree" \ |
156 | | - tree -p "12345" |
| 147 | +**Prerequisite:** |
| 148 | +
|
| 149 | +- `pip install weasyprint` |
157 | 150 | |
| 151 | +- **Windows Users:** You must install the [GTK3 Runtime](https://www.google.com/search?q=https://github.com/tschoonj/GTK-for-Windows-runtime-environment-installer "null") for WeasyPrint to work. |
158 | 152 | |
159 | 153 |
|
160 | | -## Index Restructuring Sandbox |
161 | | - |
162 | | -This additional toolset allows you to re-organize the pages and sub-pages structure (the index) of your export locally. This is useful for testing structural changes or cleaning up the navigation flow without touching Confluence or re-downloading pages. |
| 154 | +``` |
| 155 | +python3 htmlToPDF.py --site-dir "./output/2025-01-01 Space IT" |
| 156 | +``` |
163 | 157 |
|
164 | | -**The Workflow:** |
| 158 | +**Features:** |
165 | 159 |
|
166 | | -1. **Generate Editor:** Create a visual Drag & Drop editor for the index of all exported pages. |
| 160 | +- **Mixed Orientation:** Automatically detects Landscape pages based on content styles and switches the PDF page orientation accordingly. |
167 | 161 | |
168 | | - python3 create_editor.py --site-dir "./output/2025-01-01 Space IT" |
169 | | - |
170 | | - |
| 162 | +- **Configurable Layout:** A `styles/pdf_settings.css` file is auto-generated on first run. Use it to customize margins, page sizes (A4/Letter), or page numbering. |
171 | 163 | |
172 | | -2. **Edit:** Open `editor_sidebar.html` in your browser. Move pages, create folders, delete items. |
| 164 | +- **Hierarchical:** Creates a PDF outline (bookmarks) and a Table of Contents based on your `sidebar.md`. |
173 | 165 | |
174 | | -3. **Save:** Click "Copy Markdown" in the editor and paste the content into a new file `sidebar_edit.md` in the site directory. |
| 166 | +- **Internal Linking:** Links between exported pages work seamlessly within the PDF. |
175 | 167 | |
176 | | -4. **Apply:** Patch the new index structure into all **downloaded** HTML files. |
| 168 | +- **Table Optimization:** Aggressive CSS rules ensure wide tables fit onto the page (`table-layout: fixed`, word breaking). |
177 | 169 | |
178 | | - python3 patch_sidebar.py --site-dir "./output/2025-01-01 Space IT" |
179 | | - |
180 | | - |
181 | | - |
| 170 | +- **Smart Splitting:** Use `--split-by-root` to generate separate PDFs for each top-level folder (ideal for massive exports > 1000 pages). |
| 171 | + |
| 172 | +- **Debug Mode:** Use `--debug` to save the intermediate merged HTML file. This is also an **excellent input format for LLMs** (RAG), as it contains the entire documentation in a single, structured HTML file. |
0 commit comments