jgoldin-skillz
diff --git a/‎CHANGELOG.md‎
Lines changed: 31 additions & 4 deletions b/‎CHANGELOG.md‎
Lines changed: 31 additions & 4 deletions
diff --git a/‎README.md‎
Lines changed: 91 additions & 100 deletions b/‎README.md‎
Lines changed: 91 additions & 100 deletions
@@ -4,20 +4,47 @@ All notable changes to this project will be documented in this file.
 
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/ "null"), and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html "null").
 
+## \[2.6.0\] - 2025-01-26
+
+Added professional PDF publication capabilities.
+
+### Added
+
+- **PDF Generator:** Introduced `htmlToPDF.py`. Converts the dumped HTML structure into a single, hierarchical PDF document.
+    
+- **PDF Features:**
+    
+    - **Smart Splitting:** Option `--split-by-root` to generate separate PDFs for each top-level folder (scalable for 4000+ pages).
+        
+    - **Mixed Orientation:** Supports mixing Portrait and Landscape pages within the same PDF based on source HTML hints.
+        
+    - **Bookmarks:** Generates PDF Outlines/Bookmarks matching the sidebar structure.
+        
+    - **Link Rewriting:** Converts HTML links to internal PDF anchors for seamless navigation.
+        
+- **PDF Configuration:** Auto-generates `styles/pdf_settings.css` for user-definable page layouts (A4/Letter, Margins).
+    
+
 ## \[2.5.0\] - 2025-11-22
 
 Introduction of the "Architecture Sandbox" for offline restructuring.
 
 ### Added
 
-- **Architecture Sandbox:** Introduced `create_editor.py` and `patch_sidebar.py`. Users can now generate a visual Drag & Drop editor (`editor_sidebar.html`) to restructure the exported documentation offline and apply changes massively using the patcher.
+- **Architecture Sandbox:** Introduced `create_editor.py` and `patch_sidebar.py`. Users can now generate a visual Drag & Drop editor (`editor_sidebar.html`) to restructure the exported documentation offline.
 
-- **Robust Editor Generation:** The editor generator now uses a safe string concatenation approach to avoid syntax errors and supports creating a working copy of the sidebar structure (`sidebar_edit.md`).
+- **Editor Features:**
 
+    - **Zero-Dependency:** The editor is a self-contained HTML file requiring no internet access.
+        
+    - **Drag & Drop:** Robust reordering of pages and folders.
+        
+    - **Working Copy:** Supports a `sidebar_edit.md` workflow to keep the original structure safe.
+        
 
 ### Changed
 
-- **CSS Strategy:** Refined the "Two-Layer" styling approach (Standard + Custom) to be more robust in the documentation and implementation.
+- **CSS Strategy:** Refined the "Two-Layer" styling approach (Standard + Custom) to be more robust.
 
 
 ## \[2.4.1\] - 2025-11-21
@@ -95,7 +122,7 @@ Introduction of Static Sidebar Injection.
 
 - **Smart Linking:** Improved detection of dead/external links vs. local links based on the inventory.
 
-- **CSS Auto-Discovery:** The script automatically detects and applies `site.css` from the local `styles/` folder.
+- **CSS Auto-Discovery:** The script automatically detects and applies `site.css` from the local `styles/` directory.
 
 - **Multi-CSS Support:** Allows layering multiple CSS files (Standard + Custom).
 
 
@@ -1,181 +1,172 @@
 # Confluence Dump with Python
 
-This script exports content from a Confluence instance (Cloud or Data Center) using various modes.
+This script exports content from a Confluence instance (Cloud or Data Center) using various modes. It creates a static, navigable HTML archive that can be optionally converted into a structured PDF.
 
 **Key Features:**
 
--   **Visual Fidelity & Sidebar:** Creates a visually faithful copy of Confluence pages, including a **fully functional, static navigation sidebar** on the left—something even the standard Confluence export does not provide.
+- **Visual Fidelity & Sidebar:** Creates a visually faithful copy of Confluence pages (`export_view`), including a **fully functional, static navigation sidebar** on the left.
 
--   **Offline Browsing:** Localizes images and links, and downloads **all** attachments (PDFs, Office docs, etc.) for complete offline access.
+- **Offline Browsing:** Localizes images and links, and downloads **all** attachments for complete offline access.
 
--   **Recursive Inventory:** Scans the tree hierarchy to ensure the **correct sort order** (manual Confluence order) in the sidebar.
+- **Recursive Inventory:** Scans the tree hierarchy to ensure the **correct sort order** (manual Confluence order) is preserved.
 
--   **Metadata Injection:** Automatically adds Page Title, Author, and Modification Date to the top of every page.
+- **Metadata Injection:** Automatically adds Page Title, Author, and Modification Date to the top of every page.
 
--   **Versioning:** Automatically creates timestamped output subfolders (e.g., `2025-11-21 1400 Space IT`) for clean history management. This allows you to run the script repeatedly (e.g., after changes in Confluence) and maintain a history of snapshots without overwriting previous exports.
+- **Versioning:** Creates timestamped output subfolders (e.g., `2025-11-21 1400 Space IT`) for clean history management.
 
--   **Performance:** Supports **Multithreaded** downloading (`--threads`) to speed up the export of large spaces.
+- **Performance:** Supports **Multithreaded** downloading (`--threads`) to speed up the export of large spaces.
 
--   **Tree Pruning:** Exclude specific branches with `--exclude-page-id` or `--exclude-label`.
+- **Tree Pruning:** Exclude specific branches with `--exclude-page-id` or `--exclude-label`.
 
--   **Index Sandbox:** Includes visual tools to manually restructure the navigation tree via Drag & Drop and apply it to the downloaded files without affecting Confluence.
+- **Architecture Sandbox:** Tools to manually restructure the navigation tree via Drag & Drop before generating the final output.
+    
+- **Professional PDF Output:** Converts the (restructured) HTML dump into a single, hierarchical PDF with Table of Contents, Bookmarks, and mixed Portrait/Landscape orientation.
 
 
 ## Platform Support
 
 This script supports both:
 
--   **Confluence Cloud**
+- **Confluence Cloud**
 
--   **Confluence Data Center**
+- **Confluence Data Center**
 
 
 The platform-specific API paths and authentication methods are defined in the `confluence_products.ini` file.
 
-> **⚠️ Note on Cloud Verification:** The support for **Confluence Cloud** has been carefully ported to the new modular architecture based on the original codebase. However, this refactoring was developed and tested against a **Confluence Data Center** environment.
+> **⚠️ Note on Cloud Verification:** The support for **Confluence Cloud** has been carefully ported to the new modular architecture based on the original codebase. However, this refactoring was developed and rigorously tested against a **Confluence Data Center** environment.
 > 
 > While the logic remains consistent with the previous version, the Cloud mode has **not yet been verified in a live environment** by the current maintainer due to lack of access. If you encounter issues with Cloud authentication or API paths, please open an issue or submit a Pull Request.
 
-## Missing Features / Ideas
-
--   **Incremental Update:** Currently, the script always performs a full export. An update mode that only downloads changed pages would be a valuable addition.
-    
-
 ## Requirements
 
--   Python 3.x
+- Python 3.x
 
--   `requests`, `beautifulsoup4`, `tqdm`
+- `requests`, `beautifulsoup4`, `tqdm`
 
--   `pypandoc` (optional, only needed for RST export)
-    
-
-    pip install -r requirements.txt
+- `pypandoc` (optional, only needed for RST export)
 
+- `weasyprint` (optional, only needed for PDF export)
 
 
+```
+pip install -r requirements.txt
+```
+
 ## Authentication
 
 Authentication is handled via environment variables, based on the profile you select.
 
 ### For Confluence Cloud (`--profile cloud`)
 
-    export CONFLUENCE_USER="your-email@example.com"
-    export CONFLUENCE_TOKEN="YourApiTokenHere"
-    
-    
+```
+export CONFLUENCE_USER="your-email@example.com"
+export CONFLUENCE_TOKEN="YourApiTokenHere"
+```
 
 ### For Confluence Data Center (`--profile dc`)
 
-    export CONFLUENCE_TOKEN="YourPersonalAccessTokenHere"
-    
-    
+```
+export CONFLUENCE_TOKEN="YourPersonalAccessTokenHere"
+```
 
 **⚠️ Troubleshooting Note for Data Center:** If authentication fails (Intranet/SSO blocks), ensure you are on VPN and PATs are enabled.
 
-## Exporting with CSS Styling
-
-The script uses a robust **Two-Layer Styling Strategy**.
-
-### Layer 1: Standard CSS (Default)
-
-The project folder contains a `styles/` directory. If a CSS file exists there (e.g., `styles/site.css`), it is **automatically applied** to every export.
-
-### Layer 2: Custom CSS (Optional)
-
-Use `--css-file "/path/to/my_custom.css"` to apply specific overrides. This file will be loaded **after** the standard CSS.
-
-## Usage
+## Usage 1: HTML Export (The Dump)
 
 ### General Syntax
 
-    python3 confluenceDumpWithPython.py [GLOBAL_OPTIONS] <COMMAND> [COMMAND_OPTIONS]
-    
-    
+```
+python3 confluenceDumpToHTML.py [GLOBAL_OPTIONS] <COMMAND> [COMMAND_OPTIONS]
+```
 
 ### Global Options
 
-      -o OUTDIR, --outdir OUTDIR
-                            The output directory (will be created)
-      --base-url BASE_URL   Confluence Base URL (e.g., '[https://confluence.corp.com](https://confluence.corp.com)')
-      --profile PROFILE     Platform profile ('cloud' or 'dc')
-      --context-path PATH   (DC only) Context path (e.g., '/wiki')
-      --threads THREADS, -t THREADS
-                            Number of download threads (Default: 1)
-      --exclude-page-id ID  Exclude a page ID and its children (can be repeated)
-      --no-vpn-reminder     Skip the VPN check confirmation (DC only)
-      --css-file CSS_FILE   Path to custom CSS file
-      -R, --rst             Export pages as RST (requires pypandoc)
-    
-    
+```
+  -o OUTDIR, --outdir OUTDIR
+                        The output directory (will be created)
+  --base-url BASE_URL   Confluence Base URL (e.g., '[https://confluence.corp.com](https://confluence.corp.com)')
+  --profile PROFILE     Platform profile ('cloud' or 'dc')
+  --context-path PATH   (DC only) Context path (e.g., '/wiki')
+  --threads THREADS, -t THREADS
+                        Number of download threads (Default: 1)
+  --exclude-page-id ID  Exclude a page ID and its children (can be repeated)
+  --no-vpn-reminder     Skip the VPN check confirmation (DC only)
+  --css-file CSS_FILE   Path to custom CSS file
+  -R, --rst             Export pages as RST (requires pypandoc)
+```
 
 ### Commands
 
--   **`space`**: Dumps an entire space. Starts at the Space Homepage and recurses down.
+- **`space`**: Dumps an entire space. Starts at the Space Homepage and recurses down.
 
-    -   `-sp`, `--space-key`: The Key of the space.
+    - `-sp`, `--space-key`: The Key of the space.
 
--   **`tree`**: Dumps a specific page and all its descendants.
+- **`tree`**: Dumps a specific page and all its descendants.
 
-    -   `-p`, `--pageid`: The Root Page ID.
+    - `-p`, `--pageid`: The Root Page ID.
 
--   **`single`**: Dumps a single page.
+- **`single`**: Dumps a single page.
 
-    -   `-p`, `--pageid`: The Page ID.
+    - `-p`, `--pageid`: The Page ID.
 
--   **`label`**: Dumps pages by label ("Forest Mode"). Finds all pages with the label and treats them as roots for recursion.
+- **`label`**: Dumps pages by label ("Forest Mode"). Finds all pages with the label and treats them as roots for recursion.
 
-    -   `-l`, `--label`: The label to include.
+    - `-l`, `--label`: The label to include.
 
-    -   `--exclude-label`: Exclude subtrees that have this specific label (e.g. 'archived').
+    - `--exclude-label`: Exclude subtrees that have this specific label (e.g. 'archived').
 
--   **`all-spaces`**: Dumps all visible spaces.
+- **`all-spaces`**: Dumps all visible spaces.
 
 
-### Examples
+## Usage 2: Index Restructuring (The Sandbox)
 
-**1\. Data Center: Entire Space, 8 Threads, Exclude Archive**
+This toolset allows you to re-organize the index structure locally. This is useful for testing structural changes or cleaning up the navigation flow without touching Confluence.
 
-    python3 confluenceDumpWithPython.py \
-        --base-url "[https://confluence.corp.com](https://confluence.corp.com)" \
-        --profile dc \
-        --context-path "/wiki" \
-        -o "./dump_it" \
-        -t 8 \
-        --exclude-page-id "999999" \
-        space -sp "IT"
+1. **Generate Editor:** Create a visual Drag & Drop editor for the index.
+    
+    ```
+    python3 create_editor.py --site-dir "./output/2025-01-01 Space IT"
+    ```
     
+2. **Edit:** Open `editor_sidebar.html` in your browser. Move pages, create folders, delete items.
     
+3. **Save:** Click "Copy Markdown" in the editor and paste the content into a new file `sidebar_edit.md` in the site directory.
+    
+4. **Apply:** Patch the new index structure into all **downloaded** HTML files.
+    
+    ```
+    python3 patch_sidebar.py --site-dir "./output/2025-01-01 Space IT"
+    ```
+    
+
+## Usage 3: PDF Generation (Publication)
 
-**2\. Cloud: Single Page Tree**
+Once you have the HTML dump (and optionally restructured it), you can generate high-quality PDF documents. This tool is separated from the main downloader to keep dependencies light.
 
-    python3 confluenceDumpWithPython.py \
-        --base-url "[https://myteam.atlassian.net](https://myteam.atlassian.net)" \
-        --profile cloud \
-        -o "./dump_tree" \
-        tree -p "12345"
+**Prerequisite:**
+
+- `pip install weasyprint`
     
+- **Windows Users:** You must install the [GTK3 Runtime](https://www.google.com/search?q=https://github.com/tschoonj/GTK-for-Windows-runtime-environment-installer "null") for WeasyPrint to work.
     
 
-## Index Restructuring Sandbox
-
-This additional toolset allows you to re-organize the pages and sub-pages structure (the index) of your export locally. This is useful for testing structural changes or cleaning up the navigation flow without touching Confluence or re-downloading pages.
+```
+python3 htmlToPDF.py --site-dir "./output/2025-01-01 Space IT"
+```
 
-**The Workflow:**
+**Features:**
 
-1.  **Generate Editor:** Create a visual Drag & Drop editor for the index of all exported pages.
+- **Mixed Orientation:** Automatically detects Landscape pages based on content styles and switches the PDF page orientation accordingly.
     
-        python3 create_editor.py --site-dir "./output/2025-01-01 Space IT"
-        
-        
+- **Configurable Layout:** A `styles/pdf_settings.css` file is auto-generated on first run. Use it to customize margins, page sizes (A4/Letter), or page numbering.
     
-2.  **Edit:** Open `editor_sidebar.html` in your browser. Move pages, create folders, delete items.
+- **Hierarchical:** Creates a PDF outline (bookmarks) and a Table of Contents based on your `sidebar.md`.
     
-3.  **Save:** Click "Copy Markdown" in the editor and paste the content into a new file `sidebar_edit.md` in the site directory.
+- **Internal Linking:** Links between exported pages work seamlessly within the PDF.
     
-4.  **Apply:** Patch the new index structure into all **downloaded** HTML files.
+- **Table Optimization:** Aggressive CSS rules ensure wide tables fit onto the page (`table-layout: fixed`, word breaking).
     
-        python3 patch_sidebar.py --site-dir "./output/2025-01-01 Space IT"
-
-
-
+- **Smart Splitting:** Use `--split-by-root` to generate separate PDFs for each top-level folder (ideal for massive exports > 1000 pages).
+    
+- **Debug Mode:** Use `--debug` to save the intermediate merged HTML file. This is also an **excellent input format for LLMs** (RAG), as it contains the entire documentation in a single, structured HTML file.