From 6bf3486110eaf18db6b21c153163d034c719b4c9 Mon Sep 17 00:00:00 2001
From: Hermes Agent <root@okwn.cc>
Date: Sat, 23 May 2026 09:28:17 +0000
Subject: [PATCH 1/2] docs: add quickstart section to README

---
 00_STATE.md              | 102 +++++++++++++++
 01_REPO_MAP.md           | 270 +++++++++++++++++++++++++++++++++++++++
 05_PR_CANDIDATES.md      | 120 +++++++++++++++++
 06_SELECTED_5_PR_PLAN.md | 228 +++++++++++++++++++++++++++++++++
 README.md                |  21 +++
 5 files changed, 741 insertions(+)
 create mode 100644 00_STATE.md
 create mode 100644 01_REPO_MAP.md
 create mode 100644 05_PR_CANDIDATES.md
 create mode 100644 06_SELECTED_5_PR_PLAN.md

diff --git a/00_STATE.md b/00_STATE.md
new file mode 100644
index 000000000..f82884d17
--- /dev/null
+++ b/00_STATE.md
@@ -0,0 +1,102 @@
+# 00_STATE.md — PageIndex Repository Analysis
+
+## Repository Identity
+- **Upstream**: VectifyAI/PageIndex (source)
+- **Fork**: okwn/PageIndex (working copy at /root/oss-pr-campaign/repos/pageindex)
+- **License**: MIT
+- **Archived**: No
+- **Language**: Python
+
+## Repository Statistics (Upstream)
+- **Stars**: 31,969
+- **Forks**: 2,754
+- **Open Issues**: 30 (of 158 total)
+- **Open PRs**: 3 (dependency bumps)
+- **Watchers**: 31,969
+- **Default Branch**: main
+
+## Repository Structure
+```
+pageindex/
+├── pageindex/               # Main package
+│   ├── __init__.py          # Exports: page_index, md_to_tree, retrieve functions, PageIndexClient
+│   ├── page_index.py        # PDF indexing logic (~1150 lines, async LLM-driven)
+│   ├── page_index_md.py     # Markdown indexing logic (~340 lines)
+│   ├── client.py            # PageIndexClient workspace-based API (~234 lines)
+│   ├── retrieve.py          # Document/page retrieval helpers (~137 lines)
+│   ├── utils.py             # LLM wrappers, token counting, tree utilities (~710 lines)
+│   └── config.yaml          # Default config (gpt-4o-2024-11-20 model, various limits)
+├── run_pageindex.py         # CLI entry point for PDF/MD processing
+├── requirements.txt         # Dependencies (litellm, pymupdf, PyPDF2, python-dotenv, pyyaml)
+├── examples/
+│   ├── agentic_vectorless_rag_demo.py
+│   ├── documents/
+│   │   ├── q1-fy25-earnings.pdf
+│   │   ├── four-lectures.pdf
+│   │   ├── earthmover.pdf
+│   │   └── [other PDFs]
+│   │   └── results/         # Pre-generated tree structures
+│   └── tutorials/
+├── .github/
+│   ├── workflows/           # CI: codeql, dependency-review, autoclose, dedupe
+│   ├── scripts/             # autoclose-labeled-issues.js, comment-on-duplicates.sh
+│   └── dependabot.yml      # Weekly GitHub Actions dependency updates
+└── README.md               # Full documentation with examples
+
+```
+
+## Key Upstream Branches
+- `main` — stable release
+- `dev` — development work
+- `feat/markdown-tree`, `feat/md-bold-heading-recognition` — feature branches
+- `fix/cloud-poll-status-completed`, `add-pypdfium2-parser` — fix branches
+- `dependabot/*` — automated dependency updates
+
+## Current Working Branch
+- **Local main** is tracking `upstream/main`
+- Fork created via `gh api --method POST repos/VectifyAI/PageIndex/forks`
+
+## Installation
+```bash
+pip3 install --upgrade -r requirements.txt
+# Optional: openai-agents for examples/agentic_vectorless_rag_demo.py
+```
+
+## Core Functionality Summary
+PageIndex is a **vectorless, reasoning-based RAG** system that:
+1. Builds a hierarchical tree index (ToC-style) from PDFs or markdown
+2. Uses LLMs to reason over the tree for context-aware retrieval
+3. Achieved 98.7% accuracy on FinanceBench (Mafin 2.5 system)
+
+## Package Usage
+```bash
+# PDF processing
+python3 run_pageindex.py --pdf_path /path/to/document.pdf
+
+# Markdown processing
+python3 run_pageindex.py --md_path /path/to/document.md
+
+# Via Python API
+from pageindex import PageIndexClient
+client = PageIndexClient(api_key="...")
+doc_id = client.index("document.pdf")
+print(client.get_document_structure(doc_id))
+```
+
+## No Test Suite Found
+- No pytest, unittest, or test files present in the repository
+- No CI workflow for running tests
+
+## CI/CD
+- **CodeQL**: Security analysis on push/PR to main
+- **Dependency Review**: Scans dependency changes on PRs
+- **Dependabot**: Weekly GitHub Actions updates (actions/checkout, dependency-review-action, github-script)
+- **Autoclose**: Auto-closes issues with specific labels
+- **Dedupe**: Issue deduplication workflow
+
+## Health Indicators
+- Active upstream (31k stars, 2.7k forks, 158 issues)
+- Regular maintenance via Dependabot
+- Multiple active branches for features/fixes
+- No test suite — notable gap for OSS contribution
+- 3 open dependency PRs (unmerged)
\ No newline at end of file
diff --git a/01_REPO_MAP.md b/01_REPO_MAP.md
new file mode 100644
index 000000000..41da1a6b7
--- /dev/null
+++ b/01_REPO_MAP.md
@@ -0,0 +1,270 @@
+# 01_REPO_MAP.md — PageIndex Codebase Map
+
+## Package Exports (`pageindex/__init__.py`)
+```python
+from .page_index import *           # page_index(), page_index_main()
+from .page_index_md import md_to_tree
+from .retrieve import get_document, get_document_structure, get_page_content
+from .client import PageIndexClient
+```
+
+---
+
+## `pageindex/page_index.py` — PDF Indexing (1153 lines)
+
+### TOC Detection & Extraction
+| Function | Purpose |
+|----------|---------|
+| `toc_detector_single_page(content)` | Detects if a page contains a table of contents |
+| `find_toc_pages(start_page_index, page_list, opt)` | Scans pages for TOC presence |
+| `toc_extractor(page_list, toc_page_list, model)` | Extracts raw TOC text from pages |
+| `detect_page_index(toc_content, model)` | Checks if TOC has page numbers |
+| `toc_transformer(toc_content, model)` | Transforms raw TOC to JSON structure |
+| `extract_toc_content(content, model)` | Full TOC extraction with continuation logic |
+
+### TOC Index Mapping
+| Function | Purpose |
+|----------|---------|
+| `toc_index_extractor(toc, content, model)` | Maps TOC entries to physical page indices |
+| `extract_matching_page_pairs(toc_page, toc_physical_index, start_page_index)` | Matches TOC entries with page indices |
+| `calculate_page_offset(pairs)` | Computes offset between TOC page numbers and physical indices |
+| `add_page_offset_to_toc_json(data, offset)` | Applies offset to TOC entries |
+
+### Title Verification
+| Function | Purpose |
+|----------|---------|
+| `check_title_appearance(item, page_list, start_index, model)` | Async — verifies section title appears on page |
+| `check_title_appearance_in_start(title, page_text, model)` | Async — checks if section starts at page beginning |
+| `check_title_appearance_in_start_concurrent(structure, page_list, model)` | Async — batch title start verification |
+
+### Tree Building
+| Function | Purpose |
+|----------|---------|
+| `page_list_to_group_text(page_contents, token_lengths, max_tokens, overlap_page)` | Chunks pages into LLM-digestible groups |
+| `add_page_number_to_toc(part, structure, model)` | Adds page numbers to partial TOC |
+| `remove_first_physical_index_section(text)` | Strips first section between `<physical_index_*>` tags |
+| `list_to_tree(data)` | Converts flat list to hierarchical tree |
+| `add_preface_if_needed(data)` | Inserts "Preface" node if doc starts after page 1 |
+| `post_processing(structure, end_physical_index)` | Converts `physical_index` → `start_index`/`end_index` |
+
+### Verification & Correction
+| Function | Purpose |
+|----------|---------|
+| `verify_toc(page_list, list_result, start_index, N, model)` | Async — checks TOC accuracy via LLM |
+| `fix_incorrect_toc(toc_with_page_number, page_list, incorrect_results, ...)` | Async — retries failed TOC items |
+| `fix_incorrect_toc_with_retries(...)` | Async — multiple fix attempts |
+| `validate_and_truncate_physical_indices(...)` | Removes out-of-bounds page indices |
+
+### Large Node Processing
+| Function | Purpose |
+|----------|---------|
+| `process_large_node_recursively(node, page_list, opt, logger)` | Async — handles oversized nodes by recursive splitting |
+
+### Main Pipeline
+| Function | Purpose |
+|----------|---------|
+| `meta_processor(page_list, mode, toc_content, toc_page_list, start_index, opt, logger)` | Async — orchestrates PDF indexing modes |
+| `tree_parser(page_list, opt, doc, logger)` | Async — builds tree structure from pages |
+| `page_index_main(doc, opt)` | Synchronous entry point |
+| `page_index(doc, model, toc_check_page_num, ...)` | User-facing API |
+
+---
+
+## `pageindex/page_index_md.py` — Markdown Indexing (341 lines)
+
+| Function | Purpose |
+|----------|---------|
+| `extract_nodes_from_markdown(markdown_content)` | Parses `#` headers into node list |
+| `extract_node_text_content(node_list, markdown_lines)` | Extracts text between headers |
+| `update_node_list_with_text_token_count(node_list, model)` | Calculates token counts for thinning |
+| `tree_thinning_for_index(node_list, min_node_token, model)` | Merges small nodes into parents |
+| `build_tree_from_nodes(node_list)` | Converts flat nodes to tree hierarchy |
+| `clean_tree_for_output(tree_nodes)` | Removes internal fields |
+| `get_node_summary(node, summary_token_threshold, model)` | Async — generates or returns truncated text |
+| `generate_summaries_for_structure_md(structure, summary_token_threshold, model)` | Async — batch summary generation |
+| `md_to_tree(md_path, if_thinning, min_token_threshold, ...)` | Async — main markdown indexing function |
+
+---
+
+## `pageindex/client.py` — Workspace Client (234 lines)
+
+### `PageIndexClient` Class
+| Method | Purpose |
+|--------|---------|
+| `__init__(api_key, model, retrieve_model, workspace)` | Initializes client, loads workspace |
+| `index(file_path, mode)` | Indexes PDF or MD, returns `doc_id` |
+| `get_document(doc_id)` | Returns document metadata JSON |
+| `get_document_structure(doc_id)` | Returns tree structure JSON |
+| `get_page_content(doc_id, pages)` | Returns page content (e.g., `'5-7'`) |
+
+### Internal Helpers
+| Method | Purpose |
+|--------|---------|
+| `_make_meta_entry(doc)` | Builds lightweight meta entry |
+| `_read_json(path)` | Safe JSON read |
+| `_save_doc(doc_id)` | Persists doc to workspace |
+| `_rebuild_meta()` | Scans workspace for docs |
+| `_read_meta()` | Reads `_meta.json` |
+| `_save_meta(doc_id, entry)` | Updates `_meta.json` |
+| `_load_workspace()` | Loads existing docs on init |
+| `_ensure_doc_loaded(doc_id)` | Lazy-loads full doc JSON |
+
+### Internal Constants
+- `META_INDEX = "_meta.json"` — workspace metadata filename
+
+---
+
+## `pageindex/retrieve.py` — Retrieval Helpers (137 lines)
+
+| Function | Purpose |
+|----------|---------|
+| `_parse_pages(pages)` | Parses `'5-7'`, `'3,8'`, `'12'` → sorted int list |
+| `_count_pages(doc_info)` | Returns PDF page count |
+| `_get_pdf_page_content(doc_info, page_nums)` | Extracts text from PDF pages |
+| `_get_md_page_content(doc_info, page_nums)` | Extracts text from markdown nodes |
+| `get_document(documents, doc_id)` | Returns doc metadata JSON |
+| `get_document_structure(documents, doc_id)` | Returns structure JSON (no text) |
+| `get_page_content(documents, doc_id, pages)` | Returns page content JSON |
+
+---
+
+## `pageindex/utils.py` — Utilities (710 lines)
+
+### LLM Interface
+| Function | Purpose |
+|----------|---------|
+| `count_tokens(text, model)` | Token counting via litellm |
+| `llm_completion(model, prompt, chat_history, return_finish_reason)` | Sync completion with 10 retries |
+| `llm_acompletion(model, prompt)` | Async completion with 10 retries |
+
+### JSON Parsing
+| Function | Purpose |
+|----------|---------|
+| `extract_json(content)` | Extracts JSON from ` ```json ` blocks, handles cleanup |
+| `get_json_content(response)` | Strips markdown code fences |
+
+### Tree Utilities
+| Function | Purpose |
+|----------|---------|
+| `write_node_id(data, node_id)` | Assigns 4-digit zero-padded IDs |
+| `get_nodes(structure)` | Flatten tree to node list |
+| `structure_to_list(structure)` | Alias for `get_nodes` |
+| `get_leaf_nodes(structure)` | Returns only leaf nodes |
+| `is_leaf_node(data, node_id)` | Checks if node is leaf |
+| `get_last_node(structure)` | Returns last node |
+| `list_to_tree(data)` | Converts flat list to tree |
+| `remove_fields(data, fields)` | Recursively removes fields |
+| `remove_structure_text(data)` | Removes 'text' field from tree |
+| `print_toc(tree, indent)` | Pretty-prints tree |
+| `print_json(data, max_len, indent)` | Pretty-prints JSON |
+
+### PDF Utilities
+| Function | Purpose |
+|----------|---------|
+| `extract_text_from_pdf(pdf_path)` | Full PDF text extraction |
+| `get_pdf_title(pdf_path)` | Extracts PDF metadata title |
+| `get_text_of_pages(pdf_path, start_page, end_page, tag)` | Extracts pages with `<start_index_X>` tags |
+| `get_page_tokens(pdf_path, model, pdf_parser)` | Returns `[(text, token_count)]` per page |
+| `get_text_of_pdf_pages(pdf_pages, start_page, end_page)` | Gets text from page tuples |
+| `get_text_of_pdf_pages_with_labels(pdf_pages, start_page, end_page)` | Gets text with `<physical_index_*>` labels |
+| `get_number_of_pages(pdf_path)` | Returns page count |
+| `get_pdf_name(pdf_path)` | Extracts sanitized PDF filename |
+
+### Markdown Utilities
+| Function | Purpose |
+|----------|---------|
+| `sanitize_filename(filename, replacement)` | Replaces `/` with replacement |
+
+### Config
+| Class | Purpose |
+|-------|---------|
+| `ConfigLoader` | Loads `config.yaml` with user overrides (from `SimpleNamespace`) |
+| `JsonLogger` | JSON file logger for indexing runs |
+
+---
+
+## `run_pageindex.py` — CLI Entry (133 lines)
+
+### Arguments
+| Argument | Description |
+|----------|-------------|
+| `--pdf_path` | Path to PDF file |
+| `--md_path` | Path to Markdown file |
+| `--model` | Override LLM model |
+| `--toc-check-pages` | Max TOC detection pages (PDF) |
+| `--max-pages-per-node` | Max pages per tree node (PDF) |
+| `--max-tokens-per-node` | Max tokens per node (PDF) |
+| `--if-add-node-id` | Add node IDs |
+| `--if-add-node-summary` | Add node summaries |
+| `--if-add-doc-description` | Add document description |
+| `--if-add-node-text` | Include node text |
+| `--if-thinning` | Apply tree thinning (MD only) |
+| `--thinning-threshold` | Min tokens for thinning (MD only) |
+| `--summary-token-threshold` | Summary threshold (MD only) |
+
+### Output
+- Writes `{pdf_name}_structure.json` to `./results/` directory
+
+---
+
+## Data Flow Summary
+
+```
+PDF/MD Input
+    │
+    ▼
+┌─────────────────────────┐
+│  extract_nodes_from_md  │  (page_index_md.py)
+│  get_page_tokens        │  (page_index.py / utils.py)
+└─────────────────────────┘
+    │
+    ▼
+┌─────────────────────────┐
+│  tree_parser            │  (page_index.py)
+│  md_to_tree             │  (page_index_md.py)
+└─────────────────────────┘
+    │
+    ▼
+┌─────────────────────────┐
+│  LLM calls via litellm  │
+│  - toc_transformer      │
+│  - verify_toc           │
+│  - fix_incorrect_toc    │
+│  - generate_summaries   │
+└─────────────────────────┘
+    │
+    ▼
+┌─────────────────────────┐
+│  Tree Structure JSON    │
+│  {title, node_id,       │
+│   start_index, end_index│
+│   summary, text, nodes} │
+└─────────────────────────┘
+```
+
+---
+
+## Config Defaults (`config.yaml`)
+```yaml
+model: "gpt-4o-2024-11-20"
+retrieve_model: "gpt-5.4"
+toc_check_page_num: 20
+max_page_num_each_node: 10
+max_token_num_each_node: 20000
+if_add_node_id: "yes"
+if_add_node_summary: "yes"
+if_add_doc_description: "no"
+if_add_node_text: "no"
+```
+
+---
+
+## Key Design Patterns
+
+1. **Async LLM Calls**: Heavy use of `asyncio` + `litellm.acompletion` for concurrent API calls
+2. **Fallback Modes**: PDF processing has 3 modes (`process_toc_with_page_numbers` → `process_toc_no_page_numbers` → `process_no_toc`)
+3. **Token Budgeting**: Groups pages into chunks respecting `max_token_num_each_node` (20k)
+4. **Workspace Pattern**: `PageIndexClient` persists indexed documents to a workspace directory
+5. **Lazy Loading**: Workspace documents load structure/pages on demand
+6. **Retry Logic**: 10 retries with 1s sleep on LLM failures
+7. **Verification Loop**: TOC accuracy checked and incorrect entries fixed automatically
\ No newline at end of file
diff --git a/05_PR_CANDIDATES.md b/05_PR_CANDIDATES.md
new file mode 100644
index 000000000..ba383143c
--- /dev/null
+++ b/05_PR_CANDIDATES.md
@@ -0,0 +1,120 @@
+# 05_PR_CANDIDATES.md — PageIndex PR Candidate Analysis
+
+## Open Issues with PRs (Active Contribution Opportunity)
+These issues have open PRs — check if they are close to mergeable or need help:
+
+| # | Title | PR Branch | Status |
+|---|-------|-----------|--------|
+| 285 | Route TOC-without-page-numbers documents to the correct strategy | `fix/toc-no-page-numbers-routing` | Open |
+| 282 | docs: document additional run_pageindex CLI options | `mack/pr-20260519-1648-pageindex` | Open |
+| 281 | Adds missing re import | `patch-1` | Open |
+| 280 | fix: return only requested pages from get_page_content on Markdown | `fix/md-discrete-pages-overcollection` | Open |
+| 279 | get_page_content over-collects on Markdown when given a comma-separated page list | — | No PR yet, related to #280 |
+| 277 | fix: graceful error recovery in TOC builder | `fix/graceful-error-recovery-in-toc-builder` | Open |
+| 276 | Feat/fastapi server | `feat/fastapi-server` | Open |
+| 274 | fix: add missing commas in LLM prompt JSON formats and guard list return types | `fix/llm-response-robustness` | Open |
+| 268 | CLI output path is hardcoded, making scripted usage difficult | `feat/cli-output-path` | Open |
+| 267 | Defend against single-page TOC misclassification dropping all content | `fix/single-page-toc-misclassification` | Open |
+| 266 | Improve indexing robustness and logging | `fix/robustness-indexing` | Open |
+| 264 | fix: clarify single-page toc detection | `codex/single-page-toc-guard` | Open |
+| 250 | Recognize whole-line bold as level-1 heading in markdown parser | `feat/md-bold-heading-recognition` | Open |
+| 249 | fix: fix markdown parser edge cases in page_index_md.py | `fix/issue-245-markdown-parser-edge-cases` | Open |
+| 246 | fix: make markdown parser robust | `markdown_fix` | Open |
+| 245 | Markdown parser fails on common edge cases | — | No PR yet |
+| 218 | fix: comprehensive crash guards for malformed LLM output | `fix/comprehensive-crash-guards` | Open |
+| 213 | fix: apply regex-based trailing comma removal before JSON parse | `fix/issue-195-extract-json-regex-cleanup` | Open |
+
+---
+
+## High-Value Good First Issues (No PR Yet)
+
+### 1. Issue #286 — Requirements Version Conflict
+**Title**: "Installation - Requirements with version of litellm need python-dotenv==1.0.1 but conflict with requirements python-dotenv==1.2.2"
+- **Problem**: Dependency conflict between litellm (expects python-dotenv==1.0.1) and requirements.txt (has python-dotenv==1.2.2)
+- **Impact**: Installation failure for new users
+- **Fix**: Pin compatible versions or update litellm to a version that accepts newer dotenv
+- **Complexity**: Low — dependency version resolution
+
+### 2. Issue #283 — Unthrottled Concurrent LLM Requests Cause 429 Rate Limits
+**Title**: "[Bug] Unthrottled concurrent LLM requests lead to HTTP 429 Rate Limits and cascading KeyError in tree generation"
+- **Problem**: Async code fires unlimited concurrent requests, hitting rate limits
+- **Impact**: Production reliability, cascading failures
+- **Fix**: Add a semaphore/throttle to limit concurrent LLM calls (e.g., `asyncio.Semaphore(5)`)
+- **Complexity**: Medium — async concurrency control
+
+### 3. Issue #284 — Multi-Document Literature Review Support
+**Title**: "Does PageIndex work well for synthesizing a literature review or survey report from dozens of separate documents"
+- **Problem**: Question about multi-document support
+- **Opportunity**: Could close with docs update or reference to `feat/multi-doc-support` PR #216
+- **Complexity**: Documentation or feature
+
+### 4. Issue #278 — Local Private Model Configuration
+**Title**: "如何配置本地私有模型？本地运行是否有像dashboard一样的可视化界面" (Chinese)
+- **Problem**: User asking about local/private LLM model setup
+- **Fix**: Update docs for LiteLLM local endpoint configuration
+- **Complexity**: Low — documentation
+
+---
+
+## Closed/Recent Merged PRs — Pattern Analysis
+
+| PR | Title | Theme |
+|----|-------|-------|
+| 271 | Update README | Maintenance |
+| 262/261/259 | update README | Maintenance |
+| 256/255 | Fix Agentic RAG entry formatting | Bug fix |
+| 248 | Add security CI workflows | CI/CD |
+| 247 | Bump pip group | Dependencies |
+| 241 | Add Dependabot config for GitHub Actions | CI/CD |
+| 238 | feat: compatible with PageIndex SDK | Feature |
+| 228 | feat: Universal Local LLM & Custom Endpoint Support | Feature |
+| 227 | feat: add checkpoint/resume for long document processing | Feature |
+| 226 | fix: poll status=="completed" in cloud add_document | Bug fix |
+| 221 | Add FastAPI server for PageIndex document indexing service | Feature |
+| 218 | fix: comprehensive crash guards for malformed LLM output | Bug fix |
+| 216 | feat: add multi-document support to retrieval and client API | Feature |
+| 213 | fix: apply regex-based trailing comma removal before JSON parse | Bug fix |
+| 207 | feat: add PageIndex SDK with local/cloud dual-mode support | Feature |
+
+**Pattern**: Recent work focuses on SDK features, robustness, and multi-document support.
+
+---
+
+## Active Feature Branches in Upstream
+
+| Branch | Description |
+|--------|-------------|
+| `dev` | Development branch |
+| `feat/markdown-tree` | Markdown tree feature |
+| `feat/md-bold-heading-recognition` | Bold heading detection |
+| `fix/cloud-poll-status-completed` | Cloud polling fix |
+| `add-pypdfium2-parser` | Optional pypdfium2 PDF parser |
+| `dependabot/*` | Dependency updates (3 branches) |
+
+---
+
+## PR Merge Velocity
+- Last 30 closed PRs (not all merged)
+- Merged: ~20 PRs in recent weeks
+- Active maintenance with good review turnaround
+
+---
+
+## Dependency PRs (Quick Wins)
+
+Three dependabot PRs waiting to merge:
+- #275: `actions/dependency-review-action` 4→5
+- #243: `actions/github-script` 7→9  
+- #242: `actions/checkout` 4→6
+
+These are routine and should be merged.
+
+---
+
+## Recommended Focus Areas for Contribution
+
+1. **Bug fixes with existing PRs**: Issues #280, #274, #249, #246 — PRs exist, review/test them
+2. **Markdown parser**: Issues #245, #250, #249, #246 — multiple related issues, could be consolidated
+3. **Rate limiting**: Issue #283 — needs implementation of semaphore-based concurrency control
+4. **Dependency conflicts**: Issue #286 — straightforward version pinning
+5. **Test coverage**: Issue #263 — "100% Unit Test Coverage" — major gap in the project
\ No newline at end of file
diff --git a/06_SELECTED_5_PR_PLAN.md b/06_SELECTED_5_PR_PLAN.md
new file mode 100644
index 000000000..f4387f7fd
--- /dev/null
+++ b/06_SELECTED_5_PR_PLAN.md
@@ -0,0 +1,228 @@
+# 06_SELECTED_5_PR_PLAN.md — PageIndex Top 5 PR Implementation Plan
+
+## Selection Criteria
+- **High impact** on user experience or reliability
+- **Medium-to-low complexity** — achievable in focused sessions
+- **Clear scope** — well-defined problem and solution
+- **No existing test suite** — write unit tests alongside fixes
+- **Avoid duplicating active PRs** — pick where we can add unique value
+
+---
+
+## Selected PR #1: Fix `get_page_content` Over-Collection on Markdown (Issues #279 + #280)
+
+**Issue**: `get_page_content` for Markdown documents returns content from nodes outside the requested page range when using comma-separated lists (e.g., `'3,8'`).
+**PR**: #280 already filed (`fix/md-discrete-pages-overcollection`) but needs validation.
+
+### Current Behavior (retrieve.py:56-76)
+```python
+def _get_md_page_content(doc_info: dict, page_nums: list[int]) -> list[dict]:
+    min_line, max_line = min(page_nums), max(page_nums)  # ← BUG: uses min/max not discrete pages
+    # ...
+```
+The function finds nodes where `min_line <= line_num <= max_line`, which groups all nodes between the smallest and largest line numbers.
+
+### Fix Approach
+Rewrite `_get_md_page_content` to:
+1. Treat each page number as a discrete line number
+2. Return only nodes with `line_num == page` for each requested page
+3. Use a set to deduplicate if same line appears in multiple requests
+
+### Implementation Steps
+1. Read `pageindex/retrieve.py` — understand current `_get_md_page_content`
+2. Modify `_get_md_page_content` to collect per-page nodes discretely
+3. Add unit test in `tests/test_retrieve_md.py`
+4. Test with example markdown files in `examples/documents/`
+
+### Files to Modify
+- `pageindex/retrieve.py` — fix `_get_md_page_content`
+- `tests/test_retrieve_md.py` — new test file (create tests/ directory)
+
+---
+
+## Selected PR #2: Add Concurrency Throttling for LLM Requests (Issue #283)
+
+**Issue**: Unthrottled concurrent LLM requests cause HTTP 429 rate limits and cascading `KeyError` in tree generation.
+**Problem**: Code uses `asyncio.gather(*tasks)` with no limit on concurrent calls. When processing large documents, hundreds of LLM calls fire simultaneously.
+
+### Fix Approach
+Add a semaphore to limit concurrent LLM API calls globally.
+
+### Implementation Steps
+1. Create a concurrency limiter module `pageindex/concurrency.py`:
+   ```python
+   import asyncio
+   _sem = asyncio.Semaphore(5)  # Configurable via config.yaml
+   
+   async def limited_acompletion(*args, **kwargs):
+       async with _sem:
+           return await llm_acompletion(*args, **kwargs)
+   ```
+2. Update `config.yaml` to add `max_concurrent_llm_calls: 5`
+3. Update `ConfigLoader` to parse this new option
+4. Replace direct `llm_acompletion` calls with `limited_acompletion` in:
+   - `pageindex/page_index.py` — `check_title_appearance`, `check_title_appearance_in_start`, `verify_toc`, `fix_incorrect_toc`
+   - `pageindex/page_index_md.py` — `get_node_summary`
+5. Add test `tests/test_concurrency.py` that verifies semaphore behavior
+
+### Files to Modify
+- `pageindex/concurrency.py` — new file
+- `pageindex/utils.py` — export limiter function
+- `pageindex/page_index.py` — use limited versions
+- `pageindex/page_index_md.py` — use limited versions
+- `pageindex/config.yaml` — add `max_concurrent_llm_calls`
+- `tests/test_concurrency.py` — new test file
+
+### Complexity: Medium
+- Requires understanding async patterns in the codebase
+- Must not break existing functionality
+- Semaphore should be configurable
+
+---
+
+## Selected PR #3: Fix Dependency Version Conflict (Issue #286)
+
+**Issue**: `litellm==1.83.7` requires `python-dotenv==1.0.1` but `requirements.txt` pins `python-dotenv==1.2.2` — causes installation failure.
+
+### Fix Approach
+Upgrade `litellm` to a version compatible with `python-dotenv>=1.2.2`, or downgrade dotenv in requirements. Check litellm changelog for when dotenv constraint was relaxed.
+
+### Implementation Steps
+1. Check litellm release notes for dotenv compatibility
+2. Update `requirements.txt`:
+   - Option A: Bump litellm to latest (check if it supports dotenv 1.2.x)
+   - Option B: Pin python-dotenv to 1.0.1 if litellm requires it
+3. Test `pip install -r requirements.txt` in fresh virtual environment
+4. Verify PageIndex still imports correctly
+5. Update README if installation steps change
+
+### Files to Modify
+- `requirements.txt` — version adjustment
+
+### Complexity: Low
+- Direct dependency version fix
+- Should verify on fresh install
+
+---
+
+## Selected PR #4: Add `--output` CLI Option for Scripted Usage (Issue #268)
+
+**Issue**: CLI output path is hardcoded to `./results/{name}_structure.json`, making scripted/automated usage difficult.
+
+### Current Code (run_pageindex.py:72-75)
+```python
+output_dir = './results'
+output_file = f'{output_dir}/{pdf_name}_structure.json'
+os.makedirs(output_dir, exist_ok=True)
+```
+
+### Fix Approach
+Add `--output-dir` and/or `--output-file` CLI arguments.
+
+### Implementation Steps
+1. Add to `run_pageindex.py` argument parser:
+   ```python
+   parser.add_argument('--output-dir', type=str, default='./results',
+                       help='Output directory for results (default: ./results)')
+   parser.add_argument('--output-file', type=str, default=None,
+                       help='Output file path (overrides default naming)')
+   ```
+2. Use `args.output_dir` and `args.output_file` in the output section
+3. Handle case where user provides custom path but directory doesn't exist
+4. Update README.md usage section with new options
+5. Add `tests/test_cli.py` with subprocess tests for new arguments
+
+### Files to Modify
+- `run_pageindex.py` — add CLI arguments
+- `README.md` — document new options
+- `tests/test_cli.py` — new test file
+
+### Complexity: Low
+- Straightforward CLI enhancement
+- Clear user demand (issue explicitly mentions scripted usage)
+
+---
+
+## Selected PR #5: Markdown Parser Edge Cases (Issues #245, #246, #249, #250)
+
+**Issue Group**: Multiple open issues about markdown parser failures:
+- #245: "Markdown parser fails on common edge cases"
+- #246: "fix: make markdown parser robust" (PR exists)
+- #249: "fix: fix markdown parser edge cases in page_index_md.py" (PR exists)
+- #250: "feat: recognize whole-line bold as level-1 heading in markdown parser" (PR exists)
+
+**Problem**: The markdown parser (`page_index_md.py`) has known edge cases that fail:
+- Bold headings not recognized as level-1
+- Code blocks interfering with header detection
+- Edge cases in header level detection
+
+### Fix Approach
+Audit `extract_nodes_from_markdown()` in `page_index_md.py` and add robust handling.
+
+### Implementation Steps
+1. Read `pageindex/page_index_md.py` lines 32-59 (`extract_nodes_from_markdown`)
+2. Create `examples/documents/test_cases.md` with edge case examples:
+   - Bold headers: `**Section Title**`
+   - Italic headers: `*Section Title*`
+   - Code blocks with hash characters
+   - Mixed header levels
+   - Headers at start/end of code blocks
+3. Write failing tests in `tests/test_markdown_parser.py`
+4. Fix `extract_nodes_from_markdown` to handle:
+   - Skip headers inside code blocks (already done with `in_code_block` flag — verify)
+   - Recognize markdown emphasis patterns as headers
+5. Run full markdown test suite
+
+### Files to Modify
+- `pageindex/page_index_md.py` — fix `extract_nodes_from_markdown`
+- `examples/documents/test_cases.md` — new test file with edge cases
+- `tests/test_markdown_parser.py` — new test file
+
+### Complexity: Medium
+- Requires understanding of regex patterns
+- Multiple edge cases to handle carefully
+
+---
+
+## Summary Table
+
+| # | PR Title | Issue(s) | Files | Complexity |
+|---|----------|----------|-------|------------|
+| 1 | Fix MD page content over-collection | #279, #280 | retrieve.py, tests/ | Low |
+| 2 | Add LLM concurrency throttling | #283 | concurrency.py (new), page_index.py, page_index_md.py, config.yaml | Medium |
+| 3 | Fix dependency version conflict | #286 | requirements.txt | Low |
+| 4 | Add --output CLI option | #268 | run_pageindex.py, README.md | Low |
+| 5 | Markdown parser edge cases | #245, #246, #249, #250 | page_index_md.py, tests/ | Medium |
+
+---
+
+## Testing Strategy
+
+Since the project has **no existing test suite**, create a `tests/` directory with:
+- `tests/__init__.py` — package marker
+- `tests/conftest.py` — pytest fixtures (sample PDF path, sample MD content)
+- `tests/test_retrieve_md.py` — markdown retrieval tests
+- `tests/test_concurrency.py` — semaphore behavior tests
+- `tests/test_cli.py` — CLI argument tests
+- `tests/test_markdown_parser.py` — markdown parsing tests
+
+Use `pytest` as the test framework. Add to `requirements.txt` if not present.
+
+---
+
+## CI Integration
+
+Once tests are written, add a test workflow to `.github/workflows/`:
+```yaml
+name: Tests
+on: [push, pull_request]
+jobs:
+  test:
+    runs-on: ubuntu-latest
+    steps:
+      - uses: actions/checkout@v4
+      - name: Install dependencies
+        run: pip install -r requirements.txt pytest
+      - name: Run tests
+        run: pytest tests/ -v
+```
\ No newline at end of file
diff --git a/README.md b/README.md
index ddc80314e..cdfb2a594 100644
--- a/README.md
+++ b/README.md
@@ -101,6 +101,27 @@ The PageIndex service is available as a ChatGPT-style [chat platform](https://ch
   </a>
 </div>
 
+### 🚀 Quickstart
+
+Get started with PageIndex in three steps:
+
+**1. Install dependencies:**
+```bash
+pip3 install --upgrade -r requirements.txt
+```
+
+**2. Set your LLM API key:**
+```bash
+echo "OPENAI_API_KEY=your_openai_key_here" > .env
+```
+
+**3. Generate a tree index from your PDF:**
+```bash
+python3 run_pageindex.py --pdf_path /path/to/your/document.pdf
+```
+
+This outputs a structured tree index with node IDs, page ranges, and summaries. See the [tree structure section](#-pageindex-tree-structure) for example output.
+
 ---
 
 # 🌲 PageIndex Tree Structure

From 5f22fb3fd906632b188e2523fe67af9476ed93f3 Mon Sep 17 00:00:00 2001
From: Hermes Agent <root@okwn.cc>
Date: Sat, 23 May 2026 23:53:13 +0000
Subject: [PATCH 2/2] Add Python version and PDF type note to quickstart

---
 README.md | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/README.md b/README.md
index cdfb2a594..c132bbb70 100644
--- a/README.md
+++ b/README.md
@@ -120,6 +120,8 @@ echo "OPENAI_API_KEY=your_openai_key_here" > .env
 python3 run_pageindex.py --pdf_path /path/to/your/document.pdf
 ```
 
+> **Note:** Ensure you have Python 3.8+ installed. For best results, use a PDF with clear text (not scanned images).
+
 This outputs a structured tree index with node IDs, page ranges, and summaries. See the [tree structure section](#-pageindex-tree-structure) for example output.
 
 ---