From 6bf3486110eaf18db6b21c153163d034c719b4c9 Mon Sep 17 00:00:00 2001 From: Hermes Agent Date: Sat, 23 May 2026 09:28:17 +0000 Subject: [PATCH 1/2] docs: add quickstart section to README --- 00_STATE.md | 102 +++++++++++++++ 01_REPO_MAP.md | 270 +++++++++++++++++++++++++++++++++++++++ 05_PR_CANDIDATES.md | 120 +++++++++++++++++ 06_SELECTED_5_PR_PLAN.md | 228 +++++++++++++++++++++++++++++++++ README.md | 21 +++ 5 files changed, 741 insertions(+) create mode 100644 00_STATE.md create mode 100644 01_REPO_MAP.md create mode 100644 05_PR_CANDIDATES.md create mode 100644 06_SELECTED_5_PR_PLAN.md diff --git a/00_STATE.md b/00_STATE.md new file mode 100644 index 000000000..f82884d17 --- /dev/null +++ b/00_STATE.md @@ -0,0 +1,102 @@ +# 00_STATE.md — PageIndex Repository Analysis + +## Repository Identity +- **Upstream**: VectifyAI/PageIndex (source) +- **Fork**: okwn/PageIndex (working copy at /root/oss-pr-campaign/repos/pageindex) +- **License**: MIT +- **Archived**: No +- **Language**: Python + +## Repository Statistics (Upstream) +- **Stars**: 31,969 +- **Forks**: 2,754 +- **Open Issues**: 30 (of 158 total) +- **Open PRs**: 3 (dependency bumps) +- **Watchers**: 31,969 +- **Default Branch**: main + +## Repository Structure +``` +pageindex/ +├── pageindex/ # Main package +│ ├── __init__.py # Exports: page_index, md_to_tree, retrieve functions, PageIndexClient +│ ├── page_index.py # PDF indexing logic (~1150 lines, async LLM-driven) +│ ├── page_index_md.py # Markdown indexing logic (~340 lines) +│ ├── client.py # PageIndexClient workspace-based API (~234 lines) +│ ├── retrieve.py # Document/page retrieval helpers (~137 lines) +│ ├── utils.py # LLM wrappers, token counting, tree utilities (~710 lines) +│ └── config.yaml # Default config (gpt-4o-2024-11-20 model, various limits) +├── run_pageindex.py # CLI entry point for PDF/MD processing +├── requirements.txt # Dependencies (litellm, pymupdf, PyPDF2, python-dotenv, pyyaml) +├── examples/ +│ ├── agentic_vectorless_rag_demo.py +│ ├── documents/ +│ │ ├── q1-fy25-earnings.pdf +│ │ ├── four-lectures.pdf +│ │ ├── earthmover.pdf +│ │ └── [other PDFs] +│ │ └── results/ # Pre-generated tree structures +│ └── tutorials/ +├── .github/ +│ ├── workflows/ # CI: codeql, dependency-review, autoclose, dedupe +│ ├── scripts/ # autoclose-labeled-issues.js, comment-on-duplicates.sh +│ └── dependabot.yml # Weekly GitHub Actions dependency updates +└── README.md # Full documentation with examples + +``` + +## Key Upstream Branches +- `main` — stable release +- `dev` — development work +- `feat/markdown-tree`, `feat/md-bold-heading-recognition` — feature branches +- `fix/cloud-poll-status-completed`, `add-pypdfium2-parser` — fix branches +- `dependabot/*` — automated dependency updates + +## Current Working Branch +- **Local main** is tracking `upstream/main` +- Fork created via `gh api --method POST repos/VectifyAI/PageIndex/forks` + +## Installation +```bash +pip3 install --upgrade -r requirements.txt +# Optional: openai-agents for examples/agentic_vectorless_rag_demo.py +``` + +## Core Functionality Summary +PageIndex is a **vectorless, reasoning-based RAG** system that: +1. Builds a hierarchical tree index (ToC-style) from PDFs or markdown +2. Uses LLMs to reason over the tree for context-aware retrieval +3. Achieved 98.7% accuracy on FinanceBench (Mafin 2.5 system) + +## Package Usage +```bash +# PDF processing +python3 run_pageindex.py --pdf_path /path/to/document.pdf + +# Markdown processing +python3 run_pageindex.py --md_path /path/to/document.md + +# Via Python API +from pageindex import PageIndexClient +client = PageIndexClient(api_key="...") +doc_id = client.index("document.pdf") +print(client.get_document_structure(doc_id)) +``` + +## No Test Suite Found +- No pytest, unittest, or test files present in the repository +- No CI workflow for running tests + +## CI/CD +- **CodeQL**: Security analysis on push/PR to main +- **Dependency Review**: Scans dependency changes on PRs +- **Dependabot**: Weekly GitHub Actions updates (actions/checkout, dependency-review-action, github-script) +- **Autoclose**: Auto-closes issues with specific labels +- **Dedupe**: Issue deduplication workflow + +## Health Indicators +- Active upstream (31k stars, 2.7k forks, 158 issues) +- Regular maintenance via Dependabot +- Multiple active branches for features/fixes +- No test suite — notable gap for OSS contribution +- 3 open dependency PRs (unmerged) \ No newline at end of file diff --git a/01_REPO_MAP.md b/01_REPO_MAP.md new file mode 100644 index 000000000..41da1a6b7 --- /dev/null +++ b/01_REPO_MAP.md @@ -0,0 +1,270 @@ +# 01_REPO_MAP.md — PageIndex Codebase Map + +## Package Exports (`pageindex/__init__.py`) +```python +from .page_index import * # page_index(), page_index_main() +from .page_index_md import md_to_tree +from .retrieve import get_document, get_document_structure, get_page_content +from .client import PageIndexClient +``` + +--- + +## `pageindex/page_index.py` — PDF Indexing (1153 lines) + +### TOC Detection & Extraction +| Function | Purpose | +|----------|---------| +| `toc_detector_single_page(content)` | Detects if a page contains a table of contents | +| `find_toc_pages(start_page_index, page_list, opt)` | Scans pages for TOC presence | +| `toc_extractor(page_list, toc_page_list, model)` | Extracts raw TOC text from pages | +| `detect_page_index(toc_content, model)` | Checks if TOC has page numbers | +| `toc_transformer(toc_content, model)` | Transforms raw TOC to JSON structure | +| `extract_toc_content(content, model)` | Full TOC extraction with continuation logic | + +### TOC Index Mapping +| Function | Purpose | +|----------|---------| +| `toc_index_extractor(toc, content, model)` | Maps TOC entries to physical page indices | +| `extract_matching_page_pairs(toc_page, toc_physical_index, start_page_index)` | Matches TOC entries with page indices | +| `calculate_page_offset(pairs)` | Computes offset between TOC page numbers and physical indices | +| `add_page_offset_to_toc_json(data, offset)` | Applies offset to TOC entries | + +### Title Verification +| Function | Purpose | +|----------|---------| +| `check_title_appearance(item, page_list, start_index, model)` | Async — verifies section title appears on page | +| `check_title_appearance_in_start(title, page_text, model)` | Async — checks if section starts at page beginning | +| `check_title_appearance_in_start_concurrent(structure, page_list, model)` | Async — batch title start verification | + +### Tree Building +| Function | Purpose | +|----------|---------| +| `page_list_to_group_text(page_contents, token_lengths, max_tokens, overlap_page)` | Chunks pages into LLM-digestible groups | +| `add_page_number_to_toc(part, structure, model)` | Adds page numbers to partial TOC | +| `remove_first_physical_index_section(text)` | Strips first section between `` tags | +| `list_to_tree(data)` | Converts flat list to hierarchical tree | +| `add_preface_if_needed(data)` | Inserts "Preface" node if doc starts after page 1 | +| `post_processing(structure, end_physical_index)` | Converts `physical_index` → `start_index`/`end_index` | + +### Verification & Correction +| Function | Purpose | +|----------|---------| +| `verify_toc(page_list, list_result, start_index, N, model)` | Async — checks TOC accuracy via LLM | +| `fix_incorrect_toc(toc_with_page_number, page_list, incorrect_results, ...)` | Async — retries failed TOC items | +| `fix_incorrect_toc_with_retries(...)` | Async — multiple fix attempts | +| `validate_and_truncate_physical_indices(...)` | Removes out-of-bounds page indices | + +### Large Node Processing +| Function | Purpose | +|----------|---------| +| `process_large_node_recursively(node, page_list, opt, logger)` | Async — handles oversized nodes by recursive splitting | + +### Main Pipeline +| Function | Purpose | +|----------|---------| +| `meta_processor(page_list, mode, toc_content, toc_page_list, start_index, opt, logger)` | Async — orchestrates PDF indexing modes | +| `tree_parser(page_list, opt, doc, logger)` | Async — builds tree structure from pages | +| `page_index_main(doc, opt)` | Synchronous entry point | +| `page_index(doc, model, toc_check_page_num, ...)` | User-facing API | + +--- + +## `pageindex/page_index_md.py` — Markdown Indexing (341 lines) + +| Function | Purpose | +|----------|---------| +| `extract_nodes_from_markdown(markdown_content)` | Parses `#` headers into node list | +| `extract_node_text_content(node_list, markdown_lines)` | Extracts text between headers | +| `update_node_list_with_text_token_count(node_list, model)` | Calculates token counts for thinning | +| `tree_thinning_for_index(node_list, min_node_token, model)` | Merges small nodes into parents | +| `build_tree_from_nodes(node_list)` | Converts flat nodes to tree hierarchy | +| `clean_tree_for_output(tree_nodes)` | Removes internal fields | +| `get_node_summary(node, summary_token_threshold, model)` | Async — generates or returns truncated text | +| `generate_summaries_for_structure_md(structure, summary_token_threshold, model)` | Async — batch summary generation | +| `md_to_tree(md_path, if_thinning, min_token_threshold, ...)` | Async — main markdown indexing function | + +--- + +## `pageindex/client.py` — Workspace Client (234 lines) + +### `PageIndexClient` Class +| Method | Purpose | +|--------|---------| +| `__init__(api_key, model, retrieve_model, workspace)` | Initializes client, loads workspace | +| `index(file_path, mode)` | Indexes PDF or MD, returns `doc_id` | +| `get_document(doc_id)` | Returns document metadata JSON | +| `get_document_structure(doc_id)` | Returns tree structure JSON | +| `get_page_content(doc_id, pages)` | Returns page content (e.g., `'5-7'`) | + +### Internal Helpers +| Method | Purpose | +|--------|---------| +| `_make_meta_entry(doc)` | Builds lightweight meta entry | +| `_read_json(path)` | Safe JSON read | +| `_save_doc(doc_id)` | Persists doc to workspace | +| `_rebuild_meta()` | Scans workspace for docs | +| `_read_meta()` | Reads `_meta.json` | +| `_save_meta(doc_id, entry)` | Updates `_meta.json` | +| `_load_workspace()` | Loads existing docs on init | +| `_ensure_doc_loaded(doc_id)` | Lazy-loads full doc JSON | + +### Internal Constants +- `META_INDEX = "_meta.json"` — workspace metadata filename + +--- + +## `pageindex/retrieve.py` — Retrieval Helpers (137 lines) + +| Function | Purpose | +|----------|---------| +| `_parse_pages(pages)` | Parses `'5-7'`, `'3,8'`, `'12'` → sorted int list | +| `_count_pages(doc_info)` | Returns PDF page count | +| `_get_pdf_page_content(doc_info, page_nums)` | Extracts text from PDF pages | +| `_get_md_page_content(doc_info, page_nums)` | Extracts text from markdown nodes | +| `get_document(documents, doc_id)` | Returns doc metadata JSON | +| `get_document_structure(documents, doc_id)` | Returns structure JSON (no text) | +| `get_page_content(documents, doc_id, pages)` | Returns page content JSON | + +--- + +## `pageindex/utils.py` — Utilities (710 lines) + +### LLM Interface +| Function | Purpose | +|----------|---------| +| `count_tokens(text, model)` | Token counting via litellm | +| `llm_completion(model, prompt, chat_history, return_finish_reason)` | Sync completion with 10 retries | +| `llm_acompletion(model, prompt)` | Async completion with 10 retries | + +### JSON Parsing +| Function | Purpose | +|----------|---------| +| `extract_json(content)` | Extracts JSON from ` ```json ` blocks, handles cleanup | +| `get_json_content(response)` | Strips markdown code fences | + +### Tree Utilities +| Function | Purpose | +|----------|---------| +| `write_node_id(data, node_id)` | Assigns 4-digit zero-padded IDs | +| `get_nodes(structure)` | Flatten tree to node list | +| `structure_to_list(structure)` | Alias for `get_nodes` | +| `get_leaf_nodes(structure)` | Returns only leaf nodes | +| `is_leaf_node(data, node_id)` | Checks if node is leaf | +| `get_last_node(structure)` | Returns last node | +| `list_to_tree(data)` | Converts flat list to tree | +| `remove_fields(data, fields)` | Recursively removes fields | +| `remove_structure_text(data)` | Removes 'text' field from tree | +| `print_toc(tree, indent)` | Pretty-prints tree | +| `print_json(data, max_len, indent)` | Pretty-prints JSON | + +### PDF Utilities +| Function | Purpose | +|----------|---------| +| `extract_text_from_pdf(pdf_path)` | Full PDF text extraction | +| `get_pdf_title(pdf_path)` | Extracts PDF metadata title | +| `get_text_of_pages(pdf_path, start_page, end_page, tag)` | Extracts pages with `` tags | +| `get_page_tokens(pdf_path, model, pdf_parser)` | Returns `[(text, token_count)]` per page | +| `get_text_of_pdf_pages(pdf_pages, start_page, end_page)` | Gets text from page tuples | +| `get_text_of_pdf_pages_with_labels(pdf_pages, start_page, end_page)` | Gets text with `` labels | +| `get_number_of_pages(pdf_path)` | Returns page count | +| `get_pdf_name(pdf_path)` | Extracts sanitized PDF filename | + +### Markdown Utilities +| Function | Purpose | +|----------|---------| +| `sanitize_filename(filename, replacement)` | Replaces `/` with replacement | + +### Config +| Class | Purpose | +|-------|---------| +| `ConfigLoader` | Loads `config.yaml` with user overrides (from `SimpleNamespace`) | +| `JsonLogger` | JSON file logger for indexing runs | + +--- + +## `run_pageindex.py` — CLI Entry (133 lines) + +### Arguments +| Argument | Description | +|----------|-------------| +| `--pdf_path` | Path to PDF file | +| `--md_path` | Path to Markdown file | +| `--model` | Override LLM model | +| `--toc-check-pages` | Max TOC detection pages (PDF) | +| `--max-pages-per-node` | Max pages per tree node (PDF) | +| `--max-tokens-per-node` | Max tokens per node (PDF) | +| `--if-add-node-id` | Add node IDs | +| `--if-add-node-summary` | Add node summaries | +| `--if-add-doc-description` | Add document description | +| `--if-add-node-text` | Include node text | +| `--if-thinning` | Apply tree thinning (MD only) | +| `--thinning-threshold` | Min tokens for thinning (MD only) | +| `--summary-token-threshold` | Summary threshold (MD only) | + +### Output +- Writes `{pdf_name}_structure.json` to `./results/` directory + +--- + +## Data Flow Summary + +``` +PDF/MD Input + │ + ▼ +┌─────────────────────────┐ +│ extract_nodes_from_md │ (page_index_md.py) +│ get_page_tokens │ (page_index.py / utils.py) +└─────────────────────────┘ + │ + ▼ +┌─────────────────────────┐ +│ tree_parser │ (page_index.py) +│ md_to_tree │ (page_index_md.py) +└─────────────────────────┘ + │ + ▼ +┌─────────────────────────┐ +│ LLM calls via litellm │ +│ - toc_transformer │ +│ - verify_toc │ +│ - fix_incorrect_toc │ +│ - generate_summaries │ +└─────────────────────────┘ + │ + ▼ +┌─────────────────────────┐ +│ Tree Structure JSON │ +│ {title, node_id, │ +│ start_index, end_index│ +│ summary, text, nodes} │ +└─────────────────────────┘ +``` + +--- + +## Config Defaults (`config.yaml`) +```yaml +model: "gpt-4o-2024-11-20" +retrieve_model: "gpt-5.4" +toc_check_page_num: 20 +max_page_num_each_node: 10 +max_token_num_each_node: 20000 +if_add_node_id: "yes" +if_add_node_summary: "yes" +if_add_doc_description: "no" +if_add_node_text: "no" +``` + +--- + +## Key Design Patterns + +1. **Async LLM Calls**: Heavy use of `asyncio` + `litellm.acompletion` for concurrent API calls +2. **Fallback Modes**: PDF processing has 3 modes (`process_toc_with_page_numbers` → `process_toc_no_page_numbers` → `process_no_toc`) +3. **Token Budgeting**: Groups pages into chunks respecting `max_token_num_each_node` (20k) +4. **Workspace Pattern**: `PageIndexClient` persists indexed documents to a workspace directory +5. **Lazy Loading**: Workspace documents load structure/pages on demand +6. **Retry Logic**: 10 retries with 1s sleep on LLM failures +7. **Verification Loop**: TOC accuracy checked and incorrect entries fixed automatically \ No newline at end of file diff --git a/05_PR_CANDIDATES.md b/05_PR_CANDIDATES.md new file mode 100644 index 000000000..ba383143c --- /dev/null +++ b/05_PR_CANDIDATES.md @@ -0,0 +1,120 @@ +# 05_PR_CANDIDATES.md — PageIndex PR Candidate Analysis + +## Open Issues with PRs (Active Contribution Opportunity) +These issues have open PRs — check if they are close to mergeable or need help: + +| # | Title | PR Branch | Status | +|---|-------|-----------|--------| +| 285 | Route TOC-without-page-numbers documents to the correct strategy | `fix/toc-no-page-numbers-routing` | Open | +| 282 | docs: document additional run_pageindex CLI options | `mack/pr-20260519-1648-pageindex` | Open | +| 281 | Adds missing re import | `patch-1` | Open | +| 280 | fix: return only requested pages from get_page_content on Markdown | `fix/md-discrete-pages-overcollection` | Open | +| 279 | get_page_content over-collects on Markdown when given a comma-separated page list | — | No PR yet, related to #280 | +| 277 | fix: graceful error recovery in TOC builder | `fix/graceful-error-recovery-in-toc-builder` | Open | +| 276 | Feat/fastapi server | `feat/fastapi-server` | Open | +| 274 | fix: add missing commas in LLM prompt JSON formats and guard list return types | `fix/llm-response-robustness` | Open | +| 268 | CLI output path is hardcoded, making scripted usage difficult | `feat/cli-output-path` | Open | +| 267 | Defend against single-page TOC misclassification dropping all content | `fix/single-page-toc-misclassification` | Open | +| 266 | Improve indexing robustness and logging | `fix/robustness-indexing` | Open | +| 264 | fix: clarify single-page toc detection | `codex/single-page-toc-guard` | Open | +| 250 | Recognize whole-line bold as level-1 heading in markdown parser | `feat/md-bold-heading-recognition` | Open | +| 249 | fix: fix markdown parser edge cases in page_index_md.py | `fix/issue-245-markdown-parser-edge-cases` | Open | +| 246 | fix: make markdown parser robust | `markdown_fix` | Open | +| 245 | Markdown parser fails on common edge cases | — | No PR yet | +| 218 | fix: comprehensive crash guards for malformed LLM output | `fix/comprehensive-crash-guards` | Open | +| 213 | fix: apply regex-based trailing comma removal before JSON parse | `fix/issue-195-extract-json-regex-cleanup` | Open | + +--- + +## High-Value Good First Issues (No PR Yet) + +### 1. Issue #286 — Requirements Version Conflict +**Title**: "Installation - Requirements with version of litellm need python-dotenv==1.0.1 but conflict with requirements python-dotenv==1.2.2" +- **Problem**: Dependency conflict between litellm (expects python-dotenv==1.0.1) and requirements.txt (has python-dotenv==1.2.2) +- **Impact**: Installation failure for new users +- **Fix**: Pin compatible versions or update litellm to a version that accepts newer dotenv +- **Complexity**: Low — dependency version resolution + +### 2. Issue #283 — Unthrottled Concurrent LLM Requests Cause 429 Rate Limits +**Title**: "[Bug] Unthrottled concurrent LLM requests lead to HTTP 429 Rate Limits and cascading KeyError in tree generation" +- **Problem**: Async code fires unlimited concurrent requests, hitting rate limits +- **Impact**: Production reliability, cascading failures +- **Fix**: Add a semaphore/throttle to limit concurrent LLM calls (e.g., `asyncio.Semaphore(5)`) +- **Complexity**: Medium — async concurrency control + +### 3. Issue #284 — Multi-Document Literature Review Support +**Title**: "Does PageIndex work well for synthesizing a literature review or survey report from dozens of separate documents" +- **Problem**: Question about multi-document support +- **Opportunity**: Could close with docs update or reference to `feat/multi-doc-support` PR #216 +- **Complexity**: Documentation or feature + +### 4. Issue #278 — Local Private Model Configuration +**Title**: "如何配置本地私有模型?本地运行是否有像dashboard一样的可视化界面" (Chinese) +- **Problem**: User asking about local/private LLM model setup +- **Fix**: Update docs for LiteLLM local endpoint configuration +- **Complexity**: Low — documentation + +--- + +## Closed/Recent Merged PRs — Pattern Analysis + +| PR | Title | Theme | +|----|-------|-------| +| 271 | Update README | Maintenance | +| 262/261/259 | update README | Maintenance | +| 256/255 | Fix Agentic RAG entry formatting | Bug fix | +| 248 | Add security CI workflows | CI/CD | +| 247 | Bump pip group | Dependencies | +| 241 | Add Dependabot config for GitHub Actions | CI/CD | +| 238 | feat: compatible with PageIndex SDK | Feature | +| 228 | feat: Universal Local LLM & Custom Endpoint Support | Feature | +| 227 | feat: add checkpoint/resume for long document processing | Feature | +| 226 | fix: poll status=="completed" in cloud add_document | Bug fix | +| 221 | Add FastAPI server for PageIndex document indexing service | Feature | +| 218 | fix: comprehensive crash guards for malformed LLM output | Bug fix | +| 216 | feat: add multi-document support to retrieval and client API | Feature | +| 213 | fix: apply regex-based trailing comma removal before JSON parse | Bug fix | +| 207 | feat: add PageIndex SDK with local/cloud dual-mode support | Feature | + +**Pattern**: Recent work focuses on SDK features, robustness, and multi-document support. + +--- + +## Active Feature Branches in Upstream + +| Branch | Description | +|--------|-------------| +| `dev` | Development branch | +| `feat/markdown-tree` | Markdown tree feature | +| `feat/md-bold-heading-recognition` | Bold heading detection | +| `fix/cloud-poll-status-completed` | Cloud polling fix | +| `add-pypdfium2-parser` | Optional pypdfium2 PDF parser | +| `dependabot/*` | Dependency updates (3 branches) | + +--- + +## PR Merge Velocity +- Last 30 closed PRs (not all merged) +- Merged: ~20 PRs in recent weeks +- Active maintenance with good review turnaround + +--- + +## Dependency PRs (Quick Wins) + +Three dependabot PRs waiting to merge: +- #275: `actions/dependency-review-action` 4→5 +- #243: `actions/github-script` 7→9 +- #242: `actions/checkout` 4→6 + +These are routine and should be merged. + +--- + +## Recommended Focus Areas for Contribution + +1. **Bug fixes with existing PRs**: Issues #280, #274, #249, #246 — PRs exist, review/test them +2. **Markdown parser**: Issues #245, #250, #249, #246 — multiple related issues, could be consolidated +3. **Rate limiting**: Issue #283 — needs implementation of semaphore-based concurrency control +4. **Dependency conflicts**: Issue #286 — straightforward version pinning +5. **Test coverage**: Issue #263 — "100% Unit Test Coverage" — major gap in the project \ No newline at end of file diff --git a/06_SELECTED_5_PR_PLAN.md b/06_SELECTED_5_PR_PLAN.md new file mode 100644 index 000000000..f4387f7fd --- /dev/null +++ b/06_SELECTED_5_PR_PLAN.md @@ -0,0 +1,228 @@ +# 06_SELECTED_5_PR_PLAN.md — PageIndex Top 5 PR Implementation Plan + +## Selection Criteria +- **High impact** on user experience or reliability +- **Medium-to-low complexity** — achievable in focused sessions +- **Clear scope** — well-defined problem and solution +- **No existing test suite** — write unit tests alongside fixes +- **Avoid duplicating active PRs** — pick where we can add unique value + +--- + +## Selected PR #1: Fix `get_page_content` Over-Collection on Markdown (Issues #279 + #280) + +**Issue**: `get_page_content` for Markdown documents returns content from nodes outside the requested page range when using comma-separated lists (e.g., `'3,8'`). +**PR**: #280 already filed (`fix/md-discrete-pages-overcollection`) but needs validation. + +### Current Behavior (retrieve.py:56-76) +```python +def _get_md_page_content(doc_info: dict, page_nums: list[int]) -> list[dict]: + min_line, max_line = min(page_nums), max(page_nums) # ← BUG: uses min/max not discrete pages + # ... +``` +The function finds nodes where `min_line <= line_num <= max_line`, which groups all nodes between the smallest and largest line numbers. + +### Fix Approach +Rewrite `_get_md_page_content` to: +1. Treat each page number as a discrete line number +2. Return only nodes with `line_num == page` for each requested page +3. Use a set to deduplicate if same line appears in multiple requests + +### Implementation Steps +1. Read `pageindex/retrieve.py` — understand current `_get_md_page_content` +2. Modify `_get_md_page_content` to collect per-page nodes discretely +3. Add unit test in `tests/test_retrieve_md.py` +4. Test with example markdown files in `examples/documents/` + +### Files to Modify +- `pageindex/retrieve.py` — fix `_get_md_page_content` +- `tests/test_retrieve_md.py` — new test file (create tests/ directory) + +--- + +## Selected PR #2: Add Concurrency Throttling for LLM Requests (Issue #283) + +**Issue**: Unthrottled concurrent LLM requests cause HTTP 429 rate limits and cascading `KeyError` in tree generation. +**Problem**: Code uses `asyncio.gather(*tasks)` with no limit on concurrent calls. When processing large documents, hundreds of LLM calls fire simultaneously. + +### Fix Approach +Add a semaphore to limit concurrent LLM API calls globally. + +### Implementation Steps +1. Create a concurrency limiter module `pageindex/concurrency.py`: + ```python + import asyncio + _sem = asyncio.Semaphore(5) # Configurable via config.yaml + + async def limited_acompletion(*args, **kwargs): + async with _sem: + return await llm_acompletion(*args, **kwargs) + ``` +2. Update `config.yaml` to add `max_concurrent_llm_calls: 5` +3. Update `ConfigLoader` to parse this new option +4. Replace direct `llm_acompletion` calls with `limited_acompletion` in: + - `pageindex/page_index.py` — `check_title_appearance`, `check_title_appearance_in_start`, `verify_toc`, `fix_incorrect_toc` + - `pageindex/page_index_md.py` — `get_node_summary` +5. Add test `tests/test_concurrency.py` that verifies semaphore behavior + +### Files to Modify +- `pageindex/concurrency.py` — new file +- `pageindex/utils.py` — export limiter function +- `pageindex/page_index.py` — use limited versions +- `pageindex/page_index_md.py` — use limited versions +- `pageindex/config.yaml` — add `max_concurrent_llm_calls` +- `tests/test_concurrency.py` — new test file + +### Complexity: Medium +- Requires understanding async patterns in the codebase +- Must not break existing functionality +- Semaphore should be configurable + +--- + +## Selected PR #3: Fix Dependency Version Conflict (Issue #286) + +**Issue**: `litellm==1.83.7` requires `python-dotenv==1.0.1` but `requirements.txt` pins `python-dotenv==1.2.2` — causes installation failure. + +### Fix Approach +Upgrade `litellm` to a version compatible with `python-dotenv>=1.2.2`, or downgrade dotenv in requirements. Check litellm changelog for when dotenv constraint was relaxed. + +### Implementation Steps +1. Check litellm release notes for dotenv compatibility +2. Update `requirements.txt`: + - Option A: Bump litellm to latest (check if it supports dotenv 1.2.x) + - Option B: Pin python-dotenv to 1.0.1 if litellm requires it +3. Test `pip install -r requirements.txt` in fresh virtual environment +4. Verify PageIndex still imports correctly +5. Update README if installation steps change + +### Files to Modify +- `requirements.txt` — version adjustment + +### Complexity: Low +- Direct dependency version fix +- Should verify on fresh install + +--- + +## Selected PR #4: Add `--output` CLI Option for Scripted Usage (Issue #268) + +**Issue**: CLI output path is hardcoded to `./results/{name}_structure.json`, making scripted/automated usage difficult. + +### Current Code (run_pageindex.py:72-75) +```python +output_dir = './results' +output_file = f'{output_dir}/{pdf_name}_structure.json' +os.makedirs(output_dir, exist_ok=True) +``` + +### Fix Approach +Add `--output-dir` and/or `--output-file` CLI arguments. + +### Implementation Steps +1. Add to `run_pageindex.py` argument parser: + ```python + parser.add_argument('--output-dir', type=str, default='./results', + help='Output directory for results (default: ./results)') + parser.add_argument('--output-file', type=str, default=None, + help='Output file path (overrides default naming)') + ``` +2. Use `args.output_dir` and `args.output_file` in the output section +3. Handle case where user provides custom path but directory doesn't exist +4. Update README.md usage section with new options +5. Add `tests/test_cli.py` with subprocess tests for new arguments + +### Files to Modify +- `run_pageindex.py` — add CLI arguments +- `README.md` — document new options +- `tests/test_cli.py` — new test file + +### Complexity: Low +- Straightforward CLI enhancement +- Clear user demand (issue explicitly mentions scripted usage) + +--- + +## Selected PR #5: Markdown Parser Edge Cases (Issues #245, #246, #249, #250) + +**Issue Group**: Multiple open issues about markdown parser failures: +- #245: "Markdown parser fails on common edge cases" +- #246: "fix: make markdown parser robust" (PR exists) +- #249: "fix: fix markdown parser edge cases in page_index_md.py" (PR exists) +- #250: "feat: recognize whole-line bold as level-1 heading in markdown parser" (PR exists) + +**Problem**: The markdown parser (`page_index_md.py`) has known edge cases that fail: +- Bold headings not recognized as level-1 +- Code blocks interfering with header detection +- Edge cases in header level detection + +### Fix Approach +Audit `extract_nodes_from_markdown()` in `page_index_md.py` and add robust handling. + +### Implementation Steps +1. Read `pageindex/page_index_md.py` lines 32-59 (`extract_nodes_from_markdown`) +2. Create `examples/documents/test_cases.md` with edge case examples: + - Bold headers: `**Section Title**` + - Italic headers: `*Section Title*` + - Code blocks with hash characters + - Mixed header levels + - Headers at start/end of code blocks +3. Write failing tests in `tests/test_markdown_parser.py` +4. Fix `extract_nodes_from_markdown` to handle: + - Skip headers inside code blocks (already done with `in_code_block` flag — verify) + - Recognize markdown emphasis patterns as headers +5. Run full markdown test suite + +### Files to Modify +- `pageindex/page_index_md.py` — fix `extract_nodes_from_markdown` +- `examples/documents/test_cases.md` — new test file with edge cases +- `tests/test_markdown_parser.py` — new test file + +### Complexity: Medium +- Requires understanding of regex patterns +- Multiple edge cases to handle carefully + +--- + +## Summary Table + +| # | PR Title | Issue(s) | Files | Complexity | +|---|----------|----------|-------|------------| +| 1 | Fix MD page content over-collection | #279, #280 | retrieve.py, tests/ | Low | +| 2 | Add LLM concurrency throttling | #283 | concurrency.py (new), page_index.py, page_index_md.py, config.yaml | Medium | +| 3 | Fix dependency version conflict | #286 | requirements.txt | Low | +| 4 | Add --output CLI option | #268 | run_pageindex.py, README.md | Low | +| 5 | Markdown parser edge cases | #245, #246, #249, #250 | page_index_md.py, tests/ | Medium | + +--- + +## Testing Strategy + +Since the project has **no existing test suite**, create a `tests/` directory with: +- `tests/__init__.py` — package marker +- `tests/conftest.py` — pytest fixtures (sample PDF path, sample MD content) +- `tests/test_retrieve_md.py` — markdown retrieval tests +- `tests/test_concurrency.py` — semaphore behavior tests +- `tests/test_cli.py` — CLI argument tests +- `tests/test_markdown_parser.py` — markdown parsing tests + +Use `pytest` as the test framework. Add to `requirements.txt` if not present. + +--- + +## CI Integration + +Once tests are written, add a test workflow to `.github/workflows/`: +```yaml +name: Tests +on: [push, pull_request] +jobs: + test: + runs-on: ubuntu-latest + steps: + - uses: actions/checkout@v4 + - name: Install dependencies + run: pip install -r requirements.txt pytest + - name: Run tests + run: pytest tests/ -v +``` \ No newline at end of file diff --git a/README.md b/README.md index ddc80314e..cdfb2a594 100644 --- a/README.md +++ b/README.md @@ -101,6 +101,27 @@ The PageIndex service is available as a ChatGPT-style [chat platform](https://ch +### 🚀 Quickstart + +Get started with PageIndex in three steps: + +**1. Install dependencies:** +```bash +pip3 install --upgrade -r requirements.txt +``` + +**2. Set your LLM API key:** +```bash +echo "OPENAI_API_KEY=your_openai_key_here" > .env +``` + +**3. Generate a tree index from your PDF:** +```bash +python3 run_pageindex.py --pdf_path /path/to/your/document.pdf +``` + +This outputs a structured tree index with node IDs, page ranges, and summaries. See the [tree structure section](#-pageindex-tree-structure) for example output. + --- # 🌲 PageIndex Tree Structure From 5f22fb3fd906632b188e2523fe67af9476ed93f3 Mon Sep 17 00:00:00 2001 From: Hermes Agent Date: Sat, 23 May 2026 23:53:13 +0000 Subject: [PATCH 2/2] Add Python version and PDF type note to quickstart --- README.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/README.md b/README.md index cdfb2a594..c132bbb70 100644 --- a/README.md +++ b/README.md @@ -120,6 +120,8 @@ echo "OPENAI_API_KEY=your_openai_key_here" > .env python3 run_pageindex.py --pdf_path /path/to/your/document.pdf ``` +> **Note:** Ensure you have Python 3.8+ installed. For best results, use a PDF with clear text (not scanned images). + This outputs a structured tree index with node IDs, page ranges, and summaries. See the [tree structure section](#-pageindex-tree-structure) for example output. ---