fix the bug of md content extraction. by HaiyangPeng · Pull Request #297 · VectifyAI/PageIndex

HaiyangPeng · 2026-05-25T15:05:58Z

This PR aims to fix the bug of md content extraction #296 . @BukeLy @KylinMountain

KylinMountain · 2026-05-26T11:27:02Z

I don’t think we should let the LLM know whether the file is PDF or Markdown, just to decide which method to call — that logic should be handled by the code, not the prompt.

Also, if we later add support for other formats like PPT, would we need to update the prompt again?

Separately, I’m still unclear about the root cause of the issue. Is it:

Failing to extract the correct line-numbered content?

A prompt design issue?

Or a function naming problem?

HaiyangPeng · 2026-05-26T12:35:26Z

I don’t think we should let the LLM know whether the file is PDF or Markdown, just to decide which method to call — that logic should be handled by the code, not the prompt.

Also, if we later add support for other formats like PPT, would we need to update the prompt again?

Separately, I’m still unclear about the root cause of the issue. Is it:

Failing to extract the correct line-numbered content?

A prompt design issue?

Or a function naming problem?

Yes, the prompt can be relatively free and there is no need to limit the file format. But if we add this limitation for LLM, it can stably infer correct line numbers. For example, in my test case, LLM infers “37-39, 4-5“ line ranges of a markdown file, which will output [4, 5, 37, 38, 39] after _parse_pages. The core error will occur in min_line, max_line = min(page_nums), max(page_nums), because all the line contents between min_line and max_line, i.e., 4, 39 will be extracted, which is incorrect. Therefore, I replace min_line, max_line = min(page_nums), max(page_nums) with if ln and ln in requested and ln not in seen: in _get_md_page_content to avoid this error.
In addition, I constrain the LLM to avoid infer ranges by the following prompt:
- For Markdown documents: use ONLY the exact line_num values from the structure, separated by commas. Example: "3,53,69". Do NOT use ranges. @KylinMountain

fix the bug of md content extraction.

ebb560e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix the bug of md content extraction.#297

fix the bug of md content extraction.#297
HaiyangPeng wants to merge 1 commit into
VectifyAI:mainfrom
HaiyangPeng:main

HaiyangPeng commented May 25, 2026 •

edited

Loading

Uh oh!

KylinMountain commented May 26, 2026

Uh oh!

HaiyangPeng commented May 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

HaiyangPeng commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

KylinMountain commented May 26, 2026

Uh oh!

HaiyangPeng commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HaiyangPeng commented May 25, 2026 •

edited

Loading

HaiyangPeng commented May 26, 2026 •

edited

Loading