Skip to content

fix the bug of md content extraction.#297

Open
HaiyangPeng wants to merge 1 commit into
VectifyAI:mainfrom
HaiyangPeng:main
Open

fix the bug of md content extraction.#297
HaiyangPeng wants to merge 1 commit into
VectifyAI:mainfrom
HaiyangPeng:main

Conversation

@HaiyangPeng
Copy link
Copy Markdown

@HaiyangPeng HaiyangPeng commented May 25, 2026

This PR aims to fix the bug of md content extraction #296 . @BukeLy @KylinMountain

@KylinMountain
Copy link
Copy Markdown
Collaborator

I don’t think we should let the LLM know whether the file is PDF or Markdown, just to decide which method to call — that logic should be handled by the code, not the prompt.

Also, if we later add support for other formats like PPT, would we need to update the prompt again?

Separately, I’m still unclear about the root cause of the issue. Is it:

Failing to extract the correct line-numbered content?

A prompt design issue?

Or a function naming problem?

@HaiyangPeng
Copy link
Copy Markdown
Author

HaiyangPeng commented May 26, 2026

I don’t think we should let the LLM know whether the file is PDF or Markdown, just to decide which method to call — that logic should be handled by the code, not the prompt.

Also, if we later add support for other formats like PPT, would we need to update the prompt again?

Separately, I’m still unclear about the root cause of the issue. Is it:

Failing to extract the correct line-numbered content?

A prompt design issue?

Or a function naming problem?

Yes, the prompt can be relatively free and there is no need to limit the file format. But if we add this limitation for LLM, it can stably infer correct line numbers. For example, in my test case, LLM infers “37-39, 4-5“ line ranges of a markdown file, which will output [4, 5, 37, 38, 39] after _parse_pages. The core error will occur in min_line, max_line = min(page_nums), max(page_nums), because all the line contents between min_line and max_line, i.e., 4, 39 will be extracted, which is incorrect. Therefore, I replace min_line, max_line = min(page_nums), max(page_nums) with if ln and ln in requested and ln not in seen: in _get_md_page_content to avoid this error.
In addition, I constrain the LLM to avoid infer ranges by the following prompt:
- For Markdown documents: use ONLY the exact line_num values from the structure, separated by commas. Example: "3,53,69". Do NOT use ranges. @KylinMountain

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants