Skip to content

Commit 023b76a

Browse files
authored
docs: update pdf2model pipeline docs (#170)
1 parent ae5031d commit 023b76a

6 files changed

Lines changed: 336 additions & 85 deletions

File tree

docs/en/notes/guide/pipelines/Pdf2ModelPipeline.md

Lines changed: 128 additions & 39 deletions
Original file line numberDiff line numberDiff line change
@@ -10,17 +10,22 @@ This pipeline is designed to help beginners quickly get started with fine-tuning
1010

1111
## 1.Overview
1212

13-
The **Pdf-to-Model Fine-tuning Pipeline** is an end-to-end large language model training solution designed to provide fully automated services from raw documents to deployable domain-specific models. The pipeline transforms heterogeneous-format, high-noise PDF documents into high-quality Multi-Hop QA training data and performs parameter-efficient fine-tuning of large models based on this data, enabling models to achieve precise question-answering capabilities in specific domain knowledge.
13+
The **Pdf-to-Model Fine-tuning Pipeline** is an end-to-end large language model training solution designed to provide fully automated services from raw documents to deployable domain-specific models. The pipeline transforms heterogeneous-format, high-noise PDF documents into high-quality training data and performs parameter-efficient fine-tuning of large models based on this data, enabling models to achieve precise question-answering capabilities for specific domain knowledge.
1414

1515
The pipeline integrates advanced document processing technologies (MinerU, trafilatura), intelligent knowledge cleaning methods, and efficient fine-tuning strategies. It significantly enhances model performance in vertical domains while maintaining the general capabilities of base models. According to MIRIAD experimental validation, models trained with Multi-Hop QA format demonstrate excellent performance in complex question-answering scenarios requiring multi-step reasoning.
1616

17-
**Document Parsing Engine**: MinerU1 (recommended to use vlm-backend: pipeline for optimal stability) and partial functionality of MinerU 2.5 (transformers backend)
17+
**Document Parsing Engine**: MinerU1 and partial functionality of MinerU 2.5. This pipeline is now fully compatible with [Flash-MinerU](https://github.com/OpenDCAI/Flash-MinerU). Compared to the native MinerU engine, Flash-MinerU offers significant advantages in parsing speed and high-concurrency processing.
1818

19-
**Supported Input Formats**: PDF, Markdown, HTML, URL webpages
19+
**Automated Format Conversion and Fine-tuning**: The pipeline supports two data extraction and preparation paradigms: KBC (Text-based Knowledge Base) and VQA (Multimodal Visual Question Answering).
2020

21-
**Output Model**: Adapter (compatible with any Qwen/Llama series base model)
21+
- **KBC Mode**: Focused on plain-text data cleaning and QA synthesis. It utilizes the Alpaca format for model fine-tuning, making it ideal for text-centric knowledge base scenarios.
22+
- **VQA Mode**: Focused on multimodal data cleaning and QA synthesis. It utilizes the ShareGPT format for model fine-tuning, specifically optimized for textbook-style PDFs (e.g., Mathematics, Physics, and Chemistry textbooks or exam papers).
2223

23-
**Note**: Currently does not support MinerU 2.5 vlm-vllm-engine, as it requires a higher version of vLLM that is incompatible with the current latest version of LLaMA-Factory (primary conflict lies in transformers library version).
24+
**Supported Input Formats**: PDF, Markdown, HTML, URL webpages.
25+
26+
**Output Model**: Adapter (compatible with any Qwen/Llama series base model).
27+
28+
<!-- **Note**: Currently does not support MinerU 2.5 vlm-vllm-engine, as it requires a higher version of vLLM that is incompatible with the current latest version of LLaMA-Factory (primary conflict lies in transformers library version). -->
2429

2530

2631

@@ -31,12 +36,8 @@ conda create -n dataflow python=3.10
3136
conda activate dataflow
3237
git clone https://github.com/OpenDCAI/DataFlow.git
3338
cd DataFlow
34-
#prepare environment
39+
# prepare environment
3540
pip install -e .[llamafactory]
36-
# Supports mineru2.5. If you only want to run the pipeline backend, you can skip downloading the whl file and proceed directly to model preparation.
37-
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu121torch2.4cxx11abiTRUE-cp310-cp310-linux_x86_64.whl
38-
39-
pip install flash_attn-2.8.3+cu121torch2.4cxx11abiTRUE-cp310-cp310-linux_x86_64.whl
4041

4142
#prepare models
4243
mineru-models-download
@@ -46,7 +47,11 @@ mkdir run_dataflow
4647
cd run_dataflow
4748

4849
# Initialize
49-
dataflow pdf2model init
50+
# KBC Mode: Initialize KBC knowledge cleaning pipeline (Default)
51+
dataflow pdf2model init --qa="kbc"
52+
53+
# VQA Mode: Initialize VQA multimodal extraction pipeline (Optimized for textbooks/exams)
54+
dataflow pdf2model init --qa="vqa"
5055

5156
# Train
5257
dataflow pdf2model train
@@ -65,14 +70,20 @@ The Pdf-to-Model pipeline consists of two phases: initialization and execution,
6570

6671
#### Initialization Phase (dataflow pdf2model init)
6772

68-
Automatically generates training configuration file (train_config.yaml) and customizable data processing scripts, configuring default LoRA fine-tuning parameters, dataset paths, and model output directories.
73+
Automatically generates a training configuration file (`train_config.yaml`) and customizable data processing scripts (`pdf_to_model_pipeline.py`), configuring default LoRA fine-tuning parameters, dataset paths, and model output directories.
74+
75+
- `--qa="kbc"` (Default): Generates a pipeline focused on knowledge cleaning. This workflow emphasizes long-range logical text cleaning, intelligent chunking, and the production of Alpaca format data.
76+
- `--qa="vqa"`:Generates a pipeline focused on Visual Question Answering. This workflow leverages multimodal capabilities to parse charts, diagrams, and formulas within PDFs, producing ShareGPT format data.
6977

7078
#### Execution Phase (dataflow pdf2model train)
7179

7280
1. **Document Discovery**: Automatically scans specified directories to identify all PDF files and generate an index list.
73-
2. **Knowledge Extraction and Cleaning**: Extracts textual information from PDF/Markdown/HTML/URL using tools like MinerU and trafilatura, performs intelligent segmentation via chonkie, and cleans and normalizes raw text by addressing redundant tags, format errors, and privacy information. *(This step reuses the complete workflow of the knowledge base cleaning pipeline)*
74-
3. **QA Data Generation**: Utilizes a sliding window of three sentences to transform the cleaned knowledge base into a series of Multi-Hop QA pairs requiring multi-step reasoning, and converts them into LlamaFactory standard training format.
75-
4. **Fine-tuning**: Based on the generated QA data, uses LoRA (Low-Rank Adaptation) method to perform parameter-efficient fine-tuning of the base model, training model parameters and outputting a domain-specific model adapter ready for deployment.
81+
2. **Knowledge Extraction and Cleaning**: Extracts textual information from PDF/Markdown/HTML/URL using tools like [MinerU](https://github.com/opendatalab/MinerU) and [trafilatura](https://github.com/adbar/trafilatura).
82+
3. **QA Data Generation and Cleaning**
83+
- KBC Mode: Performs refined cleaning of raw text (removing redundant tags, fixing formatting errors, and protecting sensitive information). It then utilizes a three-sentence sliding window to transform knowledge into Multi-Hop QA pairs requiring multi-step reasoning.
84+
- VQA Mode: Transforms complex raw data into LLM-understandable inputs. For high-value content like textbooks and exam papers, it uses multi-threaded API calls to extract high-quality QA pairs from multimodal page blocks.
85+
4. **Data Format Conversion**: Converts the extracted QA data into Llama-Factory standard training formats (Alpaca or ShareGPT).
86+
5. **Fine-tuning**: Based on the generated QA data, uses LoRA (Low-Rank Adaptation) method to perform parameter-efficient fine-tuning of the base model, training model parameters and outputting a domain-specific model adapter ready for deployment.
7687

7788
#### Testing Phase (dataflow chat)
7889

@@ -87,16 +98,7 @@ conda create -n dataflow python=3.10
8798
conda activate dataflow
8899

89100
cd DataFlow
90-
91-
pip install -e .[llamafactory]
92-
93-
# Supports mineru2.5. If you only want to run the pipeline backend, you can skip downloading the whl file and proceed directly to model preparation
94-
# Download flash-attn whl file. You need to download the corresponding whl based on your environment
95-
# For example, if your environment is python3.10 torch2.4 cuda12.1 https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu121torch2.4cxx11abiTRUE-cp310-cp310-linux_x86_64.whl
96-
# Version selection URL: https://github.com/Dao-AILab/flash-attention/releases
97-
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu121torch2.4cxx11abiTRUE-cp310-cp310-linux_x86_64.whl
98-
99-
pip install flash_attn-2.8.3+cu121torch2.4cxx11abiTRUE-cp310-cp310-linux_x86_64.whl
101+
pip install -e .[pdf2model]
100102
```
101103

102104

@@ -126,7 +128,11 @@ Place appropriately sized datasets (data files in PDF format) into the working d
126128
# Initialize
127129
# --cache can specify the location of .cache directory (optional)
128130
# Default value is current folder directory
129-
dataflow pdf2model init
131+
# Initialize with KBC mode (Default)
132+
dataflow pdf2model init --qa="kbc"
133+
134+
# Initialize with VQA mode (For textbooks/exams)
135+
dataflow pdf2model init --qa="vqa"
130136
```
131137

132138
💡After initialization is complete, the project directory becomes:
@@ -142,35 +148,117 @@ Project Root/
142148

143149
### Step 5: Set Parameters
144150

145-
🌟 **Display common and important parameters:**
151+
🌟 Display common and important parameters:
152+
153+
#### Mode A: KBC Knowledge Cleaning (Default Mode)
146154

147155
```python
148156
self.storage = FileStorage(
149-
first_entry_file_name=str(cache_path / ".cache" / "gpu" / "pdf_list.jsonl"),
157+
first_entry_file_name="./.cache/pdf_list.jsonl", # Set the path for the default generated pdf_list.json
150158
cache_path=str(cache_path / ".cache" / "gpu"),
151159
file_name_prefix="batch_cleaning_step", # Prefix for created files
152-
cache_type="jsonl", # Type of created files
160+
cache_type="jsonl", # Format of created files
153161
)
154162

155-
self.knowledge_cleaning_step1 = FileOrURLToMarkdownConverterBatch(
156-
intermediate_dir=str(cache_path / ".cache"),
157-
lang="en",
158-
mineru_backend="vlm-vllm-engine", # Options: pipeline, vlm-vllm-engine, vlm-vllm-transformer
163+
# Flash-MinerU Backend (Recommended)
164+
self.mineru_executor = FileOrURLToMarkdownConverterFlash(
165+
intermediate_dir="../example_data/PDF2VQAPipeline/flash/",
166+
mineru_model_path="<your Model Path>/MinerU2.5-2509-1.2B", # !!! place your local model path here !!!
167+
# https://huggingface.co/opendatalab/MinerU2.5-2509-1.2B.
168+
batch_size=4, # batchsize per vllm worker
169+
replicas=1, # num of vllm workers
170+
num_gpus_per_replica=0.5, # for ray to schedule vllm workers to GPU, can be float, e.g. 0.5 means each worker uses half GPU, 1 means each worker uses whole GPU
171+
engine_gpu_util_rate_to_ray_cap=0.9 # actuall GPU utilization for each worker; acturall memory per worker= num_gpus_per_replica * engine_gpu_util_rate_to_ray_cap; this is to avoid OOM, you can set it to 0.9 or 0.8 to leave some buffer for other processes on
159172
)
160173

161174
self.knowledge_cleaning_step2 = KBCChunkGeneratorBatch(
162-
split_method="token", # Specify the splitting method
163-
chunk_size=512, # Specify the chunk size
164-
tokenizer_name="./Qwen2.5-7B-Instruct", # Path to the tokenizer model
175+
split_method="token", # Specify the splitting method
176+
chunk_size=512, # Specify the chunk size
177+
tokenizer_name="./Qwen2.5-7B-Instruct", # Path to the tokenizer model
178+
)
179+
180+
self.knowledge_cleaning_step3 = KBCTextCleaner(
181+
llm_serving=self.llm_serving,
182+
lang="en"
183+
)
184+
185+
self.knowledge_cleaning_step4 = Text2MultiHopQAGenerator(
186+
llm_serving=self.llm_serving,
187+
lang="en",
188+
num_q = 5
165189
)
166190

167-
self.extract_format_qa = QAExtractor(
191+
self.extract_format_qa_step5 = QAExtractor(
168192
qa_key="qa_pairs",
169-
output_json_file="./.cache/data/qa.json",
193+
output_json_file="./.cache/data/qa.json", # Path for the output dataset
194+
)
195+
```
196+
197+
For a more comprehensive guide on parameter settings, please refer to the descriptions in [Case 8. Converting Massive PDFs to QAs](../quickstart/knowledge_cleaning.md).
198+
199+
#### Mode B: VQA Data Extraction
200+
201+
```python
202+
self.storage = FileStorage(
203+
first_entry_file_name="./.cache/pdf_list.jsonl", # Set the path for the default generated pdf_list.json
204+
cache_path="./cache",
205+
file_name_prefix="vqa", # Prefix for created files
206+
cache_type="jsonl", # Format of created files
207+
)
208+
209+
self.llm_serving = APILLMServing_request(
210+
api_url="http://<YOUR_SERVER_IP>:3000/v1/chat/completions", # API endpoint path
211+
key_name_of_api_key="DF_API_KEY",
212+
model_name="gemini-2.5-pro", # Ensure the API node supports this model
213+
max_workers=100,
214+
)
215+
216+
self.vqa_extract_prompt = QAExtractPrompt()
217+
218+
self.pdf_merger = PDF_Merger(output_dir="./cache")
219+
220+
# Flash-MinerU Backend (Recommended)
221+
self.mineru_executor = FileOrURLToMarkdownConverterFlash(
222+
intermediate_dir="../example_data/PDF2VQAPipeline/flash/",
223+
mineru_model_path="<your Model Path>/MinerU2.5-2509-1.2B", # !!! place your local model path here !!!
224+
# https://huggingface.co/opendatalab/MinerU2.5-2509-1.2B.
225+
batch_size=4, # batchsize per vllm worker
226+
replicas=1, # num of vllm workers
227+
num_gpus_per_replica=0.5, # for ray to schedule vllm workers to GPU, can be float, e.g. 0.5 means each worker uses half GPU, 1 means each worker uses whole GPU
228+
engine_gpu_util_rate_to_ray_cap=0.9 # actuall GPU utilization for each worker; acturall memory per worker= num_gpus_per_replica * engine_gpu_util_rate_to_ray_cap; this is to avoid OOM, you can set it to 0.9 or 0.8 to leave some buffer for other processes on
229+
)
230+
231+
self.input_formatter = MinerU2LLMInputOperator()
232+
233+
self.vqa_extractor = ChunkedPromptedGenerator(
234+
llm_serving=self.llm_serving,
235+
system_prompt = self.vqa_extract_prompt.build_prompt(),
236+
max_chunk_len=128000,
237+
)
238+
self.llm_output_parser = LLMOutputParser(
239+
output_dir="./cache", intermediate_dir="intermediate"
240+
)
241+
242+
self.qa_merger = QA_Merger(
243+
output_dir="./cache", strict_title_match=False
244+
)
245+
246+
self.vqa_format_converter = VQAFormatter(
247+
output_json_file="./.cache/data/qa.json", # Path for the output dataset
170248
)
171249
```
172250

251+
For a more comprehensive guide on parameter settings, please refer to the descriptions in [Case 7. PDF VQA Extraction Pipeline](../quickstart/PDFVQAExtract.md).
252+
253+
**Note**:You must configure API credentials to invoke the LLM API. These credentials can be obtained from your LLM provider (e.g., OpenAI, Google Gemini, etc.). Set them as environment variables:
173254

255+
```shell
256+
export DF_API_KEY="sk-xxxxx"
257+
```
258+
259+
```shell
260+
$env:DF_API_KEY = "sk-xxxxx"
261+
```
174262

175263
### Step 6: One-Click Fine-tuning
176264

@@ -179,7 +267,7 @@ self.extract_format_qa = QAExtractor(
179267
dataflow pdf2model train
180268
```
181269

182-
💡After fine-tuning is complete, the project directory becomes:
270+
💡After fine-tuning is complete, the project directory will reflect a structure similar to the following (based on the `--qa="kbc"` configuration):
183271

184272
```bash
185273
Project Root/
@@ -194,6 +282,7 @@ Project Root/
194282
│ ├── batch_cleaning_step_step2.json
195283
│ ├── batch_cleaning_step_step3.json
196284
│ ├── batch_cleaning_step_step4.json
285+
│ ├── batch_cleaning_step_step5.json
197286
│ └── pdf_list.jsonl
198287
├── mineru/
199288
│ └── sample/auto/

docs/en/notes/guide/quickstart/PDFVQAExtract.md

Lines changed: 29 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -204,14 +204,35 @@ Example:
204204
}
205205
```
206206

207+
Finally, the VQAFormatter operator is invoked to convert the synthesized QA pairs into the standard ShareGPT format, facilitating seamless integration into subsequent fine-tuning steps.
208+
209+
Example:
210+
```json
211+
{
212+
"messages": [
213+
{
214+
"role": "user",
215+
"content": "<image> The incircle of $\\triangle ABC$ touches $BC$ at $D...$"
216+
},
217+
{
218+
"role": "assistant",
219+
"content": "Proof: \nLet the sides of $\\triangle ABC$ be $a, b, c$ and the semi-perimeter $p = ...$"
220+
}
221+
],
222+
"images": [
223+
"/path/to/image.jpg"
224+
]
225+
}
226+
```
227+
207228
## 5. Pipeline Example
208229

209230
```python
210231
from dataflow.operators.knowledge_cleaning import FileOrURLToMarkdownConverterAPI
211232

212233
from dataflow.serving import APILLMServing_request
213234
from dataflow.utils.storage import FileStorage
214-
from dataflow.operators.pdf2vqa import MinerU2LLMInputOperator, LLMOutputParser, QA_Merger, PDF_Merger
235+
from dataflow.operators.pdf2vqa import MinerU2LLMInputOperator, LLMOutputParser, QA_Merger, PDF_Merger, VQAFormatter
215236
from dataflow.operators.core_text import ChunkedPromptedGenerator
216237

217238
from dataflow.pipeline import PipelineABC
@@ -248,6 +269,7 @@ class PDF_VQA_extract_optimized_pipeline(PipelineABC):
248269
)
249270
self.llm_output_parser = LLMOutputParser(output_dir="./cache", intermediate_dir="intermediate")
250271
self.qa_merger = QA_Merger(output_dir="./cache", strict_title_match=False)
272+
self.vqa_format_converter = VQAFormatter(output_json_file="./.cache/data/qa.json")
251273
def forward(self):
252274
self.pdf_merger.run(
253275
storage=self.storage.step(),
@@ -285,6 +307,12 @@ class PDF_VQA_extract_optimized_pipeline(PipelineABC):
285307
output_merged_md_path_key="output_merged_md_path",
286308
output_qa_item_key="vqa_pair",
287309
)
310+
self.vqa_format_converter.run(
311+
storage=self.storage.step(),
312+
input_qa_item_key="vqa_pair",
313+
output_messages_key="messages",
314+
output_images_key="images",
315+
)
288316

289317

290318

docs/en/notes/guide/quickstart/knowledge_cleaning.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,7 @@ During execution, this pipeline will sequentially call:
6161
2. KBCChunkGenerator Segments the text into chunks
6262
3. KBCTextCleaner Performs comprehensive cleaning on the segmented text
6363
4. KBCMultiHopQAGenerator Synthesizes QA data based on the cleaned knowledge
64+
5. QAExtractor Converting Synthesized QA Data to Alpaca Format
6465

6566
For detailed descriptions of each operator, refer to the "Knowledge Base Cleaning and QA Generation" section. Once executed, a JSON file will be generated in the `.cache` directory with contents as shown below.
6667

@@ -93,6 +94,8 @@ For detailed descriptions of each operator, refer to the "Knowledge Base Cleanin
9394

9495
## Example of Synthesized Data
9596

97+
Below is an example of the QA data synthesized from the cleaned knowledge base in Step 4:
98+
9699
```json
97100
[
98101
{

0 commit comments

Comments
 (0)