You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -10,17 +10,22 @@ This pipeline is designed to help beginners quickly get started with fine-tuning
10
10
11
11
## 1.Overview
12
12
13
-
The **Pdf-to-Model Fine-tuning Pipeline** is an end-to-end large language model training solution designed to provide fully automated services from raw documents to deployable domain-specific models. The pipeline transforms heterogeneous-format, high-noise PDF documents into high-quality Multi-Hop QA training data and performs parameter-efficient fine-tuning of large models based on this data, enabling models to achieve precise question-answering capabilities in specific domain knowledge.
13
+
The **Pdf-to-Model Fine-tuning Pipeline** is an end-to-end large language model training solution designed to provide fully automated services from raw documents to deployable domain-specific models. The pipeline transforms heterogeneous-format, high-noise PDF documents into high-quality training data and performs parameter-efficient fine-tuning of large models based on this data, enabling models to achieve precise question-answering capabilities for specific domain knowledge.
14
14
15
15
The pipeline integrates advanced document processing technologies (MinerU, trafilatura), intelligent knowledge cleaning methods, and efficient fine-tuning strategies. It significantly enhances model performance in vertical domains while maintaining the general capabilities of base models. According to MIRIAD experimental validation, models trained with Multi-Hop QA format demonstrate excellent performance in complex question-answering scenarios requiring multi-step reasoning.
16
16
17
-
**Document Parsing Engine**: MinerU1 (recommended to use vlm-backend: pipeline for optimal stability) and partial functionality of MinerU 2.5 (transformers backend)
17
+
**Document Parsing Engine**: MinerU1 and partial functionality of MinerU 2.5. This pipeline is now fully compatible with [Flash-MinerU](https://github.com/OpenDCAI/Flash-MinerU). Compared to the native MinerU engine, Flash-MinerU offers significant advantages in parsing speed and high-concurrency processing.
**Automated Format Conversion and Fine-tuning**: The pipeline supports two data extraction and preparation paradigms: KBC (Text-based Knowledge Base) and VQA (Multimodal Visual Question Answering).
20
20
21
-
**Output Model**: Adapter (compatible with any Qwen/Llama series base model)
21
+
-**KBC Mode**: Focused on plain-text data cleaning and QA synthesis. It utilizes the Alpaca format for model fine-tuning, making it ideal for text-centric knowledge base scenarios.
22
+
-**VQA Mode**: Focused on multimodal data cleaning and QA synthesis. It utilizes the ShareGPT format for model fine-tuning, specifically optimized for textbook-style PDFs (e.g., Mathematics, Physics, and Chemistry textbooks or exam papers).
22
23
23
-
**Note**: Currently does not support MinerU 2.5 vlm-vllm-engine, as it requires a higher version of vLLM that is incompatible with the current latest version of LLaMA-Factory (primary conflict lies in transformers library version).
**Output Model**: Adapter (compatible with any Qwen/Llama series base model).
27
+
28
+
<!-- **Note**: Currently does not support MinerU 2.5 vlm-vllm-engine, as it requires a higher version of vLLM that is incompatible with the current latest version of LLaMA-Factory (primary conflict lies in transformers library version). -->
Automatically generates training configuration file (train_config.yaml) and customizable data processing scripts, configuring default LoRA fine-tuning parameters, dataset paths, and model output directories.
73
+
Automatically generates a training configuration file (`train_config.yaml`) and customizable data processing scripts (`pdf_to_model_pipeline.py`), configuring default LoRA fine-tuning parameters, dataset paths, and model output directories.
74
+
75
+
-`--qa="kbc"` (Default): Generates a pipeline focused on knowledge cleaning. This workflow emphasizes long-range logical text cleaning, intelligent chunking, and the production of Alpaca format data.
76
+
-`--qa="vqa"`:Generates a pipeline focused on Visual Question Answering. This workflow leverages multimodal capabilities to parse charts, diagrams, and formulas within PDFs, producing ShareGPT format data.
69
77
70
78
#### Execution Phase (dataflow pdf2model train)
71
79
72
80
1.**Document Discovery**: Automatically scans specified directories to identify all PDF files and generate an index list.
73
-
2.**Knowledge Extraction and Cleaning**: Extracts textual information from PDF/Markdown/HTML/URL using tools like MinerU and trafilatura, performs intelligent segmentation via chonkie, and cleans and normalizes raw text by addressing redundant tags, format errors, and privacy information. *(This step reuses the complete workflow of the knowledge base cleaning pipeline)*
74
-
3.**QA Data Generation**: Utilizes a sliding window of three sentences to transform the cleaned knowledge base into a series of Multi-Hop QA pairs requiring multi-step reasoning, and converts them into LlamaFactory standard training format.
75
-
4.**Fine-tuning**: Based on the generated QA data, uses LoRA (Low-Rank Adaptation) method to perform parameter-efficient fine-tuning of the base model, training model parameters and outputting a domain-specific model adapter ready for deployment.
81
+
2.**Knowledge Extraction and Cleaning**: Extracts textual information from PDF/Markdown/HTML/URL using tools like [MinerU](https://github.com/opendatalab/MinerU) and [trafilatura](https://github.com/adbar/trafilatura).
82
+
3.**QA Data Generation and Cleaning**:
83
+
- KBC Mode: Performs refined cleaning of raw text (removing redundant tags, fixing formatting errors, and protecting sensitive information). It then utilizes a three-sentence sliding window to transform knowledge into Multi-Hop QA pairs requiring multi-step reasoning.
84
+
- VQA Mode: Transforms complex raw data into LLM-understandable inputs. For high-value content like textbooks and exam papers, it uses multi-threaded API calls to extract high-quality QA pairs from multimodal page blocks.
85
+
4.**Data Format Conversion**: Converts the extracted QA data into Llama-Factory standard training formats (Alpaca or ShareGPT).
86
+
5.**Fine-tuning**: Based on the generated QA data, uses LoRA (Low-Rank Adaptation) method to perform parameter-efficient fine-tuning of the base model, training model parameters and outputting a domain-specific model adapter ready for deployment.
# Supports mineru2.5. If you only want to run the pipeline backend, you can skip downloading the whl file and proceed directly to model preparation
94
-
# Download flash-attn whl file. You need to download the corresponding whl based on your environment
95
-
# For example, if your environment is python3.10 torch2.4 cuda12.1 https://github.com/Dao-AILab/flash-attention/releases/download/v2.8.3/flash_attn-2.8.3+cu121torch2.4cxx11abiTRUE-cp310-cp310-linux_x86_64.whl
96
-
# Version selection URL: https://github.com/Dao-AILab/flash-attention/releases
num_gpus_per_replica=0.5, # for ray to schedule vllm workers to GPU, can be float, e.g. 0.5 means each worker uses half GPU, 1 means each worker uses whole GPU
171
+
engine_gpu_util_rate_to_ray_cap=0.9# actuall GPU utilization for each worker; acturall memory per worker= num_gpus_per_replica * engine_gpu_util_rate_to_ray_cap; this is to avoid OOM, you can set it to 0.9 or 0.8 to leave some buffer for other processes on
output_json_file="./.cache/data/qa.json", # Path for the output dataset
194
+
)
195
+
```
196
+
197
+
For a more comprehensive guide on parameter settings, please refer to the descriptions in [Case 8. Converting Massive PDFs to QAs](../quickstart/knowledge_cleaning.md).
198
+
199
+
#### Mode B: VQA Data Extraction
200
+
201
+
```python
202
+
self.storage = FileStorage(
203
+
first_entry_file_name="./.cache/pdf_list.jsonl", # Set the path for the default generated pdf_list.json
204
+
cache_path="./cache",
205
+
file_name_prefix="vqa", # Prefix for created files
206
+
cache_type="jsonl", # Format of created files
207
+
)
208
+
209
+
self.llm_serving = APILLMServing_request(
210
+
api_url="http://<YOUR_SERVER_IP>:3000/v1/chat/completions", # API endpoint path
211
+
key_name_of_api_key="DF_API_KEY",
212
+
model_name="gemini-2.5-pro", # Ensure the API node supports this model
num_gpus_per_replica=0.5, # for ray to schedule vllm workers to GPU, can be float, e.g. 0.5 means each worker uses half GPU, 1 means each worker uses whole GPU
228
+
engine_gpu_util_rate_to_ray_cap=0.9# actuall GPU utilization for each worker; acturall memory per worker= num_gpus_per_replica * engine_gpu_util_rate_to_ray_cap; this is to avoid OOM, you can set it to 0.9 or 0.8 to leave some buffer for other processes on
output_json_file="./.cache/data/qa.json", # Path for the output dataset
170
248
)
171
249
```
172
250
251
+
For a more comprehensive guide on parameter settings, please refer to the descriptions in [Case 7. PDF VQA Extraction Pipeline](../quickstart/PDFVQAExtract.md).
252
+
253
+
**Note**:You must configure API credentials to invoke the LLM API. These credentials can be obtained from your LLM provider (e.g., OpenAI, Google Gemini, etc.). Set them as environment variables:
Copy file name to clipboardExpand all lines: docs/en/notes/guide/quickstart/PDFVQAExtract.md
+29-1Lines changed: 29 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -204,14 +204,35 @@ Example:
204
204
}
205
205
```
206
206
207
+
Finally, the VQAFormatter operator is invoked to convert the synthesized QA pairs into the standard ShareGPT format, facilitating seamless integration into subsequent fine-tuning steps.
208
+
209
+
Example:
210
+
```json
211
+
{
212
+
"messages": [
213
+
{
214
+
"role": "user",
215
+
"content": "<image> The incircle of $\\triangle ABC$ touches $BC$ at $D...$"
216
+
},
217
+
{
218
+
"role": "assistant",
219
+
"content": "Proof: \nLet the sides of $\\triangle ABC$ be $a, b, c$ and the semi-perimeter $p = ...$"
220
+
}
221
+
],
222
+
"images": [
223
+
"/path/to/image.jpg"
224
+
]
225
+
}
226
+
```
227
+
207
228
## 5. Pipeline Example
208
229
209
230
```python
210
231
from dataflow.operators.knowledge_cleaning import FileOrURLToMarkdownConverterAPI
211
232
212
233
from dataflow.serving import APILLMServing_request
213
234
from dataflow.utils.storage import FileStorage
214
-
from dataflow.operators.pdf2vqa import MinerU2LLMInputOperator, LLMOutputParser, QA_Merger, PDF_Merger
235
+
from dataflow.operators.pdf2vqa import MinerU2LLMInputOperator, LLMOutputParser, QA_Merger, PDF_Merger, VQAFormatter
215
236
from dataflow.operators.core_text import ChunkedPromptedGenerator
216
237
217
238
from dataflow.pipeline import PipelineABC
@@ -248,6 +269,7 @@ class PDF_VQA_extract_optimized_pipeline(PipelineABC):
Copy file name to clipboardExpand all lines: docs/en/notes/guide/quickstart/knowledge_cleaning.md
+3Lines changed: 3 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -61,6 +61,7 @@ During execution, this pipeline will sequentially call:
61
61
2. KBCChunkGenerator Segments the text into chunks
62
62
3. KBCTextCleaner Performs comprehensive cleaning on the segmented text
63
63
4. KBCMultiHopQAGenerator Synthesizes QA data based on the cleaned knowledge
64
+
5. QAExtractor Converting Synthesized QA Data to Alpaca Format
64
65
65
66
For detailed descriptions of each operator, refer to the "Knowledge Base Cleaning and QA Generation" section. Once executed, a JSON file will be generated in the `.cache` directory with contents as shown below.
66
67
@@ -93,6 +94,8 @@ For detailed descriptions of each operator, refer to the "Knowledge Base Cleanin
93
94
94
95
## Example of Synthesized Data
95
96
97
+
Below is an example of the QA data synthesized from the cleaned knowledge base in Step 4:
0 commit comments