Dataflow v1.0.0 Release Notes
🎉🎉🎉We are thrilled to release our Data-centric AI system, DataFLow! 🎉🎉🎉
Version: v1.0.0
Modular and AI-assisted data preparation system for high-efficiency pipelines.
🚀 Introduction
DataFlow is a high-efficiency data preparation system composed of advanced operators and multi-stage data processing pipelines. It integrates rule-based methods, deep learning models, and large language models (LLMs) to provide a modular, scalable, and reconfigurable design.
It aims to improve the quality and efficiency of data cleaning, augmentation, and construction — supporting the development of next-generation large-scale models.
Designed for researchers and engineers working on data-centric AI, LLM training, and scalable data workflows.
🧠 Core Features
- 🔁 Modular Operator Design: Inspired by PyTorch, each operator is configurable and reusable.
- 🧩 Multi-stage Pipelines: Flexibly chain operators for end-to-end data processing.
- 🤖 Agent for DataFlow: LLM-powered automation for pipeline orchestration and operator generation.
- ⚙️ Hybrid Techniques: Seamlessly combines rule-based, neural, and LLM-based methods.
- 💾 Built-in Storage Layer: Manage intermediate data and caching.
- 🔌 LLM Backend Support: Easily plug into GPT-style backends with
LLMServing.
🧱 Framework Overview
DataFlow consists of the following core modules:
| Module | Description |
|---|---|
operator |
Basic data processing units, reusable across pipelines. |
pipeline |
Manages multi-step workflows by chaining multiple operators. |
storage |
Manages data cache, storage, and I/O between steps. |
LLMServing |
Integrates large models for reasoning, filtering, and generation. |
Agent |
Automatically generates, orchestrates, and manages data workflows. |
🛠️ Example Usage and Operators
To get started quickly with real examples, please refer to our documentation:
-
📘 Example Pipelines:
Text Pipeline Tutorial -
🧩 Available Operators:
Operator Reference for Text Evaluation
These guides provide hands-on usage of core modules including Pipeline, Operator, and Agent, and demonstrate how to configure, extend, and run a complete data processing workflow using DataFlow.
🔍 Why DataFlow?
| Feature | Benefit |
|---|---|
| PyTorch-style API | Easy to learn and integrate |
| LLM + Rules + NN | Flexible and powerful hybrid workflows |
| Auto Agent Support | Reduces manual data prep burden |
| Storage Layer | Efficient checkpointing and result reuse |
| Fully Modular | Easy to extend, test, and compose |
📫 Contact
For issues, contributions, or questions, feel free to reach out:
GitHub: https://github.com/OpenDCAI/DataFlow
Email: hao.liang@stu.pku.edu.cn