Release Dataflow v1.0.0 Release Notes · OpenDCAI/DataFlow

🎉🎉🎉We are thrilled to release our Data-centric AI system, DataFLow! 🎉🎉🎉

Version: v1.0.0
Modular and AI-assisted data preparation system for high-efficiency pipelines.

🚀 Introduction

DataFlow is a high-efficiency data preparation system composed of advanced operators and multi-stage data processing pipelines. It integrates rule-based methods, deep learning models, and large language models (LLMs) to provide a modular, scalable, and reconfigurable design.

It aims to improve the quality and efficiency of data cleaning, augmentation, and construction — supporting the development of next-generation large-scale models.

Designed for researchers and engineers working on data-centric AI, LLM training, and scalable data workflows.

🧠 Core Features

🔁 Modular Operator Design: Inspired by PyTorch, each operator is configurable and reusable.
🧩 Multi-stage Pipelines: Flexibly chain operators for end-to-end data processing.
🤖 Agent for DataFlow: LLM-powered automation for pipeline orchestration and operator generation.
⚙️ Hybrid Techniques: Seamlessly combines rule-based, neural, and LLM-based methods.
💾 Built-in Storage Layer: Manage intermediate data and caching.
🔌 LLM Backend Support: Easily plug into GPT-style backends with LLMServing.

🧱 Framework Overview

DataFlow consists of the following core modules:

Module	Description
`operator`	Basic data processing units, reusable across pipelines.
`pipeline`	Manages multi-step workflows by chaining multiple operators.
`storage`	Manages data cache, storage, and I/O between steps.
`LLMServing`	Integrates large models for reasoning, filtering, and generation.
`Agent`	Automatically generates, orchestrates, and manages data workflows.

🛠️ Example Usage and Operators

To get started quickly with real examples, please refer to our documentation:

📘 Example Pipelines:
Text Pipeline Tutorial
🧩 Available Operators:
Operator Reference for Text Evaluation

These guides provide hands-on usage of core modules including Pipeline, Operator, and Agent, and demonstrate how to configure, extend, and run a complete data processing workflow using DataFlow.

🔍 Why DataFlow?

Feature	Benefit
PyTorch-style API	Easy to learn and integrate
LLM + Rules + NN	Flexible and powerful hybrid workflows
Auto Agent Support	Reduces manual data prep burden
Storage Layer	Efficient checkpointing and result reuse
Fully Modular	Easy to extend, test, and compose

📫 Contact

For issues, contributions, or questions, feel free to reach out:

GitHub: https://github.com/OpenDCAI/DataFlow
Email: hao.liang@stu.pku.edu.cn

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataflow v1.0.0 Release Notes

Choose a tag to compare

Sorry, something went wrong.