Skip to content

Dataflow v1.0.0 Release Notes

Choose a tag to compare

@SunnyHaze SunnyHaze released this 30 Jun 14:13
· 847 commits to main since this release

🎉🎉🎉We are thrilled to release our Data-centric AI system, DataFLow! 🎉🎉🎉

Version: v1.0.0
Modular and AI-assisted data preparation system for high-efficiency pipelines.


🚀 Introduction

DataFlow is a high-efficiency data preparation system composed of advanced operators and multi-stage data processing pipelines. It integrates rule-based methods, deep learning models, and large language models (LLMs) to provide a modular, scalable, and reconfigurable design.

It aims to improve the quality and efficiency of data cleaning, augmentation, and construction — supporting the development of next-generation large-scale models.

Designed for researchers and engineers working on data-centric AI, LLM training, and scalable data workflows.


🧠 Core Features

  • 🔁 Modular Operator Design: Inspired by PyTorch, each operator is configurable and reusable.
  • 🧩 Multi-stage Pipelines: Flexibly chain operators for end-to-end data processing.
  • 🤖 Agent for DataFlow: LLM-powered automation for pipeline orchestration and operator generation.
  • ⚙️ Hybrid Techniques: Seamlessly combines rule-based, neural, and LLM-based methods.
  • 💾 Built-in Storage Layer: Manage intermediate data and caching.
  • 🔌 LLM Backend Support: Easily plug into GPT-style backends with LLMServing.

🧱 Framework Overview

DataFlow consists of the following core modules:

Module Description
operator Basic data processing units, reusable across pipelines.
pipeline Manages multi-step workflows by chaining multiple operators.
storage Manages data cache, storage, and I/O between steps.
LLMServing Integrates large models for reasoning, filtering, and generation.
Agent Automatically generates, orchestrates, and manages data workflows.

🛠️ Example Usage and Operators

To get started quickly with real examples, please refer to our documentation:

These guides provide hands-on usage of core modules including Pipeline, Operator, and Agent, and demonstrate how to configure, extend, and run a complete data processing workflow using DataFlow.

🔍 Why DataFlow?

Feature Benefit
PyTorch-style API Easy to learn and integrate
LLM + Rules + NN Flexible and powerful hybrid workflows
Auto Agent Support Reduces manual data prep burden
Storage Layer Efficient checkpointing and result reuse
Fully Modular Easy to extend, test, and compose

📫 Contact

For issues, contributions, or questions, feel free to reach out:

GitHub: https://github.com/OpenDCAI/DataFlow
Email: hao.liang@stu.pku.edu.cn