DataFlow-Skills

Reusable agent skills for DataFlow workflows.

Prerequisites: Install Claude Code

Comparison of Usage Methods

Method	Best For	Pros	Cons
Web	Complete beginners	No installation required	Limited features
CLI (command line)	Developers	Full-featured, highly integrated	Requires command-line familiarity
Editor integration (VS Code / Cursor, etc.)	Daily development	Seamless workflow	Depends on plugins and environment setup

Recommendation:

Complete beginner → Try the web at https://claude.ai/ first
Want to use it for development → Go straight to CLI
Already familiar → Consider editor integration

This guide focuses on the CLI.

Installing Claude Code CLI

1. Prerequisites

A Claude account — register at claude.ai (skip if using a third-party compatible provider)
A command-line tool:
- Mac / Linux: open Terminal
- Windows: open PowerShell or install WSL

2. Install via Official Script (Recommended)

macOS / Linux / WSL:

curl -fsSL https://claude.ai/install.sh | bash

Windows PowerShell:

irm https://claude.ai/install.ps1 | iex

Windows CMD:

curl -fsSL https://claude.ai/install.cmd -o install.cmd && install.cmd && del install.cmd

Verify the installation:

claude --version

A version number means it installed successfully.

3. Install via npm

Prerequisite: Node.js must be installed (verify: node --version; if missing, download from nodejs.org)

npm install -g @anthropic-ai/claude-code

If the download is slow, use a mirror:

npm install -g @anthropic-ai/claude-code --registry=https://registry.npmmirror.com

4. Updating

Update manually:

claude update

Claude Code checks for updates at launch and installs them in the background; the new version takes effect on the next launch. Configure update behavior in settings.json:

{
  "autoUpdatesChannel": "stable"
}

Disable automatic updates:

{
  "env": {
    "DISABLE_AUTOUPDATER": "1"
  }
}

Note: Homebrew and WinGet installations do not support automatic updates. Update manually:
brew upgrade claude-code           # macOS
winget upgrade Anthropic.ClaudeCode  # Windows

5. Common Installation Issues

Problem	Cause	Solution
`npm command not found`	Node.js not installed	Download from nodejs.org
`permission denied`	Insufficient permissions	Mac/Linux: prefix with `sudo`; Windows: run PowerShell as Administrator
Slow or stalled installation	Network issues	Use a mirror: `--registry=https://registry.npmmirror.com`

Terminal Recommendations

WezTerm (cross-platform)
Alacritty (cross-platform)
Ghostty (Linux / macOS)
Kitty (Linux / macOS)

`generating-dataflow-pipeline`

video tutorial: Generate DataFlow Pipeline

Reasoning-guided pipeline planner that generates standard DataFlow pipeline code from a task description and sample data.

What It Does

Given a target (what the pipeline should achieve) and a sample JSONL file (1-5 representative rows), this skill:

Reads and analyzes the sample data — infers field types, content characteristics, and task nature
Selects operators from six core primitives (with extended operators available when needed) using a mandatory decision table
Validates field dependencies across the operator chain
Outputs a two-stage result: an intermediate operator decision (JSON) followed by a complete, runnable Python pipeline

Quick Start

1. Add the Skill

Clone this repository and copy the skill directories into your Claude Code skills folder:

git clone https://github.com/haolpku/DataFlow-Skills.git

# Project-level (this project only)
cp -r DataFlow-Skills/generating-dataflow-pipeline .claude/skills/generating-dataflow-pipeline
cp -r DataFlow-Skills/core_text .claude/skills/core_text

# Or personal-level (all your projects)
cp -r DataFlow-Skills/generating-dataflow-pipeline ~/.claude/skills/generating-dataflow-pipeline
cp -r DataFlow-Skills/core_text ~/.claude/skills/core_text

Claude Code discovers skills from .claude/skills/<skill-name>/SKILL.md. The name field in SKILL.md frontmatter becomes the /slash-command. For more details, see the official skills documentation.

2. Prepare Your Data

Create a JSONL file (one JSON object per line) with 1–5 representative rows:

{"product_name": "Laptop", "category": "Electronics"}
{"product_name": "Coffee Maker", "category": "Appliances"}

3. Run the Skill

In Claude Code, invoke /generating-dataflow-pipeline and describe your target:

/generating-dataflow-pipeline
Target: Generate product descriptions and filter high-quality ones
Sample file: ./data/products.jsonl
Expected outputs: generated_description, quality_score

4. Review the Output

The skill returns a two-stage result:

Intermediate Operator Decision — JSON with operator chain, field flow, and reasoning
Field Mapping — which fields exist vs. need to be generated
Ordered Operator List — operators in execution order with justification
Reasoning Summary — why this design satisfies the target
Complete Pipeline Code — full executable Python following standard structure
Adjustable Parameters / Caveats — tunable knobs and debugging tips

Six Core Operators

Operator	Purpose	LLM?
`PromptedGenerator`	Single-field LLM generation	Yes
`FormatStrPromptedGenerator`	Multi-field template-based generation	Yes
`Text2MultiHopQAGenerator`	Multi-hop QA pair construction from text	Yes
`PromptedFilter`	LLM-based quality scoring & filtering	Yes
`GeneralFilter`	Rule-based deterministic filtering	No
KBC Trio (3 operators, always together in order)	File/URL -> Markdown -> chunks -> clean text	Partial

Generated Pipeline Structure

All generated pipelines follow the same standard structure:

from dataflow.operators.core_text import PromptedGenerator, PromptedFilter
from dataflow.serving import APILLMServing_request
from dataflow.utils.storage import FileStorage

class MyPipeline:
    def __init__(self):
        self.storage = FileStorage(
            first_entry_file_name="./data/input.jsonl",  # User-provided path
            cache_path="./cache",
            file_name_prefix="step",
            cache_type="jsonl"
        )
        self.llm_serving = APILLMServing_request(
            api_url="https://api.openai.com/v1/chat/completions",
            model_name="gpt-4o",
            max_workers=10
        )
        # Operator instances ...

    def forward(self):
        # Sequential operator.run() calls, each with storage.step()
        ...

if __name__ == "__main__":
    pipeline = MyPipeline()
    pipeline.forward()

Key rules:

first_entry_file_name is set to the exact user-provided JSONL path
Each operator.run() call uses storage=self.storage.step() for checkpointing
Fields propagate forward: a field must exist in the sample or be output by a prior step before it can be consumed

Extended Operators

Beyond the 6 core primitives, DataFlow provides additional operators. See the core_text section for the full operator reference.

Adding a New Operator

Prerequisite: the new operator's skill definition already exists (with SKILL.md, examples/good.md, examples/bad.md, etc.).

As an Extended Operator

Two steps are required:

Step 1. Create an operator directory with its skill definition under any appropriate location (e.g., core_text/<category>/, or a separate skill package):

<skill-directory>/<your-operator-name>/
├── SKILL.md          # API reference (constructor, run() signature, execution logic, constraints)
├── SKILL_zh.md       # Chinese translation (optional)
└── examples/
    ├── good.md       # Best-practice example
    └── bad.md        # Common mistakes

Step 2. Register the operator in SKILL.md's Extended Operator Reference section. Add a row to the corresponding category table (Generate / Filter / Refine / Eval) with the operator name, subdirectory path, and description. Without this entry, the pipeline generator will not know the operator exists.

Promoting to a Core Primitive (Optional)

If the operator is used frequently enough to warrant priority selection, promote it by modifying SKILL.md:

Preferred Operator Strategy — Add to the core primitives list
Operator Selection Priority Rule — Add a decision table row (when to use / when not to use)
Operator Parameter Signature Rule — Add full constructor and run() signatures
Correct Import Paths — Add the import path
Input File Content Analysis Rule — Add input pattern matching if it handles a new data type
Extended Operator Reference — Update or remove the entry from the extended table to avoid duplication with core primitives
Examples — Add a complete example in examples/ (recommended)

`core_text`

Extended operator reference for generating-dataflow-pipeline.

Per-operator API documentation for all text processing operators used by the pipeline generator. When the 6 core primitives in generating-dataflow-pipeline/SKILL.md don't cover your task, consult the detailed references here.

Available Operators

Generate (core_text/generate/)

prompted-generator - Basic LLM generation
format-str-prompted-generator - Template-based generation
chunked-prompted-generator - Chunked text generation
embedding-generator - Generate embeddings
retrieval-generator - RAG generation
bench-answer-generator - Generate benchmark answers
text2multihopqa-generator - Multi-hop QA generation
random-domain-knowledge-row-generator - Random domain knowledge generation

Filter (core_text/filter/)

prompted-filter - LLM scoring and filtering
general-filter - Rule-based numeric filtering
kcentergreedy-filter - Diversity-based filtering

Refine (core_text/refine/)

prompted-refiner - LLM-based text rewriting
pandas-operator - Custom pandas operations

Eval (core_text/eval/)

prompted-evaluator - LLM scoring
bench-dataset-evaluator - Evaluate benchmark datasets
bench-dataset-evaluator-question - Evaluate benchmark questions
text2qa-sample-evaluator - Evaluate QA samples
unified-bench-dataset-evaluator - Unified evaluation

Directory Structure

Each operator folder contains:

SKILL.md - English skill documentation describing use cases, usage, imports, parameters, and examples
SKILL_zh.md - Chinese documentation
examples/good.md - Correct usage with simple single-operator pipeline, sample input and output
examples/bad.md - Common mistakes

`dataflow-operator-builder`

Video tutorial: Build DataFlow Operator

Production-grade scaffold skill for new DataFlow operators (generate/filter/refine/eval), generating implementation skeletons, CLI wrappers, and test files in one run.

What It Does

Given an operator spec (package name, operator type, input/output keys, etc.), this skill:

Validates the spec against constraint rules in references/ to catch registration, contract, and naming issues early
Generates a complete operator implementation skeleton (generate, filter, refine, or eval)
Creates a standalone CLI module under cli/ for batch jobs and integration testing without extra glue code
Outputs a two-stage result: a --dry-run preview of the file plan, then actual file writes after confirmation

Quick Start

1. Add the Skill

Clone this repository and copy the skill directory into your Claude Code skills folder:

git clone https://github.com/haolpku/DataFlow-Skills.git

# Project-level (this project only)
cp -r DataFlow-Skills/dataflow-operator-builder .claude/skills/dataflow-operator-builder

# Or personal-level (all your projects)
cp -r DataFlow-Skills/dataflow-operator-builder ~/.claude/skills/dataflow-operator-builder

Claude Code discovers skills from .claude/skills/<skill-name>/SKILL.md.

2. Run the Skill

Mode A (default): Interactive Interview

Just invoke the skill — the agent collects everything it needs through two rounds of batch questions. No files to prepare:

/dataflow-operator-builder

Round 1: structural fields (package name, operator type, input/output keys, etc.)
Round 2: implementation details (LLM usage, CLI module name, test prefix, etc.)

Each question includes a recommended option with a short rationale. The agent proceeds to generation after both rounds.

Mode B: Direct Spec (when you already have a spec)

If you already have an operator spec file (JSON), skip the interview and run directly:

/dataflow-operator-builder --spec path/to/spec.json --output-root path/to/repo

Example spec:

{
  "package_name": "dataflow_ext_demo",
  "operator_type": "filter",
  "operator_class_name": "DemoQualityFilter",
  "operator_module_name": "demo_quality_filter",
  "input_key": "raw_text",
  "output_key": "is_valid",
  "uses_llm": false
}

Required: package_name, operator_type, operator_class_name, operator_module_name, input_key, output_key, uses_llm. Optional: cli_module_name, test_file_prefix, overwrite_strategy, validation_level.

4. Review the Output

The skill returns a two-stage result:

Create/update plan — --dry-run lists all files to be generated without writing anything
Operator skeleton — class definition, run() signature, and registration entry
CLI module — executable batch script under cli/
Test files — unit, registry, and smoke baselines ready for CI

Helpful Flags

--dry-run: preview create/update plan without modifying files
--overwrite {ask-each,overwrite-all,skip-existing}: control overwrite behavior safely in existing repos
--validation-level {none,basic,full}: choose how strict pre-write checks should be

Generated Artifacts

Artifact	Path	Description
Operator implementation	`<package>/<module_name>.py`	Class definition, `run()` signature, and registry entry
CLI module	`cli/<cli_module_name>.py`	Standalone batch script
Unit test	`tests/unit/test_<prefix>.py`	Basic unit test
Registry test	`tests/registry/test_<prefix>_registry.py`	Validates correct operator registration
Smoke test	`tests/smoke/test_<prefix>_smoke.py`	End-to-end minimal acceptance

Generated Operator Skeleton

All generated operators follow the same standard structure:

from dataflow.operators.base import BaseOperator
from dataflow.utils.storage import FileStorage

class DemoQualityFilter(BaseOperator):
    def __init__(self, threshold: float = 0.5):
        self.threshold = threshold

    def run(self, storage: FileStorage) -> FileStorage:
        # Implement filtering logic here
        ...
        return storage

# Registry entry (auto-generated by the skill)
OPERATOR_REGISTRY.register("DemoQualityFilter", DemoQualityFilter)

Key rules:

Must extend BaseOperator and implement run()
run() accepts and returns a FileStorage instance for chain propagation
Must be registered via OPERATOR_REGISTRY to be discoverable by pipelines
The CLI module calls run() directly via --input-file / --output-file, independent of pipeline context

`prompt-template-builder`

Video tutorial: DataFlow Prompt Template Builder

Skill for building/revising DataFlow prompt templates for existing operators, with type-aligned template selection and two-stage auditable outputs.

What It Does

Given a target operator (operator name, constraints, input arguments, etc.), this skill:

Checks operator compatibility and selects the right template style (e.g. DIYPromptABC or FormatStrPrompt) to ensure the template matches operator expectations
Outputs a Stage 1 decision JSON: template strategy, argument mapping, output contract, and static acceptance checks — for code review and traceability
Outputs a Stage 2 final deliverable: template/config content, integration snippet, and acceptance walkthrough — ready for developers and QA to act on

Quick Start

1. Add the Skill

Clone this repository and copy the skill directory into your Claude Code skills folder:

git clone https://github.com/haolpku/DataFlow-Skills.git

# Project-level (this project only)
cp -r DataFlow-Skills/prompt-template-builder .claude/skills/prompt-template-builder

# Or personal-level (all your projects)
cp -r DataFlow-Skills/prompt-template-builder ~/.claude/skills/prompt-template-builder

Claude Code discovers skills from .claude/skills/<skill-name>/SKILL.md.

2. Run the Skill

Mode A (default): Interactive Interview

Just invoke the skill — the agent collects everything it needs through two rounds of batch questions. No files to prepare:

/prompt-template-builder

Round 1: structural layer (target scenario, operator name, output contract, constraints)
Round 2: implementation layer (argument signatures, boundary samples, acceptance preferences)

Each question includes a recommended option with a short rationale. The agent then proceeds to the two-stage generation.

Mode B: Direct Spec (when you already have a spec)

If you already have a prompt spec file (JSON), skip the interview and run directly:

/prompt-template-builder --spec path/to/prompt_spec.json

Example spec:

{
  "Target": "Generate concise e-commerce selling points",
  "OP_NAME": "PromptedGenerator",
  "Constraints": "Professional tone; <= 80 Chinese chars",
  "Arguments": ["product_name", "category"]
}

Required: Target, OP_NAME. Recommended: Constraints, Expected Output, Arguments, Sample Cases, Tone/Style, Validation Focus.

4. Review the Output

The skill returns a two-stage result:

Stage 1 (decision JSON) — template strategy, argument mapping, output contract, and static checks (including prompt_template_type_aligned)
Stage 2 (final deliverable) — template/config content, integration code snippet, and acceptance walkthrough

Supported Template Types

Template Type	Compatible Operators	Description
`DIYPromptABC`	`PromptedGenerator`, `PromptedFilter`, `PromptedRefiner`, etc.	Fully custom system/user prompt with field interpolation
`FormatStrPrompt`	`FormatStrPromptedGenerator`	Python f-string style multi-field template

Stage 1 Decision JSON Format

{
  "prompt_template_type_aligned": "DIYPromptABC",
  "strategy": "Single-field generation using system+user two-layer prompt",
  "argument_mapping": {
    "product_name": "product name",
    "category": "product category"
  },
  "output_contract": "Professional tone, <= 80 Chinese characters, ending with a selling-point phrase",
  "static_checks": [
    "No extra placeholders",
    "Tone matches professional definition",
    "Character limit is verifiable in Stage 2 walkthrough"
  ]
}

Key rules:

prompt_template_type_aligned must match the target operator's contract — types cannot be mixed
Every item in static_checks must be individually verified in the Stage 2 acceptance walkthrough
Argument mapping must fully cover all fields listed in Arguments — no omissions allowed

`dataflow-dev`

A DataFlow developer expert skill that loads full architecture knowledge and routes to seven specialized workflows — from creating operators and pipelines to diagnosing errors, reviewing code, and syncing the knowledge base when the upstream repo changes.

What It Does

When you invoke /dataflow-dev inside a DataFlow repository, the skill:

Loads context/knowledge_base.md — architecture, API reference, all registered operators
Loads context/dev_notes.md — coding standards, best practices, LLM response templates
Loads diagnostics/known_issues.md — structured symptom → root cause → fix database
Probes the local repo state (git branch, git log, git diff)
Reports a 1–3 line context summary, then routes to the appropriate workflow

Quick Start

1. Add the Skill

git clone https://github.com/haolpku/DataFlow-Skills.git

# Project-level
cp -r DataFlow-Skills/dataflow-dev .claude/skills/dataflow-dev

# Or personal-level
cp -r DataFlow-Skills/dataflow-dev ~/.claude/skills/dataflow-dev

2. Open a DataFlow Repo

cd /path/to/DataFlow    # must be the repo root
claude                  # launch Claude Code

3. Invoke the Skill

/dataflow-dev
I need a new filter operator that removes texts shorter than N words.

The skill detects your intent, checks for duplicate operators, asks for spec details in a single round, then generates fully compliant code.

Seven Sub-Command Workflows

Intent keywords	Workflow
new operator / create operator / 新建算子	Operator creation (duplicate check → spec confirmation → code generation → registration reminder)
new pipeline / 新建 Pipeline	Pipeline creation (operator selection → code generation with `storage.step()` pattern)
new prompt / 新建 Prompt	Prompt creation (PromptABC or DIYPromptABC, registry decorator, `@prompt_restrict` placement)
error / KeyError / AttributeError / Warning / 报错	Diagnosis (match known_issues.md → root cause + fix code)
review / check / 规范审查	Code review (operator and pipeline checklists, 14-point validation)
sync / check updates / 仓库有新算子	Knowledge base update (detect new operator files, compare against knowledge_base.md, emit update steps)

Operator Creation Checklist

Every generated operator is validated against these hard rules before output:

✓ Inherits OperatorABC, calls super().__init__()
✓ @OPERATOR_REGISTRY.register() decorator on the class
✓ run() parameter naming: input_* / output_* / storage
✓ run() returns list of output key names
✓ storage.read() and storage.write() both present
✓ LLM-driven operators: member variable named self.llm_serving
✓ Full per-row try/except with sensible defaults on LLM failure
✓ CoT model outputs: <think> tags stripped where needed
✓ @staticmethod get_desc(lang: str = "zh") supporting zh/en
✓ __init__.py TYPE_CHECKING block registration

Diagnostics Quick Reference

Error keyword	Issue
`Unexpected key 'xxx' in operator`	#001 — config param naming (warning only, not an error)
`No object named 'Xxx' found in 'operators' registry`	#002 — missing `__init__.py` TYPE_CHECKING entry
`Key Matching Error` / `does not match any output keys`	#003 — pipeline key mismatch
`You must call storage.step() before`	#004 — missing `storage.step()` call
`DummyStorage` + `AttributeError`	#005 — DummyStorage API limitations
`AttributeError: 'NoneType'` + `re.split`	#006 — capturing group in `re.split()` pattern
`@prompt_restrict` not taking effect	#007 — decorator placement must be adjacent to class definition

Full root-cause analysis and fix examples are in diagnostics/known_issues.md.

Knowledge Base Update Flow

When the upstream repo (OpenDCAI/DataFlow) merges new operator PRs:

# Check upstream merged PRs
gh pr list --repo OpenDCAI/DataFlow --state merged --limit 20

# Detect newly added operator files in local repo (last 30 commits)
git log --oneline --diff-filter=A -- 'dataflow/operators/**/*.py' | head -30

# Or run the bundled helper script
bash .claude/skills/dataflow-dev/scripts/check_updates.sh /path/to/DataFlow

The script outputs: new operator files, all registered operator names, operators missing from knowledge_base.md, and recent upstream PRs/Issues — with a step-by-step update guide.

File Structure

dataflow-dev/
├── SKILL.md                        # Skill definition & sub-command routing
├── context/
│   ├── knowledge_base.md           # Architecture, API reference, all operators (read-only)
│   └── dev_notes.md                # Coding standards, best practices (appendable)
├── diagnostics/
│   └── known_issues.md             # Structured Issue database #001–#008
├── templates/
│   ├── operator_template.py        # Operator scaffold
│   ├── pipeline_template.py        # Pipeline scaffold
│   └── prompt_template.py          # Prompt scaffold
└── scripts/
    └── check_updates.sh            # Repo change detection & knowledge base diff

Upstream Repository

All knowledge in this skill is aligned to OpenDCAI/DataFlow (main branch, v1.0.10).

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
core_text		core_text
dataflow-dev		dataflow-dev
dataflow-operator-builder		dataflow-operator-builder
generating-dataflow-pipeline		generating-dataflow-pipeline
prompt-template-builder		prompt-template-builder
.gitignore		.gitignore
README.md		README.md
README_zh.md		README_zh.md

Folders and files

Latest commit

History

Repository files navigation

DataFlow-Skills

Prerequisites: Install Claude Code

Comparison of Usage Methods

Installing Claude Code CLI

1. Prerequisites

2. Install via Official Script (Recommended)

3. Install via npm

4. Updating

5. Common Installation Issues

Terminal Recommendations

generating-dataflow-pipeline

What It Does

Quick Start

1. Add the Skill

2. Prepare Your Data

3. Run the Skill

4. Review the Output

Six Core Operators

Generated Pipeline Structure

Extended Operators

Adding a New Operator

As an Extended Operator

Promoting to a Core Primitive (Optional)

core_text

Available Operators

Directory Structure

dataflow-operator-builder

What It Does

Quick Start

1. Add the Skill

2. Run the Skill

4. Review the Output

Helpful Flags

Generated Artifacts

Generated Operator Skeleton

prompt-template-builder

What It Does

Quick Start

1. Add the Skill

2. Run the Skill

4. Review the Output

Supported Template Types

Stage 1 Decision JSON Format

dataflow-dev

What It Does

Quick Start

1. Add the Skill

2. Open a DataFlow Repo

3. Invoke the Skill

Seven Sub-Command Workflows

Operator Creation Checklist

Diagnostics Quick Reference

Knowledge Base Update Flow

File Structure

Upstream Repository

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`generating-dataflow-pipeline`

`core_text`

`dataflow-operator-builder`

`prompt-template-builder`

`dataflow-dev`

Packages