Skip to content

memoryforget/DataFlow-Skills

 
 

Repository files navigation

DataFlow-Skills

Reusable agent skills for DataFlow workflows.

中文文档: README_zh.md

Prerequisites: Install Claude Code

Comparison of Usage Methods

Method Best For Pros Cons
Web Complete beginners No installation required Limited features
CLI (command line) Developers Full-featured, highly integrated Requires command-line familiarity
Editor integration (VS Code / Cursor, etc.) Daily development Seamless workflow Depends on plugins and environment setup

Recommendation:

  • Complete beginner → Try the web at https://claude.ai/ first
  • Want to use it for development → Go straight to CLI
  • Already familiar → Consider editor integration

This guide focuses on the CLI.


Installing Claude Code CLI

1. Prerequisites

  • A Claude account — register at claude.ai (skip if using a third-party compatible provider)
  • A command-line tool:
    • Mac / Linux: open Terminal
    • Windows: open PowerShell or install WSL

2. Install via Official Script (Recommended)

macOS / Linux / WSL:

curl -fsSL https://claude.ai/install.sh | bash

Windows PowerShell:

irm https://claude.ai/install.ps1 | iex

Windows CMD:

curl -fsSL https://claude.ai/install.cmd -o install.cmd && install.cmd && del install.cmd

Verify the installation:

claude --version

A version number means it installed successfully.

3. Install via npm

Prerequisite: Node.js must be installed (verify: node --version; if missing, download from nodejs.org)

npm install -g @anthropic-ai/claude-code

If the download is slow, use a mirror:

npm install -g @anthropic-ai/claude-code --registry=https://registry.npmmirror.com

4. Updating

Update manually:

claude update

Claude Code checks for updates at launch and installs them in the background; the new version takes effect on the next launch. Configure update behavior in settings.json:

{
  "autoUpdatesChannel": "stable"
}

Disable automatic updates:

{
  "env": {
    "DISABLE_AUTOUPDATER": "1"
  }
}

Note: Homebrew and WinGet installations do not support automatic updates. Update manually:

brew upgrade claude-code           # macOS
winget upgrade Anthropic.ClaudeCode  # Windows

5. Common Installation Issues

Problem Cause Solution
npm command not found Node.js not installed Download from nodejs.org
permission denied Insufficient permissions Mac/Linux: prefix with sudo; Windows: run PowerShell as Administrator
Slow or stalled installation Network issues Use a mirror: --registry=https://registry.npmmirror.com

Terminal Recommendations


generating-dataflow-pipeline

video tutorial: Generate DataFlow Pipeline

Reasoning-guided pipeline planner that generates standard DataFlow pipeline code from a task description and sample data.

What It Does

Given a target (what the pipeline should achieve) and a sample JSONL file (1-5 representative rows), this skill:

  1. Reads and analyzes the sample data — infers field types, content characteristics, and task nature
  2. Selects operators from six core primitives (with extended operators available when needed) using a mandatory decision table
  3. Validates field dependencies across the operator chain
  4. Outputs a two-stage result: an intermediate operator decision (JSON) followed by a complete, runnable Python pipeline

Quick Start

1. Add the Skill

Clone this repository and copy the skill directories into your Claude Code skills folder:

git clone https://github.com/haolpku/DataFlow-Skills.git

# Project-level (this project only)
cp -r DataFlow-Skills/generating-dataflow-pipeline .claude/skills/generating-dataflow-pipeline
cp -r DataFlow-Skills/core_text .claude/skills/core_text

# Or personal-level (all your projects)
cp -r DataFlow-Skills/generating-dataflow-pipeline ~/.claude/skills/generating-dataflow-pipeline
cp -r DataFlow-Skills/core_text ~/.claude/skills/core_text

Claude Code discovers skills from .claude/skills/<skill-name>/SKILL.md. The name field in SKILL.md frontmatter becomes the /slash-command. For more details, see the official skills documentation.

2. Prepare Your Data

Create a JSONL file (one JSON object per line) with 1–5 representative rows:

{"product_name": "Laptop", "category": "Electronics"}
{"product_name": "Coffee Maker", "category": "Appliances"}

3. Run the Skill

In Claude Code, invoke /generating-dataflow-pipeline and describe your target:

/generating-dataflow-pipeline
Target: Generate product descriptions and filter high-quality ones
Sample file: ./data/products.jsonl
Expected outputs: generated_description, quality_score

4. Review the Output

The skill returns a two-stage result:

  1. Intermediate Operator Decision — JSON with operator chain, field flow, and reasoning
  2. Field Mapping — which fields exist vs. need to be generated
  3. Ordered Operator List — operators in execution order with justification
  4. Reasoning Summary — why this design satisfies the target
  5. Complete Pipeline Code — full executable Python following standard structure
  6. Adjustable Parameters / Caveats — tunable knobs and debugging tips

Six Core Operators

Operator Purpose LLM?
PromptedGenerator Single-field LLM generation Yes
FormatStrPromptedGenerator Multi-field template-based generation Yes
Text2MultiHopQAGenerator Multi-hop QA pair construction from text Yes
PromptedFilter LLM-based quality scoring & filtering Yes
GeneralFilter Rule-based deterministic filtering No
KBC Trio (3 operators, always together in order) File/URL -> Markdown -> chunks -> clean text Partial

Generated Pipeline Structure

All generated pipelines follow the same standard structure:

from dataflow.operators.core_text import PromptedGenerator, PromptedFilter
from dataflow.serving import APILLMServing_request
from dataflow.utils.storage import FileStorage

class MyPipeline:
    def __init__(self):
        self.storage = FileStorage(
            first_entry_file_name="./data/input.jsonl",  # User-provided path
            cache_path="./cache",
            file_name_prefix="step",
            cache_type="jsonl"
        )
        self.llm_serving = APILLMServing_request(
            api_url="https://api.openai.com/v1/chat/completions",
            model_name="gpt-4o",
            max_workers=10
        )
        # Operator instances ...

    def forward(self):
        # Sequential operator.run() calls, each with storage.step()
        ...

if __name__ == "__main__":
    pipeline = MyPipeline()
    pipeline.forward()

Key rules:

  • first_entry_file_name is set to the exact user-provided JSONL path
  • Each operator.run() call uses storage=self.storage.step() for checkpointing
  • Fields propagate forward: a field must exist in the sample or be output by a prior step before it can be consumed

Extended Operators

Beyond the 6 core primitives, DataFlow provides additional operators. See the core_text section for the full operator reference.

Adding a New Operator

Prerequisite: the new operator's skill definition already exists (with SKILL.md, examples/good.md, examples/bad.md, etc.).

As an Extended Operator

Two steps are required:

Step 1. Create an operator directory with its skill definition under any appropriate location (e.g., core_text/<category>/, or a separate skill package):

<skill-directory>/<your-operator-name>/
├── SKILL.md          # API reference (constructor, run() signature, execution logic, constraints)
├── SKILL_zh.md       # Chinese translation (optional)
└── examples/
    ├── good.md       # Best-practice example
    └── bad.md        # Common mistakes

Step 2. Register the operator in SKILL.md's Extended Operator Reference section. Add a row to the corresponding category table (Generate / Filter / Refine / Eval) with the operator name, subdirectory path, and description. Without this entry, the pipeline generator will not know the operator exists.

Promoting to a Core Primitive (Optional)

If the operator is used frequently enough to warrant priority selection, promote it by modifying SKILL.md:

  1. Preferred Operator Strategy — Add to the core primitives list
  2. Operator Selection Priority Rule — Add a decision table row (when to use / when not to use)
  3. Operator Parameter Signature Rule — Add full constructor and run() signatures
  4. Correct Import Paths — Add the import path
  5. Input File Content Analysis Rule — Add input pattern matching if it handles a new data type
  6. Extended Operator Reference — Update or remove the entry from the extended table to avoid duplication with core primitives
  7. Examples — Add a complete example in examples/ (recommended)

core_text

Extended operator reference for generating-dataflow-pipeline.

Per-operator API documentation for all text processing operators used by the pipeline generator. When the 6 core primitives in generating-dataflow-pipeline/SKILL.md don't cover your task, consult the detailed references here.

Available Operators

Generate (core_text/generate/)

  • prompted-generator - Basic LLM generation
  • format-str-prompted-generator - Template-based generation
  • chunked-prompted-generator - Chunked text generation
  • embedding-generator - Generate embeddings
  • retrieval-generator - RAG generation
  • bench-answer-generator - Generate benchmark answers
  • text2multihopqa-generator - Multi-hop QA generation
  • random-domain-knowledge-row-generator - Random domain knowledge generation

Filter (core_text/filter/)

  • prompted-filter - LLM scoring and filtering
  • general-filter - Rule-based numeric filtering
  • kcentergreedy-filter - Diversity-based filtering

Refine (core_text/refine/)

  • prompted-refiner - LLM-based text rewriting
  • pandas-operator - Custom pandas operations

Eval (core_text/eval/)

  • prompted-evaluator - LLM scoring
  • bench-dataset-evaluator - Evaluate benchmark datasets
  • bench-dataset-evaluator-question - Evaluate benchmark questions
  • text2qa-sample-evaluator - Evaluate QA samples
  • unified-bench-dataset-evaluator - Unified evaluation

Directory Structure

Each operator folder contains:

  • SKILL.md - English skill documentation describing use cases, usage, imports, parameters, and examples
  • SKILL_zh.md - Chinese documentation
  • examples/good.md - Correct usage with simple single-operator pipeline, sample input and output
  • examples/bad.md - Common mistakes

dataflow-operator-builder

Video tutorial: Build DataFlow Operator

Production-grade scaffold skill for new DataFlow operators (generate/filter/refine/eval), generating implementation skeletons, CLI wrappers, and test files in one run.

What It Does

Given an operator spec (package name, operator type, input/output keys, etc.), this skill:

  1. Validates the spec against constraint rules in references/ to catch registration, contract, and naming issues early
  2. Generates a complete operator implementation skeleton (generate, filter, refine, or eval)
  3. Creates a standalone CLI module under cli/ for batch jobs and integration testing without extra glue code
  4. Outputs a two-stage result: a --dry-run preview of the file plan, then actual file writes after confirmation

Quick Start

1. Add the Skill

Clone this repository and copy the skill directory into your Claude Code skills folder:

git clone https://github.com/haolpku/DataFlow-Skills.git

# Project-level (this project only)
cp -r DataFlow-Skills/dataflow-operator-builder .claude/skills/dataflow-operator-builder

# Or personal-level (all your projects)
cp -r DataFlow-Skills/dataflow-operator-builder ~/.claude/skills/dataflow-operator-builder

Claude Code discovers skills from .claude/skills/<skill-name>/SKILL.md.

2. Run the Skill

Mode A (default): Interactive Interview

Just invoke the skill — the agent collects everything it needs through two rounds of batch questions. No files to prepare:

/dataflow-operator-builder
  • Round 1: structural fields (package name, operator type, input/output keys, etc.)
  • Round 2: implementation details (LLM usage, CLI module name, test prefix, etc.)

Each question includes a recommended option with a short rationale. The agent proceeds to generation after both rounds.

Mode B: Direct Spec (when you already have a spec)

If you already have an operator spec file (JSON), skip the interview and run directly:

/dataflow-operator-builder --spec path/to/spec.json --output-root path/to/repo

Example spec:

{
  "package_name": "dataflow_ext_demo",
  "operator_type": "filter",
  "operator_class_name": "DemoQualityFilter",
  "operator_module_name": "demo_quality_filter",
  "input_key": "raw_text",
  "output_key": "is_valid",
  "uses_llm": false
}

Required: package_name, operator_type, operator_class_name, operator_module_name, input_key, output_key, uses_llm. Optional: cli_module_name, test_file_prefix, overwrite_strategy, validation_level.

4. Review the Output

The skill returns a two-stage result:

  1. Create/update plan--dry-run lists all files to be generated without writing anything
  2. Operator skeleton — class definition, run() signature, and registration entry
  3. CLI module — executable batch script under cli/
  4. Test filesunit, registry, and smoke baselines ready for CI

Helpful Flags

  • --dry-run: preview create/update plan without modifying files
  • --overwrite {ask-each,overwrite-all,skip-existing}: control overwrite behavior safely in existing repos
  • --validation-level {none,basic,full}: choose how strict pre-write checks should be

Generated Artifacts

Artifact Path Description
Operator implementation <package>/<module_name>.py Class definition, run() signature, and registry entry
CLI module cli/<cli_module_name>.py Standalone batch script
Unit test tests/unit/test_<prefix>.py Basic unit test
Registry test tests/registry/test_<prefix>_registry.py Validates correct operator registration
Smoke test tests/smoke/test_<prefix>_smoke.py End-to-end minimal acceptance

Generated Operator Skeleton

All generated operators follow the same standard structure:

from dataflow.operators.base import BaseOperator
from dataflow.utils.storage import FileStorage

class DemoQualityFilter(BaseOperator):
    def __init__(self, threshold: float = 0.5):
        self.threshold = threshold

    def run(self, storage: FileStorage) -> FileStorage:
        # Implement filtering logic here
        ...
        return storage

# Registry entry (auto-generated by the skill)
OPERATOR_REGISTRY.register("DemoQualityFilter", DemoQualityFilter)

Key rules:

  • Must extend BaseOperator and implement run()
  • run() accepts and returns a FileStorage instance for chain propagation
  • Must be registered via OPERATOR_REGISTRY to be discoverable by pipelines
  • The CLI module calls run() directly via --input-file / --output-file, independent of pipeline context

prompt-template-builder

Video tutorial: DataFlow Prompt Template Builder

Skill for building/revising DataFlow prompt templates for existing operators, with type-aligned template selection and two-stage auditable outputs.

What It Does

Given a target operator (operator name, constraints, input arguments, etc.), this skill:

  1. Checks operator compatibility and selects the right template style (e.g. DIYPromptABC or FormatStrPrompt) to ensure the template matches operator expectations
  2. Outputs a Stage 1 decision JSON: template strategy, argument mapping, output contract, and static acceptance checks — for code review and traceability
  3. Outputs a Stage 2 final deliverable: template/config content, integration snippet, and acceptance walkthrough — ready for developers and QA to act on

Quick Start

1. Add the Skill

Clone this repository and copy the skill directory into your Claude Code skills folder:

git clone https://github.com/haolpku/DataFlow-Skills.git

# Project-level (this project only)
cp -r DataFlow-Skills/prompt-template-builder .claude/skills/prompt-template-builder

# Or personal-level (all your projects)
cp -r DataFlow-Skills/prompt-template-builder ~/.claude/skills/prompt-template-builder

Claude Code discovers skills from .claude/skills/<skill-name>/SKILL.md.

2. Run the Skill

Mode A (default): Interactive Interview

Just invoke the skill — the agent collects everything it needs through two rounds of batch questions. No files to prepare:

/prompt-template-builder
  • Round 1: structural layer (target scenario, operator name, output contract, constraints)
  • Round 2: implementation layer (argument signatures, boundary samples, acceptance preferences)

Each question includes a recommended option with a short rationale. The agent then proceeds to the two-stage generation.

Mode B: Direct Spec (when you already have a spec)

If you already have a prompt spec file (JSON), skip the interview and run directly:

/prompt-template-builder --spec path/to/prompt_spec.json

Example spec:

{
  "Target": "Generate concise e-commerce selling points",
  "OP_NAME": "PromptedGenerator",
  "Constraints": "Professional tone; <= 80 Chinese chars",
  "Arguments": ["product_name", "category"]
}

Required: Target, OP_NAME. Recommended: Constraints, Expected Output, Arguments, Sample Cases, Tone/Style, Validation Focus.

4. Review the Output

The skill returns a two-stage result:

  1. Stage 1 (decision JSON) — template strategy, argument mapping, output contract, and static checks (including prompt_template_type_aligned)
  2. Stage 2 (final deliverable) — template/config content, integration code snippet, and acceptance walkthrough

Supported Template Types

Template Type Compatible Operators Description
DIYPromptABC PromptedGenerator, PromptedFilter, PromptedRefiner, etc. Fully custom system/user prompt with field interpolation
FormatStrPrompt FormatStrPromptedGenerator Python f-string style multi-field template

Stage 1 Decision JSON Format

{
  "prompt_template_type_aligned": "DIYPromptABC",
  "strategy": "Single-field generation using system+user two-layer prompt",
  "argument_mapping": {
    "product_name": "product name",
    "category": "product category"
  },
  "output_contract": "Professional tone, <= 80 Chinese characters, ending with a selling-point phrase",
  "static_checks": [
    "No extra placeholders",
    "Tone matches professional definition",
    "Character limit is verifiable in Stage 2 walkthrough"
  ]
}

Key rules:

  • prompt_template_type_aligned must match the target operator's contract — types cannot be mixed
  • Every item in static_checks must be individually verified in the Stage 2 acceptance walkthrough
  • Argument mapping must fully cover all fields listed in Arguments — no omissions allowed

dataflow-dev

A DataFlow developer expert skill that loads full architecture knowledge and routes to seven specialized workflows — from creating operators and pipelines to diagnosing errors, reviewing code, and syncing the knowledge base when the upstream repo changes.

What It Does

When you invoke /dataflow-dev inside a DataFlow repository, the skill:

  1. Loads context/knowledge_base.md — architecture, API reference, all registered operators
  2. Loads context/dev_notes.md — coding standards, best practices, LLM response templates
  3. Loads diagnostics/known_issues.md — structured symptom → root cause → fix database
  4. Probes the local repo state (git branch, git log, git diff)
  5. Reports a 1–3 line context summary, then routes to the appropriate workflow

Quick Start

1. Add the Skill

git clone https://github.com/haolpku/DataFlow-Skills.git

# Project-level
cp -r DataFlow-Skills/dataflow-dev .claude/skills/dataflow-dev

# Or personal-level
cp -r DataFlow-Skills/dataflow-dev ~/.claude/skills/dataflow-dev

2. Open a DataFlow Repo

cd /path/to/DataFlow    # must be the repo root
claude                  # launch Claude Code

3. Invoke the Skill

/dataflow-dev
I need a new filter operator that removes texts shorter than N words.

The skill detects your intent, checks for duplicate operators, asks for spec details in a single round, then generates fully compliant code.

Seven Sub-Command Workflows

Intent keywords Workflow
new operator / create operator / 新建算子 Operator creation (duplicate check → spec confirmation → code generation → registration reminder)
new pipeline / 新建 Pipeline Pipeline creation (operator selection → code generation with storage.step() pattern)
new prompt / 新建 Prompt Prompt creation (PromptABC or DIYPromptABC, registry decorator, @prompt_restrict placement)
error / KeyError / AttributeError / Warning / 报错 Diagnosis (match known_issues.md → root cause + fix code)
review / check / 规范审查 Code review (operator and pipeline checklists, 14-point validation)
sync / check updates / 仓库有新算子 Knowledge base update (detect new operator files, compare against knowledge_base.md, emit update steps)

Operator Creation Checklist

Every generated operator is validated against these hard rules before output:

✓ Inherits OperatorABC, calls super().__init__()
✓ @OPERATOR_REGISTRY.register() decorator on the class
✓ run() parameter naming: input_* / output_* / storage
✓ run() returns list of output key names
✓ storage.read() and storage.write() both present
✓ LLM-driven operators: member variable named self.llm_serving
✓ Full per-row try/except with sensible defaults on LLM failure
✓ CoT model outputs: <think> tags stripped where needed
✓ @staticmethod get_desc(lang: str = "zh") supporting zh/en
✓ __init__.py TYPE_CHECKING block registration

Diagnostics Quick Reference

Error keyword Issue
Unexpected key 'xxx' in operator #001 — config param naming (warning only, not an error)
No object named 'Xxx' found in 'operators' registry #002 — missing __init__.py TYPE_CHECKING entry
Key Matching Error / does not match any output keys #003 — pipeline key mismatch
You must call storage.step() before #004 — missing storage.step() call
DummyStorage + AttributeError #005 — DummyStorage API limitations
AttributeError: 'NoneType' + re.split #006 — capturing group in re.split() pattern
@prompt_restrict not taking effect #007 — decorator placement must be adjacent to class definition

Full root-cause analysis and fix examples are in diagnostics/known_issues.md.

Knowledge Base Update Flow

When the upstream repo (OpenDCAI/DataFlow) merges new operator PRs:

# Check upstream merged PRs
gh pr list --repo OpenDCAI/DataFlow --state merged --limit 20

# Detect newly added operator files in local repo (last 30 commits)
git log --oneline --diff-filter=A -- 'dataflow/operators/**/*.py' | head -30

# Or run the bundled helper script
bash .claude/skills/dataflow-dev/scripts/check_updates.sh /path/to/DataFlow

The script outputs: new operator files, all registered operator names, operators missing from knowledge_base.md, and recent upstream PRs/Issues — with a step-by-step update guide.

File Structure

dataflow-dev/
├── SKILL.md                        # Skill definition & sub-command routing
├── context/
│   ├── knowledge_base.md           # Architecture, API reference, all operators (read-only)
│   └── dev_notes.md                # Coding standards, best practices (appendable)
├── diagnostics/
│   └── known_issues.md             # Structured Issue database #001–#008
├── templates/
│   ├── operator_template.py        # Operator scaffold
│   ├── pipeline_template.py        # Pipeline scaffold
│   └── prompt_template.py          # Prompt scaffold
└── scripts/
    └── check_updates.sh            # Repo change detection & knowledge base diff

Upstream Repository

All knowledge in this skill is aligned to OpenDCAI/DataFlow (main branch, v1.0.10).

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 60.7%
  • Go Template 26.2%
  • Shell 13.1%