Reusable agent skills for DataFlow workflows.
中文文档: README_zh.md
| Method | Best For | Pros | Cons |
|---|---|---|---|
| Web | Complete beginners | No installation required | Limited features |
| CLI (command line) | Developers | Full-featured, highly integrated | Requires command-line familiarity |
| Editor integration (VS Code / Cursor, etc.) | Daily development | Seamless workflow | Depends on plugins and environment setup |
Recommendation:
- Complete beginner → Try the web at https://claude.ai/ first
- Want to use it for development → Go straight to CLI
- Already familiar → Consider editor integration
This guide focuses on the CLI.
- A Claude account — register at claude.ai (skip if using a third-party compatible provider)
- A command-line tool:
- Mac / Linux: open Terminal
- Windows: open PowerShell or install WSL
macOS / Linux / WSL:
curl -fsSL https://claude.ai/install.sh | bashWindows PowerShell:
irm https://claude.ai/install.ps1 | iexWindows CMD:
curl -fsSL https://claude.ai/install.cmd -o install.cmd && install.cmd && del install.cmdVerify the installation:
claude --versionA version number means it installed successfully.
Prerequisite: Node.js must be installed (verify: node --version; if missing, download from nodejs.org)
npm install -g @anthropic-ai/claude-codeIf the download is slow, use a mirror:
npm install -g @anthropic-ai/claude-code --registry=https://registry.npmmirror.comUpdate manually:
claude updateClaude Code checks for updates at launch and installs them in the background; the new version takes effect on the next launch. Configure update behavior in settings.json:
{
"autoUpdatesChannel": "stable"
}Disable automatic updates:
{
"env": {
"DISABLE_AUTOUPDATER": "1"
}
}Note: Homebrew and WinGet installations do not support automatic updates. Update manually:
brew upgrade claude-code # macOS winget upgrade Anthropic.ClaudeCode # Windows
| Problem | Cause | Solution |
|---|---|---|
npm command not found |
Node.js not installed | Download from nodejs.org |
permission denied |
Insufficient permissions | Mac/Linux: prefix with sudo; Windows: run PowerShell as Administrator |
| Slow or stalled installation | Network issues | Use a mirror: --registry=https://registry.npmmirror.com |
video tutorial: Generate DataFlow Pipeline
Reasoning-guided pipeline planner that generates standard DataFlow pipeline code from a task description and sample data.
Given a target (what the pipeline should achieve) and a sample JSONL file (1-5 representative rows), this skill:
- Reads and analyzes the sample data — infers field types, content characteristics, and task nature
- Selects operators from six core primitives (with extended operators available when needed) using a mandatory decision table
- Validates field dependencies across the operator chain
- Outputs a two-stage result: an intermediate operator decision (JSON) followed by a complete, runnable Python pipeline
Clone this repository and copy the skill directories into your Claude Code skills folder:
git clone https://github.com/haolpku/DataFlow-Skills.git
# Project-level (this project only)
cp -r DataFlow-Skills/generating-dataflow-pipeline .claude/skills/generating-dataflow-pipeline
cp -r DataFlow-Skills/core_text .claude/skills/core_text
# Or personal-level (all your projects)
cp -r DataFlow-Skills/generating-dataflow-pipeline ~/.claude/skills/generating-dataflow-pipeline
cp -r DataFlow-Skills/core_text ~/.claude/skills/core_textClaude Code discovers skills from .claude/skills/<skill-name>/SKILL.md. The name field in SKILL.md frontmatter becomes the /slash-command. For more details, see the official skills documentation.
Create a JSONL file (one JSON object per line) with 1–5 representative rows:
{"product_name": "Laptop", "category": "Electronics"}
{"product_name": "Coffee Maker", "category": "Appliances"}In Claude Code, invoke /generating-dataflow-pipeline and describe your target:
/generating-dataflow-pipeline
Target: Generate product descriptions and filter high-quality ones
Sample file: ./data/products.jsonl
Expected outputs: generated_description, quality_score
The skill returns a two-stage result:
- Intermediate Operator Decision — JSON with operator chain, field flow, and reasoning
- Field Mapping — which fields exist vs. need to be generated
- Ordered Operator List — operators in execution order with justification
- Reasoning Summary — why this design satisfies the target
- Complete Pipeline Code — full executable Python following standard structure
- Adjustable Parameters / Caveats — tunable knobs and debugging tips
| Operator | Purpose | LLM? |
|---|---|---|
PromptedGenerator |
Single-field LLM generation | Yes |
FormatStrPromptedGenerator |
Multi-field template-based generation | Yes |
Text2MultiHopQAGenerator |
Multi-hop QA pair construction from text | Yes |
PromptedFilter |
LLM-based quality scoring & filtering | Yes |
GeneralFilter |
Rule-based deterministic filtering | No |
| KBC Trio (3 operators, always together in order) | File/URL -> Markdown -> chunks -> clean text | Partial |
All generated pipelines follow the same standard structure:
from dataflow.operators.core_text import PromptedGenerator, PromptedFilter
from dataflow.serving import APILLMServing_request
from dataflow.utils.storage import FileStorage
class MyPipeline:
def __init__(self):
self.storage = FileStorage(
first_entry_file_name="./data/input.jsonl", # User-provided path
cache_path="./cache",
file_name_prefix="step",
cache_type="jsonl"
)
self.llm_serving = APILLMServing_request(
api_url="https://api.openai.com/v1/chat/completions",
model_name="gpt-4o",
max_workers=10
)
# Operator instances ...
def forward(self):
# Sequential operator.run() calls, each with storage.step()
...
if __name__ == "__main__":
pipeline = MyPipeline()
pipeline.forward()Key rules:
first_entry_file_nameis set to the exact user-provided JSONL path- Each
operator.run()call usesstorage=self.storage.step()for checkpointing - Fields propagate forward: a field must exist in the sample or be output by a prior step before it can be consumed
Beyond the 6 core primitives, DataFlow provides additional operators. See the core_text section for the full operator reference.
Prerequisite: the new operator's skill definition already exists (with SKILL.md, examples/good.md, examples/bad.md, etc.).
Two steps are required:
Step 1. Create an operator directory with its skill definition under any appropriate location (e.g., core_text/<category>/, or a separate skill package):
<skill-directory>/<your-operator-name>/
├── SKILL.md # API reference (constructor, run() signature, execution logic, constraints)
├── SKILL_zh.md # Chinese translation (optional)
└── examples/
├── good.md # Best-practice example
└── bad.md # Common mistakes
Step 2. Register the operator in SKILL.md's Extended Operator Reference section. Add a row to the corresponding category table (Generate / Filter / Refine / Eval) with the operator name, subdirectory path, and description. Without this entry, the pipeline generator will not know the operator exists.
If the operator is used frequently enough to warrant priority selection, promote it by modifying SKILL.md:
- Preferred Operator Strategy — Add to the core primitives list
- Operator Selection Priority Rule — Add a decision table row (when to use / when not to use)
- Operator Parameter Signature Rule — Add full constructor and
run()signatures - Correct Import Paths — Add the import path
- Input File Content Analysis Rule — Add input pattern matching if it handles a new data type
- Extended Operator Reference — Update or remove the entry from the extended table to avoid duplication with core primitives
- Examples — Add a complete example in
examples/(recommended)
Extended operator reference for generating-dataflow-pipeline.
Per-operator API documentation for all text processing operators used by the pipeline generator. When the 6 core primitives in generating-dataflow-pipeline/SKILL.md don't cover your task, consult the detailed references here.
Generate (core_text/generate/)
prompted-generator- Basic LLM generationformat-str-prompted-generator- Template-based generationchunked-prompted-generator- Chunked text generationembedding-generator- Generate embeddingsretrieval-generator- RAG generationbench-answer-generator- Generate benchmark answerstext2multihopqa-generator- Multi-hop QA generationrandom-domain-knowledge-row-generator- Random domain knowledge generation
Filter (core_text/filter/)
prompted-filter- LLM scoring and filteringgeneral-filter- Rule-based numeric filteringkcentergreedy-filter- Diversity-based filtering
Refine (core_text/refine/)
prompted-refiner- LLM-based text rewritingpandas-operator- Custom pandas operations
Eval (core_text/eval/)
prompted-evaluator- LLM scoringbench-dataset-evaluator- Evaluate benchmark datasetsbench-dataset-evaluator-question- Evaluate benchmark questionstext2qa-sample-evaluator- Evaluate QA samplesunified-bench-dataset-evaluator- Unified evaluation
Each operator folder contains:
SKILL.md- English skill documentation describing use cases, usage, imports, parameters, and examplesSKILL_zh.md- Chinese documentationexamples/good.md- Correct usage with simple single-operator pipeline, sample input and outputexamples/bad.md- Common mistakes
Video tutorial: Build DataFlow Operator
Production-grade scaffold skill for new DataFlow operators (generate/filter/refine/eval), generating implementation skeletons, CLI wrappers, and test files in one run.
Given an operator spec (package name, operator type, input/output keys, etc.), this skill:
- Validates the spec against constraint rules in
references/to catch registration, contract, and naming issues early - Generates a complete operator implementation skeleton (
generate,filter,refine, oreval) - Creates a standalone CLI module under
cli/for batch jobs and integration testing without extra glue code - Outputs a two-stage result: a
--dry-runpreview of the file plan, then actual file writes after confirmation
Clone this repository and copy the skill directory into your Claude Code skills folder:
git clone https://github.com/haolpku/DataFlow-Skills.git
# Project-level (this project only)
cp -r DataFlow-Skills/dataflow-operator-builder .claude/skills/dataflow-operator-builder
# Or personal-level (all your projects)
cp -r DataFlow-Skills/dataflow-operator-builder ~/.claude/skills/dataflow-operator-builderClaude Code discovers skills from .claude/skills/<skill-name>/SKILL.md.
Mode A (default): Interactive Interview
Just invoke the skill — the agent collects everything it needs through two rounds of batch questions. No files to prepare:
/dataflow-operator-builder
- Round 1: structural fields (package name, operator type, input/output keys, etc.)
- Round 2: implementation details (LLM usage, CLI module name, test prefix, etc.)
Each question includes a recommended option with a short rationale. The agent proceeds to generation after both rounds.
Mode B: Direct Spec (when you already have a spec)
If you already have an operator spec file (JSON), skip the interview and run directly:
/dataflow-operator-builder --spec path/to/spec.json --output-root path/to/repo
Example spec:
{
"package_name": "dataflow_ext_demo",
"operator_type": "filter",
"operator_class_name": "DemoQualityFilter",
"operator_module_name": "demo_quality_filter",
"input_key": "raw_text",
"output_key": "is_valid",
"uses_llm": false
}Required: package_name, operator_type, operator_class_name, operator_module_name, input_key, output_key, uses_llm. Optional: cli_module_name, test_file_prefix, overwrite_strategy, validation_level.
The skill returns a two-stage result:
- Create/update plan —
--dry-runlists all files to be generated without writing anything - Operator skeleton — class definition,
run()signature, and registration entry - CLI module — executable batch script under
cli/ - Test files —
unit,registry, andsmokebaselines ready for CI
--dry-run: preview create/update plan without modifying files--overwrite {ask-each,overwrite-all,skip-existing}: control overwrite behavior safely in existing repos--validation-level {none,basic,full}: choose how strict pre-write checks should be
| Artifact | Path | Description |
|---|---|---|
| Operator implementation | <package>/<module_name>.py |
Class definition, run() signature, and registry entry |
| CLI module | cli/<cli_module_name>.py |
Standalone batch script |
| Unit test | tests/unit/test_<prefix>.py |
Basic unit test |
| Registry test | tests/registry/test_<prefix>_registry.py |
Validates correct operator registration |
| Smoke test | tests/smoke/test_<prefix>_smoke.py |
End-to-end minimal acceptance |
All generated operators follow the same standard structure:
from dataflow.operators.base import BaseOperator
from dataflow.utils.storage import FileStorage
class DemoQualityFilter(BaseOperator):
def __init__(self, threshold: float = 0.5):
self.threshold = threshold
def run(self, storage: FileStorage) -> FileStorage:
# Implement filtering logic here
...
return storage
# Registry entry (auto-generated by the skill)
OPERATOR_REGISTRY.register("DemoQualityFilter", DemoQualityFilter)Key rules:
- Must extend
BaseOperatorand implementrun() run()accepts and returns aFileStorageinstance for chain propagation- Must be registered via
OPERATOR_REGISTRYto be discoverable by pipelines - The CLI module calls
run()directly via--input-file/--output-file, independent of pipeline context
Video tutorial: DataFlow Prompt Template Builder
Skill for building/revising DataFlow prompt templates for existing operators, with type-aligned template selection and two-stage auditable outputs.
Given a target operator (operator name, constraints, input arguments, etc.), this skill:
- Checks operator compatibility and selects the right template style (e.g.
DIYPromptABCorFormatStrPrompt) to ensure the template matches operator expectations - Outputs a Stage 1 decision JSON: template strategy, argument mapping, output contract, and static acceptance checks — for code review and traceability
- Outputs a Stage 2 final deliverable: template/config content, integration snippet, and acceptance walkthrough — ready for developers and QA to act on
Clone this repository and copy the skill directory into your Claude Code skills folder:
git clone https://github.com/haolpku/DataFlow-Skills.git
# Project-level (this project only)
cp -r DataFlow-Skills/prompt-template-builder .claude/skills/prompt-template-builder
# Or personal-level (all your projects)
cp -r DataFlow-Skills/prompt-template-builder ~/.claude/skills/prompt-template-builderClaude Code discovers skills from .claude/skills/<skill-name>/SKILL.md.
Mode A (default): Interactive Interview
Just invoke the skill — the agent collects everything it needs through two rounds of batch questions. No files to prepare:
/prompt-template-builder
- Round 1: structural layer (target scenario, operator name, output contract, constraints)
- Round 2: implementation layer (argument signatures, boundary samples, acceptance preferences)
Each question includes a recommended option with a short rationale. The agent then proceeds to the two-stage generation.
Mode B: Direct Spec (when you already have a spec)
If you already have a prompt spec file (JSON), skip the interview and run directly:
/prompt-template-builder --spec path/to/prompt_spec.json
Example spec:
{
"Target": "Generate concise e-commerce selling points",
"OP_NAME": "PromptedGenerator",
"Constraints": "Professional tone; <= 80 Chinese chars",
"Arguments": ["product_name", "category"]
}Required: Target, OP_NAME. Recommended: Constraints, Expected Output, Arguments, Sample Cases, Tone/Style, Validation Focus.
The skill returns a two-stage result:
- Stage 1 (decision JSON) — template strategy, argument mapping, output contract, and static checks (including
prompt_template_type_aligned) - Stage 2 (final deliverable) — template/config content, integration code snippet, and acceptance walkthrough
| Template Type | Compatible Operators | Description |
|---|---|---|
DIYPromptABC |
PromptedGenerator, PromptedFilter, PromptedRefiner, etc. |
Fully custom system/user prompt with field interpolation |
FormatStrPrompt |
FormatStrPromptedGenerator |
Python f-string style multi-field template |
{
"prompt_template_type_aligned": "DIYPromptABC",
"strategy": "Single-field generation using system+user two-layer prompt",
"argument_mapping": {
"product_name": "product name",
"category": "product category"
},
"output_contract": "Professional tone, <= 80 Chinese characters, ending with a selling-point phrase",
"static_checks": [
"No extra placeholders",
"Tone matches professional definition",
"Character limit is verifiable in Stage 2 walkthrough"
]
}Key rules:
prompt_template_type_alignedmust match the target operator's contract — types cannot be mixed- Every item in
static_checksmust be individually verified in the Stage 2 acceptance walkthrough - Argument mapping must fully cover all fields listed in
Arguments— no omissions allowed
A DataFlow developer expert skill that loads full architecture knowledge and routes to seven specialized workflows — from creating operators and pipelines to diagnosing errors, reviewing code, and syncing the knowledge base when the upstream repo changes.
When you invoke /dataflow-dev inside a DataFlow repository, the skill:
- Loads
context/knowledge_base.md— architecture, API reference, all registered operators - Loads
context/dev_notes.md— coding standards, best practices, LLM response templates - Loads
diagnostics/known_issues.md— structured symptom → root cause → fix database - Probes the local repo state (
git branch,git log,git diff) - Reports a 1–3 line context summary, then routes to the appropriate workflow
git clone https://github.com/haolpku/DataFlow-Skills.git
# Project-level
cp -r DataFlow-Skills/dataflow-dev .claude/skills/dataflow-dev
# Or personal-level
cp -r DataFlow-Skills/dataflow-dev ~/.claude/skills/dataflow-devcd /path/to/DataFlow # must be the repo root
claude # launch Claude Code/dataflow-dev
I need a new filter operator that removes texts shorter than N words.
The skill detects your intent, checks for duplicate operators, asks for spec details in a single round, then generates fully compliant code.
| Intent keywords | Workflow |
|---|---|
| new operator / create operator / 新建算子 | Operator creation (duplicate check → spec confirmation → code generation → registration reminder) |
| new pipeline / 新建 Pipeline | Pipeline creation (operator selection → code generation with storage.step() pattern) |
| new prompt / 新建 Prompt | Prompt creation (PromptABC or DIYPromptABC, registry decorator, @prompt_restrict placement) |
| error / KeyError / AttributeError / Warning / 报错 | Diagnosis (match known_issues.md → root cause + fix code) |
| review / check / 规范审查 | Code review (operator and pipeline checklists, 14-point validation) |
| sync / check updates / 仓库有新算子 | Knowledge base update (detect new operator files, compare against knowledge_base.md, emit update steps) |
Every generated operator is validated against these hard rules before output:
✓ Inherits OperatorABC, calls super().__init__()
✓ @OPERATOR_REGISTRY.register() decorator on the class
✓ run() parameter naming: input_* / output_* / storage
✓ run() returns list of output key names
✓ storage.read() and storage.write() both present
✓ LLM-driven operators: member variable named self.llm_serving
✓ Full per-row try/except with sensible defaults on LLM failure
✓ CoT model outputs: <think> tags stripped where needed
✓ @staticmethod get_desc(lang: str = "zh") supporting zh/en
✓ __init__.py TYPE_CHECKING block registration
| Error keyword | Issue |
|---|---|
Unexpected key 'xxx' in operator |
#001 — config param naming (warning only, not an error) |
No object named 'Xxx' found in 'operators' registry |
#002 — missing __init__.py TYPE_CHECKING entry |
Key Matching Error / does not match any output keys |
#003 — pipeline key mismatch |
You must call storage.step() before |
#004 — missing storage.step() call |
DummyStorage + AttributeError |
#005 — DummyStorage API limitations |
AttributeError: 'NoneType' + re.split |
#006 — capturing group in re.split() pattern |
@prompt_restrict not taking effect |
#007 — decorator placement must be adjacent to class definition |
Full root-cause analysis and fix examples are in diagnostics/known_issues.md.
When the upstream repo (OpenDCAI/DataFlow) merges new operator PRs:
# Check upstream merged PRs
gh pr list --repo OpenDCAI/DataFlow --state merged --limit 20
# Detect newly added operator files in local repo (last 30 commits)
git log --oneline --diff-filter=A -- 'dataflow/operators/**/*.py' | head -30
# Or run the bundled helper script
bash .claude/skills/dataflow-dev/scripts/check_updates.sh /path/to/DataFlowThe script outputs: new operator files, all registered operator names, operators missing from knowledge_base.md, and recent upstream PRs/Issues — with a step-by-step update guide.
dataflow-dev/
├── SKILL.md # Skill definition & sub-command routing
├── context/
│ ├── knowledge_base.md # Architecture, API reference, all operators (read-only)
│ └── dev_notes.md # Coding standards, best practices (appendable)
├── diagnostics/
│ └── known_issues.md # Structured Issue database #001–#008
├── templates/
│ ├── operator_template.py # Operator scaffold
│ ├── pipeline_template.py # Pipeline scaffold
│ └── prompt_template.py # Prompt scaffold
└── scripts/
└── check_updates.sh # Repo change detection & knowledge base diff
All knowledge in this skill is aligned to OpenDCAI/DataFlow (main branch, v1.0.10).