Add Text2Gremlin corpus generation and augmentation pipeline by LRriver · Pull Request #352 · apache/hugegraph-ai

LRriver · 2026-05-31T10:52:46Z

Summary

This PR adds a new text2gremlin/AST_Text2Gremlin module for Text2Gremlin data generation, LLM-based data augmentation, syntax validation, dataset merging, and DPO data construction.

The goal is to provide a reproducible data pipeline for generating high-quality Gremlin query training data from structured graph schemas and templates. The module starts from schema-aware Gremlin template generation, then uses LLMs to produce natural-language instructions, migrates samples across graph scenarios, validates generated Gremlin syntax, merges SFT data, and optionally builds DPO preference data by comparing correct Gremlin queries with Groovy-style negative samples.

The generated Text2Gremlin dataset has also been published on Hugging Face:

https://huggingface.co/datasets/Lriver/Text2Gremlin

Motivation

Text2Gremlin needs training data that is both syntactically valid and diverse enough to cover common graph query patterns. Manually writing these samples is expensive and difficult to scale, especially when the data needs to cover different graph domains, operation types, traversal structures, and natural-language styles.

This module provides a structured generation and augmentation pipeline:

Generate Gremlin queries from schema-aware templates.
Validate query syntax with an ANTLR-based Gremlin parser.
Translate Gremlin queries into multiple natural-language instruction styles.
Migrate existing samples to other graph scenarios.
Merge and deduplicate SFT-style training data.
Generate DPO preference data for Groovy-vs-Gremlin alignment.

Project structure

This PR adds the following Text2Gremlin module structure:

text2gremlin/AST_Text2Gremlin/
├── README.md
├── README_zh.md
├── requirements.txt
├── config_example.json
├── generate_corpus.py
├── analyze_syntax.py
├── run_llm_pipeline.py
├── gremlin_templates.csv
├── base/
│   ├── Config.py
│   ├── Schema.py
│   ├── GremlinBase.py
│   ├── GremlinExpr.py
│   ├── GremlinParse.py
│   ├── GremlinTransVisitor.py
│   ├── TraversalGenerator.py
│   ├── CombinationController.py
│   ├── generator.py
│   ├── combination_control_config.json
│   ├── gremlin/
│   │   ├── Gremlin.g4
│   │   ├── GremlinLexer.py
│   │   ├── GremlinParser.py
│   │   ├── GremlinVisitor.py
│   │   └── GremlinListener.py
│   └── template/
│       ├── schema_dict.txt
│       └── syn_dict.txt
├── db_data/
│   ├── schema/
│   │   └── movie_schema.json
│   ├── movie/raw_data/
│   │   ├── vertex_*.csv
│   │   └── edge_*.csv
│   └── reference/
│       └── schemas_data.json
├── llm_augment/
│   ├── generalize_llm.py
│   ├── migrate_scenario.py
│   ├── merge_dataset.py
│   └── generate_dpo_data.py
└── tests/
    ├── test_analyze_syntax.py
    ├── test_generate_corpus.py
    ├── test_gremlin_base.py
    ├── test_generalize_llm.py
    ├── test_migrate_scenario.py
    ├── test_merge_dataset.py
    └── test_run_llm_pipeline.py

The module is organized around four layers:

base/: schema-aware Gremlin AST parsing, traversal generation, recipe representation, grammar integration, and generation controls.
db_data/: seed graph schema and movie-domain raw data used by the generator.
llm_augment/: LLM-based translation, scenario migration, dataset merge, and DPO data construction.
tests/: focused tests for generation, augmentation, merge behavior, pipeline argument forwarding, and syntax analysis.

What changed

1. AST-based Gremlin corpus generation

This PR adds a schema-aware Gremlin generation framework under text2gremlin/AST_Text2Gremlin/base.

Key pieces include:

ANTLR Gremlin grammar/parser/visitor integration.
Recipe-style Gremlin representation for query construction.
Step, predicate, anonymous traversal, connector, and terminal handling.
Recursive traversal generation with schema constraints.
Connectivity validation for generated paths.
Data value filling from schema and raw graph data.
Deduplication and syntax filtering for generated queries.
Combination control through base/combination_control_config.json.

The main entry point is:

python generate_corpus.py

The generation process can use:

db_data/schema/movie_schema.json
db_data/movie/raw_data/*.csv
gremlin_templates.csv
base/template/schema_dict.txt
base/template/syn_dict.txt

2. Movie-domain seed schema and data

This PR includes a movie graph scenario used as the initial schema/data source for Text2Gremlin generation.

Added files include:

db_data/schema/movie_schema.json
movie vertex CSV files, such as vertex_movie.csv, vertex_person.csv, vertex_user.csv
movie edge CSV files, such as edge_acted_in.csv, edge_directed.csv, edge_rate.csv
db_data/reference/schemas_data.json

These files provide the schema, labels, properties, and example values needed by the generation pipeline.

3. LLM-based natural-language augmentation

This PR adds llm_augment/generalize_llm.py, which converts generated Gremlin queries into natural-language Text2Gremlin samples.

The augmentation supports multiple instruction styles, including:

direct question style
command style
natural conversational style
domain-aware wording

The output is designed for SFT-style Text2Gremlin training data, where each sample pairs a user instruction with a valid Gremlin query.

4. Scenario migration augmentation

This PR adds llm_augment/migrate_scenario.py, which migrates generated samples from one graph scenario to another.

The default migration mode is now:

same_operation

In this mode, the model is asked to migrate a source query into the target scenario while preserving the original operation type, such as read, create, update, or delete. It can generate multiple target-scenario samples for each source sample.

The previous broader behavior is still available through:

mixed_operations

In this mode, the model may generate target-scenario samples across multiple operation types. This is useful when users want more diverse CRUD-style augmentation, but it is no longer the default because not every source query is suitable for migration into every operation type.

The number of same-operation migration samples is configurable and defaults to 3.

5. Dataset merge and statistics

This PR adds llm_augment/merge_dataset.py for merging augmented Text2Gremlin data into final SFT-style outputs.

The merge step supports:

combining direct translation data and scenario migration data
filtering invalid or incomplete samples
deduplicating samples
collecting domain and CRUD operation statistics
exporting train/validation style data files

This keeps generated data preparation separate from model training, making it easier to inspect and reuse the dataset.

6. DPO data generation

This PR adds llm_augment/generate_dpo_data.py for generating DPO preference data.

The DPO generation compares valid Gremlin answers with lower-quality Groovy-style or non-preferred outputs. This is intended to help align models toward producing Gremlin queries instead of code-like alternatives.

7. End-to-end pipeline runner

This PR adds run_llm_pipeline.py as a staged pipeline entry point.

The pipeline supports running stages such as:

translate -> migrate -> merge -> dpo

This makes it possible to run only the required part of the pipeline during development or data refresh.

8. Syntax analysis tooling

This PR adds analyze_syntax.py for analyzing generated Gremlin query distributions.

The analysis can report:

Gremlin step frequency
predicate usage
traversal pattern distribution
operation-type distribution
syntax coverage statistics

This is useful for checking whether generated data is overly concentrated on a small set of Gremlin patterns.

9. Configuration and examples

This PR adds:

config_example.json
requirements.txt
English README
Chinese README

The config example documents model API settings and generation-related options. Sensitive local config files are excluded by .gitignore.

The README files describe:

module purpose
installation
configuration
corpus generation
LLM augmentation
scenario migration modes
dataset merge
DPO generation
Hugging Face dataset location
expected output files

10. Tests

This PR adds focused pytest coverage under text2gremlin/AST_Text2Gremlin/tests.

The tests cover:

Gremlin syntax analysis
corpus generation behavior
Gremlin base parsing/generation helpers
LLM generalization helpers
scenario migration modes
same-operation filtering
mixed-operation mode behavior
merge dataset behavior
pipeline argument forwarding
config loading
dictionary fallback and dictionary loading behavior

Dataset

The generated dataset is available here:

https://huggingface.co/datasets/Lriver/Text2Gremlin

The dataset is intentionally hosted outside this repository to avoid committing large generated artifacts. This repository contains the generation code, schema/data seeds, configuration examples, and documentation needed to reproduce or extend the dataset.

Compatibility and scope

This PR only adds the new Text2Gremlin generation module under:

text2gremlin/AST_Text2Gremlin

It does not change the existing HugeGraph LLM runtime APIs, HugeGraph Python client APIs, or other existing modules.

Large generated output artifacts are not committed. Local model API configuration is expected to be stored in a local config file and is ignored by git.

Validation

The same head branch has already been validated through the existing PR workflow in hugegraph/hugegraph-ai#52, including Ruff checks, dependency/license checks, module CI checks, and CodeRabbit review.

Local validation used during preparation:

uv run --with-requirements text2gremlin/AST_Text2Gremlin/requirements.txt pytest text2gremlin/AST_Text2Gremlin/tests -q
uv run ruff format --check .
uv run ruff check .
git diff --check

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Introduces a new text2gremlin/AST_Text2Gremlin subproject for generating Text2Gremlin training data via AST-based template generalization and multi-stage LLM augmentation, along with project-level lint/license configuration updates to accommodate it.

Changes:

Adds the AST_Text2Gremlin pipeline (AST generalization, LLM translation, scenario migration, dataset merging, DPO data generation) with supporting schema/data, templates, and dictionaries.
Adds pytest-based unit tests for pipeline components and adds a unified run_llm_pipeline.py orchestrator.
Updates pyproject.toml (ruff exclusions/per-file ignores) and .licenserc.yaml (exclude generated .tokens/.interp/.csv files).

Reviewed changes

Copilot reviewed 51 out of 61 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
text2gremlin/AST_Text2Gremlin/tests/*.py	New pytest unit tests covering pipeline stages and helpers
text2gremlin/AST_Text2Gremlin/run_llm_pipeline.py	Unified CLI orchestrator for the 4 LLM stages
text2gremlin/AST_Text2Gremlin/generate_corpus.py	CLI entry to AST-based corpus generation
text2gremlin/AST_Text2Gremlin/analyze_syntax.py	Gremlin syntax distribution analyzer/report generator
text2gremlin/AST_Text2Gremlin/llm_augment/*	Merge dataset stage and package init for LLM augmentation
text2gremlin/AST_Text2Gremlin/base/*.py	Core engine: Config, Schema, GremlinBase, generator, combination controller, expr/parse types
text2gremlin/AST_Text2Gremlin/base/gremlin/*	ANTLR-generated Gremlin tokens/init
text2gremlin/AST_Text2Gremlin/base/template/*.txt	Schema/synonym dictionaries
text2gremlin/AST_Text2Gremlin/base/combination_control_config.json	Combination explosion control config
text2gremlin/AST_Text2Gremlin/db_data/**	Sample movie-domain schema and CSV data
text2gremlin/AST_Text2Gremlin/gremlin_templates.csv	251 Gremlin query templates
text2gremlin/AST_Text2Gremlin/config_example.json	Example config for the pipeline
text2gremlin/AST_Text2Gremlin/requirements.txt	Subproject Python dependencies
text2gremlin/AST_Text2Gremlin/README.md / README_zh.md	English/Chinese documentation
text2gremlin/AST_Text2Gremlin/.gitignore	Ignore config.json/output/pycache
pyproject.toml	Ruff exclude/per-file ignores for the new subproject
.licenserc.yaml	Exclude generated `.interp`/`.tokens`/`.csv` from license header check

Comments suppressed due to low confidence (5)

text2gremlin/AST_Text2Gremlin/llm_augment/merge_dataset.py:1

Selecting the 'latest' file by lexicographic sort of glob results is fragile. While the current YYYYMMDD_HHMMSS timestamp naming happens to sort correctly, any auxiliary file matching the pattern (e.g., llm_translated_backup.json) or a future filename change could silently pick the wrong file. Consider sorting by os.path.getmtime or parsing the timestamp explicitly to make the 'latest' semantics robust.
text2gremlin/AST_Text2Gremlin/run_llm_pipeline.py:1
subprocess.run is called without check=False explicitly and without input/stream handling; while the return code is propagated, consider passing check=False explicitly (ruff/bandit S603/PLW1510 style) for clarity, and document that stdout/stderr are inherited from the parent process so users understand interleaving of stage logs.
text2gremlin/AST_Text2Gremlin/run_llm_pipeline.py:1
Forwarding --migration-mode/--same-operation-sample-count only to the first selected stage is surprising: a user running the full pipeline who passes --migration-mode same_operation will see those args sent to translate (which will fail on unknown args) instead of migrate. Consider routing extras based on which stage owns each flag, or document this constraint prominently and validate that the first stage actually accepts the supplied args.
text2gremlin/AST_Text2Gremlin/tests/test_gremlin_base.py:1
Asserting that a misspelled key perosn_organization is absent is essentially testing a typo that may or may not have ever existed; this assertion adds no useful regression coverage and will silently pass forever. Remove it or replace it with a positive assertion about an expected key, to make the test's intent clear.
text2gremlin/AST_Text2Gremlin/llm_augment/merge_dataset.py:1
User-facing error messages on failure paths are written to stdout via print. Errors should go to stderr (e.g., print(..., file=sys.stderr)) so that callers/CI can distinguish normal output from error diagnostics; the test in test_merge_dataset.py currently asserts against result.stdout, which would also need updating.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

LRriver · 2026-06-01T09:33:42Z

+        # 保持原始大小写，不进行转换
+        for index, (key, value) in enumerate(templates_data.items()):
+            self.token_dict[key] = index
+            self.template.append(value)


Fixed in 6bbf946 by replacing the duplicate token template entries with a canonical template map plus aliases.

LRriver · 2026-06-01T09:33:45Z

+        # 如果没有指定数量，随机选择2-5个
+        if count is None:
+            count = random.randint(2, 5)


Fixed in 6bbf946 by extracting the default sample range into DEFAULT_SAMPLE_MIN and DEFAULT_SAMPLE_MAX.

LRriver · 2026-06-01T09:33:48Z

+        if stats["generated_count"] > 5000:
+            stats["warning"] = f"由于本条模版的Recipe复杂,生成了大量查询({stats['generated_count']}条)"


Fixed in 6bbf946 by extracting the large-generation threshold into LARGE_GENERATION_THRESHOLD.

LRriver · 2026-06-01T09:33:51Z

+if __name__ == "__main__":
+    # 临时创建config 对象，用于测试
+    class MockConfig:
+        def get_schema_dict_path(self):
+            return "./template/schema_dict.txt"
+
+        def get_syn_dict_path(self):
+            return "./template/syn_dict.txt"


Fixed in 6bbf946 by making MockConfig return list paths, matching the production config behavior.

LRriver · 2026-06-01T09:33:53Z

+        # chain_thresholds 只需要 short, medium, long（ultra 通过 else 分支隐式定义）
+        for category in ("short", "medium", "long"):
+            if category not in self.chain_thresholds:
+                raise ValueError(f"chain_thresholds 缺少 '{category}' 配置")


Fixed in 6bbf946 by extracting the chain category tuples and reusing them in validation and category lookup.

LRriver added 30 commits September 30, 2025 20:52

feat: add configuration management module with dictionary paths and g…

52f2e01

…eneration parameters

feat: add Gremlin parsing base classes with Step, Traversal core data…

fadaaf7

… structures

feat: add Gremlin expression processing module with predicates and co…

b775d29

…nnectors support

feat: add graph database schema management with vertex/edge labels an…

f0588a1

…d properties

feat: add Gremlin base component library with synonym replacement and…

5f3b039

… data instances

feat: add ANTLR syntax tree visitor with Gremlin query to Recipe pars…

822272f

…ing and call/with support

feat: add recursive backtracking traversal generator for diverse quer…

441b32c

…y variants from Recipe

feat: add main corpus generator with batch processing, global dedupli…

2de2096

…cation and error handling

config: add global configuration file with generation parameters and …

c92f09a

…path settings

data: add cypher2gremlin dataset with 3514 real query templates

25ca990

docs: add project README with quick start guide and usage instructions

25a2876

feat: add ANTLR-generated Gremlin grammar package with lexer, parser …

541aa20

…and visitor classes

data: add schema and graph data

eb7eb01

feat: add template directory with schema dictionary and synonym files

f0579e8

test: add gremlin statement generalization generation test module

9c13457

test: add generator unit tests for corpus generation validation

b14ffb3

Add graph2gremlin.py: Initial template-based Gremlin data generation …

7cd8427

…with correctness guarantee and preliminary question generalization

Add gremlin_checker.py: Syntax checking using Antlr4

4da021c

Add llm_handler.py: LLM interaction model for query generalization an…

bc10fe2

…d translation

Add qa_generalize.py: Seed data generalization using gremlin_checker …

6ea48d5

…and llm_handler

Add instruct_convert.py: Instruction format conversion and train/test…

78f8c2a

… set division

Add da_data: Schema and graph data

b7f3f4a

Add data/seed_data: Seed data directory

332b879

Add data/vertical_training_sets: Vertical domain scenario generalized…

8a94bad

… data directory

Add books on Gremlin syntax knowledge to process data.

676d28c

Add a dataset of Gremlin QA pairs synthesized based on LLM.

90f346f

Add README.md

4120356

Compatible with OpenAI format

67b523a

Increase Gremlin syntax vocabulary that supports generalization, and …

bccc147

…add data control policies.

modify README.md

44592b4

LRriver and others added 22 commits March 10, 2026 02:34

data: add LLM multi-style translated corpus output

d7c261f

data: add extracted text2gremlin pairs for migration input

f85c6e4

data: add scenario-migrated corpus across 20 domains

8ac9671

data: add merged text2gremlin dataset with CRUD stats

13764f9

data: add DPO preference training data (Groovy vs Gremlin)

ae38958

data: add Gremlin syntax distribution analysis report

0ebdb15

docs: update README with LLM augmentation pipeline and project structure

a6f7740

fix: slim down requirements.txt to direct dependencies only, removing…

180719e

… vulnerable pillow

style: apply ruff format to all text2gremlin Python files

1c967f3

data:Unified Diverse Translation Tone Variable Names

eb3f51c

feat(dpo): add multi-domain DPO generation with Groovy syntax skip fo…

7e28c00

…r variable references

docs: update README with final DPO data stats and add English version

d6bcfaf

data(dpo): add merged preference data with 8920 samples across 21 dom…

2e10379

…ains

chore: remove outdated DPO data file

d9da0b8

docs:modify README

8019f8a

Merge branch 'apache:main' into text2gremlin

1ff125a

Fix text2gremlin ruff checks

b534331

Improve scenario migration operation modes

5b2bf1f

Address text2gremlin review feedback

118aaa4

Address remaining text2gremlin review issues

0887abd

Clean up text2gremlin dictionaries

519ef99

Document Text2Gremlin Hugging Face dataset

e9bb59c

Copilot AI review requested due to automatic review settings May 31, 2026 10:52

dosubot Bot added size:XXL This PR changes 1000+ lines, ignoring generated files. enhancement New feature or request labels May 31, 2026

Copilot AI reviewed May 31, 2026

View reviewed changes

github-advanced-security AI found potential problems May 31, 2026

View reviewed changes

LRriver added 3 commits May 31, 2026 19:02

Address text2gremlin CodeQL and review feedback

6bbf946

Document text2gremlin pipeline subprocess output

95b5adb

Use mtime for text2gremlin latest file lookup

ccb0b97

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Text2Gremlin corpus generation and augmentation pipeline#352

Add Text2Gremlin corpus generation and augmentation pipeline#352
LRriver wants to merge 72 commits into
apache:mainfrom
LRriver:text2gremlin

LRriver commented May 31, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

LRriver Jun 1, 2026

Uh oh!

LRriver Jun 1, 2026

Uh oh!

LRriver Jun 1, 2026

Uh oh!

LRriver Jun 1, 2026

Uh oh!

LRriver Jun 1, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		if stats["generated_count"] > 5000:
		stats["warning"] = f"由于本条模版的Recipe复杂,生成了大量查询({stats['generated_count']}条)"

Conversation

LRriver commented May 31, 2026

Summary

Motivation

Project structure

What changed

1. AST-based Gremlin corpus generation

2. Movie-domain seed schema and data

3. LLM-based natural-language augmentation

4. Scenario migration augmentation

5. Dataset merge and statistics

6. DPO data generation

7. End-to-end pipeline runner

8. Syntax analysis tooling

9. Configuration and examples

10. Tests

Dataset

Compatibility and scope

Validation

Related

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

LRriver Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

LRriver Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

LRriver Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

LRriver Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

LRriver Jun 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants