Skip to content

Add Text2Gremlin corpus generation and augmentation pipeline#352

Open
LRriver wants to merge 72 commits into
apache:mainfrom
LRriver:text2gremlin
Open

Add Text2Gremlin corpus generation and augmentation pipeline#352
LRriver wants to merge 72 commits into
apache:mainfrom
LRriver:text2gremlin

Conversation

@LRriver
Copy link
Copy Markdown
Contributor

@LRriver LRriver commented May 31, 2026

Summary

This PR adds a new text2gremlin/AST_Text2Gremlin module for Text2Gremlin data generation, LLM-based data augmentation, syntax validation, dataset merging, and DPO data construction.

The goal is to provide a reproducible data pipeline for generating high-quality Gremlin query training data from structured graph schemas and templates. The module starts from schema-aware Gremlin template generation, then uses LLMs to produce natural-language instructions, migrates samples across graph scenarios, validates generated Gremlin syntax, merges SFT data, and optionally builds DPO preference data by comparing correct Gremlin queries with Groovy-style negative samples.

The generated Text2Gremlin dataset has also been published on Hugging Face:

https://huggingface.co/datasets/Lriver/Text2Gremlin

Motivation

Text2Gremlin needs training data that is both syntactically valid and diverse enough to cover common graph query patterns. Manually writing these samples is expensive and difficult to scale, especially when the data needs to cover different graph domains, operation types, traversal structures, and natural-language styles.

This module provides a structured generation and augmentation pipeline:

  1. Generate Gremlin queries from schema-aware templates.
  2. Validate query syntax with an ANTLR-based Gremlin parser.
  3. Translate Gremlin queries into multiple natural-language instruction styles.
  4. Migrate existing samples to other graph scenarios.
  5. Merge and deduplicate SFT-style training data.
  6. Generate DPO preference data for Groovy-vs-Gremlin alignment.

Project structure

This PR adds the following Text2Gremlin module structure:

text2gremlin/AST_Text2Gremlin/
├── README.md
├── README_zh.md
├── requirements.txt
├── config_example.json
├── generate_corpus.py
├── analyze_syntax.py
├── run_llm_pipeline.py
├── gremlin_templates.csv
├── base/
│   ├── Config.py
│   ├── Schema.py
│   ├── GremlinBase.py
│   ├── GremlinExpr.py
│   ├── GremlinParse.py
│   ├── GremlinTransVisitor.py
│   ├── TraversalGenerator.py
│   ├── CombinationController.py
│   ├── generator.py
│   ├── combination_control_config.json
│   ├── gremlin/
│   │   ├── Gremlin.g4
│   │   ├── GremlinLexer.py
│   │   ├── GremlinParser.py
│   │   ├── GremlinVisitor.py
│   │   └── GremlinListener.py
│   └── template/
│       ├── schema_dict.txt
│       └── syn_dict.txt
├── db_data/
│   ├── schema/
│   │   └── movie_schema.json
│   ├── movie/raw_data/
│   │   ├── vertex_*.csv
│   │   └── edge_*.csv
│   └── reference/
│       └── schemas_data.json
├── llm_augment/
│   ├── generalize_llm.py
│   ├── migrate_scenario.py
│   ├── merge_dataset.py
│   └── generate_dpo_data.py
└── tests/
    ├── test_analyze_syntax.py
    ├── test_generate_corpus.py
    ├── test_gremlin_base.py
    ├── test_generalize_llm.py
    ├── test_migrate_scenario.py
    ├── test_merge_dataset.py
    └── test_run_llm_pipeline.py

The module is organized around four layers:

  • base/: schema-aware Gremlin AST parsing, traversal generation, recipe representation, grammar integration, and generation controls.
  • db_data/: seed graph schema and movie-domain raw data used by the generator.
  • llm_augment/: LLM-based translation, scenario migration, dataset merge, and DPO data construction.
  • tests/: focused tests for generation, augmentation, merge behavior, pipeline argument forwarding, and syntax analysis.

What changed

1. AST-based Gremlin corpus generation

This PR adds a schema-aware Gremlin generation framework under text2gremlin/AST_Text2Gremlin/base.

Key pieces include:

  • ANTLR Gremlin grammar/parser/visitor integration.
  • Recipe-style Gremlin representation for query construction.
  • Step, predicate, anonymous traversal, connector, and terminal handling.
  • Recursive traversal generation with schema constraints.
  • Connectivity validation for generated paths.
  • Data value filling from schema and raw graph data.
  • Deduplication and syntax filtering for generated queries.
  • Combination control through base/combination_control_config.json.

The main entry point is:

python generate_corpus.py

The generation process can use:

  • db_data/schema/movie_schema.json
  • db_data/movie/raw_data/*.csv
  • gremlin_templates.csv
  • base/template/schema_dict.txt
  • base/template/syn_dict.txt

2. Movie-domain seed schema and data

This PR includes a movie graph scenario used as the initial schema/data source for Text2Gremlin generation.

Added files include:

  • db_data/schema/movie_schema.json
  • movie vertex CSV files, such as vertex_movie.csv, vertex_person.csv, vertex_user.csv
  • movie edge CSV files, such as edge_acted_in.csv, edge_directed.csv, edge_rate.csv
  • db_data/reference/schemas_data.json

These files provide the schema, labels, properties, and example values needed by the generation pipeline.

3. LLM-based natural-language augmentation

This PR adds llm_augment/generalize_llm.py, which converts generated Gremlin queries into natural-language Text2Gremlin samples.

The augmentation supports multiple instruction styles, including:

  • direct question style
  • command style
  • natural conversational style
  • domain-aware wording

The output is designed for SFT-style Text2Gremlin training data, where each sample pairs a user instruction with a valid Gremlin query.

4. Scenario migration augmentation

This PR adds llm_augment/migrate_scenario.py, which migrates generated samples from one graph scenario to another.

The default migration mode is now:

same_operation

In this mode, the model is asked to migrate a source query into the target scenario while preserving the original operation type, such as read, create, update, or delete. It can generate multiple target-scenario samples for each source sample.

The previous broader behavior is still available through:

mixed_operations

In this mode, the model may generate target-scenario samples across multiple operation types. This is useful when users want more diverse CRUD-style augmentation, but it is no longer the default because not every source query is suitable for migration into every operation type.

The number of same-operation migration samples is configurable and defaults to 3.

5. Dataset merge and statistics

This PR adds llm_augment/merge_dataset.py for merging augmented Text2Gremlin data into final SFT-style outputs.

The merge step supports:

  • combining direct translation data and scenario migration data
  • filtering invalid or incomplete samples
  • deduplicating samples
  • collecting domain and CRUD operation statistics
  • exporting train/validation style data files

This keeps generated data preparation separate from model training, making it easier to inspect and reuse the dataset.

6. DPO data generation

This PR adds llm_augment/generate_dpo_data.py for generating DPO preference data.

The DPO generation compares valid Gremlin answers with lower-quality Groovy-style or non-preferred outputs. This is intended to help align models toward producing Gremlin queries instead of code-like alternatives.

7. End-to-end pipeline runner

This PR adds run_llm_pipeline.py as a staged pipeline entry point.

The pipeline supports running stages such as:

translate -> migrate -> merge -> dpo

This makes it possible to run only the required part of the pipeline during development or data refresh.

8. Syntax analysis tooling

This PR adds analyze_syntax.py for analyzing generated Gremlin query distributions.

The analysis can report:

  • Gremlin step frequency
  • predicate usage
  • traversal pattern distribution
  • operation-type distribution
  • syntax coverage statistics

This is useful for checking whether generated data is overly concentrated on a small set of Gremlin patterns.

9. Configuration and examples

This PR adds:

  • config_example.json
  • requirements.txt
  • English README
  • Chinese README

The config example documents model API settings and generation-related options. Sensitive local config files are excluded by .gitignore.

The README files describe:

  • module purpose
  • installation
  • configuration
  • corpus generation
  • LLM augmentation
  • scenario migration modes
  • dataset merge
  • DPO generation
  • Hugging Face dataset location
  • expected output files

10. Tests

This PR adds focused pytest coverage under text2gremlin/AST_Text2Gremlin/tests.

The tests cover:

  • Gremlin syntax analysis
  • corpus generation behavior
  • Gremlin base parsing/generation helpers
  • LLM generalization helpers
  • scenario migration modes
  • same-operation filtering
  • mixed-operation mode behavior
  • merge dataset behavior
  • pipeline argument forwarding
  • config loading
  • dictionary fallback and dictionary loading behavior

Dataset

The generated dataset is available here:

https://huggingface.co/datasets/Lriver/Text2Gremlin

The dataset is intentionally hosted outside this repository to avoid committing large generated artifacts. This repository contains the generation code, schema/data seeds, configuration examples, and documentation needed to reproduce or extend the dataset.

Compatibility and scope

This PR only adds the new Text2Gremlin generation module under:

text2gremlin/AST_Text2Gremlin

It does not change the existing HugeGraph LLM runtime APIs, HugeGraph Python client APIs, or other existing modules.

Large generated output artifacts are not committed. Local model API configuration is expected to be stored in a local config file and is ignored by git.

Validation

The same head branch has already been validated through the existing PR workflow in hugegraph/hugegraph-ai#52, including Ruff checks, dependency/license checks, module CI checks, and CodeRabbit review.

Local validation used during preparation:

uv run --with-requirements text2gremlin/AST_Text2Gremlin/requirements.txt pytest text2gremlin/AST_Text2Gremlin/tests -q
uv run ruff format --check .
uv run ruff check .
git diff --check

Related

This PR is based on the same text2gremlin branch as:

hugegraph#52

…with correctness guarantee and preliminary question generalization
Copilot AI review requested due to automatic review settings May 31, 2026 10:52
@dosubot dosubot Bot added size:XXL This PR changes 1000+ lines, ignoring generated files. enhancement New feature or request labels May 31, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Introduces a new text2gremlin/AST_Text2Gremlin subproject for generating Text2Gremlin training data via AST-based template generalization and multi-stage LLM augmentation, along with project-level lint/license configuration updates to accommodate it.

Changes:

  • Adds the AST_Text2Gremlin pipeline (AST generalization, LLM translation, scenario migration, dataset merging, DPO data generation) with supporting schema/data, templates, and dictionaries.
  • Adds pytest-based unit tests for pipeline components and adds a unified run_llm_pipeline.py orchestrator.
  • Updates pyproject.toml (ruff exclusions/per-file ignores) and .licenserc.yaml (exclude generated .tokens/.interp/.csv files).

Reviewed changes

Copilot reviewed 51 out of 61 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
text2gremlin/AST_Text2Gremlin/tests/*.py New pytest unit tests covering pipeline stages and helpers
text2gremlin/AST_Text2Gremlin/run_llm_pipeline.py Unified CLI orchestrator for the 4 LLM stages
text2gremlin/AST_Text2Gremlin/generate_corpus.py CLI entry to AST-based corpus generation
text2gremlin/AST_Text2Gremlin/analyze_syntax.py Gremlin syntax distribution analyzer/report generator
text2gremlin/AST_Text2Gremlin/llm_augment/* Merge dataset stage and package init for LLM augmentation
text2gremlin/AST_Text2Gremlin/base/*.py Core engine: Config, Schema, GremlinBase, generator, combination controller, expr/parse types
text2gremlin/AST_Text2Gremlin/base/gremlin/* ANTLR-generated Gremlin tokens/init
text2gremlin/AST_Text2Gremlin/base/template/*.txt Schema/synonym dictionaries
text2gremlin/AST_Text2Gremlin/base/combination_control_config.json Combination explosion control config
text2gremlin/AST_Text2Gremlin/db_data/** Sample movie-domain schema and CSV data
text2gremlin/AST_Text2Gremlin/gremlin_templates.csv 251 Gremlin query templates
text2gremlin/AST_Text2Gremlin/config_example.json Example config for the pipeline
text2gremlin/AST_Text2Gremlin/requirements.txt Subproject Python dependencies
text2gremlin/AST_Text2Gremlin/README.md / README_zh.md English/Chinese documentation
text2gremlin/AST_Text2Gremlin/.gitignore Ignore config.json/output/pycache
pyproject.toml Ruff exclude/per-file ignores for the new subproject
.licenserc.yaml Exclude generated .interp/.tokens/.csv from license header check
Comments suppressed due to low confidence (5)

text2gremlin/AST_Text2Gremlin/llm_augment/merge_dataset.py:1

  • Selecting the 'latest' file by lexicographic sort of glob results is fragile. While the current YYYYMMDD_HHMMSS timestamp naming happens to sort correctly, any auxiliary file matching the pattern (e.g., llm_translated_backup.json) or a future filename change could silently pick the wrong file. Consider sorting by os.path.getmtime or parsing the timestamp explicitly to make the 'latest' semantics robust.
    text2gremlin/AST_Text2Gremlin/run_llm_pipeline.py:1
  • subprocess.run is called without check=False explicitly and without input/stream handling; while the return code is propagated, consider passing check=False explicitly (ruff/bandit S603/PLW1510 style) for clarity, and document that stdout/stderr are inherited from the parent process so users understand interleaving of stage logs.
    text2gremlin/AST_Text2Gremlin/run_llm_pipeline.py:1
  • Forwarding --migration-mode/--same-operation-sample-count only to the first selected stage is surprising: a user running the full pipeline who passes --migration-mode same_operation will see those args sent to translate (which will fail on unknown args) instead of migrate. Consider routing extras based on which stage owns each flag, or document this constraint prominently and validate that the first stage actually accepts the supplied args.
    text2gremlin/AST_Text2Gremlin/tests/test_gremlin_base.py:1
  • Asserting that a misspelled key perosn_organization is absent is essentially testing a typo that may or may not have ever existed; this assertion adds no useful regression coverage and will silently pass forever. Remove it or replace it with a positive assertion about an expected key, to make the test's intent clear.
    text2gremlin/AST_Text2Gremlin/llm_augment/merge_dataset.py:1
  • User-facing error messages on failure paths are written to stdout via print. Errors should go to stderr (e.g., print(..., file=sys.stderr)) so that callers/CI can distinguish normal output from error diagnostics; the test in test_merge_dataset.py currently asserts against result.stdout, which would also need updating.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +253 to +256
# 保持原始大小写,不进行转换
for index, (key, value) in enumerate(templates_data.items()):
self.token_dict[key] = index
self.template.append(value)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 6bbf946 by replacing the duplicate token template entries with a canonical template map plus aliases.

Comment on lines +229 to +231
# 如果没有指定数量,随机选择2-5个
if count is None:
count = random.randint(2, 5)
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 6bbf946 by extracting the default sample range into DEFAULT_SAMPLE_MIN and DEFAULT_SAMPLE_MAX.

Comment on lines +181 to +182
if stats["generated_count"] > 5000:
stats["warning"] = f"由于本条模版的Recipe复杂,生成了大量查询({stats['generated_count']}条)"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 6bbf946 by extracting the large-generation threshold into LARGE_GENERATION_THRESHOLD.

Comment on lines +317 to +324
if __name__ == "__main__":
# 临时创建config 对象,用于测试
class MockConfig:
def get_schema_dict_path(self):
return "./template/schema_dict.txt"

def get_syn_dict_path(self):
return "./template/syn_dict.txt"
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 6bbf946 by making MockConfig return list paths, matching the production config behavior.

Comment on lines +60 to +63
# chain_thresholds 只需要 short, medium, long(ultra 通过 else 分支隐式定义)
for category in ("short", "medium", "long"):
if category not in self.chain_thresholds:
raise ValueError(f"chain_thresholds 缺少 '{category}' 配置")
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 6bbf946 by extracting the chain category tuples and reusing them in validation and category lookup.

Comment thread text2gremlin/AST_Text2Gremlin/llm_augment/generalize_llm.py Fixed
Comment thread text2gremlin/AST_Text2Gremlin/llm_augment/generalize_llm.py Fixed
Comment thread text2gremlin/AST_Text2Gremlin/llm_augment/generalize_llm.py Fixed
Comment thread text2gremlin/AST_Text2Gremlin/llm_augment/generalize_llm.py Fixed
Comment thread text2gremlin/AST_Text2Gremlin/llm_augment/generalize_llm.py Fixed
Comment thread text2gremlin/AST_Text2Gremlin/llm_augment/migrate_scenario.py Fixed
Comment thread text2gremlin/AST_Text2Gremlin/llm_augment/migrate_scenario.py Fixed
Comment thread text2gremlin/AST_Text2Gremlin/llm_augment/migrate_scenario.py Fixed
Comment thread text2gremlin/AST_Text2Gremlin/llm_augment/migrate_scenario.py Fixed
Comment thread text2gremlin/AST_Text2Gremlin/llm_augment/migrate_scenario.py Fixed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request size:XXL This PR changes 1000+ lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants