Add Text2Gremlin corpus generation and augmentation pipeline#352
Add Text2Gremlin corpus generation and augmentation pipeline#352LRriver wants to merge 72 commits into
Conversation
…eneration parameters
…ing and call/with support
…y variants from Recipe
…cation and error handling
…and visitor classes
…with correctness guarantee and preliminary question generalization
…add data control policies.
… vulnerable pillow
…r variable references
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Introduces a new text2gremlin/AST_Text2Gremlin subproject for generating Text2Gremlin training data via AST-based template generalization and multi-stage LLM augmentation, along with project-level lint/license configuration updates to accommodate it.
Changes:
- Adds the
AST_Text2Gremlinpipeline (AST generalization, LLM translation, scenario migration, dataset merging, DPO data generation) with supporting schema/data, templates, and dictionaries. - Adds pytest-based unit tests for pipeline components and adds a unified
run_llm_pipeline.pyorchestrator. - Updates
pyproject.toml(ruff exclusions/per-file ignores) and.licenserc.yaml(exclude generated.tokens/.interp/.csvfiles).
Reviewed changes
Copilot reviewed 51 out of 61 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| text2gremlin/AST_Text2Gremlin/tests/*.py | New pytest unit tests covering pipeline stages and helpers |
| text2gremlin/AST_Text2Gremlin/run_llm_pipeline.py | Unified CLI orchestrator for the 4 LLM stages |
| text2gremlin/AST_Text2Gremlin/generate_corpus.py | CLI entry to AST-based corpus generation |
| text2gremlin/AST_Text2Gremlin/analyze_syntax.py | Gremlin syntax distribution analyzer/report generator |
| text2gremlin/AST_Text2Gremlin/llm_augment/* | Merge dataset stage and package init for LLM augmentation |
| text2gremlin/AST_Text2Gremlin/base/*.py | Core engine: Config, Schema, GremlinBase, generator, combination controller, expr/parse types |
| text2gremlin/AST_Text2Gremlin/base/gremlin/* | ANTLR-generated Gremlin tokens/init |
| text2gremlin/AST_Text2Gremlin/base/template/*.txt | Schema/synonym dictionaries |
| text2gremlin/AST_Text2Gremlin/base/combination_control_config.json | Combination explosion control config |
| text2gremlin/AST_Text2Gremlin/db_data/** | Sample movie-domain schema and CSV data |
| text2gremlin/AST_Text2Gremlin/gremlin_templates.csv | 251 Gremlin query templates |
| text2gremlin/AST_Text2Gremlin/config_example.json | Example config for the pipeline |
| text2gremlin/AST_Text2Gremlin/requirements.txt | Subproject Python dependencies |
| text2gremlin/AST_Text2Gremlin/README.md / README_zh.md | English/Chinese documentation |
| text2gremlin/AST_Text2Gremlin/.gitignore | Ignore config.json/output/pycache |
| pyproject.toml | Ruff exclude/per-file ignores for the new subproject |
| .licenserc.yaml | Exclude generated .interp/.tokens/.csv from license header check |
Comments suppressed due to low confidence (5)
text2gremlin/AST_Text2Gremlin/llm_augment/merge_dataset.py:1
- Selecting the 'latest' file by lexicographic sort of glob results is fragile. While the current
YYYYMMDD_HHMMSStimestamp naming happens to sort correctly, any auxiliary file matching the pattern (e.g.,llm_translated_backup.json) or a future filename change could silently pick the wrong file. Consider sorting byos.path.getmtimeor parsing the timestamp explicitly to make the 'latest' semantics robust.
text2gremlin/AST_Text2Gremlin/run_llm_pipeline.py:1 subprocess.runis called withoutcheck=Falseexplicitly and without input/stream handling; while the return code is propagated, consider passingcheck=Falseexplicitly (ruff/bandit S603/PLW1510 style) for clarity, and document that stdout/stderr are inherited from the parent process so users understand interleaving of stage logs.
text2gremlin/AST_Text2Gremlin/run_llm_pipeline.py:1- Forwarding
--migration-mode/--same-operation-sample-countonly to the first selected stage is surprising: a user running the full pipeline who passes--migration-mode same_operationwill see those args sent totranslate(which will fail on unknown args) instead ofmigrate. Consider routing extras based on which stage owns each flag, or document this constraint prominently and validate that the first stage actually accepts the supplied args.
text2gremlin/AST_Text2Gremlin/tests/test_gremlin_base.py:1 - Asserting that a misspelled key
perosn_organizationis absent is essentially testing a typo that may or may not have ever existed; this assertion adds no useful regression coverage and will silently pass forever. Remove it or replace it with a positive assertion about an expected key, to make the test's intent clear.
text2gremlin/AST_Text2Gremlin/llm_augment/merge_dataset.py:1 - User-facing error messages on failure paths are written to stdout via
print. Errors should go to stderr (e.g.,print(..., file=sys.stderr)) so that callers/CI can distinguish normal output from error diagnostics; the test intest_merge_dataset.pycurrently asserts againstresult.stdout, which would also need updating.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # 保持原始大小写,不进行转换 | ||
| for index, (key, value) in enumerate(templates_data.items()): | ||
| self.token_dict[key] = index | ||
| self.template.append(value) |
There was a problem hiding this comment.
Fixed in 6bbf946 by replacing the duplicate token template entries with a canonical template map plus aliases.
| # 如果没有指定数量,随机选择2-5个 | ||
| if count is None: | ||
| count = random.randint(2, 5) |
There was a problem hiding this comment.
Fixed in 6bbf946 by extracting the default sample range into DEFAULT_SAMPLE_MIN and DEFAULT_SAMPLE_MAX.
| if stats["generated_count"] > 5000: | ||
| stats["warning"] = f"由于本条模版的Recipe复杂,生成了大量查询({stats['generated_count']}条)" |
There was a problem hiding this comment.
Fixed in 6bbf946 by extracting the large-generation threshold into LARGE_GENERATION_THRESHOLD.
| if __name__ == "__main__": | ||
| # 临时创建config 对象,用于测试 | ||
| class MockConfig: | ||
| def get_schema_dict_path(self): | ||
| return "./template/schema_dict.txt" | ||
|
|
||
| def get_syn_dict_path(self): | ||
| return "./template/syn_dict.txt" |
There was a problem hiding this comment.
Fixed in 6bbf946 by making MockConfig return list paths, matching the production config behavior.
| # chain_thresholds 只需要 short, medium, long(ultra 通过 else 分支隐式定义) | ||
| for category in ("short", "medium", "long"): | ||
| if category not in self.chain_thresholds: | ||
| raise ValueError(f"chain_thresholds 缺少 '{category}' 配置") |
There was a problem hiding this comment.
Fixed in 6bbf946 by extracting the chain category tuples and reusing them in validation and category lookup.
Summary
This PR adds a new
text2gremlin/AST_Text2Gremlinmodule for Text2Gremlin data generation, LLM-based data augmentation, syntax validation, dataset merging, and DPO data construction.The goal is to provide a reproducible data pipeline for generating high-quality Gremlin query training data from structured graph schemas and templates. The module starts from schema-aware Gremlin template generation, then uses LLMs to produce natural-language instructions, migrates samples across graph scenarios, validates generated Gremlin syntax, merges SFT data, and optionally builds DPO preference data by comparing correct Gremlin queries with Groovy-style negative samples.
The generated Text2Gremlin dataset has also been published on Hugging Face:
https://huggingface.co/datasets/Lriver/Text2Gremlin
Motivation
Text2Gremlin needs training data that is both syntactically valid and diverse enough to cover common graph query patterns. Manually writing these samples is expensive and difficult to scale, especially when the data needs to cover different graph domains, operation types, traversal structures, and natural-language styles.
This module provides a structured generation and augmentation pipeline:
Project structure
This PR adds the following Text2Gremlin module structure:
The module is organized around four layers:
base/: schema-aware Gremlin AST parsing, traversal generation, recipe representation, grammar integration, and generation controls.db_data/: seed graph schema and movie-domain raw data used by the generator.llm_augment/: LLM-based translation, scenario migration, dataset merge, and DPO data construction.tests/: focused tests for generation, augmentation, merge behavior, pipeline argument forwarding, and syntax analysis.What changed
1. AST-based Gremlin corpus generation
This PR adds a schema-aware Gremlin generation framework under
text2gremlin/AST_Text2Gremlin/base.Key pieces include:
base/combination_control_config.json.The main entry point is:
The generation process can use:
db_data/schema/movie_schema.jsondb_data/movie/raw_data/*.csvgremlin_templates.csvbase/template/schema_dict.txtbase/template/syn_dict.txt2. Movie-domain seed schema and data
This PR includes a movie graph scenario used as the initial schema/data source for Text2Gremlin generation.
Added files include:
db_data/schema/movie_schema.jsonvertex_movie.csv,vertex_person.csv,vertex_user.csvedge_acted_in.csv,edge_directed.csv,edge_rate.csvdb_data/reference/schemas_data.jsonThese files provide the schema, labels, properties, and example values needed by the generation pipeline.
3. LLM-based natural-language augmentation
This PR adds
llm_augment/generalize_llm.py, which converts generated Gremlin queries into natural-language Text2Gremlin samples.The augmentation supports multiple instruction styles, including:
The output is designed for SFT-style Text2Gremlin training data, where each sample pairs a user instruction with a valid Gremlin query.
4. Scenario migration augmentation
This PR adds
llm_augment/migrate_scenario.py, which migrates generated samples from one graph scenario to another.The default migration mode is now:
In this mode, the model is asked to migrate a source query into the target scenario while preserving the original operation type, such as read, create, update, or delete. It can generate multiple target-scenario samples for each source sample.
The previous broader behavior is still available through:
In this mode, the model may generate target-scenario samples across multiple operation types. This is useful when users want more diverse CRUD-style augmentation, but it is no longer the default because not every source query is suitable for migration into every operation type.
The number of same-operation migration samples is configurable and defaults to
3.5. Dataset merge and statistics
This PR adds
llm_augment/merge_dataset.pyfor merging augmented Text2Gremlin data into final SFT-style outputs.The merge step supports:
This keeps generated data preparation separate from model training, making it easier to inspect and reuse the dataset.
6. DPO data generation
This PR adds
llm_augment/generate_dpo_data.pyfor generating DPO preference data.The DPO generation compares valid Gremlin answers with lower-quality Groovy-style or non-preferred outputs. This is intended to help align models toward producing Gremlin queries instead of code-like alternatives.
7. End-to-end pipeline runner
This PR adds
run_llm_pipeline.pyas a staged pipeline entry point.The pipeline supports running stages such as:
This makes it possible to run only the required part of the pipeline during development or data refresh.
8. Syntax analysis tooling
This PR adds
analyze_syntax.pyfor analyzing generated Gremlin query distributions.The analysis can report:
This is useful for checking whether generated data is overly concentrated on a small set of Gremlin patterns.
9. Configuration and examples
This PR adds:
config_example.jsonrequirements.txtThe config example documents model API settings and generation-related options. Sensitive local config files are excluded by
.gitignore.The README files describe:
10. Tests
This PR adds focused pytest coverage under
text2gremlin/AST_Text2Gremlin/tests.The tests cover:
Dataset
The generated dataset is available here:
https://huggingface.co/datasets/Lriver/Text2Gremlin
The dataset is intentionally hosted outside this repository to avoid committing large generated artifacts. This repository contains the generation code, schema/data seeds, configuration examples, and documentation needed to reproduce or extend the dataset.
Compatibility and scope
This PR only adds the new Text2Gremlin generation module under:
It does not change the existing HugeGraph LLM runtime APIs, HugeGraph Python client APIs, or other existing modules.
Large generated output artifacts are not committed. Local model API configuration is expected to be stored in a local config file and is ignored by git.
Validation
The same head branch has already been validated through the existing PR workflow in
hugegraph/hugegraph-ai#52, including Ruff checks, dependency/license checks, module CI checks, and CodeRabbit review.Local validation used during preparation:
Related
This PR is based on the same
text2gremlinbranch as:hugegraph#52