SyntheRela - Synthetic Relational Data Generation Benchmark

About SyntheRela

SyntheRela is a comprehensive benchmark designed to evaluate and compare synthetic relational database generation methods. It provides a standardized framework for assessing both the fidelity and utility of synthetic data across multiple real-world databases. The benchmark includes novel evaluation metrics, particularly for relational data, and supports various open-source and commercial synthetic data generation methods.

SyntheRela is highly extensible, allowing users to benchmark on their own custom datasets and implement new evaluation metrics to suit specific use cases.

Our research on SyntheRela is presented in the paper "SyntheRela: A Benchmark For Synthetic Relational Database Generation" at the ICLR 2025 Workshop "Will Synthetic Data Finally Solve the Data Access Problem?", available on OpenReview.

We maintain a public leaderboard on Hugging Face where you can compare the performance of different synthetic data generation methods.

Installation

To install only the benchmark package, run the following command:

pip install syntherela

Using SyntheRela

To evaluate your synthetic relational data, configure the Benchmark class with your desired metrics and run the evaluation pipeline:

from syntherela.benchmark import Benchmark
from syntherela.metrics.single_column.statistical import ChiSquareTest
from syntherela.metrics.single_table.distance import MaximumMeanDiscrepancy
from syntherela.metrics.multi_table.statistical import CardinalityShapeSimilarity
from syntherela.metrics.multi_table.detection import AggregationDetection
from xgboost import XGBClassifier

# Initialize the benchmark with specific metrics
benchmark = Benchmark(
    real_data_dir="path/to/real_data",
    synthetic_data_dir="path/to/synthetic_data",
    results_dir="results",
    single_column_metrics=[ChiSquareTest()],
    single_table_metrics=[MaximumMeanDiscrepancy()],
    multi_table_metrics=[
        CardinalityShapeSimilarity(),
        AggregationDetection(classifier_cls=XGBClassifier, random_state=42)
    ],
    datasets=["your_dataset_name"],
    methods=["your_method_name"]
)

# Execute evaluation
benchmark.run()

Examples

We provide example notebooks to help you get started with syntherela in the examples/ directory.

Evaluating Rossmann Subsampled Dataset: A step-by-step guide to evaluating a subsampled version of the Rossmann dataset using various metrics.

Replicating the paper's results

For detailed instructions on how to replicate the paper's results, please refer to docs/REPLICATING_RESULTS.md.

Adding a new metric

The documentation for adding a new metric can be found in docs/ADDING_A_METRIC.md.

* Denotes the method does not have a public implementation available.

🏆 Leaderboard Submission

We maintain an official leaderboard to benchmark synthetic relational data generation methods. To ensure fairness and reproducibility, all evaluations are performed by the SyntheRela maintainers on standardized hardware.

Evaluation Overview

Feature	Specification
Compute	Single NVIDIA H100 (80GB)
Time Limit	48 hours execution time per dataset
Submission Frequency	1 submission per 30-day period
Capacity	Up to 2 model variants/checkpoints per submission

How to Submit

Prepare your code: Ensure your method is reproducible and includes a clear README and requirements.txt.
Open an Issue: Create a new GitHub Issue using the title prefix [Model Submission].

For the complete requirements regarding environment setup, logging, and our privacy/confidentiality policy, please refer to our Full Submission Guidelines.

Conflicts of Interest

The authors declare no conflict of interest and are not associated with any of the evaluated commercial synthetic data providers.

Citation

If you use SyntheRela in your work, please cite our paper:

@inproceedings{
    iclrsyntheticdata2025syntherela,
    title={SyntheRela: A Benchmark For Synthetic Relational Database Generation},
    author={Martin Jurkovic and Valter Hudovernik and Erik {\v{S}}trumbelj},
    booktitle={Will Synthetic Data Finally Solve the Data Access Problem?},
    year={2025},
    url={https://openreview.net/forum?id=ZfQofWYn6n}
}

License

This project is licensed under the MIT License.

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
docs		docs
examples		examples
experiments		experiments
syntherela		syntherela
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

SyntheRela - Synthetic Relational Data Generation Benchmark

About SyntheRela

Installation

Using SyntheRela

Examples

Replicating the paper's results

Adding a new metric

🏆 Leaderboard Submission

Evaluation Overview

How to Submit

Conflicts of Interest

Citation

License

About

Uh oh!

Releases 3

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

SyntheRela - Synthetic Relational Data Generation Benchmark

About SyntheRela

Installation

Using SyntheRela

Examples

Replicating the paper's results

Adding a new metric

🏆 Leaderboard Submission

Evaluation Overview

How to Submit

Conflicts of Interest

Citation

License

About

Topics

Resources

License

Contributing

Uh oh!

Stars

Watchers

Forks

Releases 3

Uh oh!

Contributors

Uh oh!

Languages