SyntheRela is a comprehensive benchmark designed to evaluate and compare synthetic relational database generation methods. It provides a standardized framework for assessing both the fidelity and utility of synthetic data across multiple real-world databases. The benchmark includes novel evaluation metrics, particularly for relational data, and supports various open-source and commercial synthetic data generation methods.
SyntheRela is highly extensible, allowing users to benchmark on their own custom datasets and implement new evaluation metrics to suit specific use cases.
Our research on SyntheRela is presented in the paper "SyntheRela: A Benchmark For Synthetic Relational Database Generation" at the ICLR 2025 Workshop "Will Synthetic Data Finally Solve the Data Access Problem?", available on OpenReview.
We maintain a public leaderboard on Hugging Face where you can compare the performance of different synthetic data generation methods.
To install only the benchmark package, run the following command:
pip install syntherelaTo evaluate your synthetic relational data, configure the Benchmark class with your desired metrics and run the evaluation pipeline:
from syntherela.benchmark import Benchmark
from syntherela.metrics.single_column.statistical import ChiSquareTest
from syntherela.metrics.single_table.distance import MaximumMeanDiscrepancy
from syntherela.metrics.multi_table.statistical import CardinalityShapeSimilarity
from syntherela.metrics.multi_table.detection import AggregationDetection
from xgboost import XGBClassifier
# Initialize the benchmark with specific metrics
benchmark = Benchmark(
real_data_dir="path/to/real_data",
synthetic_data_dir="path/to/synthetic_data",
results_dir="results",
single_column_metrics=[ChiSquareTest()],
single_table_metrics=[MaximumMeanDiscrepancy()],
multi_table_metrics=[
CardinalityShapeSimilarity(),
AggregationDetection(classifier_cls=XGBClassifier, random_state=42)
],
datasets=["your_dataset_name"],
methods=["your_method_name"]
)
# Execute evaluation
benchmark.run()We provide example notebooks to help you get started with syntherela in the examples/ directory.
- Evaluating Rossmann Subsampled Dataset: A step-by-step guide to evaluating a subsampled version of the Rossmann dataset using various metrics.
For detailed instructions on how to replicate the paper's results, please refer to docs/REPLICATING_RESULTS.md.
The documentation for adding a new metric can be found in docs/ADDING_A_METRIC.md.
* Denotes the method does not have a public implementation available.
We maintain an official leaderboard to benchmark synthetic relational data generation methods. To ensure fairness and reproducibility, all evaluations are performed by the SyntheRela maintainers on standardized hardware.
| Feature | Specification |
|---|---|
| Compute | Single NVIDIA H100 (80GB) |
| Time Limit | 48 hours execution time per dataset |
| Submission Frequency | 1 submission per 30-day period |
| Capacity | Up to 2 model variants/checkpoints per submission |
- Prepare your code: Ensure your method is reproducible and includes a clear
READMEandrequirements.txt. - Open an Issue: Create a new GitHub Issue using the title prefix
[Model Submission].
For the complete requirements regarding environment setup, logging, and our privacy/confidentiality policy, please refer to our Full Submission Guidelines.
The authors declare no conflict of interest and are not associated with any of the evaluated commercial synthetic data providers.
If you use SyntheRela in your work, please cite our paper:
@inproceedings{
iclrsyntheticdata2025syntherela,
title={SyntheRela: A Benchmark For Synthetic Relational Database Generation},
author={Martin Jurkovic and Valter Hudovernik and Erik {\v{S}}trumbelj},
booktitle={Will Synthetic Data Finally Solve the Data Access Problem?},
year={2025},
url={https://openreview.net/forum?id=ZfQofWYn6n}
}
This project is licensed under the MIT License.
