PolyGraph is a Python library for evaluating graph generative models by providing standardized datasets and metrics (including PolyGraph Discrepancy). Full documentation for this library can be found here.
PolyGraph discrepancy is a new metric we introduced, which provides the following advantages over maxmimum mean discrepancy (MMD):
| Property | MMD | PGD |
|---|---|---|
| Range | [0, ∞) | [0, 1] |
| Intrinsic Scale | ❌ | ✅ |
| Descriptor Comparison | ❌ | ✅ |
| Multi-Descriptor Aggregation | ❌ | ✅ |
| Single Ranking | ❌ | ✅ |
It also provides a number of other advantages over MMD which we discuss in our paper.
pip install polygraph-benchmarkNo manual compilation of ORCA is required. For details on interaction with graph_tool, see the more detailed installation instructions in the docs.
If you'd like to use SBM graph dataset validation with graph tools, use a mamba or pixi environment. More information is available in the documentation.
Here are a set of datasets and metrics this library provides:
- 🗂️ Datasets: ready-to-use splits for procedural and real-world graphs
- Procedural datasets:
PlanarLGraphDataset,SBMLGraphDataset,LobsterLGraphDataset - Real-world:
QM9,MOSES,Guacamol,DobsonDoigGraphDataset,ModelNet10GraphDataset - Also:
EgoGraphDataset,PointCloudGraphDataset
- Procedural datasets:
- 📊 Metrics: unified, fit-once/compute-many interface with convenience wrappers, avoiding redundant computations.
- MMD2:
GaussianTVMMD2Benchmark,RBFMMD2Benchmark - Kernel hyperparameter optimization with
MaxDescriptorMMD2. - PolyGraphDiscrepancy:
StandardPGD,MolecularPGD(for molecule descriptors). - Validation/Uniqueness/Novelty:
VUN. - Uncertainty quantification for benchmarking (
GaussianTVMMD2BenchmarkInterval,RBFMMD2Benchmark,PGD5Interval)
- MMD2:
- 🧩 Extendable: Users can instantiate custom metrics by specifying descriptors, kernels, or classifiers (
PolyGraphDiscrepancy,DescriptorMMD2). PolyGraph defines all necessary interfaces but imposes no requirements on the data type of graph objects. - ⚙️ Interoperability: Works on Apple Silicon Macs and Linux.
- ✅ Tested, type checked and documented
⚠️ Important - Dataset Usage Warning
To help reproduce previous results, we provide the following datasets:
PlanarGraphDatasetSBMGraphDatasetLobsterGraphDataset
But they should not be used for benchmarking, due to unreliable metric estimates (see our paper for more details).
We provide larger datasets that should be used instead:
PlanarLGraphDatasetSBMLGraphDatasetLobsterLGraphDataset
Our demo script showcases some features of our library in action.
Instantiate a benchmark dataset as follows:
import networkx as nx
from polygraph.datasets import PlanarGraphDataset
reference = PlanarGraphDataset("test").to_nx()
# Let's also generate some graphs coming from another distribution.
generated = [nx.erdos_renyi_graph(64, 0.1) for _ in range(40)]To compute existing MMD2 formulations (e.g. based on the TV pseudokernel), one can use the following:
from polygraph.metrics import GaussianTVMMD2Benchmark # Can also be RBFMMD2Benchmark
gtv_benchmark = GaussianTVMMD2Benchmark(reference)
print(gtv_benchmark.compute(generated)) # {'orbit': ..., 'clustering': ..., 'degree': ..., 'spectral': ...}Similarly, you can compute our proposed PolyGraphDiscrepancy, like so:
from polygraph.metrics import StandardPGD
pgd = StandardPGD(reference)
print(pgd.compute(generated)) # {'pgd': ..., 'pgd_descriptor': ..., 'subscores': {'orbit': ..., }}pgd_descriptor provides the best descriptor used to report the final score.
VUN values follow a similar interface:
from polygraph.metrics import VUN
reference_ds = PlanarGraphDataset("test")
pgd = VUN(reference, validity_fn=reference_ds.is_valid, confidence_level=0.95) # if applicable, validity functions are defined as a dataset attribute
print(pgd.compute(generated)) # {'valid': ..., 'valid_unique_novel': ..., 'valid_novel': ..., 'valid_unique': ...}For MMD and PGD, uncertainty quantifiation for the metrics are obtained through subsampling. For VUN, a confidence interval is obtained with a binomial test.
For VUN, the results can be obtained by specifying a confidence level when instantiating the metric.
For the others, the Interval suffix references the class that implements subsampling.
from polygraph.metrics import GaussianTVMMD2BenchmarkInterval, RBFMMD2BenchmarkInterval, StandardPGDInterval
from tqdm import tqdm
metrics = [
GaussianTVMMD2BenchmarkInterval(reference, subsample_size=8, num_samples=10), # specify size of each subsample, and the number of samples
RBFMMD2BenchmarkInterval(reference, subsample_size=8, num_samples=10),
StandardPGDInterval(reference, subsample_size=8, num_samples=10)
]
for metric in tqdm(metrics):
metric_results = metric.compute(
generated,
)The following results mirror the tables from our paper. Bold indicates best, and underlined indicates second-best. Values are multiplied by 100 for legibility. Standard deviations are obtained with subsampling using StandardPGDInterval and MoleculePGDInterval. Specific parameters are discussed in the paper.
| Method | Planar-L | Lobster-L | SBM-L | Proteins | Guacamol | Moses |
|---|---|---|---|---|---|---|
| AutoGraph | 34.0 ± 1.8 | 18.0 ± 1.6 | 5.6 ± 1.5 | 67.7 ± 7.4 | 22.9 ± 0.5 | 29.6 ± 0.4 |
| AutoGraph* | — | — | — | — | 10.4 ± 1.2 | — |
| DiGress | 45.2 ± 1.8 | 3.2 ± 2.6 | 17.4 ± 2.3 | 88.1 ± 3.1 | 32.7 ± 0.5 | 33.4 ± 0.5 |
| GRAN | 99.7 ± 0.2 | 85.4 ± 0.5 | 69.1 ± 1.4 | 89.7 ± 2.7 | — | — |
| ESGG | 45.0 ± 1.4 | 69.9 ± 0.6 | 99.4 ± 0.2 | 79.2 ± 4.3 | — | — |
* AutoGraph* denotes a variant that leverages additional training heuristics as described in the paper.
The reproducibility/ directory contains scripts to reproduce all tables and figures from the paper.
# 1. Install dependencies
pixi install
# 2. Download the graph data (~3GB)
cd reproducibility
python download_data.py
# 3. Generate all tables and figures
make allThe generated graph data (~3GB) is hosted on Proton Drive. After downloading, extract to data/polygraph_graphs/ in the repository root.
# Full dataset (required for complete reproducibility)
python download_data.py
# Small subset for testing/CI (~50 graphs per model)
python download_data.py --subsetExpected data structure after extraction:
data/polygraph_graphs/
├── AUTOGRAPH/
│ ├── planar.pkl
│ ├── lobster.pkl
│ ├── sbm.pkl
│ └── proteins.pkl
├── DIGRESS/
│ ├── planar.pkl
│ ├── lobster.pkl
│ ├── sbm.pkl
│ ├── proteins.pkl
│ ├── denoising-iterations/
│ │ └── {15,30,45,60,75,90}_steps.pkl
│ └── training-iterations/
│ └── {119,209,...,3479}_steps.pkl
├── ESGG/
│ └── *.pkl
├── GRAN/
│ └── *.pkl
└── molecule_eval/
└── *.smiles
| Script | Output | Description |
|---|---|---|
generate_benchmark_tables.py |
tables/benchmark_results.tex |
Main PGD benchmark (Table 1) comparing AUTOGRAPH, DiGress, GRAN, ESGG |
generate_mmd_tables.py |
tables/mmd_gtv.tex, tables/mmd_rbf_biased.tex |
MMD² metrics with GTV and RBF kernels |
generate_gklr_tables.py |
tables/gklr.tex |
PGD with Kernel Logistic Regression using WL and SP kernels |
generate_concatenation_tables.py |
tables/concatenation.tex |
Ablation comparing individual vs concatenated descriptors |
| Script | Output | Description |
|---|---|---|
generate_subsampling_figures.py |
figures/subsampling/ |
Bias-variance tradeoff as function of sample size |
generate_perturbation_figures.py |
figures/perturbation/ |
Metric sensitivity to edge perturbations |
generate_model_quality_figures.py |
figures/model_quality/ |
PGD vs training/denoising steps for DiGress |
generate_phase_plot.py |
figures/phase_plot/ |
Training dynamics showing PGD vs VUN |
Each script can be run independently with --subset for quick testing:
# Tables (full computation)
python generate_benchmark_tables.py
python generate_mmd_tables.py
python generate_gklr_tables.py
python generate_concatenation_tables.py
# Tables (quick testing with --subset)
python generate_benchmark_tables.py --subset
python generate_mmd_tables.py --subset
# Figures (full computation)
python generate_subsampling_figures.py
python generate_perturbation_figures.py
python generate_model_quality_figures.py
python generate_phase_plot.py
# Figures (quick testing)
python generate_subsampling_figures.py --subset
python generate_perturbation_figures.py --subsetmake download # Download full dataset (manual step required)
make download-subset # Create small subset for CI testing
make tables # Generate all LaTeX tables
make figures # Generate all figures
make all # Generate everything
make tables-submit # Submit table jobs to SLURM cluster
make tables-collect # Collect results from completed SLURM jobs
make clean # Remove generated outputs
make help # Show available targets- Memory: 16GB RAM recommended for full dataset
- Storage: ~4GB for data + outputs
- Time: Full generation takes ~2-4 hours on a modern CPU
The --subset flag uses ~50 graphs per model, runs in minutes, and verifies code correctness (results are not publication-quality).
Table generation scripts support SLURM cluster submission via submitit. Install the cluster extras first:
pip install -e ".[cluster]"SLURM parameters are configured in YAML files (see reproducibility/configs/slurm_default.yaml):
slurm:
partition: "cpu"
timeout_min: 360
cpus_per_task: 8
mem_gb: 32Submit jobs, then collect results after completion:
cd reproducibility
# Submit all table jobs to SLURM
python generate_benchmark_tables.py --slurm-config configs/slurm_default.yaml
# After jobs complete, collect results and generate tables
python generate_benchmark_tables.py --collect
# Or use Make targets
make tables-submit # submit all
make tables-submit SLURM_CONFIG=configs/my_cluster.yaml # custom config
make tables-collect # collect allUse --local with --slurm-config to test the submission pipeline in-process without SLURM.
Memory issues: Use --subset flag for testing, process one dataset at a time, or increase system swap space.
Missing data: Verify data/polygraph_graphs/ exists in repo root, run python download_data.py to check data status, or download manually from Proton Drive.
TabPFN issues: TabPFN is pinned to v2.0.0 for reproducibility: pip install tabpfn==2.0.0.
To cite our paper:
@misc{krimmel2025polygraph,
title={PolyGraph Discrepancy: a classifier-based metric for graph generation},
author={Markus Krimmel and Philip Hartout and Karsten Borgwardt and Dexiong Chen},
year={2025},
eprint={2510.06122},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2510.06122},
}