Skip to content

BorgwardtLab/polygraph-benchmark

Repository files navigation

PolyGraph icon
PolyGraph logo

PolyGraph is a Python library for evaluating graph generative models by providing standardized datasets and metrics (including PolyGraph Discrepancy). Full documentation for this library can be found here.

PolyGraph discrepancy is a new metric we introduced, which provides the following advantages over maxmimum mean discrepancy (MMD):

Property MMD PGD
Range [0, ∞) [0, 1]
Intrinsic Scale
Descriptor Comparison
Multi-Descriptor Aggregation
Single Ranking

It also provides a number of other advantages over MMD which we discuss in our paper.

Installation

pip install polygraph-benchmark

No manual compilation of ORCA is required. For details on interaction with graph_tool, see the more detailed installation instructions in the docs.

If you'd like to use SBM graph dataset validation with graph tools, use a mamba or pixi environment. More information is available in the documentation.

At a glance

Here are a set of datasets and metrics this library provides:

  • 🗂️ Datasets: ready-to-use splits for procedural and real-world graphs
    • Procedural datasets: PlanarLGraphDataset, SBMLGraphDataset, LobsterLGraphDataset
    • Real-world: QM9, MOSES, Guacamol, DobsonDoigGraphDataset, ModelNet10GraphDataset
    • Also: EgoGraphDataset, PointCloudGraphDataset
  • 📊 Metrics: unified, fit-once/compute-many interface with convenience wrappers, avoiding redundant computations.
    • MMD2: GaussianTVMMD2Benchmark, RBFMMD2Benchmark
    • Kernel hyperparameter optimization with MaxDescriptorMMD2.
    • PolyGraphDiscrepancy: StandardPGD, MolecularPGD (for molecule descriptors).
    • Validation/Uniqueness/Novelty: VUN.
    • Uncertainty quantification for benchmarking (GaussianTVMMD2BenchmarkInterval, RBFMMD2Benchmark, PGD5Interval)
  • 🧩 Extendable: Users can instantiate custom metrics by specifying descriptors, kernels, or classifiers (PolyGraphDiscrepancy, DescriptorMMD2). PolyGraph defines all necessary interfaces but imposes no requirements on the data type of graph objects.
  • ⚙️ Interoperability: Works on Apple Silicon Macs and Linux.
  • Tested, type checked and documented
⚠️ Important - Dataset Usage Warning

To help reproduce previous results, we provide the following datasets:

  • PlanarGraphDataset
  • SBMGraphDataset
  • LobsterGraphDataset

But they should not be used for benchmarking, due to unreliable metric estimates (see our paper for more details).

We provide larger datasets that should be used instead:

  • PlanarLGraphDataset
  • SBMLGraphDataset
  • LobsterLGraphDataset

Tutorial

Our demo script showcases some features of our library in action.

Datasets

Instantiate a benchmark dataset as follows:

import networkx as nx
from polygraph.datasets import PlanarGraphDataset

reference = PlanarGraphDataset("test").to_nx()

# Let's also generate some graphs coming from another distribution.
generated = [nx.erdos_renyi_graph(64, 0.1) for _ in range(40)]

Metrics

Maximum Mean Discrepancy

To compute existing MMD2 formulations (e.g. based on the TV pseudokernel), one can use the following:

from polygraph.metrics import GaussianTVMMD2Benchmark # Can also be RBFMMD2Benchmark

gtv_benchmark = GaussianTVMMD2Benchmark(reference)

print(gtv_benchmark.compute(generated))  # {'orbit': ..., 'clustering': ..., 'degree': ..., 'spectral': ...}

PolyGraphDiscrepancy

Similarly, you can compute our proposed PolyGraphDiscrepancy, like so:

from polygraph.metrics import StandardPGD

pgd = StandardPGD(reference)
print(pgd.compute(generated)) # {'pgd': ..., 'pgd_descriptor': ..., 'subscores': {'orbit': ..., }}

pgd_descriptor provides the best descriptor used to report the final score.

Validity, uniqueness and novelty

VUN values follow a similar interface:

from polygraph.metrics import VUN
reference_ds = PlanarGraphDataset("test")
pgd = VUN(reference, validity_fn=reference_ds.is_valid, confidence_level=0.95) # if applicable, validity functions are defined as a dataset attribute
print(pgd.compute(generated))  # {'valid': ..., 'valid_unique_novel': ..., 'valid_novel': ..., 'valid_unique': ...}

Metric uncertainty quantification

For MMD and PGD, uncertainty quantifiation for the metrics are obtained through subsampling. For VUN, a confidence interval is obtained with a binomial test.

For VUN, the results can be obtained by specifying a confidence level when instantiating the metric.

For the others, the Interval suffix references the class that implements subsampling.

from polygraph.metrics import GaussianTVMMD2BenchmarkInterval, RBFMMD2BenchmarkInterval, StandardPGDInterval
from tqdm import tqdm

metrics = [
  GaussianTVMMD2BenchmarkInterval(reference, subsample_size=8, num_samples=10), # specify size of each subsample, and the number of samples
  RBFMMD2BenchmarkInterval(reference, subsample_size=8, num_samples=10),
  StandardPGDInterval(reference, subsample_size=8, num_samples=10)
]

for metric in tqdm(metrics):
	metric_results = metric.compute(
    generated,
  )

Example Benchmark

The following results mirror the tables from our paper. Bold indicates best, and underlined indicates second-best. Values are multiplied by 100 for legibility. Standard deviations are obtained with subsampling using StandardPGDInterval and MoleculePGDInterval. Specific parameters are discussed in the paper.

Method Planar-L Lobster-L SBM-L Proteins Guacamol Moses
AutoGraph 34.0 ± 1.8 18.0 ± 1.6 5.6 ± 1.5 67.7 ± 7.4 22.9 ± 0.5 29.6 ± 0.4
AutoGraph* 10.4 ± 1.2
DiGress 45.2 ± 1.8 3.2 ± 2.6 17.4 ± 2.3 88.1 ± 3.1 32.7 ± 0.5 33.4 ± 0.5
GRAN 99.7 ± 0.2 85.4 ± 0.5 69.1 ± 1.4 89.7 ± 2.7
ESGG 45.0 ± 1.4 69.9 ± 0.6 99.4 ± 0.2 79.2 ± 4.3

* AutoGraph* denotes a variant that leverages additional training heuristics as described in the paper.

Reproducibility

The reproducibility/ directory contains scripts to reproduce all tables and figures from the paper.

Quick Start

# 1. Install dependencies
pixi install

# 2. Download the graph data (~3GB)
cd reproducibility
python download_data.py

# 3. Generate all tables and figures
make all

Data Download

The generated graph data (~3GB) is hosted on Proton Drive. After downloading, extract to data/polygraph_graphs/ in the repository root.

# Full dataset (required for complete reproducibility)
python download_data.py

# Small subset for testing/CI (~50 graphs per model)
python download_data.py --subset

Expected data structure after extraction:

data/polygraph_graphs/
├── AUTOGRAPH/
│   ├── planar.pkl
│   ├── lobster.pkl
│   ├── sbm.pkl
│   └── proteins.pkl
├── DIGRESS/
│   ├── planar.pkl
│   ├── lobster.pkl
│   ├── sbm.pkl
│   ├── proteins.pkl
│   ├── denoising-iterations/
│   │   └── {15,30,45,60,75,90}_steps.pkl
│   └── training-iterations/
│       └── {119,209,...,3479}_steps.pkl
├── ESGG/
│   └── *.pkl
├── GRAN/
│   └── *.pkl
└── molecule_eval/
    └── *.smiles

Scripts Overview

Table Generation

Script Output Description
generate_benchmark_tables.py tables/benchmark_results.tex Main PGD benchmark (Table 1) comparing AUTOGRAPH, DiGress, GRAN, ESGG
generate_mmd_tables.py tables/mmd_gtv.tex, tables/mmd_rbf_biased.tex MMD² metrics with GTV and RBF kernels
generate_gklr_tables.py tables/gklr.tex PGD with Kernel Logistic Regression using WL and SP kernels
generate_concatenation_tables.py tables/concatenation.tex Ablation comparing individual vs concatenated descriptors

Figure Generation

Script Output Description
generate_subsampling_figures.py figures/subsampling/ Bias-variance tradeoff as function of sample size
generate_perturbation_figures.py figures/perturbation/ Metric sensitivity to edge perturbations
generate_model_quality_figures.py figures/model_quality/ PGD vs training/denoising steps for DiGress
generate_phase_plot.py figures/phase_plot/ Training dynamics showing PGD vs VUN

Each script can be run independently with --subset for quick testing:

# Tables (full computation)
python generate_benchmark_tables.py
python generate_mmd_tables.py
python generate_gklr_tables.py
python generate_concatenation_tables.py

# Tables (quick testing with --subset)
python generate_benchmark_tables.py --subset
python generate_mmd_tables.py --subset

# Figures (full computation)
python generate_subsampling_figures.py
python generate_perturbation_figures.py
python generate_model_quality_figures.py
python generate_phase_plot.py

# Figures (quick testing)
python generate_subsampling_figures.py --subset
python generate_perturbation_figures.py --subset

Make Targets

make download        # Download full dataset (manual step required)
make download-subset # Create small subset for CI testing
make tables          # Generate all LaTeX tables
make figures         # Generate all figures
make all             # Generate everything
make tables-submit   # Submit table jobs to SLURM cluster
make tables-collect  # Collect results from completed SLURM jobs
make clean           # Remove generated outputs
make help            # Show available targets

Hardware Requirements

  • Memory: 16GB RAM recommended for full dataset
  • Storage: ~4GB for data + outputs
  • Time: Full generation takes ~2-4 hours on a modern CPU

The --subset flag uses ~50 graphs per model, runs in minutes, and verifies code correctness (results are not publication-quality).

Cluster Submission

Table generation scripts support SLURM cluster submission via submitit. Install the cluster extras first:

pip install -e ".[cluster]"

SLURM parameters are configured in YAML files (see reproducibility/configs/slurm_default.yaml):

slurm:
  partition: "cpu"
  timeout_min: 360
  cpus_per_task: 8
  mem_gb: 32

Submit jobs, then collect results after completion:

cd reproducibility

# Submit all table jobs to SLURM
python generate_benchmark_tables.py --slurm-config configs/slurm_default.yaml

# After jobs complete, collect results and generate tables
python generate_benchmark_tables.py --collect

# Or use Make targets
make tables-submit                                        # submit all
make tables-submit SLURM_CONFIG=configs/my_cluster.yaml   # custom config
make tables-collect                                       # collect all

Use --local with --slurm-config to test the submission pipeline in-process without SLURM.

Troubleshooting

Memory issues: Use --subset flag for testing, process one dataset at a time, or increase system swap space.

Missing data: Verify data/polygraph_graphs/ exists in repo root, run python download_data.py to check data status, or download manually from Proton Drive.

TabPFN issues: TabPFN is pinned to v2.0.0 for reproducibility: pip install tabpfn==2.0.0.

Citing

To cite our paper:

@misc{krimmel2025polygraph,
  title={PolyGraph Discrepancy: a classifier-based metric for graph generation},
  author={Markus Krimmel and Philip Hartout and Karsten Borgwardt and Dexiong Chen},
  year={2025},
  eprint={2510.06122},
  archivePrefix={arXiv},
  primaryClass={cs.LG},
  url={https://arxiv.org/abs/2510.06122},
}

About

Benchmarking framework for graph generative models (ICLR 2026)

Topics

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors