Skip to content

dwsmith1983/spark-bestfit

spark-bestfit

CI Documentation Status PyPI version Production Ready License: MIT Code style: black Ruff

Modern distribution fitting library with pluggable backends (Spark, Ray, Local)

Efficiently fit ~90 scipy.stats distributions to your data using parallel processing. Supports Apache Spark for production clusters, Ray for ML workflows, or local execution for development.

Features

  • Parallel Processing: Spark, Ray, or local thread backends
  • ~90 Continuous + 16 Discrete Distributions
  • Multiple Metrics: K-S, A-D, SSE, AIC, BIC
  • Bounded Fitting: Truncated distributions with natural bounds
  • Heavy-Tail Detection: Warns when data may need special handling
  • Gaussian Copula: Correlated multi-column sampling
  • Model Serialization: Save/load to JSON or pickle
  • FitterConfig Builder: Fluent API for complex configurations

Full feature list at spark-bestfit.readthedocs.io

Installation

pip install spark-bestfit              # Core (BYO Spark)
pip install spark-bestfit[spark]       # With PySpark
pip install spark-bestfit[ray]         # With Ray
pip install spark-bestfit[plotting]    # With visualization

Quick Start

from spark_bestfit import DistributionFitter
import numpy as np
from pyspark.sql import SparkSession

spark = SparkSession.builder.getOrCreate()
data = np.random.normal(loc=50, scale=10, size=10_000)
df = spark.createDataFrame([(float(x),) for x in data], ["value"])

fitter = DistributionFitter(spark)
results = fitter.fit(df, column="value")

best = results.best(n=1)[0]
print(f"Best: {best.distribution} (KS={best.ks_statistic:.4f})")

Without Spark:

from spark_bestfit import DistributionFitter, LocalBackend
import pandas as pd

df = pd.DataFrame({"value": np.random.normal(50, 10, 1000)})
fitter = DistributionFitter(backend=LocalBackend())
results = fitter.fit(df, column="value")

Backends

Backend Use Case Install
SparkBackend Production clusters, 100M+ rows [spark] or BYO
LocalBackend Development, testing Included
RayBackend Ray clusters, ML pipelines [ray]

See Backend Guide for configuration details.

Compatibility

Spark Python NumPy
3.5.x 3.11-3.12 < 2.0
4.x 3.12-3.13 2.0+

Documentation

Full documentation at spark-bestfit.readthedocs.io:

Contributing

Contributions welcome! See Contributing Guide.

License

MIT License - see LICENSE for details.

About

Efficiently fit ~90 scipy.stats distributions to your data using Spark's parallel processing with optimized Pandas UDFs and broadcast variables.

Topics

Resources

License

Code of conduct

Contributing

Security policy

Stars

Watchers

Forks

Packages

 
 
 

Contributors