Skip to content

Tugbars/VectorFFT

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1,338 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

image

A split-layout Fast Fourier Transform library in C, with AVX2/AVX-512 kernels generated by a DAG FFT compiler. It covers complex (C2C), real (R2C/C2R), 2D, and DSP transforms (DCT/DST/DHT) — in-place and out-of-place, single- and multi-threaded — behind one calibrated wisdom planner that picks the fastest plan per transform.

It beats Intel MKL on all 238 tested 1D C2C cells single- and multi-threaded, and wins on real-FFT and 2D transforms under multithreading. Pure C, no external dependencies.


Benchmark Results

For the full v1.0 performance picture — vs MKL, vs FFTW3, multi-threaded scaling, DCT/DST/DHT, cost-model accuracy, per-codelet VTune profiles, hardware caveats — see docs/performance/v1_0_results.md.

Platform: Intel Core i9-14900KF, 48 KB L1d, DDR5, AVX2, single-threaded
Competitor: Intel MKL 2025 (sequential, mkl_set_num_threads(1))
238 data points across 9 categories, 3 batch sizes (K=4, K=32, K=256), N=8 to N=823,543

1D FFT Throughput — VectorFFT vs Intel MKL

Throughput

Three panels showing GFLOP/s at each batch size. Blue = VectorFFT, Red = MKL. Different marker shapes per category. VectorFFT sits above MKL across the board.

Category Cells Median Best Win Closest MKL Gets
Small pow2 (8-128) 15 4.28x 15.33x (N=8, K=4) 2.60x (N=128, K=4)
Power-of-2 (256-131K) 29 1.86x 3.04x (N=256, K=32) 1.10x (N=16384, K=4)
Composite 43 2.85x 4.51x (N=200, K=32) 1.62x (N=10000, K=4)
Prime powers (3,5,7) 25 2.69x 4.16x (N=2401, K=256) 1.67x (N=2401, K=4)
Prime powers (R=11,13) 17 2.79x 3.75x (N=169, K=32) 1.65x (N=2197, K=256)
Rader primes 24 2.34x 3.85x (N=641, K=32) 1.29x (N=4001, K=32)
Bluestein primes 24 1.55x 3.52x (N=83, K=4) 1.02x (N=179, K=256)
Odd composites 26 3.47x 5.16x (N=119, K=32) 2.26x (N=6615, K=256)
Mixed deep 35 2.71x 5.78x (N=60, K=32) 1.66x (N=126, K=4)

Speedup over Intel MKL — All Categories

Speedup

Every point above the dashed line is a VectorFFT win. Marker size indicates batch count (small=K=4, medium=K=32, large=K=256). All 238 points are above parity.

Combined Dense Scatter — All Sizes & Batch Counts

Scatter

All 238 data points overlaid. Blue cloud (VectorFFT) consistently above red cloud (MKL). Peak throughput at small N with large K where codelets run entirely from L1.

Highlight Results

N K Category Factors VectorFFT MKL Speedup
8 4 small 8 75.7 GF/s 4.9 GF/s 15.33x
60 32 mixed_deep 5x12 89.2 GF/s 15.4 GF/s 5.78x
119 32 odd_comp 17x7 50.7 GF/s 9.8 GF/s 5.16x
200 32 composite 5x8x5 69.6 GF/s 15.4 GF/s 4.51x
2401 256 prime_pow (7^4) 7x7x7x7 44.7 GF/s 10.7 GF/s 4.16x
169 32 genfft (13^2) 13x13 46.3 GF/s 12.3 GF/s 3.75x
641 32 rader (override) 16.8 GF/s 4.4 GF/s 3.85x
83 4 bluestein (override) 7.0 GF/s 2.0 GF/s 3.52x
256 32 pow2 4x8x8 69.7 GF/s 22.9 GF/s 3.04x
390625 4 prime_pow (5^8) 25x25x25x25 32.3 GF/s 11.1 GF/s 2.91x
16384 4 pow2 8x8x16x16 27.2 GF/s 24.7 GF/s 1.10x

2D FFT — Tiled SIMD Transpose

VectorFFT's 2D FFT uses a tiled gather/scatter approach with cache-oblivious SIMD transpose kernels (8x4 line-filling + 4x4 AVX2). Beats MKL at all tested sizes.

Size VectorFFT MKL Speedup
32x32 0.9 us 1.5 us 1.63x
64x64 5.5 us 6.5 us 1.18x
128x128 30.3 us 33.4 us 1.10x
256x256 127.3 us 145.6 us 1.14x
512x512 875.1 us 948.1 us 1.08x
1024x1024 3,900 us 5,512 us 1.41x
100x200 40.7 us 60.9 us 1.50x

Multi-threaded 2D is tile-parallel on the row pass + K-split on the columns (per-thread scratch, zero barriers). A full single- and multi-threaded 2D benchmark — all sizes, head-to-head vs MKL — is in progress; those numbers will be published with that run.

Accuracy

Precision

Strict roundtrip error — max |fwd→bwd / N − x| / max|x|, the worst single element across all N·K outputs after a full forward + backward — across all tested 1D cells. Errors track the theoretical O(log₂N · ε) bound (FP64 ε = 2.2e-16); every cell holds ~14 correct digits.

Category Min Error Max Error
pow2 small (8-128) 2.5e-16 1.3e-14
pow2 (256-131K) 7.9e-16 2.6e-14
composite 1.1e-14 5.7e-14
prime powers (3,5,7) 9.8e-15 7.1e-14
genfft (R=11,13) 1.5e-14 4.0e-14
odd composites 1.0e-14 3.5e-14
mixed deep 1.0e-14 3.7e-14

Overall: min 2.5e-16, median 2.45e-14, max 7.07e-14 — none exceed 1e-13. This is the strictest honest statistic (per-element max, relative, full roundtrip); RMS or forward-only error runs ~10–40× smaller. The errors grow ~log N exactly as a correct Cooley-Tukey decomposition should.

Rader and Bluestein prime cells use a convolution-based path; their roundtrip error (median ~3e-14, max ~7e-14) sits in the same band, dominated by the inner FFT's accumulated rounding — well within FP64.


Architecture

VectorFFT uses a stride-based Cooley-Tukey architecture.

Executor

  • In-place and out-of-place — the in-place path runs the stage chain in a single buffer with no scratch allocation; the out-of-place path reads the source directly into the first stage (no pre-copy)
  • Split-complex — separate re[] / im[] arrays. No interleaved codelets; users with packed {re,im} data convert via vfft_deinterleave / vfft_reinterleave at the boundary

Codelets

  • Emitted by the DAG FFT compiler — it builds each transform's data-flow DAG and emits an ISA-specific schedule per (radix, variant, ISA) triple. Any radix you need can be generated from the same machinery — the shipped set covers every smooth integer up to 100,000+. Not hand-written.
  • Three ISA targets: AVX2 (VL=4), AVX-512 (VL=8), scalar fallback — selected at compile time

Planner

VectorFFT exposes FFTW-style planning rigor tiers — the calibration thoroughness used on a wisdom miss (a hit ignores rigor and builds the cached plan instantly):

Rigor Plan time What it does on a wisdom miss
VFFT_MEASURE seconds–minutes DP planner (≈ FFTW_MEASURE): beam search over factorizations + per-stage variant tuning, then persist
VFFT_PATIENT minutes Wider DP beam + re-measure of the top candidates (≈ FFTW_PATIENT)
VFFT_EXHAUSTIVE minutes–hours Full search — every factorization × permutation × per-stage variant
VFFT_ESTIMATE sub-µs Planned 4th tier — closed-form cost model, no measurement (designed, not yet wired)

Wisdom mechanics:

  • DP planner (MEASURE / PATIENT) — tries each radix as the first stage, recursively solves sub-problems with memoization, then benchmarks orderings of the winning factorization; PATIENT widens the beam and re-measures the top candidates (~150 benchmarks for N=100,000 vs ~61,000 for a full exhaustive search).
  • Per-stage variant tuning — each stage is scored on a noise-robust top-K metric and the best twiddle application (FLAT / LOG3 / T1S / BUF) is chosen per (R, me, ios) cell.
  • Wisdom is a per-feature bundle — one handle holds the calibrated plans for every transform (c2c / out-of-place / r2c / …), loaded from and saved to a directory. Default: library-managed (auto-load + calibrate-on-miss + save); or pass your own. A pre-calibrated set for the i9-14900KF ships under examples/14900KF/.
  • VFFT_ESTIMATE — a closed-form cost model for a zero-measurement plan (for users who won't calibrate). Designed; lands as the 4th rigor tier, not yet wired.

Transforms

VectorFFT v1.0 covers ten transform variants. Each entry below names the method, the multi-threaded scaling strategy, and which planning flags are honored.

Transform Method MT strategy ESTIMATE MEASURE/wisdom
1D C2C DIT forward / DIF backward, fused-twiddle stride executor (Method C) direct K-split ✅ cost-model ✅ joint search
1D R2C / C2R Pair-packing (Makhoul) — N/2-point complex FFT + post-butterfly inner C2C MT (K-split) ✅ inner halfN ✅ inner halfN
2D C2C Tiled gather/scatter via 8×4 SIMD transpose (default) or full-matrix Bailey tile-parallel rows + K-split cols ⚠ same plan as MEASURE ✅ inner row/col c2c wisdom (calibrate-on-miss)
2D R2C / C2R Tiled R2C row pass + 1D C2C col pass (FFTW reduce-along-inner convention) tile-parallel rows + K-split cols ⚠ same plan ✅ inner c2c wisdom (calibrate-on-miss)
DCT-II / III (REDFT10/01) Makhoul reduction — N-point R2C + pre-permute + post-twiddle. Specialized straight-line N=8 codelet for JPEG block size three-phase K-split (pre-permute, inner R2C MT, post-twiddle) ✅ inner halfN ✅ inner halfN
DCT-IV (REDFT11) Lee 1984 — single N/2-point complex FFT + pre/post twiddles. Used in MDCT (MP3/AAC/Vorbis/Opus) three-phase K-split ✅ inner halfN ✅ inner halfN
DST-II / III (RODFT10/01) Wraps DCT-II/III with sign-flip + index reversal K-split sign-flip + reversal passes ✅ via inner DCT ✅ via inner DCT
DHT (Discrete Hartley) Built directly on R2C — no twiddles. H[k]=Re(X[k])-Im(X[k]) butterfly K-split post-butterfly only (pre-pass is bandwidth-bound memcpy) ✅ inner halfN ✅ inner halfN
Prime N (Rader) For prime N where N-1 is 19-smooth — reduces to length-(N-1) cyclic convolution via primitive root inner FFT inherits MT ✅ via inner FFT ✅ via inner FFT
Prime N (Bluestein) For non-smooth primes — chirp-z to length-M convolution where M ≥ 2N-1, M chosen for fewer-stage factorization inner FFT inherits MT ✅ via inner FFT ✅ via inner FFT

1D R2C / C2R

Pair-packing — one N/2-point complex FFT + butterfly post-process. 1.5–1.9× faster than the equivalent complex FFT. Block-walked over K to keep scratch in L2.

2D C2C

Tiled (default, B=8) — gather B rows via SIMD transpose, FFT on scratch, scatter back. Tile-parallel threading on the row pass. Beats MKL 1.08–1.63× across 32² to 1024². Bailey mode (full-matrix transpose around one large-K row FFT) is available as an alternative.

2D R2C / C2R

Same tiled pattern as 2D C2C but the row pass is a 1D R2C (FFTW reduce-along-inner convention; output is N1×(N2/2+1) complex). Backward processes tiles in reverse order to avoid input-buffer overlap during the asymmetric scatter.

DCT-II / DCT-III (REDFT10 / REDFT01)

Makhoul's reduction (1980) — pre-permute + N-point R2C + post-twiddle butterfly. ~2× faster than the textbook 2N-point R2C; matches FFTW's reodft010e-r2hc.c. Specialized straight-line codelets at N=8 bypass Makhoul for the JPEG block (1.48× / 1.16× over FFTW at the JPEG cell). Even N only.

DCT-IV (REDFT11)

Lee 1984 — single N/2-point complex FFT + pre/post twiddle passes. Folds Y[2k] / Y[N−1−2k] into one complex sequence via the (−1)^n identity. ~2× faster than the textbook DCT-III + DST-III combo. Used in MDCT for audio codecs. Even N only.

DST-II / DST-III (RODFT10 / RODFT01)

Wraps DCT-II/III with a sign-flip + index-reversal pass. Reuses all of DCT-II/III's machinery including the N=8 codelets. Even N only.

DHT (Discrete Hartley Transform)

Built on N-point R2C with no twiddlesH[k] = Re(X[k]) − Im(X[k]), H[N−k] = Re(X[k]) + Im(X[k]). Self-inverse up to 1/N. ~2× faster than the equivalent complex FFT. Even N only.

Prime-N support (Rader / Bluestein)

  • Rader for smooth primes (N−1 is 19-smooth) — primitive-root permutation reduces the prime DFT to a length-(N−1) cyclic convolution. ~2× faster than Bluestein.
  • Bluestein (chirp-z) for non-smooth primes — length-M ≥ 2N−1 convolution; M chosen to favor fewer-stage factorizations.

Both inherit ESTIMATE/MEASURE from the inner mixed-radix FFT.


Quick Start

Soon to be updated.


Roadmap

Near-term

  • ILP optimization for large power-of-2 -- VTune profiling shows MKL achieves better instruction-level parallelism on large pow2 at K=4 (our closest margin: 1.02x at N=16384). Codelet scheduling improvements to increase FMA port saturation.
  • Natural-order DFT output (vfft_permute) -- expose the digit-reversal permutation table so users can inspect individual frequency bins without roundtrip
  • R=9 codelet -- unlocks 3^N sizes (3^10 = 59049, 3^12 = 531441) that currently exceed the max stage depth with R=3 alone. Fits AVX2's 16 YMMs.
  • Strided-batch codelets for 2D -- eliminate transpose entirely by allowing non-unit batch stride in codelets. Estimated 1.27x over MKL at 256x256 (currently 1.14x with tiled transpose).
  • K=1 scalar fallback -- AVX2 codelets currently require K>=4 (SIMD width). Add scalar path for single-transform use cases.
  • Native interleaved C2C -- dedicated codelets for interleaved complex layout (re+im adjacent), avoiding deinterleave overhead for users with packed data.

Medium-term

  • 1D Bailey 4-step -- natural-order output for large 1D transforms using transpose + twiddle infrastructure (already built and benchmarked)
  • 3D FFT -- extend tiled 2D approach to three dimensions
  • ARM NEON / SVE codelets -- port codelet generators to ARM SIMD targets
  • Single-precision (float32) support -- separate codelet set with 8-wide AVX2 / 16-wide AVX-512

Long-term

  • GPU backend (Vulkan compute / CUDA) -- VkFFT-style GPU execution sharing the same planner and wisdom infrastructure
  • Distributed FFT (MPI) -- multi-node decomposition for very large transforms
  • White paper -- microarchitectural profiling (PMU counters, spill analysis, roofline) documenting why VectorFFT beats MKL at the instruction level

Acknowledgments

  • FFTW by Matteo Frigo and Steven G. Johnson -- the gold standard for decades. VectorFFT's prime-radix butterflies (R=11, 13, 17, 19) are derived from FFTW's genfft algebraic output, then re-scheduled using Sethi-Ullman register allocation with explicit spill management to minimize register pressure on AVX2 (16 YMM) and AVX-512 (32 ZMM).
  • VkFFT by Dmitrii Tolmachev -- inspiration for the benchmarking methodology and presentation style.

About

Split-layout mixed-radix FFT library in C, emitted by a Ocaml DAG FFT compiler. In-place & out-of-place C2C/R2C/C2R, 2D, and DCT/DST/DHT, with a calibrated wisdom planner. Beats Intel MKL on C2C and on multithreaded real/2D transforms. Pure C, zero dependencies.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors