A split-layout Fast Fourier Transform library in C, with AVX2/AVX-512 kernels generated by a DAG FFT compiler. It covers complex (C2C), real (R2C/C2R), 2D, and DSP transforms (DCT/DST/DHT) — in-place and out-of-place, single- and multi-threaded — behind one calibrated wisdom planner that picks the fastest plan per transform.
It beats Intel MKL on all 238 tested 1D C2C cells single- and multi-threaded, and wins on real-FFT and 2D transforms under multithreading. Pure C, no external dependencies.
For the full v1.0 performance picture — vs MKL, vs FFTW3, multi-threaded scaling, DCT/DST/DHT, cost-model accuracy, per-codelet VTune profiles, hardware caveats — see
docs/performance/v1_0_results.md.
Platform: Intel Core i9-14900KF, 48 KB L1d, DDR5, AVX2, single-threaded
Competitor: Intel MKL 2025 (sequential,mkl_set_num_threads(1))
238 data points across 9 categories, 3 batch sizes (K=4, K=32, K=256), N=8 to N=823,543
Three panels showing GFLOP/s at each batch size. Blue = VectorFFT, Red = MKL. Different marker shapes per category. VectorFFT sits above MKL across the board.
| Category | Cells | Median | Best Win | Closest MKL Gets |
|---|---|---|---|---|
| Small pow2 (8-128) | 15 | 4.28x | 15.33x (N=8, K=4) | 2.60x (N=128, K=4) |
| Power-of-2 (256-131K) | 29 | 1.86x | 3.04x (N=256, K=32) | 1.10x (N=16384, K=4) |
| Composite | 43 | 2.85x | 4.51x (N=200, K=32) | 1.62x (N=10000, K=4) |
| Prime powers (3,5,7) | 25 | 2.69x | 4.16x (N=2401, K=256) | 1.67x (N=2401, K=4) |
| Prime powers (R=11,13) | 17 | 2.79x | 3.75x (N=169, K=32) | 1.65x (N=2197, K=256) |
| Rader primes | 24 | 2.34x | 3.85x (N=641, K=32) | 1.29x (N=4001, K=32) |
| Bluestein primes | 24 | 1.55x | 3.52x (N=83, K=4) | 1.02x (N=179, K=256) |
| Odd composites | 26 | 3.47x | 5.16x (N=119, K=32) | 2.26x (N=6615, K=256) |
| Mixed deep | 35 | 2.71x | 5.78x (N=60, K=32) | 1.66x (N=126, K=4) |
Every point above the dashed line is a VectorFFT win. Marker size indicates batch count (small=K=4, medium=K=32, large=K=256). All 238 points are above parity.
All 238 data points overlaid. Blue cloud (VectorFFT) consistently above red cloud (MKL). Peak throughput at small N with large K where codelets run entirely from L1.
| N | K | Category | Factors | VectorFFT | MKL | Speedup |
|---|---|---|---|---|---|---|
| 8 | 4 | small | 8 | 75.7 GF/s | 4.9 GF/s | 15.33x |
| 60 | 32 | mixed_deep | 5x12 | 89.2 GF/s | 15.4 GF/s | 5.78x |
| 119 | 32 | odd_comp | 17x7 | 50.7 GF/s | 9.8 GF/s | 5.16x |
| 200 | 32 | composite | 5x8x5 | 69.6 GF/s | 15.4 GF/s | 4.51x |
| 2401 | 256 | prime_pow (7^4) | 7x7x7x7 | 44.7 GF/s | 10.7 GF/s | 4.16x |
| 169 | 32 | genfft (13^2) | 13x13 | 46.3 GF/s | 12.3 GF/s | 3.75x |
| 641 | 32 | rader | (override) | 16.8 GF/s | 4.4 GF/s | 3.85x |
| 83 | 4 | bluestein | (override) | 7.0 GF/s | 2.0 GF/s | 3.52x |
| 256 | 32 | pow2 | 4x8x8 | 69.7 GF/s | 22.9 GF/s | 3.04x |
| 390625 | 4 | prime_pow (5^8) | 25x25x25x25 | 32.3 GF/s | 11.1 GF/s | 2.91x |
| 16384 | 4 | pow2 | 8x8x16x16 | 27.2 GF/s | 24.7 GF/s | 1.10x |
VectorFFT's 2D FFT uses a tiled gather/scatter approach with cache-oblivious SIMD transpose kernels (8x4 line-filling + 4x4 AVX2). Beats MKL at all tested sizes.
| Size | VectorFFT | MKL | Speedup |
|---|---|---|---|
| 32x32 | 0.9 us | 1.5 us | 1.63x |
| 64x64 | 5.5 us | 6.5 us | 1.18x |
| 128x128 | 30.3 us | 33.4 us | 1.10x |
| 256x256 | 127.3 us | 145.6 us | 1.14x |
| 512x512 | 875.1 us | 948.1 us | 1.08x |
| 1024x1024 | 3,900 us | 5,512 us | 1.41x |
| 100x200 | 40.7 us | 60.9 us | 1.50x |
Multi-threaded 2D is tile-parallel on the row pass + K-split on the columns (per-thread scratch, zero barriers). A full single- and multi-threaded 2D benchmark — all sizes, head-to-head vs MKL — is in progress; those numbers will be published with that run.
Strict roundtrip error — max |fwd→bwd / N − x| / max|x|, the worst single element across all
N·K outputs after a full forward + backward — across all tested 1D cells. Errors track the theoretical
O(log₂N · ε) bound (FP64 ε = 2.2e-16); every cell holds ~14 correct digits.
| Category | Min Error | Max Error |
|---|---|---|
| pow2 small (8-128) | 2.5e-16 | 1.3e-14 |
| pow2 (256-131K) | 7.9e-16 | 2.6e-14 |
| composite | 1.1e-14 | 5.7e-14 |
| prime powers (3,5,7) | 9.8e-15 | 7.1e-14 |
| genfft (R=11,13) | 1.5e-14 | 4.0e-14 |
| odd composites | 1.0e-14 | 3.5e-14 |
| mixed deep | 1.0e-14 | 3.7e-14 |
Overall: min 2.5e-16, median 2.45e-14, max 7.07e-14 — none exceed 1e-13. This is the strictest honest statistic (per-element max, relative, full roundtrip); RMS or forward-only error runs ~10–40× smaller. The errors grow ~log N exactly as a correct Cooley-Tukey decomposition should.
Rader and Bluestein prime cells use a convolution-based path; their roundtrip error (median ~3e-14, max ~7e-14) sits in the same band, dominated by the inner FFT's accumulated rounding — well within FP64.
VectorFFT uses a stride-based Cooley-Tukey architecture.
- In-place and out-of-place — the in-place path runs the stage chain in a single buffer with no scratch allocation; the out-of-place path reads the source directly into the first stage (no pre-copy)
- Split-complex — separate
re[]/im[]arrays. No interleaved codelets; users with packed{re,im}data convert viavfft_deinterleave/vfft_reinterleaveat the boundary
- Emitted by the DAG FFT compiler — it builds each transform's data-flow DAG and emits an ISA-specific schedule per
(radix, variant, ISA)triple. Any radix you need can be generated from the same machinery — the shipped set covers every smooth integer up to 100,000+. Not hand-written. - Three ISA targets: AVX2 (VL=4), AVX-512 (VL=8), scalar fallback — selected at compile time
VectorFFT exposes FFTW-style planning rigor tiers — the calibration thoroughness used on a wisdom miss (a hit ignores rigor and builds the cached plan instantly):
| Rigor | Plan time | What it does on a wisdom miss |
|---|---|---|
VFFT_MEASURE |
seconds–minutes | DP planner (≈ FFTW_MEASURE): beam search over factorizations + per-stage variant tuning, then persist |
VFFT_PATIENT |
minutes | Wider DP beam + re-measure of the top candidates (≈ FFTW_PATIENT) |
VFFT_EXHAUSTIVE |
minutes–hours | Full search — every factorization × permutation × per-stage variant |
VFFT_ESTIMATE |
sub-µs | Planned 4th tier — closed-form cost model, no measurement (designed, not yet wired) |
Wisdom mechanics:
- DP planner (MEASURE / PATIENT) — tries each radix as the first stage, recursively solves sub-problems with memoization, then benchmarks orderings of the winning factorization; PATIENT widens the beam and re-measures the top candidates (~150 benchmarks for N=100,000 vs ~61,000 for a full exhaustive search).
- Per-stage variant tuning — each stage is scored on a noise-robust top-K metric and the best
twiddle application (FLAT / LOG3 / T1S / BUF) is chosen per
(R, me, ios)cell. - Wisdom is a per-feature bundle — one handle holds the calibrated plans for every transform
(c2c / out-of-place / r2c / …), loaded from and saved to a directory. Default: library-managed
(auto-load + calibrate-on-miss + save); or pass your own. A pre-calibrated set for the i9-14900KF
ships under
examples/14900KF/. VFFT_ESTIMATE— a closed-form cost model for a zero-measurement plan (for users who won't calibrate). Designed; lands as the 4th rigor tier, not yet wired.
VectorFFT v1.0 covers ten transform variants. Each entry below names the method, the multi-threaded scaling strategy, and which planning flags are honored.
| Transform | Method | MT strategy | ESTIMATE | MEASURE/wisdom |
|---|---|---|---|---|
| 1D C2C | DIT forward / DIF backward, fused-twiddle stride executor (Method C) | direct K-split | ✅ cost-model | ✅ joint search |
| 1D R2C / C2R | Pair-packing (Makhoul) — N/2-point complex FFT + post-butterfly | inner C2C MT (K-split) | ✅ inner halfN | ✅ inner halfN |
| 2D C2C | Tiled gather/scatter via 8×4 SIMD transpose (default) or full-matrix Bailey | tile-parallel rows + K-split cols | ⚠ same plan as MEASURE | ✅ inner row/col c2c wisdom (calibrate-on-miss) |
| 2D R2C / C2R | Tiled R2C row pass + 1D C2C col pass (FFTW reduce-along-inner convention) | tile-parallel rows + K-split cols | ⚠ same plan | ✅ inner c2c wisdom (calibrate-on-miss) |
| DCT-II / III (REDFT10/01) | Makhoul reduction — N-point R2C + pre-permute + post-twiddle. Specialized straight-line N=8 codelet for JPEG block size | three-phase K-split (pre-permute, inner R2C MT, post-twiddle) | ✅ inner halfN | ✅ inner halfN |
| DCT-IV (REDFT11) | Lee 1984 — single N/2-point complex FFT + pre/post twiddles. Used in MDCT (MP3/AAC/Vorbis/Opus) | three-phase K-split | ✅ inner halfN | ✅ inner halfN |
| DST-II / III (RODFT10/01) | Wraps DCT-II/III with sign-flip + index reversal | K-split sign-flip + reversal passes | ✅ via inner DCT | ✅ via inner DCT |
| DHT (Discrete Hartley) | Built directly on R2C — no twiddles. H[k]=Re(X[k])-Im(X[k]) butterfly |
K-split post-butterfly only (pre-pass is bandwidth-bound memcpy) | ✅ inner halfN | ✅ inner halfN |
| Prime N (Rader) | For prime N where N-1 is 19-smooth — reduces to length-(N-1) cyclic convolution via primitive root | inner FFT inherits MT | ✅ via inner FFT | ✅ via inner FFT |
| Prime N (Bluestein) | For non-smooth primes — chirp-z to length-M convolution where M ≥ 2N-1, M chosen for fewer-stage factorization | inner FFT inherits MT | ✅ via inner FFT | ✅ via inner FFT |
Pair-packing — one N/2-point complex FFT + butterfly post-process. 1.5–1.9× faster than the equivalent complex FFT. Block-walked over K to keep scratch in L2.
Tiled (default, B=8) — gather B rows via SIMD transpose, FFT on scratch, scatter back. Tile-parallel threading on the row pass. Beats MKL 1.08–1.63× across 32² to 1024². Bailey mode (full-matrix transpose around one large-K row FFT) is available as an alternative.
Same tiled pattern as 2D C2C but the row pass is a 1D R2C (FFTW reduce-along-inner convention; output is N1×(N2/2+1) complex). Backward processes tiles in reverse order to avoid input-buffer overlap during the asymmetric scatter.
Makhoul's reduction (1980) — pre-permute + N-point R2C + post-twiddle butterfly. ~2× faster than the textbook 2N-point R2C; matches FFTW's reodft010e-r2hc.c. Specialized straight-line codelets at N=8 bypass Makhoul for the JPEG block (1.48× / 1.16× over FFTW at the JPEG cell). Even N only.
Lee 1984 — single N/2-point complex FFT + pre/post twiddle passes. Folds Y[2k] / Y[N−1−2k] into one complex sequence via the (−1)^n identity. ~2× faster than the textbook DCT-III + DST-III combo. Used in MDCT for audio codecs. Even N only.
Wraps DCT-II/III with a sign-flip + index-reversal pass. Reuses all of DCT-II/III's machinery including the N=8 codelets. Even N only.
Built on N-point R2C with no twiddles — H[k] = Re(X[k]) − Im(X[k]), H[N−k] = Re(X[k]) + Im(X[k]). Self-inverse up to 1/N. ~2× faster than the equivalent complex FFT. Even N only.
- Rader for smooth primes (N−1 is 19-smooth) — primitive-root permutation reduces the prime DFT to a length-(N−1) cyclic convolution. ~2× faster than Bluestein.
- Bluestein (chirp-z) for non-smooth primes — length-M ≥ 2N−1 convolution; M chosen to favor fewer-stage factorizations.
Both inherit ESTIMATE/MEASURE from the inner mixed-radix FFT.
Soon to be updated.
- ILP optimization for large power-of-2 -- VTune profiling shows MKL achieves better instruction-level parallelism on large pow2 at K=4 (our closest margin: 1.02x at N=16384). Codelet scheduling improvements to increase FMA port saturation.
- Natural-order DFT output (
vfft_permute) -- expose the digit-reversal permutation table so users can inspect individual frequency bins without roundtrip - R=9 codelet -- unlocks 3^N sizes (3^10 = 59049, 3^12 = 531441) that currently exceed the max stage depth with R=3 alone. Fits AVX2's 16 YMMs.
- Strided-batch codelets for 2D -- eliminate transpose entirely by allowing non-unit batch stride in codelets. Estimated 1.27x over MKL at 256x256 (currently 1.14x with tiled transpose).
- K=1 scalar fallback -- AVX2 codelets currently require K>=4 (SIMD width). Add scalar path for single-transform use cases.
- Native interleaved C2C -- dedicated codelets for interleaved complex layout (re+im adjacent), avoiding deinterleave overhead for users with packed data.
- 1D Bailey 4-step -- natural-order output for large 1D transforms using transpose + twiddle infrastructure (already built and benchmarked)
- 3D FFT -- extend tiled 2D approach to three dimensions
- ARM NEON / SVE codelets -- port codelet generators to ARM SIMD targets
- Single-precision (float32) support -- separate codelet set with 8-wide AVX2 / 16-wide AVX-512
- GPU backend (Vulkan compute / CUDA) -- VkFFT-style GPU execution sharing the same planner and wisdom infrastructure
- Distributed FFT (MPI) -- multi-node decomposition for very large transforms
- White paper -- microarchitectural profiling (PMU counters, spill analysis, roofline) documenting why VectorFFT beats MKL at the instruction level
- FFTW by Matteo Frigo and Steven G. Johnson -- the gold standard for decades. VectorFFT's prime-radix butterflies (R=11, 13, 17, 19) are derived from FFTW's genfft algebraic output, then re-scheduled using Sethi-Ullman register allocation with explicit spill management to minimize register pressure on AVX2 (16 YMM) and AVX-512 (32 ZMM).
- VkFFT by Dmitrii Tolmachev -- inspiration for the benchmarking methodology and presentation style.



