GitHub - Tugbars/VectorFFT: Split-layout mixed-radix FFT library in C, emitted by a Ocaml DAG FFT compiler. In-place & out-of-place C2C/R2C/C2R, 2D, and DCT/DST/DHT, with a calibrated wisdom planner. Beats Intel MKL on C2C and on multithreaded real/2D transforms. Pure C, zero dependencies.

A split-layout Fast Fourier Transform library in C, with AVX2/AVX-512 kernels generated by a DAG FFT compiler. It covers complex (C2C), real (R2C/C2R), 2D, and DSP transforms (DCT/DST/DHT) — in-place and out-of-place, single- and multi-threaded — behind one calibrated wisdom planner that picks the fastest plan per transform.

It beats Intel MKL on all 238 tested 1D C2C cells single- and multi-threaded, and wins on real-FFT and 2D transforms under multithreading. Pure C, no external dependencies.

Benchmark Results

For the full v1.0 performance picture — vs MKL, vs FFTW3, multi-threaded scaling, DCT/DST/DHT, cost-model accuracy, per-codelet VTune profiles, hardware caveats — see docs/performance/v1_0_results.md.

Platform: Intel Core i9-14900KF, 48 KB L1d, DDR5, AVX2, single-threaded
Competitor: Intel MKL 2025 (sequential, mkl_set_num_threads(1))
238 data points across 9 categories, 3 batch sizes (K=4, K=32, K=256), N=8 to N=823,543

1D FFT Throughput — VectorFFT vs Intel MKL

Three panels showing GFLOP/s at each batch size. Blue = VectorFFT, Red = MKL. Different marker shapes per category. VectorFFT sits above MKL across the board.

Category	Cells	Median	Best Win	Closest MKL Gets
Small pow2 (8-128)	15	4.28x	15.33x (N=8, K=4)	2.60x (N=128, K=4)
Power-of-2 (256-131K)	29	1.86x	3.04x (N=256, K=32)	1.10x (N=16384, K=4)
Composite	43	2.85x	4.51x (N=200, K=32)	1.62x (N=10000, K=4)
Prime powers (3,5,7)	25	2.69x	4.16x (N=2401, K=256)	1.67x (N=2401, K=4)
Prime powers (R=11,13)	17	2.79x	3.75x (N=169, K=32)	1.65x (N=2197, K=256)
Rader primes	24	2.34x	3.85x (N=641, K=32)	1.29x (N=4001, K=32)
Bluestein primes	24	1.55x	3.52x (N=83, K=4)	1.02x (N=179, K=256)
Odd composites	26	3.47x	5.16x (N=119, K=32)	2.26x (N=6615, K=256)
Mixed deep	35	2.71x	5.78x (N=60, K=32)	1.66x (N=126, K=4)

Speedup over Intel MKL — All Categories

Every point above the dashed line is a VectorFFT win. Marker size indicates batch count (small=K=4, medium=K=32, large=K=256). All 238 points are above parity.

Combined Dense Scatter — All Sizes & Batch Counts

All 238 data points overlaid. Blue cloud (VectorFFT) consistently above red cloud (MKL). Peak throughput at small N with large K where codelets run entirely from L1.

Highlight Results

N	K	Category	Factors	VectorFFT	MKL	Speedup
8	4	small	8	75.7 GF/s	4.9 GF/s	15.33x
60	32	mixed_deep	5x12	89.2 GF/s	15.4 GF/s	5.78x
119	32	odd_comp	17x7	50.7 GF/s	9.8 GF/s	5.16x
200	32	composite	5x8x5	69.6 GF/s	15.4 GF/s	4.51x
2401	256	prime_pow (7^4)	7x7x7x7	44.7 GF/s	10.7 GF/s	4.16x
169	32	genfft (13^2)	13x13	46.3 GF/s	12.3 GF/s	3.75x
641	32	rader	(override)	16.8 GF/s	4.4 GF/s	3.85x
83	4	bluestein	(override)	7.0 GF/s	2.0 GF/s	3.52x
256	32	pow2	4x8x8	69.7 GF/s	22.9 GF/s	3.04x
390625	4	prime_pow (5^8)	25x25x25x25	32.3 GF/s	11.1 GF/s	2.91x
16384	4	pow2	8x8x16x16	27.2 GF/s	24.7 GF/s	1.10x

2D FFT — Tiled SIMD Transpose

VectorFFT's 2D FFT uses a tiled gather/scatter approach with cache-oblivious SIMD transpose kernels (8x4 line-filling + 4x4 AVX2). Beats MKL at all tested sizes.

Size	VectorFFT	MKL	Speedup
32x32	0.9 us	1.5 us	1.63x
64x64	5.5 us	6.5 us	1.18x
128x128	30.3 us	33.4 us	1.10x
256x256	127.3 us	145.6 us	1.14x
512x512	875.1 us	948.1 us	1.08x
1024x1024	3,900 us	5,512 us	1.41x
100x200	40.7 us	60.9 us	1.50x

Multi-threaded 2D is tile-parallel on the row pass + K-split on the columns (per-thread scratch, zero barriers). A full single- and multi-threaded 2D benchmark — all sizes, head-to-head vs MKL — is in progress; those numbers will be published with that run.

Accuracy

Strict roundtrip error — max |fwd→bwd / N − x| / max|x|, the worst single element across all N·K outputs after a full forward + backward — across all tested 1D cells. Errors track the theoretical O(log₂N · ε) bound (FP64 ε = 2.2e-16); every cell holds ~14 correct digits.

Category	Min Error	Max Error
pow2 small (8-128)	2.5e-16	1.3e-14
pow2 (256-131K)	7.9e-16	2.6e-14
composite	1.1e-14	5.7e-14
prime powers (3,5,7)	9.8e-15	7.1e-14
genfft (R=11,13)	1.5e-14	4.0e-14
odd composites	1.0e-14	3.5e-14
mixed deep	1.0e-14	3.7e-14

Overall: min 2.5e-16, median 2.45e-14, max 7.07e-14 — none exceed 1e-13. This is the strictest honest statistic (per-element max, relative, full roundtrip); RMS or forward-only error runs ~10–40× smaller. The errors grow ~log N exactly as a correct Cooley-Tukey decomposition should.

Rader and Bluestein prime cells use a convolution-based path; their roundtrip error (median ~3e-14, max ~7e-14) sits in the same band, dominated by the inner FFT's accumulated rounding — well within FP64.

Architecture

VectorFFT uses a stride-based Cooley-Tukey architecture.

Executor

In-place and out-of-place — the in-place path runs the stage chain in a single buffer with no scratch allocation; the out-of-place path reads the source directly into the first stage (no pre-copy)
Split-complex — separate re[] / im[] arrays. No interleaved codelets; users with packed {re,im} data convert via vfft_deinterleave / vfft_reinterleave at the boundary

Codelets

Emitted by the DAG FFT compiler — it builds each transform's data-flow DAG and emits an ISA-specific schedule per (radix, variant, ISA) triple. Any radix you need can be generated from the same machinery — the shipped set covers every smooth integer up to 100,000+. Not hand-written.
Three ISA targets: AVX2 (VL=4), AVX-512 (VL=8), scalar fallback — selected at compile time

Planner

VectorFFT exposes FFTW-style planning rigor tiers — the calibration thoroughness used on a wisdom miss (a hit ignores rigor and builds the cached plan instantly):

Rigor	Plan time	What it does on a wisdom miss
`VFFT_MEASURE`	seconds–minutes	DP planner (≈ FFTW_MEASURE): beam search over factorizations + per-stage variant tuning, then persist
`VFFT_PATIENT`	minutes	Wider DP beam + re-measure of the top candidates (≈ FFTW_PATIENT)
`VFFT_EXHAUSTIVE`	minutes–hours	Full search — every factorization × permutation × per-stage variant
`VFFT_ESTIMATE`	sub-µs	Planned 4th tier — closed-form cost model, no measurement (designed, not yet wired)

Wisdom mechanics:

DP planner (MEASURE / PATIENT) — tries each radix as the first stage, recursively solves sub-problems with memoization, then benchmarks orderings of the winning factorization; PATIENT widens the beam and re-measures the top candidates (~150 benchmarks for N=100,000 vs ~61,000 for a full exhaustive search).
Per-stage variant tuning — each stage is scored on a noise-robust top-K metric and the best twiddle application (FLAT / LOG3 / T1S / BUF) is chosen per (R, me, ios) cell.
Wisdom is a per-feature bundle — one handle holds the calibrated plans for every transform (c2c / out-of-place / r2c / …), loaded from and saved to a directory. Default: library-managed (auto-load + calibrate-on-miss + save); or pass your own. A pre-calibrated set for the i9-14900KF ships under examples/14900KF/.
VFFT_ESTIMATE — a closed-form cost model for a zero-measurement plan (for users who won't calibrate). Designed; lands as the 4th rigor tier, not yet wired.

Transforms

VectorFFT v1.0 covers ten transform variants. Each entry below names the method, the multi-threaded scaling strategy, and which planning flags are honored.

Transform	Method	MT strategy	ESTIMATE	MEASURE/wisdom
1D C2C	DIT forward / DIF backward, fused-twiddle stride executor (Method C)	direct K-split	✅ cost-model	✅ joint search
1D R2C / C2R	Pair-packing (Makhoul) — N/2-point complex FFT + post-butterfly	inner C2C MT (K-split)	✅ inner halfN	✅ inner halfN
2D C2C	Tiled gather/scatter via 8×4 SIMD transpose (default) or full-matrix Bailey	tile-parallel rows + K-split cols	⚠ same plan as MEASURE	✅ inner row/col c2c wisdom (calibrate-on-miss)
2D R2C / C2R	Tiled R2C row pass + 1D C2C col pass (FFTW reduce-along-inner convention)	tile-parallel rows + K-split cols	⚠ same plan	✅ inner c2c wisdom (calibrate-on-miss)
DCT-II / III (REDFT10/01)	Makhoul reduction — N-point R2C + pre-permute + post-twiddle. Specialized straight-line N=8 codelet for JPEG block size	three-phase K-split (pre-permute, inner R2C MT, post-twiddle)	✅ inner halfN	✅ inner halfN
DCT-IV (REDFT11)	Lee 1984 — single N/2-point complex FFT + pre/post twiddles. Used in MDCT (MP3/AAC/Vorbis/Opus)	three-phase K-split	✅ inner halfN	✅ inner halfN
DST-II / III (RODFT10/01)	Wraps DCT-II/III with sign-flip + index reversal	K-split sign-flip + reversal passes	✅ via inner DCT	✅ via inner DCT
DHT (Discrete Hartley)	Built directly on R2C — no twiddles. `H[k]=Re(X[k])-Im(X[k])` butterfly	K-split post-butterfly only (pre-pass is bandwidth-bound memcpy)	✅ inner halfN	✅ inner halfN
Prime N (Rader)	For prime N where N-1 is 19-smooth — reduces to length-(N-1) cyclic convolution via primitive root	inner FFT inherits MT	✅ via inner FFT	✅ via inner FFT
Prime N (Bluestein)	For non-smooth primes — chirp-z to length-M convolution where M ≥ 2N-1, M chosen for fewer-stage factorization	inner FFT inherits MT	✅ via inner FFT	✅ via inner FFT

1D R2C / C2R

Pair-packing — one N/2-point complex FFT + butterfly post-process. 1.5–1.9× faster than the equivalent complex FFT. Block-walked over K to keep scratch in L2.

2D C2C

Tiled (default, B=8) — gather B rows via SIMD transpose, FFT on scratch, scatter back. Tile-parallel threading on the row pass. Beats MKL 1.08–1.63× across 32² to 1024². Bailey mode (full-matrix transpose around one large-K row FFT) is available as an alternative.

2D R2C / C2R

Same tiled pattern as 2D C2C but the row pass is a 1D R2C (FFTW reduce-along-inner convention; output is N1×(N2/2+1) complex). Backward processes tiles in reverse order to avoid input-buffer overlap during the asymmetric scatter.

DCT-II / DCT-III (REDFT10 / REDFT01)

Makhoul's reduction (1980) — pre-permute + N-point R2C + post-twiddle butterfly. ~2× faster than the textbook 2N-point R2C; matches FFTW's reodft010e-r2hc.c. Specialized straight-line codelets at N=8 bypass Makhoul for the JPEG block (1.48× / 1.16× over FFTW at the JPEG cell). Even N only.

DCT-IV (REDFT11)

Lee 1984 — single N/2-point complex FFT + pre/post twiddle passes. Folds Y[2k] / Y[N−1−2k] into one complex sequence via the (−1)^n identity. ~2× faster than the textbook DCT-III + DST-III combo. Used in MDCT for audio codecs. Even N only.

DST-II / DST-III (RODFT10 / RODFT01)

Wraps DCT-II/III with a sign-flip + index-reversal pass. Reuses all of DCT-II/III's machinery including the N=8 codelets. Even N only.

DHT (Discrete Hartley Transform)

Built on N-point R2C with no twiddles — H[k] = Re(X[k]) − Im(X[k]), H[N−k] = Re(X[k]) + Im(X[k]). Self-inverse up to 1/N. ~2× faster than the equivalent complex FFT. Even N only.

Prime-N support (Rader / Bluestein)

Rader for smooth primes (N−1 is 19-smooth) — primitive-root permutation reduces the prime DFT to a length-(N−1) cyclic convolution. ~2× faster than Bluestein.
Bluestein (chirp-z) for non-smooth primes — length-M ≥ 2N−1 convolution; M chosen to favor fewer-stage factorizations.

Both inherit ESTIMATE/MEASURE from the inner mixed-radix FFT.

Quick Start

Soon to be updated.

Roadmap

Near-term

ILP optimization for large power-of-2 -- VTune profiling shows MKL achieves better instruction-level parallelism on large pow2 at K=4 (our closest margin: 1.02x at N=16384). Codelet scheduling improvements to increase FMA port saturation.
Natural-order DFT output (vfft_permute) -- expose the digit-reversal permutation table so users can inspect individual frequency bins without roundtrip
R=9 codelet -- unlocks 3^N sizes (3^10 = 59049, 3^12 = 531441) that currently exceed the max stage depth with R=3 alone. Fits AVX2's 16 YMMs.
Strided-batch codelets for 2D -- eliminate transpose entirely by allowing non-unit batch stride in codelets. Estimated 1.27x over MKL at 256x256 (currently 1.14x with tiled transpose).
K=1 scalar fallback -- AVX2 codelets currently require K>=4 (SIMD width). Add scalar path for single-transform use cases.
Native interleaved C2C -- dedicated codelets for interleaved complex layout (re+im adjacent), avoiding deinterleave overhead for users with packed data.

Medium-term

1D Bailey 4-step -- natural-order output for large 1D transforms using transpose + twiddle infrastructure (already built and benchmarked)
3D FFT -- extend tiled 2D approach to three dimensions
ARM NEON / SVE codelets -- port codelet generators to ARM SIMD targets
Single-precision (float32) support -- separate codelet set with 8-wide AVX2 / 16-wide AVX-512

Long-term

GPU backend (Vulkan compute / CUDA) -- VkFFT-style GPU execution sharing the same planner and wisdom infrastructure
Distributed FFT (MPI) -- multi-node decomposition for very large transforms
White paper -- microarchitectural profiling (PMU counters, spill analysis, roofline) documenting why VectorFFT beats MKL at the instruction level

Acknowledgments

FFTW by Matteo Frigo and Steven G. Johnson -- the gold standard for decades. VectorFFT's prime-radix butterflies (R=11, 13, 17, 19) are derived from FFTW's genfft algebraic output, then re-scheduled using Sethi-Ullman register allocation with explicit spill management to minimize register pressure on AVX2 (16 YMM) and AVX-512 (32 ZMM).
VkFFT by Dmitrii Tolmachev -- inspiration for the benchmarking methodology and presentation style.

Name		Name	Last commit message	Last commit date
Latest commit History 1,338 Commits
build_tuned		build_tuned
cmake		cmake
docs		docs
examples		examples
include		include
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Benchmark Results