Conversation
…chmarks Add an optimized FSST decompressor that replaces the baseline fsst-rs implementation for bulk decompression. Key changes: - New `OptimizedDecompressor` with packed symbol+length lookup table (16-byte aligned entries), eliminating dual array lookups per code - Compact loop-based escape handling instead of 8-arm match statement - SWAR escape detection (same as fsst-rs) with tighter codegen - Dedicated benchmarks measuring high-escape vs low-escape scenarios Benchmark results (raw decompress_into, median): High escape (10k×16): 105.6µs → 92.4µs (~12% faster) High escape (100k×64): 6.04ms → 5.70ms (~6% faster) Low escape (10k×64): 128.5µs → 127.4µs (~1% faster) Low escape (100k×64): 1.40ms → 1.38ms (~2% faster) Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_019hQy1qLZ3f8raikcRTgmpN
…ir benchmarks Replace packed 16-byte struct (4KB table) with separate u64 symbols + u8 lengths arrays (2.3KB total), matching fsst-rs cache footprint. Use fully unrolled match statement for escape handling instead of loop. Fix benchmarks to use same buffer allocation for baseline and optimized to ensure fair comparison. Results (median, raw decompress_into): - Low escape: 9-40% faster than fsst-rs baseline - High escape: 7-13% faster than fsst-rs baseline Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_019hQy1qLZ3f8raikcRTgmpN
The decompressor now has a multi-level block processing strategy: - 32-code fast path: reads four 8-byte blocks, checks all for escapes at once, and emits 32 symbols when no escapes are present - 8-code fast path: handles blocks with escape codes using the existing unrolled match statement - Scalar fallback: processes remaining bytes one at a time The 32-code path reduces loop overhead by processing 4x more codes per iteration when data compresses well (few escape codes). For high-escape data, it quickly falls through to the 8-code path with no regression. Benchmark results (raw decompress, median): - Low escape (10k,64): 98µs vs 111µs previously (+13%) - Low escape (10k,256): 402µs vs 462µs previously (+13%) - Low escape (100k,64): 1073µs vs 1190µs previously (+10%) - High escape: neutral (same performance as before) Also refactored emit_block! and handle_escape_block! into macros to reduce code duplication across the processing levels. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_019hQy1qLZ3f8raikcRTgmpN
Add two experimental decompressor methods for benchmarking:
1. AVX-512 escape scan: Uses _mm512_cmpeq_epi8_mask to scan 64 bytes
at once for escape codes, then processes escape-free blocks with
scalar emit. Gated with #[target_feature] to avoid CPU frequency
throttling from global target-cpu=native.
2. Combined symbol+length table: Uses a single 4KB lookup table
(SymbolEntry { symbol: u64, length: u64 }) for one cache-line hit
per code instead of two separate array accesses.
Also adds a shared decompress_tail helper for SIMD variant fallback paths.
Signed-off-by: Claude <noreply@anthropic.com>
https://claude.ai/code/session_019hQy1qLZ3f8raikcRTgmpN
The previous implementation broke out of the 32-code escape-free loop permanently on first escape, falling to the 8-code loop and never re-entering the wide path. This wastes the 32-code fast path for data with scattered escapes interspersed with escape-free stretches. The new unified loop alternates between 32-code escape-free batches and 8-code escape handling (up to 4 blocks), then re-enters the 32-code path. This yields 9-21% improvement across all configurations: Low escape: -16% to -21% faster High escape: -9% to -21% faster Also removes experimental AVX-512, combined-table, and prefetch variants that benchmarked slower than the separate-table approach. Key findings: - AVX-512 vpcmpeqb scan: slower due to CPU frequency throttling - Combined 16-byte table: larger cache footprint (4KB vs 2.3KB) hurts - Software prefetch: no benefit since tables already fit in L1 Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_019hQy1qLZ3f8raikcRTgmpN
Documents ~10 optimization strategies explored for the FSST decompressor, including benchmark results, why each was accepted/rejected, and potential future directions for further improvement. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_019hQy1qLZ3f8raikcRTgmpN
Two additional optimizations on top of the existing decompressor: 1. Switch from N=4 to N=1 re-entry: after handling each escape block, immediately re-enter the 32-code fast path. This is 1-3% faster for low-escape data (the common case) with no regression on high-escape. 2. Add runtime CPU feature detection: on x86-64 CPUs with BMI1/BMI2/POPCNT (virtually all modern CPUs), dispatch to a target-feature-optimized code path for better trailing_zeros codegen (tzcnt vs bsf). This gives a consistent 2-4% improvement across all workloads. Combined speedups vs fsst-rs baseline (median): - Low escape: 16-22% faster - High escape: 3-16% faster Also explored but rejected: compact loop escape handling, 8-code-only (no 32-code batching). Updated optimization exploration document. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_019hQy1qLZ3f8raikcRTgmpN
Restore the unrolled match for escape handling (jump table is ~4% faster than a loop on key workloads). Apply smaller cleanups that don't affect performance: extract block_end() helper, name escape_mask intermediates, tighten macro comments, shorter variable names for bounds. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_019hQy1qLZ3f8raikcRTgmpN
Place #[cold] #[inline(never)] fn cold() {} calls at the top of escape
branches to hint LLVM that escape handling is unlikely. This improves
code layout for the hot (escape-free) path, yielding 1-3% improvement
on low-escape data (the common case).
Also explored and rejected: inline 32-code escape handling (hurts
low-escape icache), #[cold] escape handler function (call overhead
exceeds icache benefit).
Signed-off-by: Claude <noreply@anthropic.com>
https://claude.ai/code/session_019hQy1qLZ3f8raikcRTgmpN
…end speedup Replace the generic `build_views()` call with an FSST-specific `build_views_fast()` that inlines `BinaryView` construction. The general `make_view()` is `#[inline(never)]` with a 13-arm match, causing a function call per string. The inlined version constructs views directly via `u128` byte manipulation, eliminating: - Per-string function call overhead - Buffer splitting checks (FSST data is always < 2 GiB) - Match-based dispatch on string length End-to-end improvement (decompress + build views): - Short strings (avg 16B): 47% faster - Medium strings (avg 64B): 21-26% faster - URLs: 32-39% faster Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_019hQy1qLZ3f8raikcRTgmpN
Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_019hQy1qLZ3f8raikcRTgmpN
Make build_views_fast generic over the length ptype (via AsPrimitive<usize>) so it can consume the typed lengths slice directly from the PrimitiveArray. This removes: - A Vec<usize> heap allocation (10k-100k elements) - A second iteration over the lengths array to convert types Additional 5-19% end-to-end improvement on top of the inlined view builder. Cumulative speedup from baseline: 33-54% depending on workload. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_019hQy1qLZ3f8raikcRTgmpN
Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_019hQy1qLZ3f8raikcRTgmpN
New benchmark groups: - views_old_* / views_new_*: Isolated view building comparison (old general build_views vs new inlined build_views_fast) - raw_baseline_urls / raw_optimized_urls: Raw decompression for URLs Also expose build_views_fast and canonical module under _test-harness feature for direct benchmarking access. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_019hQy1qLZ3f8raikcRTgmpN
…tring coverage Replace parameterized (string_count, avg_len) synthetic benchmarks with real-world datasets from test_utils (ClickBench URLs, log lines, JSON, emails, file paths, short URLs) plus two custom datasets that exercise the BinaryView inlining threshold: - short_strings: 3-12 bytes, all inlined (≤12 byte) views - medium_strings: 8-20 bytes, mix of inlined and reference views Three benchmark groups on each dataset (8 datasets × 3 groups = 24 benchmarks): - e2e_*: end-to-end to_canonical (full pipeline) - views_old_* / views_new_*: isolated view building comparison - raw_baseline_* / raw_optimized_*: raw decompression comparison Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_019hQy1qLZ3f8raikcRTgmpN
Replace the 8-arm match statement in emit_before_escape with a compact while-loop that LLVM can unroll when the position is a compile-time constant from trailing_zeros. This reduces code duplication without sacrificing performance. Also remove FSST_DECOMPRESSOR_OPTIMIZATION.md as it served its purpose during development. Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_019hQy1qLZ3f8raikcRTgmpN
Merging this PR will degrade performance by 12.23%
Performance Changes
Comparing Footnotes
|
Polar Signals Profiling ResultsLatest Run
Powered by Polar Signals Cloud |
Benchmarks: PolarSignals ProfilingVortex (geomean): 1.086x ➖ datafusion / vortex-file-compressed (1.086x ➖, 0↑ 6↓)
|
Benchmarks: TPC-H SF=1 on NVMEVerdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.994x ➖, 0↑ 0↓)
datafusion / vortex-compact (1.000x ➖, 0↑ 0↓)
datafusion / parquet (0.980x ➖, 2↑ 1↓)
datafusion / arrow (0.991x ➖, 1↑ 0↓)
duckdb / vortex-file-compressed (0.993x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.009x ➖, 0↑ 0↓)
duckdb / parquet (1.011x ➖, 1↑ 3↓)
duckdb / duckdb (1.014x ➖, 0↑ 1↓)
Full attributed analysis
|
Benchmarks: FineWeb NVMeVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.990x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.989x ➖, 0↑ 1↓)
datafusion / parquet (0.972x ➖, 1↑ 0↓)
duckdb / vortex-file-compressed (0.955x ➖, 1↑ 0↓)
duckdb / vortex-compact (0.972x ➖, 1↑ 0↓)
duckdb / parquet (0.977x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: TPC-DS SF=1 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.996x ➖, 0↑ 1↓)
datafusion / vortex-compact (1.030x ➖, 0↑ 16↓)
datafusion / parquet (1.003x ➖, 2↑ 1↓)
duckdb / vortex-file-compressed (0.999x ➖, 1↑ 3↓)
duckdb / vortex-compact (0.993x ➖, 1↑ 0↓)
duckdb / parquet (0.996x ➖, 0↑ 1↓)
duckdb / duckdb (1.006x ➖, 1↑ 1↓)
Full attributed analysis
|
Benchmarks: TPC-H SF=1 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.904x ➖, 2↑ 1↓)
datafusion / vortex-compact (0.920x ➖, 3↑ 2↓)
datafusion / parquet (1.047x ➖, 0↑ 1↓)
duckdb / vortex-file-compressed (0.971x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.966x ➖, 0↑ 0↓)
duckdb / parquet (0.970x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: TPC-H SF=10 on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.963x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.973x ➖, 0↑ 0↓)
datafusion / parquet (0.967x ➖, 0↑ 0↓)
datafusion / arrow (0.949x ➖, 1↑ 0↓)
duckdb / vortex-file-compressed (0.981x ➖, 0↑ 0↓)
duckdb / vortex-compact (0.974x ➖, 0↑ 0↓)
duckdb / parquet (0.986x ➖, 1↑ 0↓)
duckdb / duckdb (0.996x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: FineWeb S3Verdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.953x ➖, 0↑ 0↓)
datafusion / vortex-compact (0.944x ➖, 1↑ 0↓)
datafusion / parquet (0.962x ➖, 0↑ 0↓)
duckdb / vortex-file-compressed (0.988x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.003x ➖, 0↑ 0↓)
duckdb / parquet (0.984x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Random AccessVortex (geomean): 0.924x ➖ unknown / unknown (0.974x ➖, 7↑ 3↓)
|
Benchmarks: Statistical and Population GeneticsVerdict: No clear signal (low confidence) duckdb / vortex-file-compressed (0.941x ➖, 1↑ 0↓)
duckdb / vortex-compact (0.972x ➖, 1↑ 0↓)
duckdb / parquet (0.933x ➖, 1↑ 0↓)
Full attributed analysis
|
Benchmarks: TPC-H SF=10 on S3Verdict: No clear signal (environment too noisy confidence) datafusion / vortex-file-compressed (0.936x ➖, 2↑ 0↓)
datafusion / vortex-compact (1.034x ➖, 0↑ 1↓)
datafusion / parquet (0.923x ➖, 3↑ 0↓)
duckdb / vortex-file-compressed (1.037x ➖, 0↑ 0↓)
duckdb / vortex-compact (1.018x ➖, 0↑ 0↓)
duckdb / parquet (0.983x ➖, 0↑ 0↓)
Full attributed analysis
|
Benchmarks: Clickbench on NVMEVerdict: No clear signal (low confidence) datafusion / vortex-file-compressed (0.969x ➖, 2↑ 0↓)
datafusion / parquet (0.974x ➖, 2↑ 1↓)
duckdb / vortex-file-compressed (1.072x ➖, 0↑ 13↓)
duckdb / parquet (0.999x ➖, 0↑ 0↓)
duckdb / duckdb (1.045x ➖, 2↑ 11↓)
Full attributed analysis
|
Benchmarks: CompressionVortex (geomean): 1.004x ➖ unknown / unknown (1.018x ➖, 1↑ 9↓)
|
- Replace zero-init + copy in make_view_inline with single 16-byte unaligned read, mask, and shift for inlined views (len <= 12) - Use direct arithmetic for reference views instead of byte array copies - Add VIEW_BUILD_PADDING constant for safe 16-byte reads past buffer end - Process escape-free blocks before first escape in decompressor loop instead of breaking immediately on any escape detection - Update benchmarks to use padded buffers Signed-off-by: Claude <noreply@anthropic.com> https://claude.ai/code/session_019hQy1qLZ3f8raikcRTgmpN
Summary
Closes: #000
Testing