Benchmarking Guide for keyspace_tracker

TL;DR - Maximum Performance

RUSTFLAGS="-C target-cpu=native" cargo bench --release

This single command enables:

Native CPU instruction set (NEON on ARM, AVX2 on x86)
Full LTO (cross-crate inlining)
Single codegen unit (global optimization)
Panic=abort (no unwinding overhead)

Optimization Flags Explained

1. CPU Target (`-C target-cpu=native`)

This is the most impactful flag.

Flag	Effect
`native`	Auto-detect CPU, use all available features
`apple-m1`	Apple Silicon (M1/M2/M3/M4)
`neoverse-v1`	AWS Graviton3
`neoverse-n1`	AWS Graviton2
`haswell`	Intel Haswell+ / AMD Zen+

On Apple M2, native enables:

ARMv8.5 instructions
Aggressive NEON vectorization
Apple-specific instruction scheduling

2. Link-Time Optimization (`lto = "fat"`)

[profile.bench]
lto = "fat"

Option	Effect
`"fat"`	Full LTO, best optimization, slowest compile
`"thin"`	Parallel LTO, good optimization, faster compile
`false`	No LTO

Fat LTO allows:

Cross-crate inlining
Dead code elimination
Hot loop optimization across module boundaries

3. Codegen Units (`codegen-units = 1`)

[profile.bench]
codegen-units = 1

LLVM optimizes the entire crate as a single unit instead of parallel shards. Trade-off: Slower compilation.

4. Panic Strategy (`panic = "abort"`)

[profile.bench]
panic = "abort"

Skips unwinding machinery. Saves ~1-2% on hot paths. Only use if you don't need panic recovery.

5. Overflow Checks (`overflow-checks = false`)

[profile.bench]
overflow-checks = false

Disables integer overflow checks in release. Already default in release, but explicit is clearer.

Platform-Specific Configuration

Apple Silicon (M1/M2/M3/M4)

.cargo/config.toml:

[target.aarch64-apple-darwin]
rustflags = ["-C", "target-cpu=apple-m1"]

Or use native which auto-detects:

RUSTFLAGS="-C target-cpu=native" cargo bench

AWS Graviton

Instance	CPU	Config
c6g, m6g, r6g	Graviton2	`target-cpu=neoverse-n1`
c7g, m7g, r7g	Graviton3	`target-cpu=neoverse-v1`
c8g, m8g, r8g	Graviton4	`target-cpu=neoverse-v2`

.cargo/config.toml:

[target.aarch64-unknown-linux-gnu]
rustflags = ["-C", "target-cpu=neoverse-v1"]

x86_64

CPU	Config
Intel Haswell+	`target-cpu=haswell`
Intel Skylake-X	`target-cpu=skylake-avx512`
AMD Zen3	`target-cpu=znver3`
AMD Zen4	`target-cpu=znver4`

Advanced Optimizations

Nightly-Only Flags

# Extra CPU tuning
RUSTFLAGS="-C target-cpu=native -Z tune-cpu=native" \
  cargo +nightly bench --release

Fast-Math (Floating Point)

⚠️ Only if you don't need strict IEEE compliance:

RUSTFLAGS="-C target-cpu=native -C llvm-args=-enable-no-nans-fp-math" \
  cargo bench --release

This enables:

Reordering of FP operations
Assuming no NaNs/infinities
Fused multiply-add

Verify Vectorization

Check if NEON/AVX instructions are generated:

# Install cargo-asm
cargo install cargo-asm

# View assembly for a function
cargo asm keyspace_tracker::keyspace_tracker::bitmap_simd::neon::popcount_slice

Look for:

ARM: cnt, fmla, ld1, st1
x86: vpopcnt, vpand, vmovdqu

Benchmark Stability Tips

macOS

# Disable App Nap
defaults write NSGlobalDomain NSAppSleepDisabled -bool YES

# Close heavy apps (Chrome, Docker, Slack)

# Plug in power (avoid thermal throttling)

# Use taskpolicy for consistent scheduling
taskpolicy -c utility cargo bench --release

Linux

# Disable CPU frequency scaling
sudo cpupower frequency-set --governor performance

# Pin to specific CPUs
taskset -c 0-3 cargo bench --release

# Disable turbo boost for consistency
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo

General

Run benchmarks multiple times
Check the confidence intervals in criterion output
Avoid running during system updates/background tasks
Warm up the CPU before measuring

Running Specific Benchmarks

# All benchmarks
RUSTFLAGS="-C target-cpu=native" cargo bench --release

# Bitmap operations only
cargo bench --release -- bitmap/

# SIMD-specific tests
cargo bench --release -- simd/

# Group operations
cargo bench --release -- group/

# Scaling tests
cargo bench --release -- scaling/

# Sampling and iteration benchmarks
cargo bench --release -- sampling/

# Per-key latency verification (<1µs requirement)
cargo bench --release -- latency/

Comparing Results

Criterion saves results in target/criterion/. To compare:

# Run baseline
cargo bench --release -- --save-baseline main

# Make changes, then compare
cargo bench --release -- --baseline main

Expected Performance

With full optimizations on Apple M2:

Operation	Expected Latency
`test` (exists)	~1.5 ns
`set` (add)	~7 ns
`test_and_set` (claim)	~2.5 ns
`find_next_set`	~3-5 ns
Hierarchical exists	~15-20 ns
Iterator per-key	~2.5-5 ns
Mixed-ratio iteration	~5 ns
Sampling iteration	~5 ns
Bulk `set_range` (1M)	~200 µs
Bulk `clear_range` (1M)	~200 µs

Iterator Performance Requirement

All iteration modes must maintain < 1µs per key regardless of keyspace size:

Keyspace Size	Per-Key Latency	Throughput
10,000	~2.5 ns	400 Melem/s
100,000	~5 ns	200 Melem/s
1,000,000	~5 ns	200 Melem/s
10,000,000	~5 ns	195 Melem/s

Throughput scales linearly with core count for parallel operations.

Cargo.toml Reference

[profile.bench]
opt-level = 3           # Maximum optimization
lto = "fat"             # Full link-time optimization
codegen-units = 1       # Single compilation unit
panic = "abort"         # No unwinding
overflow-checks = false # No integer overflow checks
debug = false           # No debug symbols

Troubleshooting

Benchmark variance too high

Check for background processes
Disable turbo boost
Use taskset to pin to specific cores

Not seeing SIMD instructions

Verify target-cpu is set correctly
Check cargo asm output
Ensure using --release

LTO taking too long

Use lto = "thin" for faster builds
Increase RAM if swapping during link

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking Guide for keyspace_tracker

TL;DR - Maximum Performance

Optimization Flags Explained

1. CPU Target (`-C target-cpu=native`)

2. Link-Time Optimization (`lto = "fat"`)

3. Codegen Units (`codegen-units = 1`)

4. Panic Strategy (`panic = "abort"`)

5. Overflow Checks (`overflow-checks = false`)

Platform-Specific Configuration

Apple Silicon (M1/M2/M3/M4)

AWS Graviton

x86_64

Advanced Optimizations

Nightly-Only Flags

Fast-Math (Floating Point)

Verify Vectorization

Benchmark Stability Tips

macOS

Linux

General

Running Specific Benchmarks

Comparing Results

Expected Performance

Iterator Performance Requirement

Cargo.toml Reference

Troubleshooting

Benchmark variance too high

Not seeing SIMD instructions

LTO taking too long

FilesExpand file tree

BENCHMARKING.md

Latest commit

History

BENCHMARKING.md

File metadata and controls

Benchmarking Guide for keyspace_tracker

TL;DR - Maximum Performance

Optimization Flags Explained

1. CPU Target (-C target-cpu=native)

2. Link-Time Optimization (lto = "fat")

3. Codegen Units (codegen-units = 1)

4. Panic Strategy (panic = "abort")

5. Overflow Checks (overflow-checks = false)

Platform-Specific Configuration

Apple Silicon (M1/M2/M3/M4)

AWS Graviton

x86_64

Advanced Optimizations

Nightly-Only Flags

Fast-Math (Floating Point)

Verify Vectorization

Benchmark Stability Tips

macOS

Linux

General

Running Specific Benchmarks

Comparing Results

Expected Performance

Iterator Performance Requirement

Cargo.toml Reference

Troubleshooting

Benchmark variance too high

Not seeing SIMD instructions

LTO taking too long

1. CPU Target (`-C target-cpu=native`)

2. Link-Time Optimization (`lto = "fat"`)

3. Codegen Units (`codegen-units = 1`)

4. Panic Strategy (`panic = "abort"`)

5. Overflow Checks (`overflow-checks = false`)