RUSTFLAGS="-C target-cpu=native" cargo bench --releaseThis single command enables:
- Native CPU instruction set (NEON on ARM, AVX2 on x86)
- Full LTO (cross-crate inlining)
- Single codegen unit (global optimization)
- Panic=abort (no unwinding overhead)
This is the most impactful flag.
| Flag | Effect |
|---|---|
native |
Auto-detect CPU, use all available features |
apple-m1 |
Apple Silicon (M1/M2/M3/M4) |
neoverse-v1 |
AWS Graviton3 |
neoverse-n1 |
AWS Graviton2 |
haswell |
Intel Haswell+ / AMD Zen+ |
On Apple M2, native enables:
- ARMv8.5 instructions
- Aggressive NEON vectorization
- Apple-specific instruction scheduling
[profile.bench]
lto = "fat"| Option | Effect |
|---|---|
"fat" |
Full LTO, best optimization, slowest compile |
"thin" |
Parallel LTO, good optimization, faster compile |
false |
No LTO |
Fat LTO allows:
- Cross-crate inlining
- Dead code elimination
- Hot loop optimization across module boundaries
[profile.bench]
codegen-units = 1LLVM optimizes the entire crate as a single unit instead of parallel shards. Trade-off: Slower compilation.
[profile.bench]
panic = "abort"Skips unwinding machinery. Saves ~1-2% on hot paths. Only use if you don't need panic recovery.
[profile.bench]
overflow-checks = falseDisables integer overflow checks in release. Already default in release, but explicit is clearer.
.cargo/config.toml:
[target.aarch64-apple-darwin]
rustflags = ["-C", "target-cpu=apple-m1"]Or use native which auto-detects:
RUSTFLAGS="-C target-cpu=native" cargo bench| Instance | CPU | Config |
|---|---|---|
| c6g, m6g, r6g | Graviton2 | target-cpu=neoverse-n1 |
| c7g, m7g, r7g | Graviton3 | target-cpu=neoverse-v1 |
| c8g, m8g, r8g | Graviton4 | target-cpu=neoverse-v2 |
.cargo/config.toml:
[target.aarch64-unknown-linux-gnu]
rustflags = ["-C", "target-cpu=neoverse-v1"]| CPU | Config |
|---|---|
| Intel Haswell+ | target-cpu=haswell |
| Intel Skylake-X | target-cpu=skylake-avx512 |
| AMD Zen3 | target-cpu=znver3 |
| AMD Zen4 | target-cpu=znver4 |
# Extra CPU tuning
RUSTFLAGS="-C target-cpu=native -Z tune-cpu=native" \
cargo +nightly bench --releaseRUSTFLAGS="-C target-cpu=native -C llvm-args=-enable-no-nans-fp-math" \
cargo bench --releaseThis enables:
- Reordering of FP operations
- Assuming no NaNs/infinities
- Fused multiply-add
Check if NEON/AVX instructions are generated:
# Install cargo-asm
cargo install cargo-asm
# View assembly for a function
cargo asm keyspace_tracker::keyspace_tracker::bitmap_simd::neon::popcount_sliceLook for:
- ARM:
cnt,fmla,ld1,st1 - x86:
vpopcnt,vpand,vmovdqu
# Disable App Nap
defaults write NSGlobalDomain NSAppSleepDisabled -bool YES
# Close heavy apps (Chrome, Docker, Slack)
# Plug in power (avoid thermal throttling)
# Use taskpolicy for consistent scheduling
taskpolicy -c utility cargo bench --release# Disable CPU frequency scaling
sudo cpupower frequency-set --governor performance
# Pin to specific CPUs
taskset -c 0-3 cargo bench --release
# Disable turbo boost for consistency
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo- Run benchmarks multiple times
- Check the confidence intervals in criterion output
- Avoid running during system updates/background tasks
- Warm up the CPU before measuring
# All benchmarks
RUSTFLAGS="-C target-cpu=native" cargo bench --release
# Bitmap operations only
cargo bench --release -- bitmap/
# SIMD-specific tests
cargo bench --release -- simd/
# Group operations
cargo bench --release -- group/
# Scaling tests
cargo bench --release -- scaling/
# Sampling and iteration benchmarks
cargo bench --release -- sampling/
# Per-key latency verification (<1µs requirement)
cargo bench --release -- latency/Criterion saves results in target/criterion/. To compare:
# Run baseline
cargo bench --release -- --save-baseline main
# Make changes, then compare
cargo bench --release -- --baseline mainWith full optimizations on Apple M2:
| Operation | Expected Latency |
|---|---|
test (exists) |
~1.5 ns |
set (add) |
~7 ns |
test_and_set (claim) |
~2.5 ns |
find_next_set |
~3-5 ns |
| Hierarchical exists | ~15-20 ns |
| Iterator per-key | ~2.5-5 ns |
| Mixed-ratio iteration | ~5 ns |
| Sampling iteration | ~5 ns |
Bulk set_range (1M) |
~200 µs |
Bulk clear_range (1M) |
~200 µs |
All iteration modes must maintain < 1µs per key regardless of keyspace size:
| Keyspace Size | Per-Key Latency | Throughput |
|---|---|---|
| 10,000 | ~2.5 ns | 400 Melem/s |
| 100,000 | ~5 ns | 200 Melem/s |
| 1,000,000 | ~5 ns | 200 Melem/s |
| 10,000,000 | ~5 ns | 195 Melem/s |
Throughput scales linearly with core count for parallel operations.
[profile.bench]
opt-level = 3 # Maximum optimization
lto = "fat" # Full link-time optimization
codegen-units = 1 # Single compilation unit
panic = "abort" # No unwinding
overflow-checks = false # No integer overflow checks
debug = false # No debug symbols- Check for background processes
- Disable turbo boost
- Use
tasksetto pin to specific cores
- Verify
target-cpuis set correctly - Check
cargo asmoutput - Ensure using
--release
- Use
lto = "thin"for faster builds - Increase RAM if swapping during link