Skip to content

Sparse Dictionary Canonicalize#7841

Draft
gatesn wants to merge 11 commits intodevelopfrom
ngates/sparse-dict
Draft

Sparse Dictionary Canonicalize#7841
gatesn wants to merge 11 commits intodevelopfrom
ngates/sparse-dict

Conversation

@gatesn
Copy link
Copy Markdown
Contributor

@gatesn gatesn commented May 8, 2026

When the number of unique codes is much much smaller than the number of values, we should collect unique codes, filter the values, and remap the codes into the smaller dictionary's value space.

gatesn added 3 commits May 8, 2026 09:31
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
@joseph-isaacs
Copy link
Copy Markdown
Contributor

I think you want this idea:

fn compute_referenced_values_mask(&self, referenced: bool) -> VortexResult<BitBuffer> {
but could be sampled?

gatesn added 2 commits May 8, 2026 11:23
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>

# Conflicts:
#	vortex-array/src/arrays/dict/vtable/mod.rs
#	vortex-duckdb/src/exporter/dict.rs
@gatesn gatesn force-pushed the ngates/sparse-dict branch from 25ed843 to c3e209d Compare May 8, 2026 15:29
@gatesn gatesn added the changelog/performance A performance improvement label May 8, 2026
@codspeed-hq
Copy link
Copy Markdown

codspeed-hq Bot commented May 8, 2026

Merging this PR will degrade performance by 61.57%

⚠️ Unknown Walltime execution environment detected

Using the Walltime instrument on standard Hosted Runners will lead to inconsistent data.

For the most accurate results, we recommend using CodSpeed Macro Runners: bare-metal machines fine-tuned for performance measurement consistency.

⚡ 8 improved benchmarks
❌ 20 regressed benchmarks
✅ 1180 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation bench_many_codes_few_values[1024] 397.5 µs 324.2 µs +22.59%
Simulation bench_many_codes_few_values[2048] 368.5 µs 295.9 µs +24.56%
Simulation bench_many_codes_few_values[4096] 374.3 µs 301.7 µs +24.09%
Simulation bench_many_nulls[0.5] 361.8 µs 317.6 µs +13.9%
Simulation bench_many_nulls[0.9] 537.1 µs 456.5 µs +17.67%
Simulation bench_sparse_coverage[0.01] 366.4 µs 293.7 µs +24.76%
Simulation bench_sparse_coverage[0.1] 364.8 µs 292 µs +24.9%
Simulation bench_sparse_coverage[0.5] 364.8 µs 292.1 µs +24.9%
Simulation take_filter_random_mask_random_indices[10000, 1000] 277.7 µs 419.2 µs -33.76%
Simulation take_filter_random_mask_random_indices[50000, 10000] 753.2 µs 1,643.5 µs -54.17%
Simulation take_filter_random_mask_random_indices[50000, 1000] 664.6 µs 1,187.6 µs -44.04%
Simulation take_filter_random_mask_random_indices[90000, 10000] 989.3 µs 2,121.8 µs -53.37%
Simulation take_filter_random_mask_random_indices[90000, 1000] 896 µs 1,708.4 µs -47.56%
Simulation take_filter_random_mask_sequential_indices[10000, 1000] 280.6 µs 424 µs -33.82%
Simulation take_filter_random_mask_sequential_indices[50000, 10000] 742.2 µs 1,619.6 µs -54.18%
Simulation take_filter_random_mask_sequential_indices[50000, 1000] 663.2 µs 1,185.1 µs -44.04%
Simulation take_filter_random_mask_sequential_indices[90000, 10000] 984.1 µs 2,117.1 µs -53.52%
Simulation take_filter_random_mask_sequential_indices[90000, 1000] 915 µs 1,713.6 µs -46.6%
Simulation take_filter_slice_mask_random_indices[10000, 1000] 191.4 µs 316.1 µs -39.44%
Simulation take_filter_slice_mask_random_indices[50000, 10000] 879.8 µs 1,596.8 µs -44.9%
... ... ... ... ... ...

ℹ️ Only the first 20 benchmarks are displayed. Go to the app to view all benchmarks.


Comparing ngates/sparse-dict (58bb1d4) with develop (f3d5f09)

Open in CodSpeed

gatesn added 6 commits May 8, 2026 11:44
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Signed-off-by: Nicholas Gates <nick@nickgates.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/performance A performance improvement

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants