Results

Tabular Datasets

Below we compare speeds and compressed sizes on 3 real-world datasets. All these results, as well as those from the paper, are available in the results CSVs, e.g. results for columnar datasets on a macbook pro. All benchmarks reported here and in the paper can be easily run via the CLI.

The 3 datasets we display here are:

Devin Smith's air quality data download (15MB)
NYC taxi data (2023-04 high volume for hire) (469MB)
Reddit r/place 2022 data
- upstream Reddit post and original data
- processed Parquet file download (1.3GB)

dataset	uncompressed size	numeric data types
air quality	59.7MB	i16, i32, i64
taxi	2.14GB	f64, i32, i64
r/place	4.19GB	i32, i64

For these results, we used a single performance core of a Macbook Pro M3 Max. Only numerical columns were used. For Blosc, the SHUFFLE filter and the Zstd default of Zstd level 3 was used. For Parquet, the Parquet default of Zstd level 1 was used.

Even at max compression levels, Zstd-based codecs don't perform much better. E.g. on the Taxi dataset, Parquet+Zstd at the max Zstd level of 22 and Blosc+Zstd at the max Blosc level of 9 get ratios of 5.32 and 2.85, respectively. In contrast, Pco gets 6.89 at level 8 and 6.98 at level 12.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Results

Tabular Datasets

FilesExpand file tree

benchmark_results.md

Latest commit

History

benchmark_results.md

File metadata and controls

Results

Tabular Datasets