Conversation
019c649 to
64bf881
Compare
64bf881 to
46ff95c
Compare
|
I plan to take a look at it during the weekend! Thanks |
46ff95c to
788369c
Compare
|
Have rebase with #91 and it's ready to review. cc @notfilippo @tisonkun |
There was a problem hiding this comment.
Pull request overview
This pull request implements serialization and deserialization support for Theta Sketch's compact representation, adding support for Apache DataSketches binary format versions 1-4. This enables interoperability with Java and C++ implementations and supports Iceberg's Puffin statistics format for storing NDV (number of distinct values) metadata.
Changes:
- Adds
CompactThetaSketchstruct for immutable, serialization-friendly theta sketch representation - Implements uncompressed (v3) and compressed (v4) serialization with delta encoding and bit-packing
- Implements deserialization for legacy versions v1, v2, and modern versions v3, v4 with comprehensive validation
Reviewed changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| datasketches/src/theta/sketch.rs | Adds CompactThetaSketch struct with serialization/deserialization methods, compact() conversion, and versioned deserializers |
| datasketches/src/theta/serialization.rs | Defines binary format constants for flags and version numbers |
| datasketches/src/theta/mod.rs | Exports CompactThetaSketch and DEFAULT_UPDATE_SEED for public API |
| datasketches/src/theta/hash_table.rs | Adds hash_seed() accessor method for seed hash computation |
| datasketches/src/hash/mod.rs | Changes DEFAULT_UPDATE_SEED visibility from pub(crate) to pub for deserialization API |
| datasketches/src/error.rs | Updates invalid_preamble_longs to accept multiple expected values via AsRef<[u8]> |
| datasketches/src/codec/family.rs | Adds Family::THETA constant with id=3 and preamble size range 1-3 |
| datasketches/src/tdigest/sketch.rs | Updates invalid_preamble_longs call site to use std::slice::from_ref |
| datasketches/src/frequencies/sketch.rs | Updates invalid_preamble_longs call sites to use std::slice::from_ref |
| datasketches/src/cpc/sketch.rs | Updates invalid_preamble_longs call site to use std::slice::from_ref |
| datasketches/src/countmin/sketch.rs | Updates invalid_preamble_longs call site to use std::slice::from_ref |
| datasketches/tests/theta_serialization_test.rs | Adds Java/C++ compatibility tests for both compressed and uncompressed formats |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
datasketches/src/theta/bit_pack.rs
Outdated
| /// `(byte_index, offset)` write position. | ||
| /// | ||
| /// Returns the new `(byte_index, offset)` write state. | ||
| pub(crate) fn pack_bits( |
There was a problem hiding this comment.
Let me change it to bit packer for more natural use and readability.
Signed-off-by: tison <wander4096@gmail.com>
datasketches/src/theta/sketch.rs
Outdated
| match pre_longs { | ||
| 1 => Ok(Self { | ||
| entries: vec![], | ||
| theta: MAX_THETA, | ||
| seed_hash, | ||
| ordered: true, | ||
| empty: true, | ||
| }), | ||
| 2 => { | ||
| let num_entries = cursor.read_u32_le().map_err(make_error("num_entries"))? as usize; | ||
| cursor.read_u32_le().map_err(make_error("<unused_u32>"))?; | ||
| let entries = Self::read_entries(&mut cursor, num_entries, MAX_THETA)?; | ||
| Ok(Self { | ||
| entries, | ||
| theta: MAX_THETA, | ||
| seed_hash, | ||
| ordered: true, | ||
| empty: true, | ||
| }) | ||
| } | ||
| 3 => { |
There was a problem hiding this comment.
@ZENOTME We'd better define constants for what 1,2,3 here mean.
There was a problem hiding this comment.
I'm thinking maybe make it as a type is a clearer way:
struct PremableV2(u8);
impl PremableV2 {
bool is_estimate()
...
}
struct PremableV3(u8);
impl PremableV3 {
bool is_estimate()
...
}
There was a problem hiding this comment.
Actually, I noticed that it's different when checking for serialization/deserialization of the same field, e.g., theta, as follows:
// serialize
if (this->is_estimation_mode()) write(os, this->theta_); // means that preamble = 3
// deserialize
if (preamble_longs > 2) theta = read<uint64_t>(is);
Change like the following may be better and more consistent:
// serialize
if (this->is_estimation_mode()) write(os, this->theta_); // means that preamble = 3
// deserialize
if (preamble_longs.is_estimate()) theta = read<uint64_t>(is);
tisonkun
left a comment
There was a problem hiding this comment.
Thanks for your contribution! LGTM.
I'm waiting a few days for @notfilippo and @leerho review. But I'd merge this patch next week if there are no more concerns.
This refers #30, including: