Skip to content

feat(theta): compact sketch serde v3/v4#77

Open
ZENOTME wants to merge 14 commits intoapache:mainfrom
ZENOTME:ZENOTME/compact-theta-md
Open

feat(theta): compact sketch serde v3/v4#77
ZENOTME wants to merge 14 commits intoapache:mainfrom
ZENOTME:ZENOTME/compact-theta-md

Conversation

@ZENOTME
Copy link
Contributor

@ZENOTME ZENOTME commented Jan 25, 2026

This refers #30, including:

  • implement CompactThetaSketch
  • serialize CompactThetaSketch for uncompress/compress
  • deserialize CompactThetaSketch for v1,v2,v3,v4

@ZENOTME ZENOTME marked this pull request as draft January 25, 2026 17:57
@ZENOTME ZENOTME force-pushed the ZENOTME/compact-theta-md branch from 019c649 to 64bf881 Compare February 9, 2026 07:01
@ZENOTME ZENOTME marked this pull request as ready for review February 9, 2026 07:02
@ZENOTME ZENOTME force-pushed the ZENOTME/compact-theta-md branch from 64bf881 to 46ff95c Compare February 9, 2026 07:06
@ZENOTME
Copy link
Contributor Author

ZENOTME commented Feb 9, 2026

cc @tisonkun @PsiACE @notfilippo @Xuanwo

@notfilippo
Copy link
Member

I plan to take a look at it during the weekend! Thanks

@ZENOTME ZENOTME force-pushed the ZENOTME/compact-theta-md branch from 46ff95c to 788369c Compare February 15, 2026 02:45
@ZENOTME
Copy link
Contributor Author

ZENOTME commented Feb 15, 2026

Have rebase with #91 and it's ready to review. cc @notfilippo @tisonkun

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This pull request implements serialization and deserialization support for Theta Sketch's compact representation, adding support for Apache DataSketches binary format versions 1-4. This enables interoperability with Java and C++ implementations and supports Iceberg's Puffin statistics format for storing NDV (number of distinct values) metadata.

Changes:

  • Adds CompactThetaSketch struct for immutable, serialization-friendly theta sketch representation
  • Implements uncompressed (v3) and compressed (v4) serialization with delta encoding and bit-packing
  • Implements deserialization for legacy versions v1, v2, and modern versions v3, v4 with comprehensive validation

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
datasketches/src/theta/sketch.rs Adds CompactThetaSketch struct with serialization/deserialization methods, compact() conversion, and versioned deserializers
datasketches/src/theta/serialization.rs Defines binary format constants for flags and version numbers
datasketches/src/theta/mod.rs Exports CompactThetaSketch and DEFAULT_UPDATE_SEED for public API
datasketches/src/theta/hash_table.rs Adds hash_seed() accessor method for seed hash computation
datasketches/src/hash/mod.rs Changes DEFAULT_UPDATE_SEED visibility from pub(crate) to pub for deserialization API
datasketches/src/error.rs Updates invalid_preamble_longs to accept multiple expected values via AsRef<[u8]>
datasketches/src/codec/family.rs Adds Family::THETA constant with id=3 and preamble size range 1-3
datasketches/src/tdigest/sketch.rs Updates invalid_preamble_longs call site to use std::slice::from_ref
datasketches/src/frequencies/sketch.rs Updates invalid_preamble_longs call sites to use std::slice::from_ref
datasketches/src/cpc/sketch.rs Updates invalid_preamble_longs call site to use std::slice::from_ref
datasketches/src/countmin/sketch.rs Updates invalid_preamble_longs call site to use std::slice::from_ref
datasketches/tests/theta_serialization_test.rs Adds Java/C++ compatibility tests for both compressed and uncompressed formats

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
@tisonkun tisonkun changed the title Theta: compact sketch serde v3/v4 feat(theta): compact sketch serde v3/v4 Feb 15, 2026
tisonkun and others added 4 commits February 15, 2026 14:25
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
Signed-off-by: tison <wander4096@gmail.com>
/// `(byte_index, offset)` write position.
///
/// Returns the new `(byte_index, offset)` write state.
pub(crate) fn pack_bits(
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me change it to bit packer for more natural use and readability.

Signed-off-by: tison <wander4096@gmail.com>
@tisonkun
Copy link
Member

tisonkun commented Feb 15, 2026

cc @ZENOTME I pushed 0db7843 to fine tune some methods and reduce duplicate.

Comment on lines 664 to 684
match pre_longs {
1 => Ok(Self {
entries: vec![],
theta: MAX_THETA,
seed_hash,
ordered: true,
empty: true,
}),
2 => {
let num_entries = cursor.read_u32_le().map_err(make_error("num_entries"))? as usize;
cursor.read_u32_le().map_err(make_error("<unused_u32>"))?;
let entries = Self::read_entries(&mut cursor, num_entries, MAX_THETA)?;
Ok(Self {
entries,
theta: MAX_THETA,
seed_hash,
ordered: true,
empty: true,
})
}
3 => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZENOTME We'd better define constants for what 1,2,3 here mean.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking maybe make it as a type is a clearer way:

struct PremableV2(u8);

impl PremableV2 {
  bool is_estimate()
  ...
}

struct PremableV3(u8);

impl PremableV3 {
  bool is_estimate()
  ...
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, I noticed that it's different when checking for serialization/deserialization of the same field, e.g., theta, as follows:

// serialize
if (this->is_estimation_mode()) write(os, this->theta_); // means that preamble = 3

// deserialize
if (preamble_longs > 2) theta = read<uint64_t>(is);

Change like the following may be better and more consistent:

// serialize
if (this->is_estimation_mode()) write(os, this->theta_); // means that preamble = 3

// deserialize
if (preamble_longs.is_estimate()) theta = read<uint64_t>(is);

Copy link
Member

@tisonkun tisonkun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your contribution! LGTM.

I'm waiting a few days for @notfilippo and @leerho review. But I'd merge this patch next week if there are no more concerns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants