Skip to content

Commit 019c649

Browse files
committed
Add compact theta sketch serialization
1 parent 583ac18 commit 019c649

8 files changed

Lines changed: 1158 additions & 10 deletions

File tree

datasketches/src/common/mod.rs

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -20,8 +20,10 @@
2020
// public common components for datasketches crate
2121
mod num_std_dev;
2222
mod resize;
23+
mod seed_hash;
2324
pub use self::num_std_dev::NumStdDev;
2425
pub use self::resize::ResizeFactor;
2526

2627
// private to datasketches crate
2728
pub(crate) mod binomial_bounds;
29+
pub(crate) use self::seed_hash::compute_seed_hash;
Lines changed: 28 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
// Licensed to the Apache Software Foundation (ASF) under one
2+
// or more contributor license agreements. See the NOTICE file
3+
// distributed with this work for additional information
4+
// regarding copyright ownership. The ASF licenses this file
5+
// to you under the Apache License, Version 2.0 (the
6+
// "License"); you may not use this file except in compliance
7+
// with the License. You may obtain a copy of the License at
8+
//
9+
// http://www.apache.org/licenses/LICENSE-2.0
10+
//
11+
// Unless required by applicable law or agreed to in writing,
12+
// software distributed under the License is distributed on an
13+
// "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
14+
// KIND, either express or implied. See the License for the
15+
// specific language governing permissions and limitations
16+
// under the License.
17+
18+
use std::hash::Hasher;
19+
20+
use crate::hash::MurmurHash3X64128;
21+
22+
pub(crate) fn compute_seed_hash(seed: u64) -> u16 {
23+
let mut hasher = MurmurHash3X64128::with_seed(0);
24+
hasher.write(&seed.to_le_bytes());
25+
let (h1, _) = hasher.finish128();
26+
(h1 & 0xffff) as u16
27+
}
28+

datasketches/src/countmin/serialization.rs

Lines changed: 1 addition & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -15,19 +15,10 @@
1515
// specific language governing permissions and limitations
1616
// under the License.
1717

18-
use std::hash::Hasher;
19-
20-
use crate::hash::MurmurHash3X64128;
21-
2218
pub(super) const PREAMBLE_LONGS_SHORT: u8 = 2;
2319
pub(super) const SERIAL_VERSION: u8 = 1;
2420
pub(super) const COUNTMIN_FAMILY_ID: u8 = 18;
2521
pub(super) const FLAGS_IS_EMPTY: u8 = 1 << 0;
2622
pub(super) const LONG_SIZE_BYTES: usize = 8;
2723

28-
pub(super) fn compute_seed_hash(seed: u64) -> u16 {
29-
let mut hasher = MurmurHash3X64128::with_seed(0);
30-
hasher.write(&seed.to_le_bytes());
31-
let (h1, _) = hasher.finish128();
32-
(h1 & 0xffff) as u16
33-
}
24+
pub(super) use crate::common::compute_seed_hash;

datasketches/src/theta/hash_table.rs

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -296,6 +296,10 @@ impl ThetaHashTable {
296296
self.lg_nom_size
297297
}
298298

299+
pub(crate) fn hash_seed(&self) -> u64 {
300+
self.hash_seed
301+
}
302+
299303
/// Get stride for hash table probing
300304
fn get_stride(key: u64, lg_size: u8) -> usize {
301305
(2 * ((key >> (lg_size)) & STRIDE_MASK) + 1) as usize
Lines changed: 179 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,179 @@
1+
# Compact Theta Sketch (Implementation Notes)
2+
3+
This document describes how the Rust implementation should represent and interoperate with the
4+
Apache DataSketches **Compact Theta Sketch** formats used by the Java and C++ libraries.
5+
6+
The intent is cross-language compatibility:
7+
8+
- **On-heap representation**: a minimal immutable form of a Theta sketch.
9+
- **Binary format**: compatible serialization/deserialization (uncompressed `serVer = 3`), matching
10+
the preamble layout and flags used by `datasketches-java` and `datasketches-cpp`.
11+
12+
---
13+
14+
## compact theta sketch
15+
16+
A compact theta sketch is the immutable, serialized-friendly form of a theta sketch:
17+
18+
- It stores a **compact array** of retained hash values (no interstitial zeros like the update
19+
sketch’s hash table).
20+
- It stores the **theta** threshold (`thetaLong`) and a **seed hash** (16-bit).
21+
- It can be **ordered** (sorted ascending by hash) or **unordered**.
22+
- It is **read-only** (cannot be updated), but is intended to participate in set operations.
23+
24+
### Hash invariants (cross-language)
25+
26+
Java/C++ (and this Rust crate) treat retained hashes as:
27+
28+
- 63-bit, non-negative values derived from MurmurHash3 (128-bit), taking `h1 >> 1`.
29+
- `0` is reserved for empty slots and must not appear as a retained entry.
30+
- Every retained entry must satisfy: `0 < hash < thetaLong`.
31+
- `thetaLong` uses the signed max (`Long.MAX_VALUE` / `i64::MAX`) as “1.0” (no sampling).
32+
33+
In Rust, `MAX_THETA` is `i64::MAX as u64`, matching Java/C++.
34+
35+
### Compact-state truth table (Java/C++ behavior)
36+
37+
When producing a compact sketch (or serializing), Java defines a truth table over `(empty, curCount,
38+
thetaLong)` and applies corrections in specific cases (see `CompactOperations.correctThetaOnCompact`
39+
and related helpers):
40+
41+
- Normal empty: `empty = true`, `curCount = 0`, `thetaLong = MAX_THETA` → encoded as an 8-byte sketch.
42+
- A sketch with `p < 1.0` but never updated may have `empty = true`, `curCount = 0`,
43+
`thetaLong < MAX_THETA` internally; Java corrects theta back to `MAX_THETA` during compaction/
44+
serialization so it becomes a normal empty compact sketch.
45+
- A compact sketch can be **non-empty flag false** while still having `curCount = 0` and
46+
`thetaLong < MAX_THETA` as a possible result of set operations; this must serialize with
47+
`preLongs = 3` to preserve theta.
48+
49+
Rust should mirror these behaviors for cross-language parity.
50+
51+
---
52+
53+
## serailzation/deserialization
54+
55+
This section documents the uncompressed compact theta sketch binary format (`serVer = 3`), as used
56+
by Java and C++.
57+
58+
### Endianness
59+
60+
Multi-byte integers are written in the platform’s native endianness in the Java/C++ implementations,
61+
with a legacy “big-endian” bit in the flags byte (bit 0). In practice, modern platforms are little
62+
endian and serialize with that bit cleared.
63+
64+
For Rust cross-platform robustness:
65+
66+
- **Serialize** using little-endian encodings and keep the big-endian flag bit clear.
67+
- **Deserialize** by reading the big-endian flag bit and decoding multi-byte fields accordingly.
68+
69+
### Preamble (first 8 bytes)
70+
71+
All compact sketches start with a single 8-byte “preamble long” with fixed byte offsets:
72+
73+
| Byte offset | Field | Notes |
74+
|---:|---|---|
75+
| 0 | `preLongs` (low 6 bits) | Number of 8-byte longs in the preamble (1–3 for v3 compact). |
76+
| 1 | `serVer` | Must be `3` for uncompressed compact sketches. |
77+
| 2 | `family` | Must be `3` (`Family.COMPACT`). |
78+
| 3 | `lgNomLongs` | **Unused for compact**; must be written as `0`. |
79+
| 4 | `lgArrLongs` | **Unused for compact**; must be written as `0`. |
80+
| 5 | `flags` | Bitfield, defined below. |
81+
| 6–7 | `seedHash` (`u16`) | Must match `computeSeedHash(expectedSeed)` (Java/C++). |
82+
83+
### Flags byte (byte 5)
84+
85+
Bit positions follow Java/C++:
86+
87+
- Bit 0: big-endian legacy indicator (reserved in Java, still present in C++).
88+
- Bit 1: read-only (must be set for compact sketches).
89+
- Bit 2: empty.
90+
- Bit 3: compact (must be set for compact sketches).
91+
- Bit 4: ordered.
92+
- Bit 5: single-item.
93+
- Bits 6–7: reserved (must be zero).
94+
95+
### `preLongs` and payload layout (v3)
96+
97+
The total serialized size is `(preLongs + curCount) * 8` bytes, except the “empty compact” case
98+
which is always exactly 8 bytes.
99+
100+
The format varies by `(empty, curCount, thetaLong)`:
101+
102+
#### 1) Empty compact sketch (8 bytes)
103+
104+
- `preLongs = 1`
105+
- `flags.empty = 1`
106+
- No `curCount`, no `thetaLong`, no entries.
107+
- `thetaLong` is implicitly `MAX_THETA`.
108+
109+
#### 2) Single item (16 bytes)
110+
111+
- `preLongs = 1`
112+
- `flags.singleItem = 1`, `flags.ordered = 1` (Java sets ordered for single-item).
113+
- No `curCount`, no `thetaLong`.
114+
- Payload: one 8-byte hash at offset `8`.
115+
116+
#### 3) Exact compact sketch (non-estimating)
117+
118+
- `thetaLong == MAX_THETA`
119+
- `preLongs = 2` for `curCount > 1`; (for `curCount == 1`, Java uses the single-item form above).
120+
- Long at offset `8` contains:
121+
- `curCount` as a 4-byte int at offsets `8..12`
122+
- `p` as a 4-byte float at offsets `12..16` (**not used**; Java writes `0.0` to match C++).
123+
- Payload: `curCount` hashes starting at offset `preLongs * 8` (i.e. 16).
124+
125+
#### 4) Estimating compact sketch
126+
127+
- `thetaLong < MAX_THETA`
128+
- `preLongs = 3`
129+
- Long at offset `8` contains:
130+
- `curCount` as a 4-byte int at offsets `8..12`
131+
- `p` as a 4-byte float at offsets `12..16` (**not used**; Java writes `0.0`).
132+
- Long at offset `16` contains:
133+
- `thetaLong` as an 8-byte long at offsets `16..24`.
134+
- Payload: `curCount` hashes starting at offset `preLongs * 8` (i.e. 24).
135+
136+
### Serialization algorithm (v3, conceptual)
137+
138+
1. Determine `(empty, curCount, thetaLong, ordered)` from the compact sketch state.
139+
2. Apply the “empty + never-updated sampled sketch” correction:
140+
if `empty && curCount == 0`, serialize as an empty compact with `thetaLong = MAX_THETA`.
141+
3. Compute `preLongs`:
142+
- if `thetaLong < MAX_THETA``preLongs = 3`
143+
- else if `empty``preLongs = 1`
144+
- else if `curCount == 1``preLongs = 1` (single item)
145+
- else → `preLongs = 2`
146+
4. Write preamble fields; ensure `lgNomLongs = 0`, `lgArrLongs = 0`, and set flags:
147+
`readOnly=1`, `compact=1`, plus `empty/ordered/singleItem` as applicable.
148+
5. For `preLongs >= 2`, write `curCount` and `p = 0.0f`.
149+
6. For `preLongs == 3`, write `thetaLong`.
150+
7. Write the `curCount` retained hashes, ordered if requested.
151+
152+
### Deserialization requirements (v3)
153+
154+
When decoding bytes into a compact theta sketch:
155+
156+
- Validate `family == 3` and `serVer == 3`.
157+
- Validate flags:
158+
- `compact` must be set
159+
- `readOnly` must be set
160+
- reserved bits must be zero (or tolerated for legacy inputs, depending on strictness)
161+
- Validate `seedHash` matches the expected seed.
162+
- If `empty` flag is set:
163+
- require `preLongs == 1` and total size == 8
164+
- return the empty compact sketch (`thetaLong = MAX_THETA`, `entries = []`)
165+
- Else if `singleItem`:
166+
- require `preLongs == 1` and total size == 16
167+
- read one hash at offset 8
168+
- Else:
169+
- read `curCount` (u32) at offset 8
170+
- if `preLongs == 3`, read `thetaLong` at offset 16; else `thetaLong = MAX_THETA`
171+
- read `curCount` hashes from offset `preLongs * 8`
172+
- optionally validate ordering if `ordered` flag is set
173+
174+
### Note on compressed format (`serVer = 4`)
175+
176+
Java/C++ also support a compressed, delta-encoded ordered compact sketch (`serVer = 4`), with a
177+
different layout (variable-length retained entries count and packed deltas). This is not required
178+
for basic cross-language interoperability, but can be added later for reduced serialized sizes.
179+

datasketches/src/theta/mod.rs

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28,6 +28,7 @@
2828
//! configurable accuracy and memory usage. The implementation supports:
2929
//!
3030
//! - **ThetaSketch**: Mutable sketch for building from input data
31+
//! - **CompactThetaSketch**: Immutable sketch for serialization and set operations
3132
//!
3233
//! # Usage
3334
//!
@@ -40,6 +41,8 @@
4041
4142
mod hash_table;
4243
mod sketch;
44+
mod serialization;
4345

46+
pub use self::sketch::CompactThetaSketch;
4447
pub use self::sketch::ThetaSketch;
4548
pub use self::sketch::ThetaSketchBuilder;

0 commit comments

Comments
 (0)