|
| 1 | +# Compact Theta Sketch (Implementation Notes) |
| 2 | + |
| 3 | +This document describes how the Rust implementation should represent and interoperate with the |
| 4 | +Apache DataSketches **Compact Theta Sketch** formats used by the Java and C++ libraries. |
| 5 | + |
| 6 | +The intent is cross-language compatibility: |
| 7 | + |
| 8 | +- **On-heap representation**: a minimal immutable form of a Theta sketch. |
| 9 | +- **Binary format**: compatible serialization/deserialization (uncompressed `serVer = 3`), matching |
| 10 | + the preamble layout and flags used by `datasketches-java` and `datasketches-cpp`. |
| 11 | + |
| 12 | +--- |
| 13 | + |
| 14 | +## compact theta sketch |
| 15 | + |
| 16 | +A compact theta sketch is the immutable, serialized-friendly form of a theta sketch: |
| 17 | + |
| 18 | +- It stores a **compact array** of retained hash values (no interstitial zeros like the update |
| 19 | + sketch’s hash table). |
| 20 | +- It stores the **theta** threshold (`thetaLong`) and a **seed hash** (16-bit). |
| 21 | +- It can be **ordered** (sorted ascending by hash) or **unordered**. |
| 22 | +- It is **read-only** (cannot be updated), but is intended to participate in set operations. |
| 23 | + |
| 24 | +### Hash invariants (cross-language) |
| 25 | + |
| 26 | +Java/C++ (and this Rust crate) treat retained hashes as: |
| 27 | + |
| 28 | +- 63-bit, non-negative values derived from MurmurHash3 (128-bit), taking `h1 >> 1`. |
| 29 | +- `0` is reserved for empty slots and must not appear as a retained entry. |
| 30 | +- Every retained entry must satisfy: `0 < hash < thetaLong`. |
| 31 | +- `thetaLong` uses the signed max (`Long.MAX_VALUE` / `i64::MAX`) as “1.0” (no sampling). |
| 32 | + |
| 33 | +In Rust, `MAX_THETA` is `i64::MAX as u64`, matching Java/C++. |
| 34 | + |
| 35 | +### Compact-state truth table (Java/C++ behavior) |
| 36 | + |
| 37 | +When producing a compact sketch (or serializing), Java defines a truth table over `(empty, curCount, |
| 38 | +thetaLong)` and applies corrections in specific cases (see `CompactOperations.correctThetaOnCompact` |
| 39 | +and related helpers): |
| 40 | + |
| 41 | +- Normal empty: `empty = true`, `curCount = 0`, `thetaLong = MAX_THETA` → encoded as an 8-byte sketch. |
| 42 | +- A sketch with `p < 1.0` but never updated may have `empty = true`, `curCount = 0`, |
| 43 | + `thetaLong < MAX_THETA` internally; Java corrects theta back to `MAX_THETA` during compaction/ |
| 44 | + serialization so it becomes a normal empty compact sketch. |
| 45 | +- A compact sketch can be **non-empty flag false** while still having `curCount = 0` and |
| 46 | + `thetaLong < MAX_THETA` as a possible result of set operations; this must serialize with |
| 47 | + `preLongs = 3` to preserve theta. |
| 48 | + |
| 49 | +Rust should mirror these behaviors for cross-language parity. |
| 50 | + |
| 51 | +--- |
| 52 | + |
| 53 | +## serailzation/deserialization |
| 54 | + |
| 55 | +This section documents the uncompressed compact theta sketch binary format (`serVer = 3`), as used |
| 56 | +by Java and C++. |
| 57 | + |
| 58 | +### Endianness |
| 59 | + |
| 60 | +Multi-byte integers are written in the platform’s native endianness in the Java/C++ implementations, |
| 61 | +with a legacy “big-endian” bit in the flags byte (bit 0). In practice, modern platforms are little |
| 62 | +endian and serialize with that bit cleared. |
| 63 | + |
| 64 | +For Rust cross-platform robustness: |
| 65 | + |
| 66 | +- **Serialize** using little-endian encodings and keep the big-endian flag bit clear. |
| 67 | +- **Deserialize** by reading the big-endian flag bit and decoding multi-byte fields accordingly. |
| 68 | + |
| 69 | +### Preamble (first 8 bytes) |
| 70 | + |
| 71 | +All compact sketches start with a single 8-byte “preamble long” with fixed byte offsets: |
| 72 | + |
| 73 | +| Byte offset | Field | Notes | |
| 74 | +|---:|---|---| |
| 75 | +| 0 | `preLongs` (low 6 bits) | Number of 8-byte longs in the preamble (1–3 for v3 compact). | |
| 76 | +| 1 | `serVer` | Must be `3` for uncompressed compact sketches. | |
| 77 | +| 2 | `family` | Must be `3` (`Family.COMPACT`). | |
| 78 | +| 3 | `lgNomLongs` | **Unused for compact**; must be written as `0`. | |
| 79 | +| 4 | `lgArrLongs` | **Unused for compact**; must be written as `0`. | |
| 80 | +| 5 | `flags` | Bitfield, defined below. | |
| 81 | +| 6–7 | `seedHash` (`u16`) | Must match `computeSeedHash(expectedSeed)` (Java/C++). | |
| 82 | + |
| 83 | +### Flags byte (byte 5) |
| 84 | + |
| 85 | +Bit positions follow Java/C++: |
| 86 | + |
| 87 | +- Bit 0: big-endian legacy indicator (reserved in Java, still present in C++). |
| 88 | +- Bit 1: read-only (must be set for compact sketches). |
| 89 | +- Bit 2: empty. |
| 90 | +- Bit 3: compact (must be set for compact sketches). |
| 91 | +- Bit 4: ordered. |
| 92 | +- Bit 5: single-item. |
| 93 | +- Bits 6–7: reserved (must be zero). |
| 94 | + |
| 95 | +### `preLongs` and payload layout (v3) |
| 96 | + |
| 97 | +The total serialized size is `(preLongs + curCount) * 8` bytes, except the “empty compact” case |
| 98 | +which is always exactly 8 bytes. |
| 99 | + |
| 100 | +The format varies by `(empty, curCount, thetaLong)`: |
| 101 | + |
| 102 | +#### 1) Empty compact sketch (8 bytes) |
| 103 | + |
| 104 | +- `preLongs = 1` |
| 105 | +- `flags.empty = 1` |
| 106 | +- No `curCount`, no `thetaLong`, no entries. |
| 107 | +- `thetaLong` is implicitly `MAX_THETA`. |
| 108 | + |
| 109 | +#### 2) Single item (16 bytes) |
| 110 | + |
| 111 | +- `preLongs = 1` |
| 112 | +- `flags.singleItem = 1`, `flags.ordered = 1` (Java sets ordered for single-item). |
| 113 | +- No `curCount`, no `thetaLong`. |
| 114 | +- Payload: one 8-byte hash at offset `8`. |
| 115 | + |
| 116 | +#### 3) Exact compact sketch (non-estimating) |
| 117 | + |
| 118 | +- `thetaLong == MAX_THETA` |
| 119 | +- `preLongs = 2` for `curCount > 1`; (for `curCount == 1`, Java uses the single-item form above). |
| 120 | +- Long at offset `8` contains: |
| 121 | + - `curCount` as a 4-byte int at offsets `8..12` |
| 122 | + - `p` as a 4-byte float at offsets `12..16` (**not used**; Java writes `0.0` to match C++). |
| 123 | +- Payload: `curCount` hashes starting at offset `preLongs * 8` (i.e. 16). |
| 124 | + |
| 125 | +#### 4) Estimating compact sketch |
| 126 | + |
| 127 | +- `thetaLong < MAX_THETA` |
| 128 | +- `preLongs = 3` |
| 129 | +- Long at offset `8` contains: |
| 130 | + - `curCount` as a 4-byte int at offsets `8..12` |
| 131 | + - `p` as a 4-byte float at offsets `12..16` (**not used**; Java writes `0.0`). |
| 132 | +- Long at offset `16` contains: |
| 133 | + - `thetaLong` as an 8-byte long at offsets `16..24`. |
| 134 | +- Payload: `curCount` hashes starting at offset `preLongs * 8` (i.e. 24). |
| 135 | + |
| 136 | +### Serialization algorithm (v3, conceptual) |
| 137 | + |
| 138 | +1. Determine `(empty, curCount, thetaLong, ordered)` from the compact sketch state. |
| 139 | +2. Apply the “empty + never-updated sampled sketch” correction: |
| 140 | + if `empty && curCount == 0`, serialize as an empty compact with `thetaLong = MAX_THETA`. |
| 141 | +3. Compute `preLongs`: |
| 142 | + - if `thetaLong < MAX_THETA` → `preLongs = 3` |
| 143 | + - else if `empty` → `preLongs = 1` |
| 144 | + - else if `curCount == 1` → `preLongs = 1` (single item) |
| 145 | + - else → `preLongs = 2` |
| 146 | +4. Write preamble fields; ensure `lgNomLongs = 0`, `lgArrLongs = 0`, and set flags: |
| 147 | + `readOnly=1`, `compact=1`, plus `empty/ordered/singleItem` as applicable. |
| 148 | +5. For `preLongs >= 2`, write `curCount` and `p = 0.0f`. |
| 149 | +6. For `preLongs == 3`, write `thetaLong`. |
| 150 | +7. Write the `curCount` retained hashes, ordered if requested. |
| 151 | + |
| 152 | +### Deserialization requirements (v3) |
| 153 | + |
| 154 | +When decoding bytes into a compact theta sketch: |
| 155 | + |
| 156 | +- Validate `family == 3` and `serVer == 3`. |
| 157 | +- Validate flags: |
| 158 | + - `compact` must be set |
| 159 | + - `readOnly` must be set |
| 160 | + - reserved bits must be zero (or tolerated for legacy inputs, depending on strictness) |
| 161 | +- Validate `seedHash` matches the expected seed. |
| 162 | +- If `empty` flag is set: |
| 163 | + - require `preLongs == 1` and total size == 8 |
| 164 | + - return the empty compact sketch (`thetaLong = MAX_THETA`, `entries = []`) |
| 165 | +- Else if `singleItem`: |
| 166 | + - require `preLongs == 1` and total size == 16 |
| 167 | + - read one hash at offset 8 |
| 168 | +- Else: |
| 169 | + - read `curCount` (u32) at offset 8 |
| 170 | + - if `preLongs == 3`, read `thetaLong` at offset 16; else `thetaLong = MAX_THETA` |
| 171 | + - read `curCount` hashes from offset `preLongs * 8` |
| 172 | + - optionally validate ordering if `ordered` flag is set |
| 173 | + |
| 174 | +### Note on compressed format (`serVer = 4`) |
| 175 | + |
| 176 | +Java/C++ also support a compressed, delta-encoded ordered compact sketch (`serVer = 4`), with a |
| 177 | +different layout (variable-length retained entries count and packed deltas). This is not required |
| 178 | +for basic cross-language interoperability, but can be added later for reduced serialized sizes. |
| 179 | + |
0 commit comments