You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
- Continues logarithmically up to 18 bytes overhead for sizes up to `isize::MAX`.
11
+
`ColdString` minimizes per-string overhead for both **short and large** strings.
12
+
- Strings ≤ 8 bytes: **8 bytes total**
13
+
- Larger strings: **~9–10 bytes overhead** (other string libraries have 24 bytes per value)
17
14
18
-
Compared to `String`, which stores capacity and length inline (3 machine words), `ColdString` avoids storing length inline for heap strings and compresses metadata into tagged pointer space. This leads to substantial memory savings in benchmarks (see [Memory Comparison (System RSS)](#memory-comparison-system-rss)):
19
-
-**36% – 68%** smaller than `String` in `HashMap`
20
-
-**28% – 65%** smaller than other short-string crates in `HashMap`
15
+
This leads to substantial memory savings over both `String`and other short-string crates (see [Memory Comparison (System RSS)](#memory-comparison-system-rss)):
16
+
-**35% – 67%** smaller than `String` in `HashSet`
17
+
-**35% – 64%** smaller than other short-string crates in `HashSet`
21
18
-**30% – 75%** smaller than `String` in `BTreeSet`
22
19
-**13% – 63%** smaller than other short-string crates in `BTreeSet`
23
20
24
-
`ColdString`'s MSRV is 1.60, is `no_std` compatible, and is a drop in replacement for immutable Strings.
25
-
26
-
### Safety
27
-
`ColdString` is written using [Rust's strict provenance API](https://doc.rust-lang.org/beta/std/ptr/index.html#strict-provenance), carefully handles unaligned access internally, and is validated with property testing and MIRI.
21
+
---
28
22
29
-
### Why "Cold"?
30
-
31
-
The heap representation stores the length on the heap, not inline in the struct. This saves memory in the struct itself but *slightly* increases the cost of `len()` since it requires a heap read. In practice, the `len()` cost is only marginally slower than inline storage and is typically negligible compared to:
32
-
- Memory savings
33
-
- Cache density improvements
34
-
- Faster collection operations due to reduced footprint
23
+
### Portability
24
+
`ColdString`'s MSRV is 1.60, is `no_std` compatible, and is a drop in replacement for immutable Strings.
ColdString is 8-byte tagged pointer (4 bytes on 32-bit machines):
50
+
ColdString is an 8-byte tagged pointer (4 bytes on 32-bit machines):
61
51
```rust
62
52
#[repr(packed)]
63
53
pubstructColdString {
64
-
/// The first byte of `encoded` is the "tag" and it determines the type:
65
-
/// - 10xxxxxx: an encoded address for the heap. To decode, 10 is set to 00 and swapped
66
-
/// with the LSB bits of the tag byte. The address is always a multiple of 4 (`HEAP_ALIGN`).
67
-
/// - 11111xxx: xxx is the length in range 0..=7, followed by length UTF-8 bytes.
68
-
/// - xxxxxxxx (valid UTF-8): 8 UTF-8 bytes.
69
54
encoded:*mutu8,
70
55
}
71
56
```
72
-
`encoded` acts as either a pointer to the heap for strings longer than 8 bytes or is the inlined data itself. The first/"tag" byte indicates one of 3 encodings:
57
+
The 8 bytes encode one of three representations indicated by the 1st byte:
58
+
-`10xxxxxx`: `encoded` contains a tagged heap pointer. To decode the address, clear the tag bits (`10 → 00`) and rotate so the `00` bits become the least-significant bits. The heap allocation uses [4-byte alignment](https://doc.rust-lang.org/beta/std/alloc/struct.Layout.html#method.from_size_align), guaranteeing the
59
+
least-significant 2 bits of the address are `00`. On the heap, the UTF-8 characters are preceded by the variable-length encoding of the size. The size uses 1 byte for 0 - 127, 2 bytes for 128 - 16383, etc.
60
+
-`11111xxx`: xxx is the length and the remaining 0-7 bytes are UTF-8 characters.
61
+
-`xxxxxxxx`: All 8 bytes are UTF-8.
73
62
74
-
### Inline Mode (0 to 7 Bytes)
75
-
The tag byte has bits 11111xxx, where xxx is the length. `self.0[1]` to `self.0[7]` store the bytes of string.
63
+
`10xxxxxx` and `11111xxx` are chosen because they cannot be valid first bytes of UTF-8.
76
64
77
-
### Inline Mode (8 Bytes)
78
-
The tag byte is any valid UTF-8 byte. `self.0` stores the bytes of string. Since the string is UTF-8, the tag byte is guaranteed to not be 10xxxxx or 11111xxx.
65
+
### Why "Cold"?
79
66
80
-
### Heap Mode
81
-
`self.0` encodes the pointer to heap, where tag byte is 10xxxxxx. 10xxxxxx is chosen because it's a UTF-8 continuation byte and therefore an impossible tag byte for inline mode. Since a heap-alignment of 4 is chosen, the pointer's least significant 2 bits are guaranteed to be 0 ([See more](https://doc.rust-lang.org/beta/std/alloc/struct.Layout.html#method.from_size_align)). These bits are swapped with the 10 "tag" bits when de/coding between `self.0` and the address value.
67
+
The heap representation stores the length on the heap, not inline in the struct. This saves memory in the struct itself but *slightly* increases the cost of `len()` since it requires a heap read. In practice, the `len()` cost is only marginally slower than inline storage and is typically negligible compared to memory savings, cache density improvements, and 3x faster operations on inlined strings.
82
68
83
-
On the heap, the data starts with a variable length integer encoding of the length, followed by the bytes.
84
-
```text,ignore
85
-
ptr --> <var int length> <data>
86
-
```
69
+
### Safety
87
70
88
-
# Memory Comparisons (Allocator)
71
+
`ColdString` uses `unsafe` to implement its packed representation and pointer tagging. Usage of `unsafe` is narrowly scoped to where layout control is required, and each instance is documented with `// SAFETY: <invariant>`. To further ensure soundness, `ColdString` is written using [Rust's strict provenance API](https://doc.rust-lang.org/beta/std/ptr/index.html#strict-provenance), handles unaligned access internally, maintains explicit heap alignment guarantees, and is validated with property testing and MIRI.
72
+
73
+
## Benchmarks
74
+
75
+
### Memory Comparisons (Allocator)
89
76
90
77
Memory usage per string, measured by tracking the memory requested by the allocator:
**Note:** Columns represent string length (bytes/chars). Values represent average Resident Set Size (RSS) in bytes per string instance. Measurements taken with 10M iterations.
0 commit comments