diff --git a/claude.md b/claude.md index f3d0a57c7..b44976b51 100644 --- a/claude.md +++ b/claude.md @@ -13,6 +13,28 @@ are relevant. Only then report completion and wait for feedback. checkpoints") before presenting it. Do NOT proceed to implementation until the plan has been seen and explicitly approved. +**Store plans in PLAN.md**: Always write plans to `PLAN.md` in the repository +root so that context survives session boundaries. Update (not append to) the +file when the plan evolves. This is the single source of truth for what is +planned and what has been completed. + +**Baseline the checkout before starting work**: Before beginning implementation +of any plan, verify that the current checkout builds and passes tests. Run the +build and test suite (per `skills/building_and_testing.md`) and record the +results. If the baseline is broken, report the failures and stop — do not start +implementation on a broken base. Pre-existing failures that are not caused by +your changes must be acknowledged upfront so they are not confused with +regressions introduced by the plan. This establishes the ground truth against +which your changes will be measured. + +**Every plan step must have a test gate**: Each step in a plan must produce +a testable result — a test, a build check, or a verifiable property — that +acts as the gate to the next step. Do not move to step N+1 until step N's +gate passes. This catches integration issues incrementally rather than +deferring all testing to the end. When writing a plan, structure it so that +independently testable components are implemented and verified first, and +later steps build on proven foundations. + **Mandatory review checkpoints**: At each of these points, run the full review loop — spawn a fresh-context reviewer subagent, address findings, spawn another fresh reviewer, repeat until a reviewer finds no issues. When @@ -101,31 +123,19 @@ as a collaborator. - **Fix what your change makes stale** - When a change invalidates something elsewhere — a comment, a test description, documentation — fix it in the same PR. Stale artefacts left behind are bugs in the making, and "I didn't modify that line" isn't an excuse when your change is what made it wrong. -## Testing - -- Run `func-malloc-fast` and `func-jemalloc-fast` to catch allocation edge cases -- The `-check` variants include assertions but may pass when `-fast` hangs due to timing differences -- Use `timeout` when running tests to avoid infinite hangs -- Before considering a change complete, run the full test suite: `ctest --output-on-failure -j 4 -C Release --timeout 60` - -### Test library (`snmalloc_testlib`) - -Tests that only use the public allocator API can link against a pre-compiled static library (`snmalloc-testlib-{fast,check}`) instead of compiling the full allocator in each TU. - -- **Header**: `test/snmalloc_testlib.h` — forward-declares the API surface; does NOT include any snmalloc headers. Tests that also need snmalloc internals (sizeclasses, pointer math, etc.) include `` or `` alongside it. -- **CMake**: Add the test name to `LIBRARY_FUNC_TESTS` or `LIBRARY_PERF_TESTS` in `CMakeLists.txt`. -- **Apply broadly**: When adding new API to testlib (e.g., `ScopedAllocHandle`), immediately audit all remaining non-library tests to see which ones can now be migrated. Don't wait for CI to find them one by one. -- **Cannot migrate**: Tests that use custom `Config` types, `Pool`, override machinery, internal data structures (freelists, MPSC queues), or the statically-sized `alloc()` template with many size values genuinely need `snmalloc.h`. +- **Document the code, not the change** - Comments and documentation describe how the code IS, not how it was changed or why it differs from a previous version. Don't leave comments explaining "we removed X" or "this was changed from Y" — a reader shouldn't need the git history to understand the code. If code needs context about alternatives or design decisions, put that in design docs, not inline comments. -## Build +## Building, Testing, and Benchmarking -- Build directory: `build/` -- Build system: Ninja with CMake -- Rebuild all targets before running ctest: `ninja -C build` (required - ctest runs pre-built binaries) -- Rebuild specific targets: `ninja -C build ` -- Always run `clang-format` before committing changes: `ninja -C build clangformat` +All build, test, and benchmarking guidance lives in `skills/building_and_testing.md`. -## Benchmarking +**Delegate testing to a subagent.** When it is time to build and run tests, +spawn a subagent whose prompt includes the contents of +`skills/building_and_testing.md` and tells it which tests to run (or "run the +full suite"). Do NOT include implementation context — the subagent must not +know what code changed. This prevents the tester from rationalising failures +as related to the changes instead of reporting them objectively. -- Before benchmarking, verify Release build: `grep CMAKE_BUILD_TYPE build/CMakeCache.txt` should show `Release` -- Debug builds have assertions enabled and will give misleading performance numbers +The subagent will report back: which tests passed, which failed, exact +commands, and full output of any failures. If failures are reported, treat +them as actionable per the failure protocol in the skill file. diff --git a/docs/bitmap_coalesce.md b/docs/bitmap_coalesce.md new file mode 100644 index 000000000..5faac25f8 --- /dev/null +++ b/docs/bitmap_coalesce.md @@ -0,0 +1,579 @@ +# Bitmap-Indexed Coalescing Range + +## Background + +### The buddy system and its generalisation + +The classic binary buddy allocator manages memory as power-of-two blocks. +A block of size $2^k$ must be aligned to a $2^k$ boundary. This alignment +invariant makes splitting and merging trivial — a block's *buddy* is at a +known address — but wastes memory: the only valid sizes are exact powers +of two, so a 33-byte request consumes a 64-byte block. + +A natural generalisation keeps the alignment discipline but admits a richer +set of size classes. Instead of only powers of two, we allow sizes with a +few mantissa bits: + +$$S = 2^e + m \cdot 2^{e - B}$$ + +where $e \ge 0$ is the exponent and $m \in \{0, \ldots, 2^B - 1\}$ is +the mantissa, controlled by a parameter $B$ (the number of *intermediate +bits*). When $e < B$ the step size $2^{e-B}$ is fractional, so only those +$(e, m)$ combinations that yield an integer are valid size classes. With +$B = 0$ this is the classic buddy system. With $B = 2$ the valid sizes +(in chunk units) are: + +| $e$ | valid $m$ | sizes | +|-----|-----------|-------| +| 0 | 0 | 1 | +| 1 | 0, 2 | 2, 3 | +| 2 | 0, 1, 2, 3 | 4, 5, 6, 7 | +| 3 | 0, 1, 2, 3 | 8, 10, 12, 14 | +| … | all | … | + +For $e \ge B$ every mantissa value $m$ yields an integer, giving $2^B$ +sub-classes per power-of-two doubling. The full sequence is + +$$1,\;2,\;3,\;4,\;5,\;6,\;7,\;8,\;10,\;12,\;14,\;16,\;20,\;24,\;28,\;32,\;\ldots$$ + +This halves internal fragmentation compared to binary buddy while +preserving the structural properties that make buddy systems efficient. + +### Natural alignment + +Define the *natural alignment* of a positive integer $S$ as the largest +power of two dividing $S$: + +$$\operatorname{align}(S) = S \mathbin{\&} (\mathord{\sim}(S - 1))$$ + +For a power-of-two size this equals the size itself (the classic buddy +constraint); for $S = 12$ it is 4; for $S = 96$ it is 32. + +When computing alignment for an *address* rather than a size, the same +formula is used with one exception: address 0 is divisible by every power +of two, so $\operatorname{align}(0)$ should be treated as arbitrarily +large (i.e. no alignment constraint at all). + +### The alignment invariant + +The generalised buddy system preserves the classical buddy's alignment +discipline: every block of size $S$ at address $A$ must satisfy + +$$A \bmod \operatorname{align}(S) = 0$$ + +This invariant is important for allocation: when a free list stores blocks +of size class $S$, any block popped from the list is guaranteed to satisfy +the alignment requirement. No address checking is needed. + +### Maximum size class at a given alignment + +A critical property of the size-class scheme: for a given address alignment +$\alpha = 2^k$, there is a bounded maximum valid size class. Only size +classes whose natural alignment is $\le \alpha$ may be placed at that +address. + +With $B$ intermediate bits, valid sizes at exponent level $e$ are +$2^e + m \cdot 2^{e-B}$ for $m \in \{0,\ldots,2^B-1\}$. The largest +size at a given $\alpha$ is obtained at $e = \log_2(\alpha) + B$, +$m = 2^B - 1$: + +$$S_{\max} = (2^{B+1} - 1) \cdot \alpha$$ + +With $B = 0$ (classic buddy): $S_{\max} = \alpha$. +With $B = 2$: $S_{\max} = 7\alpha$, so the maximum block at alignment 16 is +$7 \times 16 = 112$. + +## The problem: efficient aligned allocation + +When a block of size $S$ with alignment $A = \operatorname{align}(S)$ is +requested, the allocator must find a free block that contains a +naturally-aligned region of at least $S$ bytes. + +Three classic approaches and their tradeoffs: + +**Binary buddy** stores blocks of exact power-of-two size at +power-of-two-aligned addresses. Allocation is O(1) — pop any block from +the appropriate bin. But internal fragmentation is up to 50%. + +**TLSF** stores blocks at their actual size with no alignment invariant. +Low internal fragmentation, O(1) coalescing. But aligned allocation +requires scanning the free list to find a block whose address satisfies +the alignment — O(n) per bin in the worst case. + +**Bitmap coalescing** (this design) stores blocks at their actual +(maximally-coalesced) size. A flat bitmap indexes blocks by the set of +size classes they can serve, considering all possible alignment offsets. +Allocation is O(1) via bitmap scan. Coalescing is simple boundary-tag +merging with no decomposition. + +## Illustrated walkthrough + +All sizes are in abstract *chunks* (the minimum allocation unit, +$2^{14}$ bytes in production). With $B = 2$ the valid size classes +are $1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16, \ldots$ + +### Setup: a 32-chunk arena + +``` +Address: 0 4 8 12 16 20 24 28 32 + ├───────┼───────┼───────┼───────┼───────┼───────┼───────┼───────┤ + │ free (32) │ + └───────────────────────────────────────────────────────────────┘ +``` + +### Step 1: Allocate 5, 10, 5, 7 chunks + +**Binary buddy** rounds each request up to a power of two: + +| Request | Buddy gives | Waste | +|---------|-------------|-------| +| 5 | 8 | 3 | +| 10 | 16 | 6 | +| 5 | 8 | 3 | +| 7 | 8 | 1 | + +Total waste: 10 chunks (31%). + +``` +Binary buddy after allocations: + 0 8 24 32 + ├───────────────┼──────────────────────────────┼───────────────┤ + │ alloc 8 (5+3W)│ alloc 16 (10+6W) │ alloc 8 (7+1W)│ + └───────────────┴──────────────────────────────┴───────────────┘ +``` + +**Bitmap coalescing** uses the exact size classes with zero waste: + +``` +Bitmap coalesce after allocations: + 0 5 15 20 27 32 + ├─────────┼────────────────┼───────────┼───────┼─────┤ + │ alloc 5 │ alloc 10 │ alloc 5 │alloc 7│free5│ + └─────────┴────────────────┴───────────┴───────┴─────┘ +``` + +### Step 2: Free the 10-chunk block at address 5 + +The block $[5, 15)$ is returned. Adjacent blocks on both sides are still +allocated, so no coalescing occurs — a single 10-chunk free block is +inserted. + +``` +After freeing [5,15): + 0 5 15 20 27 32 + ├─────────┼────────────────┼───────────┼───────┼─────┤ + │ alloc 5 │ FREE 10 │ alloc 5 │alloc 7│free5│ + └─────────┴────────────────┴───────────┴───────┴─────┘ +``` + +### Step 3: Free the 5-chunk block at address 15 + +Now $[15, 20)$ is freed next to the free block $[5, 15)$. The contiguous +free range is $[5, 20)$ — 15 chunks. + +**Binary buddy** decomposes into 6 blocks: + +``` +Binary buddy free decomposition of [5, 20): + 5 6 8 12 14 16 20 + ├──┼───┼──────────┼────┼───┼──────────┤ + │1 │ 2 │ 4 │ 2 │ 1 │ 4 │ + └──┴───┴──────────┴────┴───┴──────────┘ + 6 free blocks, largest = 4 +``` + +A subsequent request for 10 chunks would *fail* despite 15 contiguous +free chunks — buddy's largest block is only 4. + +**Bitmap coalescing** maximally coalesces: the left walk finds +$[5, 15)$ adjacent, merges into a single 15-chunk block. + +``` +Bitmap coalesce after merging [5, 20): + 5 20 + ├──────────────────────────────┤ + │ FREE 15 │ + └──────────────────────────────┘ +``` + +One free block of size 15. A request for 10 chunks succeeds immediately. + +### Step 4: Allocate 6 chunks with 4-chunk alignment + +The caller needs a 6-chunk block at an address divisible by 4. + +**Binary buddy**: largest block is 4. *Allocation fails.* + +**TLSF**: has the 15-chunk block at address 5. Address 5 is not +4-aligned; TLSF must scan the free list to find a block where a 4-aligned +6-chunk sub-range exists — O(n) in general. + +**Bitmap coalescing**: the 15-chunk block at address 5 was binned into a +slot that guarantees it can serve this request. The bitmap search finds it +in O(1). The range wrapper carves out the aligned subrange: + +``` +Carving [8,14) from [5,20), returning remainders: + 5 8 14 20 + ├─────┼────────┼────────────────┤ + │free3│alloc 6 │ free 6 │ + └─────┴────────┴────────────────┘ +``` + +No scanning required — the bitmap guarantees the search result can serve +the requested size class at a naturally-aligned offset. + +### Summary comparison + +``` + ┌─────────────┬──────────────┬───────────────────┐ + │ Binary Buddy│ TLSF │ Bitmap Coalesce │ + ┌───────────────┼─────────────┼──────────────┼───────────────────┤ + │ Size classes │ powers of 2 │ fine-grained │ fine-grained │ + │ Internal frag.│ up to 50% │ ~3% │ ~3% │ + │ Coalesce cost │ O(log n) │ O(1) │ O(neighbours) │ + │ Aligned search│ free (nat.) │ O(n) scan │ O(1) bitmap │ + │ Free path │ split+merge │ tag update │ tag merge + bin │ + └───────────────┴─────────────┴──────────────┴───────────────────┘ +``` + +Bitmap coalescing combines the structural advantage of buddy (O(1) aligned +search via indexed bins) with the fine-grained size classes of TLSF (low +internal fragmentation) and simple coalescing (no decomposition). + +## The key insight: servable sets and the flat bitmap + +### Servable sets + +A free block at address $A$ with size $n$ can *serve* an allocation of +size class $S$ if: + +1. There exists an address $A' \ge A$ such that $A' + S \le A + n$ + (the block is large enough). +2. $A' \bmod \operatorname{align}(S) = 0$ (the address is + naturally aligned). + +The *servable set* of a block is the set of all size classes it can serve. + +For $B = 2$ intermediate bits, exhaustive analysis over a 256-chunk arena +(all possible address, size pairs) reveals exactly **34 unique servable +sets**. These sets have a nearly-linear ordering: within each exponent +group $e$, there are 5 distinct positions: + +| Slot | Can serve | +|-----------|-----------------------------------| +| A-only | $m{=}0$ at exponent $e$, not $m{=}1$ | +| B-only | $m{=}1$ at exponent $e$, not $m{=}0$ | +| both | both $m{=}0$ and $m{=}1$ | +| +m2 | also $m{=}2$ | +| +m3 | all four mantissas | + +A-only and B-only are the sole *incomparable* pair. Every other pair of +slots is ordered by strict inclusion. + +### The flat bitmap + +Map each free block to one of the 34 servable-set positions (per exponent +level). The bitmap layout is: + +``` +Bit layout: + [0..2] PREFIX: sizes 1, 2, 3 (exponents e ≤ 1) + [3..7] Exponent e=2: A-only, B-only, both, +m2, +m3 + [8..12] Exponent e=3: A-only, B-only, both, +m2, +m3 + ... + Total = 3 + 5 × (MAX_EXPONENT - 1) bits +``` + +For a 64-bit address space with 16 KiB chunks: MAX_EXPONENT = 49, +NUM_BINS = 243, requiring 4 × 64-bit words. + +### Allocation: masked bitmap search + +To allocate size class $(e, m)$: + +1. Compute the *start bit* — the lowest-indexed bin that could serve this + request. +2. If $m = 0$: mask out the B-only bit for exponent $e$ (it serves $m=1$ + but not $m=0$). No other masking is needed. +3. Find the first set bit at or above the start bit. Pop the head + of that bin's free list. + +This is O(1): one mask operation, one `ctz` per bitmap word. + +### Insertion: bin index from (size, alignment) + +When inserting a block of size $n$ at address $A$: + +1. Compute $\alpha = \operatorname{align}(A / \text{chunk\_size})$ + (the block's chunk-level alignment). +2. For each size class $S$ at each exponent, compute the *threshold*: + $T(S, \alpha) = S + \max(0, \operatorname{align}(S) - \alpha)$. + This is the minimum block size needed to guarantee a naturally-aligned + sub-region of size $S$ exists within the block. +3. The bin index is the highest slot for which $n \ge T(S, \alpha)$ for + all size classes in that slot's servable set. + +Better-aligned blocks reach higher bins (can serve more size classes), +so the bin assignment uses the block's *actual* alignment, not a +worst-case assumption. + +## Worked example: bitmap operations + +### Arena: 16 chunks at address 0 + +Initial state: one 16-chunk free block at address 0. + +**Insert [0, 16):** +$n = 16$ chunks, $\alpha = \operatorname{align}(0) = \infty$. +At exponent $e = 4$: $S_0 = 16, T = 16$; $S_1 = 20, T = 20 > 16$. +So the block can serve $m{=}0$ at $e{=}4$ but not $m{=}1$. +At exponent $e = 3$: $S_3 = 14, T = 14 \le 16$ → can serve all +four mantissas at $e{=}3$. + +Which bin is higher? A-only at $e{=}4$ (bit index = 3 + 5×2 + 0 = 13) +vs +m3 at $e{=}3$ (bit index = 3 + 5×1 + 4 = 12). A-only at $e{=}4$ +wins. The block goes to bin 13. + +**Allocate 10 chunks ($e{=}3, m{=}1$):** +Start bit = B-only at $e{=}3$ = 3 + 5×1 + 1 = 9. +$m \ne 0$, so no masking. +Scan bitmap from bit 9: bin 13 is set. Pop the 16-chunk block at +address 0. + +The range wrapper carves: $\operatorname{align}(10) = 2$, natural +alignment = 2 chunks. Address 0 is already 2-aligned. +Return $[0, 10)$, remainder $[10, 16)$ size 6 is re-inserted via +`add_fresh_range`. + +**Insert remainder [10, 16):** +$n = 6$, $\alpha = \operatorname{align}(10) = 2$. +Highest qualifying bin: at $e{=}2$, $S_2 = 6, \operatorname{align}(6) = 2$, +$T(6, 2) = 6 \le 6$. Also $S_1 = 5, T(5, 2) = 5 \le 6$ and +$S_0 = 4, T(4, 2) = 4 \le 6$. All four check: $S_3 = 7 > 6$, so +stop at +m2. Bin index = 3 + 5×0 + 3 = 6. + +**Allocate 4 chunks ($e{=}2, m{=}0$):** +Start bit = A-only at $e{=}2$ = 3. +$m = 0$: mask out B-only at $e{=}2$ = bit 4. +Scan bitmap from bit 3 with bit 4 masked: bin 6 is set. Pop the +6-chunk block at address 10. + +Carve: $\operatorname{align}(4) = 4$. Address 10 is 2-aligned, +first 4-aligned address $\ge 10$ is 12. Return $[12, 16)$, prefix +$[10, 12)$ size 2 is re-inserted. + +### Coalescing example + +State after the above: $[0, 10)$ allocated, $[10, 12)$ free (bin 1), +$[12, 16)$ allocated. + +**Free [12, 16):** +`add_block(12, 4)`. + +Left walk: check address 8 ($= 12 - 4$). Not coalesce_free (it's allocated). +Check address 11 ($= 12 - 1$ chunk). `is_free_block(11)`: the 2-chunk +free block at $[10, 12)$ has coalesce_free set at address 10 and 11. Read +$\text{size}(11) = 2$, so $\text{prev\_start} = 12 - 2 = 10$. Check +`is_free_block(10)`: yes. Cross-check $\text{size}(10) = 2$: matches. +Remove $[10, 12)$ from its bin. $\text{merge\_start} = 10$. + +Continue left walk: check address 9. Not coalesce_free (inside the allocated +block $[0, 10)$). Stop. + +Right walk: address 16 is at the boundary. Stop. + +Insert $[10, 16)$ size 6. The two fragments have coalesced into one +block. + +## The algorithm + +### Data structures + +``` +bin_heads[NUM_BINS] — head pointer for each bin's singly-linked free list +bitmap[BITMAP_WORDS] — one bit per bin: set iff the bin is non-empty +``` + +Metadata is stored in the pagemap: each chunk entry has two backend words +(next pointer and size boundary tag) plus two 1-bit flags (coalesce_free and +boundary). + +### Insert (`insert_block`) + +``` +insert_block(addr, size): + n = size / CHUNK_SIZE + α = natural_alignment(addr / CHUNK_SIZE) + bin = bin_index(n, α) + set_boundary_tags(addr, size) // first and last entry + set_coalesce_free(addr) // first entry + if size > CHUNK_SIZE: + set_coalesce_free(addr + size - CHUNK_SIZE) // last entry + prepend to bin_heads[bin] + set bitmap bit +``` + +### Allocate (`remove_block`) + +``` +remove_block(size): + (e, m) = decompose(size / CHUNK_SIZE) + start = alloc_start_bit(e, m) + mask = (m == 0 and e >= 2) ? alloc_mask_bit(e) : none + bin = find_first_set_bit(bitmap, from=start, masking=mask) + if none found: return empty + pop head from bin_heads[bin] + clear first-entry size tag and coalesce_free marker + return {addr, block_size} +``` + +### Free with coalescing (`add_block`) + +``` +add_block(addr, size): + merge_start = addr + merge_end = addr + size + + // Left walk: merge with preceding free blocks + while merge_start > 0 and not boundary(merge_start): + prev_last = merge_start - CHUNK_SIZE + if not is_free_block(prev_last): break + prev_size = get_size(prev_last) + if prev_size == 0 or prev_size > merge_start: break + prev_start = merge_start - prev_size + if not is_free_block(prev_start): break + if get_size(prev_start) != prev_size: break // cross-check + remove_from_bin(prev_start, prev_size) + merge_start = prev_start + + // Right walk: merge with following free blocks + while not boundary(merge_end): + if not is_free_block(merge_end): break + next_size = get_size(merge_end) + if next_size == 0: break + remove_from_bin(merge_end, next_size) + clear_size_tags(merge_end, next_size) // prevent stale reads + merge_end += next_size + + // Insert the coalesced block + insert_block(merge_start, merge_end - merge_start) +``` + +Key property: **no decomposition on free**. The freed region is maximally +merged with neighbours and inserted as a single block. The bitmap index +captures exactly which size classes the merged block can serve at its +actual address alignment. + +### Post-allocation carving (range wrapper) + +The block returned by `remove_block` may not be naturally aligned for the +requested size class. The range wrapper carves the allocation: + +``` +alloc_range(size): + (e, m) = decompose(size / CHUNK_SIZE) + alignment = natural_alignment(size / CHUNK_SIZE) * CHUNK_SIZE + result = remove_block(size) + if empty: refill from parent + aligned_addr = align_up(result.addr, alignment) + prefix = aligned_addr - result.addr + suffix = result.size - prefix - size + if prefix > 0: add_fresh_range(result.addr, prefix) + if suffix > 0: add_fresh_range(aligned_addr + size, suffix) + return aligned_addr +``` + +## Metadata protocol + +### The coalesce-free marker + +Each metadata entry carries a single bit, `META_COALESCE_FREE_BIT`, set only on +entries belonging to blocks currently in the coalescing range's free pool. +The coalescing algorithm tests `is_free_block(addr)`, which returns true +only when the entry is both *backend-owned* and has the coalesce-free bit set. +Other range components (e.g. DecayRange) also set backend-owned, but they +do not set coalesce-free, so the coalescing algorithm ignores them. + +### Boundary markers + +A separate bit, `META_BOUNDARY_BIT`, prevents coalescing from crossing +OS-level allocation boundaries. On platforms where distinct PAL +allocations cannot be merged (CHERI, Windows VirtualAlloc), the first +chunk of each allocation has its boundary bit set. Both the left walk +and right walk stop when they encounter a boundary. + +When the range wrapper splits a refill (keeping part for the caller, +returning the rest to the free pool), a boundary is set at the split point +to prevent the remainder from coalescing into the returned allocation. + +### Marking lifecycle + +1. **Insert.** Coalesce-free is set on both the first and last chunk entries. + Boundary tags (size) are written at both ends. This allows both + leftward probing (which reads the last entry of a predecessor) and + rightward probing (which reads the first entry of a successor). + +2. **Allocate (`remove_block`).** The first entry's size tag and coalesce-free + bit are cleared. The last entry is left stale. This is safe because + the left walk always cross-checks: it reads the predecessor's size from + its last entry, computes the start address, then verifies + `is_free_block(start)`. Since the first entry's coalesce-free was cleared, + this check fails, preventing incorrect coalescing. + +3. **Right-walk absorption.** When the right walk absorbs a neighbour, it + zeros the absorbed block's first and last size tags. This prevents + subsequent left walks from misreading stale sizes at those positions. + +| Bit | Set by | Cleared by | Purpose | +|--------------------|-----------------------|------------------------------|----------------------------| +| `META_COALESCE_FREE_BIT` | `insert_block` (both endpoints) | `remove_block` (first only); right-walk (implicit) | Free-pool membership | +| `META_BOUNDARY_BIT`| PAL allocation; range split | Never (preserved) | Cross-range boundary | + +## B-dependence: structural assumptions + +The bitmap layout, mask count, and slot ordering all depend on `B = +INTERMEDIATE_BITS`. This design hardcodes $B = 2$ and guards every +structural assumption with a `static_assert` labelled **[A1]**–**[A7]**: + +| Label | Assumption | What would change for $B \ne 2$ | +|-------|-----------|-------------------------------| +| A1 | 5 slots per exponent | More alignment tiers → more incomparable pairs → more slots | +| A2 | 3 prefix bits (sizes 1–3) | Different sub-exponent range | +| A3 | `alloc_mask_bit` returns one index | Multiple incomparable pairs → multi-bit mask | +| A4 | Mask clears 1 bit | Same as A3 | +| A5 | Only $m{=}0$ needs masking | Intermediate-tier mantissas might also need masks | +| A6 | Slot ordering [A-only, B-only, both, +m2, +m3] | Different DAG shape | +| A7 | Two threshold breakpoints per exponent | More tiers → more breakpoints | + +To support a different $B$ value, each of these assumptions would need to +be re-derived (the `single_bitmap.py` and `index_table.py` prototypes in +`prototype/` can be adapted for this). + +## File organisation + +| File | Purpose | +|------|---------| +| `backend_helpers/bitmap_coalesce_helpers.h` | Pure-math helpers: bin index, threshold, decompose, round-up | +| `backend_helpers/bitmap_coalesce.h` | Core data structure: bitmap, free lists, insert/remove/coalesce | +| `backend_helpers/bitmap_coalesce_range.h` | Pipeline adapter: `alloc_range`/`dealloc_range` with carving and refill | +| `test/func/bc_helpers/bc_helpers.cc` | Unit tests for helper math | +| `test/func/bc_core/bc_core.cc` | Unit tests for core data structure (mock Rep) | +| `test/func/bc_range/bc_range.cc` | Pipeline integration tests | + +## Pipeline position + +`BitmapCoalesceRange` is wired into snmalloc's range pipeline as the +global coalescing layer: + +``` +GlobalR = Pipe< + Base, + BitmapCoalesceRange, + LogRange<2>, + GlobalRange> +``` + +It sits between the OS-level base range and the per-thread caching +layers, managing large free blocks and providing naturally-aligned +allocations to the rest of the pipeline. diff --git a/prototype/index_table.py b/prototype/index_table.py new file mode 100644 index 000000000..6b9110a74 --- /dev/null +++ b/prototype/index_table.py @@ -0,0 +1,240 @@ +#!/usr/bin/env python3 +""" +Exhaustively test all blocks in a 256-unit arena. +For each block (address, size), determine which size classes it can serve +with natural alignment. Group blocks by their unique servable set. +""" + +ARENA = 256 +B = 2 # intermediate bits + +def gen_size_classes(max_size): + """Generate all valid size classes up to max_size with B=2.""" + classes = set() + # e=0: S = 1 + # e=1: S = 2, 3 + # e>=2: S = 2^e + m * 2^(e-2) for m in 0..3 + classes.add(1) + classes.add(2) + classes.add(3) + e = 2 + while True: + base = 1 << e + step = 1 << (e - B) + for m in range(1 << B): + s = base + m * step + if s > max_size: + break + classes.add(s) + if base > max_size: + break + e += 1 + return sorted(classes) + +def align(x): + """Natural alignment: largest power of 2 dividing x. align(0) = infinity.""" + if x == 0: + return 1 << 30 # effectively infinite + return x & (-x) + +def can_serve(addr, block_size, sizeclass): + """Can block [addr, addr+block_size) serve a naturally-aligned allocation of sizeclass?""" + A = align(sizeclass) + # First A-aligned address >= addr + first_aligned = ((addr + A - 1) // A) * A + # Need first_aligned + sizeclass <= addr + block_size + return first_aligned + sizeclass <= addr + block_size + +def main(): + size_classes = gen_size_classes(ARENA) + print(f"Size classes (B={B}, up to {ARENA}): {size_classes}") + print(f"Count: {len(size_classes)}") + print() + + # For each (addr, block_size), compute the set of servable size classes + # Key: frozenset of servable classes -> list of (addr, size) blocks + groups = {} + + for a in range(ARENA): + for n in range(1, ARENA - a + 1): + servable = frozenset( + sc for sc in size_classes if can_serve(a, n, sc) + ) + if servable not in groups: + groups[servable] = [] + groups[servable].append((a, n)) + + # Sort groups by the minimum block size that achieves this set + sorted_groups = sorted(groups.items(), key=lambda kv: (len(kv[0]), min(kv[0]) if kv[0] else 0)) + + print(f"Total unique servable sets: {len(groups)}") + print() + + # Now the key question: does the servable set depend only on (block_size, align(addr))? + # Check this empirically + print("=" * 80) + print("CHECKING: does servable set depend only on (block_size, align(addr))?") + print("=" * 80) + + by_size_align = {} + conflicts = 0 + for a in range(ARENA): + for n in range(1, ARENA - a + 1): + servable = frozenset( + sc for sc in size_classes if can_serve(a, n, sc) + ) + alpha = min(align(a), ARENA) # cap alignment + key = (n, alpha) + if key not in by_size_align: + by_size_align[key] = servable + elif by_size_align[key] != servable: + conflicts += 1 + if conflicts <= 5: + print(f" CONFLICT: (size={n}, align={alpha})") + print(f" existing: {sorted(by_size_align[key])}") + print(f" new (a={a}): {sorted(servable)}") + + if conflicts == 0: + print(" NO CONFLICTS! Servable set depends only on (block_size, align(addr)).") + else: + print(f" {conflicts} total conflicts found.") + print() + + # Now build the flattened index: unique sets indexed by (block_size, align(addr)) + # Group by unique servable set, showing which (size, align) pairs map to it + print("=" * 80) + print("FLATTENED INDEX: unique servable sets ordered by inclusion") + print("=" * 80) + + # Collect unique sets from the (size, align) perspective + unique_sets = {} + for (n, alpha), servable in sorted(by_size_align.items()): + if servable not in unique_sets: + unique_sets[servable] = [] + unique_sets[servable].append((n, alpha)) + + # Sort by set size then min element + sorted_sets = sorted(unique_sets.items(), + key=lambda kv: (len(kv[0]), max(kv[0]) if kv[0] else 0)) + + for idx, (servable, pairs) in enumerate(sorted_sets): + sc_list = sorted(servable) + # Find the minimum block size across all (n, alpha) pairs + min_n = min(n for n, _ in pairs) + max_n = max(n for n, _ in pairs) + # Show a compact representation of which (size, align) pairs give this set + print(f"\nIndex {idx}: servable = {sc_list}") + print(f" |servable| = {len(servable)}, block sizes [{min_n}..{max_n}]") + # Show a few representative (n, alpha) pairs + pairs_sorted = sorted(pairs) + if len(pairs_sorted) <= 10: + for n, alpha in pairs_sorted: + print(f" (size={n}, align={alpha})") + else: + for n, alpha in pairs_sorted[:5]: + print(f" (size={n}, align={alpha})") + print(f" ... ({len(pairs_sorted)} total pairs)") + for n, alpha in pairs_sorted[-3:]: + print(f" (size={n}, align={alpha})") + + print() + print("=" * 80) + print(f"SUMMARY: {len(unique_sets)} unique servable sets (= flat index entries)") + print("=" * 80) + + # Now show just the index by (block_size, block_align) -> index + print() + print("=" * 80) + print("INDEX TABLE: (block_size, block_align) -> index") + print("Showing for block sizes 1..64 and aligns 1,2,4,8,16,32,64") + print("=" * 80) + + # Assign index numbers + set_to_idx = {} + for idx, (servable, _) in enumerate(sorted_sets): + set_to_idx[servable] = idx + + # Print header + aligns = [1, 2, 4, 8, 16, 32, 64, 128, 256] + print(f"{'size':>6}", end="") + for alpha in aligns: + print(f" α={alpha:>3}", end="") + print(" servable max (at high α)") + + for n in range(1, 65): + print(f"{n:>6}", end="") + for alpha in aligns: + key = (n, alpha) + if key in by_size_align: + idx = set_to_idx[by_size_align[key]] + print(f" {idx:>5}", end="") + else: + print(f" {'—':>3}", end="") + # Show max servable sizeclass for highest alpha + best_alpha = max(a for a in aligns if (n, a) in by_size_align) + best_set = by_size_align[(n, best_alpha)] + print(f" max_sc={max(best_set) if best_set else 0}") + + # Show how many indexes are actually distinct per alignment tier + print() + for alpha in aligns: + indices_at_tier = set() + for n in range(1, ARENA + 1): + key = (n, alpha) + if key in by_size_align: + indices_at_tier.add(set_to_idx[by_size_align[key]]) + print(f" Tier α={alpha:>3}: {len(indices_at_tier)} distinct indexes used") + + # Show the progression: for each block_size n (at max alignment), + # what's the index and what new size class was unlocked? + print() + print("=" * 80) + print("PROGRESSION at max alignment (block at address 0):") + print("As block size grows, which size classes become servable?") + print("=" * 80) + prev_set = frozenset() + for n in range(1, 129): + # At address 0, alignment is infinite, so alpha is huge + key = (n, min(align(0), ARENA)) + # Actually address 0 won't appear with all sizes. Use highest alpha. + # Let's compute directly + servable = frozenset( + sc for sc in size_classes if n >= sc # at perfect alignment, just need n >= sc + ) + if servable != prev_set: + new = sorted(servable - prev_set) + print(f" n={n:>4}: +{new} (total {len(servable)} classes)") + prev_set = servable + + print() + print("=" * 80) + print("PROGRESSION at alignment 1 (worst case):") + print("=" * 80) + prev_set = frozenset() + for n in range(1, 257): + key = (n, 1) + if key not in by_size_align: + continue + servable = by_size_align[key] + if servable != prev_set: + new = sorted(servable - prev_set) + print(f" n={n:>4}: +{new} (total {len(servable)} classes)") + prev_set = servable + + print() + print("=" * 80) + print("PROGRESSION at alignment 4:") + print("=" * 80) + prev_set = frozenset() + for n in range(1, 257): + key = (n, 4) + if key not in by_size_align: + continue + servable = by_size_align[key] + if servable != prev_set: + new = sorted(servable - prev_set) + print(f" n={n:>4}: +{new} (total {len(servable)} classes)") + prev_set = servable + +if __name__ == "__main__": + main() diff --git a/prototype/single_bitmap.py b/prototype/single_bitmap.py new file mode 100644 index 000000000..21838b91c --- /dev/null +++ b/prototype/single_bitmap.py @@ -0,0 +1,398 @@ +#!/usr/bin/env python3 +""" +Verify: can we use a SINGLE bitmap with a per-mantissa mask +to do O(1) allocation lookups? + +Key idea: lay out the 34 unique servable sets in a flat bitmap. +For each requested size class, compute a starting bit and a mask +that clears at most 1 bit. Then find-first-set gives the answer. +""" + +ARENA = 256 +B = 2 # intermediate bits + +def gen_size_classes(max_size): + classes = set() + classes.add(1) + classes.add(2) + classes.add(3) + e = 2 + while True: + base = 1 << e + step = 1 << (e - B) + for m in range(1 << B): + s = base + m * step + if s > max_size: + break + classes.add(s) + if base > max_size: + break + e += 1 + return sorted(classes) + +def natural_align(x): + if x == 0: + return 1 << 30 + return x & (-x) + +def can_serve(addr, block_size, sizeclass): + A = natural_align(sizeclass) + first_aligned = ((addr + A - 1) // A) * A + return first_aligned + sizeclass <= addr + block_size + +def get_exponent_mantissa(s): + """Return (exponent, mantissa) for size class s with B=2.""" + if s == 1: return (0, 0) + if s == 2: return (1, 0) + if s == 3: return (1, 1) + e = 2 + while True: + base = 1 << e + step = 1 << (e - B) + for m in range(4): + if base + m * step == s: + return (e, m) + e += 1 + if base > s * 2: + return None + +def main(): + size_classes = gen_size_classes(ARENA) + print(f"Size classes: {size_classes}") + print(f"Count: {len(size_classes)}\n") + + # ================================================================ + # Step 1: Compute ALL unique servable sets (exhaustive) + # ================================================================ + all_sets = {} # frozenset -> list of (addr, size) blocks + for a in range(ARENA): + for n in range(1, ARENA - a + 1): + servable = frozenset(sc for sc in size_classes if can_serve(a, n, sc)) + if servable not in all_sets: + all_sets[servable] = [] + all_sets[servable].append((a, n)) + + # Sort by (|set|, max element) + sorted_sets = sorted(all_sets.keys(), key=lambda s: (len(s), max(s) if s else 0)) + set_to_idx = {s: i for i, s in enumerate(sorted_sets)} + + print(f"Total unique servable sets: {len(sorted_sets)}\n") + + # ================================================================ + # Step 2: Analyze the structure per exponent + # ================================================================ + print("=" * 80) + print("STRUCTURE ANALYSIS: which size classes appear at each step?") + print("=" * 80) + + for i, s in enumerate(sorted_sets): + # What's NEW in this set compared to all strict subsets? + subsets = [sorted_sets[j] for j in range(i) if sorted_sets[j] < s] + if subsets: + biggest_subset = max(subsets, key=len) + new = sorted(s - biggest_subset) + else: + new = sorted(s) + + # Which sets is this incomparable with? + incomparable = [] + for j in range(i): + other = sorted_sets[j] + if not (other < s) and not (other > s) and other != s: + if len(other) == len(s): # same cardinality, truly incomparable + incomparable.append(j) + + em_list = [(sc, get_exponent_mantissa(sc)) for sc in new] + incomp_str = f" INCOMPARABLE with {incomparable}" if incomparable else "" + new_str = str(sorted(new)) + print(f" idx {i:2d}: |{len(s):2d}| +{new_str:30s}{incomp_str}") + + # ================================================================ + # Step 3: Verify the 5-per-exponent structure + # ================================================================ + print() + print("=" * 80) + print("BITMAP LAYOUT: 5 bits per exponent level") + print("=" * 80) + + # For each exponent e >= 2, identify the 5 indices + max_exp = 0 + for sc in size_classes: + em = get_exponent_mantissa(sc) + if em and em[0] > max_exp: + max_exp = em[0] + + # Build mapping: for each exponent e, find: + # - A-only: first index that adds m=0 but NOT m=1 + # - B-only: first index that adds m=1 but NOT m=0 + # - both: first index that has both m=0 and m=1 + # - +m2: first index that has m=2 + # - +m3: first index that has m=3 + + print(f"\nPrefix (e=0,1): bits 0-2") + for i in range(min(3, len(sorted_sets))): + print(f" bit {i}: {sorted(sorted_sets[i])}") + + bit_pos = 3 # next available bit + exponent_bits = {} # e -> {role: bit_position} + + for e in range(2, max_exp + 1): + sizes_at_e = [] + for m in range(4): + step = 1 << (e - B) + s = (1 << e) + m * step + if s <= ARENA: + sizes_at_e.append((m, s)) + + if not sizes_at_e: + continue + + m0_size = (1 << e) # m=0 + m1_size = 5 * (1 << (e - 2)) # m=1 + m2_size = 3 * (1 << (e - 1)) # m=2 + m3_size = 7 * (1 << (e - 2)) # m=3 + + # Find the 5 indices for this exponent + roles = {} + for i, s in enumerate(sorted_sets): + has_m0 = m0_size in s and m0_size <= ARENA + has_m1 = m1_size in s and m1_size <= ARENA + has_m2 = m2_size in s and m2_size <= ARENA + has_m3 = m3_size in s and m3_size <= ARENA + + if has_m0 and not has_m1 and 'A-only' not in roles: + roles['A-only'] = i + if has_m1 and not has_m0 and 'B-only' not in roles: + roles['B-only'] = i + if has_m0 and has_m1 and not has_m2 and 'both' not in roles: + roles['both'] = i + if has_m2 and not has_m3 and '+m2' not in roles: + roles['+m2'] = i + if has_m3 and '+m3' not in roles: + roles['+m3'] = i + + exponent_bits[e] = {} + print(f"\nExponent e={e}: sizes {[s for _, s in sizes_at_e]}") + for role in ['A-only', 'B-only', 'both', '+m2', '+m3']: + if role in roles: + idx = roles[role] + exponent_bits[e][role] = bit_pos + print(f" bit {bit_pos:2d} ({role:7s}): idx {idx:2d} = {sorted(sorted_sets[idx])}") + bit_pos += 1 + else: + print(f" ({role:7s}): not present (size > arena)") + + total_bits = bit_pos + print(f"\nTotal bitmap bits: {total_bits}") + + # ================================================================ + # Step 4: The mask rule + # ================================================================ + print() + print("=" * 80) + print("MASK RULE") + print("=" * 80) + print() + print("For each size class S with exponent e and mantissa m:") + print(" m=0: start at A-only bit(e). Mask OUT the B-only bit(e). [1 bit masked]") + print(" m=1: start at B-only bit(e). No mask needed. [0 bits masked]") + print(" m=2: start at +m2 bit(e). No mask needed. [0 bits masked]") + print(" m=3: start at +m3 bit(e). No mask needed. [0 bits masked]") + print() + print("Only m=0 needs a mask, and it's exactly 1 bit.") + + # ================================================================ + # Step 5: VERIFY the mask rule exhaustively + # ================================================================ + print() + print("=" * 80) + print("EXHAUSTIVE VERIFICATION") + print("=" * 80) + + # For each size class S, compute which indices' servable sets include S + valid_indices_for = {} # sizeclass -> set of indices + for sc in size_classes: + valid = set() + for i, s in enumerate(sorted_sets): + if sc in s: + valid.add(i) + valid_indices_for[sc] = valid + + # Build the bitmap assignment: index_i -> bit position + # We need to map the 34 indices to the bit positions we assigned + idx_to_bit = {} + # Prefix bits + idx_to_bit[0] = 0 # {1} + idx_to_bit[1] = 1 # {1,2} + idx_to_bit[2] = 2 # {1,2,3} + + for e, roles in exponent_bits.items(): + # Map from role -> index we found + m0_size = (1 << e) + m1_size = 5 * (1 << (e - 2)) + m2_size = 3 * (1 << (e - 1)) + m3_size = 7 * (1 << (e - 2)) + + for i, s in enumerate(sorted_sets): + has_m0 = m0_size in s and m0_size <= ARENA + has_m1 = m1_size in s and m1_size <= ARENA + has_m2 = m2_size in s and m2_size <= ARENA + has_m3 = m3_size in s and m3_size <= ARENA + + if has_m0 and not has_m1 and i not in idx_to_bit: + if 'A-only' in roles: + idx_to_bit[i] = roles['A-only'] + if has_m1 and not has_m0 and i not in idx_to_bit: + if 'B-only' in roles: + idx_to_bit[i] = roles['B-only'] + if has_m0 and has_m1 and not has_m2 and i not in idx_to_bit: + if 'both' in roles: + idx_to_bit[i] = roles['both'] + if has_m2 and not has_m3 and i not in idx_to_bit: + if '+m2' in roles: + idx_to_bit[i] = roles['+m2'] + if has_m3 and i not in idx_to_bit: + if '+m3' in roles: + idx_to_bit[i] = roles['+m3'] + + # Check we mapped everything + unmapped = [i for i in range(len(sorted_sets)) if i not in idx_to_bit] + if unmapped: + print(f" WARNING: unmapped indices: {unmapped}") + + bit_to_idx = {v: k for k, v in idx_to_bit.items()} + + # Now verify: for each size class, the mask rule correctly identifies + # all valid bits from the start position upward + errors = 0 + for sc in size_classes: + em = get_exponent_mantissa(sc) + if em is None: + continue + e, m = em + + valid_bits = set() + for idx in valid_indices_for[sc]: + if idx in idx_to_bit: + valid_bits.add(idx_to_bit[idx]) + + # Compute start bit and mask + if e <= 1: + # Prefix: sizes 1,2,3 + start = {1: 0, 2: 1, 3: 2}[sc] + mask_clear = set() # no mask needed for prefix + else: + roles = exponent_bits.get(e, {}) + if m == 0: + start = roles.get('A-only', -1) + mask_clear = {roles.get('B-only', -1)} + elif m == 1: + start = roles.get('B-only', -1) + mask_clear = set() # no mask! + elif m == 2: + start = roles.get('+m2', -1) + mask_clear = set() + elif m == 3: + start = roles.get('+m3', -1) + mask_clear = set() + + if start == -1: + continue + + # The set of bits we'd search: all bits >= start, except mask_clear + search_bits = set(range(start, total_bits)) - mask_clear + + # Valid bits at or above start + reachable = valid_bits & search_bits + + # Invalid bits that we'd incorrectly hit + false_positives = search_bits - valid_bits + # Check: are any false positives below the first valid bit? + # (We only care about false positives that have a set bit in the bitmap + # and appear before a true positive.) + # Actually, the REAL check: for every bit in search_bits, + # if it's set in the bitmap, it should be in valid_bits. + + # Simpler check: search_bits should be a SUBSET of valid_bits ∪ {bits above all valid} + # Actually: we need that for any bit b in search_bits where b has a block, + # it's valid to allocate S from that block. + # This means: every bit in search_bits must be in valid_bits. + + problematic = search_bits - valid_bits + # These are bits where, if a block exists there, we'd incorrectly try to serve + # size class S from it. Let's check if any of these are actually reachable. + + # For the verification to pass: search_bits ⊆ valid_bits + # i.e., every bit at or above start (minus masked) should be a valid index for S + + if problematic: + # Find the actual problematic servable set + for pb in sorted(problematic): + if pb in bit_to_idx: + idx = bit_to_idx[pb] + if sc not in sorted_sets[idx]: + errors += 1 + if errors <= 10: + print(f" ERROR: sc={sc} (e={e},m={m}), bit {pb} " + f"(idx {idx}) does NOT contain {sc}") + print(f" set = {sorted(sorted_sets[idx])}") + + if errors == 0: + print(" ALL CHECKS PASSED! Single bitmap + 1-bit mask works correctly.") + else: + print(f"\n {errors} total errors found.") + + # ================================================================ + # Step 6: Show the complete scheme + # ================================================================ + print() + print("=" * 80) + print("COMPLETE ALLOCATION SCHEME") + print("=" * 80) + print() + print(f"Bitmap: {total_bits} bits (one per unique servable set)") + print() + print("On FREE(addr, size):") + print(" 1. Maximally coalesce with neighbors") + print(" 2. Compute exact servable set for (addr, coalesced_size)") + print(" 3. Map to bit position, set bit in bitmap, add to free list") + print() + print("On ALLOCATE(sizeclass S):") + print(" 1. (e, m) = exponent_mantissa(S)") + print(" 2. start = start_bit[e][m]") + print(" 3. masked_bitmap = bitmap") + print(" if m == 0: masked_bitmap &= ~(1 << b_only_bit[e])") + print(" 4. result = find_first_set(masked_bitmap >> start)") + print(" 5. Pop block from free list at (start + result)") + print(" 6. Carve S from block, reinsert remainders") + print() + + # Print the lookup tables + print("START BIT TABLE:") + print(f" {'SC':>4s} {'(e,m)':>6s} {'start':>5s} {'mask_bit':>8s}") + for sc in size_classes: + em = get_exponent_mantissa(sc) + if em is None: + continue + e, m = em + if e <= 1: + start = {1: 0, 2: 1, 3: 2}[sc] + mask = "—" + else: + roles = exponent_bits.get(e, {}) + if m == 0: + start = roles.get('A-only', -1) + mask = str(roles.get('B-only', -1)) + elif m == 1: + start = roles.get('B-only', -1) + mask = "—" + elif m == 2: + start = roles.get('+m2', -1) + mask = "—" + elif m == 3: + start = roles.get('+m3', -1) + mask = "—" + print(f" {sc:>4d} ({e},{m}) {start:>5d} {mask:>8s}") + +if __name__ == "__main__": + main() diff --git a/skills/building_and_testing.md b/skills/building_and_testing.md new file mode 100644 index 000000000..019449d7b --- /dev/null +++ b/skills/building_and_testing.md @@ -0,0 +1,57 @@ +# Building and Testing Skill + +This file is the complete reference for building and testing snmalloc. +It is designed to be used by a subagent that has NO context about what +code changes were made — only that it needs to build and verify the +project. This isolation is intentional: test results must be interpreted +without bias from knowing what changed. + +## Build + +- Build directory: `build/` +- Build system: Ninja with CMake +- **Always test with a Debug build.** Debug enables assertions (`-check` variants) that catch invariant violations invisible in Release. A Release-only test run can report 100% pass while masking real bugs. Verify with `grep CMAKE_BUILD_TYPE build/CMakeCache.txt` — it must show `Debug`. +- Rebuild all targets before running ctest: `ninja -C build` (required — ctest runs pre-built binaries) +- Rebuild specific targets: `ninja -C build ` +- Always run `clang-format` before committing changes: `ninja -C build clangformat` + +## Testing + +- Run `func-malloc-fast` and `func-jemalloc-fast` to catch allocation edge cases +- The `-check` variants include assertions but may pass when `-fast` hangs due to timing differences +- Use `timeout` when running tests to avoid infinite hangs +- Never run a test on a stale build artifact. Rebuild in the same build directory/config before any run or rerun: `ninja -C ` for `ctest` runs, or `ninja -C ` if you invoke a direct binary. If the rebuild fails, stop and report with the rebuild log. +- Testing skill: keep commands stable (right build dir/config, consistent flags), prefer `ctest -R --output-on-failure`, and avoid ad-hoc command variants that change coverage or filters. +- Before considering a change complete, run the full test suite: `ctest --output-on-failure -j 4 --timeout 60` + +### Test failures (never hand-wave) + +- Never describe a failure as transient without evidence. Treat every failure as actionable until disproven. +- After a rebuild (per the testing rule above) succeeds, rerun the exact failing command twice: Rerun #1 must match the original command (including filters/flags such as `-R`, `-j`, `--timeout`); Rerun #2 may add only `--output-on-failure` if it was missing. No other changes to flags or filters between reruns. +- Required logging bundle for any failure or flake claim: rebuild command plus stdout/stderr; original failing command plus stdout/stderr; both rerun commands plus stdout/stderr; commit/branch; build directory and config (Release/Debug); compiler/toolchain; host OS; env vars/options affecting the run (allocator config, sanitizers, thread count); note if Rerun #2 added `--output-on-failure`. +- Workflow: record failing command/output → rebuild in the same build directory (stop/report if rebuild fails) → two reruns as above → capture all logs/context → check CI status and origin/main baseline → only label a flake with evidence. Report flakes or unresolved failures in a PR comment with logs and CI links. + +### Test library (`snmalloc_testlib`) + +Tests that only use the public allocator API can link against a pre-compiled static library (`snmalloc-testlib-{fast,check}`) instead of compiling the full allocator in each TU. + +- **Header**: `test/snmalloc_testlib.h` — forward-declares the API surface; does NOT include any snmalloc headers. Tests that also need snmalloc internals (sizeclasses, pointer math, etc.) include `` or `` alongside it. +- **CMake**: Add the test name to `LIBRARY_FUNC_TESTS` or `LIBRARY_PERF_TESTS` in `CMakeLists.txt`. +- **Apply broadly**: When adding new API to testlib (e.g., `ScopedAllocHandle`), immediately audit all remaining non-library tests to see which ones can now be migrated. Don't wait for CI to find them one by one. +- **Cannot migrate**: Tests that use custom `Config` types, `Pool`, override machinery, internal data structures (freelists, MPSC queues), or the statically-sized `alloc()` template with many size values genuinely need `snmalloc.h`. + +## Benchmarking + +- Before benchmarking, verify Release build: `grep CMAKE_BUILD_TYPE build/CMakeCache.txt` should show `Release` +- Debug builds have assertions enabled and will give misleading performance numbers + +## Subagent protocol + +When you are invoked as a testing subagent: + +1. **Read this file first.** It is your only reference for how to build and test. +2. **You have no knowledge of what changed.** Do not ask. Do not speculate. Report only what you observe. +3. **Rebuild before testing.** Always run `ninja -C build` before any test invocation. If the rebuild fails, report the failure and stop. +4. **Run the requested tests** (or the full suite if not specified). Use the exact commands from this file. +5. **Report results factually**: which tests passed, which failed, the exact commands you ran, and the full output of any failures. Do not interpret failures in terms of code changes — you don't know what they are. +6. **Never label a failure as transient.** If a test fails, follow the failure protocol above (rebuild + two reruns + logging bundle). Report all evidence. diff --git a/src/snmalloc/backend/backend.h b/src/snmalloc/backend/backend.h index 2772cf319..07782e486 100644 --- a/src/snmalloc/backend/backend.h +++ b/src/snmalloc/backend/backend.h @@ -92,8 +92,8 @@ namespace snmalloc uintptr_t ras, sizeclass_t sizeclass) { - SNMALLOC_ASSERT(bits::is_pow2(size)); SNMALLOC_ASSERT(size >= MIN_CHUNK_SIZE); + SNMALLOC_ASSERT(size % MIN_CHUNK_SIZE == 0); // Calculate the extra bytes required to store the client meta-data. size_t extra_bytes = SlabMetadata::get_extra_bytes(sizeclass); @@ -129,7 +129,20 @@ namespace snmalloc } typename Pagemap::Entry t(meta, ras); - Pagemap::set_metaentry(address_cast(p), size, t); + + // Write the entry to all chunk entries in a single pass. + // For non-pow2 large sizeclasses, also set per-chunk offset + // bits (distance from allocation start in nat_align units). + auto slab_mask = sizeclass_metadata.fast(sizeclass).slab_mask; + size_t nat_align = slab_mask + 1; + if (nat_align < size) + { + Pagemap::set_metaentry(address_cast(p), size, t, nat_align); + } + else + { + Pagemap::set_metaentry(address_cast(p), size, t); + } return {Aal::capptr_bound(p, size), meta}; } diff --git a/src/snmalloc/backend/meta_protected_range.h b/src/snmalloc/backend/meta_protected_range.h index 857e853d2..0991f5244 100644 --- a/src/snmalloc/backend/meta_protected_range.h +++ b/src/snmalloc/backend/meta_protected_range.h @@ -32,7 +32,7 @@ namespace snmalloc // Global range of memory using GlobalR = Pipe< Base, - LargeBuddyRange< + BitmapCoalesceRange< GlobalCacheSizeBits, bits::BITS - 1, Pagemap, @@ -51,10 +51,10 @@ namespace snmalloc // would be able to corrupt meta-data. using CentralObjectRange = Pipe< GlobalR, - LargeBuddyRange, LogRange<3>, GlobalRange, CommitRange, + DecayRange, StatsRange>; // Controls the padding around the meta-data range. @@ -88,23 +88,10 @@ namespace snmalloc StatsRange>; // Local caching of object range - using ObjectRange = Pipe< - CentralObjectRange, - LargeBuddyRange< - LocalCacheSizeBits, - LocalCacheSizeBits, - Pagemap, - page_size_bits>, - LogRange<5>>; + using ObjectRange = Pipe>; // Local caching of meta-data range - using MetaRange = Pipe< - CentralMetaRange, - LargeBuddyRange< - LocalCacheSizeBits - SubRangeRatioBits, - bits::BITS - 1, - Pagemap>, - SmallBuddyRange>; + using MetaRange = Pipe; ObjectRange object_range; diff --git a/src/snmalloc/backend/standard_range.h b/src/snmalloc/backend/standard_range.h index 78609ed2d..72bba5a70 100644 --- a/src/snmalloc/backend/standard_range.h +++ b/src/snmalloc/backend/standard_range.h @@ -29,7 +29,7 @@ namespace snmalloc // Global range of memory, expose this so can be filled by init. using GlobalR = Pipe< Base, - LargeBuddyRange< + BitmapCoalesceRange< GlobalCacheSizeBits, bits::BITS - 1, Pagemap, @@ -37,8 +37,18 @@ namespace snmalloc LogRange<2>, GlobalRange>; - // Track stats of the committed memory - using Stats = Pipe, StatsRange>; + // Decay range caches deallocated memory and gradually releases it + // back to the parent, avoiding expensive repeated decommit/recommit + // cycles for transient allocation patterns. +#ifdef SNMALLOC_ENABLE_DECAY + using DecayR = Pipe>; +#else + using DecayR = Pipe, DecayRange>; +#endif + + // Track stats of the memory handed out (outside decay so stats + // methods are directly visible to StatsCombiner). + using Stats = Pipe; private: static constexpr size_t page_size_bits = @@ -46,14 +56,8 @@ namespace snmalloc public: // Source for object allocations and metadata - // Use buddy allocators to cache locally. - using LargeObjectRange = Pipe< - Stats, - StaticConditionalRange>>; + // No thread-local caching for now. + using LargeObjectRange = Stats; private: using ObjectRange = Pipe; @@ -88,8 +92,7 @@ namespace snmalloc static void set_small_heap() { - // This disables the thread local caching of large objects. - LargeObjectRange::disable_range(); + // No thread-local caching to disable. } }; } // namespace snmalloc diff --git a/src/snmalloc/backend_helpers/backend_helpers.h b/src/snmalloc/backend_helpers/backend_helpers.h index ee339337b..899e41706 100644 --- a/src/snmalloc/backend_helpers/backend_helpers.h +++ b/src/snmalloc/backend_helpers/backend_helpers.h @@ -2,9 +2,13 @@ #include "../mem/mem.h" #include "authmap.h" +#include "bitmap_coalesce.h" +#include "bitmap_coalesce_helpers.h" +#include "bitmap_coalesce_range.h" #include "buddy.h" #include "commitrange.h" #include "commonconfig.h" +#include "decayrange.h" #include "defaultpagemapentry.h" #include "empty_range.h" #include "globalrange.h" diff --git a/src/snmalloc/backend_helpers/bitmap_coalesce.h b/src/snmalloc/backend_helpers/bitmap_coalesce.h new file mode 100644 index 000000000..edc8ef732 --- /dev/null +++ b/src/snmalloc/backend_helpers/bitmap_coalesce.h @@ -0,0 +1,379 @@ +#pragma once + +#include "../ds/ds.h" +#include "bitmap_coalesce_helpers.h" + +namespace snmalloc +{ + /** + * Bitmap-indexed coalescing allocator for free memory. + * + * Manages free memory blocks stored in a flat bitmap with one bit per + * bin. Bins are indexed by BitmapCoalesceHelpers::bin_index, which + * maps (block_size_chunks, block_alignment_chunks) to a flat index + * that encodes both the servable size classes and the block's alignment + * tier. + * + * Blocks are stored at their actual size (no decomposition into + * sizeclass blocks). The allocation search finds + * a block guaranteed to serve the requested sizeclass at *some* + * naturally-aligned address within the block. The range wrapper + * (Step 4) carves the exact aligned allocation from the returned block. + * + * The Rep template provides pagemap access (see + * BitmapCoalesceRep in bitmap_coalesce_range.h for the concept). + * + * Template parameters: + * Rep - Pagemap accessor implementing the Rep concept. + * MAX_SIZE_BITS - Maximum block size this range manages (log2 bytes). + */ + template + class BitmapCoalesce + { + using BC = BitmapCoalesceHelpers; + + static constexpr size_t NUM_BINS = BC::NUM_BINS; + static constexpr size_t BITMAP_WORDS = BC::BITMAP_WORDS; + + /** + * Head pointers for each bin's singly-linked free list. + * 0 means the bin is empty. + */ + address_t bin_heads[NUM_BINS] = {}; + + /** + * Single flat bitmap: bit i is set iff bin_heads[i] is non-empty. + */ + size_t bitmap[BITMAP_WORDS] = {}; + + // ---- Bitmap operations ---- + + void set_bit(size_t idx) + { + SNMALLOC_ASSERT(idx < NUM_BINS); + bitmap[idx / bits::BITS] |= size_t(1) << (idx % bits::BITS); + } + + void clear_bit(size_t idx) + { + SNMALLOC_ASSERT(idx < NUM_BINS); + bitmap[idx / bits::BITS] &= ~(size_t(1) << (idx % bits::BITS)); + } + + [[nodiscard]] bool test_bit(size_t idx) const + { + SNMALLOC_ASSERT(idx < NUM_BINS); + return (bitmap[idx / bits::BITS] >> (idx % bits::BITS)) & 1; + } + + /** + * Find the first set bit at position >= start, skipping mask_bit + * if it is not SIZE_MAX. Returns SIZE_MAX if no bit is found. + */ + [[nodiscard]] size_t find_set_from(size_t start, size_t mask_bit) const + { + SNMALLOC_ASSERT(start < NUM_BINS); + + size_t word_idx = start / bits::BITS; + size_t bit_ofs = start % bits::BITS; + + for (size_t w = word_idx; w < BITMAP_WORDS; w++) + { + size_t word = bitmap[w]; + + // Apply mask if mask_bit falls in this word. + if (mask_bit != SIZE_MAX) + { + size_t mask_word = mask_bit / bits::BITS; + if (w == mask_word) + word &= ~(size_t(1) << (mask_bit % bits::BITS)); + } + + // Clear bits below start in the first word. + if (w == word_idx) + word &= ~((size_t(1) << bit_ofs) - 1); + + if (word != 0) + { + size_t result = w * bits::BITS + bits::ctz(word); + if (result < NUM_BINS) + return result; + return SIZE_MAX; + } + } + return SIZE_MAX; + } + + // ---- Chunk-level helpers ---- + + /** + * Chunk-level address alignment: natural alignment of the chunk index + * at byte address addr. For address 0, returns a very large alignment. + */ + static size_t chunk_alignment(address_t addr) + { + auto chunk_addr = addr / MIN_CHUNK_SIZE; + if (chunk_addr == 0) + return size_t(1) << (bits::BITS - 2); + return BC::natural_alignment(chunk_addr); + } + + // ---- Bin operations ---- + + /** + * Insert a free block into its bin. Prepend to the singly-linked + * list, set boundary tags at both ends, mark coalesce_free, set bitmap bit. + */ + void insert_block(address_t addr, size_t size) + { + SNMALLOC_ASSERT(size >= MIN_CHUNK_SIZE); + SNMALLOC_ASSERT(size % MIN_CHUNK_SIZE == 0); + + size_t n_chunks = size / MIN_CHUNK_SIZE; + size_t alpha = chunk_alignment(addr); + size_t bin = BC::bin_index(n_chunks, alpha); + + SNMALLOC_ASSERT(bin < NUM_BINS); + + Rep::set_boundary_tags(addr, size); + Rep::set_next(addr, bin_heads[bin]); + bin_heads[bin] = addr; + + Rep::set_coalesce_free(addr); + if (size > MIN_CHUNK_SIZE) + { + Rep::set_coalesce_free(addr + size - MIN_CHUNK_SIZE); + } + + set_bit(bin); + } + + /** + * Remove a specific block from its bin's linked list. + * Returns true if found and removed, false otherwise. + */ + bool remove_from_bin(address_t addr, size_t size) + { + size_t n_chunks = size / MIN_CHUNK_SIZE; + size_t alpha = chunk_alignment(addr); + size_t bin = BC::bin_index(n_chunks, alpha); + + address_t prev = 0; + address_t curr = bin_heads[bin]; + + while (curr != 0) + { + if (curr == addr) + { + auto next = Rep::get_next(curr); + if (prev == 0) + bin_heads[bin] = next; + else + Rep::set_next(prev, next); + + if (bin_heads[bin] == 0) + clear_bit(bin); + + return true; + } + prev = curr; + curr = Rep::get_next(curr); + } + + return false; + } + + public: + constexpr BitmapCoalesce() = default; + + /** + * Result of a block removal. + */ + struct RemoveResult + { + /// Base address of the block, or 0 if not found. + address_t addr; + /// Size of the block in bytes, or 0 if not found. + size_t size; + }; + + /** + * Insert a block without coalescing. + * + * Inserts the block at its natural bin based on (size, alignment). + * Aligns the input to MIN_CHUNK_SIZE boundaries. + */ + void add_fresh_range(address_t addr, size_t length) + { + if (length == 0) + return; + + auto aligned_start = bits::align_up(addr, MIN_CHUNK_SIZE); + auto aligned_end = bits::align_down(addr + length, MIN_CHUNK_SIZE); + if (aligned_end <= aligned_start) + return; + + insert_block(aligned_start, aligned_end - aligned_start); + } + + /** + * Find and remove a block that can serve the given sizeclass. + * + * The size must be chunk-aligned and a valid sizeclass in chunk units. + * Returns {addr, block_size} or {0, 0} if not found. + * + * The returned block may be larger than requested, and its address + * may not be naturally aligned for the sizeclass. The caller + * (range wrapper) is responsible for carving. + * + * Clears the first-entry boundary tag and coalesce_free marker to + * prevent stale reads by the coalescing left walk. + */ + RemoveResult remove_block(size_t size) + { + SNMALLOC_ASSERT(size >= MIN_CHUNK_SIZE); + SNMALLOC_ASSERT(size % MIN_CHUNK_SIZE == 0); + + size_t n_chunks = size / MIN_CHUNK_SIZE; + + size_t e = 0, m = 0; + bool valid = BC::decompose(n_chunks, e, m); + SNMALLOC_ASSERT(valid); + UNUSED(valid); + + size_t start = BC::alloc_start_bit(e, m); + + // [A4,A5] Apply mask for m=0 searches: skip the B-only bin at this + // exponent. For B=2, only m=0 needs masking and only 1 bit is masked. + static_assert( + INTERMEDIATE_BITS == 2, + "[A4,A5] single-bit mask for m=0 only assumes B=2"); + size_t mask_bit = SIZE_MAX; + if (m == 0 && e >= 2) + mask_bit = BC::alloc_mask_bit(e); + + size_t bin = find_set_from(start, mask_bit); + if (bin == SIZE_MAX) + return {0, 0}; + + // Pop head from this bin's list. + address_t addr = bin_heads[bin]; + SNMALLOC_ASSERT(addr != 0); + + size_t block_size = Rep::get_size(addr); + SNMALLOC_ASSERT(block_size >= MIN_CHUNK_SIZE); + SNMALLOC_ASSERT(block_size % MIN_CHUNK_SIZE == 0); + + bin_heads[bin] = Rep::get_next(addr); + if (bin_heads[bin] == 0) + clear_bit(bin); + + // Clear the first-entry boundary tag and coalesce free marker to + // prevent stale reads by the left walk in add_block. The last + // entry's boundary tag and marker are left stale: the left walk's + // cross-check catches this. + Rep::set_size(addr, 0); + Rep::clear_coalesce_free(addr); + + return {addr, block_size}; + } + + /** + * Add a block with coalescing. + * + * Merges with adjacent free blocks using left and right walks, + * then inserts the coalesced block. + * + * (Implemented in Step 3.) + */ + void add_block(address_t addr, size_t size) + { + SNMALLOC_ASSERT(size >= MIN_CHUNK_SIZE); + SNMALLOC_ASSERT(size % MIN_CHUNK_SIZE == 0); + + address_t merge_start = addr; + address_t merge_end = addr + size; + + // Phase 1: Left walk. + while (merge_start > 0 && !Rep::is_boundary(merge_start)) + { + address_t prev_last = merge_start - MIN_CHUNK_SIZE; + if (!Rep::is_free_block(prev_last)) + break; + + size_t prev_size = Rep::get_size(prev_last); + if (prev_size == 0 || prev_size > merge_start) + break; + + address_t prev_start = merge_start - prev_size; + if (!Rep::is_free_block(prev_start)) + break; + + if (Rep::get_size(prev_start) != prev_size) + break; + + bool removed = remove_from_bin(prev_start, prev_size); + SNMALLOC_ASSERT(removed); + UNUSED(removed); + + merge_start = prev_start; + } + + // Phase 2: Right walk. + while (!Rep::is_boundary(merge_end)) + { + if (!Rep::is_free_block(merge_end)) + break; + + size_t next_size = Rep::get_size(merge_end); + if (next_size == 0) + break; + + bool removed = remove_from_bin(merge_end, next_size); + SNMALLOC_ASSERT(removed); + UNUSED(removed); + + // Clear absorbed block's tags to prevent stale reads by + // subsequent left walks. + Rep::set_size(merge_end, 0); + if (next_size > MIN_CHUNK_SIZE) + Rep::set_size(merge_end + next_size - MIN_CHUNK_SIZE, 0); + + merge_end += next_size; + } + + // Insert the coalesced block. + insert_block(merge_start, merge_end - merge_start); + } + + // ---- Test/debug access ---- + + /** + * Check if the bitmap bit for a given bin is set. + * Exposed for testing. + */ + [[nodiscard]] bool is_bin_non_empty(size_t bin) const + { + SNMALLOC_ASSERT(bin < NUM_BINS); + return test_bit(bin); + } + + /** + * Get the head of a bin's linked list. + * Exposed for testing. + */ + [[nodiscard]] address_t get_bin_head(size_t bin) const + { + SNMALLOC_ASSERT(bin < NUM_BINS); + return bin_heads[bin]; + } + + /** + * Get the number of bins. + */ + static constexpr size_t num_bins() + { + return NUM_BINS; + } + }; +} // namespace snmalloc diff --git a/src/snmalloc/backend_helpers/bitmap_coalesce_helpers.h b/src/snmalloc/backend_helpers/bitmap_coalesce_helpers.h new file mode 100644 index 000000000..94076ee42 --- /dev/null +++ b/src/snmalloc/backend_helpers/bitmap_coalesce_helpers.h @@ -0,0 +1,451 @@ +#pragma once + +#include "../ds/allocconfig.h" +#include "../ds_core/ds_core.h" + +namespace snmalloc +{ + /** + * Pure-math helpers for the bitmap-indexed coalescing range. + * + * Provides the mapping from (block_size_chunks, block_align_chunks) to a + * flat bitmap bin index, and the allocation lookup tables (start bit, mask + * bit) for each size class. + * + * Size classes follow S = 2^e + m * 2^{e-B} where B = INTERMEDIATE_BITS. + * The bitmap has PREFIX_BITS prefix entries (for sizes 1..2^B-1 in chunks) + * plus SLOTS_PER_EXPONENT entries per exponent level e in [2, MAX_EXPONENT]. + * + * The slot layout within each exponent group is: + * [A-only, B-only, both, +m2, +m3] + * where A-only and B-only form the sole incomparable pair. This structure + * is specific to B=2 and guarded by static_asserts labelled [A1]-[A7]. + * + * Templated on MAX_SIZE_BITS (passed from the range wrapper). + */ + template + struct BitmapCoalesceHelpers + { + // ---- B=2 structural constants ---- + + static constexpr size_t B = INTERMEDIATE_BITS; + static constexpr size_t SL_COUNT = size_t(1) << B; + + /** + * [A1] Slot count per exponent depends on the incomparable-pair structure + * of the servable-set DAG, which is determined by alignment-tier count + * (B+1 tiers). For B=2: 1 incomparable pair -> 5 slots. + */ + static_assert( + INTERMEDIATE_BITS == 2, "[A1] SLOTS_PER_EXPONENT=5 assumes B=2"); + static constexpr size_t SLOTS_PER_EXPONENT = 5; + + /** + * [A2] Prefix bins cover sizes 1..2^B-1 (the sub-exponent range). + * For B=2 these are {1,2,3}. + */ + static_assert(INTERMEDIATE_BITS == 2, "[A2] PREFIX_BITS=3 assumes B=2"); + static constexpr size_t PREFIX_BITS = 3; + + /** + * Highest exponent in chunk units. Exponent range is [2, MAX_EXPONENT], + * giving MAX_EXPONENT - 1 levels of SLOTS_PER_EXPONENT each. + */ + static constexpr size_t MAX_EXPONENT = MAX_SIZE_BITS - MIN_CHUNK_BITS; + + static constexpr size_t NUM_BINS = + PREFIX_BITS + SLOTS_PER_EXPONENT * (MAX_EXPONENT - 1); + + static constexpr size_t BITMAP_WORDS = + (NUM_BINS + bits::BITS - 1) / bits::BITS; + + // ---- Slot offsets within an exponent group ---- + // These name the 5 positions: A-only=0, B-only=1, both=2, +m2=3, +m3=4 + + static constexpr size_t SLOT_A_ONLY = 0; + static constexpr size_t SLOT_B_ONLY = 1; + static constexpr size_t SLOT_BOTH = 2; + static constexpr size_t SLOT_M2 = 3; + static constexpr size_t SLOT_M3 = 4; + + // ---- Helpers ---- + + /** + * Natural alignment of a positive integer: largest power of 2 dividing it. + * For 0, returns a very large power of 2. + */ + static constexpr size_t natural_alignment(size_t s) + { + if (s == 0) + return size_t(1) << (bits::BITS - 1); + return s & ~(s - 1); + } + + /** + * Bit position of the first slot for exponent e (e >= 2). + */ + static constexpr size_t exponent_base_bit(size_t e) + { + SNMALLOC_ASSERT(e >= 2); + return PREFIX_BITS + (e - 2) * SLOTS_PER_EXPONENT; + } + + /** + * Decompose a valid size class (in chunks) into (exponent, mantissa). + * Returns true on success, false if not a valid size class. + * + * For sizes 1..SL_COUNT-1 (prefix range): e <= 1, special handling. + * For sizes >= SL_COUNT: standard S = 2^e + m * 2^{e-B}. + */ + static constexpr bool decompose(size_t s, size_t& e_out, size_t& m_out) + { + if (s == 0) + return false; + if (s == 1) + { + e_out = 0; + m_out = 0; + return true; + } + if (s == 2) + { + e_out = 1; + m_out = 0; + return true; + } + if (s == 3) + { + e_out = 1; + m_out = 1; + return true; + } + + // s >= 4: find fl = floor(log2(s)) + size_t fl = bits::BITS - 1 - bits::clz(s); + if (fl < B) + return false; + size_t base = size_t(1) << fl; + size_t step = base >> B; + if (step == 0) + return false; + size_t remainder = s - base; + if (remainder % step != 0) + return false; + size_t m = remainder / step; + if (m >= SL_COUNT) + return false; + + e_out = fl; + m_out = m; + return true; + } + + /** + * Size (in chunks) for a given (exponent, mantissa) pair. + */ + static constexpr size_t sizeclass_size(size_t e, size_t m) + { + if (e == 0) + return 1; + if (e == 1) + return m == 0 ? 2 : 3; // m can be 0 or 1 for e=1 + size_t base = size_t(1) << e; + size_t step = base >> B; + return base + m * step; + } + + /** + * Natural alignment of size class (e, m) in chunk units. + */ + static constexpr size_t sizeclass_alignment(size_t e, size_t m) + { + return natural_alignment(sizeclass_size(e, m)); + } + + // ---- Threshold computation ---- + + /** + * Minimum block size (in chunks) needed to serve size class S at + * block alignment alpha (in chunks). + * + * T(S, alpha) = S + max(0, align(S) - alpha) + * + * The block must be big enough for S itself plus any padding needed + * to reach the first naturally-aligned address within the block. + */ + static constexpr size_t threshold(size_t s, size_t alpha) + { + size_t a = natural_alignment(s); + if (a <= alpha) + return s; + return s + a - alpha; + } + + // ---- bin_index: the core mapping ---- + + /** + * [A7] Two threshold breakpoints per exponent assumes B+1 = 3 alignment + * tiers producing exactly 1 incomparable pair. Larger B -> more tiers -> + * more breakpoints per exponent. + * + * For each exponent e, comparing against the size classes: + * m=0: S0 = 2^e, align0 = 2^e + * m=1: S1 = 5*2^{e-2}, align1 = 2^{e-2} + * m=2: S2 = 3*2^{e-1}, align2 = 2^{e-1} + * m=3: S3 = 7*2^{e-2}, align3 = 2^{e-2} + * + * T(S0, alpha) = 2^e + max(0, 2^e - alpha) + * T(S1, alpha) = 5*2^{e-2} + max(0, 2^{e-2} - alpha) + * T(S2, alpha) = 3*2^{e-1} + max(0, 2^{e-1} - alpha) + * T(S3, alpha) = 7*2^{e-2} + max(0, 2^{e-2} - alpha) + * + * The progression within an exponent at a given alpha is: + * A-only: can serve S0 but not S1 (S0's threshold met, S1's not) + * B-only: can serve S1 but not S0 (S1's threshold met, S0's not) + * both: can serve S0 and S1, but not S2 + * +m2: can also serve S2, but not S3 + * +m3: can serve all four + * + * Whether A or B threshold is smaller depends on alpha: + * - When alpha >= 2^e: T(S0) = 2^e, T(S1) = 5*2^{e-2} > T(S0), so A first + * - When alpha < 2^{e-2}: T(S0) = 2^{e+1}-alpha, T(S1) = 3*2^{e-2}-alpha, + * so B first + * + * The bin_index function finds the highest bin the block qualifies for. + */ + static_assert( + INTERMEDIATE_BITS == 2, "[A7] bin_index threshold logic assumes B=2"); + + static constexpr size_t bin_index(size_t n_chunks, size_t alpha_chunks) + { + if (n_chunks == 0) + return 0; + + // Prefix range: sizes 1, 2, 3 + // Size 3: align(3) = 1, T(3, α) = 3 for all α. + // Size 2: align(2) = 2, T(2, α) = 2 + max(0, 2 - α). + // Size 1: align(1) = 1, always servable. + if (n_chunks < SL_COUNT) + { + if (n_chunks >= 3) + return 2; + if (n_chunks >= threshold(2, alpha_chunks)) + return 1; + return 0; + } + + // For each exponent from high to low, check which slot the block + // qualifies for. We want the HIGHEST qualifying bin. + // + // Walk exponents downward. At each level, check up to 4 size classes. + // The thresholds within an exponent follow the slot progression: + // +m3 (highest) >= +m2 >= both >= {A-only, B-only} (lowest pair). + // + // Since A-only and B-only are incomparable, we must check both. + + size_t best_bin = 0; // fallback: bin 0 (serves sizeclass 1) + + // Upper bound on exponent: n_chunks can't serve classes larger than + // itself (even with perfect alignment). + size_t max_e = bits::BITS - 1 - bits::clz(n_chunks); + if (max_e > MAX_EXPONENT) + max_e = MAX_EXPONENT; + if (max_e < 2) + { + // Only prefix range is reachable + if (n_chunks >= 3) + return 2; + return n_chunks - 1; + } + + for (size_t e = max_e; e >= 2; e--) + { + size_t base_bit = exponent_base_bit(e); + + // Size classes at this exponent: + size_t s0 = size_t(1) << e; // m=0 + size_t s1 = 5 * (size_t(1) << (e - 2)); // m=1 + size_t s2 = 3 * (size_t(1) << (e - 1)); // m=2 + size_t s3 = 7 * (size_t(1) << (e - 2)); // m=3 + + // Thresholds: + size_t t0 = threshold(s0, alpha_chunks); + size_t t1 = threshold(s1, alpha_chunks); + size_t t2 = threshold(s2, alpha_chunks); + size_t t3 = threshold(s3, alpha_chunks); + + // Check from highest slot down + if (n_chunks >= t3) + { + // Can serve all four mantissas at this exponent + size_t candidate = base_bit + SLOT_M3; + if (candidate >= best_bin) + best_bin = candidate; + // This is the highest slot at this exponent; no need to check lower. + // But we also want the highest across ALL exponents, so we return + // immediately since higher exponents have higher bit indices and + // we're walking downward. + return candidate; + } + if (n_chunks >= t2) + { + size_t candidate = base_bit + SLOT_M2; + if (candidate >= best_bin) + best_bin = candidate; + return candidate; + } + + // both: serves m=0 AND m=1 but not m=2 + bool can_m0 = n_chunks >= t0; + bool can_m1 = n_chunks >= t1; + + if (can_m0 && can_m1) + { + size_t candidate = base_bit + SLOT_BOTH; + if (candidate >= best_bin) + best_bin = candidate; + return candidate; + } + + // A-only or B-only: incomparable pair + if (can_m0) + { + size_t candidate = base_bit + SLOT_A_ONLY; + if (candidate > best_bin) + best_bin = candidate; + // Don't return: lower exponent might have a higher total bin + // (e.g. if we can serve all 4 mantissas at e-1, that's slot + // base_bit(e-1)+4 which could be > base_bit(e)+0). + // Actually: base_bit(e)+0 = PREFIX_BITS + (e-2)*5, + // base_bit(e-1)+4 = PREFIX_BITS + (e-3)*5 + 4 = PREFIX_BITS + (e-2)*5 + // - 1 So A-only at e is always > any slot at e-1. Return. + return candidate; + } + if (can_m1) + { + size_t candidate = base_bit + SLOT_B_ONLY; + if (candidate > best_bin) + best_bin = candidate; + // B-only at e = base_bit(e)+1 = PREFIX_BITS + (e-2)*5 + 1 + // Highest at e-1 = base_bit(e-1)+4 = PREFIX_BITS + (e-3)*5 + 4 + // = PREFIX_BITS + (e-2)*5 - 1 + // So B-only at e (offset 1) > highest at e-1 (offset -1). Return. + return candidate; + } + + // Can't serve any size class at this exponent; try lower. + } + + // Fall back to prefix range (same threshold logic as the early return). + // For n >= SL_COUNT (=4), T(3, α) = 3 and T(2, α) ≤ 3 are always met, + // so this always returns 2. But keep the threshold checks for clarity. + if (n_chunks >= 3) + best_bin = 2; + else if (n_chunks >= threshold(2, alpha_chunks)) + best_bin = bits::max(best_bin, size_t(1)); + // n_chunks >= 1 is always true (we checked n_chunks == 0 above) + + return best_bin; + } + + // ---- Allocation lookup ---- + + /** + * [A6] The named slot layout {A-only, B-only, both, +m2, +m3} is the + * specific DAG linearisation for B=2. + * + * [A3] Returns a single bit index for the mask (not a multi-bit mask). + * For B>2, multiple incomparable slots could require a multi-bit mask. + */ + static_assert(INTERMEDIATE_BITS == 2, "[A3,A6] alloc lookup assumes B=2"); + + /** + * Starting bit position for bitmap search when allocating size class + * with exponent e and mantissa m. + * + * For prefix-range (e <= 1): + * sizeclass 1 -> bit 0, sizeclass 2 -> bit 1, sizeclass 3 -> bit 2 + * + * For exponent-range (e >= 2): + * m=0: A-only bit of exponent e + * m=1: B-only bit of exponent e + * m=2: +m2 bit of exponent e + * m=3: +m3 bit of exponent e + */ + static constexpr size_t alloc_start_bit(size_t e, size_t m) + { + if (e <= 1) + { + // Prefix range + return sizeclass_size(e, m) - 1; + } + + size_t base = exponent_base_bit(e); + switch (m) + { + case 0: + return base + SLOT_A_ONLY; + case 1: + return base + SLOT_B_ONLY; + case 2: + return base + SLOT_M2; + case 3: + return base + SLOT_M3; + default: + SNMALLOC_ASSERT(false); + return 0; + } + } + + /** + * Bit position to mask out when allocating m=0 at exponent e. + * Returns SIZE_MAX if no masking is needed. + * + * [A4,A5] For B=2, only m=0 needs masking (mask out B-only), + * and only 1 bit. + */ + static constexpr size_t alloc_mask_bit(size_t e) + { + static_assert(INTERMEDIATE_BITS == 2, "[A4,A5] mask bit assumes B=2"); + + if (e <= 1) + return SIZE_MAX; // prefix range: no masking needed + + return exponent_base_bit(e) + SLOT_B_ONLY; + } + + /** + * Check if a chunk count is a valid size class. + */ + static constexpr bool is_valid_sizeclass(size_t s) + { + if (s == 0) + return false; + size_t e, m; + return decompose(s, e, m); + } + + /** + * Round up a chunk count to the next valid sizeclass. + * Returns 0 for input 0. + */ + static constexpr size_t round_up_sizeclass(size_t n) + { + if (n <= SL_COUNT) + return n; // 0..4 are all valid (0 → 0, 1..4 → themselves) + + size_t fl = bits::BITS - 1 - bits::clz(n); + size_t base = size_t(1) << fl; + if (n == base) + return n; + + size_t step = base >> B; + size_t remainder = n - base; + size_t m = (remainder + step - 1) / step; // round up + if (m >= SL_COUNT) + return size_t(1) << (fl + 1); + + return base + m * step; + } + }; +} // namespace snmalloc diff --git a/src/snmalloc/backend_helpers/bitmap_coalesce_range.h b/src/snmalloc/backend_helpers/bitmap_coalesce_range.h new file mode 100644 index 000000000..cd9069287 --- /dev/null +++ b/src/snmalloc/backend_helpers/bitmap_coalesce_range.h @@ -0,0 +1,341 @@ +#pragma once + +#include "../ds/ds.h" +#include "../mem/mem.h" +#include "bitmap_coalesce.h" +#include "bitmap_coalesce_helpers.h" +#include "empty_range.h" +#include "range_helpers.h" + +namespace snmalloc +{ + /** + * Pagemap accessor for the BitmapCoalesceRange. + * + * Implements the Rep concept required by BitmapCoalesce. + * Word::One = next pointer, + * Word::Two = size (boundary tag). + */ + template + class BitmapCoalesceRep + { + template + struct pagemap_has_bounds : stl::false_type + {}; + + template + struct pagemap_has_bounds> + : stl::true_type + {}; + + static bool is_out_of_bounds(address_t addr) + { + if constexpr (pagemap_has_bounds::value) + { + auto [pm_base, pm_size] = Pagemap::get_bounds(); + return (addr - pm_base) >= pm_size; + } + else + { + UNUSED(addr); + return false; + } + } + + public: + static address_t get_next(address_t addr) + { + auto& entry = + Pagemap::template get_metaentry_mut(address_cast(addr)); + return entry.get_backend_word(MetaEntryBase::Word::One).get(); + } + + static void set_next(address_t addr, address_t next) + { + auto& entry = + Pagemap::template get_metaentry_mut(address_cast(addr)); + entry.get_backend_word(MetaEntryBase::Word::One) = next; + } + + static size_t get_size(address_t addr) + { + auto& entry = + Pagemap::template get_metaentry_mut(address_cast(addr)); + return static_cast( + entry.get_backend_word(MetaEntryBase::Word::Two).get()); + } + + static void set_size(address_t addr, size_t size) + { + auto& entry = + Pagemap::template get_metaentry_mut(address_cast(addr)); + entry.get_backend_word(MetaEntryBase::Word::Two) = size; + } + + static void set_boundary_tags(address_t addr, size_t size) + { + set_size(addr, size); + if (size > MIN_CHUNK_SIZE) + { + set_size(addr + size - MIN_CHUNK_SIZE, size); + } + } + + static bool is_free_block(address_t addr) + { + if (addr == 0) + return false; + if (is_out_of_bounds(addr)) + return false; + auto& entry = + Pagemap::template get_metaentry_mut(address_cast(addr)); + return entry.is_backend_owned() && entry.is_coalesce_free(); + } + + static bool is_boundary(address_t addr) + { + if (is_out_of_bounds(addr)) + return true; + auto& entry = + Pagemap::template get_metaentry_mut(address_cast(addr)); + return entry.is_boundary(); + } + + static void set_boundary(address_t addr) + { + auto& entry = + Pagemap::template get_metaentry_mut(address_cast(addr)); + entry.set_boundary(); + } + + static void set_coalesce_free(address_t addr) + { + auto& entry = + Pagemap::template get_metaentry_mut(address_cast(addr)); + entry.set_coalesce_free(); + } + + static void clear_coalesce_free(address_t addr) + { + auto& entry = + Pagemap::template get_metaentry_mut(address_cast(addr)); + entry.clear_coalesce_free(); + } + }; + + /** + * Pipeline adapter for the bitmap-indexed coalescing allocator. + * + * Thin wrapper that connects BitmapCoalesce to snmalloc's range + * pipeline, providing alloc_range / dealloc_range with post-allocation + * carving and gradual warm-up refill. Drop-in replacement for + * LargeBuddyRange. + * + * Template parameters: + * REFILL_SIZE_BITS - Maximum refill size from parent (log2). + * MAX_SIZE_BITS - Maximum block size this range manages (log2). + * Pagemap - Pagemap type for boundary tags & linked lists. + * MIN_REFILL_SIZE_BITS - Minimum refill size from parent (log2). + */ + template< + size_t REFILL_SIZE_BITS, + size_t MAX_SIZE_BITS, + SNMALLOC_CONCEPT(IsWritablePagemap) Pagemap, + size_t MIN_REFILL_SIZE_BITS = 0> + class BitmapCoalesceRange + { + static_assert( + REFILL_SIZE_BITS <= MAX_SIZE_BITS, "REFILL_SIZE_BITS > MAX_SIZE_BITS"); + static_assert( + MIN_REFILL_SIZE_BITS <= REFILL_SIZE_BITS, + "MIN_REFILL_SIZE_BITS > REFILL_SIZE_BITS"); + + static constexpr size_t REFILL_SIZE = bits::one_at_bit(REFILL_SIZE_BITS); + static constexpr size_t MIN_REFILL_SIZE = + bits::one_at_bit(MIN_REFILL_SIZE_BITS); + + public: + template> + class Type : public ContainsParent + { + using ContainsParent::parent; + + using Rep = BitmapCoalesceRep; + using BC = BitmapCoalesceHelpers; + + BitmapCoalesce bc{}; + + size_t requested_total = 0; + + // ----- Refill from parent ----- + + capptr::Arena refill(size_t size, size_t alignment) + { + if (ParentRange::Aligned) + { + // Gradual warm-up heuristic (same as LargeBuddyRange). + size_t refill_size = bits::min(REFILL_SIZE, requested_total); + refill_size = bits::max(refill_size, MIN_REFILL_SIZE); + refill_size = bits::max(refill_size, size); + refill_size = bits::next_pow2(refill_size); + + auto refill_range = parent.alloc_range(refill_size); + if (refill_range != nullptr) + { + requested_total += refill_size; + if (refill_size > size) + { + Rep::set_boundary(refill_range.unsafe_uintptr() + size); + bc.add_fresh_range( + refill_range.unsafe_uintptr() + size, refill_size - size); + } + SNMALLOC_ASSERT((refill_range.unsafe_uintptr() % alignment) == 0); + } + return refill_range; + } + + // Unaligned parent: overallocate, carve, return remainders. + size_t extra = size + alignment; + if (extra < size) // overflow check + return nullptr; + + size_t refill_size = bits::min(REFILL_SIZE, requested_total); + refill_size = bits::max(refill_size, MIN_REFILL_SIZE); + refill_size = bits::max(refill_size, extra); + + while (extra <= refill_size) + { + auto range = parent.alloc_range(refill_size); + if (range != nullptr) + { + requested_total += refill_size; + auto base = range.unsafe_uintptr(); + + auto aligned_base = bits::align_up(base, alignment); + SNMALLOC_ASSERT(aligned_base + size <= base + refill_size); + + if (aligned_base > base) + { + Rep::set_boundary(aligned_base); + } + + auto right_start = aligned_base + size; + auto range_end = base + refill_size; + if (right_start < range_end) + { + Rep::set_boundary(right_start); + } + + if (aligned_base > base) + { + bc.add_fresh_range(base, aligned_base - base); + } + + if (right_start < range_end) + { + bc.add_fresh_range(right_start, range_end - right_start); + } + + SNMALLOC_ASSERT((aligned_base % alignment) == 0); + return capptr::Arena::unsafe_from( + reinterpret_cast(aligned_base)); + } + refill_size >>= 1; + } + + return nullptr; + } + + public: + static constexpr bool Aligned = true; + static constexpr bool ConcurrencySafe = false; + + using ChunkBounds = capptr::bounds::Arena; + static_assert( + stl::is_same_v); + + constexpr Type() = default; + + /** + * Allocate a range of the given size. + * Size must be >= MIN_CHUNK_SIZE. + * Returns a naturally-aligned block of at least the requested size. + */ + NOINLINE capptr::Arena alloc_range(size_t size) + { + SNMALLOC_ASSERT(size >= MIN_CHUNK_SIZE); + + // Bypass for sizes at or above our maximum. + if (size >= bits::mask_bits(MAX_SIZE_BITS)) + { + if (ParentRange::Aligned) + return parent.alloc_range(size); + return nullptr; + } + + // Round up to a valid sizeclass for the bitmap search. + size_t n_chunks = size / MIN_CHUNK_SIZE; + size_t sc_chunks = BC::round_up_sizeclass(n_chunks); + if (sc_chunks == 0) + sc_chunks = 1; + size_t sc_bytes = sc_chunks * MIN_CHUNK_SIZE; + + // Required alignment: natural alignment for the rounded-up sizeclass. + size_t sc_align = BC::natural_alignment(sc_chunks) * MIN_CHUNK_SIZE; + + auto result = bc.remove_block(sc_bytes); + if (result.addr != 0) + { + SNMALLOC_ASSERT(result.size >= sc_bytes); + SNMALLOC_ASSERT(result.size % MIN_CHUNK_SIZE == 0); + + // Carve: find aligned address within the block. + address_t aligned_addr = bits::align_up(result.addr, sc_align); + SNMALLOC_ASSERT(aligned_addr + size <= result.addr + result.size); + + size_t prefix = aligned_addr - result.addr; + size_t suffix = result.addr + result.size - aligned_addr - size; + + // Return prefix and suffix remainders to the free pool. + // No boundary bit needed: this carving happens under the lock, + // and subsequent frees will correctly coalesce. + if (prefix > 0) + bc.add_fresh_range(result.addr, prefix); + if (suffix > 0) + bc.add_fresh_range(aligned_addr + size, suffix); + + SNMALLOC_ASSERT((aligned_addr % sc_align) == 0); + return capptr::Arena::unsafe_from( + reinterpret_cast(aligned_addr)); + } + + return refill(size, sc_align); + } + + /** + * Deallocate a range of memory. + * Size must be >= MIN_CHUNK_SIZE. + */ + NOINLINE void dealloc_range(capptr::Arena base, size_t size) + { + SNMALLOC_ASSERT(size >= MIN_CHUNK_SIZE); + SNMALLOC_ASSERT(size % MIN_CHUNK_SIZE == 0); + SNMALLOC_ASSERT( + (base.unsafe_uintptr() % + (BC::natural_alignment(size / MIN_CHUNK_SIZE) * MIN_CHUNK_SIZE)) == + 0); + + if constexpr (MAX_SIZE_BITS != (bits::BITS - 1)) + { + if (size >= bits::mask_bits(MAX_SIZE_BITS)) + { + parent.dealloc_range(base, size); + return; + } + } + + bc.add_block(base.unsafe_uintptr(), size); + } + }; + }; +} // namespace snmalloc diff --git a/src/snmalloc/backend_helpers/decayrange.h b/src/snmalloc/backend_helpers/decayrange.h new file mode 100644 index 000000000..22cd10ba2 --- /dev/null +++ b/src/snmalloc/backend_helpers/decayrange.h @@ -0,0 +1,433 @@ +#pragma once + +#include "../ds/ds.h" +#include "../mem/mem.h" +#include "empty_range.h" +#include "largebuddyrange.h" +#include "range_helpers.h" + +namespace snmalloc +{ + /** + * Intrusive singly-linked list using pagemap entries for storage. + * + * This uses BuddyChunkRep's pagemap entry access (direction=false, i.e. + * Word::Two) to store the "next" pointer for each node. + */ + template + class DecayList + { + using Rep = BuddyChunkRep; + + uintptr_t head = 0; + + DecayList(uintptr_t head) : head(head) {} + + public: + constexpr DecayList() = default; + + [[nodiscard]] bool is_empty() const + { + return head == 0; + } + + DecayList get_next() + { + SNMALLOC_ASSERT(!is_empty()); + auto next_field = Rep::ref(false, head); + auto next = Rep::get(next_field); + return {next}; + } + + capptr::Arena get_capability() + { + return capptr::Arena::unsafe_from(reinterpret_cast(head)); + } + + DecayList cons(capptr::Arena new_head_cap) + { + auto new_head = new_head_cap.unsafe_uintptr(); + auto field = Rep::ref(false, new_head); + Rep::set(field, head); + return {new_head}; + } + + template + void forall(F f) + { + auto curr = *this; + while (!curr.is_empty()) + { + auto next = curr.get_next(); + f(curr.get_capability()); + curr = next; + } + } + }; + + /** + * Concurrent stack for caching deallocated ranges. + * + * Supports the following concurrency pattern: + * (push|pop)* || pop_all* || ... || pop_all* + * + * That is, a single thread can do push and pop, and other threads + * can do pop_all. pop_all returns all of the stack if it doesn't + * race, or empty if it does. + * + * The primary use case is single-threaded access, where other threads + * can attempt to drain all values (via the timer callback). + */ + template + class DecayStack + { + static constexpr auto empty = DecayList{}; + + alignas(CACHELINE_SIZE) stl::Atomic> stack{}; + + DecayList take() + { + if (stack.load(stl::memory_order_relaxed).is_empty()) + return empty; + return stack.exchange(empty, stl::memory_order_acquire); + } + + void replace(DecayList new_head) + { + SNMALLOC_ASSERT(stack.load().is_empty()); + stack.store(new_head, stl::memory_order_release); + } + + public: + constexpr DecayStack() = default; + + void push(capptr::Arena new_head_cap) + { + auto old_head = take(); + auto new_head = old_head.cons(new_head_cap); + replace(new_head); + } + + capptr::Arena pop() + { + auto old_head = take(); + if (old_head.is_empty()) + return nullptr; + + auto next = old_head.get_next(); + replace(next); + + return old_head.get_capability(); + } + + DecayList pop_all() + { + return take(); + } + }; + + /** + * A range that provides temporal caching of deallocated ranges. + * + * Instead of immediately releasing deallocated memory back to the parent + * range (which would decommit it), this range caches it locally and + * uses PAL timers to gradually release it. This avoids expensive + * repeated decommit/recommit cycles for transient allocation patterns + * (e.g. repeatedly allocating and deallocating ~800KB objects). + * + * The range uses an epoch-based rotation scheme: + * - Deallocated ranges are placed in the current epoch's stack + * - A timer periodically advances the epoch + * - When the epoch advances, the oldest epoch's entries are flushed + * to the parent range + * + * The parent range MUST be ConcurrencySafe, as the timer callback may + * flush entries from a different thread context. + * + * PAL - Platform abstraction layer (for timer support) + * Pagemap - Used for storing linked list nodes in pagemap entries + */ + template + struct DecayRange + { + template> + class Type : public ContainsParent + { + using ContainsParent::parent; + + public: + static constexpr bool Aligned = ParentRange::Aligned; + + static constexpr bool ConcurrencySafe = false; + + using ChunkBounds = typename ParentRange::ChunkBounds; + + private: + /** + * Maximum chunk size we cache (4 MiB). + * Larger allocations bypass the cache and go directly to/from parent. + */ + static constexpr size_t MAX_CACHEABLE_SIZE = bits::one_at_bit(22); + + /** + * How many slab sizes that can be cached. + * Covers all fine-grained exp-mant size classes from 1 chunk + * (MIN_CHUNK_SIZE) up to MAX_CACHEABLE_SIZE. + */ + static constexpr size_t NUM_SLAB_SIZES = + bits::to_exp_mant_const( + MAX_CACHEABLE_SIZE / MIN_CHUNK_SIZE) + + 1; + + /** + * Convert a byte size to the decay cache sizeclass index. + * Size must be a multiple of MIN_CHUNK_SIZE. + */ + static constexpr size_t to_sizeclass(size_t size) + { + return bits::to_exp_mant_const( + size / MIN_CHUNK_SIZE); + } + + /** + * Convert a decay cache sizeclass index back to a byte size. + */ + static constexpr size_t from_sizeclass(size_t sc) + { + return bits::from_exp_mant(sc) * MIN_CHUNK_SIZE; + } + + /** + * Number of epoch slots for cached ranges. + * + * Ranges not used within (NUM_EPOCHS - 1) timer periods will be + * released to the parent. E.g., with period=500ms and NUM_EPOCHS=4, + * memory not reused within 1500-2000ms will be released. + * + * Must be a power of 2. + */ + static constexpr size_t NUM_EPOCHS = 4; + static_assert(bits::is_pow2(NUM_EPOCHS), "NUM_EPOCHS must be power of 2"); + + /** + * Per-sizeclass, per-epoch stacks of cached ranges. + */ + ModArray>> + chunk_stack; + + /** + * Current epoch index. + */ + static inline stl::Atomic epoch{0}; + + /** + * Flag to ensure one-shot timer registration with the PAL. + */ + static inline stl::AtomicBool registered_timer{false}; + + /** + * Flag indicating this instance has been registered in the global list. + */ + stl::AtomicBool registered_local{false}; + + /** + * Global list of all activated DecayRange instances. + * Used by the timer to iterate and flush old entries. + */ + static inline stl::Atomic all_local{nullptr}; + + /** + * Next pointer for the global intrusive list. + */ + Type* all_local_next{nullptr}; + + /** + * Flush the oldest epoch's entries across all instances + * and advance the epoch. + */ + static void handle_decay_tick() + { + static_assert( + ParentRange::ConcurrencySafe, + "Parent range must be concurrency safe, as dealloc_range is called " + "from the timer callback on a potentially different thread."); + + auto new_epoch = + (epoch.load(stl::memory_order_relaxed) + 1) % NUM_EPOCHS; + + // Flush the epoch that is about to become current + // across all registered instances. + auto curr = all_local.load(stl::memory_order_acquire); + while (curr != nullptr) + { + for (size_t sc = 0; sc < NUM_SLAB_SIZES; sc++) + { + auto old_stack = curr->chunk_stack[sc][new_epoch].pop_all(); + + old_stack.forall([curr, sc](auto cap) { + size_t size = from_sizeclass(sc); +#ifdef SNMALLOC_TRACING + message<1024>( + "DecayRange::tick flushing {} size {} to parent", + cap.unsafe_ptr(), + size); +#endif + curr->parent.dealloc_range(cap, size); + }); + } + curr = curr->all_local_next; + } + + // Advance the epoch + epoch.store(new_epoch, stl::memory_order_release); + } + + /** + * Timer callback object for periodic decay. + */ + class DecayMemoryTimerObject : public PalTimerObject + { + static void process(PalTimerObject*) + { +#ifdef SNMALLOC_TRACING + message<1024>("DecayRange::handle_decay_tick timer"); +#endif + handle_decay_tick(); + } + + /// Timer fires every 500ms. + static constexpr size_t PERIOD = 500; + + public: + constexpr DecayMemoryTimerObject() : PalTimerObject(&process, PERIOD) {} + }; + + static inline DecayMemoryTimerObject timer_object; + + void ensure_registered() + { + // Register the global timer if this is the first instance. + if ( + !registered_timer.load(stl::memory_order_relaxed) && + !registered_timer.exchange(true, stl::memory_order_acq_rel)) + { + PAL::register_timer(&timer_object); + } + + // Register this instance in the global list. + if ( + !registered_local.load(stl::memory_order_relaxed) && + !registered_local.exchange(true, stl::memory_order_acq_rel)) + { + auto* head = all_local.load(stl::memory_order_relaxed); + do + { + all_local_next = head; + } while (!all_local.compare_exchange_weak( + head, this, stl::memory_order_release, stl::memory_order_relaxed)); + } + } + + public: + constexpr Type() = default; + + CapPtr alloc_range(size_t size) + { + SNMALLOC_ASSERT(size >= MIN_CHUNK_SIZE); + SNMALLOC_ASSERT(size % MIN_CHUNK_SIZE == 0); + + auto slab_sizeclass = to_sizeclass(size); + + // Bypass cache for sizes beyond what we track. + if (slab_sizeclass >= NUM_SLAB_SIZES) + return parent.alloc_range(size); + + if constexpr (pal_supports) + { + // Try local cache across all epochs, starting from current. + auto current_epoch = epoch.load(stl::memory_order_relaxed); + for (size_t e = 0; e < NUM_EPOCHS; e++) + { + auto p = + chunk_stack[slab_sizeclass][(current_epoch - e) % NUM_EPOCHS] + .pop(); + + if (p != nullptr) + { +#ifdef SNMALLOC_TRACING + message<1024>( + "DecayRange::alloc_range returning {} from local cache", + p.unsafe_ptr()); +#endif + return p; + } + } + } + + // Try parent. If OOM, flush decay caches and retry. + CapPtr result; + for (size_t i = NUM_EPOCHS; i > 0; i--) + { + result = parent.alloc_range(size); + if (result != nullptr) + { +#ifdef SNMALLOC_TRACING + message<1024>( + "DecayRange::alloc_range returning {} from parent", + result.unsafe_ptr()); +#endif + return result; + } + + // OOM: force-flush decay caches to free memory. +#ifdef SNMALLOC_TRACING + message<1024>("DecayRange::alloc_range OOM, flushing decay caches"); +#endif + handle_decay_tick(); + } + + // Final attempt after flushing all epochs. + result = parent.alloc_range(size); +#ifdef SNMALLOC_TRACING + message<1024>( + "DecayRange::alloc_range final attempt: {}", result.unsafe_ptr()); +#endif + return result; + } + + void dealloc_range(CapPtr base, size_t size) + { + SNMALLOC_ASSERT(size >= MIN_CHUNK_SIZE); + SNMALLOC_ASSERT(size % MIN_CHUNK_SIZE == 0); + + auto slab_sizeclass = to_sizeclass(size); + + // Bypass cache for sizes beyond what we track. + if (slab_sizeclass >= NUM_SLAB_SIZES) + { + parent.dealloc_range(base, size); + return; + } + + if constexpr (pal_supports) + { + ensure_registered(); + +#ifdef SNMALLOC_TRACING + message<1024>( + "DecayRange::dealloc_range caching {} size {}", + base.unsafe_ptr(), + size); +#endif + auto current_epoch = epoch.load(stl::memory_order_relaxed); + chunk_stack[slab_sizeclass][current_epoch].push(base); + } + else + { + // No timer support: pass through directly. + parent.dealloc_range(base, size); + } + } + }; + }; +} // namespace snmalloc diff --git a/src/snmalloc/backend_helpers/pagemap.h b/src/snmalloc/backend_helpers/pagemap.h index 2d6ed44b0..b042fe510 100644 --- a/src/snmalloc/backend_helpers/pagemap.h +++ b/src/snmalloc/backend_helpers/pagemap.h @@ -65,6 +65,34 @@ namespace snmalloc } } + /** + * Set the metadata associated with a chunk, computing per-chunk + * offset bits for non-pow2 large allocations in the same pass. + * + * nat_align is the natural alignment of the size class (slab_mask + 1). + * Each entry's offset is its distance from the allocation start, + * measured in nat_align units. + */ + SNMALLOC_FAST_PATH + static void + set_metaentry(address_t p, size_t size, const Entry& t, size_t nat_align) + { + size_t chunks_per_nat = nat_align / MIN_CHUNK_SIZE; + SNMALLOC_ASSERT_MSG( + ((size / MIN_CHUNK_SIZE) / chunks_per_nat) <= 7, + "Offset {} exceeds 3-bit field for size {} nat_align {}", + (size / MIN_CHUNK_SIZE) / chunks_per_nat, + size, + nat_align); + size_t chunk_index = 0; + for (address_t a = p; a < p + size; a += MIN_CHUNK_SIZE) + { + concretePagemap.template get_mut(a).assign_with_offset( + t, static_cast(chunk_index / chunks_per_nat)); + chunk_index++; + } + } + /** * Get the metadata associated with a chunk. * diff --git a/src/snmalloc/ds_core/bits.h b/src/snmalloc/ds_core/bits.h index 3391e70f7..9ce17045b 100644 --- a/src/snmalloc/ds_core/bits.h +++ b/src/snmalloc/ds_core/bits.h @@ -352,6 +352,22 @@ namespace snmalloc return (e << MANTISSA_BITS) + m; } + template + inline SNMALLOC_FAST_PATH size_t to_exp_mant(size_t value) + { + constexpr size_t LEADING_BIT = one_at_bit(MANTISSA_BITS + LOW_BITS) >> 1; + constexpr size_t MANTISSA_MASK = mask_bits(MANTISSA_BITS); + + value = value - 1; + + size_t e = + bits::BITS - MANTISSA_BITS - LOW_BITS - clz(value | LEADING_BIT); + size_t b = (e == 0) ? 0 : 1; + size_t m = (value >> (LOW_BITS + e - b)) & MANTISSA_MASK; + + return (e << MANTISSA_BITS) + m; + } + template constexpr size_t from_exp_mant(size_t m_e) { diff --git a/src/snmalloc/global/globalalloc.h b/src/snmalloc/global/globalalloc.h index 110051e2f..2f8c2a9d0 100644 --- a/src/snmalloc/global/globalalloc.h +++ b/src/snmalloc/global/globalalloc.h @@ -140,7 +140,8 @@ namespace snmalloc const auto& entry = Config_::Backend::template get_metaentry(p); auto sizeclass = entry.get_sizeclass(); - return snmalloc::remaining_bytes(sizeclass, p); + auto offset = entry.get_offset(); + return snmalloc::remaining_bytes(sizeclass, p, offset); } /** @@ -155,7 +156,8 @@ namespace snmalloc const auto& entry = Config_::Backend::template get_metaentry(p); auto sizeclass = entry.get_sizeclass(); - return snmalloc::index_in_object(sizeclass, p); + auto offset = entry.get_offset(); + return snmalloc::index_in_object(sizeclass, p, offset); } enum Boundary @@ -224,7 +226,8 @@ namespace snmalloc { const auto& entry = Config_::Backend::get_metaentry(address_cast(p)); - size_t index = slab_index(entry.get_sizeclass(), address_cast(p)); + size_t index = + slab_index(entry.get_sizeclass(), address_cast(p), entry.get_offset()); auto* meta_slab = entry.get_slab_metadata(); @@ -253,7 +256,8 @@ namespace snmalloc const auto& entry = Config_::Backend::template get_metaentry(address_cast(p)); - size_t index = slab_index(entry.get_sizeclass(), address_cast(p)); + size_t index = + slab_index(entry.get_sizeclass(), address_cast(p), entry.get_offset()); auto* meta_slab = entry.get_slab_metadata(); diff --git a/src/snmalloc/mem/corealloc.h b/src/snmalloc/mem/corealloc.h index 5ec7bf1f3..ac60d81a0 100644 --- a/src/snmalloc/mem/corealloc.h +++ b/src/snmalloc/mem/corealloc.h @@ -3,6 +3,7 @@ #include "../ds/ds.h" #include "check_init.h" #include "freelist.h" +#include "largecache.h" #include "metadata.h" #include "pool.h" #include "remotecache.h" @@ -181,6 +182,13 @@ namespace snmalloc */ Ticker ticker; + /** + * Cache for large object allocations. + * Avoids pagemap manipulation and backend buddy tree operations + * for recently freed large allocations. + */ + LargeObjectCache large_object_cache; + /** * The message queue needs to be accessible from other threads * @@ -534,7 +542,8 @@ namespace snmalloc snmalloc_check_client( mitigations(sanity_checks), - is_start_of_object(entry.get_sizeclass(), address_cast(msg)), + is_start_of_object( + entry.get_sizeclass(), address_cast(msg), entry.get_offset()), "Not deallocating start of an object"); size_t objsize = sizeclass_full_to_size(entry.get_sizeclass()); @@ -695,14 +704,86 @@ namespace snmalloc return Conts::success(result, size, true); } + auto chunk_size = large_size_to_chunk_size(size); + auto sizeclass = size_to_sizeclass_full(size); + + // Check the frontend large object cache first. + // This avoids all pagemap and backend manipulation. + auto* cached_meta = self->large_object_cache.try_alloc( + sizeclass, [self](BackendSlabMetadata* fmeta) { + self->flush_large_cache_entry(fmeta); + }); + if (cached_meta != nullptr) + { + // Cache hit: pagemap still valid, recover address from meta. + // The cache is indexed by fine-grained sizeclass, so the + // pagemap already has the correct sizeclass for this block. + auto slab_addr = + cached_meta->get_slab_interior(freelist::Object::key_root); + cached_meta->initialise_large( + slab_addr, freelist::Object::key_root); + self->laden.insert(cached_meta); + + // Reconstruct the capptr from the address. + auto p = Config::Backend::capptr_rederive_alloc( + capptr::Alloc::unsafe_from( + reinterpret_cast(slab_addr)), + chunk_size); + // Verify the allocation starts at the correct offset. + // On cache hit, the pagemap offset is still valid. + SNMALLOC_ASSERT(is_start_of_object( + sizeclass, + slab_addr, + Config::Backend::get_metaentry(slab_addr).get_offset())); + return Conts::success(capptr_reveal(p), size); + } + + // Cache miss: go to backend. // Grab slab of correct size // Set remote as large allocator remote. auto [chunk, meta] = Config::Backend::alloc_chunk( self->get_backend_local_state(), - large_size_to_chunk_size(size), - PagemapEntry::encode( - self->public_state(), size_to_sizeclass_full(size)), - size_to_sizeclass_full(size)); + chunk_size, + PagemapEntry::encode(self->public_state(), sizeclass), + sizeclass); + + // If backend OOM, try staged cache flush and retry. + // First flush smaller sizes — they coalesce upward in the + // buddy. If that's not enough, flush one larger entry — + // the buddy can split it. + if (meta == nullptr) + { + auto flush_fn = [self](BackendSlabMetadata* fmeta) { + self->flush_large_cache_entry(fmeta); + }; + + // Stage 1: flush all smaller sizeclasses. + if (self->large_object_cache.flush_smaller(sizeclass, flush_fn)) + { + auto retry = Config::Backend::alloc_chunk( + self->get_backend_local_state(), + chunk_size, + PagemapEntry::encode(self->public_state(), sizeclass), + sizeclass); + chunk = retry.first; + meta = retry.second; + } + + // Stage 2: flush a single larger-or-equal entry. + if ( + meta == nullptr && + self->large_object_cache.flush_one_larger( + sizeclass, flush_fn)) + { + auto retry = Config::Backend::alloc_chunk( + self->get_backend_local_state(), + chunk_size, + PagemapEntry::encode(self->public_state(), sizeclass), + sizeclass); + chunk = retry.first; + meta = retry.second; + } + } #ifdef SNMALLOC_TRACING message<1024>( @@ -1057,7 +1138,8 @@ namespace snmalloc snmalloc_check_client( mitigations(sanity_checks), - is_start_of_object(entry.get_sizeclass(), address_cast(p)), + is_start_of_object( + entry.get_sizeclass(), address_cast(p), entry.get_offset()), "Not deallocating start of an object"); auto cp = p.as_static>(); @@ -1086,6 +1168,7 @@ namespace snmalloc const PagemapEntry& entry, BackendSlabMetadata* meta) noexcept { + UNUSED(p); // TODO: Handle message queue on this path? if (meta->is_large()) @@ -1095,20 +1178,24 @@ namespace snmalloc // XXX: because large objects have unique metadata associated with them, // the ring size here is one. We should probably assert that. - size_t entry_sizeclass = entry.get_sizeclass().as_large(); - size_t size = bits::one_at_bit(entry_sizeclass); - #ifdef SNMALLOC_TRACING + size_t size = sizeclass_full_to_size(entry.get_sizeclass()); message<1024>("Large deallocation: {}", size); -#else - UNUSED(size); #endif // Remove from set of fully used slabs. meta->node.remove(); - Config::Backend::dealloc_chunk( - get_backend_local_state(), *meta, p, size, entry.get_sizeclass()); + // Cache in the frontend large object cache. + // The meta's free_queue already holds the chunk address (from + // initialise_large), and the pagemap entry retains the sizeclass + // and remote allocator info. No data is stored in the freed object. + // Epoch sync happens internally; stale entries are flushed via the + // callback. + large_object_cache.cache( + meta, entry.get_sizeclass(), [this](BackendSlabMetadata* fmeta) { + flush_large_cache_entry(fmeta); + }); return; } @@ -1117,6 +1204,23 @@ namespace snmalloc dealloc_local_object_meta(entry, meta); } + /** + * Flush a single cached large object back to the backend. + * Recovers the chunk address from the metadata and size from the pagemap. + */ + NOINLINE void flush_large_cache_entry(BackendSlabMetadata* meta) + { + auto slab_addr = meta->get_slab_interior(freelist::Object::key_root); + const PagemapEntry& entry = Config::Backend::get_metaentry(slab_addr); + size_t chunk_size = sizeclass_full_to_size(entry.get_sizeclass()); + + auto p = + capptr::Alloc::unsafe_from(reinterpret_cast(slab_addr)); + + Config::Backend::dealloc_chunk( + get_backend_local_state(), *meta, p, chunk_size, entry.get_sizeclass()); + } + /** * Very slow path for object deallocation. * @@ -1427,6 +1531,10 @@ namespace snmalloc dealloc_local_slabs(sizeclass); } + // Flush the large object cache back to the backend. + large_object_cache.flush_all( + [this](BackendSlabMetadata* fmeta) { flush_large_cache_entry(fmeta); }); + if constexpr (mitigations(freelist_teardown_validate)) { laden.iterate( diff --git a/src/snmalloc/mem/largecache.h b/src/snmalloc/mem/largecache.h new file mode 100644 index 000000000..389276bd2 --- /dev/null +++ b/src/snmalloc/mem/largecache.h @@ -0,0 +1,421 @@ +#pragma once + +#include "../ds/ds.h" +#include "../pal/pal_ds.h" +#include "metadata.h" +#include "sizeclasstable.h" + +namespace snmalloc +{ + /** + * Frontend cache for large object allocations. + * + * This cache sits in the per-thread Allocator and intercepts large + * alloc/dealloc before they reach the backend. By caching recently freed + * large objects, we avoid: + * - Pagemap writes on dealloc (clearing N entries) and alloc (setting N + * entries) + * - Metadata allocation/deallocation + * - Buddy allocator tree operations + * - Decommit/recommit syscalls (if DecayRange is also in the pipeline) + * + * The cache uses the slab metadata's SeqSet node to link cached entries, + * storing no data inside the freed object itself. The chunk address is + * recovered from the metadata's free_queue, and the chunk size from the + * pagemap entry's sizeclass. + * + * Epoch rotation is driven by a PAL timer (DecayMemoryTimerObject). + * A global epoch counter is advanced periodically by the timer. Each + * cache instance tracks the last epoch it observed and self-flushes + * stale epochs on its next operation. This means no concurrent access + * to the per-thread SeqSets is needed. + * + * Each sizeclass has an adaptive budget that bounds how many items can + * be cached. The budget starts at 1 and adjusts on each epoch rotation: + * - If stale entries were flushed (surplus), decrease budget. + * - If no entries were flushed and the cache was actively drained by + * allocations (not just empty from startup), increase budget. + * This allows the cache to grow to match the working set while shrinking + * when the workload subsides. + * + * Template parameter Config provides Backend, PagemapEntry, Pal, etc. + */ + template + class LargeObjectCache + { + using PAL = typename Config::Pal; + using BackendSlabMetadata = typename Config::Backend::SlabMetadata; + using PagemapEntry = typename Config::PagemapEntry; + + /** + * Maximum chunk size we cache (4 MiB). + * Larger allocations bypass the cache and go directly to/from backend. + */ + static constexpr size_t MAX_CACHEABLE_SIZE = bits::one_at_bit(22); + + /** + * Number of chunk sizeclasses we actually cache. + * Covers all fine-grained large classes up to MAX_CACHEABLE_SIZE. + * This is the large class index of MAX_CACHEABLE_SIZE, plus one. + */ + static constexpr size_t NUM_SIZECLASSES = + size_to_large_class_index_const(MAX_CACHEABLE_SIZE) + 1; + + /** + * Number of epoch slots for cached ranges. + * Must be a power of 2. + */ + static constexpr size_t NUM_EPOCHS = 4; + static_assert(bits::is_pow2(NUM_EPOCHS)); + + /** + * Global epoch counter, advanced by the timer callback. + * All LargeObjectCache instances read this to detect when epochs + * have advanced and stale entries need flushing. + */ + static inline stl::Atomic global_epoch{0}; + + /** + * Timer callback that advances the global epoch. + */ + class DecayMemoryTimerObject : public PalTimerObject + { + static void process(PalTimerObject*) + { + auto e = global_epoch.load(stl::memory_order_relaxed); + global_epoch.store(e + 1, stl::memory_order_release); + } + + /// Timer fires every 500ms. + static constexpr size_t PERIOD = 500; + + public: + constexpr DecayMemoryTimerObject() : PalTimerObject(&process, PERIOD) {} + }; + + static inline DecayMemoryTimerObject timer_object; + + /** + * Flag to ensure one-shot timer registration. + */ + static inline stl::AtomicBool registered_timer{false}; + + /** + * Per-sizeclass adaptive budget state. + */ + struct SizeclassState + { + /// Maximum number of items allowed in the cache for this sizeclass. + /// Starts at 1 so the first deallocation is always cached. + size_t budget{1}; + + /// Current number of cached items across all epoch slots. + size_t count{0}; + + /// Number of cache misses since last cache insert. + /// Reset to 0 each time we successfully add to the cache. + size_t misses{0}; + + /// Peak value of misses this epoch. + /// This is what we use for budget growth - it captures the maximum + /// "depth" of consecutive misses, not cumulative misses. + size_t peak_misses{0}; + }; + + /** + * Per-sizeclass budget tracking. + */ + ModArray sc_state; + + /** + * Per-sizeclass, per-epoch SeqSets of cached metadata. + * Indexed as lists[sizeclass_index][epoch % NUM_EPOCHS]. + */ + ModArray>> + lists; + + /** + * The epoch this instance last synced to. + * Used to detect when new epochs have passed and old ones need flushing. + */ + size_t local_epoch{0}; + + /** + * Convert a sizeclass_t to the cache index. + * Uses sizeclass_t::as_large() to extract the sequential large + * class index directly, avoiding any size-to-index computation. + */ + static size_t to_sizeclass(sizeclass_t sc) + { + return sc.as_large(); + } + + /** + * Check if a sizeclass is cacheable. + * Only sizeclasses whose large class index fits within + * NUM_SIZECLASSES (up to MAX_CACHEABLE_SIZE) are cached; + * larger allocations bypass the cache entirely. + */ + static bool is_cacheable(sizeclass_t sc) + { + return sc.as_large() < NUM_SIZECLASSES; + } + + /** + * Register the global timer if not already done. + */ + void ensure_registered() + { + if constexpr (pal_supports) + { + if ( + !registered_timer.load(stl::memory_order_relaxed) && + !registered_timer.exchange(true, stl::memory_order_acq_rel)) + { + PAL::register_timer(&timer_object); + } + } + } + + /** + * Catch up to the global epoch, flushing any stale epochs and + * adjusting per-sizeclass budgets. + */ + template + void sync_epoch(FlushFn&& flush_fn) + { + if constexpr (pal_supports) + { + auto current = global_epoch.load(stl::memory_order_acquire); + + auto behind = current - local_epoch; + if (behind == 0) + return; + + if (behind > NUM_EPOCHS) + behind = NUM_EPOCHS; + + // Snapshot counts before flushing. + size_t before_count[NUM_SIZECLASSES]; + for (size_t sc = 0; sc < NUM_SIZECLASSES; sc++) + before_count[sc] = sc_state[sc].count; + + // Flush stale epoch slots. + for (size_t i = 0; i < behind; i++) + { + auto epoch_to_flush = (local_epoch + 1 + i) % NUM_EPOCHS; + flush_epoch_slot(epoch_to_flush, flush_fn); + } + + // Adjust budgets based on what happened. + // Net out misses against flushed items to determine direction. + for (size_t sc = 0; sc < NUM_SIZECLASSES; sc++) + { + auto& state = sc_state[sc]; + size_t flushed = before_count[sc] - state.count; + + if (state.peak_misses > flushed) + { + // More misses than surplus: grow budget by the difference. + state.budget += state.peak_misses - flushed; + } + else if (flushed > state.peak_misses) + { + // More surplus than misses: shrink budget smoothly. + state.budget -= (flushed - state.peak_misses) / 2; + } + // If equal, budget stays the same. + + state.misses = 0; + state.peak_misses = 0; + } + + local_epoch = current; + } + } + + /** + * Flush all entries in a single epoch slot. + * Decrements per-sizeclass counts. + */ + template + void flush_epoch_slot(size_t epoch_slot, FlushFn&& flush_fn) + { + for (size_t sc = 0; sc < NUM_SIZECLASSES; sc++) + { + auto& list = lists[sc][epoch_slot]; + while (!list.is_empty()) + { + sc_state[sc].count--; + flush_fn(list.pop_front()); + } + } + } + + public: + constexpr LargeObjectCache() = default; + + /** + * Try to satisfy a large allocation from the cache. + * + * @param sc The sizeclass of the allocation. + * @param flush_fn Callback to flush stale entries during epoch sync. + * @return Metadata for a cached chunk, or nullptr on cache miss. + */ + template + BackendSlabMetadata* try_alloc(sizeclass_t sc, FlushFn&& flush_fn) + { + // Don't cache very large allocations. + if (!is_cacheable(sc)) + return nullptr; + + sync_epoch(flush_fn); + + auto idx = to_sizeclass(sc); + auto current = local_epoch; + + // Check current epoch first, then older ones. + for (size_t age = 0; age < NUM_EPOCHS; age++) + { + auto& list = lists[idx][(current - age) % NUM_EPOCHS]; + if (!list.is_empty()) + { + sc_state[idx].count--; + return list.pop_front(); + } + } + + // Cache miss - track for budget growth. + sc_state[idx].misses++; + if (sc_state[idx].misses > sc_state[idx].peak_misses) + sc_state[idx].peak_misses = sc_state[idx].misses; + return nullptr; + } + + /** + * Cache a large deallocation. + * + * If the sizeclass is at its budget, the entry is flushed immediately + * instead of being cached. + * + * @param meta The slab metadata for the chunk. + * @param sc The sizeclass of the allocation. + * @param flush_fn Callback to flush stale entries during epoch sync, + * and to flush this entry if over budget. + */ + template + void cache(BackendSlabMetadata* meta, sizeclass_t sc, FlushFn&& flush_fn) + { + // Don't cache very large allocations - flush directly to backend. + if (!is_cacheable(sc)) + { + flush_fn(meta); + return; + } + + ensure_registered(); + sync_epoch(flush_fn); + + auto idx = to_sizeclass(sc); + + if (sc_state[idx].count >= sc_state[idx].budget) + { + // Over budget: flush immediately rather than caching. + flush_fn(meta); + return; + } + + sc_state[idx].count++; + sc_state[idx].misses = 0; // Reset miss counter on successful cache. + lists[idx][local_epoch % NUM_EPOCHS].insert(meta); + } + + /** + * Flush all cached entries back to the backend. + * Called during allocator teardown/flush. + */ + template + void flush_all(FlushFn&& flush_fn) + { + for (size_t e = 0; e < NUM_EPOCHS; e++) + { + flush_epoch_slot(e, flush_fn); + } + } + + /** + * Flush all cached entries with sizeclass strictly smaller than + * the given chunk_size. These can coalesce in the buddy allocator + * to form the needed size. + * + * @return true if any entries were flushed. + */ + template + bool flush_smaller(sizeclass_t sc, FlushFn&& flush_fn) + { + // If not cacheable, all cached entries are smaller. + size_t target_idx = is_cacheable(sc) ? to_sizeclass(sc) : NUM_SIZECLASSES; + bool flushed = false; + for (size_t i = 0; i < target_idx; i++) + { + for (size_t e = 0; e < NUM_EPOCHS; e++) + { + auto& list = lists[i][e]; + while (!list.is_empty()) + { + sc_state[i].count--; + flush_fn(list.pop_front()); + flushed = true; + } + } + } + return flushed; + } + + /** + * Flush a single cached entry with sizeclass >= the given sizeclass. + * The buddy allocator can split this to satisfy the request. + * + * @return true if an entry was flushed. + */ + template + bool flush_one_larger(sizeclass_t sc, FlushFn&& flush_fn) + { + // Nothing in cache can satisfy requests larger than our max. + if (!is_cacheable(sc)) + return false; + + auto target_idx = to_sizeclass(sc); + for (size_t i = target_idx; i < NUM_SIZECLASSES; i++) + { + for (size_t e = 0; e < NUM_EPOCHS; e++) + { + auto& list = lists[i][e]; + if (!list.is_empty()) + { + sc_state[i].count--; + flush_fn(list.pop_front()); + return true; + } + } + } + return false; + } + + /** + * Check if the cache is completely empty. + */ + bool is_empty() const + { + for (size_t sc = 0; sc < NUM_SIZECLASSES; sc++) + { + for (size_t e = 0; e < NUM_EPOCHS; e++) + { + if (!lists[sc][e].is_empty()) + return false; + } + } + return true; + } + }; +} // namespace snmalloc diff --git a/src/snmalloc/mem/metadata.h b/src/snmalloc/mem/metadata.h index 0284e8a5d..5de40af90 100644 --- a/src/snmalloc/mem/metadata.h +++ b/src/snmalloc/mem/metadata.h @@ -56,6 +56,41 @@ namespace snmalloc */ static constexpr address_t META_BOUNDARY_BIT = 1 << 0; + /** + * Bit used by the coalescing range (BitmapCoalesce / + * BitmapCoalesceRange) to mark entries that belong to its free pool. + * This allows the coalescing algorithm to distinguish its own free + * blocks from other backend-owned entries (e.g., those held by + * DecayRange or other range components that also call + * claim_for_backend). Cleared by claim_for_backend(). + */ + static constexpr address_t META_COALESCE_FREE_BIT = 1 << 1; + + /** + * 3-bit field in the meta word (bits 2–4) that stores the distance + * of this pagemap entry from the start of a non-pow2 large + * allocation, measured in natural-alignment units. For pow2 and + * small allocations this is always 0. + * + * Given slab_mask = natural_alignment(size) - 1: + * nat_base = addr & ~slab_mask + * nat_align = slab_mask + 1 + * alloc_start = nat_base - nat_align * offset_bits + * + * Available because SlabMetadata pointers are at least 32-byte + * aligned, so bits 0–4 of the meta word are free for tagging. + */ + static constexpr unsigned META_OFFSET_SHIFT = 2; + static constexpr address_t META_OFFSET_MASK = address_t(7) + << META_OFFSET_SHIFT; + + /** + * Mask covering all reserved low bits in the meta word. + * get_slab_metadata() masks these off to recover the pointer. + */ + static constexpr address_t META_LOW_BITS_MASK = + META_BOUNDARY_BIT | META_COALESCE_FREE_BIT | META_OFFSET_MASK; + /** * The bit above the sizeclass is always zero unless this is used * by the backend to represent another datastructure such as the buddy @@ -158,7 +193,8 @@ namespace snmalloc /** * Explicit assignment operator, copies the data preserving the boundary bit - * in the target if it is set. + * in the target if it is set. Other reserved bits (e.g. the coalesce + * free marker) are taken from the source, not preserved from the target. */ MetaEntryBase& operator=(const MetaEntryBase& other) { @@ -169,6 +205,19 @@ namespace snmalloc return *this; } + /** + * Combined assign-and-set-offset: copies the entry from `other` + * (preserving the target's boundary bit) and sets the offset field + * in a single operation, avoiding a second read-modify-write. + */ + void assign_with_offset(const MetaEntryBase& other, uint8_t offset) + { + meta = (other.meta & ~(META_BOUNDARY_BIT | META_OFFSET_MASK)) | + address_cast(meta & META_BOUNDARY_BIT) | + (static_cast(offset) << META_OFFSET_SHIFT); + remote_and_sizeclass = other.remote_and_sizeclass; + } + /** * On some platforms, allocations originating from the OS may not be * combined. The boundary bit indicates whether this is meta entry @@ -193,6 +242,48 @@ namespace snmalloc ///@} + /** + * Mark this entry as part of a coalescing-range free block. + * Set on the first and last chunk entries of each free block. + * @{ + */ + void set_coalesce_free() + { + meta |= META_COALESCE_FREE_BIT; + } + + [[nodiscard]] bool is_coalesce_free() const + { + return meta & META_COALESCE_FREE_BIT; + } + + void clear_coalesce_free() + { + meta &= ~META_COALESCE_FREE_BIT; + } + + ///@} + + /** + * Store the 3-bit placement offset for a non-pow2 large allocation. + * For pow2 and small allocations, pass 0. + * @{ + */ + void set_offset(uint8_t offset) + { + SNMALLOC_ASSERT(offset <= 7); + meta = (meta & ~META_OFFSET_MASK) | + (static_cast(offset) << META_OFFSET_SHIFT); + } + + [[nodiscard]] uint8_t get_offset() const + { + return static_cast( + (meta & META_OFFSET_MASK) >> META_OFFSET_SHIFT); + } + + ///@} + /** * Returns the remote. * @@ -772,7 +863,7 @@ namespace snmalloc [[nodiscard]] SNMALLOC_FAST_PATH SlabMetadata* get_slab_metadata() const { SNMALLOC_ASSERT(!is_backend_owned()); - return unsafe_from_uintptr(meta & ~META_BOUNDARY_BIT); + return unsafe_from_uintptr(meta & ~META_LOW_BITS_MASK); } }; diff --git a/src/snmalloc/mem/sizeclasstable.h b/src/snmalloc/mem/sizeclasstable.h index d31a1f692..a48253c24 100644 --- a/src/snmalloc/mem/sizeclasstable.h +++ b/src/snmalloc/mem/sizeclasstable.h @@ -34,10 +34,34 @@ namespace snmalloc constexpr size_t NUM_SMALL_SIZECLASSES = size_to_sizeclass_const(MAX_SMALL_SIZECLASS_SIZE) + 1; - // Large classes range from [MAX_SMALL_SIZECLASS_SIZE, ADDRESS_SPACE). - constexpr size_t NUM_LARGE_CLASSES = + // Number of pow2-only large exponent levels (one per power of two). + constexpr size_t NUM_LARGE_CLASSES_POW2 = DefaultPal::address_bits - MAX_SMALL_SIZECLASS_BITS; + // Number of exponent levels that also get non-pow2 intermediate + // sub-classes (mantissa 1..2^B-1). Kept small enough that the + // total large class count fits in TAG_SIZECLASS_BITS without + // increasing REMOTE_MIN_ALIGN. 6 levels = cutoff at ~4 MiB. + static constexpr size_t NUM_LARGE_FINE_LEVELS = 6; + + // Sub-classes per fine-grained level. + static constexpr size_t LARGE_SUBS_PER_LEVEL = + bits::one_at_bit(INTERMEDIATE_BITS); + + // The largest non-pow2 size class has S = (2^(B+1) - 1) * nat_align, + // spanning (2^(B+1) - 1) nat_align blocks. The last block's offset + // is (2^(B+1) - 2), which must fit in the 3-bit pagemap offset field + // (max value 7). For B=2 the max offset is 6, which fits. + static_assert( + (bits::one_at_bit(INTERMEDIATE_BITS + 1) - 2) <= 7, + "Non-pow2 large class offsets exceed 3-bit pagemap field capacity"); + + // Total large class count: fine-grained levels contribute 2^B classes + // each (including the pow2 at m=0), remaining levels contribute 1. + constexpr size_t NUM_LARGE_CLASSES = + NUM_LARGE_FINE_LEVELS * LARGE_SUBS_PER_LEVEL + + (NUM_LARGE_CLASSES_POW2 - NUM_LARGE_FINE_LEVELS); + // How many bits are required to represent either a large or a small // sizeclass. constexpr size_t TAG_SIZECLASS_BITS = bits::max( @@ -75,13 +99,18 @@ namespace snmalloc } /** - * Takes the number of leading zero bits from the actual large size-1. - * See size_to_sizeclass_full + * Creates a sizeclass_t from a sequential large class index. + * Indices 0..NUM_LARGE_CLASSES-1 enumerate all large classes: + * - Indices 0..(NUM_LARGE_FINE_LEVELS*LARGE_SUBS_PER_LEVEL-1) are + * fine-grained (including pow2 at mantissa 0 for each level). + * - The remaining indices are pow2-only classes. */ static constexpr sizeclass_t from_large_class(size_t large_class) { - SNMALLOC_ASSERT(large_class < TAG); - return {large_class}; + // +1 reserves raw value 0 as a sentinel for uninitialized pagemap + // entries, so that alloc_size(nullptr) returns 0. + SNMALLOC_ASSERT(large_class + 1 < TAG); + return {large_class + 1}; } static constexpr sizeclass_t from_raw(size_t raw) @@ -100,10 +129,14 @@ namespace snmalloc return value & (TAG - 1); } - constexpr chunksizeclass_t as_large() + /** + * Returns the sequential large class index. + */ + constexpr size_t as_large() { SNMALLOC_ASSERT(!is_small()); - return bits::BITS - (value & (TAG - 1)); + // -1 undoes the +1 offset from from_large_class. + return (value & (TAG - 1)) - 1; } constexpr size_t raw() @@ -129,6 +162,101 @@ namespace snmalloc using sizeclass_compress_t = uint8_t; + /** + * The exp_mant index of the smallest large class in chunk units. + * For B=2 with MAX_SMALL=64K and MIN_CHUNK=16K: 4 chunks → em index 3. + */ + static constexpr size_t MIN_LARGE_EM = + bits::to_exp_mant_const( + MAX_SMALL_SIZECLASS_SIZE / MIN_CHUNK_SIZE); + + /** + * Number of large class indices in the fine-grained range. + * These levels have all 2^B sub-classes (m=0..3 for B=2). + */ + static constexpr size_t LARGE_FINE_TOTAL = + NUM_LARGE_FINE_LEVELS * LARGE_SUBS_PER_LEVEL; + + /** + * Convert a sequential large class index to the byte size of the class. + * + * Indices 0..LARGE_FINE_TOTAL-1 cover fine-grained levels: each maps + * directly to a consecutive exp_mant index starting from MIN_LARGE_EM. + * + * Indices >= LARGE_FINE_TOTAL cover pow2-only levels: each maps to + * the pow2 (m=0) entry at the corresponding exponent level. + */ + constexpr size_t large_class_index_to_size(size_t index) + { + size_t em_index; + if (index < LARGE_FINE_TOTAL) + { + em_index = MIN_LARGE_EM + index; + } + else + { + size_t level = NUM_LARGE_FINE_LEVELS + (index - LARGE_FINE_TOTAL); + em_index = MIN_LARGE_EM + level * LARGE_SUBS_PER_LEVEL; + } + size_t chunks = bits::from_exp_mant(em_index); + return chunks * MIN_CHUNK_SIZE; + } + + /** + * Convert a byte size (which must be a valid large class size, i.e. + * the result of round_size for a large allocation) to the sequential + * large class index. + */ + /** + * Constexpr version for compile-time use (e.g. array sizing). + */ + constexpr size_t size_to_large_class_index_const(size_t size) + { + size_t chunks = (size + MIN_CHUNK_SIZE - 1) / MIN_CHUNK_SIZE; + size_t em = bits::to_exp_mant_const(chunks); + size_t local_em = em - MIN_LARGE_EM; + + if (local_em < LARGE_FINE_TOTAL) + return local_em; + + size_t m = local_em % LARGE_SUBS_PER_LEVEL; + if (m != 0) + local_em = (local_em / LARGE_SUBS_PER_LEVEL + 1) * LARGE_SUBS_PER_LEVEL; + + size_t level = local_em / LARGE_SUBS_PER_LEVEL; + size_t index = LARGE_FINE_TOTAL + (level - NUM_LARGE_FINE_LEVELS); + if (index >= NUM_LARGE_CLASSES) + return NUM_LARGE_CLASSES - 1; + return index; + } + + /** + * Runtime version using hardware CLZ intrinsic for fast path. + */ + inline SNMALLOC_FAST_PATH size_t size_to_large_class_index(size_t size) + { + // Ceiling division to avoid truncation when size isn't chunk-aligned. + size_t chunks = (size + MIN_CHUNK_SIZE - 1) / MIN_CHUNK_SIZE; + size_t em = bits::to_exp_mant(chunks); + size_t local_em = em - MIN_LARGE_EM; + + if (local_em < LARGE_FINE_TOTAL) + return local_em; + + // Pow2-only range: if em corresponds to a non-pow2 mantissa, + // round up to the next pow2 level. + size_t m = local_em % LARGE_SUBS_PER_LEVEL; + if (m != 0) + local_em = (local_em / LARGE_SUBS_PER_LEVEL + 1) * LARGE_SUBS_PER_LEVEL; + + size_t level = local_em / LARGE_SUBS_PER_LEVEL; + size_t index = LARGE_FINE_TOTAL + (level - NUM_LARGE_FINE_LEVELS); + // Clamp to valid range for sizes beyond the representable address space. + if (index >= NUM_LARGE_CLASSES) + return NUM_LARGE_CLASSES - 1; + return index; + } + /** * This structure contains the fields required for fast paths for sizeclasses. */ @@ -242,12 +370,18 @@ namespace snmalloc meta.mod_zero_mult = (~zero / meta.size) + 1; } - for (size_t sizeclass = 0; sizeclass < bits::BITS; sizeclass++) + for (size_t index = 0; index < NUM_LARGE_CLASSES; index++) { - auto lsc = sizeclass_t::from_large_class(sizeclass); + auto lsc = sizeclass_t::from_large_class(index); auto& meta = fast(lsc); - meta.size = sizeclass == 0 ? 0 : bits::one_at_bit(lsc.as_large()); - meta.slab_mask = meta.size - 1; + meta.size = large_class_index_to_size(index); + // For pow2 large classes: slab_mask = size - 1. + // For non-pow2 large classes: slab_mask = natural_alignment(size) - 1. + // Each pagemap chunk entry stores its distance (in nat_align + // units) from the allocation start, enabling start_of_object + // to work for any interior pointer without placement constraints. + size_t nat = meta.size & (~(meta.size - 1)); + meta.slab_mask = nat - 1; // The slab_mask will do all the necessary work, so // perform identity multiplication for the test. meta.mod_zero_mult = 1; @@ -255,6 +389,21 @@ namespace snmalloc // so collapse the calculated offset. meta.div_mult = 0; } + + // Raw sizeclass value 0 is the sentinel for uninitialized pagemap + // entries (non-snmalloc memory like stack, globals, etc.). + // Set size=0 so alloc_size(nullptr) returns 0, and + // slab_mask=SIZE_MAX so remaining_bytes returns a large value, + // preventing false-positive bounds check failures on non-heap + // addresses. + { + auto& sentinel = fast_[0]; + sentinel.size = 0; + size_t zero = 0; + sentinel.slab_mask = ~zero; + sentinel.mod_zero_mult = 1; + sentinel.div_mult = 0; + } } }; @@ -305,31 +454,24 @@ namespace snmalloc return bits::next_pow2_bits(ssize) - MIN_CHUNK_BITS; } - constexpr size_t slab_sizeclass_to_size(chunksizeclass_t sizeclass) - { - return bits::one_at_bit(MIN_CHUNK_BITS + sizeclass); - } - - /** - * For large allocations, the metaentry stores the raw log_2 of the size, - * which must be shifted into the index space of slab_sizeclass-es. - */ - constexpr size_t - metaentry_chunk_sizeclass_to_slab_sizeclass(chunksizeclass_t sizeclass) - { - return sizeclass - MIN_CHUNK_BITS; - } - constexpr uint16_t sizeclass_to_slab_object_count(smallsizeclass_t sizeclass) { return sizeclass_metadata.slow(sizeclass_t::from_small_class(sizeclass)) .capacity; } - SNMALLOC_FAST_PATH constexpr size_t slab_index(sizeclass_t sc, address_t addr) + SNMALLOC_FAST_PATH constexpr size_t + slab_index(sizeclass_t sc, address_t addr, uint8_t offset_bits = 0) { auto meta = sizeclass_metadata.fast(sc); size_t offset = addr & meta.slab_mask; + // For non-pow2 large allocations, offset_bits records this + // pagemap entry's distance from the allocation start, measured + // in nat_align (= slab_mask + 1) units. Adding it reconstructs + // the total byte offset from the start of the allocation. + // For small and pow2-large sizeclasses offset_bits is always 0, + // so this addition is a no-op and the fast path is unchanged. + offset += (meta.slab_mask + 1) * offset_bits; if constexpr (sizeof(offset) >= 8) { // Only works for 64 bit multiplication, as the following will overflow in @@ -354,27 +496,40 @@ namespace snmalloc } SNMALLOC_FAST_PATH constexpr address_t - start_of_object(sizeclass_t sc, address_t addr) + start_of_object(sizeclass_t sc, address_t addr, uint8_t offset_bits = 0) { auto meta = sizeclass_metadata.fast(sc); - address_t slab_start = addr & ~meta.slab_mask; - size_t index = slab_index(sc, addr); - return slab_start + (index * meta.size); + address_t nat_base = addr & ~meta.slab_mask; + // Subtract offset_bits * nat_align to recover the allocation start. + // Each pagemap entry stores how many nat_align blocks it is from + // the start, so subtracting walks back to the beginning. + // For small and pow2-large sizeclasses offset_bits is always 0, + // so alloc_start == nat_base and the fast path is unchanged. + address_t alloc_start = nat_base - (meta.slab_mask + 1) * offset_bits; + size_t index = slab_index(sc, addr, offset_bits); + return alloc_start + (index * meta.size); } - constexpr size_t index_in_object(sizeclass_t sc, address_t addr) + constexpr size_t + index_in_object(sizeclass_t sc, address_t addr, uint8_t offset_bits = 0) { - return addr - start_of_object(sc, addr); + return addr - start_of_object(sc, addr, offset_bits); } - constexpr size_t remaining_bytes(sizeclass_t sc, address_t addr) + constexpr size_t + remaining_bytes(sizeclass_t sc, address_t addr, uint8_t offset_bits = 0) { - return sizeclass_metadata.fast(sc).size - index_in_object(sc, addr); + return sizeclass_metadata.fast(sc).size - + index_in_object(sc, addr, offset_bits); } - constexpr bool is_start_of_object(sizeclass_t sc, address_t addr) + constexpr bool + is_start_of_object(sizeclass_t sc, address_t addr, uint8_t offset_bits = 0) { size_t offset = addr & (sizeclass_full_to_slab_size(sc) - 1); + // Branchless offset correction: add the entry's distance from the + // allocation start (in nat_align units) to get the total offset. + offset += sizeclass_full_to_slab_size(sc) * offset_bits; // Only works up to certain offsets, exhaustively tested by rounding.cc if constexpr (sizeof(offset) >= 8) @@ -394,12 +549,7 @@ namespace snmalloc inline static size_t large_size_to_chunk_size(size_t size) { - return bits::next_pow2(size); - } - - inline static size_t large_size_to_chunk_sizeclass(size_t size) - { - return bits::next_pow2_bits(size) - MIN_CHUNK_BITS; + return large_class_index_to_size(size_to_large_class_index(size)); } constexpr SNMALLOC_PURE size_t sizeclass_lookup_index(const size_t s) @@ -485,12 +635,9 @@ namespace snmalloc /** * A compressed size representation, - * either a small size class with the 7th bit set - * or a large class with the 7th bit not set. - * Large classes are stored as a mask shift. - * size = (~0 >> lc) + 1; - * Thus large size class 0, has size 0. - * And large size class 33, has size 2^31 + * either a small size class with TAG bit set + * or a large class with TAG bit not set. + * Large classes use a sequential index; see large_class_index_to_size. */ static inline sizeclass_t size_to_sizeclass_full(size_t size) { @@ -498,9 +645,9 @@ namespace snmalloc { return sizeclass_t::from_small_class(size_to_sizeclass(size)); } - // bits::clz is undefined on 0, but we have size == 1 has already been - // handled here. We conflate 0 and sizes larger than we can allocate. - return sizeclass_t::from_large_class(bits::clz(size - 1)); + // For large sizes, compute the sequential large class index. + // size_to_large_class_index rounds up to the next valid class. + return sizeclass_t::from_large_class(size_to_large_class_index(size)); } inline SNMALLOC_FAST_PATH static size_t round_size(size_t size) @@ -526,17 +673,18 @@ namespace snmalloc // failed allocation later. return size; } - return bits::next_pow2(size); + // Use fine-grained large size classes (B=2 intermediate classes) + // for sizes within the fine-grained range, pow2 for larger. + return large_class_index_to_size(size_to_large_class_index(size)); } /// Returns the alignment that this size naturally has, that is /// all allocations of size `size` will be aligned to the returned value. inline SNMALLOC_FAST_PATH static size_t natural_alignment(size_t size) { - auto rsize = round_size(size); if (size == 0) return 1; - return bits::one_at_bit(bits::ctz(rsize)); + return bits::one_at_bit(bits::ctz(size)); } constexpr SNMALLOC_FAST_PATH static size_t diff --git a/src/snmalloc/override/new.cc b/src/snmalloc/override/new.cc index 667ca9c45..041413a9b 100644 --- a/src/snmalloc/override/new.cc +++ b/src/snmalloc/override/new.cc @@ -37,7 +37,10 @@ namespace snmalloc SNMALLOC_ASSERT( secondary_allocator || - is_start_of_object(size_to_sizeclass_full(size), address_cast(p))); + is_start_of_object( + size_to_sizeclass_full(size), + address_cast(p), + Config::Backend::get_metaentry(address_cast(p)).get_offset())); return p; } diff --git a/src/snmalloc/pal/pal_windows.h b/src/snmalloc/pal/pal_windows.h index 4ed643649..4ef951e2a 100644 --- a/src/snmalloc/pal/pal_windows.h +++ b/src/snmalloc/pal/pal_windows.h @@ -594,7 +594,7 @@ namespace snmalloc # ifdef PLATFORM_HAS_VIRTUALALLOC2 template - void* PALWindows::reserve_aligned(size_t size) noexcept + inline void* PALWindows::reserve_aligned(size_t size) noexcept { SNMALLOC_ASSERT(bits::is_pow2(size)); SNMALLOC_ASSERT(size >= minimum_alloc_size); @@ -624,7 +624,7 @@ namespace snmalloc } # endif - void* PALWindows::reserve(size_t size) noexcept + inline void* PALWindows::reserve(size_t size) noexcept { void* ret = VirtualAlloc(nullptr, size, MEM_RESERVE, PAGE_READWRITE); diff --git a/src/test/func/bc_core/bc_core.cc b/src/test/func/bc_core/bc_core.cc new file mode 100644 index 000000000..e5d5ce869 --- /dev/null +++ b/src/test/func/bc_core/bc_core.cc @@ -0,0 +1,958 @@ +/** + * Unit tests for BitmapCoalesce — core insert/remove/coalescing. + * + * Uses a mock Rep (in-memory array) to simulate pagemap entries. + * Tests the bitmap + free-list machinery in isolation. + */ + +#include +#include +#include +#include +#include + +using namespace snmalloc; + +static constexpr size_t TEST_MAX_SIZE_BITS = bits::BITS - 1; +using BC = BitmapCoalesceHelpers; + +// ---- Mock Rep ---- + +static constexpr size_t ARENA_CHUNKS = 256; +static constexpr address_t ARENA_BASE = MIN_CHUNK_SIZE; // chunk index 1 +static constexpr size_t ARENA_SIZE = ARENA_CHUNKS * MIN_CHUNK_SIZE; + +struct MockEntry +{ + address_t next_ptr = 0; + size_t block_size = 0; + bool coalesce_free = false; + bool boundary = false; +}; + +static MockEntry mock_entries[ARENA_CHUNKS]; + +struct MockRep +{ + static size_t idx(address_t addr) + { + SNMALLOC_ASSERT(addr >= ARENA_BASE); + SNMALLOC_ASSERT(addr < ARENA_BASE + ARENA_SIZE); + SNMALLOC_ASSERT(addr % MIN_CHUNK_SIZE == 0); + return (addr - ARENA_BASE) / MIN_CHUNK_SIZE; + } + + static address_t get_next(address_t addr) + { + return mock_entries[idx(addr)].next_ptr; + } + + static void set_next(address_t addr, address_t next) + { + mock_entries[idx(addr)].next_ptr = next; + } + + static size_t get_size(address_t addr) + { + return mock_entries[idx(addr)].block_size; + } + + static void set_size(address_t addr, size_t size) + { + mock_entries[idx(addr)].block_size = size; + } + + static void set_boundary_tags(address_t addr, size_t size) + { + set_size(addr, size); + if (size > MIN_CHUNK_SIZE) + set_size(addr + size - MIN_CHUNK_SIZE, size); + } + + static bool is_free_block(address_t addr) + { + if (addr < ARENA_BASE || addr >= ARENA_BASE + ARENA_SIZE) + return false; + if (addr % MIN_CHUNK_SIZE != 0) + return false; + return mock_entries[idx(addr)].coalesce_free; + } + + static bool is_boundary(address_t addr) + { + if (addr < ARENA_BASE || addr >= ARENA_BASE + ARENA_SIZE) + return true; + if (addr % MIN_CHUNK_SIZE != 0) + return true; + return mock_entries[idx(addr)].boundary; + } + + static void set_boundary(address_t addr) + { + mock_entries[idx(addr)].boundary = true; + } + + static void set_coalesce_free(address_t addr) + { + mock_entries[idx(addr)].coalesce_free = true; + } + + static void clear_coalesce_free(address_t addr) + { + mock_entries[idx(addr)].coalesce_free = false; + } +}; + +using BCCore = BitmapCoalesce; + +static void reset_mock() +{ + for (auto& e : mock_entries) + e = MockEntry{}; +} + +// ---- Test framework ---- + +static size_t failure_count = 0; + +static void check(bool cond, const char* msg, size_t line) +{ + if (!cond) + { + std::cout << "FAIL (line " << line << "): " << msg << std::endl; + failure_count++; + } +} + +#define CHECK(cond) check(cond, #cond, __LINE__) + +// Helper: convert chunk index (relative to arena) to byte address. +static address_t chunk_addr(size_t chunk_idx) +{ + return ARENA_BASE + chunk_idx * MIN_CHUNK_SIZE; +} + +// Helper: convert chunk count to byte size. +static size_t chunk_size(size_t n) +{ + return n * MIN_CHUNK_SIZE; +} + +// ---- Test: insert/remove round-trip ---- + +void test_insert_remove_roundtrip() +{ + std::cout << "test_insert_remove_roundtrip..." << std::endl; + reset_mock(); + BCCore bc{}; + + // Insert a 4-chunk block at address with alignment 4 (chunk 4). + // chunk_addr(3) = ARENA_BASE + 3*MCS = 4*MCS, so chunk index in + // address space is 4, alignment = 4. + address_t addr = chunk_addr(3); // byte addr = 4 * MCS + size_t sz = chunk_size(4); + bc.add_fresh_range(addr, sz); + + // The block can serve sizeclass 4 (e=2, m=0) since alpha=4 and n=4. + // T(4, 4) = 4. + auto result = bc.remove_block(chunk_size(4)); + CHECK(result.addr == addr); + CHECK(result.size == sz); + + // After removal, another remove should return empty. + auto empty = bc.remove_block(chunk_size(4)); + CHECK(empty.addr == 0); + CHECK(empty.size == 0); +} + +// ---- Test: bitmap correctness ---- + +void test_bitmap_correctness() +{ + std::cout << "test_bitmap_correctness..." << std::endl; + reset_mock(); + BCCore bc{}; + + // Insert a block. + address_t addr = chunk_addr(3); // 4*MCS, alpha=4 + bc.add_fresh_range(addr, chunk_size(4)); + + // Compute expected bin. + size_t bin = BC::bin_index(4, 4); + CHECK(bc.is_bin_non_empty(bin)); + CHECK(bc.get_bin_head(bin) == addr); + + // Remove it. + bc.remove_block(chunk_size(4)); + + // Bitmap bit should be cleared. + CHECK(!bc.is_bin_non_empty(bin)); + CHECK(bc.get_bin_head(bin) == 0); +} + +// ---- Test: multiple bins ---- + +void test_multiple_bins() +{ + std::cout << "test_multiple_bins..." << std::endl; + reset_mock(); + BCCore bc{}; + + // Block A: 4 chunks at chunk 4 (alpha=4) → bin_index(4, 4) + address_t a_addr = chunk_addr(3); // 4*MCS + bc.add_fresh_range(a_addr, chunk_size(4)); + + // Block B: 8 chunks at chunk 8 (alpha=8) → bin_index(8, 8) + address_t b_addr = chunk_addr(7); // 8*MCS + bc.add_fresh_range(b_addr, chunk_size(8)); + + size_t bin_a = BC::bin_index(4, 4); + size_t bin_b = BC::bin_index(8, 8); + CHECK(bin_a != bin_b); + CHECK(bc.is_bin_non_empty(bin_a)); + CHECK(bc.is_bin_non_empty(bin_b)); + + // Remove from bin A (sizeclass 4). + auto result = bc.remove_block(chunk_size(4)); + CHECK(result.addr == a_addr); + + // Block B should still be there. + CHECK(bc.is_bin_non_empty(bin_b)); + CHECK(!bc.is_bin_non_empty(bin_a)); +} + +// ---- Test: best-fit search ---- + +void test_best_fit_search() +{ + std::cout << "test_best_fit_search..." << std::endl; + reset_mock(); + BCCore bc{}; + + // Insert a small block that can't serve sizeclass 8. + // Block at chunk 4, size 4, alpha=4 → can serve sc 4 but not 8. + bc.add_fresh_range(chunk_addr(3), chunk_size(4)); + + // Insert a larger block that can serve sizeclass 8. + // Block at chunk 16, size 8, alpha=16. bin_index(8, 16). + // T(8, 16) = 8. So 8 >= 8 → A-only at e=3. + bc.add_fresh_range(chunk_addr(15), chunk_size(8)); + + // Allocate sizeclass 8. The small block can't serve it; the large + // block should be returned. + auto result = bc.remove_block(chunk_size(8)); + CHECK(result.addr == chunk_addr(15)); + CHECK(result.size == chunk_size(8)); +} + +// ---- Test: masked search for m=0 ---- + +void test_masked_search() +{ + std::cout << "test_masked_search..." << std::endl; + reset_mock(); + BCCore bc{}; + + // Insert a block that is B-only at e=2: serves m=1 (sc 5) but not + // m=0 (sc 4). Need n >= T(5, alpha) and n < T(4, alpha). + // + // At alpha=1: T(5,1) = 5, T(4,1) = 7. So n=5 at alpha=1 works: + // bin_index(5, 1) should give B-only at e=2. + // + // Use chunk 1 (addr = ARENA_BASE = MCS, chunk_addr in addr space = 1, + // alpha = 1). + address_t bonly_addr = chunk_addr(0); // addr = 1*MCS, alpha = 1 + bc.add_fresh_range(bonly_addr, chunk_size(5)); + + size_t bin = BC::bin_index(5, 1); + size_t expected_bonly = BC::exponent_base_bit(2) + BC::SLOT_B_ONLY; + CHECK(bin == expected_bonly); + + // Allocate m=0 at e=2 (sizeclass 4). The B-only block should be + // skipped by the mask. + auto result = bc.remove_block(chunk_size(4)); + CHECK(result.addr == 0); // Not found — B-only is masked. + CHECK(result.size == 0); + + // Allocate m=1 at e=2 (sizeclass 5). The B-only block should be found. + result = bc.remove_block(chunk_size(5)); + CHECK(result.addr == bonly_addr); + CHECK(result.size == chunk_size(5)); +} + +// ---- Test: remove_from_bin correctness ---- + +void test_remove_from_bin() +{ + std::cout << "test_remove_from_bin..." << std::endl; + reset_mock(); + BCCore bc{}; + + // Insert 3 blocks into the same bin. We need blocks at different + // addresses but same (n_chunks, alpha). + // + // Use blocks of 4 chunks at addresses with alpha=4: + // chunk 4 (addr = 4*MCS), alpha = 4 + // chunk 12 (addr = 12*MCS), alpha = 4 (12 = 4*3, natural_alignment = 4) + // chunk 20 (addr = 20*MCS), alpha = 4 (20 = 4*5, natural_alignment = 4) + address_t a1 = chunk_addr(3); // 4*MCS + address_t a2 = chunk_addr(11); // 12*MCS + address_t a3 = chunk_addr(19); // 20*MCS + + bc.add_fresh_range(a1, chunk_size(4)); + bc.add_fresh_range(a2, chunk_size(4)); + bc.add_fresh_range(a3, chunk_size(4)); + + size_t bin = BC::bin_index(4, 4); + CHECK(bc.is_bin_non_empty(bin)); + + // The list should be: a3 -> a2 -> a1 -> 0 (prepend order). + CHECK(bc.get_bin_head(bin) == a3); + + // Remove the middle block (a2) using remove_block(addr, size). + // We need to use add_block for this test since remove_from_bin is private. + // Actually, we test it indirectly through add_block's coalescing. + // For now, let's verify that remove_block (allocation) pops the head. + + // Remove: should get a3 (head). + auto r1 = bc.remove_block(chunk_size(4)); + CHECK(r1.addr == a3); + + // Next remove: should get a2. + auto r2 = bc.remove_block(chunk_size(4)); + CHECK(r2.addr == a2); + + // Next remove: should get a1. + auto r3 = bc.remove_block(chunk_size(4)); + CHECK(r3.addr == a1); + + // Now empty. + auto r4 = bc.remove_block(chunk_size(4)); + CHECK(r4.addr == 0); +} + +// ---- Test: empty return ---- + +void test_empty_return() +{ + std::cout << "test_empty_return..." << std::endl; + reset_mock(); + BCCore bc{}; + + auto result = bc.remove_block(chunk_size(1)); + CHECK(result.addr == 0); + CHECK(result.size == 0); + + result = bc.remove_block(chunk_size(4)); + CHECK(result.addr == 0); + CHECK(result.size == 0); +} + +// ---- Test: boundary tags ---- + +void test_boundary_tags() +{ + std::cout << "test_boundary_tags..." << std::endl; + reset_mock(); + BCCore bc{}; + + // Insert a 4-chunk block and verify boundary tags are set. + address_t addr = chunk_addr(3); // 4*MCS + bc.add_fresh_range(addr, chunk_size(4)); + + // First chunk should have size = 4*MCS. + CHECK(MockRep::get_size(addr) == chunk_size(4)); + // Last chunk should also have size = 4*MCS. + CHECK(MockRep::get_size(addr + chunk_size(3)) == chunk_size(4)); + // coalesce_free should be set on first and last. + CHECK(MockRep::is_free_block(addr)); + CHECK(MockRep::is_free_block(addr + chunk_size(3))); + + // After removal, first-entry tags should be cleared. + bc.remove_block(chunk_size(4)); + CHECK(MockRep::get_size(addr) == 0); + CHECK(!MockRep::is_free_block(addr)); + // Last-entry tag is intentionally left stale. + CHECK(MockRep::get_size(addr + chunk_size(3)) == chunk_size(4)); +} + +// ---- Test: single-chunk block ---- + +void test_single_chunk() +{ + std::cout << "test_single_chunk..." << std::endl; + reset_mock(); + BCCore bc{}; + + // Single-chunk block at chunk 2 (alpha=2). + address_t addr = chunk_addr(1); // 2*MCS + bc.add_fresh_range(addr, chunk_size(1)); + + // Can serve sizeclass 1. + auto result = bc.remove_block(chunk_size(1)); + CHECK(result.addr == addr); + CHECK(result.size == chunk_size(1)); +} + +// ---- Test: large block serves small sizeclass ---- + +void test_large_serves_small() +{ + std::cout << "test_large_serves_small..." << std::endl; + reset_mock(); + BCCore bc{}; + + // Insert a large block (16 chunks) at a well-aligned address. + // chunk 16 (alpha=16). + address_t addr = chunk_addr(15); // 16*MCS + bc.add_fresh_range(addr, chunk_size(16)); + + // Should be able to serve sizeclass 1 (the search walks up from bit 0). + auto result = bc.remove_block(chunk_size(1)); + CHECK(result.addr == addr); + CHECK(result.size == chunk_size(16)); +} + +// ---- Test: coalesce right ---- + +void test_coalesce_right() +{ + std::cout << "test_coalesce_right..." << std::endl; + reset_mock(); + BCCore bc{}; + + // Insert block A. + address_t a_addr = chunk_addr(3); // 4*MCS + bc.add_fresh_range(a_addr, chunk_size(4)); + + // add_block B immediately to A's right → should coalesce. + address_t b_addr = a_addr + chunk_size(4); + bc.add_block(b_addr, chunk_size(4)); + + // Should now have a single 8-chunk block at a_addr. + // The combined block's bin depends on (8 chunks, alpha(a_addr)). + // alpha(4*MCS) = alpha(chunk 4) = 4. bin_index(8, 4). + size_t combined_bin = BC::bin_index(8, 4); + CHECK(bc.is_bin_non_empty(combined_bin)); + CHECK(bc.get_bin_head(combined_bin) == a_addr); + + // Remove it and verify size. + auto result = bc.remove_block(chunk_size(1)); + CHECK(result.addr == a_addr); + CHECK(result.size == chunk_size(8)); +} + +// ---- Test: coalesce left ---- + +void test_coalesce_left() +{ + std::cout << "test_coalesce_left..." << std::endl; + reset_mock(); + BCCore bc{}; + + // Insert block B (the right one first). + address_t b_addr = chunk_addr(7); // 8*MCS + bc.add_fresh_range(b_addr, chunk_size(4)); + + // add_block A immediately to B's left → should coalesce. + address_t a_addr = chunk_addr(3); // 4*MCS + bc.add_block(a_addr, chunk_size(4)); + + // Should have a single 8-chunk block at a_addr. + size_t combined_bin = + BC::bin_index(8, BC::natural_alignment(a_addr / MIN_CHUNK_SIZE)); + CHECK(bc.is_bin_non_empty(combined_bin)); + CHECK(bc.get_bin_head(combined_bin) == a_addr); +} + +// ---- Test: coalesce both sides ---- + +void test_coalesce_both() +{ + std::cout << "test_coalesce_both..." << std::endl; + reset_mock(); + BCCore bc{}; + + // Insert blocks A and C with a gap between. + address_t a_addr = chunk_addr(3); // 4*MCS, 4 chunks + address_t c_addr = chunk_addr(11); // 12*MCS, 4 chunks + + bc.add_fresh_range(a_addr, chunk_size(4)); + bc.add_fresh_range(c_addr, chunk_size(4)); + + // add_block the gap (chunks 8-11 = 4 chunks from 8*MCS). + address_t gap_addr = chunk_addr(7); // 8*MCS + bc.add_block(gap_addr, chunk_size(4)); + + // Should have a single 12-chunk block at a_addr. + size_t combined_bin = + BC::bin_index(12, BC::natural_alignment(a_addr / MIN_CHUNK_SIZE)); + CHECK(bc.is_bin_non_empty(combined_bin)); + CHECK(bc.get_bin_head(combined_bin) == a_addr); + + // Verify by removing. + auto result = bc.remove_block(chunk_size(1)); + CHECK(result.addr == a_addr); + CHECK(result.size == chunk_size(12)); +} + +// ---- Test: boundary prevents coalescing ---- + +void test_boundary_prevents_coalescing() +{ + std::cout << "test_boundary_prevents_coalescing..." << std::endl; + reset_mock(); + BCCore bc{}; + + // Insert blocks A and B adjacent. + address_t a_addr = chunk_addr(3); // 4*MCS, 4 chunks + address_t b_addr = chunk_addr(7); // 8*MCS, 4 chunks + + bc.add_fresh_range(a_addr, chunk_size(4)); + bc.add_fresh_range(b_addr, chunk_size(4)); + + // Set a boundary between A and B (at B's start). + MockRep::set_boundary(b_addr); + + // add_block C adjacent to B's right → should merge with B, not A. + address_t c_addr = chunk_addr(11); // 12*MCS + bc.add_block(c_addr, chunk_size(4)); + + // B+C should be coalesced into 8 chunks at b_addr. + size_t bc_bin = + BC::bin_index(8, BC::natural_alignment(b_addr / MIN_CHUNK_SIZE)); + CHECK(bc.is_bin_non_empty(bc_bin)); + + // A should still be in its original bin. + size_t a_bin = + BC::bin_index(4, BC::natural_alignment(a_addr / MIN_CHUNK_SIZE)); + CHECK(bc.is_bin_non_empty(a_bin)); +} + +// ---- Test: stale tag after remove ---- + +void test_stale_tag_after_remove() +{ + std::cout << "test_stale_tag_after_remove..." << std::endl; + reset_mock(); + BCCore bc{}; + + // add_fresh_range block A. + address_t a_addr = chunk_addr(3); // 4*MCS, 4 chunks + bc.add_fresh_range(a_addr, chunk_size(4)); + + // Allocate A (this clears first-entry coalesce_free and size). + auto result = bc.remove_block(chunk_size(1)); + CHECK(result.addr == a_addr); + + // Now add_block B immediately to A's right. The left walk should see + // that A's first entry is not coalesce_free and stop — no coalescing. + address_t b_addr = chunk_addr(7); // 8*MCS, 4 chunks + bc.add_block(b_addr, chunk_size(4)); + + // B should be in its own bin, not merged with ghost of A. + size_t b_bin = + BC::bin_index(4, BC::natural_alignment(b_addr / MIN_CHUNK_SIZE)); + CHECK(bc.is_bin_non_empty(b_bin)); + CHECK(bc.get_bin_head(b_bin) == b_addr); + + // Remove B and verify it's still 4 chunks (not 8). + result = bc.remove_block(chunk_size(1)); + CHECK(result.addr == b_addr); + CHECK(result.size == chunk_size(4)); +} + +// ---- Test: stale tag after right-walk absorption ---- + +void test_stale_tag_after_absorption() +{ + std::cout << "test_stale_tag_after_absorption..." << std::endl; + reset_mock(); + BCCore bc{}; + + // Insert A, B, C adjacent: chunks 4-7, 8-11, 12-15. + address_t a_addr = chunk_addr(3); // 4*MCS + address_t b_addr = chunk_addr(7); // 8*MCS + address_t c_addr = chunk_addr(11); // 12*MCS + bc.add_fresh_range(a_addr, chunk_size(4)); + bc.add_fresh_range(b_addr, chunk_size(4)); + bc.add_fresh_range(c_addr, chunk_size(4)); + + // add_block a block to the left of A. The right walk absorbs A, B, C. + address_t left_addr = chunk_addr(1); // 2*MCS, 2 chunks + bc.add_block(left_addr, chunk_size(2)); + + // Should have one big block: 2+4+4+4 = 14 chunks at left_addr. + auto result = bc.remove_block(chunk_size(1)); + CHECK(result.addr == left_addr); + CHECK(result.size == chunk_size(14)); + + // Now the stale tags from B and C should have been cleared. + // add_block a block to the right of the merged region. + // The merged region ends at chunk_addr(1) + 14*MCS = chunk_addr(15). + address_t right_addr = chunk_addr(15); // 16*MCS + bc.add_block(right_addr, chunk_size(4)); + + // Should NOT read stale tags from B or C and should stay as 4 chunks. + result = bc.remove_block(chunk_size(1)); + CHECK(result.addr == right_addr); + CHECK(result.size == chunk_size(4)); +} + +// ---- Test: underflow guard (address 0) ---- +// Can't directly test address 0 with our arena base at MCS. +// But we can test that the left walk stops at the arena boundary. + +void test_left_walk_stops_at_boundary() +{ + std::cout << "test_left_walk_stops_at_boundary..." << std::endl; + reset_mock(); + BCCore bc{}; + + // Insert block at the very start of the arena. + address_t addr = chunk_addr(0); // ARENA_BASE = 1*MCS + bc.add_fresh_range(addr, chunk_size(4)); + + // add_block another block adjacent to its right. + address_t right_addr = chunk_addr(4); // 5*MCS + bc.add_block(right_addr, chunk_size(4)); + + // Should coalesce into one 8-chunk block at chunk_addr(0). + auto result = bc.remove_block(chunk_size(1)); + CHECK(result.addr == addr); + CHECK(result.size == chunk_size(8)); + + // Now insert a block at chunk_addr(0) again. + bc.add_fresh_range(addr, chunk_size(4)); + + // add_block at its left should be impossible (out of arena). + // But we can verify the block stays at 4 chunks. + auto result2 = bc.remove_block(chunk_size(1)); + CHECK(result2.addr == addr); + CHECK(result2.size == chunk_size(4)); +} + +// ---- Test: stress (random alloc/free) ---- + +void test_stress() +{ + std::cout << "test_stress..." << std::endl; + reset_mock(); + BCCore bc{}; + + // Use a deterministic PRNG for reproducibility. + uint32_t seed = 12345; + auto rng = [&seed]() -> uint32_t { + seed ^= seed << 13; + seed ^= seed >> 17; + seed ^= seed << 5; + return seed; + }; + + // Track live allocations (addr -> size). + struct Alloc + { + address_t addr; + size_t size; + }; + + std::vector live; + + // Insert some initial blocks. + // Use 4-chunk aligned addresses to keep things simple. + for (size_t i = 0; i < 64; i += 4) + { + address_t addr = chunk_addr(i); + bc.add_fresh_range(addr, chunk_size(4)); + } + + // Do 1000 random alloc/free operations. + for (size_t op = 0; op < 1000; op++) + { + if (live.size() > 0 && (rng() % 3 != 0)) + { + // Free a random live allocation. + size_t idx = rng() % live.size(); + bc.add_block(live[idx].addr, live[idx].size); + live.erase(live.begin() + static_cast(idx)); + } + else + { + // Allocate sizeclass 1 (always the smallest). + auto result = bc.remove_block(chunk_size(1)); + if (result.addr != 0) + { + live.push_back({result.addr, result.size}); + } + } + } + + // Free everything remaining. + for (auto& a : live) + { + bc.add_block(a.addr, a.size); + } + live.clear(); + + // Verify we can allocate everything again as sizeclass 1. + size_t total_freed = 0; + for (;;) + { + auto result = bc.remove_block(chunk_size(1)); + if (result.addr == 0) + break; + total_freed += result.size; + } + + // We initially inserted 16 blocks of 4 chunks each = 64 chunks. + CHECK(total_freed == chunk_size(64)); +} + +// ---- Test: carving simulation ---- +// Simulate the range wrapper's carving logic using the mock. + +void test_carving_exact_fit() +{ + std::cout << "test_carving_exact_fit..." << std::endl; + reset_mock(); + BCCore bc{}; + + // Insert a 4-chunk block at chunk 4 (alpha=4). + // This block exactly fits sizeclass 4 at alignment 4. + address_t addr = chunk_addr(3); // 4*MCS + bc.add_fresh_range(addr, chunk_size(4)); + + auto result = bc.remove_block(chunk_size(4)); + CHECK(result.addr == addr); + CHECK(result.size == chunk_size(4)); + + // Carving: aligned_addr = align_up(addr, 4*MCS) = addr since addr is + // 4-aligned. + address_t aligned = bits::align_up(result.addr, chunk_size(4)); + CHECK(aligned == addr); + // No prefix, no suffix. + size_t prefix = aligned - result.addr; + size_t suffix = result.size - prefix - chunk_size(4); + CHECK(prefix == 0); + CHECK(suffix == 0); +} + +void test_carving_with_prefix() +{ + std::cout << "test_carving_with_prefix..." << std::endl; + reset_mock(); + BCCore bc{}; + + // Insert a block at chunk 1 (alpha=1), size 8 chunks. + // For sizeclass 4: align=4, T(4,1) = 7. n=8 >= 7 → can serve. + // But the block address is 1*MCS which is not 4*MCS-aligned. + address_t addr = chunk_addr(0); // 1*MCS + bc.add_fresh_range(addr, chunk_size(8)); + + auto result = bc.remove_block(chunk_size(4)); + CHECK(result.addr != 0); + + // Carving: aligned = align_up(1*MCS, 4*MCS) = 4*MCS + address_t aligned = bits::align_up(result.addr, chunk_size(4)); + size_t prefix = aligned - result.addr; + size_t suffix = result.size - prefix - chunk_size(4); + + CHECK(aligned == chunk_addr(3)); // 4*MCS + CHECK(prefix == chunk_size(3)); // 3 chunks of prefix + CHECK(suffix == chunk_size(1)); // 1 chunk of suffix + + // Return remainders to the pool. + if (prefix > 0) + bc.add_fresh_range(result.addr, prefix); + if (suffix > 0) + bc.add_fresh_range(aligned + chunk_size(4), suffix); + + // Should be able to allocate from the remainders. + auto r2 = bc.remove_block(chunk_size(1)); + CHECK(r2.addr != 0); +} + +void test_carving_with_suffix() +{ + std::cout << "test_carving_with_suffix..." << std::endl; + reset_mock(); + BCCore bc{}; + + // Insert a 16-chunk block at chunk 16 (alpha=16). + // For sizeclass 4: aligned, no prefix, suffix = 12 chunks. + address_t addr = chunk_addr(15); // 16*MCS + bc.add_fresh_range(addr, chunk_size(16)); + + auto result = bc.remove_block(chunk_size(4)); + CHECK(result.addr == addr); + + address_t aligned = bits::align_up(result.addr, chunk_size(4)); + size_t prefix = aligned - result.addr; + size_t suffix = result.size - prefix - chunk_size(4); + + CHECK(aligned == addr); + CHECK(prefix == 0); + CHECK(suffix == chunk_size(12)); + + // Return suffix. + if (suffix > 0) + bc.add_fresh_range(aligned + chunk_size(4), suffix); + + // Should be able to allocate from the suffix. + auto r2 = bc.remove_block(chunk_size(1)); + CHECK(r2.addr != 0); + CHECK(r2.size == chunk_size(12)); +} + +void test_carving_both() +{ + std::cout << "test_carving_both..." << std::endl; + reset_mock(); + BCCore bc{}; + + // Insert a 16-chunk block at chunk 2 (alpha=2). + // For sizeclass 8: align=8, T(8,2) = 8+6 = 14. n=16 >= 14 → can serve. + // aligned = align_up(2*MCS, 8*MCS) = 8*MCS. prefix = 6, suffix = 2. + address_t addr = chunk_addr(1); // 2*MCS + bc.add_fresh_range(addr, chunk_size(16)); + + auto result = bc.remove_block(chunk_size(8)); + CHECK(result.addr != 0); + + address_t aligned = bits::align_up(result.addr, chunk_size(8)); + size_t prefix = aligned - result.addr; + size_t suffix = result.size - prefix - chunk_size(8); + + CHECK(aligned == chunk_addr(7)); // 8*MCS + CHECK(prefix == chunk_size(6)); + CHECK(suffix == chunk_size(2)); + + // Verify the aligned address IS properly aligned. + CHECK(aligned % chunk_size(8) == 0); +} + +void test_carving_alignment_correctness() +{ + std::cout << "test_carving_alignment_correctness..." << std::endl; + + // For each valid sizeclass, insert a block with alpha=1 (worst case), + // allocate, and verify the carved address has proper alignment. + std::vector test_sizes; + test_sizes.push_back(1); + test_sizes.push_back(2); + test_sizes.push_back(3); + for (size_t e = 2; e <= 5; e++) + { + for (size_t m = 0; m < BC::SL_COUNT; m++) + test_sizes.push_back(BC::sizeclass_size(e, m)); + } + + for (size_t sc : test_sizes) + { + reset_mock(); + BCCore bc{}; + + size_t sc_align = BC::natural_alignment(sc); + + // Compute threshold at alpha=1 to know how big the block needs to be. + size_t thr = BC::threshold(sc, 1); + // Use a block of size = thr at chunk 1 (alpha=1). + if (thr > ARENA_CHUNKS) + continue; + + address_t addr = chunk_addr(0); // 1*MCS, alpha=1 + bc.add_fresh_range(addr, chunk_size(thr)); + + auto result = bc.remove_block(chunk_size(sc)); + if (result.addr == 0) + { + std::cout << " FAIL: sc=" << sc << " thr=" << thr + << " could not allocate" << std::endl; + CHECK(false); + continue; + } + + address_t aligned = bits::align_up(result.addr, sc_align * MIN_CHUNK_SIZE); + CHECK(aligned + chunk_size(sc) <= result.addr + result.size); + CHECK(aligned % (sc_align * MIN_CHUNK_SIZE) == 0); + } +} + +// ---- Test: round_up_sizeclass ---- + +void test_round_up_sizeclass() +{ + std::cout << "test_round_up_sizeclass..." << std::endl; + + CHECK(BC::round_up_sizeclass(0) == 0); + CHECK(BC::round_up_sizeclass(1) == 1); + CHECK(BC::round_up_sizeclass(2) == 2); + CHECK(BC::round_up_sizeclass(3) == 3); + CHECK(BC::round_up_sizeclass(4) == 4); + + // 5 is already valid (e=2, m=1) + CHECK(BC::round_up_sizeclass(5) == 5); + // 9 is not valid; rounds to 10 + CHECK(BC::round_up_sizeclass(9) == 10); + // 11 rounds to 12 + CHECK(BC::round_up_sizeclass(11) == 12); + // 15 rounds to 16 + CHECK(BC::round_up_sizeclass(15) == 16); + // 16 is already valid + CHECK(BC::round_up_sizeclass(16) == 16); + + // Exhaustive: for all n from 1 to 256, round_up is >= n and is valid. + for (size_t n = 1; n <= 256; n++) + { + size_t r = BC::round_up_sizeclass(n); + CHECK(r >= n); + CHECK(BC::is_valid_sizeclass(r)); + // And it's the smallest valid sizeclass >= n. + if (r > n) + { + for (size_t k = n; k < r; k++) + CHECK(!BC::is_valid_sizeclass(k)); + } + } +} + +// ---- Main ---- + +int main() +{ + setup(); + + test_insert_remove_roundtrip(); + test_bitmap_correctness(); + test_multiple_bins(); + test_best_fit_search(); + test_masked_search(); + test_remove_from_bin(); + test_empty_return(); + test_boundary_tags(); + test_single_chunk(); + test_large_serves_small(); + test_coalesce_right(); + test_coalesce_left(); + test_coalesce_both(); + test_boundary_prevents_coalescing(); + test_stale_tag_after_remove(); + test_stale_tag_after_absorption(); + test_left_walk_stops_at_boundary(); + test_stress(); + test_carving_exact_fit(); + test_carving_with_prefix(); + test_carving_with_suffix(); + test_carving_both(); + test_carving_alignment_correctness(); + test_round_up_sizeclass(); + + if (failure_count > 0) + { + std::cout << "\n" << failure_count << " FAILURES" << std::endl; + return 1; + } + + std::cout << "\nAll bc_core tests passed." << std::endl; + return 0; +} diff --git a/src/test/func/bc_helpers/bc_helpers.cc b/src/test/func/bc_helpers/bc_helpers.cc new file mode 100644 index 000000000..5a90f1ad0 --- /dev/null +++ b/src/test/func/bc_helpers/bc_helpers.cc @@ -0,0 +1,423 @@ +/** + * Unit tests for BitmapCoalesceHelpers — the pure arithmetic layer. + * + * Tests bin_index mapping, allocation lookup (start bit, mask bit), + * threshold monotonicity, decompose round-trips, and exhaustive + * verification against brute-force servable-set computation. + */ + +#include +#include +#include +#include + +using namespace snmalloc; + +// Use a small MAX_SIZE_BITS for testing so the bitmap is manageable. +// 22 gives MAX_EXPONENT = 22 - MIN_CHUNK_BITS, enough for several exponent +// levels while keeping exhaustive tests fast. +// But for the helpers tests we can also just use the full address space. +// We'll use bits::BITS - 1 to match the real configuration. +using BC = BitmapCoalesceHelpers; + +static size_t failure_count = 0; + +static void check(bool cond, const char* msg, size_t line) +{ + if (!cond) + { + std::cout << "FAIL (line " << line << "): " << msg << std::endl; + failure_count++; + } +} + +#define CHECK(cond) check(cond, #cond, __LINE__) + +// ---- Brute-force servable-set computation ---- + +/** + * Can a block at `addr` (in chunks) of `block_size` (in chunks) + * serve size class `sc` (in chunks) with natural alignment? + */ +static bool can_serve(size_t addr, size_t block_size, size_t sc) +{ + size_t align = BC::natural_alignment(sc); + // First aligned address >= addr + size_t aligned_addr = ((addr + align - 1) / align) * align; + return aligned_addr + sc <= addr + block_size; +} + +/** + * Generate all valid size classes up to max_size (in chunks). + */ +static std::vector gen_size_classes(size_t max_size) +{ + std::vector classes; + classes.push_back(1); + if (max_size >= 2) + classes.push_back(2); + if (max_size >= 3) + classes.push_back(3); + + size_t e = 2; + while (true) + { + size_t base = size_t(1) << e; + if (base > max_size) + break; + size_t step = base >> INTERMEDIATE_BITS; + for (size_t m = 0; m < BC::SL_COUNT; m++) + { + size_t s = base + m * step; + if (s > max_size) + break; + classes.push_back(s); + } + e++; + } + return classes; +} + +// ---- Test: constants sanity ---- + +void test_constants() +{ + std::cout << "test_constants..." << std::endl; + + CHECK(BC::B == 2); + CHECK(BC::SL_COUNT == 4); + CHECK(BC::SLOTS_PER_EXPONENT == 5); + CHECK(BC::PREFIX_BITS == 3); + + // For full address space: MAX_EXPONENT should be BITS-1 - MIN_CHUNK_BITS + CHECK(BC::MAX_EXPONENT == bits::BITS - 1 - MIN_CHUNK_BITS); + + // NUM_BINS = 3 + 5 * (MAX_EXPONENT - 1) + CHECK(BC::NUM_BINS == 3 + 5 * (BC::MAX_EXPONENT - 1)); + + // BITMAP_WORDS should be enough to hold NUM_BINS bits + CHECK(BC::BITMAP_WORDS * bits::BITS >= BC::NUM_BINS); + CHECK((BC::BITMAP_WORDS - 1) * bits::BITS < BC::NUM_BINS); +} + +// ---- Test: decompose / sizeclass_size round-trip ---- + +void test_decompose_roundtrip() +{ + std::cout << "test_decompose_roundtrip..." << std::endl; + + // All valid size classes up to 1024 chunks + auto classes = gen_size_classes(1024); + + for (size_t sc : classes) + { + size_t e, m; + bool ok = BC::decompose(sc, e, m); + CHECK(ok); + if (ok) + { + CHECK(BC::sizeclass_size(e, m) == sc); + } + } + + // Invalid sizes should not decompose + CHECK(!BC::decompose( + 0, + *reinterpret_cast(&failure_count), + *reinterpret_cast(&failure_count))); + // 9 is not a valid size class (between 8 and 10 for B=2) + size_t e_tmp, m_tmp; + CHECK(!BC::decompose(9, e_tmp, m_tmp)); + CHECK(!BC::decompose(11, e_tmp, m_tmp)); + CHECK(!BC::decompose(13, e_tmp, m_tmp)); + CHECK(!BC::decompose(15, e_tmp, m_tmp)); +} + +// ---- Test: is_valid_sizeclass ---- + +void test_is_valid_sizeclass() +{ + std::cout << "test_is_valid_sizeclass..." << std::endl; + + auto classes = gen_size_classes(256); + + for (size_t s = 0; s <= 256; s++) + { + bool expected = false; + for (size_t sc : classes) + { + if (sc == s) + { + expected = true; + break; + } + } + if (BC::is_valid_sizeclass(s) != expected) + { + std::cout << " is_valid_sizeclass(" << s + << ") = " << BC::is_valid_sizeclass(s) << ", expected " + << expected << std::endl; + CHECK(false); + } + } +} + +// ---- Test: alloc_start_bit / alloc_mask_bit ---- + +void test_alloc_lookup() +{ + std::cout << "test_alloc_lookup..." << std::endl; + + auto classes = gen_size_classes(1024); + + for (size_t sc : classes) + { + size_t e, m; + bool ok = BC::decompose(sc, e, m); + CHECK(ok); + if (!ok) + continue; + + size_t start = BC::alloc_start_bit(e, m); + CHECK(start < BC::NUM_BINS); + + if (e >= 2 && m == 0) + { + size_t mask = BC::alloc_mask_bit(e); + CHECK(mask < BC::NUM_BINS); + CHECK(mask != start); // mask bit should be different from start + } + else if (e >= 2) + { + // m != 0: alloc_mask_bit for e should still be valid + // but we don't use it; the search is unmasked. + } + if (e <= 1) + { + // Prefix range: no masking + CHECK(BC::alloc_mask_bit(e) == SIZE_MAX); + } + } +} + +// ---- Test: threshold monotonicity ---- + +void test_threshold_monotonicity() +{ + std::cout << "test_threshold_monotonicity..." << std::endl; + + // For a fixed alignment, bin_index(n, alpha) should be non-decreasing in n. + for (size_t alpha_bits = 0; alpha_bits <= 8; alpha_bits++) + { + size_t alpha = size_t(1) << alpha_bits; + size_t prev_bin = 0; + for (size_t n = 1; n <= 256; n++) + { + size_t bin = BC::bin_index(n, alpha); + if (bin < prev_bin) + { + std::cout << " MONOTONICITY FAIL: bin_index(" << n << ", " << alpha + << ") = " << bin << " < prev " << prev_bin << std::endl; + CHECK(false); + } + prev_bin = bin; + } + } +} + +// ---- Test: exhaustive bin_index verification ---- +// For every (addr, block_size) in a small arena, compute the servable set +// by brute force, then verify bin_index assigns a bin whose servable set +// is a subset of the block's actual servable set. + +void test_exhaustive_bin_index() +{ + std::cout << "test_exhaustive_bin_index..." << std::endl; + + constexpr size_t ARENA = 256; + auto classes = gen_size_classes(ARENA); + + // For each bin index, compute its servable set (conservatively). + // A bin's servable set = intersection of servable sets of all blocks in it. + // But we verify from the other direction: for each block, verify that + // every size class claimable from the block's bin is actually servable. + + // First, build the mapping: for each (e, m), which bins have start_bit + // and what's the search path? + // + // The key property to verify: if bin_index assigns a block to bin B, and + // an allocation for sizeclass S would find bin B (i.e., B >= start_bit + // and B is not masked), then the block MUST be able to serve S. + + size_t errors = 0; + + for (size_t addr = 0; addr < ARENA; addr++) + { + for (size_t n = 1; n + addr <= ARENA; n++) + { + // We want alignment in chunk units; since these ARE chunk units already, + // alpha_chunks = natural_alignment(addr) for addr > 0, or large for 0. + size_t alpha_chunks; + if (addr == 0) + alpha_chunks = size_t(1) << (bits::BITS - 1); + else + alpha_chunks = BC::natural_alignment(addr); + + size_t bin = BC::bin_index(n, alpha_chunks); + CHECK(bin < BC::NUM_BINS); + + // For every size class S that would find this bin during allocation: + // The search starts at alloc_start_bit(e, m) and walks upward, + // optionally masking one bit. If bin >= start_bit and bin is not + // masked, the allocation would pick this block. + for (size_t sc : classes) + { + size_t e, m; + if (!BC::decompose(sc, e, m)) + continue; + + size_t start = BC::alloc_start_bit(e, m); + + // Would this bin be found by the search? + if (bin < start) + continue; // bin is below the search window + + // Check if bin is the masked-out bit + if (m == 0 && e >= 2) + { + size_t mask = BC::alloc_mask_bit(e); + if (bin == mask) + continue; // this bit is masked out for m=0 + } + + // The allocation would pick this block. Verify it can serve S. + if (!can_serve(addr, n, sc)) + { + errors++; + if (errors <= 10) + { + std::cout << " bin_index ERROR: addr=" << addr << " n=" << n + << " alpha=" << alpha_chunks << " -> bin=" << bin + << ", but can't serve sc=" << sc << " (e=" << e + << " m=" << m << " start=" << start << ")" << std::endl; + } + } + } + } + } + + if (errors > 0) + { + std::cout << " " << errors << " errors in exhaustive bin_index check" + << std::endl; + } + CHECK(errors == 0); +} + +// ---- Test: threshold-based completeness ---- +// If a block meets the threshold for size class S (i.e. n >= T(S, alpha)), +// then the allocation search for S must be able to find the block's bin. +// Note: this is weaker than "can_serve implies findable" because the +// threshold is a worst-case bound over all addresses with a given alignment. +// Some blocks at favorable addresses can serve S even when n < T(S, alpha), +// but those are not guaranteed findable. + +void test_threshold_completeness() +{ + std::cout << "test_threshold_completeness..." << std::endl; + + constexpr size_t ARENA = 64; + auto classes = gen_size_classes(ARENA); + + size_t errors = 0; + + for (size_t addr = 0; addr < ARENA; addr++) + { + for (size_t n = 1; n + addr <= ARENA; n++) + { + size_t alpha_chunks; + if (addr == 0) + alpha_chunks = size_t(1) << (bits::BITS - 1); + else + alpha_chunks = BC::natural_alignment(addr); + + size_t bin = BC::bin_index(n, alpha_chunks); + + for (size_t sc : classes) + { + // Only check classes where the threshold is met. + size_t t = BC::threshold(sc, alpha_chunks); + if (n < t) + continue; + + size_t e, m; + if (!BC::decompose(sc, e, m)) + continue; + + size_t start = BC::alloc_start_bit(e, m); + + // bin should be >= start (the block should be findable) + if (bin < start) + { + errors++; + if (errors <= 10) + { + std::cout << " COMPLETENESS: addr=" << addr << " n=" << n + << " alpha=" << alpha_chunks << " -> bin=" << bin + << ", but sc=" << sc << " (e=" << e << " m=" << m + << " start=" << start << " T=" << t << ") not findable" + << std::endl; + } + } + + // bin should not be the masked bit (if m=0) + if (m == 0 && e >= 2) + { + size_t mask = BC::alloc_mask_bit(e); + if (bin == mask) + { + errors++; + if (errors <= 10) + { + std::cout << " COMPLETENESS: addr=" << addr << " n=" << n + << " alpha=" << alpha_chunks << " -> bin=" << bin + << " == mask for sc=" << sc << " (e=" << e << ")" + << std::endl; + } + } + } + } + } + } + + if (errors > 0) + { + std::cout << " " << errors << " threshold completeness errors" + << std::endl; + } + CHECK(errors == 0); +} + +// ---- Main ---- + +int main() +{ + setup(); + + test_constants(); + test_decompose_roundtrip(); + test_is_valid_sizeclass(); + test_alloc_lookup(); + test_threshold_monotonicity(); + test_exhaustive_bin_index(); + test_threshold_completeness(); + + if (failure_count > 0) + { + std::cout << "\n" << failure_count << " FAILURES" << std::endl; + return 1; + } + + std::cout << "\nAll bc_helpers tests passed." << std::endl; + return 0; +} diff --git a/src/test/func/bc_range/bc_range.cc b/src/test/func/bc_range/bc_range.cc new file mode 100644 index 000000000..cfb53e74c --- /dev/null +++ b/src/test/func/bc_range/bc_range.cc @@ -0,0 +1,493 @@ +/** + * Pipeline integration tests for BitmapCoalesceRange. + * + * Exercises BitmapCoalesceRange through the real snmalloc pipeline + * (FixedRangeConfig -> StandardLocalState -> BitmapCoalesceRange) + * using large-object allocation patterns that exercise coalescing, + * carving, and refill paths. + */ + +#include "test/setup.h" + +#include +#include +#include +#include + +#ifdef assert +# undef assert +#endif +#define assert please_use_SNMALLOC_ASSERT + +using namespace snmalloc; + +// Use a reasonably large arena so the tests have room to exercise +// coalescing behaviour without running out of backing memory. +static constexpr size_t ARENA_BITS = 30; // 1 GiB +static constexpr size_t ARENA_SIZE = bits::one_at_bit(ARENA_BITS); + +// ---- Per-test globals setup ---- +// FixedRangeConfig has static state, so we can only initialise once. +// All test functions run sequentially within a single init. + +using CustomGlobals = FixedRangeConfig; +using FixedAlloc = Allocator; + +/** + * Assert that the pointer is naturally aligned for the given allocation size. + * Natural alignment = largest power of 2 dividing the size. + */ +static void check_natural_alignment(void* p, size_t size, const char* context) +{ + size_t nat_align = natural_alignment(size); + auto addr = reinterpret_cast(p); + if ((addr % nat_align) != 0) + { + std::cout << "\n FAIL [" << context << "]: alloc(" << size << ") returned " + << p << " not aligned to " << nat_align << " (offset " + << (addr % nat_align) << ")" << std::endl; + } + SNMALLOC_CHECK((addr % nat_align) == 0); +} + +/** + * test_large_alloc_roundtrip + * + * Allocate a chunk-sized block, write to every page, free it, + * then reallocate. The second allocation should succeed (proving + * the freed block re-entered the pool via BitmapCoalesceRange). + */ +template +static void test_large_alloc_roundtrip(Alloc& a) +{ + std::cout << " test_large_alloc_roundtrip ... " << std::flush; + + constexpr size_t obj_size = MIN_CHUNK_SIZE; // 16 KiB + auto* p1 = static_cast(a->alloc(obj_size)); + SNMALLOC_CHECK(p1 != nullptr); + check_natural_alignment(p1, obj_size, "roundtrip p1"); + + // Touch every page to ensure commit path works. + for (size_t i = 0; i < obj_size; i += 4096) + p1[i] = static_cast(i & 0xFF); + + a->dealloc(p1); + + // Re-allocate the same size — should succeed via the coalescing range. + auto* p2 = static_cast(a->alloc(obj_size)); + SNMALLOC_CHECK(p2 != nullptr); + check_natural_alignment(p2, obj_size, "roundtrip p2"); + a->dealloc(p2); + + std::cout << "OK" << std::endl; +} + +/** + * test_large_alloc_after_free + * + * Allocate N chunks, free them all, then allocate a single block + * larger than any individual chunk. This forces BitmapCoalesceRange + * to coalesce adjacent freed chunks into a larger block. + */ +template +static void test_large_alloc_after_free(Alloc& a) +{ + std::cout << " test_large_alloc_after_free ... " << std::flush; + + constexpr size_t N = 8; + constexpr size_t obj_size = MIN_CHUNK_SIZE; + void* ptrs[N]; + + for (size_t i = 0; i < N; i++) + { + ptrs[i] = a->alloc(obj_size); + SNMALLOC_CHECK(ptrs[i] != nullptr); + check_natural_alignment(ptrs[i], obj_size, "alloc_after_free batch"); + } + + // Free all. + for (size_t i = 0; i < N; i++) + a->dealloc(ptrs[i]); + + // Now allocate something larger. If coalescing works, the freed + // chunks should have been merged and this should not need a fresh + // refill from the parent (though the test doesn't assert on that + // internal detail — it just checks success). + constexpr size_t big_size = MIN_CHUNK_SIZE * 4; + auto* big = a->alloc(big_size); + SNMALLOC_CHECK(big != nullptr); + check_natural_alignment(big, big_size, "alloc_after_free big"); + a->dealloc(big); + + std::cout << "OK" << std::endl; +} + +/** + * test_non_pow2_sizes + * + * Allocate objects whose sizes are valid snmalloc sizeclasses that + * are NOT powers of two. The bitmap coalescing range uses B=2 + * intermediate bits, so sizeclasses include 3/4 and 5/4 multiples + * of each power of two. This exercises the carving / alignment + * logic inside BitmapCoalesceRange::alloc_range. + * + * Verifies natural alignment: alloc_range must return addresses + * aligned to the largest power of two dividing the allocation size. + */ +template +static void test_non_pow2_sizes(Alloc& a) +{ + std::cout << " test_non_pow2_sizes ... " << std::flush; + + // Pick a few sizes that land on non-power-of-two sizeclasses. + // These are all multiples of MIN_CHUNK_SIZE so they go through + // the large-object path which uses BitmapCoalesceRange. + const size_t sizes[] = { + MIN_CHUNK_SIZE * 3, // 3 chunks — natural alignment 1 chunk + MIN_CHUNK_SIZE * 5, // 5 chunks — natural alignment 1 chunk + MIN_CHUNK_SIZE * 6, // 6 chunks — natural alignment 2 chunks + }; + + for (auto sz : sizes) + { + auto* p = a->alloc(sz); + SNMALLOC_CHECK(p != nullptr); + + // Verify natural alignment: address must be divisible by the + // largest power of two dividing the allocation size. + check_natural_alignment(p, sz, "non_pow2"); + + a->dealloc(p); + } + + std::cout << "OK" << std::endl; +} + +/** + * test_alignment_after_fragmentation + * + * Randomly allocates and deallocates a mix of pow2 and non-pow2 + * large sizes to build up fragmentation, then verifies that every + * allocation is naturally aligned. + * + * Uses a simple xorshift PRNG for reproducibility. + */ +template +static void test_alignment_after_fragmentation(Alloc& a) +{ + std::cout << " test_alignment_after_fragmentation ... " << std::flush; + + static constexpr size_t MAX_LIVE = 128; + static constexpr size_t NUM_OPS = 2000; + + // All sizes are valid bitmap-coalesce sizeclasses (multiples of + // MIN_CHUNK_SIZE with chunk counts from the B=2 sequence). + const size_t chunk_counts[] = { + 1, 2, 3, 4, 5, 6, 7, 8, 10, 12, 14, 16, 20, 24, 28, 32}; + constexpr size_t N_SIZES = sizeof(chunk_counts) / sizeof(chunk_counts[0]); + + struct Slot + { + void* ptr; + size_t size; + }; + + Slot live[MAX_LIVE] = {}; + size_t num_live = 0; + uint32_t rng = 42; + + auto xorshift = [&]() -> uint32_t { + rng ^= rng << 13; + rng ^= rng >> 17; + rng ^= rng << 5; + return rng; + }; + + for (size_t op = 0; op < NUM_OPS; op++) + { + bool do_alloc; + if (num_live == 0) + do_alloc = true; + else if (num_live >= MAX_LIVE) + do_alloc = false; + else + do_alloc = (xorshift() % 3) != 0; // 2/3 alloc, 1/3 free + + if (do_alloc) + { + size_t chunks = chunk_counts[xorshift() % N_SIZES]; + size_t sz = chunks * MIN_CHUNK_SIZE; + + auto* p = a->alloc(sz); + SNMALLOC_CHECK(p != nullptr); + + // Verify natural alignment. + check_natural_alignment(p, sz, "fragmentation"); + + live[num_live].ptr = p; + live[num_live].size = sz; + num_live++; + } + else + { + size_t idx = xorshift() % num_live; + a->dealloc(live[idx].ptr); + live[idx] = live[num_live - 1]; + num_live--; + } + } + + // Free remaining. + for (size_t i = 0; i < num_live; i++) + a->dealloc(live[i].ptr); + + std::cout << "OK" << std::endl; +} + +/** + * test_range_stress + * + * Rapid alloc / dealloc cycles with varying sizes, exercising + * the coalescing range under pressure. Allocates in waves: first + * fills up a batch, then frees every other allocation (creating + * fragmentation), then allocates again to exercise coalescing of + * the freed slots. + */ +template +static void test_range_stress(Alloc& a) +{ + std::cout << " test_range_stress ... " << std::flush; + + constexpr size_t BATCH = 32; + void* ptrs[BATCH]; + + // Wave 1: allocate a batch of chunk-sized objects. + for (size_t i = 0; i < BATCH; i++) + { + ptrs[i] = a->alloc(MIN_CHUNK_SIZE); + SNMALLOC_CHECK(ptrs[i] != nullptr); + check_natural_alignment(ptrs[i], MIN_CHUNK_SIZE, "stress wave1"); + } + + // Free every other one — leaves alternating holes. + for (size_t i = 0; i < BATCH; i += 2) + { + a->dealloc(ptrs[i]); + ptrs[i] = nullptr; + } + + // Re-fill the holes. + for (size_t i = 0; i < BATCH; i += 2) + { + ptrs[i] = a->alloc(MIN_CHUNK_SIZE); + SNMALLOC_CHECK(ptrs[i] != nullptr); + check_natural_alignment(ptrs[i], MIN_CHUNK_SIZE, "stress refill"); + } + + // Free everything. + for (size_t i = 0; i < BATCH; i++) + { + a->dealloc(ptrs[i]); + ptrs[i] = nullptr; + } + + // Final: allocate a multi-chunk block to prove coalescing succeeded. + auto* big = a->alloc(MIN_CHUNK_SIZE * 8); + SNMALLOC_CHECK(big != nullptr); + check_natural_alignment(big, MIN_CHUNK_SIZE * 8, "stress big"); + a->dealloc(big); + + std::cout << "OK" << std::endl; +} + +/** + * test_mixed_sizes + * + * Interleave allocations of different sizes and free them in a + * different order, exercising the coalescing range's ability to + * merge blocks of varying sizes and maintain correct boundary tags. + */ +template +static void test_mixed_sizes(Alloc& a) +{ + std::cout << " test_mixed_sizes ... " << std::flush; + + auto* p1 = a->alloc(MIN_CHUNK_SIZE); + auto* p2 = a->alloc(MIN_CHUNK_SIZE * 2); + auto* p3 = a->alloc(MIN_CHUNK_SIZE); + auto* p4 = a->alloc(MIN_CHUNK_SIZE * 4); + + SNMALLOC_CHECK(p1 != nullptr); + SNMALLOC_CHECK(p2 != nullptr); + SNMALLOC_CHECK(p3 != nullptr); + SNMALLOC_CHECK(p4 != nullptr); + + check_natural_alignment(p1, MIN_CHUNK_SIZE, "mixed p1"); + check_natural_alignment(p2, MIN_CHUNK_SIZE * 2, "mixed p2"); + check_natural_alignment(p3, MIN_CHUNK_SIZE, "mixed p3"); + check_natural_alignment(p4, MIN_CHUNK_SIZE * 4, "mixed p4"); + + // Free in reverse order. + a->dealloc(p4); + a->dealloc(p3); + a->dealloc(p2); + a->dealloc(p1); + + // Allocate a large block — should succeed if coalescing worked. + auto* big = a->alloc(MIN_CHUNK_SIZE * 8); + SNMALLOC_CHECK(big != nullptr); + check_natural_alignment(big, MIN_CHUNK_SIZE * 8, "mixed big"); + a->dealloc(big); + + std::cout << "OK" << std::endl; +} + +/** + * test_alignment_after_dealloc_sequences + * + * Allocates blocks of various sizes, frees them in different orders, + * then re-allocates and checks that every result is naturally aligned. + * Exercises the coalescing path's ability to produce correctly aligned + * blocks from merged free regions. + */ +template +static void test_alignment_after_dealloc_sequences(Alloc& a) +{ + std::cout << " test_alignment_after_dealloc_sequences ... " << std::flush; + + // Sizes in chunks — mix of pow2 and non-pow2 to stress alignment. + const size_t chunk_counts[] = {1, 2, 3, 4, 5, 6, 8, 10, 12, 16}; + constexpr size_t N = sizeof(chunk_counts) / sizeof(chunk_counts[0]); + + // --- Sequence 1: allocate all, free all forward, re-allocate --- + { + void* ptrs[N]; + size_t sizes[N]; + for (size_t i = 0; i < N; i++) + { + sizes[i] = chunk_counts[i] * MIN_CHUNK_SIZE; + ptrs[i] = a->alloc(sizes[i]); + SNMALLOC_CHECK(ptrs[i] != nullptr); + check_natural_alignment(ptrs[i], sizes[i], "dealloc_seq1 alloc"); + } + for (size_t i = 0; i < N; i++) + a->dealloc(ptrs[i]); + // Re-allocate in reverse size order after forward free. + for (size_t i = N; i > 0; i--) + { + ptrs[i - 1] = a->alloc(sizes[i - 1]); + SNMALLOC_CHECK(ptrs[i - 1] != nullptr); + check_natural_alignment( + ptrs[i - 1], sizes[i - 1], "dealloc_seq1 realloc"); + } + for (size_t i = 0; i < N; i++) + a->dealloc(ptrs[i]); + } + + // --- Sequence 2: allocate all, free all reverse, re-allocate --- + { + void* ptrs[N]; + size_t sizes[N]; + for (size_t i = 0; i < N; i++) + { + sizes[i] = chunk_counts[i] * MIN_CHUNK_SIZE; + ptrs[i] = a->alloc(sizes[i]); + SNMALLOC_CHECK(ptrs[i] != nullptr); + check_natural_alignment(ptrs[i], sizes[i], "dealloc_seq2 alloc"); + } + for (size_t i = N; i > 0; i--) + a->dealloc(ptrs[i - 1]); + for (size_t i = 0; i < N; i++) + { + ptrs[i] = a->alloc(sizes[i]); + SNMALLOC_CHECK(ptrs[i] != nullptr); + check_natural_alignment(ptrs[i], sizes[i], "dealloc_seq2 realloc"); + } + for (size_t i = 0; i < N; i++) + a->dealloc(ptrs[i]); + } + + // --- Sequence 3: interleaved alloc/dealloc with growing sizes --- + { + void* prev = nullptr; + for (size_t i = 0; i < N; i++) + { + size_t sz = chunk_counts[i] * MIN_CHUNK_SIZE; + void* p = a->alloc(sz); + SNMALLOC_CHECK(p != nullptr); + check_natural_alignment(p, sz, "dealloc_seq3 interleave"); + + if (prev != nullptr) + a->dealloc(prev); + + prev = p; + } + if (prev != nullptr) + a->dealloc(prev); + + // Now allocate a large block from the coalesced space. + size_t big_sz = MIN_CHUNK_SIZE * 16; + void* big = a->alloc(big_sz); + SNMALLOC_CHECK(big != nullptr); + check_natural_alignment(big, big_sz, "dealloc_seq3 big"); + a->dealloc(big); + } + + // --- Sequence 4: free odd-indexed, then even-indexed, re-allocate larger --- + { + void* ptrs[N]; + size_t sizes[N]; + for (size_t i = 0; i < N; i++) + { + sizes[i] = chunk_counts[i] * MIN_CHUNK_SIZE; + ptrs[i] = a->alloc(sizes[i]); + SNMALLOC_CHECK(ptrs[i] != nullptr); + check_natural_alignment(ptrs[i], sizes[i], "dealloc_seq4 alloc"); + } + // Free odd indices first. + for (size_t i = 1; i < N; i += 2) + a->dealloc(ptrs[i]); + // Free even indices. + for (size_t i = 0; i < N; i += 2) + a->dealloc(ptrs[i]); + // Re-allocate with different sizes. + const size_t realloc_chunks[] = {2, 4, 6, 8, 3, 5, 1, 10, 12, 16}; + for (size_t i = 0; i < N; i++) + { + sizes[i] = realloc_chunks[i] * MIN_CHUNK_SIZE; + ptrs[i] = a->alloc(sizes[i]); + SNMALLOC_CHECK(ptrs[i] != nullptr); + check_natural_alignment(ptrs[i], sizes[i], "dealloc_seq4 realloc"); + } + for (size_t i = 0; i < N; i++) + a->dealloc(ptrs[i]); + } + + std::cout << "OK" << std::endl; +} + +int main() +{ + setup(); + + std::cout << "bc_range: pipeline integration tests for BitmapCoalesceRange" + << std::endl; + + auto* oe_base = DefaultPal::reserve(ARENA_SIZE); + SNMALLOC_CHECK(oe_base != nullptr); + + CustomGlobals::init(nullptr, oe_base, ARENA_SIZE); + + auto a = get_scoped_allocator(); + + test_large_alloc_roundtrip(a); + test_large_alloc_after_free(a); + test_non_pow2_sizes(a); + test_alignment_after_fragmentation(a); + test_range_stress(a); + test_mixed_sizes(a); + test_alignment_after_dealloc_sequences(a); + + std::cout << "bc_range: all tests passed" << std::endl; + return 0; +} diff --git a/src/test/func/large_nonpow2/large_nonpow2.cc b/src/test/func/large_nonpow2/large_nonpow2.cc new file mode 100644 index 000000000..eb4141906 --- /dev/null +++ b/src/test/func/large_nonpow2/large_nonpow2.cc @@ -0,0 +1,244 @@ +#include +#include +#include + +#define SNMALLOC_NAME_MANGLE(a) our_##a +#include + +using namespace snmalloc; + +/** + * Test that large non-power-of-2 allocations: + * 1. Return non-null pointers + * 2. Have the correct natural alignment (NOT next-pow2 alignment) + * 3. Report usable size >= requested size + * 4. Can be written to and freed without error + * 5. Work correctly under repeated alloc/free cycles + */ + +/// The non-pow2 large size classes in chunks (B=2 scheme). +/// These are the sizes that exist in the fine-grained levels: +/// e=1: 3 +/// e=2: 5, 6, 7 +/// e=3: 10, 12, 14 +/// e=4: 20, 24, 28 +/// e=5: 40, 48, 56 +/// e=6: 80, 96, 112 +static constexpr size_t non_pow2_chunks[] = { + 3, 5, 6, 7, 10, 12, 14, 20, 24, 28, 40, 48, 56, 80, 96, 112}; + +static constexpr size_t NUM_SIZES = + sizeof(non_pow2_chunks) / sizeof(non_pow2_chunks[0]); + +/// Natural alignment of a size: largest power of 2 dividing it. +static size_t natural_align(size_t s) +{ + return s & (~(s - 1)); +} + +void test_alignment_and_usable_size() +{ + printf(" test_alignment_and_usable_size\n"); + for (size_t i = 0; i < NUM_SIZES; i++) + { + size_t size = non_pow2_chunks[i] * MIN_CHUNK_SIZE; + void* p = our_malloc(size); + EXPECT(p != nullptr, "malloc({}) returned null", size); + + // Check natural alignment. + size_t expected_align = natural_align(size); + size_t addr = reinterpret_cast(p); + EXPECT( + (addr % expected_align) == 0, + "malloc({}) returned {} not aligned to {} (natural alignment)", + size, + p, + expected_align); + + // Verify it does NOT have next-pow2 alignment (which would mean + // the allocator is still rounding up to pow2 internally). + size_t pow2_align = bits::next_pow2(size); + if (pow2_align != size) + { + // For a truly non-pow2 allocation, pow2 alignment is not guaranteed. + // We don't assert it's NOT pow2-aligned (it might be by chance), + // but we track it for reporting. + } + + // Check usable size is at least what we asked for, and not rounded + // up to next pow2. + size_t usable = our_malloc_usable_size(p); + EXPECT( + usable >= size, + "malloc_usable_size({}) = {} < requested {}", + p, + usable, + size); + EXPECT( + usable < bits::next_pow2(size), + "malloc_usable_size({}) = {} rounded up to pow2 {} for request {}", + p, + usable, + bits::next_pow2(size), + size); + + // Write to every page to verify the allocation is valid. + memset(p, 0xAB, size); + + our_free(p); + } +} + +void test_many_allocations() +{ + printf(" test_many_allocations\n"); + // Allocate many objects of each non-pow2 size simultaneously. + static constexpr size_t COUNT = 20; + void* ptrs[NUM_SIZES][COUNT]; + + for (size_t i = 0; i < NUM_SIZES; i++) + { + size_t size = non_pow2_chunks[i] * MIN_CHUNK_SIZE; + for (size_t j = 0; j < COUNT; j++) + { + ptrs[i][j] = our_malloc(size); + EXPECT( + ptrs[i][j] != nullptr, "malloc({}) returned null at j={}", size, j); + + size_t addr = reinterpret_cast(ptrs[i][j]); + size_t expected_align = natural_align(size); + EXPECT( + (addr % expected_align) == 0, + "malloc({}) at j={} returned {} not aligned to {}", + size, + j, + ptrs[i][j], + expected_align); + + // Touch the memory. + memset(ptrs[i][j], static_cast(i + j), size); + } + } + + // Free in reverse order. + for (size_t i = NUM_SIZES; i > 0; i--) + { + for (size_t j = COUNT; j > 0; j--) + { + our_free(ptrs[i - 1][j - 1]); + } + } +} + +void test_alloc_free_cycles() +{ + printf(" test_alloc_free_cycles\n"); + // Repeated alloc/free to exercise the cache and decay paths. + static constexpr size_t CYCLES = 50; + + for (size_t cycle = 0; cycle < CYCLES; cycle++) + { + for (size_t i = 0; i < NUM_SIZES; i++) + { + size_t size = non_pow2_chunks[i] * MIN_CHUNK_SIZE; + void* p = our_malloc(size); + EXPECT(p != nullptr, "malloc({}) returned null at cycle={}", size, cycle); + + size_t addr = reinterpret_cast(p); + size_t expected_align = natural_align(size); + EXPECT( + (addr % expected_align) == 0, + "cycle {} malloc({}) returned {} not aligned to {}", + cycle, + size, + p, + expected_align); + + // Write first and last bytes. + static_cast(p)[0] = 'A'; + static_cast(p)[size - 1] = 'Z'; + + our_free(p); + } + } +} + +void test_mixed_sizes() +{ + printf(" test_mixed_sizes\n"); + // Interleave pow2 and non-pow2 large allocations. + static constexpr size_t COUNT = 10; + void* pow2_ptrs[COUNT]; + void* nonpow2_ptrs[COUNT]; + + for (size_t i = 0; i < COUNT; i++) + { + // Pow2 allocation (128 KiB) + size_t pow2_size = 128 * 1024; + pow2_ptrs[i] = our_malloc(pow2_size); + EXPECT(pow2_ptrs[i] != nullptr, "pow2 malloc({}) returned null", pow2_size); + + // Non-pow2 allocation (96 KiB = 6 chunks) + size_t nonpow2_size = 6 * MIN_CHUNK_SIZE; + nonpow2_ptrs[i] = our_malloc(nonpow2_size); + EXPECT( + nonpow2_ptrs[i] != nullptr, + "nonpow2 malloc({}) returned null", + nonpow2_size); + + size_t addr = reinterpret_cast(nonpow2_ptrs[i]); + size_t expected_align = natural_align(nonpow2_size); + EXPECT( + (addr % expected_align) == 0, + "nonpow2 malloc({}) returned {} not aligned to {}", + nonpow2_size, + nonpow2_ptrs[i], + expected_align); + } + + // Free interleaved. + for (size_t i = 0; i < COUNT; i++) + { + our_free(pow2_ptrs[i]); + our_free(nonpow2_ptrs[i]); + } +} + +void test_stack_remaining_bytes() +{ + printf(" test_stack_remaining_bytes\n"); + // Verify that remaining_bytes returns a large value for non-heap + // addresses (stack, globals). The sizeclass sentinel (raw value 0) + // must have slab_mask = SIZE_MAX so that bounds checks pass for + // non-snmalloc memory. Without the sentinel fix, remaining_bytes + // would return 0, causing false-positive bounds check failures. + char stack_buf[64]; + auto remaining = snmalloc::remaining_bytes( + snmalloc::address_cast(stack_buf)); + EXPECT( + remaining > sizeof(stack_buf), + "remaining_bytes on stack address returned {}, expected large value", + remaining); + + // Also check that alloc_size returns 0 for stack addresses. + auto sc = snmalloc::Config::Backend::template get_metaentry( + snmalloc::address_cast(stack_buf)) + .get_sizeclass(); + auto sz = snmalloc::sizeclass_full_to_size(sc); + EXPECT(sz == 0, "alloc_size on stack address returned {}, expected 0", sz); +} + +int main() +{ + printf("large_nonpow2: non-power-of-2 large allocation tests\n"); + setup(); + + test_alignment_and_usable_size(); + test_many_allocations(); + test_alloc_free_cycles(); + test_mixed_sizes(); + test_stack_remaining_bytes(); + + printf("large_nonpow2: all tests passed\n"); + return 0; +} diff --git a/src/test/func/sizeclass/sizeclass.cc b/src/test/func/sizeclass/sizeclass.cc index 836c62111..1e38ea300 100644 --- a/src/test/func/sizeclass/sizeclass.cc +++ b/src/test/func/sizeclass/sizeclass.cc @@ -150,4 +150,33 @@ int main(int, char**) abort(); test_align_size(); + + // Verify runtime to_exp_mant matches constexpr to_exp_mant_const. + { + bool exp_mant_failed = false; + for (size_t v = 1; v < 100000; v++) + { + auto ce = snmalloc::bits::to_exp_mant_const<2, 0>(v); + auto re = snmalloc::bits::to_exp_mant<2, 0>(v); + if (ce != re) + { + std::cout << "to_exp_mant mismatch at " << v << ": const=" << ce + << " runtime=" << re << std::endl; + exp_mant_failed = true; + } + auto ce4 = snmalloc::bits::to_exp_mant_const<2, 4>(v); + auto re4 = snmalloc::bits::to_exp_mant<2, 4>(v); + if (ce4 != re4) + { + std::cout << "to_exp_mant<2,4> mismatch at " << v << ": const=" << ce4 + << " runtime=" << re4 << std::endl; + exp_mant_failed = true; + } + } + if (exp_mant_failed) + { + std::cout << "to_exp_mant equivalence test FAILED" << std::endl; + abort(); + } + } }