Skip to content

Latest commit

 

History

History
502 lines (376 loc) · 12.7 KB

File metadata and controls

502 lines (376 loc) · 12.7 KB

Architecture Documentation

Overview

This document explains the internal architecture of fastalloc, including design decisions, data structures, and implementation details.

Core Components

1. Pool Types

FixedPool

Purpose: Pre-allocated fixed-size pool with O(1) operations.

Structure:

pub struct FixedPool<T> {
    storage: RefCell<Vec<MaybeUninit<T>>>,
    allocator: RefCell<StackAllocator>,
    capacity: usize,
    config: PoolConfig<T>,
}

Design rationale:

  • Vec<MaybeUninit<T>>: Allows uninitialized memory until explicitly allocated
  • RefCell: Provides interior mutability for single-threaded use
  • StackAllocator: LIFO allocation for cache-friendly access patterns

Trade-offs:

  • ✅ Fastest allocation (~3.5ns)
  • ✅ Zero fragmentation
  • ✅ Predictable memory usage
  • ❌ Fixed capacity (cannot grow)
  • ❌ Not thread-safe

GrowingPool

Purpose: Dynamic pool that grows on demand.

Structure:

pub struct GrowingPool<T> {
    storage: RefCell<Vec<Vec<MaybeUninit<T>>>>,
    allocator: RefCell<FreeListAllocator>,
    capacity: RefCell<usize>,
    chunk_boundaries: RefCell<Vec<usize>>,
    config: PoolConfig<T>,
}

Design rationale:

  • Vec<Vec<...>>: Chunks never move, making pointers stable
  • FreeListAllocator: Better for random access than stack
  • chunk_boundaries: Cached for O(log n) binary search instead of O(n) linear scan

Growth strategy:

  • Linear: Add fixed amount each time
  • Exponential: Multiply by factor (e.g., 2x)
  • Custom: User-defined function

Trade-offs:

  • ✅ Automatic growth
  • ✅ No fixed capacity limit
  • ✅ Stable pointers across growth
  • ❌ Slightly slower (~4.6ns)
  • ❌ More complex memory layout

ThreadSafePool

Purpose: Thread-safe pool for concurrent access.

Structure:

pub struct ThreadSafePool<T> {
    inner: Arc<Mutex<GrowingPool<T>>>,
}

pub struct ThreadSafeHandle<T> {
    pool: Arc<Mutex<GrowingPool<T>>>,
    index: usize,
    cached_ptr: *mut T,  // Key optimization!
}

Key optimization: Cached pointer eliminates lock acquisition on every dereference.

Design rationale:

  • Arc<Mutex>: Shared ownership with exclusive access
  • Cached pointer: Lock-free deref after initial allocation
  • Only locks during allocate/deallocate

Trade-offs:

  • ✅ Thread-safe
  • ✅ Lock-free deref (10-50x improvement)
  • ✅ Works with multiple threads
  • ❌ Mutex overhead on allocate/free
  • ❌ Cannot be Sync (contains raw pointer)

2. Allocators

StackAllocator (LIFO)

Strategy: Last-In-First-Out using a stack of free indices.

struct StackAllocator {
    free_stack: Vec<usize>,
    capacity: usize,
    allocated_bitmap: Vec<u64>,  // Debug only
}

Allocation: O(1) - Pop from stack
Deallocation: O(1) - Push to stack

Advantages:

  • Excellent cache locality (recently freed = hot in cache)
  • Simple implementation
  • Optimal for allocate/free patterns

Used by: FixedPool

FreeListAllocator

Strategy: Maintains list of free slot indices.

struct FreeListAllocator {
    free_list: Vec<usize>,
    capacity: usize,
    allocated_bitmap: Vec<u64>,  // Debug only
}

Allocation: O(1) - Pop from list
Deallocation: O(1) - Push to list

Advantages:

  • Good for random access patterns
  • Can extend capacity dynamically
  • No preference for recently freed slots

Used by: GrowingPool

BitmapAllocator

Strategy: Bit vector tracking allocation status.

struct BitmapAllocator {
    bitmap: Vec<u64>,
    capacity: usize,
    allocated: usize,
    next_free_hint: usize,
}

Allocation: O(n) worst case, O(1) typical (with hint)
Deallocation: O(1)

Advantages:

  • Minimal memory overhead (1 bit per slot)
  • Fast intrinsics (trailing_zeros)
  • Hint tracking for common patterns

Used by: Optional, for space-constrained scenarios

3. Handles

OwnedHandle

Purpose: RAII-based exclusive ownership of pool object.

pub struct OwnedHandle<'pool, T> {
    pool: &'pool dyn PoolInterface<T>,
    index: usize,
}

Lifetime safety:

  • 'pool lifetime prevents pool from being dropped
  • Borrow checker ensures exclusive access
  • Drop impl automatically returns object

Trait implementations:

  • Deref / DerefMut: Direct access to value
  • Drop: Returns object to pool
  • Debug, Display, PartialEq, Ord: For convenience

SharedHandle

Purpose: Reference-counted shared access.

pub struct SharedHandle<'pool, T> {
    inner: Rc<OwnedHandle<'pool, T>>,
}

Use case: Multiple readers need access to same pooled object.

Trade-off: Reference counting overhead vs flexibility.

WeakHandle

Purpose: Non-owning reference that may become invalid.

pub struct WeakHandle<'pool, T> {
    inner: Weak<OwnedHandle<'pool, T>>,
}

Use case: Breaking reference cycles, observer patterns.

Memory Overhead Analysis

Per-Object Overhead

FixedPool (per object):

  • Storage: size_of::<T>() bytes
  • No per-object metadata (objects stored directly)
  • Total: Exactly size_of::<T>() bytes per slot

Allocator Overhead (one-time, shared across all objects):

Allocator Type Per-Pool Overhead Formula
StackAllocator capacity * 8 + bitmap_size Vec + optional debug bitmap
FreeListAllocator capacity * 8 + bitmap_size Vec + optional debug bitmap
BitmapAllocator (capacity + 63) / 64 * 8 One bit per slot (packed in u64s)

Debug Mode Additional Overhead:

  • StackAllocator: (capacity + 63) / 64 * 8 bytes for bitmap
  • FreeListAllocator: (capacity + 63) / 64 * 8 bytes for bitmap
  • Release mode: No additional overhead

Example Calculation (FixedPool with 1000 i32 objects):

Object storage: 1000 * 4 = 4,000 bytes
StackAllocator: 1000 * 8 = 8,000 bytes (indices)
Debug bitmap: (1000 + 63) / 64 * 8 = 128 bytes
Total: 12,128 bytes
Overhead: 8,128 / 12,128 = 67% (but amortized across all objects)
Per-object view: 8,128 / 1000 = 8.128 bytes overhead per slot

For larger objects (e.g., 256-byte structs):

Object storage: 1000 * 256 = 256,000 bytes
Allocator overhead: 8,128 bytes
Total: 264,128 bytes
Overhead: 8,128 / 264,128 = 3.1%

Conclusion: Overhead is < 5% for objects > 100 bytes and pools > 1000 objects.

Handle Overhead

OwnedHandle:

  • Size: size_of::<&dyn Trait>() + size_of::<usize>() = 16 + 8 = 24 bytes (on 64-bit)
  • No heap allocation (stored on stack)

ThreadSafeHandle:

  • Size: Arc (16 bytes) + usize (8 bytes) + *mut T (8 bytes) = 32 bytes
  • Arc overhead: Shared atomic reference counter

SharedHandle:

  • Size: Rc (16 bytes) wrapper around OwnedHandle
  • Rc overhead: Reference counter

GrowingPool Overhead

Additional overhead:

  • Vec<Vec<MaybeUninit<T>>>: Outer Vec grows, ~24 bytes per chunk pointer
  • chunk_boundaries: Vec<usize>: 8 bytes per chunk
  • Per chunk: ~32 bytes of Vec metadata

Example (GrowingPool with 3 chunks: 100, 200, 400 capacity):

Chunk 0 storage: 100 * size_of::<T>()
Chunk 1 storage: 200 * size_of::<T>()
Chunk 2 storage: 400 * size_of::<T>()
Chunk pointers: 3 * 24 = 72 bytes
Boundaries: 3 * 8 = 24 bytes
Allocator: (100+200+400) * 8 = 5,600 bytes
Total overhead: ~5,696 bytes + 3 * ~32 = ~5,792 bytes

Memory Efficiency Tips

  1. Use larger pool capacities (overhead amortizes better)
  2. Use FixedPool when capacity is known (less overhead than GrowingPool)
  3. For tiny objects (< 8 bytes), consider batching or using larger container types
  4. In release mode, debug bitmap is compiled out

Memory Layout

FixedPool Memory Layout

FixedPool<T>
├── storage: Vec<MaybeUninit<T>>
│   ├── [0]: MaybeUninit<T>  ← Object or uninitialized
│   ├── [1]: MaybeUninit<T>
│   └── [n]: MaybeUninit<T>
├── allocator: StackAllocator
│   └── free_stack: [n, n-1, ..., 2, 1, 0]  ← Available indices
└── capacity: n

Key properties:

  • Contiguous memory block
  • Fixed size, never moves
  • Pointers stable for pool lifetime

GrowingPool Memory Layout

GrowingPool<T>
├── storage: Vec<Vec<MaybeUninit<T>>>
│   ├── Chunk 0: [MaybeUninit<T>; initial_cap]
│   ├── Chunk 1: [MaybeUninit<T>; growth_amount]
│   └── Chunk 2: [MaybeUninit<T>; growth_amount]
├── chunk_boundaries: [initial_cap, initial_cap + g1, initial_cap + g1 + g2]
├── allocator: FreeListAllocator
└── capacity: total across chunks

Index to pointer mapping:

fn compute_chunk_location(index: usize) -> (chunk_idx, offset) {
    // Binary search in chunk_boundaries
    chunk_idx = binary_search(chunk_boundaries, index + 1)
    offset = if chunk_idx == 0 { 
        index 
    } else { 
        index - chunk_boundaries[chunk_idx - 1] 
    }
}

Key properties:

  • Chunks never move individually
  • Pointers remain valid across growth
  • O(log n) lookup with cached boundaries

Unsafe Code Justification

Pattern 1: Extending RefCell Borrow Lifetime

pub(crate) fn get(&self, index: usize) -> &T {
    let storage = self.storage.borrow();
    unsafe {
        let ptr = storage.as_ptr();
        &*ptr.add(index).cast::<T>()
    }
}

Why unsafe?: Returns reference that outlives the borrow.

Why safe?:

  1. Pool owns the storage (won't be freed)
  2. Index is valid (checked by allocator)
  3. Handle prevents concurrent mutable access
  4. Memory is initialized before handle creation

Pattern 2: Mutable from Immutable

pub(crate) fn get_mut(&self, index: usize) -> &mut T {
    let storage = self.storage.borrow_mut();
    unsafe {
        let ptr = storage.as_ptr() as *mut MaybeUninit<T>;
        &mut *ptr.add(index).cast::<T>()
    }
}

Why unsafe?: &mut T from &self.

Why safe?:

  1. RefCell provides interior mutability
  2. Only one handle per slot exists
  3. Borrow checker ensures no aliasing via handle

Pattern 3: MaybeUninit Access

// Allocation
storage[index].write(value);  // Initialize
Ok(OwnedHandle::new(self, index))  // Now safe to read

// Deallocation  
let value_ptr = storage[index].as_mut_ptr();
(*value_ptr).on_release();
ptr::drop_in_place(value_ptr);  // Drop before reuse

Why safe?:

  1. Write before any read
  2. Drop before reuse
  3. Handle ensures no access after drop

Performance Optimizations

1. Cached Chunk Boundaries (GrowingPool)

Before: O(n) linear scan through chunks
After: O(log n) binary search

Implementation:

chunk_boundaries: RefCell<Vec<usize>>  // [100, 200, 400, 800]

Updated on growth, used in compute_chunk_location.

2. Lock-Free Deref (ThreadSafeHandle)

Before: Lock on every dereference
After: Cached pointer, lock only on alloc/dealloc

Impact: 10-50x improvement for handle access patterns

3. Bitmap Double-Free Check

Before: O(n) contains() in FreeListAllocator
After: O(1) bitmap check

Impact: Eliminates debug build overhead for large pools

4. Inline Annotations

Critical path functions marked with #[inline]:

  • Allocator methods
  • Handle deref/deref_mut
  • Poolable trait hooks

5. Pre-allocation with Capacity

let mut free_stack = Vec::with_capacity(capacity);
for i in (0..capacity).rev() {
    free_stack.push(i);
}

Better than: (0..capacity).rev().collect() (avoids reallocation)

Testing Strategy

Unit Tests

  • Allocator tests: Verify allocation/deallocation logic
  • Pool tests: Test capacity, growth, reuse
  • Handle tests: RAII behavior, drop semantics
  • Safety tests: Borrow checker violations (compile failures)

Integration Tests

  • Multi-threaded scenarios
  • Edge cases (capacity limits, exhaustion)
  • Drop behavior and cleanup

Benchmark Tests

  • Criterion.rs for statistical rigor
  • Comparison with Box allocation
  • Various pool sizes and types

Property-Based Testing (Future)

  • QuickCheck/proptest for invariant checking
  • Fuzz testing with cargo-fuzz

Design Principles

  1. Safety First: Leverage Rust's type system, minimal unsafe
  2. Zero-Cost Abstractions: Performance comparable to hand-written code
  3. Ergonomic API: RAII handles, builder pattern, sensible defaults
  4. Composability: Multiple pool types for different use cases
  5. Transparency: Well-documented unsafe code with justification

Future Improvements

  • Lock-free data structures (experimental lock-free feature)
  • NUMA-aware allocation for large systems
  • Custom allocator backend support
  • Allocation batching for reduced overhead
  • Per-thread caching for thread-safe pool

References