Skip to content

Latest commit

 

History

History
496 lines (353 loc) · 11.8 KB

File metadata and controls

496 lines (353 loc) · 11.8 KB

VMM Extensions & Optimization Roadmap

This document outlines practical enhancements and optimizations for the VMM simulator.


8 Practical Next Enhancements

1. Copy-on-Write (COW) Fork Semantics

Complexity: Medium
Impact: High educational value, realistic process creation

Implementation:

  • Add PTE_COW flag to page table entries
  • Implement reference counting in FrameInfo (ref_count field already present)
  • Fork operation: duplicate page table, mark all pages COW, increment frame ref counts
  • On write to COW page: allocate new frame, copy data, update PTE

Benefits:

  • Simulate efficient fork() system call
  • Demonstrate space-time tradeoff in OS design
  • Add new metric: COW faults

Code Locations:

  • src/pagetable.h: Add PTE_COW flag
  • src/vmm.c: Modify vmm_handle_page_fault() to detect COW faults
  • src/frame.c: Implement frame_inc_ref(), frame_dec_ref()

2. Shared Memory Regions

Complexity: Medium
Impact: Multi-process communication simulation

Implementation:

  • Create SharedMemoryRegion structure to track shared frames
  • Map same physical frames into multiple process page tables
  • Use existing pin_count in FrameInfo for reference tracking
  • Add vmm_create_shared(), vmm_attach_shared() API

Benefits:

  • Simulate IPC (Inter-Process Communication)
  • Test TLB behavior with shared mappings
  • Demonstrate memory efficiency of sharing

Code Locations:

  • src/vmm.h: Add shared memory structures
  • src/vmm.c: Implement shared region management

3. Memory-Mapped File Simulation

Complexity: Medium-High
Impact: Realistic file I/O modeling

Implementation:

  • Extend PageTableEntry with file descriptor and offset
  • On page fault: check if page is file-backed
  • Load page from "file" (simulated) instead of zero page
  • Track file-backed vs anonymous pages separately

Benefits:

  • Simulate mmap() system call
  • Demonstrate demand-paging for executables
  • Add read-only vs read-write file mappings

Metrics:

  • File-backed vs anonymous page faults
  • Clean vs dirty file page evictions

Code Locations:

  • src/pagetable.h: Add file_offset, file_id to PTE
  • src/vmm.c: Extend page fault handler for file-backed pages

4. Large Pages (Huge Pages)

Complexity: Medium
Impact: Performance optimization simulation

Implementation:

  • Support multiple page sizes (4KB, 2MB, 1GB)
  • Extend page table to indicate page size
  • Modify TLB to handle multiple page sizes
  • Add heuristics for when to use large pages

Benefits:

  • Reduce TLB pressure (one entry covers more memory)
  • Lower page fault frequency
  • Demonstrate modern OS optimization

Metrics:

  • Large page vs small page usage
  • TLB reach (total memory covered by TLB)

Code Locations:

  • src/pagetable.h: Add page size field to PTE
  • src/tlb.c: Handle variable-size translations
  • src/frame.c: Allocate contiguous frame groups

5. Prefetching and Speculation

Complexity: Low-Medium
Impact: Performance optimization

Implementation:

  • Detect sequential access patterns in trace
  • Proactively load next N pages on sequential access
  • Add configurable prefetch distance
  • Track prefetch accuracy (useful vs wasted)

Benefits:

  • Reduce apparent page fault latency
  • Demonstrate speculative execution
  • Trade memory for reduced faults

Metrics:

  • Prefetch hits (used pages)
  • Prefetch misses (evicted before use)
  • Net latency reduction

Code Locations:

  • src/vmm.c: Add prefetch logic in vmm_access()
  • src/trace.c: Pattern detection utilities

6. NUMA (Non-Uniform Memory Access) Simulation

Complexity: High
Impact: Modern multi-socket system simulation

Implementation:

  • Partition physical frames into NUMA nodes
  • Add latency penalty for remote node access
  • Implement NUMA-aware page placement policy
  • Add process-to-node affinity

Benefits:

  • Simulate multi-socket servers
  • Demonstrate locality importance
  • Test NUMA-aware replacement

Metrics:

  • Local vs remote memory accesses
  • Average memory latency (node-aware)
  • Node balance (frame distribution)

Code Locations:

  • src/frame.h: Add numa_node to FrameInfo
  • src/replacement.c: Prefer local frames
  • src/vmm.c: Add node latency to AMT calculation

7. Multi-level (3+ Level) Page Tables

Complexity: Medium
Impact: 64-bit address space simulation

Implementation:

  • Support 3-level and 4-level page tables (x86-64 style)
  • Implement PML4 → PDPT → PD → PT hierarchy
  • Lazy allocation of intermediate levels
  • Compare space overhead vs access time

Benefits:

  • Simulate modern 64-bit architectures
  • Demonstrate scalability to large address spaces
  • Show space-time tradeoff

Code Locations:

  • src/pagetable.h: Add PT_THREE_LEVEL, PT_FOUR_LEVEL
  • src/pagetable.c: Implement multi-level walk

8. Advanced Replacement: Working Set Clock (WSClock)

Complexity: Medium
Impact: Improved replacement policy

Implementation:

  • Combine working set model with Clock algorithm
  • Track time of last use and age of page
  • Evict pages outside working set first
  • Add configurable working set window (τ)

Benefits:

  • Better than pure Clock for variable workloads
  • Prevent young, frequently-used pages from eviction
  • Demonstrate advanced OS research

Algorithm:

For each frame in clock order:
    age = current_time - last_use_time
    if age > τ and not dirty:
        evict
    elif age > τ and dirty:
        schedule write-back, continue
    else:
        give second chance

Code Locations:

  • src/replacement.c: Add REPLACE_WSCLOCK

Performance Optimizations

Opt-1: Hash Table for Sparse Page Tables

Problem: Two-level PT wastes space for sparse address spaces

Solution: Hash table mapping VPN → PTE

Tradeoffs:

  • Pros: O(1) average lookup, minimal space for sparse mappings
  • Cons: Hash collisions, slightly slower worst case

Implementation:

typedef struct {
    uint64_t vpn;
    uint32_t frame_number;
    uint32_t flags;
    UT_hash_handle hh;  // uthash library
} HashedPTE;

Opt-2: Inverted Page Table

Problem: Page tables scale with virtual space, not physical

Solution: One global table indexed by frame number

Tradeoffs:

  • Pros: O(physical_frames) space, not O(virtual_pages)
  • Cons: Slower lookup (need hash or search)

Use Case: Systems with very large virtual address spaces


Opt-3: Multi-threading Support

Problem: Single-threaded simulation limits performance

Solution: Parallelize independent components

Parallelization Points:

  • Process page tables (no sharing)
  • Metrics collection (lock-free counters)
  • Trace execution (partition trace by PID)

Challenges:

  • TLB and frame allocator need synchronization
  • Lock-free algorithms for hot paths

Opt-4: SIMD for Aging

Problem: frame_age_all() iterates over all frames

Solution: Use SIMD to age 4-8 frames per instruction

Implementation:

#include <immintrin.h>  // AVX2

void frame_age_all_simd(FrameAllocator *allocator) {
    __m256i *age_vec = (__m256i *)allocator->frames[...].age_counter;
    __m256i shift = _mm256_set1_epi32(1);
    
    for (uint32_t i = 0; i < allocator->total_frames / 8; i++) {
        age_vec[i] = _mm256_srli_epi32(age_vec[i], 1);
        // OR with reference bits...
    }
}

Opt-5: Cache-Conscious Data Layout

Problem: Poor cache locality in linked structures

Solution: Structure-of-Arrays (SoA) instead of Array-of-Structures (AoS)

Example:

// Instead of:
FrameInfo frames[N];  // Hot + cold fields interleaved

// Use:
struct {
    uint32_t frame_numbers[N];
    uint32_t pids[N];
    uint64_t vpns[N];
    // ... hot fields
} hot_frame_data;

struct {
    uint32_t pin_counts[N];
    // ... cold fields
} cold_frame_data;

Opt-6: Profile-Guided Optimization (PGO)

Steps:

  1. Build with instrumentation: gcc -fprofile-generate
  2. Run representative traces
  3. Rebuild with profile data: gcc -fprofile-use

Expected: 10-20% speedup from better branch prediction and inlining


Opt-7: Batch TLB Invalidation

Problem: Invalidating many entries one-by-one is slow

Solution: Batch invalidation with bloom filter or bitmap

Implementation:

void tlb_invalidate_batch(TLB *tlb, uint32_t pid, uint64_t *vpns, uint32_t count) {
    // Use bitmap for fast membership test
    for (uint32_t i = 0; i < tlb->size; i++) {
        if (tlb->entries[i].pid == pid && in_vpn_set(tlb->entries[i].vpn, vpns, count)) {
            tlb->entries[i].valid = false;
        }
    }
}

Opt-8: JIT Compilation of Replacement Policy

Concept: Generate machine code for replacement policy at runtime

Tradeoffs:

  • Pros: Eliminate function call overhead, inline policy logic
  • Cons: Complexity, portability issues

Better Alternative: Link-Time Optimization (LTO)

gcc -flto -O3 src/*.c -o bin/vmm

Testing Extensions

Test-1: Stress Testing Framework

# Generate massive trace
./bin/trace_gen -t random -n 10000000 -o stress.trace

# Run with memory constraints
ulimit -v 2097152  # Limit to 2GB
./bin/vmm -r 1024 -t stress.trace -a LRU

Test-2: Comparative Benchmarking

Compare against other simulators:

  • SimpleScalar
  • gem5
  • Custom simulators

Metrics:

  • Accuracy (page fault count match)
  • Performance (simulation speed)
  • Scalability (max trace size)

Test-3: Fuzzing for Robustness

# Use AFL to fuzz trace parser
afl-gcc -o vmm_fuzz src/*.c
afl-fuzz -i traces/ -o fuzz_results/ ./vmm_fuzz -r 64 -t @@

Deployment & Integration

Deploy-1: Python Bindings (ctypes/cffi)

from ctypes import *

vmm_lib = CDLL('./bin/libvmm.so')

# Create VMM
vmm_lib.vmm_create.restype = c_void_p
vmm = vmm_lib.vmm_create(byref(config))

# Access memory
vmm_lib.vmm_access(vmm, pid, addr, is_write)

Deploy-2: Kernel Module (Real OS Integration)

Concept: Use simulator as user-space page fault handler

Steps:

  1. Intercept page faults with userfaultfd() (Linux)
  2. Forward to VMM simulator
  3. Resolve fault and resume execution

Use Case: Test replacement algorithms with real workloads


Deploy-3: Web-Based Visualizer

Components:

  • Backend: VMM simulator with REST API
  • Frontend: React/Vue for visualization
  • Features:
    • Real-time page table visualization
    • TLB state animation
    • Metrics dashboard

Summary Table

Extension Complexity Educational Value Performance Impact
Copy-on-Write Medium ⭐⭐⭐⭐⭐ Neutral
Shared Memory Medium ⭐⭐⭐⭐ Neutral
Memory-Mapped Files Medium-High ⭐⭐⭐⭐ Neutral
Large Pages Medium ⭐⭐⭐⭐ Positive
Prefetching Low-Medium ⭐⭐⭐ Positive
NUMA High ⭐⭐⭐⭐⭐ Negative (complexity)
Multi-Level PT Medium ⭐⭐⭐⭐ Neutral
WSClock Medium ⭐⭐⭐ Positive

Recommended Implementation Order

  1. Copy-on-Write - Fundamental OS concept, builds on existing code
  2. Large Pages - Significant performance teaching moment
  3. Prefetching - Easy to implement, clear performance benefit
  4. Shared Memory - Natural extension of COW
  5. Memory-Mapped Files - Realistic I/O simulation
  6. Multi-Level Page Tables - Scalability demonstration
  7. NUMA - Advanced topic for graduate students
  8. WSClock - Algorithm research

Contributing

To implement an extension:

  1. Create feature branch: git checkout -b feature/cow
  2. Implement with tests
  3. Update documentation
  4. Submit pull request with benchmarks

References

  • Operating Systems: Three Easy Pieces (OSTEP) - Free online book
  • Linux kernel source: mm/ directory
  • FreeBSD VM subsystem documentation
  • Research papers on modern VM techniques