This document outlines practical enhancements and optimizations for the VMM simulator.
Complexity: Medium
Impact: High educational value, realistic process creation
Implementation:
- Add
PTE_COWflag to page table entries - Implement reference counting in
FrameInfo(ref_countfield already present) - Fork operation: duplicate page table, mark all pages COW, increment frame ref counts
- On write to COW page: allocate new frame, copy data, update PTE
Benefits:
- Simulate efficient
fork()system call - Demonstrate space-time tradeoff in OS design
- Add new metric: COW faults
Code Locations:
src/pagetable.h: AddPTE_COWflagsrc/vmm.c: Modifyvmm_handle_page_fault()to detect COW faultssrc/frame.c: Implementframe_inc_ref(),frame_dec_ref()
Complexity: Medium
Impact: Multi-process communication simulation
Implementation:
- Create
SharedMemoryRegionstructure to track shared frames - Map same physical frames into multiple process page tables
- Use existing
pin_countinFrameInfofor reference tracking - Add
vmm_create_shared(),vmm_attach_shared()API
Benefits:
- Simulate IPC (Inter-Process Communication)
- Test TLB behavior with shared mappings
- Demonstrate memory efficiency of sharing
Code Locations:
src/vmm.h: Add shared memory structuressrc/vmm.c: Implement shared region management
Complexity: Medium-High
Impact: Realistic file I/O modeling
Implementation:
- Extend
PageTableEntrywith file descriptor and offset - On page fault: check if page is file-backed
- Load page from "file" (simulated) instead of zero page
- Track file-backed vs anonymous pages separately
Benefits:
- Simulate
mmap()system call - Demonstrate demand-paging for executables
- Add read-only vs read-write file mappings
Metrics:
- File-backed vs anonymous page faults
- Clean vs dirty file page evictions
Code Locations:
src/pagetable.h: Addfile_offset,file_idto PTEsrc/vmm.c: Extend page fault handler for file-backed pages
Complexity: Medium
Impact: Performance optimization simulation
Implementation:
- Support multiple page sizes (4KB, 2MB, 1GB)
- Extend page table to indicate page size
- Modify TLB to handle multiple page sizes
- Add heuristics for when to use large pages
Benefits:
- Reduce TLB pressure (one entry covers more memory)
- Lower page fault frequency
- Demonstrate modern OS optimization
Metrics:
- Large page vs small page usage
- TLB reach (total memory covered by TLB)
Code Locations:
src/pagetable.h: Add page size field to PTEsrc/tlb.c: Handle variable-size translationssrc/frame.c: Allocate contiguous frame groups
Complexity: Low-Medium
Impact: Performance optimization
Implementation:
- Detect sequential access patterns in trace
- Proactively load next N pages on sequential access
- Add configurable prefetch distance
- Track prefetch accuracy (useful vs wasted)
Benefits:
- Reduce apparent page fault latency
- Demonstrate speculative execution
- Trade memory for reduced faults
Metrics:
- Prefetch hits (used pages)
- Prefetch misses (evicted before use)
- Net latency reduction
Code Locations:
src/vmm.c: Add prefetch logic invmm_access()src/trace.c: Pattern detection utilities
Complexity: High
Impact: Modern multi-socket system simulation
Implementation:
- Partition physical frames into NUMA nodes
- Add latency penalty for remote node access
- Implement NUMA-aware page placement policy
- Add process-to-node affinity
Benefits:
- Simulate multi-socket servers
- Demonstrate locality importance
- Test NUMA-aware replacement
Metrics:
- Local vs remote memory accesses
- Average memory latency (node-aware)
- Node balance (frame distribution)
Code Locations:
src/frame.h: Addnuma_nodetoFrameInfosrc/replacement.c: Prefer local framessrc/vmm.c: Add node latency to AMT calculation
Complexity: Medium
Impact: 64-bit address space simulation
Implementation:
- Support 3-level and 4-level page tables (x86-64 style)
- Implement PML4 → PDPT → PD → PT hierarchy
- Lazy allocation of intermediate levels
- Compare space overhead vs access time
Benefits:
- Simulate modern 64-bit architectures
- Demonstrate scalability to large address spaces
- Show space-time tradeoff
Code Locations:
src/pagetable.h: AddPT_THREE_LEVEL,PT_FOUR_LEVELsrc/pagetable.c: Implement multi-level walk
Complexity: Medium
Impact: Improved replacement policy
Implementation:
- Combine working set model with Clock algorithm
- Track time of last use and age of page
- Evict pages outside working set first
- Add configurable working set window (τ)
Benefits:
- Better than pure Clock for variable workloads
- Prevent young, frequently-used pages from eviction
- Demonstrate advanced OS research
Algorithm:
For each frame in clock order:
age = current_time - last_use_time
if age > τ and not dirty:
evict
elif age > τ and dirty:
schedule write-back, continue
else:
give second chance
Code Locations:
src/replacement.c: AddREPLACE_WSCLOCK
Problem: Two-level PT wastes space for sparse address spaces
Solution: Hash table mapping VPN → PTE
Tradeoffs:
- Pros: O(1) average lookup, minimal space for sparse mappings
- Cons: Hash collisions, slightly slower worst case
Implementation:
typedef struct {
uint64_t vpn;
uint32_t frame_number;
uint32_t flags;
UT_hash_handle hh; // uthash library
} HashedPTE;Problem: Page tables scale with virtual space, not physical
Solution: One global table indexed by frame number
Tradeoffs:
- Pros: O(physical_frames) space, not O(virtual_pages)
- Cons: Slower lookup (need hash or search)
Use Case: Systems with very large virtual address spaces
Problem: Single-threaded simulation limits performance
Solution: Parallelize independent components
Parallelization Points:
- Process page tables (no sharing)
- Metrics collection (lock-free counters)
- Trace execution (partition trace by PID)
Challenges:
- TLB and frame allocator need synchronization
- Lock-free algorithms for hot paths
Problem: frame_age_all() iterates over all frames
Solution: Use SIMD to age 4-8 frames per instruction
Implementation:
#include <immintrin.h> // AVX2
void frame_age_all_simd(FrameAllocator *allocator) {
__m256i *age_vec = (__m256i *)allocator->frames[...].age_counter;
__m256i shift = _mm256_set1_epi32(1);
for (uint32_t i = 0; i < allocator->total_frames / 8; i++) {
age_vec[i] = _mm256_srli_epi32(age_vec[i], 1);
// OR with reference bits...
}
}Problem: Poor cache locality in linked structures
Solution: Structure-of-Arrays (SoA) instead of Array-of-Structures (AoS)
Example:
// Instead of:
FrameInfo frames[N]; // Hot + cold fields interleaved
// Use:
struct {
uint32_t frame_numbers[N];
uint32_t pids[N];
uint64_t vpns[N];
// ... hot fields
} hot_frame_data;
struct {
uint32_t pin_counts[N];
// ... cold fields
} cold_frame_data;Steps:
- Build with instrumentation:
gcc -fprofile-generate - Run representative traces
- Rebuild with profile data:
gcc -fprofile-use
Expected: 10-20% speedup from better branch prediction and inlining
Problem: Invalidating many entries one-by-one is slow
Solution: Batch invalidation with bloom filter or bitmap
Implementation:
void tlb_invalidate_batch(TLB *tlb, uint32_t pid, uint64_t *vpns, uint32_t count) {
// Use bitmap for fast membership test
for (uint32_t i = 0; i < tlb->size; i++) {
if (tlb->entries[i].pid == pid && in_vpn_set(tlb->entries[i].vpn, vpns, count)) {
tlb->entries[i].valid = false;
}
}
}Concept: Generate machine code for replacement policy at runtime
Tradeoffs:
- Pros: Eliminate function call overhead, inline policy logic
- Cons: Complexity, portability issues
Better Alternative: Link-Time Optimization (LTO)
gcc -flto -O3 src/*.c -o bin/vmm# Generate massive trace
./bin/trace_gen -t random -n 10000000 -o stress.trace
# Run with memory constraints
ulimit -v 2097152 # Limit to 2GB
./bin/vmm -r 1024 -t stress.trace -a LRUCompare against other simulators:
- SimpleScalar
- gem5
- Custom simulators
Metrics:
- Accuracy (page fault count match)
- Performance (simulation speed)
- Scalability (max trace size)
# Use AFL to fuzz trace parser
afl-gcc -o vmm_fuzz src/*.c
afl-fuzz -i traces/ -o fuzz_results/ ./vmm_fuzz -r 64 -t @@from ctypes import *
vmm_lib = CDLL('./bin/libvmm.so')
# Create VMM
vmm_lib.vmm_create.restype = c_void_p
vmm = vmm_lib.vmm_create(byref(config))
# Access memory
vmm_lib.vmm_access(vmm, pid, addr, is_write)Concept: Use simulator as user-space page fault handler
Steps:
- Intercept page faults with
userfaultfd()(Linux) - Forward to VMM simulator
- Resolve fault and resume execution
Use Case: Test replacement algorithms with real workloads
Components:
- Backend: VMM simulator with REST API
- Frontend: React/Vue for visualization
- Features:
- Real-time page table visualization
- TLB state animation
- Metrics dashboard
| Extension | Complexity | Educational Value | Performance Impact |
|---|---|---|---|
| Copy-on-Write | Medium | ⭐⭐⭐⭐⭐ | Neutral |
| Shared Memory | Medium | ⭐⭐⭐⭐ | Neutral |
| Memory-Mapped Files | Medium-High | ⭐⭐⭐⭐ | Neutral |
| Large Pages | Medium | ⭐⭐⭐⭐ | Positive |
| Prefetching | Low-Medium | ⭐⭐⭐ | Positive |
| NUMA | High | ⭐⭐⭐⭐⭐ | Negative (complexity) |
| Multi-Level PT | Medium | ⭐⭐⭐⭐ | Neutral |
| WSClock | Medium | ⭐⭐⭐ | Positive |
- Copy-on-Write - Fundamental OS concept, builds on existing code
- Large Pages - Significant performance teaching moment
- Prefetching - Easy to implement, clear performance benefit
- Shared Memory - Natural extension of COW
- Memory-Mapped Files - Realistic I/O simulation
- Multi-Level Page Tables - Scalability demonstration
- NUMA - Advanced topic for graduate students
- WSClock - Algorithm research
To implement an extension:
- Create feature branch:
git checkout -b feature/cow - Implement with tests
- Update documentation
- Submit pull request with benchmarks
- Operating Systems: Three Easy Pieces (OSTEP) - Free online book
- Linux kernel source:
mm/directory - FreeBSD VM subsystem documentation
- Research papers on modern VM techniques