Request Cancellation, GPU-Resident Workflows, and Wine/Proton Integration

Status: Planning Phase
Priority: Medium to High
Target: Complete feature set for production DirectStorage-style runtime

Part 1: Request Cancellation

1.1 Current State

What Exists:

RequestStatus enum with values: Pending, InProgress, Complete, Failed
No Cancelled status
No cancel() method on Queue
No cancellation support in backends

What's Missing:

❌ RequestStatus::Cancelled enum value
❌ Queue::cancel_request(request_id) method
❌ In-flight request tracking for cancellation
❌ Backend cancellation hooks
❌ Race condition handling (completion vs cancellation)

1.2 Design Requirements

Use Cases:

Timeout: Cancel requests that take too long
User Action: User cancels loading operation
Priority Change: Cancel low-priority work to start high-priority
Shutdown: Cancel all in-flight requests on cleanup

Semantics:

// Strong guarantee: Request will not complete after cancel
bool cancel(request_id);

// Weak guarantee: Request may complete, but won't invoke callback
bool try_cancel(request_id);

1.3 Implementation Plan

Phase 1: API Design

Add to Request:

struct Request {
    // Existing fields...
    std::atomic<bool> cancellation_requested = false;
    request_id_t id = 0; // Unique ID for tracking
};

Add to Queue:

class Queue {
public:
    // Cancel specific request (returns true if cancelled before completion)
    bool cancel_request(request_id_t id);
    
    // Cancel all pending requests (not yet submitted)
    size_t cancel_all_pending();
    
    // Cancel all requests (including in-flight)
    size_t cancel_all();
};

Add Status:

enum class RequestStatus {
    Pending,
    InProgress,
    Complete,
    Failed,
    Cancelled  // NEW
};

Phase 2: Queue Implementation

Request Tracking:

class Queue::Impl {
    std::unordered_map<request_id_t, Request*> active_requests_;
    std::mutex active_mutex_;
    std::atomic<request_id_t> next_id_{1};
};

void Queue::enqueue(Request& req) {
    req.id = impl_->next_id_.fetch_add(1);
    {
        std::lock_guard lock(impl_->active_mutex_);
        impl_->active_requests_[req.id] = &req;
    }
    // ... existing enqueue logic
}

Cancellation:

bool Queue::cancel_request(request_id_t id) {
    std::lock_guard lock(impl_->active_mutex_);
    
    auto it = impl_->active_requests_.find(id);
    if (it == impl_->active_requests_.end()) {
        return false; // Already completed or never existed
    }
    
    Request* req = it->second;
    
    // Mark as cancellation requested
    req->cancellation_requested.store(true, std::memory_order_release);
    
    // If still pending (not submitted), remove immediately
    if (req->status == RequestStatus::Pending) {
        req->status = RequestStatus::Cancelled;
        impl_->active_requests_.erase(it);
        return true;
    }
    
    // If in-flight, backend must handle cancellation
    // Return false to indicate "in progress, might complete"
    return false;
}

Phase 3: Backend Support

CPU Backend:

void CpuBackend::process_request(Request& req) {
    // Check cancellation before I/O
    if (req.cancellation_requested.load(std::memory_order_acquire)) {
        req.status = RequestStatus::Cancelled;
        req.callback(&req);
        return;
    }
    
    // Perform I/O
    ssize_t bytes = pread(req.fd, req.dst, req.size, req.offset);
    
    // Check cancellation after I/O (before callback)
    if (req.cancellation_requested.load(std::memory_order_acquire)) {
        req.status = RequestStatus::Cancelled;
        req.callback(&req);
        return;
    }
    
    // Normal completion
    req.status = RequestStatus::Complete;
    req.callback(&req);
}

Vulkan Backend:

// Harder to cancel GPU work in progress
// Strategy: Don't invoke callback if cancelled
void VulkanBackend::complete_request(Request& req) {
    if (req.cancellation_requested.load(std::memory_order_acquire)) {
        req.status = RequestStatus::Cancelled;
    }
    
    req.callback(&req);
}

io_uring Backend:

// Can cancel SQE before submission
bool IoUringBackend::cancel_sqe(Request& req) {
    // Remove from pending queue if not yet submitted
    std::lock_guard lock(pending_mutex_);
    auto it = std::find(pending_.begin(), pending_.end(), &req);
    if (it != pending_.end()) {
        pending_.erase(it);
        req.status = RequestStatus::Cancelled;
        return true;
    }
    return false; // Already submitted
}

1.4 Testing

Test Cases:

Cancel pending request (before submit)
Cancel in-flight request (during I/O)
Cancel completed request (should fail)
Cancel non-existent request (should fail)
Race: cancel vs completion
Cancel all requests
Callback not invoked for cancelled request

Test File: tests/cancellation_test.cpp

1.5 Timeline

2-3 weeks for complete cancellation support

Part 2: GPU-Resident Workflows

2.1 Motivation

Goal: Zero-copy disk → GPU data path

Traditional Path (current):

Disk → Host Staging Buffer → GPU Buffer
       [copy 1]              [copy 2]

GPU-Resident Path (target):

Disk → GPU Buffer (direct)
       [copy 1 only]

2.2 DirectStorage GPU Upload Heap

Microsoft DirectStorage Concept:

GPU upload heap: CPU-visible, GPU-accessible memory
Direct writes from storage controller to GPU memory
Requires hardware support (PCIe peer-to-peer, GPU Direct Storage)

Linux Equivalent:

NVIDIA GPUDirect Storage: Kernel driver enables direct NVMe → GPU transfers
AMD equivalent: DirectGMA (less documented)
Standard Vulkan: No direct disk → GPU (must use staging)

2.3 Implementation Strategies

Strategy 1: Vulkan External Memory (Current)

Approach: Staging buffer + GPU copy (already implemented)

Pros:

Works on all Vulkan hardware
Portable across vendors
Already implemented

Cons:

Extra copy (staging → GPU)
Higher latency
More memory usage

Strategy 2: GPU Direct Storage Integration

Approach: Integrate with vendor-specific APIs

NVIDIA GPUDirect Storage:

// Open file with GDS flags
int fd = open(path, O_RDONLY | O_DIRECT);

// Register GPU buffer with GDS
cuFileDriverOpen();
CUfileHandle_t handle;
cuFileHandleRegister(&handle, &cufile_desc);

// Direct read to GPU memory
cuFileRead(handle, gpu_buffer, size, offset, 0);

Pros:

Zero extra copies
Lowest latency
Highest throughput

Cons:

NVIDIA-only (no AMD/Intel equivalent)
Requires special driver setup
O_DIRECT alignment requirements
Complex integration

Strategy 3: Memory-Mapped Files + GPU Upload

Approach: mmap file, map to GPU upload heap

Implementation:

// Map file to host memory
void* mapped = mmap(nullptr, file_size, PROT_READ, 
                   MAP_SHARED, fd, 0);

// Allocate GPU upload heap (CPU-visible, GPU-accessible)
VkBuffer upload_buffer = create_upload_buffer(device);
void* gpu_mapped = map_buffer(upload_buffer);

// Copy file data to upload heap
memcpy(gpu_mapped, mapped, file_size);

// Unmap
munmap(mapped, file_size);
unmap_buffer(upload_buffer);

// Use upload buffer directly in GPU (no staging copy needed)

Pros:

Simpler than GDS
Works across vendors
Reduces staging buffer usage

Cons:

Still one copy (mmap → GPU)
Page cache overhead
Not true "direct to GPU"

2.4 Recommended Approach

Phase 1: Optimize Current Path

Reuse staging buffers (pool)
Async staging → GPU copy (don't wait)
Batch multiple requests

Phase 2: Vendor-Specific Paths (Optional)

Add GDS backend for NVIDIA
Conditional compilation (#ifdef NVIDIA_GDS)
Fallback to standard path

Phase 3: Future Hardware

Wait for standardized GPU Direct Storage in Vulkan
Integrate when available

2.5 GPU-to-GPU Transfers

Use Case: Texture decompression GPU → GPU

Current Path:

Disk → Staging → GPU Compressed Buffer → GPU Decompressed Buffer
                 [compute shader]

Optimization:

Disk → GPU Compressed Buffer → GPU Decompressed Buffer
                               [single command buffer]

Implementation: Already supported via Vulkan backend + compute pipelines

2.6 Testing

Benchmarks:

Staging vs direct (if GDS available)
Throughput (MB/s)
Latency (ms)
CPU overhead (%)

Validation:

Data integrity (checksums)
Memory usage
GPU utilization

2.7 Timeline

Phase 1 (Optimization): 2 weeks
Phase 2 (GDS Integration): 4-6 weeks (if needed)

Part 3: Wine/Proton Integration

3.1 Architecture Overview

Goal: Enable Windows DirectStorage games to run on Linux via Proton

Strategy:

Windows Game (DirectStorage API)
  ↓
Wine/Proton dstorage.dll Shim
  ↓ (translate calls)
ds-runtime (Linux native)
  ↓ (execute)
Linux Kernel (io_uring, Vulkan)

3.2 Integration Approaches

Approach 1: PE DLL Shim (Recommended)

Architecture:

dstorage.dll (PE) - Windows ABI
  ↓ dlopen
libds_runtime.so - Linux ABI

Implementation:

Create dstorage.dll (Wine builtin DLL)
Implement DirectStorage API entry points
Forward to libds_runtime.so via C ABI
Translate types (HANDLE → fd, etc.)

Example:

// dstorage.dll (Wine)
HRESULT WINAPI DStorageCreateQueue(
    const DSTORAGE_QUEUE_DESC* desc,
    REFIID riid,
    void** ppv
) {
    // Load libds_runtime.so
    void* handle = dlopen("libds_runtime.so", RTLD_NOW);
    
    // Get C API functions
    auto ds_create_queue = (ds_queue_t* (*)(ds_backend_t*))
        dlsym(handle, "ds_create_queue");
    
    // Create backend
    ds_backend_t* backend = ds_make_cpu_backend();
    
    // Create queue
    ds_queue_t* queue = ds_create_queue(backend);
    
    // Wrap in COM object
    *ppv = new DStorageQueueImpl(queue);
    return S_OK;
}

Approach 2: Direct Integration (No Shim)

Architecture:

Wine/Proton DirectStorage Implementation
  ↓ (link directly)
libds_runtime_static.a

Implementation:

Build ds-runtime as static library
Link into Wine dlls/dstorage build
Call C++ API directly (no PE/ELF bridge)
Share Vulkan device with vkd3d-proton

Pros:

No dlopen overhead
Simpler debugging
Shared Vulkan context

Cons:

Tighter coupling
Requires Wine build modifications

Approach 3: Kernel Module (Experimental)

Architecture:

DirectStorage Requests
  ↓
ioctl to kernel module
  ↓
Kernel-side I/O handling

Not Recommended: Too complex, overkill for userspace I/O

3.3 Type Mapping

Windows → Linux Translation:

Windows Type	Linux Type	Conversion
`HANDLE`	`int`	`fd = _open_osfhandle(handle)`
`DSTORAGE_REQUEST`	`ds_request`	Struct field mapping
`ID3D12Resource*`	`VkBuffer`	vkd3d-proton interop
`DSTORAGE_COMPRESSION`	`ds_compression_t`	Enum mapping
`OVERLAPPED`	Completion callback	Async model

Example Struct Mapping:

void translate_request(
    const DSTORAGE_REQUEST_DESC* windows_req,
    ds_request* linux_req
) {
    linux_req->fd = get_fd_from_handle(windows_req->Source.File.Handle);
    linux_req->offset = windows_req->Source.File.Offset;
    linux_req->size = windows_req->Source.File.Size;
    linux_req->dst = get_buffer_pointer(windows_req->Destination);
    linux_req->op = (windows_req->DestinationType == DSTORAGE_REQUEST_DESTINATION_MEMORY)
        ? DS_REQUEST_OP_READ : DS_REQUEST_OP_WRITE;
    linux_req->compression = translate_compression(
        windows_req->CompressionFormat
    );
}

3.4 Vulkan Device Sharing

Challenge: DirectStorage expects D3D12 device, we need Vulkan

Solution: vkd3d-proton already handles D3D12 → Vulkan translation

Integration:

// Get Vulkan device from vkd3d-proton
VkDevice vk_device = vkd3d_get_vk_device(d3d12_device);
VkQueue vk_queue = vkd3d_get_vk_queue(d3d12_device);

// Create ds-runtime Vulkan backend with shared device
ds_vulkan_backend_config config;
config.device = vk_device;
config.queue = vk_queue;
config.take_ownership = false; // Don't destroy device

ds_backend_t* backend = ds_make_vulkan_backend(&config);

3.5 Implementation Steps

Step 1: C ABI Wrapper (Already Exists)

Status: ✅ Complete

include/ds_runtime_c.h provides C API
Type conversions implemented
Tested with c_abi_stats_test.c

Step 2: Create dstorage.dll Skeleton

Location: Outside ds-runtime repo (in Wine tree)

Files:

dlls/dstorage/
├── Makefile.in
├── dstorage.spec
├── dstorage_main.c
├── queue.c
└── request.c

Implement:

DStorageGetFactory
DStorageSetConfiguration
IDStorageFactory::CreateQueue
IDStorageQueue::EnqueueRequest
IDStorageQueue::Submit
IDStorageQueue::EnqueueSignal

Step 3: Link with ds-runtime

Option A: Dynamic Linking

EXTRADLLFLAGS = -Wl,--no-undefined
EXTRALIBS = -lds_runtime

Option B: Static Linking

EXTRALIBS = $(LIBDS_RUNTIME_STATIC)

Step 4: Test with Real Games

Test Titles:

Forspoken (uses DirectStorage)
Ratchet & Clank: Rift Apart
Any UE5 game with DirectStorage support

Validation:

Game launches without crashes
Asset loading works
Performance acceptable
No memory leaks

3.6 Documentation

Create: docs/wine_integration_guide.md

Contents:

Build dstorage.dll
Configure Wine to use builtin override
Debugging tips
Performance tuning
Known issues

3.7 Timeline

Week 1-2: Prototype

Create basic dstorage.dll shim
Implement skeleton COM interfaces
Test with simple DirectStorage app

Week 3-4: Type Mapping

Implement full type conversion
Handle edge cases
Vulkan device sharing

Week 5-6: Testing

Test with real games
Performance benchmarking
Bug fixing

Week 7-8: Polish

Documentation
Error handling
Wine upstreaming (if desired)

Total Estimate: 8 weeks

Part 4: Master Implementation Roadmap

4.1 Dependency Graph

GDeflate CPU ━━━┓
                ┃
Vulkan Compute ━╋━━> GDeflate GPU
                ┃
io_uring Multi  ┃
                ┃
Cancellation ━━━╋━━> GPU Workflows
                ┃
                ┗━━> Wine/Proton Integration

4.2 Phased Implementation

Phase 1: Foundation (Weeks 1-8)

✅ Initial assessment (complete)
⏩ GDeflate research (2 weeks)
⏩ Vulkan compute infrastructure (8 weeks, parallel)

Phase 2: Core Features (Weeks 9-18)

⏩ GDeflate CPU implementation (5 weeks)
⏩ io_uring multi-worker (6 weeks, parallel)
⏩ Request cancellation (3 weeks, parallel)

Phase 3: Advanced Features (Weeks 19-28)

⏩ GDeflate GPU implementation (6 weeks)
⏩ GPU workflow optimization (4 weeks)

Phase 4: Integration (Weeks 29-36)

⏩ Wine/Proton shim (8 weeks)
⏩ Real game testing
⏩ Performance tuning

Total Timeline: 36 weeks (9 months)

4.3 Parallelization Opportunities

Can Work in Parallel:

Vulkan compute + GDeflate research
GDeflate CPU + io_uring enhancements
GDeflate CPU + cancellation
GPU workflows + Wine integration

Must Be Sequential:

Vulkan compute → GDeflate GPU
GDeflate CPU → GDeflate GPU
Core features → Wine integration

4.4 Fast Track Option

Goal: Minimal viable product in 12 weeks

Scope:

✅ CPU backend (working)
⏩ GDeflate CPU (5 weeks)
⏩ Vulkan compute (8 weeks, start week 1)
⏩ Basic Wine shim (3 weeks)
❌ Skip: GPU GDeflate, io_uring multi-worker, advanced features

Timeline: 12 weeks

Part 5: Success Criteria

5.1 Functional Requirements

Core:

✅ All features work independently
✅ Integration tests pass
✅ No regressions in existing functionality
✅ Documentation complete

Performance:

✅ GDeflate CPU: ≥ 500 MB/s
✅ GDeflate GPU: ≥ 2 GB/s
✅ io_uring: ≥ 2x CPU backend
✅ Wine overhead: < 10%

Quality:

✅ No memory leaks
✅ Thread-safe
✅ Vulkan validation clean
✅ Works on CachyOS/Arch Linux

5.2 Wine/Proton Validation

Required:

✅ At least one DirectStorage game runs
✅ Asset loading works correctly
✅ Performance within 20% of Windows
✅ No crashes or hangs

Part 6: Risk Assessment

6.1 Technical Risks

Risk	Probability	Impact	Mitigation
GDeflate format unavailable	Medium	High	Reverse engineer, community collaboration
GPU compute too slow	Low	Medium	Optimize shaders, fallback to CPU
Wine integration complex	High	Medium	Start simple, iterate
Hardware incompatibility	Medium	High	Test on multiple GPUs, provide fallbacks

6.2 Timeline Risks

Risk	Impact	Mitigation
GDeflate research longer than expected	+4 weeks	Start GPU work in parallel
Wine upstreaming delays	+8 weeks	Maintain out-of-tree fork
Testing reveals bugs	+2-4 weeks	Allocate buffer time

Part 7: Next Actions

Immediate (This Week)

✅ Complete investigation documents
⏩ Begin GDeflate format research
⏩ Start Vulkan compute implementation
⏩ Install liburing for io_uring testing

Short Term (Weeks 2-4)

⏩ Implement shader module loading
⏩ Begin GDeflate CPU decoder
⏩ Design cancellation API
⏩ Test io_uring multi-worker prototype

Medium Term (Weeks 5-12)

⏩ Complete Vulkan compute pipelines
⏩ Finish GDeflate CPU implementation
⏩ Implement request cancellation
⏩ Start Wine shim prototype

Document Status: Draft v1.0
Last Updated: 2026-02-16
Next Review: After Phase 1 milestones complete

FilesExpand file tree

investigation_remaining_features.md

Latest commit

History

investigation_remaining_features.md

File metadata and controls

Request Cancellation, GPU-Resident Workflows, and Wine/Proton Integration

Part 1: Request Cancellation

1.1 Current State

1.2 Design Requirements

1.3 Implementation Plan

Phase 1: API Design

Phase 2: Queue Implementation

Phase 3: Backend Support

1.4 Testing

1.5 Timeline

Part 2: GPU-Resident Workflows

2.1 Motivation

2.2 DirectStorage GPU Upload Heap

2.3 Implementation Strategies

Strategy 1: Vulkan External Memory (Current)

Strategy 2: GPU Direct Storage Integration

Strategy 3: Memory-Mapped Files + GPU Upload

2.4 Recommended Approach

2.5 GPU-to-GPU Transfers

2.6 Testing

2.7 Timeline

Part 3: Wine/Proton Integration

3.1 Architecture Overview

3.2 Integration Approaches

Approach 1: PE DLL Shim (Recommended)

Approach 2: Direct Integration (No Shim)

Approach 3: Kernel Module (Experimental)

3.3 Type Mapping

3.4 Vulkan Device Sharing

3.5 Implementation Steps

Step 1: C ABI Wrapper (Already Exists)

Step 2: Create dstorage.dll Skeleton

Step 3: Link with ds-runtime

Step 4: Test with Real Games

3.6 Documentation

3.7 Timeline

Part 4: Master Implementation Roadmap

4.1 Dependency Graph

4.2 Phased Implementation

4.3 Parallelization Opportunities

4.4 Fast Track Option

Part 5: Success Criteria

5.1 Functional Requirements

5.2 Wine/Proton Validation

Part 6: Risk Assessment

6.1 Technical Risks

6.2 Timeline Risks

Part 7: Next Actions

Immediate (This Week)

Short Term (Weeks 2-4)

Medium Term (Weeks 5-12)