Status: Planning Phase
Priority: High
Target: Full GPU compute capability for decompression and data processing
Dependencies: Existing Vulkan device/queue infrastructure
The current Vulkan backend in ds-runtime supports staging buffer copies but lacks compute pipeline functionality. This document outlines the investigation, design, and implementation plan for adding full GPU compute capabilities, primarily for GPU-accelerated decompression.
Existing Vulkan Infrastructure (src/ds_runtime_vulkan.cpp):
- ✅ Vulkan instance creation
- ✅ Physical device selection
- ✅ Logical device and queue creation
- ✅ Command pool management
- ✅ Staging buffer allocation (host-visible)
- ✅ GPU buffer allocation (device-local)
- ✅
vkCmdCopyBufferfor staging ↔ GPU transfers - ✅ Synchronization via
vkDeviceWaitIdle - ✅ Memory type selection and allocation
- ✅ Request submission and completion tracking
Capability: File I/O → Staging buffer → GPU buffer (pure data transfer, no computation)
Compute Pipeline Components:
- ❌ Compute pipeline creation (
vkCreateComputePipelines) - ❌ Shader module loading (
vkCreateShaderModule) - ❌ Descriptor set layout creation
- ❌ Descriptor pool allocation
- ❌ Descriptor set updates (buffer bindings)
- ❌ Pipeline layout creation
- ❌ Compute command recording (
vkCmdBindPipeline,vkCmdDispatch) - ❌ Push constant support
- ❌ Compute-specific synchronization (barriers)
Impact: Cannot execute any GPU compute workloads (decompression, transforms, etc.)
SPIR-V Shader: examples/vk-copy-test/copy.comp.spv (256 bytes)
- Precompiled compute shader
- Currently not loaded or used by any code
- Likely a simple buffer copy/transform shader
- Should be examined and potentially reused
Application
↓ (prepare compute work)
Descriptor Sets (bind buffers, uniforms)
↓
Pipeline Layout (descriptor layouts + push constants)
↓
Compute Pipeline (shader + configuration)
↓
Command Buffer (vkCmdBindPipeline, vkCmdDispatch)
↓
GPU Execution (workgroups → invocations)
↓
Synchronization (barriers, fences)
↓
Results in GPU buffers
| Object | Purpose | Lifetime |
|---|---|---|
| VkShaderModule | Compiled SPIR-V code | Per-shader, reusable |
| VkDescriptorSetLayout | Binding layout (types, stages) | Per-layout, reusable |
| VkDescriptorPool | Allocation pool for descriptor sets | Per-backend, managed |
| VkDescriptorSet | Actual buffer bindings | Per-dispatch, short-lived |
| VkPipelineLayout | Push constants + descriptor layouts | Per-pipeline, reusable |
| VkPipeline | Complete compute configuration | Per-shader, reusable |
| VkCommandBuffer | Recorded GPU commands | Per-submission, pooled |
Workgroup Hierarchy:
Global Work Size (e.g., 1024 elements)
↓ divide by local_size_x
Workgroups (e.g., 4 workgroups if local_size_x = 256)
↓ parallel execution
Invocations (256 invocations per workgroup)
↓
Each invocation processes one element
Shader Invocation IDs:
gl_GlobalInvocationID: Unique ID across all invocationsgl_LocalInvocationID: ID within the workgroupgl_WorkGroupID: Workgroup index
Goal: Load and create VkShaderModule from SPIR-V bytecode
- Add SPIR-V loading utility function
- Implement shader module creation
- Add shader module caching (avoid redundant loads)
- Validate shader compilation
Location: src/ds_runtime_vulkan.cpp
class ShaderModuleCache {
public:
VkShaderModule load_shader(
VkDevice device,
const std::string& path
);
void destroy_all(VkDevice device);
private:
std::unordered_map<std::string, VkShaderModule> modules_;
};
// Add to VulkanBackend::Impl
ShaderModuleCache shader_cache_;VkShaderModule ShaderModuleCache::load_shader(
VkDevice device,
const std::string& path
) {
// Check cache
auto it = modules_.find(path);
if (it != modules_.end()) {
return it->second;
}
// Read SPIR-V file
std::ifstream file(path, std::ios::binary | std::ios::ate);
if (!file) {
throw std::runtime_error("Failed to open shader file");
}
size_t size = file.tellg();
std::vector<char> code(size);
file.seekg(0);
file.read(code.data(), size);
// Create shader module
VkShaderModuleCreateInfo create_info{};
create_info.sType = VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO;
create_info.codeSize = code.size();
create_info.pCode = reinterpret_cast<const uint32_t*>(code.data());
VkShaderModule module;
VkResult result = vkCreateShaderModule(
device, &create_info, nullptr, &module
);
if (result != VK_SUCCESS) {
throw std::runtime_error("Failed to create shader module");
}
modules_[path] = module;
return module;
}- Load existing
copy.comp.spvshader - Validate shader module creation succeeds
- Test error handling for missing files
- Verify shader module caching works
Goal: Define buffer binding layouts for compute shaders
Example: GDeflate Decompression
// Binding 0: Input compressed buffer (read-only)
// Binding 1: Block metadata (read-only)
// Binding 2: Output decompressed buffer (write-only)Location: src/ds_runtime_vulkan.cpp
struct ComputeDescriptorLayouts {
VkDescriptorSetLayout decompression_layout;
VkDescriptorSetLayout copy_layout;
// Add more as needed
};
VkDescriptorSetLayout create_decompression_descriptor_layout(
VkDevice device
) {
VkDescriptorSetLayoutBinding bindings[3];
// Binding 0: Input compressed data
bindings[0].binding = 0;
bindings[0].descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER;
bindings[0].descriptorCount = 1;
bindings[0].stageFlags = VK_SHADER_STAGE_COMPUTE_BIT;
bindings[0].pImmutableSamplers = nullptr;
// Binding 1: Block metadata
bindings[1].binding = 1;
bindings[1].descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER;
bindings[1].descriptorCount = 1;
bindings[1].stageFlags = VK_SHADER_STAGE_COMPUTE_BIT;
bindings[1].pImmutableSamplers = nullptr;
// Binding 2: Output decompressed data
bindings[2].binding = 2;
bindings[2].descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER;
bindings[2].descriptorCount = 1;
bindings[2].stageFlags = VK_SHADER_STAGE_COMPUTE_BIT;
bindings[2].pImmutableSamplers = nullptr;
VkDescriptorSetLayoutCreateInfo layout_info{};
layout_info.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_SET_LAYOUT_CREATE_INFO;
layout_info.bindingCount = 3;
layout_info.pBindings = bindings;
VkDescriptorSetLayout layout;
VkResult result = vkCreateDescriptorSetLayout(
device, &layout_info, nullptr, &layout
);
if (result != VK_SUCCESS) {
throw std::runtime_error("Failed to create descriptor set layout");
}
return layout;
}// In VulkanBackend::Impl
ComputeDescriptorLayouts descriptor_layouts_;
// Initialize in constructor
descriptor_layouts_.decompression_layout =
create_decompression_descriptor_layout(device_);Goal: Allocate descriptor pool for runtime descriptor set allocation
Strategy: Pre-allocate pool large enough for concurrent dispatches
// Estimate: 16 concurrent compute dispatches
// Each dispatch needs 3 storage buffer descriptors
// Total: 16 * 3 = 48 storage buffer descriptorsVkDescriptorPool create_compute_descriptor_pool(
VkDevice device,
uint32_t max_sets = 32
) {
VkDescriptorPoolSize pool_size{};
pool_size.type = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER;
pool_size.descriptorCount = max_sets * 3; // 3 bindings per set
VkDescriptorPoolCreateInfo pool_info{};
pool_info.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_POOL_CREATE_INFO;
pool_info.poolSizeCount = 1;
pool_info.pPoolSizes = &pool_size;
pool_info.maxSets = max_sets;
pool_info.flags = VK_DESCRIPTOR_POOL_CREATE_FREE_DESCRIPTOR_SET_BIT;
VkDescriptorPool pool;
VkResult result = vkCreateDescriptorPool(
device, &pool_info, nullptr, &pool
);
if (result != VK_SUCCESS) {
throw std::runtime_error("Failed to create descriptor pool");
}
return pool;
}
// Add to VulkanBackend::Impl
VkDescriptorPool descriptor_pool_;
// Initialize in constructor
descriptor_pool_ = create_compute_descriptor_pool(device_);Goal: Create compute pipeline with shader and layout
VkPipelineLayout create_compute_pipeline_layout(
VkDevice device,
VkDescriptorSetLayout descriptor_layout
) {
// Optional: push constants for dispatch parameters
VkPushConstantRange push_constant{};
push_constant.stageFlags = VK_SHADER_STAGE_COMPUTE_BIT;
push_constant.offset = 0;
push_constant.size = sizeof(uint32_t) * 4; // Example: 4 uint32s
VkPipelineLayoutCreateInfo layout_info{};
layout_info.sType = VK_STRUCTURE_TYPE_PIPELINE_LAYOUT_CREATE_INFO;
layout_info.setLayoutCount = 1;
layout_info.pSetLayouts = &descriptor_layout;
layout_info.pushConstantRangeCount = 1;
layout_info.pPushConstantRanges = &push_constant;
VkPipelineLayout layout;
VkResult result = vkCreatePipelineLayout(
device, &layout_info, nullptr, &layout
);
if (result != VK_SUCCESS) {
throw std::runtime_error("Failed to create pipeline layout");
}
return layout;
}VkPipeline create_compute_pipeline(
VkDevice device,
VkPipelineLayout layout,
VkShaderModule shader_module,
const char* entry_point = "main"
) {
VkPipelineShaderStageCreateInfo stage_info{};
stage_info.sType = VK_STRUCTURE_TYPE_PIPELINE_SHADER_STAGE_CREATE_INFO;
stage_info.stage = VK_SHADER_STAGE_COMPUTE_BIT;
stage_info.module = shader_module;
stage_info.pName = entry_point;
VkComputePipelineCreateInfo pipeline_info{};
pipeline_info.sType = VK_STRUCTURE_TYPE_COMPUTE_PIPELINE_CREATE_INFO;
pipeline_info.stage = stage_info;
pipeline_info.layout = layout;
VkPipeline pipeline;
VkResult result = vkCreateComputePipelines(
device, VK_NULL_HANDLE, 1, &pipeline_info, nullptr, &pipeline
);
if (result != VK_SUCCESS) {
throw std::runtime_error("Failed to create compute pipeline");
}
return pipeline;
}struct ComputePipeline {
VkPipeline pipeline;
VkPipelineLayout layout;
VkDescriptorSetLayout descriptor_layout;
VkShaderModule shader;
};
// Add to VulkanBackend::Impl
std::unordered_map<std::string, ComputePipeline> compute_pipelines_;
ComputePipeline create_decompression_pipeline() {
ComputePipeline result;
// Load shader
result.shader = shader_cache_.load_shader(
device_, "shaders/decompress.comp.spv"
);
// Create descriptor layout
result.descriptor_layout =
create_decompression_descriptor_layout(device_);
// Create pipeline layout
result.layout = create_compute_pipeline_layout(
device_, result.descriptor_layout
);
// Create pipeline
result.pipeline = create_compute_pipeline(
device_, result.layout, result.shader
);
return result;
}Goal: Bind buffers to descriptor sets for each dispatch
VkDescriptorSet allocate_descriptor_set(
VkDevice device,
VkDescriptorPool pool,
VkDescriptorSetLayout layout
) {
VkDescriptorSetAllocateInfo alloc_info{};
alloc_info.sType = VK_STRUCTURE_TYPE_DESCRIPTOR_SET_ALLOCATE_INFO;
alloc_info.descriptorPool = pool;
alloc_info.descriptorSetCount = 1;
alloc_info.pSetLayouts = &layout;
VkDescriptorSet descriptor_set;
VkResult result = vkAllocateDescriptorSets(
device, &alloc_info, &descriptor_set
);
if (result != VK_SUCCESS) {
throw std::runtime_error("Failed to allocate descriptor set");
}
return descriptor_set;
}void update_decompression_descriptor_set(
VkDevice device,
VkDescriptorSet descriptor_set,
VkBuffer input_buffer,
VkBuffer metadata_buffer,
VkBuffer output_buffer,
VkDeviceSize input_size,
VkDeviceSize metadata_size,
VkDeviceSize output_size
) {
VkDescriptorBufferInfo buffer_infos[3];
// Input buffer
buffer_infos[0].buffer = input_buffer;
buffer_infos[0].offset = 0;
buffer_infos[0].range = input_size;
// Metadata buffer
buffer_infos[1].buffer = metadata_buffer;
buffer_infos[1].offset = 0;
buffer_infos[1].range = metadata_size;
// Output buffer
buffer_infos[2].buffer = output_buffer;
buffer_infos[2].offset = 0;
buffer_infos[2].range = output_size;
VkWriteDescriptorSet writes[3];
for (int i = 0; i < 3; ++i) {
writes[i].sType = VK_STRUCTURE_TYPE_WRITE_DESCRIPTOR_SET;
writes[i].pNext = nullptr;
writes[i].dstSet = descriptor_set;
writes[i].dstBinding = i;
writes[i].dstArrayElement = 0;
writes[i].descriptorCount = 1;
writes[i].descriptorType = VK_DESCRIPTOR_TYPE_STORAGE_BUFFER;
writes[i].pBufferInfo = &buffer_infos[i];
writes[i].pImageInfo = nullptr;
writes[i].pTexelBufferView = nullptr;
}
vkUpdateDescriptorSets(device, 3, writes, 0, nullptr);
}Goal: Record and execute compute commands
void record_compute_dispatch(
VkCommandBuffer cmd,
VkPipeline pipeline,
VkPipelineLayout layout,
VkDescriptorSet descriptor_set,
uint32_t workgroup_count_x,
uint32_t workgroup_count_y = 1,
uint32_t workgroup_count_z = 1
) {
// Bind compute pipeline
vkCmdBindPipeline(cmd, VK_PIPELINE_BIND_POINT_COMPUTE, pipeline);
// Bind descriptor set
vkCmdBindDescriptorSets(
cmd,
VK_PIPELINE_BIND_POINT_COMPUTE,
layout,
0, // first set
1, // descriptor set count
&descriptor_set,
0, // dynamic offset count
nullptr
);
// Optional: push constants
// vkCmdPushConstants(cmd, layout, VK_SHADER_STAGE_COMPUTE_BIT, ...);
// Dispatch compute work
vkCmdDispatch(cmd, workgroup_count_x, workgroup_count_y, workgroup_count_z);
}void VulkanBackend::Impl::process_request_with_compute(Request& req) {
// Begin command buffer
VkCommandBufferBeginInfo begin_info{};
begin_info.sType = VK_STRUCTURE_TYPE_COMMAND_BUFFER_BEGIN_INFO;
begin_info.flags = VK_COMMAND_BUFFER_USAGE_ONE_TIME_SUBMIT_BIT;
vkBeginCommandBuffer(command_buffer_, &begin_info);
// 1. Copy file data to staging buffer
// (existing code for file I/O)
// 2. Barrier: transfer → compute
VkBufferMemoryBarrier barrier{};
barrier.sType = VK_STRUCTURE_TYPE_BUFFER_MEMORY_BARRIER;
barrier.srcAccessMask = VK_ACCESS_TRANSFER_WRITE_BIT;
barrier.dstAccessMask = VK_ACCESS_SHADER_READ_BIT;
barrier.buffer = staging_buffer_;
barrier.offset = 0;
barrier.size = VK_WHOLE_SIZE;
vkCmdPipelineBarrier(
command_buffer_,
VK_PIPELINE_STAGE_TRANSFER_BIT,
VK_PIPELINE_STAGE_COMPUTE_SHADER_BIT,
0, 0, nullptr, 1, &barrier, 0, nullptr
);
// 3. Dispatch compute (decompression)
auto& pipeline = compute_pipelines_["decompress"];
VkDescriptorSet desc_set = allocate_descriptor_set(
device_, descriptor_pool_, pipeline.descriptor_layout
);
update_decompression_descriptor_set(
device_, desc_set,
staging_buffer_, // compressed input
metadata_buffer_, // block info
req.gpu_buffer, // decompressed output
req.size,
metadata_size,
req.size * 2 // assume 2x expansion
);
uint32_t workgroup_count = (req.size + 255) / 256;
record_compute_dispatch(
command_buffer_,
pipeline.pipeline,
pipeline.layout,
desc_set,
workgroup_count
);
// 4. Barrier: compute → host (if needed for readback)
// ...
// 5. End and submit
vkEndCommandBuffer(command_buffer_);
VkSubmitInfo submit_info{};
submit_info.sType = VK_STRUCTURE_TYPE_SUBMIT_INFO;
submit_info.commandBufferCount = 1;
submit_info.pCommandBuffers = &command_buffer_;
vkQueueSubmit(queue_, 1, &submit_info, VK_NULL_HANDLE);
vkQueueWaitIdle(queue_);
// Free descriptor set
vkFreeDescriptorSets(device_, descriptor_pool_, 1, &desc_set);
}Goal: Proper memory synchronization between pipeline stages
Transfer → Compute:
VkBufferMemoryBarrier transfer_to_compute{};
transfer_to_compute.srcAccessMask = VK_ACCESS_TRANSFER_WRITE_BIT;
transfer_to_compute.dstAccessMask = VK_ACCESS_SHADER_READ_BIT;
// Use: After vkCmdCopyBuffer, before compute dispatchCompute → Transfer:
VkBufferMemoryBarrier compute_to_transfer{};
compute_to_transfer.srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT;
compute_to_transfer.dstAccessMask = VK_ACCESS_TRANSFER_READ_BIT;
// Use: After compute dispatch, before staging buffer readbackCompute → Compute (between dispatches):
VkBufferMemoryBarrier compute_to_compute{};
compute_to_compute.srcAccessMask = VK_ACCESS_SHADER_WRITE_BIT;
compute_to_compute.dstAccessMask = VK_ACCESS_SHADER_READ_BIT;
// Use: Between dependent compute passes- Minimize barriers: Batch operations when possible
- Use appropriate stages: Don't over-synchronize
- Consider queue ownership: Handle queue family transfers if needed
- Validate with layers: Enable
VK_LAYER_KHRONOS_validation
File: shaders/buffer_copy.comp
#version 450
layout(local_size_x = 256) in;
layout(binding = 0) readonly buffer InputBuffer {
uint data[];
} input_buf;
layout(binding = 1) writeonly buffer OutputBuffer {
uint data[];
} output_buf;
layout(push_constant) uniform PushConstants {
uint element_count;
} push;
void main() {
uint idx = gl_GlobalInvocationID.x;
if (idx < push.element_count) {
output_buf.data[idx] = input_buf.data[idx];
}
}Compilation:
glslangValidator -V buffer_copy.comp -o buffer_copy.comp.spvFile: shaders/uppercase.comp (enhanced version of FakeUppercase)
#version 450
layout(local_size_x = 256) in;
layout(binding = 0) readonly buffer InputBuffer {
uint8_t data[];
} input_buf;
layout(binding = 1) writeonly buffer OutputBuffer {
uint8_t data[];
} output_buf;
layout(push_constant) uniform PushConstants {
uint byte_count;
} push;
void main() {
uint idx = gl_GlobalInvocationID.x;
if (idx < push.byte_count) {
uint8_t c = input_buf.data[idx];
// Uppercase ASCII
if (c >= 'a' && c <= 'z') {
c = c - 32;
}
output_buf.data[idx] = c;
}
}Add to CMakeLists.txt:
# Find glslangValidator
find_program(GLSLANG_VALIDATOR glslangValidator)
if(GLSLANG_VALIDATOR)
# Compile shaders
file(GLOB SHADER_SOURCES shaders/*.comp)
foreach(SHADER ${SHADER_SOURCES})
get_filename_component(SHADER_NAME ${SHADER} NAME)
set(SPIRV "${CMAKE_CURRENT_BINARY_DIR}/shaders/${SHADER_NAME}.spv")
add_custom_command(
OUTPUT ${SPIRV}
COMMAND ${CMAKE_COMMAND} -E make_directory
"${CMAKE_CURRENT_BINARY_DIR}/shaders"
COMMAND ${GLSLANG_VALIDATOR} -V ${SHADER} -o ${SPIRV}
DEPENDS ${SHADER}
COMMENT "Compiling shader ${SHADER_NAME}"
)
list(APPEND SPIRV_SHADERS ${SPIRV})
endforeach()
add_custom_target(shaders ALL DEPENDS ${SPIRV_SHADERS})
endif()Test Suite: tests/vulkan_compute_test.cpp
Test Cases:
- Pipeline creation: Verify pipeline objects created successfully
- Shader loading: Test shader module creation from SPIR-V
- Descriptor allocation: Verify descriptor pool management
- Simple dispatch: Buffer copy shader execution
- Transform shader: Uppercase transform on GPU
- Error handling: Invalid pipeline, missing shader, etc.
Scenarios:
- File → GPU compute → GPU buffer: Full decompression path
- Multiple dispatches: Concurrent compute workloads
- Mixed operations: Compute + transfer in same command buffer
- Large buffers: Test with multi-MB data
- Synchronization: Verify barriers work correctly
Tools:
- Vulkan Validation Layers (
VK_LAYER_KHRONOS_validation) - RenderDoc for command buffer inspection
- GPU profilers (Nsight, Radeon GPU Profiler)
Check for:
- Synchronization errors
- Memory leaks (descriptor sets, pipelines)
- Invalid API usage
- Performance bottlenecks
Shader Optimizations:
- Coalesced memory access patterns
- Shared memory for temporary data
- Reduce divergent branches
- Optimize workgroup size for target GPU
Pipeline Management:
- Reuse pipelines across requests
- Minimize pipeline switches
- Batch similar compute work
- Pre-warm pipeline caches
Memory Management:
- Pool descriptor sets (avoid per-frame allocation)
- Reuse command buffers when possible
- Minimize host-device transfers
- Use push constants for small data
Goals:
- Compute dispatch overhead < 100 µs
- Throughput ≥ 10 GB/s for simple transforms
- GPU utilization ≥ 80% during compute
- CPU overhead < 5% during GPU execution
Required:
- Vulkan SDK (already optional dependency)
glslangValidatorfor shader compilation- C++20 compiler (already required)
Optional:
- RenderDoc for debugging
- GPU profiling tools
Minimum:
- Vulkan 1.0 support
- Compute queue support
- Storage buffer support
Recommended:
- Vulkan 1.3 support
- Dedicated compute queue
- ≥ 4GB VRAM for large asset processing
Week 1-2: Foundation
- Shader module loading
- Descriptor layouts
- Descriptor pool creation
- Milestone: Infrastructure ready
Week 3-4: Pipeline Creation
- Pipeline layout
- Compute pipeline creation
- Pipeline management
- Milestone: Simple copy shader works
Week 5-6: Dispatch and Synchronization
- Command buffer recording
- Compute dispatch
- Barriers and synchronization
- Milestone: Transform shader works
Week 7-8: Integration and Testing
- Backend integration
- Comprehensive testing
- Performance tuning
- Milestone: Production-ready compute support
8 weeks for complete Vulkan compute implementation
Dependencies: None (can proceed independently)
- ✅ Shader modules load from SPIR-V files
- ✅ Compute pipelines created successfully
- ✅ Descriptor sets allocated and bound correctly
- ✅ Compute dispatches execute on GPU
- ✅ Synchronization barriers work properly
- ✅ Existing Vulkan tests still pass
- ✅ New compute tests pass (100% coverage)
- ✅ Compute overhead < 100 µs per dispatch
- ✅ GPU utilization ≥ 80% during compute
- ✅ Throughput ≥ 10 GB/s for simple operations
- ✅ No performance regression in existing paths
- ✅ Vulkan validation layers pass (no errors)
- ✅ No memory leaks (descriptor sets, pipelines)
- ✅ Thread-safe pipeline management
- ✅ Documentation complete and accurate
- ✅ API stability maintained
- ✅ Complete investigation document
- ⏩ Examine existing
copy.comp.spvshader - ⏩ Set up shader build system
- ⏩ Create simple test shader
- ⏩ Implement shader module loading
- ⏩ Create descriptor layouts
- ⏩ Set up descriptor pool
- ⏩ Test infrastructure
- ⏩ Complete pipeline creation
- ⏩ Implement compute dispatch
- ⏩ Add synchronization
- ⏩ Comprehensive testing
- Shader Language: Should we support multiple shader languages (GLSL, HLSL via DXC)?
- Pipeline Caching: Do we need VkPipelineCache for faster startup?
- Async Compute: Should we use dedicated compute queue or graphics queue?
- Workgroup Size: How to determine optimal local_size_x for different GPUs?
- Error Recovery: How to handle GPU compute failures gracefully?
- Shader Hot-Reload: Do we need runtime shader recompilation for development?
- Vulkan 1.3 Specification: Compute Shaders
- Khronos Vulkan Guide: Compute
- Vulkan Tutorial: Compute Shaders
- Khronos Vulkan Best Practices
- AMD GPU Architecture
- NVIDIA Compute Best Practices
- Intel GPU Architecture Guide
- glslangValidator documentation
- RenderDoc user guide
- Vulkan Validation Layers
Document Status: Draft v1.0
Last Updated: 2026-02-16
Next Review: After shader loading implementation complete