System design and data flow for GPU-accelerated sorting.
WebGPU Sorting implements two GPU-accelerated sorting algorithms using WebGPU compute shaders written in WGSL (WebGPU Shading Language).
graph TB
subgraph CPU["CPU Side (JavaScript)"]
A[Input Uint32Array] --> B[GPUContext.initialize]
B --> C[Create Storage Buffer]
C --> D[Write Data to GPU]
end
subgraph GPU["GPU Side (WGSL Compute Shaders)"]
D --> E[Bitonic Sort / Radix Sort]
E --> F[Multiple Compute Passes]
F --> G[workgroupBarrier Sync]
G --> H[Write Sorted Buffer]
end
subgraph Readback["Result Readback"]
H --> I[Map Buffer Async]
I --> J[Copy to CPU Array]
J --> K[Return Sorted Data]
end
┌─────────────────────────────────────────────────────────────┐
│ User Interface │
│ ┌─────────────┐ ┌─────────────┐ ┌─────────────────────┐ │
│ │ Controls │ │ Progress │ │ Results Display │ │
│ └─────────────┘ └─────────────┘ └─────────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ Sorting API Layer │
│ ┌─────────────────┐ ┌─────────────────┐ ┌─────────────┐ │
│ │ BitonicSorter │ │ RadixSorter │ │ Benchmark │ │
│ └─────────────────┘ └─────────────────┘ └─────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ WebGPU Core Layer │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ GPUContext │ │ BufferManager │ │
│ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────┐
│ WGSL Compute Shaders │
│ ┌─────────────────┐ ┌─────────────────┐ │
│ │ bitonic.wgsl │ │ radix.wgsl │ │
│ └─────────────────┘ └─────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Manages WebGPU device lifecycle and provides a unified interface for GPU operations.
class GPUContext {
private adapter: GPUAdapter | null = null;
private device: GPUDevice | null = null;
static isSupported(): boolean {
return typeof navigator !== 'undefined' && 'gpu' in navigator;
}
async initialize(config?: GPUContextConfig): Promise<void> {
this.adapter = await navigator.gpu.requestAdapter({
powerPreference: config?.powerPreference ?? 'high-performance',
});
this.device = await this.adapter.requestDevice();
}
getDevice(): GPUDevice {
return this.device;
}
destroy(): void {
this.device?.destroy();
}
}Handles GPU memory allocation and data transfer between CPU and GPU.
class BufferManager {
createStorageBuffer(data: Uint32Array): GPUBuffer {
const buffer = this.device.createBuffer({
size: data.byteLength,
usage: GPUBufferUsage.STORAGE | GPUBufferUsage.COPY_SRC | GPUBufferUsage.COPY_DST,
mappedAtCreation: true,
});
new Uint32Array(buffer.getMappedRange()).set(data);
buffer.unmap();
return buffer;
}
async readBuffer(buffer: GPUBuffer, size: number): Promise<Uint32Array> {
// Create staging buffer, copy, map, and read
}
}graph LR
subgraph WebGPU Memory
A[Storage Buffer<br/>Read-Write] --> B[Workgroup Memory<br/>Shared Local]
B --> C[Atomic Operations]
end
D[Host Memory<br/>JavaScript ArrayBuffer] --> A
| Phase | Operation | Memory | Time |
|---|---|---|---|
| 1 | Upload | CPU → GPU | O(n) |
| 2 | Sort | GPU Compute | O(log²n) or O(n×k) |
| 3 | Readback | GPU → CPU | O(n) |
| Feature | Bitonic Sort | Radix Sort |
|---|---|---|
| Complexity | O(n log²n) | O(n × k) |
| Type | Comparison-based | Non-comparison |
| Data Type | Any comparable | Uint32 only |
| Stability | Not stable | Stable |
| Best For | General purpose | Large integers |
| GPU Passes | log²n | k × 3 |
GPU sorting becomes advantageous when:
- Array size > 65,536 - Buffer transfer overhead is amortized
- Repeated sorting - GPU context can be reused
- Batch processing - Multiple sorts share setup cost
-
Shared Memory (Workgroup Memory)
var<workgroup> shared_data: array<u32, 256>; // Much faster than global memory access
-
Coalesced Memory Access
// ✅ Good: Consecutive access let value = data[global_id.x]; // ❌ Bad: Strided access let value = data[global_id.x * stride];
-
Minimize Synchronization
// Only barrier when data is shared between threads workgroupBarrier();
try {
const gpu = new GPUContext();
await gpu.initialize();
const sorter = new BitonicSorter(gpu);
const result = await sorter.sort(data);
} catch (error) {
if (error instanceof WebGPUNotSupportedError) {
// Fallback to CPU sort
data.sort((a, b) => a - b);
}
}- Bitonic Sort Algorithm - Detailed implementation
- Radix Sort Algorithm - Detailed implementation
- Performance Benchmarks - Real-world measurements