|
1 | 1 | **_GPU-Driven Rendering Performance: Traditional vs Mesh Shaders_** |
2 | | - **William Gunawan** - 22 November, 2025 |
| 2 | + **William Gunawan** - 26 November, 2025 |
3 | 3 |
|
4 | 4 | # Introduction |
5 | 5 |
|
6 | 6 | This technical article is a benchmark comparison of traditional rendering pipelines against modern GPU-driven techniques: draw indirect, mesh shaders, and task shaders. |
7 | 7 | The focus is on measuring the performance impact of moving culling decisions from CPU to GPU, and from per-instance to per-meshlet granularity. |
8 | 8 |
|
| 9 | +## Code Availability |
| 10 | + |
| 11 | +The code for this vulkan application is readily available at: |
| 12 | + - https://github.com/Williscool13/MeshTaskBenchmark |
| 13 | + |
| 14 | +The profiler results and images used in this article are also readily available at: |
| 15 | + - https://github.com/Williscool13/Williscool13.github.io/tree/main/technical/task-mesh-benchmarking |
| 16 | + |
9 | 17 | ## Test Scene |
10 | 18 |
|
11 | 19 | **Geometry:** 125 Stanford Bunnies (72,378 vertices, 144,046 triangles each) arranged in a 5x5x5 grid. |
|
76 | 84 | This configuration adds GPU-driven per-instance culling while keeping the traditional vertex pipeline. |
77 | 85 | A compute shader performs frustum culling and writes draw commands for visible instances to an indirect buffer. |
78 | 86 |
|
79 | | - |
80 | | -### Compute Pass: Instance Culling |
81 | | - |
82 | 87 | The compute shader dispatches one thread per instance to evaluate visibility. Here, we only cull on an instance-level using frustum culling. |
83 | 88 |
|
84 | 89 | ```````````````````cpp |
|
141 | 146 | A more optimized approach would batch identical models into single draws with instanceCount > 1. |
142 | 147 |
|
143 | 148 |
|
144 | | -### Render Pass: Indirect Draw |
145 | | - |
146 | 149 | The GPU reads the culled draw commands without CPU involvement. We're going all-in on GPU-Driven Rendering: |
147 | 150 | ```````````````````cpp |
148 | 151 | vkCmdBindVertexBuffers(cmd, 0, 1, &megaVertexBuffer.handle, &vertexOffset); |
|
332 | 335 |
|
333 | 336 | I'm particularly proud about this implementation :) |
334 | 337 |
|
335 | | -### Compute Pass: Instance Culling |
336 | | - |
337 | 338 |
|
338 | 339 | ``````````````````` |
339 | 340 | public struct MeshIndirectDrawParameters { |
|
390 | 391 | } |
391 | 392 | ``````````````````` |
392 | 393 |
|
393 | | -### Render Pass: Indirect Draw |
394 | 394 | With some modification to the task shader to accommodate indirect |
395 | 395 | ``````````````````` |
396 | 396 | [shader("task")] |
|
479 | 479 |
|
480 | 480 | # Discussion |
481 | 481 |
|
482 | | -**Task and Mesh shaders provide massive gains with only meshlet-level culling** |
| 482 | +(###) **Task and Mesh shaders provide massive gains with only meshlet-level culling** |
| 483 | + |
483 | 484 | - Task and Mesh achieves a 5.3x performance increase over traditional vertex pipeline |
484 | 485 | - Processing 2,251 meshlets per instance vs full 72K vertex model |
485 | 486 |
|
486 | | -**Instance-level culling shows mixed results** |
| 487 | +(###) **Instance-level culling shows mixed results** |
| 488 | + |
487 | 489 | - Indirect + Traditional: 46% faster than baseline (74 FPS vs 51 FPS) |
488 | 490 | - Traditional rendering processes every vertex regardless of visibility, so eliminating even a few instances provides measurable savings. |
489 | 491 | - There is a considerable amount of waste in traditional rendering. Lots of backface rasterization, reasonably worse cache locality, and less control overall of the geometry pipeline. |
|
496 | 498 | The finer granularity results in better GPU occupancy and aligns the graphics pipeline with modern GPU-driven rendering techniques. |
497 | 499 | More importantly, it allows precise control over which parts of a mesh actually get rendered, eliminating wasted work before it reaches the rasterizer. |
498 | 500 |
|
| 501 | +## Profiler Analysis |
| 502 | + |
| 503 | +Cache behavior shows clear differences between traditional and mesh shader approaches: |
| 504 | + |
| 505 | + |
| 506 | + |
| 507 | + |
| 508 | + |
| 509 | + |
| 510 | + |
| 511 | + |
| 512 | + |
| 513 | + |
| 514 | + |
| 515 | +(###) **L2 Cache Performance** |
| 516 | + |
| 517 | +- Traditional: 57.8% hit rate |
| 518 | +- Traditional Indirect: 57.4% hit rate |
| 519 | +- Mesh: 64.2% hit rate |
| 520 | +- Mesh Indirect: 63.9% hit rate |
| 521 | + |
| 522 | +(###) **Observations** |
| 523 | + |
| 524 | +Mesh shaders achieve ~11% better L2 cache hit rates. |
| 525 | + |
| 526 | +This improvement likely stems from processing meshlets as independent, tightly-packed units rather than strided vertex buffers across the entire model. |
| 527 | + |
| 528 | +Notably, adding indirect culling doesn't significantly hurt cache performance in either pipeline. The compute pass overhead is minimal compared to the rendering workload. |
| 529 | + |
| 530 | +(###) **L1 Cache Behavior** |
| 531 | + |
| 532 | +L1 cache hit rates are consistently low across all configurations (4-7%). |
| 533 | +This pattern appears in both this benchmark and my game engine, suggesting it may be related to the draw setup or memory access patterns. |
| 534 | +While this could affect absolute performance numbers, the relative comparison between pipelines remains valid. |
499 | 535 |
|
500 | 536 | ## Limitations |
501 | 537 |
|
502 | 538 | This benchmark favors mesh shaders due to the high vertex count (72K vertices per bunny). The 5.3x speedup reflects ideal conditions for meshlet-level culling. |
503 | 539 | Other optimizations may also disproportionately improve the performance of traditional rendering techniques, further reducing the performance gap between the 2 approaches. |
504 | | -Geometry LOD for example, could help with would likely help traditional rendering slightly more than it does meshlet rendering. |
| 540 | +Geometry LOD for example, would likely help traditional rendering slightly more than it does meshlet rendering. |
505 | 541 |
|
506 | | -## Pratical Considerations |
| 542 | +### Practical Considerations |
507 | 543 |
|
508 | 544 | Other factors that make traditional pipelines more appealing also need to be considered: |
509 | | - - Much better support on older GPUs. Task+Mesh is only supported on NVIDIA Turing+ (RTX 2000+), AMD RDNA2+ (RX 6000+), and Intel Arc. Traditional pipelines work on any GPU from the past decade. |
| 545 | + - Much better support on older GPUs. Task+Mesh is only supported on NVIDIA Turing+, AMD RDNA2+, and Intel Arc. Traditional pipelines work on any GPU from the past decade. |
510 | 546 | - Simpler debugging and profiling. Mesh shader workloads can be harder to trace and analyze with standard GPU tools. |
511 | 547 | - Traditional rendering is much more ubiquitous so learning material and general developer familiarity with them is high. |
512 | 548 | - Task + Mesh shaders aren't universally beneficial. Low-poly meshes (< 1000 triangles) may not benefit from the added complexity, while high-density photogrammetry scans and CAD models see the largest gains. |
513 | 549 |
|
| 550 | +# Conclusion |
| 551 | + |
| 552 | +If you're planning on exploring modern rendering techniques for use in your game engine, you need to know the benefits and drawbacks of using them. |
| 553 | +Task and mesh shaders are great for scenes with high geometry complexity, but may not perform as well for simple scenes. |
| 554 | +Adoption rate is still low, requiring modern hardware from the user. [Vulkan GPU Info](https://vulkan.gpuinfo.org/listextensions.php) reports adoption rates at <10%, so there is still a way to go before this technique can be broadly used. |
| 555 | +If you plan on making a game engine or renderer with large reach, this technique may not be the right choice for you. |
| 556 | + |
| 557 | +With all this in mind, if these circumstances are right for you, use task and mesh shader! They're not that complicated. |
| 558 | + |
| 559 | +Thanks for reading! Feel free to contact me for fun talks about graphics and game engines :) |
| 560 | + |
| 561 | +(#) References |
514 | 562 |
|
515 | | -# Further Reading |
516 | 563 | - [NVIDIA - Introduction to Mesh Shaders](https://developer.nvidia.com/blog/introduction-turing-mesh-shaders/) |
517 | 564 | - [AMD - Mesh Shader Guide](https://gpuopen.com/learn/mesh_shaders/mesh_shaders-from_vertex_shader_to_mesh_shader/). |
518 | 565 | - [NVIDIA - Using Mesh Shaders For Professional Graphics](https://developer.nvidia.com/blog/using-mesh-shaders-for-professional-graphics/) |
|
0 commit comments