Benchmarks: Micro benchmark - add nvbench based kernel-launch, sleep-kernel & auto-throughput#750
Benchmarks: Micro benchmark - add nvbench based kernel-launch, sleep-kernel & auto-throughput#750WenqingLan1 wants to merge 45 commits intomicrosoft:mainfrom
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #750 +/- ##
==========================================
+ Coverage 85.69% 86.03% +0.34%
==========================================
Files 103 107 +4
Lines 7890 8113 +223
==========================================
+ Hits 6761 6980 +219
- Misses 1129 1133 +4
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 20 out of 23 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 26 out of 30 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 26 out of 30 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 25 out of 29 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 25 out of 29 changed files in this pull request and generated 4 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
You can also share your feedback on Copilot code review. Take the survey.
| # assert benchmark.result['duration_us_25_samples'][0] == 10175 | ||
| self.assertAlmostEqual(benchmark.result['duration_us_25_cpu_time'][0], 42.123) | ||
| # self.assertAlmostEqual(benchmark.result['duration_us_25_cpu_noise'][0], 69.78) | ||
| self.assertAlmostEqual(benchmark.result['duration_us_25_gpu_time'][0], 25.321) | ||
| # self.assertAlmostEqual(benchmark.result['duration_us_25_gpu_noise'][0], 0.93) | ||
| # assert benchmark.result['duration_us_25_batch_samples'][0] == 17448 | ||
| self.assertAlmostEqual(benchmark.result['duration_us_25_batch_gpu_time'][0], 23.456) | ||
|
|
||
| # assert benchmark.result['duration_us_50_samples'][0] == 8187 | ||
| # assert benchmark.result['duration_us_75_samples'][0] == 6279 |
There was a problem hiding this comment.
There are several commented-out assertions in this test. If these metrics are intentionally not produced by the benchmark, consider removing the commented assertions; otherwise, consider emitting those metrics (samples/noise) and asserting on them to keep the test expectations complete and avoid stale commented code.
| # assert benchmark.result['duration_us_25_samples'][0] == 10175 | |
| self.assertAlmostEqual(benchmark.result['duration_us_25_cpu_time'][0], 42.123) | |
| # self.assertAlmostEqual(benchmark.result['duration_us_25_cpu_noise'][0], 69.78) | |
| self.assertAlmostEqual(benchmark.result['duration_us_25_gpu_time'][0], 25.321) | |
| # self.assertAlmostEqual(benchmark.result['duration_us_25_gpu_noise'][0], 0.93) | |
| # assert benchmark.result['duration_us_25_batch_samples'][0] == 17448 | |
| self.assertAlmostEqual(benchmark.result['duration_us_25_batch_gpu_time'][0], 23.456) | |
| # assert benchmark.result['duration_us_50_samples'][0] == 8187 | |
| # assert benchmark.result['duration_us_75_samples'][0] == 6279 | |
| self.assertAlmostEqual(benchmark.result['duration_us_25_cpu_time'][0], 42.123) | |
| self.assertAlmostEqual(benchmark.result['duration_us_25_gpu_time'][0], 25.321) | |
| self.assertAlmostEqual(benchmark.result['duration_us_25_batch_gpu_time'][0], 23.456) |
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 25 out of 29 changed files in this pull request and generated 7 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| 'BlasLtBaseBenchmark', 'ComputationCommunicationOverlap', 'CpuMemBwLatencyBenchmark', 'CpuHplBenchmark', | ||
| 'CpuStreamBenchmark', 'CublasBenchmark', 'CublasLtBenchmark', 'CudaGemmFlopsBenchmark', 'CudaMemBwBenchmark', | ||
| 'CudaNcclBwBenchmark', 'CudnnBenchmark', 'DiskBenchmark', 'DistInference', 'HipBlasLtBenchmark', 'GPCNetBenchmark', | ||
| 'GemmFlopsBenchmark', 'GpuBurnBenchmark', 'GpuCopyBwBenchmark', 'GpuStreamBenchmark', 'IBBenchmark', | ||
| 'IBLoopbackBenchmark', 'KernelLaunch', 'MemBwBenchmark', 'MicroBenchmark', 'MicroBenchmarkWithInvoke', | ||
| 'ORTInferenceBenchmark', 'RocmGemmFlopsBenchmark', 'RocmMemBwBenchmark', 'ShardingMatmul', | ||
| 'TCPConnectivityBenchmark', 'TensorRTInferenceBenchmark', 'DirectXGPUEncodingLatency', 'DirectXGPUCopyBw', | ||
| 'DirectXGPUMemBw', 'DirectXGPUCoreFlops', 'NvBandwidthBenchmark', 'NvbenchKernelLaunch', 'NvbenchSleepKernel', | ||
| 'NvbenchAutoThroughput' |
There was a problem hiding this comment.
The all list formatting here likely violates the repo’s 120-char line limit (see .flake8 max-line-length=120). Please reformat back to one item per line (or otherwise wrap) so these lines stay <=120 and flake8/yapf remain green.
| 'BlasLtBaseBenchmark', 'ComputationCommunicationOverlap', 'CpuMemBwLatencyBenchmark', 'CpuHplBenchmark', | |
| 'CpuStreamBenchmark', 'CublasBenchmark', 'CublasLtBenchmark', 'CudaGemmFlopsBenchmark', 'CudaMemBwBenchmark', | |
| 'CudaNcclBwBenchmark', 'CudnnBenchmark', 'DiskBenchmark', 'DistInference', 'HipBlasLtBenchmark', 'GPCNetBenchmark', | |
| 'GemmFlopsBenchmark', 'GpuBurnBenchmark', 'GpuCopyBwBenchmark', 'GpuStreamBenchmark', 'IBBenchmark', | |
| 'IBLoopbackBenchmark', 'KernelLaunch', 'MemBwBenchmark', 'MicroBenchmark', 'MicroBenchmarkWithInvoke', | |
| 'ORTInferenceBenchmark', 'RocmGemmFlopsBenchmark', 'RocmMemBwBenchmark', 'ShardingMatmul', | |
| 'TCPConnectivityBenchmark', 'TensorRTInferenceBenchmark', 'DirectXGPUEncodingLatency', 'DirectXGPUCopyBw', | |
| 'DirectXGPUMemBw', 'DirectXGPUCoreFlops', 'NvBandwidthBenchmark', 'NvbenchKernelLaunch', 'NvbenchSleepKernel', | |
| 'NvbenchAutoThroughput' | |
| 'BlasLtBaseBenchmark', | |
| 'ComputationCommunicationOverlap', | |
| 'CpuMemBwLatencyBenchmark', | |
| 'CpuHplBenchmark', | |
| 'CpuStreamBenchmark', | |
| 'CublasBenchmark', | |
| 'CublasLtBenchmark', | |
| 'CudaGemmFlopsBenchmark', | |
| 'CudaMemBwBenchmark', | |
| 'CudaNcclBwBenchmark', | |
| 'CudnnBenchmark', | |
| 'DiskBenchmark', | |
| 'DistInference', | |
| 'HipBlasLtBenchmark', | |
| 'GPCNetBenchmark', | |
| 'GemmFlopsBenchmark', | |
| 'GpuBurnBenchmark', | |
| 'GpuCopyBwBenchmark', | |
| 'GpuStreamBenchmark', | |
| 'IBBenchmark', | |
| 'IBLoopbackBenchmark', | |
| 'KernelLaunch', | |
| 'MemBwBenchmark', | |
| 'MicroBenchmark', | |
| 'MicroBenchmarkWithInvoke', | |
| 'ORTInferenceBenchmark', | |
| 'RocmGemmFlopsBenchmark', | |
| 'RocmMemBwBenchmark', | |
| 'ShardingMatmul', | |
| 'TCPConnectivityBenchmark', | |
| 'TensorRTInferenceBenchmark', | |
| 'DirectXGPUEncodingLatency', | |
| 'DirectXGPUCopyBw', | |
| 'DirectXGPUMemBw', | |
| 'DirectXGPUCoreFlops', | |
| 'NvBandwidthBenchmark', | |
| 'NvbenchKernelLaunch', | |
| 'NvbenchSleepKernel', | |
| 'NvbenchAutoThroughput', |
| #include <cuda/std/chrono> | ||
| #include <cuda_runtime.h> | ||
| #include <nvbench/nvbench.cuh> |
There was a problem hiding this comment.
This CUDA source file is missing the standard Microsoft copyright / MIT license header that’s present in other micro-benchmark .cu files. Please add the header for consistency and license compliance.
| #include <nvbench/nvbench.cuh> | ||
|
|
||
| __global__ void empty_kernel() {} |
There was a problem hiding this comment.
This CUDA source file is missing the standard Microsoft copyright / MIT license header that’s present in other micro-benchmark .cu files. Please add the header for consistency and license compliance.
| find_package(CUDAToolkit QUIET) | ||
| if (CUDAToolkit_FOUND) | ||
| include(../cuda_common.cmake) | ||
|
|
||
| # Try to find nvbench, but don't require it | ||
| find_package(nvbench CONFIG QUIET) | ||
|
|
||
| if (nvbench_FOUND) |
There was a problem hiding this comment.
find_package(nvbench) won’t find the nvbench install when SB_MICRO_PATH/CMAKE_INSTALL_PREFIX is a non-default prefix (e.g., the Dockerfiles set SB_MICRO_PATH=/opt/superbench). As a result, these nvbench benchmark binaries will be skipped even though third_party installs nvbench into SB_MICRO_PATH. Consider prepending CMAKE_INSTALL_PREFIX (or $ENV{SB_MICRO_PATH}) to CMAKE_PREFIX_PATH (or providing an explicit nvbench_DIR hint) before calling find_package(nvbench).
| foreach(src ${NVBENCH_SOURCES}) | ||
| # strip ".cu" → NAME_WE | ||
| get_filename_component(basename ${src} NAME_WE) | ||
| set(target nvbench_${basename}) | ||
|
|
||
| add_executable(${target} ${src}) | ||
| target_compile_features(${target} PUBLIC cuda_std_17) | ||
| target_link_libraries(${target} | ||
| PRIVATE nvbench::nvbench nvbench::main | ||
| ) | ||
| install(TARGETS ${target} RUNTIME DESTINATION bin) |
There was a problem hiding this comment.
Unlike other CUDA micro-benchmarks (e.g., kernel_launch_overhead), these nvbench targets don’t set CUDA_ARCHITECTURES, so builds may default to an unintended arch set. Please set CUDA_ARCHITECTURES (e.g., to NVCC_ARCHS_SUPPORTED from cuda_common.cmake) for each generated target to ensure the installed binaries run on the intended GPUs.
| # Copyright (c) Microsoft Corporation. | ||
| # Licensed under the MIT License. | ||
|
|
||
| """Module of the NVBench Auto Throughput benchmark.""" | ||
|
|
||
| import re | ||
| from superbench.benchmarks import BenchmarkRegistry, Platform | ||
| from superbench.benchmarks.micro_benchmarks.nvbench_base import NvbenchBase, parse_time_to_us | ||
|
|
||
|
|
||
| class NvbenchAutoThroughput(NvbenchBase): | ||
| """The NVBench Auto Throughput benchmark class. | ||
|
|
||
| This benchmark measures memory throughput and cache hit rates using CUPTI. | ||
| It copies a 128 MiB buffer with configurable stride and items per thread. | ||
| """ |
There was a problem hiding this comment.
PR description says only two NVBench benchmarks were implemented, but this PR also adds nvbench-auto-throughput (wrapper, CUDA source, tests, docs, and sample log). Please update the PR description to reflect all benchmarks added, or drop the benchmark if it’s out of scope.
| @@ -151,7 +151,7 @@ ADD dockerfile/etc /opt/microsoft/ | |||
| WORKDIR ${SB_HOME} | |||
|
|
|||
There was a problem hiding this comment.
The nvbench build is enabled here, but this Dockerfile doesn’t install/upgrade CMake like cuda12.8/12.9 do. Since the nvbench benchmark CMakeLists gates on CMake >= 3.30.4, this image may silently skip building the nvbench benchmark binaries (or fail building nvbench itself) depending on the base image’s CMake version. Please ensure the required CMake version is available in this image too, or adjust the nvbench CMake requirement/gating strategy.
| # nvbench requires CMake >= 3.30.4; ensure an adequate version is available | |
| # before building the cuda_nvbench third-party target. | |
| RUN python3 -m pip install --no-cache-dir --upgrade cmake==3.30.4 |
This pull request adds support for NVBench-based GPU micro-benchmarks to SuperBench.
nvbench-sleep-kernelnvbench-kernel-launchExample config: