Benchmarks: Micro benchmark - add nvbench based kernel-launch, sleep-kernel & auto-throughput by WenqingLan1 · Pull Request #750 · microsoft/superbenchmark

WenqingLan1 · 2025-10-09T23:12:33Z

This pull request adds support for NVBench-based GPU micro-benchmarks to SuperBench.

Integrated the NVBench submodule
Implemented two benchmarks
- nvbench-sleep-kernel
- nvbench-kernel-launch
updated documentation and added example scripts

Example config:

version: v0.12
superbench:
  enable:
  # nvbench benchmarks
  - nvbench-sleep-kernel:single
  - nvbench-sleep-kernel:list
  - nvbench-sleep-kernel:range
  - nvbench-sleep-kernel:range-step
  - nvbench-kernel-launch
  var:
    default_local_mode: &default_local_mode
      modes:
      - name: local
        proc_num: 4
        prefix: CUDA_VISIBLE_DEVICES={proc_rank}
        parallel: yes
  benchmarks:
    nvbench-sleep-kernel:single:
      <<: *default_local_mode
      timeout: 300
      parameters:
        duration_us: "50"                   # Single value format
        timeout: 30
    nvbench-sleep-kernel:list:
      <<: *default_local_mode
      timeout: 300
      parameters:
        duration_us: "[25,50,75]"         # List format - no spaces after commas
        timeout: 30
    nvbench-sleep-kernel:range:
      <<: *default_local_mode
      timeout: 300
      parameters:
        duration_us: "[0:5]"           # Range format
        timeout: 30
    nvbench-sleep-kernel:range-step:
      <<: *default_local_mode
      timeout: 300
      parameters:
        duration_us: "[0:50:10]"         # Range with step format
        timeout: 30
    nvbench-kernel-launch:
      <<: *default_local_mode
      timeout: 300

codecov · 2025-10-10T20:44:21Z

Codecov Report

❌ Patch coverage is 98.20628% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.03%. Comparing base (036c471) to head (e1e12d2).

Files with missing lines	Patch %	Lines
...rbench/benchmarks/micro_benchmarks/nvbench_base.py	97.91%	2 Missing ⚠️
...hmarks/micro_benchmarks/nvbench_auto_throughput.py	98.07%	1 Missing ⚠️
...enchmarks/micro_benchmarks/nvbench_sleep_kernel.py	97.67%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #750      +/-   ##
==========================================
+ Coverage   85.69%   86.03%   +0.34%     
==========================================
  Files         103      107       +4     
  Lines        7890     8113     +223     
==========================================
+ Hits         6761     6980     +219     
- Misses       1129     1133       +4

Flag	Coverage Δ
cpu-python3.10-unit-test	`71.18% <98.17%> (+0.75%)`	⬆️
cpu-python3.7-unit-test	`70.64% <98.20%> (+0.78%)`	⬆️
cuda-unit-test	`83.99% <98.17%> (+0.39%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

Copilot reviewed 20 out of 23 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

dockerfile/cuda12.9.dockerfile

examples/benchmarks/nvbench_kernel_launch.py

Copilot

Pull request overview

Copilot reviewed 26 out of 30 changed files in this pull request and generated 3 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

third_party/Makefile

superbench/benchmarks/micro_benchmarks/nvbench/auto_throughput.cu

…launch

Copilot

Pull request overview

Copilot reviewed 26 out of 30 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

dockerfile/cuda13.0.dockerfile

.gitmodules

superbench/benchmarks/micro_benchmarks/__init__.py

dockerfile/rocm5.0.x.dockerfile

Copilot

Pull request overview

Copilot reviewed 25 out of 29 changed files in this pull request and generated 5 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

third_party/Makefile

superbench/benchmarks/micro_benchmarks/nvbench_base.py

tests/benchmarks/micro_benchmarks/test_nvbench_base.py

dockerfile/cuda12.9.dockerfile

tests/benchmarks/micro_benchmarks/test_nvbench_sleep_kernel.py

Copilot

Pull request overview

Copilot reviewed 25 out of 29 changed files in this pull request and generated 4 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot · 2026-03-10T21:34:24Z

tests/benchmarks/micro_benchmarks/test_nvbench_sleep_kernel.py

+        # assert benchmark.result['duration_us_25_samples'][0] == 10175
+        self.assertAlmostEqual(benchmark.result['duration_us_25_cpu_time'][0], 42.123)
+        # self.assertAlmostEqual(benchmark.result['duration_us_25_cpu_noise'][0], 69.78)
+        self.assertAlmostEqual(benchmark.result['duration_us_25_gpu_time'][0], 25.321)
+        # self.assertAlmostEqual(benchmark.result['duration_us_25_gpu_noise'][0], 0.93)
+        # assert benchmark.result['duration_us_25_batch_samples'][0] == 17448
+        self.assertAlmostEqual(benchmark.result['duration_us_25_batch_gpu_time'][0], 23.456)
+
+        # assert benchmark.result['duration_us_50_samples'][0] == 8187
+        # assert benchmark.result['duration_us_75_samples'][0] == 6279


There are several commented-out assertions in this test. If these metrics are intentionally not produced by the benchmark, consider removing the commented assertions; otherwise, consider emitting those metrics (samples/noise) and asserting on them to keep the test expectations complete and avoid stale commented code.

Suggested change

# assert benchmark.result['duration_us_25_samples'][0] == 10175

self.assertAlmostEqual(benchmark.result['duration_us_25_cpu_time'][0], 42.123)

# self.assertAlmostEqual(benchmark.result['duration_us_25_cpu_noise'][0], 69.78)

self.assertAlmostEqual(benchmark.result['duration_us_25_gpu_time'][0], 25.321)

# self.assertAlmostEqual(benchmark.result['duration_us_25_gpu_noise'][0], 0.93)

# assert benchmark.result['duration_us_25_batch_samples'][0] == 17448

self.assertAlmostEqual(benchmark.result['duration_us_25_batch_gpu_time'][0], 23.456)

# assert benchmark.result['duration_us_50_samples'][0] == 8187

# assert benchmark.result['duration_us_75_samples'][0] == 6279

self.assertAlmostEqual(benchmark.result['duration_us_25_cpu_time'][0], 42.123)

self.assertAlmostEqual(benchmark.result['duration_us_25_gpu_time'][0], 25.321)

self.assertAlmostEqual(benchmark.result['duration_us_25_batch_gpu_time'][0], 23.456)

Copilot

Pull request overview

Copilot reviewed 25 out of 29 changed files in this pull request and generated 7 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-08T20:42:45Z

superbench/benchmarks/micro_benchmarks/__init__.py

+    'BlasLtBaseBenchmark', 'ComputationCommunicationOverlap', 'CpuMemBwLatencyBenchmark', 'CpuHplBenchmark',
+    'CpuStreamBenchmark', 'CublasBenchmark', 'CublasLtBenchmark', 'CudaGemmFlopsBenchmark', 'CudaMemBwBenchmark',
+    'CudaNcclBwBenchmark', 'CudnnBenchmark', 'DiskBenchmark', 'DistInference', 'HipBlasLtBenchmark', 'GPCNetBenchmark',
+    'GemmFlopsBenchmark', 'GpuBurnBenchmark', 'GpuCopyBwBenchmark', 'GpuStreamBenchmark', 'IBBenchmark',
+    'IBLoopbackBenchmark', 'KernelLaunch', 'MemBwBenchmark', 'MicroBenchmark', 'MicroBenchmarkWithInvoke',
+    'ORTInferenceBenchmark', 'RocmGemmFlopsBenchmark', 'RocmMemBwBenchmark', 'ShardingMatmul',
+    'TCPConnectivityBenchmark', 'TensorRTInferenceBenchmark', 'DirectXGPUEncodingLatency', 'DirectXGPUCopyBw',
+    'DirectXGPUMemBw', 'DirectXGPUCoreFlops', 'NvBandwidthBenchmark', 'NvbenchKernelLaunch', 'NvbenchSleepKernel',
+    'NvbenchAutoThroughput'


The all list formatting here likely violates the repo’s 120-char line limit (see .flake8 max-line-length=120). Please reformat back to one item per line (or otherwise wrap) so these lines stay <=120 and flake8/yapf remain green.

Suggested change

'BlasLtBaseBenchmark', 'ComputationCommunicationOverlap', 'CpuMemBwLatencyBenchmark', 'CpuHplBenchmark',

'CpuStreamBenchmark', 'CublasBenchmark', 'CublasLtBenchmark', 'CudaGemmFlopsBenchmark', 'CudaMemBwBenchmark',

'CudaNcclBwBenchmark', 'CudnnBenchmark', 'DiskBenchmark', 'DistInference', 'HipBlasLtBenchmark', 'GPCNetBenchmark',

'GemmFlopsBenchmark', 'GpuBurnBenchmark', 'GpuCopyBwBenchmark', 'GpuStreamBenchmark', 'IBBenchmark',

'IBLoopbackBenchmark', 'KernelLaunch', 'MemBwBenchmark', 'MicroBenchmark', 'MicroBenchmarkWithInvoke',

'ORTInferenceBenchmark', 'RocmGemmFlopsBenchmark', 'RocmMemBwBenchmark', 'ShardingMatmul',

'TCPConnectivityBenchmark', 'TensorRTInferenceBenchmark', 'DirectXGPUEncodingLatency', 'DirectXGPUCopyBw',

'DirectXGPUMemBw', 'DirectXGPUCoreFlops', 'NvBandwidthBenchmark', 'NvbenchKernelLaunch', 'NvbenchSleepKernel',

'NvbenchAutoThroughput'

'BlasLtBaseBenchmark',

'ComputationCommunicationOverlap',

'CpuMemBwLatencyBenchmark',

'CpuHplBenchmark',

'CpuStreamBenchmark',

'CublasBenchmark',

'CublasLtBenchmark',

'CudaGemmFlopsBenchmark',

'CudaMemBwBenchmark',

'CudaNcclBwBenchmark',

'CudnnBenchmark',

'DiskBenchmark',

'DistInference',

'HipBlasLtBenchmark',

'GPCNetBenchmark',

'GemmFlopsBenchmark',

'GpuBurnBenchmark',

'GpuCopyBwBenchmark',

'GpuStreamBenchmark',

'IBBenchmark',

'IBLoopbackBenchmark',

'KernelLaunch',

'MemBwBenchmark',

'MicroBenchmark',

'MicroBenchmarkWithInvoke',

'ORTInferenceBenchmark',

'RocmGemmFlopsBenchmark',

'RocmMemBwBenchmark',

'ShardingMatmul',

'TCPConnectivityBenchmark',

'TensorRTInferenceBenchmark',

'DirectXGPUEncodingLatency',

'DirectXGPUCopyBw',

'DirectXGPUMemBw',

'DirectXGPUCoreFlops',

'NvBandwidthBenchmark',

'NvbenchKernelLaunch',

'NvbenchSleepKernel',

'NvbenchAutoThroughput',

Copilot · 2026-04-08T20:42:45Z

superbench/benchmarks/micro_benchmarks/nvbench/sleep_kernel.cu

+#include <cuda/std/chrono>
+#include <cuda_runtime.h>
+#include <nvbench/nvbench.cuh>


This CUDA source file is missing the standard Microsoft copyright / MIT license header that’s present in other micro-benchmark .cu files. Please add the header for consistency and license compliance.

Copilot · 2026-04-08T20:42:46Z

superbench/benchmarks/micro_benchmarks/nvbench/kernel_launch.cu

+#include <nvbench/nvbench.cuh>
+
+__global__ void empty_kernel() {}


This CUDA source file is missing the standard Microsoft copyright / MIT license header that’s present in other micro-benchmark .cu files. Please add the header for consistency and license compliance.

Copilot · 2026-04-08T20:42:46Z

superbench/benchmarks/micro_benchmarks/nvbench/CMakeLists.txt

+find_package(CUDAToolkit QUIET)
+if (CUDAToolkit_FOUND)
+  include(../cuda_common.cmake)
+
+  # Try to find nvbench, but don't require it
+  find_package(nvbench CONFIG QUIET)
+
+  if (nvbench_FOUND)


find_package(nvbench) won’t find the nvbench install when SB_MICRO_PATH/CMAKE_INSTALL_PREFIX is a non-default prefix (e.g., the Dockerfiles set SB_MICRO_PATH=/opt/superbench). As a result, these nvbench benchmark binaries will be skipped even though third_party installs nvbench into SB_MICRO_PATH. Consider prepending CMAKE_INSTALL_PREFIX (or $ENV{SB_MICRO_PATH}) to CMAKE_PREFIX_PATH (or providing an explicit nvbench_DIR hint) before calling find_package(nvbench).

Copilot · 2026-04-08T20:42:46Z

superbench/benchmarks/micro_benchmarks/nvbench/CMakeLists.txt

+    foreach(src ${NVBENCH_SOURCES})
+      # strip ".cu" → NAME_WE
+      get_filename_component(basename ${src} NAME_WE)
+      set(target nvbench_${basename})
+
+      add_executable(${target} ${src})
+      target_compile_features(${target} PUBLIC cuda_std_17)
+      target_link_libraries(${target}
+        PRIVATE nvbench::nvbench nvbench::main
+      )
+      install(TARGETS ${target} RUNTIME DESTINATION bin)


Unlike other CUDA micro-benchmarks (e.g., kernel_launch_overhead), these nvbench targets don’t set CUDA_ARCHITECTURES, so builds may default to an unintended arch set. Please set CUDA_ARCHITECTURES (e.g., to NVCC_ARCHS_SUPPORTED from cuda_common.cmake) for each generated target to ensure the installed binaries run on the intended GPUs.

Copilot · 2026-04-08T20:42:46Z

superbench/benchmarks/micro_benchmarks/nvbench_auto_throughput.py

+# Copyright (c) Microsoft Corporation.
+# Licensed under the MIT License.
+
+"""Module of the NVBench Auto Throughput benchmark."""
+
+import re
+from superbench.benchmarks import BenchmarkRegistry, Platform
+from superbench.benchmarks.micro_benchmarks.nvbench_base import NvbenchBase, parse_time_to_us
+
+
+class NvbenchAutoThroughput(NvbenchBase):
+    """The NVBench Auto Throughput benchmark class.
+
+    This benchmark measures memory throughput and cache hit rates using CUPTI.
+    It copies a 128 MiB buffer with configurable stride and items per thread.
+    """


PR description says only two NVBench benchmarks were implemented, but this PR also adds nvbench-auto-throughput (wrapper, CUDA source, tests, docs, and sample log). Please update the PR description to reflect all benchmarks added, or drop the benchmark if it’s out of scope.

Copilot · 2026-04-08T20:42:47Z

dockerfile/cuda13.0.dockerfile

@@ -151,7 +151,7 @@ ADD dockerfile/etc /opt/microsoft/
 WORKDIR ${SB_HOME}



The nvbench build is enabled here, but this Dockerfile doesn’t install/upgrade CMake like cuda12.8/12.9 do. Since the nvbench benchmark CMakeLists gates on CMake >= 3.30.4, this image may silently skip building the nvbench benchmark binaries (or fail building nvbench itself) depending on the base image’s CMake version. Please ensure the required CMake version is available in this image too, or adjust the nvbench CMake requirement/gating strategy.

Suggested change

# nvbench requires CMake >= 3.30.4; ensure an adequate version is available

# before building the cuda_nvbench third-party target.

RUN python3 -m pip install --no-cache-dir --upgrade cmake==3.30.4

WenqingLan1 and others added 15 commits July 22, 2025 16:03

add nvbench kernel launch

741ee98

submodule update

0ae7864

init sleep kernel

35bfb61

Merge branch 'microsoft:main' into feat/third_party/nvbench

66b4786

Merge branch 'microsoft:main' into feat/third_party/nvbench

82aed0c

Merge branch 'microsoft:main' into feat/third_party/nvbench

24ee0a5

test sleep kernel

bd87f50

add sm 103

a663db6

add arg parsing logic

32fe197

Merge branch 'microsoft:main' into feat/third_party/nvbench

76562dc

add arg parsing tests

3eb5525

refactor

4785fe6

refine logic - remove gpu_id

1fb7c05

add doc

83c442c

refine regex & update nvbench submodule

4b274c4

WenqingLan1 requested a review from a team as a code owner October 9, 2025 23:12

WenqingLan1 added benchmarks SuperBench Benchmarks micro-benchmarks Micro Benchmark Test for SuperBench Benchmarks labels Oct 9, 2025

WenqingLan1 added 8 commits October 10, 2025 16:48

update cmake

0cf48bb

fix lint

5905647

fix lint

baa57c9

fix import

ecce2d9

fix

3a58ead

fix

d0d8773

fix

fbb5969

fix

f007745

WenqingLan1 added 3 commits October 10, 2025 21:23

fix

b6b6082

fix

0f2c838

fix

5bd20f6

Copilot AI reviewed Feb 6, 2026

View reviewed changes

dockerfile/cuda12.9.dockerfile Show resolved Hide resolved

examples/benchmarks/nvbench_kernel_launch.py Show resolved Hide resolved

WenqingLan1 added 2 commits February 6, 2026 11:03

fix comments

0804c12

add auto throughput benchmark

c1d1e43

Copilot AI review requested due to automatic review settings February 18, 2026 22:43

Copilot AI reviewed Feb 18, 2026

View reviewed changes

third_party/Makefile Show resolved Hide resolved

third_party/Makefile Show resolved Hide resolved

superbench/benchmarks/micro_benchmarks/nvbench/auto_throughput.cu Show resolved Hide resolved

WenqingLan1 added 2 commits February 20, 2026 15:11

refined logic & fix bug

c34591d

add comment to clarify diff between nvbench-kernel-launch and kernel-…

68f5c7d

…launch

Copilot AI review requested due to automatic review settings February 26, 2026 22:04

Copilot AI reviewed Feb 26, 2026

View reviewed changes

polarG reviewed Mar 5, 2026

View reviewed changes

dockerfile/cuda13.0.dockerfile Outdated Show resolved Hide resolved

.gitmodules Show resolved Hide resolved

superbench/benchmarks/micro_benchmarks/__init__.py Show resolved Hide resolved

polarG reviewed Mar 5, 2026

View reviewed changes

dockerfile/rocm5.0.x.dockerfile Show resolved Hide resolved

microsoft deleted a comment from Copilot AI Mar 10, 2026

WenqingLan1 added 2 commits March 10, 2026 10:54

resolve comments

0bde332

fix lint

7c456cf

Copilot AI review requested due to automatic review settings March 10, 2026 20:55

Copilot AI reviewed Mar 10, 2026

View reviewed changes

WenqingLan1 added 2 commits March 10, 2026 14:18

fix pipeline & resolve comments

9643150

fix lint

f1a3b6d

Copilot AI review requested due to automatic review settings March 10, 2026 21:31

Copilot AI reviewed Mar 10, 2026

View reviewed changes

microsoft deleted a comment from Copilot AI Mar 10, 2026

fix test

fe48e35

WenqingLan1 changed the title ~~Benchmarks: Micro benchmark - add nvbench based kernel-launch & sleep-kernel~~ Benchmarks: Micro benchmark - add nvbench based kernel-launch, sleep-kernel & auto-throughput Mar 25, 2026

Merge branch 'microsoft:main' into feat/third_party/nvbench

e1e12d2

Copilot AI review requested due to automatic review settings April 8, 2026 20:27

Copilot started reviewing on behalf of WenqingLan1 April 8, 2026 20:29 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

		#include <nvbench/nvbench.cuh>

		__global__ void empty_kernel() {}

		@@ -151,7 +151,7 @@ ADD dockerfile/etc /opt/microsoft/
		WORKDIR ${SB_HOME}

+# nvbench requires CMake >= 3.30.4; ensure an adequate version is available
+# before building the cuda_nvbench third-party target.
+RUN python3 -m pip install --no-cache-dir --upgrade cmake==3.30.4

Conversation

WenqingLan1 commented Oct 9, 2025

Uh oh!

codecov bot commented Oct 10, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Oct 10, 2025 •

edited

Loading