Skip to content

Benchmarks: Micro benchmark - add nvbench based kernel-launch, sleep-kernel & auto-throughput#750

Open
WenqingLan1 wants to merge 45 commits intomicrosoft:mainfrom
WenqingLan1:feat/third_party/nvbench
Open

Benchmarks: Micro benchmark - add nvbench based kernel-launch, sleep-kernel & auto-throughput#750
WenqingLan1 wants to merge 45 commits intomicrosoft:mainfrom
WenqingLan1:feat/third_party/nvbench

Conversation

@WenqingLan1
Copy link
Copy Markdown
Contributor

This pull request adds support for NVBench-based GPU micro-benchmarks to SuperBench.

  • Integrated the NVBench submodule
  • Implemented two benchmarks
    • nvbench-sleep-kernel
    • nvbench-kernel-launch
  • updated documentation and added example scripts

Example config:

version: v0.12
superbench:
  enable:
  # nvbench benchmarks
  - nvbench-sleep-kernel:single
  - nvbench-sleep-kernel:list
  - nvbench-sleep-kernel:range
  - nvbench-sleep-kernel:range-step
  - nvbench-kernel-launch
  var:
    default_local_mode: &default_local_mode
      modes:
      - name: local
        proc_num: 4
        prefix: CUDA_VISIBLE_DEVICES={proc_rank}
        parallel: yes
  benchmarks:
    nvbench-sleep-kernel:single:
      <<: *default_local_mode
      timeout: 300
      parameters:
        duration_us: "50"                   # Single value format
        timeout: 30
    nvbench-sleep-kernel:list:
      <<: *default_local_mode
      timeout: 300
      parameters:
        duration_us: "[25,50,75]"         # List format - no spaces after commas
        timeout: 30
    nvbench-sleep-kernel:range:
      <<: *default_local_mode
      timeout: 300
      parameters:
        duration_us: "[0:5]"           # Range format
        timeout: 30
    nvbench-sleep-kernel:range-step:
      <<: *default_local_mode
      timeout: 300
      parameters:
        duration_us: "[0:50:10]"         # Range with step format
        timeout: 30
    nvbench-kernel-launch:
      <<: *default_local_mode
      timeout: 300

@WenqingLan1 WenqingLan1 requested a review from a team as a code owner October 9, 2025 23:12
@WenqingLan1 WenqingLan1 added benchmarks SuperBench Benchmarks micro-benchmarks Micro Benchmark Test for SuperBench Benchmarks labels Oct 9, 2025
@codecov
Copy link
Copy Markdown

codecov bot commented Oct 10, 2025

Codecov Report

❌ Patch coverage is 98.20628% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.03%. Comparing base (036c471) to head (e1e12d2).

Files with missing lines Patch % Lines
...rbench/benchmarks/micro_benchmarks/nvbench_base.py 97.91% 2 Missing ⚠️
...hmarks/micro_benchmarks/nvbench_auto_throughput.py 98.07% 1 Missing ⚠️
...enchmarks/micro_benchmarks/nvbench_sleep_kernel.py 97.67% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #750      +/-   ##
==========================================
+ Coverage   85.69%   86.03%   +0.34%     
==========================================
  Files         103      107       +4     
  Lines        7890     8113     +223     
==========================================
+ Hits         6761     6980     +219     
- Misses       1129     1133       +4     
Flag Coverage Δ
cpu-python3.10-unit-test 71.18% <98.17%> (+0.75%) ⬆️
cpu-python3.7-unit-test 70.64% <98.20%> (+0.78%) ⬆️
cuda-unit-test 83.99% <98.17%> (+0.39%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 20 out of 23 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings February 18, 2026 22:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 30 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI review requested due to automatic review settings February 26, 2026 22:04
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 26 out of 30 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@microsoft microsoft deleted a comment from Copilot AI Mar 10, 2026
@microsoft microsoft deleted a comment from Copilot AI Mar 10, 2026
Copilot AI review requested due to automatic review settings March 10, 2026 20:55
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 29 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Copilot AI review requested due to automatic review settings March 10, 2026 21:31
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 29 changed files in this pull request and generated 4 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

You can also share your feedback on Copilot code review. Take the survey.

Comment on lines +78 to +87
# assert benchmark.result['duration_us_25_samples'][0] == 10175
self.assertAlmostEqual(benchmark.result['duration_us_25_cpu_time'][0], 42.123)
# self.assertAlmostEqual(benchmark.result['duration_us_25_cpu_noise'][0], 69.78)
self.assertAlmostEqual(benchmark.result['duration_us_25_gpu_time'][0], 25.321)
# self.assertAlmostEqual(benchmark.result['duration_us_25_gpu_noise'][0], 0.93)
# assert benchmark.result['duration_us_25_batch_samples'][0] == 17448
self.assertAlmostEqual(benchmark.result['duration_us_25_batch_gpu_time'][0], 23.456)

# assert benchmark.result['duration_us_50_samples'][0] == 8187
# assert benchmark.result['duration_us_75_samples'][0] == 6279
Copy link

Copilot AI Mar 10, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are several commented-out assertions in this test. If these metrics are intentionally not produced by the benchmark, consider removing the commented assertions; otherwise, consider emitting those metrics (samples/noise) and asserting on them to keep the test expectations complete and avoid stale commented code.

Suggested change
# assert benchmark.result['duration_us_25_samples'][0] == 10175
self.assertAlmostEqual(benchmark.result['duration_us_25_cpu_time'][0], 42.123)
# self.assertAlmostEqual(benchmark.result['duration_us_25_cpu_noise'][0], 69.78)
self.assertAlmostEqual(benchmark.result['duration_us_25_gpu_time'][0], 25.321)
# self.assertAlmostEqual(benchmark.result['duration_us_25_gpu_noise'][0], 0.93)
# assert benchmark.result['duration_us_25_batch_samples'][0] == 17448
self.assertAlmostEqual(benchmark.result['duration_us_25_batch_gpu_time'][0], 23.456)
# assert benchmark.result['duration_us_50_samples'][0] == 8187
# assert benchmark.result['duration_us_75_samples'][0] == 6279
self.assertAlmostEqual(benchmark.result['duration_us_25_cpu_time'][0], 42.123)
self.assertAlmostEqual(benchmark.result['duration_us_25_gpu_time'][0], 25.321)
self.assertAlmostEqual(benchmark.result['duration_us_25_batch_gpu_time'][0], 23.456)

Copilot uses AI. Check for mistakes.
@microsoft microsoft deleted a comment from Copilot AI Mar 10, 2026
@microsoft microsoft deleted a comment from Copilot AI Mar 10, 2026
@microsoft microsoft deleted a comment from Copilot AI Mar 10, 2026
@WenqingLan1 WenqingLan1 changed the title Benchmarks: Micro benchmark - add nvbench based kernel-launch & sleep-kernel Benchmarks: Micro benchmark - add nvbench based kernel-launch, sleep-kernel & auto-throughput Mar 25, 2026
Copilot AI review requested due to automatic review settings April 8, 2026 20:27
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 25 out of 29 changed files in this pull request and generated 7 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +47 to +55
'BlasLtBaseBenchmark', 'ComputationCommunicationOverlap', 'CpuMemBwLatencyBenchmark', 'CpuHplBenchmark',
'CpuStreamBenchmark', 'CublasBenchmark', 'CublasLtBenchmark', 'CudaGemmFlopsBenchmark', 'CudaMemBwBenchmark',
'CudaNcclBwBenchmark', 'CudnnBenchmark', 'DiskBenchmark', 'DistInference', 'HipBlasLtBenchmark', 'GPCNetBenchmark',
'GemmFlopsBenchmark', 'GpuBurnBenchmark', 'GpuCopyBwBenchmark', 'GpuStreamBenchmark', 'IBBenchmark',
'IBLoopbackBenchmark', 'KernelLaunch', 'MemBwBenchmark', 'MicroBenchmark', 'MicroBenchmarkWithInvoke',
'ORTInferenceBenchmark', 'RocmGemmFlopsBenchmark', 'RocmMemBwBenchmark', 'ShardingMatmul',
'TCPConnectivityBenchmark', 'TensorRTInferenceBenchmark', 'DirectXGPUEncodingLatency', 'DirectXGPUCopyBw',
'DirectXGPUMemBw', 'DirectXGPUCoreFlops', 'NvBandwidthBenchmark', 'NvbenchKernelLaunch', 'NvbenchSleepKernel',
'NvbenchAutoThroughput'
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The all list formatting here likely violates the repo’s 120-char line limit (see .flake8 max-line-length=120). Please reformat back to one item per line (or otherwise wrap) so these lines stay <=120 and flake8/yapf remain green.

Suggested change
'BlasLtBaseBenchmark', 'ComputationCommunicationOverlap', 'CpuMemBwLatencyBenchmark', 'CpuHplBenchmark',
'CpuStreamBenchmark', 'CublasBenchmark', 'CublasLtBenchmark', 'CudaGemmFlopsBenchmark', 'CudaMemBwBenchmark',
'CudaNcclBwBenchmark', 'CudnnBenchmark', 'DiskBenchmark', 'DistInference', 'HipBlasLtBenchmark', 'GPCNetBenchmark',
'GemmFlopsBenchmark', 'GpuBurnBenchmark', 'GpuCopyBwBenchmark', 'GpuStreamBenchmark', 'IBBenchmark',
'IBLoopbackBenchmark', 'KernelLaunch', 'MemBwBenchmark', 'MicroBenchmark', 'MicroBenchmarkWithInvoke',
'ORTInferenceBenchmark', 'RocmGemmFlopsBenchmark', 'RocmMemBwBenchmark', 'ShardingMatmul',
'TCPConnectivityBenchmark', 'TensorRTInferenceBenchmark', 'DirectXGPUEncodingLatency', 'DirectXGPUCopyBw',
'DirectXGPUMemBw', 'DirectXGPUCoreFlops', 'NvBandwidthBenchmark', 'NvbenchKernelLaunch', 'NvbenchSleepKernel',
'NvbenchAutoThroughput'
'BlasLtBaseBenchmark',
'ComputationCommunicationOverlap',
'CpuMemBwLatencyBenchmark',
'CpuHplBenchmark',
'CpuStreamBenchmark',
'CublasBenchmark',
'CublasLtBenchmark',
'CudaGemmFlopsBenchmark',
'CudaMemBwBenchmark',
'CudaNcclBwBenchmark',
'CudnnBenchmark',
'DiskBenchmark',
'DistInference',
'HipBlasLtBenchmark',
'GPCNetBenchmark',
'GemmFlopsBenchmark',
'GpuBurnBenchmark',
'GpuCopyBwBenchmark',
'GpuStreamBenchmark',
'IBBenchmark',
'IBLoopbackBenchmark',
'KernelLaunch',
'MemBwBenchmark',
'MicroBenchmark',
'MicroBenchmarkWithInvoke',
'ORTInferenceBenchmark',
'RocmGemmFlopsBenchmark',
'RocmMemBwBenchmark',
'ShardingMatmul',
'TCPConnectivityBenchmark',
'TensorRTInferenceBenchmark',
'DirectXGPUEncodingLatency',
'DirectXGPUCopyBw',
'DirectXGPUMemBw',
'DirectXGPUCoreFlops',
'NvBandwidthBenchmark',
'NvbenchKernelLaunch',
'NvbenchSleepKernel',
'NvbenchAutoThroughput',

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +3
#include <cuda/std/chrono>
#include <cuda_runtime.h>
#include <nvbench/nvbench.cuh>
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This CUDA source file is missing the standard Microsoft copyright / MIT license header that’s present in other micro-benchmark .cu files. Please add the header for consistency and license compliance.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +3
#include <nvbench/nvbench.cuh>

__global__ void empty_kernel() {}
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This CUDA source file is missing the standard Microsoft copyright / MIT license header that’s present in other micro-benchmark .cu files. Please add the header for consistency and license compliance.

Copilot uses AI. Check for mistakes.
Comment on lines +10 to +17
find_package(CUDAToolkit QUIET)
if (CUDAToolkit_FOUND)
include(../cuda_common.cmake)

# Try to find nvbench, but don't require it
find_package(nvbench CONFIG QUIET)

if (nvbench_FOUND)
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

find_package(nvbench) won’t find the nvbench install when SB_MICRO_PATH/CMAKE_INSTALL_PREFIX is a non-default prefix (e.g., the Dockerfiles set SB_MICRO_PATH=/opt/superbench). As a result, these nvbench benchmark binaries will be skipped even though third_party installs nvbench into SB_MICRO_PATH. Consider prepending CMAKE_INSTALL_PREFIX (or $ENV{SB_MICRO_PATH}) to CMAKE_PREFIX_PATH (or providing an explicit nvbench_DIR hint) before calling find_package(nvbench).

Copilot uses AI. Check for mistakes.
Comment on lines +28 to +38
foreach(src ${NVBENCH_SOURCES})
# strip ".cu" → NAME_WE
get_filename_component(basename ${src} NAME_WE)
set(target nvbench_${basename})

add_executable(${target} ${src})
target_compile_features(${target} PUBLIC cuda_std_17)
target_link_libraries(${target}
PRIVATE nvbench::nvbench nvbench::main
)
install(TARGETS ${target} RUNTIME DESTINATION bin)
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unlike other CUDA micro-benchmarks (e.g., kernel_launch_overhead), these nvbench targets don’t set CUDA_ARCHITECTURES, so builds may default to an unintended arch set. Please set CUDA_ARCHITECTURES (e.g., to NVCC_ARCHS_SUPPORTED from cuda_common.cmake) for each generated target to ensure the installed binaries run on the intended GPUs.

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +16
# Copyright (c) Microsoft Corporation.
# Licensed under the MIT License.

"""Module of the NVBench Auto Throughput benchmark."""

import re
from superbench.benchmarks import BenchmarkRegistry, Platform
from superbench.benchmarks.micro_benchmarks.nvbench_base import NvbenchBase, parse_time_to_us


class NvbenchAutoThroughput(NvbenchBase):
"""The NVBench Auto Throughput benchmark class.

This benchmark measures memory throughput and cache hit rates using CUPTI.
It copies a 128 MiB buffer with configurable stride and items per thread.
"""
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR description says only two NVBench benchmarks were implemented, but this PR also adds nvbench-auto-throughput (wrapper, CUDA source, tests, docs, and sample log). Please update the PR description to reflect all benchmarks added, or drop the benchmark if it’s out of scope.

Copilot uses AI. Check for mistakes.
@@ -151,7 +151,7 @@ ADD dockerfile/etc /opt/microsoft/
WORKDIR ${SB_HOME}

Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The nvbench build is enabled here, but this Dockerfile doesn’t install/upgrade CMake like cuda12.8/12.9 do. Since the nvbench benchmark CMakeLists gates on CMake >= 3.30.4, this image may silently skip building the nvbench benchmark binaries (or fail building nvbench itself) depending on the base image’s CMake version. Please ensure the required CMake version is available in this image too, or adjust the nvbench CMake requirement/gating strategy.

Suggested change
# nvbench requires CMake >= 3.30.4; ensure an adequate version is available
# before building the cuda_nvbench third-party target.
RUN python3 -m pip install --no-cache-dir --upgrade cmake==3.30.4

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

benchmarks SuperBench Benchmarks micro-benchmarks Micro Benchmark Test for SuperBench Benchmarks

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants