Benchmark: Micro benchmark - Add float datatype support and other refinements to GPU Stream by WenqingLan1 · Pull Request #769 · microsoft/superbenchmark

WenqingLan1 · 2025-12-19T20:05:13Z

Refinements:

Add support for float (fp32) execution and --data_type <float|double> CLI option for runtime type selection.
Refactor CUDA kernels to use 128-bit vectorized accesses (double2 / float4) and move template kernel implementations into a header for cross-TU instantiation. (Required for CUDA template instantiation across compilation units.)
Fix allocation buf size bug, args->size is buf size in bytes, not number of elements.
Adjust execution/output to single visible GPU (device 0 via CUDA_VISIBLE_DEVICES) and update metric/tag formats (removing gpu_id) plus docs/examples/test log.
Updated numa assignment from hard coded numa_alloc_onnode to numa_alloc_local to optimize memory allocation.
Rename entry point file from gpu_stream_test.cpp to gpu_stream_main.cpp.

New config:

    gpu-stream:fp64:
      <<: *default_local_mode
      timeout: 600
      parameters:
        num_warm_up: 10
        num_loops: 40
        size: 1308622848
        data_type: double
    gpu-stream:fp64-correctness:
      <<: *default_local_mode
      timeout: 600
      parameters:
        num_warm_up: 0
        num_loops: 1
        size: 1048576
        data_type: double
        check_data: true
    gpu-stream:fp32:
      <<: *default_local_mode
      timeout: 600
      parameters:
        num_warm_up: 10
        num_loops: 40
        size: 654311424
        data_type: float
    gpu-stream:fp32-correctness:
      <<: *default_local_mode
      timeout: 600
      parameters:
        num_warm_up: 0
        num_loops: 1
        size: 1048576
        data_type: float
        check_data: true

New rule:

    gpu-stream:
      statistics:
        - mean
      categories: GPU-STREAM
      aggregate: True
      metrics:
        - gpu-stream:fp(?:32|64)/STREAM_.*_(?:bw|ratio):(\d+)

Example results:

"gpu-stream:fp32/STREAM_COPY_float_buffer_2617245696_block_256_bw:0": 1234, 
"gpu-stream:fp32/STREAM_COPY_float_buffer_2617245696_block_256_bw:1": 1234, 
"gpu-stream:fp32/STREAM_COPY_float_buffer_2617245696_block_256_bw:2": 1234, 
"gpu-stream:fp32/STREAM_COPY_float_buffer_2617245696_block_256_bw:3": 1234

Processed by rules:

| gpu-stream:fp32/STREAM_COPY_float_buffer_2617245696_block_256_bw | mean | 1234|

codecov · 2025-12-19T20:14:11Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 85.69%. Comparing base (036c471) to head (58fead3).

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #769   +/-   ##
=======================================
  Coverage   85.69%   85.69%           
=======================================
  Files         103      103           
  Lines        7890     7891    +1     
=======================================
+ Hits         6761     6762    +1     
  Misses       1129     1129

Flag	Coverage Δ
cpu-python3.10-unit-test	`70.42% <50.00%> (+<0.01%)`	⬆️
cpu-python3.7-unit-test	`69.85% <50.00%> (+<0.01%)`	⬆️
cuda-unit-test	`83.60% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copilot

Pull request overview

Updates the GPU STREAM microbenchmark to support runtime-selectable FP32/FP64 execution and improve GPU memory bandwidth utilization, while aligning SuperBench integration (CLI, output tags, docs, and tests) to the new behavior.

Changes:

Add --data_type <float|double> to select FP32/FP64 at runtime and propagate it through the Python benchmark wrapper + unit tests.
Refactor CUDA kernels to use 128-bit vectorized accesses (double2 / float4) and move template kernel implementations into a header for cross-TU instantiation.
Adjust execution/output to single visible GPU (device 0 via CUDA_VISIBLE_DEVICES) and update metric/tag formats (removing gpu_id) plus docs/examples/test log.

Reviewed changes

Copilot reviewed 11 out of 13 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`tests/data/gpu_stream.log`	Updates golden log output to include data type and new tag format (no `gpu_id`).
`tests/benchmarks/micro_benchmarks/test_gpu_stream.py`	Extends command-generation assertions to include `--data_type` (currently only covers `double`).
`superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.hpp`	Removes NUMA/GPU iteration fields from args and adds `Opts::data_type`.
`superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp`	Adds CLI parsing/printing for `--data_type`.
`superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_main.cpp`	New entry point replacing the previous main file.
`superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.hpp`	Introduces vector-type mapping and templated kernel definitions (128-bit loads/stores).
`superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.cu`	Keeps a CUDA compilation unit and moves template implementations to the header.
`superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.hpp`	Expands bench-args variant to support `float` and `double`.
`superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu`	Uses local NUMA allocation, enforces 16B/thread sizing, launches templated vectorized kernels, updates tag format, and runs only CUDA device 0.
`superbench/benchmarks/micro_benchmarks/gpu_stream/CMakeLists.txt`	Switches target sources to the new `gpu_stream_main.cpp`.
`superbench/benchmarks/micro_benchmarks/gpu_stream.py`	Adds `--data_type` argument and forwards it to the binary.
`examples/benchmarks/gpu_stream.py`	Updates example invocation to include `--data_type double`.
`docs/user-tutorial/benchmarks/micro-benchmarks.md`	Updates gpu-stream metric patterns to include `(double

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu

docs/user-tutorial/benchmarks/micro-benchmarks.md

Copilot

Pull request overview

Copilot reviewed 12 out of 14 changed files in this pull request and generated 2 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu

superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.hpp

Copilot

Pull request overview

Copilot reviewed 11 out of 13 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp:99

ParseOpts intends to error out when required options are not provided, but size_specified is initialized to true, so missing --size will never be detected by the if (!size_specified || ...) check. Initialize it to false (like the other flags) or remove the required-argument check if defaults are intended.

    int getopt_ret = 0;
    int opt_idx = 0;
    bool size_specified = true;
    bool num_warm_up_specified = false;
    bool num_loops_specified = false;

    bool parse_err = false;
    while (true) {
        getopt_ret = getopt_long(argc, argv, "", options, &opt_idx);
        if (getopt_ret == -1) {
            if (!size_specified || !num_warm_up_specified || !num_loops_specified) {
                parse_err = true;
            }
            break;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.hpp

superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu

Copilot

Pull request overview

Copilot reviewed 11 out of 13 changed files in this pull request and generated 3 comments.

Comments suppressed due to low confidence (1)

superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp:99

ParseOpts initializes size_specified=true, which makes --size effectively optional, but PrintUsage presents --size as required and the end-of-parse validation still checks size_specified. Either initialize size_specified=false to enforce explicit --size, or update the usage/validation logic to reflect that the default buffer size is acceptable.

    int getopt_ret = 0;
    int opt_idx = 0;
    bool size_specified = true;
    bool num_warm_up_specified = false;
    bool num_loops_specified = false;

    bool parse_err = false;
    while (true) {
        getopt_ret = getopt_long(argc, argv, "", options, &opt_idx);
        if (getopt_ret == -1) {
            if (!size_specified || !num_warm_up_specified || !num_loops_specified) {
                parse_err = true;
            }
            break;

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu

superbench/benchmarks/micro_benchmarks/gpu_stream.py

WenqingLan1 added 3 commits December 18, 2025 01:16

remove fixed gpu id & numa id assignment

7f23c75

use 128bit alignment, add float support, cleanup

d63fe8c

add data_type arg

242714e

WenqingLan1 requested a review from a team as a code owner December 19, 2025 20:05

WenqingLan1 added the micro-benchmarks Micro Benchmark Test for SuperBench Benchmarks label Dec 19, 2025

guoshzhao self-assigned this Dec 19, 2025

guoshzhao requested review from guoshzhao and polarG December 19, 2025 20:32

WenqingLan1 and others added 5 commits December 19, 2025 23:31

fix lint

e8d0282

fix clang lint

5a18946

update doc

fddf56e

Merge branch 'main' into wenqinglan/refine-gpu-stream

3c359a3

Merge branch 'microsoft:main' into wenqinglan/refine-gpu-stream

e445363

Copilot AI review requested due to automatic review settings February 3, 2026 22:14

Copilot started reviewing on behalf of WenqingLan1 February 3, 2026 22:15 View session

Copilot AI reviewed Feb 3, 2026

View reviewed changes

superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu Outdated Show resolved Hide resolved

superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu Outdated Show resolved Hide resolved

docs/user-tutorial/benchmarks/micro-benchmarks.md Outdated Show resolved Hide resolved

WenqingLan1 and others added 2 commits February 5, 2026 16:04

Merge branch 'microsoft:main' into wenqinglan/refine-gpu-stream

60b130c

fix alloc count & comment

f31933f

Copilot AI review requested due to automatic review settings February 6, 2026 00:20

Copilot AI reviewed Feb 6, 2026

View reviewed changes

superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu Show resolved Hide resolved

superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu Show resolved Hide resolved

fix: reset gpu-burn submodule to correct commit

d8a91ab

guoshzhao requested a review from abuccts February 13, 2026 00:11

guoshzhao reviewed Mar 26, 2026

View reviewed changes

superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu Outdated Show resolved Hide resolved

guoshzhao reviewed Mar 26, 2026

View reviewed changes

superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu Show resolved Hide resolved

guoshzhao reviewed Mar 26, 2026

View reviewed changes

superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.hpp Outdated Show resolved Hide resolved

guoshzhao requested changes Mar 26, 2026

View reviewed changes

Merge branch 'microsoft:main' into wenqinglan/refine-gpu-stream

2dfa122

Copilot AI review requested due to automatic review settings April 8, 2026 20:27

Copilot started reviewing on behalf of WenqingLan1 April 8, 2026 20:30 View session

Copilot AI reviewed Apr 8, 2026

View reviewed changes

WenqingLan1 added 2 commits April 9, 2026 14:27

resolve comments

6dfdaa6

fix lint

e3232f5

Copilot AI review requested due to automatic review settings April 9, 2026 21:59

Copilot started reviewing on behalf of WenqingLan1 April 9, 2026 21:59 View session

Copilot AI reviewed Apr 9, 2026

View reviewed changes

superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu Outdated Show resolved Hide resolved

superbench/benchmarks/micro_benchmarks/gpu_stream.py Show resolved Hide resolved

microsoft deleted a comment from Copilot AI Apr 9, 2026

resolve comment

58fead3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmark: Micro benchmark - Add float datatype support and other refinements to GPU Stream#769

Benchmark: Micro benchmark - Add float datatype support and other refinements to GPU Stream#769
WenqingLan1 wants to merge 15 commits intomicrosoft:mainfrom
WenqingLan1:wenqinglan/refine-gpu-stream

WenqingLan1 commented Dec 19, 2025 •

edited

Loading

Uh oh!

codecov bot commented Dec 19, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

WenqingLan1 commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Dec 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

WenqingLan1 commented Dec 19, 2025 •

edited

Loading

codecov bot commented Dec 19, 2025 •

edited

Loading