Benchmark: Micro benchmark - Add float datatype support and other refinements to GPU Stream#769
Benchmark: Micro benchmark - Add float datatype support and other refinements to GPU Stream#769WenqingLan1 wants to merge 15 commits intomicrosoft:mainfrom
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #769 +/- ##
=======================================
Coverage 85.69% 85.69%
=======================================
Files 103 103
Lines 7890 7891 +1
=======================================
+ Hits 6761 6762 +1
Misses 1129 1129
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Pull request overview
Updates the GPU STREAM microbenchmark to support runtime-selectable FP32/FP64 execution and improve GPU memory bandwidth utilization, while aligning SuperBench integration (CLI, output tags, docs, and tests) to the new behavior.
Changes:
- Add
--data_type <float|double>to select FP32/FP64 at runtime and propagate it through the Python benchmark wrapper + unit tests. - Refactor CUDA kernels to use 128-bit vectorized accesses (
double2/float4) and move template kernel implementations into a header for cross-TU instantiation. - Adjust execution/output to single visible GPU (device 0 via
CUDA_VISIBLE_DEVICES) and update metric/tag formats (removinggpu_id) plus docs/examples/test log.
Reviewed changes
Copilot reviewed 11 out of 13 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
tests/data/gpu_stream.log |
Updates golden log output to include data type and new tag format (no gpu_id). |
tests/benchmarks/micro_benchmarks/test_gpu_stream.py |
Extends command-generation assertions to include --data_type (currently only covers double). |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.hpp |
Removes NUMA/GPU iteration fields from args and adds Opts::data_type. |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp |
Adds CLI parsing/printing for --data_type. |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_main.cpp |
New entry point replacing the previous main file. |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.hpp |
Introduces vector-type mapping and templated kernel definitions (128-bit loads/stores). |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.cu |
Keeps a CUDA compilation unit and moves template implementations to the header. |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.hpp |
Expands bench-args variant to support float and double. |
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu |
Uses local NUMA allocation, enforces 16B/thread sizing, launches templated vectorized kernels, updates tag format, and runs only CUDA device 0. |
superbench/benchmarks/micro_benchmarks/gpu_stream/CMakeLists.txt |
Switches target sources to the new gpu_stream_main.cpp. |
superbench/benchmarks/micro_benchmarks/gpu_stream.py |
Adds --data_type argument and forwards it to the binary. |
examples/benchmarks/gpu_stream.py |
Updates example invocation to include --data_type double. |
docs/user-tutorial/benchmarks/micro-benchmarks.md |
Updates gpu-stream metric patterns to include `(double |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu
Outdated
Show resolved
Hide resolved
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 12 out of 14 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu
Outdated
Show resolved
Hide resolved
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.hpp
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 13 changed files in this pull request and generated 4 comments.
Comments suppressed due to low confidence (1)
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp:99
- ParseOpts intends to error out when required options are not provided, but
size_specifiedis initialized totrue, so missing--sizewill never be detected by theif (!size_specified || ...)check. Initialize it tofalse(like the other flags) or remove the required-argument check if defaults are intended.
int getopt_ret = 0;
int opt_idx = 0;
bool size_specified = true;
bool num_warm_up_specified = false;
bool num_loops_specified = false;
bool parse_err = false;
while (true) {
getopt_ret = getopt_long(argc, argv, "", options, &opt_idx);
if (getopt_ret == -1) {
if (!size_specified || !num_warm_up_specified || !num_loops_specified) {
parse_err = true;
}
break;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.hpp
Show resolved
Hide resolved
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.hpp
Show resolved
Hide resolved
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_kernels.hpp
Outdated
Show resolved
Hide resolved
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream.cu
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 11 out of 13 changed files in this pull request and generated 3 comments.
Comments suppressed due to low confidence (1)
superbench/benchmarks/micro_benchmarks/gpu_stream/gpu_stream_utils.cpp:99
- ParseOpts initializes size_specified=true, which makes --size effectively optional, but PrintUsage presents --size as required and the end-of-parse validation still checks size_specified. Either initialize size_specified=false to enforce explicit --size, or update the usage/validation logic to reflect that the default buffer size is acceptable.
int getopt_ret = 0;
int opt_idx = 0;
bool size_specified = true;
bool num_warm_up_specified = false;
bool num_loops_specified = false;
bool parse_err = false;
while (true) {
getopt_ret = getopt_long(argc, argv, "", options, &opt_idx);
if (getopt_ret == -1) {
if (!size_specified || !num_warm_up_specified || !num_loops_specified) {
parse_err = true;
}
break;
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Refinements:
New config:
New rule:
Example results:
Processed by rules: