You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
## Matrix multiplication optimization step by step
2
+
3
+
A sequence of matrix multiplication optimizations that inspired by [OpenCL SGEMM tutorial by Cedric Nugteren](https://cnugteren.github.io/tutorial/pages/page1.html).
4
+
Our goal is to show basic optimizations, so we omit some steps represented in the tutorial.
5
+
All kernels compute **C = A * B** and are parametrized by element type and element-wise operations.
6
+
7
+
**kernel0 (K0)** is a naive kernel that accumulates results directly in **C**.
8
+
9
+
In **kernel1 (K1)** each thread uses register to accumulate **C[i,j]** and writes this value to **C** at the end of computations.
10
+
Thus we reduce global memory IO.
11
+
This kernel reproduces [naive implementation form the tutorial](https://cnugteren.github.io/tutorial/pages/page3.html).
12
+
13
+
**kernel2 (K2)** utilizes local memory to store tiles of matrices. The idea is based on [block matrix multiplication](https://en.wikipedia.org/wiki/Block_matrix#Multiplication).
14
+
Respective kernel from te tutorial is a [kernel 2](https://cnugteren.github.io/tutorial/pages/page4.html).
15
+
16
+
**kernel3 (K3)** implicitly reduce data transfer between local memory and registers by computations grouping.
17
+
Respective kernel from te tutorial is a [kernel 3](https://cnugteren.github.io/tutorial/pages/page5.html).
18
+
19
+
**kernel4 (K4)** is designed to use register aggressively to allocates tiles of matrices.
20
+
Thus we try to reduce data local memory and registers even more.
21
+
Respective kernel from te tutorial is a [kernel 6](https://cnugteren.github.io/tutorial/pages/page8.html).
0 commit comments