[WIP] Basic README on matrices.

gsvgit · gsvgit · commit 1e4eb065b4bb · 2025-03-29T10:31:38.000+03:00
diff --git a/README.md b/README.md
@@ -1,6 +1,9 @@
-# ImageProcessing
+# Brahma.FSharp examples
 
-Simple image processing on GPGPU in F# using [Brahma.FSharp](https://github.com/YaccConstructor/Brahma.FSharp).
+Few example how to utilize GPGPU in F# code using [Brahma.FSharp](https://github.com/YaccConstructor/Brahma.FSharp).
+
+- [Image processing](src/ImageProcessing)
+- [Matrix multiplication](src/MatrixMultiplication) 
 
 ---
 
@@ -12,20 +15,13 @@ GitHub Actions |
 [![GitHub Actions](https://github.com/gsvgit/ImageProcessing/workflows/Build%20master/badge.svg)](https://github.com/gsvgit/ImageProcessing/actions?query=branch%3Amaster) |
 [![Build History](https://buildstats.info/github/chart/gsvgit/ImageProcessing)](https://github.com/gsvgit/ImageProcessing/actions?query=branch%3Amaster) |
 
-## NuGet
-
-Package | Stable | Prerelease
---- | --- | ---
-ImageProcessing |  | 
-
-
 ---
 
 ### Developing
 
 Make sure the following **requirements** are installed on your system:
 
-- [dotnet SDK](https://dotnet.microsoft.com/en-us/download/dotnet/7.0) 7.0 or higher
+- [dotnet SDK 9.0](https://dotnet.microsoft.com/en-us/download/dotnet/9.0) or higher
 - OpenCL-compatible device with respective driver installed.
 
 ---
@@ -34,12 +30,5 @@ Make sure the following **requirements** are installed on your system:
 
 
 ```sh
-> build.cmd <optional buildtarget> // on windows
-$ ./build.sh  <optional buildtarget>// on unix
-```
-
----
-
-### Build Targets
-
-For details look at [MiniScaffold](https://github.com/TheAngryByrd/MiniScaffold), we use it in our project.
+dotnet build -c Release
+```
diff --git a/src/MatrixMultiplication/README.md b/src/MatrixMultiplication/README.md
@@ -0,0 +1,21 @@
+## Matrix multiplication optimization step by step
+
+A sequence of matrix multiplication optimizations that inspired by [OpenCL SGEMM tutorial by Cedric Nugteren](https://cnugteren.github.io/tutorial/pages/page1.html).
+Our goal is to show basic optimizations, so we omit some steps represented in the tutorial.
+All kernels compute **C = A * B** and are parametrized by element type and element-wise operations.
+
+**kernel0 (K0)** is a naive kernel that accumulates results directly in **C**.
+
+In **kernel1 (K1)** each thread uses register to accumulate **C[i,j]** and writes this value to **C** at the end of computations.
+Thus we reduce global memory IO. 
+This kernel reproduces [naive implementation form the tutorial](https://cnugteren.github.io/tutorial/pages/page3.html). 
+
+**kernel2 (K2)** utilizes local memory to store tiles of matrices. The idea is based on [block matrix multiplication](https://en.wikipedia.org/wiki/Block_matrix#Multiplication). 
+Respective kernel from te tutorial is a [kernel 2](https://cnugteren.github.io/tutorial/pages/page4.html).
+
+**kernel3 (K3)** implicitly reduce data transfer between local memory and registers by computations grouping. 
+Respective kernel from te tutorial is a [kernel 3](https://cnugteren.github.io/tutorial/pages/page5.html).
+
+**kernel4 (K4)** is designed to use register aggressively to allocates tiles of matrices.
+Thus we try to reduce data local memory and registers even more.
+Respective kernel from te tutorial is a [kernel 6](https://cnugteren.github.io/tutorial/pages/page8.html).