Implementation of LU decomposition using data flow tasks#65
Implementation of LU decomposition using data flow tasks#65
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #65 +/- ##
==========================================
+ Coverage 75.64% 76.47% +0.83%
==========================================
Files 14 14
Lines 1466 1522 +56
==========================================
+ Hits 1109 1164 +55
- Misses 357 358 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
maltezfaria
left a comment
There was a problem hiding this comment.
Looks good as a first implementation, but I think we need:
- A task graph for the LU
- Some performance comparisons
I suspect we need to insert tasks at a finer granularity to expose more parallelism in the task graph.
| function DataFlowTasks.memory_overlap(A::HMatrix, B::HMatrix) | ||
| # TODO: compare leaves in more efficient way. | ||
| if A === B | ||
| return true | ||
| elseif issubmatrix(A, B) || issubmatrix(B, A) | ||
| return true | ||
| end | ||
| chdA = leaves(A) | ||
| chdB = leaves(B) | ||
| for i in eachindex(chdA) | ||
| for j in eachindex(chdB) | ||
| if data(chdA[i]) === data(chdB[j]) | ||
| return true |
There was a problem hiding this comment.
I think for most use cases, we can simply:
- Check if
AandBhave a common root - Check the intersection of the row and column ranges of
AandB
When they don't have a common root, we can (probably?) assume they don't overlap?
There was a problem hiding this comment.
If they don't have a common root it is not nesseccery they don't overlap. For example, we can have this situation:
M = Matrix(...)
H = HMatrix(M, ...)
H1 = HMatrix(M, ...)
Then even we check intersections of H and H1 we need to compare their data anyway. But we can assume that the H and H1 are built correctly(without using one data matrix for more than one HMatrix) and then remove the loops altogether.
| function _lu_threads!(M::HMatrix, compressor, bufs = nothing, level = 0, parent = (0, 0)) | ||
| if isleaf(M) | ||
| @dspawn begin | ||
| @RW(M) | ||
| d = data(M) | ||
| @assert d isa Matrix | ||
| lu!(d, NOPIVOT()) | ||
| end label = "lu($(parent[1]),$(parent[2]))\nlevel=$(level)" | ||
| else | ||
| @assert !hasdata(M) | ||
| chdM = children(M) | ||
| m, n = size(chdM) | ||
| for i in 1:m | ||
| _lu_threads!(chdM[i, i], compressor, bufs, level + 1, (i, i)) | ||
| for j in (i+1):n | ||
| @dspawn ldiv!( | ||
| UnitLowerTriangular(@R(chdM[i, i])), | ||
| @RW(chdM[i, j]), | ||
| compressor, | ||
| bufs, | ||
| ) label = "ldiv($i,$j)\nlevel=$(level+1)" | ||
| @dspawn rdiv!( | ||
| @RW(chdM[j, i]), | ||
| UpperTriangular(@R(chdM[i, i])), | ||
| compressor, | ||
| bufs, | ||
| ) label = "rdiv($j,$i)\nlevel=$(level+1)" | ||
| end | ||
| for j in (i+1):m | ||
| for k in (i+1):n | ||
| @dspawn hmul!( | ||
| @RW(chdM[j, k]), | ||
| @R(chdM[j, i]), | ||
| @R(chdM[i, k]), | ||
| -1, | ||
| 1, | ||
| compressor, | ||
| bufs, | ||
| ) label = "hmul($j,$k)\nlevel=$(level+1)" | ||
| end | ||
| end | ||
| end | ||
| end | ||
| return M | ||
| end |
There was a problem hiding this comment.
Seems fine as a first implementation, but this level of granularity is not going to be sufficient for a good parallel scaling (I think). You probably need to spawn task a finer scale inside the hmul! and ldiv!/rdiv! functions...
We should probably start by looking at the TaskGraph for the current implementation. Could you post one e.g. on the PR?
There was a problem hiding this comment.
Yes, you are right. I am planning to use higher level of granularity and to spawn tasks inside ldiv!, rdiv! and hmul!.
There was a problem hiding this comment.
Left some comments. Overall this looks good as a first implementation, but it is not ready to be merged. We need:
- An HLU task graph to see what is being parallelized
- Some performance comparisons
As mentioned in one of the comments, I think we will need to spawn tasks at finer granularity to get decent scaling.
|
I see that we are getting the same but with a missing key when |
I think it is a bug from there side because I have never seen this problem running benchmarks locally or running workflows in the Docker container. I will check it carefully again and try to find a solution. |

This is the simple parallelization of LU decomposition using DataFlowTasks library. To see the DAG and performance plot run lu_test_3.