Skip to content

Next-Generation Scheduling#715

Open
jpsamaroo wants to merge 21 commits into
masterfrom
jps/ng-sch
Open

Next-Generation Scheduling#715
jpsamaroo wants to merge 21 commits into
masterfrom
jps/ng-sch

Conversation

@jpsamaroo

Copy link
Copy Markdown
Member

Dagger's core scheduler has historically had a very single-threaded shape to it, with most code paths being protected by a big global lock. This has impeded Dagger's ability to scale well when running the scheduler on multi-threaded systems, which is easily the most common case today. This PR takes multiple steps to split apart the scheduler into pieces that can run concurrently, with much of the scheduler's ComputeState global state put into Thunks, and locking via state.lock minimized to very short segments where it is currently unavoidable. Together with upcoming work to parallelize Datadeps analysis and scheduling, this should allow Dagger to scale to arbitrarily large systems and task graphs without bottlenecks.

Written by Claude Opus and Sonnet

@github-actions

github-actions Bot commented Jul 1, 2026

Copy link
Copy Markdown
Contributor

Dagger benchmarks: dirty vs master

master dirty master / dirty
array/dagger/N=1024 (block 512)/add (X + X) 9.98 ± 0.67 ms
array/dagger/N=1024 (block 512)/alloc (rand) 3.91 ± 0.5 ms
array/dagger/N=1024 (block 512)/broadcast (X .+ 1) 8.41 ± 1.4 ms
array/dagger/N=1024 (block 512)/map (sin.(X)) 11.6 ± 0.55 ms
array/dagger/N=1024 (block 512)/norm 3.73 ± 0.33 ms
array/dagger/N=1024 (block 512)/reduce (sum) 8.45 ± 1 ms
array/dagger/N=1024 (block 512)/transpose (permutedims) 9.8 ± 1.3 ms
array/dagger/N=256 (block 256)/add (X + X) 3.39 ± 0.28 ms 3.14 ± 0.33 ms 1.08 ± 0.14
array/dagger/N=256 (block 256)/alloc (rand) 1.34 ± 0.031 ms 1.19 ± 0.16 ms 1.13 ± 0.15
array/dagger/N=256 (block 256)/broadcast (X .+ 1) 3.45 ± 0.21 ms
array/dagger/N=256 (block 256)/map (sin.(X)) 4.07 ± 0.067 ms 3.92 ± 0.21 ms 1.04 ± 0.058
array/dagger/N=256 (block 256)/norm 2.22 ± 0.86 ms 2.22 ± 0.033 ms 1 ± 0.39
array/dagger/N=256 (block 256)/reduce (sum) 4.27 ± 1.2 ms 4.48 ± 1.3 ms 0.952 ± 0.39
array/dagger/N=256 (block 256)/transpose (permutedims) 2.88 ± 0.11 ms 2.44 ± 0.53 ms 1.18 ± 0.26
linalg/dagger/N=1024 (block 512)/cholesky 0.0318 ± 0.0027 s 31.1 ± 2.5 ms 1.02 ± 0.12
linalg/dagger/N=1024 (block 512)/lu 8.02 s 7.53 s 1.07
linalg/dagger/N=1024 (block 512)/matmul (A*A) 0.0555 ± 0.005 s 0.055 ± 0.0016 s 1.01 ± 0.096
linalg/dagger/N=1024 (block 512)/matvec (A*x) 9.06 ± 0.36 ms 9.32 ± 0.16 ms 0.972 ± 0.042
linalg/dagger/N=1024 (block 512)/qr 0.385 ± 0.0071 s 0.343 ± 0.0059 s 1.12 ± 0.028
linalg/dagger/N=1024 (block 512)/solve (A\b via lu) 11.2 s 10.5 s 1.07
linalg/dagger/N=1024 (block 512)/syrk (A'*A) 0.043 ± 0.0024 s 0.0445 ± 0.0021 s 0.967 ± 0.071
linalg/dagger/N=256 (block 256)/cholesky 6.92 ± 0.76 ms 6.77 ± 0.7 ms 1.02 ± 0.15
linalg/dagger/N=256 (block 256)/lu 1.42 ± 0.024 s
linalg/dagger/N=256 (block 256)/matmul (A*A) 4.43 ± 0.32 ms 4.31 ± 0.16 ms 1.03 ± 0.084
linalg/dagger/N=256 (block 256)/matvec (A*x) 3.17 ± 0.078 ms 3.35 ± 0.34 ms 0.945 ± 0.098
linalg/dagger/N=256 (block 256)/qr 11 ± 0.56 ms 10.7 ± 0.44 ms 1.03 ± 0.068
linalg/dagger/N=256 (block 256)/solve (A\b via lu) 2.31 ± 0.015 s 2.25 ± 0.054 s 1.03 ± 0.025
linalg/dagger/N=256 (block 256)/syrk (A'*A) 6.07 ± 0.46 ms 6.96 ± 0.53 ms 0.872 ± 0.094
time_to_load 0.972 ± 0.012 s 0.983 ± 0.0016 s 0.988 ± 0.012

No regressions beyond 25.0% 🎉

Full results and plots (download the benchmark-results artifact).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant