Next-Generation Scheduling by jpsamaroo · Pull Request #715 · JuliaParallel/Dagger.jl

jpsamaroo · 2026-07-01T01:14:45Z

Dagger's core scheduler has historically had a very single-threaded shape to it, with most code paths being protected by a big global lock. This has impeded Dagger's ability to scale well when running the scheduler on multi-threaded systems, which is easily the most common case today. This PR takes multiple steps to split apart the scheduler into pieces that can run concurrently, with much of the scheduler's ComputeState global state put into Thunks, and locking via state.lock minimized to very short segments where it is currently unavoidable. Together with upcoming work to parallelize Datadeps analysis and scheduling, this should allow Dagger to scale to arbitrarily large systems and task graphs without bottlenecks.

Written by Claude Opus and Sonnet

…ked)

…uired* for correctness)

github-actions · 2026-07-01T01:28:00Z

Dagger benchmarks: `dirty` vs `master`

	master	dirty	master / dirty
array/dagger/N=1024 (block 512)/add (X + X)	9.98 ± 0.67 ms
array/dagger/N=1024 (block 512)/alloc (rand)	3.91 ± 0.5 ms
array/dagger/N=1024 (block 512)/broadcast (X .+ 1)	8.41 ± 1.4 ms
array/dagger/N=1024 (block 512)/map (sin.(X))	11.6 ± 0.55 ms
array/dagger/N=1024 (block 512)/norm	3.73 ± 0.33 ms
array/dagger/N=1024 (block 512)/reduce (sum)	8.45 ± 1 ms
array/dagger/N=1024 (block 512)/transpose (permutedims)	9.8 ± 1.3 ms
array/dagger/N=256 (block 256)/add (X + X)	3.39 ± 0.28 ms	3.14 ± 0.33 ms	1.08 ± 0.14
array/dagger/N=256 (block 256)/alloc (rand)	1.34 ± 0.031 ms	1.19 ± 0.16 ms	1.13 ± 0.15
array/dagger/N=256 (block 256)/broadcast (X .+ 1)	3.45 ± 0.21 ms
array/dagger/N=256 (block 256)/map (sin.(X))	4.07 ± 0.067 ms	3.92 ± 0.21 ms	1.04 ± 0.058
array/dagger/N=256 (block 256)/norm	2.22 ± 0.86 ms	2.22 ± 0.033 ms	1 ± 0.39
array/dagger/N=256 (block 256)/reduce (sum)	4.27 ± 1.2 ms	4.48 ± 1.3 ms	0.952 ± 0.39
array/dagger/N=256 (block 256)/transpose (permutedims)	2.88 ± 0.11 ms	2.44 ± 0.53 ms	1.18 ± 0.26
linalg/dagger/N=1024 (block 512)/cholesky	0.0318 ± 0.0027 s	31.1 ± 2.5 ms	1.02 ± 0.12
linalg/dagger/N=1024 (block 512)/lu	8.02 s	7.53 s	1.07
linalg/dagger/N=1024 (block 512)/matmul (A*A)	0.0555 ± 0.005 s	0.055 ± 0.0016 s	1.01 ± 0.096
linalg/dagger/N=1024 (block 512)/matvec (A*x)	9.06 ± 0.36 ms	9.32 ± 0.16 ms	0.972 ± 0.042
linalg/dagger/N=1024 (block 512)/qr	0.385 ± 0.0071 s	0.343 ± 0.0059 s	1.12 ± 0.028
linalg/dagger/N=1024 (block 512)/solve (A\b via lu)	11.2 s	10.5 s	1.07
linalg/dagger/N=1024 (block 512)/syrk (A'*A)	0.043 ± 0.0024 s	0.0445 ± 0.0021 s	0.967 ± 0.071
linalg/dagger/N=256 (block 256)/cholesky	6.92 ± 0.76 ms	6.77 ± 0.7 ms	1.02 ± 0.15
linalg/dagger/N=256 (block 256)/lu	1.42 ± 0.024 s
linalg/dagger/N=256 (block 256)/matmul (A*A)	4.43 ± 0.32 ms	4.31 ± 0.16 ms	1.03 ± 0.084
linalg/dagger/N=256 (block 256)/matvec (A*x)	3.17 ± 0.078 ms	3.35 ± 0.34 ms	0.945 ± 0.098
linalg/dagger/N=256 (block 256)/qr	11 ± 0.56 ms	10.7 ± 0.44 ms	1.03 ± 0.068
linalg/dagger/N=256 (block 256)/solve (A\b via lu)	2.31 ± 0.015 s	2.25 ± 0.054 s	1.03 ± 0.025
linalg/dagger/N=256 (block 256)/syrk (A'*A)	6.07 ± 0.46 ms	6.96 ± 0.53 ms	0.872 ± 0.094
time_to_load	0.972 ± 0.012 s	0.983 ± 0.0016 s	0.988 ± 0.012

No regressions beyond 25.0% 🎉

Full results and plots (download the benchmark-results artifact).

jpsamaroo added 14 commits June 29, 2026 14:36

NG Phase 1: Move per-thunk-keyed dicts onto Thunk (no semantic change)

b1df760

NG Phase 2: Move futures, introduce Treiber lists (still locked)

86a2edc

NG Phase 3: Dataflow counter replaces waiting/waiting_data (still loc…

54ae440

…ked)

NG Phase 4: Per-thunk synchronization for finish (lock no longer *req…

c40f80e

…uired* for correctness)

NG Phase 5: Lock-free cost model

ed66640

processors: Ensure get_processors repopulates OSProc cache as needed

c44d20e

options: Optimize setproperty with generated function

18b8f59

cancellation: Reset occupancy and pressure on pre-running task

dcc494e

Phase 6: Dissolve ready and schedule!; inline self-scheduling

278a27a

NG Fixes: Fix excessive state.lock holding, fix GC of running tasks

8aaf122

Sch/submission: Create strong reference to running Thunks

0649858

Fixes for TASK_TYPED=true

dec4ddd

submission: Cache Base.promote_op results

6d993d4

Sch: Use single reaper task for errormonitor_tracked

2827289

jpsamaroo added scheduler performance labels Jul 1, 2026

jpsamaroo added 7 commits June 30, 2026 19:28

Sch: Disable thread 1 avoidance

f57c053

thunk: Default TASK_TYPED to true

851c3c5

Bound MemPool to 0.4.16

09b7661

thunk: Fix kwargs processing for typed tasks

a24d2e3

CI: Make 10% the default benchmark regression threshold

87d6313

streaming: Properly support typed DTasks

e133240

test/scheduler: W1T1 is now allowed

ad00974

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Next-Generation Scheduling#715

Next-Generation Scheduling#715
jpsamaroo wants to merge 21 commits into
masterfrom
jps/ng-sch

jpsamaroo commented Jul 1, 2026

Uh oh!

github-actions Bot commented Jul 1, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Uh oh!

Conversation

jpsamaroo commented Jul 1, 2026

Uh oh!

github-actions Bot commented Jul 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Dagger benchmarks: dirty vs master

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Jul 1, 2026 •

edited

Loading

Dagger benchmarks: `dirty` vs `master`