Issue automatically imported from old repo: EmbarkStudios/rust-gpu#1136
Originally creatd by eddyb on 2024-02-20T10:38:57Z 🎉: 8 ❤️: 6
We've known for a while that the existing inliner+mem2reg setup can lead to long compile times (e.g. #851), but we've never had a simple stress test to measure that behavior and e.g. validate potential solutions against, only full projects.
Thankfully, @schell published a crabslab crate with a #[derive(...)] proc macro which generates what is effectively a "shallow deserialization from GPU buffer" method, and I was able to turn that into a benchmark:
|
O(_ⁿ) |
n=4 |
n=5 |
n=6 |
n=7 |
n=8 |
n=9 |
| total |
O(3.3ⁿ) |
0.664 |
0.965 |
2.135 |
9.124 |
46.472 |
247.335 |
post-inline mem2reg |
O(5.4ⁿ) |
0.054 |
0.245 |
1.173 |
7.584 |
43.371 |
239.904 |
spirv-opt |
O(2.2ⁿ) |
0.081 |
0.169 |
0.351 |
0.767 |
1.783 |
4.397 |
| inline |
O(3.4ⁿ) |
0 |
0.007 |
0.020 |
0.067 |
0.234 |
0.959 |
| SPIR-V -> SPIR-T |
O(2ⁿ) |
0.005 |
0.008 |
0.014 |
0.032 |
0.071 |
0.167 |
If you don't mind the very rudimentary curve-fitting (also I've simplified the rows from the -Z time-passes names), what you should be able to notice is there are two trends:
~2ⁿ: the amount of SPIR-V generated (as observed by SPIR-V -> SPIR-T and spirv-opt)
- this is intended for this test: there should be 2ⁿ leaf calls generated and inlined
- the inliner itself should also fit here but it's not bottom-up so presumably has extra inefficiencies
- while working on the fix, I saw the amount of debuginfo generated, that is likely a lot of the cost
>4ⁿ: post-inline mem2reg is at least (2ⁿ)², i.e. quadratic (or worse) in the amount of SPIR-V
- we more or less knew this, but this test is simple enough that it shouldn't have any
mem2reg work left!
What happened? We forgot to switch the inliner over to OpPhis for its return value dataflow, so to this day it generates OpVariables (w/ OpStores replacing callee returns, and OpLoads at call site):
https://github.com/EmbarkStudios/rust-gpu/blob/8678d58d61a78f01201ec854cb5e3835c014fa3b/crates/rustc_codegen_spirv/src/linker/inline.rs#L658-L664
Some quick hacky test (using OpUndef), for two known projects, ended up making mem2reg:
- 15x faster (~
730s -> ~50s) for @hatoo's rene
- see also #851
- EDIT: initially this said "3x" but that was an invalid comparison and Rust-GPU
main is really slow
- 30x faster (~
150s -> ~5s) on @schell's more recent renderling (at schell/renderling@d9f4d6f)
(that is, if we fix this bug, it could bring some projects from minutes to seconds - for them, mem2reg was spinning its wheels that entire time, due to those OpVariables generated by the inliner, instead of actually helping)
Since this is caused by the inliner itself, and we have to force-inline calls taking pointers into buffers (due to SPIR-V not allowing them to be passed to calls), I repro'd with just #[derive(Clone)] too:
|
O(_ⁿ) |
n=4 |
n=5 |
n=6 |
n=7 |
n=8 |
n=9 |
| total |
O(1.7ⁿ) |
0.543 |
0.567 |
0.625 |
0.875 |
1.952 |
7.683 |
post-inline mem2reg |
O(4.8ⁿ) |
0 |
0.013 |
0.059 |
0.264 |
1.225 |
6.695 |
spirv-opt |
O(1.9ⁿ) |
0.009 |
0.012 |
0.022 |
0.046 |
0.096 |
0.204 |
| inline |
O(3ⁿ) |
0 |
0 |
0 |
0.009 |
0.024 |
0.080 |
| SPIR-V -> SPIR-T |
O(1.7ⁿ) |
0.003 |
0.004 |
0.007 |
0.010 |
0.019 |
0.047 |
That one is fast enough that it deserved more columns, but I'm not messing with jq/sorting any further.
There is, however, a very compact testcase that can be generated from it:
#[derive(Clone)]
pub struct D<T>(T, T);
type D4<T> = D<D<D<D<T>>>>;
type D12<T> = D4<D4<D4<T>>>;
#[spirv(fragment)]
pub fn fs_main(
#[spirv(storage_buffer, descriptor_set = 0, binding = 0)] buf: &D12<f32>,
out: &mut D12<f32>,
) {
*out = buf.clone();
}
- on
main, it takes 692s (~11.5min) in mem2reg, and ~11.7s everywhere else
- with the local hacky workaround, it's down to ~
6.2s in total
- alright, that should be impossible, even the inlining is faster, the hack is doing too much
- then again, variables do require weirder handling, and the inliner isn't bottom-up, so maybe?
- either way, anywhere between 6 and 12 seconds should be possible with the true
OpPhi fix
And if a 100x speedup isn't impressive enough (or 11-12 minutes not slow enough for a CI timeout), you can always bump it further: a type D13<T> = D<D12<T>>; should still take less than a minute once fixed, but anywhere from 45 minutes to a whole hour on main (I am not further delaying this issue just to prove that, though).
Issue automatically imported from old repo: EmbarkStudios/rust-gpu#1136
Originally creatd by eddyb on 2024-02-20T10:38:57Z 🎉: 8 ❤️: 6
We've known for a while that the existing inliner+
mem2regsetup can lead to long compile times (e.g. #851), but we've never had a simple stress test to measure that behavior and e.g. validate potential solutions against, only full projects.Thankfully, @schell published a
crabslabcrate with a#[derive(...)]proc macro which generates what is effectively a "shallow deserialization from GPU buffer" method, and I was able to turn that into a benchmark:n=4n=5n=6n=7n=8n=9mem2regspirv-optIf you don't mind the very rudimentary curve-fitting (also I've simplified the rows from the
-Z time-passesnames), what you should be able to notice is there are two trends:~2ⁿ: the amount of SPIR-V generated (as observed bySPIR-V -> SPIR-Tandspirv-opt)>4ⁿ: post-inlinemem2regis at least(2ⁿ)², i.e. quadratic (or worse) in the amount of SPIR-Vmem2regwork left!What happened? We forgot to switch the inliner over to
OpPhis for its return value dataflow, so to this day it generatesOpVariables (w/OpStores replacing callee returns, andOpLoads at call site):https://github.com/EmbarkStudios/rust-gpu/blob/8678d58d61a78f01201ec854cb5e3835c014fa3b/crates/rustc_codegen_spirv/src/linker/inline.rs#L658-L664
Some quick hacky test (using
OpUndef), for two known projects, ended up makingmem2reg:730s-> ~50s) for @hatoo'srenemainis really slow150s-> ~5s) on @schell's more recentrenderling(at schell/renderling@d9f4d6f)(that is, if we fix this bug, it could bring some projects from minutes to seconds - for them,
mem2regwas spinning its wheels that entire time, due to thoseOpVariables generated by the inliner, instead of actually helping)Since this is caused by the inliner itself, and we have to force-inline calls taking pointers into buffers (due to SPIR-V not allowing them to be passed to calls), I repro'd with just
#[derive(Clone)]too:n=4n=5n=6n=7n=8n=9mem2regspirv-optThat one is fast enough that it deserved more columns, but I'm not messing with
jq/sorting any further.There is, however, a very compact testcase that can be generated from it:
main, it takes692s(~11.5min) inmem2reg, and ~11.7severywhere else6.2sin totalOpPhifixAnd if a 100x speedup isn't impressive enough (or 11-12 minutes not slow enough for a CI timeout), you can always bump it further: a
type D13<T> = D<D12<T>>;should still take less than a minute once fixed, but anywhere from 45 minutes to a whole hour onmain(I am not further delaying this issue just to prove that, though).