Conversation
The previous comment reported a time that seemed to have regressed. It was not 8.2ms on main - more like 11
|
Also notable: LLVM is happy to undo all this shuffle factorization work, fuse them back together, and just make a big mess. So a new mechanism in this PR is optimization_fence, which abuses llvm's arithmetic fence intrinsic to prevent fusion of shuffle instructions (it's supposed to be used to prevent of floating point ops). |
alexreinking
left a comment
There was a problem hiding this comment.
A couple comment nits, but otherwise, everything read very clearly. Did you write most of these comments yourself, or did Claude?
|
I wrote them all myself. And re-reading it all myself I found a bunch of stuff I didn't like and made more changes. Maybe it existed in the original python too but at this point I've rewritten enough of the code that I don't think this counts as coauthored by claude anymore. |
Before: Computing best tile sizes for each type ................................................. bytes, tile width, tile height, bandwidth (GB/s): 1 8 8 20.9997 1 16 8 20.8329 1 8 16 18.5702 1 8 32 17.2463 1 8 64 14.312 2 8 16 19.2047 2 8 8 18.8368 2 16 8 17.0593 2 8 32 17.0591 2 4 8 15.7681 4 8 8 24.9364 4 4 16 22.9699 4 8 16 22.5743 4 4 32 22.255 4 4 8 20.4468 8 8 8 38.4094 8 16 4 28.4167 8 16 8 27.6184 8 8 4 27.6062 8 8 16 26.8693 After: Computing best tile sizes for each type ................................................. bytes, tile width, tile height, bandwidth (GB/s): 1 16 32 34.1921 1 16 16 31.8399 1 8 16 25.575 1 16 64 25.1665 1 32 16 25.0061 2 8 32 28.2635 2 8 16 27.7648 2 16 16 27.2126 2 16 32 23.9034 2 8 8 23.6345 4 8 16 34.5303 4 8 8 28.3653 4 16 8 26.8521 4 8 32 26.084 4 16 16 24.4519 8 8 8 33.7163 8 8 4 29.1339 8 4 16 26.418 8 16 4 25.4663 8 2 8 24.3949
|
It turns out LLVM was making a mess on ARM too. Adding optimization fences to the base class interleave implementation makes an 8-bit transpose 1.7x faster. See the last commit for the numbers. |
9a09bfa to
3eef5db
Compare
3eef5db to
484bd4c
Compare
|
I rebased on main to trigger the new buildbot workflows |
|
Yes, sorry, this became a WIP because I wanted to add some functionality to it. |
7452679 to
678a353
Compare
To help diagnose occasional illegal instruction errors
|
This PR now also has a partial implementation of "A decomposition for in-place matrix transposition" by Catanzaro et al. for non power-of-two interleaves and deinterleaves. |
|
Ready for review. Failure is #8928 |
|
Review ping. The changes inside CodeGen_X86 could be deferred to a follow-up PR, but the review burden is equal to the sum of its parts because it's exactly segregated by file, so it seemed better to get it over with and treat it as one monolithic thing. |
alexreinking
left a comment
There was a problem hiding this comment.
Still combing through the big files, but there's some low-hanging fruit to address here.
| // Explicitly vectorized loads from the input. Was necessary before we | ||
| // automatically swizzled the 2D load into dense order. | ||
| // input.in().compute_at(output, x).vectorize(x).unroll(y); | ||
|
|
||
| // Explicit transpose in registers. This used to be the idiom, but is no | ||
| // longer necessary because stage_strided_loads should detect the strided | ||
| // loads from input.in() and turn it into a transpose. | ||
| // input.in().in().reorder_storage(y, x).compute_at(output, x).vectorize(x).unroll(y); | ||
|
|
||
| // TODO: Should not be necessary, but prevents licm from doing something dumb. |
There was a problem hiding this comment.
Is there anything we can do, e.g. with a custom lowering pass, to test that the expected IR gets built? Does the old idiom still work? It should probably be tested not to regress.
There was a problem hiding this comment.
Yes, I will add a test that checks all the idioms work.
| // Set the target features to use for dumping to assembly | ||
| target.set_features({Target::NoRuntime, Target::NoAsserts, Target::NoBoundsQuery}); | ||
|
|
||
| std::cout << "\nbytes, interleave factor, interleave bandwidth (GB/s), deinterleave bandwidth (GB/s):\n"; |
There was a problem hiding this comment.
Is there any property at all we can test here?
There was a problem hiding this comment.
llvm is free to do clever things under the hood, so I worry any asserts on the actual performance results risks being a flaky test. It asserts correctness and the performance output is just informational right now.
There was a problem hiding this comment.
I think it wouldn't hurt to try adding a performance test? If LLVM does something weird, at least we know. The whole point of this PR is to get better and faster interleave codegen. If we don't really test the result that LLVM produces, it's not really demanding the fruits of your work.
I think this might also reveal that the optimization_fence breaks in the future.
src/Simplify_Stmts.cpp
Outdated
| // foo[x] = foo[x] or foo[x] = undef is a no-op | ||
| return Evaluate::make(0); | ||
| } else if (shuf && shuf->is_concat()) { | ||
| // Break a store of a concat of vector indices into separate stores |
There was a problem hiding this comment.
To what end? Should explain more in the comment.
src/IR.cpp
Outdated
| int cols = indices[1] - indices[0]; | ||
| int rows = vectors[0].type().lanes() / cols; | ||
| if ((int)indices.size() != rows * cols) { | ||
| return false; | ||
| } |
There was a problem hiding this comment.
This looks weird. What if the shuffle is like...
shuffle({10,20,30,40}, {2, 1, 0, 3})
Then cols == -1, rows == -4 and so indices.size() == rows * cols, right? Then, below, the two outer for loops are both apparently infinite. But maybe you luck out by returning false right away? indices[0 * -4 + 0] == 2 != 0 * -1 + 0.
Aha! This is the way to make this code cleaner / safer... to return true, we know that indices[col * rows + row] == row * cols + col for all row/col. At 0/0, that's
indices[0 * rows + 0] == 0 * cols + 0
=> indices[0] == 0
So why don't we check that up front?
| int cols = indices[1] - indices[0]; | |
| int rows = vectors[0].type().lanes() / cols; | |
| if ((int)indices.size() != rows * cols) { | |
| return false; | |
| } | |
| if (indices[0] != 0) { | |
| return false; | |
| } | |
| int cols = indices[1]; | |
| int rows = vectors[0].type().lanes() / cols; | |
| if ((int)indices.size() != rows * cols) { | |
| return false; | |
| } |
There was a problem hiding this comment.
It actually also looks like this can divide by 0... need to check that indices[1] != 0, too.
| /** A fence to prevent fusion of ops by llvm. Designed for floats, but we | ||
| * abuse it to prevent shufflevector fusion too. */ | ||
| virtual llvm::Value *optimization_fence(llvm::Value *); |
There was a problem hiding this comment.
I think we need to rebase this on main after merging the optimization_fence from the other PR.
There was a problem hiding this comment.
Yeah, should be easy because that was copy-pasted out of this branch.
|
I'd also like to defer this until after #8977 so we have coverage of SVE2. I'm nervous about the interactions with scalable vectors in CodeGen_LLVM.cpp |
mcourteaux
left a comment
There was a problem hiding this comment.
Reviewed everything except the bit-logic thing in x86. If the tests pass, I hope that's sufficient 😛
Looks mostly very good, but reading through it there were a few places where I got confused, so I suggested some comments explaining things.
Also, to my understanding, the tests for interleaving don't seem to test all of the vector length cases, but only the power of two cases?
| std::vector<int> rotation(vec_elements, 0); | ||
| for (int i = 0; i < vec_elements; i++) { | ||
| int k = (i * num_vecs) % vec_elements; | ||
| rotation[k] = (i * num_vecs) / vec_elements; |
There was a problem hiding this comment.
as they are coprime, every index in rotation should have been produced. a for loop of internal_assert(rotation[i] != 0) for all i != 0?
src/CodeGen_LLVM.cpp
Outdated
| // Using unary shuffles, get each element into the right ultimate | ||
| // lane. This works out without collisions because the number of vectors | ||
| // and the length of each vector is coprime. | ||
| const int num_vecs = (int)v.size(); |
There was a problem hiding this comment.
Perhaps add a comment that signed datatype is required for bitlogic further.
| if (f == 1 || f == num_vecs) { | ||
| for (int i = 2; i < num_vecs; i++) { | ||
| if (num_vecs % i == 0) { | ||
| f = i; |
There was a problem hiding this comment.
This logic seems to find the smallest divisor; instead of the largest power of two, like the comment suggests?
There was a problem hiding this comment.
It's the initial value that is the largest power of two factor: num_vecs & -num_vecs. I'll pull it out into a helper function so it's less magic
There was a problem hiding this comment.
Okay. I think the word "first" raises some questions to a reader. It seems to be some divide-and-conquer strategy, and we prefer big powers of two factors at high levels of the recursion, if I understand well.
src/CodeGen_LLVM.cpp
Outdated
| // Use the inverse of Catanzaro's algorithm from above. We slice into | ||
| // distinct vectors, then rotate each element into the correct final | ||
| // vector, then do a unary permutation of each vector. | ||
| std::vector<int> shuffle(vec_elements); |
There was a problem hiding this comment.
move this declaration down to where it's actually used
src/Simplify_Stmts.cpp
Outdated
| // foo[x] = foo[x] or foo[x] = undef is a no-op | ||
| return Evaluate::make(0); | ||
| } else if (shuf && shuf->is_concat()) { | ||
| // Break a store of a concat of vector indices into separate stores |
There was a problem hiding this comment.
Are we sure that concat(ramp(0, 1, 8), ramp(8, 1, 8)) is already simplified to ramp(0, 1, 16) at this point, before we needlessly split those up? What brought you to add this? What happened without this transformation?
There was a problem hiding this comment.
If you tile a transpose like .tile(x, y, xi, xi, 8, 8), and vectorize both xi and yi, then the store index becomes a concat of dense ramps. This breaks it up. The benchmark uses this but I'll also add a new test that clarifies.
| return false; | ||
| } | ||
| const Shuffle &v = (const Shuffle &)e; | ||
| return v.vectors.size() == 1 && |
There was a problem hiding this comment.
This makes me wonder; transpose is only happening on one vector. Is make_transpose(make_concat(a, b), 4) going to make problems? I hope not, and it just doesn't get lowered to the nice work you did?
There was a problem hiding this comment.
In general the simplifier doesn't fuse shuffles, so it should be fine.
test/performance/interleave.cpp
Outdated
| output(x) = in(x / factor, x % factor); | ||
|
|
||
| Var xi, yi; | ||
| output.unroll(x, factor, TailStrategy::RoundUp).vectorize(x, t.natural_vector_size<T>(), TailStrategy::RoundUp); |
There was a problem hiding this comment.
This needs also testing different vector sizes, but perhaps in a correctness test? You have logic for handling non-POT, but multiple of native. This does not seem explicitly tested right now?
There was a problem hiding this comment.
Can you document why you do unroll(factor).vectorize()? I would have expected to vectorize(x, natural * factor)?
| if (!is_power_of_two(vec_elements) && | ||
| vec_elements % elems_per_native_vec == 0) { | ||
| // It's not a power of two, but it's a multiple of the native vector | ||
| // length, so slice it and recurse. |
There was a problem hiding this comment.
Yes the interleave tests ends up with input vectors of size say 24 when deinterleaving 8-vectors by a factor of three.
Using the strategy in CodeGen_LLVM to do big vector interleaves (repeated 2-way interleaves), LLVM generates pretty poor code on x86. This is because x86 has no two-way vector interleave instruction until avx-512, and that instruction requires a runtime shuffle table, using up a register. The instructions x86 does have that take immediates are weird and hard to think about. It's important to stick to instructions that take immediates because interleaves often happen in high register pressure contexts (e.g. block transposes). This PR redoes vector interleaving for power of two blocks on x86 to use unpckl and shufi/vperm2/vinsert instructions only. The algorithm is somewhat complex and requires reasoning about permutations of the bits of the indices of each element. Hopefully it is understandable given the jumbo comment. I first got it working in python and Claude correctly translated that to C++ for me, after which I made extensive rewrites.
On my machine, this makes block transposes significantly faster and shorter in terms of code size and avoids some of the pathological cases on main. E.g. a 16x16 transpose of uint16s on avx2 on main is 621 instructions total, taking 419 cycles. I'd paste it but it's just a huge mess of various instructions. In this PR it's 134 instructions and 64 cycles:
This changes what block sizes are best used for transposing. Here are the best block sizes for each type before and after this change:
AVX512:
AVX2:
A good rule of thumb seems to be that you now want to use 512-byte blocks on avx2, and 1024-byte blocks on avx512.