Conversation
|
4.3ms -> 3.6ms on tiger |
|
~3.6ms -> ~2.8ms on tiger. Things that I tried but were either the same perf or worse:
Last thing that could be tried is switching from processing by row to processing by column. This would at the very least save the transpose necessary before writing to the alpha buffer. |
| bounds: &[f32; 4], | ||
| derivatives: &[f32; 4], | ||
| intersection_data: u32, | ||
| cannonical_x_dir: bool, |
| computed_rows[y] = pixel_counts.mul(scale_mul).add(scale_add).shr(3); | ||
| } | ||
|
|
||
| for x in 0..4 { |
There was a problem hiding this comment.
I'm probably missing something, but why is computed_rows an array of u32x8 when we only use the first four values per row? Shouldn't it be u32x4 instead?
There was a problem hiding this comment.
Good question, also, for whatever reason I didn't get a notification for these?
First, yes the code here is for 4x4 tiles. But it is u32x8 because each pixel in the tile requires MSAA count x byte amount of space to record the winding for each sample point.
So at MSAA 8, 8 bytes per pixel, 4 pixels per row, 32 bytes total, u32x8!
There was a problem hiding this comment.
But am I wrong in seeing that for each u32x8 row in computed_rows, we never observe the elements 4..8? We compute some values for them, but they're never stored anywhere.
For each row, we do:
- initialize to all zeros
- set to the result of
pixel_counts.mul(scale_mul).add(scale_add).shr(3) - Store elements 0..4 in alpha_buf
And unless I'm missing something, the whole data flow graph for that computation is insensitive to the elements 4..8? That is, I didn't spot any lane crossing operations.
There was a problem hiding this comment.
If you're just using u32x8 because it meshes better with the types of other variables involved here, that'd be understandable (but probably deserving of a comment?). But all the types involved seem to derive from a single v = mask[y], and it'd probably be easy enough to take the low half of mask[y] to clarify the types involved here?
There was a problem hiding this comment.
I misread your original question, I thought you were asking about the use of u32x8 in general, using compute_rows as an example.
Yes, compute_rows should be u32x4. I initially wrote this to push the rows directly into the alpha buffer, but then I remembered that Vello expects collumn-major so I haphazardly added this for the transpose!
If you're trying to run it locally, iirc the last commit passes the lines in incorrectly, so watch out for that too.
There was a problem hiding this comment.
That being said, this version is a little outdated from what I have on Skia right now, the biggest differing being that I'm using Raph's slope and intercept based LUT as opposed to the angle and distance LUT from the Li et al paper.
Furthermore, if you've kept up with the zulip, I've been tinkering with a RLE tile representation, and there actually might be a new technique it enables, so there may be more changes coming in the future!
Uploading this as a potential feature, and if we do want to land it, I can polish it up from here. Currently, the performance is pretty bad, about 3x Analytic anti-aliasing (~4.3ms vs ~1.5ms on tiger). However, the pixel intersection logic is currently written GPU-ish style, and I can imagine switching to more serial friendly DDA approach could yield some savings.