[vello_common] Add CPU MSAA by b0nes164 · Pull Request #1377 · linebender/vello

b0nes164 · 2026-01-24T21:25:26Z

Uploading this as a potential feature, and if we do want to land it, I can polish it up from here. Currently, the performance is pretty bad, about 3x Analytic anti-aliasing (~4.3ms vs ~1.5ms on tiger). However, the pixel intersection logic is currently written GPU-ish style, and I can imagine switching to more serial friendly DDA approach could yield some savings.

b0nes164 · 2026-01-25T00:22:22Z

4.3ms -> 3.6ms on tiger

b0nes164 · 2026-01-25T02:49:05Z

~3.6ms -> ~2.8ms on tiger.

Things that I tried but were either the same perf or worse:

morton encoding the LUT
u8x32 SIMD instead of the current u32x8
dda stepping the rows

Last thing that could be tried is switching from processing by row to processing by column. This would at the very least save the transpose necessary before writing to the alpha buffer.

dominikh · 2026-04-07T17:29:10Z

+    bounds: &[f32; 4],
+    derivatives: &[f32; 4],
+    intersection_data: u32,
+    cannonical_x_dir: bool,


canonical, not cannonical

dominikh · 2026-04-07T20:41:02Z

+                        computed_rows[y] = pixel_counts.mul(scale_mul).add(scale_add).shr(3);
+                    }
+
+                    for x in 0..4 {


I'm probably missing something, but why is computed_rows an array of u32x8 when we only use the first four values per row? Shouldn't it be u32x4 instead?

Good question, also, for whatever reason I didn't get a notification for these?

First, yes the code here is for 4x4 tiles. But it is u32x8 because each pixel in the tile requires MSAA count x byte amount of space to record the winding for each sample point.

So at MSAA 8, 8 bytes per pixel, 4 pixels per row, 32 bytes total, u32x8!

But am I wrong in seeing that for each u32x8 row in computed_rows, we never observe the elements 4..8? We compute some values for them, but they're never stored anywhere.

For each row, we do:

initialize to all zeros

set to the result of pixel_counts.mul(scale_mul).add(scale_add).shr(3)

Store elements 0..4 in alpha_buf

And unless I'm missing something, the whole data flow graph for that computation is insensitive to the elements 4..8? That is, I didn't spot any lane crossing operations.

If you're just using u32x8 because it meshes better with the types of other variables involved here, that'd be understandable (but probably deserving of a comment?). But all the types involved seem to derive from a single v = mask[y], and it'd probably be easy enough to take the low half of mask[y] to clarify the types involved here?

I misread your original question, I thought you were asking about the use of u32x8 in general, using compute_rows as an example.

Yes, compute_rows should be u32x4. I initially wrote this to push the rows directly into the alpha buffer, but then I remembered that Vello expects collumn-major so I haphazardly added this for the transpose!

If you're trying to run it locally, iirc the last commit passes the lines in incorrectly, so watch out for that too.

That being said, this version is a little outdated from what I have on Skia right now, the biggest differing being that I'm using Raph's slope and intercept based LUT as opposed to the angle and distance LUT from the Li et al paper.

Furthermore, if you've kept up with the zulip, I've been tinkering with a RLE tile representation, and there actually might be a new technique it enables, so there may be more changes coming in the future!

Thomas Smith added 3 commits January 23, 2026 19:43

Initial Commit

b819457

More

894fb7d

Inline

be950ea

Thomas Smith added 2 commits January 24, 2026 20:37

Inline clip to tile

12d301e

Preprocess Lines

aa6e311

b0nes164 mentioned this pull request Jan 29, 2026

vello_common: Store strip data in tile units #1379

Draft

b0nes164 marked this pull request as draft January 30, 2026 05:45

dominikh reviewed Apr 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[vello_common] Add CPU MSAA#1377

[vello_common] Add CPU MSAA#1377
b0nes164 wants to merge 5 commits intomainfrom
thomas/cpu_msaa

b0nes164 commented Jan 24, 2026

Uh oh!

b0nes164 commented Jan 25, 2026

Uh oh!

b0nes164 commented Jan 25, 2026 •

edited

Loading

Uh oh!

dominikh Apr 7, 2026

Uh oh!

dominikh Apr 7, 2026

Uh oh!

b0nes164 Apr 10, 2026

Uh oh!

dominikh Apr 11, 2026

Uh oh!

dominikh Apr 11, 2026

Uh oh!

b0nes164 Apr 11, 2026

Uh oh!

b0nes164 Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

b0nes164 commented Jan 24, 2026

Uh oh!

b0nes164 commented Jan 25, 2026

Uh oh!

b0nes164 commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dominikh Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

dominikh Apr 7, 2026

Choose a reason for hiding this comment

Uh oh!

b0nes164 Apr 10, 2026

Choose a reason for hiding this comment

Uh oh!

dominikh Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

dominikh Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

b0nes164 Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

b0nes164 Apr 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

b0nes164 commented Jan 25, 2026 •

edited

Loading