Migrate vello sparse strips renderers to use glifo#1562
Migrate vello sparse strips renderers to use glifo#1562laurenz-canva merged 158 commits intomainfrom
Conversation
This reverts commit ee0dd29.
Closes #1528. # Context Nearly all gradients have a clearly defined value at all position, the only exception being very specific degenerate radial gradients. The behavior in this case is that all positions that are undefined should yield a transparent pixel. Previously, we had handling for this in the CPU shader, where at sampling time we would explicitly mask out such positions. As part of #1301, a number of performance changes were introduced for gradient handling. Among that, a trick was used that encodes an additional transparent padding pixel in the gradient LUT, allowing to handle that case much better in Vello CPU. While this is a valid approach, it has two problems: 1) It destroys the assumption that the length of gradient LUTs are a multiple of 2. While for Vello CPU this doesn't matter and it currently also doesn't matter for Vello Hybrid, it could be a very useful property to have in the future, especially if we decide to change how gradients are uploaded in Vello hybrid. 2) It requires us to cache stops based on whether the underlying gradient has potentially undefined locations, which is also unfortunate and the root cause for #1528. # Solution This PR revert that change (while still keeping in-tact the other performance changes from that PR), such that gradients LUT's are always a multiple of 2 again and don't need to know anything about undefined locations. From my benchmarks, this doesn't have any notable performance impact for the normal case. In certain cases, it seems to even slightly improve performance. While I haven't measured it, this is likely to make undefined gradients about 2x slower, but since those are very rare edge cases we don't really care about apart from for correctness, I think this is very much a worthy change. Here you can find the change in performance before vs. after this PR: ``` fine/gradient/linear/opaque_u8_neon time: [359.08 ns 360.57 ns 362.26 ns] change: [-5.0401% -4.6310% -4.2254%] (p = 0.00 < 0.05) Performance has improved. Found 9 outliers among 100 measurements (9.00%) 1 (1.00%) low severe 3 (3.00%) high mild 5 (5.00%) high severe fine/gradient/radial/opaque_u8_neon time: [475.66 ns 476.60 ns 477.56 ns] change: [-1.3570% -1.0840% -0.8030%] (p = 0.00 < 0.05) Change within noise threshold. Found 7 outliers among 100 measurements (7.00%) 4 (4.00%) low mild 2 (2.00%) high mild 1 (1.00%) high severe fine/gradient/radial/opaque_conical_u8_neon time: [577.07 ns 578.02 ns 579.05 ns] change: [-1.2222% -0.8150% -0.3570%] (p = 0.00 < 0.05) Change within noise threshold. Found 12 outliers among 100 measurements (12.00%) 1 (1.00%) low mild 5 (5.00%) high mild 6 (6.00%) high severe fine/gradient/sweep/opaque_u8_neon time: [847.68 ns 849.27 ns 850.89 ns] change: [-0.0226% +0.2834% +0.5911%] (p = 0.07 > 0.05) No change in performance detected. Found 2 outliers among 100 measurements (2.00%) 1 (1.00%) low mild 1 (1.00%) high severe fine/gradient/extend/pad_u8_neon time: [346.25 ns 346.96 ns 347.77 ns] change: [-0.2873% +0.0627% +0.3827%] (p = 0.72 > 0.05) No change in performance detected. Found 5 outliers among 100 measurements (5.00%) 1 (1.00%) low mild 3 (3.00%) high mild 1 (1.00%) high severe fine/gradient/extend/repeat_u8_neon time: [368.20 ns 369.45 ns 371.00 ns] change: [-4.5087% -4.1494% -3.8349%] (p = 0.00 < 0.05) Performance has improved. Found 6 outliers among 100 measurements (6.00%) 2 (2.00%) low mild 1 (1.00%) high mild 3 (3.00%) high severe fine/gradient/extend/reflect_u8_neon time: [417.63 ns 418.73 ns 420.20 ns] change: [-6.1869% -5.5205% -4.6695%] (p = 0.00 < 0.05) Performance has improved. Found 15 outliers among 100 measurements (15.00%) 1 (1.00%) low mild 8 (8.00%) high mild 6 (6.00%) high severe fine/gradient/many_stops_u8_neon time: [479.07 ns 480.13 ns 481.45 ns] change: [-0.7523% -0.4523% -0.0884%] (p = 0.01 < 0.05) Change within noise threshold. Found 8 outliers among 100 measurements (8.00%) 1 (1.00%) low mild 5 (5.00%) high mild 2 (2.00%) high severe fine/gradient/transparent_u8_neon time: [459.30 ns 461.39 ns 463.68 ns] change: [-0.4103% +0.0785% +0.6946%] (p = 0.77 > 0.05) No change in performance detected. Found 10 outliers among 100 measurements (10.00%) 1 (1.00%) low mild 3 (3.00%) high mild 6 (6.00%) high severe ``` --------- Co-authored-by: Laurenz Stampfl <laurenz@canva.com> Co-authored-by: Alex Gemberg <gemberg@canva.com> (cherry picked from commit 8859b91)
…o default blending only
|
@taj-p Thanks for the great suggestion, code is now simpler indeed, let me know what you think! I haven't yet merged the two traits because conceptually they do kind of do different things, but I'm happy to revisit this in the future. There's probably still a lot more that can be improved, but I hope this is a good start now! Feel free to make any changes you want and merge yourself, or I can do it once you are happy and approved. |
|
I also removed the COLR scene wrapper and just always wrap color glyphs in a push/pop command, which seems like the correct thing to do on both, Vello Hybrid and Vello CPU (regardless of the |
taj-p
left a comment
There was a problem hiding this comment.
Sorry, I ran out of time this morning to finish the review. The migration looks good.
Some of the comments are addressed in #1571 (feel free to merge that PR / cherry pick / whatever you like).
There's a current high memory cost to glyph caching that we're likely going to need to address before being able to consume it. Either by device detection or improving memory efficiency. But that can definitely come later.
| // TODO: Similarly to wgpu, maybe this can be done in a more effective | ||
| // way? |
There was a problem hiding this comment.
Ohhh I generally write this to mean "100% - totally agree!" - as in I strongly agree.
| (width as u16, height as u16) | ||
| }; | ||
| // TODO: Use a config optionally provided by the user! | ||
| self.glyph_resources = Some(GlyphAtlasResources::with_config( |
There was a problem hiding this comment.
Ah, so the user pays the cost of duplicate alpha and strip storage and all the other buffer allocations of a new renderer. That's quite a memory cost. Ideally, these buffers are shared between renderers, but to enable that, we would require deferring alpha/strip evaluation until render time - that is, buffer all draw commands and then, on render, execute the caching glyph pipeline first, reset and pass internal buffer allocations to the screen renderer, and then render the screen.
No action here. Just sharing thoughts on opportunities.
There was a problem hiding this comment.
Hmm yeah, I think it would be tricky! We should do some measurements to determine memory impact.
| } | ||
|
|
||
| /// Upload all pending bitmaps, rasterize pending outline/COLR glyphs, etc. | ||
| fn sync_glyph_cache(&mut self) { |
There was a problem hiding this comment.
I believe this has been addressed in your updates 🙏
| // TODO: We need to figure something out here API-wise. At the moment, the user can | ||
| // theoretically rasterize the same `RenderContext` multiple times without resetting in-between. | ||
| // However, if glyph caching is enabled, this method call could now evict that were previously | ||
| // assumed to exist in `RenderContext`, meaning that if the user rasterizes the same `RenderContext` | ||
| // again without resetting it, some of the cached glyphs might be stale and not exist anymore. | ||
| resources.after_render(); |
There was a problem hiding this comment.
Sorry, I ran out of my timebox this morning in reviewing the PR
taj-p
left a comment
There was a problem hiding this comment.
Considering this PR was supposed to simply factor out the vello_* logic into corresponding crates, I think I've overstepped how much improvement we should try make to glifo in this PR.
It works and it looks good!
cLGTM if the prior comments are adopted
| glyph::{GlyphCaches, GlyphRenderer, GlyphRunBuilder, GlyphType, PreparedGlyph}, | ||
| }; | ||
| pub(crate) const DEFAULT_GLYPH_ATLAS_SIZE: u16 = 4096; | ||
| // Why do we need this? The reason is that the way uploaded images work in Vello Hybrid |
There was a problem hiding this comment.
We might be able to avoid this but let's not tackle that now. The PR is in a much better state already.
Overview
This PR attempts to integrate glifo into vello_hybrid and vello_cpu. As mentioned, the git diff is a bit messy because 1) it's based on an old version of main and 2) some commits that are needed have been cherry-picked, bust should ideally be merged separately (i.e. the other currently open PRs).
Blockers
PRs I’d consider blocking are:
#1553#1555Semi-blocking is also #1459, but for now we could just accept that (non-cached) bitmap glyphs panic in vello_hybrid and not render them for now.
Limitations / TODOs
Since this PR by itself is already big by itself, it still some TODOs and definitely doesn’t represent the ideal end state we want to have:
Some other TODOs:
slowness.mp4
cache_bug.mp4
jump_bug.mp4
So overall, there are definitely more things to investigate, but as you can see the integration seems to work overall!
Changes
Resources(open for better name suggestions) for certain operations (anything text-related on the scene, and the final rendering operation). In vello_cpu, this basically only contains stuff relevant for glyph caching. In vello_hybrid, this additionally now contains the image cache, which previously lived in the “Renderer”. This allows us to share resources between scene encoding and the final rendering operation. The disadvantage is that it puts the burden on the user to always pass the correct resources, but hopefully this shouldn’t be too hard to get wrong. I don’t really see any better way of handling this otherwise. Perhaps this API can in the future be extended to allow the user passing custom resources (e.g. custom texture views for sampling). In vello_cpu, we could probably get away with hiding this behindRenderContext, but it also gets a bit messy and I think having this explicit consistently is better.Resourcesstruct used by both renders, I’ve made the builder generic over a backend. This way, the user simply has to callglyph_runonce, passing the resources to it, and the rest then happens automatically.GlyphCachesstruct into a separate struct for outline/hint caches and atlas caching. The reason for this change is that I want it to be possible to lazily initialize glyph atlases and glyph renderers (especially in vello_cpu), such that the user doesn’t have to pay the cost for constructing those and keeping them in memory, if they don’t make use of the feature at all. This required making it easier to pass no atlas cache to all of the glyph drawing methods, and splitting those two up seemed like the easiest way. This way, you can still use the outline/hint cache even if atlas caching itself is disabled.Other Notes
Recordingtype lives in vello_common and we need to have the abilities to replay glyph drawing instructions, some of it has to live in vello_common for now. In particular, I had to move theGlyphtype to vello_common, in the long term this should probably move back to glifo.