Skip to content

Migrate vello sparse strips renderers to use glifo#1562

Merged
laurenz-canva merged 158 commits intomainfrom
laurenz/parley_migration
Apr 14, 2026
Merged

Migrate vello sparse strips renderers to use glifo#1562
laurenz-canva merged 158 commits intomainfrom
laurenz/parley_migration

Conversation

@LaurenzV
Copy link
Copy Markdown
Collaborator

@LaurenzV LaurenzV commented Apr 7, 2026

Overview

This PR attempts to integrate glifo into vello_hybrid and vello_cpu. As mentioned, the git diff is a bit messy because 1) it's based on an old version of main and 2) some commits that are needed have been cherry-picked, bust should ideally be merged separately (i.e. the other currently open PRs).

Blockers

PRs I’d consider blocking are:
#1553
#1555

Semi-blocking is also #1459, but for now we could just accept that (non-cached) bitmap glyphs panic in vello_hybrid and not render them for now.

Limitations / TODOs

Since this PR by itself is already big by itself, it still some TODOs and definitely doesn’t represent the ideal end state we want to have:

  • glifo still has the dependency on vello_common.
  • While I agree that there are probably improvements that can be made to the API (see the Zulip thread), I have not investigated those yet and just implemented the minimum necessary to get tests to run. For now, core text rendering logic still lives in vello_hybrid/vello_cpu instead of an external crate or somewhere else. I also haven’t introduced any “render from sub-rect of atlas” API, instead just sticking as closely as possible to what Alex previously implemented.

Some other TODOs:

  • When caching is enabled, zooming is currently extremely slow. Looking at the profile, it becomes clear that this is because deallocating and reallocating cached glyphs with guillotierre becomes a huge bottleneck with many glyphs. So we either need to explore using etagere for that, or investigate whether there is something else that causes the issue in our case. See also the video:
slowness.mp4
  • There also seems to be some other bug where some glyphs become corrupted at certain zoom levels, but it's not easily reproducible. I am not sure yet whether this is a regression introduced in this PR or already existed previously. If I find the time I might piggy-back a fix into this PR, otherwise it will be a follow-up:
cache_bug.mp4
  • It seems like when caching is enabled, the glyphs have some kind of funny "jump" when panning the viewport. I suspect that the subpixel offsets are somehow applied inversely, but I need to look into it. EDIT: Nevermind, I realized that the answer is very simple, it's because we don't do any subpixel rendering in the y direction at all. Only in the x direction.
jump_bug.mp4

So overall, there are definitely more things to investigate, but as you can see the integration seems to work overall!

Changes

  • The biggest change is probably to resources handling. Both renderers now require the user to pass a Resources (open for better name suggestions) for certain operations (anything text-related on the scene, and the final rendering operation). In vello_cpu, this basically only contains stuff relevant for glyph caching. In vello_hybrid, this additionally now contains the image cache, which previously lived in the “Renderer”. This allows us to share resources between scene encoding and the final rendering operation. The disadvantage is that it puts the burden on the user to always pass the correct resources, but hopefully this shouldn’t be too hard to get wrong. I don’t really see any better way of handling this otherwise. Perhaps this API can in the future be extended to allow the user passing custom resources (e.g. custom texture views for sampling). In vello_cpu, we could probably get away with hiding this behind RenderContext, but it also gets a bit messy and I think having this explicit consistently is better.
  • The default size of an atlas on Vello CPU was changed to 4096, I hope that’s a reasonable default? Or should it be smaller?
  • I’ve made slight changes to how the glyph run builder API works. In the previous parley version, GlyphRunBuilder was a simple struct with a builder pattern, where you needed to manually the image cache and glyph caches to the method. In order to make it work better with the Resources struct used by both renders, I’ve made the builder generic over a backend. This way, the user simply has to call glyph_run once, passing the resources to it, and the rest then happens automatically.
  • The previous benchmarks had some benches for measuring specific parts of the pipeline (e.g. maintain), since those are now more hidden away it’s not easy to benchmark anymore. Therefore, I’ve changed the benchmarks to simply be “integration” benchmarks for Vello CPU.
  • I split up the previous GlyphCaches struct into a separate struct for outline/hint caches and atlas caching. The reason for this change is that I want it to be possible to lazily initialize glyph atlases and glyph renderers (especially in vello_cpu), such that the user doesn’t have to pay the cost for constructing those and keeping them in memory, if they don’t make use of the feature at all. This required making it easier to pass no atlas cache to all of the glyph drawing methods, and splitting those two up seemed like the easiest way. This way, you can still use the outline/hint cache even if atlas caching itself is disabled.
  • I’m not sure what the best way of handling this is, but obviously we want to exercise the caching path in the tests. Unfortunately, caching simply can result in quite large pixel differences, so reusing existing snapshots is not really an option, and we need dedicated ones for new ones. Right now, I went down the path of simply adding an attribute that parametrizes the test on whether glyph caching should be enabled or not. I’m not sure if the better option here is to just manually hardcode the variants for each test, or something else. Open for suggestions here!

Other Notes

  • Something that make things awkward are recordings. Ideally, glifo would take care of handling and re-exporting all text-related dependencies, and vello_common doesn’t have any of that. However, since the Recording type lives in vello_common and we need to have the abilities to replay glyph drawing instructions, some of it has to live in vello_common for now. In particular, I had to move the Glyph type to vello_common, in the long term this should probably move back to glifo.
  • To expand on that, the fact that we have recordings makes things even harder. On current main, we are able to record normal outline glyphs but will panic for COLR and bitmap glyphs. However that exact setup is unfortunately difficult to replicate now. The reason being that the version currently on main relies on the ability to lower glyph runs into strip commands directly in vello_common. However, now that logic lives in glifo, and vello_common can’t take a dependency on glifo. Therefore, for now I’ve opted to simply record glyph render commmands as they are, but not actually provide any caching.
  • For similar reasons, we need to duplicate the glyph run builder API in vello_common, which is also not ideal, but I wouldn't know how to deal with that better.

laurenz-canva and others added 30 commits March 30, 2026 08:42
Closes #1528.

# Context
Nearly all gradients have a clearly defined value at all position, the
only exception being very specific degenerate radial gradients. The
behavior in this case is that all positions that are undefined should
yield a transparent pixel.

Previously, we had handling for this in the CPU shader, where at
sampling time we would explicitly mask out such positions. As part of
#1301, a number of performance
changes were introduced for gradient handling. Among that, a trick was
used that encodes an additional transparent padding pixel in the
gradient LUT, allowing to handle that case much better in Vello CPU.

While this is a valid approach, it has two problems:
1) It destroys the assumption that the length of gradient LUTs are a
multiple of 2. While for Vello CPU this doesn't matter and it currently
also doesn't matter for Vello Hybrid, it could be a very useful property
to have in the future, especially if we decide to change how gradients
are uploaded in Vello hybrid.
2) It requires us to cache stops based on whether the underlying
gradient has potentially undefined locations, which is also unfortunate
and the root cause for #1528.

# Solution

This PR revert that change (while still keeping in-tact the other
performance changes from that PR), such that gradients LUT's are always
a multiple of 2 again and don't need to know anything about undefined
locations. From my benchmarks, this doesn't have any notable performance
impact for the normal case. In certain cases, it seems to even slightly
improve performance. While I haven't measured it, this is likely to make
undefined gradients about 2x slower, but since those are very rare edge
cases we don't really care about apart from for correctness, I think
this is very much a worthy change. Here you can find the change in
performance before vs. after this PR:

```
fine/gradient/linear/opaque_u8_neon
                        time:   [359.08 ns 360.57 ns 362.26 ns]
                        change: [-5.0401% -4.6310% -4.2254%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 9 outliers among 100 measurements (9.00%)
  1 (1.00%) low severe
  3 (3.00%) high mild
  5 (5.00%) high severe

fine/gradient/radial/opaque_u8_neon
                        time:   [475.66 ns 476.60 ns 477.56 ns]
                        change: [-1.3570% -1.0840% -0.8030%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 7 outliers among 100 measurements (7.00%)
  4 (4.00%) low mild
  2 (2.00%) high mild
  1 (1.00%) high severe

fine/gradient/radial/opaque_conical_u8_neon
                        time:   [577.07 ns 578.02 ns 579.05 ns]
                        change: [-1.2222% -0.8150% -0.3570%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 12 outliers among 100 measurements (12.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  6 (6.00%) high severe

fine/gradient/sweep/opaque_u8_neon
                        time:   [847.68 ns 849.27 ns 850.89 ns]
                        change: [-0.0226% +0.2834% +0.5911%] (p = 0.07 > 0.05)
                        No change in performance detected.
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) low mild
  1 (1.00%) high severe

fine/gradient/extend/pad_u8_neon
                        time:   [346.25 ns 346.96 ns 347.77 ns]
                        change: [-0.2873% +0.0627% +0.3827%] (p = 0.72 > 0.05)
                        No change in performance detected.
Found 5 outliers among 100 measurements (5.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  1 (1.00%) high severe

fine/gradient/extend/repeat_u8_neon
                        time:   [368.20 ns 369.45 ns 371.00 ns]
                        change: [-4.5087% -4.1494% -3.8349%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 6 outliers among 100 measurements (6.00%)
  2 (2.00%) low mild
  1 (1.00%) high mild
  3 (3.00%) high severe

fine/gradient/extend/reflect_u8_neon
                        time:   [417.63 ns 418.73 ns 420.20 ns]
                        change: [-6.1869% -5.5205% -4.6695%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 15 outliers among 100 measurements (15.00%)
  1 (1.00%) low mild
  8 (8.00%) high mild
  6 (6.00%) high severe

fine/gradient/many_stops_u8_neon
                        time:   [479.07 ns 480.13 ns 481.45 ns]
                        change: [-0.7523% -0.4523% -0.0884%] (p = 0.01 < 0.05)
                        Change within noise threshold.
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low mild
  5 (5.00%) high mild
  2 (2.00%) high severe

fine/gradient/transparent_u8_neon
                        time:   [459.30 ns 461.39 ns 463.68 ns]
                        change: [-0.4103% +0.0785% +0.6946%] (p = 0.77 > 0.05)
                        No change in performance detected.
Found 10 outliers among 100 measurements (10.00%)
  1 (1.00%) low mild
  3 (3.00%) high mild
  6 (6.00%) high severe

```

---------

Co-authored-by: Laurenz Stampfl <laurenz@canva.com>
Co-authored-by: Alex Gemberg <gemberg@canva.com>
(cherry picked from commit 8859b91)
@LaurenzV
Copy link
Copy Markdown
Collaborator Author

LaurenzV commented Apr 13, 2026

@taj-p Thanks for the great suggestion, code is now simpler indeed, let me know what you think! I haven't yet merged the two traits because conceptually they do kind of do different things, but I'm happy to revisit this in the future.

There's probably still a lot more that can be improved, but I hope this is a good start now! Feel free to make any changes you want and merge yourself, or I can do it once you are happy and approved.

@LaurenzV
Copy link
Copy Markdown
Collaborator Author

I also removed the COLR scene wrapper and just always wrap color glyphs in a push/pop command, which seems like the correct thing to do on both, Vello Hybrid and Vello CPU (regardless of the default_blending_problem).

Copy link
Copy Markdown
Contributor

@taj-p taj-p left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I ran out of time this morning to finish the review. The migration looks good.

Some of the comments are addressed in #1571 (feel free to merge that PR / cherry pick / whatever you like).

There's a current high memory cost to glyph caching that we're likely going to need to address before being able to consume it. Either by device detection or improving memory efficiency. But that can definitely come later.

Comment thread sparse_strips/vello_hybrid/src/text.rs Outdated
Comment thread sparse_strips/vello_hybrid/src/render/webgl.rs Outdated
Comment on lines +675 to +676
// TODO: Similarly to wgpu, maybe this can be done in a more effective
// way?
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

100 what? 😄

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ohhh I generally write this to mean "100% - totally agree!" - as in I strongly agree.

Comment thread sparse_strips/vello_hybrid/src/resources.rs Outdated
Comment thread sparse_strips/vello_cpu/src/render.rs Outdated
(width as u16, height as u16)
};
// TODO: Use a config optionally provided by the user!
self.glyph_resources = Some(GlyphAtlasResources::with_config(
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, so the user pays the cost of duplicate alpha and strip storage and all the other buffer allocations of a new renderer. That's quite a memory cost. Ideally, these buffers are shared between renderers, but to enable that, we would require deferring alpha/strip evaluation until render time - that is, buffer all draw commands and then, on render, execute the caching glyph pipeline first, reset and pass internal buffer allocations to the screen renderer, and then render the screen.

No action here. Just sharing thoughts on opportunities.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah, I think it would be tricky! We should do some measurements to determine memory impact.

}

/// Upload all pending bitmaps, rasterize pending outline/COLR glyphs, etc.
fn sync_glyph_cache(&mut self) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this has been addressed in your updates 🙏

Comment thread glifo/src/colr.rs Outdated
Comment thread sparse_strips/vello_hybrid/src/render/webgl.rs
Comment on lines +656 to +661
// TODO: We need to figure something out here API-wise. At the moment, the user can
// theoretically rasterize the same `RenderContext` multiple times without resetting in-between.
// However, if glyph caching is enabled, this method call could now evict that were previously
// assumed to exist in `RenderContext`, meaning that if the user rasterizes the same `RenderContext`
// again without resetting it, some of the cached glyphs might be stale and not exist anymore.
resources.after_render();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I ran out of my timebox this morning in reviewing the PR

Comment thread glifo/src/interface.rs
Copy link
Copy Markdown
Contributor

@taj-p taj-p left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Considering this PR was supposed to simply factor out the vello_* logic into corresponding crates, I think I've overstepped how much improvement we should try make to glifo in this PR.

It works and it looks good!

cLGTM if the prior comments are adopted

glyph::{GlyphCaches, GlyphRenderer, GlyphRunBuilder, GlyphType, PreparedGlyph},
};
pub(crate) const DEFAULT_GLYPH_ATLAS_SIZE: u16 = 4096;
// Why do we need this? The reason is that the way uploaded images work in Vello Hybrid
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might be able to avoid this but let's not tackle that now. The PR is in a much better state already.

@laurenz-canva laurenz-canva added this pull request to the merge queue Apr 14, 2026
Merged via the queue into main with commit 46cc30b Apr 14, 2026
17 checks passed
@laurenz-canva laurenz-canva deleted the laurenz/parley_migration branch April 14, 2026 06:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants