Skip to content

[Investigation PR]: Improving performance of image sampling in Vello Hybrid#1547

Draft
taj-p wants to merge 11 commits intomainfrom
tajp/bilinear
Draft

[Investigation PR]: Improving performance of image sampling in Vello Hybrid#1547
taj-p wants to merge 11 commits intomainfrom
tajp/bilinear

Conversation

@taj-p
Copy link
Copy Markdown
Contributor

@taj-p taj-p commented Mar 28, 2026

In #1517, there is some uncertainty about direction for improving bilinear image sampling. I performed an investigation into bilinear image sampling to try understand bottlenecks. My conclusions are that the bottlenecks identified and previously discussed are slightly off the mark.

Baseline

I used #1493 as the baseline. It produced these values on my Samsung A05s.

Benchmark Latency
200 Rect - 200×200 - Image - Nearest 127.23 ms/f (9 iters)
200 Rect - 200×200 - Image - Bilinear 144.63 ms/f (7 iters)
200 Rect - 200×200 - Opaque Image - Nearest 120.22 ms/f (9 iters)
200 Rect - 200×200 - Opaque Image - Bilinear 143.74 ms/f (7 iters)
200 Rect - 200×200 - Opaque Image (draw_image) - Nearest 105.75 ms/f (10 iters)
200 Rect - 200×200 - Opaque Image (draw_image) - Bilinear 104.19 ms/f (10 iters)
source image

1000000065

Strategy 1: Don't query texture dimensions per pixel + simplify its calculation + move per-pixel calcs to vertex

Using #1517 as a base, in 36e3f3e, we simply ensured atlas dimensions were a power of 2 (to remove a division) and moved per-pixel calculations to the vertex shader.

This yielded a 30-40% improvement across the board against the control.

Benchmark Latency Improvement
200 Rect - 200×200 - Image - Nearest 76.45 ms/f (15 iters) -35.9%
200 Rect - 200×200 - Image - Bilinear 71.47 ms/f (14 iters) -48.6%
200 Rect - 200×200 - Opaque Image - Nearest 70.11 ms/f (15 iters) -39.5%
200 Rect - 200×200 - Opaque Image - Bilinear 72.89 ms/f (15 iters) -48.6%
200 Rect - 200×200 - Opaque Image (draw_image) - Nearest 70.07 ms/f (15 iters) -33.9%
200 Rect - 200×200 - Opaque Image (draw_image) - Bilinear 69.81 ms/f (15 iters) -32.7%
source image

1000000066

Strategy 2: Remove branching

IMO, I wasn't sure why we pay the runtime cost of branching on image quality in the shader of Vello Hybrid - aren't most consumers wanting bilinear sampling only? I wondered whether it make sense to strip out the branching from consumers who only use bilinear sampling. This improved performance by 2x again!. See b50aaee.

The idea here would be to add a feature flag to Hybrid to build-time remove this branching from the shader.

Benchmark Latency Improvement
200 Rect - 200×200 - Image - Nearest 43.15 ms/f (30 iters) -63.8%
200 Rect - 200×200 - Image - Bilinear 34.97 ms/f (23 iters) -74.9%
200 Rect - 200×200 - Opaque Image - Nearest 34.86 ms/f (33 iters) -69.9%
200 Rect - 200×200 - Opaque Image - Bilinear 37.35 ms/f (27 iters) -73.7%
200 Rect - 200×200 - Opaque Image (draw_image) - Nearest 36.18 ms/f (30 iters) -65.9%
200 Rect - 200×200 - Opaque Image (draw_image) - Bilinear 34.29 ms/f (31 iters) -67.0%
source image

1000000068

Note: I tried removing the extend_mode function from the shader and didn't see much improvement after #2.

Extra: 58ea8fe

Simply returning final_color saw further performance improvement (the same as if tinting logic was commented out).

@LaurenzV
Copy link
Copy Markdown
Collaborator

Wow, thanks for digging into this! However, I'm still curious whether this precludes the other optimizations I tried. I'm wondering, have you tried, on top of strategy 2, to see what numbers you get when:

  1. You remove extend calculations and use the transparent padding
  2. You remove the image tinting calculations

I'm still curious if this gives even more performance, because in my experiments that did clearly also haven an impact. 🤔

@taj-p
Copy link
Copy Markdown
Contributor Author

taj-p commented Mar 28, 2026

Wow, thanks for digging into this! However, I'm still curious whether this precludes the other optimizations I tried. I'm wondering, have you tried, on top of strategy 2, to see what numbers you get when:

  1. You remove extend calculations and use the transparent padding
  2. You remove the image tinting calculations

I'm still curious if this gives even more performance, because in my experiments that did clearly also haven an impact. 🤔

You remove extend calculations and use the transparent padding

This didn't have much impact when I tried it on top of #2. But, I think it should be verified.

  1. You remove the image tinting calculations

About 10-20% improvement in performance on top of #2 (edit: this is overcome by 58ea8fe)

@taj-p
Copy link
Copy Markdown
Contributor Author

taj-p commented Mar 28, 2026

cc @LaurenzV

58ea8fe removes the performance cost of tinting. In fact, it's faster than commenting out tinting for some suites.

edit: Unsure if related to tinting or simply a means to improve performance generally.

@nicoburns
Copy link
Copy Markdown
Contributor

What size is your source image here, and what size are you rendering it at? Do these numbers vary for:

  • Downsampling vs. upsampling vs. exact size match?
  • Different ratio's between source and output size?

@taj-p
Copy link
Copy Markdown
Contributor Author

taj-p commented Mar 30, 2026

What size is your source image here, and what size are you rendering it at? Do these numbers vary for:

  • Downsampling vs. upsampling vs. exact size match?
  • Different ratio's between source and output size?

Good questions. These are likely good benchmarks to add to vello_bench2 at some point. We're rendering at device screen size I believe

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants