vello_cpu: Add u8 fast path for some blend modes by LaurenzV · Pull Request #1653 · linebender/vello

LaurenzV · 2026-05-16T13:14:44Z

Mostly generated with codex, but I did look at it myself and make some adjustments, so I hope it's good now. Since we do have quite a few tests for blending (both manual ones as well as via COLR), not too concerned about correctness issues here.

Note that this does not address #1579 yet so it's possible this won't have much effect on AVX2. However, on NEON I'm seeing 4x-5x speedups for blending now:

fine/blend/normal_u8_neon
                        time:   [40.096 ns 40.297 ns 40.521 ns]
                        change: [-1.1748% +0.8265% +3.1392%] (p = 0.45 > 0.05)
                        No change in performance detected.
Found 8 outliers among 100 measurements (8.00%)
  3 (3.00%) high mild
  5 (5.00%) high severe

fine/blend/multiply_u8_neon
                        time:   [976.39 ns 978.78 ns 981.34 ns]
                        change: [-80.790% -80.728% -80.664%] (p = 0.00 < 0.05)
                        Performance has improved.

fine/blend/screen_u8_neon
                        time:   [1.0140 µs 1.0167 µs 1.0199 µs]
                        change: [-80.496% -80.407% -80.320%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

fine/blend/overlay_u8_neon
                        time:   [1.3631 µs 1.3667 µs 1.3701 µs]
                        change: [-74.682% -74.592% -74.499%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

fine/blend/darken_u8_neon
                        time:   [1.1359 µs 1.1385 µs 1.1412 µs]
                        change: [-77.273% -77.197% -77.125%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

fine/blend/lighten_u8_neon
                        time:   [1.1535 µs 1.1557 µs 1.1582 µs]
                        change: [-77.013% -76.936% -76.857%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

fine/blend/color_dodge_u8_neon
                        time:   [5.6951 µs 5.7070 µs 5.7195 µs]
                        change: [+1.6232% +1.9789% +2.3529%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 4 outliers among 100 measurements (4.00%)
  3 (3.00%) high mild
  1 (1.00%) high severe

fine/blend/color_burn_u8_neon
                        time:   [5.6208 µs 5.6334 µs 5.6466 µs]
                        change: [+1.5668% +1.9000% +2.2646%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 1 outliers among 100 measurements (1.00%)
  1 (1.00%) high mild

fine/blend/hard_light_u8_neon
                        time:   [1.3581 µs 1.3602 µs 1.3626 µs]
                        change: [-75.426% -75.345% -75.265%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  4 (4.00%) high mild
  1 (1.00%) high severe

fine/blend/soft_light_u8_neon
                        time:   [6.0497 µs 6.0630 µs 6.0768 µs]
                        change: [+1.2844% +1.7849% +2.2700%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

fine/blend/difference_u8_neon
                        time:   [1.2694 µs 1.2720 µs 1.2747 µs]
                        change: [-75.605% -75.514% -75.423%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 3 outliers among 100 measurements (3.00%)
  1 (1.00%) high mild
  2 (2.00%) high severe

fine/blend/exclusion_u8_neon
                        time:   [1.0596 µs 1.0614 µs 1.0634 µs]
                        change: [-80.316% -80.250% -80.184%] (p = 0.00 < 0.05)
                        Performance has improved.
Found 5 outliers among 100 measurements (5.00%)
  2 (2.00%) high mild
  3 (3.00%) high severe

fine/blend/hue_u8_neon  time:   [8.5128 µs 8.5387 µs 8.5659 µs]
                        change: [+1.5041% +1.9534% +2.4143%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 5 outliers among 100 measurements (5.00%)
  5 (5.00%) high mild

fine/blend/saturation_u8_neon
                        time:   [8.5693 µs 8.6052 µs 8.6431 µs]
                        change: [+1.7844% +2.2460% +2.6872%] (p = 0.00 < 0.05)
                        Performance has regressed.
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

fine/blend/color_u8_neon
                        time:   [7.5338 µs 7.5591 µs 7.5869 µs]
                        change: [+0.7948% +1.2293% +1.6653%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 3 outliers among 100 measurements (3.00%)
  3 (3.00%) high mild

fine/blend/luminosity_u8_neon
                        time:   [7.5325 µs 7.5531 µs 7.5772 µs]
                        change: [+0.8659% +1.3864% +1.8684%] (p = 0.00 < 0.05)
                        Change within noise threshold.
Found 6 outliers among 100 measurements (6.00%)
  3 (3.00%) high mild
  3 (3.00%) high severe

grebmeg

Nice speedup! 🚀

grebmeg · 2026-05-22T01:56:54Z

+}
+
+impl MixExt for BlendMode {
+    fn mix<S: Simd>(&self, src_c: u8x32<S>, bg_c: u8x32<S>) -> Option<u8x32<S>> {


Nit: the trait pattern from highp/blend.rs doesn't earn its keep here — it's private, has a single impl, a single call site, and its method name shadows the BlendMode::mix field, so blend_mode.mix(src_c, bg_c) (method) and blend_mode.mix (field) look identical at a glance. A free function reads more straightforwardly:

fn try_u8_mix<S: Simd>( blend_mode: BlendMode, src_c: u8x32<S>, bg_c: u8x32<S>, ) -> Option<u8x32<S>> { // We implement the u8 fast path for blend modes that // 1) are separable. // 2) don't have too many divisions, since integer normalization is // relatively expensive. // In the future, it's possible to do further experimentation to see whether // some more blend modes are worth doing in integer space. Some(match blend_mode.mix { Mix::Normal => src_c, Mix::Multiply => Multiply::mix(src_c, bg_c), Mix::Screen => Screen::mix(src_c, bg_c), Mix::Overlay => Overlay::mix(src_c, bg_c), Mix::Darken => Darken::mix(src_c, bg_c), Mix::Lighten => Lighten::mix(src_c, bg_c), Mix::HardLight => HardLight::mix(src_c, bg_c), Mix::Difference => Difference::mix(src_c, bg_c), Mix::Exclusion => Exclusion::mix(src_c, bg_c), Mix::ColorDodge | Mix::ColorBurn | Mix::SoftLight | Mix::Luminosity | Mix::Color | Mix::Hue | Mix::Saturation => return None, }) }

Makes sense!

grebmeg · 2026-05-22T01:59:13Z

+use vello_common::fearless_simd::*;
+use vello_common::util::{Div255Ext, f32_to_u8, normalized_mul_u8x32};
+
+pub(crate) fn mix<S: Simd>(src_c: u8x32<S>, bg_c: u8x32<S>, blend_mode: BlendMode) -> u8x32<S> {


Would it make sense to add #[inline(always)] here?

Yes, but I will do this in a follow-up since we need to fix this up in a couple of places anyway (see #1579).

Will add a TODO.

grebmeg · 2026-05-22T02:22:53Z

                        next_src
                    } else {
-                        mix(next_src, bg_v, blend_mode)
+                        blend::mix(next_src, bg_v, blend_mode)


Just thinking from a performance perspective, would it make sense to combine mix and compose into a single fused implementation? Could that improve performance even further?

One downside I can see is that you’d need implementations for every Mix * Compose combination. Still, it might make sense for a few commonly used subsets.

It might help a bit (especially for u8), but I don't think it carries it's weight due to the large number of combinations you get (as you mentioned, not good for code size). And blending by itself is already pretty slow. I also don't think it's common at all to have a non-default blend mode + composition mode set.

) linebender#1653 added a new fast path for performing blending with u8/u16. Something that was missed here was that in many cases, we can exceed the technically allowed maximum of 255. In the previous f32 path, this wasn't a problem because converting a float larger than 255.0 to u8 it will automatically be clamped to, but this is not the case when doing all of blending using u8. I noticed this when trying to use the newest version of vello_cpu in my PDF renderer, where some pixels turned black. This PR fixes this by 1) Ensuring all pixels are clamped to 255. 2) Additionally, ensuring that all RGB values are <= the alpha, which is necessary for premultiplied colors. With this fix applied, I didn't notice any other regressions in my test suite. First commit adds failing tests, second commit fixes the issue.

LaurenzV requested a review from grebmeg May 16, 2026 13:15

LaurenzV force-pushed the laurenz/fast_blend branch 2 times, most recently from 37cf4ab to d46b2c0 Compare May 16, 2026 13:19

grebmeg approved these changes May 22, 2026

View reviewed changes

LaurenzV added 3 commits May 22, 2026 19:44

.

c94e5a0

Change trait to function

78fd45b

Add TODO

847b894

laurenz-canva force-pushed the laurenz/fast_blend branch from d46b2c0 to 847b894 Compare May 22, 2026 17:49

LaurenzV changed the title ~~Add u8 fast path for some blend modes~~ vello_cpu: Add u8 fast path for some blend modes May 22, 2026

LaurenzV enabled auto-merge May 22, 2026 17:56

LaurenzV added this pull request to the merge queue May 22, 2026

github-merge-queue Bot removed this pull request from the merge queue due to failed status checks May 22, 2026

LaurenzV added this pull request to the merge queue May 22, 2026

Merged via the queue into main with commit 4057c5a May 22, 2026
17 checks passed

LaurenzV deleted the laurenz/fast_blend branch May 22, 2026 18:59

LaurenzV mentioned this pull request May 29, 2026

vello_cpu: Fix potential overflows in new blending code #1684

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vello_cpu: Add u8 fast path for some blend modes#1653

vello_cpu: Add u8 fast path for some blend modes#1653
LaurenzV merged 3 commits into
mainfrom
laurenz/fast_blend

LaurenzV commented May 16, 2026

Uh oh!

grebmeg left a comment

Uh oh!

grebmeg May 22, 2026

Uh oh!

LaurenzV May 22, 2026

Uh oh!

grebmeg May 22, 2026

Uh oh!

LaurenzV May 22, 2026

Uh oh!

LaurenzV May 22, 2026

Uh oh!

grebmeg May 22, 2026

Uh oh!

LaurenzV May 22, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

LaurenzV commented May 16, 2026

Uh oh!

grebmeg left a comment

Choose a reason for hiding this comment

Uh oh!

grebmeg May 22, 2026

Choose a reason for hiding this comment

Uh oh!

LaurenzV May 22, 2026

Choose a reason for hiding this comment

Uh oh!

grebmeg May 22, 2026

Choose a reason for hiding this comment

Uh oh!

LaurenzV May 22, 2026

Choose a reason for hiding this comment

Uh oh!

LaurenzV May 22, 2026

Choose a reason for hiding this comment

Uh oh!

grebmeg May 22, 2026

Choose a reason for hiding this comment

Uh oh!

LaurenzV May 22, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants