vello_cpu: Add u8 fast path for some blend modes#1653
Conversation
37cf4ab to
d46b2c0
Compare
| } | ||
|
|
||
| impl MixExt for BlendMode { | ||
| fn mix<S: Simd>(&self, src_c: u8x32<S>, bg_c: u8x32<S>) -> Option<u8x32<S>> { |
There was a problem hiding this comment.
Nit: the trait pattern from highp/blend.rs doesn't earn its keep here — it's private, has a single impl, a single call site, and its method name shadows the BlendMode::mix field, so blend_mode.mix(src_c, bg_c) (method) and blend_mode.mix (field) look identical at a glance. A free function reads more straightforwardly:
fn try_u8_mix<S: Simd>(
blend_mode: BlendMode,
src_c: u8x32<S>,
bg_c: u8x32<S>,
) -> Option<u8x32<S>> {
// We implement the u8 fast path for blend modes that
// 1) are separable.
// 2) don't have too many divisions, since integer normalization is
// relatively expensive.
// In the future, it's possible to do further experimentation to see whether
// some more blend modes are worth doing in integer space.
Some(match blend_mode.mix {
Mix::Normal => src_c,
Mix::Multiply => Multiply::mix(src_c, bg_c),
Mix::Screen => Screen::mix(src_c, bg_c),
Mix::Overlay => Overlay::mix(src_c, bg_c),
Mix::Darken => Darken::mix(src_c, bg_c),
Mix::Lighten => Lighten::mix(src_c, bg_c),
Mix::HardLight => HardLight::mix(src_c, bg_c),
Mix::Difference => Difference::mix(src_c, bg_c),
Mix::Exclusion => Exclusion::mix(src_c, bg_c),
Mix::ColorDodge
| Mix::ColorBurn
| Mix::SoftLight
| Mix::Luminosity
| Mix::Color
| Mix::Hue
| Mix::Saturation => return None,
})
}| use vello_common::fearless_simd::*; | ||
| use vello_common::util::{Div255Ext, f32_to_u8, normalized_mul_u8x32}; | ||
|
|
||
| pub(crate) fn mix<S: Simd>(src_c: u8x32<S>, bg_c: u8x32<S>, blend_mode: BlendMode) -> u8x32<S> { |
There was a problem hiding this comment.
Would it make sense to add #[inline(always)] here?
There was a problem hiding this comment.
Yes, but I will do this in a follow-up since we need to fix this up in a couple of places anyway (see #1579).
| next_src | ||
| } else { | ||
| mix(next_src, bg_v, blend_mode) | ||
| blend::mix(next_src, bg_v, blend_mode) |
There was a problem hiding this comment.
Just thinking from a performance perspective, would it make sense to combine mix and compose into a single fused implementation? Could that improve performance even further?
One downside I can see is that you’d need implementations for every Mix * Compose combination. Still, it might make sense for a few commonly used subsets.
There was a problem hiding this comment.
It might help a bit (especially for u8), but I don't think it carries it's weight due to the large number of combinations you get (as you mentioned, not good for code size). And blending by itself is already pretty slow. I also don't think it's common at all to have a non-default blend mode + composition mode set.
d46b2c0 to
847b894
Compare
) linebender#1653 added a new fast path for performing blending with u8/u16. Something that was missed here was that in many cases, we can exceed the technically allowed maximum of 255. In the previous f32 path, this wasn't a problem because converting a float larger than 255.0 to u8 it will automatically be clamped to, but this is not the case when doing all of blending using u8. I noticed this when trying to use the newest version of vello_cpu in my PDF renderer, where some pixels turned black. This PR fixes this by 1) Ensuring all pixels are clamped to 255. 2) Additionally, ensuring that all RGB values are <= the alpha, which is necessary for premultiplied colors. With this fix applied, I didn't notice any other regressions in my test suite. First commit adds failing tests, second commit fixes the issue.
Mostly generated with codex, but I did look at it myself and make some adjustments, so I hope it's good now. Since we do have quite a few tests for blending (both manual ones as well as via COLR), not too concerned about correctness issues here.
Note that this does not address #1579 yet so it's possible this won't have much effect on AVX2. However, on NEON I'm seeing 4x-5x speedups for blending now: