Skip to content

perf: add palette transform fast paths#713

Open
hjanuschka wants to merge 3 commits into
libjxl:mainfrom
hjanuschka:perf/pr07-palette-fastpaths
Open

perf: add palette transform fast paths#713
hjanuschka wants to merge 3 commits into
libjxl:mainfrom
hjanuschka:perf/pr07-palette-fastpaths

Conversation

@hjanuschka

@hjanuschka hjanuschka commented Mar 16, 2026

Copy link
Copy Markdown
Collaborator

This optimizes palette transforms by prefetching palette rows and adding a direct no-delta lookup fast path, while keeping delta handling behavior unchanged. It also keeps delta prediction on the cross-grid-safe path to preserve conformance correctness. Unsafe is used only for unchecked palette index reads in the proven in-bounds fast path where idx < palette_size and palette_size <= pal_row.len().

@github-actions

Copy link
Copy Markdown

Benchmark @ 3ca7625

MULTI-FILE BENCHMARK RESULTS (4 files)
  CPU architecture: x86_64
  WARNING: System appears noisy: high system load (2.30). Results may be unreliable.
Statistics:
  Confidence:               99.0%
  Max relative error:        3.0%

Comparing: e883140e (Base) vs cba318d6 (PR)

File Base (MP/s) PR (MP/s) Δ%
bike.jxl 24.486 23.857 -2.57% ±2.7%
green_queen_modular_e3.jxl 7.872 7.780 -1.16% ±0.5%
green_queen_vardct_e3.jxl 23.852 23.676 -0.74% ±2.3%
sunset_logo.jxl 2.784 2.698 -3.07% ±1.8%

Comment thread jxl/src/frame/modular/transforms/palette.rs Outdated
Address review feedback from veluca93: replace the get_unchecked calls in
the no-delta palette fast paths with safe indexing, gated by an
assert!(palette_size <= pal_row.len()) outside the inner loop. Combined
with the existing 'if idx < palette_size' guard, the compiler can prove
in-bounds access and elide the check while still keeping the code
unsafe-free.

Also drop #[inline(always)] from the large body functions
(do_palette_step_general, do_palette_step_one_group,
do_palette_step_group_row, get_prediction_data); keep it only on the
small helpers (scale, get_palette_value_with_row). The original commit's
benchmark showed a regression on all four files (-0.74% .. -3.07%), and
aggressive inlining of these large bodies is a plausible contributor via
icache pressure. The structural wins (hoisting palette.row(c) out of the
inner loops, splitting the no-delta case) are retained.
palette: &Image<i32>,
/// Look up palette value. `pal_row` is the pre-fetched palette row for channel `c`.
#[inline(always)]
fn get_palette_value_with_row(

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rename this back to get_palette_value.

let index_img = buf_in.data.row(y);
let palette_size = num_colors + num_deltas;

if num_deltas == 0 {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In do_palette_step_general, this is gated by the predictor also being zero - why is this not the same here?

If the fast path is basically equivalent, shouldn't we factor it out to a function?

What images trigger this fast path?

Address review feedback from veluca93:

* Rename get_palette_value_with_row back to get_palette_value. The
  function's role hasn't changed; it just now takes a pre-fetched palette
  row (`&[i32]`) instead of looking it up via `palette.row(c)` on every
  call. Updated the doc comment to reflect that.

* Drop the predictor == Predictor::Zero half of the fast-path gate in
  do_palette_step_general. When num_deltas == 0, the
  `if index < num_deltas as i32` branch in the weighted/general arms is
  never taken, so predictor.predict_one is never called and the WP
  state's update_errors writes are dead. The fast path is therefore
  correct for any predictor when num_deltas == 0, matching what
  do_palette_step_one_group was already doing.

* Factor the common inner loop into apply_palette_lookup_row. Both
  fast paths now call it, removing the duplicated direct-lookup /
  out-of-range-fallback / assert pattern.

Behavior unchanged. 639 unit tests pass.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants