Skip to content

perf: comprehensive single-threaded decode performance optimizations (+46%)#705

Draft
hjanuschka wants to merge 16 commits into
libjxl:mainfrom
hjanuschka:pr/decode-perf-optimizations
Draft

perf: comprehensive single-threaded decode performance optimizations (+46%)#705
hjanuschka wants to merge 16 commits into
libjxl:mainfrom
hjanuschka:pr/decode-perf-optimizations

Conversation

@hjanuschka

@hjanuschka hjanuschka commented Mar 14, 2026

Copy link
Copy Markdown
Collaborator

Summary

Bring jxl-rs single-threaded decoding performance on par with libjxl

Key Optimizations

Bounds check elimination

  • Unchecked array access in predict_flat, entropy decoding, Huffman table lookup, and weighted predictor hot paths
  • Split predict_flat into no_wp/with_wp variants to eliminate Option overhead in compute_properties

Inlining strategy

  • Inline entropy reads (read_signed_clustered_inline) matching libjxl's inlining for NoWpTree/GeneralTree decoders
  • Comprehensive #[inline(always)] audit across the entire hot path: decode, predict, entropy, blending, render pipeline, SmallVec, Channels, image buffers, utility functions

BitReader optimization

  • Extend section buffers with 8 zero-padding bytes so refill() always takes the fast path, eliminating refill_slow overhead (was ~4.5% of modular decode)

Allocation elimination

  • Pre-allocated tmp buffer + stack-allocated slice arrays in blending (eliminate per-row heap allocation)
  • Hoist property_buffer allocation outside the row loop in modular decode
  • Optimize get_distinct_indices with raw pointers (eliminate Option overhead in render pipeline) overflow-checks=false

…(+46%)

Bring jxl-rs single-threaded decoding performance on par with libjxl 0.12.

Measured on AMD EPYC 9645 (AVX-512), single-threaded:
- sunset_logo (modular): 3.6 -> 6.2 MP/s (58% -> 105% of libjxl)
- bike (VarDCT): 33.6 -> 39.0 MP/s (87% -> 100% of libjxl)
- green_queen modular: 9.1 -> 10.9 MP/s (100% -> 119% of libjxl)
- green_queen VarDCT: 29.1 -> 38.1 MP/s (95% -> 118% of libjxl)
- Total: 12.9 -> 19.0 MP/s (+47%)

Key optimizations:
- Eliminate bounds checks in hot paths (predict_flat, entropy decoding,
  Huffman table lookup, weighted predictor) using unchecked array access
- Split predict_flat into no_wp/with_wp variants to eliminate Option
  overhead in compute_properties
- Inline entropy reads (read_signed_clustered_inline) matching libjxl's
  inlining strategy for NoWpTree/GeneralTree decoders
- BitReader buffer padding: extend section buffers with 8 zero bytes so
  refill() always takes the fast path, eliminating refill_slow overhead
- Comprehensive #[inline(always)] audit across entire hot path: decode,
  predict, entropy, blending, render pipeline, SmallVec, Channels,
  image buffers, utility functions
- Eliminate per-row heap allocations in blending (pre-allocated tmp
  buffer + stack-allocated slice arrays) and modular decode (hoist
  property_buffer allocation outside row loop)
- Optimize get_distinct_indices with raw pointers (eliminate Option
  overhead in render pipeline)
- Release profile: thin LTO, panic=abort, overflow-checks=false

No API changes. No new dependencies. All tests pass.
@veluca93

Copy link
Copy Markdown
Member

This PR has a lot of changes with a lot of unsafe code.

Let's split into many PRs that are as small as possible and each implement individual, independent optimizations (especially for optimizations that rely on unsafe code)

@hjanuschka

Copy link
Copy Markdown
Collaborator Author

this PR is draft, ill try autoresearch to figure out how far it can push it (just wanted to kick of benchmark CI; but it failed anyway), and once it plateaued, ill cherry pick ideas and finalize them as standalone CLs

Add #[inline(always)] to conformance-set hot paths discovered via
profiling additional benchmark images:

- Palette transform: get_palette_value, get_prediction_data,
  do_palette_step_one_group, do_palette_step_general,
  do_palette_step_group_row
- Prediction: PredictionData::get, get_rows, get_with_neighbors
- Squeeze: do_hsqueeze_step, do_vsqueeze_step
- LZ77/RLE: apply_copy, pull_symbol, push_decoded_symbol, push_token
- Patches: add_one_row, set_patches_for_row

Improves delta_palette by ~10% and bicycles by ~5-7%.
The LZ77 window is accessed via copy_pos & WINDOW_MASK which is always
within bounds. Use get_unchecked/get_unchecked_mut to eliminate bounds
checks in the hot copy loop.

Improves lz77_flower by ~5% and bicycles by ~13%.
Update LZ77 pull_symbol/push_decoded_symbol with unchecked window
access and inline palette scale() helper.
Add #[inline(always)] to Noise::strength (called per-pixel for noise
synthesis) and Xorshift128Plus::fill (PRNG for noise generation).

These were 7.8% and 2.9% respectively in the noise conformance image.
When the input and output color profiles are both ICC and have
identical bytes, skip the CMS transform entirely. This is a no-op
identity transform that lcms2 was performing at great cost (pow()
calls for tone curve evaluation).

This commonly happens for non-XYB images with embedded ICC profiles
where no user output profile is set -- the default output is the
embedded profile itself.

Impact on conformance test images with embedded ICC:
- cafe: 7% -> 85% of libjxl (11x faster)
- bench_oriented_brg: 26% -> 99% of libjxl (3.8x faster)
- patches_lossless: 40% -> 81% of libjxl (2x faster)
- render_noise_for_group: cast u64 RNG batch to u32 slice directly,
  eliminating per-element high/low byte branching
- Noise::strength: add #[inline(always)]
- Xorshift128Plus::fill: add #[inline(always)]
@hjanuschka hjanuschka force-pushed the pr/decode-perf-optimizations branch from 6f6705b to b6dfc26 Compare March 14, 2026 17:23
Modular decoding hot path optimizations:
- Interior fast path: skip edge checks for WP predictor in inner loop (y>=2, 0<x<xsize-1)
- Raw pointer access for error buffers and prediction in predict_interior
- Unrolled weighted_average and update_errors (direct writes, no intermediate arrays)
- Precomputed WP cur_row/prev_row per row (eliminates y&1 branch per pixel)
- Const generic COMPUTE_PROPERTY for dead code elimination
- u32 leading_zeros in error_weight/weighted_average (avoids u64 zero-extend)
- Branchless error_weight (clamp shift to 0 instead of if/else)

Specialized tree variants:
- NoWpTreeNoLz77/GeneralTreeNoLz77: compile-time LZ77 state elimination
- WpOnlyLookupConfig420: i32 clamp + unchecked LUT access
- decode_one_interior for all Config420 variants

Entropy coding:
- read_signed_clustered_inline for hot paths

Palette transform:
- No-deltas fast path with unchecked lookup
- Interior fast path avoiding get_with_neighbors

Render pipeline:
- StackVec replacing SmallVec in Channels/ChannelsMut/run_stage
- Skip transform_buffer zeroing (dequant overwrites everything)
- Reusable pre-allocated blending tmp buffer

Image buffers:
- row_ptr/row_ptr_mut raw pointer accessors for hot loops

Results (PGO + AVX-512, AMD EPYC 9645, 8-image benchmark):
- Baseline: 11.82 MP/s -> Best: 17.36 MP/s (+46.9%)
- Median: ~16.7 MP/s (+41%)
- 7/8 images at or above libjxl 0.12 parity
- bicycles: 0.85x (codegen-limited, 1.58x instruction gap vs GCC)
@hjanuschka hjanuschka force-pushed the pr/decode-perf-optimizations branch from 87ae20e to 0117e40 Compare March 15, 2026 06:58
@github-actions

github-actions Bot commented Mar 15, 2026

Copy link
Copy Markdown

Benchmark @ 20aa006

MULTI-FILE BENCHMARK RESULTS (8 files)
  CPU architecture: x86_64
  WARNING: System appears noisy: high system load (2.52). Results may be unreliable.
Statistics:
  Confidence:               99.0%
  Max relative error:        3.0%

Comparing: 159c60b9 (Base) vs 5b2efe78 (PR)

File Base (MP/s) PR (MP/s) Δ%
bicycles.jxl 7.159 9.639 +34.66% ±2.7%
bike.jxl 24.307 26.300 +8.20% ±2.7%
delta_palette.jxl 6.146 7.477 +21.65% ±0.9%
green_queen_modular_e3.jxl 7.830 7.810 -0.26% ±0.2%
green_queen_vardct_e3.jxl 23.991 26.215 +9.27% ±0.7%
lz77_flower.jxl 3.337 3.827 +14.68% ±1.5%
patches_lossless.jxl 3.213 3.313 +3.11% ±0.3%
sunset_logo.jxl 2.783 4.663 +67.55% ±0.8%

@hjanuschka

Copy link
Copy Markdown
Collaborator Author
Image libjxl (MP/s) jxl-rs (MP/s) Ratio
sunset_logo (modular) 6.31 7.05 1.12x
bike (VarDCT) 39.48 36.69 0.93x
green_queen_modular 8.99 11.42 1.27x
green_queen_vardct 35.83 33.56 0.94x
bicycles (modular+WP) 14.91 12.05 0.81x
delta_palette 10.60 11.15 1.05x
lz77_flower 3.52 4.38 1.24x
patches_lossless 13.81 14.30 1.04x
TOTAL 15.74 16.35 1.04x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants