perf: comprehensive single-threaded decode performance optimizations (+46%) by hjanuschka · Pull Request #705 · libjxl/jxl-rs

hjanuschka · 2026-03-14T15:04:44Z

Summary

Bring jxl-rs single-threaded decoding performance on par with libjxl

Key Optimizations

Bounds check elimination

Unchecked array access in predict_flat, entropy decoding, Huffman table lookup, and weighted predictor hot paths
Split predict_flat into no_wp/with_wp variants to eliminate Option overhead in compute_properties

Inlining strategy

Inline entropy reads (read_signed_clustered_inline) matching libjxl's inlining for NoWpTree/GeneralTree decoders
Comprehensive #[inline(always)] audit across the entire hot path: decode, predict, entropy, blending, render pipeline, SmallVec, Channels, image buffers, utility functions

BitReader optimization

Extend section buffers with 8 zero-padding bytes so refill() always takes the fast path, eliminating refill_slow overhead (was ~4.5% of modular decode)

Allocation elimination

Pre-allocated tmp buffer + stack-allocated slice arrays in blending (eliminate per-row heap allocation)
Hoist property_buffer allocation outside the row loop in modular decode
Optimize get_distinct_indices with raw pointers (eliminate Option overhead in render pipeline) overflow-checks=false

…(+46%) Bring jxl-rs single-threaded decoding performance on par with libjxl 0.12. Measured on AMD EPYC 9645 (AVX-512), single-threaded: - sunset_logo (modular): 3.6 -> 6.2 MP/s (58% -> 105% of libjxl) - bike (VarDCT): 33.6 -> 39.0 MP/s (87% -> 100% of libjxl) - green_queen modular: 9.1 -> 10.9 MP/s (100% -> 119% of libjxl) - green_queen VarDCT: 29.1 -> 38.1 MP/s (95% -> 118% of libjxl) - Total: 12.9 -> 19.0 MP/s (+47%) Key optimizations: - Eliminate bounds checks in hot paths (predict_flat, entropy decoding, Huffman table lookup, weighted predictor) using unchecked array access - Split predict_flat into no_wp/with_wp variants to eliminate Option overhead in compute_properties - Inline entropy reads (read_signed_clustered_inline) matching libjxl's inlining strategy for NoWpTree/GeneralTree decoders - BitReader buffer padding: extend section buffers with 8 zero bytes so refill() always takes the fast path, eliminating refill_slow overhead - Comprehensive #[inline(always)] audit across entire hot path: decode, predict, entropy, blending, render pipeline, SmallVec, Channels, image buffers, utility functions - Eliminate per-row heap allocations in blending (pre-allocated tmp buffer + stack-allocated slice arrays) and modular decode (hoist property_buffer allocation outside row loop) - Optimize get_distinct_indices with raw pointers (eliminate Option overhead in render pipeline) - Release profile: thin LTO, panic=abort, overflow-checks=false No API changes. No new dependencies. All tests pass.

veluca93 · 2026-03-14T15:19:29Z

This PR has a lot of changes with a lot of unsafe code.

Let's split into many PRs that are as small as possible and each implement individual, independent optimizations (especially for optimizations that rely on unsafe code)

hjanuschka · 2026-03-14T15:25:48Z

this PR is draft, ill try autoresearch to figure out how far it can push it (just wanted to kick of benchmark CI; but it failed anyway), and once it plateaued, ill cherry pick ideas and finalize them as standalone CLs

Add #[inline(always)] to conformance-set hot paths discovered via profiling additional benchmark images: - Palette transform: get_palette_value, get_prediction_data, do_palette_step_one_group, do_palette_step_general, do_palette_step_group_row - Prediction: PredictionData::get, get_rows, get_with_neighbors - Squeeze: do_hsqueeze_step, do_vsqueeze_step - LZ77/RLE: apply_copy, pull_symbol, push_decoded_symbol, push_token - Patches: add_one_row, set_patches_for_row Improves delta_palette by ~10% and bicycles by ~5-7%.

The LZ77 window is accessed via copy_pos & WINDOW_MASK which is always within bounds. Use get_unchecked/get_unchecked_mut to eliminate bounds checks in the hot copy loop. Improves lz77_flower by ~5% and bicycles by ~13%.

Update LZ77 pull_symbol/push_decoded_symbol with unchecked window access and inline palette scale() helper.

Add #[inline(always)] to Noise::strength (called per-pixel for noise synthesis) and Xorshift128Plus::fill (PRNG for noise generation). These were 7.8% and 2.9% respectively in the noise conformance image.

When the input and output color profiles are both ICC and have identical bytes, skip the CMS transform entirely. This is a no-op identity transform that lcms2 was performing at great cost (pow() calls for tone curve evaluation). This commonly happens for non-XYB images with embedded ICC profiles where no user output profile is set -- the default output is the embedded profile itself. Impact on conformance test images with embedded ICC: - cafe: 7% -> 85% of libjxl (11x faster) - bench_oriented_brg: 26% -> 99% of libjxl (3.8x faster) - patches_lossless: 40% -> 81% of libjxl (2x faster)

- render_noise_for_group: cast u64 RNG batch to u32 slice directly, eliminating per-element high/low byte branching - Noise::strength: add #[inline(always)] - Xorshift128Plus::fill: add #[inline(always)]

Modular decoding hot path optimizations: - Interior fast path: skip edge checks for WP predictor in inner loop (y>=2, 0<x<xsize-1) - Raw pointer access for error buffers and prediction in predict_interior - Unrolled weighted_average and update_errors (direct writes, no intermediate arrays) - Precomputed WP cur_row/prev_row per row (eliminates y&1 branch per pixel) - Const generic COMPUTE_PROPERTY for dead code elimination - u32 leading_zeros in error_weight/weighted_average (avoids u64 zero-extend) - Branchless error_weight (clamp shift to 0 instead of if/else) Specialized tree variants: - NoWpTreeNoLz77/GeneralTreeNoLz77: compile-time LZ77 state elimination - WpOnlyLookupConfig420: i32 clamp + unchecked LUT access - decode_one_interior for all Config420 variants Entropy coding: - read_signed_clustered_inline for hot paths Palette transform: - No-deltas fast path with unchecked lookup - Interior fast path avoiding get_with_neighbors Render pipeline: - StackVec replacing SmallVec in Channels/ChannelsMut/run_stage - Skip transform_buffer zeroing (dequant overwrites everything) - Reusable pre-allocated blending tmp buffer Image buffers: - row_ptr/row_ptr_mut raw pointer accessors for hot loops Results (PGO + AVX-512, AMD EPYC 9645, 8-image benchmark): - Baseline: 11.82 MP/s -> Best: 17.36 MP/s (+46.9%) - Median: ~16.7 MP/s (+41%) - 7/8 images at or above libjxl 0.12 parity - bicycles: 0.85x (codegen-limited, 1.58x instruction gap vs GCC)

github-actions · 2026-03-15T08:13:11Z

Benchmark @ `20aa006`

MULTI-FILE BENCHMARK RESULTS (8 files)
  CPU architecture: x86_64
  WARNING: System appears noisy: high system load (2.52). Results may be unreliable.
Statistics:
  Confidence:               99.0%
  Max relative error:        3.0%

Comparing: 159c60b9 (Base) vs 5b2efe78 (PR)

File	Base (MP/s)	PR (MP/s)	Δ%
bicycles.jxl	7.159	9.639	+34.66% ±2.7%
bike.jxl	24.307	26.300	+8.20% ±2.7%
delta_palette.jxl	6.146	7.477	+21.65% ±0.9%
green_queen_modular_e3.jxl	7.830	7.810	-0.26% ±0.2%
green_queen_vardct_e3.jxl	23.991	26.215	+9.27% ±0.7%
lz77_flower.jxl	3.337	3.827	+14.68% ±1.5%
patches_lossless.jxl	3.213	3.313	+3.11% ±0.3%
sunset_logo.jxl	2.783	4.663	+67.55% ±0.8%

hjanuschka · 2026-03-15T20:06:07Z

Image	libjxl (MP/s)	jxl-rs (MP/s)	Ratio
sunset_logo (modular)	6.31	7.05	1.12x
bike (VarDCT)	39.48	36.69	0.93x
green_queen_modular	8.99	11.42	1.27x
green_queen_vardct	35.83	33.56	0.94x
bicycles (modular+WP)	14.91	12.05	0.81x
delta_palette	10.60	11.15	1.05x
lz77_flower	3.52	4.38	1.24x
patches_lossless	13.81	14.30	1.04x
TOTAL	15.74	16.35	1.04x

hjanuschka added 6 commits March 14, 2026 16:39

perf: unchecked LZ77 window + palette scale() inline

e4c111e

Update LZ77 pull_symbol/push_decoded_symbol with unchecked window access and inline palette scale() helper.

perf: inline noise strength and xorshift128plus fill

8bff1be

Add #[inline(always)] to Noise::strength (called per-pixel for noise synthesis) and Xorshift128Plus::fill (PRNG for noise generation). These were 7.8% and 2.9% respectively in the noise conformance image.

perf: optimize noise rendering and inline helpers

6f6705b

- render_noise_for_group: cast u64 RNG batch to u32 slice directly, eliminating per-element high/low byte branching - Noise::strength: add #[inline(always)] - Xorshift128Plus::fill: add #[inline(always)]

hjanuschka force-pushed the pr/decode-perf-optimizations branch from 6f6705b to b6dfc26 Compare March 14, 2026 17:23

hjanuschka force-pushed the pr/decode-perf-optimizations branch from 87ae20e to 0117e40 Compare March 15, 2026 06:58

hjanuschka added 2 commits March 15, 2026 08:17

Fix CI lints on PR 705

56fb1ab

Fix delta palette conformance regression

024d6e5

hjanuschka added 2 commits March 15, 2026 12:20

benchmark: run perfhistory on all test images

d12b309

benchmark: use autoresearch image set to avoid timeout

588ff87

hjanuschka added 4 commits March 20, 2026 08:58

perf: avoid extra pull_symbol call in unsigned RLE repeat path

e981527

perf: add HybridUint::read fast path for msb_in_token == 0

38907f2

perf: use direct gradient decode path in SingleGradientOnly

17bf728

style: format HybridUint fast path

20aa006

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

perf: comprehensive single-threaded decode performance optimizations (+46%)#705

perf: comprehensive single-threaded decode performance optimizations (+46%)#705
hjanuschka wants to merge 16 commits into
libjxl:mainfrom
hjanuschka:pr/decode-perf-optimizations

hjanuschka commented Mar 14, 2026 •

edited

Loading

Uh oh!

veluca93 commented Mar 14, 2026

Uh oh!

hjanuschka commented Mar 14, 2026

Uh oh!

github-actions Bot commented Mar 15, 2026 •

edited

Loading

Uh oh!

hjanuschka commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

hjanuschka commented Mar 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Key Optimizations

Bounds check elimination

Inlining strategy

BitReader optimization

Allocation elimination

Uh oh!

veluca93 commented Mar 14, 2026

Uh oh!

hjanuschka commented Mar 14, 2026

Uh oh!

github-actions Bot commented Mar 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Benchmark @ 20aa006

Uh oh!

hjanuschka commented Mar 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

hjanuschka commented Mar 14, 2026 •

edited

Loading

github-actions Bot commented Mar 15, 2026 •

edited

Loading

Benchmark @ `20aa006`