perf: comprehensive single-threaded decode performance optimizations (+46%)#705
Draft
hjanuschka wants to merge 16 commits into
Draft
perf: comprehensive single-threaded decode performance optimizations (+46%)#705hjanuschka wants to merge 16 commits into
hjanuschka wants to merge 16 commits into
Conversation
…(+46%) Bring jxl-rs single-threaded decoding performance on par with libjxl 0.12. Measured on AMD EPYC 9645 (AVX-512), single-threaded: - sunset_logo (modular): 3.6 -> 6.2 MP/s (58% -> 105% of libjxl) - bike (VarDCT): 33.6 -> 39.0 MP/s (87% -> 100% of libjxl) - green_queen modular: 9.1 -> 10.9 MP/s (100% -> 119% of libjxl) - green_queen VarDCT: 29.1 -> 38.1 MP/s (95% -> 118% of libjxl) - Total: 12.9 -> 19.0 MP/s (+47%) Key optimizations: - Eliminate bounds checks in hot paths (predict_flat, entropy decoding, Huffman table lookup, weighted predictor) using unchecked array access - Split predict_flat into no_wp/with_wp variants to eliminate Option overhead in compute_properties - Inline entropy reads (read_signed_clustered_inline) matching libjxl's inlining strategy for NoWpTree/GeneralTree decoders - BitReader buffer padding: extend section buffers with 8 zero bytes so refill() always takes the fast path, eliminating refill_slow overhead - Comprehensive #[inline(always)] audit across entire hot path: decode, predict, entropy, blending, render pipeline, SmallVec, Channels, image buffers, utility functions - Eliminate per-row heap allocations in blending (pre-allocated tmp buffer + stack-allocated slice arrays) and modular decode (hoist property_buffer allocation outside row loop) - Optimize get_distinct_indices with raw pointers (eliminate Option overhead in render pipeline) - Release profile: thin LTO, panic=abort, overflow-checks=false No API changes. No new dependencies. All tests pass.
Member
|
This PR has a lot of changes with a lot of unsafe code. Let's split into many PRs that are as small as possible and each implement individual, independent optimizations (especially for optimizations that rely on unsafe code) |
Collaborator
Author
|
this PR is draft, ill try autoresearch to figure out how far it can push it (just wanted to kick of benchmark CI; but it failed anyway), and once it plateaued, ill cherry pick ideas and finalize them as standalone CLs |
Add #[inline(always)] to conformance-set hot paths discovered via profiling additional benchmark images: - Palette transform: get_palette_value, get_prediction_data, do_palette_step_one_group, do_palette_step_general, do_palette_step_group_row - Prediction: PredictionData::get, get_rows, get_with_neighbors - Squeeze: do_hsqueeze_step, do_vsqueeze_step - LZ77/RLE: apply_copy, pull_symbol, push_decoded_symbol, push_token - Patches: add_one_row, set_patches_for_row Improves delta_palette by ~10% and bicycles by ~5-7%.
The LZ77 window is accessed via copy_pos & WINDOW_MASK which is always within bounds. Use get_unchecked/get_unchecked_mut to eliminate bounds checks in the hot copy loop. Improves lz77_flower by ~5% and bicycles by ~13%.
Update LZ77 pull_symbol/push_decoded_symbol with unchecked window access and inline palette scale() helper.
Add #[inline(always)] to Noise::strength (called per-pixel for noise synthesis) and Xorshift128Plus::fill (PRNG for noise generation). These were 7.8% and 2.9% respectively in the noise conformance image.
When the input and output color profiles are both ICC and have identical bytes, skip the CMS transform entirely. This is a no-op identity transform that lcms2 was performing at great cost (pow() calls for tone curve evaluation). This commonly happens for non-XYB images with embedded ICC profiles where no user output profile is set -- the default output is the embedded profile itself. Impact on conformance test images with embedded ICC: - cafe: 7% -> 85% of libjxl (11x faster) - bench_oriented_brg: 26% -> 99% of libjxl (3.8x faster) - patches_lossless: 40% -> 81% of libjxl (2x faster)
- render_noise_for_group: cast u64 RNG batch to u32 slice directly, eliminating per-element high/low byte branching - Noise::strength: add #[inline(always)] - Xorshift128Plus::fill: add #[inline(always)]
6f6705b to
b6dfc26
Compare
Modular decoding hot path optimizations: - Interior fast path: skip edge checks for WP predictor in inner loop (y>=2, 0<x<xsize-1) - Raw pointer access for error buffers and prediction in predict_interior - Unrolled weighted_average and update_errors (direct writes, no intermediate arrays) - Precomputed WP cur_row/prev_row per row (eliminates y&1 branch per pixel) - Const generic COMPUTE_PROPERTY for dead code elimination - u32 leading_zeros in error_weight/weighted_average (avoids u64 zero-extend) - Branchless error_weight (clamp shift to 0 instead of if/else) Specialized tree variants: - NoWpTreeNoLz77/GeneralTreeNoLz77: compile-time LZ77 state elimination - WpOnlyLookupConfig420: i32 clamp + unchecked LUT access - decode_one_interior for all Config420 variants Entropy coding: - read_signed_clustered_inline for hot paths Palette transform: - No-deltas fast path with unchecked lookup - Interior fast path avoiding get_with_neighbors Render pipeline: - StackVec replacing SmallVec in Channels/ChannelsMut/run_stage - Skip transform_buffer zeroing (dequant overwrites everything) - Reusable pre-allocated blending tmp buffer Image buffers: - row_ptr/row_ptr_mut raw pointer accessors for hot loops Results (PGO + AVX-512, AMD EPYC 9645, 8-image benchmark): - Baseline: 11.82 MP/s -> Best: 17.36 MP/s (+46.9%) - Median: ~16.7 MP/s (+41%) - 7/8 images at or above libjxl 0.12 parity - bicycles: 0.85x (codegen-limited, 1.58x instruction gap vs GCC)
87ae20e to
0117e40
Compare
Benchmark @ 20aa006Comparing: 159c60b9 (Base) vs 5b2efe78 (PR)
|
Collaborator
Author
|
This was referenced Mar 16, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Bring jxl-rs single-threaded decoding performance on par with libjxl
Key Optimizations
Bounds check elimination
predict_flat, entropy decoding, Huffman table lookup, and weighted predictor hot pathspredict_flatintono_wp/with_wpvariants to eliminateOptionoverhead incompute_propertiesInlining strategy
read_signed_clustered_inline) matching libjxl's inlining forNoWpTree/GeneralTreedecoders#[inline(always)]audit across the entire hot path: decode, predict, entropy, blending, render pipeline,SmallVec,Channels, image buffers, utility functionsBitReader optimization
refill()always takes the fast path, eliminatingrefill_slowoverhead (was ~4.5% of modular decode)Allocation elimination
property_bufferallocation outside the row loop in modular decodeget_distinct_indiceswith raw pointers (eliminateOptionoverhead in render pipeline)overflow-checks=false