Skip to content

WIP: Interleave modular transform processing#796

Draft
hjanuschka wants to merge 6 commits into
libjxl:mainfrom
hjanuschka:experiment/issue-782-interleave-modular
Draft

WIP: Interleave modular transform processing#796
hjanuschka wants to merge 6 commits into
libjxl:mainfrom
hjanuschka:experiment/issue-782-interleave-modular

Conversation

@hjanuschka

@hjanuschka hjanuschka commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

WIP follow-up to #795.

This tries the deeper path for #782: for safe gridded Modular frames, decode a group and immediately run any modular transforms that became ready, instead of decoding all groups before transform processing. This lets large squeeze images free intermediates during decode.

The optimization is gated to straight non-flush Modular decodes, before any flush happened, gridded outputs, and sufficiently tiled frames. Flush/progressive and small-frame paths keep the existing batched behavior.

Peak RSS on the repro:

decoder peak RSS
jxl-rs main 2739 MiB
#795 1198 MiB
this branch 333 MiB
jxl-oxide 0.12.6 421 MiB
djxl -> PPM 1657 MiB

This avoids the global scratch-pool cap from #795 by only skipping center-buffer caching while interleaved modular output is actively feeding the low-memory pipeline. Normal scratch-buffer reuse is left unchanged.

@veluca93

veluca93 commented Jun 8, 2026

Copy link
Copy Markdown
Member

I had a different approach in mind that might be simpler and more effective during progressive renders. I will give that some thought and hopefully write something up in the next day or two :-)

Only retain progressive render snapshots in the CLI when they can be written to an output.
Inverse squeeze steps read neighbor grids (next average and previous decoded) that the transform graph counts as buffer uses, but the per-step code never released them, so those intermediate modular buffers stayed allocated for the whole frame. Mark them used on the final render so they are freed once consumed.
Modular frames never reclaim center group buffers via get_buffer, so the scratch pool grew to a full-frame copy that was retained for the pipeline's lifetime. Cap it to the few buffers sequential rendering can actually reuse.
Run safe gridded modular transforms during decode so large lossless frames can free squeeze intermediates as soon as their dependencies are ready.
Avoid retaining modular center group buffers while interleaved processing is feeding the low-memory pipeline, without changing the normal scratch-buffer reuse path.
@veluca93 veluca93 force-pushed the experiment/issue-782-interleave-modular branch from 35bbb98 to a42d5b2 Compare June 10, 2026 10:38

@veluca93 veluca93 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving for benchmark purposes ;-)

@veluca93

Copy link
Copy Markdown
Member

Performance Summary (Commit a42d5b2)

Machine Threading Base MP/s PR MP/s Avg Improvement

Detailed per-image results

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants