Skip to content

Reduce memory use for progressive lossless images#795

Open
hjanuschka wants to merge 5 commits into
libjxl:mainfrom
hjanuschka:fix/issue-782-progressive-memory
Open

Reduce memory use for progressive lossless images#795
hjanuschka wants to merge 5 commits into
libjxl:mainfrom
hjanuschka:fix/issue-782-progressive-memory

Conversation

@hjanuschka

@hjanuschka hjanuschka commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

Fixes #782.

The inverse squeeze transforms read two neighbor grids per step (the next average and the previously decoded output), which the transform graph counts as buffer uses, but do_run only released the primary inputs. So those intermediate modular buffers stayed allocated for the whole frame. This releases the neighbor grids on the final render, mirroring how the graph registers them (with a dedup guard for when a coarser average channel maps the neighbor onto the same grid index).

jxl_cli also no longer stores progressive render snapshots when there's no output path to write them to.

Peak RSS on the repro:

scenario before after
decode (-s) 2740 MiB 1191 MiB
--render-interval 1000 4208 MiB 3626 MiB
--render-interval 250 7411 MiB 5815 MiB

For full decode that's now below djxl (~1516 MiB) and ~1.5x jxl-oxide (~792 MiB), down from 3.4x.

Only retain progressive render snapshots in the CLI when they can be written to an output.
Inverse squeeze steps read neighbor grids (next average and previous decoded) that the transform graph counts as buffer uses, but the per-step code never released them, so those intermediate modular buffers stayed allocated for the whole frame. Mark them used on the final render so they are freed once consumed.
Modular frames never reclaim center group buffers via get_buffer, so the scratch pool grew to a full-frame copy that was retained for the pipeline's lifetime. Cap it to the few buffers sequential rendering can actually reuse.

@veluca93 veluca93 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I intend to revisit how modular transform processing works, but this is a good fix in the meantime.

@hjanuschka

Copy link
Copy Markdown
Collaborator Author

started working on the modular transforms, it was just to big! so this PR is a temp. improvement!

@veluca93

veluca93 commented Jun 8, 2026

Copy link
Copy Markdown
Member

Performance Summary (Commit 58821ab)

Machine Threading Base MP/s PR MP/s Avg Improvement
desktop Single 80.80 81.01 +0.61% ± 0.39%
desktop Multi 80.82 80.82 +0.49% ± 0.33%
framework-desktop Single 94.43 93.71 +0.13% ± 0.45%
framework-desktop Multi 94.79 94.65 +0.20% ± 0.44%
pixel7a Single (Fast) 28.58 28.90 +0.44% ± 0.40%
pixel7a Single (Mid) 20.83 21.10 +0.39% ± 0.38%
pixel7a Multi 29.08 29.06 -0.07% ± 0.42%

Detailed per-image results

@veluca93

veluca93 commented Jun 8, 2026

Copy link
Copy Markdown
Member

I think I know where the speed regressions are coming from - can you remove the part to limit the scratch buffers for now?

@hjanuschka

Copy link
Copy Markdown
Collaborator Author

done, also did #796 that has the refactored interleave! and completely wins in terms of RSS (could we bench this?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

High memory use for progressive lossless images

3 participants