feat: Add optional lz4 compression support for arrays passed via base64 or binref encoding#579
feat: Add optional lz4 compression support for arrays passed via base64 or binref encoding#579angela-ko wants to merge 23 commits into
base64 or binref encoding#579Conversation
|
@dionhaefner @nmheim Let me know if this is what you meant by testing compression in tesseract? |
|
That's a good start, thanks @angela-ko ! As next step, please add minimal, meaningful end-to-end tests that cover this functionality - which I expect are going to fail because I do see some issues with how the new lz4 dependency is added :) Once everything is passing end-to-end I'll have a closer look at the design choices here. |
|
And please outline your rationale for choosing lz4 specifically as part of the PR body. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #579 +/- ##
==========================================
+ Coverage 68.30% 77.84% +9.54%
==========================================
Files 39 39
Lines 4635 4690 +55
Branches 754 770 +16
==========================================
+ Hits 3166 3651 +485
+ Misses 1224 727 -497
- Partials 245 312 +67 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Benchmark Resultsℹ️ No baseline found — all benchmarks marked as new. Benchmarks use a no-op Tesseract to measure pure framework overhead.
Benchmark details
|
56294af to
17fb949
Compare
8310d77 to
dc6a43b
Compare
dionhaefner
left a comment
There was a problem hiding this comment.
Taking shape – let's get some clarity on high-level design decisions before diving into details.
…ssion sie if compression is set
af7df61 to
b9ed79a
Compare
base64 or binref encoding
| ### binref + lz4 compression | ||
|
|
||
| Set `TESSERACT_BINREF_COMPRESSION=lz4` to compress arrays in `.bin` files. Each array is compressed individually, preserving offset-based random access. The compressed size is embedded directly in the buffer path (`<file>:<offset>:<compressed_size>`). |
There was a problem hiding this comment.
This now also applies to base64, correct?
| def _lz4_frame(): | ||
| import lz4.frame | ||
|
|
||
| return lz4.frame |
There was a problem hiding this comment.
Can live in global scope since the dep is now mandatory
| output_path: str = "." | ||
| output_format: supported_format_type = "json" | ||
| output_file: str = "" | ||
| binref_compression: Literal["lz4"] | None = None |
| if array_encoding == "base64": | ||
| return _dump_base64_arraydict(arr) | ||
| return _dump_base64_arraydict( | ||
| arr, compression=context.get("base64_compression") |
There was a problem hiding this comment.
I suggest we use a single use_compression variable instead of format-specific ones.
dionhaefner
left a comment
There was a problem hiding this comment.
Thanks @angela-ko. Looking real good now, just a last few comments.
Relevant issue or PR
To be done prior to implementing in pasteur-types
Changes are basically identical to the changes here
https://github.com/pasteurlabs/pasteur-types/pull/358/changes
Following the design doc here - chose to start with lz4 as the minimal dependency option for compression, and we can add in more optional compression types once it's working
https://pasteurisi.atlassian.net/wiki/spaces/~71202060d9f9d7be6c427dafac7d77e930e293/pages/1191247903/Compression+-+Design+Options
Description of changes
Testing done
Unit testing