Skip to content

feat: Add lz4 compression to array_encodings#579

Open
angela-ko wants to merge 20 commits into
mainfrom
ako/compression
Open

feat: Add lz4 compression to array_encodings#579
angela-ko wants to merge 20 commits into
mainfrom
ako/compression

Conversation

@angela-ko

@angela-ko angela-ko commented Apr 30, 2026

Copy link
Copy Markdown
Contributor

Relevant issue or PR

To be done prior to implementing in pasteur-types

Changes are basically identical to the changes here
https://github.com/pasteurlabs/pasteur-types/pull/358/changes

Following the design doc here - chose to start with lz4 as the minimal dependency option for compression, and we can add in more optional compression types once it's working
https://pasteurisi.atlassian.net/wiki/spaces/~71202060d9f9d7be6c427dafac7d77e930e293/pages/1191247903/Compression+-+Design+Options

Description of changes

  • Add optional dependency for lz4
  • Add compress/decompress to array_encodings and output_to_bytes
  • Updated cli and tesseract.py to support compression as well

Testing done

Unit testing

@angela-ko

Copy link
Copy Markdown
Contributor Author

@dionhaefner @nmheim Let me know if this is what you meant by testing compression in tesseract?

@angela-ko angela-ko marked this pull request as ready for review April 30, 2026 18:01
@dionhaefner

Copy link
Copy Markdown
Contributor

That's a good start, thanks @angela-ko ! As next step, please add minimal, meaningful end-to-end tests that cover this functionality - which I expect are going to fail because I do see some issues with how the new lz4 dependency is added :)

Once everything is passing end-to-end I'll have a closer look at the design choices here.

@dionhaefner

Copy link
Copy Markdown
Contributor

And please outline your rationale for choosing lz4 specifically as part of the PR body.

@angela-ko angela-ko marked this pull request as draft May 11, 2026 02:31
@codecov

codecov Bot commented May 11, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 70.49180% with 18 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.12%. Comparing base (2b7ac30) to head (bf7a5a9).

Files with missing lines Patch % Lines
tesseract_core/runtime/array_encoding.py 76.31% 7 Missing and 2 partials ⚠️
tesseract_core/sdk/tesseract.py 52.63% 5 Missing and 4 partials ⚠️

❗ There is a different number of reports uploaded between BASE (2b7ac30) and HEAD (bf7a5a9). Click for more details.

HEAD has 4 uploads less than BASE
Flag BASE (2b7ac30) HEAD (bf7a5a9)
36 32
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #579      +/-   ##
==========================================
- Coverage   77.95%   72.12%   -5.83%     
==========================================
  Files          39       39              
  Lines        4635     4685      +50     
  Branches      754      768      +14     
==========================================
- Hits         3613     3379     -234     
- Misses        716     1011     +295     
+ Partials      306      295      -11     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@PasteurBot

PasteurBot commented May 11, 2026

Copy link
Copy Markdown
Contributor

Benchmark Results

ℹ️ No baseline found — all benchmarks marked as new.

Benchmarks use a no-op Tesseract to measure pure framework overhead.

Benchmark Baseline Current Change Status
api/apply_1,000 - 0.591ms new 🆕
api/apply_100,000 - 0.592ms new 🆕
api/apply_10,000,000 - 0.587ms new 🆕
cli/apply_1,000 - 1635.676ms new 🆕
cli/apply_100,000 - 1634.852ms new 🆕
cli/apply_10,000,000 - 1699.345ms new 🆕
decoding/base64_1,000 - 0.035ms new 🆕
decoding/base64_100,000 - 0.535ms new 🆕
decoding/base64_10,000,000 - 65.555ms new 🆕
decoding/base64+lz4_1,000 - 0.039ms new 🆕
decoding/base64+lz4_100,000 - 0.592ms new 🆕
decoding/base64+lz4_10,000,000 - 109.624ms new 🆕
decoding/binref_1,000 - 0.207ms new 🆕
decoding/binref_100,000 - 0.244ms new 🆕
decoding/binref_10,000,000 - 11.001ms new 🆕
decoding/binref+lz4_1,000 - 0.215ms new 🆕
decoding/binref+lz4_100,000 - 0.309ms new 🆕
decoding/binref+lz4_10,000,000 - 38.996ms new 🆕
decoding/json_1,000 - 0.108ms new 🆕
decoding/json_100,000 - 8.934ms new 🆕
decoding/json_10,000,000 - 1075.916ms new 🆕
encoding/base64_1,000 - 0.042ms new 🆕
encoding/base64_100,000 - 0.151ms new 🆕
encoding/base64_10,000,000 - 26.030ms new 🆕
encoding/base64+lz4_1,000 - 0.049ms new 🆕
encoding/base64+lz4_100,000 - 0.357ms new 🆕
encoding/base64+lz4_10,000,000 - 88.291ms new 🆕
encoding/binref_1,000 - 0.312ms new 🆕
encoding/binref_100,000 - 0.491ms new 🆕
encoding/binref_10,000,000 - 18.720ms new 🆕
encoding/binref+lz4_1,000 - 0.323ms new 🆕
encoding/binref+lz4_100,000 - 0.716ms new 🆕
encoding/binref+lz4_10,000,000 - 82.211ms new 🆕
encoding/json_1,000 - 0.152ms new 🆕
encoding/json_100,000 - 13.184ms new 🆕
encoding/json_10,000,000 - 1408.370ms new 🆕
http/apply_1,000 - 3.174ms new 🆕
http/apply_100,000 - 9.257ms new 🆕
http/apply_10,000,000 - 783.413ms new 🆕
roundtrip/base64_1,000 - 0.089ms new 🆕
roundtrip/base64_100,000 - 0.698ms new 🆕
roundtrip/base64_10,000,000 - 90.609ms new 🆕
roundtrip/base64+lz4_1,000 - 0.100ms new 🆕
roundtrip/base64+lz4_100,000 - 0.968ms new 🆕
roundtrip/base64+lz4_10,000,000 - 197.953ms new 🆕
roundtrip/binref_1,000 - 0.544ms new 🆕
roundtrip/binref_100,000 - 0.739ms new 🆕
roundtrip/binref_10,000,000 - 30.222ms new 🆕
roundtrip/binref+lz4_1,000 - 0.557ms new 🆕
roundtrip/binref+lz4_100,000 - 1.018ms new 🆕
roundtrip/binref+lz4_10,000,000 - 121.022ms new 🆕
roundtrip/json_1,000 - 0.270ms new 🆕
roundtrip/json_100,000 - 20.016ms new 🆕
roundtrip/json_10,000,000 - 2467.532ms new 🆕
Benchmark details
  • Runner: Linux 6.17.0-1018-azure x86_64

@angela-ko angela-ko force-pushed the ako/compression branch 2 times, most recently from 56294af to 17fb949 Compare May 11, 2026 18:21
@angela-ko angela-ko force-pushed the ako/compression branch 3 times, most recently from 8310d77 to dc6a43b Compare May 25, 2026 04:58
@angela-ko angela-ko marked this pull request as ready for review May 25, 2026 04:58
Comment thread pyproject.toml
Comment thread tesseract_core/runtime/array_encoding.py Outdated
Comment thread tests/dummy_tesseract/tesseract_requirements.txt Outdated
Comment thread docs/content/using-tesseracts/array-encodings.md

@dionhaefner dionhaefner left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking shape – let's get some clarity on high-level design decisions before diving into details.

@angela-ko angela-ko marked this pull request as draft June 16, 2026 05:55
@angela-ko angela-ko marked this pull request as ready for review June 26, 2026 19:52
@angela-ko angela-ko requested review from dionhaefner and nmheim June 26, 2026 19:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants