Refactor asset extractors#551
Conversation
Progress
|
|
I fixed the M3U8 issue :) |
5d88433 to
ad5fc09
Compare
I finished refactoring all asset extractors to the defined interface, I have a question regarding the overall logic of asset and outlink extraction: I currently just loop through all extractors, and if they match I append their newly found assets or outlinks to an array which holds all extracted assets/outlinks. The comment in Zeno/internal/pkg/postprocessor/assets.go Lines 25 to 26 in aa1fbdb Should we find the most specific/highest priority Asset extractor and exclusively use it's extracted assets/outlinks? (Loop through the ordered extractors, break out when matched) I think @vbanos is the original author of that comment |
|
Can any1 comment on this? Else I will just go forward with reimplementing the same logic as previously. |
|
I don't have a solid harness to test the asset extraction logic against a corpus of items. I think Zeno in it's entirety would profit from a serialized data format for Writing tests is currently a little bit of a hassle, you have to do the same bag of tricks again and again and again. |
Yes, use the same logic as before. Extractors that appear earlier in the list have higher priority; once a match is found, After all, we only have fewer than 10 assets extractors right now, so there's no need to use a complex tree-like matching structure like mimetype. |
|
Same thing as internetarchive#602 but for the refactored asset extractors
|
Converted to draft to avoid wasting Github Actions Runners resources (if the CI is configured like that) |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #551 +/- ##
==========================================
+ Coverage 56.47% 56.66% +0.18%
==========================================
Files 133 134 +1
Lines 6760 6812 +52
==========================================
+ Hits 3818 3860 +42
+ Misses 2562 2560 -2
- Partials 380 392 +12
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Zeno/internal/pkg/postprocessor/assets.go
Lines 24 to 27 in aa1fbdb
I started Refactoring, this is less straightforward than I thought.
The massive switch case in
assets.gocalls these extractors:INA Extraction
Zeno/internal/pkg/postprocessor/assets.go
Lines 35 to 48 in aa1fbdb
Is this supposed to be a fallback (if the matched INA is not a JSON anymore, but a HTML)?
Interface design
I tried to take inspiration from
Zeno/internal/pkg/preprocessor/sitespecific/sitespecific.go
Lines 13 to 34 in aa1fbdb
and decided to use
This breaks support for the
EmbeddedCSSExtractor(which requires the fullItemin it's Match method), but it has some special logic anywayZeno/internal/pkg/postprocessor/assets.go
Lines 79 to 90 in aa1fbdb
Trump social
The RegEx for the
statusesAPI URL has changed, I added that. But the rest of the JSON has also heavily changed, I don't know if this is a prioritized feature, will do this if I find time inbetween.