Skip to content

Refactor asset extractors#551

Draft
AltayAkkus wants to merge 25 commits into
internetarchive:mainfrom
AltayAkkus:refactor/asset-extractors
Draft

Refactor asset extractors#551
AltayAkkus wants to merge 25 commits into
internetarchive:mainfrom
AltayAkkus:refactor/asset-extractors

Conversation

@AltayAkkus

Copy link
Copy Markdown
Contributor

// Extract assets and outlinks from the body using the appropriate extractor
// Order is important, we want to check for more specific things first,
// as they may trigger more general extractors (e.g. HTML)
// TODO this should be refactored using interfaces

I started Refactoring, this is less straightforward than I thought.

The massive switch case in assets.go calls these extractors:

Name Match method Input Output Extract method Input Output
INAExtractor ina.IsAPIURL item.GetURL() bool ina.ExtractMedias item.GetURL() assets, err
TruthSocial truthsocial.NeedExtraction item.GetURL() bool truthsocial.ExtractAssets item assets, outlinks, err
m3u8 extractor.IsM3U8 item.GetURL() bool extractor.M3U8 item.GetURL() assets, err
JSON extractor.IsJson item.GetURL() bool extractor.JSON item.GetURL() assets, outlinks, err
XML extractor.IsXML item.GetURL() bool extractor.XML item.GetURL() assets, outlinks, err
HTML extractor.IsHTML item.GetURL() bool extractor.HTMLAssets item assets, err
EmbeddedCSS extractor.IsEmbeddedCSS item bool extractor.ExtractFromURLCSS item.GetURL() assets, atImportLinks, err

INA Extraction

case ina.IsAPIURL(item.GetURL()):
INAAssets, err := ina.ExtractMedias(item.GetURL())
if err != nil {
logger.Error("unable to extract medias from INA", "err", err.Error())
return assets, outlinks, err
}
HTMLAssets, err := extractor.HTMLAssets(item)
if err != nil {
logger.Error("unable to extract assets", "err", err.Error())
return assets, outlinks, err
}
assets = append(INAAssets, HTMLAssets...)

Is this supposed to be a fallback (if the matched INA is not a JSON anymore, but a HTML)?

Interface design

I tried to take inspiration from

type Preprocessor interface {
Match(*models.URL) bool
Apply(*http.Request)
}
var preprocessors = []Preprocessor{
npr.NPRPreprocessor{},
reddit.RedditPreprocessor{},
tiktok.TikTokPreprocessor{},
truthsocial.TruthsocialStatusPreprocessor{},
truthsocial.TruthsocialAccountsPreprocessor{},
}
// Apply the first matching preprocessor.
func RunPreprocessors(URL *models.URL, req *http.Request) {
for _, p := range preprocessors {
if p.Match(URL) {
p.Apply(req)
break
}
}
}

and decided to use

type AssetExtractor interface {
	Match(*models.URL) bool
	Extract(*models.Item) (assets, outlinks []*models.URL, err error)
}

var assetExtractors = []AssetExtractor{
	ina.INAExtractor{},
	truthsocial.TruthsocialExtractor{},
	extractor.M3U8Extractor{},
	extractor.JSONExtractor{},
	extractor.XMLExtractor{},
	extractor.HTMLAssetsExtractor{},
}

This breaks support for the EmbeddedCSSExtractor (which requires the full Item in it's Match method), but it has some special logic anyway

case extractor.IsEmbeddedCSS(item):
var atImportLinks []*models.URL
assets, atImportLinks, err = extractor.ExtractFromURLCSS(item.GetURL())
logArgs := []any{"links", len(assets), "at_import_links", len(atImportLinks)}
if err != nil {
logArgs = append(logArgs, "err", err)
logger.Error("error extracting assets from CSS", logArgs...)
} else {
logger.Debug("extracted assets from CSS", logArgs...)
}
extractor.AddAtImportLinksToItemChild(item, atImportLinks)

Trump social

The RegEx for the statuses API URL has changed, I added that. But the rest of the JSON has also heavily changed, I don't know if this is a prioritized feature, will do this if I find time inbetween.

@AltayAkkus

AltayAkkus commented Jan 30, 2026

Copy link
Copy Markdown
Contributor Author

Progress

  • INAExtractor
    • Interface implementation
    • Tests
  • TruthSocial (just drop it?)
    • Interface implementation
    • Tests
  • M3U8
  • JSON
    • Interface implementation
    • Tests
  • XML
    • Interface implementation
    • Tests
      Note: For XML and JSON only the AssetExtractor struct, Match and Extract functions were created, and since the tests work with URLs only (and already provide great coverage) I did not change them. The code-under-test is return otherFunction(input)
  • HTML
    • Interface implementation
    • Tests

@yzqzss

yzqzss commented Jan 31, 2026

Copy link
Copy Markdown
Collaborator

I fixed the M3U8 issue :)
Any question?

@yzqzss yzqzss force-pushed the refactor/asset-extractors branch from 5d88433 to ad5fc09 Compare January 31, 2026 02:39
@AltayAkkus

AltayAkkus commented Feb 3, 2026

Copy link
Copy Markdown
Contributor Author

Any question?

I finished refactoring all asset extractors to the defined interface, I have a question regarding the overall logic of asset and outlink extraction:
The switch case currently matches URL, if the given extractor thinks it can extract assets and/or outlinks, the case will write into assets and outlinks, WITHOUT accumulating. So if two cases are met, the first one returns early, the second one will not be reached.

I currently just loop through all extractors, and if they match I append their newly found assets or outlinks to an array which holds all extracted assets/outlinks.

The comment in assets.go mentions priority

// Order is important, we want to check for more specific things first,
// as they may trigger more general extractors (e.g. HTML)

Should we find the most specific/highest priority Asset extractor and exclusively use it's extracted assets/outlinks? (Loop through the ordered extractors, break out when matched) I think @vbanos is the original author of that comment

@AltayAkkus

Copy link
Copy Markdown
Contributor Author

Can any1 comment on this? Else I will just go forward with reimplementing the same logic as previously.

@AltayAkkus AltayAkkus marked this pull request as ready for review February 22, 2026 20:58
@AltayAkkus

AltayAkkus commented Feb 22, 2026

Copy link
Copy Markdown
Contributor Author

I don't have a solid harness to test the asset extraction logic against a corpus of items.

I think Zeno in it's entirety would profit from a serialized data format for models.item which can loaded with

// go:embed ina.item
var inaItem []byte
item, ok := testutil.HydrateItem(inaItem).(*models.Item)

Writing tests is currently a little bit of a hassle, you have to do the same bag of tricks again and again and again.
EDIT: #563 implements this

@yzqzss

yzqzss commented Feb 26, 2026

Copy link
Copy Markdown
Collaborator

The switch case currently matches URL, if the given extractor thinks it can extract assets and/or outlinks, the case will write into assets and outlinks, WITHOUT accumulating. So if two cases are met, the first one returns early, the second one will not be reached.

Yes, use the same logic as before. Extractors that appear earlier in the list have higher priority; once a match is found, break.

After all, we only have fewer than 10 assets extractors right now, so there's no need to use a complex tree-like matching structure like mimetype.

@AltayAkkus

Copy link
Copy Markdown
Contributor Author

Yes, use the same logic as before. Extractors that appear earlier in the list have higher priority; once a match is found, break.
nice, thx for the clarification, implemented that in cc557c

@AltayAkkus AltayAkkus marked this pull request as draft June 10, 2026 13:48
@AltayAkkus

Copy link
Copy Markdown
Contributor Author

Converted to draft to avoid wasting Github Actions Runners resources (if the CI is configured like that)

@codecov-commenter

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 48.95833% with 49 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.66%. Comparing base (456cc72) to head (b8e9915).

Files with missing lines Patch % Lines
internal/pkg/postprocessor/testutil/item.go 33.33% 26 Missing and 8 partials ⚠️
internal/pkg/postprocessor/assets.go 69.23% 2 Missing and 2 partials ⚠️
internal/pkg/postprocessor/extractor/m3u8.go 57.14% 3 Missing ⚠️
internal/pkg/postprocessor/extractor/json.go 50.00% 2 Missing ⚠️
internal/pkg/postprocessor/extractor/xml.go 50.00% 2 Missing ⚠️
internal/pkg/postprocessor/sitespecific/ina/ina.go 71.42% 1 Missing and 1 partial ⚠️
...tprocessor/sitespecific/truthsocial/truthsocial.go 50.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #551      +/-   ##
==========================================
+ Coverage   56.47%   56.66%   +0.18%     
==========================================
  Files         133      134       +1     
  Lines        6760     6812      +52     
==========================================
+ Hits         3818     3860      +42     
+ Misses       2562     2560       -2     
- Partials      380      392      +12     
Flag Coverage Δ
e2etests 42.02% <55.55%> (+0.15%) ⬆️
unittests 29.62% <45.83%> (+0.53%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants