ETT-1459: reuse main_repo_audit for extracting info from mets#187
Merged
Conversation
* rename to "crawl repo mets" and remove md5 checking (done by truenas_audit.pl) * record date of first ingest to feed_audit; extracts all PREMIS 'ingestion' events & takes the first * separate flag for source mets handling * extract source METS PREMIS events * extract methods to make testable (similar to treatment for populate_rights)
moseshll
approved these changes
Jun 23, 2026
moseshll
left a comment
Contributor
There was a problem hiding this comment.
Yes, lotsa stuff we've seen before. Could see if HTFeed::RepositoryIterator would do some of the lifting, maybe at a later point. It would probably have some gotchas to refactor since it's really tailored to the truenas audit -- not worth the pain right now IMHO.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is the second half of ETT-1459 -- backfilling feed_audit with the first ingest date for things that don't have it.
This used the same approach to getting it under test as I did for populate_rights_data.pl -- I tried calling out with
systembut ran into some of the same issues you did with Glacier with not inheriting the config. I figured making it more directly testable was probably preferable, and I think it's close to a point where it could be changed to be an object to further make it testable.Of particular note is that some of the things in
crawl_repo_metsare still duplicative oftruenas_audit-- probably we should extract those somewhere separate or do less rigorous checks incrawl_repo_mets.I don't love that the interface is outputting to stdout, but I can redirect the output to a file in Kubernetes and process later.