Split up doc collection and data extraction#447
Conversation
| entry for entry in results if entry["filtered_reason"] is None | ||
| ] | ||
|
|
||
| def _sort_key(entry): |
There was a problem hiding this comment.
Are all these auxiliary functions redundant and already existed, or are you just moving it around?
I'm playing with this criteria and have some changes already. What is the best way to connect with what you're doing?
There was a problem hiding this comment.
Little bit of both.
The core logic functions have been moved here: https://github.com/NatLabRockies/COMPASS/blob/pp/split_doc_collection/compass/web/search.py
I think I need to merge this PR ASAP so that we don't all have diverging implementations
| Lower values indicate more confidence in result | ||
| """ | ||
| duplicate_count = len(entry.get("duplicates", [])) | ||
| return ( # lower is better |
There was a problem hiding this comment.
It might be worth creating a ranking system later, but for now, I think this might work best. The duplicate seems to be a strong criterion. In particular, when using multple engines. And query index might help more than the engine itself.
-duplicate_count,
entry["query_rank"],
entry["query_index"],
entry["search_engine"],
entry["_order"],
This is not perfect, but I think defining this heuristics should be an effort by itself.
Add new CLI commands that run document collection and data extraction independently, allowing different parts of the pipeline to run in different places