Skip to content

Split up doc collection and data extraction#447

Draft
ppinchuk wants to merge 123 commits into
mainfrom
pp/split_doc_collection
Draft

Split up doc collection and data extraction#447
ppinchuk wants to merge 123 commits into
mainfrom
pp/split_doc_collection

Conversation

@ppinchuk
Copy link
Copy Markdown
Collaborator

Add new CLI commands that run document collection and data extraction independently, allowing different parts of the pipeline to run in different places

Comment thread compass/scripts/search.py
entry for entry in results if entry["filtered_reason"] is None
]

def _sort_key(entry):
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all these auxiliary functions redundant and already existed, or are you just moving it around?

I'm playing with this criteria and have some changes already. What is the best way to connect with what you're doing?

Copy link
Copy Markdown
Collaborator Author

@ppinchuk ppinchuk May 31, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Little bit of both.

The core logic functions have been moved here: https://github.com/NatLabRockies/COMPASS/blob/pp/split_doc_collection/compass/web/search.py

I think I need to merge this PR ASAP so that we don't all have diverging implementations

Comment thread compass/web/search.py
Lower values indicate more confidence in result
"""
duplicate_count = len(entry.get("duplicates", []))
return ( # lower is better
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It might be worth creating a ranking system later, but for now, I think this might work best. The duplicate seems to be a strong criterion. In particular, when using multple engines. And query index might help more than the engine itself.

     -duplicate_count,
      entry["query_rank"],
      entry["query_index"],
      entry["search_engine"],
      entry["_order"],

This is not perfect, but I think defining this heuristics should be an effort by itself.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Breaks something in the API or config enhancement Update to logic or general code improvements p-critical Priority: critical topic-python-async Issues/pull requests related to python async code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants