Split up doc collection and data extraction by ppinchuk · Pull Request #447 · NatLabRockies/COMPASS

ppinchuk · 2026-05-27T06:51:30Z

Add new CLI commands that run document collection and data extraction independently, allowing different parts of the pipeline to run in different places

castelao · 2026-05-31T22:26:41Z

-        entry for entry in results if entry["filtered_reason"] is None
-    ]
-
-    def _sort_key(entry):


Are all these auxiliary functions redundant and already existed, or are you just moving it around?

I'm playing with this criteria and have some changes already. What is the best way to connect with what you're doing?

Little bit of both.

The core logic functions have been moved here: https://github.com/NatLabRockies/COMPASS/blob/pp/split_doc_collection/compass/web/search.py

I think I need to merge this PR ASAP so that we don't all have diverging implementations

castelao · 2026-05-31T22:35:37Z

+    Lower values indicate more confidence in result
+    """
+    duplicate_count = len(entry.get("duplicates", []))
+    return (  # lower is better


It might be worth creating a ranking system later, but for now, I think this might work best. The duplicate seems to be a strong criterion. In particular, when using multple engines. And query index might help more than the engine itself.

-duplicate_count, entry["query_rank"], entry["query_index"], entry["search_engine"], entry["_order"],

This is not perfect, but I think defining this heuristics should be an effort by itself.

ppinchuk added 30 commits May 7, 2026 18:42

Set class correctly

124f369

Filter out empty docs

fb99095

Merge remote-tracking branch 'origin/main' into pp/split_doc_collection

59619ae

find_jurisdiction_website can now run without validation

50dddef

Checking for correct jurisdiction can now be disabled via doc attrs

38c35a6

Add reset function for progress bar

cc9e40c

MInor update

bbff3a6

More flexible load_config

8512367

generalized _move_file a little

4cd67ac

Minor update

6bfd39c

Add TempFileCacheCopier

4ca787f

No extra key top-level

2c86554

MInor logic update

80d436f

Fix bug

8b75bac

Func returns docs

8e5428c

Add known website

3bf14ce

Directories now have a collect-only option

84aaf92

Add load function for collected docs

c3607ee

Add arg

9fb3cb5

Add _load_docs_from_collection_info

b0598e3

Add _persist_doc

710e475

Add _collection_doc_key

9006af2

Add _write_collection_manifest

36037a8

Flexibility

a6fe5a5

Add compile_collection_summary_message

37c0353

Minor

5563b4c

Add ParsedFileWriter

003b157

Add function to namespace

8b71af5

Update priority order

4acff6e

MVP of collect/extract split

5d4b4d9

ppinchuk added 28 commits May 29, 2026 20:56

print report if requested

0d47e7b

Bump elm dep

5585cf7

Merge remote-tracking branch 'origin/main' into pp/split_doc_collection

ba47a2f

Update lockfile

aec0e7c

Use elm

1742572

Move code

215b453

Use new module

60ba3e3

Minor pipeline runtime refactor

a90e183

Use pipeline runtime

bae7238

Fix tests

d4e6e8c

Always use holistic search

08b08ac

Add new parameter

d84fade

Search sort now controlled by user input

aeb5e20

Doc update for data classes

45c16b9

Jurisdiction validation is now on a per-document basis

54c1d59

Fix tests

e41d7f8

Fix tests

6a0f20b

Remove redundant class

213be75

Use class for readibility

2484f1f

Keep log messages together

b04c9fd

More logging

9ba3014

Always get max number of results

29d001d

Sort results before output

6ece7f8

Add raw_pages_from_doc

a671b84

Fix docling read file bug

8406900

Sort on collection rank step, if possible

df599e3

Revert unintentional change

9cb1e9e

Fix tests

f2968bd

castelao reviewed May 31, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Split up doc collection and data extraction#447

Split up doc collection and data extraction#447
ppinchuk wants to merge 123 commits into
mainfrom
pp/split_doc_collection

ppinchuk commented May 27, 2026

Uh oh!

castelao May 31, 2026

Uh oh!

ppinchuk May 31, 2026 •

edited

Loading

Uh oh!

castelao May 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ppinchuk commented May 27, 2026

Uh oh!

castelao May 31, 2026

Choose a reason for hiding this comment

Uh oh!

ppinchuk May 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

castelao May 31, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ppinchuk May 31, 2026 •

edited

Loading