Split up doc collection and data extraction#447
Merged
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR refactors COMPASS orchestration to support splitting the pipeline into independent document collection and data extraction phases, exposed via new CLI commands (collect, extract) while keeping process as the end-to-end workflow. It also introduces manifest-based persistence to replay extraction later, and updates supporting web/search, loaders, and validators accordingly.
Changes:
- Add a new
compass.pipelineorchestration layer (runtime + coordinator + per-jurisdiction workflow) supportingPROCESS,COLLECT, andEXTRACTmodes. - Add CLI subcommands
collectandextract, and consolidate shared CLI behavior into_cli/common.py. - Introduce new web search and persistence utilities (holistic search filtering, collection manifests/shards), plus supporting updates across loaders, validators, and tests/docs.
Reviewed changes
Copilot reviewed 74 out of 75 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/python/unit/web/test_web_search.py | New unit coverage for holistic web-search filtering and ranking behavior |
| tests/python/unit/web/test_web_crawl.py | Adjust crawl tests for doc-type detection + URL sanitization behavior |
| tests/python/unit/validation/test_validation_location.py | Update tests for new _weighted_vote/raw-page handling API |
| tests/python/unit/utilities/test_utilities_parsing.py | Update expectations for relative path stringification + add absolute-path test |
| tests/python/unit/utilities/test_utilities_io.py | Add cross-platform coverage for Windows-style relative paths |
| tests/python/unit/utilities/test_utilities_base.py | Remove WebSearchParams tests after refactor/move |
| tests/python/unit/test_exceptions.py | Docstring formatting tweaks to match repo style |
| tests/python/unit/services/test_services_threaded.py | Update expectations for renamed output file suffixes + removed fields |
| tests/python/unit/services/test_services_provider.py | Docstring formatting tweaks |
| tests/python/unit/scripts/test_search.py | Remove search-engine resolution/filter tests moved to compass.web.search |
| tests/python/unit/scripts/test_process.py | Removed (replaced by pipeline orchestration tests) |
| tests/python/unit/pipeline/test_pipeline_orchestration.py | New end-to-end-ish unit tests for collect/extract/process orchestration + manifest flow |
| tests/python/unit/pipeline/test_pipeline_data_classes.py | New tests for request/search settings normalization (WebSearchParams.se_kwargs) |
| tests/python/unit/pipeline/test_pipeline_collection_steps.py | New unit tests ensuring collection steps pass correct loader kwargs |
| tests/python/unit/cli/test_cli_search.py | Update CLI tests to use shared cli_runner fixture + new request/config expectations |
| tests/python/unit/cli/test_cli_main.py | New test ensuring CLI exposes new subcommands |
| tests/python/unit/cli/test_cli_common.py | Update tests to point to _cli/common.py out-dir handling helpers |
| tests/python/conftest.py | Add shared session cli_runner fixture + docstring formatting tweaks |
| pyproject.toml | Bump nlr-elm dependency version |
| examples/parse_existing_docs/CLI/README.rst | Document new collect/extract workflow for local docs |
| examples/execution_basics/README.rst | Document run modes and manifest-based split workflow |
| docs/source/dev/README.rst | Update intersphinx example to new pipeline entrypoint |
| compass/web/website_crawl.py | Switch PDF detection to attribute/type-based helper (is_pdf_doc) |
| compass/web/url_utils.py | Preserve existing percent-encoding via expanded safe character sets |
| compass/web/search.py | New holistic search implementation with blacklist/duplicate/top-N filters |
| compass/web/file_loader.py | Filter empty docs, preserve remote source URI separately, use Docling local loader backend |
| compass/validation/utilities.py | New helper for dynamic validation thresholds |
| compass/validation/location.py | Refactor raw-page handling and threshold computation; improve debug logging |
| compass/validation/graphs.py | Add TODO note (with a typo to fix) near subdivision handling |
| compass/validation/content.py | Add dynamic thresholding + improved logging and chunk-processing defaults |
| compass/utilities/parsing.py | Add is_pdf_doc + raw_pages_from_doc; change path-to-string conversion behavior |
| compass/utilities/nt.py | Remove legacy ProcessKwargs namedtuple |
| compass/utilities/logs.py | Docstring formatting tweak |
| compass/utilities/io.py | Improve config error messages and normalize Windows-style paths in resolve_path |
| compass/utilities/finalize.py | Add collection summary message formatter |
| compass/utilities/enums.py | Add run-mode and collection-step enums with priorities/action strings |
| compass/utilities/base.py | Move/remove WebSearchParams; adjust output directory structure for collect-only mode |
| compass/utilities/init.py | Export new collection summary helper; remove ProcessKwargs export |
| compass/services/threaded.py | Add parsed-text writer + temp-file copy service; unify PDF detection; adjust file naming |
| compass/services/provider.py | Docstring formatting tweaks |
| compass/services/cpu.py | Preserve original remote URL separately for Docling HTML base URI |
| compass/services/base.py | Docstring formatting tweak |
| compass/scripts/search.py | Refactor search-only to operate on pipeline request/runtime and new compass.web.search |
| compass/scripts/download.py | Use new web search; add collection step ranking + optional jurisdiction validation exemption |
| compass/plugin/registry.py | Add resolve_plugin helper with consistent COMPASS exception behavior |
| compass/plugin/interface.py | Remove need_jurisdiction_verification plumbing from plugin filter API |
| compass/plugin/base.py | Update abstract filter_docs signature to match interface |
| compass/plugin/init.py | Export resolve_plugin |
| compass/pipeline/runtime.py | New runtime context: services, semaphores, loader kwargs, logging setup |
| compass/pipeline/jurisdiction.py | New per-jurisdiction workflow supporting process/collect/extract behaviors |
| compass/pipeline/extraction.py | New extraction workflow for prepared documents |
| compass/pipeline/coordinator.py | New top-level orchestrator selecting mode strategy + finalization routines |
| compass/pipeline/collection/steps.py | New fixed collection steps (known docs/URLs/SE/ELM crawl/COMPASS crawl) |
| compass/pipeline/collection/persistence.py | New manifest/shard persistence + replay loader for extraction |
| compass/pipeline/collection/dedupe.py | New deduplication domain service keyed by checksum (with fallback) |
| compass/pipeline/collection/base.py | New collection workflow applying ordered steps + optional eager extraction |
| compass/pipeline/collection/init.py | Export collection workflow |
| compass/pipeline/init.py | Export request/data-class types and model builder |
| compass/pb.py | Add progress-bar reset + support action label in main task |
| compass/llm/config.py | Add printable text splitter wrapper (docstring typo present) |
| compass/extraction/water/plugin.py | Update plugin method signature to match new base/interface |
| compass/extraction/date.py | Refactor to use raw_pages_from_doc and concurrent LLM calls (contains a logic bug) |
| compass/extraction/apply.py | Adjust default chunk-processing minimum and minor docstring formatting |
| compass/data/domains.json5 | Add default URL blacklist/whitelist data file |
| compass/data/conus_jurisdictions.csv | Fill in some missing jurisdiction website URLs |
| compass/_cli/search.py | Adapt search CLI to new request-based search + optional JSON stdout handling |
| compass/_cli/process.py | Refactor to shared async runner and normalize option naming |
| compass/_cli/main.py | Register new collect and extract commands |
| compass/_cli/finalize.py | Switch model init to pipeline _build_models |
| compass/_cli/extract.py | New CLI subcommand for manifest-based extraction |
| compass/_cli/common.py | New shared CLI runner: logging, progress, out-dir conflict handling |
| compass/_cli/collect.py | New CLI subcommand for collection-only runs |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Add new CLI commands that run document collection and data extraction independently, allowing different parts of the pipeline to run in different places