Skip to content

Split up doc collection and data extraction#447

Merged
ppinchuk merged 160 commits into
mainfrom
pp/split_doc_collection
Jun 1, 2026
Merged

Split up doc collection and data extraction#447
ppinchuk merged 160 commits into
mainfrom
pp/split_doc_collection

Conversation

@ppinchuk
Copy link
Copy Markdown
Collaborator

Add new CLI commands that run document collection and data extraction independently, allowing different parts of the pipeline to run in different places

@ppinchuk ppinchuk marked this pull request as ready for review June 1, 2026 07:07
Copilot AI review requested due to automatic review settings June 1, 2026 07:07
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors COMPASS orchestration to support splitting the pipeline into independent document collection and data extraction phases, exposed via new CLI commands (collect, extract) while keeping process as the end-to-end workflow. It also introduces manifest-based persistence to replay extraction later, and updates supporting web/search, loaders, and validators accordingly.

Changes:

  • Add a new compass.pipeline orchestration layer (runtime + coordinator + per-jurisdiction workflow) supporting PROCESS, COLLECT, and EXTRACT modes.
  • Add CLI subcommands collect and extract, and consolidate shared CLI behavior into _cli/common.py.
  • Introduce new web search and persistence utilities (holistic search filtering, collection manifests/shards), plus supporting updates across loaders, validators, and tests/docs.

Reviewed changes

Copilot reviewed 74 out of 75 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
tests/python/unit/web/test_web_search.py New unit coverage for holistic web-search filtering and ranking behavior
tests/python/unit/web/test_web_crawl.py Adjust crawl tests for doc-type detection + URL sanitization behavior
tests/python/unit/validation/test_validation_location.py Update tests for new _weighted_vote/raw-page handling API
tests/python/unit/utilities/test_utilities_parsing.py Update expectations for relative path stringification + add absolute-path test
tests/python/unit/utilities/test_utilities_io.py Add cross-platform coverage for Windows-style relative paths
tests/python/unit/utilities/test_utilities_base.py Remove WebSearchParams tests after refactor/move
tests/python/unit/test_exceptions.py Docstring formatting tweaks to match repo style
tests/python/unit/services/test_services_threaded.py Update expectations for renamed output file suffixes + removed fields
tests/python/unit/services/test_services_provider.py Docstring formatting tweaks
tests/python/unit/scripts/test_search.py Remove search-engine resolution/filter tests moved to compass.web.search
tests/python/unit/scripts/test_process.py Removed (replaced by pipeline orchestration tests)
tests/python/unit/pipeline/test_pipeline_orchestration.py New end-to-end-ish unit tests for collect/extract/process orchestration + manifest flow
tests/python/unit/pipeline/test_pipeline_data_classes.py New tests for request/search settings normalization (WebSearchParams.se_kwargs)
tests/python/unit/pipeline/test_pipeline_collection_steps.py New unit tests ensuring collection steps pass correct loader kwargs
tests/python/unit/cli/test_cli_search.py Update CLI tests to use shared cli_runner fixture + new request/config expectations
tests/python/unit/cli/test_cli_main.py New test ensuring CLI exposes new subcommands
tests/python/unit/cli/test_cli_common.py Update tests to point to _cli/common.py out-dir handling helpers
tests/python/conftest.py Add shared session cli_runner fixture + docstring formatting tweaks
pyproject.toml Bump nlr-elm dependency version
examples/parse_existing_docs/CLI/README.rst Document new collect/extract workflow for local docs
examples/execution_basics/README.rst Document run modes and manifest-based split workflow
docs/source/dev/README.rst Update intersphinx example to new pipeline entrypoint
compass/web/website_crawl.py Switch PDF detection to attribute/type-based helper (is_pdf_doc)
compass/web/url_utils.py Preserve existing percent-encoding via expanded safe character sets
compass/web/search.py New holistic search implementation with blacklist/duplicate/top-N filters
compass/web/file_loader.py Filter empty docs, preserve remote source URI separately, use Docling local loader backend
compass/validation/utilities.py New helper for dynamic validation thresholds
compass/validation/location.py Refactor raw-page handling and threshold computation; improve debug logging
compass/validation/graphs.py Add TODO note (with a typo to fix) near subdivision handling
compass/validation/content.py Add dynamic thresholding + improved logging and chunk-processing defaults
compass/utilities/parsing.py Add is_pdf_doc + raw_pages_from_doc; change path-to-string conversion behavior
compass/utilities/nt.py Remove legacy ProcessKwargs namedtuple
compass/utilities/logs.py Docstring formatting tweak
compass/utilities/io.py Improve config error messages and normalize Windows-style paths in resolve_path
compass/utilities/finalize.py Add collection summary message formatter
compass/utilities/enums.py Add run-mode and collection-step enums with priorities/action strings
compass/utilities/base.py Move/remove WebSearchParams; adjust output directory structure for collect-only mode
compass/utilities/init.py Export new collection summary helper; remove ProcessKwargs export
compass/services/threaded.py Add parsed-text writer + temp-file copy service; unify PDF detection; adjust file naming
compass/services/provider.py Docstring formatting tweaks
compass/services/cpu.py Preserve original remote URL separately for Docling HTML base URI
compass/services/base.py Docstring formatting tweak
compass/scripts/search.py Refactor search-only to operate on pipeline request/runtime and new compass.web.search
compass/scripts/download.py Use new web search; add collection step ranking + optional jurisdiction validation exemption
compass/plugin/registry.py Add resolve_plugin helper with consistent COMPASS exception behavior
compass/plugin/interface.py Remove need_jurisdiction_verification plumbing from plugin filter API
compass/plugin/base.py Update abstract filter_docs signature to match interface
compass/plugin/init.py Export resolve_plugin
compass/pipeline/runtime.py New runtime context: services, semaphores, loader kwargs, logging setup
compass/pipeline/jurisdiction.py New per-jurisdiction workflow supporting process/collect/extract behaviors
compass/pipeline/extraction.py New extraction workflow for prepared documents
compass/pipeline/coordinator.py New top-level orchestrator selecting mode strategy + finalization routines
compass/pipeline/collection/steps.py New fixed collection steps (known docs/URLs/SE/ELM crawl/COMPASS crawl)
compass/pipeline/collection/persistence.py New manifest/shard persistence + replay loader for extraction
compass/pipeline/collection/dedupe.py New deduplication domain service keyed by checksum (with fallback)
compass/pipeline/collection/base.py New collection workflow applying ordered steps + optional eager extraction
compass/pipeline/collection/init.py Export collection workflow
compass/pipeline/init.py Export request/data-class types and model builder
compass/pb.py Add progress-bar reset + support action label in main task
compass/llm/config.py Add printable text splitter wrapper (docstring typo present)
compass/extraction/water/plugin.py Update plugin method signature to match new base/interface
compass/extraction/date.py Refactor to use raw_pages_from_doc and concurrent LLM calls (contains a logic bug)
compass/extraction/apply.py Adjust default chunk-processing minimum and minor docstring formatting
compass/data/domains.json5 Add default URL blacklist/whitelist data file
compass/data/conus_jurisdictions.csv Fill in some missing jurisdiction website URLs
compass/_cli/search.py Adapt search CLI to new request-based search + optional JSON stdout handling
compass/_cli/process.py Refactor to shared async runner and normalize option naming
compass/_cli/main.py Register new collect and extract commands
compass/_cli/finalize.py Switch model init to pipeline _build_models
compass/_cli/extract.py New CLI subcommand for manifest-based extraction
compass/_cli/common.py New shared CLI runner: logging, progress, out-dir conflict handling
compass/_cli/collect.py New CLI subcommand for collection-only runs

Comment thread compass/extraction/date.py
Comment thread compass/validation/location.py
Comment thread compass/scripts/download.py
Comment thread compass/pipeline/coordinator.py
Comment thread compass/scripts/search.py
Comment thread compass/web/search.py
Comment thread compass/validation/graphs.py
Comment thread compass/llm/config.py Outdated
Comment thread compass/validation/utilities.py
@ppinchuk ppinchuk linked an issue Jun 1, 2026 that may be closed by this pull request
@ppinchuk ppinchuk merged commit 9a6f60e into main Jun 1, 2026
31 checks passed
@ppinchuk ppinchuk deleted the pp/split_doc_collection branch June 1, 2026 20:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

breaking Breaks something in the API or config enhancement Update to logic or general code improvements p-critical Priority: critical topic-python-async Issues/pull requests related to python async code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Separate search from LLM

4 participants