Open
Conversation
Add a new tools/report_404_packages.py utility to identify packages that have failed with fatal 404 errors for a minimum age. Report them in normal mode. Remove them and make a commit in non-dry --commit mode with a nice commit message. It is up to the user to then make a branch or a PR.
Support skipping known failures when reporting unreachable packages. The reporter now accepts --ignore values and --ignore-file entries, with comma-separated parsing and # comments in ignore files. Filtering applies to package names and details URLs so scheduled automation can avoid re-reporting previously handled packages. Also add tests for ignore behavior and document the new options in tools/README.md.
Extend the machine-readable -z output records to include package details URLs between the package name and failing_since timestamp. This preserves URL context for downstream automation and avoids losing identifiers when collecting prior reports. Update argparse help text and adjust unit tests to assert the new name\0details\0timestamp record shape.
Add a --build-pr-message option to tools.report_404_packages. When reportable packages are found, it writes pr_title.txt and pr_body.md in the target root. This moves PR message formatting out of the workflow and into the Python tool, where it is easier to unit test and maintain. Also add tests for singular/plural message rendering and file writing, and document the new option in tools/README.md.
Add a GitHub Actions workflow to run report_404_packages on a schedule and via workflow_dispatch. When packages are found, it commits removals, opens a pull request, and uses a deterministic bot/report-404-<hash> branch name derived from report output. It aborts if the branch already exists to prevent double PR's. The workflow also restores and updates a cached reported_urls.txt file. This file is required for automated runs and must be generated by a manual `workflow_dispatch` initially.
Before, reported_urls.txt was ever-growing. Go through all reported URLs (line by line) and keep the URLs that are still referenced in the workspace. Rationale: since reported_urls is a gate to prevent reporting URLs more than once, there is no risk in reporting them if not found in the workspace anymore.
Package ReviewChannel DiffRemoved (none), changed (none), added (none). ResultNo changed or added packages to review. |
The report workflow used a deterministic branch name based only on the reported record hash. That made the push and PR creation sequence fragile: if the push succeeded but `gh pr create` failed, the workflow could get stuck. The URLs were not marked as reported, but future runs with the same hash would refuse to reuse the existing remote branch and would never retry PR creation. Make the branch name include the GitHub run id and run attempt. This keeps the report hash in the name for human inspection, while ensuring reruns and future scheduled runs can create a fresh branch instead of being blocked by a stranded one. This intentionally chooses the simple recovery path over extra PR lookup and reconciliation logic. The workflow runs infrequently, so avoiding a durable dead state is more valuable than preventing the occasional orphaned bot branch, which can be cleaned up manually if needed.
Package ReviewChannel DiffRemoved (none), changed (none), added (none). ResultNo changed or added packages to review. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This is somewhat complex; first it adds tools/report_404_packages which can be used to report packages that respond with 404 according to thecrawl.
Pretty output is
You can also patch the repository and make a commit
uv run -m tools.report_404_packages --commitand then make a PR manually.This was the easy part.
As a last step, this installs a GitHub action that runs this checker on schedule or manually on
workflow_dispatchand opens PRs.To initialize the cache, the action must be triggered manually once. The scheduled run aborts if the cache is not initialized; t.i. to abort on GitHub outages.
Fixes packagecontrol/thecrawl#353