Skip to content

Add 404 reporter and PR maker#9390

Open
kaste wants to merge 8 commits intosublimehq:masterfrom
kaste:toolz
Open

Add 404 reporter and PR maker#9390
kaste wants to merge 8 commits intosublimehq:masterfrom
kaste:toolz

Conversation

@kaste
Copy link
Copy Markdown
Contributor

@kaste kaste commented Apr 24, 2026

This is somewhat complex; first it adds tools/report_404_packages which can be used to report packages that respond with 404 according to thecrawl.

Pretty output is

$ uv run -m tools.report_404_packages
testify               [since 2026-02-28; 7 weeks]
NHP Syntax Highlight  [since 2026-03-07; 6 weeks]
Rune.js Completions   [since 2026-03-15; 5 weeks]
jQuery Mobile Demos   [since 2026-03-18; 5 weeks]
KarmaRunner           [since 2026-03-21; 4 weeks]
LazyTimeTracker       [since 2026-03-28; 3 weeks]
ProjectEnvironment    [since 2026-03-29; 3 weeks]

You can also patch the repository and make a commit uv run -m tools.report_404_packages --commit and then make a PR manually.

This was the easy part.

As a last step, this installs a GitHub action that runs this checker on schedule or manually on workflow_dispatch and opens PRs.

To initialize the cache, the action must be triggered manually once. The scheduled run aborts if the cache is not initialized; t.i. to abort on GitHub outages.

Fixes packagecontrol/thecrawl#353

kaste added 7 commits April 23, 2026 20:01
Add a new tools/report_404_packages.py utility to identify packages that
have failed with fatal 404 errors for a minimum age.

Report them in normal mode.

Remove them and make a commit in non-dry --commit mode with a nice
commit message.  It is up to the user to then make a branch or a PR.
Support skipping known failures when reporting unreachable packages.
The reporter now accepts --ignore values and --ignore-file entries,
with comma-separated parsing and # comments in ignore files.

Filtering applies to package names and details URLs so scheduled
automation can avoid re-reporting previously handled packages.

Also add tests for ignore behavior and document the new options
in tools/README.md.
Extend the machine-readable -z output records to include package
 details URLs between the package name and failing_since timestamp.
This preserves URL context for downstream automation and avoids
losing identifiers when collecting prior reports.

Update argparse help text and adjust unit tests to assert the new
name\0details\0timestamp record shape.
Add a --build-pr-message option to tools.report_404_packages.
When reportable packages are found, it writes pr_title.txt and
pr_body.md in the target root.

This moves PR message formatting out of the workflow and into the
Python tool, where it is easier to unit test and maintain.

Also add tests for singular/plural message rendering and file
writing, and document the new option in tools/README.md.
Add a GitHub Actions workflow to run report_404_packages on a
schedule and via workflow_dispatch.

When packages are found, it commits removals, opens a pull request,
and uses a deterministic bot/report-404-<hash> branch name derived
from report output. It aborts if the branch already exists to prevent
double PR's.

The workflow also restores and updates a cached reported_urls.txt file.
This file is required for automated runs and must be generated by a
manual `workflow_dispatch` initially.
Before, reported_urls.txt was ever-growing.

Go through all reported URLs (line by line) and keep the URLs that are
still referenced in the workspace.

Rationale:  since reported_urls is a gate to prevent reporting URLs
more than once, there is no risk in reporting them if not found in
the workspace anymore.
@github-actions
Copy link
Copy Markdown

Package Review

Channel Diff

Removed (none), changed (none), added (none).

Result

No changed or added packages to review.

The report workflow used a deterministic branch name based only on the
reported record hash. That made the push and PR creation sequence fragile:
if the push succeeded but `gh pr create` failed, the workflow could get
stuck. The URLs were not marked as reported, but future runs with the same
hash would refuse to reuse the existing remote branch and would never retry
PR creation.

Make the branch name include the GitHub run id and run attempt. This keeps
the report hash in the name for human inspection, while ensuring reruns and
future scheduled runs can create a fresh branch instead of being blocked by
a stranded one.

This intentionally chooses the simple recovery path over extra PR lookup and
reconciliation logic. The workflow runs infrequently, so avoiding a durable
dead state is more valuable than preventing the occasional orphaned bot
branch, which can be cleaned up manually if needed.
@github-actions
Copy link
Copy Markdown

Package Review

Channel Diff

Removed (none), changed (none), added (none).

Result

No changed or added packages to review.

@kaste kaste mentioned this pull request Apr 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Send automatic PR's when packages 404 for a long time

1 participant