Skip to content

fix(github): bisect adaptive time windows for workflow runs 40k pagination cap#8844

Open
yamoyamoto wants to merge 1 commit intoapache:mainfrom
yamoyamoto:fix/8842-workflow-runs-pagination-cap
Open

fix(github): bisect adaptive time windows for workflow runs 40k pagination cap#8844
yamoyamoto wants to merge 1 commit intoapache:mainfrom
yamoyamoto:fix/8842-workflow-runs-pagination-cap

Conversation

@yamoyamoto
Copy link
Copy Markdown
Contributor

⚠️ Pre Checklist

  • I have read through the Contributing Documentation.
  • I have added relevant tests.
  • I have added relevant documentation.
  • I will add labels to the PR, such as pr-type/bug-fix.

Summary

Repositories with more than ~40k GitHub Actions workflow runs cannot be collected today: Collect Workflow Runs hits GitHub's per_page * page > 40,000 cap in unfiltered mode and fails with HTTP 422, leaving _tool_github_runs empty.

This PR switches the collector to always use filtered mode (created=<from>..<to>) and adds adaptive time-window bisection to work around filtered mode's own 1,000-item-per-search cap. Leaf windows are collected through a single ApiCollector fed by an Input iterator, so the raw-table Delete still fires only once. github_graphql inherits the fix automatically via its existing CollectRunsMeta import.

Does this close any open issues?

Closes #8842

Screenshots

N/A — internal collector change with no UI surface.

Other Information

  • Unit tests in the new cicd_run_collector_test.go cover pagination edges, total_count >= 1000 / HTTP 422 bisection triggers, the non-overlapping integer-second split rule, bootstrap from epoch, and a thin integration check that the raw-table Delete is invoked exactly once regardless of how many leaf windows are produced. make build and make unit-test pass locally.
  • Validated in our environment: we built this branch into a container image on top of v1.0.3-beta10 and deployed it against our own DevLake instance pointing at a GitHub repository that had previously hit the 40k cap. The Collect Workflow Runs subtask now completes successfully and _tool_github_runs is populated as expected, so the change has been exercised against real GitHub traffic rather than only synthetic mocks.

@dosubot dosubot Bot added size:L This PR changes 100-499 lines, ignoring generated files. component/plugins This issue or PR relates to plugins pr-type/bug-fix This PR fixes a bug labels Apr 23, 2026
…ation cap

GitHub's /actions/runs enforces a 40k cap in unfiltered mode and a 1,000-item
cap per filtered search, making any repo with >40k workflow runs uncollectable.
Switch to filtered mode and recursively bisect time windows at integer-second
midpoints, probing via SubmitBlocking to share rate-limit with the main
collector and feeding leaves to a single ApiCollector so raw-table Delete
fires only once. See PR description for the full design rationale.

Closes apache#8842

Signed-off-by: yamoyamoto <yamo7yamoto@gmail.com>
@yamoyamoto yamoyamoto force-pushed the fix/8842-workflow-runs-pagination-cap branch from fcf84bb to 35cf936 Compare April 23, 2026 09:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/plugins This issue or PR relates to plugins pr-type/bug-fix This PR fixes a bug size:L This PR changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug][github] "Collect Workflow Runs" fails with HTTP 422 on large repos due to GitHub.com pagination cap

1 participant