Transient 502s & Incremental Cursor Gaps: Scaling github_graphql for 300+ Repos #8821
Replies: 1 comment
-
|
Hi @zaiddialpad! I'm Dosu and I’m helping the incubator-devlake team. Great questions — I dug into the codebase to give you concrete answers on all three. 1. Retry Logic: Yes, it exists, but with caveats. The
So if you're seeing 502s kill tasks after retries, you may want to bump A closely related issue was reported in #8614 [2] where 502s and stream cancellations were caused by GitHub's server-side response size limits on large repos. The fix (merged in v1.0.3-beta8) introduced
Both are addressed in the open PR #8791 [5] — not yet merged. Given your GitHub App auth + 340 repos (likely >1 hour syncs), this is almost certainly hitting you. 2. Cursor Commitment: Strict all-or-nothing — your concern about gaps is partially valid. The Here's the nuance on data gaps:
3. Recommended Workarounds at Scale
Hope this helps make the pipelines more resilient. The retry mechanism is there but the defaults are conservative for a 340+ repo deployment, and the two bugs in #8788 are likely compounding your failures significantly. To reply, just mention @dosu. How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Environment
DevLake Version: v1.0.3-beta9
Plugin: github_graphql
Source: GitHub Cloud (authenticated via GitHub App)
Deployment Scale: 340+ repositories running on a daily sync blueprint
The Context
First off, a huge thanks to the contributors for DevLake, it’s been a game-changer for our internal metrics. However, as we’ve scaled our deployment to handle 346+ repositories, we’ve run into two interconnected issues that are making our production dashboards a bit shaky.
Currently, our daily pipelines experience 1-2 task failures every single run, primarily due to how transient errors are handled during the collection phase.
Issue 1: GitHub GraphQL 502 Bad Gateway (Transient)
We are seeing intermittent graphql query got error failures when GitHub returns a 502. While these are clearly transient on GitHub's end, the github_graphql plugin appears to treat them as fatal.
In a large-scale environment, it’s statistically inevitable that at least one repo will hit a 502 during a massive sync. Because there doesn't seem to be a built-in retry mechanism for these specific HTTP codes, a single "hiccup" from GitHub kills the entire task.
Issue 2: The "Silent" Incremental Collection Cursor Gap
When a Collect subtask fails halfway through (due to the 502 mentioned above), we’ve observed the following behavior:
Partial Write: Data collected before the 502 is already committed to the raw tables.
Cursor Stalls: The incremental cursor is not advanced because the subtask failed.
The Gap: On the next scheduled run, the incremental cursor picks up from the last successful run.
We are concerned that if the next run successfully skips over the "failed window" or if the logic assumes the data was already handled because it exists in the raw tables, we end up with silent data gaps in our domain tables. For an organization relying on these for DORA metrics, missing even a few PRs or Issues creates a significant trust issue with the data.
Questions for the Maintainers
Retry Logic: Does the github_graphql plugin currently support (or have plans for) configurable retries on transient HTTP errors like 502 or 503?
Cursor Commitment: Is the incremental cursor only committed upon total subtask success? If a subtask is partially successful, is there a risk of the next run skipping the "partially collected" data window?
Workarounds: For those running DevLake at this scale (300+ repos), are there recommended configurations to mitigate these transient failures? We are currently using Advanced Mode blueprints to surgically exclude problematic subtasks, but a more automated "resiliency" setting would be ideal.
We’d love to hear if others are seeing this or if there’s a specific configuration in v1.0.3 we might be missing to make these pipelines more self-healing.
Beta Was this translation helpful? Give feedback.
All reactions