Transient 502s & Incremental Cursor Gaps: Scaling github_graphql for 300+ Repos #8821

zaiddialpad · 2026-03-31T17:35:05Z

zaiddialpad
Mar 31, 2026

Environment

DevLake Version: v1.0.3-beta9
Plugin: github_graphql
Source: GitHub Cloud (authenticated via GitHub App)
Deployment Scale: 340+ repositories running on a daily sync blueprint

The Context

First off, a huge thanks to the contributors for DevLake, it’s been a game-changer for our internal metrics. However, as we’ve scaled our deployment to handle 346+ repositories, we’ve run into two interconnected issues that are making our production dashboards a bit shaky.

Currently, our daily pipelines experience 1-2 task failures every single run, primarily due to how transient errors are handled during the collection phase.

Issue 1: GitHub GraphQL 502 Bad Gateway (Transient)

We are seeing intermittent graphql query got error failures when GitHub returns a 502. While these are clearly transient on GitHub's end, the github_graphql plugin appears to treat them as fatal.

In a large-scale environment, it’s statistically inevitable that at least one repo will hit a 502 during a massive sync. Because there doesn't seem to be a built-in retry mechanism for these specific HTTP codes, a single "hiccup" from GitHub kills the entire task.

Issue 2: The "Silent" Incremental Collection Cursor Gap

When a Collect subtask fails halfway through (due to the 502 mentioned above), we’ve observed the following behavior:

Partial Write: Data collected before the 502 is already committed to the raw tables.
Cursor Stalls: The incremental cursor is not advanced because the subtask failed.
The Gap: On the next scheduled run, the incremental cursor picks up from the last successful run.

We are concerned that if the next run successfully skips over the "failed window" or if the logic assumes the data was already handled because it exists in the raw tables, we end up with silent data gaps in our domain tables. For an organization relying on these for DORA metrics, missing even a few PRs or Issues creates a significant trust issue with the data.

Questions for the Maintainers

Retry Logic: Does the github_graphql plugin currently support (or have plans for) configurable retries on transient HTTP errors like 502 or 503?
Cursor Commitment: Is the incremental cursor only committed upon total subtask success? If a subtask is partially successful, is there a risk of the next run skipping the "partially collected" data window?
Workarounds: For those running DevLake at this scale (300+ repos), are there recommended configurations to mitigate these transient failures? We are currently using Advanced Mode blueprints to surgically exclude problematic subtasks, but a more automated "resiliency" setting would be ideal.

We’d love to hear if others are seeing this or if there’s a specific configuration in v1.0.3 we might be missing to make these pipelines more self-healing.

@zaiddialpad · 2026-03-31T17:44:05Z

dosubot[bot]
Bot Mar 31, 2026

Hi @zaiddialpad! I'm Dosu and I’m helping the incubator-devlake team.

Great questions — I dug into the codebase to give you concrete answers on all three.

1. Retry Logic: Yes, it exists, but with caveats.

The GraphqlAsyncClient does retry HTTP-level errors including 502 and 503 [1]. It uses a fixed backoff (not exponential) controlled by two env vars:

Variable	Default	What it does
`API_RETRY`	`3`	Max retry attempts per GraphQL request
`API_TIMEOUT`	`30s`	Wait duration between retries (for the GraphQL client specifically)

So if you're seeing 502s kill tasks after retries, you may want to bump API_RETRY higher (e.g., 5 or 7) given your scale. Note that GraphQL-level DataError responses (schema/validation errors) are not retried — only transport-level failures are.

A closely related issue was reported in #8614 [2] where 502s and stream cancellations were caused by GitHub's server-side response size limits on large repos. The fix (merged in v1.0.3-beta8) introduced GITHUB_JOB_GRAPHQL_PAGE_SIZE to reduce page sizes. If you haven't tuned this, it may help reduce the frequency of 502s.

⚠️ Heads-up on two active bugs that compound this:

The GraphQL client's background rate-limit goroutine calls panic(err) on any error (including a transient 502 during rate-limit polling), which crashes the entire DevLake process [3]. This is tracked in #8788 [4].
If you're authenticating via a GitHub App, the GraphQL client uses a StaticTokenSource that freezes the token at task start. After ~1 hour when it expires, all GraphQL requests fail with 401 and trigger the panic above. The REST client was already fixed for this, but the GraphQL client was not [4].

Both are addressed in the open PR #8791 [5] — not yet merged. Given your GitHub App auth + 340 repos (likely >1 hour syncs), this is almost certainly hitting you.

2. Cursor Commitment: Strict all-or-nothing — your concern about gaps is partially valid.

The CollectorStateManager.Close() that persists LatestSuccessStart is only called when all nested collectors succeed [6] [7]. If a subtask fails at any point, Close() is never reached, so the cursor stays at its previous position. On the next run, collection restarts from the same since timestamp.

Here's the nuance on data gaps:

Raw data collected before the failure persists in the _raw_github_graphql_* tables (no rollback) [8].
On the next incremental run, the raw table is not cleared, so the re-run will produce duplicate raw rows for data already collected in the failed run.
The GraphQL pagination cursor (EndCursor) is entirely in-memory — if a subtask fails on page 47 of 100, the next run restarts from page 1.
No data should be silently skipped, because the since timestamp doesn't advance on failure. The risk is duplicates, not gaps. The extractor/converter pipeline should handle deduplication at the domain layer via primary keys, though it's worth verifying this for your specific data models.

3. Recommended Workarounds at Scale

Increase retries: Set API_RETRY=5 or higher and consider increasing API_TIMEOUT (the inter-retry wait) to give GitHub more breathing room.
Reduce page sizes: Set GITHUB_JOB_GRAPHQL_PAGE_SIZE to a smaller value (e.g., 20 or 30) to reduce the chance of GitHub returning 502 on large responses.
Watch for PR Fix/graphql client token refresh #8791: This is critical for your setup — it fixes the panic crash and the GitHub App token expiry issue for the GraphQL client. You may want to comment on it to signal urgency.
Consider splitting blueprints: Breaking your 346 repos into multiple blueprints (e.g., groups of 50-80) can reduce the blast radius of a single failure and keep individual sync times under the 1-hour token expiry window.

Hope this helps make the pipelines more resilient. The retry mechanism is there but the defaults are conservative for a 340+ repo deployment, and the two bugs in #8788 are likely compounding your failures significantly.

To reply, just mention @dosu.

^{How did I do? Good | Irrelevant | Incorrect | Verbose | Hallucination | Report 🐛 | Other}

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transient 502s & Incremental Cursor Gaps: Scaling github_graphql for 300+ Repos #8821

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Transient 502s & Incremental Cursor Gaps: Scaling github_graphql for 300+ Repos #8821

Uh oh!

zaiddialpad Mar 31, 2026

Environment

The Context

Issue 1: GitHub GraphQL 502 Bad Gateway (Transient)

Issue 2: The "Silent" Incremental Collection Cursor Gap

Questions for the Maintainers

Replies: 1 comment

Uh oh!

dosubot[bot] Bot Mar 31, 2026

zaiddialpad
Mar 31, 2026

dosubot[bot]
Bot Mar 31, 2026