Skip to content

Retry CreateOnGithubJob on GitHub auth 401s#1490

Merged
tdickers merged 1 commit into
mainfrom
tdickers/retry-unauthorized-create-on-github
Jun 16, 2026
Merged

Retry CreateOnGithubJob on GitHub auth 401s#1490
tdickers merged 1 commit into
mainfrom
tdickers/retry-unauthorized-create-on-github

Conversation

@tdickers

@tdickers tdickers commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Carries the Octokit::Unauthorized retry shape validated in github-certification#1873 over to Shipit's deployment/status creation path.

Summary

CreateOnGithubJob creates GitHub deployments/statuses and surfaces a transient Octokit::Unauthorized when a GitHub installation token is rejected or still propagating. CommitDeployment#create_on_github! only rescues NotFound/Forbidden, so the 401 escaped the job unhandled and reopened shop/issues#8801. The errors are bursty then quiet across many repos at once (e.g. a Jun 10 spike), which fits a GitHub-side auth blip rather than a real permission failure.

Add a job-level retry_on Octokit::Unauthorized so transient auth failures get time to settle before counting as a failure.

Review focus

Area Choice Why
New retry retry_on Octokit::Unauthorized, polynomially_longer, attempts: 14 on CreateOnGithubJob ~24h scheduled-retry window. The installation token is cached for 50m (GITHUB_TOKEN_RAILS_CACHE_LIFETIME in lib/shipit/github_app.rb), so a shorter window could give up before a stale cached token refreshes; ~24h outlasts the cache and rides out extended GitHub auth incidents. Transient propagation lag still recovers in the first few attempts.
No token invalidation Do not evict/remint the GitHub App token on 401 Avoids a retry/remint storm where many workers race to invalidate the shared token and re-fail on freshly minted (still-propagating) tokens
Exhaustion Log and do not re-raise Matches the existing NotFound/Forbidden give-up path in create_on_github! ("if no one can create the deployment we can only give up")
Scope CreateOnGithubJob only This is the job in the reported issue; other GitHub jobs are out of scope for this fix

Risk

Low. Retries are ActiveJob-scheduled, not blocking, so workers are not held. Polynomial backoff means later retries are hours apart, so recovery from a multi-hour incident can lag by up to the gap; the common cases (token propagation lag, 50m cache refresh) recover within the first several attempts. On a persistent failure the job degrades to a logged give-up after ~24h rather than an unhandled crash.

Testing

Covered:

Area Coverage
Auth retry Octokit::Unauthorized enqueues an ActiveJob retry
Auth exhaustion Exhausted auth retries complete without re-raising

Gaps:

  • Could not run the suite locally; this checkout's bundle is out of sync (Octokit 5.6.1 / Rails 8.1 gems not installed). Syntax-checked with ruby -c; CI runs the suite. Tests raise the bare Octokit::Unauthorized class to match repo convention.

@tdickers tdickers force-pushed the tdickers/retry-unauthorized-create-on-github branch 2 times, most recently from 1a8c92a to f26b01a Compare June 16, 2026 16:42
@tdickers tdickers marked this pull request as ready for review June 16, 2026 16:44
@tdickers tdickers force-pushed the tdickers/retry-unauthorized-create-on-github branch 2 times, most recently from 52e1edf to 43aa3f2 Compare June 16, 2026 19:18
Deployment/status creation surfaces transient Octokit::Unauthorized when a
GitHub installation token is rejected or still propagating. CommitDeployment#create_on_github!
only rescues NotFound/Forbidden, so the 401 escaped the job unhandled and
reopened the Observe issue.

Add retry_on Octokit::Unauthorized to CreateOnGithubJob with polynomially_longer
backoff and attempts: 14 (~24h window). The window intentionally outlasts the
50m installation-token cache (GITHUB_TOKEN_RAILS_CACHE_LIFETIME in
lib/shipit/github_app.rb) so a stale cached token can refresh before we give
up. On exhaustion, log and do not re-raise, matching the existing
NotFound/Forbidden give-up behavior. No token cache or client changes; we do
not evict/remint the cached token to avoid a remint storm across workers.

This aligns the retry shape with the validated approach from
Shopify/github-certification#1873.

Fixes shop/issues#8801
@tdickers tdickers force-pushed the tdickers/retry-unauthorized-create-on-github branch from 43aa3f2 to 13f0b33 Compare June 16, 2026 19:55
@tdickers tdickers merged commit 217dcac into main Jun 16, 2026
15 checks passed
@tdickers tdickers deleted the tdickers/retry-unauthorized-create-on-github branch June 16, 2026 20:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants