Retry CreateOnGithubJob on GitHub auth 401s#1490
Merged
Merged
Conversation
timothysmith0609
approved these changes
Jun 16, 2026
1a8c92a to
f26b01a
Compare
52e1edf to
43aa3f2
Compare
Deployment/status creation surfaces transient Octokit::Unauthorized when a GitHub installation token is rejected or still propagating. CommitDeployment#create_on_github! only rescues NotFound/Forbidden, so the 401 escaped the job unhandled and reopened the Observe issue. Add retry_on Octokit::Unauthorized to CreateOnGithubJob with polynomially_longer backoff and attempts: 14 (~24h window). The window intentionally outlasts the 50m installation-token cache (GITHUB_TOKEN_RAILS_CACHE_LIFETIME in lib/shipit/github_app.rb) so a stale cached token can refresh before we give up. On exhaustion, log and do not re-raise, matching the existing NotFound/Forbidden give-up behavior. No token cache or client changes; we do not evict/remint the cached token to avoid a remint storm across workers. This aligns the retry shape with the validated approach from Shopify/github-certification#1873. Fixes shop/issues#8801
43aa3f2 to
13f0b33
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
CreateOnGithubJobcreates GitHub deployments/statuses and surfaces a transientOctokit::Unauthorizedwhen a GitHub installation token is rejected or still propagating.CommitDeployment#create_on_github!only rescuesNotFound/Forbidden, so the 401 escaped the job unhandled and reopened shop/issues#8801. The errors are bursty then quiet across many repos at once (e.g. a Jun 10 spike), which fits a GitHub-side auth blip rather than a real permission failure.Add a job-level
retry_on Octokit::Unauthorizedso transient auth failures get time to settle before counting as a failure.Review focus
retry_on Octokit::Unauthorized,polynomially_longer,attempts: 14onCreateOnGithubJobGITHUB_TOKEN_RAILS_CACHE_LIFETIMEinlib/shipit/github_app.rb), so a shorter window could give up before a stale cached token refreshes; ~24h outlasts the cache and rides out extended GitHub auth incidents. Transient propagation lag still recovers in the first few attempts.NotFound/Forbiddengive-up path increate_on_github!("if no one can create the deployment we can only give up")CreateOnGithubJobonlyRisk
Low. Retries are ActiveJob-scheduled, not blocking, so workers are not held. Polynomial backoff means later retries are hours apart, so recovery from a multi-hour incident can lag by up to the gap; the common cases (token propagation lag, 50m cache refresh) recover within the first several attempts. On a persistent failure the job degrades to a logged give-up after ~24h rather than an unhandled crash.
Testing
Covered:
Octokit::Unauthorizedenqueues an ActiveJob retryGaps:
ruby -c; CI runs the suite. Tests raise the bareOctokit::Unauthorizedclass to match repo convention.