Skip to content

Validate and auto-resolve AWS region before Karpenter operations#3057

Draft
L3n41c wants to merge 2 commits into
mainfrom
lenaic/validate-aws-region
Draft

Validate and auto-resolve AWS region before Karpenter operations#3057
L3n41c wants to merge 2 commits into
mainfrom
lenaic/validate-aws-region

Conversation

@L3n41c

@L3n41c L3n41c commented May 28, 2026

Copy link
Copy Markdown
Member

What does this PR do?

Reconciles the AWS region with the target EKS cluster before the kubectl datadog autoscaling cluster commands (install/update/uninstall) build their AWS clients:

  • AWS_REGION unset but derivable from the kubeconfig context ARN → derive the region and reload the AWS config with config.WithRegion (so credential providers also pick it up), then proceed with a log notice.
  • AWS_REGION unset and not derivable → clear, actionable error instead of the opaque STS "Missing Region" noise.
  • AWS_REGION set but different from the cluster's region → hard error (RegionMismatchError) in all three commands.

Motivation

QA hit two confusing failures, both rooted in the AWS region:

  1. With AWS_REGION unset, the only feedback was an opaque, deeply-wrapped STS error: failed to get AWS caller identity: operation error STS: GetCallerIdentity, failed to resolve service endpoint, endpoint rule error, Invalid Configuration: Missing Region.
  2. With AWS_REGION set to the wrong region, operations silently targeted the wrong place — e.g. uninstall printing Stack ... does not exist, skipping deletion.

The kubeconfig context for an EKS cluster is an ARN (arn:aws:eks:<region>:<account>:cluster/<name>) that already carries the region — the same source PR #2892 uses for the AWS account-consistency check. This change reuses it for the region, with no extra AWS API call.

Additional Notes

  • Region resolution lives in clients.Build, before the service clients are constructed, because credential providers (assume-role / web-identity STS clients) capture the region at config-load time — a post-load mutation wouldn't reach them. The config is reloaded with the derived region only in the derive path.
  • Refactors the kubeconfig-ARN parse into a shared getClusterARNFromKubeconfig helper used by both the account and region extractors. The helper is tightened to only trust EKS cluster ARNs (Service == "eks" + cluster/ resource prefix).
  • RegionMismatchError mirrors the existing AccountMismatchError.

Minimum Agent Versions

N/A — this is a kubectl-datadog plugin change, not an agent change.

Describe your test plan

  • Unit tests added for resolveRegion (all branches incl. mismatch via errors.As, GovCloud partitions) and getClusterARNFromKubeconfig (EKS/GovCloud ARNs, plain names, eksctl FQDNs, non-EKS ARN rejection).
  • go build ./cmd/kubectl-datadog/..., go vet, and the package tests pass; full make lint reports 0 issues.
  • Manual (kubeconfig context = an EKS ARN):
    • Unset region (profile without a region) → logs AWS region not set; using "us-east-2" from the kubeconfig context. and proceeds.
    • Wrong region (AWS_REGION=us-west-2 against a us-east-2 cluster) → fails immediately with RegionMismatchError for install, update, and uninstall.
    • Correct region → unchanged behavior.

Checklist

  • PR has at least one valid label: bug, enhancement, refactoring, documentation, tooling, and/or dependencies
  • PR has a milestone or the qa/skip-qa label
  • All commits are signed (see: signing commits)

🤖 Generated with Claude Code

The `kubectl datadog autoscaling cluster` commands build their AWS clients
from the default credential chain while the target EKS cluster comes from the
kubeconfig context. When AWS_REGION was unset, users hit an opaque STS error
("Invalid Configuration: Missing Region" buried in endpoint-resolution noise);
when it was set to the wrong region, operations silently looked at the wrong
place (e.g. uninstall reporting "stack does not exist, skipping").

The kubeconfig context for an EKS cluster is an ARN
(arn:aws:eks:<region>:<account>:cluster/<name>) that already carries the
region — the same source PR #2892 uses for the AWS account-consistency check.
Reconcile the region from it inside clients.Build, before the service clients
are constructed:

- AWS_REGION unset but derivable from the kubeconfig ARN: derive it and reload
  the config with config.WithRegion so credential providers (assume-role /
  web-identity STS clients) also pick it up, then proceed with a notice.
- AWS_REGION unset and not derivable: clear, actionable error.
- AWS_REGION set but different from the cluster's region: hard error
  (RegionMismatchError) in all commands (install, update, uninstall).

Refactors the kubeconfig-ARN parse into a shared getClusterARNFromKubeconfig
helper (also tightened to only trust EKS cluster ARNs) that both the account
and region extractors use.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@L3n41c L3n41c added the enhancement New feature or request label May 28, 2026
@datadog-datadog-prod-us1-2

datadog-datadog-prod-us1-2 Bot commented May 28, 2026

Copy link
Copy Markdown

Pipelines  Code Coverage

Fix all issues with BitsAI

⚠️ Warnings

🚦 1 Pipeline job failed

pull request linter | Check Milestone   View in Datadog   GitHub Actions

🛟 This job is unlikely to succeed on retry. Please review your pipeline configuration. Missing milestone or `qa/skip-qa` label.

ℹ️ Info

🎯 Code Coverage (details)
Patch Coverage: 84.78%
Overall Coverage: 43.44% (+0.08%)

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 4c0d254 | Docs | Datadog PR Page | Give us feedback!

@codecov-commenter

codecov-commenter commented May 28, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 86.53846% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 43.11%. Comparing base (6b4b5f7) to head (4c0d254).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...adog/autoscaling/cluster/common/clients/clients.go 86.53% 6 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3057      +/-   ##
==========================================
+ Coverage   43.03%   43.11%   +0.08%     
==========================================
  Files         339      339              
  Lines       29215    29262      +47     
==========================================
+ Hits        12573    12617      +44     
- Misses      15820    15823       +3     
  Partials      822      822              
Flag Coverage Δ
unittests 43.11% <86.53%> (+0.08%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...adog/autoscaling/cluster/common/clients/clients.go 33.90% <86.53%> (+22.09%) ⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6b4b5f7...4c0d254. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

The patch-coverage gate (target 80%) failed because the region-reconciliation
logic lived inline in clients.Build, which is integration-only and not unit
testable. Extract it verbatim into a reconcileRegion(ctx, awsConfig,
configFlags) helper (behavior-preserving) and add a hermetic table-driven
TestReconcileRegion covering the match, derive, mismatch, undeterminable, and
unreadable-kubeconfig paths.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@L3n41c L3n41c added this to the v1.28.0 milestone May 29, 2026
@khewonc khewonc modified the milestones: v1.28.0, v1.29.0 Jun 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants