fix(e2e): reduce E2E test flakiness (sandbox events, duplicate CSE timing) by r2k1 · Pull Request #8480 · Azure/AgentBaker

r2k1 · 2026-05-08T04:13:02Z

What this PR does / why we need it:

Fixes two sources of E2E test false-positive failures found by analyzing recent builds on main.

1. Remove FailedCreatePodSandBox early exit (`kube.go`)

The old code listed pod events every 3s and killed the test on the first FailedCreatePodSandBox event. But kubelet retries sandbox creation automatically — transient errors like "loopback interrupted system call" resolve on their own. The early exit was killing tests before kubelet could recover.

Removed the event check entirely. The 5-minute poll timeout already handles persistent failures. CrashLoopBackOff still fails fast. Also removed the unused WaitUntilPodRunningWithRetry wrapper and maxRetries parameter.

2. Fix duplicate CSE event extraction (`cse_timing.go`)

The find command searched two overlapping paths: cseEventsDir (.../events/) AND its parent directory with -path '*/events/*'. Same files matched twice → duplicate tasks → spurious Task_installDeps#01 subtests (observed in 11/46 builds).

Fixed by using a single starting path.

The strict wireserver validation change that was previously bundled here has been extracted into #8580 for separate review.

Which issue(s) this PR fixes:

N/A — test-only hardening, no linked issue.

Copilot

Pull request overview

This PR targets major sources of E2E test flakiness by tightening readiness gating, reducing validator false positives, and stabilizing timing-based assertions. It also extends scenario mutator hooks to receive cluster context, enabling scenarios that depend on cluster-level derived values (e.g., a proxy URL).

Changes:

Make node readiness waiting more accurate by requiring cloud-provider uninitialized taint removal before proceeding.
Reduce false-positive failures by expanding the eBPF/iptables allowlist and deduplicating CSE timing events; relax overly tight CSE perf thresholds.
Update E2E scenario mutator function signatures to accept *Cluster, adjust specific scenarios (e.g., Flatcar AzureCNI), and skip a consistently failing Ubuntu 20.04 FIPS lane.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
e2e/kube.go	Improves node readiness gating; adds proxy ConfigMap/DaemonSet + proxy discovery; fixes DaemonSet CreateOrUpdate mutation.
e2e/cluster.go	Adds cluster-level `ProxyURL` plumbing, sets up private DNS for API server, and broadens retryable cluster-create errors.
e2e/aks_model.go	Adds 409-conflict handling/waits for private DNS zone and VNet link creation.
e2e/validation.go	Allows a standard DHCP INPUT rule in the eBPF-host-routing iptables compatibility allowlist.
e2e/cse_timing.go	Deduplicates extracted CSE event timings that can appear in multiple event directories.
e2e/scenario_cse_perf_test.go	Adjusts full-install timing thresholds (notably `installDeps`) and updates mutator signatures.
e2e/types.go	Updates `BootstrapConfigMutator` / `AKSNodeConfigMutator` signatures to include `*Cluster`.
e2e/test_helpers.go	Threads `*Cluster` through mutator invocations (including pre-provision flow).
e2e/scenario_test.go	Removes forced Azure CNI plugin settings for Flatcar AzureCNI, skips Ubuntu2004FIPS, adds HTTPS proxy + private DNS scenario, updates mutator signatures across scenarios.
e2e/scenario_win_test.go	Updates bootstrap mutator signatures for Windows scenarios.
e2e/scenario_localdns_hosts_test.go	Updates mutator signatures for LocalDNS hosts plugin scenarios.
e2e/scenario_gpu_managed_experience_test.go	Updates mutator signatures for GPU managed experience scenarios.
e2e/scenario_gpu_daemonset_test.go	Updates mutator signatures for GPU daemonset scenario.

Comments suppressed due to low confidence (2)

e2e/kube.go:330

EnsureDebugDaemonsets now creates the proxy ConfigMap/DaemonSet for every non-network-isolated cluster. If the proxy is only needed for a subset of scenarios, consider decoupling it from the general "debug daemonsets" setup so failures in pulling/starting the proxy don’t break unrelated tests during cluster preparation.

		return nil
	})
	if err != nil {
		return err
	}
	return nil
}

func (k *Kubeclient) createKubernetesSecret(ctx context.Context, namespace, secretName, registryName, username, password string) error {
	defer toolkit.LogStepCtxf(ctx, "creating kubernetes secret %s in namespace %s for registry %s", secretName, namespace, registryName)()

e2e/kube.go:606

The proxy DaemonSet runs with HostNetwork: true and exposes a fixed HostPort (8888) while tolerating all taints. This effectively opens an unauthenticated forward proxy on every system-pool node IP, which is reachable from within the VNet and may be abused or interfere with other host processes that might already bind the port. Consider scoping this down (e.g., run on a single chosen node/instance, tighten tolerations/nodeSelectors, and/or add network-level restrictions) and document the intended threat model for this test-only proxy.

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

e2e/kube.go:83

PR description mentions tolerating transient FailedCreatePodSandBox events by aggregating event.Count and allowing retries (e.g., maxRetries=3), but WaitUntilPodRunning no longer has any sandbox-event handling and the prior retry-aware method was removed. Either implement the described sandbox-failure tolerance (including using event.Count) or adjust the PR description so reviewers/users don’t assume this mitigation exists.

func (k *Kubeclient) WaitUntilPodRunning(ctx context.Context, namespace string, labelSelector string, fieldSelector string) (*corev1.Pod, error) {
	defer toolkit.LogStepCtxf(ctx, "waiting for pod %s %s in %q namespace", labelSelector, fieldSelector, namespace)()
	var pod *corev1.Pod

	err := wait.PollUntilContextTimeout(ctx, 3*time.Second, 5*time.Minute, true, func(ctx context.Context) (bool, error) {
		pods, err := k.Typed.CoreV1().Pods(namespace).List(ctx, metav1.ListOptions{
			FieldSelector: fieldSelector,

r2k1 · 2026-05-10T22:06:18Z

 	for _, check := range checks {
-		var lastResult *podExecResult
-		err := wait.PollUntilContextTimeout(ctx, 10*time.Second, 1*time.Minute, true, func(ctx context.Context) (bool, error) {
-			execResult, execErr := execOnUnprivilegedPod(ctx, s.Runtime.Cluster.Kube, nonHostPod.Namespace, nonHostPod.Name, check.cmd)
-			if execErr != nil {
-				s.T.Logf("wireserver check %q: exec error (retrying): %v", check.desc, execErr)
-				return false, nil
-			}
-			lastResult = execResult
-			if lastResult.exitCode == "28" {
-				return true, nil
-			}
-			s.T.Logf("wireserver check %q: expected exit code 28, got %s (retrying)", check.desc, lastResult.exitCode)
-			return false, nil
-		})
-		if err != nil {
-			s.T.Logf("host IPTABLES: %s", execScriptOnVMForScenario(ctx, s, "sudo iptables -t filter -L FORWARD -v -n --line-numbers").String())
-			if lastResult == nil {
-				require.NoErrorf(s.T, err, "curl to %s did not complete before polling stopped", check.desc)
-			}
-			s.T.Logf("last curl result for %s: %s", check.desc, lastResult.String())
-			assert.Equal(s.T, "28", lastResult.exitCode, "curl to %s expected to fail with timeout, but it didn't after retries", check.desc)
-			s.T.FailNow()
+		execResult, execErr := execOnUnprivilegedPod(ctx, s.Runtime.Cluster.Kube, nonHostPod.Namespace, nonHostPod.Name, check.cmd)
+		require.NoError(s.T, execErr, "failed to exec wireserver check %q on debug pod", check.desc)
+		if execResult.exitCode != "28" {
+			iptablesFwd := execScriptOnVMForScenario(ctx, s, "sudo iptables -t filter -L FORWARD -v -n --line-numbers").String()
+			iptablesKubeFwd := execScriptOnVMForScenario(ctx, s, "sudo iptables -t filter -L KUBE-FORWARD -v -n --line-numbers 2>/dev/null || echo 'chain not found'").String()
+			iptablesSave := execScriptOnVMForScenario(ctx, s, "sudo iptables-save -t filter 2>/dev/null | head -80").String()
+			conntrack := execScriptOnVMForScenario(ctx, s, "sudo conntrack -L -d 168.63.129.16 2>/dev/null || echo 'conntrack not available'").String()
+			s.T.Fatalf("wireserver must not be reachable from pods: curl to %s returned exit code %s (expected 28/timeout)\n"+


Every single request to wireserver is expected to fail. If doesn't fail ocasionally, then it highlights a potential issue that needs to be addressed

 // validateWireServerBlocked checks that unprivileged pods cannot reach WireServer.
-// The iptables FORWARD DROP rules blocking pod→WireServer traffic can be transiently
-// absent when kube-proxy or CNI flush/recreate iptables chains during node setup.
-// We resolve the debug pod once up front (outside the retry budget) so that pod
-// scheduling latency doesn't eat into the iptables-check timeout.
+// validateWireServerBlocked checks that unprivileged pods cannot reach WireServer.
+// Wireserver must never be reachable from pods — any successful connection is a
+// security issue, not a transient condition to retry through.
 func validateWireServerBlocked(ctx context.Context, s *Scenario) {


Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

Comments suppressed due to low confidence (2)

e2e/validation.go:26

ValidatePodRunningWithRetry currently will not compile: for i := range maxRetries attempts to range over an int. Use a standard counted loop (e.g., for i := 0; i < maxRetries; i++) and ensure the function still performs at least one validation attempt when maxRetries is 0 or 1 (depending on intended semantics).

func ValidatePodRunningWithRetry(ctx context.Context, s *Scenario, pod *corev1.Pod, maxRetries int) {
	var err error
	for i := range maxRetries {
		err = validatePodRunning(ctx, s, pod)
		if err != nil {
			time.Sleep(1 * time.Second)
			s.T.Logf("retrying pod %q validation (%d/%d)", pod.Name, i+1, maxRetries)
			continue

e2e/kube.go:82

PR description says WaitUntilNodeReady() was updated to wait for removal of the node.cloudprovider.kubernetes.io/uninitialized:NoSchedule taint before returning success, but the implementation still returns as soon as NodeReady=True (no taint check). Either add the taint-removal wait as described, or update the PR description to match the actual behavior.

func (k *Kubeclient) WaitUntilPodRunning(ctx context.Context, namespace string, labelSelector string, fieldSelector string) (*corev1.Pod, error) {
	defer toolkit.LogStepCtxf(ctx, "waiting for pod %s %s in %q namespace", labelSelector, fieldSelector, namespace)()
	var pod *corev1.Pod

	err := wait.PollUntilContextTimeout(ctx, 3*time.Second, 5*time.Minute, true, func(ctx context.Context) (bool, error) {
		pods, err := k.Typed.CoreV1().Pods(namespace).List(ctx, metav1.ListOptions{

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (2)

e2e/cse_timing.go:135

listCmd no longer references cseEventsDir, but the file still declares the cseEventsDir constant. That constant is now unused, which will cause go test/go vet to fail for the e2e module. Either remove cseEventsDir or reintroduce it into the command (without reintroducing duplicate file matches).

		"sudo find /var/log/azure/Microsoft.Azure.Extensions.CustomScript/ -name '*.json' -path '*/events/*' -exec sh -c 'cat \"$1\"; echo' _ {} \\; 2>/dev/null",
	)
	result, err := execScriptOnVm(ctx, s, s.Runtime.VM, listCmd)
	if err != nil {
		return nil, fmt.Errorf("failed to read CSE events: %w", err)
	}

e2e/kube.go:82

PR description mentions raising the installDeps CSE perf threshold in scenario_cse_perf_test.go, but that file is not part of the PR diff (only validation.go, kube.go, cse_timing.go are modified). Either include the intended threshold change in this PR or update the description to reflect the actual changes.

func (k *Kubeclient) WaitUntilPodRunning(ctx context.Context, namespace string, labelSelector string, fieldSelector string) (*corev1.Pod, error) {
	defer toolkit.LogStepCtxf(ctx, "waiting for pod %s %s in %q namespace", labelSelector, fieldSelector, namespace)()
	var pod *corev1.Pod

	err := wait.PollUntilContextTimeout(ctx, 3*time.Second, 5*time.Minute, true, func(ctx context.Context) (bool, error) {
		pods, err := k.Typed.CoreV1().Pods(namespace).List(ctx, metav1.ListOptions{

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

r2k1 · 2026-05-12T21:53:24Z

+		execResult, execErr := execOnUnprivilegedPod(ctx, s.Runtime.Cluster.Kube, nonHostPod.Namespace, nonHostPod.Name, check.cmd)
+		require.NoError(s.T, execErr, "failed to exec wireserver check %q on debug pod", check.desc)
+		if execResult.exitCode == "0" {
+			iptablesFwd := execScriptOnVMForScenario(ctx, s, "sudo iptables -t filter -L FORWARD -v -n --line-numbers").String()
+			iptablesKubeFwd := execScriptOnVMForScenario(ctx, s, "sudo iptables -t filter -L KUBE-FORWARD -v -n --line-numbers 2>/dev/null || echo 'chain not found'").String()
+			iptablesSave := execScriptOnVMForScenario(ctx, s, "sudo iptables-save -t filter 2>/dev/null | head -80").String()


Good catch — addressed in 4f42248.

Now whitelists only 28 (FORWARD DROP timeout) and 7 (FORWARD REJECT — connection refused). Anything else (0 = reachable, 127 = curl missing, 2/3 = bad args, 6 = DNS, etc.) fails the test with the actual exit code logged plus full iptables/conntrack diagnostics.

This is strictly more defensive than the original exit == 28 check (now also accepts REJECT-style blocks) while closing the silent-bypass holes you flagged. command -v curl pre-check is unnecessary because exit 127 now fails explicitly with "unexpected curl exit code 127" instead of being lost.

r2k1

Addressed review comment:

validation.go:300 — curl exit code check: Fixed. Now only fails on exit code 0 (successful connection), not on any non-28 code. Exit codes like 7 (connection refused), 28 (timeout), or 127 (curl missing) all correctly indicate wireserver is not reachable.

djsly · 2026-05-12T14:40:58Z

not sure bout 3) I fear this will cause more unstable issue, I agree we need to better understand, but I did work on this for 3 days and the RCA was never found.

djsly

please decouple the wireserver in another PR to have it run multiple run to help diagnose the failures this will reintroduce

djsly · 2026-05-12T14:43:10Z

-// We resolve the debug pod once up front (outside the retry budget) so that pod
-// scheduling latency doesn't eat into the iptables-check timeout.
+// Wireserver must never be reachable from pods — any successful connection is a
+// security issue, not a transient condition to retry through.


dont commit and ensure stability before removing the retry. or else this will keep generating issues, this test run with every e2e

moved to a #8580

Copilot

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 1 comment.

Copilot · 2026-05-13T01:02:08Z

Just as a heads up, I was blocked by some firewall rules while working on your feedback. Expand below for details.

Warning

Firewall rules blocked me from connecting to one or more addresses (expand for details)

I tried to connect to the following addresses, but was blocked by firewall rules:

azcliprod.blob.core.windows.net
- Triggering command: /usr/bin/../../opt/az/bin/python3 /usr/bin/../../opt/az/bin/python3 -Im azure.cli account get-access-token -o json --resource REDACTED (dns block)

If you need me to access, download, or install something from one of these locations, you can either:

Configure Actions setup steps to set up my environment, which run before the firewall is enabled
Add the appropriate URLs or hosts to the custom allowlist in this repository's Copilot coding agent settings (admins only)

Fixes two sources of false-positive failures observed across recent E2E builds on main. ### 1. Remove FailedCreatePodSandBox early exit (`kube.go`) The previous code listed pod events every 3s and killed the test on the first FailedCreatePodSandBox event. But kubelet retries sandbox creation automatically — transient errors like "loopback interrupted system call" resolve on their own. The early exit was killing tests before kubelet could recover. Removed the event check entirely. The 5-minute poll timeout already handles persistent failures, and CrashLoopBackOff still fails fast. Also removed the now-unused `WaitUntilPodRunningWithRetry` wrapper and its `maxRetries` parameter. ### 2. Fix duplicate CSE event extraction (`cse_timing.go`) The `find` command searched two overlapping paths: `cseEventsDir` (`.../events/`) AND its parent directory with `-path '*/events/*'`. The same files matched twice → duplicate tasks → spurious `Task_installDeps#01` subtests (observed in 11/46 recent builds). Fixed by using a single starting path. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated no new comments.

timmy-wright · 2026-05-25T01:03:47Z

-	Tasks      []CSETaskTiming
-	Provision  *CSEProvisionTiming
-	taskIndex  map[string]*CSETaskTiming
+	Tasks     []CSETaskTiming


Not related to this PR, but we probably should have some formatting or lint rule to stop whitespace only changes in PRs.

Copilot AI review requested due to automatic review settings May 8, 2026 04:13

r2k1 requested review from AbelHu, Devinwong, SriHarsha001, awesomenix, calvin197, cameronmeissner, djsly, ganeshkumarashok, junjiezhang1997, lilypan26, mxj220, pdamianov-dev, phealy, sulixu, surajssd, timmy-wright and zachary-bailey as code owners May 8, 2026 04:13

r2k1 temporarily deployed to test May 8, 2026 04:13 — with GitHub Actions Inactive

Copilot started reviewing on behalf of r2k1 May 8, 2026 04:13 View session

r2k1 force-pushed the e2e-flakiness-fixes branch from f2a44c2 to ba7f563 Compare May 8, 2026 04:18

r2k1 temporarily deployed to test May 8, 2026 04:19 — with GitHub Actions Inactive

Copilot AI reviewed May 8, 2026

View reviewed changes

r2k1 force-pushed the e2e-flakiness-fixes branch from ba7f563 to e29f86b Compare May 8, 2026 04:56

r2k1 temporarily deployed to test May 8, 2026 04:56 — with GitHub Actions Inactive

r2k1 force-pushed the e2e-flakiness-fixes branch from e29f86b to fb806d0 Compare May 8, 2026 04:57

Copilot AI review requested due to automatic review settings May 8, 2026 04:57

r2k1 temporarily deployed to test May 8, 2026 04:57 — with GitHub Actions Inactive

Copilot started reviewing on behalf of r2k1 May 8, 2026 04:58 View session

This comment was marked as outdated.

Sign in to view

r2k1 force-pushed the e2e-flakiness-fixes branch from 3e2dd81 to af07286 Compare May 10, 2026 20:47

r2k1 temporarily deployed to test May 10, 2026 20:49 — with GitHub Actions Inactive

Copilot AI review requested due to automatic review settings May 10, 2026 20:50

r2k1 force-pushed the e2e-flakiness-fixes branch from af07286 to 7480aaf Compare May 10, 2026 20:50

r2k1 temporarily deployed to test May 10, 2026 20:50 — with GitHub Actions Inactive

Copilot started reviewing on behalf of r2k1 May 10, 2026 20:50 View session

r2k1 force-pushed the e2e-flakiness-fixes branch from 7480aaf to 2957903 Compare May 10, 2026 20:51

r2k1 temporarily deployed to test May 10, 2026 20:51 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

r2k1 force-pushed the e2e-flakiness-fixes branch from 2957903 to bb79253 Compare May 10, 2026 20:57

r2k1 temporarily deployed to test May 10, 2026 20:57 — with GitHub Actions Inactive

Copilot AI review requested due to automatic review settings May 10, 2026 21:01

r2k1 force-pushed the e2e-flakiness-fixes branch from bb79253 to 29ff11a Compare May 10, 2026 21:01

r2k1 temporarily deployed to test May 10, 2026 21:01 — with GitHub Actions Inactive

This comment was marked as outdated.

Sign in to view

Copilot AI reviewed May 10, 2026

View reviewed changes

Comment thread e2e/validation.go Outdated

Copilot AI reviewed May 11, 2026

View reviewed changes

r2k1 commented May 11, 2026

View reviewed changes

djsly requested changes May 12, 2026

View reviewed changes

Copilot AI reviewed May 12, 2026

View reviewed changes

Comment thread e2e/cse_timing.go Outdated

r2k1 mentioned this pull request May 25, 2026

fix(e2e): strict wireserver validation — fail fast on unexpected curl exits #8580

Open

Copilot AI reviewed May 25, 2026

View reviewed changes

timmy-wright reviewed May 25, 2026

View reviewed changes

timmy-wright approved these changes May 25, 2026

View reviewed changes

Conversation

r2k1 commented May 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

1. Remove FailedCreatePodSandBox early exit (kube.go)

2. Fix duplicate CSE event extraction (cse_timing.go)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

This comment was marked as outdated.

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

r2k1 May 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

r2k1 May 12, 2026

Choose a reason for hiding this comment

Uh oh!

r2k1 left a comment

Choose a reason for hiding this comment

Uh oh!

djsly commented May 12, 2026

Uh oh!

djsly left a comment

Choose a reason for hiding this comment

Uh oh!

djsly May 12, 2026

Choose a reason for hiding this comment

Uh oh!

r2k1 May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Copilot AI commented May 13, 2026

I tried to connect to the following addresses, but was blocked by firewall rules:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

timmy-wright May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

r2k1 commented May 8, 2026 •

edited

Loading

1. Remove FailedCreatePodSandBox early exit (`kube.go`)

2. Fix duplicate CSE event extraction (`cse_timing.go`)

r2k1 May 10, 2026 •

edited

Loading

r2k1 May 25, 2026 •

edited

Loading