Skip to content

Migrate daily.yml to self-hosted ARC runners for Daily and Weekly runs#3681

Open
roshkhatri wants to merge 1 commit into
valkey-io:unstablefrom
roshkhatri:arc-runners-ci
Open

Migrate daily.yml to self-hosted ARC runners for Daily and Weekly runs#3681
roshkhatri wants to merge 1 commit into
valkey-io:unstablefrom
roshkhatri:arc-runners-ci

Conversation

@roshkhatri
Copy link
Copy Markdown
Member

@roshkhatri roshkhatri commented May 12, 2026

ci: migrate daily.yml to self-hosted ARC runners (schedule/dispatch only)

Use conditional runs-on to route CI jobs to self-hosted EKS runners for scheduled (daily/weekly) and manual dispatch runs. PR-triggered runs continue using GitHub-hosted ubuntu-latest.

Runner selection logic

schedule / workflow_dispatch / workflow_call / PR from valkey-committers -->  self-hosted (EKS + ARC + Karpenter)
pull_request -->  ubuntu-latest (GitHub-hosted)

Runner pools

Pool Label Instance Types Resources/Pod Jobs
x64 valkey-x64 c7i/m7i/c6i/m6i .2xlarge–.4xlarge 4 vCPU, 16 GiB Standard builds, TLS, io-threads, sanitizers, valgrind, lttng, reply-schemas
x64-largemem valkey-x64-largemem m7i/m6i .4xlarge 8 vCPU, 32 GiB ASan/UBSan large-memory, Valgrind (test + misc + no-malloc-usable-size)
arm64 valkey-arm64 c7g/m7g .2xlarge–.4xlarge 4 vCPU, 16 GiB Native ARM64 builds

Jobs that always stay on ubuntu-latest

Job Reason
test-freebsd QEMU x86 emulation (needs fdisk/qemu)
test-s390x QEMU s390x emulation (run-on-arch-action)
test-ubuntu-jemalloc-fortify Uses container: ubuntu:noble
test-rpm-distros-* Uses container images (almalinux, centos, fedora)
test-alpine-* Uses container: alpine
Moving these to self hosted to would add additional maintenance effort to cache the images on ECR so we dont run into docker pull rate limits

Infrastructure

Component Detail
Cluster EKS (Kubernetes 1.31)
Auto-scaling Karpenter 1.1.1 (scales to zero when idle)
Runner controller ARC 0.14.1 (gha-runner-scale-set)
Runner image ghcr.io/actions/actions-runner:latest (no custom image)
Max runners 400 (x64) / 200 (largemem) / 100 (arm64)
Infra repo roshkhatri/valkey-arc-runners (Terraform)

Scaling performance (observed)

  • 0 to 100 in ~40 secs (cold start burst)
  • Each runner pod matches GitHub-hosted specs (4 vCPU / 16 GiB)
  • Nodes terminate 2 minutes after becoming idle

@roshkhatri roshkhatri marked this pull request as ready for review May 12, 2026 18:03
@coderabbitai
Copy link
Copy Markdown

coderabbitai Bot commented May 12, 2026

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Updated .github/workflows/daily.yml to conditionally select Valkey runners for scheduled/manual runs and add explicit "Install build dependencies" steps with standardized apt-get update + apt-get install -y package lists and standardized gtest/gmock and Tcl/TLS installs across many test jobs.

Changes

Conditional Runner Selection and Build Dependency Standardization

Layer / File(s) Summary
Standard Ubuntu memory / allocator variants
.github/workflows/daily.yml
jemalloc, libc-malloc, no-malloc-usable-size, and 32-bit test jobs now use conditional runs-on expressions (selecting Valkey runners for schedule/dispatch), add an explicit "Install build dependencies" step, and standardize apt-get install -y usage for core build packages, gtest, and Tcl testprep (includes Redis backward-compat Tcl standardization).
ARM architecture test job
.github/workflows/daily.yml
ARM test job selects valkey-arm64 conditionally for scheduled/manual triggers and adds standardized build dependency, gtest, and Tcl testprep apt-get install -y steps.
TLS and I/O threads variant tests
.github/workflows/daily.yml
TLS, TLS-no-TLS, TLS-I/O, and I/O-threads jobs use conditional runners and explicit build dependency installation; testprep steps standardized for Tcl and tcl-tls where applicable.
Cache reclaim test job
.github/workflows/daily.yml
Reclaim-cache job uses conditional runner selection and standardized build dependency installation; vmtouch/fs prep install command formatted to use apt-get install -y while preserving behavior.
Valgrind memory-analysis jobs
.github/workflows/daily.yml
Valgrind jobs now select valkey-x64-largemem conditionally for scheduled/manual triggers and add explicit build dependency installs with standardized gtest and supporting tool installs.
Sanitizer jobs (ASan/UBSan + large-memory variants)
.github/workflows/daily.yml
All sanitizer jobs now select appropriate Valkey runners conditionally, add explicit build dependency installation, standardize gtest installs, and include extra tooling (e.g., clang) in large-memory sanitizer job install lists.
LTTNG tracing and reply-schemas-validator jobs
.github/workflows/daily.yml
LTTNG and reply-schemas-validator jobs use conditional runs-on, add explicit build dependency installs, and standardize lttng/Tcl prerequisites to apt-get install -y.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I hopped through workflows, one by one,
With runners chosen when the schedule's done,
Apt-get hums and packages come to play,
Tcl and gtest greeting the CI day,
A tidy hop — then off I run.

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: migrating daily.yml workflows to self-hosted ARC runners for scheduled/dispatch runs, which is exactly what the PR accomplishes.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The pull request description clearly explains the migration to self-hosted ARC runners for scheduled/dispatch runs, with detailed runner pool configurations, infrastructure details, and reasoning for jobs remaining on ubuntu-latest.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Tip

💬 Introducing Slack Agent: The best way for teams to turn conversations into code.

Slack Agent is built on CodeRabbit's deep understanding of your code, so your team can collaborate across the entire SDLC without losing context.

  • Generate code and open pull requests
  • Plan features and break down work
  • Investigate incidents and troubleshoot customer tickets together
  • Automate recurring tasks and respond to alerts with triggers
  • Summarize progress and report instantly

Built for teams:

  • Shared memory across your entire org—no repeating context
  • Per-thread sandboxes to safely plan and execute work
  • Governance built-in—scoped access, auditability, and budget controls

One agent for your entire SDLC. Right inside Slack.

👉 Get started


Comment @coderabbitai help to get the list of available commands and usage tips.

@roshkhatri roshkhatri requested a review from madolson May 12, 2026 18:03
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (2)
.github/workflows/daily.yml (2)

172-210: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Add the missing package/bootstrap step on these ARC jobs.

These jobs now take the self-hosted path for schedule/workflow_dispatch, but unlike the other migrated Ubuntu jobs they never run an initial apt-get update + toolchain/dev-package install before ./configure/make. On the stock ARC runner image, that makes the migrated path fail even if the GitHub-hosted path still passes.

Suggested fix
+      - name: Install build dependencies
+        run: |
+          sudo apt-get update
+          sudo apt-get install -y build-essential autoconf automake pkg-config libssl-dev tcl tcl-dev python3

Also applies to: 1143-1185, 1302-1344

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/daily.yml around lines 172 - 210, The ARC job paths are
missing a bootstrap package step so the self-hosted runner fails before
building; add a new step (e.g., name: bootstrap packages) before "Install
libbacktrace" / before the run that does "cd libbacktrace && ./configure && make
&& sudo make install" to run apt-get update and install build/dev toolchain
packages (at minimum: apt-get update && apt-get install -y build-essential
autoconf automake libtool pkg-config gcc g++ make python3) so ./configure/make
and the later "make all-with-unit-tests" step succeed; apply the same insertion
for the other ARC job blocks referenced (the ranges around the other job
copies).

1454-1493: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Do not route the LTTng job to ARC runners—it requires host kernel module support.

This migration moves scheduled/manual LTTng runs onto ARC, but the job still installs lttng-modules-dkms. That package requires host kernel headers, the /lib/modules directory, and module-loading capabilities (modprobe, depmod) to build and load kernel modules. Standard Kubernetes runner pods do not provide these. DKMS post-install scripts will fail because containers cannot load modules into the host kernel, and even with privileged pod configuration, the job would fail without kernel headers matching the EKS host kernel version. This is an architectural incompatibility, not a configuration issue.

Revert runs-on to ubuntu-latest for this job:

Fix
-    runs-on: ${{ (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch') && 'valkey-x64' || 'ubuntu-latest' }}
+    runs-on: ubuntu-latest
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/daily.yml around lines 1454 - 1493, The job mistakenly
targets ARC runners via the conditional runs-on expression; since the Build
libbacktrace step installs lttng-modules-dkms (and uses modprobe/depmod/kernel
headers), change the job's runs-on to a non-ARC host (set runs-on:
'ubuntu-latest') so the DKMS/module install can run on a real VM, keep the
Install libbacktrace and Build libbacktrace steps as-is (they rely on
lttng-modules-dkms), and remove the complex ternary expression currently
assigned to runs-on to ensure this job does not run on ARC.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In @.github/workflows/daily.yml:
- Around line 172-210: The ARC job paths are missing a bootstrap package step so
the self-hosted runner fails before building; add a new step (e.g., name:
bootstrap packages) before "Install libbacktrace" / before the run that does "cd
libbacktrace && ./configure && make && sudo make install" to run apt-get update
and install build/dev toolchain packages (at minimum: apt-get update && apt-get
install -y build-essential autoconf automake libtool pkg-config gcc g++ make
python3) so ./configure/make and the later "make all-with-unit-tests" step
succeed; apply the same insertion for the other ARC job blocks referenced (the
ranges around the other job copies).
- Around line 1454-1493: The job mistakenly targets ARC runners via the
conditional runs-on expression; since the Build libbacktrace step installs
lttng-modules-dkms (and uses modprobe/depmod/kernel headers), change the job's
runs-on to a non-ARC host (set runs-on: 'ubuntu-latest') so the DKMS/module
install can run on a real VM, keep the Install libbacktrace and Build
libbacktrace steps as-is (they rely on lttng-modules-dkms), and remove the
complex ternary expression currently assigned to runs-on to ensure this job does
not run on ARC.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: e0e026da-0844-4a17-9131-76fa0ff1c6ac

📥 Commits

Reviewing files that changed from the base of the PR and between d4337d6 and 67554d1.

📒 Files selected for processing (1)
  • .github/workflows/daily.yml

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/daily.yml:
- Around line 1182-1187: The workflow runs apt-get install in the "Install
gtest" step before refreshing the package index, which can fail on fresh
runners; update the package index prior to any apt installs by adding or moving
a sudo apt-get update into the "Install gtest" step (or make a shared
preliminary step that runs sudo apt-get update) so that both the "Install gtest"
and "Install build dependencies" steps run after the update.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: ec9ffaf1-4751-4943-ac10-f4f9945be7ec

📥 Commits

Reviewing files that changed from the base of the PR and between 67554d1 and 2b2f96f.

📒 Files selected for processing (1)
  • .github/workflows/daily.yml

Comment thread .github/workflows/daily.yml Outdated
Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
.github/workflows/daily.yml (1)

106-109: 🏗️ Heavy lift

Consider extracting repeated Ubuntu bootstrap into a shared action

The same apt-get update + dependency install block is duplicated across many jobs. A composite action/reusable workflow would reduce drift and future maintenance overhead when package lists change.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @.github/workflows/daily.yml around lines 106 - 109, The "Install build
dependencies" step duplicates an apt-get bootstrap across jobs; extract that
shell block into a reusable unit (either a composite action under
.github/actions/bootstrap or a reusable workflow) and replace each job step that
runs the apt-get update/install commands with a single "uses:
./.github/actions/bootstrap" (or appropriate workflow call) invocation; ensure
the new composite action exposes any needed inputs (e.g., extra packages) and
preserves the original behavior of the "Install build dependencies" step so all
jobs call the shared action instead of repeating the shell block.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/daily.yml:
- Line 75: The runs-on expression currently sends all workflow_dispatch runs to
self-hosted labels like 'valkey-x64', which can deadlock on forks; update the
expression to only pick the self-hosted label when either event is 'schedule' or
when event is 'workflow_dispatch' AND the run is in the canonical repository
(e.g. check github.repository == 'OWNER/REPO'), otherwise fall back to
'ubuntu-latest' — specifically change the existing runs-on expression (the line
that inspects github.event_name and returns 'valkey-x64') to something like:
(github.event_name == 'schedule' || (github.event_name == 'workflow_dispatch' &&
github.repository == 'OWNER/REPO')) && 'valkey-x64' || 'ubuntu-latest', and
apply the same guarded pattern to the other migrated runs-on expressions (e.g.
'valkey-arm64', 'valkey-x64-largemem', etc.).

---

Nitpick comments:
In @.github/workflows/daily.yml:
- Around line 106-109: The "Install build dependencies" step duplicates an
apt-get bootstrap across jobs; extract that shell block into a reusable unit
(either a composite action under .github/actions/bootstrap or a reusable
workflow) and replace each job step that runs the apt-get update/install
commands with a single "uses: ./.github/actions/bootstrap" (or appropriate
workflow call) invocation; ensure the new composite action exposes any needed
inputs (e.g., extra packages) and preserves the original behavior of the
"Install build dependencies" step so all jobs call the shared action instead of
repeating the shell block.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro Plus

Run ID: d36950e9-ab1e-4a76-9758-f17ba58fa05e

📥 Commits

Reviewing files that changed from the base of the PR and between 2b2f96f and 3651c12.

📒 Files selected for processing (1)
  • .github/workflows/daily.yml

Comment thread .github/workflows/daily.yml Outdated
@sarthakaggarwal97 sarthakaggarwal97 self-requested a review May 12, 2026 19:56
@codecov
Copy link
Copy Markdown

codecov Bot commented May 12, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.71%. Comparing base (d4337d6) to head (9cbf5ca).
⚠️ Report is 5 commits behind head on unstable.

Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #3681      +/-   ##
============================================
+ Coverage     76.66%   76.71%   +0.05%     
============================================
  Files           162      162              
  Lines         80656    80662       +6     
============================================
+ Hits          61831    61876      +45     
+ Misses        18825    18786      -39     

see 26 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Copy Markdown
Contributor

@sarthakaggarwal97 sarthakaggarwal97 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @roshkhatri I think this PR is important to support patch and CVE releases as we often see constraints on our github runners. Left some comments on of the steps, but are probably applicable to all.

with:
repository: ${{ inputs.use_repo || github.event.inputs.use_repo || github.repository }}
ref: ${{ inputs.use_git_ref || github.event.inputs.use_git_ref || github.ref }}
- name: Install build dependencies
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we don't need this step for ubuntu-latest

Comment thread .github/workflows/daily.yml Outdated
- name: Install build dependencies
run: |
sudo apt-get update
sudo apt-get install -y wget build-essential autoconf automake pkg-config libssl-dev tcl tcl-dev python3
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

tcl is installed below already

Comment thread .github/workflows/daily.yml Outdated
jobs:
test-ubuntu-jemalloc:
runs-on: ubuntu-latest
runs-on: ${{ (github.event_name == 'schedule' || github.event_name == 'workflow_dispatch') && 'valkey-x64' || 'ubuntu-latest' }}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the behaviour for run-extra-tests here

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wouldn't use self hosted runners for testing PR. I still have to make a change where it only runs for release vversions on PR from trusted members which are committers

@sarthakaggarwal97
Copy link
Copy Markdown
Contributor

@roshkhatri what is the work required to set these runners up for the valkey project. We talked offline and you had tested this in your personal account

@roshkhatri
Copy link
Copy Markdown
Member Author

@roshkhatri what is the work required to set these runners up for the valkey project. We talked offline and you had tested this in your personal account

I would move this repo to valkey, https://github.com/roshkhatri/valkey-arc-runners and run the steps in README for setup for the valkey's AWS account

Copy link
Copy Markdown

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/daily.yml:
- Around line 106-110: The YAML block under the step named "Install build
dependencies" has a misindented line after the run: | scalar (one leading space)
which breaks parsing; fix by aligning all lines inside the block scalar so each
command line is indented consistently (e.g., two spaces or the same indentation
level as the first command) under run: |; apply the same correction to every
identical run: | block in the workflow (the other occurrences of the "Install
build dependencies" run blocks) so each multi-line shell script has uniform
indentation.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro Plus

Run ID: 4717b73b-53c1-4915-866f-a0afa42a9bf1

📥 Commits

Reviewing files that changed from the base of the PR and between 3651c12 and 5f7bd0b.

📒 Files selected for processing (1)
  • .github/workflows/daily.yml

Comment thread .github/workflows/daily.yml Outdated
…nly)

Use conditional runs-on to route jobs to self-hosted ARC runners only
for scheduled (daily/weekly) and manual dispatch runs. PR-triggered
runs continue using GitHub-hosted ubuntu-latest.

Changes:
- Conditional runs-on: schedule/dispatch -> self-hosted, PR -> ubuntu-latest
- Add "Install build dependencies" step for self-hosted (minimal image)
- Add -y flag to all apt-get install commands for non-interactive CI
- Container/emulation jobs (alpine, rpm-distros, freebsd, s390x) always ubuntu-latest

Runner pools:
- valkey-x64: 4 vCPU, 16 GiB (standard builds, TLS, sanitizers, valgrind)
- valkey-x64-largemem: 8 vCPU, 32 GiB (large-memory sanitizer/valgrind)
- valkey-arm64: 4 vCPU, 16 GiB (native ARM64)

Infra: EKS + Karpenter + ARC 0.14.1 (valkey-io/valkey-arc-runners)

Signed-off-by: Roshan Khatri <rvkhatri@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants