Skip to content

ci: migrate Linux workflow to RunsOn self-hosted runners#14367

Open
mrpollo wants to merge 2 commits into
masterfrom
mrpollo/runs-on-linux-ci
Open

ci: migrate Linux workflow to RunsOn self-hosted runners#14367
mrpollo wants to merge 2 commits into
masterfrom
mrpollo/runs-on-linux-ci

Conversation

@mrpollo
Copy link
Copy Markdown
Member

@mrpollo mrpollo commented May 12, 2026

This PR migrates the Linux workflow (.github/workflows/linux.yml) from GitHub-hosted runners to RunsOn self-hosted ephemeral EC2 runners. Runners are backed by the new qgc-ci-runs-on CloudFormation stack in the Dronecode AWS account (us-west-2, RunsOn v3.0.6). x86_64 builds run on c8i.2xlarge, ARM64 on c8g.2xlarge, the Debug+coverage test job on m8i.2xlarge — all 8 vCPU, On-Demand, ubuntu24-full-* images.

Wall-clock impact

Measured against the most recent successful GitHub-hosted run on master vs. the equivalent run on this branch:

Job GitHub-hosted RunsOn (this PR) Δ
Release linux_gcc_64 36m 40s 15m 10s -22m 30s (-59%)
Release linux_gcc_arm64 21m 26s 13m 04s -9m 22s (-39%)
Test + Coverage linux_gcc_64 30m 43s 20m 21s -11m 22s (-34%)
Wall-clock total (parallel) 36m 40s 20m 21s -17m 19s (-44%)

Caches are still cold on this run — no master push has populated the Ubuntu 24.04 warm caches yet. Once this merges, the next push to master populates ccache-linux-…-shared-, cpm-modules-shared-, pipx-linux-…, and qt-linux-desktop-… for everyone, and PR runs should drop another 5-8 min off both Release builds.

Context

PX4-Autopilot has been running on RunsOn for over a year in the same account; this migration brings QGC onto the same infrastructure. Once this merges, the same pattern (inline runs-on= labels plus the runs-on/action@v2 step) can be applied to analysis.yml, pr-checks.yml, and the Linux portions of pre-commit.yml / ci-scripts.yml with very small diffs.

Runner choices

Job Instance OnD $/h Why
Release linux_gcc_64 c8i.2xlarge $0.375 Intel Granite Rapids, 16 GB, full-fat CPU
Release linux_gcc_arm64 c8g.2xlarge $0.319 Graviton4, 16 GB, cheapest in the table
Test+Coverage linux_gcc_64 Debug m8i.2xlarge (volume=60gb) $0.423 32 GB headroom for tests + .gcda

On-Demand on the latest generation, deliberately. Earlier iterations on Spot c7i/m7i hit reclaims mid-build. The headline cost is ~$0.30-0.40 per Linux PR across all three jobs, dominated by engineer-time considerations rather than compute spend.

Caching

extras=s3-cache on each runner label plus runs-on/action@v2 as the first step bootstraps RunsOn's "magic cache" — a sidecar that transparently intercepts every actions/cache@v5 call and redirects it to the S3 bucket provisioned by the stack. Existing cache calls for ccache, Qt SDK, GStreamer, pipx, apt packages, and CPM modules all work unchanged. runs-on/action@v2 is a no-op on GitHub-hosted runners so the workflow stays portable if anyone needs to revert.

Matrix cleanup

The build job's matrix used os: [ubuntu-24.04-arm, ubuntu-22.04] both as the runner label and as a discriminator for two size-analysis steps (if: matrix.os == 'ubuntu-22.04'). With RunsOn we don't need os to pick the runner. The dual-purpose field is gone and matrix.arch is the single discriminator. The two if: conditions changed from matrix.os == 'ubuntu-22.04' to matrix.arch == 'linux_gcc_64' which is what they always meant ("x86_64 only").

OS bump

The x86_64 Release build's host OS moves from Ubuntu 22.04 → Ubuntu 24.04. That bumps the AppImage's glibc baseline (22.04 ships glibc 2.35 vs. 24.04's 2.39), so AppImages produced here won't run on hosts with older glibc (RHEL 8, Ubuntu 20.04, Debian 11). If supporting older distros matters for QGC's release builds, the image= parameter in the runner label can be switched back to ubuntu22-full-x64 with no other workflow changes.

Test execution restructured

cmake/QGCTest.cmake:164 auto-attaches RESOURCE_LOCK "MockLink" to every Integration test because MockLink shares a LinkManager singleton and a static _nextVehicleSystemId counter. A single CTest invocation over both labels with --parallel auto silently serialized everything on that lock. Split into two passes:

  • Run Unit Tests (parallel): -L Unit, --parallel auto. 151 Unit tests with no shared state.
  • Run Integration Tests (serial): -L Integration, --parallel 1. 37 Integration tests serialize on shared MockLink state.

Each pass writes its own junit + ctest output; downstream Analyze / Report / Upload steps run once per pass. The coverage path picks up .gcda from both passes via the existing find . -name '*.gcda'. The split is responsible for most of the test job's wall-clock improvement (process-startup overhead across 151 Unit invocations was the hidden bottleneck, not test execution itself).

The tester runner uses volume=60gb. The 40 GB default left only ~1-2 GB headroom at peak (Debug build + Qt SDK + caches + .gcda + scratch), which silently killed the runner agent before any diagnostic step could write to its log — caught after three identical failures by SSM'ing into the live runner.

Companion fix: QGCKeychain headless D-Bus fallback

The first PR run failed SigningTest and QGCKeychainTest because the Ubuntu 24.04 image doesn't run a Secret Service daemon by default, and libsecret reports "Cannot autolaunch D-Bus without X11 \$DISPLAY" instead of the patterns the existing isMissingSecretService() knew about. Recognize the autolaunch / missing-session-bus error strings as the same "no backend" condition so the QSettings fallback kicks in, matching the behavior on macOS/Windows when no keychain is configured.

This is its own commit so it can be backed out independently if the keychain fix is contentious.

Testing

The PR's own CI run validates:

  • Both Release builds complete on RunsOn (~13-15 min cold-cache, expect ~7-8 min warm after merge).
  • The QGCKeychain fallback lets SigningTest and QGCKeychainTest pass on headless Ubuntu 24.04.
  • Coverage report still produces valid coverage.xml, picking up .gcda from both Unit and Integration passes.

Related infra

  • RunsOn stack: qgc-ci-runs-on (CloudFormation, us-west-2)
  • GitHub App: Dronecode Infra, installed on mavlink/qgroundcontrol
  • Same Dronecode AWS account also runs px4-ci (RunsOn v2.12.6) for PX4-Autopilot

@github-actions github-actions Bot added github_actions Pull requests that update GitHub Actions code RN: IMPROVEMENT size/XS labels May 12, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 12, 2026

Build Results

Platform Status

Platform Status Details
Linux Passed View
Windows Passed View
MacOS Passed View
Android Passed View

All builds passed.

Pre-commit

Check Status Details
pre-commit Failed (non-blocking) View

Pre-commit hooks: 4 passed, 45 failed, 7 skipped.

Test Results

linux-coverage-integration: 16 passed, 0 skipped
linux-coverage-unit: 74 passed, 0 skipped
Total: 90 passed, 0 skipped

Code Coverage

Coverage: 59.2%

No baseline available for comparison

Artifact Sizes

Artifact Size
QGroundControl 216.86 MB
QGroundControl-aarch64 176.58 MB
QGroundControl-installer-AMD64 134.71 MB
QGroundControl-installer-AMD64-ARM64 77.33 MB
QGroundControl-installer-ARM64 106.06 MB
QGroundControl-linux 335.18 MB
QGroundControl-mac 187.14 MB
QGroundControl-windows 187.15 MB
QGroundControl-x86_64 188.59 MB
No baseline available for comparison

Updated: 2026-05-13 19:35:11 UTC • Triggered by: Android

mrpollo added 2 commits May 13, 2026 11:44
isMissingSecretService() routes "no Secret Service reachable" errors
through the QSettings fallback so reads and writes still succeed. It
only matched libsecret's org.freedesktop.secrets / ServiceUnknown
patterns, which covers a Secret Service daemon not being registered
on the bus, but not the case where there is no session bus at all.

On a headless host (CI without dbus-launch / gnome-keyring, Docker,
embedded test rigs) libsecret instead reports:

  "Cannot autolaunch D-Bus without X11 \$DISPLAY"

That's a QKeychain::OtherError, did not match the existing patterns,
and dropped through to the terminal error branch — QGCKeychain::write
returned false and QGCKeychainTest / SigningTest failed.

Recognize the autolaunch and missing-session-bus messages as the same
"no backend" condition so the fallback kicks in and callers see the
behavior they already see on macOS/Windows when no keychain is
configured.

Signed-off-by: Ramon Roche <mrpollo@gmail.com>
Move the Linux build and debug-validation jobs from GitHub-hosted
runners to RunsOn ephemeral EC2 runners (qgc-ci-runs-on stack in
us-west-2, RunsOn v3.0.6). x86_64 builds now run on c8i.2xlarge,
ARM64 on c8g.2xlarge, the Debug+coverage test job on m8i.2xlarge —
all 8 vCPU, On-Demand, ubuntu24-full-* images.

Runner labels are inline in the workflow so this PR is self-contained
(named runner profiles in .github/runs-on.yml require the config on
the default branch first); the named profiles are kept in the tree
for the next workflow migration to reuse.

Caching: extras=s3-cache + runs-on/action@v2 transparently redirect
all existing actions/cache@v5 calls (ccache, Qt SDK, GStreamer, pipx,
apt, CPM) to the S3 bucket provisioned by the stack. runs-on/action
is a no-op on GitHub-hosted runners so the workflow stays portable.

Matrix cleanup: dropped the dual-purpose `matrix.os` field on the
build job (it was both a runner selector and a discriminator for two
size-analysis steps). matrix.arch is now the single discriminator;
the previous `matrix.os == 'ubuntu-22.04'` conditions on lines 128
and 135 now correctly read `matrix.arch == 'linux_gcc_64'`.

Both architectures build on Ubuntu 24.04 (was 22.04 for x64, 24.04
for ARM). This bumps the AppImage glibc baseline from 2.35 to 2.39;
older distros (RHEL 8, Ubuntu 20.04, Debian 11) won't run binaries
produced here.

Test execution restructured by label. cmake/QGCTest.cmake:164
auto-attaches RESOURCE_LOCK "MockLink" to every Integration test
because MockLink shares a LinkManager singleton and static
_nextVehicleSystemId counter; a single CTest invocation over both
labels with --parallel auto silently serialized everything on that
lock. Split into two passes:

  - Run Unit Tests (parallel): -L Unit, --parallel auto. 151 Unit
    tests with no shared state.
  - Run Integration Tests (serial): -L Integration, --parallel 1.
    37 Integration tests serialize on shared MockLink state.

Each pass writes its own junit + ctest output; downstream Analyze /
Report / Upload steps run once per pass. Coverage path picks up
.gcda from both passes via the existing find . -name '*.gcda'.

Tester runner uses volume=60gb (40GB default left ~1-2GB headroom
at peak with the Debug build + Qt SDK + caches + .gcda + scratch,
which silently killed the agent before any diagnostic could run).

Signed-off-by: Ramon Roche <mrpollo@gmail.com>
@mrpollo mrpollo force-pushed the mrpollo/runs-on-linux-ci branch from 5ac0ef2 to c5de6f3 Compare May 13, 2026 18:45
@mrpollo mrpollo marked this pull request as ready for review May 13, 2026 19:15
@mrpollo mrpollo requested a review from HTRamsey as a code owner May 13, 2026 19:15
Copilot AI review requested due to automatic review settings May 13, 2026 19:15
@mrpollo
Copy link
Copy Markdown
Member Author

mrpollo commented May 13, 2026

This is ready for review @HTRamsey once we get this one in, I can help you migrate the rest, starting with Windows

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR migrates QGroundControl’s Linux CI workflow to RunsOn self-hosted ephemeral runners (EC2) to reduce build/test wall-clock time, while also adjusting test execution to avoid unintended serialization and improving QGCKeychain behavior on headless Linux images.

Changes:

  • Migrate .github/workflows/linux.yml from GitHub-hosted runners to RunsOn labels (with runs-on/action@v2) and clean up the build matrix to use arch as the discriminator.
  • Restructure the Debug test job to run Unit tests in a parallel pass and Integration tests in a serial pass, producing separate artifacts/reports per pass.
  • Expand Linux libsecret “no backend/no session bus” detection in QGCKeychain so headless CI falls back to QSettings instead of failing.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 2 comments.

File Description
src/Utilities/Platform/QGCKeychain.cc Broadens detection of headless/missing Secret Service conditions to trigger the existing QSettings fallback.
.github/workflows/linux.yml Switches Linux jobs to RunsOn runners + magic cache and splits Unit vs Integration test execution/reporting.
.github/runs-on.yml Adds RunsOn runner definitions (family/image/extras/volume) for the repository.

Comment thread .github/workflows/linux.yml
Comment thread .github/workflows/linux.yml
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

github_actions Pull requests that update GitHub Actions code RN: IMPROVEMENT size/XS

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants