Skip to content

ci(cache): re-add ccache as artifact-cache fallback#360

Merged
hedgar2017 merged 21 commits intomainfrom
ci/ccache-fallback-llvm-builds
Apr 23, 2026
Merged

ci(cache): re-add ccache as artifact-cache fallback#360
hedgar2017 merged 21 commits intomainfrom
ci/ccache-fallback-llvm-builds

Conversation

@nebasuke
Copy link
Copy Markdown
Member

@nebasuke nebasuke commented Apr 21, 2026

Summary

  • I've upped the solx cache size to 150GB.
  • Re-adds ccache to the build-llvm and build-solc composite actions as a fallback behind the SHA-keyed artifact cache (removed in ci: replace ccache with artifact caching + standardize coverage config #246). ccache only runs on artifact-cache miss; the warm-cache fast path (~30 s restore) is unchanged.
  • Adds ccache-touch steps to each cache-warmup.yaml job so the daily cron extends the ccache LRU even when the artifact cache hits, preventing GHA's 7-day eviction from emptying ccache on quiet periods.
  • Adds a free-disk-space composite action that reclaims ~24 GB of pre-installed host tooling (dotnet, Android SDK, Haskell, CodeQL) via container.volumes: bind-mounts, so hosted ubuntu-24.04 runners have enough room for cold LLVM builds + ccache on the same disk pool.
  • Tunes LLVM_PARALLEL_LINK_JOBS per platform — 2 on Linux/Windows and macOS x86 (14–16 GB RAM), 1 on macOS ARM64 — because link memory pressure becomes the bottleneck once compile-phase is served from ccache at 99%+ hit rates. The hosted macos-15 ARM runner has 3 vCPU / 7 GB RAM; two parallel RelWithDebInfo links at 2–4 GB RSS each push it into swap.
  • max-size: "10G" per ccache scope (matches pre-ci: replace ccache with artifact caching + standardize coverage config #246 value).

Observed runtimes

Three reference runs on this PR branch show the progression:

  • Cold: run #24731422129 at fb9f415 — first run, ccache empty, pre-free-disk-space; cargo-checks fails on disk exhaustion during lld link of llvm-opt-fuzzer.
  • Warm ccache, link-jobs=2: run #24745368771 at 13d56d5 — LLVM ccache populated by the cold run, free-disk-space wired, but parallel links still causing swap on macOS ARM64.
  • Warm ccache + link-jobs tuning: run #24751701680 at 6f85d0aLLVM_PARALLEL_LINK_JOBS=1 on macOS ARM64, 100% ccache hit rate across the board.

Whole-job wall-clock

Platform Cold Warm ccache, link-jobs=2 Warm ccache + link-jobs tuned
cargo-checks 2h 17min, failed on disk 18.5 min 16.5 min
Linux x86 gnu 3h 00min 32.5 min 15.5 min
Linux ARM64 gnu 1h 44min 25 min 14.5 min
macOS x86 3h 28min 1h 53min 45 min
macOS ARM64 4h 44min 2h 23min 21 min
Windows 4h 31min 1h 46min 1h 19min

Build LLVM step only

Isolates the LLVM-building cost from checkout / solc build / tests / etc. Shows that the dramatic macOS ARM64 win is mechanism-specific:

Platform Warm ccache, link-jobs=2 Warm ccache + link-jobs tuned Speedup
cargo-checks 7:07 4:20 1.6×
Linux x86 gnu 7:00 3:59 1.8×
Linux ARM64 gnu 6:11 4:22 1.4×
macOS x86 17:40 10:23 1.7×
macOS ARM64 1h 57min 5:54 20×
Windows 46:35 42:16 1.1×

Two effects stacking:

  1. ccache hit rate crept from 99.35% → 100% (confirmed on both macOS legs this run). The previous run had 25 misses out of 3839 cacheable calls; those got saved, so this run hits them. Accounts for the modest 1.4–1.8× speedup on every non-ARM64 leg — roughly 25 cold compiles × ~10–15 s each.
  2. LLVM_PARALLEL_LINK_JOBS=1 eliminated swap thrashing on macOS ARM64, which is the 20× speedup. 25 additional cache hits don't explain ~1h 50min of saved wall time; that was the linker stalled on page faults in a 7 GB runner trying to run two ~3 GB RSS link jobs concurrently. Windows' 1.1× result is the control group: 16 GB RAM, no swap pressure, no link-jobs change, speedup limited to cache warming.

ccache itself handles compile, not link. The ~200 link steps per build still run every time; that's the floor. On most platforms link time is tolerable; on macOS ARM64 specifically the interaction between link count, link RSS, and runner RAM was catastrophic before this PR and is now fine.

Building solc step

solc's ccache key changed in 35949d3 (adding -{cmake-build-type}-end per review feedback), which invalidated run 1's saves — run 2 re-populated with the new key format, and run 3 is the first run that actually restores a warm solc ccache at 100% hit rate.

Platform Cold (run 1, fb9f415) Warm ccache (run 3, 6f85d0a) Speedup
Linux x86 gnu 19:11 1:32 12.5×
Linux ARM64 gnu 10:26 1:25 7.4×
macOS x86 25:32 7:12 3.5×
macOS ARM64 14:07 2:31 5.6×
Windows 35:50 16:03 2.2×

Observations:

  • Linux warm solc is the clean case — under 2 min thanks to a small link graph (just solc and solc-tests binaries, vs LLVM's ~200 tool-executables). Almost all of the wall-clock is the link-floor we can't eliminate.
  • Windows warm solc is still 16 min. Same link-bound story as LLVM: lld-link is inherently slow per-executable on Windows and solc + boost together produce enough binaries for that to dominate. No swap here (16 GB RAM, not memory-pressured), just slow links.
  • macOS ARM64 solc doesn't need the link-jobs treatment. Solc produces few enough binaries that even link-jobs=2 doesn't exceed the 7 GB RAM.

Fixes and tuning during review

  • Cache-key terminator (-end): llvm-…-mlir was a prefix of llvm-…-mlir-coverage-no-assertions, so ccache-action's restore-keys prefix match was cross-restoring between dev and coverage configs on Linux x64. All LLVM and solc keys now end with an -end marker.
  • Touch-step path mismatch: cache-warmup.yaml's Touch steps used ${{ runner.temp }}/ccache-touch-{llvm,solc} but ccache-action saves with path = CCACHE_DIR = ${{ runner.temp }}/ccache-{llvm,solc}. GHA includes path in the cache version hash, so the Touch was a silent no-op. Fixed to real paths.
  • solc key missing cmake-build-type: warm-solc (RelWithDebInfo) and warm-llvm-integration (Release) were colliding on one key. Key now includes build-type.
  • Show ccache statscontinue-on-error: true so a missing-ccache-binary failure can't mask the real build failure.
  • apt update handling — went through a couple of iterations: first bare apt update, then sudo apt-get update -qq (per review suggestion for portability), then back to plain apt-get update -qq once a test run (#24751389733) confirmed the solx-ci-runner container image doesn't ship sudo at all. Kept the add-then-remove as separate commits so the iteration is visible in history.
  • YAML anchor reality check: GHA composite-action manifests reject YAML anchors (ActionManifestManagerLegacy explicit refusal), so the four duplicated CCACHE_* env vars across steps in build-llvm/action.yml and build-solc/action.yml stay inlined with a sync comment. Workflow files do accept anchors, which is used to dedupe the five-entry container.volumes: list across the six affected jobs within each of test.yaml and cache-warmup.yaml.
  • find -mindepth 1 -delete instead of rm -rf on the bind-mounted host paths: rm -rf on a bind-mount directory fails with EBUSY on the mount point itself (disk reclaim still happens, but log noise and a masked non-zero exit). find -delete clears contents, leaves the mount intact, and preserves a meaningful exit code.
  • xargs -P for macOS Xcode removal: the pre-installed runner has ~16 Xcode versions. Sequential rm -rf was ~8 min; parallel fan-out cuts it well under two.
  • Windows LLVM tool-disable experiment, landed and reverted: commit 3106d1e added LLVM_BUILD_TOOLS=Off + LLVM_INCLUDE_TOOLS=Off on Windows to skip ~200 unused tool-binary links. Build LLVM step dropped from 42 min → 9:44 (4.3×) on the validation run. Reverted in 90df14b after Run tests failed with llvm-sys unable to find llvm-config at Rust crate-build time — llvm-config is itself a tool binary and got disabled along with the rest. Follow-up to retry with a surgical "build only llvm-config" approach tracked in ci(llvm): reduce Windows LLVM tool count while keeping llvm-config (retry of reverted tool-disable) #364.

Prior art (why this shape)

Cache calculations

Current observed active usage (gh api repos/NomicFoundation/solx/actions/cache/usage):

Entry Per entry Copies¹ Subtotal
v1-llvm-Windows-X64-RelWithDebInfo-mlir-… 9.09 GB 4 36.4 GB
v1-llvm-macOS-X64-RelWithDebInfo-mlir-… 2.39 GB 2 4.8 GB
v1-llvm-macOS-ARM64-RelWithDebInfo-mlir-… 2.28 GB 1 2.3 GB
v1-llvm-Linux-X64-RelWithDebInfo-mlir-… 1.86 GB 2 3.7 GB
v1-llvm-Linux-ARM64-RelWithDebInfo-mlir-… 1.82 GB 2 3.6 GB
build-and-test-v2-* (rust-cache, 3 OSes) ~1.5 GB 3 ~4.7 GB
v1-solc-* (5 platforms) ~0.35 GB 5 1.8 GB
misc small ~0.05 GB
Current total ~59.7 GB / 28 entries

¹ GHA scopes caches per-ref: each merge queue branch (gh-readonly-queue/main/pr-###-…) creates its own cache copy.

Projected ccache additions

With max-size: 10G per scope, actual on-disk ccache per variant tends to sit at ~2–3 GB for LLVM, ~1 GB for solc. GHA stores the zstd-compressed tarball, typically 40–60 % of on-disk size.

New ccache entry On-disk (typical) Count Compressed subtotal
LLVM dev (llvm-<OS>-<arch>-RelWithDebInfo-mlir), 5 platforms ~3 GB 5 ~8 GB
LLVM sanitizer (Linux x86 only) ~3 GB 1 ~1.5 GB
LLVM coverage (Linux x86 only) ~3 GB 1 ~1.5 GB
LLVM integration / Release (Linux x86 only) ~3 GB 1 ~1.5 GB
solc (solc-<OS>-<arch>-<build-type>), 5 platforms ~1 GB 5 ~3 GB
New subtotal 13 ~15–20 GB

Worst-case upper bound

If every ccache scope fills its 10G cap (unlikely, but the theoretical max):

On-disk Compressed in GHA
13 scopes × 10G 130 GB ~50–80 GB

Projected totals

Scenario Active cache
Current (artifact cache only) ~60 GB
Current + typical ccache ~75–80 GB
Current + worst-case ccache ~110–140 GB
Quota (newly raised) 150 GB

Headroom under quota stays comfortable in the typical case and survives the worst case. If observed usage approaches 150 GB after a few weeks, the max-size cap is a single-line dial-back.

Why ccache-touch in cache-warmup?

The composite action gates the ccache steps on cache-hit != 'true'. When the daily cron fires on an unchanged submodule SHA, the artifact cache hits → composite skips ccache → the ccache entry's LRU timer isn't refreshed. After 7 quiet days, GHA evicts it. The Touch ccache steps use actions/cache/restore with lookup-only: true — the lookup hits the cache service endpoint, resetting the 7-day access timer, without downloading the ~2–4 GB entry.

Test plan

  • CI passes cold build — confirmed in run #24745368771 once free-disk-space was wired. cargo-checks passes in 16.5 min (was failing on disk with SIGBUS during link).
  • LLVM ccache active, populated, and restored across runs. Stats confirm 100% hit rate once warmed.
  • solc ccache validated end-to-end on run #24751701680 — 100% hit rate on all five platforms after the key change propagated.
  • LLVM wall-clock drops substantially vs cold baseline on every leg. Linux x86: 3h → 15.5 min (12×). Linux ARM64: 1h 44min → 14.5 min (7×). macOS x86: 3h 28min → 45 min (4.6×). macOS ARM64: 4h 44min → 21 min (13×). Windows: 4h 31min → 1h 19min (3.4×).
  • LLVM_PARALLEL_LINK_JOBS=1 on macOS ARM64 validated — Build LLVM step 1h 57min → 5:54 (20×) compared to prior warm-ccache run at link-jobs=2.
  • After merge, watch push-to-main cache-warmup run populate ccache entries on main.
  • Monitor gh api repos/NomicFoundation/solx/actions/cache/usage over 1–2 weeks — confirm total stays under 150 GB.

Out of scope / follow-ups

  • Publishing prebuilt LLVM binary tarballs from the solx-llvm repo (build once there, download by SHA here) — biggest long-term win but requires cross-repo infra.
  • Windows LLVM artifact is 9 GB (4× macOS, 5× Linux) — likely PDBs / .lib files in RelWithDebInfo. Not in scope per prior discussion.
  • Capping ninja compile parallelism (LLVM_PARALLEL_COMPILE_JOBS) if link-jobs tuning isn't enough on some future platform.
  • Reduce Windows LLVM tool count (ci(llvm): reduce Windows LLVM tool count while keeping llvm-config (retry of reverted tool-disable) #364) — attempted in this PR and reverted; worth retrying surgically with llvm-config retained. Expected ~4× Windows Build LLVM speedup.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR reintroduces ccache as a fallback layer behind the existing SHA-keyed GitHub Actions artifact cache for LLVM/solc builds, and updates the cache warmup workflow to “touch” ccache entries so they don’t get evicted during quiet periods.

Changes:

  • Add hendrikmuhs/ccache-action setup + --ccache-variant=ccache for LLVM and solc builds on artifact-cache miss.
  • Add “Touch LLVM/solc ccache” steps to cache-warmup.yaml jobs using actions/cache/restore with lookup-only: true.
  • Configure per-scope ccache size cap (max-size: "10G").

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
.github/workflows/cache-warmup.yaml Adds lookup-only cache restores to refresh ccache LRU for LLVM/solc warmup jobs.
.github/actions/build-solc/action.yml Re-adds ccache setup/stats and passes --ccache-variant=ccache when artifact cache misses.
.github/actions/build-llvm/action.yml Re-adds ccache setup/stats, defines per-config ccache keys, and passes --ccache-variant=ccache when artifact cache misses.

Comment thread .github/actions/build-llvm/action.yml Outdated
Comment thread .github/actions/build-llvm/action.yml Outdated
Comment thread .github/actions/build-solc/action.yml Outdated
Comment thread .github/actions/build-solc/action.yml Outdated
Comment thread .github/workflows/cache-warmup.yaml Outdated
Comment thread .github/actions/build-solc/action.yml Outdated
Comment thread .github/actions/build-llvm/action.yml
Comment thread .github/actions/build-llvm/action.yml
Comment thread .github/actions/build-llvm/action.yml Outdated
nebasuke added a commit that referenced this pull request Apr 21, 2026
Five fixes from hedgar2017's review on #360:

1. Touch-step path fix (was a silent no-op). The `Touch {LLVM,solc}
   ccache` steps used `${{ runner.temp }}/ccache-{touch-llvm,touch-solc}`,
   but ccache-action saves with `path = CCACHE_DIR = ${{ runner.temp
   }}/ccache-{llvm,solc}`. GHA includes `path` in the cache version hash,
   so the Touch restore-keys prefix match never saw the real ccache
   entries. After 7 quiet days the cron would silently fail to refresh
   the LRU and the entries would age out. Fixed to the real paths across
   all six Touch steps.

2. solc ccache key missing `cmake-build-type`. `warm-solc` builds
   `RelWithDebInfo`; `warm-llvm-integration` builds `Release`. Same key
   → the two configs evict each other. Added the build-type component
   and the `-end` terminator (same rationale as a239dea) to the solc key
   and the two solc Touch restore-keys.

3. `Show ccache stats` → `continue-on-error: true` so a missing `ccache`
   binary (e.g. ccache-action install failed upstream) can't mask the
   real build failure in the step summary.

4. `CCACHE_*` env deduplication via YAML anchor (`&ccache-{llvm,solc}-env`)
   and `<<:` merge keys. Composite-action sibling steps don't share env,
   so four variables were duplicated three times per action, with no
   guardrail against drift. PyYAML verified the merge resolves to the
   same env sets previously written by hand.

5. `apt update` → `sudo apt-get update -qq`. Works in both root (current
   Docker container) and non-root (hosted runner) environments; `-qq`
   silences the default chatter.
@nebasuke nebasuke force-pushed the ci/ccache-fallback-llvm-builds branch from 13d56d5 to 08feef9 Compare April 21, 2026 23:15
nebasuke added a commit that referenced this pull request Apr 22, 2026
Five fixes from hedgar2017's review on #360:

1. Touch-step path fix (was a silent no-op). The `Touch {LLVM,solc}
   ccache` steps used `${{ runner.temp }}/ccache-{touch-llvm,touch-solc}`,
   but ccache-action saves with `path = CCACHE_DIR = ${{ runner.temp
   }}/ccache-{llvm,solc}`. GHA includes `path` in the cache version hash,
   so the Touch restore-keys prefix match never saw the real ccache
   entries. After 7 quiet days the cron would silently fail to refresh
   the LRU and the entries would age out. Fixed to the real paths across
   all six Touch steps.

2. solc ccache key missing `cmake-build-type`. `warm-solc` builds
   `RelWithDebInfo`; `warm-llvm-integration` builds `Release`. Same key
   → the two configs evict each other. Added the build-type component
   and the `-end` terminator (same rationale as a239dea) to the solc key
   and the two solc Touch restore-keys.

3. `Show ccache stats` → `continue-on-error: true` so a missing `ccache`
   binary (e.g. ccache-action install failed upstream) can't mask the
   real build failure in the step summary.

4. `CCACHE_*` env deduplication via YAML anchor (`&ccache-{llvm,solc}-env`)
   and `<<:` merge keys. Composite-action sibling steps don't share env,
   so four variables were duplicated three times per action, with no
   guardrail against drift. PyYAML verified the merge resolves to the
   same env sets previously written by hand.

5. `apt update` → `sudo apt-get update -qq`. Works in both root (current
   Docker container) and non-root (hosted runner) environments; `-qq`
   silences the default chatter.
@nebasuke nebasuke force-pushed the ci/ccache-fallback-llvm-builds branch from 90df14b to 9659de6 Compare April 22, 2026 10:29
@nebasuke nebasuke marked this pull request as ready for review April 22, 2026 10:48
@nebasuke nebasuke requested a review from hedgar2017 April 22, 2026 10:48
nebasuke added a commit that referenced this pull request Apr 22, 2026
Five fixes from hedgar2017's review on #360:

1. Touch-step path fix (was a silent no-op). The `Touch {LLVM,solc}
   ccache` steps used `${{ runner.temp }}/ccache-{touch-llvm,touch-solc}`,
   but ccache-action saves with `path = CCACHE_DIR = ${{ runner.temp
   }}/ccache-{llvm,solc}`. GHA includes `path` in the cache version hash,
   so the Touch restore-keys prefix match never saw the real ccache
   entries. After 7 quiet days the cron would silently fail to refresh
   the LRU and the entries would age out. Fixed to the real paths across
   all six Touch steps.

2. solc ccache key missing `cmake-build-type`. `warm-solc` builds
   `RelWithDebInfo`; `warm-llvm-integration` builds `Release`. Same key
   → the two configs evict each other. Added the build-type component
   and the `-end` terminator (same rationale as a239dea) to the solc key
   and the two solc Touch restore-keys.

3. `Show ccache stats` → `continue-on-error: true` so a missing `ccache`
   binary (e.g. ccache-action install failed upstream) can't mask the
   real build failure in the step summary.

4. `CCACHE_*` env deduplication via YAML anchor (`&ccache-{llvm,solc}-env`)
   and `<<:` merge keys. Composite-action sibling steps don't share env,
   so four variables were duplicated three times per action, with no
   guardrail against drift. PyYAML verified the merge resolves to the
   same env sets previously written by hand.

5. `apt update` → `sudo apt-get update -qq`. Works in both root (current
   Docker container) and non-root (hosted runner) environments; `-qq`
   silences the default chatter.
@nebasuke nebasuke force-pushed the ci/ccache-fallback-llvm-builds branch from 9659de6 to a444822 Compare April 22, 2026 11:16
nebasuke added 17 commits April 22, 2026 12:44
When the solx-llvm submodule is bumped the SHA-keyed artifact cache
misses and every platform does a ~3.5h cold LLVM build. PR #246 removed
ccache on the assumption that artifact cache hits made it redundant;
that doesn't hold for the bump case we're now in.

Restore the pre-removal ccache steps (commit 1320d1b), layered behind
the existing artifact cache so ccache only runs on artifact-cache miss:

- build-llvm/action.yml: define ccache key, install ccache, pass
  --ccache-variant=ccache to solx-dev, report --show-stats.
- build-solc/action.yml: same pattern, separate ccache dir.
- Cap max-size at 4G (was 10G) for a tighter cache budget.
- Align ccache key schema with the current artifact-cache key
  (includes -no-assertions; matches the 4-config matrix introduced
  by #246).

Expected: 3.5h cold builds drop to ~45-90min on submodule bumps;
warm-cache fast path unchanged.
build-llvm and build-solc now wire ccache behind the artifact cache,
but the gate (steps.<artifact>-cache.outputs.cache-hit != 'true') means
that when the daily cache-warmup cron fires on an unchanged submodule
SHA, the artifact cache hits, the composite skips its ccache steps,
and the ccache entry's LRU timer is never refreshed. After 7 days of
hits-only it gets evicted, and the next submodule bump finds ccache
also cold.

Add a Touch step after each build in cache-warmup.yaml that resolves
the ccache entry by prefix via actions/cache/restore with lookup-only:
true. The lookup alone resets the 7-day access timer without
downloading the 2-4 GB entry.
4G is tight enough that a single full LLVM EVM-target build (~4000 TUs,
~2.4 GB of cached objects on average) plus one re-build with slightly
different inputs can push ccache over the cap and trigger LRU eviction
of still-useful entries, hurting hit rate on the next submodule bump.

10G (matching the pre-PR-#246 value) leaves plenty of headroom.
Worst-case 13 scopes × 10G = ~130 GB on-disk (compressed: ~50-80 GB),
well under the 150 GB repo quota.
ccache-action's restore-keys does prefix matching, and the previous key
shape let shorter variant keys (e.g. `llvm-Linux-X64-RelWithDebInfo-mlir`)
prefix-match longer ones (`...-mlir-coverage-no-assertions`). On Linux x64
where both dev and coverage warm-ups run, the newer entry would win and
the ccache dir would be cross-restored with differently-compiled objects
— zero hit rate plus cache churn.

Append a literal `-end` terminator to every llvm ccache key and to the
corresponding Touch restore-keys in cache-warmup.yaml. `build-solc`'s
single-variant key is unaffected.
Two changes to validate ccache end-to-end on this branch without waiting
for a post-merge submodule bump:

1. Disable the SHA-keyed artifact cache restore in build-llvm and
   build-solc (`if: false`). Forces the ccache path to run on every CI
   run. The matching Save steps are already guarded by
   `github.event_name != 'pull_request'`, so main's artifact cache is
   not polluted.

2. Flip ccache-action `save` from `github.event_name != 'pull_request'`
   to `true`. Lets this PR's runs populate ccache so a second run can
   restore from the first and `Show ccache stats` reports hit rate.

Expected: first run cold (ccache miss, saves); second run warm (prefix
match on `-end`-terminated key, high hit rate on LLVM/solc builds).
Five fixes from hedgar2017's review on #360:

1. Touch-step path fix (was a silent no-op). The `Touch {LLVM,solc}
   ccache` steps used `${{ runner.temp }}/ccache-{touch-llvm,touch-solc}`,
   but ccache-action saves with `path = CCACHE_DIR = ${{ runner.temp
   }}/ccache-{llvm,solc}`. GHA includes `path` in the cache version hash,
   so the Touch restore-keys prefix match never saw the real ccache
   entries. After 7 quiet days the cron would silently fail to refresh
   the LRU and the entries would age out. Fixed to the real paths across
   all six Touch steps.

2. solc ccache key missing `cmake-build-type`. `warm-solc` builds
   `RelWithDebInfo`; `warm-llvm-integration` builds `Release`. Same key
   → the two configs evict each other. Added the build-type component
   and the `-end` terminator (same rationale as a239dea) to the solc key
   and the two solc Touch restore-keys.

3. `Show ccache stats` → `continue-on-error: true` so a missing `ccache`
   binary (e.g. ccache-action install failed upstream) can't mask the
   real build failure in the step summary.

4. `CCACHE_*` env deduplication via YAML anchor (`&ccache-{llvm,solc}-env`)
   and `<<:` merge keys. Composite-action sibling steps don't share env,
   so four variables were duplicated three times per action, with no
   guardrail against drift. PyYAML verified the merge resolves to the
   same env sets previously written by hand.

5. `apt update` → `sudo apt-get update -qq`. Works in both root (current
   Docker container) and non-root (hosted runner) environments; `-qq`
   silences the default chatter.
The hosted `ubuntu-24.04` runner has ~14 GB free at start. A cold
LLVM+MLIR RelWithDebInfo build fills ~12 GB, and the new ccache adds
another ~1.1 GB — on this run cargo-checks tipped over with a SIGBUS
while linking llvm-opt-fuzzer ("no space left on device" moments later
during the ccache save). Linux x86 gnu happened to squeak by on the
same commit.

Reclaim ~24 GB of preinstalled host tooling solx doesn't use: .NET SDK,
Android SDK+NDK, Haskell (ghc + ghcup), CodeQL bundles. These live on
the runner VM's disk outside the container's view, so callers bind-mount
each host path into `/mnt/free-disk-space/<name>` via `container.volumes:`
and the composite action rms them from inside. The action refuses to
operate on anything outside that prefix, and skips missing or empty
mounts so a forgotten volume can't cause harm.

Wired into the six hosted-ubuntu-x64 jobs that cold-build LLVM:
  - test.yaml::cargo-checks
  - test.yaml::build-and-test (gated to containerized Linux legs)
  - cache-warmup.yaml::warm-llvm (gated to containerized Linux legs)
  - cache-warmup.yaml::warm-llvm-sanitizer
  - cache-warmup.yaml::warm-llvm-coverage
  - cache-warmup.yaml::warm-llvm-integration

Boost intentionally not removed — solc's `--build-boost` builds its own,
but the preinstalled `/usr/local/share/boost` headers may be pulled in
transitively by other tooling. CodeQL confirmed unused in this repo.

With ~38 GB free during build (vs ~14 GB before), the cold-build path
has a comfortable margin for the LLVM build, ccache, and future growth.
Deduplicates the five-entry `container.volumes:` list that was
repeated on six jobs. YAML anchor-alias within each file collapses
the four single-job cache-warmup definitions to one-liners and the
two test.yaml jobs' definitions to the anchor + one alias.

Cross-file sharing isn't possible (YAML anchors are per-document), so
test.yaml and cache-warmup.yaml each carry their own anchor definition
with a sync comment pointing at the other.
Run 24745149414 failed at action-load time with:

    /home/runner/.../.github/actions/build-llvm/action.yml:
      Anchors are not currently supported. Remove the anchor
      'ccache-llvm-env'

GHA's composite-action manifest loader (ActionManifestManagerLegacy)
explicitly rejects YAML anchors, independent of whether the underlying
YAML library supports them (PyYAML parses these fine; the workflow
parser apparently also does — only action.yml is restricted).

Restore the explicit duplicated CCACHE_* env blocks on the Build and
Show-ccache-stats steps in build-llvm/action.yml and build-solc/action.yml,
with a comment noting why we can't dedupe.
…ontents

First real run confirmed the action freed ~20 GB (85 → 105 GB avail),
but surfaced cosmetic "Device or resource busy" errors on each mount:

    rm: cannot remove '/mnt/free-disk-space/android': Device or resource busy

`rm -rf "$p"` on a bind mount deletes everything underneath just fine,
but the final syscall to unlink the mount-point directory itself fails
with EBUSY because the mount is live. The disk reclaim already happened
by that point, so it's noise — but the non-zero exit also masks any
*real* rm failure under the same error.

Switch to `find "$p" -mindepth 1 -delete`, which only touches contents
under the mount point. Same disk reclaim, clean logs, and the exit code
now actually signals problems worth looking at.
GitHub's macOS x86 runner ships ~16 Xcode versions. Removing them
sequentially took ~45s each (~8 min total on the free-disk-space step)
because `rm -rf` is I/O-bound per inode on APFS. Fan them out concurrently
with one worker per bundle — should cut this to under 2 min.

Soft-fails by design: an opportunistic cleanup shouldn't fail the job if
one rm hits a stray file handle; rm's stderr still pinpoints the path.
With ccache compile-cache hitting at 99%+ (measured on the macOS LLVM
leg of run 24745368771), link time dominates the remaining wall-clock.
Each RelWithDebInfo LLVM tool link peaks at 2-4 GB RSS and ccache
doesn't cache links.

The hosted macos-15 ARM runner has 3 vCPU / 7 GB RAM — two parallel
links exceed RAM and push into swap, which is net slower than
serialized links (paging during mmap-heavy linker work is devastating,
and shows up as elapsed time without CPU utilization). Other hosted
runners have headroom:

  macos-15-intel   4 vCPU / 14 GB RAM   ← 2 parallel links fit
  Linux / Windows           16 GB RAM   ← 2 parallel links fit
  macos-15 (ARM)   3 vCPU /  7 GB RAM   ← 2 parallel links = swap

Drive LLVM_PARALLEL_LINK_JOBS from a shell conditional on runner.os
and runner.arch: 1 only on macOS ARM64, 2 elsewhere. Confined to the
one cmake flag in the --extra-args string.
Run 24751389733 confirmed the hosted `solx-ci-runner` container image
doesn't ship sudo at all:

  /__w/_temp/…sh: line 1: sudo: command not found
  Process completed with exit code 127

Steps already run as root inside the container, so `sudo` was both
unnecessary and broken. Invoke `apt-get update -qq` directly in both
composite actions; update the surrounding comment to reflect the
actual container behavior rather than the generic root/non-root
portability framing.

Reverts the sudo-adding hunk from 35949d3 (where the original review
suggestion for portability was accepted).
On Windows with 100% ccache hits, Build LLVM still takes ~42 min. The
bottleneck is `lld-link` processing the ~200 LLVM tool executables
(opt, llc, llvm-objdump, llvm-pdbutil, ...) — each link is ~10-25 s on
Windows RelWithDebInfo with PDBs, and ccache doesn't cache link.

solx doesn't use any of these tool binaries at runtime:

  - solx consumes LLVM as a library via inkwell FFI on the static libs
    under target-llvm/target-final/ (LLVM_SYS_211_PREFIX). No reference
    to target-llvm/target-final/bin/ anywhere in solx or solx-dev.
  - Runtime tool deps (llvm-cov, llvm-profdata, llvm-symbolizer,
    llvm-lipo) come from distro/Xcode packages via the ci-runner
    Dockerfile / macOS runner tooling, not from the built LLVM.
  - No CI path passes enable-tests: true to build-llvm (the
    enable-tests in deploy-mdbook is for mdbook test, not LLVM tests).
  - solx-llvm's regression-tests.yml runs only inside solx-llvm's own
    CI, not from solx.

Gate the new flags on runner.os == 'Windows' so non-Windows runs stay
byte-identical pending independent validation. Restructures the
--extra-args positional list into a bash array that grows conditionally;
solx-dev's --extra-args is already declared `num_args = 1..` so the
`"${EXTRA_ARGS[@]}"` expansion works as before.

Expected Windows Build LLVM: ~42 min → ~5-10 min (link count drops by
roughly an order of magnitude, scheduling unchanged at link-jobs=2).
Link-jobs cap from #245 is deliberately left at 2 on Windows — #245's
rationale (OOM on 16 GB runners) still applies; we're reducing the
number of links, not running more in parallel.
- build-{llvm,solc}/action.yml: collapse the four-line YAML-anchor
  history block into a one-sentence KEEP-IN-SYNC marker that points at
  the downstream steps and notes the action-manifest parser limitation.
- free-disk-space/action.yml: drop the sudo/Boost paragraph from the
  action description. The sudo half was stale once we confirmed the
  container runs as root, and the Boost half is tangential to
  free-disk-space itself.

Net -8 lines of comments, no behaviour change.
nebasuke added a commit that referenced this pull request Apr 22, 2026
Retry of the Windows LLVM tool-disable that landed+reverted in #360
(commits 3106d1e / 90df14b). The first attempt broke `Run tests` because
`llvm-sys` (pulled in via inkwell) runs `${LLVM_SYS_211_PREFIX}/bin/llvm-config`
at Rust crate-build time to discover include/lib paths, and `llvm-config` is
itself an LLVM tool that got disabled along with the rest.

This version keeps `llvm-config` alive while disabling the other ~200 tool
binaries:

  -DLLVM_BUILD_TOOLS=Off           # tools are no longer in the ALL target
  -DLLVM_INCLUDE_TOOLS=On          # tools/ subdirectory still configured,
                                    # so individual tool targets exist
  -DLLVM_TOOL_LLVM_CONFIG_BUILD=On # per-tool override forces llvm-config
                                    # specifically into the ALL target

LLVM's cmake uses the `LLVM_TOOL_<name>_BUILD` pattern to let individual
tools opt back in when LLVM_BUILD_TOOLS is off. Expected effect: only
`llvm-config.exe` builds (small link), the ~200 heavy tool links are skipped.

Run 24771100328 confirmed the validation target: with all tools disabled
the Windows Build LLVM step went from 42 min → 9:44 (4.3×). This PR aims
to preserve that win while keeping `llvm-sys` happy.

Cache-key hardening (required for correctness)

Extracts the cmake `--extra-args` construction into a new `Compute LLVM
build config` step and hashes the flag list into the artifact cache key
(`...-args<sha8>-<solx-llvm-sha>`). Without this, the existing key only
reflects the solx-dev action inputs + the solx-llvm submodule SHA — not
the cmake flags — so entries built with `LLVM_BUILD_TOOLS=Off` on Windows
would share a key with entries built with tools `On` and silently serve
the wrong install tree.

Side-benefit: the hash catches any future output-affecting flag added to
`--extra-args` without requiring reviewer discipline to update the key.
Non-output-affecting flag tweaks (e.g. LLVM_PARALLEL_LINK_JOBS scheduling)
also rotate the key, costing one cold build per tweak — acceptable with
ccache as fallback and much more robust than manual key maintenance.

The Build LLVM step now reads `${RUNNER_TEMP}/llvm-extra-args` produced by
the new step, so EXTRA_ARGS is constructed once and consumed twice (once
for the hash, once for the build).

Acceptance:
- Windows Build LLVM step < 15 min on a warm-ccache run.
- Windows `Run tests` succeeds (llvm-sys finds llvm-config).
- No regression on Linux/macOS (no flag change; just the new args-hash
  key component, which triggers a one-time cold build on first run).
- `Show ccache stats` still reports near-100% hit rate.

See #364 for the full design rationale and alternatives considered.
nebasuke added a commit that referenced this pull request Apr 22, 2026
…ERGE

Mirrors the TEMP pattern from #360 but scoped to `build-llvm/action.yml`
only (this branch doesn't touch solc):

1. `if: false` on the LLVM artifact cache restore → forces Build LLVM to
   always run, so the Windows tool-disable change is actually exercised.
   Without this, the args-hash in the cache key only triggers miss on the
   first push (when no entry exists); subsequent pushes to this branch
   would hit the entry we just saved and skip the build, masking any
   regression.

2. `save: true` on the ccache-action → populates ccache in this branch's
   cache scope even on PR events, so a follow-up push can observe
   warm-ccache + tool-disable together (the actual target configuration).

Revert both before merge. The end-state behaviour (artifact cache restore
works, ccache saves only on non-PR events) is what ships.
@nebasuke nebasuke force-pushed the ci/ccache-fallback-llvm-builds branch from a444822 to bf8cf2c Compare April 22, 2026 11:55
Comment thread .github/actions/free-disk-space/action.yml Outdated
Comment thread .github/actions/free-disk-space/action.yml
Comment thread .github/workflows/cache-warmup.yaml
Comment thread .github/workflows/test.yaml
Copy link
Copy Markdown
Contributor

@hedgar2017 hedgar2017 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ran another round! Everything is mostly good but a couple of things worth tightening.

Bare `wait` returned 0 regardless of child exit codes, so a failed
`find -delete` on any bind mount was silently swallowed. Capture each
PID and wait per-child; `rc` keeps the last non-zero exit, which is
all we need to trigger the warning path (per-cleanup stderr identifies
which path failed).

On non-zero rc, emit a `::warning::` annotation instead of exiting
non-zero. Partial cleanup is usually still enough headroom for the
downstream build; when it isn't, ENOSPC will surface at a more
specific call site than this action can name.

Also reworded the inline comment at the `find` call — the previous
note about "a real rm failure can still surface as a non-zero exit"
no longer matches the deliberate warn-and-continue semantics.

Reported by @hedgar2017 in PR #360.
The case guard is a shell glob, not a realpath check — so the doc claim
that it "caps the blast radius to /mnt/free-disk-space/*" is false for
`..` traversal (e.g. `/mnt/free-disk-space/../host-dir` passes the
glob). It's unexploitable today since all callers pass hard-coded
literals, but the wording should match what the code actually does.

Reword both the action description and the inline comment to describe
the guard as best-effort typo-catching, explicitly not a security
boundary. No code change.

Reported by @hedgar2017 in PR #360.
warm-solc was the only Linux-container job in cache-warmup.yaml without
volumes + Free disk space, making it an implicit special case that the
next reader has to reason about. solc's build + boost footprint
(~3-5 GB) fits in container headroom today, but the ~30 s the cleanup
costs is cheap insurance against future solc/boost growth hitting
ENOSPC mid-warmup.

Reported by @hedgar2017 in PR #360.
The old code gated `removed=${#to_remove[@]}` on xargs returning zero,
but xargs exits non-zero if *any* child `rm -rf` fails — so the summary
would print "Removed 0 inactive Xcode version(s)" even after 3/4
bundles (~45 GB) were successfully deleted. Misleading on a cleanup
step whose value is precisely the freed disk.

Set `removed` to the attempt count unconditionally before xargs and
keep the warning path for partial failure. Off-by-at-most-one in the
rare failure case is a much better signal than zero.

Reported by @hedgar2017 in PR #360.
@nebasuke nebasuke requested a review from hedgar2017 April 23, 2026 08:44
Copy link
Copy Markdown
Contributor

@hedgar2017 hedgar2017 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you sir!

@hedgar2017 hedgar2017 added this pull request to the merge queue Apr 23, 2026
Merged via the queue into main with commit 6fc0aaf Apr 23, 2026
41 checks passed
@hedgar2017 hedgar2017 deleted the ci/ccache-fallback-llvm-builds branch April 23, 2026 10:10
nebasuke added a commit that referenced this pull request Apr 23, 2026
Retry of the Windows LLVM tool-disable that landed+reverted in #360
(commits 3106d1e / 90df14b). The first attempt broke `Run tests` because
`llvm-sys` (pulled in via inkwell) runs `${LLVM_SYS_211_PREFIX}/bin/llvm-config`
at Rust crate-build time to discover include/lib paths, and `llvm-config` is
itself an LLVM tool that got disabled along with the rest.

This version keeps `llvm-config` alive while disabling the other ~200 tool
binaries:

  -DLLVM_BUILD_TOOLS=Off           # tools are no longer in the ALL target
  -DLLVM_INCLUDE_TOOLS=On          # tools/ subdirectory still configured,
                                    # so individual tool targets exist
  -DLLVM_TOOL_LLVM_CONFIG_BUILD=On # per-tool override forces llvm-config
                                    # specifically into the ALL target

LLVM's cmake uses the `LLVM_TOOL_<name>_BUILD` pattern to let individual
tools opt back in when LLVM_BUILD_TOOLS is off. Expected effect: only
`llvm-config.exe` builds (small link), the ~200 heavy tool links are skipped.

Run 24771100328 confirmed the validation target: with all tools disabled
the Windows Build LLVM step went from 42 min → 9:44 (4.3×). This PR aims
to preserve that win while keeping `llvm-sys` happy.

Cache-key hardening (required for correctness)

Extracts the cmake `--extra-args` construction into a new `Compute LLVM
build config` step and hashes the flag list into the artifact cache key
(`...-args<sha8>-<solx-llvm-sha>`). Without this, the existing key only
reflects the solx-dev action inputs + the solx-llvm submodule SHA — not
the cmake flags — so entries built with `LLVM_BUILD_TOOLS=Off` on Windows
would share a key with entries built with tools `On` and silently serve
the wrong install tree.

Side-benefit: the hash catches any future output-affecting flag added to
`--extra-args` without requiring reviewer discipline to update the key.
Non-output-affecting flag tweaks (e.g. LLVM_PARALLEL_LINK_JOBS scheduling)
also rotate the key, costing one cold build per tweak — acceptable with
ccache as fallback and much more robust than manual key maintenance.

The Build LLVM step now reads `${RUNNER_TEMP}/llvm-extra-args` produced by
the new step, so EXTRA_ARGS is constructed once and consumed twice (once
for the hash, once for the build).

Acceptance:
- Windows Build LLVM step < 15 min on a warm-ccache run.
- Windows `Run tests` succeeds (llvm-sys finds llvm-config).
- No regression on Linux/macOS (no flag change; just the new args-hash
  key component, which triggers a one-time cold build on first run).
- `Show ccache stats` still reports near-100% hit rate.

See #364 for the full design rationale and alternatives considered.
nebasuke added a commit that referenced this pull request Apr 23, 2026
…ERGE

Mirrors the TEMP pattern from #360 but scoped to `build-llvm/action.yml`
only (this branch doesn't touch solc):

1. `if: false` on the LLVM artifact cache restore → forces Build LLVM to
   always run, so the Windows tool-disable change is actually exercised.
   Without this, the args-hash in the cache key only triggers miss on the
   first push (when no entry exists); subsequent pushes to this branch
   would hit the entry we just saved and skip the build, masking any
   regression.

2. `save: true` on the ccache-action → populates ccache in this branch's
   cache scope even on PR events, so a follow-up push can observe
   warm-ccache + tool-disable together (the actual target configuration).

Revert both before merge. The end-state behaviour (artifact cache restore
works, ccache saves only on non-PR events) is what ships.
nebasuke added a commit that referenced this pull request Apr 27, 2026
Retry of the Windows LLVM tool-disable that landed+reverted in #360
(commits 3106d1e / 90df14b). The first attempt broke `Run tests` because
`llvm-sys` (pulled in via inkwell) runs `${LLVM_SYS_211_PREFIX}/bin/llvm-config`
at Rust crate-build time to discover include/lib paths, and `llvm-config` is
itself an LLVM tool that got disabled along with the rest.

This version keeps `llvm-config` alive while disabling the other ~200 tool
binaries:

  -DLLVM_BUILD_TOOLS=Off           # tools are no longer in the ALL target
  -DLLVM_INCLUDE_TOOLS=On          # tools/ subdirectory still configured,
                                    # so individual tool targets exist
  -DLLVM_TOOL_LLVM_CONFIG_BUILD=On # per-tool override forces llvm-config
                                    # specifically into the ALL target

LLVM's cmake uses the `LLVM_TOOL_<name>_BUILD` pattern to let individual
tools opt back in when LLVM_BUILD_TOOLS is off. Expected effect: only
`llvm-config.exe` builds (small link), the ~200 heavy tool links are skipped.

Run 24771100328 confirmed the validation target: with all tools disabled
the Windows Build LLVM step went from 42 min → 9:44 (4.3×). This PR aims
to preserve that win while keeping `llvm-sys` happy.

Cache-key hardening (required for correctness)

Extracts the cmake `--extra-args` construction into a new `Compute LLVM
build config` step and hashes the flag list into the artifact cache key
(`...-args<sha8>-<solx-llvm-sha>`). Without this, the existing key only
reflects the solx-dev action inputs + the solx-llvm submodule SHA — not
the cmake flags — so entries built with `LLVM_BUILD_TOOLS=Off` on Windows
would share a key with entries built with tools `On` and silently serve
the wrong install tree.

Side-benefit: the hash catches any future output-affecting flag added to
`--extra-args` without requiring reviewer discipline to update the key.
Non-output-affecting flag tweaks (e.g. LLVM_PARALLEL_LINK_JOBS scheduling)
also rotate the key, costing one cold build per tweak — acceptable with
ccache as fallback and much more robust than manual key maintenance.

The Build LLVM step now reads `${RUNNER_TEMP}/llvm-extra-args` produced by
the new step, so EXTRA_ARGS is constructed once and consumed twice (once
for the hash, once for the build).

Acceptance:
- Windows Build LLVM step < 15 min on a warm-ccache run.
- Windows `Run tests` succeeds (llvm-sys finds llvm-config).
- No regression on Linux/macOS (no flag change; just the new args-hash
  key component, which triggers a one-time cold build on first run).
- `Show ccache stats` still reports near-100% hit rate.

See #364 for the full design rationale and alternatives considered.
nebasuke added a commit that referenced this pull request Apr 27, 2026
…ERGE

Mirrors the TEMP pattern from #360 but scoped to `build-llvm/action.yml`
only (this branch doesn't touch solc):

1. `if: false` on the LLVM artifact cache restore → forces Build LLVM to
   always run, so the Windows tool-disable change is actually exercised.
   Without this, the args-hash in the cache key only triggers miss on the
   first push (when no entry exists); subsequent pushes to this branch
   would hit the entry we just saved and skip the build, masking any
   regression.

2. `save: true` on the ccache-action → populates ccache in this branch's
   cache scope even on PR events, so a follow-up push can observe
   warm-ccache + tool-disable together (the actual target configuration).

Revert both before merge. The end-state behaviour (artifact cache restore
works, ccache saves only on non-PR events) is what ships.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants