Switch to ParallelTestRunner by maleadt · Pull Request #3110 · JuliaGPU/CUDA.jl

maleadt · 2026-04-18T13:54:39Z

Needs JuliaTesting/ParallelTestRunner.jl#129

Each subpackage's `Pkg.test` runner is now a minimal call to PTR's `runtests`, which spawns one worker process per test file and runs them concurrently. `setup.jl` is loaded via `init_code` so each worker picks up the shared fixtures. Side effects: - Extract cuTENSOR's inline "kernel cache" testset to its own file (lib/cutensor/test/kernel_cache.jl) since runtests.jl is no longer the place for test code. - cuSPARSE's array.jl had three show-output tests that implicitly relied on `using cuSPARSE, SparseArrays` being in Main (where CUDA.jl's top-level runner incidentally loaded them). PTR workers run tests in an isolated submodule, so pass an explicit `:module => @__MODULE__` context to `sprint(show, …)` so the type names are qualified against the worker module's bindings rather than Main's. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Rewrites `test/runtests.jl` on top of `ParallelTestRunner.runtests`, unifying CUDA.jl with the subpackages migrated in 94dedf5. The homegrown runner (~580 lines in runtests.jl + ~210 in setup.jl) becomes a thin wrapper plus a CUDA-specific `AbstractTestRecord`: - `CUDATestRecord` carries the standard fields plus `gpu_bytes`, `gpu_time`, and `gpu_rss`; an `execute(::Type{CUDATestRecord}, ...)` method uses `CUDA.@timed` to capture GPU alloc stats and queries NVML for per-process GPU RSS. `print_test_finished`/`print_test_failed` overrides add `GPU Alloc (MB)` and `GPU RSS (MB)` columns. - Worker count is capped by free GPU memory (~2 GiB/worker) in addition to PTR's CPU/RAM default. - `--sanitize[=tool]` wraps every worker by passing a compute-sanitizer `Cmd` as `runtests`'s `exename` kwarg (new in PTR 2.6). - `--all` (or an explicit `libraries/*` positional) includes subpackage tests under `lib/*/test/`, using `Base.set_active_project` to activate the subpackage's Project.toml. - Context-destroying tests (`core/initialization`, `core/cudadrv`) are isolated on a fresh worker via the `test_worker` hook and use plain Julia timing (since CUDA events invalidate with the context). Per-worker setup (`CUDATestRecord`, NVML helpers, GPUArrays TestSuite include, `CUDA.precompile_runtime`) lives in `test/setup.jl` and runs via `init_worker_code`. Per-test helpers (`testf`, `sink`, `@grab_output`, `@on_device`, `julia_exec`) are in a new `test/helpers.jl` included via `init_code`, so subpackage setup.jl's `testf` redefinitions don't clash with an imported binding. Drops: `--gpu=…` multi-device selection, exclusive-mode downgrade, interactive `?` key. GPU selection now goes through `CUDA_VISIBLE_DEVICES`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

maleadt · 2026-04-19T06:54:36Z

51/41/37 mins here vs 42/34/33 on master for Julia 1.11/1.12/1.13. And this surprisingly looks like the actual tests slowing down, e.g. on Julia 1.12:

gpuarrays/linalg/core                          (9) │   152.89 │   0.02 │      28.29 │   148.00 │   4.48 │  2.9 │   13223.51 │  4035.78 │
gpuarrays/linalg/norm                         (13) │   281.17 │   0.02 │       0.03 │   130.00 │   7.46 │  2.7 │   17295.61 │  3769.99 │

vs

gpuarrays/linalg/core                         (3) |   126.39 |   0.02 |  0.0 |      28.29 |   226.00 |   3.43 |  2.7 |   11518.49 | 10666.16 |
gpuarrays/linalg/norm                         (4) |   250.23 |   0.02 |  0.0 |       0.03 |   146.00 |   6.00 |  2.4 |   14957.74 |  6904.70 |

giordano · 2026-04-19T08:02:34Z

Try with --verbose which shows also the init time?

maleadt · 2026-04-19T08:16:40Z

Good idea. It's probably related to my tuning of the memory pool heuristics though, and not because of PTR.

A compute-sanitizer-wrapped worker starts by printing its banner ('========= COMPUTE-SANITIZER') to stdout, which collides with Malt's port handshake (the first stdout line must be parseable as a UInt16 port number). Passing `--log-file=<dir>/%p.log` redirects sanitizer text to a per-process file, leaving the worker's stdout clean for Malt. After `runtests` returns (or throws), scan the directory and surface any logs missing the "ERROR SUMMARY: 0 errors" line; emit a colored one-liner summary otherwise. This preserves the signal while keeping clean runs quiet. Also silence the `Pkg.activate`/`Pkg.add` chatter during CUDA_SDK_jll install (`io = devnull`) — the only output we want is the sanitizer version banner we explicitly print. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Workers run many tests back-to-back, and pool-cached buffers stay resident because the release threshold is unbounded and the idle pool-cleanup task only runs when `isinteractive()`. Calling `CUDA.reclaim()` after the post-test GC trims the pool and empties library handle caches, reducing GPU RSS accumulation without invalidating compiled kernels. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Match ParallelTestRunner's new composition pattern for `AbstractTestRecord`: carry a `base::TestRecord` field and delegate Julia-timed execution to `ParallelTestRunner.execute(TestRecord, …)` instead of redeclaring every baseline field and re-implementing the non-CUDA timing path inline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Per-test GPU RSS data (collected after adding post-test CUDA.reclaim()) showed a handful of tests blowing well past the 4 GiB per-worker budget: - test/core/array.jl: the 512^3 sum!() case allocated ~1 GiB Float64 to exercise the big-mapreduce path; (85, 1320, 100) already exercises the same serial kernel path. Drop the 512^3 case. - test/core/sorting.jl: the "large sizes" quicksort input at 2^25 Float32 was 128 MiB; 2^22 still exercises the multi-block quicksort path. - examples/peakflops.jl: default n=5000 built four 5000x5000 Float32 matrices (~400 MiB); n=1024 is enough to demonstrate the example. - lib/cutensornet/test/contractions.jl: max_ws_size=2^32 (4 GiB workspace hint) was inflating cuTensorNet to ~1.5 GiB; 2^28 covers the same tuning paths. Library tests (cusolver/cusparse/cudnn/cutensor/etc.) still sit at 1-2 GiB due to persistent library workspace that's not pool-allocated and therefore not released by CUDA.reclaim() between tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The `@test_throws UndefRefError current_context()` / `current_device()` assertions at the top of initialization.jl require that CUDA hasn't been touched yet in the current Julia process. With PTR, every worker runs `setup.jl` as `init_worker_code`, and that already does `CUDA.functional(true)` / `precompile_runtime` / pool config — so the worker is never in a fresh state by the time the test runs, and these assertions fail ("Expected: UndefRefError, No exception thrown"). Run those four assertions (and the paired "now cause initialization" check) in a subprocess instead, the same way the issue-1331 test at the bottom of the file already does. The rest of initialization.jl doesn't depend on fresh state and runs fine on a normal worker. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The top-band GPU label spanned 30 columns but the bottom-band GPU cells (GC/Alloc/RSS) sum to 33, shifting every pipe after the GPU section three columns left. Widen the dashes (12 + 13) to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions

CUDA.jl Benchmarks

Details

Benchmark suite	Current: `63a68c4`	Previous: `9702041`	Ratio
`array/accumulate/Float32/1d`	`101140` ns	`101336` ns	`1.00`
`array/accumulate/Float32/dims=1`	`76267` ns	`76569` ns	`1.00`
`array/accumulate/Float32/dims=1L`	`1585826` ns	`1585630.5` ns	`1.00`
`array/accumulate/Float32/dims=2`	`143428` ns	`143872.5` ns	`1.00`
`array/accumulate/Float32/dims=2L`	`657493.5` ns	`657817.5` ns	`1.00`
`array/accumulate/Int64/1d`	`118646` ns	`118546` ns	`1.00`
`array/accumulate/Int64/dims=1`	`80280` ns	`80328` ns	`1.00`
`array/accumulate/Int64/dims=1L`	`1706680.5` ns	`1694780.5` ns	`1.01`
`array/accumulate/Int64/dims=2`	`157247.5` ns	`156646` ns	`1.00`
`array/accumulate/Int64/dims=2L`	`962003` ns	`962572` ns	`1.00`
`array/broadcast`	`20670.5` ns	`20507` ns	`1.01`
`array/construct`	`1264.5` ns	`1260.6` ns	`1.00`
`array/copy`	`17872` ns	`18042.5` ns	`0.99`
`array/copyto!/cpu_to_gpu`	`214938` ns	`216139.5` ns	`0.99`
`array/copyto!/gpu_to_cpu`	`283434` ns	`282550` ns	`1.00`
`array/copyto!/gpu_to_gpu`	`10684` ns	`10770` ns	`0.99`
`array/iteration/findall/bool`	`134489` ns	`134891` ns	`1.00`
`array/iteration/findall/int`	`149743` ns	`150607` ns	`0.99`
`array/iteration/findfirst/bool`	`81215` ns	`81621` ns	`1.00`
`array/iteration/findfirst/int`	`83429.5` ns	`83931` ns	`0.99`
`array/iteration/findmin/1d`	`88136.5` ns	`88319.5` ns	`1.00`
`array/iteration/findmin/2d`	`117332.5` ns	`116740` ns	`1.01`
`array/iteration/logical`	`200219.5` ns	`199127` ns	`1.01`
`array/iteration/scalar`	`67096` ns	`69801` ns	`0.96`
`array/permutedims/2d`	`52173.5` ns	`51913` ns	`1.01`
`array/permutedims/3d`	`52747` ns	`52967` ns	`1.00`
`array/permutedims/4d`	`51373` ns	`51865.5` ns	`0.99`
`array/random/rand/Float32`	`12818` ns	`12969` ns	`0.99`
`array/random/rand/Int64`	`24941` ns	`24834` ns	`1.00`
`array/random/rand!/Float32`	`8318.666666666666` ns	`8996.666666666666` ns	`0.92`
`array/random/rand!/Int64`	`21893` ns	`21694` ns	`1.01`
`array/random/randn/Float32`	`37834` ns	`37552.5` ns	`1.01`
`array/random/randn!/Float32`	`30772` ns	`30840` ns	`1.00`
`array/reductions/mapreduce/Float32/1d`	`34319` ns	`35074` ns	`0.98`
`array/reductions/mapreduce/Float32/dims=1`	`40492` ns	`40645` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=1L`	`51236` ns	`51243.5` ns	`1.00`
`array/reductions/mapreduce/Float32/dims=2`	`56197` ns	`56526` ns	`0.99`
`array/reductions/mapreduce/Float32/dims=2L`	`69138.5` ns	`69833` ns	`0.99`
`array/reductions/mapreduce/Int64/1d`	`42165.5` ns	`43336` ns	`0.97`
`array/reductions/mapreduce/Int64/dims=1`	`42715` ns	`42698` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=1L`	`87109` ns	`87131` ns	`1.00`
`array/reductions/mapreduce/Int64/dims=2`	`59317` ns	`59771` ns	`0.99`
`array/reductions/mapreduce/Int64/dims=2L`	`84317` ns	`84773.5` ns	`0.99`
`array/reductions/reduce/Float32/1d`	`34432` ns	`34857.5` ns	`0.99`
`array/reductions/reduce/Float32/dims=1`	`40092` ns	`39525` ns	`1.01`
`array/reductions/reduce/Float32/dims=1L`	`51381` ns	`51365` ns	`1.00`
`array/reductions/reduce/Float32/dims=2`	`56802` ns	`56840` ns	`1.00`
`array/reductions/reduce/Float32/dims=2L`	`69698` ns	`69873` ns	`1.00`
`array/reductions/reduce/Int64/1d`	`42618` ns	`43072` ns	`0.99`
`array/reductions/reduce/Int64/dims=1`	`43391.5` ns	`42870` ns	`1.01`
`array/reductions/reduce/Int64/dims=1L`	`87117` ns	`87150` ns	`1.00`
`array/reductions/reduce/Int64/dims=2`	`59838` ns	`59657.5` ns	`1.00`
`array/reductions/reduce/Int64/dims=2L`	`84509` ns	`84754` ns	`1.00`
`array/reverse/1d`	`17975` ns	`17920` ns	`1.00`
`array/reverse/1dL`	`68561` ns	`68474` ns	`1.00`
`array/reverse/1dL_inplace`	`65693` ns	`65769` ns	`1.00`
`array/reverse/1d_inplace`	`10259` ns	`10329.333333333334` ns	`0.99`
`array/reverse/2d`	`20756` ns	`20545` ns	`1.01`
`array/reverse/2dL`	`72734` ns	`72599` ns	`1.00`
`array/reverse/2dL_inplace`	`65813` ns	`65843` ns	`1.00`
`array/reverse/2d_inplace`	`9934` ns	`9925` ns	`1.00`
`array/sorting/1d`	`2734738` ns	`2735157` ns	`1.00`
`array/sorting/2d`	`1068707` ns	`1068027` ns	`1.00`
`array/sorting/by`	`3304798` ns	`3304860` ns	`1.00`
`cuda/synchronization/context/auto`	`1158.2` ns	`1186.5` ns	`0.98`
`cuda/synchronization/context/blocking`	`941.1315789473684` ns	`933.8571428571429` ns	`1.01`
`cuda/synchronization/context/nonblocking`	`8496` ns	`7202.1` ns	`1.18`
`cuda/synchronization/stream/auto`	`1027.875` ns	`1047.75` ns	`0.98`
`cuda/synchronization/stream/blocking`	`835.5714285714286` ns	`845.7979797979798` ns	`0.99`
`cuda/synchronization/stream/nonblocking`	`7455` ns	`7400.299999999999` ns	`1.01`
`integration/byval/reference`	`143779` ns	`143847` ns	`1.00`
`integration/byval/slices=1`	`145974` ns	`145799` ns	`1.00`
`integration/byval/slices=2`	`284791` ns	`284592` ns	`1.00`
`integration/byval/slices=3`	`423503.5` ns	`423121` ns	`1.00`
`integration/cudadevrt`	`102569` ns	`102350` ns	`1.00`
`integration/volumerhs`	`23499620` ns	`23414128.5` ns	`1.00`
`kernel/indexing`	`13265` ns	`13228` ns	`1.00`
`kernel/indexing_checked`	`14047` ns	`14089` ns	`1.00`
`kernel/launch`	`2127.8888888888887` ns	`2207.5555555555557` ns	`0.96`
`kernel/occupancy`	`722.0289855072464` ns	`663.5094339622641` ns	`1.09`
`kernel/rand`	`16687` ns	`14119` ns	`1.18`
`latency/import`	`3837952002` ns	`3828719288.5` ns	`1.00`
`latency/precompile`	`4578362120` ns	`4584717609` ns	`1.00`
`latency/ttfp`	`4425030626` ns	`4415845715.5` ns	`1.00`

This comment was automatically generated by workflow using github-action-benchmark.

[only special]

maleadt · 2026-04-21T05:48:51Z

The only remaining issue is the multigpu one. I'm looking into this separately, but I think we can merge this already.

giordano · 2026-04-21T06:20:36Z

What's the issue specifically?

maleadt · 2026-04-21T06:23:26Z

What's the issue specifically?

Test failure due to driver bug (presumably), unrelated to PTR but it does seem to make it occur more often.

maleadt and others added 2 commits April 18, 2026 15:54

maleadt force-pushed the tb/ptr branch from ae1f4d4 to 053d6a0 Compare April 19, 2026 08:18

maleadt and others added 8 commits April 19, 2026 10:24

Fix module look-up.

300a0e3

Avoid binding warning.

c741a15

maleadt force-pushed the tb/ptr branch from 053d6a0 to 03e3980 Compare April 19, 2026 08:25

Bump ParallelTestRunner.

63a68c4

maleadt marked this pull request as ready for review April 20, 2026 13:08

maleadt force-pushed the tb/ptr branch from 03e3980 to 63a68c4 Compare April 20, 2026 13:08

github-actions Bot reviewed Apr 20, 2026

View reviewed changes

maleadt added 2 commits April 20, 2026 17:23

Remove pre-commit hook.

489ca22

Fix multigpu job.

2aa23aa

[only special]

maleadt merged commit e0e295f into master Apr 21, 2026
1 of 2 checks passed

maleadt deleted the tb/ptr branch April 21, 2026 05:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Switch to ParallelTestRunner#3110

Switch to ParallelTestRunner#3110
maleadt merged 13 commits intomasterfrom
tb/ptr

maleadt commented Apr 18, 2026

Uh oh!

maleadt commented Apr 19, 2026

Uh oh!

giordano commented Apr 19, 2026

Uh oh!

maleadt commented Apr 19, 2026

Uh oh!

github-actions Bot left a comment

Uh oh!

maleadt commented Apr 21, 2026

Uh oh!

Uh oh!

giordano commented Apr 21, 2026

Uh oh!

maleadt commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

maleadt commented Apr 18, 2026

Uh oh!

maleadt commented Apr 19, 2026

Uh oh!

giordano commented Apr 19, 2026

Uh oh!

maleadt commented Apr 19, 2026

Uh oh!

github-actions Bot left a comment

Choose a reason for hiding this comment

CUDA.jl Benchmarks

Uh oh!

maleadt commented Apr 21, 2026

Uh oh!

Uh oh!

giordano commented Apr 21, 2026

Uh oh!

maleadt commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants