Skip to content

Switch to ParallelTestRunner#3110

Merged
maleadt merged 13 commits intomasterfrom
tb/ptr
Apr 21, 2026
Merged

Switch to ParallelTestRunner#3110
maleadt merged 13 commits intomasterfrom
tb/ptr

Conversation

@maleadt
Copy link
Copy Markdown
Member

@maleadt maleadt commented Apr 18, 2026

maleadt and others added 2 commits April 18, 2026 15:54
Each subpackage's `Pkg.test` runner is now a minimal call to PTR's
`runtests`, which spawns one worker process per test file and runs them
concurrently. `setup.jl` is loaded via `init_code` so each worker picks
up the shared fixtures.

Side effects:
- Extract cuTENSOR's inline "kernel cache" testset to its own file
  (lib/cutensor/test/kernel_cache.jl) since runtests.jl is no longer
  the place for test code.
- cuSPARSE's array.jl had three show-output tests that implicitly
  relied on `using cuSPARSE, SparseArrays` being in Main (where
  CUDA.jl's top-level runner incidentally loaded them). PTR workers
  run tests in an isolated submodule, so pass an explicit
  `:module => @__MODULE__` context to `sprint(show, …)` so the type
  names are qualified against the worker module's bindings rather
  than Main's.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rewrites `test/runtests.jl` on top of `ParallelTestRunner.runtests`,
unifying CUDA.jl with the subpackages migrated in 94dedf5. The homegrown
runner (~580 lines in runtests.jl + ~210 in setup.jl) becomes a thin
wrapper plus a CUDA-specific `AbstractTestRecord`:

- `CUDATestRecord` carries the standard fields plus `gpu_bytes`,
  `gpu_time`, and `gpu_rss`; an `execute(::Type{CUDATestRecord}, ...)`
  method uses `CUDA.@timed` to capture GPU alloc stats and queries NVML
  for per-process GPU RSS. `print_test_finished`/`print_test_failed`
  overrides add `GPU Alloc (MB)` and `GPU RSS (MB)` columns.
- Worker count is capped by free GPU memory (~2 GiB/worker) in addition
  to PTR's CPU/RAM default.
- `--sanitize[=tool]` wraps every worker by passing a compute-sanitizer
  `Cmd` as `runtests`'s `exename` kwarg (new in PTR 2.6).
- `--all` (or an explicit `libraries/*` positional) includes subpackage
  tests under `lib/*/test/`, using `Base.set_active_project` to activate
  the subpackage's Project.toml.
- Context-destroying tests (`core/initialization`, `core/cudadrv`) are
  isolated on a fresh worker via the `test_worker` hook and use plain
  Julia timing (since CUDA events invalidate with the context).

Per-worker setup (`CUDATestRecord`, NVML helpers, GPUArrays TestSuite
include, `CUDA.precompile_runtime`) lives in `test/setup.jl` and runs
via `init_worker_code`. Per-test helpers (`testf`, `sink`, `@grab_output`,
`@on_device`, `julia_exec`) are in a new `test/helpers.jl` included via
`init_code`, so subpackage setup.jl's `testf` redefinitions don't clash
with an imported binding.

Drops: `--gpu=…` multi-device selection, exclusive-mode downgrade,
interactive `?` key. GPU selection now goes through `CUDA_VISIBLE_DEVICES`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@maleadt
Copy link
Copy Markdown
Member Author

maleadt commented Apr 19, 2026

51/41/37 mins here vs 42/34/33 on master for Julia 1.11/1.12/1.13. And this surprisingly looks like the actual tests slowing down, e.g. on Julia 1.12:

gpuarrays/linalg/core                          (9) │   152.89 │   0.02 │      28.29 │   148.00 │   4.48 │  2.9 │   13223.51 │  4035.78 │
gpuarrays/linalg/norm                         (13) │   281.17 │   0.02 │       0.03 │   130.00 │   7.46 │  2.7 │   17295.61 │  3769.99 │

vs

gpuarrays/linalg/core                         (3) |   126.39 |   0.02 |  0.0 |      28.29 |   226.00 |   3.43 |  2.7 |   11518.49 | 10666.16 |
gpuarrays/linalg/norm                         (4) |   250.23 |   0.02 |  0.0 |       0.03 |   146.00 |   6.00 |  2.4 |   14957.74 |  6904.70 |

@giordano
Copy link
Copy Markdown
Contributor

Try with --verbose which shows also the init time?

@maleadt
Copy link
Copy Markdown
Member Author

maleadt commented Apr 19, 2026

Good idea. It's probably related to my tuning of the memory pool heuristics though, and not because of PTR.

maleadt and others added 8 commits April 19, 2026 10:24
A compute-sanitizer-wrapped worker starts by printing its banner
('========= COMPUTE-SANITIZER') to stdout, which collides with Malt's
port handshake (the first stdout line must be parseable as a UInt16 port
number). Passing `--log-file=<dir>/%p.log` redirects sanitizer text to a
per-process file, leaving the worker's stdout clean for Malt.

After `runtests` returns (or throws), scan the directory and surface any
logs missing the "ERROR SUMMARY: 0 errors" line; emit a colored
one-liner summary otherwise. This preserves the signal while keeping
clean runs quiet.

Also silence the `Pkg.activate`/`Pkg.add` chatter during CUDA_SDK_jll
install (`io = devnull`) — the only output we want is the sanitizer
version banner we explicitly print.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Workers run many tests back-to-back, and pool-cached buffers stay
resident because the release threshold is unbounded and the idle
pool-cleanup task only runs when `isinteractive()`. Calling
`CUDA.reclaim()` after the post-test GC trims the pool and empties
library handle caches, reducing GPU RSS accumulation without
invalidating compiled kernels.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Match ParallelTestRunner's new composition pattern for `AbstractTestRecord`:
carry a `base::TestRecord` field and delegate Julia-timed execution to
`ParallelTestRunner.execute(TestRecord, …)` instead of redeclaring every
baseline field and re-implementing the non-CUDA timing path inline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-test GPU RSS data (collected after adding post-test CUDA.reclaim())
showed a handful of tests blowing well past the 4 GiB per-worker budget:

- test/core/array.jl: the 512^3 sum!() case allocated ~1 GiB Float64 to
  exercise the big-mapreduce path; (85, 1320, 100) already exercises the
  same serial kernel path. Drop the 512^3 case.
- test/core/sorting.jl: the "large sizes" quicksort input at 2^25 Float32
  was 128 MiB; 2^22 still exercises the multi-block quicksort path.
- examples/peakflops.jl: default n=5000 built four 5000x5000 Float32
  matrices (~400 MiB); n=1024 is enough to demonstrate the example.
- lib/cutensornet/test/contractions.jl: max_ws_size=2^32 (4 GiB
  workspace hint) was inflating cuTensorNet to ~1.5 GiB; 2^28 covers
  the same tuning paths.

Library tests (cusolver/cusparse/cudnn/cutensor/etc.) still sit at
1-2 GiB due to persistent library workspace that's not pool-allocated
and therefore not released by CUDA.reclaim() between tests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `@test_throws UndefRefError current_context()` / `current_device()`
assertions at the top of initialization.jl require that CUDA hasn't been
touched yet in the current Julia process. With PTR, every worker runs
`setup.jl` as `init_worker_code`, and that already does
`CUDA.functional(true)` / `precompile_runtime` / pool config — so the
worker is never in a fresh state by the time the test runs, and these
assertions fail ("Expected: UndefRefError, No exception thrown").

Run those four assertions (and the paired "now cause initialization"
check) in a subprocess instead, the same way the issue-1331 test at the
bottom of the file already does. The rest of initialization.jl doesn't
depend on fresh state and runs fine on a normal worker.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The top-band GPU label spanned 30 columns but the bottom-band GPU cells
(GC/Alloc/RSS) sum to 33, shifting every pipe after the GPU section
three columns left. Widen the dashes (12 + 13) to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@maleadt maleadt marked this pull request as ready for review April 20, 2026 13:08
Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA.jl Benchmarks

Details
Benchmark suite Current: 63a68c4 Previous: 9702041 Ratio
array/accumulate/Float32/1d 101140 ns 101336 ns 1.00
array/accumulate/Float32/dims=1 76267 ns 76569 ns 1.00
array/accumulate/Float32/dims=1L 1585826 ns 1585630.5 ns 1.00
array/accumulate/Float32/dims=2 143428 ns 143872.5 ns 1.00
array/accumulate/Float32/dims=2L 657493.5 ns 657817.5 ns 1.00
array/accumulate/Int64/1d 118646 ns 118546 ns 1.00
array/accumulate/Int64/dims=1 80280 ns 80328 ns 1.00
array/accumulate/Int64/dims=1L 1706680.5 ns 1694780.5 ns 1.01
array/accumulate/Int64/dims=2 157247.5 ns 156646 ns 1.00
array/accumulate/Int64/dims=2L 962003 ns 962572 ns 1.00
array/broadcast 20670.5 ns 20507 ns 1.01
array/construct 1264.5 ns 1260.6 ns 1.00
array/copy 17872 ns 18042.5 ns 0.99
array/copyto!/cpu_to_gpu 214938 ns 216139.5 ns 0.99
array/copyto!/gpu_to_cpu 283434 ns 282550 ns 1.00
array/copyto!/gpu_to_gpu 10684 ns 10770 ns 0.99
array/iteration/findall/bool 134489 ns 134891 ns 1.00
array/iteration/findall/int 149743 ns 150607 ns 0.99
array/iteration/findfirst/bool 81215 ns 81621 ns 1.00
array/iteration/findfirst/int 83429.5 ns 83931 ns 0.99
array/iteration/findmin/1d 88136.5 ns 88319.5 ns 1.00
array/iteration/findmin/2d 117332.5 ns 116740 ns 1.01
array/iteration/logical 200219.5 ns 199127 ns 1.01
array/iteration/scalar 67096 ns 69801 ns 0.96
array/permutedims/2d 52173.5 ns 51913 ns 1.01
array/permutedims/3d 52747 ns 52967 ns 1.00
array/permutedims/4d 51373 ns 51865.5 ns 0.99
array/random/rand/Float32 12818 ns 12969 ns 0.99
array/random/rand/Int64 24941 ns 24834 ns 1.00
array/random/rand!/Float32 8318.666666666666 ns 8996.666666666666 ns 0.92
array/random/rand!/Int64 21893 ns 21694 ns 1.01
array/random/randn/Float32 37834 ns 37552.5 ns 1.01
array/random/randn!/Float32 30772 ns 30840 ns 1.00
array/reductions/mapreduce/Float32/1d 34319 ns 35074 ns 0.98
array/reductions/mapreduce/Float32/dims=1 40492 ns 40645 ns 1.00
array/reductions/mapreduce/Float32/dims=1L 51236 ns 51243.5 ns 1.00
array/reductions/mapreduce/Float32/dims=2 56197 ns 56526 ns 0.99
array/reductions/mapreduce/Float32/dims=2L 69138.5 ns 69833 ns 0.99
array/reductions/mapreduce/Int64/1d 42165.5 ns 43336 ns 0.97
array/reductions/mapreduce/Int64/dims=1 42715 ns 42698 ns 1.00
array/reductions/mapreduce/Int64/dims=1L 87109 ns 87131 ns 1.00
array/reductions/mapreduce/Int64/dims=2 59317 ns 59771 ns 0.99
array/reductions/mapreduce/Int64/dims=2L 84317 ns 84773.5 ns 0.99
array/reductions/reduce/Float32/1d 34432 ns 34857.5 ns 0.99
array/reductions/reduce/Float32/dims=1 40092 ns 39525 ns 1.01
array/reductions/reduce/Float32/dims=1L 51381 ns 51365 ns 1.00
array/reductions/reduce/Float32/dims=2 56802 ns 56840 ns 1.00
array/reductions/reduce/Float32/dims=2L 69698 ns 69873 ns 1.00
array/reductions/reduce/Int64/1d 42618 ns 43072 ns 0.99
array/reductions/reduce/Int64/dims=1 43391.5 ns 42870 ns 1.01
array/reductions/reduce/Int64/dims=1L 87117 ns 87150 ns 1.00
array/reductions/reduce/Int64/dims=2 59838 ns 59657.5 ns 1.00
array/reductions/reduce/Int64/dims=2L 84509 ns 84754 ns 1.00
array/reverse/1d 17975 ns 17920 ns 1.00
array/reverse/1dL 68561 ns 68474 ns 1.00
array/reverse/1dL_inplace 65693 ns 65769 ns 1.00
array/reverse/1d_inplace 10259 ns 10329.333333333334 ns 0.99
array/reverse/2d 20756 ns 20545 ns 1.01
array/reverse/2dL 72734 ns 72599 ns 1.00
array/reverse/2dL_inplace 65813 ns 65843 ns 1.00
array/reverse/2d_inplace 9934 ns 9925 ns 1.00
array/sorting/1d 2734738 ns 2735157 ns 1.00
array/sorting/2d 1068707 ns 1068027 ns 1.00
array/sorting/by 3304798 ns 3304860 ns 1.00
cuda/synchronization/context/auto 1158.2 ns 1186.5 ns 0.98
cuda/synchronization/context/blocking 941.1315789473684 ns 933.8571428571429 ns 1.01
cuda/synchronization/context/nonblocking 8496 ns 7202.1 ns 1.18
cuda/synchronization/stream/auto 1027.875 ns 1047.75 ns 0.98
cuda/synchronization/stream/blocking 835.5714285714286 ns 845.7979797979798 ns 0.99
cuda/synchronization/stream/nonblocking 7455 ns 7400.299999999999 ns 1.01
integration/byval/reference 143779 ns 143847 ns 1.00
integration/byval/slices=1 145974 ns 145799 ns 1.00
integration/byval/slices=2 284791 ns 284592 ns 1.00
integration/byval/slices=3 423503.5 ns 423121 ns 1.00
integration/cudadevrt 102569 ns 102350 ns 1.00
integration/volumerhs 23499620 ns 23414128.5 ns 1.00
kernel/indexing 13265 ns 13228 ns 1.00
kernel/indexing_checked 14047 ns 14089 ns 1.00
kernel/launch 2127.8888888888887 ns 2207.5555555555557 ns 0.96
kernel/occupancy 722.0289855072464 ns 663.5094339622641 ns 1.09
kernel/rand 16687 ns 14119 ns 1.18
latency/import 3837952002 ns 3828719288.5 ns 1.00
latency/precompile 4578362120 ns 4584717609 ns 1.00
latency/ttfp 4425030626 ns 4415845715.5 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@maleadt
Copy link
Copy Markdown
Member Author

maleadt commented Apr 21, 2026

The only remaining issue is the multigpu one. I'm looking into this separately, but I think we can merge this already.

@maleadt maleadt merged commit e0e295f into master Apr 21, 2026
1 of 2 checks passed
@maleadt maleadt deleted the tb/ptr branch April 21, 2026 05:49
@giordano
Copy link
Copy Markdown
Contributor

What's the issue specifically?

@maleadt
Copy link
Copy Markdown
Member Author

maleadt commented Apr 21, 2026

What's the issue specifically?

Test failure due to driver bug (presumably), unrelated to PTR but it does seem to make it occur more often.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants