Conversation
Each subpackage's `Pkg.test` runner is now a minimal call to PTR's `runtests`, which spawns one worker process per test file and runs them concurrently. `setup.jl` is loaded via `init_code` so each worker picks up the shared fixtures. Side effects: - Extract cuTENSOR's inline "kernel cache" testset to its own file (lib/cutensor/test/kernel_cache.jl) since runtests.jl is no longer the place for test code. - cuSPARSE's array.jl had three show-output tests that implicitly relied on `using cuSPARSE, SparseArrays` being in Main (where CUDA.jl's top-level runner incidentally loaded them). PTR workers run tests in an isolated submodule, so pass an explicit `:module => @__MODULE__` context to `sprint(show, …)` so the type names are qualified against the worker module's bindings rather than Main's. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rewrites `test/runtests.jl` on top of `ParallelTestRunner.runtests`, unifying CUDA.jl with the subpackages migrated in 94dedf5. The homegrown runner (~580 lines in runtests.jl + ~210 in setup.jl) becomes a thin wrapper plus a CUDA-specific `AbstractTestRecord`: - `CUDATestRecord` carries the standard fields plus `gpu_bytes`, `gpu_time`, and `gpu_rss`; an `execute(::Type{CUDATestRecord}, ...)` method uses `CUDA.@timed` to capture GPU alloc stats and queries NVML for per-process GPU RSS. `print_test_finished`/`print_test_failed` overrides add `GPU Alloc (MB)` and `GPU RSS (MB)` columns. - Worker count is capped by free GPU memory (~2 GiB/worker) in addition to PTR's CPU/RAM default. - `--sanitize[=tool]` wraps every worker by passing a compute-sanitizer `Cmd` as `runtests`'s `exename` kwarg (new in PTR 2.6). - `--all` (or an explicit `libraries/*` positional) includes subpackage tests under `lib/*/test/`, using `Base.set_active_project` to activate the subpackage's Project.toml. - Context-destroying tests (`core/initialization`, `core/cudadrv`) are isolated on a fresh worker via the `test_worker` hook and use plain Julia timing (since CUDA events invalidate with the context). Per-worker setup (`CUDATestRecord`, NVML helpers, GPUArrays TestSuite include, `CUDA.precompile_runtime`) lives in `test/setup.jl` and runs via `init_worker_code`. Per-test helpers (`testf`, `sink`, `@grab_output`, `@on_device`, `julia_exec`) are in a new `test/helpers.jl` included via `init_code`, so subpackage setup.jl's `testf` redefinitions don't clash with an imported binding. Drops: `--gpu=…` multi-device selection, exclusive-mode downgrade, interactive `?` key. GPU selection now goes through `CUDA_VISIBLE_DEVICES`. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Member
Author
|
51/41/37 mins here vs 42/34/33 on master for Julia 1.11/1.12/1.13. And this surprisingly looks like the actual tests slowing down, e.g. on Julia 1.12: vs |
Contributor
|
Try with |
Member
Author
|
Good idea. It's probably related to my tuning of the memory pool heuristics though, and not because of PTR. |
A compute-sanitizer-wrapped worker starts by printing its banner
('========= COMPUTE-SANITIZER') to stdout, which collides with Malt's
port handshake (the first stdout line must be parseable as a UInt16 port
number). Passing `--log-file=<dir>/%p.log` redirects sanitizer text to a
per-process file, leaving the worker's stdout clean for Malt.
After `runtests` returns (or throws), scan the directory and surface any
logs missing the "ERROR SUMMARY: 0 errors" line; emit a colored
one-liner summary otherwise. This preserves the signal while keeping
clean runs quiet.
Also silence the `Pkg.activate`/`Pkg.add` chatter during CUDA_SDK_jll
install (`io = devnull`) — the only output we want is the sanitizer
version banner we explicitly print.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Workers run many tests back-to-back, and pool-cached buffers stay resident because the release threshold is unbounded and the idle pool-cleanup task only runs when `isinteractive()`. Calling `CUDA.reclaim()` after the post-test GC trims the pool and empties library handle caches, reducing GPU RSS accumulation without invalidating compiled kernels. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Match ParallelTestRunner's new composition pattern for `AbstractTestRecord`: carry a `base::TestRecord` field and delegate Julia-timed execution to `ParallelTestRunner.execute(TestRecord, …)` instead of redeclaring every baseline field and re-implementing the non-CUDA timing path inline. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Per-test GPU RSS data (collected after adding post-test CUDA.reclaim()) showed a handful of tests blowing well past the 4 GiB per-worker budget: - test/core/array.jl: the 512^3 sum!() case allocated ~1 GiB Float64 to exercise the big-mapreduce path; (85, 1320, 100) already exercises the same serial kernel path. Drop the 512^3 case. - test/core/sorting.jl: the "large sizes" quicksort input at 2^25 Float32 was 128 MiB; 2^22 still exercises the multi-block quicksort path. - examples/peakflops.jl: default n=5000 built four 5000x5000 Float32 matrices (~400 MiB); n=1024 is enough to demonstrate the example. - lib/cutensornet/test/contractions.jl: max_ws_size=2^32 (4 GiB workspace hint) was inflating cuTensorNet to ~1.5 GiB; 2^28 covers the same tuning paths. Library tests (cusolver/cusparse/cudnn/cutensor/etc.) still sit at 1-2 GiB due to persistent library workspace that's not pool-allocated and therefore not released by CUDA.reclaim() between tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The `@test_throws UndefRefError current_context()` / `current_device()`
assertions at the top of initialization.jl require that CUDA hasn't been
touched yet in the current Julia process. With PTR, every worker runs
`setup.jl` as `init_worker_code`, and that already does
`CUDA.functional(true)` / `precompile_runtime` / pool config — so the
worker is never in a fresh state by the time the test runs, and these
assertions fail ("Expected: UndefRefError, No exception thrown").
Run those four assertions (and the paired "now cause initialization"
check) in a subprocess instead, the same way the issue-1331 test at the
bottom of the file already does. The rest of initialization.jl doesn't
depend on fresh state and runs fine on a normal worker.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The top-band GPU label spanned 30 columns but the bottom-band GPU cells (GC/Alloc/RSS) sum to 33, shifting every pipe after the GPU section three columns left. Widen the dashes (12 + 13) to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
CUDA.jl Benchmarks
Details
| Benchmark suite | Current: 63a68c4 | Previous: 9702041 | Ratio |
|---|---|---|---|
array/accumulate/Float32/1d |
101140 ns |
101336 ns |
1.00 |
array/accumulate/Float32/dims=1 |
76267 ns |
76569 ns |
1.00 |
array/accumulate/Float32/dims=1L |
1585826 ns |
1585630.5 ns |
1.00 |
array/accumulate/Float32/dims=2 |
143428 ns |
143872.5 ns |
1.00 |
array/accumulate/Float32/dims=2L |
657493.5 ns |
657817.5 ns |
1.00 |
array/accumulate/Int64/1d |
118646 ns |
118546 ns |
1.00 |
array/accumulate/Int64/dims=1 |
80280 ns |
80328 ns |
1.00 |
array/accumulate/Int64/dims=1L |
1706680.5 ns |
1694780.5 ns |
1.01 |
array/accumulate/Int64/dims=2 |
157247.5 ns |
156646 ns |
1.00 |
array/accumulate/Int64/dims=2L |
962003 ns |
962572 ns |
1.00 |
array/broadcast |
20670.5 ns |
20507 ns |
1.01 |
array/construct |
1264.5 ns |
1260.6 ns |
1.00 |
array/copy |
17872 ns |
18042.5 ns |
0.99 |
array/copyto!/cpu_to_gpu |
214938 ns |
216139.5 ns |
0.99 |
array/copyto!/gpu_to_cpu |
283434 ns |
282550 ns |
1.00 |
array/copyto!/gpu_to_gpu |
10684 ns |
10770 ns |
0.99 |
array/iteration/findall/bool |
134489 ns |
134891 ns |
1.00 |
array/iteration/findall/int |
149743 ns |
150607 ns |
0.99 |
array/iteration/findfirst/bool |
81215 ns |
81621 ns |
1.00 |
array/iteration/findfirst/int |
83429.5 ns |
83931 ns |
0.99 |
array/iteration/findmin/1d |
88136.5 ns |
88319.5 ns |
1.00 |
array/iteration/findmin/2d |
117332.5 ns |
116740 ns |
1.01 |
array/iteration/logical |
200219.5 ns |
199127 ns |
1.01 |
array/iteration/scalar |
67096 ns |
69801 ns |
0.96 |
array/permutedims/2d |
52173.5 ns |
51913 ns |
1.01 |
array/permutedims/3d |
52747 ns |
52967 ns |
1.00 |
array/permutedims/4d |
51373 ns |
51865.5 ns |
0.99 |
array/random/rand/Float32 |
12818 ns |
12969 ns |
0.99 |
array/random/rand/Int64 |
24941 ns |
24834 ns |
1.00 |
array/random/rand!/Float32 |
8318.666666666666 ns |
8996.666666666666 ns |
0.92 |
array/random/rand!/Int64 |
21893 ns |
21694 ns |
1.01 |
array/random/randn/Float32 |
37834 ns |
37552.5 ns |
1.01 |
array/random/randn!/Float32 |
30772 ns |
30840 ns |
1.00 |
array/reductions/mapreduce/Float32/1d |
34319 ns |
35074 ns |
0.98 |
array/reductions/mapreduce/Float32/dims=1 |
40492 ns |
40645 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=1L |
51236 ns |
51243.5 ns |
1.00 |
array/reductions/mapreduce/Float32/dims=2 |
56197 ns |
56526 ns |
0.99 |
array/reductions/mapreduce/Float32/dims=2L |
69138.5 ns |
69833 ns |
0.99 |
array/reductions/mapreduce/Int64/1d |
42165.5 ns |
43336 ns |
0.97 |
array/reductions/mapreduce/Int64/dims=1 |
42715 ns |
42698 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=1L |
87109 ns |
87131 ns |
1.00 |
array/reductions/mapreduce/Int64/dims=2 |
59317 ns |
59771 ns |
0.99 |
array/reductions/mapreduce/Int64/dims=2L |
84317 ns |
84773.5 ns |
0.99 |
array/reductions/reduce/Float32/1d |
34432 ns |
34857.5 ns |
0.99 |
array/reductions/reduce/Float32/dims=1 |
40092 ns |
39525 ns |
1.01 |
array/reductions/reduce/Float32/dims=1L |
51381 ns |
51365 ns |
1.00 |
array/reductions/reduce/Float32/dims=2 |
56802 ns |
56840 ns |
1.00 |
array/reductions/reduce/Float32/dims=2L |
69698 ns |
69873 ns |
1.00 |
array/reductions/reduce/Int64/1d |
42618 ns |
43072 ns |
0.99 |
array/reductions/reduce/Int64/dims=1 |
43391.5 ns |
42870 ns |
1.01 |
array/reductions/reduce/Int64/dims=1L |
87117 ns |
87150 ns |
1.00 |
array/reductions/reduce/Int64/dims=2 |
59838 ns |
59657.5 ns |
1.00 |
array/reductions/reduce/Int64/dims=2L |
84509 ns |
84754 ns |
1.00 |
array/reverse/1d |
17975 ns |
17920 ns |
1.00 |
array/reverse/1dL |
68561 ns |
68474 ns |
1.00 |
array/reverse/1dL_inplace |
65693 ns |
65769 ns |
1.00 |
array/reverse/1d_inplace |
10259 ns |
10329.333333333334 ns |
0.99 |
array/reverse/2d |
20756 ns |
20545 ns |
1.01 |
array/reverse/2dL |
72734 ns |
72599 ns |
1.00 |
array/reverse/2dL_inplace |
65813 ns |
65843 ns |
1.00 |
array/reverse/2d_inplace |
9934 ns |
9925 ns |
1.00 |
array/sorting/1d |
2734738 ns |
2735157 ns |
1.00 |
array/sorting/2d |
1068707 ns |
1068027 ns |
1.00 |
array/sorting/by |
3304798 ns |
3304860 ns |
1.00 |
cuda/synchronization/context/auto |
1158.2 ns |
1186.5 ns |
0.98 |
cuda/synchronization/context/blocking |
941.1315789473684 ns |
933.8571428571429 ns |
1.01 |
cuda/synchronization/context/nonblocking |
8496 ns |
7202.1 ns |
1.18 |
cuda/synchronization/stream/auto |
1027.875 ns |
1047.75 ns |
0.98 |
cuda/synchronization/stream/blocking |
835.5714285714286 ns |
845.7979797979798 ns |
0.99 |
cuda/synchronization/stream/nonblocking |
7455 ns |
7400.299999999999 ns |
1.01 |
integration/byval/reference |
143779 ns |
143847 ns |
1.00 |
integration/byval/slices=1 |
145974 ns |
145799 ns |
1.00 |
integration/byval/slices=2 |
284791 ns |
284592 ns |
1.00 |
integration/byval/slices=3 |
423503.5 ns |
423121 ns |
1.00 |
integration/cudadevrt |
102569 ns |
102350 ns |
1.00 |
integration/volumerhs |
23499620 ns |
23414128.5 ns |
1.00 |
kernel/indexing |
13265 ns |
13228 ns |
1.00 |
kernel/indexing_checked |
14047 ns |
14089 ns |
1.00 |
kernel/launch |
2127.8888888888887 ns |
2207.5555555555557 ns |
0.96 |
kernel/occupancy |
722.0289855072464 ns |
663.5094339622641 ns |
1.09 |
kernel/rand |
16687 ns |
14119 ns |
1.18 |
latency/import |
3837952002 ns |
3828719288.5 ns |
1.00 |
latency/precompile |
4578362120 ns |
4584717609 ns |
1.00 |
latency/ttfp |
4425030626 ns |
4415845715.5 ns |
1.00 |
This comment was automatically generated by workflow using github-action-benchmark.
[only special]
Member
Author
|
The only remaining issue is the multigpu one. I'm looking into this separately, but I think we can merge this already. |
Contributor
|
What's the issue specifically? |
Member
Author
Test failure due to driver bug (presumably), unrelated to PTR but it does seem to make it occur more often. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Needs JuliaTesting/ParallelTestRunner.jl#129