Skip to content
Merged
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Binary file added docs/user/images/nsys_kernel_profile_example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions docs/user/images/nsys_kernel_profile_example.png.license
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
SPDX-FileCopyrightText: 2026 CERN
SPDX-License-Identifier: CC-BY-4.0
Binary file added docs/user/images/nsys_species_pies_example.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 2 additions & 0 deletions docs/user/images/nsys_species_pies_example.png.license
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
SPDX-FileCopyrightText: 2026 CERN
SPDX-License-Identifier: CC-BY-4.0
69 changes: 68 additions & 1 deletion docs/user/profiling.md
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ Run `nsys` with the CUDA profiler API capture range:
```console
nsys profile --capture-range=cudaProfilerApi --capture-range-end=stop \
--trace=cuda,nvtx --sample=none --cpuctxsw=none \
--stats=true --force-overwrite=true \
--stats=true --export=sqlite --force-overwrite=true \
--output adept_transport_profile \
<application command>
```
Expand All @@ -63,6 +63,73 @@ Open the report with:
nsys-ui adept_transport_profile.nsys-rep
```

### Plotting AdePT kernel profiles

The exported SQLite file can be summarized with the AdePT profile plotting
script:

```console
python3 scripts/plot_adept_nsys_profile.py \
--sqlite adept_transport_profile.sqlite \
--output-prefix adept_transport_profile \
--title "AdePT"
```

This writes:

- `adept_transport_profile_kernel_profile.png`: total CUDA kernel times,
transport shares by particle type, the waited kernel category that most often
reaches the end of an AdePT transport iteration, and the corresponding
critical-path margin in milliseconds.
- `adept_transport_profile_species_pies.png`: per-species pie charts showing
which split kernels dominate electron, positron, and gamma transport.
- `adept_transport_profile.txt` plus CSV files with the numeric summaries.

The percentages labelled as "all kernels" are normalized to all CUDA kernels in
the captured range, including injection, population statistics, and bookkeeping.
The percentages labelled as "transport" are normalized only to electron,
positron, and gamma transport kernels.

The limiter plots use the latest-ending waited non-`FinishIteration` CUDA
kernel before `FinishIteration` as the limiting category. `InitTracks` kernels
are excluded from this limiter view because they run on a separate injection
stream and `FinishIteration` does not directly wait for them; they remain visible
in the total CUDA kernel-time view. The count view shows how often each waited
category is last. The critical-margin view credits only the time by which that
latest kernel extends the iteration beyond the runner-up latest kernel, capped
by the latest kernel's own duration.

```{figure} images/nsys_kernel_profile_example.png
:name: fig-nsys-kernel-profile-example
:alt: Example AdePT Nsight Systems kernel profile summary plot.
:align: center
:width: 95%

Example kernel summary from an AdePT split-kernel `nsys` profile.
```

The example above was produced from a CMSSW ttbar simulation. It illustrates why
both the total kernel-time view and the limiter view are useful: gamma transport
accounts for about 25% of all CUDA kernel time and about 27% of transport kernel
time, but gamma kernels almost never determine the end of the transport
iteration in this profile. Electrons and positrons dominate the waited
last-kernel counts, and electrons alone account for most of the critical-path
margin. This means that reducing the gamma workload, for example with Russian
roulette, is not expected to improve the iteration wall time unless the gamma
kernels become the latest waited path. Conversely, `InitTracks` is visible as a
non-negligible total kernel-time bucket, but it is intentionally excluded from
the limiter view because it runs asynchronously on the injection stream.

```{figure} images/nsys_species_pies_example.png
:name: fig-nsys-species-pies-example
:alt: Example AdePT Nsight Systems per-species transport kernel pie charts.
:align: center
:width: 95%

Example per-species transport-kernel summary from the same kind of AdePT
split-kernel `nsys` profile.
```

## Nsight Compute

`ncu` is the tool for detailed kernel analysis: register count, achieved
Expand Down
Loading
Loading