Add reduced memory runtime toggle for LST#50925
Conversation
|
cms-bot internal usage |
|
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50925/49310
|
|
A new Pull Request was created by @GNiendorf for master. It involves the following packages:
@Moanwar, @cmsbuild, @jfernan2, @mandrenguyen, @srimanob can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
|
test parameters:
|
|
@cmsbuild, please test with cms-sw/cms-bot#2740 |
|
+1 Size: This PR adds an extra 80KB to repository HLT P2 Timing: chart Comparison SummarySummary:
|
|
assign heterogeneous |
Do you have specific question(s) in mind? |
|
Sorry to bother, I was not fully sure about the use of the kernel in RecoTracker/LSTCore/src/alpaka/LSTEvent.dev.cc within Alpaka. |
|
test parameters:
|
|
@cmsbuild, please test |
|
I'm not seeing anything obviously concerning (beyond the presumable code size and compilation time increase from the two instantiations of the kernel class templates). |
|
-1 Failed Tests: RelVals-AMD_MI300X HLT P2 Timing: chart Failed RelVals-AMD_MI300X
Comparison SummarySummary:
AMD_W7900 Comparison SummarySummary:
NVIDIA_H100 Comparison SummarySummary:
NVIDIA_L40S Comparison SummarySummary:
|
|
+1 |
|
+heterogeneous Code changes look OK. MI300X failure is a recurring problem, and seems unrelated to these changes. |
|
ignore tests-rejected with ib-failure |
|
This pull request is fully signed and it will be integrated in one of the next master IBs (test failures were overridden). This pull request will now be reviewed by the release team before it's merged. @sextonkennedy, @ftenchini, @mandrenguyen (and backports should be raised in the release meeting by the corresponding L2) |
This PR adds a
reduceMemByFullPrecomputeruntime flag that enables exact buffer sizing for all LST objects (LS, T3, T5, T4) in each counting kernel, reducing average memory usage from ~80 MB to ~33 MB per event. When the flag is off (default), behavior is identical to master with negligible timing overhead, as the new kernel launches are gated behind host-sideif (reduceMemByFullPrecompute_)checks and use templated kernel variants. The flag is exposed as--reduce_mem_by_full_precomputein standalone and as areduceMemByFullPrecomputeconfig parameter in the CMSSW EDProducer. Increases LST time/event by roughly 10-20% on CPU and GPU (depending on stream count, lower when running multiple streams) for a 60-70% reduction in total memory. Table below shows average and max decreases in memory per event over 100 events.c.c @slava77