LST: add LSTGeometry package and associated ESProducer by ariostas · Pull Request #50679 · cms-sw/cmssw

ariostas · 2026-04-07T15:58:50Z

This PR adds a new RecoTracker/LSTGeometry package containing the module map computation used by the LST algorithm. Currently, the maps are pre-computed by the code in https://github.com/SegmentLinking/LSTGeometry and they are stored in https://github.com/cms-data/RecoTracker-LSTCore. This PR allows for the on-the-fly computation of these maps via an ESProducer, ensuring that they stay consistent with the tracker geometry being used.

This is the last major task in #46746.

c.c. @slava77

cmsbuild · 2026-04-07T15:59:20Z

cms-bot internal usage

cmsbuild · 2026-04-07T16:01:25Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50679/48907

There are other open Pull requests which might conflict with changes you have proposed:
- File HLTrigger/Configuration/python/HLT_75e33_cff.py modified in PR(s): Phase2 Single_Tau_Trigger Path Added #49637, TICL: Consolidate v5 as Default Configuration and Cleanup Legacy Code #49932
- File HLTrigger/Configuration/python/HLT_75e33_timing_cff.py modified in PR(s): Phase2 Single_Tau_Trigger Path Added #49637, TICL: Consolidate v5 as Default Configuration and Cleanup Legacy Code #49932
- File HLTrigger/Configuration/python/HLT_NGTScouting_cff.py modified in PR(s): TICL: Consolidate v5 as Default Configuration and Cleanup Legacy Code #49932

cmsbuild · 2026-04-07T16:01:48Z

A new Pull Request was created by @ariostas for master.

It involves the following packages:

HLTrigger/Configuration (hlt)
RecoTracker/IterativeTracking (reconstruction)
RecoTracker/LST (reconstruction)
RecoTracker/LSTCore (reconstruction)
RecoTracker/LSTGeometry (****)

The following packages do not have a category, yet:

RecoTracker/LSTGeometry
Please create a PR for https://github.com/cms-sw/cms-bot/blob/master/categories_map.py to assign category

@Martin-Grunewald, @Moanwar, @cmsbuild, @jfernan2, @mandrenguyen, @mmusich, @srimanob can you please review it and eventually sign? Thanks.
@GiacomoSguazzoni, @Martin-Grunewald, @SohamBhattacharya, @VinInn, @VourMa, @dgulhan, @elusian, @felicepantaleo, @gpetruc, @missirol, @mmasciov, @mmusich, @mtosi, @rovere this is something you requested to watch as well.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

mmusich · 2026-04-07T16:15:20Z

test parameters:

enable = hlt_p2_integration, hlt_p2_timing
workflows = ph2_hlt

mmusich · 2026-04-07T16:15:27Z

@cmsbuild, please test

cmsbuild · 2026-04-07T19:02:25Z

-1

Failed Tests: UnitTests HLTP2Timing
Size: This PR adds an extra 104KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-7657dc/52513/summary.html
COMMIT: e612f24
CMSSW: CMSSW_17_0_X_2026-04-07-1100/el8_amd64_gcc13
Additional Tests: HLT_P2_INTEGRATION,HLT_P2_TIMING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/50679/52513/install.sh to create a dev area with all the needed externals and cmssw changes.

Failed Unit Tests

I found 1 errors in the following unit tests:

---> test test-das-selected-lumis had ERRORS

Comparison Summary

Summary:

You potentially removed 1 lines from the logs
Reco comparison results: 0 differences found in the comparisons
DQMHistoTests: Total files compared: 68
DQMHistoTests: Total histograms compared: 4795858
DQMHistoTests: Total failures: 6232
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 4789606
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 67 files compared)
Checked 282 log files, 243 edm output root files, 68 DQM output files
TriggerResults: no differences found

Max Memory Comparisons exceeding threshold

@cms-sw/core-l2 , I found 17 workflow step(s) with memory usage exceeding the error threshold:

Expand to see workflows ...

Error: Workflow 34434.0_TTbar_14TeV+Run4D121 step2 max memory diff 191.8 exceeds +/- 90.0 MiB
Error: Workflow 34434.75_TTbar_14TeV+Run4D121_HLT75e33Timing step2 max memory diff 191.8 exceeds +/- 90.0 MiB
Error: Workflow 34434.7501_TTbar_14TeV+Run4D121_HLT75e33TrackingOnly step2 max memory diff 191.8 exceeds +/- 90.0 MiB
Error: Workflow 34434.7502_TTbar_14TeV+Run4D121_HLT75e33TrackingNtuple step2 max memory diff 191.9 exceeds +/- 90.0 MiB
Error: Workflow 34434.751_TTbar_14TeV+Run4D121_HLT75e33TimingAlpaka step2 max memory diff 191.8 exceeds +/- 90.0 MiB
Error: Workflow 34434.752_TTbar_14TeV+Run4D121_HLT75e33TimingTiclV5 step2 max memory diff 189.8 exceeds +/- 90.0 MiB
Error: Workflow 34434.7521_TTbar_14TeV+Run4D121_HLT75e33TimingTiclV5TrackLinkGNN step2 max memory diff 166.0 exceeds +/- 90.0 MiB
Error: Workflow 34434.755_TTbar_14TeV+Run4D121_HLT75e33TimingLST step2 max memory diff 191.8 exceeds +/- 90.0 MiB
Error: Workflow 34434.756_TTbar_14TeV+Run4D121_HLT75e33TimingTrimmedTracking step2 max memory diff 191.8 exceeds +/- 90.0 MiB
Error: Workflow 34434.757_TTbar_14TeV+Run4D121_HLT75e33TimingMkFitFit step2 max memory diff 191.8 exceeds +/- 90.0 MiB
Error: Workflow 34434.758_TTbar_14TeV+Run4D121_HLT75e33TimingTiclBarrel step2 max memory diff 191.8 exceeds +/- 90.0 MiB
Error: Workflow 34434.759_TTbar_14TeV+Run4D121_HLTPhase2WithNano step2 max memory diff 191.8 exceeds +/- 90.0 MiB
Error: Workflow 34434.775_TTbar_14TeV+Run4D121_NGTScoutingCAExtensionMergeT5 step2 max memory diff 191.8 exceeds +/- 90.0 MiB
Error: Workflow 34434.911_TTbar_14TeV+Run4D121_DD4hep step2 max memory diff 191.8 exceeds +/- 90.0 MiB
Error: Workflow 34496.0_CloseByPGun_CE_E_Front_120um+Run4D121 step2 max memory diff 191.8 exceeds +/- 90.0 MiB
Error: Workflow 34500.0_CloseByPGun_CE_H_Coarse_Scint+Run4D121 step2 max memory diff 191.8 exceeds +/- 90.0 MiB
Error: Workflow 34634.999_TTbar_14TeV+Run4D121PU_PMXS1S2PR step3 max memory diff 191.8 exceeds +/- 90.0 MiB

makortel · 2026-04-07T19:14:17Z

Is ~190 MB increase in memory usage expected?

ariostas · 2026-04-07T19:36:19Z

Is ~190 MB increase in memory usage expected?

That seems a bit high, but it's likely. I'll double-check. Either way, it is only temporarily. Most of it is freed once the maps are constructed.

makortel · 2026-04-07T19:43:25Z

Is ~190 MB increase in memory usage expected?

That seems a bit high, but it's likely. I'll double-check. Either way, it is only temporarily. Most of it is freed once the maps are constructed.

According to the monitoring the peak memory usage would increase by ~190 MB, and thus freeing it afterwards doesn't help much if the job was killed because of going over the limit.

makortel · 2026-04-07T19:44:30Z

test parameters:

workflows_profiling = 34434.0
enable_tests = profiling

makortel · 2026-04-07T19:44:46Z

@cmsbuild, please test

Maybe one round of profiling tests would be worth it.

makortel · 2026-04-29T19:15:26Z

Nothing seems obviously wrong. LSTModulesDevESProducer@alpaka/'hltESPModulesDevLST' is marked as consuming LSTGeometryESProducer/'hltLSTGeometry', but as I mentioned, it is not actually used because it is commented out. I don't see any obvious duplication of products or anything like that.

commenting out the request in produce is not enough. Saying you consume the item will cause the framework to prefetch it. So to actually keep the module from being called requires to that no module say they consume it.

Right. This behavior is visible in the Tracer log as well:

++++++++++++ starting: processing esmodule: label = 'hltLSTGeometry' type = LSTGeometryESProducer in record = TrackerRecoGeometryRecord
<cut>
++++++++++++ finished: processing esmodule: label = 'hltLSTGeometry' type = LSTGeometryESProducer in record = TrackerRecoGeometryRecord
++++++++++ finished: prefetching for esmodule: label = 'hltESPModulesDevLST' type = LSTModulesDevESProducer@alpaka in record = TrackerRecoGeometryRecord
++++++++++ starting: processing esmodule: label = 'hltESPModulesDevLST' type = LSTModulesDevESProducer@alpaka in record = TrackerRecoGeometryRecord
++++++++++ finished: processing esmodule: label = 'hltESPModulesDevLST' type = LSTModulesDevESProducer@alpaka in record = TrackerRecoGeometryRecord

So when you

simply commenting out this line

the hltLSTGeometry can't be run, and it does not result in an error because the only consumer does not actually access the data because of these lines being commented out
https://github.com/SegmentLinking/cmssw/blob/a9ab18292aa3f5a4b0774aecec84d628f17a544a/RecoTracker/LST/plugins/alpaka/LSTModulesDevESProducer.cc#L40-L42

This analysis does not answer to the question on how LSTGeometryESProducer leads to GPU memory being used.

makortel · 2026-04-29T19:19:53Z

This analysis does not answer to the question on how LSTGeometryESProducer leads to GPU memory being used.

The Tracer log shows only LSTModulesDevESProducer@alpaka/'hltESPModulesDevLST' consuming the data product of hltLSTGeometry (and from the code only the host data product is consumed). The log also shows that only one produce call is made on hltLSTGeometry (i.e. no sign of implicit host-to-device copy; well, there can't be because LSTGeometryESProducer is not an Alpaka module).

makortel · 2026-04-29T19:31:37Z

Does the behavior of excessive memory usage reproduce on 1 thread/stream?

No, for 1 thread/strem everything looks normal.

If 1 thread/stream shows "good behavior", I'm wondering if the caching allocator could play a role. The allocator is shared, and if some modules allocate concurrently large temporary buffers, those buffers might end up being held by the caching allocator without being used later in the job. On 1 thread these temporary buffers would be allocated and deallocated serially, and the same large buffer could be used by multiple modules.

But this is, of course, pure speculation, and does not explain the role of the existence of hltLSTGeometry in the GPU memory usage.

makortel · 2026-04-29T19:45:23Z

The CachingAllocator hypothesis could be investigated further by comparing the behavior between 1-thread and many-thread cases (on a few events).

The debug prints of the CachingAllocator can be enabled with

if not hasattr(process, "AlpakaServiceCudaAsync"):
    process.load("HeterogeneousCore.AlpakaServices.AlpakaServiceCudaAsync_cfi")
    process.AlpakaServiceCudaAsync.verbose = True

A crude way to see the functions that lead to actual memory allocations would be

cmsTraceFunction "cms::alpakatools::CachingAllocator<alpaka::DevCudaRt, alpaka::QueueCudaRtNonBlocking>::allocateBuffer" cmsRun ...

(I'm not 100 % sure I got the CachingAllocator template instantiation right, possibly tracing calls to just cudaMalloc might also do the trick)

ariostas · 2026-04-29T19:53:31Z

The CachingAllocator hypothesis could be investigated further...

I'm currently recompiling everything after adding

<flags CXXFLAGS="-DALPAKA_DISABLE_CACHING_ALLOCATOR -DALPAKA_DISABLE_ASYNC_ALLOCATOR"/>

to all the LST build files. I'll see what happens and try using the debug prints. Thanks!

ariostas · 2026-04-29T20:29:41Z

Okay, so disabling the caching allocator shows that there's this big spike. So it's not actually the caching allocator itself, but something is shortly allocating a big chunk of memory.

I'll try tracing cachingallocator/cudaMalloc calls to see if I can pinpoint what's happening.

ariostas · 2026-05-01T16:33:52Z

I couldn't get cmsTraceFunction to work. Not sure why, but it didn't trace any function that I tried.

By using gdb directly, I looked at calls to cudaMalloc and I see that there is no single giant allocation, but rather there are many more allocations than before (more than double).

I'm going to be on vacation for the next 2 weeks. But I'll keep looking into this when I get back.

slava77 · 2026-05-01T16:58:54Z

can it be something to do with the number of queues (and subsequently some extra allocations coming per queue)?
How is the number of queues defined: can it vary randomly (presumably repeatable but varying from unrelated changes)?

mmusich · 2026-05-07T03:23:17Z

In addition to the problems already discussed, now this branch has conflicts that must be resolved. @ariostas

slava77 · 2026-05-15T18:20:18Z

In addition to the problems already discussed, now this branch has conflicts that must be resolved. @ariostas

IIUC Andres was away, expected to be back next week

Parsifal-2045 · 2026-05-22T09:03:04Z

Also, I have tried to profile it with nsys, but it gets stuck when I try to use more than 1 stream.

Since profiling is still ongoing, this problem was surfaced in #50870 and appears to be a weird clash between CMSSW's jemalloc and nsys. You should be able to run any profile by changing the launch command form cmsRun to cmsRunGlibC

ariostas · 2026-05-22T12:55:06Z

In addition to the problems already discussed, now this branch has conflicts that must be resolved.

Rebased to fix conflicts. I'll get back to looking into this.

Since profiling is still ongoing, this problem was surfaced in #50870 and appears to be a weird clash between CMSSW's jemalloc and nsys. You should be able to run any profile by changing the launch command form cmsRun to cmsRunGlibC

Thank you! I'll see what I can learn from the nsys profile

cmsbuild · 2026-05-22T12:55:33Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50679/49455

There are other open Pull requests which might conflict with changes you have proposed:
- File HLTrigger/Configuration/python/HLT_75e33_cff.py modified in PR(s): Phase2 Single_Tau_Trigger Path Added #49637, Set up HGCAL GPU vs CPU DQM #50974
- File HLTrigger/Configuration/python/HLT_75e33_timing_cff.py modified in PR(s): Phase2 Single_Tau_Trigger Path Added #49637

cmsbuild · 2026-05-22T12:56:03Z

Pull request #50679 was updated. @Martin-Grunewald, @Moanwar, @cmsbuild, @jfernan2, @mandrenguyen, @mmusich, @srimanob can you please check and sign again.

cmsbuild added this to the CMSSW_17_0_X milestone Apr 7, 2026

cmsbuild added reconstruction-pending hlt-pending pending-signatures tests-pending orp-pending new-package-pending code-checks-pending tracking labels Apr 7, 2026

cmsbuild added code-checks-approved and removed code-checks-pending labels Apr 7, 2026

ariostas mentioned this pull request Apr 7, 2026

Add package RecoTracker/LSTGeometry to reconstruction cms-sw/cms-bot#2716

Merged

cmsbuild added tests-started and removed tests-pending labels Apr 7, 2026

cmsbuild added tests-rejected and removed tests-started labels Apr 7, 2026

makortel reviewed Apr 7, 2026

View reviewed changes

Comment thread RecoTracker/LSTGeometry/test/dumpLSTGeometry.py Outdated

cmsbuild added tests-started and removed tests-rejected labels Apr 7, 2026

cmsbuild mentioned this pull request May 1, 2026

LST: Remove Matrix Caps for MDs using Precompute #50856

Merged

cmsbuild mentioned this pull request May 12, 2026

Add reduced memory runtime toggle for LST #50925

Merged

cmsbuild mentioned this pull request May 19, 2026

Set up HGCAL GPU vs CPU DQM #50974

Open

ariostas added 4 commits May 22, 2026 05:39

Added LSTGeometry package and associated ESProducer

f954f8e

Reduce memory usage

8b385b9

Switched to argparse and other minor tweaks

c399d2f

Tighten module maps

46de12c

ariostas force-pushed the ariostas/lst_geometry branch from 4e3de58 to 46de12c Compare May 22, 2026 12:52

cmsbuild added tests-pending code-checks-pending and removed tests-rejected requires-external code-checks-approved labels May 22, 2026

cmsbuild added code-checks-approved and removed code-checks-pending labels May 22, 2026

cmsbuild mentioned this pull request May 22, 2026

LST: Merge T5s After Building #51021

Open

Conversation

ariostas commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmsbuild commented Apr 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmsbuild commented Apr 7, 2026

Uh oh!

cmsbuild commented Apr 7, 2026

Uh oh!

mmusich commented Apr 7, 2026

Uh oh!

mmusich commented Apr 7, 2026

Uh oh!

cmsbuild commented Apr 7, 2026

Failed Unit Tests

Comparison Summary

Max Memory Comparisons exceeding threshold

Uh oh!

makortel commented Apr 7, 2026

Uh oh!

Uh oh!

ariostas commented Apr 7, 2026

Uh oh!

makortel commented Apr 7, 2026

Uh oh!

makortel commented Apr 7, 2026

Uh oh!

makortel commented Apr 7, 2026

Uh oh!

makortel commented Apr 29, 2026

Uh oh!

makortel commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

makortel commented Apr 29, 2026

Uh oh!

makortel commented Apr 29, 2026

Uh oh!

ariostas commented Apr 29, 2026

Uh oh!

ariostas commented Apr 29, 2026

Uh oh!

ariostas commented May 1, 2026

Uh oh!

slava77 commented May 1, 2026

Uh oh!

mmusich commented May 7, 2026

Uh oh!

slava77 commented May 15, 2026

Uh oh!

Parsifal-2045 commented May 22, 2026

Uh oh!

ariostas commented May 22, 2026

Uh oh!

cmsbuild commented May 22, 2026

Uh oh!

cmsbuild commented May 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

ariostas commented Apr 7, 2026 •

edited

Loading

cmsbuild commented Apr 7, 2026 •

edited

Loading

makortel commented Apr 29, 2026 •

edited

Loading