Skip to content

LST: add LSTGeometry package and associated ESProducer#50679

Open
ariostas wants to merge 4 commits into
cms-sw:masterfrom
SegmentLinking:ariostas/lst_geometry
Open

LST: add LSTGeometry package and associated ESProducer#50679
ariostas wants to merge 4 commits into
cms-sw:masterfrom
SegmentLinking:ariostas/lst_geometry

Conversation

@ariostas
Copy link
Copy Markdown
Contributor

@ariostas ariostas commented Apr 7, 2026

This PR adds a new RecoTracker/LSTGeometry package containing the module map computation used by the LST algorithm. Currently, the maps are pre-computed by the code in https://github.com/SegmentLinking/LSTGeometry and they are stored in https://github.com/cms-data/RecoTracker-LSTCore. This PR allows for the on-the-fly computation of these maps via an ESProducer, ensuring that they stay consistent with the tracker geometry being used.

This is the last major task in #46746.

c.c. @slava77

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Apr 7, 2026

cms-bot internal usage

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Apr 7, 2026

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50679/48907

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Apr 7, 2026

A new Pull Request was created by @ariostas for master.

It involves the following packages:

  • HLTrigger/Configuration (hlt)
  • RecoTracker/IterativeTracking (reconstruction)
  • RecoTracker/LST (reconstruction)
  • RecoTracker/LSTCore (reconstruction)
  • RecoTracker/LSTGeometry (****)

The following packages do not have a category, yet:

RecoTracker/LSTGeometry
Please create a PR for https://github.com/cms-sw/cms-bot/blob/master/categories_map.py to assign category

@Martin-Grunewald, @Moanwar, @cmsbuild, @jfernan2, @mandrenguyen, @mmusich, @srimanob can you please review it and eventually sign? Thanks.
@GiacomoSguazzoni, @Martin-Grunewald, @SohamBhattacharya, @VinInn, @VourMa, @dgulhan, @elusian, @felicepantaleo, @gpetruc, @missirol, @mmasciov, @mmusich, @mtosi, @rovere this is something you requested to watch as well.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Apr 7, 2026

test parameters:

  • enable = hlt_p2_integration, hlt_p2_timing
  • workflows = ph2_hlt

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Apr 7, 2026

@cmsbuild, please test

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Apr 7, 2026

-1

Failed Tests: UnitTests HLTP2Timing
Size: This PR adds an extra 104KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-7657dc/52513/summary.html
COMMIT: e612f24
CMSSW: CMSSW_17_0_X_2026-04-07-1100/el8_amd64_gcc13
Additional Tests: HLT_P2_INTEGRATION,HLT_P2_TIMING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/50679/52513/install.sh to create a dev area with all the needed externals and cmssw changes.

Failed Unit Tests

I found 1 errors in the following unit tests:

---> test test-das-selected-lumis had ERRORS

Comparison Summary

Summary:

Max Memory Comparisons exceeding threshold

@cms-sw/core-l2 , I found 17 workflow step(s) with memory usage exceeding the error threshold:

Expand to see workflows ...
  • Error: Workflow 34434.0_TTbar_14TeV+Run4D121 step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.75_TTbar_14TeV+Run4D121_HLT75e33Timing step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.7501_TTbar_14TeV+Run4D121_HLT75e33TrackingOnly step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.7502_TTbar_14TeV+Run4D121_HLT75e33TrackingNtuple step2 max memory diff 191.9 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.751_TTbar_14TeV+Run4D121_HLT75e33TimingAlpaka step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.752_TTbar_14TeV+Run4D121_HLT75e33TimingTiclV5 step2 max memory diff 189.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.7521_TTbar_14TeV+Run4D121_HLT75e33TimingTiclV5TrackLinkGNN step2 max memory diff 166.0 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.755_TTbar_14TeV+Run4D121_HLT75e33TimingLST step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.756_TTbar_14TeV+Run4D121_HLT75e33TimingTrimmedTracking step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.757_TTbar_14TeV+Run4D121_HLT75e33TimingMkFitFit step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.758_TTbar_14TeV+Run4D121_HLT75e33TimingTiclBarrel step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.759_TTbar_14TeV+Run4D121_HLTPhase2WithNano step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.775_TTbar_14TeV+Run4D121_NGTScoutingCAExtensionMergeT5 step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34434.911_TTbar_14TeV+Run4D121_DD4hep step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34496.0_CloseByPGun_CE_E_Front_120um+Run4D121 step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34500.0_CloseByPGun_CE_H_Coarse_Scint+Run4D121 step2 max memory diff 191.8 exceeds +/- 90.0 MiB
  • Error: Workflow 34634.999_TTbar_14TeV+Run4D121PU_PMXS1S2PR step3 max memory diff 191.8 exceeds +/- 90.0 MiB

@makortel
Copy link
Copy Markdown
Contributor

makortel commented Apr 7, 2026

Is ~190 MB increase in memory usage expected?

Comment thread RecoTracker/LSTGeometry/test/dumpLSTGeometry.py Outdated
@ariostas
Copy link
Copy Markdown
Contributor Author

ariostas commented Apr 7, 2026

Is ~190 MB increase in memory usage expected?

That seems a bit high, but it's likely. I'll double-check. Either way, it is only temporarily. Most of it is freed once the maps are constructed.

@makortel
Copy link
Copy Markdown
Contributor

makortel commented Apr 7, 2026

Is ~190 MB increase in memory usage expected?

That seems a bit high, but it's likely. I'll double-check. Either way, it is only temporarily. Most of it is freed once the maps are constructed.

According to the monitoring the peak memory usage would increase by ~190 MB, and thus freeing it afterwards doesn't help much if the job was killed because of going over the limit.

@makortel
Copy link
Copy Markdown
Contributor

makortel commented Apr 7, 2026

test parameters:

  • workflows_profiling = 34434.0
  • enable_tests = profiling

@makortel
Copy link
Copy Markdown
Contributor

makortel commented Apr 7, 2026

@cmsbuild, please test

Maybe one round of profiling tests would be worth it.

@makortel
Copy link
Copy Markdown
Contributor

Nothing seems obviously wrong. LSTModulesDevESProducer@alpaka/'hltESPModulesDevLST' is marked as consuming LSTGeometryESProducer/'hltLSTGeometry', but as I mentioned, it is not actually used because it is commented out. I don't see any obvious duplication of products or anything like that.

commenting out the request in produce is not enough. Saying you consume the item will cause the framework to prefetch it. So to actually keep the module from being called requires to that no module say they consume it.

Right. This behavior is visible in the Tracer log as well:

++++++++++++ starting: processing esmodule: label = 'hltLSTGeometry' type = LSTGeometryESProducer in record = TrackerRecoGeometryRecord
<cut>
++++++++++++ finished: processing esmodule: label = 'hltLSTGeometry' type = LSTGeometryESProducer in record = TrackerRecoGeometryRecord
++++++++++ finished: prefetching for esmodule: label = 'hltESPModulesDevLST' type = LSTModulesDevESProducer@alpaka in record = TrackerRecoGeometryRecord
++++++++++ starting: processing esmodule: label = 'hltESPModulesDevLST' type = LSTModulesDevESProducer@alpaka in record = TrackerRecoGeometryRecord
++++++++++ finished: processing esmodule: label = 'hltESPModulesDevLST' type = LSTModulesDevESProducer@alpaka in record = TrackerRecoGeometryRecord

So when you

simply commenting out this line

the hltLSTGeometry can't be run, and it does not result in an error because the only consumer does not actually access the data because of these lines being commented out
https://github.com/SegmentLinking/cmssw/blob/a9ab18292aa3f5a4b0774aecec84d628f17a544a/RecoTracker/LST/plugins/alpaka/LSTModulesDevESProducer.cc#L40-L42

This analysis does not answer to the question on how LSTGeometryESProducer leads to GPU memory being used.

@makortel
Copy link
Copy Markdown
Contributor

makortel commented Apr 29, 2026

This analysis does not answer to the question on how LSTGeometryESProducer leads to GPU memory being used.

The Tracer log shows only LSTModulesDevESProducer@alpaka/'hltESPModulesDevLST' consuming the data product of hltLSTGeometry (and from the code only the host data product is consumed). The log also shows that only one produce call is made on hltLSTGeometry (i.e. no sign of implicit host-to-device copy; well, there can't be because LSTGeometryESProducer is not an Alpaka module).

@makortel
Copy link
Copy Markdown
Contributor

Does the behavior of excessive memory usage reproduce on 1 thread/stream?

No, for 1 thread/strem everything looks normal.

If 1 thread/stream shows "good behavior", I'm wondering if the caching allocator could play a role. The allocator is shared, and if some modules allocate concurrently large temporary buffers, those buffers might end up being held by the caching allocator without being used later in the job. On 1 thread these temporary buffers would be allocated and deallocated serially, and the same large buffer could be used by multiple modules.

But this is, of course, pure speculation, and does not explain the role of the existence of hltLSTGeometry in the GPU memory usage.

@makortel
Copy link
Copy Markdown
Contributor

The CachingAllocator hypothesis could be investigated further by comparing the behavior between 1-thread and many-thread cases (on a few events).

The debug prints of the CachingAllocator can be enabled with

if not hasattr(process, "AlpakaServiceCudaAsync"):
    process.load("HeterogeneousCore.AlpakaServices.AlpakaServiceCudaAsync_cfi")
    process.AlpakaServiceCudaAsync.verbose = True

A crude way to see the functions that lead to actual memory allocations would be

cmsTraceFunction "cms::alpakatools::CachingAllocator<alpaka::DevCudaRt, alpaka::QueueCudaRtNonBlocking>::allocateBuffer" cmsRun ...

(I'm not 100 % sure I got the CachingAllocator template instantiation right, possibly tracing calls to just cudaMalloc might also do the trick)

@ariostas
Copy link
Copy Markdown
Contributor Author

The CachingAllocator hypothesis could be investigated further...

I'm currently recompiling everything after adding

<flags CXXFLAGS="-DALPAKA_DISABLE_CACHING_ALLOCATOR -DALPAKA_DISABLE_ASYNC_ALLOCATOR"/>

to all the LST build files. I'll see what happens and try using the debug prints. Thanks!

@ariostas
Copy link
Copy Markdown
Contributor Author

Okay, so disabling the caching allocator shows that there's this big spike. So it's not actually the caching allocator itself, but something is shortly allocating a big chunk of memory.

image

I'll try tracing cachingallocator/cudaMalloc calls to see if I can pinpoint what's happening.

@ariostas
Copy link
Copy Markdown
Contributor Author

ariostas commented May 1, 2026

I couldn't get cmsTraceFunction to work. Not sure why, but it didn't trace any function that I tried.

By using gdb directly, I looked at calls to cudaMalloc and I see that there is no single giant allocation, but rather there are many more allocations than before (more than double).

image

I'm going to be on vacation for the next 2 weeks. But I'll keep looking into this when I get back.

@slava77
Copy link
Copy Markdown
Contributor

slava77 commented May 1, 2026

can it be something to do with the number of queues (and subsequently some extra allocations coming per queue)?
How is the number of queues defined: can it vary randomly (presumably repeatable but varying from unrelated changes)?

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented May 7, 2026

In addition to the problems already discussed, now this branch has conflicts that must be resolved. @ariostas

@slava77
Copy link
Copy Markdown
Contributor

slava77 commented May 15, 2026

In addition to the problems already discussed, now this branch has conflicts that must be resolved. @ariostas

IIUC Andres was away, expected to be back next week

@Parsifal-2045
Copy link
Copy Markdown
Contributor

Also, I have tried to profile it with nsys, but it gets stuck when I try to use more than 1 stream.

Since profiling is still ongoing, this problem was surfaced in #50870 and appears to be a weird clash between CMSSW's jemalloc and nsys. You should be able to run any profile by changing the launch command form cmsRun to cmsRunGlibC

@ariostas
Copy link
Copy Markdown
Contributor Author

In addition to the problems already discussed, now this branch has conflicts that must be resolved.

Rebased to fix conflicts. I'll get back to looking into this.

Since profiling is still ongoing, this problem was surfaced in #50870 and appears to be a weird clash between CMSSW's jemalloc and nsys. You should be able to run any profile by changing the launch command form cmsRun to cmsRunGlibC

Thank you! I'll see what I can learn from the nsys profile

@cmsbuild
Copy link
Copy Markdown
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50679/49455

@cmsbuild
Copy link
Copy Markdown
Contributor

Pull request #50679 was updated. @Martin-Grunewald, @Moanwar, @cmsbuild, @jfernan2, @mandrenguyen, @mmusich, @srimanob can you please check and sign again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants