Skip to content

Switch to the new tracking baseline (single iteration, CA-extended Patatrack + LST, mkFit) as Phase 2 HLT default#50040

Merged
cmsbuild merged 2 commits into
cms-sw:masterfrom
SegmentLinking:newTRKAsPhase2HLTDefault
Mar 3, 2026
Merged

Switch to the new tracking baseline (single iteration, CA-extended Patatrack + LST, mkFit) as Phase 2 HLT default#50040
cmsbuild merged 2 commits into
cms-sw:masterfrom
SegmentLinking:newTRKAsPhase2HLTDefault

Conversation

@VourMa
Copy link
Copy Markdown
Contributor

@VourMa VourMa commented Feb 5, 2026

This PR switches the default tracking sequence for Phase 2 HLT from the current baseline (two iterations, Patatrack quads + legacy triplets for seeding, CKF for building) to a new baseline (single iteration, CA-extended Patatrack + LST for seeding, mkFit for building) proposed by the TRK POG, in coordination with HLT Upgrade.

Previous behavior (configs as defined above):

  • No procModifiers: Current baseline.
  • phase2LegacyPixelTracks: Current baseline but with legacy (instead of Patatrack) quads.
  • phase2CAExtension,singleIterPatatrack,trackingLST,seedingLST,trackingMkFitCommon,hltTrackingMkFitInitialStep: New baseline.

Behavior adter this PR (configs as defined above):

  • No procModifiers: New baseline.
  • hltPhase2LegacyTracking: Current baseline but with legacy (instead of Patatrack) quads.
  • hltPhase2LegacyTrackingPatatrackQuads: Current baseline.

By switching to the new baseline, a significant simplification of the tracking modules has been performed by removing all intermediate tracking configurations. Apart from the configurations discussed above, only the following configurations remain for Phase 2 HLT:

  • trackingLST: single iteration, CA-extended Patatrack, LST for building.
  • trackingMkFitFit: single iteration, CA-extended Patatrack + LST for seeding, mkFit for building, mkFit (instead of CKF) fitting.

As a result of the above, the workflows of intermediate configurations have been removed. Together with that, the updates of #49755 (and this PR is superseded by this one) have been included here, to avoid conflicts.

The NGT scouting configurations have been touched as apart of the aforementioned simplifications but all of the previous configurations are still supported.

The PR has been validated by running all the supported configuration and making sure that they produce exactly the same results as before the changes, i.e. this PR is purely technical for those configurations:

Current baseline (click me for validation plot) image
Current baseline but with legacy (instead of Patatrack) quads (click me for validation plot) image
New baseline (click me for validation plot) image
`trackingLST` (click me for validation plot) image
`trackingMkFitFit` (click me for validation plot) image
`ngtScouting` (click me for validation plot) image
`ngtScouting,trackingLST` (click me for validation plot) image

FYI @rovere

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Feb 5, 2026

cms-bot internal usage

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Feb 5, 2026

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50040/47892

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Feb 5, 2026

A new Pull Request was created by @VourMa for master.

It involves the following packages:

  • Configuration/ProcessModifiers (operations)
  • Configuration/PyReleaseValidation (pdmv)
  • HLTrigger/Configuration (hlt)
  • HLTrigger/NGTScouting (hlt)
  • Validation/RecoTrack (dqm)
  • Validation/SiTrackerPhase2V (dqm)

@AdrianoDee, @DickyChant, @Martin-Grunewald, @antoniovagnerini, @cmsbuild, @ctarricone, @davidlange6, @fabiocos, @ftenchini, @gabrielmscampos, @mandrenguyen, @miquork, @mmusich, @nothingface0, @rseidita can you please review it and eventually sign? Thanks.
@GiacomoSguazzoni, @Martin-Grunewald, @SohamBhattacharya, @VinInn, @VourMa, @arossi83, @dgulhan, @elusian, @fabiocos, @felicepantaleo, @makortel, @missirol, @mmasciov, @mmusich, @mtosi, @richa2710, @rovere, @slomeo, @sroychow, @wmtford this is something you requested to watch as well.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Feb 5, 2026

test parameters:

  • enable = hlt_p2_integration, hlt_p2_timing
  • workflows = ph2_hlt

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Feb 5, 2026

@cmsbuild, please test

)

from Configuration.ProcessModifiers.phase2CAExtension_cff import phase2CAExtension
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously, extended pixel tracks without the corresponding ID were used here, even though these are not used anywhere in the downstream code, except for pixel vertexing, for reasons that are not explained with a comment in the relevant module. This was in disagreement with the rest of the configurations, which were using here the pixel tracks that are used by downstream modules.

With the deletion of these lines in this PR, extended with the corresponding ID are used here, i.e. the one that are used by downstream code. This behavior is to me the "natural" one, in agreement with the rest of the configuration. However, if there is a strong preference that this is kept as different from the rest of the configurations, I will adjust accordingly.

Tagging @elenavernazza and @mmusich as the original authors of this code. FYI @rovere.

This also affects a couple of replacements in the following lines.

Copy link
Copy Markdown
Contributor

@mmusich mmusich Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@VourMa, we are using e.g. the non-high purity pixel tracks to make track quality selection studies (Cc: @EmanueleCoradin). We would appreciate if you could keep sending those to nanoAOD until these studies are done. After we have a robust selection strategy for pixel tracks we can go back sending the HP collection.

Thanks for asking!

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do I understand correctly that *pixelTrack* tables are not used in the default NANO (I was looking at an expanded config in 29834.772)? In this sense these are a POG/DPG NANO flavor. Is there a workflow that enables these?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a workflow that enables these?

yes, actually multiple.

0.759: HLT phase-2 timing menu, with NANO:@Phase2HLT
0.772: HLT phase-2 NGT Scouting menu, with NANO:@NGTScouting
0.773: HLT phase-2 NGT Scouting menu, with NANO:@NGTScoutingVal

the are all included in the ph2_hlt matrix tested here:

prefixDet+34.759, # HLT phase-2 timing menu, with NANO:@Phase2HLT
prefixDet+34.77, # HLT phase-2 NGT Scouting menu
prefixDet+34.771, # HLT phase-2 NGT Scouting menu, Alpaka, TICL-v5, TICL-Barrel, CA Extension
prefixDet+34.772, # HLT phase-2 NGT Scouting menu, with NANO:@NGTScouting
prefixDet+34.773, # HLT phase-2 NGT Scouting menu, with NANO:@NGTScoutingVal

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ehm, I was looking at 772 and it is not using the pixelTrack table.
nanoAOD_step = cms.Path(dstNanoFlavour) doesn't have it.

There are sequences containing NanoPixelTables but are all unused

  • hltPixelOnlyNanoFlavour
  • dstValidationNanoFlavour
  • hltValidationNanoFlavour

this is CMSSW_16_1_0_pre1
Did I miss something?

Copy link
Copy Markdown
Contributor

@mmusich mmusich Feb 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try looking in .773 if you are referring specifically to the pixel tables. I thought you referred in general to the HLT nanoAOD-s not being in workflows.

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Feb 5, 2026

-1

Failed Tests: HLTP2Integration HLTP2Timing RelVals
Size: This PR adds an extra 112KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-4bd0a4/51120/summary.html
COMMIT: 1cb2ecc
CMSSW: CMSSW_16_1_X_2026-02-04-2300/el8_amd64_gcc13
Additional Tests: HLT_P2_INTEGRATION,HLT_P2_TIMING
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/50040/51120/install.sh to create a dev area with all the needed externals and cmssw changes.

Failed RelVals

----- Begin Fatal Exception 05-Feb-2026 11:43:09 CET-----------------------
An exception of category 'ProductNotFound' occurred while
   [0] Processing  Event run: 1 lumi: 1 event: 1 stream: 0
   [1] Running path 'HLT_PFPuppiMETTypeOne140_PFPuppiMHT140'
   [2] Calling method for module CAHitNtupletAlpakaPhase2OT@alpaka/'hltPhase2PixelTracksSoA'
Exception Message:
Principal::getByToken: Found zero products matching all criteria
Looking for type: reco::TrackingRecHitHost
Looking for module label: hltPhase2PixelRecHitsExtendedSoA
Looking for productInstanceName: 

   Additional Info:
      [a] If you wish to continue processing events after a ProductNotFound exception,
add "TryToContinue = cms.untracked.vstring('ProductNotFound')" to the "options" PSet in the configuration.

----- End Fatal Exception -------------------------------------------------

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Feb 5, 2026

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50040/47906

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Mar 2, 2026

I found 14 workflow step(s) with memory usage exceeding the error threshold

Would it be possible to identify which is the source of the memory usage increase on CPU?

@VourMa
Copy link
Copy Markdown
Contributor Author

VourMa commented Mar 2, 2026

I found 14 workflow step(s) with memory usage exceeding the error threshold

Would it be possible to identify which is the source of the memory usage increase on CPU?

I have made a report previously on this point:
#50040 (comment)

All in all, the new memory for the default baseline is consistent with the memory of the (now deleted, as it became the new baseline) 0.7571workflow (HLT75e33TimingAlpakaSingleIterLSTSeedingMkFitBuilding).

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Mar 2, 2026

I have made a report previously on this point:

is it possible instead to do a direct measurement of the workflow memory (and possibly profiling) instead of trying to infer from the bot results?

@VourMa
Copy link
Copy Markdown
Contributor Author

VourMa commented Mar 2, 2026

is it possible instead to do a direct measurement of the workflow memory (and possibly profiling) instead of trying to infer from the bot results?

These bot measurements are pretty consistent across PRs (e.g. similar results can be inferred from the tests in #50283). What is the reason to mistrust them?

@slava77
Copy link
Copy Markdown
Contributor

slava77 commented Mar 2, 2026

is this CPU memory a blocking issue? Perhaps it's practical to make a github issue for a follow up.

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Mar 2, 2026

is this CPU memory a blocking issue? Perhaps it's practical to make a github issue for a follow up.

yes, let's follow-up in an issue. I created #50288

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Mar 2, 2026

+hlt

@rovere
Copy link
Copy Markdown
Contributor

rovere commented Mar 2, 2026

is this CPU memory a blocking issue? Perhaps it's practical to make a github issue for a follow up.

In general nothing is a blocking issue. On the other hand, one of the outcome of this "strict" review is a considerable reduction of memory in the new tracking baseline from LST. While I agree we should try to have new developments integrated as soon as possible, we do also have to keep a constant eye on resource usage and act accordingly to reduce them, when and where feasible.

@mmasciov
Copy link
Copy Markdown
Contributor

mmasciov commented Mar 2, 2026

The physics performance has also been thoroughly and carefully validated by the TRK POG, so any residual concerns in that direction can likely be considered already addressed.

While the physics performance was thoroughly reviewed within the TRK POG (@cms-sw/tracking-pog-l2) (as well as within TSG/HLT upgrade), and that the TRK POG supports this PR, let me highlight for historical precision that there are improvements foreseen on top of the current PR, physics-wise; e.g., see presentations at TRK POG last week (Feb 23), or today (Mar 2).
It was agreed that these improvements shall build on top of this PR, as independent PRs that will be submitted as soon as ready.

@slava77
Copy link
Copy Markdown
Contributor

slava77 commented Mar 2, 2026

For the record, here are the changes in timing due to the PR:

are the circles links available for the timing/job reports?

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Mar 2, 2026

are the circles links available for the timing/job reports?

they are, follow the link at:

HLT P2 Timing: chart

@VourMa
Copy link
Copy Markdown
Contributor Author

VourMa commented Mar 3, 2026

@cms-sw/dqm-l2 We have converged on this, could you take a look and let me know whether it looks OK from your side?

@AdrianoDee
Copy link
Copy Markdown
Contributor

+pdmv
(wf wise, all good)

@gabrielmscampos
Copy link
Copy Markdown
Member

+dqm

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Mar 3, 2026

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @sextonkennedy, @mandrenguyen, @ftenchini (and backports should be raised in the release meeting by the corresponding L2)

@mandrenguyen
Copy link
Copy Markdown
Contributor

+1

@mmusich
Copy link
Copy Markdown
Contributor

mmusich commented Mar 5, 2026

For the record, in the IB CMSSW_16_1_X_2026-03-03-2300 after this PR was merged we started observing an higher than usual rate of failed workflows in the GPU matrix on the machines with GPUs, from about 1 per IB to about 20-30 per IBs:

image

The failures is more often than not a segmentation violation in the step2 in the HLT, example log file with the stack trace containing:

Thread 28 (Thread 0x14a8fb400700 (LWP 3630546) "cmsRun"):
#0  0x000014ab80732ae1 in poll () from /lib64/libc.so.6
#1  0x000014ab7924bace in edm::service::InitRootHandlers::stacktraceFromThread() () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02931/el8_amd64_gcc13/cms/cmssw/CMSSW_16_1_X_2026-03-02-2300/lib/el8_amd64_gcc13/pluginFWCoreServicesPlugins.so
#2  0x000014ab7924bcd3 in sig_dostack_then_abort () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02931/el8_amd64_gcc13/cms/cmssw/CMSSW_16_1_X_2026-03-02-2300/lib/el8_amd64_gcc13/pluginFWCoreServicesPlugins.so
#3  <signal handler called>
#4  0x000014ab806df37a in __memset_avx512_unaligned_erms () from /lib64/libc.so.6
#5  0x000014a8f5977ca1 in std::_Function_handler<std::any (edm::StreamID, edm::WrapperBase const&, edm::WaitingTaskWithArenaHolder), edm::stream::impl::Transformer::registerTransformAsync<reco::TrackingRecHitHost, alpaka_cuda_async::ProducerBase<edm::stream::EDProducer>::produces<reco::TrackingRecHitHost, (edm::Transition)0>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(edm::StreamID, reco::TrackingRecHitHost const&, edm::WaitingTaskWithArenaHolder)#1}, alpaka_cuda_async::ProducerBase<edm::stream::EDProducer>::produces<reco::TrackingRecHitHost, (edm::Transition)0>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(edm::StreamID, auto:1)#1}>(edm::EDPutTokenT<reco::TrackingRecHitHost>, alpaka_cuda_async::ProducerBase<edm::stream::EDProducer>::produces<reco::TrackingRecHitHost, (edm::Transition)0>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(edm::StreamID, reco::TrackingRecHitHost const&, edm::WaitingTaskWithArenaHolder)#1}, alpaka_cuda_async::ProducerBase<edm::stream::EDProducer>::produces<reco::TrackingRecHitHost, (edm::Transition)0>(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(edm::StreamID, auto:1)#1}, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >)::{lambda(edm::StreamID, edm::WrapperBase const&, edm::WaitingTaskWithArenaHolder)#1}>::_M_invoke(std::_Any_data const&, edm::StreamID&&, edm::WrapperBase const&, edm::WaitingTaskWithArenaHolder&&) () from /cvmfs/cms-ib.cern.ch/sw/x86_64/nweek-02931/el8_amd64_gcc13/cms/cmssw-patch/CMSSW_16_1_X_2026-03-03-2300/lib/el8_amd64_gcc13/pluginRecoLocalTrackerPhase2OTRecHitsSoAPluginsPortableCudaAsync.so

@JanGerritSchulz diagnosed and fixed this issue at #50318 (there might be other failures downstream).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.