Fix for testModelJit unit test by smuzaffar · Pull Request #50927 · cms-sw/cmssw

smuzaffar · 2026-05-13T06:16:18Z

This PR should fix the failing testModelJit unit test. The code in TestModelJIT::testToDevice_UpdatesUnderlyingState was creating a torch::Model with default auto_freeze=true. First call to Model.to() then sets is_forzen_=true and 2nd call m.to(::torch::kCPU); should have asserted.

      auto m = cms::torch::Model(m_path);
      m.to(dev);
      CPPUNIT_ASSERT_EQUAL(dev, m.device());
      m.to(::torch::kCPU);
      CPPUNIT_ASSERT_EQUAL(::torch::kCPU, m.device().type());

This PR propose to test both auto_freeze ON and OFF.
auto_freeze=false: Multiple calls to Model.to(dev) should not assert
auto_freeze=true: First call to Model.to(dev) should work but next call to Model.to(dev2) should throw.

@lukaszmichalskii , I think for cms::torch::Model constructor , we should call the freeze() if auto_freeze is true. Currently model become frozen only after calling first Model.to(dev) . If the intend is that if a model is created with auto_freeze then it should not be allowed to move to different device then may be we should call freeze() in the constructor

smuzaffar · 2026-05-13T06:16:29Z

enable gpu

smuzaffar · 2026-05-13T06:16:34Z

please test

cmsbuild · 2026-05-13T06:16:47Z

cms-bot internal usage

cmsbuild · 2026-05-13T06:18:34Z

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50927/49316

There are other open Pull requests which might conflict with changes you have proposed:
- File PhysicsTools/PyTorch/interface/Model.h modified in PR(s): Add Torch FP16 inference support #50790
- File PhysicsTools/PyTorch/test/testModelJit.cc modified in PR(s): Add Torch FP16 inference support #50790

cmsbuild · 2026-05-13T06:18:59Z

A new Pull Request was created by @smuzaffar for master.

It involves the following packages:

PhysicsTools/PyTorch (ml)

@hjkwon260, @valsdav, @y19y19 can you please review it and eventually sign? Thanks.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

smuzaffar · 2026-05-13T06:25:40Z

ah #50790 seem to fix this unit tests by explictily disabling the auto freeze

EmanueleCoradin · 2026-05-13T07:06:56Z

Hi @smuzaffar! I think your proposed change to the test unit is much more elegant than the one I was using in #50790, I would be in favor of keeping it.

Concerning this proposal:

I think for cms::torch::Model constructor , we should call the freeze() if auto_freeze is true. Currently model become frozen only after calling first Model.to(dev) . If the intend is that if a model is created with auto_freeze then it should not be allowed to move to different device then may be we should call freeze() in the constructor

In the workflow to use the PyTorchAlpaka package, the model is created in the constructor of the stream::FixedQueueEDProducer and it is moved to the device during the first call model.forward in the produce method, using the alpaka queue from the framework.
See for instance the SimpleNet example.
So the intention is to freeze the model after this step

cmsbuild · 2026-05-13T10:42:21Z

-1

Failed Tests: RelVals-AMD_MI300X
Size: This PR adds an extra 20KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-fd6464/53212/summary.html
COMMIT: 8e6dc28
CMSSW: CMSSW_17_0_X_2026-05-12-2300/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/50927/53212/install.sh to create a dev area with all the needed externals and cmssw changes.

Failed RelVals-AMD_MI300X

The relvals timed out after 4 hours.

34634.40334634.403_TTbar_14TeV+Run4D121PU_Patatrack_PixelOnlyAlpaka_Validation/step2_TTbar_14TeV+Run4D121PU_Patatrack_PixelOnlyAlpaka_Validation.log

Comparison Summary

Summary:

You potentially removed 4 lines from the logs
ROOTFileChecks: Some differences in event products or their sizes found
Reco comparison results: 4 differences found in the comparisons
DQMHistoTests: Total files compared: 53
DQMHistoTests: Total histograms compared: 4187168
DQMHistoTests: Total failures: 3
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 4187145
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 52 files compared)
Checked 227 log files, 197 edm output root files, 53 DQM output files
TriggerResults: no differences found

AMD_W7900 Comparison Summary

Summary:

You potentially removed 30 lines from the logs
Reco comparison results: 362 differences found in the comparisons
DQMHistoTests: Total files compared: 13
DQMHistoTests: Total histograms compared: 216259
DQMHistoTests: Total failures: 29551
DQMHistoTests: Total nulls: 37
DQMHistoTests: Total successes: 186671
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 12 files compared)
Checked 49 log files, 50 edm output root files, 13 DQM output files
TriggerResults: found differences in 2 / 12 workflows

NVIDIA_H100 Comparison Summary

Summary:

You potentially added 2 lines to the logs
Reco comparison results: 367 differences found in the comparisons
DQMHistoTests: Total files compared: 13
DQMHistoTests: Total histograms compared: 216259
DQMHistoTests: Total failures: 29216
DQMHistoTests: Total nulls: 34
DQMHistoTests: Total successes: 187009
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 12 files compared)
Checked 49 log files, 50 edm output root files, 13 DQM output files
TriggerResults: found differences in 1 / 12 workflows

NVIDIA_L40S Comparison Summary

Summary:

You potentially removed 10 lines from the logs
Reco comparison results: 380 differences found in the comparisons
DQMHistoTests: Total files compared: 13
DQMHistoTests: Total histograms compared: 216259
DQMHistoTests: Total failures: 31183
DQMHistoTests: Total nulls: 33
DQMHistoTests: Total successes: 185043
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 12 files compared)
Checked 49 log files, 50 edm output root files, 13 DQM output files
TriggerResults: found differences in 1 / 12 workflows

smuzaffar · 2026-05-13T10:45:59Z

faillig unit test passed now.

@cms-sw/ml-l2 (@hjkwon260, @valsdav, @y19y19 ) , can you please review this?

smuzaffar · 2026-05-13T11:02:55Z

ignore tests-rejected with manual-override

relval timing out on AMD mu300 gpu has nothing to do with this change. Something the NGT AMD mi300 just misbehaves.

smuzaffar · 2026-05-14T08:34:53Z

@cms-sw/orp-l2 , can we get this in IBs. This is a trivial fix for torch unit test which is failing for all GPU.

smuzaffar · 2026-05-18T10:45:21Z

@cms-sw/ml-l2 (@hjkwon260, @valsdav, @y19y19 ) , can you please review this?

valsdav · 2026-05-18T11:46:10Z

+ml

cmsbuild · 2026-05-18T11:46:36Z

This pull request is fully signed and it will be integrated in one of the next master IBs (test failures were overridden). This pull request will now be reviewed by the release team before it's merged. @sextonkennedy, @ftenchini, @mandrenguyen (and backports should be raised in the release meeting by the corresponding L2)

Fix for testModelJit unit test

8e6dc28

cmsbuild added this to the CMSSW_17_0_X milestone May 13, 2026

cmsbuild added pending-signatures orp-pending tests-started code-checks-pending ml-pending labels May 13, 2026

cmsbuild added code-checks-approved and removed code-checks-pending labels May 13, 2026

This was referenced May 13, 2026

[GPU-Nvidia] nvidia_t4,nvidia_h100,nvidia_l4,nvidia_ls40s failing #50923

Open

Add Torch FP16 inference support #50790

Open

cmsbuild added tests-rejected and removed tests-started labels May 13, 2026

cmsbuild added tests-approved tests-manual-override and removed tests-rejected labels May 13, 2026

cmsbuild removed the pending-signatures label May 18, 2026

cmsbuild added fully-signed ml-approved and removed ml-pending labels May 18, 2026

Conversation

smuzaffar commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smuzaffar commented May 13, 2026

Uh oh!

smuzaffar commented May 13, 2026

Uh oh!

cmsbuild commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmsbuild commented May 13, 2026

Uh oh!

cmsbuild commented May 13, 2026

Uh oh!

smuzaffar commented May 13, 2026

Uh oh!

EmanueleCoradin commented May 13, 2026

Uh oh!

cmsbuild commented May 13, 2026

Failed RelVals-AMD_MI300X

Comparison Summary

AMD_W7900 Comparison Summary

NVIDIA_H100 Comparison Summary

NVIDIA_L40S Comparison Summary

Uh oh!

smuzaffar commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

smuzaffar commented May 13, 2026

Uh oh!

smuzaffar commented May 14, 2026

Uh oh!

smuzaffar commented May 18, 2026

Uh oh!

valsdav commented May 18, 2026

Uh oh!

cmsbuild commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

smuzaffar commented May 13, 2026 •

edited

Loading

cmsbuild commented May 13, 2026 •

edited

Loading

smuzaffar commented May 13, 2026 •

edited

Loading