Skip to content

Fix for testModelJit unit test#50927

Open
smuzaffar wants to merge 1 commit into
cms-sw:masterfrom
smuzaffar:fix-testModelJit
Open

Fix for testModelJit unit test#50927
smuzaffar wants to merge 1 commit into
cms-sw:masterfrom
smuzaffar:fix-testModelJit

Conversation

@smuzaffar
Copy link
Copy Markdown
Contributor

@smuzaffar smuzaffar commented May 13, 2026

This PR should fix the failing testModelJit unit test. The code in TestModelJIT::testToDevice_UpdatesUnderlyingState was creating a torch::Model with default auto_freeze=true. First call to Model.to() then sets is_forzen_=true and 2nd call m.to(::torch::kCPU); should have asserted.

      auto m = cms::torch::Model(m_path);
      m.to(dev);
      CPPUNIT_ASSERT_EQUAL(dev, m.device());
      m.to(::torch::kCPU);
      CPPUNIT_ASSERT_EQUAL(::torch::kCPU, m.device().type());

This PR propose to test both auto_freeze ON and OFF.
auto_freeze=false: Multiple calls to Model.to(dev) should not assert
auto_freeze=true: First call to Model.to(dev) should work but next call to Model.to(dev2) should throw.

@lukaszmichalskii , I think for cms::torch::Model constructor , we should call the freeze() if auto_freeze is true. Currently model become frozen only after calling first Model.to(dev) . If the intend is that if a model is created with auto_freeze then it should not be allowed to move to different device then may be we should call freeze() in the constructor

@smuzaffar
Copy link
Copy Markdown
Contributor Author

enable gpu

@smuzaffar
Copy link
Copy Markdown
Contributor Author

please test

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented May 13, 2026

cms-bot internal usage

@cmsbuild
Copy link
Copy Markdown
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50927/49316

@cmsbuild
Copy link
Copy Markdown
Contributor

A new Pull Request was created by @smuzaffar for master.

It involves the following packages:

  • PhysicsTools/PyTorch (ml)

@hjkwon260, @valsdav, @y19y19 can you please review it and eventually sign? Thanks.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@smuzaffar
Copy link
Copy Markdown
Contributor Author

ah #50790 seem to fix this unit tests by explictily disabling the auto freeze

@EmanueleCoradin
Copy link
Copy Markdown
Contributor

Hi @smuzaffar! I think your proposed change to the test unit is much more elegant than the one I was using in #50790, I would be in favor of keeping it.

Concerning this proposal:

I think for cms::torch::Model constructor , we should call the freeze() if auto_freeze is true. Currently model become frozen only after calling first Model.to(dev) . If the intend is that if a model is created with auto_freeze then it should not be allowed to move to different device then may be we should call freeze() in the constructor

In the workflow to use the PyTorchAlpaka package, the model is created in the constructor of the stream::FixedQueueEDProducer and it is moved to the device during the first call model.forward in the produce method, using the alpaka queue from the framework.
See for instance the SimpleNet example.
So the intention is to freeze the model after this step

@cmsbuild
Copy link
Copy Markdown
Contributor

-1

Failed Tests: RelVals-AMD_MI300X
Size: This PR adds an extra 20KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-fd6464/53212/summary.html
COMMIT: 8e6dc28
CMSSW: CMSSW_17_0_X_2026-05-12-2300/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/50927/53212/install.sh to create a dev area with all the needed externals and cmssw changes.

Failed RelVals-AMD_MI300X

The relvals timed out after 4 hours.

  • 34634.40334634.403_TTbar_14TeV+Run4D121PU_Patatrack_PixelOnlyAlpaka_Validation/step2_TTbar_14TeV+Run4D121PU_Patatrack_PixelOnlyAlpaka_Validation.log

Comparison Summary

Summary:

  • You potentially removed 4 lines from the logs
  • ROOTFileChecks: Some differences in event products or their sizes found
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 53
  • DQMHistoTests: Total histograms compared: 4187168
  • DQMHistoTests: Total failures: 3
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4187145
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 52 files compared)
  • Checked 227 log files, 197 edm output root files, 53 DQM output files
  • TriggerResults: no differences found

AMD_W7900 Comparison Summary

Summary:

NVIDIA_H100 Comparison Summary

Summary:

  • You potentially added 2 lines to the logs
  • Reco comparison results: 367 differences found in the comparisons
  • DQMHistoTests: Total files compared: 13
  • DQMHistoTests: Total histograms compared: 216259
  • DQMHistoTests: Total failures: 29216
  • DQMHistoTests: Total nulls: 34
  • DQMHistoTests: Total successes: 187009
  • DQMHistoTests: Total skipped: 0
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 12 files compared)
  • Checked 49 log files, 50 edm output root files, 13 DQM output files
  • TriggerResults: found differences in 1 / 12 workflows

NVIDIA_L40S Comparison Summary

Summary:

@smuzaffar
Copy link
Copy Markdown
Contributor Author

smuzaffar commented May 13, 2026

faillig unit test passed now.

@cms-sw/ml-l2 (@hjkwon260, @valsdav, @y19y19 ) , can you please review this?

@smuzaffar
Copy link
Copy Markdown
Contributor Author

ignore tests-rejected with manual-override

relval timing out on AMD mu300 gpu has nothing to do with this change. Something the NGT AMD mi300 just misbehaves.

@smuzaffar
Copy link
Copy Markdown
Contributor Author

@cms-sw/orp-l2 , can we get this in IBs. This is a trivial fix for torch unit test which is failing for all GPU.

@smuzaffar
Copy link
Copy Markdown
Contributor Author

@cms-sw/ml-l2 (@hjkwon260, @valsdav, @y19y19 ) , can you please review this?

@valsdav
Copy link
Copy Markdown
Contributor

valsdav commented May 18, 2026

+ml

@cmsbuild
Copy link
Copy Markdown
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (test failures were overridden). This pull request will now be reviewed by the release team before it's merged. @sextonkennedy, @ftenchini, @mandrenguyen (and backports should be raised in the release meeting by the corresponding L2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants