Fix for testModelJit unit test#50927
Conversation
|
enable gpu |
|
please test |
|
cms-bot internal usage |
|
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50927/49316
|
|
A new Pull Request was created by @smuzaffar for master. It involves the following packages:
@hjkwon260, @valsdav, @y19y19 can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
|
ah #50790 seem to fix this unit tests by explictily disabling the auto freeze |
|
Hi @smuzaffar! I think your proposed change to the test unit is much more elegant than the one I was using in #50790, I would be in favor of keeping it. Concerning this proposal:
In the workflow to use the PyTorchAlpaka package, the model is created in the constructor of the |
|
-1 Failed Tests: RelVals-AMD_MI300X Failed RelVals-AMD_MI300XThe relvals timed out after 4 hours.
Comparison SummarySummary:
AMD_W7900 Comparison SummarySummary:
NVIDIA_H100 Comparison SummarySummary:
NVIDIA_L40S Comparison SummarySummary:
|
|
faillig unit test passed now. @cms-sw/ml-l2 (@hjkwon260, @valsdav, @y19y19 ) , can you please review this? |
|
ignore tests-rejected with manual-override relval timing out on AMD mu300 gpu has nothing to do with this change. Something the NGT AMD mi300 just misbehaves. |
|
@cms-sw/orp-l2 , can we get this in IBs. This is a trivial fix for torch unit test which is failing for all GPU. |
|
@cms-sw/ml-l2 (@hjkwon260, @valsdav, @y19y19 ) , can you please review this? |
|
+ml |
|
This pull request is fully signed and it will be integrated in one of the next master IBs (test failures were overridden). This pull request will now be reviewed by the release team before it's merged. @sextonkennedy, @ftenchini, @mandrenguyen (and backports should be raised in the release meeting by the corresponding L2) |
This PR should fix the failing
testModelJitunit test. The code inTestModelJIT::testToDevice_UpdatesUnderlyingStatewas creating atorch::Modelwith defaultauto_freeze=true. First call toModel.to()then setsis_forzen_=trueand 2nd callm.to(::torch::kCPU);should have asserted.This PR propose to test both auto_freeze ON and OFF.
auto_freeze=false: Multiple calls toModel.to(dev)should not assertauto_freeze=true: First call toModel.to(dev)should work but next call toModel.to(dev2)should throw.@lukaszmichalskii , I think for
cms::torch::Modelconstructor , we should call the freeze() if auto_freeze is true. Currently model become frozen only after calling firstModel.to(dev). If the intend is that if a model is created with auto_freeze then it should not be allowed to move to different device then may be we should callfreeze()in the constructor