Testing 4 GPU relvals in parallel by smuzaffar · Pull Request #2747 · cms-sw/cms-bot

smuzaffar · 2026-05-13T13:46:32Z

By default we run N jobs in parallel for GPU relvals (where N is number of GPU available mostly 1). This PR s to test if we can run multiple relvals in parallel

smuzaffar · 2026-05-13T13:46:39Z

enable gpu

smuzaffar · 2026-05-13T13:46:47Z

please test

cmsbuild · 2026-05-13T13:47:13Z

A new Pull Request was created by @smuzaffar for branch master.

@akritkbehera, @iarspider, @raoatifshad, @smuzaffar can you please review it and eventually sign? Thanks.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.
cms-bot commands are listed here

cmsbuild · 2026-05-13T13:47:14Z

cms-bot internal usage

cmsbuild · 2026-05-13T16:12:41Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b5835b/53234/summary.html
COMMIT: a556f63
CMSSW: CMSSW_17_0_X_2026-05-12-2300/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cms-bot/2747/53234/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

You potentially removed 5 lines from the logs
Reco comparison results: 16 differences found in the comparisons
DQMHistoTests: Total files compared: 53
DQMHistoTests: Total histograms compared: 4187168
DQMHistoTests: Total failures: 42
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 4187106
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 52 files compared)
Checked 227 log files, 197 edm output root files, 53 DQM output files
TriggerResults: no differences found

AMD_MI300X Comparison Summary

There are some workflows for which there are errors in the baseline:
34634.404 step 2
The results for the comparisons for these workflows could be incomplete
This means most likely that the IB is having errors in the relvals.The error does NOT come from this pull request

Summary:

You potentially removed 98 lines from the logs
ROOTFileChecks: Some differences in event products or their sizes found
Reco comparison results: 322 differences found in the comparisons
DQMHistoTests: Total files compared: 13
DQMHistoTests: Total histograms compared: 216259
DQMHistoTests: Total failures: 34886
DQMHistoTests: Total nulls: 38
DQMHistoTests: Total successes: 181335
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 12 files compared)
Checked 48 log files, 50 edm output root files, 13 DQM output files
TriggerResults: found differences in 1 / 12 workflows

AMD_W7900 Comparison Summary

Summary:

You potentially removed 28 lines from the logs
Reco comparison results: 363 differences found in the comparisons
DQMHistoTests: Total files compared: 13
DQMHistoTests: Total histograms compared: 216259
DQMHistoTests: Total failures: 31439
DQMHistoTests: Total nulls: 31
DQMHistoTests: Total successes: 184789
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 12 files compared)
Checked 49 log files, 50 edm output root files, 13 DQM output files
TriggerResults: found differences in 1 / 12 workflows

NVIDIA_H100 Comparison Summary

Summary:

You potentially removed 10 lines from the logs
Reco comparison results: 321 differences found in the comparisons
DQMHistoTests: Total files compared: 13
DQMHistoTests: Total histograms compared: 216259
DQMHistoTests: Total failures: 45249
DQMHistoTests: Total nulls: 34
DQMHistoTests: Total successes: 170976
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 12 files compared)
Checked 49 log files, 50 edm output root files, 13 DQM output files
TriggerResults: found differences in 1 / 12 workflows

NVIDIA_L40S Comparison Summary

Summary:

You potentially removed 9 lines from the logs
Reco comparison results: 360 differences found in the comparisons
DQMHistoTests: Total files compared: 13
DQMHistoTests: Total histograms compared: 216259
DQMHistoTests: Total failures: 31135
DQMHistoTests: Total nulls: 32
DQMHistoTests: Total successes: 185092
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 12 files compared)
Checked 49 log files, 50 edm output root files, 13 DQM output files
TriggerResults: found differences in 1 / 12 workflows

smuzaffar · 2026-05-13T21:27:49Z

please test

cmsbuild · 2026-05-13T21:28:21Z

Pull request #2747 was updated.

cmsbuild · 2026-05-14T01:04:00Z

-1

Failed Tests: RelVals-AMD_W7900
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b5835b/53244/summary.html
COMMIT: e23c8f2
CMSSW: CMSSW_17_0_X_2026-05-13-1100/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cms-bot/2747/53244/install.sh to create a dev area with all the needed externals and cmssw changes.

Failed RelVals-AMD_W7900

34634.75134634.751_TTbar_14TeV+Run4D121PU_HLT75e33TimingAlpaka/step2_TTbar_14TeV+Run4D121PU_HLT75e33TimingAlpaka.log

Comparison Summary

Summary:

You potentially removed 2 lines from the logs
Reco comparison results: 4 differences found in the comparisons
DQMHistoTests: Total files compared: 53
DQMHistoTests: Total histograms compared: 4187168
DQMHistoTests: Total failures: 32
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 4187116
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 52 files compared)
Checked 227 log files, 197 edm output root files, 53 DQM output files
TriggerResults: no differences found

AMD_MI300X Comparison Summary

There are some workflows for which there are errors in the baseline:
34634.402 step 2
The results for the comparisons for these workflows could be incomplete
This means most likely that the IB is having errors in the relvals.The error does NOT come from this pull request

Summary:

You potentially removed 130 lines from the logs
ROOTFileChecks: Some differences in event products or their sizes found
Reco comparison results: 321 differences found in the comparisons
DQMHistoTests: Total files compared: 12
DQMHistoTests: Total histograms compared: 200550
DQMHistoTests: Total failures: 23086
DQMHistoTests: Total nulls: 34
DQMHistoTests: Total successes: 177430
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 11 files compared)
Checked 47 log files, 48 edm output root files, 12 DQM output files
TriggerResults: found differences in 1 / 11 workflows

NVIDIA_H100 Comparison Summary

Summary:

You potentially removed 9 lines from the logs
Reco comparison results: 369 differences found in the comparisons
DQMHistoTests: Total files compared: 13
DQMHistoTests: Total histograms compared: 216259
DQMHistoTests: Total failures: 31315
DQMHistoTests: Total nulls: 34
DQMHistoTests: Total successes: 184910
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 12 files compared)
Checked 49 log files, 50 edm output root files, 13 DQM output files
TriggerResults: found differences in 1 / 12 workflows

NVIDIA_L40S Comparison Summary

Summary:

You potentially removed 2 lines from the logs
Reco comparison results: 367 differences found in the comparisons
DQMHistoTests: Total files compared: 13
DQMHistoTests: Total histograms compared: 216259
DQMHistoTests: Total failures: 27881
DQMHistoTests: Total nulls: 34
DQMHistoTests: Total successes: 188344
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 12 files compared)
Checked 49 log files, 50 edm output root files, 13 DQM output files
TriggerResults: found differences in 2 / 12 workflows

smuzaffar · 2026-05-14T09:00:45Z

please test

cmsbuild · 2026-05-14T11:04:47Z

+1

Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-b5835b/53250/summary.html
COMMIT: e23c8f2
CMSSW: CMSSW_17_0_X_2026-05-13-2300/el8_amd64_gcc13
Additional Tests: GPU,AMD_MI300X,AMD_W7900,NVIDIA_H100,NVIDIA_L40S
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cms-bot/2747/53250/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

You potentially removed 2 lines from the logs
Reco comparison results: 8 differences found in the comparisons
DQMHistoTests: Total files compared: 53
DQMHistoTests: Total histograms compared: 4187168
DQMHistoTests: Total failures: 37
DQMHistoTests: Total nulls: 0
DQMHistoTests: Total successes: 4187111
DQMHistoTests: Total skipped: 20
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 52 files compared)
Checked 227 log files, 197 edm output root files, 53 DQM output files
TriggerResults: no differences found

AMD_MI300X Comparison Summary

Summary:

You potentially added 4 lines to the logs
Reco comparison results: 330 differences found in the comparisons
DQMHistoTests: Total files compared: 13
DQMHistoTests: Total histograms compared: 216259
DQMHistoTests: Total failures: 32848
DQMHistoTests: Total nulls: 34
DQMHistoTests: Total successes: 183377
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 12 files compared)
Checked 49 log files, 50 edm output root files, 13 DQM output files
TriggerResults: found differences in 2 / 12 workflows

AMD_W7900 Comparison Summary

Summary:

You potentially added 3 lines to the logs
Reco comparison results: 365 differences found in the comparisons
DQMHistoTests: Total files compared: 13
DQMHistoTests: Total histograms compared: 216259
DQMHistoTests: Total failures: 30644
DQMHistoTests: Total nulls: 33
DQMHistoTests: Total successes: 185582
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 12 files compared)
Checked 49 log files, 50 edm output root files, 13 DQM output files
TriggerResults: found differences in 2 / 12 workflows

NVIDIA_H100 Comparison Summary

Summary:

You potentially removed 1 lines from the logs
Reco comparison results: 381 differences found in the comparisons
DQMHistoTests: Total files compared: 13
DQMHistoTests: Total histograms compared: 216259
DQMHistoTests: Total failures: 28113
DQMHistoTests: Total nulls: 34
DQMHistoTests: Total successes: 188112
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 12 files compared)
Checked 49 log files, 50 edm output root files, 13 DQM output files
TriggerResults: found differences in 1 / 12 workflows

NVIDIA_L40S Comparison Summary

Summary:

You potentially added 9 lines to the logs
Reco comparison results: 350 differences found in the comparisons
DQMHistoTests: Total files compared: 13
DQMHistoTests: Total histograms compared: 216259
DQMHistoTests: Total failures: 28833
DQMHistoTests: Total nulls: 33
DQMHistoTests: Total successes: 187393
DQMHistoTests: Total skipped: 0
DQMHistoTests: Total Missing objects: 0
DQMHistoSizes: Histogram memory added: 0.0 KiB( 12 files compared)
Checked 49 log files, 50 edm output root files, 13 DQM output files
TriggerResults: found differences in 1 / 12 workflows

smuzaffar · 2026-05-14T14:07:28Z

+externals

cmsbuild · 2026-05-14T14:07:42Z

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @sextonkennedy, @ftenchini, @mandrenguyen (and backports should be raised in the release meeting by the corresponding L2)

[Do no merge] Testing 2 GPU relvals in parallel

a556f63

cmsbuild added orp-pending pending-signatures externals-pending tests-started labels May 13, 2026

cmsbuild added tests-approved and removed tests-started labels May 13, 2026

Update run-ib-pr-matrix.sh

e23c8f2

cmsbuild added tests-started and removed tests-approved labels May 13, 2026

cmsbuild added tests-rejected and removed tests-started labels May 14, 2026

cmsbuild added tests-started and removed tests-rejected labels May 14, 2026

cmsbuild added tests-approved and removed tests-started labels May 14, 2026

smuzaffar changed the title ~~[Do no merge] Testing 2 GPU relvals in parallel~~ Testing 4 GPU relvals in parallel May 14, 2026

smuzaffar merged commit 7057236 into master May 14, 2026
21 checks passed

cmsbuild removed the pending-signatures label May 14, 2026

cmsbuild added fully-signed externals-approved and removed externals-pending labels May 14, 2026

smuzaffar deleted the 2-gpu-relvals branch May 14, 2026 18:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Testing 4 GPU relvals in parallel#2747

Testing 4 GPU relvals in parallel#2747
smuzaffar merged 2 commits into
masterfrom
2-gpu-relvals

smuzaffar commented May 13, 2026

Uh oh!

smuzaffar commented May 13, 2026

Uh oh!

smuzaffar commented May 13, 2026

Uh oh!

cmsbuild commented May 13, 2026

Uh oh!

cmsbuild commented May 13, 2026 •

edited

Loading

Uh oh!

cmsbuild commented May 13, 2026

Uh oh!

smuzaffar commented May 13, 2026

Uh oh!

cmsbuild commented May 13, 2026

Uh oh!

cmsbuild commented May 14, 2026

Uh oh!

smuzaffar commented May 14, 2026

Uh oh!

cmsbuild commented May 14, 2026

Uh oh!

smuzaffar commented May 14, 2026

Uh oh!

Uh oh!

cmsbuild commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

smuzaffar commented May 13, 2026

Uh oh!

smuzaffar commented May 13, 2026

Uh oh!

smuzaffar commented May 13, 2026

Uh oh!

cmsbuild commented May 13, 2026

Uh oh!

cmsbuild commented May 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cmsbuild commented May 13, 2026

Comparison Summary

AMD_MI300X Comparison Summary

AMD_W7900 Comparison Summary

NVIDIA_H100 Comparison Summary

NVIDIA_L40S Comparison Summary

Uh oh!

smuzaffar commented May 13, 2026

Uh oh!

cmsbuild commented May 13, 2026

Uh oh!

cmsbuild commented May 14, 2026

Failed RelVals-AMD_W7900

Comparison Summary

AMD_MI300X Comparison Summary

NVIDIA_H100 Comparison Summary

NVIDIA_L40S Comparison Summary

Uh oh!

smuzaffar commented May 14, 2026

Uh oh!

cmsbuild commented May 14, 2026

Comparison Summary

AMD_MI300X Comparison Summary

AMD_W7900 Comparison Summary

NVIDIA_H100 Comparison Summary

NVIDIA_L40S Comparison Summary

Uh oh!

smuzaffar commented May 14, 2026

Uh oh!

Uh oh!

cmsbuild commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cmsbuild commented May 13, 2026 •

edited

Loading