Skip to content

Extend edmMpiSplitConfig for multiple controllers#50678

Merged
cmsbuild merged 1 commit into
cms-sw:masterfrom
Annnnya:multiple_controllers_splitter
Apr 23, 2026
Merged

Extend edmMpiSplitConfig for multiple controllers#50678
cmsbuild merged 1 commit into
cms-sw:masterfrom
Annnnya:multiple_controllers_splitter

Conversation

@Annnnya
Copy link
Copy Markdown
Contributor

@Annnnya Annnnya commented Apr 7, 2026

PR description:

This pull request corresponds to ngt issue #18

It changes the splitter to allow creating multiple-remote setups, where parameters for different processes are separated with ":"

It also changes the logic by which the MPIController and MPISource discover each other: instead of specifying ranks of the follower process, MPIController can set remoteProcess parameter with remote process name, and it's rank will be discovered at runtime. MPISource has to specify the controller process name via controllerProcessName parameter, and MPIController must specify name of matching process in followerProcessName parameter.

Integration tests were modified to account for the configuration changes.

PR validation:

Tried using new splitter with the command:

edmMpiSplitConfig hlt.py -l local.py -m hltEcalDigisSoA hltEcalUncalibRecHitSoA -r remote_ecal.py -n ECAL :  \
-m hltHcalDigisSoA hltHbheRecoSoA hltParticleFlowRecHitHBHESoA hltParticleFlowClusterHBHESoA -d hltHcalDigis -r remote_hcal.py -n HCAL

Also integration test to split multiple remotes was added in MPICore package.

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Apr 7, 2026

cms-bot internal usage

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Apr 7, 2026

-code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50678/48900

Code check has found code style and quality issues which could be resolved by applying following patch(s)

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Apr 7, 2026

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50678/48901

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Apr 7, 2026

A new Pull Request was created by @Annnnya for master.

It involves the following packages:

  • HeterogeneousCore/MPICore (heterogeneous)
  • HeterogeneousCore/MPIServices (heterogeneous)

@cmsbuild, @fwyzard, @makortel can you please review it and eventually sign? Thanks.
@makortel, @rovere this is something you requested to watch as well.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

Comment thread HeterogeneousCore/MPICore/interface/hash_function.h Outdated
@fwyzard
Copy link
Copy Markdown
Contributor

fwyzard commented Apr 7, 2026

How (un)likely is it that we get a collision in the hash of two process names?

Given that most process names are quite short, maybe we could gather the names directly instead of the hashes ?
Something like using MPI_Allgather to exchange the number of characters in each name, followed by MPI_Allgatherv to names ?

Comment thread HeterogeneousCore/MPICore/interface/hash_function.h Outdated
Comment thread HeterogeneousCore/MPICore/plugins/MPIController.cc Outdated
Comment thread HeterogeneousCore/MPICore/plugins/MPIController.cc Outdated
Comment thread HeterogeneousCore/MPICore/plugins/MPISource.cc Outdated
Comment thread HeterogeneousCore/MPICore/plugins/MPISource.cc Outdated
Comment thread HeterogeneousCore/MPICore/plugins/MPISource.cc Outdated
Comment thread HeterogeneousCore/MPICore/plugins/MPISource.cc Outdated
Comment thread HeterogeneousCore/MPIServices/interface/MPIService.h Outdated
@fwyzard
Copy link
Copy Markdown
Contributor

fwyzard commented Apr 20, 2026

please test

Comment thread HeterogeneousCore/MPICore/plugins/MPISource.cc Outdated
Comment thread HeterogeneousCore/MPICore/plugins/MPISource.cc
Comment thread HeterogeneousCore/MPICore/plugins/MPIController.cc
@fwyzard
Copy link
Copy Markdown
Contributor

fwyzard commented Apr 20, 2026

@Annnnya could you

  • undo trivial changes like this and this
  • remove the comments about MPI_Abort that are no longer relevant
  • add checks that the controllerProcessName and followerProcessName are different from the current process name
  • add a new test that exercises edmMpiSplitConfig with multiple followers

?

@fwyzard
Copy link
Copy Markdown
Contributor

fwyzard commented Apr 20, 2026

From a cursory look the python part looks fine.

@fwyzard
Copy link
Copy Markdown
Contributor

fwyzard commented Apr 20, 2026

type ngt

@cmsbuild
Copy link
Copy Markdown
Contributor

-1

Failed Tests: RelVals-INPUT
Size: This PR adds an extra 28KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2ea7bf/52773/summary.html
COMMIT: 2249e96
CMSSW: CMSSW_17_0_X_2026-04-20-1100/el8_amd64_gcc13
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/50678/52773/install.sh to create a dev area with all the needed externals and cmssw changes.

Failed RelVals-INPUT

The relvals timed out after 4 hours.

Comparison Summary

Summary:

  • You potentially added 1 lines to the logs
  • Reco comparison results: 5 differences found in the comparisons
  • DQMHistoTests: Total files compared: 53
  • DQMHistoTests: Total histograms compared: 4186853
  • DQMHistoTests: Total failures: 50
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4186783
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 52 files compared)
  • Checked 227 log files, 197 edm output root files, 53 DQM output files
  • TriggerResults: no differences found

@cmsbuild
Copy link
Copy Markdown
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50678/49087

  • There are other open Pull requests which might conflict with changes you have proposed:

@cmsbuild
Copy link
Copy Markdown
Contributor

Pull request #50678 was updated. @cmsbuild, @fwyzard, @makortel can you please check and sign again.

@cmsbuild
Copy link
Copy Markdown
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50678/49089

  • There are other open Pull requests which might conflict with changes you have proposed:

@cmsbuild
Copy link
Copy Markdown
Contributor

Pull request #50678 was updated. @cmsbuild, @fwyzard, @makortel can you please check and sign again.

@cmsbuild
Copy link
Copy Markdown
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50678/49090

  • There are other open Pull requests which might conflict with changes you have proposed:

@cmsbuild
Copy link
Copy Markdown
Contributor

Pull request #50678 was updated. @cmsbuild, @fwyzard, @makortel can you please check and sign again.

@fwyzard
Copy link
Copy Markdown
Contributor

fwyzard commented Apr 21, 2026

please test

@cmsbuild
Copy link
Copy Markdown
Contributor

-1

Failed Tests: UnitTests
Size: This PR adds an extra 16KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2ea7bf/52800/summary.html
COMMIT: b1322c0
CMSSW: CMSSW_17_0_X_2026-04-21-1100/el8_amd64_gcc13
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/50678/52800/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2ea7bf/52800/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2ea7bf/52800/git-merge-result

Failed Unit Tests

I found 1 errors in the following unit tests:

---> test testMPITooManyStreams had ERRORS

Comparison Summary

Summary:

  • You potentially added 3 lines to the logs
  • Reco comparison results: 4 differences found in the comparisons
  • DQMHistoTests: Total files compared: 53
  • DQMHistoTests: Total histograms compared: 4186853
  • DQMHistoTests: Total failures: 47
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4186786
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 52 files compared)
  • Checked 227 log files, 197 edm output root files, 53 DQM output files
  • TriggerResults: no differences found

}
edm::LogInfo("MPI") << "MPIController Comm World size: " << size;

// All processes exchane the hashes of their names
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// All processes exchane the hashes of their names
// All processes exchange the hashes of their names.

Comment on lines +41 to +43
*
* If this module encounters a configuration error it will call MPI_Abort() instead of throwing an exception.
* Otherwise the call to MPI_Finalize() in the MPIService destructor may hang.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
*
* If this module encounters a configuration error it will call MPI_Abort() instead of throwing an exception.
* Otherwise the call to MPI_Finalize() in the MPIService destructor may hang.

@fwyzard
Copy link
Copy Markdown
Contributor

fwyzard commented Apr 21, 2026

@Annnnya can you update also the new tests, HeterogeneousCore/MPICore/test/controller_too_many_streams_cfg.py and HeterogeneousCore/MPICore/test/follower_too_many_streams_cfg.py ?

…ange the hashes at communicatior channels creation
@cmsbuild
Copy link
Copy Markdown
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50678/49093

  • There are other open Pull requests which might conflict with changes you have proposed:

@cmsbuild
Copy link
Copy Markdown
Contributor

Pull request #50678 was updated. @cmsbuild, @fwyzard, @makortel can you please check and sign again.

@fwyzard
Copy link
Copy Markdown
Contributor

fwyzard commented Apr 21, 2026

please test

@cmsbuild
Copy link
Copy Markdown
Contributor

+1

Size: This PR adds an extra 32KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2ea7bf/52810/summary.html
COMMIT: 87a5605
CMSSW: CMSSW_17_0_X_2026-04-21-1100/el8_amd64_gcc13
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/50678/52810/install.sh to create a dev area with all the needed externals and cmssw changes.

The following merge commits were also included on top of IB + this PR after doing git cms-merge-topic:

You can see more details here:
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2ea7bf/52810/git-recent-commits.json
https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-2ea7bf/52810/git-merge-result

Comparison Summary

Summary:

  • You potentially added 1 lines to the logs
  • Reco comparison results: 6 differences found in the comparisons
  • DQMHistoTests: Total files compared: 53
  • DQMHistoTests: Total histograms compared: 4186853
  • DQMHistoTests: Total failures: 7
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 4186826
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 52 files compared)
  • Checked 227 log files, 197 edm output root files, 53 DQM output files
  • TriggerResults: no differences found

@fwyzard
Copy link
Copy Markdown
Contributor

fwyzard commented Apr 22, 2026

+heterogeneous

@cmsbuild
Copy link
Copy Markdown
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @ftenchini, @mandrenguyen, @sextonkennedy (and backports should be raised in the release meeting by the corresponding L2)

@mandrenguyen
Copy link
Copy Markdown
Contributor

+1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants