Skip to content

Add an MPISenderPortable and MPIReceiverPortable modules to send/receive arbitrary device collections#50503

Open
ghyls wants to merge 2 commits into
cms-sw:masterfrom
ghyls:devel-mpi-generic
Open

Add an MPISenderPortable and MPIReceiverPortable modules to send/receive arbitrary device collections#50503
ghyls wants to merge 2 commits into
cms-sw:masterfrom
ghyls:devel-mpi-generic

Conversation

@ghyls
Copy link
Copy Markdown
Contributor

@ghyls ghyls commented Mar 24, 2026

PR description:

These PR includes two separate developments, one of which requires the other.

Enable the registration at runtime of DtoH and HtoD product transformations:

  • Implements the ability of registering DtoH and HtoD transformations of products whose type is not known at compile-time.
  • These changes are motivated in the context of the need for an MPI module that:
    • Is an alpaka module, and can receive products directly on device memory.
    • These products might be needed on host by downstream modules, in which case they should be converted automatically
    • The concrete type of these modules is not known at compile-time.

Add an MPISenderPortable and MPIReceiverPortable modules to send/receive device collections

  • Introduce MPISenderPortable.cc and MPIReceiverPortable.cc, which can send/receive device runtime-typed device collections for which a device TrivialSerialiser plugin exists, directly to/from device memory.

PR validation:

  • A GenericClonerPortable test module is introduced to demonstrate the D to H and H to D transformation registrations at runtime. The module clones a host or a device product and registers the H to D (or D to H) transformation for it. The test is configured via testGenericClonerDevice.py

  • A small test is added to FWCore/Framework/test/stream_producer_catch2.cc to test the non-templated registerTransformAsync overload added to Framework/interface/stream/implementors.h.

  • The Portable MPI modules (MPISenderPortable.cc and MPIReceiverPortable.cc) are tested via the two new configurations added to HeterogeneousCore/MPICore/test.

Backport

  • We plan to backport this PR to 16_0_X and 16_1_X for it to be used in the NGT demonstrator during data taking this year.

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Mar 24, 2026

cms-bot internal usage

@cmsbuild
Copy link
Copy Markdown
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50503/48670

@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented Apr 4, 2026

Milestone for this pull request has been moved to CMSSW_17_0_X. Please open a backport if it should also go in to CMSSW_16_1_X.

@ghyls ghyls marked this pull request as ready for review May 7, 2026 10:38
@cmsbuild
Copy link
Copy Markdown
Contributor

cmsbuild commented May 7, 2026

A new Pull Request was created by @ghyls for master.

It involves the following packages:

  • DataFormats/BeamSpot (reconstruction)
  • DataFormats/EcalDigi (simulation)
  • DataFormats/EcalRecHit (reconstruction)
  • DataFormats/HGCalDigi (simulation)
  • DataFormats/HGCalReco (reconstruction)
  • DataFormats/HcalDigi (simulation)
  • DataFormats/HcalRecHit (reconstruction)
  • DataFormats/ParticleFlowReco (reconstruction)
  • DataFormats/PortableTestObjects (heterogeneous)
  • DataFormats/SiPixelClusterSoA (heterogeneous, reconstruction)
  • DataFormats/SiPixelDigiSoA (heterogeneous, reconstruction)
  • DataFormats/SiStripClusterSoA (heterogeneous, reconstruction)
  • DataFormats/SiStripDigiSoA (heterogeneous, reconstruction)
  • DataFormats/TrackSoA (heterogeneous, reconstruction)
  • DataFormats/TrackingRecHitSoA (heterogeneous, reconstruction)
  • DataFormats/VertexSoA (heterogeneous, reconstruction)
  • FWCore/Framework (core)
  • HeterogeneousCore/AlpakaCore (heterogeneous)
  • HeterogeneousCore/MPICore (heterogeneous)
  • HeterogeneousCore/TrivialSerialisation (heterogeneous)

@Dr15Jones, @Moanwar, @civanch, @fwyzard, @jfernan2, @kpedro88, @makortel, @mandrenguyen, @mdhildreth, @smuzaffar, @srimanob can you please review it and eventually sign? Thanks.
@IzaakWN, @ReyerBand, @VinInn, @VourMa, @abdoulline, @alesaggio, @argiro, @bsunanda, @dkotlins, @echabert, @elusian, @ferencek, @gpetruc, @hatakeyamak, @jlidrych, @lgray, @makortel, @mariadalfonso, @missirol, @mmasciov, @mmusich, @mroguljic, @mtosi, @pfs, @rchatter, @robervalwalsh, @rovere, @thomreis, @threus, @tsusa, @wang0jin, @wddgit this is something you requested to watch as well.
@ftenchini, @mandrenguyen, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

<use name="HeterogeneousCore/TrivialSerialisation"/>
<flags ALPAKA_BACKENDS="1"/>
<flags EDM_PLUGIN="1"/>
</library> No newline at end of file
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A newline will be added

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will be renamed to testGenericClonerPortable_cfg.py

bool hasCopyToHost() const override { return HasCopyToHost<T, Queue>; }

bool hasCopyToDevice() const override {
if constexpr (HasCopyToHost<T, Queue>) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you check for HasCopyToHost before checking for HasCopyToDevice ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be followed up in a separate PR, see https://github.com/cms-ngt-hlt/ngt32/issues/46 .

}

std::function<std::shared_ptr<Queue>(edm::WrapperBase const&)> getQueue() const override {
if constexpr (HasCopyToHost<T, Queue>) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you check for HasCopyToHost before returning the queue?
A DeviceProduct<T> can provide the queue even if the underling type does not have a copy-to-host specialisation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm true, thank you.
getQueue() is only used in the context of registering a D to H transformation, which requires a HasCopyToHost<T> to exist. That is why otherwise I made it return immediately.

However, I see how this makes getQueue misleading. Since this check is already done by preTransformDtoH and transformDtoH I have removed it from getQueue.

if (deviceSerialiser) {
entry.typeID = edm::TypeID{deviceSerialiser->productTypeID()};
entry.getToken =
this->consumes(edm::TypeToGet{entry.typeID, edm::PRODUCT_TYPE}, edm::InputTag{label, instance});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
this->consumes(edm::TypeToGet{entry.typeID, edm::PRODUCT_TYPE}, edm::InputTag{label, instance});
this->consumes(edm::TypeToGet{entry.typeID, edm::PRODUCT_TYPE}, src);

?


if (verbose_) {
edm::LogInfo("GenericClonerPortable") << "will clone device product of type '" << type << "', label '"
<< label << "', instance '" << instance << "'";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<< label << "', instance '" << instance << "'";
<< src.label() << "', instance '" << src.instance() << "'";

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually it may be possible to simply print src, can check if this works

          edm::LogInfo("GenericClonerPortable") << "will clone device product of type '" << type << "', tag '" << src << '\'';

?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, it works. Thank you

if (hostSerialiser) {
entry.typeID = edm::TypeID{twd.typeInfo()};
entry.getToken =
this->consumes(edm::TypeToGet{entry.typeID, edm::PRODUCT_TYPE}, edm::InputTag{label, instance});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
this->consumes(edm::TypeToGet{entry.typeID, edm::PRODUCT_TYPE}, edm::InputTag{label, instance});
this->consumes(edm::TypeToGet{entry.typeID, edm::PRODUCT_TYPE}, src);

Comment on lines +184 to +185
edm::LogInfo("GenericClonerPortable") << "will clone host product of type '" << type << "', label '" << label
<< "', instance '" << instance << "'";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
edm::LogInfo("GenericClonerPortable") << "will clone host product of type '" << type << "', label '" << label
<< "', instance '" << instance << "'";
edm::LogInfo("GenericClonerPortable") << "will clone host product of type '" << type << "', label '" << src.label()
<< "', instance '" << src.instance() << "'";

}

entry.typeID = edm::TypeID{twd.typeInfo()};
entry.getToken = this->consumes(edm::TypeToGet{entry.typeID, edm::PRODUCT_TYPE}, edm::InputTag{label, instance});
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
entry.getToken = this->consumes(edm::TypeToGet{entry.typeID, edm::PRODUCT_TYPE}, edm::InputTag{label, instance});
entry.getToken = this->consumes(edm::TypeToGet{entry.typeID, edm::PRODUCT_TYPE}, src);

Comment on lines +210 to +211
edm::LogInfo("GenericClonerPortable") << "will clone ROOT-serialised product of type '" << type << "', label '"
<< label << "', instance '" << instance << "'";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
edm::LogInfo("GenericClonerPortable") << "will clone ROOT-serialised product of type '" << type << "', label '"
<< label << "', instance '" << instance << "'";
edm::LogInfo("GenericClonerPortable") << "will clone ROOT-serialised product of type '" << type << "', label '"
<< src.label() << "', instance '" << src.instance() << "'";

[copyAsync = std::forward<TCopyAsync>(copyAsync), synchronize = this->synchronize()](
edm::StreamID streamID, edm::WrapperBase const& wb, edm::WaitingTaskWithArenaHolder holder) -> std::any {
detail::EDMetadataAcquireSentry sentry(streamID, std::move(holder), synchronize);
auto productOnHost = copyAsync(sentry.metadata()->queue(), wb);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't this productOnDevice ?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please rename 🥺

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for not having noticed this...
It is fixed now.

@ghyls ghyls force-pushed the devel-mpi-generic branch from 46ce159 to 495dd60 Compare May 11, 2026 17:01
@cmsbuild
Copy link
Copy Markdown
Contributor

+code-checks

Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-50503/49298

@cmsbuild
Copy link
Copy Markdown
Contributor

Pull request #50503 was updated. @Dr15Jones, @Moanwar, @civanch, @cmsbuild, @fwyzard, @jfernan2, @kpedro88, @makortel, @mandrenguyen, @mdhildreth, @smuzaffar, @srimanob can you please check and sign again.

@fwyzard
Copy link
Copy Markdown
Contributor

fwyzard commented May 14, 2026

allow @ghyls test rights

Copy link
Copy Markdown
Contributor

@makortel makortel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few things that caught my eye

std::string productInstance) {
TransformerBase::registerTransformAsyncImp(
*this, iToken, returnType, std::move(productInstance), std::move(iPre), std::move(iF));
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please extend this overload to the other module base classes that provide Transformer (that is, global and limited).

The real test suite for the Transformer is in

<test name="testFWCoreIntegrationTransform" command="cmsRun ${LOCALTOP}/src/FWCore/Integration/test/transformTest_cfg.py"/>
<test name="testFWCoreIntegrationTransform_async" command="cmsRun ${LOCALTOP}/src/FWCore/Integration/test/transformTest_cfg.py --async_"/>
<test name="testFWCoreIntegrationTransform_async_tracer" command="cmsRun ${LOCALTOP}/src/FWCore/Integration/test/transformTest_cfg.py --async_ --addTracer 2>&amp;1 | fgrep 'transform in event' | wc | awk '{print $1}' | fgrep 24"/>
<test name="testFWCoreIntegrationTransform_onPath" command="cmsRun ${LOCALTOP}/src/FWCore/Integration/test/transformTest_cfg.py --onPath"/>
<test name="testFWCoreIntegrationTransform_onPath_async" command="cmsRun ${LOCALTOP}/src/FWCore/Integration/test/transformTest_cfg.py --onPath --async_"/>
<test name="testFWCoreIntegrationTransform_noTransform" command="cmsRun ${LOCALTOP}/src/FWCore/Integration/test/transformTest_cfg.py --noTransform"/>
<test name="testFWCoreIntegrationTransform_noTransform_async" command="cmsRun ${LOCALTOP}/src/FWCore/Integration/test/transformTest_cfg.py --noTransform --async_"/>
<test name="testFWCoreIntegrationTransform_noTransform_onPath" command="cmsRun ${LOCALTOP}/src/FWCore/Integration/test/transformTest_cfg.py --noTransform --onPath"/>
<test name="testFWCoreIntegrationTransform_noTransform_onPath_async" command="cmsRun ${LOCALTOP}/src/FWCore/Integration/test/transformTest_cfg.py --noTransform --onPath --async_"/>
<test name="testFWCoreIntegrationTransform_stream" command="cmsRun ${LOCALTOP}/src/FWCore/Integration/test/transformTest_cfg.py --stream"/>
<test name="testFWCoreIntegrationTransform_stream_async" command="cmsRun ${LOCALTOP}/src/FWCore/Integration/test/transformTest_cfg.py --stream --async_"/>
<test name="testFWCoreIntegrationTransform_stream_onPath" command="cmsRun ${LOCALTOP}/src/FWCore/Integration/test/transformTest_cfg.py --stream --onPath"/>
<test name="testFWCoreIntegrationTransform_stream_onPath_async" command="cmsRun ${LOCALTOP}/src/FWCore/Integration/test/transformTest_cfg.py --stream --onPath --async_"/>
<test name="testFWCoreIntegrationTransform_noPut" command="! cmsRun ${LOCALTOP}/src/FWCore/Integration/test/transformTest_cfg.py --noPut"/>
<test name="testFWCoreIntegrationTransform_noPut_async" command="! cmsRun ${LOCALTOP}/src/FWCore/Integration/test/transformTest_cfg.py --noPut --async_"/>
<test name="testFWCoreIntegrationTransform_exception" command="cmsRun ${LOCALTOP}/src/FWCore/Integration/test/transformTest_cfg.py --exception 2>&amp;1 | fgrep 'exception for testing purposes'"/>
<test name="testFWCoreIntegrationTransform_async_exception" command="cmsRun ${LOCALTOP}/src/FWCore/Integration/test/transformTest_cfg.py --async --exception 2>&amp;1 | fgrep 'exception for testing purposes'"/>
<test name="testFWCoreIntegrationTransform_onPath_exception" command="cmsRun ${LOCALTOP}/src/FWCore/Integration/test/transformTest_cfg.py --onPath --exception 2>&amp;1 | fgrep 'exception for testing purposes'"/>
<test name="testFWCoreIntegrationTransform_onPath_async_exception" command="cmsRun ${LOCALTOP}/src/FWCore/Integration/test/transformTest_cfg.py --onPath --async --exception 2>&amp;1 | fgrep 'exception for testing purposes'"/>

Could you extend those to cover this overload as well?

Comment on lines +129 to +133
edm::EDPutToken produces(edm::TypeID deviceProductType,
edm::TypeID hostProductType,
std::string instanceName,
TCopyAsync&& copyAsync,
TTransform&& transform) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this produces() overload can be called only when the copy operation exists for the host-to-device copy (which is different from the typed produces()), and the present API does not support producing a type-erased host-only data product. I'd suggest to add a comment about that.

Comment on lines +129 to +130
edm::EDPutToken produces(edm::TypeID deviceProductType,
edm::TypeID hostProductType,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My personal preference would be to have the primarily produced type first, and the implicitly copied type second

Suggested change
edm::EDPutToken produces(edm::TypeID deviceProductType,
edm::TypeID hostProductType,
edm::EDPutToken produces(edm::TypeID hostProductType,
edm::TypeID deviceProductType,

TTransform&& transform) {
edm::EDPutToken token = this->producesCollector().template produces<Tr>(hostProductType, instanceName);

if constexpr (not detail::useProductDirectly) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the detail::useProductDirectly == true case, should deviceProductType == hostProductType?

std::string instanceName,
TCopyAsync&& copyAsync,
TTransform&& transform) {
edm::EDPutToken token = this->producesCollector().template produces<Tr>(hostProductType, instanceName);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why go through producesCollector() instead of

Suggested change
edm::EDPutToken token = this->producesCollector().template produces<Tr>(hostProductType, instanceName);
edm::EDPutToken token = Base::template produces<Tr>(hostProductType, instanceName);

?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants