fix(#702, #703): key local_nodes by (participant_gid, node_fullname); release CString memory#705
Open
yumeminami wants to merge 2 commits into
Open
Conversation
…iter creation
create_topic and create_dds_writer leak two CString allocations per call:
the topic name and the type name are converted via CString::into_raw and
handed to cyclonedds (which copies them internally), but never reclaimed.
Under workloads that churn DDS readers/writers (e.g. publisher subscribers
that re-create their endpoints on a short cycle), this accumulates over
hours and eventually exhausts entity slots in the underlying DDS layer
("Error creating DDS Reader: Bad Parameter").
Mirrors the cleanup pattern already used in ros_discovery.rs.
Closes part of eclipse-zenoh#703.
…lname) A zenoh<->DDS route can get stuck `is_active: false` with an empty `local_nodes` after a ROS 2 node restarts under the same name with a new participant GID. The local DDS subscriber is matched, the bridge's own ROS graph shows the subscription, but `local_nodes` desyncs and the route never reactivates until a brand-new (differently named) subscriber appears. Root cause: every route's `local_nodes` was a `HashSet<String>` keyed on the node fullname alone. When a node restarted with the same name but a new participant GID, a late-arriving DISPOSE for the old participant's reader/writer would clear the entry that the new participant's ADD had just re-inserted, since both events used the same string key. The admin space at `ros2/node/<participant_gid>/...` remained consistent because it was already keyed by participant GID; only `local_nodes` lost the race. Fix: - Carry the participant GID through every `ROS2DiscoveryEvent` variant. - Add a `participant: Gid` field on `NodeInfo` and include it in every emitted Discovered*/Undiscovered* event. - Change `local_nodes` to `HashSet<(Gid, String)>` in every route type (subscriber, publisher, service srv/cli, action srv/cli) and update add_local_node/remove_local_node to use the tuple. - Preserve the admin-space JSON format with a custom serializer that emits a deduplicated array of node fullname strings, hiding the internal participant GID. Also fix a secondary issue in `RouteSubscriber::add_local_node`: it ran `len() == 1` after an unconditional `insert`, so re-inserting an existing key (which is common under DISPOSE-then-ADD races) would incorrectly re-run `announce_route`. Now mirrors RoutePublisher's correct pattern (only fire on the 0→1 transition). Side effect: in scenarios where a same-named subscriber on one side created an ADD/DISPOSE race that was previously corrupting cyclonedds state (and triggering a flood of "failed to activate DDS Reader: Bad Parameter" errors on the publisher side), the death loop also disappears once `local_nodes` is keyed correctly. Closes eclipse-zenoh#702. Mitigates the matching-listener Bad-Parameter storm in eclipse-zenoh#703 indirectly.
51d327a to
33e0078
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #702 (subscriber route stuck
is_active: falseafter a same-named ROS 2 node restarts) and part of #703 (Bad Parameter storm from leaked CStrings during rapid create/destroy churn).The two bugs reinforce each other in practice: the #702 same-name race corrupts cyclonedds state on the publisher-side bridge, which then triggers a flood of
failed to activate DDS Reader: Bad Parametererrors — exactly the #703 symptom — but in minutes instead of the hours the original #703 report needed.Root cause
#702 —
local_nodeskeyed on node fullname onlyEvery route type (
RouteSubscriber,RoutePublisher, service srv/cli, action srv/cli) tracked its serving nodes in:When a ROS 2 node restarts with the same name but a new participant GID, the bridge processes two events:
route.add_local_node("/zephyr")route.remove_local_node("/zephyr")If the DISPOSE arrives after the ADD (common), it removes the entry the new participant just inserted. The route deactivates and never recovers because subsequent events for the new participant key into the (now empty) set as if the node had already been added.
The bridge's
ros2/node/<participant_gid>/...admin space stayed consistent because it was already keyed by participant GID. Onlylocal_nodeslost the race.#703 — CString leaks in
create_topic/create_dds_writerBoth functions did:
cyclonedds copies the strings internally, so freeing immediately after the create call is safe — and is exactly what
ros_discovery.rs:171-172already does.Fix
Two commits, each independently reviewable:
fix(#703): release CString memory after DDS topic and writer creation—dds_utils.rsonly. Addsdrop(CString::from_raw(...))in both branches ofcreate_topicand afterdds_create_writer. Mirrors the existing cleanup pattern inros_discovery.rs.fix(#702): key local_nodes by (participant_gid, node_fullname)— the main change:Gid(participant) to everyROS2DiscoveryEventvariant.participant: Gidfield onNodeInfoand threads it through every emitted Discovered*/Undiscovered* event.local_nodestoHashSet<(Gid, String)>in all 6 route types.serialize_local_nodeskeeps the admin-space JSON format unchanged (still a deduplicated array of node fullname strings — participant GID is internal).RouteSubscriber::add_local_nodewhich usedif self.local_nodes.len() == 1after an unconditional insert, so re-inserting an existing key would incorrectly re-runannounce_route. Now matchesRoutePublisher's correct 0→1-transition pattern.How we tested
A docker-compose-based reproducer (not included in this PR — happy to send as a follow-up if useful) that runs entirely on plain Docker bridge networks, no host networking:
bridge-a—zenoh-bridge-ros2ddsin router mode, REST admin enabled on:8000. Onros-anetwork withROS_DOMAIN_ID=42.bridge-b—zenoh-bridge-ros2ddsin client mode, connecting tobridge-aover TCP/7447. Onros-bnetwork withROS_DOMAIN_ID=43.publisher-b— ROS 2 node namedzephyrpublishingstd_msgs/Stringon/gripper/button/left, on theros-bnetwork.subscriber-a— driver loop that restarts a same-named ROS 2 nodezephyrevery ~2 s, subscribing to the same topic, on theros-anetwork.probe— pollsbridge-a's REST admin space every 1 s for~/ros2/route/topic/sub/...and~/ros2/route/topic/pub/...; exits 0 once 5 consecutive probes show both routes healthy (local_nodesandremote_routespopulated,dds_readernon-empty), exits 1 after 90 s otherwise.Different
ROS_DOMAIN_IDvalues on either side force the topic across the two bridges — no DDS shortcut. Both sides naming their nodezephyrmakes the bookkeeping race that #702 describes easy to hit, and (in our run) reliably also triggers the #703 publisher-sideBad Parameterstorm within seconds.Verification
Before the fix —
eclipse/zenoh-bridge-ros2dds:1.9.0Probe never reaches healthy state across the full 90 s window:
Note pub side:
local=[](the local zephyr publisher should be tracked here) whileremote=[...]is correctly populated. This is the bridge-b mirror of #702 — the publisher-side route knows the remote subscriber exists but does not realize a local publisher node exists, so the routeddds_readernever matches a usable local entity and no messages flow.Subscriber never receives a single message across 36 same-name restarts:
Bridge-b logs (earlier longer run): 12,675 occurrences of
Error creating DDS Reader: Bad Parameterin a few minutes, with the matching listener thrashing in a tight loop because activation kept failing but the matching status was never cleared.After the fix — image built from this branch
Probe reaches healthy state within seconds:
Note pub side after fix:
local=['/zephyr']is correctly populated — the bridge now recognises the same-named local publisher as a distinct instance from any prior instance.Subscriber receives messages from the first restart onward:
Bad Parametererrors: 0 (vs. 12,675 against1.9.0).eclipse/zenoh-bridge-ros2dds:1.9.0local_nodes[]['/zephyr']dds_readersetbut useless (no local)set, workingBad ParametererrorsHEALTHYreachedNotes on scope
dds_utils.rsalso contains aBox::into_rawleak for listener callbacks increate_dds_reader(the other half of [BUG] zenoh-bridge-ros2dds v1.9.0: rapid create/destroy of a subscriber destabilizes DDS ("Error creating DDS Reader: Bad Parameter", transport closes) #703). That fix requires changing the listener-arg type to be erased uniformly and touchingRoutePublisher::activate/deactivate_dds_reader— held back to a follow-up PR to keep this one reviewable. The CString half of [BUG] zenoh-bridge-ros2dds v1.9.0: rapid create/destroy of a subscriber destabilizes DDS ("Error creating DDS Reader: Bad Parameter", transport closes) #703 is fixed here; the Box half remains a slow leak under churn but no longer compounds with the [BUG] Subscriber route stuckis_active:false(emptylocal_nodes) after a same-named subscriber node restarts, though discovery shows the subscription [v1.9.0] #702 race-induced storm.@/<zid>/ros2/route/topic/sub/...and@/<zid>/ros2/route/topic/pub/...).cargo check/cargo build --releasesucceed.Test plan
cargo check --workspacecleancargo build --release -p zenoh-bridge-ros2ddscleaneclipse/zenoh-bridge-ros2dds:1.9.0Bad Parameterafter fix (vs. ~12k against v1.9.0)valgrind/ hours) to confirm CString leak elimination — not done here; relying on code review for that part.