fix(#702, #703): key local_nodes by (participant_gid, node_fullname); release CString memory by yumeminami · Pull Request #705 · eclipse-zenoh/zenoh-plugin-ros2dds

yumeminami · 2026-05-28T17:02:38Z

Summary

Fixes #702 (subscriber route stuck is_active: false after a same-named ROS 2 node restarts) and part of #703 (Bad Parameter storm from leaked CStrings during rapid create/destroy churn).

The two bugs reinforce each other in practice: the #702 same-name race corrupts cyclonedds state on the publisher-side bridge, which then triggers a flood of failed to activate DDS Reader: Bad Parameter errors — exactly the #703 symptom — but in minutes instead of the hours the original #703 report needed.

Root cause

#702 — `local_nodes` keyed on node fullname only

Every route type (RouteSubscriber, RoutePublisher, service srv/cli, action srv/cli) tracked its serving nodes in:

local_nodes: HashSet<String>   // node fullname only

When a ROS 2 node restarts with the same name but a new participant GID, the bridge processes two events:

ADD for the new participant's reader/writer → route.add_local_node("/zephyr")
DISPOSE for the old participant's reader/writer → route.remove_local_node("/zephyr")

If the DISPOSE arrives after the ADD (common), it removes the entry the new participant just inserted. The route deactivates and never recovers because subsequent events for the new participant key into the (now empty) set as if the node had already been added.

The bridge's ros2/node/<participant_gid>/... admin space stayed consistent because it was already keyed by participant GID. Only local_nodes lost the race.

#703 — CString leaks in `create_topic` / `create_dds_writer`

Both functions did:

let cton = CString::new(topic_name).unwrap().into_raw();
let ctyn = CString::new(type_name).unwrap().into_raw();
// ...hand to cyclonedds...
// no matching CString::from_raw — leaks every call

cyclonedds copies the strings internally, so freeing immediately after the create call is safe — and is exactly what ros_discovery.rs:171-172 already does.

Fix

Two commits, each independently reviewable:

fix(#703): release CString memory after DDS topic and writer creation — dds_utils.rs only. Adds drop(CString::from_raw(...)) in both branches of create_topic and after dds_create_writer. Mirrors the existing cleanup pattern in ros_discovery.rs.
fix(#702): key local_nodes by (participant_gid, node_fullname) — the main change:
- Adds a Gid (participant) to every ROS2DiscoveryEvent variant.
- Adds a participant: Gid field on NodeInfo and threads it through every emitted Discovered*/Undiscovered* event.
- Changes local_nodes to HashSet<(Gid, String)> in all 6 route types.
- Custom serialize_local_nodes keeps the admin-space JSON format unchanged (still a deduplicated array of node fullname strings — participant GID is internal).
- Also fixes a secondary issue in RouteSubscriber::add_local_node which used if self.local_nodes.len() == 1 after an unconditional insert, so re-inserting an existing key would incorrectly re-run announce_route. Now matches RoutePublisher's correct 0→1-transition pattern.

How we tested

A docker-compose-based reproducer (not included in this PR — happy to send as a follow-up if useful) that runs entirely on plain Docker bridge networks, no host networking:

bridge-a — zenoh-bridge-ros2dds in router mode, REST admin enabled on :8000. On ros-a network with ROS_DOMAIN_ID=42.
bridge-b — zenoh-bridge-ros2dds in client mode, connecting to bridge-a over TCP/7447. On ros-b network with ROS_DOMAIN_ID=43.
publisher-b — ROS 2 node named zephyr publishing std_msgs/String on /gripper/button/left, on the ros-b network.
subscriber-a — driver loop that restarts a same-named ROS 2 node zephyr every ~2 s, subscribing to the same topic, on the ros-a network.
probe — polls bridge-a's REST admin space every 1 s for ~/ros2/route/topic/sub/... and ~/ros2/route/topic/pub/...; exits 0 once 5 consecutive probes show both routes healthy (local_nodes and remote_routes populated, dds_reader non-empty), exits 1 after 90 s otherwise.

Different ROS_DOMAIN_ID values on either side force the topic across the two bridges — no DDS shortcut. Both sides naming their node zephyr makes the bookkeeping race that #702 describes easy to hit, and (in our run) reliably also triggers the #703 publisher-side Bad Parameter storm within seconds.

Verification

Before the fix — `eclipse/zenoh-bridge-ros2dds:1.9.0`

Probe never reaches healthy state across the full 90 s window:

probe 86: sub[active=True local=['/zephyr'] remote=[]] pub[reader=set local=[] remote=['56ca...:gripper/button/left']] node_has_subscriber=True
probe 87: sub[active=True local=['/zephyr'] remote=[]] pub[reader=set local=[] remote=['56ca...:gripper/button/left']] node_has_subscriber=True
probe 88: sub[active=True local=['/zephyr'] remote=[]] pub[reader=set local=[] remote=['56ca...:gripper/button/left']] node_has_subscriber=True
probe 89: sub[active=True local=['/zephyr'] remote=[]] pub[reader=set local=[] remote=['56ca...:gripper/button/left']] node_has_subscriber=True
not reproduced within probe window (and never reached healthy state)
probe-1 exited with code 1

Note pub side: local=[] (the local zephyr publisher should be tracked here) while remote=[...] is correctly populated. This is the bridge-b mirror of #702 — the publisher-side route knows the remote subscriber exists but does not realize a local publisher node exists, so the routed dds_reader never matches a usable local entity and no messages flow.

Subscriber never receives a single message across 36 same-name restarts:

[INFO] [zephyr]: received 0 messages
[INFO] [zephyr]: received 0 messages
[INFO] [zephyr]: received 0 messages
... (every single restart)

Bridge-b logs (earlier longer run): 12,675 occurrences of Error creating DDS Reader: Bad Parameter in a few minutes, with the matching listener thrashing in a tight loop because activation kept failing but the matching status was never cleared.

After the fix — image built from this branch

Probe reaches healthy state within seconds:

probe 00: sub[active=True local=['/zephyr'] remote=['ee18...:gripper/button/left']] pub[reader=set local=['/zephyr'] remote=['f49b...:gripper/button/left']] node_has_subscriber=True
probe 01: sub[active=False local=[] remote=['ee18...:gripper/button/left']] pub[reader=empty local=['/zephyr'] remote=[]] node_has_subscriber=False    ← subscriber restart in progress
probe 02: sub[active=True local=['/zephyr'] remote=['ee18...:gripper/button/left']] pub[reader=set local=['/zephyr'] remote=['f49b...:gripper/button/left']] node_has_subscriber=True    ← auto-recovers
probe 03-05: healthy
HEALTHY: route flow steady — sub active with local+remote, pub has dds_reader, both populated
probe-1 exited with code 0

Note pub side after fix: local=['/zephyr'] is correctly populated — the bridge now recognises the same-named local publisher as a distinct instance from any prior instance.

Subscriber receives messages from the first restart onward:

[INFO] [zephyr]: first message: button 0
[INFO] [zephyr]: received 9 messages
[INFO] [zephyr]: first message: button 12
[INFO] [zephyr]: received 10 messages
[INFO] [zephyr]: first message: button 25

Bad Parameter errors: 0 (vs. 12,675 against 1.9.0).

Metric	`eclipse/zenoh-bridge-ros2dds:1.9.0`	This branch
Subscriber message count	0 (all 36 restarts)	9 → 10 → ... continuously
Pub route `local_nodes`	`[]`	`['/zephyr']`
Pub route `dds_reader`	`set` but useless (no local)	`set`, working
`Bad Parameter` errors	12,675	0
`HEALTHY` reached	never (90 s window)	within 5 s
Probe exit code	1	0

Notes on scope

dds_utils.rs also contains a Box::into_raw leak for listener callbacks in create_dds_reader (the other half of [BUG] zenoh-bridge-ros2dds v1.9.0: rapid create/destroy of a subscriber destabilizes DDS ("Error creating DDS Reader: Bad Parameter", transport closes) #703). That fix requires changing the listener-arg type to be erased uniformly and touching RoutePublisher::activate/deactivate_dds_reader — held back to a follow-up PR to keep this one reviewable. The CString half of [BUG] zenoh-bridge-ros2dds v1.9.0: rapid create/destroy of a subscriber destabilizes DDS ("Error creating DDS Reader: Bad Parameter", transport closes) #703 is fixed here; the Box half remains a slow leak under churn but no longer compounds with the [BUG] Subscriber route stuck is_active:false (empty local_nodes) after a same-named subscriber node restarts, though discovery shows the subscription [v1.9.0] #702 race-induced storm.
The admin-space JSON format is unchanged — verified manually against the same REST queries (@/<zid>/ros2/route/topic/sub/... and @/<zid>/ros2/route/topic/pub/...).
All workspace cargo check / cargo build --release succeed.

Test plan

cargo check --workspace clean
cargo build --release -p zenoh-bridge-ros2dds clean
Repro: probe exits 0 (healthy) against fix image
Repro: probe exits 1 (not reproduced) against eclipse/zenoh-bridge-ros2dds:1.9.0
Bridge logs show 0 Bad Parameter after fix (vs. ~12k against v1.9.0)
Admin-space JSON shape unchanged
Long-run (valgrind / hours) to confirm CString leak elimination — not done here; relying on code review for that part.

…iter creation create_topic and create_dds_writer leak two CString allocations per call: the topic name and the type name are converted via CString::into_raw and handed to cyclonedds (which copies them internally), but never reclaimed. Under workloads that churn DDS readers/writers (e.g. publisher subscribers that re-create their endpoints on a short cycle), this accumulates over hours and eventually exhausts entity slots in the underlying DDS layer ("Error creating DDS Reader: Bad Parameter"). Mirrors the cleanup pattern already used in ros_discovery.rs. Closes part of eclipse-zenoh#703.

…lname) A zenoh<->DDS route can get stuck `is_active: false` with an empty `local_nodes` after a ROS 2 node restarts under the same name with a new participant GID. The local DDS subscriber is matched, the bridge's own ROS graph shows the subscription, but `local_nodes` desyncs and the route never reactivates until a brand-new (differently named) subscriber appears. Root cause: every route's `local_nodes` was a `HashSet<String>` keyed on the node fullname alone. When a node restarted with the same name but a new participant GID, a late-arriving DISPOSE for the old participant's reader/writer would clear the entry that the new participant's ADD had just re-inserted, since both events used the same string key. The admin space at `ros2/node/<participant_gid>/...` remained consistent because it was already keyed by participant GID; only `local_nodes` lost the race. Fix: - Carry the participant GID through every `ROS2DiscoveryEvent` variant. - Add a `participant: Gid` field on `NodeInfo` and include it in every emitted Discovered*/Undiscovered* event. - Change `local_nodes` to `HashSet<(Gid, String)>` in every route type (subscriber, publisher, service srv/cli, action srv/cli) and update add_local_node/remove_local_node to use the tuple. - Preserve the admin-space JSON format with a custom serializer that emits a deduplicated array of node fullname strings, hiding the internal participant GID. Also fix a secondary issue in `RouteSubscriber::add_local_node`: it ran `len() == 1` after an unconditional `insert`, so re-inserting an existing key (which is common under DISPOSE-then-ADD races) would incorrectly re-run `announce_route`. Now mirrors RoutePublisher's correct pattern (only fire on the 0→1 transition). Side effect: in scenarios where a same-named subscriber on one side created an ADD/DISPOSE race that was previously corrupting cyclonedds state (and triggering a flood of "failed to activate DDS Reader: Bad Parameter" errors on the publisher side), the death loop also disappears once `local_nodes` is keyed correctly. Closes eclipse-zenoh#702. Mitigates the matching-listener Bad-Parameter storm in eclipse-zenoh#703 indirectly.

yumeminami added 2 commits May 29, 2026 00:54

yumeminami force-pushed the fix/702-local-nodes-participant-gid branch from 51d327a to 33e0078 Compare May 28, 2026 17:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(#702, #703): key local_nodes by (participant_gid, node_fullname); release CString memory#705

fix(#702, #703): key local_nodes by (participant_gid, node_fullname); release CString memory#705
yumeminami wants to merge 2 commits into
eclipse-zenoh:mainfrom
yumeminami:fix/702-local-nodes-participant-gid

yumeminami commented May 28, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yumeminami commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root cause

#702 — local_nodes keyed on node fullname only

#703 — CString leaks in create_topic / create_dds_writer

Fix

How we tested

Verification

Before the fix — eclipse/zenoh-bridge-ros2dds:1.9.0

After the fix — image built from this branch

Notes on scope

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yumeminami commented May 28, 2026 •

edited

Loading

#702 — `local_nodes` keyed on node fullname only

#703 — CString leaks in `create_topic` / `create_dds_writer`

Before the fix — `eclipse/zenoh-bridge-ros2dds:1.9.0`