Skip to content

fix(#702, #703): key local_nodes by (participant_gid, node_fullname); release CString memory#705

Open
yumeminami wants to merge 2 commits into
eclipse-zenoh:mainfrom
yumeminami:fix/702-local-nodes-participant-gid
Open

fix(#702, #703): key local_nodes by (participant_gid, node_fullname); release CString memory#705
yumeminami wants to merge 2 commits into
eclipse-zenoh:mainfrom
yumeminami:fix/702-local-nodes-participant-gid

Conversation

@yumeminami
Copy link
Copy Markdown

@yumeminami yumeminami commented May 28, 2026

Summary

Fixes #702 (subscriber route stuck is_active: false after a same-named ROS 2 node restarts) and part of #703 (Bad Parameter storm from leaked CStrings during rapid create/destroy churn).

The two bugs reinforce each other in practice: the #702 same-name race corrupts cyclonedds state on the publisher-side bridge, which then triggers a flood of failed to activate DDS Reader: Bad Parameter errors — exactly the #703 symptom — but in minutes instead of the hours the original #703 report needed.

Root cause

#702local_nodes keyed on node fullname only

Every route type (RouteSubscriber, RoutePublisher, service srv/cli, action srv/cli) tracked its serving nodes in:

local_nodes: HashSet<String>   // node fullname only

When a ROS 2 node restarts with the same name but a new participant GID, the bridge processes two events:

  1. ADD for the new participant's reader/writer → route.add_local_node("/zephyr")
  2. DISPOSE for the old participant's reader/writer → route.remove_local_node("/zephyr")

If the DISPOSE arrives after the ADD (common), it removes the entry the new participant just inserted. The route deactivates and never recovers because subsequent events for the new participant key into the (now empty) set as if the node had already been added.

The bridge's ros2/node/<participant_gid>/... admin space stayed consistent because it was already keyed by participant GID. Only local_nodes lost the race.

#703 — CString leaks in create_topic / create_dds_writer

Both functions did:

let cton = CString::new(topic_name).unwrap().into_raw();
let ctyn = CString::new(type_name).unwrap().into_raw();
// ...hand to cyclonedds...
// no matching CString::from_raw — leaks every call

cyclonedds copies the strings internally, so freeing immediately after the create call is safe — and is exactly what ros_discovery.rs:171-172 already does.

Fix

Two commits, each independently reviewable:

  1. fix(#703): release CString memory after DDS topic and writer creationdds_utils.rs only. Adds drop(CString::from_raw(...)) in both branches of create_topic and after dds_create_writer. Mirrors the existing cleanup pattern in ros_discovery.rs.

  2. fix(#702): key local_nodes by (participant_gid, node_fullname) — the main change:

    • Adds a Gid (participant) to every ROS2DiscoveryEvent variant.
    • Adds a participant: Gid field on NodeInfo and threads it through every emitted Discovered*/Undiscovered* event.
    • Changes local_nodes to HashSet<(Gid, String)> in all 6 route types.
    • Custom serialize_local_nodes keeps the admin-space JSON format unchanged (still a deduplicated array of node fullname strings — participant GID is internal).
    • Also fixes a secondary issue in RouteSubscriber::add_local_node which used if self.local_nodes.len() == 1 after an unconditional insert, so re-inserting an existing key would incorrectly re-run announce_route. Now matches RoutePublisher's correct 0→1-transition pattern.

How we tested

A docker-compose-based reproducer (not included in this PR — happy to send as a follow-up if useful) that runs entirely on plain Docker bridge networks, no host networking:

  • bridge-azenoh-bridge-ros2dds in router mode, REST admin enabled on :8000. On ros-a network with ROS_DOMAIN_ID=42.
  • bridge-bzenoh-bridge-ros2dds in client mode, connecting to bridge-a over TCP/7447. On ros-b network with ROS_DOMAIN_ID=43.
  • publisher-b — ROS 2 node named zephyr publishing std_msgs/String on /gripper/button/left, on the ros-b network.
  • subscriber-a — driver loop that restarts a same-named ROS 2 node zephyr every ~2 s, subscribing to the same topic, on the ros-a network.
  • probe — polls bridge-a's REST admin space every 1 s for ~/ros2/route/topic/sub/... and ~/ros2/route/topic/pub/...; exits 0 once 5 consecutive probes show both routes healthy (local_nodes and remote_routes populated, dds_reader non-empty), exits 1 after 90 s otherwise.

Different ROS_DOMAIN_ID values on either side force the topic across the two bridges — no DDS shortcut. Both sides naming their node zephyr makes the bookkeeping race that #702 describes easy to hit, and (in our run) reliably also triggers the #703 publisher-side Bad Parameter storm within seconds.

Verification

Before the fix — eclipse/zenoh-bridge-ros2dds:1.9.0

Probe never reaches healthy state across the full 90 s window:

probe 86: sub[active=True local=['/zephyr'] remote=[]] pub[reader=set local=[] remote=['56ca...:gripper/button/left']] node_has_subscriber=True
probe 87: sub[active=True local=['/zephyr'] remote=[]] pub[reader=set local=[] remote=['56ca...:gripper/button/left']] node_has_subscriber=True
probe 88: sub[active=True local=['/zephyr'] remote=[]] pub[reader=set local=[] remote=['56ca...:gripper/button/left']] node_has_subscriber=True
probe 89: sub[active=True local=['/zephyr'] remote=[]] pub[reader=set local=[] remote=['56ca...:gripper/button/left']] node_has_subscriber=True
not reproduced within probe window (and never reached healthy state)
probe-1 exited with code 1

Note pub side: local=[] (the local zephyr publisher should be tracked here) while remote=[...] is correctly populated. This is the bridge-b mirror of #702 — the publisher-side route knows the remote subscriber exists but does not realize a local publisher node exists, so the routed dds_reader never matches a usable local entity and no messages flow.

Subscriber never receives a single message across 36 same-name restarts:

[INFO] [zephyr]: received 0 messages
[INFO] [zephyr]: received 0 messages
[INFO] [zephyr]: received 0 messages
... (every single restart)

Bridge-b logs (earlier longer run): 12,675 occurrences of Error creating DDS Reader: Bad Parameter in a few minutes, with the matching listener thrashing in a tight loop because activation kept failing but the matching status was never cleared.

After the fix — image built from this branch

Probe reaches healthy state within seconds:

probe 00: sub[active=True local=['/zephyr'] remote=['ee18...:gripper/button/left']] pub[reader=set local=['/zephyr'] remote=['f49b...:gripper/button/left']] node_has_subscriber=True
probe 01: sub[active=False local=[] remote=['ee18...:gripper/button/left']] pub[reader=empty local=['/zephyr'] remote=[]] node_has_subscriber=False    ← subscriber restart in progress
probe 02: sub[active=True local=['/zephyr'] remote=['ee18...:gripper/button/left']] pub[reader=set local=['/zephyr'] remote=['f49b...:gripper/button/left']] node_has_subscriber=True    ← auto-recovers
probe 03-05: healthy
HEALTHY: route flow steady — sub active with local+remote, pub has dds_reader, both populated
probe-1 exited with code 0

Note pub side after fix: local=['/zephyr'] is correctly populated — the bridge now recognises the same-named local publisher as a distinct instance from any prior instance.

Subscriber receives messages from the first restart onward:

[INFO] [zephyr]: first message: button 0
[INFO] [zephyr]: received 9 messages
[INFO] [zephyr]: first message: button 12
[INFO] [zephyr]: received 10 messages
[INFO] [zephyr]: first message: button 25

Bad Parameter errors: 0 (vs. 12,675 against 1.9.0).

Metric eclipse/zenoh-bridge-ros2dds:1.9.0 This branch
Subscriber message count 0 (all 36 restarts) 9 → 10 → ... continuously
Pub route local_nodes [] ['/zephyr']
Pub route dds_reader set but useless (no local) set, working
Bad Parameter errors 12,675 0
HEALTHY reached never (90 s window) within 5 s
Probe exit code 1 0

Notes on scope

Test plan

  • cargo check --workspace clean
  • cargo build --release -p zenoh-bridge-ros2dds clean
  • Repro: probe exits 0 (healthy) against fix image
  • Repro: probe exits 1 (not reproduced) against eclipse/zenoh-bridge-ros2dds:1.9.0
  • Bridge logs show 0 Bad Parameter after fix (vs. ~12k against v1.9.0)
  • Admin-space JSON shape unchanged
  • Long-run (valgrind / hours) to confirm CString leak elimination — not done here; relying on code review for that part.

…iter creation

create_topic and create_dds_writer leak two CString allocations per call:
the topic name and the type name are converted via CString::into_raw and
handed to cyclonedds (which copies them internally), but never reclaimed.
Under workloads that churn DDS readers/writers (e.g. publisher subscribers
that re-create their endpoints on a short cycle), this accumulates over
hours and eventually exhausts entity slots in the underlying DDS layer
("Error creating DDS Reader: Bad Parameter").

Mirrors the cleanup pattern already used in ros_discovery.rs.

Closes part of eclipse-zenoh#703.
…lname)

A zenoh<->DDS route can get stuck `is_active: false` with an empty
`local_nodes` after a ROS 2 node restarts under the same name with a
new participant GID. The local DDS subscriber is matched, the bridge's
own ROS graph shows the subscription, but `local_nodes` desyncs and
the route never reactivates until a brand-new (differently named)
subscriber appears.

Root cause: every route's `local_nodes` was a `HashSet<String>` keyed
on the node fullname alone. When a node restarted with the same name
but a new participant GID, a late-arriving DISPOSE for the old
participant's reader/writer would clear the entry that the new
participant's ADD had just re-inserted, since both events used the
same string key. The admin space at `ros2/node/<participant_gid>/...`
remained consistent because it was already keyed by participant GID;
only `local_nodes` lost the race.

Fix:
- Carry the participant GID through every `ROS2DiscoveryEvent` variant.
- Add a `participant: Gid` field on `NodeInfo` and include it in every
  emitted Discovered*/Undiscovered* event.
- Change `local_nodes` to `HashSet<(Gid, String)>` in every route type
  (subscriber, publisher, service srv/cli, action srv/cli) and update
  add_local_node/remove_local_node to use the tuple.
- Preserve the admin-space JSON format with a custom serializer that
  emits a deduplicated array of node fullname strings, hiding the
  internal participant GID.

Also fix a secondary issue in `RouteSubscriber::add_local_node`: it
ran `len() == 1` after an unconditional `insert`, so re-inserting an
existing key (which is common under DISPOSE-then-ADD races) would
incorrectly re-run `announce_route`. Now mirrors RoutePublisher's
correct pattern (only fire on the 0→1 transition).

Side effect: in scenarios where a same-named subscriber on one side
created an ADD/DISPOSE race that was previously corrupting cyclonedds
state (and triggering a flood of "failed to activate DDS Reader: Bad
Parameter" errors on the publisher side), the death loop also
disappears once `local_nodes` is keyed correctly.

Closes eclipse-zenoh#702. Mitigates the matching-listener Bad-Parameter storm in
eclipse-zenoh#703 indirectly.
@yumeminami yumeminami force-pushed the fix/702-local-nodes-participant-gid branch from 51d327a to 33e0078 Compare May 28, 2026 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant