Skip to content

Fix mavros_router segfault with mutex on remote_addrs#2138

Merged
vooon merged 4 commits into
mavlink:ros2from
JonasPerolini:fix-segfault-router
Mar 10, 2026
Merged

Fix mavros_router segfault with mutex on remote_addrs#2138
vooon merged 4 commits into
mavlink:ros2from
JonasPerolini:fix-segfault-router

Conversation

@JonasPerolini
Copy link
Copy Markdown

In the mavros_router, I've encountered a segmentation fault because this->remote_addrs is accessed by two threads concurrently without lock mechanism.

Solution: protect concurrent calls with a mutex: remote_addrs_mutex.

To reproduce the segfault: change the frequency of the timer below from 60s to 1ms

this->create_wall_timer(60s, std::bind(&Router::periodic_clear_stale_remote_addrs, this));

@vooon
Copy link
Copy Markdown
Member

vooon commented Mar 10, 2026

Thanks for tracking down this race — the segfault is real and the fix direction is right.

I have an in-flight refactor of the same file in #2105 that restructures route_message() (replaces the goto retry with a collect_targets lambda, narrows lock scopes in add/del_endpoint, etc.). That PR does not fix the remote_addrs race you found, so your fix is still needed — but the two will conflict in route_message().

To keep both PRs mergeable with minimal rebase pain, could you make two changes?

1. Remove the lock in periodic_reconnect_endpoints()

That function never reads or writes remote_addrs / stale_addrs, so the lock is unnecessary. Worse, it holds the mutex across p->open() which can block on network I/O (TCP connect, serial open), stalling recv_message() and route_message() for that endpoint.

2. Use std::shared_mutex instead of std::mutex

route_message() is the hot path and only reads remote_addrs. A std::shared_mutex lets multiple concurrent routers read without blocking each other:

// header
std::shared_mutex remote_addrs_mutex;

// route_message (read path)
std::shared_lock lock(dest->remote_addrs_mutex);
has_target = dest->remote_addrs.find(target_addr) != dest->remote_addrs.end();

// recv_message, periodic_clear_stale_remote_addrs, diag_run (write / copy paths)
std::unique_lock lock(this->remote_addrs_mutex);

@JonasPerolini
Copy link
Copy Markdown
Author

Thank you for the feedback @vooon. I've updated the code accordingly. Were you able to reproduce the segfault?

(Side note that I also did not have the lock inside periodic_reconnect_endpoints() when I first dev the code on a fork based on an old commit, and added it by mistake once too many times when porting the changes for the upstream PR. Sorry.)

Comment thread mavros/src/lib/mavros_router.cpp Outdated
stat.addf("Remotes count", "%zu", this->remote_addrs.size());
std::set<addr_t> remote_addrs_copy;
{
std::unique_lock<std::shared_mutex> lock(this->remote_addrs_mutex);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe shared lock, since we're just reading here?

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed @vooon updated in bda0aa1 for both remote_addrs_copy = this->remote_addrs;

We could also change unique_lock lock(mu); to shared_lock lock(mu); inside periodic_clear_stale_remote_addrs because this->endpoints does not seem to be modified but probably out of scope for this PR

Copy link
Copy Markdown
Member

@vooon vooon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, Thanks!

@vooon vooon merged commit d07483e into mavlink:ros2 Mar 10, 2026
6 checks passed
@vooon vooon added this to the Version 2.15 milestone Mar 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants