chore(NODE-1953): Implement new dynamic route provider#209
chore(NODE-1953): Implement new dynamic route provider#209blind-oracle wants to merge 13 commits into
Conversation
|
✅ No security or compliance issues detected. Reviewed everything up to 1e57956. Security Overview
Detected Code Changes| Change Type | Relevant files ... (code changes summary truncated to fit VCS comment limits.) |
There was a problem hiding this comment.
Pull request overview
This PR replaces the previous ic_bn_lib dynamic route provider integration with an in-repo dynamic route provider that discovers API boundary nodes, health-checks them, and selects routes using weighted round-robin based on latency and reliability.
Changes:
- Introduces a new dynamic routing subsystem (
FetcherManager→HealthCheckManager→RoutesManager→DynamicRouteProvider) undersrc/routing/ic/route_provider/. - Switches IC routing setup to use Hyper-based clients/services and updates mainnet root subnet ID handling to a
Principalconstant. - Adds a local WRR implementation and tests for WRR, health checking, fetching, and route selection.
Reviewed changes
Copilot reviewed 12 out of 13 changed files in this pull request and generated 10 comments.
Show a summary per file
| File | Description |
|---|---|
| src/routing/ic/routing_table_manager.rs | Updates tests to use the new MAINNET_ROOT_SUBNET_ID constant type. |
| src/routing/ic/route_provider/wrr.rs | Adds a weighted round-robin implementation used by the new route snapshot. |
| src/routing/ic/route_provider/routes.rs | Adds route snapshot building and ranking logic (latency/reliability → weights) plus tests. |
| src/routing/ic/route_provider/provider.rs | Adds DynamicRouteProvider and task orchestration managers implementing RouteProvider. |
| src/routing/ic/route_provider/mod.rs | Adds the new route provider module API and wiring via setup_route_provider. |
| src/routing/ic/route_provider/health.rs | Adds HTTP health checker + per-node health actors + manager producing HealthyNode snapshots. |
| src/routing/ic/route_provider/fetcher.rs | Adds agent-based node discovery fetcher + manager publishing refreshed node lists. |
| src/routing/ic/route_provider.rs | Removes the old route provider wiring (previous ic_bn_lib dynamic routing builder usage). |
| src/routing/ic/mod.rs | Changes MAINNET_ROOT_SUBNET_ID to a Principal constant and updates imports. |
| src/routing/ic/http_service.rs | Switches derive usage to derive_new::new. |
| src/core.rs | Rewires startup to build Hyper clients/services and pass them into the new route provider setup. |
| Cargo.toml | Adds thiserror dependency used by the new route provider module. |
| Cargo.lock | Locks thiserror version. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Bownairo
left a comment
There was a problem hiding this comment.
A lot of these are just questions, but I think I've mostly got it 🙂. The roles in the description were really helpful to guide the review.
| })); | ||
|
|
||
| info!( | ||
| "{self}: Got a list of API BNs ({}, {} invalid skipped): {node_list:?}", |
There was a problem hiding this comment.
What exactly? List is refreshed periodically
There was a problem hiding this comment.
Hmmm maybe this comment moved (or I was tired 😴).
I was wondering, when does this invalid case happen?
Hostname should be a valid FQDN & have at least one label
There was a problem hiding this comment.
Ah. Well, in case of some incorrect entry in the Registry potentially. When e.g. an empty hostname is used or it contains some unsupported characters for a FQDN.
| async fn fetch_nodes(&self) -> Result<Vec<String>, RouteError> { | ||
| let api_bns = self | ||
| .agent | ||
| .fetch_api_boundary_nodes_by_subnet_id(MAINNET_ROOT_SUBNET_ID) |
There was a problem hiding this comment.
This means the list is pulled from the NNS? Is it only ever different for testing?
There was a problem hiding this comment.
Yes, it's pulled from the registry. No, the list isn't static, though it changes not very frequently - like recently two Pakistani API BNs were added. And this might happen during runtime by a proposal, so we have to poll for a new list periodically and update it.
In testing I think we don't use the dynamic route provider, but a static one to target a single PocketIC node.
There was a problem hiding this comment.
Would we ever pull from a different subnet ID? Or, are all BNs associated with the NNS subnet?
(When would .fetch_api_boundary_nodes_by_subnet_id(OTHER_SUBNET_ID) be used?)
There was a problem hiding this comment.
Probably not. This subnet ID there is kinda redundant, I don't know why it's used in the first place in the agent API.
|
|
||
| // If we removed some nodes & didn't add anything - trigger an explicit update of healthy nodes. | ||
| // Otherwise removed nodes will be still available until some other node changes health status. | ||
| // If some nodes were added - then the update will come in order once their healthchecks are done. |
There was a problem hiding this comment.
If a newly added node is slow to healthcheck, old nodes would be stale for as long as the timeout - but I guess that's really not much time compared to the idle timeout, or another health status change?
There was a problem hiding this comment.
Yes, they will hang around for up to a health check timeout (currently 3s default).
This overlap is needed for a smooth transition in the case when the node list is fully replaced - like when transitioning from a seed list that consisted of e.g. just [ic0.app, icp0.io] to the fetched list with actual nodes. Or when something happens that changes all nodes in the NNS (unlikely event).
If we just remove the old nodes (seed ones) instantly, then we'll end up with no healthy nodes in the list for up to health check timeout until the new nodes are checked & added as healthy.
There was a problem hiding this comment.
Got it! In this seed case, nodes can be fully replaced even if they are healthy, so we use them in the meantime.
If we are strictly removing, we assume we're not removing all of them, so we have some other nodes to lean on, and pull the removed ones from the pool right away.
There was a problem hiding this comment.
Yes. Technically we can remove all of them and not add anything, but this would mean some edge case.
Maybe we need to add some protection against that (e.g. corrupted registry that might return an empty list w/o error), dunno. Like if an empty list comes - refuse it.
There was a problem hiding this comment.
I've added a safeguard against an empty list
| } | ||
| } | ||
|
|
||
| impl Display for RoutesManager { |
There was a problem hiding this comment.
Super nit: Most places in this PR, Display is before Debug 🤪.
There was a problem hiding this comment.
It's because Debug is using Display in most places, so it comes first :) {self}
There was a problem hiding this comment.
The super nit is that Route has:
Display -> Debug ({self})
and RoutesManager has:
Debug ({self}) -> Display
There was a problem hiding this comment.
Ah, will fix :)
| let fetcher_manager = | ||
| FetcherManager::new(fetcher_factory(route_provider.clone())?, node_list_tx); | ||
| tracker.spawn(fetcher_manager.run(node_fetch_interval, token.child_token())); |
There was a problem hiding this comment.
The fetcher holds an Arc to this provider, so it won't drop until the fetcher does, but the fetcher doesn't drop until the provider cancels the token in its own drop, so I think we are stuck.
There was a problem hiding this comment.
Yes, shutting down of the provider was sucky.
I've reworked it a bit, removed Drop and added stop(). It looks a bit crappy since we need to return now (Arc<dyn RouteProvider>, Option<Arc<DynamicRouteProvider>>) from setup_route_provider(), but there seems no other good way, since Rust doesn't allow downcasting from Arc<dyn ...> if the trait doesn't have Any. And RouteProvider trait is defined in ic-agent and we can't really change it here (well we can create other trait that extends it but not sure it would look better).
https://doc.rust-lang.org/stable/std/sync/struct.Arc.html#method.downcast
There was a problem hiding this comment.
Do you think it would be possible to use a Weak? I guess that would also mean updating ic-agent, or at least creating yet another wrapper RouteProvivder to handle upgrades...
It would be a little weird between when the Arc is dropped and the FetcherManager is canceled - but it could be things are handled well enough that we just log a few "Refresh error:"?
There was a problem hiding this comment.
Otherwise the tuple with the Option seems fine. I guess this could be an enum, but it's only an extra Arc 🤷♀️.
There was a problem hiding this comment.
I think it makes no sense to bother, it seems to stop nicely:
- We cancel the token, actors are stopped
- When
FetcherManageris stopped - theArc<dyn FetchesNodes>(AgentFetcher) is dropped (it's the only reference, I guess we can even use Box if need be) decreasing refcount onArc<dyn RouteProvider> - After
stop()finishes - everything is cleaned up,Arc<dyn RouteProvider>still exists at this point and dropped at the end ofmain()
Bownairo
left a comment
There was a problem hiding this comment.
I just realized ic-agent still has its own DynamicRouteProvider and the rest of the stack. If that's in use elsewhere, does this one deserve a new name? Or will that one be cleared out?
| pub async fn stop(&self) { | ||
| self.token.cancel(); | ||
| self.tracker.close(); | ||
| self.tracker.wait().await; |
There was a problem hiding this comment.
nit on Eero: I don't know enough about the tracker to know why we wait here now... but I trust you 😅.
There was a problem hiding this comment.
Well, it just waits until all Futures under its control are joined and that's all. This way stop() returns only when it is really stopped and all actors are finished.
| let fetcher_manager = | ||
| FetcherManager::new(fetcher_factory(route_provider.clone())?, node_list_tx); | ||
| tracker.spawn(fetcher_manager.run(node_fetch_interval, token.child_token())); |
There was a problem hiding this comment.
Do you think it would be possible to use a Weak? I guess that would also mean updating ic-agent, or at least creating yet another wrapper RouteProvivder to handle upgrades...
It would be a little weird between when the Arc is dropped and the FetcherManager is canceled - but it could be things are handled well enough that we just log a few "Refresh error:"?
| let fetcher_manager = | ||
| FetcherManager::new(fetcher_factory(route_provider.clone())?, node_list_tx); | ||
| tracker.spawn(fetcher_manager.run(node_fetch_interval, token.child_token())); |
There was a problem hiding this comment.
Otherwise the tuple with the Option seems fine. I guess this could be an enum, but it's only an extra Arc 🤷♀️.
It's gated under a very-secret |
A bit simplified version of
DynamicRouteProviderfromagent-rs. It takes into account both node latency & reliability to pick the best nodes, uses Weighted Round Robin to do that.Entity roles are as follows:
FetcherManagerfetches a fresh list of API BNs usingArc<dyn FetchesNodes>(AgentFetcheris included to implement that trait) and sends this list down a channelRouteProviderManagergets a copy of it to share it withDynamicRouteProviderso that it knows how many total nodes there are (required forRouteProvidertrait)HealthCheckManageralso receives the same list and spawnsHealthCheckActors for each node to perform healthchecks usingArc<dyn ChecksHealth>(HttpHealthCheckeris included to do that using an HTTP client). Once the set of healthy nodes changes (and also periodically) it sends the healthy node list (along with latency/reliability stats) down a channel. When the list is updated it figures out which nodes are added/removed and stops & spawns the corresponding actors.RoutesManagerreceives this list, processes and builds a snapshot of URLs for theDynamicRouteProviderto use and shares it usingArcSwapDynamicRouteProviderimplements the actualRouteProvidertrait & uses WRR in theRouteSnapshotto respond to URL queries.