Skip to content

chore(NODE-1953): Implement new dynamic route provider#209

Open
blind-oracle wants to merge 13 commits into
mainfrom
igor/route-provider
Open

chore(NODE-1953): Implement new dynamic route provider#209
blind-oracle wants to merge 13 commits into
mainfrom
igor/route-provider

Conversation

@blind-oracle

Copy link
Copy Markdown
Collaborator

A bit simplified version of DynamicRouteProvider from agent-rs. It takes into account both node latency & reliability to pick the best nodes, uses Weighted Round Robin to do that.

Entity roles are as follows:

  • FetcherManager fetches a fresh list of API BNs using Arc<dyn FetchesNodes> (AgentFetcher is included to implement that trait) and sends this list down a channel
  • RouteProviderManager gets a copy of it to share it with DynamicRouteProvider so that it knows how many total nodes there are (required for RouteProvider trait)
  • HealthCheckManager also receives the same list and spawns HealthCheckActors for each node to perform healthchecks using Arc<dyn ChecksHealth> (HttpHealthChecker is included to do that using an HTTP client). Once the set of healthy nodes changes (and also periodically) it sends the healthy node list (along with latency/reliability stats) down a channel. When the list is updated it figures out which nodes are added/removed and stops & spawns the corresponding actors.
  • RoutesManager receives this list, processes and builds a snapshot of URLs for the DynamicRouteProvider to use and shares it using ArcSwap
  • DynamicRouteProvider implements the actual RouteProvider trait & uses WRR in the RouteSnapshot to respond to URL queries.

@zeropath-ai

zeropath-ai Bot commented Jun 12, 2026

Copy link
Copy Markdown

No security or compliance issues detected. Reviewed everything up to 1e57956.

Security Overview
Detected Code Changes

| Change Type | Relevant files

... (code changes summary truncated to fit VCS comment limits.)

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR replaces the previous ic_bn_lib dynamic route provider integration with an in-repo dynamic route provider that discovers API boundary nodes, health-checks them, and selects routes using weighted round-robin based on latency and reliability.

Changes:

  • Introduces a new dynamic routing subsystem (FetcherManagerHealthCheckManagerRoutesManagerDynamicRouteProvider) under src/routing/ic/route_provider/.
  • Switches IC routing setup to use Hyper-based clients/services and updates mainnet root subnet ID handling to a Principal constant.
  • Adds a local WRR implementation and tests for WRR, health checking, fetching, and route selection.

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 10 comments.

Show a summary per file
File Description
src/routing/ic/routing_table_manager.rs Updates tests to use the new MAINNET_ROOT_SUBNET_ID constant type.
src/routing/ic/route_provider/wrr.rs Adds a weighted round-robin implementation used by the new route snapshot.
src/routing/ic/route_provider/routes.rs Adds route snapshot building and ranking logic (latency/reliability → weights) plus tests.
src/routing/ic/route_provider/provider.rs Adds DynamicRouteProvider and task orchestration managers implementing RouteProvider.
src/routing/ic/route_provider/mod.rs Adds the new route provider module API and wiring via setup_route_provider.
src/routing/ic/route_provider/health.rs Adds HTTP health checker + per-node health actors + manager producing HealthyNode snapshots.
src/routing/ic/route_provider/fetcher.rs Adds agent-based node discovery fetcher + manager publishing refreshed node lists.
src/routing/ic/route_provider.rs Removes the old route provider wiring (previous ic_bn_lib dynamic routing builder usage).
src/routing/ic/mod.rs Changes MAINNET_ROOT_SUBNET_ID to a Principal constant and updates imports.
src/routing/ic/http_service.rs Switches derive usage to derive_new::new.
src/core.rs Rewires startup to build Hyper clients/services and pass them into the new route provider setup.
Cargo.toml Adds thiserror dependency used by the new route provider module.
Cargo.lock Locks thiserror version.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/routing/ic/route_provider/wrr.rs Outdated
Comment thread src/routing/ic/route_provider/wrr.rs
Comment thread src/routing/ic/route_provider/wrr.rs
Comment thread src/routing/ic/route_provider/routes.rs Outdated
Comment thread src/routing/ic/route_provider/routes.rs
Comment thread src/routing/ic/route_provider/routes.rs
Comment thread src/routing/ic/route_provider/provider.rs Outdated
Comment thread src/routing/ic/route_provider/provider.rs
Comment thread src/routing/ic/route_provider/health.rs Outdated
Comment thread src/routing/ic/route_provider/provider.rs Outdated

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated 8 comments.

Comment thread src/routing/ic/route_provider/routes.rs
Comment thread src/routing/ic/route_provider/routes.rs
Comment thread src/routing/ic/route_provider/routes.rs
Comment thread src/routing/ic/route_provider/routes.rs Outdated
Comment thread src/routing/ic/route_provider/routes.rs
Comment thread src/routing/ic/route_provider/provider.rs Outdated
Comment thread src/routing/ic/route_provider/fetcher.rs
Comment thread src/routing/ic/route_provider/wrr.rs

@Bownairo Bownairo left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of these are just questions, but I think I've mostly got it 🙂. The roles in the description were really helpful to guide the review.

}));

info!(
"{self}: Got a list of API BNs ({}, {} invalid skipped): {node_list:?}",

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does this happen?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What exactly? List is refreshed periodically

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm maybe this comment moved (or I was tired 😴).

I was wondering, when does this invalid case happen?

Hostname should be a valid FQDN & have at least one label

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. Well, in case of some incorrect entry in the Registry potentially. When e.g. an empty hostname is used or it contains some unsupported characters for a FQDN.

async fn fetch_nodes(&self) -> Result<Vec<String>, RouteError> {
let api_bns = self
.agent
.fetch_api_boundary_nodes_by_subnet_id(MAINNET_ROOT_SUBNET_ID)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This means the list is pulled from the NNS? Is it only ever different for testing?

@blind-oracle blind-oracle Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it's pulled from the registry. No, the list isn't static, though it changes not very frequently - like recently two Pakistani API BNs were added. And this might happen during runtime by a proposal, so we have to poll for a new list periodically and update it.

In testing I think we don't use the dynamic route provider, but a static one to target a single PocketIC node.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would we ever pull from a different subnet ID? Or, are all BNs associated with the NNS subnet?
(When would .fetch_api_boundary_nodes_by_subnet_id(OTHER_SUBNET_ID) be used?)

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably not. This subnet ID there is kinda redundant, I don't know why it's used in the first place in the agent API.


// If we removed some nodes & didn't add anything - trigger an explicit update of healthy nodes.
// Otherwise removed nodes will be still available until some other node changes health status.
// If some nodes were added - then the update will come in order once their healthchecks are done.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a newly added node is slow to healthcheck, old nodes would be stale for as long as the timeout - but I guess that's really not much time compared to the idle timeout, or another health status change?

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, they will hang around for up to a health check timeout (currently 3s default).

This overlap is needed for a smooth transition in the case when the node list is fully replaced - like when transitioning from a seed list that consisted of e.g. just [ic0.app, icp0.io] to the fetched list with actual nodes. Or when something happens that changes all nodes in the NNS (unlikely event).

If we just remove the old nodes (seed ones) instantly, then we'll end up with no healthy nodes in the list for up to health check timeout until the new nodes are checked & added as healthy.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it! In this seed case, nodes can be fully replaced even if they are healthy, so we use them in the meantime.
If we are strictly removing, we assume we're not removing all of them, so we have some other nodes to lean on, and pull the removed ones from the pool right away.

@blind-oracle blind-oracle Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. Technically we can remove all of them and not add anything, but this would mean some edge case.

Maybe we need to add some protection against that (e.g. corrupted registry that might return an empty list w/o error), dunno. Like if an empty list comes - refuse it.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a safeguard against an empty list

Comment thread src/routing/ic/route_provider/health.rs
Comment thread src/routing/ic/route_provider/routes.rs Outdated
}
}

impl Display for RoutesManager {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super nit: Most places in this PR, Display is before Debug 🤪.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's because Debug is using Display in most places, so it comes first :) {self}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The super nit is that Route has:
Display -> Debug ({self})
and RoutesManager has:
Debug ({self}) -> Display

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, will fix :)

Comment thread src/core.rs Outdated
Comment thread src/core.rs
Comment thread src/routing/ic/route_provider/health.rs
Comment on lines +136 to +138
let fetcher_manager =
FetcherManager::new(fetcher_factory(route_provider.clone())?, node_list_tx);
tracker.spawn(fetcher_manager.run(node_fetch_interval, token.child_token()));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The fetcher holds an Arc to this provider, so it won't drop until the fetcher does, but the fetcher doesn't drop until the provider cancels the token in its own drop, so I think we are stuck.

@blind-oracle blind-oracle Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, shutting down of the provider was sucky.

I've reworked it a bit, removed Drop and added stop(). It looks a bit crappy since we need to return now (Arc<dyn RouteProvider>, Option<Arc<DynamicRouteProvider>>) from setup_route_provider(), but there seems no other good way, since Rust doesn't allow downcasting from Arc<dyn ...> if the trait doesn't have Any. And RouteProvider trait is defined in ic-agent and we can't really change it here (well we can create other trait that extends it but not sure it would look better).

https://doc.rust-lang.org/stable/std/sync/struct.Arc.html#method.downcast

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be possible to use a Weak? I guess that would also mean updating ic-agent, or at least creating yet another wrapper RouteProvivder to handle upgrades...

It would be a little weird between when the Arc is dropped and the FetcherManager is canceled - but it could be things are handled well enough that we just log a few "Refresh error:"?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise the tuple with the Option seems fine. I guess this could be an enum, but it's only an extra Arc 🤷‍♀️.

@blind-oracle blind-oracle Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it makes no sense to bother, it seems to stop nicely:

  • We cancel the token, actors are stopped
  • When FetcherManager is stopped - the Arc<dyn FetchesNodes> (AgentFetcher) is dropped (it's the only reference, I guess we can even use Box if need be) decreasing refcount on Arc<dyn RouteProvider>
  • After stop() finishes - everything is cleaned up, Arc<dyn RouteProvider> still exists at this point and dropped at the end of main()

@Bownairo Bownairo left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just realized ic-agent still has its own DynamicRouteProvider and the rest of the stack. If that's in use elsewhere, does this one deserve a new name? Or will that one be cleared out?

pub async fn stop(&self) {
self.token.cancel();
self.tracker.close();
self.tracker.wait().await;

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit on Eero: I don't know enough about the tracker to know why we wait here now... but I trust you 😅.

@blind-oracle blind-oracle Jun 17, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, it just waits until all Futures under its control are joined and that's all. This way stop() returns only when it is really stopped and all actors are finished.

Comment on lines +136 to +138
let fetcher_manager =
FetcherManager::new(fetcher_factory(route_provider.clone())?, node_list_tx);
tracker.spawn(fetcher_manager.run(node_fetch_interval, token.child_token()));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it would be possible to use a Weak? I guess that would also mean updating ic-agent, or at least creating yet another wrapper RouteProvivder to handle upgrades...

It would be a little weird between when the Arc is dropped and the FetcherManager is canceled - but it could be things are handled well enough that we just log a few "Refresh error:"?

Comment on lines +136 to +138
let fetcher_manager =
FetcherManager::new(fetcher_factory(route_provider.clone())?, node_list_tx);
tracker.spawn(fetcher_manager.run(node_fetch_interval, token.child_token()));

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise the tuple with the Option seems fine. I guess this could be an enum, but it's only an extra Arc 🤷‍♀️.

@blind-oracle

Copy link
Copy Markdown
Collaborator Author

I just realized ic-agent still has its own DynamicRouteProvider and the rest of the stack. If that's in use elsewhere, does this one deserve a new name? Or will that one be cleared out?

It's gated under a very-secret _internal_dynamic-routing feature and nobody really uses it (except Caffeine I think). Probably at some point we can either remove it or replace it with this new version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants