chore(NODE-1953): Implement new dynamic route provider by blind-oracle · Pull Request #209 · dfinity/ic-gateway

blind-oracle · 2026-06-12T16:36:35Z

A bit simplified version of DynamicRouteProvider from agent-rs. It takes into account both node latency & reliability to pick the best nodes, uses Weighted Round Robin to do that.

Entity roles are as follows:

FetcherManager fetches a fresh list of API BNs using Arc<dyn FetchesNodes> (AgentFetcher is included to implement that trait) and sends this list down a channel
RouteProviderManager gets a copy of it to share it with DynamicRouteProvider so that it knows how many total nodes there are (required for RouteProvider trait)
HealthCheckManager also receives the same list and spawns HealthCheckActors for each node to perform healthchecks using Arc<dyn ChecksHealth> (HttpHealthChecker is included to do that using an HTTP client). Once the set of healthy nodes changes (and also periodically) it sends the healthy node list (along with latency/reliability stats) down a channel. When the list is updated it figures out which nodes are added/removed and stops & spawns the corresponding actors.
RoutesManager receives this list, processes and builds a snapshot of URLs for the DynamicRouteProvider to use and shares it using ArcSwap
DynamicRouteProvider implements the actual RouteProvider trait & uses WRR in the RouteSnapshot to respond to URL queries.

zeropath-ai · 2026-06-12T16:37:38Z

✅ No security or compliance issues detected. Reviewed everything up to 1e57956.

Security Overview

🔎 Scanned files: 14 changed file(s)
🔗 Scan Link: https://zeropath.com/app/repositories/2f4ca4b6-3e66-473c-8350-2a581fad9f23?scanId=8f8b3cc9-a3de-4880-a103-d629cbb9bf63&codeScanTypes=PrScan&tab=issues

Detected Code Changes

| Change Type | Relevant files

... (code changes summary truncated to fit VCS comment limits.)

Copilot

Pull request overview

This PR replaces the previous ic_bn_lib dynamic route provider integration with an in-repo dynamic route provider that discovers API boundary nodes, health-checks them, and selects routes using weighted round-robin based on latency and reliability.

Changes:

Introduces a new dynamic routing subsystem (FetcherManager → HealthCheckManager → RoutesManager → DynamicRouteProvider) under src/routing/ic/route_provider/.
Switches IC routing setup to use Hyper-based clients/services and updates mainnet root subnet ID handling to a Principal constant.
Adds a local WRR implementation and tests for WRR, health checking, fetching, and route selection.

Reviewed changes

Copilot reviewed 12 out of 13 changed files in this pull request and generated 10 comments.

Show a summary per file

File	Description
src/routing/ic/routing_table_manager.rs	Updates tests to use the new `MAINNET_ROOT_SUBNET_ID` constant type.
src/routing/ic/route_provider/wrr.rs	Adds a weighted round-robin implementation used by the new route snapshot.
src/routing/ic/route_provider/routes.rs	Adds route snapshot building and ranking logic (latency/reliability → weights) plus tests.
src/routing/ic/route_provider/provider.rs	Adds `DynamicRouteProvider` and task orchestration managers implementing `RouteProvider`.
src/routing/ic/route_provider/mod.rs	Adds the new route provider module API and wiring via `setup_route_provider`.
src/routing/ic/route_provider/health.rs	Adds HTTP health checker + per-node health actors + manager producing `HealthyNode` snapshots.
src/routing/ic/route_provider/fetcher.rs	Adds agent-based node discovery fetcher + manager publishing refreshed node lists.
src/routing/ic/route_provider.rs	Removes the old route provider wiring (previous `ic_bn_lib` dynamic routing builder usage).
src/routing/ic/mod.rs	Changes `MAINNET_ROOT_SUBNET_ID` to a `Principal` constant and updates imports.
src/routing/ic/http_service.rs	Switches derive usage to `derive_new::new`.
src/core.rs	Rewires startup to build Hyper clients/services and pass them into the new route provider setup.
Cargo.toml	Adds `thiserror` dependency used by the new route provider module.
Cargo.lock	Locks `thiserror` version.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot

Pull request overview

Copilot reviewed 13 out of 14 changed files in this pull request and generated 8 comments.

Bownairo

A lot of these are just questions, but I think I've mostly got it 🙂. The roles in the description were really helpful to guide the review.

Bownairo · 2026-06-16T05:10:10Z

+        }));
+
+        info!(
+            "{self}: Got a list of API BNs ({}, {} invalid skipped): {node_list:?}",


When does this happen?

What exactly? List is refreshed periodically

Hmmm maybe this comment moved (or I was tired 😴).

I was wondering, when does this invalid case happen?

Hostname should be a valid FQDN & have at least one label

Ah. Well, in case of some incorrect entry in the Registry potentially. When e.g. an empty hostname is used or it contains some unsupported characters for a FQDN.

Bownairo · 2026-06-16T05:17:49Z

+    async fn fetch_nodes(&self) -> Result<Vec<String>, RouteError> {
+        let api_bns = self
+            .agent
+            .fetch_api_boundary_nodes_by_subnet_id(MAINNET_ROOT_SUBNET_ID)


This means the list is pulled from the NNS? Is it only ever different for testing?

Yes, it's pulled from the registry. No, the list isn't static, though it changes not very frequently - like recently two Pakistani API BNs were added. And this might happen during runtime by a proposal, so we have to poll for a new list periodically and update it.

In testing I think we don't use the dynamic route provider, but a static one to target a single PocketIC node.

Would we ever pull from a different subnet ID? Or, are all BNs associated with the NNS subnet?
(When would .fetch_api_boundary_nodes_by_subnet_id(OTHER_SUBNET_ID) be used?)

Probably not. This subnet ID there is kinda redundant, I don't know why it's used in the first place in the agent API.

Bownairo · 2026-06-16T21:46:36Z

+
+        // If we removed some nodes & didn't add anything - trigger an explicit update of healthy nodes.
+        // Otherwise removed nodes will be still available until some other node changes health status.
+        // If some nodes were added - then the update will come in order once their healthchecks are done.


If a newly added node is slow to healthcheck, old nodes would be stale for as long as the timeout - but I guess that's really not much time compared to the idle timeout, or another health status change?

Yes, they will hang around for up to a health check timeout (currently 3s default).

This overlap is needed for a smooth transition in the case when the node list is fully replaced - like when transitioning from a seed list that consisted of e.g. just [ic0.app, icp0.io] to the fetched list with actual nodes. Or when something happens that changes all nodes in the NNS (unlikely event).

If we just remove the old nodes (seed ones) instantly, then we'll end up with no healthy nodes in the list for up to health check timeout until the new nodes are checked & added as healthy.

Got it! In this seed case, nodes can be fully replaced even if they are healthy, so we use them in the meantime.
If we are strictly removing, we assume we're not removing all of them, so we have some other nodes to lean on, and pull the removed ones from the pool right away.

Yes. Technically we can remove all of them and not add anything, but this would mean some edge case.

Maybe we need to add some protection against that (e.g. corrupted registry that might return an empty list w/o error), dunno. Like if an empty list comes - refuse it.

I've added a safeguard against an empty list

Bownairo · 2026-06-16T22:19:41Z

+    }
+}
+
+impl Display for RoutesManager {


Super nit: Most places in this PR, Display is before Debug 🤪.

It's because Debug is using Display in most places, so it comes first :) {self}

The super nit is that Route has:
Display -> Debug ({self})
and RoutesManager has:
Debug ({self}) -> Display

Ah, will fix :)

Bownairo · 2026-06-16T23:36:49Z

+        let fetcher_manager =
+            FetcherManager::new(fetcher_factory(route_provider.clone())?, node_list_tx);
+        tracker.spawn(fetcher_manager.run(node_fetch_interval, token.child_token()));


The fetcher holds an Arc to this provider, so it won't drop until the fetcher does, but the fetcher doesn't drop until the provider cancels the token in its own drop, so I think we are stuck.

Yes, shutting down of the provider was sucky.

I've reworked it a bit, removed Drop and added stop(). It looks a bit crappy since we need to return now (Arc<dyn RouteProvider>, Option<Arc<DynamicRouteProvider>>) from setup_route_provider(), but there seems no other good way, since Rust doesn't allow downcasting from Arc<dyn ...> if the trait doesn't have Any. And RouteProvider trait is defined in ic-agent and we can't really change it here (well we can create other trait that extends it but not sure it would look better).

https://doc.rust-lang.org/stable/std/sync/struct.Arc.html#method.downcast

Do you think it would be possible to use a Weak? I guess that would also mean updating ic-agent, or at least creating yet another wrapper RouteProvivder to handle upgrades...

It would be a little weird between when the Arc is dropped and the FetcherManager is canceled - but it could be things are handled well enough that we just log a few "Refresh error:"?

Otherwise the tuple with the Option seems fine. I guess this could be an enum, but it's only an extra Arc 🤷‍♀️.

I think it makes no sense to bother, it seems to stop nicely:

We cancel the token, actors are stopped

When FetcherManager is stopped - the Arc<dyn FetchesNodes> (AgentFetcher) is dropped (it's the only reference, I guess we can even use Box if need be) decreasing refcount on Arc<dyn RouteProvider>

After stop() finishes - everything is cleaned up, Arc<dyn RouteProvider> still exists at this point and dropped at the end of main()

Bownairo

I just realized ic-agent still has its own DynamicRouteProvider and the rest of the stack. If that's in use elsewhere, does this one deserve a new name? Or will that one be cleared out?

Bownairo · 2026-06-17T17:13:39Z

+    pub async fn stop(&self) {
        self.token.cancel();
        self.tracker.close();
+        self.tracker.wait().await;


nit on Eero: I don't know enough about the tracker to know why we wait here now... but I trust you 😅.

Well, it just waits until all Futures under its control are joined and that's all. This way stop() returns only when it is really stopped and all actors are finished.

Bownairo · 2026-06-17T17:38:57Z

+        let fetcher_manager =
+            FetcherManager::new(fetcher_factory(route_provider.clone())?, node_list_tx);
+        tracker.spawn(fetcher_manager.run(node_fetch_interval, token.child_token()));


Do you think it would be possible to use a Weak? I guess that would also mean updating ic-agent, or at least creating yet another wrapper RouteProvivder to handle upgrades...

It would be a little weird between when the Arc is dropped and the FetcherManager is canceled - but it could be things are handled well enough that we just log a few "Refresh error:"?

Bownairo · 2026-06-17T18:08:34Z

+        let fetcher_manager =
+            FetcherManager::new(fetcher_factory(route_provider.clone())?, node_list_tx);
+        tracker.spawn(fetcher_manager.run(node_fetch_interval, token.child_token()));


Otherwise the tuple with the Option seems fine. I guess this could be an enum, but it's only an extra Arc 🤷‍♀️.

blind-oracle · 2026-06-17T19:48:56Z

I just realized ic-agent still has its own DynamicRouteProvider and the rest of the stack. If that's in use elsewhere, does this one deserve a new name? Or will that one be cleared out?

It's gated under a very-secret _internal_dynamic-routing feature and nobody really uses it (except Caffeine I think). Probably at some point we can either remove it or replace it with this new version.

blind-oracle added 4 commits June 11, 2026 13:51

Initial work

69f2bf2

Merge remote-tracking branch 'origin/main' into igor/route-provider

fe098c4

Final work on route provider

51ff1ff

Remove commented code

71df21d

blind-oracle requested a review from Copilot June 12, 2026 16:36

blind-oracle requested a review from a team as a code owner June 12, 2026 16:36

Copilot started reviewing on behalf of blind-oracle June 12, 2026 16:36 View session

Copilot AI reviewed Jun 12, 2026

View reviewed changes

blind-oracle added 2 commits June 12, 2026 17:23

Address comments

878509b

Rework weight calculations, improve some logging

dffb8e6

blind-oracle requested a review from Copilot June 15, 2026 14:05

Copilot started reviewing on behalf of blind-oracle June 15, 2026 14:06 View session

Copilot AI reviewed Jun 15, 2026

View reviewed changes

blind-oracle added 3 commits June 15, 2026 16:13

Address AI comments

92da41f

Fix test

9a48d22

Small improvements / comments

0b54547

Bownairo reviewed Jun 16, 2026

View reviewed changes

blind-oracle added 2 commits June 17, 2026 09:45

Implement Dynamic Route Provider shutdown

9a130da

remove redundant clone

b3cac57

Bownairo reviewed Jun 17, 2026

View reviewed changes

blind-oracle added 2 commits June 17, 2026 19:59

Add safeguard against empty list

d729fa2

Nits

1e57956

Conversation

blind-oracle commented Jun 12, 2026

Uh oh!

zeropath-ai Bot commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Bownairo left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

blind-oracle Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

blind-oracle Jun 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zeropath-ai Bot commented Jun 12, 2026 •

edited

Loading

blind-oracle Jun 17, 2026 •

edited

Loading

blind-oracle Jun 17, 2026 •

edited

Loading

blind-oracle Jun 17, 2026 •

edited

Loading

blind-oracle Jun 17, 2026 •

edited

Loading

blind-oracle Jun 17, 2026 •

edited

Loading