Skip to content

[RFC] add Batman V support#3736

Draft
grische wants to merge 5 commits intofreifunk-gluon:mainfrom
freifunkMUC:batman-v-integration
Draft

[RFC] add Batman V support#3736
grische wants to merge 5 commits intofreifunk-gluon:mainfrom
freifunkMUC:batman-v-integration

Conversation

@grische
Copy link
Copy Markdown
Contributor

@grische grische commented Apr 10, 2026

This is a first iteration trying to upstream the Batman V changes we have been using for multiple years.

The original patchset can be found here:
v2025.1.x...freifunkMUC:gluon:v2025.1.x-batmanv

Looking for feedback.

Comment on lines +7 to +9
if [ "$IFNAME" = "vx_mesh_uplink" ] || [ "$IFNAME" = "vx_mesh_other" ] || [ "$IFNAME" = "mesh-vpn" ]; then
sleep 3
batctl hardif "${IFNAME}" throughput_override 1000mbit
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the wires mesh case, this does ignore the lower layer link speed? I'm not using Batman V, but in a scenario where the lower layer has multiple links over a DSA port, a 10 Mbit link should be wiegted less than a 1G link, doesn't it?

Copy link
Copy Markdown
Contributor Author

@grische grische Apr 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a very old piece of code. I asked around and the problem in the past was (rough quote)

that gateways were detected with 1MBit and hence were rejected in favour of WiFi Meshes with a proper throughput (which are of course much worse).
So the traffic was routed within the local mesh until eventually it did go through the gateway into the internet.

cc @krombel

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand, but in case you have this topology, it would lead to a suboptimal path selection:

image

I could imagine this is a bigger can of worms as determining lower link state assumes we don't use a bridge with multiple interfaces.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand your concern, but in our cases we never have realistically any cable based connection <100Mbps, even 100Mbps is very very rare and almost all are 1000Mbps (with a few 2500Mbps exceptions).

So hardcoding it to 1000Mbps is much reliable than "hoping" that no device detects its (LAN) speed of 1Mbps or 3Mbps.

What do you think about a compromise like this?

if speed < 100Mbps:
   speed = 1000Mbps
endif

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@grische the vxlan interface is bound to a interface, so at least attempting to set the correct link rates on bringup should entirely be possible.

(assuming its not in a bridge with multiple interfaces - In which case the entire approach of having a bridge in the first place seems incompatible with the protocol, but this is the can of worms I've been talking about)

Forcing the link-rate to a fixed value for all interface is just a hack avoiding the situation you've seen with the vxlan tunnel, not a proper fix.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@T-X as you might have more insights here on what the issue was.

Copy link
Copy Markdown
Member

@awlx awlx Apr 14, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if this could already fix the issue:

diff --git a/net/batman-adv/bat_v_elp.c b/net/batman-adv/bat_v_elp.c
index fdc2abe9..0b685f7a 100644
--- a/net/batman-adv/bat_v_elp.c
+++ b/net/batman-adv/bat_v_elp.c
@@ -170,6 +170,23 @@ static bool batadv_v_elp_get_throughput(struct batadv_hardif_neigh_node *neigh,
         * ethtool (e.g. an Ethernet adapter)
         */
        ret = __ethtool_get_link_ksettings(hard_iface->net_dev, &link_settings);
+
+       /* Virtual/stacked interfaces (VXLAN, VLAN, bridges, ...) often fail
+        * the ethtool query or report SPEED_UNKNOWN. Try to resolve the
+        * underlying real net device and query that instead.
+        */
+       if (ret != 0 ||
+           link_settings.base.speed == 0 ||
+           link_settings.base.speed == SPEED_UNKNOWN) {
+               real_netdev = __batadv_get_real_netdev(hard_iface->net_dev);
+               if (real_netdev) {
+                       if (real_netdev != hard_iface->net_dev)
+                               ret = __ethtool_get_link_ksettings(real_netdev,
+                                                                  &link_settings);
+                       dev_put(real_netdev);
+               }
+       }
+
        rtnl_unlock();
        if (ret == 0) {
                /* link characteristics might change over time */

So we try to check if there is a realdev in the chain to read the speed from. cc @T-X

From my understanding this is already done for Wifi interfaces but not for all other interfaces because of line 109 in that code.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the batman-adv approach so far was when in doubt about the correctness of the throughput then choose a conservative value. Otherwise it'd get annoying if a too optimistic node beats a node with a realistic value.

For VXLAN or mesh-vpn over the internet choosing the lower physical device speed might be too optimistic. So I'm not sure if that would be a patch acceptable for upstream.

However for the Gluon use-case I think we can quite likely say that with mesh-on-lan/wan the lower device of the vxlan one does provide a realistic throughput in most scenarios. So that would make me think that it might be better to implement that lower device check in Gluon scripts? And for mesh-vpn I don't think we can get a realistic value from the lower device via ethtool at the moment.


I think I would also be fine to start with 1gbit/s on vx_mesh_uplink / vx_mesh_other and 100mbit/s on mesh-vpn in this PR to start with in this PR. 1gbit/s is the typical / most common LAN ethernet link speed right now. And 100mbit/s the average internet uplink speed in Germany right now.

(But if someone feels like working on this then getting/setting the link speed in Gluon from the lower dev for vx_mesh_uplink / vx_mesh_other would be great, too.)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After a lot of discussions across several chats, let's remove this from the PR for now and try to sort this independently.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@awlx For Gluon WAN / Single is always ensalved in a bridge (necessity due to private-WiFi) thus the proposed kernel patch will probably not cut it.

Another point - This assumes we do not deal with a legacy switch where link-state metrics only describe the link between SoC and switch mac.

Copy link
Copy Markdown
Member

@blocktrron blocktrron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts so far

Comment thread package/gluon-mesh-batman-adv/src/respondd-neighbours.c Outdated
Comment thread package/gluon-mesh-batman-adv/src/respondd-neighbours.c Outdated
Comment thread package/gluon-mesh-batman-adv/src/respondd-neighbours.c
Comment thread package/gluon-mesh-batman-adv/src/respondd-neighbours.c Outdated
Comment thread package/gluon-mesh-batman-adv/src/respondd-statistics.c Outdated
Comment thread package/gluon-radv-filterd/src/gluon-radv-filterd.c
@grische grische force-pushed the batman-v-integration branch from b902a12 to 46cf596 Compare April 12, 2026 18:05
@grische grische marked this pull request as ready for review April 14, 2026 19:27
@grische grische marked this pull request as draft April 14, 2026 19:51
@ambassador86
Copy link
Copy Markdown
Contributor

ambassador86 commented Apr 16, 2026

Hello @grische,
thanks for your PR. I built a firmware for our network with your branch.

At first glance it looks good to me. I can switch between our BATMAN IV and BATMAN V domain without problems. In a small mesh cloud with cable and wifi mesh it also behaves correctly. Only the VPN connection falls back to the default of 1Mbit/s.

Thanks for your effort.

@grische
Copy link
Copy Markdown
Contributor Author

grische commented Apr 17, 2026

As @neocturne suggested in the Gluon meet, I added commits which drop the fake-TQ and instead use Batman V's throughput everywhere.

I also started downstream PRs to support throughput there as well:

Comment thread package/libgluonutil/src/libgluonutil.c Outdated
Comment thread package/gluon-mesh-batman-adv/src/respondd-neighbours.c
Comment on lines +7 to +9
if [ "$IFNAME" = "vx_mesh_uplink" ] || [ "$IFNAME" = "vx_mesh_other" ] || [ "$IFNAME" = "mesh-vpn" ]; then
sleep 3
batctl hardif "${IFNAME}" throughput_override 1000mbit
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the batman-adv approach so far was when in doubt about the correctness of the throughput then choose a conservative value. Otherwise it'd get annoying if a too optimistic node beats a node with a realistic value.

For VXLAN or mesh-vpn over the internet choosing the lower physical device speed might be too optimistic. So I'm not sure if that would be a patch acceptable for upstream.

However for the Gluon use-case I think we can quite likely say that with mesh-on-lan/wan the lower device of the vxlan one does provide a realistic throughput in most scenarios. So that would make me think that it might be better to implement that lower device check in Gluon scripts? And for mesh-vpn I don't think we can get a realistic value from the lower device via ethtool at the moment.


I think I would also be fine to start with 1gbit/s on vx_mesh_uplink / vx_mesh_other and 100mbit/s on mesh-vpn in this PR to start with in this PR. 1gbit/s is the typical / most common LAN ethernet link speed right now. And 100mbit/s the average internet uplink speed in Germany right now.

(But if someone feels like working on this then getting/setting the link speed in Gluon from the lower dev for vx_mesh_uplink / vx_mesh_other would be great, too.)

Comment thread package/libbatadv/src/batadv-genl.c Outdated
Comment thread package/gluon-status-page-mesh-batman-adv/src/neighbours-batadv.c Outdated
Comment thread package/libbatadv/src/batadv-genl.h
Comment thread package/gluon-status-page-mesh-batman-adv/src/neighbours-batadv.c Outdated
Comment thread package/gluon-mesh-batman-adv/src/respondd-neighbours.c Outdated
Comment thread package/gluon-mesh-batman-adv/src/respondd-neighbours.c
Comment thread package/gluon-mesh-batman-adv/src/respondd-neighbours.c Outdated
grische and others added 5 commits April 30, 2026 21:54
The error message is in the local translation table query, not the global
one.
Each routing algorithm now emits its native link-quality metric in its
own JSON field: tq (uint8, 0-255) for Batman IV; throughput (uint32,
kbps) for Batman V. Each neighbour link carries exactly one of the two.

This matches the pattern in respondd-statistics.c (gateway_tq vs
gateway_throughput) and removes the need to fake a TQ value from
throughput via gluonutil_get_pseudo_tq().

Note: the status-page pipeline (neighbours-batadv.c) keeps emitting
"tp" as a pre-formatted display string (e.g. "123M"); only the respondd
neighbours output gains the raw numeric "throughput" here for machine
consumers (yanic, meshviewer).

add_neighbour() now takes a metric name plus a caller-allocated
json_object value and owns that value, json_object_put()-ing it on the
two early-return paths (ifname lookup fail, inner object alloc fail).
The IV and V callbacks pass the appropriate (name, value) pair.

Throughput is wrapped with json_object_new_int64 to avoid int32
truncation of the uint32 source; gateway_throughput in
respondd-statistics.c still uses json_object_new_int and can be widened
in a follow-up.

The libgluonutil include is no longer needed in this file.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The function synthesised a fake Batman IV TQ from a Batman V throughput
value via a log formula. Its only caller in the tree was respondd-
neighbours.c, which now emits the raw throughput field directly.

Drop the function, the <math.h> include it required, and the -lm
linkage it forced on every libgluonutil consumer.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…etric column

A given Gluon build runs exactly one batman-adv routing algorithm
(mesh.batman_adv.routing_algo in site config). Showing both TQ and TP
columns on the status page always leaves one permanently blank and
confuses users. Emit only the relevant attr tuple:

- BATMAN_V → TP (throughput)
- BATMAN_IV → TQ (link quality)

The value is read from gluon.site at render time (mesh.lua is loaded
fresh per request), so it tracks the authoritative site config that
also drives gluon_bat0.sh and the kernel. check_site already
guarantees the field is one of {'BATMAN_IV', 'BATMAN_V'}.

No JS or HTML template change: status-page.html iterates mesh.attrs
dynamically, and the minified JS builds per-key cells from the
rendered <th data-key="..."> headers, so shrinking the list works
transparently. The Monitoring panel's separate "Gateway (TQ: ...)"
line is unrelated (batman-adv reports gateway TQ for both algos) and
stays unchanged.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@grische grische force-pushed the batman-v-integration branch from 9368a41 to 4f05bcf Compare April 30, 2026 20:06
@grische
Copy link
Copy Markdown
Contributor Author

grische commented Apr 30, 2026

Rebased on top of latest main and updated the series:

  • replaced uint8_t with enum batadv_algo. Also has an BATADV_ALGO_UNKNOWN = 0; now
  • fixed batadv_genl_get_algo() to return errno (-EINVAL / propagated negative) instead of -1
  • use nla_strlcpy instead of custom strncpy + manual NUL
  • use int64 for throughput: both respondd's gateway_throughput and status-page's tp field
  • drop the C-side display formatter for status-page's TP and emit raw kbit/s instead, and add a bitrate formatter for it
  • LOCAL_THROUGHPUT maxed at 10 * 1000 * 1000 kbit/s = 10 Gbit/s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants