[RFC] add Batman V support#3736
Conversation
| if [ "$IFNAME" = "vx_mesh_uplink" ] || [ "$IFNAME" = "vx_mesh_other" ] || [ "$IFNAME" = "mesh-vpn" ]; then | ||
| sleep 3 | ||
| batctl hardif "${IFNAME}" throughput_override 1000mbit |
There was a problem hiding this comment.
In the wires mesh case, this does ignore the lower layer link speed? I'm not using Batman V, but in a scenario where the lower layer has multiple links over a DSA port, a 10 Mbit link should be wiegted less than a 1G link, doesn't it?
There was a problem hiding this comment.
This is a very old piece of code. I asked around and the problem in the past was (rough quote)
that gateways were detected with 1MBit and hence were rejected in favour of WiFi Meshes with a proper throughput (which are of course much worse).
So the traffic was routed within the local mesh until eventually it did go through the gateway into the internet.
cc @krombel
There was a problem hiding this comment.
I understand your concern, but in our cases we never have realistically any cable based connection <100Mbps, even 100Mbps is very very rare and almost all are 1000Mbps (with a few 2500Mbps exceptions).
So hardcoding it to 1000Mbps is much reliable than "hoping" that no device detects its (LAN) speed of 1Mbps or 3Mbps.
What do you think about a compromise like this?
if speed < 100Mbps:
speed = 1000Mbps
endif
There was a problem hiding this comment.
@grische the vxlan interface is bound to a interface, so at least attempting to set the correct link rates on bringup should entirely be possible.
(assuming its not in a bridge with multiple interfaces - In which case the entire approach of having a bridge in the first place seems incompatible with the protocol, but this is the can of worms I've been talking about)
Forcing the link-rate to a fixed value for all interface is just a hack avoiding the situation you've seen with the vxlan tunnel, not a proper fix.
There was a problem hiding this comment.
@T-X as you might have more insights here on what the issue was.
There was a problem hiding this comment.
I wonder if this could already fix the issue:
diff --git a/net/batman-adv/bat_v_elp.c b/net/batman-adv/bat_v_elp.c
index fdc2abe9..0b685f7a 100644
--- a/net/batman-adv/bat_v_elp.c
+++ b/net/batman-adv/bat_v_elp.c
@@ -170,6 +170,23 @@ static bool batadv_v_elp_get_throughput(struct batadv_hardif_neigh_node *neigh,
* ethtool (e.g. an Ethernet adapter)
*/
ret = __ethtool_get_link_ksettings(hard_iface->net_dev, &link_settings);
+
+ /* Virtual/stacked interfaces (VXLAN, VLAN, bridges, ...) often fail
+ * the ethtool query or report SPEED_UNKNOWN. Try to resolve the
+ * underlying real net device and query that instead.
+ */
+ if (ret != 0 ||
+ link_settings.base.speed == 0 ||
+ link_settings.base.speed == SPEED_UNKNOWN) {
+ real_netdev = __batadv_get_real_netdev(hard_iface->net_dev);
+ if (real_netdev) {
+ if (real_netdev != hard_iface->net_dev)
+ ret = __ethtool_get_link_ksettings(real_netdev,
+ &link_settings);
+ dev_put(real_netdev);
+ }
+ }
+
rtnl_unlock();
if (ret == 0) {
/* link characteristics might change over time */
So we try to check if there is a realdev in the chain to read the speed from. cc @T-X
From my understanding this is already done for Wifi interfaces but not for all other interfaces because of line 109 in that code.
There was a problem hiding this comment.
I think the batman-adv approach so far was when in doubt about the correctness of the throughput then choose a conservative value. Otherwise it'd get annoying if a too optimistic node beats a node with a realistic value.
For VXLAN or mesh-vpn over the internet choosing the lower physical device speed might be too optimistic. So I'm not sure if that would be a patch acceptable for upstream.
However for the Gluon use-case I think we can quite likely say that with mesh-on-lan/wan the lower device of the vxlan one does provide a realistic throughput in most scenarios. So that would make me think that it might be better to implement that lower device check in Gluon scripts? And for mesh-vpn I don't think we can get a realistic value from the lower device via ethtool at the moment.
I think I would also be fine to start with 1gbit/s on vx_mesh_uplink / vx_mesh_other and 100mbit/s on mesh-vpn in this PR to start with in this PR. 1gbit/s is the typical / most common LAN ethernet link speed right now. And 100mbit/s the average internet uplink speed in Germany right now.
(But if someone feels like working on this then getting/setting the link speed in Gluon from the lower dev for vx_mesh_uplink / vx_mesh_other would be great, too.)
There was a problem hiding this comment.
After a lot of discussions across several chats, let's remove this from the PR for now and try to sort this independently.
There was a problem hiding this comment.
@awlx For Gluon WAN / Single is always ensalved in a bridge (necessity due to private-WiFi) thus the proposed kernel patch will probably not cut it.
Another point - This assumes we do not deal with a legacy switch where link-state metrics only describe the link between SoC and switch mac.
b902a12 to
46cf596
Compare
|
Hello @grische, At first glance it looks good to me. I can switch between our BATMAN IV and BATMAN V domain without problems. In a small mesh cloud with cable and wifi mesh it also behaves correctly. Only the VPN connection falls back to the default of 1Mbit/s. Thanks for your effort. |
db5ea78 to
9368a41
Compare
|
As @neocturne suggested in the Gluon meet, I added commits which drop the fake-TQ and instead use Batman V's throughput everywhere. I also started downstream PRs to support throughput there as well: |
| if [ "$IFNAME" = "vx_mesh_uplink" ] || [ "$IFNAME" = "vx_mesh_other" ] || [ "$IFNAME" = "mesh-vpn" ]; then | ||
| sleep 3 | ||
| batctl hardif "${IFNAME}" throughput_override 1000mbit |
There was a problem hiding this comment.
I think the batman-adv approach so far was when in doubt about the correctness of the throughput then choose a conservative value. Otherwise it'd get annoying if a too optimistic node beats a node with a realistic value.
For VXLAN or mesh-vpn over the internet choosing the lower physical device speed might be too optimistic. So I'm not sure if that would be a patch acceptable for upstream.
However for the Gluon use-case I think we can quite likely say that with mesh-on-lan/wan the lower device of the vxlan one does provide a realistic throughput in most scenarios. So that would make me think that it might be better to implement that lower device check in Gluon scripts? And for mesh-vpn I don't think we can get a realistic value from the lower device via ethtool at the moment.
I think I would also be fine to start with 1gbit/s on vx_mesh_uplink / vx_mesh_other and 100mbit/s on mesh-vpn in this PR to start with in this PR. 1gbit/s is the typical / most common LAN ethernet link speed right now. And 100mbit/s the average internet uplink speed in Germany right now.
(But if someone feels like working on this then getting/setting the link speed in Gluon from the lower dev for vx_mesh_uplink / vx_mesh_other would be great, too.)
The error message is in the local translation table query, not the global one.
Each routing algorithm now emits its native link-quality metric in its own JSON field: tq (uint8, 0-255) for Batman IV; throughput (uint32, kbps) for Batman V. Each neighbour link carries exactly one of the two. This matches the pattern in respondd-statistics.c (gateway_tq vs gateway_throughput) and removes the need to fake a TQ value from throughput via gluonutil_get_pseudo_tq(). Note: the status-page pipeline (neighbours-batadv.c) keeps emitting "tp" as a pre-formatted display string (e.g. "123M"); only the respondd neighbours output gains the raw numeric "throughput" here for machine consumers (yanic, meshviewer). add_neighbour() now takes a metric name plus a caller-allocated json_object value and owns that value, json_object_put()-ing it on the two early-return paths (ifname lookup fail, inner object alloc fail). The IV and V callbacks pass the appropriate (name, value) pair. Throughput is wrapped with json_object_new_int64 to avoid int32 truncation of the uint32 source; gateway_throughput in respondd-statistics.c still uses json_object_new_int and can be widened in a follow-up. The libgluonutil include is no longer needed in this file. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The function synthesised a fake Batman IV TQ from a Batman V throughput value via a log formula. Its only caller in the tree was respondd- neighbours.c, which now emits the raw throughput field directly. Drop the function, the <math.h> include it required, and the -lm linkage it forced on every libgluonutil consumer. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…etric column
A given Gluon build runs exactly one batman-adv routing algorithm
(mesh.batman_adv.routing_algo in site config). Showing both TQ and TP
columns on the status page always leaves one permanently blank and
confuses users. Emit only the relevant attr tuple:
- BATMAN_V → TP (throughput)
- BATMAN_IV → TQ (link quality)
The value is read from gluon.site at render time (mesh.lua is loaded
fresh per request), so it tracks the authoritative site config that
also drives gluon_bat0.sh and the kernel. check_site already
guarantees the field is one of {'BATMAN_IV', 'BATMAN_V'}.
No JS or HTML template change: status-page.html iterates mesh.attrs
dynamically, and the minified JS builds per-key cells from the
rendered <th data-key="..."> headers, so shrinking the list works
transparently. The Monitoring panel's separate "Gateway (TQ: ...)"
line is unrelated (batman-adv reports gateway TQ for both algos) and
stays unchanged.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
9368a41 to
4f05bcf
Compare
|
Rebased on top of latest main and updated the series:
|

This is a first iteration trying to upstream the Batman V changes we have been using for multiple years.
The original patchset can be found here:
v2025.1.x...freifunkMUC:gluon:v2025.1.x-batmanv
Looking for feedback.