From a7638b5d09ba137732a51fcfa1b0293a57d981e1 Mon Sep 17 00:00:00 2001 From: sudheer-nexthop Date: Fri, 15 May 2026 14:56:01 +0530 Subject: [PATCH] [HLD] MAC Move Guard Signed-off-by: sudheer-nexthop --- doc/mac_move_guard/MAC_MOVE_GUARD_HLD.md | 678 +++++++++++++++++++++++ 1 file changed, 678 insertions(+) create mode 100644 doc/mac_move_guard/MAC_MOVE_GUARD_HLD.md diff --git a/doc/mac_move_guard/MAC_MOVE_GUARD_HLD.md b/doc/mac_move_guard/MAC_MOVE_GUARD_HLD.md new file mode 100644 index 00000000000..0abb56c46e1 --- /dev/null +++ b/doc/mac_move_guard/MAC_MOVE_GUARD_HLD.md @@ -0,0 +1,678 @@ +# SONiC MAC Move Guard High Level Design + +## Table of Contents + +- [1. Revision](#1-revision) +- [2. Scope](#2-scope) +- [3. Definitions/Abbreviations](#3-definitionsabbreviations) +- [4. Overview](#4-overview) + - [4.1 Detection](#41-detection) + - [4.2 Mitigation actions](#42-mitigation-actions) + - [4.3 Mitigation duration](#43-mitigation-duration) +- [5. Requirements](#5-requirements) + - [5.1 Phase 1](#51-phase-1) + - [5.2 Phase 2](#52-phase-2) +- [6. Module Design](#6-module-design) + - [6.1 Overall design](#61-overall-design) + - [6.2 Configuration and control flow](#62-configuration-and-control-flow) + - [6.2.1 Detection: native SAI MAC move](#621-detection-native-sai-mac-move) + - [6.2.2 Detection: synthesized move from AGED + LEARNED](#622-detection-synthesized-move-from-aged--learned) + - [6.2.3 Mitigation: DISABLE_PORT](#623-mitigation-disable_port) + - [6.2.4 Mitigation: DISABLE_MAC_MOVE](#624-mitigation-disable_mac_move) + - [6.2.5 Mitigation: DISABLE_LEARN_ON_VLAN](#625-mitigation-disable_learn_on_vlan) + - [6.2.6 Mitigation: DISABLE_LEARN_ON_PORT](#626-mitigation-disable_learn_on_port) + - [6.2.7 Recovery](#627-recovery) + - [6.3 State machines](#63-state-machines) + - [6.3.1 Config-processing FSM](#631-config-processing-fsm) + - [6.3.2 Per-MAC policy FSM](#632-per-mac-policy-fsm) + - [6.4 Data structures](#64-data-structures) + - [6.5 SWSS and syncd changes](#65-swss-and-syncd-changes) +- [7. Configuration and Management](#7-configuration-and-management) + - [7.1 CONFIG_DB](#71-config_db) + - [7.2 DB and Schema changes](#72-db-and-schema-changes) + - [7.3 YANG model](#73-yang-model) + - [7.4 CLI / sonic-cfggen examples](#74-cli--sonic-cfggen-examples) +- [8. Warmboot and Fastboot Impact](#8-warmboot-and-fastboot-impact) +- [9. Memory Consumption](#9-memory-consumption) +- [10. Restrictions/Limitations](#10-restrictionslimitations) +- [11. Testing Requirements](#11-testing-requirements) + - [11.1 Unit tests (one-liners)](#111-unit-tests-one-liners) + - [11.2 System tests](#112-system-tests) +- [12. Open/Action items](#12-openaction-items) + +## 1. Revision +Rev | Date | Author | Change Description +----|------|--------|------------------- +|v0.1|2026-05-15|Sudheer Y R(Nexthop)|Initial version of MAC Move Guard HLD + +## 2. Scope +This document describes the high level design of the MAC Move Guard feature in SONiC. MAC Move Guard detects abnormally high rates of MAC address moves between Layer-2 ports within a VLAN/bridge domain and applies a configurable mitigation action against the offending MAC, for a period of configured interval. The feature is implemented entirely inside `orchagent` as a new orchestrator (`MacMoveGuardOrch`) that subscribes to FDB events emitted by `FdbOrch` and drives `PortsOrch` / SAI FDB / SAI VLAN / SAI Bridge APIs to apply mitigation. + +## 3. Definitions/Abbreviations +Definitions/Abbreviation|Description +------------------------|----------- +MAC Move | An FDB event where a previously-learned MAC address is re-learned on a different bridge port within the same VLAN +Bad MAC | A (VLAN, MAC) whose number of moves within `detect_interval` has exceeded the configured `threshold` +Pinned port | The one port to which a bad MAC stays anchored under the `DISABLE_PORT` action; all other ports the MAC was bouncing on are admin-disabled +Detect interval | Sliding window (seconds) over which moves are counted +Action interval | Time (seconds) the mitigation action stays in effect before recovery +`bv_id` | SAI bridge-vlan OID identifying the L2 broadcast domain a MAC belongs to +`MacKey` | Tuple of (`MacAddress`, `bv_id`) — uniquely identifies a tracked MAC +FDB | Forwarding DataBase (the bridge MAC address table) +SAI | Switch Abstraction Interface +FSM | Finite State Machine + +## 4. Overview +In a stable L2 network, a host's MAC is normally seen on a single bridge port. Pathological conditions — L2 forwarding loops, a misbehaving host, or a duplicated MAC across two endpoints — cause the same MAC to bounce between two or more ports at very high rates. The data plane "MAC move storm" that results is harmful in several ways: the CPU is saturated by FDB notifications, forwarding to the affected MAC becomes effectively random because the FDB entry is overwritten constantly, and it can mask the real source of the problem by spreading the impact across many ports. + +MAC Move Guard provides a contained, automatic mitigation that is split into: detection and mitigation for a user specified amount of time. + +### 4.1 Detection +Per (VLAN, MAC), the orchestrator maintains a sliding window of move timestamps. On each move event it prunes entries older than `detect_interval` and re-derives the move count. When the count exceeds `threshold`, the MAC enters the **Bad MAC** state. + +Two FDB-event paths feed detection: +1. **Native** SAI MAC move (`SAI_FDB_EVENT_MOVE`) is delivered as-is through `SUBJECT_TYPE_MAC_MOVE`. +2. **Synthesized** move — for SDKs that only emit `AGED` then `LEARNED` instead of a native MOVE, the orch reconstructs the MOVE from a residual cache entry on the LEARN path. + +### 4.2 Mitigation actions +Four mitigation actions are supported. They are mutually exclusive — only one is in effect at a time, configured by the `action` field in CONFIG_DB. + +Action | Effect on the bad MAC +-------|---------------------- +`DISABLE_PORT` | Pin the MAC to one selected port; admin-disable every other port it was bouncing on +`DISABLE_MAC_MOVE` | Convert the MAC's existing FDB entry to a static entry with `ALLOW_MAC_MOVE=false`, pinned to its most recent port +`DISABLE_LEARN_ON_VLAN` | Disable MAC learning on the bridge-VLAN the MAC belongs to (`SAI_VLAN_ATTR_LEARN_DISABLE=true`) +`DISABLE_LEARN_ON_PORT` | Disable MAC learning on every bridge port the MAC was just bouncing on (`SAI_BRIDGE_PORT_ATTR_FDB_LEARNING_MODE=DISABLE`) + +### 4.3 Mitigation duration +A `SelectableTimer` fires every 30 s. For every bad MAC whose `action_expiry_time` has passed, the orchestrator reverses the action that was applied (re-enable the port, return the FDB entry to dynamic, re-enable VLAN learning, restore bridge-port learning mode). Action state is reference-counted, so a shared target (e.g. a port disabled because two bad MACs were both bouncing on it) is reverted only when the last bad MAC releases it. + +## 5. Requirements +MAC Move Guard will be implemented in two phases. + +### 5.1 Phase 1 +- Detect MAC moves on a per (VLAN, MAC) basis using a sliding-window threshold +- Support four mitigation actions: `DISABLE_PORT`, `DISABLE_MAC_MOVE`, `DISABLE_LEARN_ON_VLAN`, `DISABLE_LEARN_ON_PORT` +- Mitigation stays for a configurable action interval +- CONFIG_DB-driven configuration via a single `GLOBAL` row, validated by a new YANG model + +### 5.2 Phase 2 +- 'DISABLE_LEARN_ON_MAC' action through ACL +- `show mac-move-guard` CLI with operational state surfaced via STATE_DB + +## 6. Module Design +### 6.1 Overall design +- Management framework writes the `MAC_MOVE_GUARD|GLOBAL` row to CONFIG_DB using the YANG model in [§7.3](#73-yang-model) +- `MacMoveGuardOrch` (new) in orchagent subscribes to the `MAC_MOVE_GUARD` table and configures itself +- `MacMoveGuardOrch` attaches as an `Observer` on `FdbOrch` and consumes `SUBJECT_TYPE_MAC_LEARN` / `SUBJECT_TYPE_MAC_MOVE` +- Mitigation is applied by calling `PortsOrch::setPortAdminStatusByAlias()` and the SAI FDB / VLAN / Bridge attribute setters directly +- A 30 s `SelectableTimer` is used to apply the mitigation action for the configured duration, and to manage the MAC state +- syncd / SAI: no changes + +The picture below groups the relationships by *role*. The numbered arrows describe the lifetime of one bad-MAC episode. + +```mermaid +%%{init: {'theme':'base','themeVariables':{'primaryColor':'#EAF1F8','primaryBorderColor':'#5B7BAB','primaryTextColor':'#2C3E50','lineColor':'#7A8B9F','secondaryColor':'#F4F0E8','tertiaryColor':'#F9FAFC','fontSize':'12px'},'flowchart':{'useMaxWidth':false,'nodeSpacing':25,'rankSpacing':35}}}%% +flowchart TB + HWIN:::invis + subgraph DETECT["DETECTION PLANE"] + direction TB + FDB["FdbOrch"] + end + + subgraph MGMT["MANAGEMENT PLANE"] + direction TB + CLI["CLI / sonic-cfggen"] + CFGDB["CONFIG_DB
MAC_MOVE_GUARD|GLOBAL"] + CLI --> CFGDB + end + + subgraph CTRL["CONTROL PLANE"] + direction TB + MMG["MacMoveGuardOrch
observer + 30s timer"] + end + + subgraph ACTION["ACTION PLANE"] + direction TB + PO["PortsOrch"] + SAIFDB["sai_fdb_api"] + SAIVLAN["sai_vlan_api"] + SAIBR["sai_bridge_api"] + end + HWOUT:::invis + + HWIN -->|"raw FDB events from hardware
(SAI notif via syncd)"| FDB + CFGDB -->|"(1) config"| MMG + FDB -->|"(2) notify"| MMG + MMG -->|"(3) action"| PO + MMG -->|"(3) action"| SAIFDB + MMG -->|"(3) action"| SAIVLAN + MMG -->|"(3) action"| SAIBR + ACTION -.->|"(4) SAI calls to hardware
(via syncd)"| HWOUT + + classDef invis fill:transparent,stroke:transparent,color:transparent +``` + +`MacMoveGuardOrch` holds: +- maps: `m_macTrackingState`, `m_disabledPorts`, `m_learnDisabledVlans`, `m_learnDisabledBridgePorts`, `m_learntMac` +- timer: `m_recoveryTimer` (30s) +- handlers: `handleMacLearn()`, `handleMacMove()`, `checkRecovery()` + +**Numbered flows (one bad-MAC episode):** +1) Administrator writes `MAC_MOVE_GUARD|GLOBAL` to CONFIG_DB; `MacMoveGuardOrch::doTask(Consumer&)` consumes it +2) FDB events originate in the ASIC; the SDK invokes the SAI FDB-event callback registered by `syncd`, which forwards the event on the swss notification channel where `FdbOrch::handleSaiFdbEvent()` consumes it and emits `SUBJECT_TYPE_MAC_LEARN` / `SUBJECT_TYPE_MAC_MOVE` to `MacMoveGuardOrch::update()` +3) On threshold breach `MacMoveGuardOrch` calls into `PortsOrch` and the SAI FDB / VLAN / Bridge attribute setters +4) These SAI calls reach the hardware via `syncd` and reshape what the SDK will emit next: a disabled port emits no further LEARNs; a static MAC stops moving; a learn-disabled VLAN/port stops generating LEARNs at all + +**Recovery timer (out-of-band).** A 30-second `SelectableTimer` inside `MacMoveGuardOrch` drives `checkRecovery()`. On each tick it iterates `m_macTrackingState`, prunes each MAC's sliding window, garbage-collects quiet non-bad MACs from the map, and for any bad MAC whose `action_expiry_time` has elapsed invokes `releaseBadMac()` to revert the applied SAI action. The timer is not on the data path — it is purely housekeeping. + +### 6.2 Configuration and control flow + +#### 6.2.1 Detection: native SAI MAC move +The SDK does **not** call `FdbOrch` directly. FDB events originate in the ASIC; the SDK invokes the SAI FDB-event callback that `syncd` registered at boot. `syncd` forwards the event onto the swss notification channel, where `FdbOrch::handleSaiFdbEvent()` decodes it and emits the appropriate observer notification (`SUBJECT_TYPE_MAC_MOVE` for a native move). `MacMoveGuardOrch::handleMacMove()` processes that notification: + +
+ +```mermaid +%%{init: {'theme':'base','themeVariables':{'primaryColor':'#EAF1F8','primaryBorderColor':'#5B7BAB','primaryTextColor':'#2C3E50','lineColor':'#7A8B9F','secondaryColor':'#F4F0E8','tertiaryColor':'#F9FAFC','noteBkgColor':'#FFF8E1','noteTextColor':'#5C4500','noteBorderColor':'#D4B563','actorBkg':'#EAF1F8','actorBorder':'#5B7BAB','actorTextColor':'#2C3E50','fontSize':'9px'},'sequence':{'useMaxWidth':false,'actorFontSize':9,'noteFontSize':9,'messageFontSize':9,'mirrorActors':false,'noteAlign':'left','wrap':true},'themeCSS':'.noteText, .noteText tspan, .note text, .note tspan { font-size: 9px !important; }'}}%% +sequenceDiagram + participant ASIC as ASIC SDK + participant SYN as syncd + participant FDB as FdbOrch + participant MMG as MMGuardOrch + participant C as learntMac + + ASIC->>SYN: SAI FDB callback MOVE A→B + SYN->>FDB: notification channel + FDB->>MMG: SUBJECT_TYPE_MAC_MOVE old=A new=B + MMG->>C: write (M,V) → B + Note over MMG,C: 1. push timestamp to deque
2. prune entries older than detect_interval
3. count = deque.size()
4. if count >= threshold: markBadMac → apply action +``` + +
+ +1) ASIC SDK invokes the SAI FDB-event callback for `SAI_FDB_EVENT_MOVE`; `syncd` posts the event onto the swss notification channel +2) `FdbOrch::handleSaiFdbEvent()` consumes the notification and emits `SUBJECT_TYPE_MAC_MOVE` to attached observers +3) `MacMoveGuardOrch::handleMacMove()` records the move timestamp, prunes entries older than `detect_interval`, updates `m_learntMac[(M,V)] = B`, and if the count reaches `threshold` calls `markBadMac()` to apply the configured action + +#### 6.2.2 Detection: synthesized move from AGED + LEARNED +Some SDKs report a port change as `AGED` on the old port followed by `LEARNED` on the new port, with no native MOVE. The orchestrator never erases `m_learntMac` on AGE; that residual entry is what allows a later LEARN on a different port to be recognized as a move. + +
+ +```mermaid +%%{init: {'theme':'base','themeVariables':{'primaryColor':'#EAF1F8','primaryBorderColor':'#5B7BAB','primaryTextColor':'#2C3E50','lineColor':'#7A8B9F','secondaryColor':'#F4F0E8','tertiaryColor':'#F9FAFC','noteBkgColor':'#FFF8E1','noteTextColor':'#5C4500','noteBorderColor':'#D4B563','actorBkg':'#EAF1F8','actorBorder':'#5B7BAB','actorTextColor':'#2C3E50','fontSize':'9px'},'sequence':{'useMaxWidth':false,'actorFontSize':9,'noteFontSize':9,'messageFontSize':9,'mirrorActors':false}}}%% +sequenceDiagram + participant ASIC as ASIC SDK + participant SYN as syncd + participant FDB as FdbOrch + participant MMG as MMGuardOrch + participant C as learntMac + + ASIC->>SYN: LEARN on A + SYN->>FDB: notification + FDB->>MMG: MAC_LEARN + MMG->>C: read prev = none + MMG->>C: write (M,V) → A + Note over MMG: no move + + ASIC->>SYN: AGE on A + SYN->>FDB: notification + Note over FDB: clears m_entries + Note over MMG: NOT notified + Note over C: (M,V) → A unchanged + + ASIC->>SYN: LEARN on B + SYN->>FDB: notification + FDB->>MMG: MAC_LEARN + MMG->>C: read prev = A + MMG->>C: write (M,V) → B + Note over MMG: prev != B
synthesize move + MMG->>MMG: handleMacMove A→B +``` + +
+ +1) First `LEARN` on `A`: `handleMacLearn()` finds `m_learntMac[(M,V)]` empty, sets it to `A`, returns (not a move) +2) `AGE` on `A`: `MacMoveGuardOrch` is **not** subscribed to AGE; `m_learntMac[(M,V)] = A` is intentionally retained +3) Subsequent `LEARN` on `B`: `handleMacLearn()` reads `prev = A`, writes `B`, and synthesizes a `MacMoveNotification{old:A, new:B}` which it forwards to `handleMacMove()` — joining the same code path as the native MOVE in [§6.2.1](#621-detection-native-sai-mac-move) + +#### 6.2.3 Mitigation: DISABLE_PORT +The MAC is pinned to one port; all other ports it appeared on within the detection window are admin-disabled. Port disable is reference-counted so a port shared by multiple bad MACs is only re-enabled when the last bad MAC is released. + +
+ +```mermaid +%%{init: {'theme':'base','themeVariables':{'primaryColor':'#EAF1F8','primaryBorderColor':'#5B7BAB','primaryTextColor':'#2C3E50','lineColor':'#7A8B9F','secondaryColor':'#F4F0E8','tertiaryColor':'#F9FAFC','fontSize':'13px'},'flowchart':{'useMaxWidth':false}}}%% +flowchart TD + A[Bad MAC detected] --> B{pinned_port
already set?} + B -->|no| C[Select pinned port
prefer non-disabled] + B -->|yes| D[Use existing
pinned port] + C --> E + D --> E[For each non-pinned
port in ports_seen] + E --> F{port already
disabled?} + F -->|yes| G[Refcount port
add MacKey] + F -->|no| H[Admin-disable port
seed refcount] + G --> I[Record port
in state] + H --> I +``` + +
+ +1) Administrator configures `action=DISABLE_PORT` in `MAC_MOVE_GUARD|GLOBAL` +2) On threshold breach, `markBadMac()` selects a pinned port — preferring a port not currently disabled by any other bad MAC (to maximize ports kept UP) +3) Every other port the MAC was just bouncing on is admin-disabled via `PortsOrch::setPortAdminStatusByAlias(port, false)` +4) `m_disabledPorts[port].insert(MacKey)` adds the bad MAC to the port's reference set; the SAI admin-down call is issued only on the first insertion +5) On recovery (`action_expiry_time` reached), `releaseBadMac()` decrements each port's ref-count and re-enables ports whose count reaches zero + +#### 6.2.4 Mitigation: DISABLE_MAC_MOVE +The MAC's existing dynamic FDB entry is converted to a static entry with MAC move disallowed, anchored to its most recent port. Three SAI attribute writes are issued in order; the bridge_port_id is saved for use during recovery. + +
+ +```mermaid +%%{init: {'theme':'base','themeVariables':{'primaryColor':'#EAF1F8','primaryBorderColor':'#5B7BAB','primaryTextColor':'#2C3E50','lineColor':'#7A8B9F','secondaryColor':'#F4F0E8','tertiaryColor':'#F9FAFC','fontSize':'13px'},'flowchart':{'useMaxWidth':false}}}%% +flowchart TD + A[Bad MAC detected] --> B[Resolve VLAN and
last_port objects] + B --> C{valid
bridge_port_id?} + C -->|no| Err[Log error
return] + C -->|yes| D[Set FDB attr
BRIDGE_PORT_ID] + D --> E[Set FDB attr
TYPE=STATIC] + E --> F[Set FDB attr
ALLOW_MAC_MOVE=false] + F --> G[Record bridge_port_id
in state] +``` + +
+ +1) Administrator configures `action=DISABLE_MAC_MOVE` +2) On threshold breach, `markBadMac()` looks up the bridge port for `state.last_port` +3) Three SAI attribute writes on the existing FDB entry: anchor `BRIDGE_PORT_ID`, set `TYPE=STATIC`, set `ALLOW_MAC_MOVE=false` +4) `state.static_fdb_bridge_port_id` is recorded so recovery can revert the same entry +5) On recovery, `releaseBadMac()` sets `TYPE=DYNAMIC` and `ALLOW_MAC_MOVE=true` + +#### 6.2.5 Mitigation: DISABLE_LEARN_ON_VLAN +The orchestrator disables MAC learning on the entire bridge-VLAN that the bad MAC belongs to. This is a coarse-grained but very effective action when the churn comes from a localized loop in one VLAN — it stops the FDB churn at the source without dropping any traffic. + +
+ +```mermaid +%%{init: {'theme':'base','themeVariables':{'primaryColor':'#EAF1F8','primaryBorderColor':'#5B7BAB','primaryTextColor':'#2C3E50','lineColor':'#7A8B9F','secondaryColor':'#F4F0E8','tertiaryColor':'#F9FAFC','fontSize':'13px'},'flowchart':{'useMaxWidth':false}}}%% +flowchart TD + A[Bad MAC detected] --> B[Resolve VLAN
from key.bv_id] + B --> C{first bad MAC
on this VLAN?} + C -->|yes| D[Set VLAN attr
LEARN_DISABLE=true] + C -->|no| E[Skip SAI write] + D --> F[Refcount VLAN
and record in state] + E --> F +``` + +
+ +1) Administrator configures `action=DISABLE_LEARN_ON_VLAN` +2) On threshold breach, `markBadMac()` checks `m_learnDisabledVlans[bv_id]` — the SAI write is issued only on the first bad MAC for the VLAN +3) `state.learn_disabled_vlan_oid` records the VLAN OID so recovery can re-enable learning +4) On recovery, the ref-count is decremented; when it reaches zero, `SAI_VLAN_ATTR_LEARN_DISABLE` is restored to `false` + +#### 6.2.6 Mitigation: DISABLE_LEARN_ON_PORT +The orchestrator disables MAC learning on every bridge port the MAC was seen on within the detection window. Existing FDB entries are preserved; only new learning is suppressed. + +
+ +```mermaid +%%{init: {'theme':'base','themeVariables':{'primaryColor':'#EAF1F8','primaryBorderColor':'#5B7BAB','primaryTextColor':'#2C3E50','lineColor':'#7A8B9F','secondaryColor':'#F4F0E8','tertiaryColor':'#F9FAFC','fontSize':'13px'},'flowchart':{'useMaxWidth':false}}}%% +flowchart TD + A[Bad MAC detected] --> B[For each port in
state.ports_seen] + B --> C{first bad MAC
on this port?} + C -->|yes| D[Set bridge-port attr
LEARNING_MODE=DISABLE] + C -->|no| E[Skip SAI write] + D --> F[Refcount port
and record in state] + E --> F +``` + +
+ +1) Administrator configures `action=DISABLE_LEARN_ON_PORT` +2) On threshold breach, `markBadMac()` iterates `state.ports_seen` and resolves each alias to a bridge port +3) For each bridge port whose ref-count was zero, write `SAI_BRIDGE_PORT_ATTR_FDB_LEARNING_MODE = SAI_BRIDGE_PORT_FDB_LEARNING_MODE_DISABLE` +4) On recovery, ref-counts are decremented; when a count reaches zero, the bridge port's learning mode is restored to `SAI_BRIDGE_PORT_FDB_LEARNING_MODE_HW` (SONiC default) + +Compared with `DISABLE_PORT`, this action is **non-disruptive to in-flight forwarding** — the port stays admin-up, existing FDB entries continue to forward, only the FDB-churn source is gated. + +#### 6.2.7 Recovery +A 30 s `SelectableTimer` drives `checkRecovery()`. For each bad MAC whose `action_expiry_time` has elapsed, `releaseBadMac()` dispatches on the configured action and reverts the SAI state. Non-bad MACs whose `ports_seen` map is empty (i.e. quiet for at least one detection window) are garbage-collected. + +
+ +```mermaid +%%{init: {'theme':'base','themeVariables':{'primaryColor':'#EAF1F8','primaryBorderColor':'#5B7BAB','primaryTextColor':'#2C3E50','lineColor':'#7A8B9F','secondaryColor':'#F4F0E8','tertiaryColor':'#F9FAFC','fontSize':'13px'},'flowchart':{'useMaxWidth':false}}}%% +flowchart TD + A[Timer tick] --> B{m_enabled?} + B -->|no| R[Return] + B -->|yes| C[For each tracked
MacKey, state] + C --> D[Prune sliding
window] + D --> E{is_bad_mac?} + E -->|no| F{ports_seen
empty?} + F -->|yes| G[Erase entry
GC] + F -->|no| H[Keep TRACKED] + E -->|yes| I{action_expiry
passed?} + I -->|yes| J[releaseBadMac
revert action] + I -->|no| K[Keep BAD] +``` + +
+ +1) Timer fires every `RECOVERY_CHECK_INTERVAL_SECS` (30 s) +2) If the feature is disabled, return immediately +3) For each tracked MAC, prune the sliding window first to keep `move_count` and `ports_seen` current +4) For non-bad MACs with empty `ports_seen`, erase the tracking entry (memory GC) +5) For bad MACs whose `action_expiry_time` has passed, call `releaseBadMac()` to revert the action and transition the MAC back to TRACKED + +### 6.3 State machines + +#### 6.3.1 Config-processing FSM +Global per-orch state driven by `doTask(Consumer&)` on the `MAC_MOVE_GUARD` table. + +
+ +```mermaid +%%{init: {'theme':'base','themeVariables':{'primaryColor':'#EAF1F8','primaryBorderColor':'#5B7BAB','primaryTextColor':'#2C3E50','lineColor':'#7A8B9F','secondaryColor':'#F4F0E8','tertiaryColor':'#F9FAFC','fontSize':'12px'},'flowchart':{'useMaxWidth':false,'htmlLabels':true,'nodeSpacing':25,'rankSpacing':30}}}%% +flowchart TD + Start([doTask Consumer]) --> KeyCheck{key == GLOBAL?} + KeyCheck -->|no| Drop[WARN log, drop] + KeyCheck -->|yes| Parse["
Parse fields
1. enabled
2. threshold
3. detect_interval
4. action_interval
5. action
"] + Parse --> EnDiamond{enabled?} + EnDiamond -->|true| Enabled["
ENABLED
1. observer active
2. recovery timer ticking
3. checkRecovery iterates
   m_macTrackingState
4. mitigation actions
   applied on threshold
"] + EnDiamond -->|false| Disabling["
DISABLING
1. clearAllState invoked
2. revert every
   in-flight action
3. flush all maps
"] + Disabling --> Disabled["
DISABLED
1. update is a no-op
2. observer + timer
   stay attached
"] +``` + +
+ +Notes: +- Bad input on `threshold` / intervals / `action` does not transition state; the offending field is logged and skipped, other fields in the same SET are applied +- A key other than `GLOBAL` logs a WARN and is dropped before any state change +- The observer attachment / timer are created in the constructor and torn down in the destructor, so those resources live for the lifetime of the orch — not of the ENABLED state + +#### 6.3.2 Per-MAC policy FSM +Driven by the events handled in [§6.2](#62-configuration-and-control-flow). One instance per (VLAN, MAC) currently being observed. + +
+ +```mermaid +%%{init: {'theme':'base','themeVariables':{'primaryColor':'#EAF1F8','primaryBorderColor':'#5B7BAB','primaryTextColor':'#2C3E50','lineColor':'#7A8B9F','secondaryColor':'#F4F0E8','tertiaryColor':'#F9FAFC','fontSize':'12px'},'flowchart':{'useMaxWidth':false,'htmlLabels':true,'nodeSpacing':25,'rankSpacing':30}}}%% +flowchart TD + Start([first move /
first LEARN]) --> Tracked["
TRACKED
1. moves accumulate
   in sliding window
2. count below threshold
3. no action applied
"] + Tracked -->|move event
count below threshold| Tracked + Tracked -->|count reaches
threshold| Bad["
BAD_MAC
1. configured action applied
2. action_expiry_time set
3. all targets tracked
   for refcounted revert
"] + Bad -->|further moves
while bad| Bad + Bad -->|action_expiry passed
releaseBadMac| Tracked + Tracked -->|ports_seen empty
after detect_interval| GC([erase entry / GC]) +``` + +
+ +A `clearAllState()` (feature disable) is an unconditional transition of every per-MAC FSM to the terminal state after reverting whatever action it was holding. + +### 6.4 Data structures + +```c++ +struct MacKey { + MacAddress mac; + sai_object_id_t bv_id; // SAI bridge-vlan OID +}; + +struct MacMoveTrackingState { + std::deque move_timestamps; + std::map ports_seen; + size_t move_count; + bool is_bad_mac; + steady_clock::time_point action_expiry_time; + + // DISABLE_PORT + std::string pinned_port; + std::set disabled_ports; + + // DISABLE_MAC_MOVE + sai_object_id_t static_fdb_bridge_port_id; + + // DISABLE_LEARN_ON_VLAN + sai_object_id_t learn_disabled_vlan_oid; + + // DISABLE_LEARN_ON_PORT + std::set learn_disabled_ports; + + std::string last_port; +}; +``` + +Per-orch maps: + +Map | Key → Value | Purpose +----|-------------|-------- +`m_macTrackingState` | `MacKey → MacMoveTrackingState` | Per-MAC sliding window + FSM state +`m_disabledPorts` | `port_alias → set` | Reference count for `DISABLE_PORT` +`m_learnDisabledVlans` | `bv_id → set` | Reference count for `DISABLE_LEARN_ON_VLAN` +`m_learnDisabledBridgePorts` | `port_alias → set` | Reference count for `DISABLE_LEARN_ON_PORT` +`m_learntMac` | `MacKey → port_alias` | Last-known port per (VLAN, MAC); used to synthesize MOVE from AGED+LEARNED SDKs (see [§6.2.2](#622-detection-synthesized-move-from-aged--learned)) + +### 6.5 SWSS and syncd changes +- New `MacMoveGuardOrch`: consumes the `MAC_MOVE_GUARD` CONFIG_DB table, attaches as an `Observer` on `FdbOrch`, owns its own 30 s `SelectableTimer`, and drives `PortsOrch` + SAI FDB / VLAN / Bridge attribute setters +- `FdbOrch`: two new notification subjects, `SUBJECT_TYPE_MAC_LEARN` and `SUBJECT_TYPE_MAC_MOVE`, emitted from the existing `SAI_FDB_EVENT_LEARNED` and `SAI_FDB_EVENT_MOVE` handlers. No change to FDB programming or aging behavior +- syncd / SAI: no changes + +## 7. Configuration and Management +### 7.1 CONFIG_DB +Configure MAC Move Guard by creating the single `MAC_MOVE_GUARD|GLOBAL` row: +``` +"MAC_MOVE_GUARD": { + "GLOBAL": { + "enabled": "true", + "threshold": "100", + "detect_interval": "5", + "action_interval": "600", + "action": "DISABLE_PORT" + } +} +``` + +### 7.2 DB and Schema changes + +``` +; Defines schema for MAC Move Guard configuration attributes +key = MAC_MOVE_GUARD:GLOBAL ; only the GLOBAL key is permitted +; field = value +ENABLED = "true" / "false" ; default false +THRESHOLD = 1*10DIGIT ; max moves in detect_interval (default 10000) +DETECT_INTERVAL = 1*4DIGIT ; seconds, 1..3600 (default 5) +ACTION_INTERVAL = 1*5DIGIT ; seconds, 1..86400 (default 120) +ACTION = action_value ; default DISABLE_PORT + +; value annotations +action_value = "DISABLE_PORT" / "DISABLE_MAC_MOVE" / + "DISABLE_LEARN_ON_VLAN" / "DISABLE_LEARN_ON_PORT" +``` + +> Note: Refer to swss-schema.md for general BNF conventions used across SONiC documents. + +### 7.3 YANG model +The CONFIG_DB schema is backed by a YANG module added under `src/sonic-yang-models/yang-models/sonic-mac-move-guard.yang`. + +```yang +module sonic-mac-move-guard { + + yang-version 1.1; + + namespace "http://github.com/sonic-net/sonic-mac-move-guard"; + prefix mmg; + + organization + "SONiC"; + + contact + "SONiC"; + + description + "MAC Move Guard - detect and mitigate excessive (VLAN, MAC) + moves on the local FDB."; + + revision 2026-05-15 { + description "Initial revision."; + } + + typedef mac-move-guard-action { + type enumeration { + enum DISABLE_PORT; + enum DISABLE_MAC_MOVE; + enum DISABLE_LEARN_ON_VLAN; + enum DISABLE_LEARN_ON_PORT; + } + default DISABLE_PORT; + } + + container sonic-mac-move-guard { + + container MAC_MOVE_GUARD { + + description + "MAC Move Guard configuration. + Only the GLOBAL key is permitted."; + + list MAC_MOVE_GUARD_LIST { + key "name"; + max-elements 1; + + leaf name { + type enumeration { enum GLOBAL; } + } + + leaf enabled { + type boolean; + default false; + } + + leaf threshold { + type uint32 { range "1..max"; } + default 1000; + } + + leaf detect_interval { + type uint32 { range "1..3600"; } + units "seconds"; + default 5; + } + + leaf action_interval { + type uint32 { range "1..86400"; } + units "seconds"; + default 600; + } + + leaf action { + type mac-move-guard-action; + } + } + } + } +} +``` + +Notes: +- `max-elements 1` enforces the "only `GLOBAL`" constraint, mirroring the orch's runtime check +- Range constraints keep `sonic-cfggen` from accepting nonsensical values such as zero or negative seconds +- Defaults are aligned with the orch's in-class defaults (`m_threshold = 1000`, `m_durationSeconds = 5`, `m_recoverySeconds = 600`, `m_action = DISABLE_PORT`) +- Adding a new action means adding an `enum` value to `mac-move-guard-action` and a `case` in `MacMoveGuardOrch::doTask()` + +### 7.4 CLI / sonic-cfggen examples +No new CLI commands are introduced in Phase 1. Configuration is applied through sonic config utilities, which validate the input against the YANG model in [§7.3](#73-yang-model) before writing CONFIG_DB. + +Enable with the port-shut action: +```bash +cat <<'EOF' > /tmp/mac-move-guard.json +{ + "MAC_MOVE_GUARD": { + "GLOBAL": { + "enabled": "true", + "threshold": "5000", + "detect_interval": "5", + "action_interval": "120", + "action": "DISABLE_PORT" + } + } +} +EOF +sonic-cfggen -j /tmp/mac-move-guard.json --write-to-db +``` + +Switch to MAC pinning (non-disruptive to forwarding): +```bash +sonic-cfggen -a '{"MAC_MOVE_GUARD": {"GLOBAL": {"action": "DISABLE_MAC_MOVE"}}}' --write-to-db +``` + +Switch to learn-disable on the offending VLAN: +```bash +sonic-cfggen -a '{"MAC_MOVE_GUARD": {"GLOBAL": {"action": "DISABLE_LEARN_ON_VLAN"}}}' --write-to-db +``` + +Switch to learn-disable on the offending bridge ports: +```bash +sonic-cfggen -a '{"MAC_MOVE_GUARD": {"GLOBAL": {"action": "DISABLE_LEARN_ON_PORT"}}}' --write-to-db +``` + +Disable the feature (reverts every in-flight action): +```bash +sonic-cfggen -a '{"MAC_MOVE_GUARD": {"GLOBAL": {"enabled": "false"}}}' --write-to-db +``` + +Persist across reboot: +```bash +config save -y +``` + +Operational visibility is provided through orchagent syslog at NOTICE level, including bad-MAC detection, port admin-down/up transitions, FDB attribute changes for `DISABLE_MAC_MOVE`, VLAN/bridge-port learn enable/disable transitions, and action-interval expiry. + +## 8. Warmboot and Fastboot Impact +- No additional sleeps in boot-critical path; the orch is constructed after `FdbOrch` and `PortsOrch` are up +- The feature is **stateless across reboot** by design — rationale: the bad-MAC state is itself a reaction to ongoing pathological L2 churn, and the same churn (or its absence) will re-establish (or not re-establish) the state quickly after the data plane comes back up +- Any port `MacMoveGuardOrch` had admin-disabled is **not** automatically restored on reboot. The admin-disable was persisted via the standard port admin path (`PORT|:admin_status`), so user-visible state is consistent across reboot +- Static FDB entries written by `DISABLE_MAC_MOVE`, and the `SAI_VLAN_ATTR_LEARN_DISABLE` / `SAI_BRIDGE_PORT_ATTR_FDB_LEARNING_MODE` writes from the learn-disable actions, are SAI-only and do not survive warm reboot +- If full warm-reboot fidelity is required later, tracking maps would need to be persisted to STATE_DB and re-hydrated on startup (see [§12](#12-openaction-items)) + +## 9. Memory Consumption +- Minimal control-plane state in orchagent +- `m_macTrackingState` and `m_learntMac` are bounded in steady state by the SDK's MAC-table size. Quiet MACs are garbage-collected from `m_macTrackingState` by `checkRecovery()` after one detection-interval of silence +- `m_disabledPorts`, `m_learnDisabledVlans`, `m_learnDisabledBridgePorts` are bounded by the number of bad MACs currently in effect +- No growth when feature disabled + +## 10. Restrictions/Limitations +- **Switch-global policy.** Threshold, action, and intervals are global; per-VLAN or per-port policy is not supported in Phase 1 +- **Action set.** Four actions are implemented (`DISABLE_PORT`, `DISABLE_MAC_MOVE`, `DISABLE_LEARN_ON_VLAN`, `DISABLE_LEARN_ON_PORT`). Other actions are reserved in the enum but not wired up +- **`DISABLE_MAC_MOVE` causes drops on other ports.** When a bad MAC is pinned to a port via `DISABLE_MAC_MOVE`, all traffic from that MAC to other ports is dropped until the action interval expires. +- **`DISABLE_LEARN_ON_VLAN` is coarse.** All new MAC learning on the VLAN stops for the duration of the action interval — affecting every host on that VLAN, not just the offender + +## 11. Testing Requirements +### 11.1 Unit tests (one-liners) +1) Threshold trips at exactly N=`threshold` MOVE events within `detect_interval`; bad MAC FSM enters BAD_MAC +2) Threshold trips identically when moves are synthesized from alternating LEARN events on two ports for the same MAC +3) Sliding window: `threshold-1` moves, sleep > `detect_interval`, `threshold-1` more moves — MAC stays TRACKED +4) `DISABLE_PORT` pinning: with two bad MACs sharing two ports, exactly one port is admin-disabled and reference-counted correctly +5) `DISABLE_PORT` recovery: after `action_interval`, port is re-enabled; if a second bad MAC still requires it, the port stays down +6) `DISABLE_MAC_MOVE` programming: three SAI attribute writes occur in the documented order; `bridge_port_id` is anchored to `state.last_port` +7) `DISABLE_MAC_MOVE` recovery: after `action_interval`, FDB entry returns to `TYPE=DYNAMIC` +8) `DISABLE_LEARN_ON_VLAN`: `SAI_VLAN_ATTR_LEARN_DISABLE=true` written exactly once even with multiple bad MACs on the same VLAN; restored to false only after the last bad MAC is released +9) `DISABLE_LEARN_ON_PORT`: per-bridge-port ref-counting writes `FDB_LEARNING_MODE=DISABLE` once per port and restores `HW` mode after the last bad MAC is released +10) Feature disable cleanup: all disabled ports re-enabled, all static FDB entries removed, all learn-disable attributes restored before maps are cleared +11) GC of quiet MACs: after one detection interval of silence, tracking entry is erased +12) Config rejection: non-`GLOBAL` keys are dropped with WARN; bad `action` value falls back to `DISABLE_PORT` with WARN; invalid integer values logged at ERROR without crashing +13) YANG validation: `sonic-cfggen` rejects out-of-range `detect_interval`, invalid `action`, and non-`GLOBAL` keys before they reach CONFIG_DB + +### 11.2 System tests +1) Two ports in the same VLAN, host MAC bouncing at 100 per second: confirm port goes admin-down within `threshold / rate + detect_interval` +2) Stop the bouncing source, wait `action_interval + recovery_check_interval`: disabled port restored +3) Switch the `action` while a bad MAC is in effect: in-flight action continues until natural expiry; new action applies on the next violation +4) Stress: 1000 distinct MACs each moving once below the threshold; no false positives and `m_macTrackingState` does not grow unbounded +5) With `action=DISABLE_LEARN_ON_PORT`, pre-existing FDB entries on the affected port continue to forward; only new LEARNs are gated +6) Warm-reboot and verify the feature comes back up disabled-by-default (no STATE_DB persistence) + +## 12. Open/Action items +- `show mac-move-guard` CLI backed by STATE_DB counters (total bad MACs, total disabled ports, total learn-disabled VLANs/bridge ports, per-MAC history) +- Support `action=DISABLE_LEARN_ON_MAC` using ACL entries for detected bad MACs