Skip to content

soak: add Gate 8b MAC-churn 24h harness#104

Merged
lance0 merged 2 commits into
mainfrom
chore/gate8b-mac-churn-soak-harness
May 15, 2026
Merged

soak: add Gate 8b MAC-churn 24h harness#104
lance0 merged 2 commits into
mainfrom
chore/gate8b-mac-churn-soak-harness

Conversation

@lance0
Copy link
Copy Markdown
Owner

@lance0 lance0 commented May 15, 2026

Summary

Sibling to tests/soak/run-gate8b-soak.sh. Reuses the existing 2-PE shared-ESI topology and layers sustained bridge-FDB churn on top of the DF-flip loop so the soak exercises:

  • kernel-learn / local-MAC observation → Type 2 origination
  • ADR-0059 receive-side aliasing-ECMP (FDB nexthop groups)
  • RFC 7432 §15.1 MAC mobility sequencing
  • Gate 8b BUM-suppression while FDB programming is in flight
  • ADR-0059 drift-recovery counters under realistic timing

The alpha-checklist exit condition for flipping apply_bum_enforcement and apply_aliasing_ecmp production defaults.

Design choices (per discussion)

  • Sibling harness, not a replacement. Base Gate 8b BUM-state soak still useful for memory-only validation.
  • Direct kernel FDB mutation (docker exec <pe> bridge fdb add/del <mac> dev ce100a master static). Exercises the local-MAC observation pipeline; deliberately not the gRPC inject path.
  • Bounded rotating pool (default 512 MACs, ~256/PE). Per-batch ops every CHURN_INTERVAL_SEC (default 5s), weighted across add / delete / PE-to-PE mobility. Pool occupancy bracketed around target so the soak doesn't drift to empty or saturated.
  • DF flips run concurrently. PE2 stop/start clears PE2's pool state file to match kernel reality after restart.
  • Sampling additions: kernel-side FDB total + extern_learn counts, ip nexthop count (ADR-0059 NHGs), evpn_local_originations_total, evpn_local_origination_errors_total, evpn_local_observations_dropped_total, evpn_duplicate_mac_moves_total, all four ADR-0059 drift counters, and harness-side add / del / move totals.

Smoke-before-soak path

# 6-min aggressive stress (~25 ops/sec) — catches obvious leaks / hangs.
SOAK_HOURS=0.1 CHURN_INTERVAL_SEC=2 CHURN_BATCH_SIZE=50 \
    bash tests/soak/run-gate8b-mac-churn-soak.sh

# 1h soak — catches slower drift.
SOAK_HOURS=1 bash tests/soak/run-gate8b-mac-churn-soak.sh

# 24h.
bash tests/soak/run-gate8b-mac-churn-soak.sh

Test plan

  • bash -n tests/soak/run-gate8b-mac-churn-soak.sh — syntax clean
  • 6-minute aggressive stress smoke
  • 1h soak
  • 24h soak

Notes for review

  • No dedicated analyzer yet. tests/soak/analyze-gate8b-soak.py still covers memory / DF gates that apply; MAC-churn-specific gates (e.g. evpn_local_origination_errors_total == 0, drift counters bounded, extern_learn count stable on receiver) surface from manual CSV inspection. A dedicated analyzer can land alongside the first 24h run's postmortem.
  • One bug spotted-and-fixed during authoring: the churn loop runs in a background subshell, so the add/del/move totals must live in counter files (not bash vars) for the CSV writer to see them.

Sibling to run-gate8b-soak.sh. Reuses the 2-PE shared-ESI topology
at tests/soak/gate8b-soak.clab.yml and layers sustained bridge-FDB
churn on top of the existing DF-flip loop, so the soak exercises:

- kernel-learn / local-MAC observation → Type 2 origination
- ADR-0059 receive-side aliasing-ECMP (FDB nexthop groups)
- RFC 7432 §15.1 MAC mobility sequencing
- Gate 8b BUM-suppression while FDB programming is in flight
- the ADR-0059 drift-recovery counters under realistic timing

MAC injection is direct kernel mutation via
`docker exec <pe> bridge fdb add/del <mac> dev ce100a master static`
— the daemon's local-MAC observation pipeline is the path under
test, not the gRPC inject path.

Churn pattern: bounded rotating MAC pool (default 512 MACs, ~256
per PE), batched ops every CHURN_INTERVAL_SEC (default 5s) picked
weighted across add / delete / PE-to-PE mobility. Pool occupancy
is bounded so the soak doesn't drift into empty or saturated.

CSV samples extend the base harness with kernel-side FDB totals,
extern_learn counts, ip nexthop counts, evpn_local_originations,
evpn_local_origination_errors, evpn_local_observations_dropped,
evpn_duplicate_mac_moves, and the four ADR-0059 drift counters,
plus harness-side add / del / move totals.

tests/soak/README.md gains the operator-facing section covering
topology reuse, run knobs, sampling fields, and smoke-before-soak
discipline. The MAC-churn variant is the alpha-checklist exit
condition for flipping apply_bum_enforcement and apply_aliasing_ecmp
production defaults.
@lance0 lance0 temporarily deployed to kernel-dataplane-auto May 15, 2026 14:54 — with GitHub Actions Inactive
@lance0 lance0 temporarily deployed to kernel-dataplane-auto May 15, 2026 14:54 — with GitHub Actions Inactive
@lance0 lance0 temporarily deployed to kernel-dataplane-auto May 15, 2026 15:01 — with GitHub Actions Inactive
@lance0 lance0 temporarily deployed to kernel-dataplane-auto May 15, 2026 15:01 — with GitHub Actions Inactive
@lance0 lance0 temporarily deployed to kernel-dataplane-auto May 15, 2026 15:06 — with GitHub Actions Inactive
@lance0 lance0 temporarily deployed to kernel-dataplane-auto May 15, 2026 15:06 — with GitHub Actions Inactive
@lance0 lance0 temporarily deployed to kernel-dataplane-auto May 15, 2026 15:06 — with GitHub Actions Inactive
@lance0 lance0 merged commit 5cf7d57 into main May 15, 2026
27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant