Skip to content

🐛 BUG: slow memory leak #1633

@theblop

Description

@theblop

What version of nebula are you using? (nebula -version)

1.10.3

What operating system are you using?

rocky linux 8 (rhel8 clone)

Describe the Bug

I use nebula on about 100 hosts to run a p2p app inside the mesh. Each host is typically connected to about 50 other hosts max.

I've been running this setup for a few years already but I only noticed today whilst looking at prometheus stats that nebula has been slowly leaking memory for many months until it gets restarted. This happened with 1.9.7., and since I upgraded all my nodes to 1.10.3 today I can see memory slowly going up as well.

nebula_runtime_MemStats_Alloc for the last 5 months using v1.9.7 (each drop to 0 is a host restarting):

Image

and the last 5h with v1.10.3 (memory trend going up slowly too):

Image

interestingly enough, my 3 lighthouses do NOT show this leak but they also do not do any p2p traffic so maybe that explains the difference.

Logs from affected hosts

I'm not sure what logs and how much to include that could be relevant to the memory leak...

At info level I have mostly Handshake timed out messages (10-20 per min) but I think it's normal since not all my nodes are always accessible. Also I have a ssh user logged in every minute (I have a cronjob getting nebula hostmaps over ssh each minute)

here are a couple minutes of logs on one of the hosts:

Mar 17 17:54:00 host1 nebula[2606833]: time="2026-03-17T17:54:00+01:00" level=info msg="Handshake timed out" durationNs=6856008331 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=353>
Mar 17 17:54:00 host1 nebula[2606833]: time="2026-03-17T17:54:00+01:00" level=info msg="Handshake timed out" durationNs=6743008693 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=254>
Mar 17 17:54:01 host1 nebula[2606833]: time="2026-03-17T17:54:01+01:00" level=info msg="ssh user logged in" remoteAddress="127.0.0.1:59224" sshFingerprint="SHA256:xxxxxxxxxxxx>
Mar 17 17:54:02 host1 nebula[2606833]: time="2026-03-17T17:54:02+01:00" level=info msg="Handshake timed out" durationNs=6795351794 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=400>
Mar 17 17:54:07 host1 nebula[2606833]: time="2026-03-17T17:54:07+01:00" level=info msg="Handshake timed out" durationNs=7088424603 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=370>
Mar 17 17:54:07 host1 nebula[2606833]: time="2026-03-17T17:54:07+01:00" level=info msg="Handshake timed out" durationNs=6606159160 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=247>
Mar 17 17:54:11 host1 nebula[2606833]: time="2026-03-17T17:54:11+01:00" level=info msg="Handshake timed out" durationNs=6661422710 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=134>
Mar 17 17:54:16 host1 nebula[2606833]: time="2026-03-17T17:54:16+01:00" level=info msg="Handshake timed out" durationNs=6861751980 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=359>
Mar 17 17:54:16 host1 nebula[2606833]: time="2026-03-17T17:54:16+01:00" level=info msg="Handshake timed out" durationNs=6750871014 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=238>
Mar 17 17:54:17 host1 nebula[2606833]: time="2026-03-17T17:54:17+01:00" level=info msg="Handshake timed out" durationNs=6693127148 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=902>
Mar 17 17:54:23 host1 nebula[2606833]: time="2026-03-17T17:54:23+01:00" level=info msg="Handshake timed out" durationNs=6761373363 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=151>
Mar 17 17:54:23 host1 nebula[2606833]: time="2026-03-17T17:54:23+01:00" level=info msg="Handshake timed out" durationNs=6648436805 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=145>
Mar 17 17:54:25 host1 nebula[2606833]: time="2026-03-17T17:54:25+01:00" level=info msg="Handshake timed out" durationNs=6663000386 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=212>
Mar 17 17:54:30 host1 nebula[2606833]: time="2026-03-17T17:54:30+01:00" level=info msg="Handshake timed out" durationNs=6659522520 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=796>
Mar 17 17:54:30 host1 nebula[2606833]: time="2026-03-17T17:54:30+01:00" level=info msg="Handshake timed out" durationNs=6644052190 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=357>
Mar 17 17:54:32 host1 nebula[2606833]: time="2026-03-17T17:54:32+01:00" level=info msg="Handshake timed out" durationNs=6796309453 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=238>
Mar 17 17:54:37 host1 nebula[2606833]: time="2026-03-17T17:54:37+01:00" level=info msg="Handshake timed out" durationNs=6808914034 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=278>
Mar 17 17:54:37 host1 nebula[2606833]: time="2026-03-17T17:54:37+01:00" level=info msg="Handshake timed out" durationNs=6990890344 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=279>
Mar 17 17:54:41 host1 nebula[2606833]: time="2026-03-17T17:54:41+01:00" level=info msg="Handshake timed out" durationNs=6961631643 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=302>
Mar 17 17:54:44 host1 nebula[2606833]: time="2026-03-17T17:54:44+01:00" level=info msg="Handshake timed out" durationNs=6733695836 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=228>
Mar 17 17:54:44 host1 nebula[2606833]: time="2026-03-17T17:54:44+01:00" level=info msg="Handshake timed out" durationNs=6733697616 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=112>
Mar 17 17:54:44 host1 nebula[2606833]: time="2026-03-17T17:54:44+01:00" level=info msg="Handshake timed out" durationNs=6733715816 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=350>
Mar 17 17:54:44 host1 nebula[2606833]: time="2026-03-17T17:54:44+01:00" level=info msg="Handshake timed out" durationNs=6707769459 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=389>
Mar 17 17:54:46 host1 nebula[2606833]: time="2026-03-17T17:54:46+01:00" level=info msg="Handshake timed out" durationNs=6660714223 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=332>
Mar 17 17:54:48 host1 nebula[2606833]: time="2026-03-17T17:54:48+01:00" level=info msg="Handshake timed out" durationNs=6849693245 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=292>
Mar 17 17:54:51 host1 nebula[2606833]: time="2026-03-17T17:54:51+01:00" level=info msg="Handshake timed out" durationNs=6748233156 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=162>
Mar 17 17:54:51 host1 nebula[2606833]: time="2026-03-17T17:54:51+01:00" level=info msg="Handshake timed out" durationNs=6703145971 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=102>
Mar 17 17:54:51 host1 nebula[2606833]: time="2026-03-17T17:54:51+01:00" level=info msg="Handshake timed out" durationNs=6703208511 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=306>
Mar 17 17:54:51 host1 nebula[2606833]: time="2026-03-17T17:54:51+01:00" level=info msg="Handshake timed out" durationNs=6703229851 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=225>
Mar 17 17:54:53 host1 nebula[2606833]: time="2026-03-17T17:54:53+01:00" level=info msg="Handshake timed out" durationNs=6748145156 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=256>
Mar 17 17:54:55 host1 nebula[2606833]: time="2026-03-17T17:54:55+01:00" level=info msg="Handshake timed out" durationNs=6746614980 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=151>
Mar 17 17:55:00 host1 nebula[2606833]: time="2026-03-17T17:55:00+01:00" level=info msg="Handshake timed out" durationNs=6758704525 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=596>
Mar 17 17:55:00 host1 nebula[2606833]: time="2026-03-17T17:55:00+01:00" level=info msg="Handshake timed out" durationNs=6644429325 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=138>
Mar 17 17:55:01 host1 nebula[2606833]: time="2026-03-17T17:55:01+01:00" level=info msg="ssh user logged in" remoteAddress="127.0.0.1:51440" sshFingerprint="SHA256:xxxxxxxxxxxx>
Mar 17 17:55:02 host1 nebula[2606833]: time="2026-03-17T17:55:02+01:00" level=info msg="Handshake timed out" durationNs=6895768992 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=327>
Mar 17 17:55:07 host1 nebula[2606833]: time="2026-03-17T17:55:07+01:00" level=info msg="Handshake timed out" durationNs=6690390610 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=231>
Mar 17 17:55:08 host1 nebula[2606833]: time="2026-03-17T17:55:08+01:00" level=info msg="Handshake timed out" durationNs=6669996449 handshake="map[stage:1 style:ix_psk0]" initiatorIndex=338>

Config files from affected hosts

configs for the p2p hosts:

pki:
  ca: /opt/nebula/ca.crt
  cert: /opt/nebula/host.crt
  key: /opt/nebula/host.key
  blocklist:
  disconnect_invalid: true
static_host_map:
  "100.96.0.1":
    - "lh1:4242"
  "100.96.0.2":
    - "lh2:4242"
  "100.96.0.3":
    - "lh3:4242"
lighthouse:
  am_lighthouse: false
  serve_dns: false
  dns:
    host: 0.0.0.0
    port: 15353
  interval: 60
  hosts:
    - "100.96.0.3"
    - "100.96.0.1"
    - "100.96.0.2"
  local_allow_list:
    interfaces:
      "docker.*": false
      "br-.*": false
      "nebula.*": false
  remote_allow_list:
    "0.0.0.0/0": true
    "::/0": false
listen:
  host: 0.0.0.0
  port: 4242
  read_buffer: 20000000
  write_buffer: 20000000
punchy:
  punch: true
  respond: true
cipher: aes
sshd:
  enabled: true
  listen: 127.0.0.1:12222
  host_key: /opt/nebula/ssh_host.key
  authorized_users:
    - user: root
      keys:
        - "xxxxxxxxxxxxxxxxxxxx"
relay:
  am_relay: false
  use_relays: true
tun:
  disabled: false
  dev: nebula1
  drop_local_broadcast: false
  drop_multicast: false
  tx_queue: 5000
  mtu: 1300
  routes:
  unsafe_routes:
logging:
  level: info
  format: text
stats:
  type: prometheus
  listen: 0.0.0.0:18888
  path: /metrics
  subsystem: nebula
  interval: 10s
  lighthouse_metrics: true
firewall:
  outbound_action: drop
  inbound_action: drop
  conntrack:
    tcp_timeout: 12m
    udp_timeout: 3m
    default_timeout: 10m
  outbound:
    - port: any
      proto: any
      host: any
  inbound:
    ...

configs for the lighthouses:

    pki:
      ca: /config/ca.crt
      cert: /config/host.crt
      key: /config/host.key
      disconnect_invalid: true
    lighthouse:
      am_lighthouse: true
      serve_dns: true
      dns:
        host: 0.0.0.0
        port: 53
      interval: 60
    listen:
      host: 0.0.0.0
      port: 4242
    punchy:
      punch: true
    relay:
      am_relay: true
      use_relays: true
    tun:
      disabled: false
      dev: nebula1
      drop_local_broadcast: false
      drop_multicast: false
      tx_queue: 500
      mtu: 1440
      routes:
      unsafe_routes:
    logging:
      level: info
      format: text
    sshd:
      enabled: true
      listen: 0.0.0.0:12222
      host_key: /ssh/ssh_host.key
      authorized_users:
        - user: root
          keys:
            - "xxxxxxxxxxxxxx"
    stats:
      type: prometheus
      listen: 0.0.0.0:8080
      path: /metrics
      subsystem: nebula
      interval: 30s
      lighthouse_metrics: true
    firewall:
      outbound_action: drop
      inbound_action: drop
      conntrack:
        tcp_timeout: 12m
        udp_timeout: 3m
        default_timeout: 10m
      outbound:
        - port: any
          proto: any
          host: any
      inbound:
        ...

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions