net_kernel:start/stop cycles race against epmd's FIN processing

**Describe the bug**

`net_kernel:start/2` followed by `net_kernel:stop/0` in a tight loop intermittently fails to restart with:
                                                                                                                                                             
> Protocol 'inet_tcp': the name <node>@<host> seems to be in use by another Erlang node
                                                                                                                                                             
The failure is a race between epmd's userspace processing of the prior TCP FIN and the new node's `ALIVE2_REQ`. When `gen_tcp:close/1` returns to erl_epmd, the kernel has queued FIN, but epmd's read loop hasn't necessarily processed the EOF and unregistered the node by the time the next ALIVE2_REQ arrives; so it
returns ALIVE2_RESP status=1 ("already registered"), which `erl_epmd:wait_for_reg_reply/2` translates to `{error, duplicate_name}` and `net_kernel:init/1` propagates as `{stop, duplicate_name}`.

TCP close is the only unregister signal in the protocol, so a client that wants to re-register cannot avoid this race except by waiting or retrying.

**To Reproduce**

```erlang
-module(repro_otp_epmd_race).
-export([main/0]).                                                                                                                                           
 
main() ->                                                                                                                                                    
    Errors = loop(1000, 0, []),                                 
    io:format("done. failures=~p~n", [length(Errors)]),
    erlang:halt(case Errors of [] -> 0; _ -> 1 end).                                                                                                         
 
loop(0, _, Errors) -> Errors;                                                                                                                                
loop(N, Iter, Errors) ->                                        
    NextIter = Iter + 1,                                                                                                                                     
    case net_kernel:start(otp_repro, #{name_domain => shortnames}) of
        {ok, _} ->                                                                                                                                           
            ok = net_kernel:stop(),                                                                                                                          
            loop(N - 1, NextIter, Errors);
        {error, _} = E ->                                                                                                                                    
            catch net_kernel:stop(),                                                                                                                         
            loop(N - 1, NextIter, [{NextIter, E} | Errors])
    end.                                                                                                                                                     
```                                                      
                                                                                                                                                            
On an idle Linux multipass VM on this Intel Mac, I see 4-5 failures per 1000 iterations. Under CPU load (e.g. taskset -c 1,2,3 stress-ng --cpu 3 --cpu-load 90) the rate rises a little bit.

**Expected behavior**

`net_kernel:start/2` either succeeds or fails for a real reason (name actually in use by another live node). It should not fail because epmd hasn't yet finished bookkeeping for this same node's prior connection.

**Affected versions**

This was tested with OTP 28.1.

**Additional context**

I packet-captured 1000 iterations of an equivalent reproducer with `tcpdump -w epmd.pcap port 4369` and parsed the `ALIVE2_RESP` status byte. All status=1 responses (16/1000) occurred when the new connection's SYN arrived 1 to 3ms after the prior connection's FIN; every connection with more than 3ms gap succeeded.
 
This was found investigating a downstream `test_net_kernel` flake in AtomVM. The retry approach reduced the failure rate from ~2.5% to 0/1000 in our `erl_epmd` (4 retries, 5/10/20/40ms backoff). Happy to send a PR for OTP's erl_epmd if the approach is acceptable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

net_kernel:start/stop cycles race against epmd's FIN processing #11083

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

net_kernel:start/stop cycles race against epmd's FIN processing #11083

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions