Opened 10 months ago

Last modified 6 weeks ago

#1329 new Bug / Defect

starting a server instance a second time (failing) messes up routing for the first instance

Reported by: Gert Döring Owned by:
Priority: major Milestone:
Component: Generic / unclassified Version: OpenVPN git master branch (Community Ed)
Severity: Not set (select this one, unless your'e a OpenVPN developer) Keywords: sitnl, netlink, double routes
Cc: Antonio, tct

Description

So, I have this server instance

port 51194
proto udp6
dev tun
server 10.204.2.0 255.255.255.0

which will on start do this:

Sep 18 17:01:32 gentoo tun-udp-p2mp[27140]: net_iface_mtu_set: mtu 1500 for tun1
Sep 18 17:01:32 gentoo tun-udp-p2mp[27140]: net_iface_up: set tun1 up
Sep 18 17:01:32 gentoo tun-udp-p2mp[27140]: net_addr_ptp_v4_add: 10.204.2.1 peer 10.204.2.2 dev tun1
...
Sep 18 17:01:32 gentoo tun-udp-p2mp[27140]: net_route_v4_add: 10.204.2.0/24 via 10.204.2.2 dev [NULL] table 0 metric -1
...
Sep 18 17:01:32 gentoo tun-udp-p2mp[27140]: setsockopt(IPV6_V6ONLY=0)
Sep 18 17:01:32 gentoo tun-udp-p2mp[27140]: UDPv6 link local (bound): [AF_INET6][undef]:51194

all good:

$ ip route |grep 10.204.1
10.204.1.0/24 via 10.204.1.2 dev tun0 
10.204.1.2 dev tun0 proto kernel scope link src 10.204.1.1 

Now, start this instance again (because you messed up your locking in the surrounding scripts, or whatever)... it fails, because it cannot bind:

2020-09-18 17:09:14 us=113238 net_addr_ptp_v4_add: 10.204.1.1 peer 10.204.1.2 dev tun7
2020-09-18 17:09:14 us=114066 net_route_v4_add: 10.204.1.0/24 via 10.204.1.2 dev [NULL] table 0 metric -1
2020-09-18 17:09:14 us=114278 setsockopt(IPV6_V6ONLY=0)
2020-09-18 17:09:14 us=114338 TCP/UDP: Socket bind failed on local address [AF_INET6][undef]:51194: Address already in use (errno=98)
2020-09-18 17:09:14 us=114376 Exiting due to fatal error
2020-09-18 17:09:14 us=114422 net_route_v4_del: 10.204.1.0/24 via 10.204.1.2 dev [NULL] table 0 metric -1

but now:

# ip route |grep 10.204.1
10.204.1.2 dev tun0 proto kernel scope link src 10.204.1.1 

the routes belonging to the *other* instance are gone.

Not sure why this happens. I think netlink should tell us "this route already exists" and then we should *not* remove it on exit (AFAIR our logic already does "we only remove routes that we successfully installed") - but we do not seem to receive this feedback.

I have not tested with an "iproute2" based openvpn binary, or on other platforms.

Change History (7)

comment:1 Changed 10 months ago by tct

Cc: tct added

comment:2 Changed 6 weeks ago by tct

A brief test on Linux:

2021-06-16 13:33:30 us=370834 OpenVPN 2.6_git [git:master/890225c1783d0f11] x86_64-pc-linux-gnu [SSL (OpenSSL)] [LZO] [LZ4] [EPOLL] [MH/PKTINFO] [AEAD] built on May 27 2021
2021-06-16 13:33:30 us=370855 library versions: OpenSSL 1.1.1f  31 Mar 2020, LZO 2.10
2021-06-16 13:33:30 us=371033 MANAGEMENT: Socket bind failed on local address [AF_INET]127.0.0.1:17122: Address already in use (errno=98)
2021-06-16 13:33:30 us=371057 Exiting due to fatal error

Socket bind fails before routing is changed, so appears to be resolved.

comment:3 Changed 6 weeks ago by Gert Döring

Do you actually have any "route" statements in your config? These would be needed to trigger the problem.

comment:4 Changed 6 weeks ago by tct

At first I only had --server, so I added a --route to the config.
Both versions do not cause a problem because bind happens first.
The first instance worked correctly and the second did not interfere.

Second server instance log

2021-06-16 17:50:01 us=862111 OpenVPN 2.6_git [git:master/890225c1783d0f11] x86_64-pc-linux-gnu [SSL (OpenSSL)] [LZO] [LZ4] [EPOLL] [MH/PKTINFO] [AEAD] built on May 27 2021
2021-06-16 17:50:01 us=862126 library versions: OpenSSL 1.1.1f  31 Mar 2020, LZO 2.10
2021-06-16 17:50:01 us=862261 MANAGEMENT: Socket bind failed on local address [AF_INET]127.0.0.1:17122: Address already in use (errno=98)
2021-06-16 17:50:01 us=862276 Exiting due to fatal error

I just realised, it is the --management interface causing the failure.

Last edited 6 weeks ago by tct (previous) (diff)

comment:5 Changed 6 weeks ago by tct

Without --management or --writepid, log of conflicting second instance:

2021-06-16 17:55:18 us=259340 OpenVPN 2.6_git [git:master/890225c1783d0f11] x86_64-pc-linux-gnu [SSL (OpenSSL)] [LZO] [LZ4] [EPOLL] [MH/PKTINFO] [AEAD] built on May 27 2021
2021-06-16 17:55:18 us=259368 library versions: OpenSSL 1.1.1f  31 Mar 2020, LZO 2.10
2021-06-16 17:55:18 us=259671 NOTE: the current --script-security setting may allow this configuration to call user-defined scripts
2021-06-16 17:55:18 us=260670 Outgoing Control Channel Encryption: Cipher 'AES-256-CTR' initialized with 256 bit key
2021-06-16 17:55:18 us=260702 Outgoing Control Channel Encryption: Using 256 bit message hash 'SHA256' for HMAC authentication
2021-06-16 17:55:18 us=260716 Incoming Control Channel Encryption: Cipher 'AES-256-CTR' initialized with 256 bit key
2021-06-16 17:55:18 us=260731 Incoming Control Channel Encryption: Using 256 bit message hash 'SHA256' for HMAC authentication
2021-06-16 17:55:18 us=260770 tls-crypt-v2 server key: Cipher 'AES-256-CTR' initialized with 256 bit key
2021-06-16 17:55:18 us=260801 tls-crypt-v2 server key: Using 256 bit message hash 'SHA256' for HMAC authentication
2021-06-16 17:55:18 us=260830 TLS-Auth MTU parms [ L:1622 D:1156 EF:94 EB:0 ET:0 EL:3 ]
2021-06-16 17:55:18 us=260991 ROUTE_GATEWAY 10.10.101.1/255.255.255.0 IFACE=enp5s0 HWADDR=24:b6:fd:31:bc:ca
2021-06-16 17:55:18 us=261039 ERROR: Cannot ioctl TUNSETIFF tun17122: Device or resource busy (errno=16)
2021-06-16 17:55:18 us=261060 Exiting due to fatal error

In the config I have --server and --route, this time it is not binding causing the error but duplicate tun device name. Routing is not effected though.

I guess, you'll want to test all this yourself, just in case I missed a spot.

comment:6 in reply to:  3 Changed 6 weeks ago by tct

Third test: Triggered the problem.

Conflicting second server instance log:

2021-06-16 18:05:08 us=238814 OpenVPN 2.6_git [git:master/890225c1783d0f11] x86_64-pc-linux-gnu [SSL (OpenSSL)] [LZO] [LZ4] [EPOLL] [MH/PKTINFO] [AEAD] built on May 27 2021
2021-06-16 18:05:08 us=238846 library versions: OpenSSL 1.1.1f  31 Mar 2020, LZO 2.10
2021-06-16 18:05:08 us=239213 NOTE: the current --script-security setting may allow this configuration to call user-defined scripts
2021-06-16 18:05:08 us=240813 Outgoing Control Channel Encryption: Cipher 'AES-256-CTR' initialized with 256 bit key
2021-06-16 18:05:08 us=240859 Outgoing Control Channel Encryption: Using 256 bit message hash 'SHA256' for HMAC authentication
2021-06-16 18:05:08 us=240880 Incoming Control Channel Encryption: Cipher 'AES-256-CTR' initialized with 256 bit key
2021-06-16 18:05:08 us=240903 Incoming Control Channel Encryption: Using 256 bit message hash 'SHA256' for HMAC authentication
2021-06-16 18:05:08 us=240944 tls-crypt-v2 server key: Cipher 'AES-256-CTR' initialized with 256 bit key
2021-06-16 18:05:08 us=240969 tls-crypt-v2 server key: Using 256 bit message hash 'SHA256' for HMAC authentication
2021-06-16 18:05:08 us=240996 TLS-Auth MTU parms [ L:1622 D:1156 EF:94 EB:0 ET:0 EL:3 ]
2021-06-16 18:05:08 us=241215 ROUTE_GATEWAY 10.10.101.1/255.255.255.0 IFACE=enp5s0 HWADDR=24:b6:fd:31:bc:ca
2021-06-16 18:05:08 us=242717 TUN/TAP device tun1 opened
2021-06-16 18:05:08 us=242751 do_ifconfig, ipv4=1, ipv6=0
2021-06-16 18:05:08 us=242768 /sbin/ip link set dev tun1 up mtu 1500
2021-06-16 18:05:08 us=245255 /sbin/ip link set dev tun1 up
2021-06-16 18:05:08 us=248575 /sbin/ip addr add dev tun1 10.171.22.1/28
2021-06-16 18:05:08 us=252439 /sbin/ip route add 172.22.22.0/24 via 10.171.22.2
RTNETLINK answers: File exists
2021-06-16 18:05:08 us=254870 ERROR: Linux route add command failed: external program exited with error status: 2
2021-06-16 18:05:08 us=254927 Data Channel MTU parms [ L:1622 D:1450 EF:122 EB:406 ET:0 EL:3 AF:14/122 ]
2021-06-16 18:05:08 us=254959 Socket Buffers: R=[212992->212992] S=[212992->212992]
2021-06-16 18:05:08 us=254986 setsockopt(IPV6_V6ONLY=0)
2021-06-16 18:05:08 us=255015 TCP/UDP: Socket bind failed on local address [AF_INET6][undef]:17122: Address already in use (errno=98)
2021-06-16 18:05:08 us=255030 Exiting due to fatal error
2021-06-16 18:05:08 us=255060 /sbin/ip route del 172.22.22.0/24
2021-06-16 18:05:08 us=272670 Closing TUN/TAP interface
2021-06-16 18:05:08 us=272724 /sbin/ip addr del dev tun1 10.171.22.1/28

Route 172.22.22.0/24 is deleted by the second conflicting server erroring out. And now the first server's routing is borked.

Last edited 6 weeks ago by tct (previous) (diff)

comment:7 Changed 6 weeks ago by tct

For the record, this issue now has a sibling: --writepid over-writes and subsequently deletes the first instance PID file.

Note: See TracTickets for help on using tickets.