Opened 3 months ago

Closed 3 months ago

Last modified 3 months ago

#1208 closed Bug / Defect (fixed)

TCP socket backlog set too low

Reported by: mjo Owned by:
Priority: major Milestone: release 2.4.8
Component: Generic / unclassified Version: OpenVPN 2.4.6 (Community Ed)
Severity: Not set (select this one, unless your'e a OpenVPN developer) Keywords:
Cc: plaisthos, Antonio

Description

Our Nagios monitor has been reporting that OpenVPN is down periodically for about a week. Most users were connected fine -- so I tried to ignore it -- but today I was forced to deal with it. After a server reboot (to troubleshoot another issue), my own OpenVPN client failed to re-establish its connection.

The first symptom I noticed was that in my client logs I see,

Aug 14 10:21:34 [openvpn] TCP/UDP: Preserving recently used remote address: [AF_INET]192.168.0.1:443
Aug 14 10:21:34 [openvpn] Socket Buffers: R=[131072->131072] S=[16384->16384]
Aug 14 10:21:34 [openvpn] Attempting to establish TCP connection with [AF_INET]192.168.0.1:443 [nonblock]

over and over again. (Our OpenVPN server is on TCP/443, and uses a tls-auth key.) This goes on for about half an hour, until it finally works... after I've changed nothing. This is repeatable: if I reboot the server, everyone loses their connection, and then re-establishing it is flaky. It does eventually work, though.

And of course, our Nagios system (which uses https://github.com/liquidat/nagios-icinga-openvpn) is reporting that the service is down the whole time. So there's that symptom too =)

After messing around on the server for a while and tweaking a bunch of things that didn't help, I finally tried the simplest test I could think of. On the server, I tried to establish an OpenVPN connection to localhost/443. That fails:

vpn1 # /usr/lib64/nagios/plugins/check_openvpn --timeout=60 --tcp --port 443 --tls-auth /etc/openvpn/keys/ta.key localhost
CRIT: Not responding

Which is... pretty surprising, given that netstat shows OpenVPN listening on port 443 as usual. I now notice the following in my dmesg output on the server:

[Wed Aug 14 11:06:08 2019] TCP: request_sock_TCP: Possible SYN flooding on port 443. Dropping request.  Check SNMP counters.

This leads me on a wild goose chase, tweaking a bunch of kernel parameters that don't help. Ultimately, I dig into the OpenVPN source code, and find

if (listen(sd, 1))

in src/openvpn/socket.c. A ha! That's setting the "backlog" to only one at a time, which limits the server to only one pending connection at a time. I believe it's also responsible for that "syn flooding" message.

I'm not sure why exactly that causes a problem (perhaps because a failed tls-auth session leaves a connection open?), but after patching it to say "8" instead of "1", I've noticed that everyone can connect immediately and that our Nagios monitor is happy.

This is all recent, so I'll keep an eye on things, but to me it looks like that "1" is too low with TCP/tls-auth.

Change History (9)

comment:1 Changed 3 months ago by Gert Döring

Cc: plaisthos added

Not sure why we do this... but it has been that way since "day one" (I went back to release/2.2 and it also has listen (sd, 1)) .

Normally this should be fine, as new connections are not that often (read: only every now and then) and not "that many in parallel" so it's interesting that it actually fails for you in steady state. Maybe something else is trying to connect to that 443 port all the time, keeping the server busy with garbage connections? Anything interesting to see in the server log (with --verb 4 or so)?

@plaisthos, @ordex: I do not see any obvious drawback to increase this - it will potentially cause a bit more work due to "massive parallel connections due to DoS" which would be ignored previously now being accepted and dropped after tls-auth check (or tls check). But since we only handle one new connection at a time, nothing really changes here - if an attacker wants to annoy us and sends 5 SYNs per second, we'll be busy either way, no matter what the backlog is...

What is a reasonable value for the backlog? https://stackoverflow.com/questions/10002868/what-value-of-backlog-should-i-use suggests "nobody knows" :-)

comment:2 Changed 3 months ago by Gert Döring

Cc: Antonio added

comment:3 Changed 3 months ago by mjo

My original suspicion was that the failed tls-auth connections were left incomplete, and stuck in the queue. POSIX suggests that it's possible: the "backlog" argument acts only as a suggestion for the length of the incoming-connection queue, which can consist of both complete and incomplete connections.

But, I'm on Linux, and the man page for the Linux implementation says that the "backlog" should be for established connections only -- the connections that need to be accept()ed. For incomplete connections, there is a separate queue (inaccessible to the application) whose length is controlled by the value of /proc/sys/net/ipv4/tcp_max_syn_backlog. In my case, when I was testing on localhost, netstat convinced be that the connection was never established (I was seeing SYN_SENT, if I recall correctly). All of that suggests that my guess might not be right, since the "backlog" value shouldn't limit incomplete connections.

Nevertheless, everything has been fine all day with the "backlog" set to 8 instead of 1. I'll turn up the logging and see if anything stands out. I'm reluctant to re-break things, but if all else fails, I'm pretty sure I could reproduce the issue.

comment:4 Changed 3 months ago by mjo

There's nothing too crazy at verb 4. I have a few of these, which look like random scans (I don't recognize that IP address):

Aug 15 08:40:11 [openvpn] MULTI: multi_create_instance called
Aug 15 08:40:11 [openvpn] Re-using SSL/TLS context
Aug 15 08:40:11 [openvpn] LZO compression initializing
Aug 15 08:40:11 [openvpn] Control Channel MTU parms [ L:1624 D:1182 EF:68 EB:0 ET:0 EL:3 ]
Aug 15 08:40:11 [openvpn] Data Channel MTU parms [ L:1624 D:1450 EF:124 EB:406 ET:0 EL:3 AF:3/1 ]
Aug 15 08:40:11 [openvpn] Local Options String (VER=V4): 'V4,dev-type tun,link-mtu 1544,tun-mtu 1500,proto TCPv4_SERVER,comp-lzo,keydir 0,cipher BF-CBC,auth SHA1,keysize 128,tls-auth,key-method 2,tls-server'
Aug 15 08:40:11 [openvpn] Expected Remote Options String (VER=V4): 'V4,dev-type tun,link-mtu 1544,tun-mtu 1500,proto TCPv4_CLIENT,comp-lzo,keydir 1,cipher BF-CBC,auth SHA1,keysize 128,tls-auth,key-method 2,tls-client'
Aug 15 08:40:11 [openvpn] TCP connection established with [AF_INET]198.108.66.161:23544
Aug 15 08:40:11 [openvpn] TCPv4_SERVER link local: (not bound)
Aug 15 08:40:11 [openvpn] TCPv4_SERVER link remote: [AF_INET]198.108.66.161:23544
Aug 15 08:40:11 [openvpn] 198.108.66.161:23544 WARNING: Bad encapsulated packet length from peer (5635), which must be > 0 and <= 1627 -- please ensure that --tun-mtu or --link-mtu is equal on both peers -- this condition could also indicate a possible active attack on the TCP link -- [Attempting restart...]
Aug 15 08:40:11 [openvpn] 198.108.66.161:23544 Connection reset, restarting [0]
Aug 15 08:40:11 [openvpn] 198.108.66.161:23544 SIGUSR1[soft,connection-reset] received, client-instance restarting
Aug 15 08:40:11 [openvpn] TCP/UDP: Closing socket

And a lot of these, because it's our Nagios server:

Aug 15 08:43:56 [openvpn] MULTI: multi_create_instance called
Aug 15 08:43:56 [openvpn] Re-using SSL/TLS context
Aug 15 08:43:56 [openvpn] LZO compression initializing
Aug 15 08:43:56 [openvpn] Control Channel MTU parms [ L:1624 D:1182 EF:68 EB:0 ET:0 EL:3 ]
Aug 15 08:43:56 [openvpn] Data Channel MTU parms [ L:1624 D:1450 EF:124 EB:406 ET:0 EL:3 AF:3/1 ]
Aug 15 08:43:56 [openvpn] Local Options String (VER=V4): 'V4,dev-type tun,link-mtu 1544,tun-mtu 1500,proto TCPv4_SERVER,comp-lzo,keydir 0,cipher BF-CBC,auth SHA1,keysize 128,tls-auth,key-method 2,tls-server'
Aug 15 08:43:56 [openvpn] Expected Remote Options String (VER=V4): 'V4,dev-type tun,link-mtu 1544,tun-mtu 1500,proto TCPv4_CLIENT,comp-lzo,keydir 1,cipher BF-CBC,auth SHA1,keysize 128,tls-auth,key-method 2,tls-client'
Aug 15 08:43:56 [openvpn] TCP connection established with [AF_INET]10.1.1.10:42786
Aug 15 08:43:56 [openvpn] TCPv4_SERVER link local: (not bound)
Aug 15 08:43:56 [openvpn] TCPv4_SERVER link remote: [AF_INET]10.1.1.10:42786
Aug 15 08:43:56 [openvpn] 10.1.1.10:42786 TLS: Initial packet from [AF_INET]10.1.1.10:42786, sid=545c1c6b aed99cd3
Aug 15 08:43:56 [openvpn] 10.1.1.10:42786 Connection reset, restarting [0]
Aug 15 08:43:56 [openvpn] 10.1.1.10:42786 SIGUSR1[soft,connection-reset] received, client-instance restarting
Aug 15 08:43:56 [openvpn] TCP/UDP: Closing socket

Aside from one old client with a 64-bit cipher warning, those are the only unusual entries.

comment:5 Changed 3 months ago by Gert Döring

Thanks for the logs. This basically confirms what I suspected - you're hit by random port scanners (ubiqitous on the Internet on "well-known" ports these days, alas) and so they occasionally keep the port "engaged" for a short moment, making other connections fail.

As for why Nagios fails "more often recently", this does not really answer it (it should succeed most of the time, only fail when it conflicts with an incoming port scanner - but maybe there were just *more scanners* doing their thing) - but anyway, bumping up the listen() backlog sounds like a relatively sane thing to do.

I have turned this into a formal patch and sent it to the openvpn-devel list. Should show up as soon as SF greylisting wants to talk to me.

comment:6 Changed 3 months ago by mjo

Thank you. This problem is a lot more interesting than the ones I usually solve =)

I'm back again to say that my previous comment (which contradicted the comment before THAT) was wrong. TheCloudflare blog gave me a hint: the way that the SYN queue limit is handled changed with linux-4.3, and we're running linux-5.1.11. Now, the test is against sk_max_ack_backlog, which the linux source code tells me is the "listen backlog set in listen()".

That means that the linux man page for listen() is no longer correct, and neither is the analysis in my previous comment. I've filed a bug about this.

So if I don't change my mind again, with newer kernels, calling listen(sd,1) will also set the SYN queue length to "1", and perfectly explains what I was seeing: a "syn flood" of length 1, that resulted in me being unable to telnet to the openvpn daemon on localhost.

It also means that your patch is justified.

I'll report back if someone tells me I'm wrong on that kernel man page bug, but I'm optimistic.

comment:7 Changed 3 months ago by Gert Döring

I'm all convinced, David and Antonio are also all convinced, so this went into the tree in record time :-)

commit 6d8380c78bf77766454b93b49ab2ebf713b0be48 (master)
commit ec0ca68f4ed1e6aa6f08f470b18e0198b7e5a4da (release/2.4)
Author: Gert Doering
Date: Thu Aug 15 17:53:19 2019 +0200

Increase listen() backlog queue to 32

will hit 2.4.8 whenever that comes out (no fixed release date yet).

comment:8 Changed 3 months ago by Gert Döring

Milestone: release 2.4.8
Resolution: fixed
Status: newclosed

comment:9 Changed 3 months ago by mjo

Thank you!

Note: See TracTickets for help on using tickets.