Opened 4 years ago
Last modified 3 months ago
#1166 new Bug / Defect
Key renotiation blocks openvpn server process
Reported by: | sumpfralle | Owned by: | Steffan Karger |
---|---|---|---|
Priority: | major | Milestone: | |
Component: | Crypto | Version: | OpenVPN 2.4.0 (Community Ed) |
Severity: | Not set (select this one, unless your'e a OpenVPN developer) | Keywords: | |
Cc: | David Sommerseth |
Description
Our wireless community uses OpenVPN servers (Debian Stretch, 2.4.0-6+deb9u3). Each server handles around 60 clients.
Recently we noticed, that the OpenVPN server process periodically seems to block for about six seconds. During these six seconds, the OpenVPN tun interface traffic stops completely. Under normal circumstances around 30 to 60 MBit/s are transferred. The host itself operates normally - only the OpenVPN process seems to misbehave.
After a good amount of digging around, I noticed that every single occurrence of these dropouts happens just at the time when one specific client processed its key renegotiation. This timing was easy to see, since that specific client is a bit older (2.3.6) and thus the key renegotiation leaves a warning message (regarding "SWEET32") in the server log.
As soon as I restarted the OpenVPN process of that specific client, everything went back to normal: no visible (longer than one second) dropouts occur.
The troublesome client uses exactly the same setup as many others. Thus there does not seem to be anything specific to this client. I cannot tell, what made him special.
Then I looked more closely at the key renegotiation and minor dropouts (just a few packets getting lost). I think I notice the pattern that every key renegotiation (at least the ones with older clients emitting "SWEET32" warnings) seems to cause a chance of some packets getting lost.
Now I am curious: is the key negotiation processed on the client or on the server? Does the server process block only for the server-side of this calculation or also for the client-side? (our servers are quite fast; but the clients are slow embedded devices)
The following lines are the strace log during one of these dropouts:
00:24:53.207723 recvmsg(7, {msg_name={sa_family=AF_INET6, sin6_port=htons(60440), inet_pton(AF_INET6, "::ffff:217.226.195.234", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, msg_namelen=28, msg_iov=[{iov_base="#?\236\330\341\243O\235\215\0\0\0\0\21:\344\26\3\1\0F\20\0\0BA\4n\376\v\375i"..., iov_len=2030}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 114 00:24:53.207796 time(NULL) = 1552001093 00:24:53.207852 time(NULL) = 1552001093 00:24:53.207907 time(NULL) = 1552001093 00:24:53.208175 time([1552001093]) = 1552001093 00:24:53.208244 time([1552001093]) = 1552001093 00:24:53.208573 time([1552001093]) = 1552001093 00:24:53.208636 time([1552001093]) = 1552001093 00:24:55.091926 time([1552001095]) = 1552001095 00:24:55.092029 time([1552001095]) = 1552001095 00:24:56.983555 time([1552001096]) = 1552001096 00:24:56.983674 time([1552001096]) = 1552001096 00:24:59.034857 time(NULL) = 1552001099 00:24:59.034998 time(NULL) = 1552001099 00:24:59.035056 time(NULL) = 1552001099 00:24:59.035124 time(NULL) = 1552001099 00:24:59.035170 time(NULL) = 1552001099 00:24:59.035241 time(NULL) = 1552001099 00:24:59.035311 time(NULL) = 1552001099 00:24:59.035368 poll([{fd=7, events=POLLOUT}, {fd=6, events=0}, {fd=3, events=POLLIN|POLLPRI}], 3, 0) = 1 ([{fd=7, revents=POLLOUT}]) 00:24:59.035450 time(NULL) = 1552001099 00:24:59.035505 sendmsg(7, {msg_name={sa_family=AF_INET6, sin6_port=htons(60440), inet_pton(AF_INET6, "::ffff:217.226.195.234", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, msg_namelen=28, msg_iov=[{iov_base="+\375\251\1\23kk\r\6\1\0\0\0\21?\236\330\341\243O\235\215", iov_len=22}], msg_iovlen=1, msg_control=[{cmsg_len=36, cmsg_level=SOL_IPV6, cmsg_type=0x32}], msg_controllen=40, msg_flags=0}, 0) = 22 00:24:59.035613 time(NULL) = 1552001099 00:24:59.035662 time(NULL) = 1552001099 00:24:59.035708 time(NULL) = 1552001099
Please note the lack of activity between 00:24:53 and 00:24:59. Usually there are a few thousand lines of strace log output for every second.
Change History (16)
comment:1 Changed 4 years ago by
Cc: | David Sommerseth added |
---|
comment:2 Changed 4 years ago by
Thank you for your hint regarding configuration changes avoiding this issue!
In our case this is not easily possible, since only a few affected clients are under our control.
Regarding the problematic dead time: now (two days later) the issue with that specific troublesome client came back (currently at around four seconds of dead time). This client handles quite a lot of traffic and thus has the shortest key renegotiation period of all connected clients (with older ciphers and thus reduced renegotiation bytes limit).
The dead time behavior seems to be the following:
- no visible dead time at the beginning of the connection
- the dead time grows after a while, maybe based on:
- duration of the connection
- amount of traffic exchanged
- number of key renegotiations
Other clients also seem to cause a (shorter) dead time of the server process.
comment:3 Changed 4 years ago by
Sorry for spamming / adding more details ...
I tried to determine which of the above properties relates to the length of the noticable dead time:
- duration of the connection: seems to be unrelated
- amount of traffic exchanged: seems to be a good indicator, e.g.:
- 4 GB traffix, 11 days connected: ~2 seconds of dead time
- 17 GB, 2 days connected: ~6 seconds of dead time
- number of key renegotiations: slightly related
I am obviously just guessing here, but it seems like the required time for the key renegotiation procedure depends on the amount of previously exchanged traffic within the current connection.
comment:4 Changed 4 years ago by
I don't think the absolute amount of traffic is directly relevant, but if reneg-bytes
is in use, it will affect the number of renegotiations - and my gut feeling is that "something is queueing up" with renegotiations, so, more renegotiations, longer work time.
No idea what that might be, though... hoping for @syzzer or @dazo here :-)
comment:5 Changed 4 years ago by
I don't think the absolute amount of traffic is directly relevant, but if reneg-bytes is in use, it will affect the number of renegotiations ..
Yes, indeed: the renegotiations depend on reneg-bytes
as well as reneg-sec
.
Both are unspecified (i.e. using defaults) in our setup. The clients I am discussing here are using OpenVPN before 2.4 (i.e. reneg-bytes=64MB
).
In my example above (4 GB / 11 days, 17 GB / 2 days), I guess that both clients went through a similar number of renegotiations (11 days -> 264 hours; 17 GB -> 265 * 64 MB). The exact number depends on the traffic pattern, but it should be in the same ballpark. But both clients exposed a different dead time duration. Of course other factors could come into play here, too: performance (both clients are different embedded architectures) and load. I just guessed, the accumulated traffic could be the crucial factor. But I do not mind being proven wrong :)
FYI: sending a SIGUSR1
to the client is sufficient for cutting down the dead time to a low level.
PS: I am amazed by the quick responses whenever I open an issue here for OpenVPN. Thank you!
comment:6 Changed 4 years ago by
Some more small details:
- The CPU load on the client and the server goes up during the key exchange. This is probably not surprising.
- The client that is exposing (or causing) the dead time on the server is a 32bit x86 architecture. The server is a 64bit x86 device.
Regarding the architecture: I cannot tell, whether the architecture is relevant for the problem, since we mainly use 32bit x86 devices (APU) and 32bit MIPS devices (Ubiquiti Nanostations) as clients.
comment:8 follow-up: 10 Changed 4 years ago by
I am looking forward to interesting theories, too! :)
Maybe it is simply expected, that the key renegotiation with each causes a certain dead time on the server? But why should this dead time grow based on the duration of the respective connection?
comment:10 Changed 4 years ago by
Replying to sumpfralle:
Maybe it is simply expected, that the key renegotiation with each causes a certain dead time on the server? But why should this dead time grow based on the duration of the respective connection?
The "a certain (short) dead time on the server" is expected, since OpenVPN is single-threaded today.
"Dead time growing with number of renegotiations (and/or duration of the connection)" is most certainly NOT expected. So something fishy is going on here.
I had hoped for @syzzer to find time to look into this (because he understands the crypto details much better) but he's awfully busy these days.
@sumpfralle: if you feel like experimenting with this, you could try to test with a separate setup (= do not disturb your production setup) if this ramps up faster if you do something like reneg-sec 30
to force a renegotiation twice a minute...
comment:11 Changed 4 years ago by
@tincantech:
- server: OpenSSL 1.1.0j 20 Nov 2018
- clienst: OpenSSL 1.1.1b 26 Feb 2019 (and some older ones)
I tried the reneg-sec
approach locally against a mostly unused server.
With reneg-sec 7
I noticed within 90 minutes (i.e. 770 renegotiations), that the dead time slowly grew to around one second.
Just for your information - my test involves the following:
while sleep 0.5; do timeout 0.1 ping -c 1 REMOTE_HOST >/dev/null && printf . || date; done
The resulting output on my host was the following (after 90 minutes):
.............Sun Mar 31 17:47:02 CEST 2019 Sun Mar 31 17:47:02 CEST 2019 ............Sun Mar 31 17:47:09 CEST 2019 Sun Mar 31 17:47:10 CEST 2019 Sun Mar 31 17:47:10 CEST 2019 ...........Sun Mar 31 17:47:17 CEST 2019 Sun Mar 31 17:47:17 CEST 2019 Sun Mar 31 17:47:18 CEST 2019 .............Sun Mar 31 17:47:25 CEST 2019 Sun Mar 31 17:47:26 CEST 2019 ............Sun Mar 31 17:47:33 CEST 2019 Sun Mar 31 17:47:33 CEST 2019 Sun Mar 31 17:47:34 CEST 2019 ............Sun Mar 31 17:47:41 CEST 2019 Sun Mar 31 17:47:41 CEST 2019 Sun Mar 31 17:47:42 CEST 2019 .............Sun Mar 31 17:47:49 CEST 2019
(every output of date
indicates a single failed ping)
At the start of the connection there were almost no failed pings.
Again (what a coincidence!), my client was a 32 bit system.
Maybe I will manage to test it with a 64 bit client, soon.
comment:12 Changed 4 years ago by
Now I manged to run this test with an amd64 client (Debian Stretch, 2.4.0-6+deb9u3).
Even after 90 minutes with reneg-sec 5
(more than one thousand key negotiations), the dead time did not increase.
Thus the above problems seems to affect only 32 bit clients.
Is there anything else, I could test?
comment:13 Changed 4 years ago by
In case there is any patch or weird configuration that you want me to try out: just tell me :)
comment:14 Changed 3 years ago by
This is just a wild stab in the dark ..
OpenSSL has introduced a new Cryptographically Secure Pseudo Random Number Generator [CSPRNG] .. is it possible that your less powerful clients are having trouble gathering enough entropy to satisfy the CSPRNG while build a new data channel key when renegotiating ?
comment:15 Changed 3 months ago by
I still don't understand why "the client does something" should have an effect on the server and other clients. The whole TLS renegotiation is not done synchronously ("waiting for packets from the client") but "whenever a packet comes in we deal with it".
OTOH lots of the TLS code has been refactored and rewritten for 2.5.0 and 2.6_beta2, so it might be worth trying a recent 2.5.x version on the server.
comment:16 Changed 3 months ago by
Thanks for coming back to this mysterious issue!
I tested quickly with our recently updated openvpn servers (now v2.5.1), whether the issue is still visible. With my testing approach from above (--reneg-sec 7
in the server configuration) I could not see noticable unresponsive breaks anymore.
OTOH lots of the TLS code has been refactored and rewritten for 2.5.0 and 2.6_beta2, so it might be worth trying a recent 2.5.x version on the server.
That sounds like a reasonable explanation. Since a few of the clients are still running the same version as before, I guess we can assume, that the server implementation was improved by diligent hands.
Thank you for your time and thoughts!
the 6 second dead time is very unusual. Normally the key renegotiation (as well as the initial negotiation) should be fast enough that it will not significantly impact traffic, though it can cause some delay due to buffering.
As for the SWEET32 client - if the server is 2.4, you could just change that client's config to use a newer cipher (aes-256-cbc) and permit that in the server's "--ncp-ciphers" list (
--ncp-ciphers aes-256-gcm:aes-256-cbc:aes-128-gcm
or something like that) - it should then use said cipher for the old client, and AES-GCM for the newer ones.