Opened 5 years ago

Last modified 15 months ago

#1166 new Bug / Defect

Key renotiation blocks openvpn server process

Reported by: sumpfralle Owned by: Steffan Karger
Priority: major Milestone:
Component: Crypto Version: OpenVPN 2.4.0 (Community Ed)
Severity: Not set (select this one, unless your'e a OpenVPN developer) Keywords:
Cc: David Sommerseth

Description

Our wireless community uses OpenVPN servers (Debian Stretch, 2.4.0-6+deb9u3). Each server handles around 60 clients.

Recently we noticed, that the OpenVPN server process periodically seems to block for about six seconds. During these six seconds, the OpenVPN tun interface traffic stops completely. Under normal circumstances around 30 to 60 MBit/s are transferred. The host itself operates normally - only the OpenVPN process seems to misbehave.

After a good amount of digging around, I noticed that every single occurrence of these dropouts happens just at the time when one specific client processed its key renegotiation. This timing was easy to see, since that specific client is a bit older (2.3.6) and thus the key renegotiation leaves a warning message (regarding "SWEET32") in the server log.

As soon as I restarted the OpenVPN process of that specific client, everything went back to normal: no visible (longer than one second) dropouts occur.

The troublesome client uses exactly the same setup as many others. Thus there does not seem to be anything specific to this client. I cannot tell, what made him special.

Then I looked more closely at the key renegotiation and minor dropouts (just a few packets getting lost). I think I notice the pattern that every key renegotiation (at least the ones with older clients emitting "SWEET32" warnings) seems to cause a chance of some packets getting lost.

Now I am curious: is the key negotiation processed on the client or on the server? Does the server process block only for the server-side of this calculation or also for the client-side? (our servers are quite fast; but the clients are slow embedded devices)

The following lines are the strace log during one of these dropouts:

00:24:53.207723 recvmsg(7, {msg_name={sa_family=AF_INET6, sin6_port=htons(60440), inet_pton(AF_INET6, "::ffff:217.226.195.234", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, msg_namelen=28, msg_iov=[{iov_base="#?\236\330\341\243O\235\215\0\0\0\0\21:\344\26\3\1\0F\20\0\0BA\4n\376\v\375i"..., iov_len=2030}], msg_iovlen=1, msg_controllen=0, msg_flags=0}, 0) = 114
00:24:53.207796 time(NULL)              = 1552001093
00:24:53.207852 time(NULL)              = 1552001093
00:24:53.207907 time(NULL)              = 1552001093
00:24:53.208175 time([1552001093])      = 1552001093
00:24:53.208244 time([1552001093])      = 1552001093
00:24:53.208573 time([1552001093])      = 1552001093
00:24:53.208636 time([1552001093])      = 1552001093
00:24:55.091926 time([1552001095])      = 1552001095
00:24:55.092029 time([1552001095])      = 1552001095
00:24:56.983555 time([1552001096])      = 1552001096
00:24:56.983674 time([1552001096])      = 1552001096
00:24:59.034857 time(NULL)              = 1552001099
00:24:59.034998 time(NULL)              = 1552001099
00:24:59.035056 time(NULL)              = 1552001099
00:24:59.035124 time(NULL)              = 1552001099
00:24:59.035170 time(NULL)              = 1552001099
00:24:59.035241 time(NULL)              = 1552001099
00:24:59.035311 time(NULL)              = 1552001099
00:24:59.035368 poll([{fd=7, events=POLLOUT}, {fd=6, events=0}, {fd=3, events=POLLIN|POLLPRI}], 3, 0) = 1 ([{fd=7, revents=POLLOUT}])
00:24:59.035450 time(NULL)              = 1552001099
00:24:59.035505 sendmsg(7, {msg_name={sa_family=AF_INET6, sin6_port=htons(60440), inet_pton(AF_INET6, "::ffff:217.226.195.234", &sin6_addr), sin6_flowinfo=htonl(0), sin6_scope_id=0}, msg_namelen=28, msg_iov=[{iov_base="+\375\251\1\23kk\r\6\1\0\0\0\21?\236\330\341\243O\235\215", iov_len=22}], msg_iovlen=1, msg_control=[{cmsg_len=36, cmsg_level=SOL_IPV6, cmsg_type=0x32}], msg_controllen=40, msg_flags=0}, 0) = 22
00:24:59.035613 time(NULL)              = 1552001099
00:24:59.035662 time(NULL)              = 1552001099
00:24:59.035708 time(NULL)              = 1552001099

Please note the lack of activity between 00:24:53 and 00:24:59. Usually there are a few thousand lines of strace log output for every second.

Change History (16)

comment:1 Changed 5 years ago by Gert Döring

Cc: David Sommerseth added

the 6 second dead time is very unusual. Normally the key renegotiation (as well as the initial negotiation) should be fast enough that it will not significantly impact traffic, though it can cause some delay due to buffering.

As for the SWEET32 client - if the server is 2.4, you could just change that client's config to use a newer cipher (aes-256-cbc) and permit that in the server's "--ncp-ciphers" list (--ncp-ciphers aes-256-gcm:aes-256-cbc:aes-128-gcm or something like that) - it should then use said cipher for the old client, and AES-GCM for the newer ones.

comment:2 Changed 5 years ago by sumpfralle

Thank you for your hint regarding configuration changes avoiding this issue!
In our case this is not easily possible, since only a few affected clients are under our control.

Regarding the problematic dead time: now (two days later) the issue with that specific troublesome client came back (currently at around four seconds of dead time). This client handles quite a lot of traffic and thus has the shortest key renegotiation period of all connected clients (with older ciphers and thus reduced renegotiation bytes limit).

The dead time behavior seems to be the following:

  • no visible dead time at the beginning of the connection
  • the dead time grows after a while, maybe based on:
    • duration of the connection
    • amount of traffic exchanged
    • number of key renegotiations

Other clients also seem to cause a (shorter) dead time of the server process.

comment:3 Changed 5 years ago by sumpfralle

Sorry for spamming / adding more details ...

I tried to determine which of the above properties relates to the length of the noticable dead time:

  • duration of the connection: seems to be unrelated
  • amount of traffic exchanged: seems to be a good indicator, e.g.:
    • 4 GB traffix, 11 days connected: ~2 seconds of dead time
    • 17 GB, 2 days connected: ~6 seconds of dead time
  • number of key renegotiations: slightly related

I am obviously just guessing here, but it seems like the required time for the key renegotiation procedure depends on the amount of previously exchanged traffic within the current connection.

comment:4 Changed 5 years ago by Gert Döring

I don't think the absolute amount of traffic is directly relevant, but if reneg-bytes is in use, it will affect the number of renegotiations - and my gut feeling is that "something is queueing up" with renegotiations, so, more renegotiations, longer work time.

No idea what that might be, though... hoping for @syzzer or @dazo here :-)

comment:5 Changed 5 years ago by sumpfralle

I don't think the absolute amount of traffic is directly relevant, but if reneg-bytes is in use, it will affect the number of renegotiations ..

Yes, indeed: the renegotiations depend on reneg-bytes as well as reneg-sec.
Both are unspecified (i.e. using defaults) in our setup. The clients I am discussing here are using OpenVPN before 2.4 (i.e. reneg-bytes=64MB).

In my example above (4 GB / 11 days, 17 GB / 2 days), I guess that both clients went through a similar number of renegotiations (11 days -> 264 hours; 17 GB -> 265 * 64 MB). The exact number depends on the traffic pattern, but it should be in the same ballpark. But both clients exposed a different dead time duration. Of course other factors could come into play here, too: performance (both clients are different embedded architectures) and load. I just guessed, the accumulated traffic could be the crucial factor. But I do not mind being proven wrong :)

FYI: sending a SIGUSR1 to the client is sufficient for cutting down the dead time to a low level.

PS: I am amazed by the quick responses whenever I open an issue here for OpenVPN. Thank you!

comment:6 Changed 5 years ago by sumpfralle

Some more small details:

  • The CPU load on the client and the server goes up during the key exchange. This is probably not surprising.
  • The client that is exposing (or causing) the dead time on the server is a 32bit x86 architecture. The server is a 64bit x86 device.

Regarding the architecture: I cannot tell, whether the architecture is relevant for the problem, since we mainly use 32bit x86 devices (APU) and 32bit MIPS devices (Ubiquiti Nanostations) as clients.

comment:7 Changed 5 years ago by tct

curious ;-)

comment:8 Changed 5 years ago by sumpfralle

I am looking forward to interesting theories, too! :)

Maybe it is simply expected, that the key renegotiation with each causes a certain dead time on the server? But why should this dead time grow based on the duration of the respective connection?

comment:9 Changed 5 years ago by tct

openssl --version

comment:10 in reply to:  8 Changed 5 years ago by Gert Döring

Replying to sumpfralle:

Maybe it is simply expected, that the key renegotiation with each causes a certain dead time on the server? But why should this dead time grow based on the duration of the respective connection?

The "a certain (short) dead time on the server" is expected, since OpenVPN is single-threaded today.

"Dead time growing with number of renegotiations (and/or duration of the connection)" is most certainly NOT expected. So something fishy is going on here.

I had hoped for @syzzer to find time to look into this (because he understands the crypto details much better) but he's awfully busy these days.

@sumpfralle: if you feel like experimenting with this, you could try to test with a separate setup (= do not disturb your production setup) if this ramps up faster if you do something like reneg-sec 30 to force a renegotiation twice a minute...

comment:11 Changed 5 years ago by sumpfralle

@tincantech:

  • server: OpenSSL 1.1.0j 20 Nov 2018
  • clienst: OpenSSL 1.1.1b 26 Feb 2019 (and some older ones)

I tried the reneg-sec approach locally against a mostly unused server.

With reneg-sec 7 I noticed within 90 minutes (i.e. 770 renegotiations), that the dead time slowly grew to around one second.

Just for your information - my test involves the following:

while sleep 0.5; do timeout 0.1 ping -c 1 REMOTE_HOST >/dev/null && printf . || date; done

The resulting output on my host was the following (after 90 minutes):

.............Sun Mar 31 17:47:02 CEST 2019
Sun Mar 31 17:47:02 CEST 2019
............Sun Mar 31 17:47:09 CEST 2019
Sun Mar 31 17:47:10 CEST 2019
Sun Mar 31 17:47:10 CEST 2019
...........Sun Mar 31 17:47:17 CEST 2019
Sun Mar 31 17:47:17 CEST 2019
Sun Mar 31 17:47:18 CEST 2019
.............Sun Mar 31 17:47:25 CEST 2019
Sun Mar 31 17:47:26 CEST 2019
............Sun Mar 31 17:47:33 CEST 2019
Sun Mar 31 17:47:33 CEST 2019
Sun Mar 31 17:47:34 CEST 2019
............Sun Mar 31 17:47:41 CEST 2019
Sun Mar 31 17:47:41 CEST 2019
Sun Mar 31 17:47:42 CEST 2019
.............Sun Mar 31 17:47:49 CEST 2019

(every output of date indicates a single failed ping)

At the start of the connection there were almost no failed pings.

Again (what a coincidence!), my client was a 32 bit system.

Maybe I will manage to test it with a 64 bit client, soon.

comment:12 Changed 5 years ago by sumpfralle

Now I manged to run this test with an amd64 client (Debian Stretch, 2.4.0-6+deb9u3).

Even after 90 minutes with reneg-sec 5 (more than one thousand key negotiations), the dead time did not increase.

Thus the above problems seems to affect only 32 bit clients.

Is there anything else, I could test?

comment:13 Changed 5 years ago by sumpfralle

In case there is any patch or weird configuration that you want me to try out: just tell me :)

comment:14 Changed 4 years ago by tct

This is just a wild stab in the dark ..

OpenSSL has introduced a new Cryptographically Secure Pseudo Random Number Generator [CSPRNG] .. is it possible that your less powerful clients are having trouble gathering enough entropy to satisfy the CSPRNG while build a new data channel key when renegotiating ?

Version 0, edited 4 years ago by tct (next)

comment:15 Changed 15 months ago by Gert Döring

I still don't understand why "the client does something" should have an effect on the server and other clients. The whole TLS renegotiation is not done synchronously ("waiting for packets from the client") but "whenever a packet comes in we deal with it".

OTOH lots of the TLS code has been refactored and rewritten for 2.5.0 and 2.6_beta2, so it might be worth trying a recent 2.5.x version on the server.

comment:16 Changed 15 months ago by sumpfralle

Thanks for coming back to this mysterious issue!

I tested quickly with our recently updated openvpn servers (now v2.5.1), whether the issue is still visible. With my testing approach from above (--reneg-sec 7 in the server configuration) I could not see noticable unresponsive breaks anymore.

OTOH lots of the TLS code has been refactored and rewritten for 2.5.0 and 2.6_beta2, so it might be worth trying a recent 2.5.x version on the server.

That sounds like a reasonable explanation. Since a few of the clients are still running the same version as before, I guess we can assume, that the server implementation was improved by diligent hands.

Thank you for your time and thoughts!

Note: See TracTickets for help on using tickets.