Opened 3 years ago

Closed 3 years ago

#854 closed Bug / Defect (fixed)

TLS Renegotiation Slowdown

Reported by: mknutson Owned by: Steffan Karger
Priority: major Milestone: release 2.4.1
Component: Crypto Version: OpenVPN 2.4.0 (Community Ed)
Severity: Not set (select this one, unless your'e a OpenVPN developer) Keywords: TLS renegotiation regen-sec
Cc: Samuli Seppänen

Description

I am working on integrating OpenVPN into a device and ran into an issue with TLS renegotiationg getting slower and slower the more times it occurs. The effects have been observed when using OpenVPN between the device (server) and an x86 Linux VM (client).

The issue occurred when running OpenVPN 2.4.0 on both sides, or when running a combination of 2.4.0 and 2.3.14. The issue did not occur when running OpenVPN 2.3.14 on both sides, suggesting that it may be a regression in 2.4.0, as opposed to a cross-compilation issue on this particular device.

I will attach config files, ping logs, an Excel graph, and OpenVPN logs (verb 5) for the following tests:
Test 1: OpenVPN 2.4.0 on both sides
Test 2: OpenVPN 2.3.13 on the device (server) and OpenVPN 2.4.0 on the VM (client)
Test 3: OpenVPN 2.3.13 on both sides

Directions
==========
Start an OpenVPN connection between two devices with reneg-sec set to a small value
Have each device ping the other device every second
Start "top -d1" on each device
Each time there is a TLS renegotiation, the CPU spikes on both machines spikes and network traffic stops (this is expected).
After a while, you will notice that the renegotiation begins to take longer, causing higher CPU usage.
After a while, the TLS renegotiation starts to take longer than 1 second, causing the ping response times to rise.

Embedded System
===============
With reneg-sec set to 10 seconds, the TLS renegotiation rises from 800ms to 6000ms after running for 1 hour (e.g. approx 360 renegotiations).
800Mhz MIPS processor
Linux 2.6.36
mipsel-unknown-linux-uclibc-gcc (Buildroot 2011.05) 4.3.5

OpenVPN 2.4.0, cross-compiled using default configure options (auto-detected liblz4, liblzo2, libpam, libssl)

OpenVPN 2.4.0 mipsel-unknown-linux-uclibc [SSL (OpenSSL)] [LZO] [LZ4] [EPOLL] [MH/PKTINFO] [AEAD] built on Mar 2 2017
library versions: OpenSSL 1.0.2h 3 May 2016, LZO 2.03
Originally developed by James Yonan
Copyright (C) 2002-2017 OpenVPN Technologies, Inc. <sales@…>
Compile time defines: enable_async_push=no enable_comp_stub=no enable_crypto=yes enable_crypto_ofb_cfb=yes enable_debug=yes enable_def_auth=yes enable_dlopen=unknown enable_dlopen_self=unknown enable_dlopen_self_static=unknown enable_fast_install=yes enable_fragment=yes enable_iproute2=no enable_libtool_lock=yes enable_lz4=yes enable_lzo=yes enable_management=yes enable_multi=yes enable_multihome=yes enable_pam_dlopen=no enable_pedantic=no enable_pf=yes enable_pkcs11=no enable_plugin_auth_pam=yes enable_plugin_down_root=yes enable_plugins=yes enable_port_share=yes enable_selinux=no enable_server=yes enable_shared=yes enable_shared_with_static_runtimes=no enable_small=no enable_static=yes enable_strict=no enable_strict_options=no enable_systemd=no enable_werror=no enable_win32_dll=yes enable_x509_alt_username=no with_crypto_library=openssl with_gnu_ld=yes with_mem_check=no with_plugindir='$(libdir)/openvpn/plugins' with_sysroot=no

x86 VM
======
With reneg-sec set to 3 seconds, the TLS renegotiation rises to 2000ms after running for 1 hour (e.g. approx 1200 renegotiations)
Intel i7-4870HQ @ 2.5Ghz
VirtualBox 5.1.14 / Vagrant 1.9.1 / UbuntuMATE 14.04
Linux vagrant-ubuntu-trusty-64 3.13.0-108-generic #155-Ubuntu SMP Wed Jan 11 16:58:52 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

OpenVPN 2.4.0, installed using openvpn-aptrepo.list:

deb http://build.openvpn.net/debian/openvpn/stable trusty main

OpenVPN 2.4.0 x86_64-pc-linux-gnu [SSL (OpenSSL)] [LZO] [LZ4] [EPOLL] [PKCS11] [MH/PKTINFO] [AEAD] built on Dec 27 2016
library versions: OpenSSL 1.0.1f 6 Jan 2014, LZO 2.06
Originally developed by James Yonan
Copyright (C) 2002-2017 OpenVPN Technologies, Inc. <sales@…>
Compile time defines: enable_async_push=no enable_comp_stub=no enable_crypto=yes enable_crypto_ofb_cfb=yes enable_debug=yes enable_def_auth=yes enable_dependency_tracking=no enable_dlopen=unknown enable_dlopen_self=unknown enable_dlopen_self_static=unknown enable_fast_install=yes enable_fragment=yes enable_iproute2=yes enable_libtool_lock=yes enable_lz4=yes enable_lzo=yes enable_maintainer_mode=no enable_management=yes enable_multi=yes enable_multihome=yes enable_pam_dlopen=no enable_pedantic=no enable_pf=yes enable_pkcs11=yes enable_plugin_auth_pam=yes enable_plugin_down_root=yes enable_plugins=yes enable_port_share=yes enable_selinux=no enable_server=yes enable_shared=yes enable_shared_with_static_runtimes=no enable_small=no enable_static=yes enable_strict=no enable_strict_options=no enable_systemd=no enable_werror=no enable_win32_dll=yes enable_x509_alt_username=yes with_crypto_library=openssl with_gnu_ld=yes with_ifconfig_path=/sbin/ifconfig with_iproute_path=/sbin/ip with_mem_check=no with_plugindir='${prefix}/lib/openvpn' with_route_path=/sbin/route with_sysroot=no

Attachments (12)

client.config (230 bytes) - added by mknutson 3 years ago.
Client configuration
server.config (261 bytes) - added by mknutson 3 years ago.
Server configuration
test1_device_to_vm.PNG (25.7 KB) - added by mknutson 3 years ago.
Test 1 ping times
test1_vm_to_device.PNG (33.4 KB) - added by mknutson 3 years ago.
Test 1 ping times
test3_device_to_vm.PNG (12.3 KB) - added by mknutson 3 years ago.
Test 3 ping times
test3_vm_to_device.PNG (11.3 KB) - added by mknutson 3 years ago.
Test 3 ping times
test2_vm_to_device_3sec.PNG (16.0 KB) - added by mknutson 3 years ago.
Test 2 ping times
test1_device_to_vm_60sec.PNG (112.8 KB) - added by mknutson 3 years ago.
test1_device_to_vm_reneg60_tran60_hand30.PNG (60.5 KB) - added by mknutson 3 years ago.
Test 1 ping times with tran-window and hand-window
test1_vm_to_device_reneg60_tran60_hand30.PNG (58.4 KB) - added by mknutson 3 years ago.
Test 1 ping times with tran-window and hand-window
0001-Remove-duplicate-X509-env-variables.patch (2.3 KB) - added by Steffan Karger 3 years ago.
test1_patched_2_4_0.PNG (34.8 KB) - added by mknutson 3 years ago.
After patching, ping response times are flat

Download all attachments as: .zip

Change History (31)

comment:1 Changed 3 years ago by David Sommerseth

Without having checked too much into your test configuration ... you might hit a corner case; I know I have with a similar testing scenario.

When using --reneg-sec lower than 60 seconds, you also need to consider --hand-window, and perhaps even --tran-window. The --hand-window is 60 seconds by default. I have experienced odd behaviours when using --reneg-sec lower than 15 seconds. My initial testing improved by also lowering the --hand-window. But beware, much of the implementation logic around the renegotiations does account for at least 60 seconds life time before a renegotiation happens.

Changed 3 years ago by mknutson

Attachment: client.config added

Client configuration

Changed 3 years ago by mknutson

Attachment: server.config added

Server configuration

Changed 3 years ago by mknutson

Attachment: test1_device_to_vm.PNG added

Test 1 ping times

Changed 3 years ago by mknutson

Attachment: test1_vm_to_device.PNG added

Test 1 ping times

Changed 3 years ago by mknutson

Attachment: test3_device_to_vm.PNG added

Test 3 ping times

Changed 3 years ago by mknutson

Attachment: test3_vm_to_device.PNG added

Test 3 ping times

comment:2 Changed 3 years ago by Gert Döring

Cc: Steffan Karger added

cc'ing syzzer - what could be different in 2.4.0 that also manifests in 2.3.14<->2.4.0 connections (= no AEAD)?

comment:3 Changed 3 years ago by mknutson

One correction, the x86 VM renegotiation times seem to rise to 1000ms after an hour with (reneg-sec set to 3) for test 2 (not 2000ms as originally posted).

Changed 3 years ago by mknutson

Attachment: test2_vm_to_device_3sec.PNG added

Test 2 ping times

Changed 3 years ago by mknutson

comment:4 Changed 3 years ago by mknutson

Hi dazo, Thanks for the suggestion!

We tried a run with reneg-sec=60, and still saw the renegotiation times rising (see attached test1_device_to_vm_60sec.PNG). We also tried another run with reneg-sec=60,tran-window=30,hand-window=30, and got messages stating, "TLS Error: local/remote TLS keys are out of sync" during the second half of each 60-second window.

I'm running another set of tests with reneg-sec=60,tran-window=60,hand-window=30 right now. Is there a different set of values that worked well for you in the past?

comment:5 Changed 3 years ago by Steffan Karger

This could be related to something else dazo spotted earlier: 2.4.0 adds a number of environment variables for each renegotiation, which will increase the number of allocations and amount memory used for each reconnect, but also increase the computation required when going through the list.

I was already planning to spend some of my spare time tonight on OpenVPN, so I'll try to cook up a patch. It would be great if you could test that patch to see if it resolves your issue.

If you have time before tonight, could you check:

  1. Whether the memory usage indeed increases for each renegatiation
  2. Whether both memory usage and connection time drop again once you disconnect the client, wait for - say - 2 minutes, and reconnect again? (But keep the server running)
Last edited 3 years ago by Steffan Karger (previous) (diff)

comment:6 Changed 3 years ago by mknutson

With reneg-sec set to 60, setting tran-window to 60 and hand-window to 30 did not seem to resolve the issue. I did an overnight test with these settings (2.4.0 on both ends) and saw renegotiation times hitting 18 seconds (see attached graphs).

I'm busy for a few hours, but as soon as I get a chance, I'll take a look at the memory usage and the specific reconnect scenario described by syzzer.

Changed 3 years ago by mknutson

Test 1 ping times with tran-window and hand-window

Changed 3 years ago by mknutson

Test 1 ping times with tran-window and hand-window

comment:7 Changed 3 years ago by signo-

Syzzer,

Disconnecting client for ~4 minutes and reconnecting to the server doesn't fix the issue, pings pick up where they left off time wise and memory/CPU util seems to remain about the same. Memory was ~118312K used before the disconnect and after the reconnect. The system on a fresh restart has ~107508K used after a system reboot and starting openvpn so not a

The above tests were done on the same system mknutson was using. I will note that using v2.4 from a windows PC to a windows PC I saw similar behavior with ping times increasing with --reneg-sec 60.

comment:8 Changed 3 years ago by Steffan Karger

Hm, that does not really confirm my suspicions. Thanks for testing though!

But let's try to exclude this for real. I attached a patch that should resolve the issue I had in mind. Could you build OpenVPN (git master, or release/2.4) with this patch applied? Then run this code either at both ends, or 2.3 on the other end, and see if the renegotiation times still increase.

Changed 3 years ago by Steffan Karger

comment:9 Changed 3 years ago by David Sommerseth

Just a question to the patch

        if (item->string
            && 0 == strncmp("X509_", item->string, strlen("X509_")))

Is that strlen() truly needed? Okay, I see the argument of being explicit what the length refers to. But how will this hit the performance? Will the compiler be smart enough to optimize this not to call strlen() for each iteration? If we want explicit clarity, what about:

    const char *filter_item = "X509_";
    const size_t filter_len = strlen(filter_item);
    while (item)
    {
        struct env_item *next = item->next;
        if (item->string
            && 0 == strncmp(filter_item, item->string, filter_len))
        {
            env_set_del(es, item->string);
        }
        item = next;
    }

Otherwise, in this case I am also perfectly fine with just

        if (item->string
            && 0 == strncmp("X509_", item->string, 5))

comment:10 Changed 3 years ago by mknutson

Hi syzzer,

Thank you for the quick turn around on this. I'm impressed at how rapidly this issue was isolated and fixed!

I ran our accelerated testing (with reneg-sec=3) on a patched v2.4.0 on both ends. I also ran with 2.3.14 on the VM and a patched v2.4.0 on the other end. In both cases, I ran for over 20 minutes and did not observe any renegotiation slowdown effects (previously, the renegotiation was over 1 second by that point). We'll run some longer tests to be sure, but I'd tentatively say that this patch seems to resolve this issue.

Let me know if you decide to update the patch per dazo's suggestion and I can give that a sanity check, too.

comment:11 in reply to:  9 Changed 3 years ago by Steffan Karger

Replying to dazo:

Just a question to the patch

        if (item->string
            && 0 == strncmp("X509_", item->string, strlen("X509_")))

Is that strlen() truly needed? Okay, I see the argument of being explicit what the length refers to. But how will this hit the performance? Will the compiler be smart enough to optimize this not to call strlen() for each iteration?

Yes, compilers are smart enough to optimize the strlen() calls away. Both gcc and clang with -O2 do not emit any calls to strlen(). See for example the following intermixed assembly generated by gcc:

1497:../.././../openvpn/src/openvpn/ssl_verify.c ****         if (item->string
 8816                   .loc 8 1497 0
 8817 6745 488B45F0         movq    -16(%rbp), %rax
 8818 6749 4889C2           movq    %rax, %rdx
 8819 674c 48C1EA03         shrq    $3, %rdx
 8820 6750 4881C200         addq    $2147450880, %rdx
 8820      80FF7F
 8821 6757 0FB612           movzbl  (%rdx), %edx
 8822 675a 84D2             testb   %dl, %dl
 8823 675c 7408             je  .L627
 8824 675e 4889C7           movq    %rax, %rdi
 8825 6761 E8000000         call    __asan_report_load8@PLT
 8825      00
 8826               .L627:
 8827 6766 488B45F0         movq    -16(%rbp), %rax
 8828 676a 488B00           movq    (%rax), %rax
 8829 676d 4885C0           testq   %rax, %rax
 8830 6770 7435             je  .L628
1498:../.././../openvpn/src/openvpn/ssl_verify.c ****             && 0 == strncmp("X509_", item->string, strlen("X509_")))
 8831                   .loc 8 1498 0
 8832 6772 488B45F0         movq    -16(%rbp), %rax
 8833 6776 488B00           movq    (%rax), %rax
 8834 6779 BA050000         movl    $5, %edx
 8834      00
 8835 677e 4889C6           movq    %rax, %rsi
 8836 6781 488D3D00         leaq    .LC90(%rip), %rdi
 8836      000000
 8837 6788 E8000000         call    strncmp@PLT
 8837      00
 8838 678d 85C0             testl   %eax, %eax
^LGAS LISTING /tmp/ccu4zSLd.s           page 287


 8839 678f 7516             jne .L628

Note in particular the movl $5, %edx right before the call strncmp@PLT.

So I stand by my approach :) It's explicit about what it wants to do, and the compiler understands it well enough to make it as efficient as more verbose or less explicit approaches.

Last edited 3 years ago by Steffan Karger (previous) (diff)

comment:12 in reply to:  10 Changed 3 years ago by Steffan Karger

Owner: set to Steffan Karger
Status: newaccepted

Replying to mknutson:

Thank you for the quick turn around on this. I'm impressed at how rapidly this issue was isolated and fixed!

I ran our accelerated testing (with reneg-sec=3) on a patched v2.4.0 on both ends. I also ran with 2.3.14 on the VM and a patched v2.4.0 on the other end. In both cases, I ran for over 20 minutes and did not observe any renegotiation slowdown effects (previously, the renegotiation was over 1 second by that point). We'll run some longer tests to be sure, but I'd tentatively say that this patch seems to resolve this issue.

Thanks for the kind words. Most credits go to you and dazo though: good bug reports like this are seldom but extremely helpful in pinning down issues, and dazo already spotted and reported the underlying issue to me. All I did was make the connection and fix the bug I introduced myself ;)

I'll send the patch to the list for inclusion in 2.4.1.

comment:13 Changed 3 years ago by mknutson

I ran a longer test last night with the patched 2.4.0 running on both the VM and the device. I left reneg-sec set to 3 seconds and ran the test for 16 hours (approximately 15000 TLS renegotiations). During the course of this test, there was no observable slowdown in the ping times (all were under 350ms).

It's clear that this patch resolves the issue that we had been seeing. Thanks again for the help on this!

Changed 3 years ago by mknutson

Attachment: test1_patched_2_4_0.PNG added

After patching, ping response times are flat

comment:14 Changed 3 years ago by mknutson

Hi syzzer,

Would you be willing to delete the "test2_device_3sec.log​" and "test2_vm_3sec.log​" attachments from this ticket to prevent spambots from picking up the email addresses in the log? I tried to delete them (or upload new, scrubbed versions), but I don't seem to have the right permissions on Trac. Much appreciated!

comment:15 Changed 3 years ago by Steffan Karger

Cc: Samuli Seppänen added; Steffan Karger removed

I don't seem to have rights to remove attachments either (or I just can't find the right button...). Samuli, can you check this out?

comment:16 Changed 3 years ago by Steffan Karger

Cc: Samuli Seppänen added; Samuli Seppänen removed

Hm, seems his nick here is samuli, not mattock.

Samuli: see above.

comment:17 Changed 3 years ago by Gert Döring

For whatever reasons, I seem to have the necessary superpowers :) - attachments deleted (though the path there is hidden - you have to click on an attachment, and then below the actual text there is a "delete attachment" button)

comment:18 Changed 3 years ago by Gert Döring

Your patch has been applied to the master and release/2.4 branch.

commit fd0361813cd3d5a55f3408a018e2ed776d79fef6 (master)
commit 19d6cd4e63557c983cb0ab5094688c974588c2ff (release/2.4)
Author: Steffan Karger
Date: Thu Mar 9 09:13:32 2017 +0100

Remove duplicate X509 env variables
Signed-off-by: Steffan Karger <steffan@…>
Acked-by: Gert Doering <gert@…>
Message-Id: <1489047212-31994-1-git-send-email-steffan@…>
URL: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg14

237.html

comment:19 Changed 3 years ago by Steffan Karger

Milestone: release 2.4.1
Resolution: fixed
Status: acceptedclosed

... which means that this ticket can be closed!

Note: See TracTickets for help on using tickets.