Opened 8 years ago
Closed 7 years ago
#854 closed Bug / Defect (fixed)
TLS Renegotiation Slowdown
Reported by: | mknutson | Owned by: | Steffan Karger |
---|---|---|---|
Priority: | major | Milestone: | release 2.4.1 |
Component: | Crypto | Version: | OpenVPN 2.4.0 (Community Ed) |
Severity: | Not set (select this one, unless your'e a OpenVPN developer) | Keywords: | TLS renegotiation regen-sec |
Cc: | Samuli Seppänen |
Description
I am working on integrating OpenVPN into a device and ran into an issue with TLS renegotiationg getting slower and slower the more times it occurs. The effects have been observed when using OpenVPN between the device (server) and an x86 Linux VM (client).
The issue occurred when running OpenVPN 2.4.0 on both sides, or when running a combination of 2.4.0 and 2.3.14. The issue did not occur when running OpenVPN 2.3.14 on both sides, suggesting that it may be a regression in 2.4.0, as opposed to a cross-compilation issue on this particular device.
I will attach config files, ping logs, an Excel graph, and OpenVPN logs (verb 5) for the following tests:
Test 1: OpenVPN 2.4.0 on both sides
Test 2: OpenVPN 2.3.13 on the device (server) and OpenVPN 2.4.0 on the VM (client)
Test 3: OpenVPN 2.3.13 on both sides
Directions
==========
Start an OpenVPN connection between two devices with reneg-sec set to a small value
Have each device ping the other device every second
Start "top -d1" on each device
Each time there is a TLS renegotiation, the CPU spikes on both machines spikes and network traffic stops (this is expected).
After a while, you will notice that the renegotiation begins to take longer, causing higher CPU usage.
After a while, the TLS renegotiation starts to take longer than 1 second, causing the ping response times to rise.
Embedded System
===============
With reneg-sec set to 10 seconds, the TLS renegotiation rises from 800ms to 6000ms after running for 1 hour (e.g. approx 360 renegotiations).
800Mhz MIPS processor
Linux 2.6.36
mipsel-unknown-linux-uclibc-gcc (Buildroot 2011.05) 4.3.5
OpenVPN 2.4.0, cross-compiled using default configure options (auto-detected liblz4, liblzo2, libpam, libssl)
OpenVPN 2.4.0 mipsel-unknown-linux-uclibc [SSL (OpenSSL)] [LZO] [LZ4] [EPOLL] [MH/PKTINFO] [AEAD] built on Mar 2 2017
library versions: OpenSSL 1.0.2h 3 May 2016, LZO 2.03
Originally developed by James Yonan
Copyright (C) 2002-2017 OpenVPN Technologies, Inc. <sales@…>
Compile time defines: enable_async_push=no enable_comp_stub=no enable_crypto=yes enable_crypto_ofb_cfb=yes enable_debug=yes enable_def_auth=yes enable_dlopen=unknown enable_dlopen_self=unknown enable_dlopen_self_static=unknown enable_fast_install=yes enable_fragment=yes enable_iproute2=no enable_libtool_lock=yes enable_lz4=yes enable_lzo=yes enable_management=yes enable_multi=yes enable_multihome=yes enable_pam_dlopen=no enable_pedantic=no enable_pf=yes enable_pkcs11=no enable_plugin_auth_pam=yes enable_plugin_down_root=yes enable_plugins=yes enable_port_share=yes enable_selinux=no enable_server=yes enable_shared=yes enable_shared_with_static_runtimes=no enable_small=no enable_static=yes enable_strict=no enable_strict_options=no enable_systemd=no enable_werror=no enable_win32_dll=yes enable_x509_alt_username=no with_crypto_library=openssl with_gnu_ld=yes with_mem_check=no with_plugindir='$(libdir)/openvpn/plugins' with_sysroot=no
x86 VM
======
With reneg-sec set to 3 seconds, the TLS renegotiation rises to 2000ms after running for 1 hour (e.g. approx 1200 renegotiations)
Intel i7-4870HQ @ 2.5Ghz
VirtualBox 5.1.14 / Vagrant 1.9.1 / UbuntuMATE 14.04
Linux vagrant-ubuntu-trusty-64 3.13.0-108-generic #155-Ubuntu SMP Wed Jan 11 16:58:52 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
OpenVPN 2.4.0, installed using openvpn-aptrepo.list:
deb http://build.openvpn.net/debian/openvpn/stable trusty main
OpenVPN 2.4.0 x86_64-pc-linux-gnu [SSL (OpenSSL)] [LZO] [LZ4] [EPOLL] [PKCS11] [MH/PKTINFO] [AEAD] built on Dec 27 2016
library versions: OpenSSL 1.0.1f 6 Jan 2014, LZO 2.06
Originally developed by James Yonan
Copyright (C) 2002-2017 OpenVPN Technologies, Inc. <sales@…>
Compile time defines: enable_async_push=no enable_comp_stub=no enable_crypto=yes enable_crypto_ofb_cfb=yes enable_debug=yes enable_def_auth=yes enable_dependency_tracking=no enable_dlopen=unknown enable_dlopen_self=unknown enable_dlopen_self_static=unknown enable_fast_install=yes enable_fragment=yes enable_iproute2=yes enable_libtool_lock=yes enable_lz4=yes enable_lzo=yes enable_maintainer_mode=no enable_management=yes enable_multi=yes enable_multihome=yes enable_pam_dlopen=no enable_pedantic=no enable_pf=yes enable_pkcs11=yes enable_plugin_auth_pam=yes enable_plugin_down_root=yes enable_plugins=yes enable_port_share=yes enable_selinux=no enable_server=yes enable_shared=yes enable_shared_with_static_runtimes=no enable_small=no enable_static=yes enable_strict=no enable_strict_options=no enable_systemd=no enable_werror=no enable_win32_dll=yes enable_x509_alt_username=yes with_crypto_library=openssl with_gnu_ld=yes with_ifconfig_path=/sbin/ifconfig with_iproute_path=/sbin/ip with_mem_check=no with_plugindir='${prefix}/lib/openvpn' with_route_path=/sbin/route with_sysroot=no
Attachments (12)
Change History (31)
comment:1 Changed 8 years ago by
comment:2 Changed 8 years ago by
Cc: | Steffan Karger added |
---|
cc'ing syzzer - what could be different in 2.4.0 that also manifests in 2.3.14<->2.4.0 connections (= no AEAD)?
comment:3 Changed 8 years ago by
One correction, the x86 VM renegotiation times seem to rise to 1000ms after an hour with (reneg-sec set to 3) for test 2 (not 2000ms as originally posted).
Changed 8 years ago by
Attachment: | test1_device_to_vm_60sec.PNG added |
---|
comment:4 Changed 8 years ago by
Hi dazo, Thanks for the suggestion!
We tried a run with reneg-sec=60, and still saw the renegotiation times rising (see attached test1_device_to_vm_60sec.PNG). We also tried another run with reneg-sec=60,tran-window=30,hand-window=30, and got messages stating, "TLS Error: local/remote TLS keys are out of sync" during the second half of each 60-second window.
I'm running another set of tests with reneg-sec=60,tran-window=60,hand-window=30 right now. Is there a different set of values that worked well for you in the past?
comment:5 Changed 8 years ago by
This could be related to something else dazo spotted earlier: 2.4.0 adds a number of environment variables for each renegotiation, which will increase the number of allocations and amount memory used for each reconnect, but also increase the computation required when going through the list.
I was already planning to spend some of my spare time tonight on OpenVPN, so I'll try to cook up a patch. It would be great if you could test that patch to see if it resolves your issue.
If you have time before tonight, could you check:
- Whether the memory usage indeed increases for each renegatiation
- Whether both memory usage and connection time drop again once you disconnect the client, wait for - say - 2 minutes, and reconnect again? (But keep the server running)
comment:6 Changed 8 years ago by
With reneg-sec set to 60, setting tran-window to 60 and hand-window to 30 did not seem to resolve the issue. I did an overnight test with these settings (2.4.0 on both ends) and saw renegotiation times hitting 18 seconds (see attached graphs).
I'm busy for a few hours, but as soon as I get a chance, I'll take a look at the memory usage and the specific reconnect scenario described by syzzer.
Changed 8 years ago by
Attachment: | test1_device_to_vm_reneg60_tran60_hand30.PNG added |
---|
Test 1 ping times with tran-window and hand-window
Changed 8 years ago by
Attachment: | test1_vm_to_device_reneg60_tran60_hand30.PNG added |
---|
Test 1 ping times with tran-window and hand-window
comment:7 Changed 8 years ago by
Syzzer,
Disconnecting client for ~4 minutes and reconnecting to the server doesn't fix the issue, pings pick up where they left off time wise and memory/CPU util seems to remain about the same. Memory was ~118312K used before the disconnect and after the reconnect. The system on a fresh restart has ~107508K used after a system reboot and starting openvpn so not a
The above tests were done on the same system mknutson was using. I will note that using v2.4 from a windows PC to a windows PC I saw similar behavior with ping times increasing with --reneg-sec 60.
comment:8 Changed 8 years ago by
Hm, that does not really confirm my suspicions. Thanks for testing though!
But let's try to exclude this for real. I attached a patch that should resolve the issue I had in mind. Could you build OpenVPN (git master, or release/2.4) with this patch applied? Then run this code either at both ends, or 2.3 on the other end, and see if the renegotiation times still increase.
Changed 8 years ago by
Attachment: | 0001-Remove-duplicate-X509-env-variables.patch added |
---|
comment:9 follow-up: 11 Changed 8 years ago by
Just a question to the patch
if (item->string && 0 == strncmp("X509_", item->string, strlen("X509_")))
Is that strlen() truly needed? Okay, I see the argument of being explicit what the length refers to. But how will this hit the performance? Will the compiler be smart enough to optimize this not to call strlen() for each iteration? If we want explicit clarity, what about:
const char *filter_item = "X509_"; const size_t filter_len = strlen(filter_item); while (item) { struct env_item *next = item->next; if (item->string && 0 == strncmp(filter_item, item->string, filter_len)) { env_set_del(es, item->string); } item = next; }
Otherwise, in this case I am also perfectly fine with just
if (item->string && 0 == strncmp("X509_", item->string, 5))
comment:10 follow-up: 12 Changed 8 years ago by
Hi syzzer,
Thank you for the quick turn around on this. I'm impressed at how rapidly this issue was isolated and fixed!
I ran our accelerated testing (with reneg-sec=3) on a patched v2.4.0 on both ends. I also ran with 2.3.14 on the VM and a patched v2.4.0 on the other end. In both cases, I ran for over 20 minutes and did not observe any renegotiation slowdown effects (previously, the renegotiation was over 1 second by that point). We'll run some longer tests to be sure, but I'd tentatively say that this patch seems to resolve this issue.
Let me know if you decide to update the patch per dazo's suggestion and I can give that a sanity check, too.
comment:11 Changed 8 years ago by
Replying to dazo:
Just a question to the patch
if (item->string && 0 == strncmp("X509_", item->string, strlen("X509_")))Is that strlen() truly needed? Okay, I see the argument of being explicit what the length refers to. But how will this hit the performance? Will the compiler be smart enough to optimize this not to call strlen() for each iteration?
Yes, compilers are smart enough to optimize the strlen() calls away. Both gcc and clang with -O2 do not emit any calls to strlen(). See for example the following intermixed assembly generated by gcc:
1497:../.././../openvpn/src/openvpn/ssl_verify.c **** if (item->string 8816 .loc 8 1497 0 8817 6745 488B45F0 movq -16(%rbp), %rax 8818 6749 4889C2 movq %rax, %rdx 8819 674c 48C1EA03 shrq $3, %rdx 8820 6750 4881C200 addq $2147450880, %rdx 8820 80FF7F 8821 6757 0FB612 movzbl (%rdx), %edx 8822 675a 84D2 testb %dl, %dl 8823 675c 7408 je .L627 8824 675e 4889C7 movq %rax, %rdi 8825 6761 E8000000 call __asan_report_load8@PLT 8825 00 8826 .L627: 8827 6766 488B45F0 movq -16(%rbp), %rax 8828 676a 488B00 movq (%rax), %rax 8829 676d 4885C0 testq %rax, %rax 8830 6770 7435 je .L628 1498:../.././../openvpn/src/openvpn/ssl_verify.c **** && 0 == strncmp("X509_", item->string, strlen("X509_"))) 8831 .loc 8 1498 0 8832 6772 488B45F0 movq -16(%rbp), %rax 8833 6776 488B00 movq (%rax), %rax 8834 6779 BA050000 movl $5, %edx 8834 00 8835 677e 4889C6 movq %rax, %rsi 8836 6781 488D3D00 leaq .LC90(%rip), %rdi 8836 000000 8837 6788 E8000000 call strncmp@PLT 8837 00 8838 678d 85C0 testl %eax, %eax ^LGAS LISTING /tmp/ccu4zSLd.s page 287 8839 678f 7516 jne .L628
Note in particular the movl $5, %edx
right before the call strncmp@PLT
.
So I stand by my approach :) It's explicit about what it wants to do, and the compiler understands it well enough to make it as efficient as more verbose or less explicit approaches.
comment:12 Changed 8 years ago by
Owner: | set to Steffan Karger |
---|---|
Status: | new → accepted |
Replying to mknutson:
Thank you for the quick turn around on this. I'm impressed at how rapidly this issue was isolated and fixed!
I ran our accelerated testing (with reneg-sec=3) on a patched v2.4.0 on both ends. I also ran with 2.3.14 on the VM and a patched v2.4.0 on the other end. In both cases, I ran for over 20 minutes and did not observe any renegotiation slowdown effects (previously, the renegotiation was over 1 second by that point). We'll run some longer tests to be sure, but I'd tentatively say that this patch seems to resolve this issue.
Thanks for the kind words. Most credits go to you and dazo though: good bug reports like this are seldom but extremely helpful in pinning down issues, and dazo already spotted and reported the underlying issue to me. All I did was make the connection and fix the bug I introduced myself ;)
I'll send the patch to the list for inclusion in 2.4.1.
comment:13 Changed 8 years ago by
I ran a longer test last night with the patched 2.4.0 running on both the VM and the device. I left reneg-sec set to 3 seconds and ran the test for 16 hours (approximately 15000 TLS renegotiations). During the course of this test, there was no observable slowdown in the ping times (all were under 350ms).
It's clear that this patch resolves the issue that we had been seeing. Thanks again for the help on this!
Changed 8 years ago by
Attachment: | test1_patched_2_4_0.PNG added |
---|
After patching, ping response times are flat
comment:14 Changed 8 years ago by
Hi syzzer,
Would you be willing to delete the "test2_device_3sec.log" and "test2_vm_3sec.log" attachments from this ticket to prevent spambots from picking up the email addresses in the log? I tried to delete them (or upload new, scrubbed versions), but I don't seem to have the right permissions on Trac. Much appreciated!
comment:15 Changed 8 years ago by
Cc: | Samuli Seppänen added; Steffan Karger removed |
---|
I don't seem to have rights to remove attachments either (or I just can't find the right button...). Samuli, can you check this out?
comment:16 Changed 8 years ago by
Cc: | Samuli Seppänen added; Samuli Seppänen removed |
---|
Hm, seems his nick here is samuli, not mattock.
Samuli: see above.
comment:17 Changed 8 years ago by
For whatever reasons, I seem to have the necessary superpowers :) - attachments deleted (though the path there is hidden - you have to click on an attachment, and then below the actual text there is a "delete attachment" button)
comment:18 Changed 7 years ago by
Your patch has been applied to the master and release/2.4 branch.
commit fd0361813cd3d5a55f3408a018e2ed776d79fef6 (master)
commit 19d6cd4e63557c983cb0ab5094688c974588c2ff (release/2.4)
Author: Steffan Karger
Date: Thu Mar 9 09:13:32 2017 +0100
Remove duplicate X509 env variables
Signed-off-by: Steffan Karger <steffan@…>
Acked-by: Gert Doering <gert@…>
Message-Id: <1489047212-31994-1-git-send-email-steffan@…>
URL: https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg14
237.html
comment:19 Changed 7 years ago by
Milestone: | → release 2.4.1 |
---|---|
Resolution: | → fixed |
Status: | accepted → closed |
... which means that this ticket can be closed!
Without having checked too much into your test configuration ... you might hit a corner case; I know I have with a similar testing scenario.
When using --reneg-sec lower than 60 seconds, you also need to consider --hand-window, and perhaps even --tran-window. The --hand-window is 60 seconds by default. I have experienced odd behaviours when using --reneg-sec lower than 15 seconds. My initial testing improved by also lowering the --hand-window. But beware, much of the implementation logic around the renegotiations does account for at least 60 seconds life time before a renegotiation happens.