Opened 11 years ago
Closed 23 months ago
#311 closed Bug / Defect (fixed)
Client does not always die after a hard SIGTERM
Reported by: | markm | Owned by: | Samuli Seppänen |
---|---|---|---|
Priority: | major | Milestone: | release 2.6 |
Component: | Generic / unclassified | Version: | OpenVPN 2.3.1 (Community Ed) |
Severity: | Not set (select this one, unless your'e a OpenVPN developer) | Keywords: | sigterm dns resolve |
Cc: | plaisthos |
Description
Seen on 2.3.2 and 2.1.2 on Linux
An openvpn client process does not reliably die after being killed with a SIGTERM. This is especially the case when a user's network or DNS resolution is flaky - I've seen that is surprisingly common in the field.
It can be pretty reliably reproduced. First set up a bogus DNS server on a Linux host using netcat (can be localhost):
sudo nc -l -k -u -p 53 | hexdump -C
The idea is to get something that will accept the connection (so the client does not fail too quickly), and then do nothing so that the client stalls for a bit.
Then point your DNS resolver to the bogus setup (distro-specific) and all DNS queries should fail.
Next, run this shell script. Adapt to your setup.
#!/bin/bash OPENVPN_EXE="/opt/openvpn/sbin/openvpn" OPENVPN_CONFIG="/tmp/myvpn.conf" OPENVPN_PIDFILE="/var/run/myvpn.pid" OPENVPN_CMD="$OPENVPN_EXE --config $OPENVPN_CONFIG --daemon TestVPN --writepid $OPENVPN_PIDFILE" while true; do echo "Starting..." $OPENVPN_CMD echo "Waiting a bit..." sleep $(( 2 + $RANDOM % 20 )) echo "Stopping..." kill `cat $OPENVPN_PIDFILE` || echo "Failed to send SIGTERM signal" done
This stresses the starting and stopping of an openvpn client, but is not unlike how many init scripts work.
Let it run for a little bit, and you should eventually get multiple instances of openvpn running. Here's a log snip (hostnames edited):
Jul 24 03:41:51 ubuntu TestVPN[34077]: RESOLVE: Cannot resolve host address: xxx.domain.net: System error Jul 24 03:41:51 ubuntu TestVPN[34077]: SIGUSR1[soft,init_instance] received, process restarting Jul 24 03:41:51 ubuntu TestVPN[34077]: Restart pause, 2 second(s) Jul 24 03:41:53 ubuntu TestVPN[34077]: Control Channel Authentication: tls-auth using INLINE static key file Jul 24 03:41:53 ubuntu TestVPN[34077]: Outgoing Control Channel Authentication: Using 160 bit message hash 'SHA1' for HMAC authentication Jul 24 03:41:53 ubuntu TestVPN[34077]: Incoming Control Channel Authentication: Using 160 bit message hash 'SHA1' for HMAC authentication Jul 24 03:41:53 ubuntu TestVPN[34077]: Socket Buffers: R=[212992->200000] S=[212992->200000] Jul 24 03:42:01 ubuntu TestVPN[34003]: RESOLVE: Cannot resolve host address: xxx.domain.net: System error Jul 24 03:42:01 ubuntu TestVPN[34003]: SIGUSR1[soft,init_instance] received, process restarting Jul 24 03:42:01 ubuntu TestVPN[34003]: Restart pause, 2 second(s)
Seems like soft SIGUSR1 generated by a DNS resolution failure is overriding the hard SIGTERM received, and the process happily continues along.
I did a cursory dive into the code and it seems like the end of link_socket_init_phase2() could be the culprit:
if (sig_save && signal_received) { if (!*signal_received) *signal_received = sig_save; }
If there is both a previously saved (hard SIGTERM) and current (soft SIGUSR1 from resolving), then it takes the latter. Possible solutions are to separate storage of hard and soft signals, or to consider priorities before overwriting signals.
Change History (11)
comment:1 Changed 11 years ago by
comment:2 Changed 11 years ago by
Owner: | set to plaisthos |
---|---|
Status: | new → assigned |
comment:3 Changed 11 years ago by
I've seen quite some signal handling fixes in the dual-stack code - have we covered this with them, already?
comment:4 Changed 10 years ago by
Milestone: | → release 2.4 |
---|---|
Owner: | changed from plaisthos to Samuli Seppänen |
I will try to reproduce this on 2.3.x and Git matter and then report back.
comment:5 Changed 10 years ago by
Cc: | plaisthos added |
---|
The signal priorities thing ("if a SIGTERM has been seen, ignore all further soft signals") might still be needed... I think it should be fairly easy to reproduce, even without the "nc" thing - just put an IP address into /etc/resolv.conf that does not respond (not even with a "host unreachable", and must not be on the same LAN).
I'm just now trying to see whether I can reproduce #276, then come back here and look at Samuli's test results :-)
comment:6 Changed 10 years ago by
Actually, with 2.3.6 (well, release/2.3 as of today) this is not easy - even if I make it hang in getaddrinfo() for long with the "unreachable address" trick, it nicely sees the SIGTERM...
Sun May 31 19:27:05 2015 RESOLVE: signal received during DNS resolution attempt
Sun May 31 19:27:05 2015 SIGTERM[hard,init_instance] received, process exiting
This might actually need a combination of things to trigger, like "a resolving failure (setting SIGUSR1 internally) and *then* a SIGTERM, so slightly hard to hit the right (microsecond) point in time... maybe it has also already fixed, though I can't see anything in the logs.
Nothing obvious anyway...
comment:7 Changed 10 years ago by
I managed to reproduce this issue on Debian 7 by simply setting a single, fake DNS server in /etc/resolv.conf:
# Real DNS server is here # nameserver 192.168.1.1 # There is no DNS server in this IP nameserver 192.168.1.199
I first launched OpenVPN Git "master" to foreground:
$ /home/samuli/opt/openvpn/src/openvpn/openvpn --config /tmp/community.conf
Git "master" version looped forever because it could not resolve the IP of the server:
RESOLVE: Cannot resolve host address: server.domain.com:1194 (Name or service not known)
Doing a kill -15 <pid> did not kill OpenVPN Git "master" - the kill signals were just ignored. Only fixing /etc/resolv.conf and letting OpenVPN connect to the remote peer put OpenVPN into a state where it would again gracefully die from SIGTERM (15). All signals sent before OpenVPN had connected properly were lost.
Intestingly OpenVPN 2.3.6 (and Debian-patched 2.3.4) behaved differently. It also initially ignored the SIGTERM, but once the current/ongoing DNS resolution attempt failed, the SIGTERM was processed and OpenVPN killed properly.
Note that if the fake DNS server is set to 127.0.0.1 then both OpenVPN 2.3.6 and Git "master" terminate immediately when they receive a SIGTERM, even though both still loop forever waiting for a DNS response. It did not seem to matter whether there was a netcat instance listening on port 53 or not.
comment:8 Changed 8 years ago by
see also #639
I know we added some signal handling fixes, but I'm not sure if these fix this particular issue.
comment:10 Changed 2 years ago by
Milestone: | release 2.4 → release 2.7 |
---|
it seems to be still there. Not sure if anyone feels like digging into this for 2.6.0 now (since nobody cared for the last 6 years) but we should eventually revisit this...
comment:11 Changed 23 months ago by
Milestone: | release 2.7 → release 2.6 |
---|---|
Resolution: | → fixed |
Status: | assigned → closed |
Discussion on the signal issues was summarized in
https://github.com/OpenVPN/openvpn/issues/205
and fixed in a number of patches from Selva Nair that went into 2.6.0 - basically doing what the original poster suggested: all signals go to a central "raise signal" function, which orders by priority, so a SIGUSR* can never replace a SIGTERM (commit b3b1436955b9db8e557fc58b7e37ba3a881109a6, and here on the list https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg25871.html).
I am not sure if we want to backport the signal handling changes to release/2.5 ("it qualifies as bugfix"), because it's quite a large changeset to be able to cleanly address this particular effect (pending signals are overwritten by soft-signals at getaddrinfo() erorrs). Given the lack of sustained yelling, this seems to be a more infrequent annoyance mostly.
This seems somewhat related to #276.