Opened 11 years ago

Closed 15 months ago

#311 closed Bug / Defect (fixed)

Client does not always die after a hard SIGTERM

Reported by: markm Owned by: Samuli Seppänen
Priority: major Milestone: release 2.6
Component: Generic / unclassified Version: OpenVPN 2.3.1 (Community Ed)
Severity: Not set (select this one, unless your'e a OpenVPN developer) Keywords: sigterm dns resolve
Cc: plaisthos

Description

Seen on 2.3.2 and 2.1.2 on Linux

An openvpn client process does not reliably die after being killed with a SIGTERM. This is especially the case when a user's network or DNS resolution is flaky - I've seen that is surprisingly common in the field.

It can be pretty reliably reproduced. First set up a bogus DNS server on a Linux host using netcat (can be localhost):

sudo nc -l -k -u -p 53 | hexdump -C

The idea is to get something that will accept the connection (so the client does not fail too quickly), and then do nothing so that the client stalls for a bit.
Then point your DNS resolver to the bogus setup (distro-specific) and all DNS queries should fail.

Next, run this shell script. Adapt to your setup.

#!/bin/bash

OPENVPN_EXE="/opt/openvpn/sbin/openvpn"
OPENVPN_CONFIG="/tmp/myvpn.conf"
OPENVPN_PIDFILE="/var/run/myvpn.pid"
OPENVPN_CMD="$OPENVPN_EXE --config $OPENVPN_CONFIG --daemon TestVPN --writepid $OPENVPN_PIDFILE"

while true; do
  echo "Starting..."
  $OPENVPN_CMD
  echo "Waiting a bit..."
  sleep $(( 2 + $RANDOM % 20 ))
  echo "Stopping..."
  kill `cat $OPENVPN_PIDFILE` || echo "Failed to send SIGTERM signal"
done

This stresses the starting and stopping of an openvpn client, but is not unlike how many init scripts work.

Let it run for a little bit, and you should eventually get multiple instances of openvpn running. Here's a log snip (hostnames edited):

Jul 24 03:41:51 ubuntu TestVPN[34077]: RESOLVE: Cannot resolve host address: xxx.domain.net: System error
Jul 24 03:41:51 ubuntu TestVPN[34077]: SIGUSR1[soft,init_instance] received, process restarting
Jul 24 03:41:51 ubuntu TestVPN[34077]: Restart pause, 2 second(s)
Jul 24 03:41:53 ubuntu TestVPN[34077]: Control Channel Authentication: tls-auth using INLINE static key file
Jul 24 03:41:53 ubuntu TestVPN[34077]: Outgoing Control Channel Authentication: Using 160 bit message hash 'SHA1' for HMAC authentication
Jul 24 03:41:53 ubuntu TestVPN[34077]: Incoming Control Channel Authentication: Using 160 bit message hash 'SHA1' for HMAC authentication
Jul 24 03:41:53 ubuntu TestVPN[34077]: Socket Buffers: R=[212992->200000] S=[212992->200000]
Jul 24 03:42:01 ubuntu TestVPN[34003]: RESOLVE: Cannot resolve host address: xxx.domain.net: System error
Jul 24 03:42:01 ubuntu TestVPN[34003]: SIGUSR1[soft,init_instance] received, process restarting
Jul 24 03:42:01 ubuntu TestVPN[34003]: Restart pause, 2 second(s)

Seems like soft SIGUSR1 generated by a DNS resolution failure is overriding the hard SIGTERM received, and the process happily continues along.

I did a cursory dive into the code and it seems like the end of link_socket_init_phase2() could be the culprit:

  if (sig_save && signal_received)
    {
      if (!*signal_received)
	*signal_received = sig_save;
    }

If there is both a previously saved (hard SIGTERM) and current (soft SIGUSR1 from resolving), then it takes the latter. Possible solutions are to separate storage of hard and soft signals, or to consider priorities before overwriting signals.

Change History (11)

comment:1 Changed 10 years ago by Samuli Seppänen

This seems somewhat related to #276.

comment:2 Changed 10 years ago by Samuli Seppänen

Owner: set to plaisthos
Status: newassigned

comment:3 Changed 10 years ago by Gert Döring

I've seen quite some signal handling fixes in the dual-stack code - have we covered this with them, already?

comment:4 Changed 9 years ago by Samuli Seppänen

Milestone: release 2.4
Owner: changed from plaisthos to Samuli Seppänen

I will try to reproduce this on 2.3.x and Git matter and then report back.

comment:5 Changed 9 years ago by Gert Döring

Cc: plaisthos added

The signal priorities thing ("if a SIGTERM has been seen, ignore all further soft signals") might still be needed... I think it should be fairly easy to reproduce, even without the "nc" thing - just put an IP address into /etc/resolv.conf that does not respond (not even with a "host unreachable", and must not be on the same LAN).

I'm just now trying to see whether I can reproduce #276, then come back here and look at Samuli's test results :-)

comment:6 Changed 9 years ago by Gert Döring

Actually, with 2.3.6 (well, release/2.3 as of today) this is not easy - even if I make it hang in getaddrinfo() for long with the "unreachable address" trick, it nicely sees the SIGTERM...

Sun May 31 19:27:05 2015 RESOLVE: signal received during DNS resolution attempt
Sun May 31 19:27:05 2015 SIGTERM[hard,init_instance] received, process exiting

This might actually need a combination of things to trigger, like "a resolving failure (setting SIGUSR1 internally) and *then* a SIGTERM, so slightly hard to hit the right (microsecond) point in time... maybe it has also already fixed, though I can't see anything in the logs.

Nothing obvious anyway...

comment:7 Changed 9 years ago by Samuli Seppänen

I managed to reproduce this issue on Debian 7 by simply setting a single, fake DNS server in /etc/resolv.conf:

# Real DNS server is here
# nameserver 192.168.1.1

# There is no DNS server in this IP
nameserver 192.168.1.199

I first launched OpenVPN Git "master" to foreground:

$ /home/samuli/opt/openvpn/src/openvpn/openvpn --config /tmp/community.conf

Git "master" version looped forever because it could not resolve the IP of the server:

RESOLVE: Cannot resolve host address: server.domain.com:1194 (Name or service not known)

Doing a kill -15 <pid> did not kill OpenVPN Git "master" - the kill signals were just ignored. Only fixing /etc/resolv.conf and letting OpenVPN connect to the remote peer put OpenVPN into a state where it would again gracefully die from SIGTERM (15). All signals sent before OpenVPN had connected properly were lost.

Intestingly OpenVPN 2.3.6 (and Debian-patched 2.3.4) behaved differently. It also initially ignored the SIGTERM, but once the current/ongoing DNS resolution attempt failed, the SIGTERM was processed and OpenVPN killed properly.

Note that if the fake DNS server is set to 127.0.0.1 then both OpenVPN 2.3.6 and Git "master" terminate immediately when they receive a SIGTERM, even though both still loop forever waiting for a DNS response. It did not seem to matter whether there was a netcat instance listening on port 53 or not.

comment:8 Changed 7 years ago by Gert Döring

see also #639

I know we added some signal handling fixes, but I'm not sure if these fix this particular issue.

comment:9 Changed 7 years ago by Samuli Seppänen

I will check if this issue is gone in latest code.

comment:10 Changed 16 months ago by Gert Döring

Milestone: release 2.4release 2.7

it seems to be still there. Not sure if anyone feels like digging into this for 2.6.0 now (since nobody cared for the last 6 years) but we should eventually revisit this...

comment:11 Changed 15 months ago by Gert Döring

Milestone: release 2.7release 2.6
Resolution: fixed
Status: assignedclosed

Discussion on the signal issues was summarized in

https://github.com/OpenVPN/openvpn/issues/205

and fixed in a number of patches from Selva Nair that went into 2.6.0 - basically doing what the original poster suggested: all signals go to a central "raise signal" function, which orders by priority, so a SIGUSR* can never replace a SIGTERM (commit b3b1436955b9db8e557fc58b7e37ba3a881109a6, and here on the list https://www.mail-archive.com/openvpn-devel@lists.sourceforge.net/msg25871.html).

I am not sure if we want to backport the signal handling changes to release/2.5 ("it qualifies as bugfix"), because it's quite a large changeset to be able to cleanly address this particular effect (pending signals are overwritten by soft-signals at getaddrinfo() erorrs). Given the lack of sustained yelling, this seems to be a more infrequent annoyance mostly.

Note: See TracTickets for help on using tickets.