Opened 9 years ago

Closed 20 months ago

#639 closed Bug / Defect (fixed)

non-interruptible loop in windows dns resolution failure

Reported by: Gert Döring Owned by: stipa
Priority: major Milestone: release 2.6
Component: Generic / unclassified Version: OpenVPN git master branch (Community Ed)
Severity: Not set (select this one, unless your'e a OpenVPN developer) Keywords: windows signal dns loop
Cc: tct

Description

  • win7 VM with no IPv6 (no interface has v6, so win7 disables v6 altogether)
  • git master 6417a6f8a0
  • connecting with "--proto udp6"
  • run from console window

-> result is a "DNS resolution fails, retrying" (because windows will refuse to even lookup a v6 record if "there is no v6 in the system!") endless loop, neither ctrl-c nor f1...f4 work.

Didn't try from GUI, was too annoyed at my VM setup... but should be reproduceable easy enough.

The endless loop happens on Windows as well, but ctrl-c works - here's a linux log:

Mon Dec 14 19:42:29 2015 RESOLVE: Cannot resolve host address: v4only.greenie.net:1194 (Name or service not known)
Mon Dec 14 19:42:29 2015 Could not determine IPv4/IPv6 protocol
Mon Dec 14 19:42:29 2015 SIGUSR1[soft,init_instance] received, process restarting
Mon Dec 14 19:42:29 2015 Restart pause, 5 second(s)
Mon Dec 14 19:42:34 2015 Control Channel Authentication: tls-auth using INLINE static key file
Mon Dec 14 19:42:34 2015 Outgoing Control Channel Authentication: Using 160 bit message hash 'SHA1' for HMAC authentication
Mon Dec 14 19:42:34 2015 Incoming Control Channel Authentication: Using 160 bit message hash 'SHA1' for HMAC authentication
Mon Dec 14 19:42:34 2015 RESOLVE: Cannot resolve host address: v4only.greenie.net:1194 (Name or service not known)
Mon Dec 14 19:42:34 2015 RESOLVE: Cannot resolve host address: v4only.greenie.net:1194 (Name or service not known)
Mon Dec 14 19:42:34 2015 Could not determine IPv4/IPv6 protocol
Mon Dec 14 19:42:34 2015 SIGUSR1[soft,init_instance] received, process restarting
Mon Dec 14 19:42:34 2015 Restart pause, 5 second(s)
^CMon Dec 14 19:42:35 2015 SIGINT[hard,init_instance] received, process exiting

Change History (12)

comment:1 Changed 9 years ago by Gert Döring

Owner: set to stipa
Status: newassigned

comment:2 Changed 9 years ago by Gert Döring

thanks :)

comment:3 in reply to:  description Changed 9 years ago by Selva Nair

Replying to cron2:

  • win7 VM with no IPv6 (no interface has v6, so win7 disables v6 altogether)
  • git master 6417a6f8a0
  • connecting with "--proto udp6"
  • run from console window

-> result is a "DNS resolution fails, retrying" (because windows will refuse to even lookup a v6 record if "there is no v6 in the system!") endless loop, neither ctrl-c nor f1...f4 work.

Didn't try from GUI, was too annoyed at my VM setup... but should be reproduceable easy enough.

The endless loop happens on Windows as well, but ctrl-c works

A quick question: is this with Xen, KVM, something else? I've seen ctrl-c sometimes ignored on windows 10 even on physical hardware, though a few tries always work.

Please try ctrl-break as well. Although it was added along with ctrl-c, the former is delivered by windows as a signal, the latter as a key-press like f1..f4, while running from console.

comment:4 Changed 8 years ago by Selva Nair

Failed DNS resolution can loop can become non-interruptible even on Linux: as a test use an unsused ip number as the nameserver in resolv.conf and start a connection. Nearly impossible to break out of it by SIGINT or SIGTERM.

There are a number of places in socket.c where sig_info->signal_received is assigned to SIGUSR1, overwriting previous value which could be SIGTERM or SIGINT (e.g., line socket.c:1919 which appears to be the culprit in this case). Note that sig_info here is a pointer to signinfo_static and its members are volatile. They can change when signals interrupt.

Interestingly its the hard SIGTERM/SIGINT that is easily lost in this case -- a SIGTERM simulated through the management doesn't get noticed until the loop restarts and goes back to init.c. So it survives this blatant over-write of signal_received in socket.c

comment:5 Changed 8 years ago by Gert Döring

see also #311 which is the same issue but got forgotten in the meantime *sigh*

comment:6 Changed 5 years ago by tct

Cc: tct added

comment:7 Changed 4 years ago by tct

Milestone: release 2.4release 2.5

comment:8 Changed 3 years ago by Gert Döring

Milestone: release 2.5release 2.5.3

This needs to be re-tested and then either be closed or fixed.

Last edited 3 years ago by Gert Döring (previous) (diff)

comment:9 Changed 21 months ago by Gert Döring

I think I've bumped into this with 2.6_beta2 - in certain conditions, OpenVPN just ignores incoming SIGTERM (seems to be "when in SIGUSR1 restart wait").

comment:10 Changed 21 months ago by Selva Nair

We need a revamp of how signals are implemented: use posix signal (sigaction) so that signals can be blocked and a priority order can be enforced etc.

I worked on this years back but dropped the ball. If there is enough interest in pursuing this approach I can resurrect it:

See: https://github.com/OpenVPN/openvpn/commit/7e5d775227e6d304ce24d7505da9332f405ee4f3

Or here is a summary from 2018 (some things may be outdated)
`
Fix signal handling issues

Currently signal received is directly modified in many places in the code,
leading to loss of signals, low priority signals overwriting higher priority
ones etc.

  • Set all signals using functions like register_signal
  • Add a function register_signal_si to help setting of signals when only the pointer to the signal_info struct is available.
  • Allow only a higher or equal priority signal to overwirte an already registered but yet to be processed signal. The signals in increasing order of priority are SIGUSR2, SIGUSR1, SIGHUP, SIGTERM, SIGINT.
  • Use posix signals (sigaction) to properly block signals while in the handler etc.
  • Collect windows signals even when management is not available. Currently Windows signals cannot interrupt openvpn_sleep unless the management interface is in use: the latter forces management_event_loop_n_seconds() in place of sleep().

`

Version 0, edited 21 months ago by Selva Nair (next)

comment:11 Changed 21 months ago by Selva Nair

I guess the patch linked above would be too much, too late for 2.6? What about a band-aid fix that adds signal priority but does not extensively re-write sig.c?

comment:12 Changed 20 months ago by Gert Döring

Milestone: release 2.5.3release 2.6
Resolution: fixed
Status: assignedclosed

Discussion on the signal issues was summarized in

https://github.com/OpenVPN/openvpn/issues/205

and fixed in a number of patches from Selva Nair that went into 2.6.0 - basically doing what was suggested above. The "POSIX sigaction" is still open for review and more time for cross-platform testing.

I am not sure if we want to backport the signal handling changes to release/2.5 ("it qualifies as bugfix"), because it's quite a large changeset to be able to cleanly address this particular effect (pending signals are overwritten by soft-signals at getaddrinfo() erorrs). Given the lack of sustained yelling, this seems to be a more infrequent annoyance mostly - and since *I* opened this particular ticket, I consider it solved sufficiently for my needs, I never run old versions of OpenVPN on Windows.

Thanks, Selva :-)

Note: See TracTickets for help on using tickets.