Opened 7 years ago

Closed 6 days ago

#639 closed Bug / Defect (fixed)

non-interruptible loop in windows dns resolution failure

Reported by: Gert Döring Owned by: stipa
Priority: major Milestone: release 2.6
Component: Generic / unclassified Version: OpenVPN git master branch (Community Ed)
Severity: Not set (select this one, unless your'e a OpenVPN developer) Keywords: windows signal dns loop
Cc: tct

Description

  • win7 VM with no IPv6 (no interface has v6, so win7 disables v6 altogether)
  • git master 6417a6f8a0
  • connecting with "--proto udp6"
  • run from console window

-> result is a "DNS resolution fails, retrying" (because windows will refuse to even lookup a v6 record if "there is no v6 in the system!") endless loop, neither ctrl-c nor f1...f4 work.

Didn't try from GUI, was too annoyed at my VM setup... but should be reproduceable easy enough.

The endless loop happens on Windows as well, but ctrl-c works - here's a linux log:

Mon Dec 14 19:42:29 2015 RESOLVE: Cannot resolve host address: v4only.greenie.net:1194 (Name or service not known)
Mon Dec 14 19:42:29 2015 Could not determine IPv4/IPv6 protocol
Mon Dec 14 19:42:29 2015 SIGUSR1[soft,init_instance] received, process restarting
Mon Dec 14 19:42:29 2015 Restart pause, 5 second(s)
Mon Dec 14 19:42:34 2015 Control Channel Authentication: tls-auth using INLINE static key file
Mon Dec 14 19:42:34 2015 Outgoing Control Channel Authentication: Using 160 bit message hash 'SHA1' for HMAC authentication
Mon Dec 14 19:42:34 2015 Incoming Control Channel Authentication: Using 160 bit message hash 'SHA1' for HMAC authentication
Mon Dec 14 19:42:34 2015 RESOLVE: Cannot resolve host address: v4only.greenie.net:1194 (Name or service not known)
Mon Dec 14 19:42:34 2015 RESOLVE: Cannot resolve host address: v4only.greenie.net:1194 (Name or service not known)
Mon Dec 14 19:42:34 2015 Could not determine IPv4/IPv6 protocol
Mon Dec 14 19:42:34 2015 SIGUSR1[soft,init_instance] received, process restarting
Mon Dec 14 19:42:34 2015 Restart pause, 5 second(s)
^CMon Dec 14 19:42:35 2015 SIGINT[hard,init_instance] received, process exiting

Change History (12)

comment:1 Changed 7 years ago by Gert Döring

Owner: set to stipa
Status: newassigned

comment:2 Changed 7 years ago by Gert Döring

thanks :)

comment:3 in reply to:  description Changed 7 years ago by Selva Nair

Replying to cron2:

  • win7 VM with no IPv6 (no interface has v6, so win7 disables v6 altogether)
  • git master 6417a6f8a0
  • connecting with "--proto udp6"
  • run from console window

-> result is a "DNS resolution fails, retrying" (because windows will refuse to even lookup a v6 record if "there is no v6 in the system!") endless loop, neither ctrl-c nor f1...f4 work.

Didn't try from GUI, was too annoyed at my VM setup... but should be reproduceable easy enough.

The endless loop happens on Windows as well, but ctrl-c works

A quick question: is this with Xen, KVM, something else? I've seen ctrl-c sometimes ignored on windows 10 even on physical hardware, though a few tries always work.

Please try ctrl-break as well. Although it was added along with ctrl-c, the former is delivered by windows as a signal, the latter as a key-press like f1..f4, while running from console.

comment:4 Changed 7 years ago by Selva Nair

Failed DNS resolution can loop can become non-interruptible even on Linux: as a test use an unsused ip number as the nameserver in resolv.conf and start a connection. Nearly impossible to break out of it by SIGINT or SIGTERM.

There are a number of places in socket.c where sig_info->signal_received is assigned to SIGUSR1, overwriting previous value which could be SIGTERM or SIGINT (e.g., line socket.c:1919 which appears to be the culprit in this case). Note that sig_info here is a pointer to signinfo_static and its members are volatile. They can change when signals interrupt.

Interestingly its the hard SIGTERM/SIGINT that is easily lost in this case -- a SIGTERM simulated through the management doesn't get noticed until the loop restarts and goes back to init.c. So it survives this blatant over-write of signal_received in socket.c

comment:5 Changed 6 years ago by Gert Döring

see also #311 which is the same issue but got forgotten in the meantime *sigh*

comment:6 Changed 3 years ago by tct

Cc: tct added

comment:7 Changed 2 years ago by tct

Milestone: release 2.4release 2.5

comment:8 Changed 22 months ago by Gert Döring

Milestone: release 2.5release 2.5.3

This needs to be re-tested and then either closed or fixed.

Version 0, edited 22 months ago by Gert Döring (next)

comment:9 Changed 5 weeks ago by Gert Döring

I think I've bumped into this with 2.6_beta2 - in certain conditions, OpenVPN just ignores incoming SIGTERM (seems to be "when in SIGUSR1 restart wait").

comment:10 Changed 5 weeks ago by Selva Nair

We need a revamp of how signals are implemented: use posix signal (sigaction) so that signals can be blocked and a priority order can be enforced etc.

I worked on this years back but dropped the ball. If there is enough interest in pursuing this approach I can resurrect it:

See: https://github.com/selvanair/openvpn/commit/7e5d775227e6d304ce24d7505da9332f405ee4f3

Or here is a summary from 2018 (some things may be outdated)

Fix signal handling issues

Currently signal received is directly modified in many places in the code,
leading to loss of signals, low priority signals overwriting higher priority
ones etc.

  • Set all signals using functions like register_signal
  • Add a function register_signal_si to help setting of signals when only the pointer to the signal_info struct is available.
  • Allow only a higher or equal priority signal to overwirte an already registered but yet to be processed signal. The signals in increasing order of priority are SIGUSR2, SIGUSR1, SIGHUP, SIGTERM, SIGINT.
  • Use posix signals (sigaction) to properly block signals while in the handler etc.
  • Collect windows signals even when management is not available. Currently Windows signals cannot interrupt openvpn_sleep unless the management interface is in use: the latter forces management_event_loop_n_seconds() in place of sleep().
Last edited 5 weeks ago by Selva Nair (previous) (diff)

comment:11 Changed 5 weeks ago by Selva Nair

I guess the patch linked above would be too much, too late for 2.6? What about a band-aid fix that adds signal priority but does not extensively re-write sig.c?

comment:12 Changed 6 days ago by Gert Döring

Milestone: release 2.5.3release 2.6
Resolution: fixed
Status: assignedclosed

Discussion on the signal issues was summarized in

https://github.com/OpenVPN/openvpn/issues/205

and fixed in a number of patches from Selva Nair that went into 2.6.0 - basically doing what was suggested above. The "POSIX sigaction" is still open for review and more time for cross-platform testing.

I am not sure if we want to backport the signal handling changes to release/2.5 ("it qualifies as bugfix"), because it's quite a large changeset to be able to cleanly address this particular effect (pending signals are overwritten by soft-signals at getaddrinfo() erorrs). Given the lack of sustained yelling, this seems to be a more infrequent annoyance mostly - and since *I* opened this particular ticket, I consider it solved sufficiently for my needs, I never run old versions of OpenVPN on Windows.

Thanks, Selva :-)

Note: See TracTickets for help on using tickets.