Opened 6 years ago

Closed 4 years ago

#1017 closed Bug / Defect (worksforme)

OpenVPN TCP with 300 clients, new clients won't ping

Reported by: rgaufman Owned by:
Priority: critical Milestone:
Component: Generic / unclassified Version:
Severity: Not set (select this one, unless your'e a OpenVPN developer) Keywords:
Cc:

Description

I have been using OpenVPN successfully for a while in TCP mode, but I have recently hit around 300 simultaneous clients mark, since then new clients that connect won't ping for a period of time:

$ sudo grep 10.7.1.215 /etc/openvpn/log/ipp.txt
ad5953801158e779accf,10.7.1.215
$ sudo grep ad5953801158e779accf /etc/openvpn/log/openvpn-status.log
ad5953801158e779accf,86.154.113.255:53858,11241,9878,Sat Feb 10 11:42:49 2018
$ ping 10.7.1.215
PING 10.7.1.215 (10.7.1.215) 56(84) bytes of data.
From 10.7.0.1 icmp_seq=1 Destination Host Unreachable
From 10.7.0.1 icmp_seq=2 Destination Host Unreachable

This can be anywhere from 30 minutes to hours that the new client cannot be pinged, but eventually, it seems to recover on its own. Once recovered, subsequent reconnects from this client work correctly without this big delay. This happens every single time, with every new client.

There is nothing unexpected appearing in the logs (from what I can tell anyway):

$ sudo cat /var/log/syslog | grep -i ovpn | grep -i ad5953801158e779accf
Feb 10 11:42:49 Timeline ovpn[21183]: 86.154.113.255:53858 VERIFY OK: depth=0, C=UK, L=London, O=Org1, CN=ad5953801158e779accf, emailAddress=vpn@org1.com
Feb 10 11:42:49 Timeline ovpn[21183]: 86.154.113.255:53858 [ad5953801158e779accf] Peer Connection Initiated with [AF_INET]86.154.113.255:53858
Feb 10 11:42:49 Timeline ovpn[21183]: ad5953801158e779accf/86.154.113.255:53858 MULTI_sva: pool returned IPv4=10.7.1.215, IPv6=(Not enabled)
Feb 10 11:42:49 Timeline ovpn[21183]: ad5953801158e779accf/86.154.113.255:53858 OPTIONS IMPORT: reading client specific options from: /tmp/openvpn_cc_723245f993cf734b3cf7f701a0fb8d83.tmp
Feb 10 11:43:14 Timeline ovpn[21183]: ad5953801158e779accf/86.154.113.255:53858 PUSH: Received control message: 'PUSH_REQUEST'
Feb 10 11:43:14 Timeline ovpn[21183]: ad5953801158e779accf/86.154.113.255:53858 SENT CONTROL [ad5953801158e779accf]: 'PUSH_REPLY,route-gateway 10.7.0.1,ping 10,ping-restart 60,ifconfig 10.7.1.215 255.255.0.0,peer-id 0' (status=1)
Feb 10 11:43:14 Timeline ovpn[21183]: ad5953801158e779accf/86.154.113.255:53858 PUSH: Received control message: 'PUSH_REQUEST'
Feb 10 11:43:14 Timeline ovpn[21183]: message repeated 2 times: [ ad5953801158e779accf/86.154.113.255:53858 PUSH: Received control message: 'PUSH_REQUEST']
Feb 10 11:43:15 Timeline ovpn[21183]: ad5953801158e779accf/86.154.113.255:53858 PUSH: Received control message: 'PUSH_REQUEST'

Top is not showing a particularly high load on the server or the OpenVPN process:

Tasks: 684 total,   3 running, 681 sleeping,   0 stopped,   0 zombie
%Cpu(s): 24.4 us,  5.5 sy, 17.5 ni, 49.2 id,  0.8 wa,  0.0 hi,  2.1 si,  0.5 st
KiB Mem : 24689504 total,  5199748 free, 16910864 used,  2578892 buff/cache
KiB Swap: 16775164 total,  3100740 free, 13674424 used.  7150244 avail Mem

  PID USER      PR  NI    VIRT    RES    SHR S  %CPU %MEM     TIME+ COMMAND
21183 nobody    20   0  167076  58028   4028 S  14.2  0.2   4:19.40 /usr/sbin/openvpn --daemon ovpn-timeline.is --status /run/openvpn/timeline.is.status 10 --cd /etc/openvpn --script-security 2 --config /etc/openvpn/timeline.is.conf --writepid /run/openvpn/timeline.is.pid

The server is running openvpn 2.4.4-xenial0 with the following config:

port 7443
proto tcp
dev tap0
ca certs/ca.crt
cert certs/server.crt
key certs/server.key
dh certs/dh1024.pem
ifconfig-pool-persist log/ipp.txt 10
server-bridge 10.7.0.1 255.255.0.0 10.7.0.2 10.7.254.254
keepalive 10 60
comp-lzo
user nobody
group nogroup
persist-key
persist-tun
status log/openvpn-status.log
verb 3
management localhost 7506
client-connect /home/deployer/openvpn_control.rb
client-disconnect /home/deployer/openvpn_control.rb
up /etc/openvpn/postup.sh
script-security 2

The clients are running openvpn 2.4.4-xenial0 also and the configs look like this:

client
dev tap
proto tcp
remote vpn-server 7443 tcp-client
resolv-retry infinite
nobind
persist-key
persist-tun
comp-lzo
ca ca.crt
cert client.crt
key client.key
ns-cert-type server
verb 3
mute 20

I'm not sure if I provided enough information, I'm not really sure how to troubleshoot this issue. Any advice would be much appreciated!

Change History (6)

comment:1 Changed 6 years ago by rgaufman

One other symptom I didn't mention is when I enable client-to-client, I seem to randomly lose this ability between some clients. E.g. client C can connect to client D but not to client E (both ping/ssh), but I'm able to ping/ssh to all 3 from the server.

All the clients have the same version and config, except a unique cert/key. I have also tried running the OpenVPN version that comes with Ubuntu 16.04 (2.3.10-1ubuntu2.1) but currently running 2.4.4-xenial0 - both exhibit the same behaviour.

This problem only started happening around the 300 client mark, until then everything was working correctly. Any ideas at all?

Version 1, edited 6 years ago by rgaufman (previous) (next) (diff)

comment:2 Changed 6 years ago by tct

Watching

comment:3 Changed 6 years ago by Selva Nair

You are running a bridged setup with TCP, a combo I've never used. However, if bridging is not essential for the setup (its rarely needed), using --dev tun, --topology subnet and --server 10.7.0.0 255.255.0.0 would generate the same VPN network (except layer 3) and should work better. And if you can replace that TCP with UDP even better -- that'll also make it a widely used setup.

As for debugging, look at client and server logs at a high verb level to see the packet read/write. --verb 5 will write R and W for each packer read and write, --verb 9 will provide more info. Mind you, that will generate a lot of log output, especially on the sever, with so many connections active. tcpdump would also help.

comment:4 in reply to:  3 Changed 6 years ago by rgaufman

Replying to selvanair:

You are running a bridged setup with TCP, a combo I've never used. However, if bridging is not essential for the setup (its rarely needed), using --dev tun, --topology subnet and --server 10.7.0.0 255.255.0.0 would generate the same VPN network (except layer 3) and should work better. And if you can replace that TCP with UDP even better -- that'll also make it a widely used setup.

Thank you, I have been reading up about it and I think TUN is the way to go. Is there any way to have the server push the "tun" config to the clients? - some of the clients can be offline for many months, so I'm not quite sure how to migrate everything across to tun safely. Any advice for that?

As for UDP, absolutely, but some clients use port 443 on TCP to bypass some aggressive firewalls, but that will be a backup rather than the norm.

comment:5 Changed 6 years ago by Selva Nair

There is no perfect migration path. One option is to run two servers (the new one with tun + subnet topology) and gradually get all clients migrated. If UDP and TCP servers are required, that would make it 4 servers. If you have multiple public IPs on the server, that would make it somewhat easier.

Clients that need TCP as a fall back can have multiple remotes (or connection profiles) in the config. But for clients that are known to work with UDP do not add TCP remotes as the former is much better.

Anyway, think of this as an opportunity to plan better and avoid mistakes like bridging, when routing would have been the right choice etc.. For example, do use --topology subnet on the server, not the default net30. As you use --server, it will get pushed to the client, so not needed in the client config. Update deprecated options (e.g., comp-lzo, ns-cert-type). Hard code only absolutely necessary options in the client config. As far as possible use client side options in a way easy to adapt from the server: e.g, --compress instead of --compress lzo and then push the required compression algo from the server. Etc. Test thoroughly before deploy.

comment:6 Changed 4 years ago by Gert Döring

Resolution: worksforme
Status: newclosed

"Hanging" connections can also be --client-connect scripts that take significant time, for example, because a DNS lookup times out. Since OpenVPN is single-threaded, scripts need to exit "very fast" - anything that can take longer need to be put into a background task.

Since there hasn't been any activity in the last 3 years, I assume that the original problem was solved or worked around based on Seva's suggestions, and will close the ticket.

Note: See TracTickets for help on using tickets.