Opened 7 years ago
Closed 4 years ago
#1017 closed Bug / Defect (worksforme)
OpenVPN TCP with 300 clients, new clients won't ping
Reported by: | rgaufman | Owned by: | |
---|---|---|---|
Priority: | critical | Milestone: | |
Component: | Generic / unclassified | Version: | |
Severity: | Not set (select this one, unless your'e a OpenVPN developer) | Keywords: | |
Cc: |
Description
I have been using OpenVPN successfully for a while in TCP mode, but I have recently hit around 300 simultaneous clients mark, since then new clients that connect won't ping for a period of time:
$ sudo grep 10.7.1.215 /etc/openvpn/log/ipp.txt ad5953801158e779accf,10.7.1.215 $ sudo grep ad5953801158e779accf /etc/openvpn/log/openvpn-status.log ad5953801158e779accf,86.154.113.255:53858,11241,9878,Sat Feb 10 11:42:49 2018 $ ping 10.7.1.215 PING 10.7.1.215 (10.7.1.215) 56(84) bytes of data. From 10.7.0.1 icmp_seq=1 Destination Host Unreachable From 10.7.0.1 icmp_seq=2 Destination Host Unreachable
This can be anywhere from 30 minutes to hours that the new client cannot be pinged, but eventually, it seems to recover on its own. Once recovered, subsequent reconnects from this client work correctly without this big delay. This happens every single time, with every new client.
There is nothing unexpected appearing in the logs (from what I can tell anyway):
$ sudo cat /var/log/syslog | grep -i ovpn | grep -i ad5953801158e779accf Feb 10 11:42:49 Timeline ovpn[21183]: 86.154.113.255:53858 VERIFY OK: depth=0, C=UK, L=London, O=Org1, CN=ad5953801158e779accf, emailAddress=vpn@org1.com Feb 10 11:42:49 Timeline ovpn[21183]: 86.154.113.255:53858 [ad5953801158e779accf] Peer Connection Initiated with [AF_INET]86.154.113.255:53858 Feb 10 11:42:49 Timeline ovpn[21183]: ad5953801158e779accf/86.154.113.255:53858 MULTI_sva: pool returned IPv4=10.7.1.215, IPv6=(Not enabled) Feb 10 11:42:49 Timeline ovpn[21183]: ad5953801158e779accf/86.154.113.255:53858 OPTIONS IMPORT: reading client specific options from: /tmp/openvpn_cc_723245f993cf734b3cf7f701a0fb8d83.tmp Feb 10 11:43:14 Timeline ovpn[21183]: ad5953801158e779accf/86.154.113.255:53858 PUSH: Received control message: 'PUSH_REQUEST' Feb 10 11:43:14 Timeline ovpn[21183]: ad5953801158e779accf/86.154.113.255:53858 SENT CONTROL [ad5953801158e779accf]: 'PUSH_REPLY,route-gateway 10.7.0.1,ping 10,ping-restart 60,ifconfig 10.7.1.215 255.255.0.0,peer-id 0' (status=1) Feb 10 11:43:14 Timeline ovpn[21183]: ad5953801158e779accf/86.154.113.255:53858 PUSH: Received control message: 'PUSH_REQUEST' Feb 10 11:43:14 Timeline ovpn[21183]: message repeated 2 times: [ ad5953801158e779accf/86.154.113.255:53858 PUSH: Received control message: 'PUSH_REQUEST'] Feb 10 11:43:15 Timeline ovpn[21183]: ad5953801158e779accf/86.154.113.255:53858 PUSH: Received control message: 'PUSH_REQUEST'
Top is not showing a particularly high load on the server or the OpenVPN process:
Tasks: 684 total, 3 running, 681 sleeping, 0 stopped, 0 zombie %Cpu(s): 24.4 us, 5.5 sy, 17.5 ni, 49.2 id, 0.8 wa, 0.0 hi, 2.1 si, 0.5 st KiB Mem : 24689504 total, 5199748 free, 16910864 used, 2578892 buff/cache KiB Swap: 16775164 total, 3100740 free, 13674424 used. 7150244 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 21183 nobody 20 0 167076 58028 4028 S 14.2 0.2 4:19.40 /usr/sbin/openvpn --daemon ovpn-timeline.is --status /run/openvpn/timeline.is.status 10 --cd /etc/openvpn --script-security 2 --config /etc/openvpn/timeline.is.conf --writepid /run/openvpn/timeline.is.pid
The server is running openvpn 2.4.4-xenial0 with the following config:
port 7443 proto tcp dev tap0 ca certs/ca.crt cert certs/server.crt key certs/server.key dh certs/dh1024.pem ifconfig-pool-persist log/ipp.txt 10 server-bridge 10.7.0.1 255.255.0.0 10.7.0.2 10.7.254.254 keepalive 10 60 comp-lzo user nobody group nogroup persist-key persist-tun status log/openvpn-status.log verb 3 management localhost 7506 client-connect /home/deployer/openvpn_control.rb client-disconnect /home/deployer/openvpn_control.rb up /etc/openvpn/postup.sh script-security 2
The clients are running openvpn 2.4.4-xenial0 also and the configs look like this:
client dev tap proto tcp remote vpn-server 7443 tcp-client resolv-retry infinite nobind persist-key persist-tun comp-lzo ca ca.crt cert client.crt key client.key ns-cert-type server verb 3 mute 20
I'm not sure if I provided enough information, I'm not really sure how to troubleshoot this issue. Any advice would be much appreciated!
Change History (6)
comment:3 follow-up: 4 Changed 7 years ago by
You are running a bridged setup with TCP, a combo I've never used. However, if bridging is not essential for the setup (its rarely needed), using --dev tun
, --topology subnet
and --server 10.7.0.0 255.255.0.0
would generate the same VPN network (except layer 3) and should work better. And if you can replace that TCP with UDP even better -- that'll also make it a widely used setup.
As for debugging, look at client and server logs at a high verb level to see the packet read/write. --verb 5
will write R and W for each packer read and write, --verb 9
will provide more info. Mind you, that will generate a lot of log output, especially on the sever, with so many connections active. tcpdump would also help.
comment:4 Changed 7 years ago by
Replying to selvanair:
You are running a bridged setup with TCP, a combo I've never used. However, if bridging is not essential for the setup (its rarely needed), using
--dev tun
,--topology subnet
and--server 10.7.0.0 255.255.0.0
would generate the same VPN network (except layer 3) and should work better. And if you can replace that TCP with UDP even better -- that'll also make it a widely used setup.
Thank you, I have been reading up about it and I think TUN is the way to go. Is there any way to have the server push the "tun" config to the clients? - some of the clients can be offline for many months, so I'm not quite sure how to migrate everything across to tun safely. Any advice for that?
As for UDP, absolutely, but some clients use port 443 on TCP to bypass some aggressive firewalls, but that will be a backup rather than the norm.
comment:5 Changed 7 years ago by
There is no perfect migration path. One option is to run two servers (the new one with tun + subnet topology) and gradually get all clients migrated. If UDP and TCP servers are required, that would make it 4 servers. If you have multiple public IPs on the server, that would make it somewhat easier.
Clients that need TCP as a fall back can have multiple remotes (or connection profiles) in the config. But for clients that are known to work with UDP do not add TCP remotes as the former is much better.
Anyway, think of this as an opportunity to plan better and avoid mistakes like bridging, when routing would have been the right choice etc.. For example, do use --topology subnet
on the server, not the default net30. As you use --server
, it will get pushed to the client, so not needed in the client config. Update deprecated options (e.g., comp-lzo
, ns-cert-type
). Hard code only absolutely necessary options in the client config. As far as possible use client side options in a way easy to adapt from the server: e.g, --compress
instead of --compress lzo
and then push the required compression algo from the server. Etc. Test thoroughly before deploy.
comment:6 Changed 4 years ago by
Resolution: | → worksforme |
---|---|
Status: | new → closed |
"Hanging" connections can also be --client-connect
scripts that take significant time, for example, because a DNS lookup times out. Since OpenVPN is single-threaded, scripts need to exit "very fast" - anything that can take longer need to be put into a background task.
Since there hasn't been any activity in the last 3 years, I assume that the original problem was solved or worked around based on Seva's suggestions, and will close the ticket.
Note: the client-connect/client-disconnect scripts just notify a web app when a client connects.
One other symptom I didn't mention is when I enable client-to-client, I seem to randomly lose this ability between some clients. E.g. client C can connect to client D but not to client E (both ping/ssh), but I'm able to ping/ssh to all 3 from the server.
All the clients have the same version and config, except a unique cert/key. I have also tried running the OpenVPN version that comes with Ubuntu 16.04 (2.3.10-1ubuntu2.1) but currently running 2.4.4-xenial0 - both exhibit the same behaviour.
This problem only started happening around the 300 client mark, until then everything was working correctly. Any ideas at all?