Opened 3 years ago

Last modified 3 years ago

#950 new Bug / Defect

Long-running VPN session stops decrypting data

Reported by: jdanders Owned by:
Priority: major Milestone:
Component: Crypto Version: OpenVPN 2.4.4 (Community Ed)
Severity: Not set (select this one, unless your'e a OpenVPN developer) Keywords:
Cc: Steffan Karger

Description

I have an OpenVPN connection between my phone and my server. Periodically, the VPN stops sending data, even though the connection seems to be open still. Restarting the client restores data flow.

I turned up logging on the server to try to see what is happening, but I cannot determine a root cause. I've attached the full log. The first "Error" I see is at Sat Oct 21 16:37:27 2017 us=180526, and I know things aren't working by the time I see AEAD errors, starting at Sat Oct 21 17:13:13 2017 us=359274. Because this is an error that seems to occur randomly after lots of time, it is hard to get more detail than this.

The first error is this:
Sat Oct 21 16:37:27 2017 us=180526 TLS State Error: No TLS state for client [AF_INET]192.168.1.1:59790, opcode=9
Sat Oct 21 16:37:27 2017 us=180648 GET INST BY REAL: 192.168.1.1:59790 [failed]

Eventually I see this for every packet sent from client to server, and the server drops the data:
Sat Oct 21 17:13:13 2017 us=359274 phone/192.168.1.1:59790 AEAD Decrypt error: cipher final failed

I am running OpenVPN 2.4.4 on my server, and OpenVPN for Android on my client.

I think this is different than issue 847.

Attachments (2)

truncated_log.log.gz (244.6 KB) - added by jdanders 3 years ago.
Truncated log trimmed to show error points.
issue950_verb4.log (177.8 KB) - added by jdanders 3 years ago.
Issue captured again with verbosity 4 this time

Download all attachments as: .zip

Change History (16)

Changed 3 years ago by jdanders

Attachment: truncated_log.log.gz added

Truncated log trimmed to show error points.

comment:1 Changed 3 years ago by Gert Döring

Cc: Steffan Karger added

comment:2 Changed 3 years ago by tincantech

verb = 11 is completely over the top .. use verb = 4 which will show you all the errors you need to know about.

SENT CONTROL [phone]: 'PUSH_REPLY,redirect-gateway autolocal bypass-dhcp,route 192.168.1.0 255.255.255.0,dhcp-option DNS 192.168.1.1,route 192.168.8.1,topology net30,ifconfig 192.168.8.6 192.168.8.5,peer-id 0' (status=1)

No --ping/-ping-restart (WARNING: --keepalive option is missing from server config)

You may find it easier to use the Forum (https://forums.openvpn.net/viewtopic.php?f=30&t=22603).

Last edited 3 years ago by tincantech (previous) (diff)

comment:3 Changed 3 years ago by Gert Döring

I'm neither convinced that this is a user error or that it is related to --ping - which is why I cc'ed syzzer here, as he's the one who understands the intricacies of TLS renegotiations and AEAD.

comment:4 Changed 3 years ago by jdanders

The client and server both have ping parameters of 30, and --ping-restart and --keepalive are deliberately not used. Note that the connection is healthy and I am guessing that neither client nor server would detect the need to restart the connection in this case.

comment:5 Changed 3 years ago by Selva Nair

I think you are running multiple client instances with same common name either deliberately (if so you'd need --duplicate-cn) or by accident.

Around 16:37 a second connection using the same name 'phone' is opened from port 52655 which replaces the previous connection from port 59790. Data continues to arrive from 59790 for which the server can no more find a state. A log at lower verbosity would have shown this more clearly without having to wade through the chatter -- I understand you suspect some TLS-related error and hence the verbose log...

The reverse happens at 17:12 with connection from port 59790 usurping the other. Two client instances are pushing each other out of the way at regular intervals. Usually this would happen at the ping-restart intervals, but your case may be different.

comment:6 Changed 3 years ago by jdanders

I will try to grab a less verbose log if there aren't any useful details in that verbosity of log.

The phone would be switching from wifi to LTE, which causes multiple reconnections with the same CN. I often see "MULTI: new connection by client 'phone' will cause previous active sessions by this client to be dropped." in my logs, and I've always assumed it's from network switches. In this particular log, I see new connections at 11:57 (trimmed from log port 36571 takes over 40119), 16:10, 16:37, and 17:12.

I see the following:
16:10 - 192.168.1.1:59790 Re-using SSL/TLS context
16:37:23 us=488620 192.168.1.1:52655 Re-using SSL/TLS context
16:37:24 us=751315 SCHEDULE: schedule_add_modify wakeup=[Sat Oct 21 16:37:25 2017 us=652533] pri=190312851
16:37:25 us=654722 192.168.1.1:52655 TIMER: coarse timer wakeup 1 seconds
16:37:26 us=287646 phone/192.168.1.1:59790 TLS: ...
16:37:26 us=318859 MULTI: new connection by client 'phone' will cause previous active sessions by this client to be dropped

And after the "new connection" message, the old connection gets TLS state errors. That makes sense. It looks like the server if setting a timer to do something on the other port after one second? It looks like strange behavior that both connections are maintained from 16:37-17:12. Maybe the phone has both Wifi and LTE connections open at the same time?

Obviously I would not expect good behavior while two connections are canceling each other out, and I would imagine sort of behavior is not very well tested. Do you think there's a bug here, or is it just bad client behavior to keep both connections open? The first reconnection at 11:57 happened without error.

comment:7 Changed 3 years ago by Selva Nair

I will try to grab a less verbose log if there aren't any useful details in that verbosity of log.

Use verb 4 next time please, unless someone asks for log at a particular verb level.

I know little about the android client to say anything more. But if the client is indeed making multiple connections when another interface comes up something is wrong.

comment:8 Changed 3 years ago by jdanders

Looking at the verb 4 log now, and I was wrong -- a switch from LTE to wifi just produces a "peer 1 (phone) floated from 172.56.38.207:32339 to [AF_INET]192.168.1.1:36571". So I am not sure what causes the reconnection.

comment:9 Changed 3 years ago by Selva Nair

I do not think its a reconnection. Its more likely a second instance causing a contention. Check your setup. If the IP is shared (say, by NAT) it may be even from another machine/device using the same certificate.

There are other strange things in your log -- why is the client connecting from 192.168.1.1 -- is that really the ip of the phone with both client and server on a private network? If so, why is the server set to push the same 192.168.1.1 as the DNS server?

comment:10 Changed 3 years ago by jdanders

I noticed that too. 192.168.1.1 is the gateway router. The client phone while on the local network would have an address of 192.168.1.44. That address does show up very rarely (14 times in the whole day's log) in the untruncated log with these messages:

Sat Oct 21 10:15:01 2017 us=468090 phone/192.168.1.1:40119 GET INST BY VIRT: 192.168.1.44 [failed]
Sat Oct 21 10:15:01 2017 us=468197 phone/192.168.1.1:40119 MULTI: bad source address from client [192.168.1.44], packet dropped

The phone uses my home public IP address as the OpenVPN server address whether in the local network or out. The VPN bound packets from inside the network would go to the router (because it's not a local address). I would need to do some tcpdumps to find out what happens next, but I'm guessing it gets NAT'd back in with a source address that matches the router's internal address?

Changed 3 years ago by jdanders

Attachment: issue950_verb4.log added

Issue captured again with verbosity 4 this time

comment:11 Changed 3 years ago by jdanders

So I captured another log and have a little more information this time. One thing is for sure: while this is in a bad state the client on my phone is holding two connections open (ports 37755 and 40916 in this case).

I have this issue open for the client software, and the author of the client pointed out that it uses the peer-id for roaming between networks. It looks like the Android client begins a reconnect when it detects no network connectivity for a while. Typically when this happens, the peer-id alternates between 0 and 1 (since there are no other connections to this server). However, at 18:03:25, peer-id 1 is used for the second time in a row. Is that maybe a problem?

My brief analysis of this second log:

The phone jumped around between LTE (172.x.x.x address) and wifi (192.x.x.x or 65.44.211.33, home and work). While the phone was having connection problems, the client issued a restart, such as at 07:07:49, 09:33:46, etc. At 14:42 there is a TLS failure, probably due to poor connectivity, and then the next reconnect was successful. There are more TLS failures at 17:57:17, but I don't know what caused those, and the next reconnect was still successful.

I am pretty sure everything was good between 18:08:33 and 19:05:08, and then at 19:08:59 the unfortunate second connection is created.

So I think the question for this issue is how do I get to a state where everything is dropped with "AEAD Decrypt error: cipher final failed" and the client doesn't know to restart the connection?

comment:12 Changed 3 years ago by Selva Nair

The phone uses my home public IP address as the OpenVPN server address whether in the local network or out. The VPN bound packets from inside the network would go to the router (because it's not a local address). I would need to do some tcpdumps to find out what happens next, but I'm guessing it gets NAT'd back in with a source address that matches the router's internal address?

It seems you are doing NAT on the external interface translating external ips to that of the router's internal ip. That is an uncommon setup and most likely a mis-configuration.

So I think the question for this issue is how do I get to a state where everything is dropped with "AEAD Decrypt error: cipher final failed" and the client doesn't know to restart the connection?

The server logs wont answer that. Check the client config and logs. But your real problem is the second connection. Even if the client that lost connectivity restarted, it will only push out the other one for a while until that one reconnects. Are you sure you are not mistakenly running two instances?

comment:13 Changed 3 years ago by jdanders

Thanks for the observations.

The server logs wont answer that. Check the client config and logs. But your real problem is the second connection. Even if the client that lost connectivity restarted, it will only push out the other one for a while until that one reconnects. Are you sure you are not mistakenly running two instances?

I think there's a bug in the server that is exposed by this bad client behavior. The client is definitely misbehaving, I can see the two connections open on my Android phone. So there is only one client, and it's maintaining two connections. That is not good behavior. I hope to resolve that behavior with the author of the client.

However, I would expect the server to either alternate between the two connections, throwing a bunch of "MULTI: new connection by client 'phone' will cause previous active sessions by this client to be dropped." warnings, or I would expect it to hold one, and cause the TLS State Error: No TLS state for client to fire and close the connection. I don't expect the server to accept data and then drop it server-side.

I am also curious about the repeated peer-id 1.

comment:14 Changed 3 years ago by Selva Nair

However, I would expect the server to either alternate between the two connections, throwing a bunch of "MULTI: new connection by client 'phone' will cause previous active sessions by this client to be dropped." warnings, or I would expect it to hold one, and cause the TLS State Error: No TLS state for client to fire and close the connection. I don't expect the server to accept data and then drop it server-side.

There is nothing to close in a UDP connection. Only thing the server can do is to drop packets it cant decrypt. Use --ping and --ping-restart to detect and reconnect a broken tunnel. Also fix your multiple concurrent connections issue.

Note: See TracTickets for help on using tickets.