Opened 7 years ago

Last modified 7 years ago

#847 new Bug / Defect

Long lived crypto "state" client lock-up

Reported by: David Sommerseth Owned by: Steffan Karger
Priority: major Milestone:
Component: Crypto Version: OpenVPN 2.3.10 (Community Ed)
Severity: Not set (select this one, unless your'e a OpenVPN developer) Keywords:
Cc: tct, Gert Döring

Description

Configuration setup

Client

[root@localhost ~]# openvpn --client --remote 192.168.122.1 --ca sample/sample-keys/ca.crt --key sample/sample-keys/client.key --cert sample/sample-keys/client.crt --verb 4 --dev tun --auth-user-pass --explicit-exit-notify --auth-nocache

Server

[root@localhost ~]# openvpn --dev tun --ca sample/sample-keys/ca.crt --key sample/sample-keys/server.key --cert sample/sample-keys/server.crt --dh sample/sample-keys/dh2048.pem --server 10.8.0.0 255.255.255.0 --script-security 3 --auth-user-pass-verify .auth.sh via-env --verb 7 --reneg-sec 10

What happens

The client asks for username/password, connects and running ping on the client against 10.8.0.1 works as expected.
A renegotiation is forced (due to --reneg-sec 10). At this point do not enter any credentials until the server have logged:

Sat Feb 25 00:34:55 2017 us=576093 Test-Client/192.168.122.9:1194 TLS Error: TLS key negotiation failed to occur within 60 seconds (check your network connectivity)
Sat Feb 25 00:34:55 2017 us=576099 Test-Client/192.168.122.9:1194 TLS Error: TLS handshake failed
Sat Feb 25 00:34:55 2017 us=576104 Test-Client/192.168.122.9:1194 TLS: move_session: dest=TM_LAME_DUCK src=TM_ACTIVE reinit_src=1

When this stage is completed, enter the proper username/password credentials. The client will re-connect properly (might need a couple of attempts, due to some client time-outs). But from this point of the server will no longer accept packets on the DATA channel from the client. Right before the rejection happens, this can be seen in the server log:

Sat Feb 25 00:35:06 2017 us=972360 Test-Client/192.168.122.9:1194 Data Channel Encrypt: Cipher 'AES-256-GCM' initialized with 256 bit key
Sat Feb 25 00:35:06 2017 us=972379 Test-Client/192.168.122.9:1194 Data Channel Encrypt: CIPHER KEY: 3d4f9143 557d88ff 51cf05f7 4052abd8 93126d2e f2c1826c 1a7888a2 ef3cceb8
Sat Feb 25 00:35:06 2017 us=972401 Test-Client/192.168.122.9:1194 Data Channel Encrypt: CIPHER block_size=16 iv_size=12
Sat Feb 25 00:35:06 2017 us=972424 Test-Client/192.168.122.9:1194 Data Channel Decrypt: Cipher 'AES-256-GCM' initialized with 256 bit key
Sat Feb 25 00:35:06 2017 us=972449 Test-Client/192.168.122.9:1194 Data Channel Decrypt: CIPHER KEY: e8c6b960 88e078da ce001f18 8f465bb0 bc82adfa 63118cd3 7d29d6b1 70273ae4
Sat Feb 25 00:35:06 2017 us=972465 Test-Client/192.168.122.9:1194 Data Channel Decrypt: CIPHER block_size=16 iv_size=12
Sat Feb 25 00:35:06 2017 us=972540 Test-Client/192.168.122.9:1194 UDPv4 WRITE [247] to [AF_INET]192.168.122.9:1194: P_CONTROL_V1 kid=0 [ 5 ] pid=6 DATA len=221

And after that, the following log lines appears for each packet the client tries to send over the VPN tunnel

Sat Feb 25 00:35:06 2017 us=972653 Test-Client/192.168.122.9:1194 AEAD Decrypt error: cipher final failed

The bad thing about this is that this is persistent if the client stops its openvpn process (even with --explicit-exit-notify) and starts a completely fresh session. Restarting the server side re-enables the client again.

Change History (8)

comment:1 Changed 7 years ago by David Sommerseth

Cc: tct added

comment:2 Changed 7 years ago by Steffan Karger

This looks NCP-related. Can you test if this still happens if (at least) one of the sides specifies --ncp-disable?

I've tried reproducing the behaviour, but didn't manage to get "AEAD Decrypt error: cipher final failed", I did manage to get "Key [AF_INET]10.1.1.2:36499 [0] not initialized (yet), dropping packet." though. One reconnect later the connection recovered in my test.

I'll investigate further.

comment:3 Changed 7 years ago by Steffan Karger

Priority: blockermajor
Summary: Long lived AEAD "state" client lock-upLong lived crypto "state" client lock-up
Version: git master branch2.3.10

Just managed to reproduce, and this is not NCP-related. The same happens when running 2.3.10 server vs 2.3.10 client, it just reports Authenticate/Decrypt packet error: packet HMAC authentication failed instead of AEAD Decrypt error: cipher final failed.

So this seems to be a long-standing thing, which requires diving deep into the openvpn connection setup logic, and not something new that lots people will run into. I'm reducing the priority - there's more pressing stuff to dive into first.

comment:4 Changed 7 years ago by Gert Döring

Cc: Gert Döring added

comment:5 Changed 7 years ago by kpielorz

Just ran into this issue as well - OpenVPN had been up for 94 days, with a small number of clients (very low churn rate - i.e. teardown/setup every 3-4 days) - suddenly logging on the server:

" ... 1194 AEAD Decrypt error: cipher final failed"

As above - restarting client didn't help, only restarting the server...

So guess that's a "/me too"

-Karl

comment:6 Changed 7 years ago by Marcin T.

I have the same problem since i upgraded Centos 7.2 to Centos 7.3 and OpenVPN 2.3.10 to OpenVPN 2.4.1. This problem pops up randomly on random clients. After server reboot it's ok for some time (few hours?). When problem occurs, user get error: "Authenticate/Decrypt? packet error: packet HMAC authentication failed" and on server side there is error: "AEAD Decrypt error: cipher final failed". I am going to downgrade back to OpenVPN 2.3.10 for now.

comment:7 Changed 7 years ago by lfarkas

exactly like the above. not always but randomly...

comment:8 Changed 7 years ago by Gert Döring

There might be two different issue here.

One is the NCP bug on TLS reconnects, fixed today in master and release/2.4 (only relevant on 2.4<->2.4 connections, and when this hits you see a comment in the server log about "server has already generated data channel keys"), see trac #887. This one is only relevant on 2.4.0/2.4.1/2.4.2 servers in combinations with clients that bind to a fixed UDP port (= not using nobind) - and fixed in 2.4.3. Workaround is nobind (on the client) or ncp-disable (either side).

The other one is much more rare, and since it can happen 2.3<->2.3, downgrading isn't going to help with that one.

Note: See TracTickets for help on using tickets.