ktls / af_ktls Goto Github PK
View Code? Open in Web Editor NEWLinux Kernel TLS/DTLS Module
License: GNU General Public License v2.0
Linux Kernel TLS/DTLS Module
License: GNU General Public License v2.0
tls_rx_async_work acquires tsk->rx_lock, then tries to lock_sock() twice (for both tsk and the underlying socket). lock_sock() potentially sleeps
Currently tls_sendmsg does a size check:
if (size > KTLS_MAX_PAYLOAD_SIZE) { ret = -E2BIG; goto send_end; }
For tcp, we should be sending KTLS_MAX_PAYLOAD_SIZE bytes, and returning the number of bytes sent, or providing a socket buffer somehow.
I think this is probably OK for UDP/DTLS though.
The man page says it should be EMSGSIZE:
If the message is too long to pass atomically through the underlying protocol, the error EMSGSIZE is returned, and the message is not transmitted.
There is still missing nonblocking socket support in sendmsg(2)
by supplying MSG_DONTWAIT flag. It will need big reorganization in sources so before the work will be started, here is my proposed idea, I am open for discussion.
I don't see any differences in UDP/TCP (DTLS/TLS) now, so the implementation could be possibly shared.
Since we drop preallocation of pages optimization (no reasonable results to keep it), it should be easier to handle MSG_DONTWAIT as well. Each time there is a tls_sendmsg()
or tls_sendpage()
called, we will allocate pages for the TLS/DTLS record and we will not reuse them.
I see reasonable to spawn a kernel worker each time there is a nonblocking sendmsg(2)
called and return to user space. Each socket will have its own one sending worker. We will have to copy supplied buffer to a queue before returning though. The pseudocode for tls_sendmsg()
(the same applies to tls_sendpage()
but with a page instead of buf, handling payload not covered):
plaintext_pages = allocate_pages()
copy_to_pages(buf, plaintext_pages)
data_queue_push(plaintext_pages)
queue_work()
return;
Here is a pseudocode of a worker:
plaintext_pages = data_queue_pop()
ciphertext_pages = allocate_pages()
encrypt(plaintext_pages, ciphertext_pages)
free_pages(plaintext_pages)
record = prepare_record(ciphertext_pages) // header creation
kernel_sendpage(record)
There can be also supplied splice(2)
with SPLICE_F_NONBLOCK, so similar approach should be introduced for kernel_sendpage()
. The synchronization between kernel_sendpage()
and kernel_sendmsg()
should be explicit with implemented queue (which should be guarded) - first called, first served (user space should synchronize in special cases).
The queue of (plaintext) pages that are going to be sent can be implemented with the linked list available in the kernel. Just to note that the core part of the worker pseudocode can be reused even
when there is a blocking send.
Any suggestions?
related: https://github.com/fridex/af_ktls/issues/4 https://github.com/fridex/af_ktls/issues/22
Currently only AES-GCM is supported from the TLS 1.2 ciphersuites. A new ciphersuite is defined in RFC7905 the chacha20-poly1305 which is used in several places where the AES-GCM performance is unsatisfactory. It would be good for af-ktls to support the chacha20-poly1305 ciphersuite.
Asynchronous cache currently stores data if decryption and packet disassembling went well. If there is an error, the error is not reflected in asynchronous cache. This causes that asynchronous worker tries to decrypt and disassemble record every time a record is received even so there is wrong record on top of receiving queue.
This should be avoided by propagating error in asynchronous cache (e.g. negative number will signalize error, positive number number of bytes in cache, zero would mean free cache).
This should be OK from consistency POV since the asynchronous cache gets invalidated every time there is a setsockopt(2)
call.
SSIA
Seen this a couple times, haven't nailed it down yet though
Console initialized - press Ctrl-D for menu [45/96761]
[ 282.613149] tls: --> tls_init
[ 307.655912] tls: --> tls_exit
[ 307.664466] ------------[ cut here ]------------
[ 307.671461] general protection fault: 0000 [#1] SMP
[ 307.672294] Last file read FILE* = ffff8823b7b30800
[ 307.673186] Modules linked in: af_ktls(O-) decnet tcp_diag inet_diag ip6table_filter xt_NFLOG xt_comment iptable_filter netconsole autofs4 hwmon_vid w83795
i2c_piix4 rpcsec_gss_krb5 auth_rpcgss oid_registry dm_mod loop sg serio_raw iTCO_wdt iTCO_vendor_support e1000e ipmi_devintf x86_pkg_temp_thermal coretemp kv
m irqbypass crc32c_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr i2c_i801 i2c_core lpc_ich mfd_core ehci_pci ehci_hcd ipmi_s
i ipmi_msghandler shpchp button
[ 307.681160] CPU: 13 PID: 7393 Comm: rmmod Tainted: G O 4.6.0-rc6_00054_g5294e32 #117
[ 307.682649] Hardware name: Quanta Freedom/Winterfell, BIOS F03_3B09 05/22/2014
[ 307.683901] task: ffff8823e65daa00 ti: ffff8811d4628000 task.ti: ffff8811d4628000
[ 307.685257] RIP: 0010:[] [] proto_unregister+0x48/0xe0
[ 307.686644] RSP: 0018:ffff8811d462be88 EFLAGS: 00010282
[ 307.687541] RAX: dead000000000100 RBX: ffffffffa023e000 RCX: 0000000000000000
[ 307.688729] RDX: dead000000000200 RSI: ffff88247f8ccb18 RDI: ffffffff81f00360
[ 307.689970] RBP: ffff8811d462bea8 R08: 00000000fffffffe R09: 0000000000000000
[ 307.691331] R10: 0000000000000005 R11: 0000000000000001 R12: 00007ffe72736270
[ 307.697318] R13: 00007ffe727377c2 R14: 0000000000000000 R15: 0000000000000001
[ 307.705584] FS: 00007f251200c700(0000) GS:ffff88247f8c0000(0000) knlGS:0000000000000000
[ 307.712018] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 307.713913] CR2: 00007f2511ba5183 CR3: 00000023e7bbb000 CR4: 00000000001406e0
[ 307.715328] Stack:
[ 307.715686] 0000000000000000 ffff8811d462bed8 ffffffffa023e180 ffffffffa023e180
[ 307.717317] ffff8811d462beb8 ffffffffa023c29f ffff8811d462bf48 ffffffff810daed5
[ 307.718601] 0000000000000000 ffff8824280896e8 00736c746b5f6661 00007f2512014000
[ 307.720006] Call Trace:
[ 307.720421] [] tls_exit+0x2f/0xd90 [af_ktls]
[ 307.721430] [] SyS_delete_module+0x155/0x1a0
[ 307.722416] [] ? vm_munmap+0x5c/0x80
[ 307.723268] [] ? SyS_munmap+0x2c/0x40
[ 307.724151] [] entry_SYSCALL_64_fastpath+0x13/0x8f
[ 307.725309] Code: 8b 83 c0 00 00 00 83 f8 3f 74 0b 89 c0 f0 48 0f b3 05 c5 82 b5 00 48 8b 83 68 01 00 00 48 8b 93 70 01 00 00 48 c7 c7 60 03 f0 81 <48> 89
50 08 48 89 45 e0 48 89 02 48 b8 00 02 00 00 00 00 ad de
[ 307.728639] RIP [] proto_unregister+0x48/0xe0
[ 307.731148] RSP
[ 307.731732] ------------[ cut here ]------------
If kernel does not support rfc5288(gcm(aes)) cipher:
grep /proc/crypto -e 'rfc5288(gcm(aes))' -o || \
echo 'rfc5288(gcm(aes)) is not present'
the error in tls_create() is not handled properly
First!
Return values should be disjoint from ones from calls to other parts of the kernel. This will give us ability to clarify what went wrong (e.g. there was an error in decryption, not data record (not 0x17), etc). Maybe they will need to be remapped (e.g. if crypto API returns same error as kernel_{recv,send}msg()
).
If server sends: abcde, then sends 12345, a client calling recv() with a read size of 10 would get "abcde" instead of "abcde12345"
In tls_sendmsg, if msghdr *msg contains more than one iovec, we only send the first one, assuming it contains all the data.
af_alg_make_sg only copies a single iovec at a time, and we need to inspect the return value to see how much it copied.
OpenConnect protocol support is currently not finished. Instead of hardcoding every protocol or rule, AF_KTLS can be extended with Linux Socket Filtering support. This needs inspection and suitability study.
Current implementation uses kernel_sendmsg()
for sending records. This function does copy from passed vector, so it would be nice to avoid it.
There can be used kernel_sendpage()
, which operates directly on pages (see [1] for TCP, see [2] for UDP).
[1] http://lxr.free-electrons.com/source/net/ipv4/tcp.c#L1017
[2] http://lxr.free-electrons.com/source/net/ipv4/udp.c#L1216
Poll isn't currently supported. Related, it looks like recvmsg / sendmsg don't block in blocking mode.
Line 1731 in 7fbe182
alloc_pages allocates 1 << order number of pages. That line would therefore allocate 16 pages.
Can you please explain why the rx async work is needed?
Why can't we read the data and decrypt it inside tls_recvmsg from process context?
I'm guessing the concern is that we will deadlock if there is not enough room for 1 record
in the TCP socket. But can't we just enforce SO_RCVBUF > 16KB?
static void increment_seqno(char *s)
{
u64 *seqno = (u64 *) s;
This cast's operation is cpu-specific. It works on intel arch but there are others that cannot cope with it.
If a client makes a syscall like this:
char recv_mem[1000];
recv(fd, recv_mem, TLS_MAX_PAYLOAD_LENGTH, 0);
And the decryption is done straight to user memory (the else{} in tls_recvmsg) a kernel fault is triggered
http://pastebin.com/XjGu0dHx
There might be (and possibly are) use cases when asyn cache can become inconsistent based on user space actions on TCP/UDP socket. This needs to be deeply inspected.
Bound UDP/TCP sockets should be locked while we are doing operations in AF_KTLS since (not only) userspace can operate on them in parallel with AF_KTLS and that could lead to inconstancy; pseudocode:
in AF_KTLS: userspace:
read(sd, ktls_buf, size, MSG_PEEK)
decrypt !!! read(sd, buf2, size, 0)
!!! pop_record(sd)
tls_splice_read acquires locks in the order:
tls_sendpage acquires lock in the order:
Kernel warns about possible deadlock
http://pastebin.com/tYCk0bxy
In practice I don't think this is much of an issue since I don't think you'd ever splice to and from the same pipe.
I like the approach with fixed values (that's a type-safe userspace API), however it may cause upstream opposition. Be prepared to be asked to use directly the "rfc5288(gcm(aes))" string from userspace. There are good arguments to not use it (such as ability to change the internal API, and type safety).
The KTLS_RECV_READY macro is used to check whether or not the aead has been set up for decryption, however, it merely checks that the keys are set, when tls_init_aead is called after keys are set, so an interrupt that checks KTLS_RECV_READY(such as tls_data_ready) that occurs between settings keys and initializing aead would cause a crash.
3 options:
The current implementation of DTLS sliding window handling behaves correctly only if there are no out of order DTLS records. If we receive a record that is not at the beginning of the window, and user space asks for seq number, the seq number can be incorrect. There should be returned seq number which corresponds to the very first record expected within the sliding window:
An example scenario:
Since DTLS window is part of the state, there should be probably appropriate {get,set}sockopt(2)
call to populate the sliding window to user space. This would allow to sync user space with DTLS socket and vice versa. There has to be kept in mind that sliding window in user space can be of different size than the one in the kernel.
For recv, aead_request_set_crypt takes the nonce, and make_aad should take the iv_recv sequence number. These are allowed to be the same by the spec, and the current implementation does (and I guess gnutls must also?), but openssl does not, so recv doesn't work when receiving data from openssl.
When bind(2)
is called more than once on the AF_KTLS socket instance, the previously bound socket is not correctly freed. We should either limit bind(2)
to be called only once per AF_KTLS instance or free previously bound socket.
Current implementation uses kernel_recvmsg()
for receiving records. This function does copy from skbuff to passed vector (see [1] for TCP, see [2] for UDP), so it would be nice to avoid it.
tcp_read_sock()
. Unfortunately the implementation of tcp_read_sock()
does not support peeking (see [3]), which is necessary according to current AF_KTLS design.EDIT: we could consider to operate directly on skbuff
[1] http://lxr.free-electrons.com/source/net/ipv4/tcp.c#L1830
[2] http://lxr.free-electrons.com/source/net/ipv4/udp.c#L1392
[3] http://lxr.free-electrons.com/source/net/ipv4/tcp.c#L1485
Review flags that can be supplied (e.g. MSG_NONBLOCK, MSG_MORE, etc.). Some of them are not useful for TLS/DTLS, but some are. Some ideas:
sendmsg(2)
can be implemented with a workerkernel_recvmsg()
rx queue)If CONFIG_CRYPTO_USER_API=n, there are a bunch of errors building, because we are depending on af_alg things and we shouldn't be
WARNING: "af_alg_make_sg" [/home/davejwatson/local/af_ktls/af_ktls.ko] undefined!
WARNING: "af_alg_wait_for_completion" [/home/davejwatson/local/af_ktls/af_ktls.ko] undefined!
SSIA
Currently there is implemented one worker per whole module called "ktls". It would worth it to consider to add a worker for every socket instance - e.g. "ktls-<PID>/id" or so.
SSIA
This one looks like a race between receiving a new message, and close() locally. You can kinda see it in the interleaved logs
[ 372.859537] tls: --> tls_rx_async_work
[ 372.860805] tls: --> tls_peek_data
[ 373.860974] tls: [ 373.861021] tls: --> tls_release
[ 373.861023] tls: --> tls_free_sendpage_ctx
[ 373.861024] tls: --> tls_sock_destruct
[ 373.861026] tls: parallel executions: 2
[ 373.867009] --> tls_data_ready
[ 373.868162] tls: --> tls_rx_async_work
[ 373.870335] tls: --> tls_peek_data
[ 373.871089] BUG: unable to handle kernel NULL pointer dereference at 0000000000000090
[ 373.873204] IP: [] sock_recvmsg+0x45/0x60
[ 373.874106] PGD 0
[ 373.874430] ------------[ cut here ]------------
[ 373.875749] Oops: 0000 [#1] SMP
[ 373.876466] Last file read FILE* = ffff8823f8401b00
[ 373.887355] Modules linked in: sha256_generic drbg af_ktls(O) tcp_diag inet_diag ip6table_filter xt_NFLOG xt_comment iptable_filter netconsole autofs4 hwmon_vid w83795 i2c_piix4 rp
csec_gss_krb5 auth_rpcgss oid_registry dm_mod loop sg serio_raw iTCO_wdt iTCO_vendor_support e1000e ipmi_devintf x86_pkg_temp_thermal coretemp kvm irqbypass crc32c_intel aesni_intel a
blk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr i2c_i801 i2c_core lpc_ich mfd_core ehci_pci ehci_hcd ipmi_si ipmi_msghandler shpchp button
[ 373.896046] CPU: 2 PID: 333 Comm: kworker/2:1 Tainted: G O 4.6.0-rc6_00054_g5294e32 #117
[ 373.897631] Hardware name: Quanta Freedom/Winterfell, BIOS F03_3B09 05/22/2014
[ 373.898850] Workqueue: ktls tls_rx_async_work [af_ktls]
[ 373.899694] task: ffff881228b96200 ti: ffff881228470000 task.ti: ffff881228470000
[ 373.901004] RIP: 0010:[] [] sock_recvmsg+0x45/0x60
[ 373.903504] RSP: 0018:ffff881228473a88 EFLAGS: 00010246
[ 373.904457] RAX: 0000000000000000 RBX: ffff8811fd726400 RCX: 0000000000000042
[ 373.912814] RDX: 000000000000401d RSI: ffff881228473b28 RDI: ffff8811fd726400
[ 373.913964] RBP: ffff881228473aa8 R08: 000000000000401d R09: 0000000000000042
[ 373.915112] R10: 000000000000003f R11: 0000000000000259 R12: ffff881228473b28
[ 373.916352] R13: 000000000000401d R14: 0000000000000042 R15: 0000000000000042
[ 373.917616] FS: 0000000000000000(0000) GS:ffff881237880000(0000) knlGS:0000000000000000
[ 373.920577] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 373.921502] CR2: 0000000000000090 CR3: 0000000001e06000 CR4: 00000000001406e0
[ 373.922753] Stack:
[ 373.923089] ffff881228474000 ffff881228473b28 000000000000401d ffffffffffffffff
[ 373.924293] ffff881228473af8 ffffffff817de7b9 ffff881228473b58 ffff8811fd726400
[ 373.925505] 0000000000000001 ffff881225fcd000 ffff881225fcd94b 0000000000000042
[ 373.926700] Call Trace:
[ 373.927212] [] kernel_recvmsg+0x69/0x90
[ 373.928145] [] tls_peek_data+0xab/0x200 [af_ktls]
[ 373.930195] [] tls_rx_async_work+0x113/0x1b0 [af_ktls]
[ 373.931298] [] process_one_work+0x16c/0x4f0
[ 373.937924] [] ? __schedule+0x36a/0x9d0
[ 373.946157] [] ? schedule+0x40/0xb0
Met with a couple netdev kernel guys (therbert, hannes, alexei) to discuss buffer management in ktls. Consensus was roughly:
Preallocated send/recv buffers per-socket aren't going to work, would be too much memory for large numbers of sockets. Need to allocate them lazily as we need to decode/encode. Alternatively, if we aren't pre-emptable, they could possibly be preallocated per-cpu as an optimization.
When the tcp socket has data ready, it needs to be pulled off the socket and stored the ktls. Otherwise a tcp socket with a receive buffer size of 4k, and being sent a larger TLS message, would not result in forward progress. There are also other reasons this needs to happen, like out of order receives in tcp interacting in interesting ways with the receive buffer size, that are also fixed by moving the sk_buffs to being owned by ktls instead. KCM already does a version of this.
Output buffers should probably be stored as sk_buffs on the socket, and not scattergather lists. This would facilitate passing ktls sockets to other kernel things (KCM), as well as supporting multiple message decode by respecting receive buffer sizes on ktls easily.
In tls_bind, on the failure path for bind_end, we need to sockfd_put() the tsk->socket, since we're not using it anymore.
There are needed multiple setsockopt(2)
calls for passing key material to kernel. It would be nice to consider to introduce one setsockopt(2)
call, which would pass needed key material at once to reduce number of context switches.
E.g.:
struct ktls_conf cnf;
// session_* is obtained material from GnuTLS/OpenSSL
cnf.salt_send = session_salt_send;
cnf.salt_recv = session_salt_recv;
cnf.key_send = session_key_send;
cnf.key_recv = session_key_recv;
cnf.iv_send = session_iv_send;
cnf.iv_recv = session_iv_recv;
setsockopt(ksd, AF_KTLS, KTLS_SET_CONF, &cnf, sizeof(cnf));
EDIT: ... and skip NULL fields in struct ktls_conf
in kernel.
include/linux/socket.h
should cover AF_KTLS socket. For now, you have to choose unused protocol family in order to do insmod
.
Using --sendfile
in AF_KTLS tool with a file which has more than 2GB causes overflow. This should be easyfix.
While in af_ktls-tool you have a testing tool, it may be better to automate the test suite on a simple make check command, that includes the unit tests, as well as stress test the subsystem.
I wrote up a framework for what I think could be a good foundation for testing the AF_KTLS module. Check it out here.
https://github.com/lancerchao/af_ktls-test
Based on https://github.com/fridex/af_ktls/pull/28 I am open to discussion whether there can be only one peek or two peeks per kernel_recvmsg()
and splice_read
.
We should distinguish stream (TCP) and datagram (UDP) protocols here. As was discussed in https://github.com/fridex/af_ktls/pull/28. This should be a full list of scenarios that can happen:
Can we peek only once (not a separate peek for header and then for the record)?
Nevertheless, peek is not a big deal though (as discussed in #28).
Any suggestions?
related: https://github.com/fridex/af_ktls/pull/28 https://github.com/fridex/af_ktls/issues/21
The crypto API expects data to be contiguous in memory. This means that even though it supports a scatter/gather buffer interface, under the covers it does a copy to make everything contiguous. This makes sense in some ways: the AESNI routines need data aligned on certain byte boundaries to be most efficient.
For af_ktls however, the header aad data and hash are currently never contiguous. We should either make all the af_ktls data contiguous if possible, or modify the crypto API to accept portions of data that aren't contiguous where it doesn't matter.
Attached was my work in progress diff to modify the crypto API to avoid the copies if possible.
At: https://github.com/fridex/af_ktls/blob/master/af_ktls.c#L1046
you do not check whether the data received are sufficient before starting reading the header.
It seems that the license of this module is GPLv3. That will mean in any attempt to make it upstream we will have to be re-license to GPLv2-only or GPLv2+.
We need a way to "peek" data from tcp socket using tcp_read_sock (which is stated in the comments that it is currently not supported). If for whatever reason we decide the packet is bad during the decryption stage, we can't let userspace SSL handle the packet since at that point it has already been pulled from TCP's receive queue.
http://lxr.free-electrons.com/source/net/ipv4/tcp.c#L1490
Related #37
Overall the API is very neat. However, currently AF_KTLS in socket() may mean either TLS or DTLS depending on whether TCP or UDP is used. While that makes sense for TCP and (mostly) UDP, there are other protocols such as SCTP that can run either DTLS or TLS. Would it make sense to select the actual protocol (DTLS vs TLS) using the 'protocol' parameter in TLS create?
That would currently break the OPENCONNECT support, but if we plan with upstream in mind, it may make sense to take that aside initially and handle more specific protocols at a second level using (possibly) BPF. https://lwn.net/Articles/599755/
SSIA
After chatting with Fridolin and reading RFC 6347, I notice a discrepancy between the specs and MTU handling in KTLS, which I would like to discuss.
Right now in KTLS, if the MTU is 1500, and the user tries to send a message of length 1501 over UDP, only 1500 bytes is transmitted in order to avoid IP fragmentation. In other words, KTLS is actively cutting down the size of records to avoid fragmentation.
However, RFC 6347 says this.
In general, DTLS's philosophy is to leave PMTU discovery to the application.
It's saying that the application using DTLS is responsible for avoiding IP fragmentation, not the DTLS implementation itself.
Additionally, it goes on to say:
However, DTLS cannot completely ignore PMTU for three
reasons:
- The DTLS record framing expands the datagram size, thus lowering
the effective PMTU from the application's perspective.- In some implementations, the application may not directly talk to
the network, in which case the DTLS stack may absorb ICMP
[RFC1191] "Datagram Too Big" indications or ICMPv6 [RFC4443]
"Packet Too Big" indications.- The DTLS handshake messages can exceed the PMTU.
As a solution to the above problems, it says DTLS should provide the following:
If PMTU estimates are available from the underlying transport
protocol, they should be made available to upper layer protocols. In
particular:
- For DTLS over UDP, the upper layer protocol SHOULD be allowed to
obtain the PMTU estimate maintained in the IP layer.
My recommended changes:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.