ktls / af_ktls Goto Github PK

View Code? Open in Web Editor NEW

156.0 23.0 25.0 152 KB

Linux Kernel TLS/DTLS Module

License: GNU General Public License v2.0

Makefile 1.17% C 98.83%

dtls kernel socket tls linux linux-kernel-tls linux-kernel-module

af_ktls's People

Contributors

Stargazers

Watchers

af_ktls's Issues

Sleeping with lock held

tls_rx_async_work acquires tsk->rx_lock, then tries to lock_sock() twice (for both tsk and the underlying socket). lock_sock() potentially sleeps

tls_sendmsg should send partial data for TCP

Currently tls_sendmsg does a size check:

if (size > KTLS_MAX_PAYLOAD_SIZE) { ret = -E2BIG; goto send_end; }

For tcp, we should be sending KTLS_MAX_PAYLOAD_SIZE bytes, and returning the number of bytes sent, or providing a socket buffer somehow.

I think this is probably OK for UDP/DTLS though.

The man page says it should be EMSGSIZE:

If the message is too long to pass atomically through the underlying protocol, the error EMSGSIZE is returned, and the message is not transmitted.

Design MSG_DONTWAIT for AF_KTLS

There is still missing nonblocking socket support in sendmsg(2) by supplying MSG_DONTWAIT flag. It will need big reorganization in sources so before the work will be started, here is my proposed idea, I am open for discussion.

I don't see any differences in UDP/TCP (DTLS/TLS) now, so the implementation could be possibly shared.

Since we drop preallocation of pages optimization (no reasonable results to keep it), it should be easier to handle MSG_DONTWAIT as well. Each time there is a tls_sendmsg() or tls_sendpage() called, we will allocate pages for the TLS/DTLS record and we will not reuse them.

I see reasonable to spawn a kernel worker each time there is a nonblocking sendmsg(2) called and return to user space. Each socket will have its own one sending worker. We will have to copy supplied buffer to a queue before returning though. The pseudocode for tls_sendmsg() (the same applies to tls_sendpage() but with a page instead of buf, handling payload not covered):

plaintext_pages = allocate_pages()
copy_to_pages(buf, plaintext_pages)
data_queue_push(plaintext_pages)
queue_work()
return;

Here is a pseudocode of a worker:

plaintext_pages = data_queue_pop()
ciphertext_pages = allocate_pages()
encrypt(plaintext_pages, ciphertext_pages)
free_pages(plaintext_pages)
record = prepare_record(ciphertext_pages) // header creation
kernel_sendpage(record)

There can be also supplied splice(2) with SPLICE_F_NONBLOCK, so similar approach should be introduced for kernel_sendpage(). The synchronization between kernel_sendpage() and kernel_sendmsg() should be explicit with implemented queue (which should be guarded) - first called, first served (user space should synchronize in special cases).

The queue of (plaintext) pages that are going to be sent can be implemented with the linked list available in the kernel. Just to note that the core part of the worker pseudocode can be reused even
when there is a blocking send.

Any suggestions?

new ciphersuite: chacha20-poly1305

Currently only AES-GCM is supported from the TLS 1.2 ciphersuites. A new ciphersuite is defined in RFC7905 the chacha20-poly1305 which is used in several places where the AES-GCM performance is unsatisfactory. It would be good for af-ktls to support the chacha20-poly1305 ciphersuite.

Asynchronous cache should propagate error

Asynchronous cache currently stores data if decryption and packet disassembling went well. If there is an error, the error is not reflected in asynchronous cache. This causes that asynchronous worker tries to decrypt and disassemble record every time a record is received even so there is wrong record on top of receiving queue.

This should be avoided by propagating error in asynchronous cache (e.g. negative number will signalize error, positive number number of bytes in cache, zero would mean free cache).

This should be OK from consistency POV since the asynchronous cache gets invalidated every time there is a setsockopt(2) call.

Review socket locking

SSIA

rmmod crashes kernel

Seen this a couple times, haven't nailed it down yet though

Console initialized - press Ctrl-D for menu [45/96761]
[ 282.613149] tls: --> tls_init
[ 307.655912] tls: --> tls_exit
[ 307.664466] ------------[ cut here ]------------
[ 307.671461] general protection fault: 0000 [#1] SMP
[ 307.672294] Last file read FILE* = ffff8823b7b30800
[ 307.673186] Modules linked in: af_ktls(O-) decnet tcp_diag inet_diag ip6table_filter xt_NFLOG xt_comment iptable_filter netconsole autofs4 hwmon_vid w83795
i2c_piix4 rpcsec_gss_krb5 auth_rpcgss oid_registry dm_mod loop sg serio_raw iTCO_wdt iTCO_vendor_support e1000e ipmi_devintf x86_pkg_temp_thermal coretemp kv
m irqbypass crc32c_intel aesni_intel ablk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr i2c_i801 i2c_core lpc_ich mfd_core ehci_pci ehci_hcd ipmi_s
i ipmi_msghandler shpchp button
[ 307.681160] CPU: 13 PID: 7393 Comm: rmmod Tainted: G O 4.6.0-rc6_00054_g5294e32 #117
[ 307.682649] Hardware name: Quanta Freedom/Winterfell, BIOS F03_3B09 05/22/2014
[ 307.683901] task: ffff8823e65daa00 ti: ffff8811d4628000 task.ti: ffff8811d4628000
[ 307.685257] RIP: 0010:[] [] proto_unregister+0x48/0xe0
[ 307.686644] RSP: 0018:ffff8811d462be88 EFLAGS: 00010282
[ 307.687541] RAX: dead000000000100 RBX: ffffffffa023e000 RCX: 0000000000000000
[ 307.688729] RDX: dead000000000200 RSI: ffff88247f8ccb18 RDI: ffffffff81f00360
[ 307.689970] RBP: ffff8811d462bea8 R08: 00000000fffffffe R09: 0000000000000000
[ 307.691331] R10: 0000000000000005 R11: 0000000000000001 R12: 00007ffe72736270
[ 307.697318] R13: 00007ffe727377c2 R14: 0000000000000000 R15: 0000000000000001
[ 307.705584] FS: 00007f251200c700(0000) GS:ffff88247f8c0000(0000) knlGS:0000000000000000
[ 307.712018] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 307.713913] CR2: 00007f2511ba5183 CR3: 00000023e7bbb000 CR4: 00000000001406e0
[ 307.715328] Stack:
[ 307.715686] 0000000000000000 ffff8811d462bed8 ffffffffa023e180 ffffffffa023e180
[ 307.717317] ffff8811d462beb8 ffffffffa023c29f ffff8811d462bf48 ffffffff810daed5
[ 307.718601] 0000000000000000 ffff8824280896e8 00736c746b5f6661 00007f2512014000
[ 307.720006] Call Trace:
[ 307.720421] [] tls_exit+0x2f/0xd90 [af_ktls]
[ 307.721430] [] SyS_delete_module+0x155/0x1a0
[ 307.722416] [] ? vm_munmap+0x5c/0x80
[ 307.723268] [] ? SyS_munmap+0x2c/0x40
[ 307.724151] [] entry_SYSCALL_64_fastpath+0x13/0x8f
[ 307.725309] Code: 8b 83 c0 00 00 00 83 f8 3f 74 0b 89 c0 f0 48 0f b3 05 c5 82 b5 00 48 8b 83 68 01 00 00 48 8b 93 70 01 00 00 48 c7 c7 60 03 f0 81 <48> 89
50 08 48 89 45 e0 48 89 02 48 b8 00 02 00 00 00 00 ad de
[ 307.728639] RIP [] proto_unregister+0x48/0xe0
[ 307.731148] RSP
[ 307.731732] ------------[ cut here ]------------

Error when a cipher is not present

If kernel does not support rfc5288(gcm(aes)) cipher:

grep /proc/crypto -e 'rfc5288(gcm(aes))' -o || \
    echo 'rfc5288(gcm(aes)) is not present'

the error in tls_create() is not handled properly

First issue!

First!

Adapt return values

Return values should be disjoint from ones from calls to other parts of the kernel. This will give us ability to clarify what went wrong (e.g. there was an error in decryption, not data record (not 0x17), etc). Maybe they will need to be remapped (e.g. if crypto API returns same error as kernel_{recv,send}msg()).

Copy data from multiple TLS records to userspace buffer

If server sends: abcde, then sends 12345, a client calling recv() with a read size of 10 would get "abcde" instead of "abcde12345"

sendmsg doesn't work with multiple iovecs

In tls_sendmsg, if msghdr *msg contains more than one iovec, we only send the first one, assuming it contains all the data.

af_alg_make_sg only copies a single iovec at a time, and we need to inspect the return value to see how much it copied.

OpenConnect protocol support

OpenConnect protocol support is currently not finished. Instead of hardcoding every protocol or rule, AF_KTLS can be extended with Linux Socket Filtering support. This needs inspection and suitability study.

Avoid kernel_sendmsg()

Current implementation uses kernel_sendmsg() for sending records. This function does copy from passed vector, so it would be nice to avoid it.

There can be used kernel_sendpage(), which operates directly on pages (see [1] for TCP, see [2] for UDP).

[1] http://lxr.free-electrons.com/source/net/ipv4/tcp.c#L1017
[2] http://lxr.free-electrons.com/source/net/ipv4/udp.c#L1216

support poll

Poll isn't currently supported. Related, it looks like recvmsg / sendmsg don't block in blocking mode.

Allocating too many pages

af_ktls/af_ktls.c

Line 1731 in 7fbe182

tsk->pages_send = alloc_pages(GFP_KERNEL, KTLS_DATA_PAGES);

alloc_pages allocates 1 << order number of pages. That line would therefore allocate 16 pages.

Ref:
http://www.makelinux.net/books/lkd2/ch11lev1sec3

Remove rx async work

Can you please explain why the rx async work is needed?
Why can't we read the data and decrypt it inside tls_recvmsg from process context?

I'm guessing the concern is that we will deadlock if there is not enough room for 1 record
in the TCP socket. But can't we just enforce SO_RCVBUF > 16KB?

possibly dangerous cast

static void increment_seqno(char *s)
{
    u64 *seqno = (u64 *) s;

This cast's operation is cpu-specific. It works on intel arch but there are others that cannot cope with it.

kernel fault when decrypting to user buffer

If a client makes a syscall like this:

char recv_mem[1000];
recv(fd, recv_mem, TLS_MAX_PAYLOAD_LENGTH, 0);

And the decryption is done straight to user memory (the else{} in tls_recvmsg) a kernel fault is triggered
http://pastebin.com/XjGu0dHx

Review async cache handling

There might be (and possibly are) use cases when asyn cache can become inconsistent based on user space actions on TCP/UDP socket. This needs to be deeply inspected.

Fix indentation to conform Linux kernel requirements

SSIA, http://kernelnewbies.org/FirstKernelPatch

Documentation: no parallel socket operations on AF_KTLS and bound socket are possible without explicit synchronization

Bound UDP/TCP sockets should be locked while we are doing operations in AF_KTLS since (not only) userspace can operate on them in parallel with AF_KTLS and that could lead to inconstancy; pseudocode:

in AF_KTLS:                                   userspace:
read(sd, ktls_buf, size, MSG_PEEK)

decrypt                                         !!! read(sd, buf2, size, 0)

!!! pop_record(sd)

possible lock inversion surrounding splice

tls_splice_read acquires locks in the order:

tsk->rx_lock
lock_sock(tsk)
pipe_lock (called in splice_to_pipe)

tls_sendpage acquires lock in the order:

pipe_lock (called in splice_from_pipe)
tsk_rx_lock
lock_sock(tsk)

Kernel warns about possible deadlock
http://pastebin.com/tYCk0bxy

In practice I don't think this is much of an issue since I don't think you'd ever splice to and from the same pipe.

sa_ktls and sa_cipher (KTLS_CIPHER_AES_GCM_128)

I like the approach with fixed values (that's a type-safe userspace API), however it may cause upstream opposition. Be prepared to be asked to use directly the "rfc5288(gcm(aes))" string from userspace. There are good arguments to not use it (such as ability to change the internal API, and type safety).

Race condition in KTLS_RECV_READY

The KTLS_RECV_READY macro is used to check whether or not the aead has been set up for decryption, however, it merely checks that the keys are set, when tls_init_aead is called after keys are set, so an interrupt that checks KTLS_RECV_READY(such as tls_data_ready) that occurs between settings keys and initializing aead would cause a crash.

3 options:

Acquire the socket lock during decryption so that settings keys and initializing aead occurs atomically, which means that decryption cannot happen in a irqroutine like tls_data_ready.
Use an atomic flag that is set after key setup.
Set keys before changing the sk_data_ready callback

DTLS window handling from user space

The current implementation of DTLS sliding window handling behaves correctly only if there are no out of order DTLS records. If we receive a record that is not at the beginning of the window, and user space asks for seq number, the seq number can be incorrect. There should be returned seq number which corresponds to the very first record expected within the sliding window:

An example scenario:

user space sets seq number to be (epoch 1, seq num 1)
kernel receives (epoch 1, seq num 3) - within window, record is accepted
kernel receives (epoch 1, seq num 4) - within window, record is accepted
user space asks for seq number, the kernel responds with (epoch 1. seq num 3) - incorrect, should be (epoch 1, seq num 2)

Since DTLS window is part of the state, there should be probably appropriate {get,set}sockopt(2) call to populate the sliding window to user space. This would allow to sync user space with DTLS socket and vice versa. There has to be kept in mind that sliding window in user space can be of different size than the one in the kernel.

Recv nonce and sequence number are reversed

For recv, aead_request_set_crypt takes the nonce, and make_aad should take the iv_recv sequence number. These are allowed to be the same by the spec, and the current implementation does (and I guess gnutls must also?), but openssl does not, so recv doesn't work when receiving data from openssl.

Issues when bind(2) is called more than once

When bind(2) is called more than once on the AF_KTLS socket instance, the previously bound socket is not correctly freed. We should either limit bind(2) to be called only once per AF_KTLS instance or free previously bound socket.

Avoid kernel_recvmsg()

Current implementation uses kernel_recvmsg() for receiving records. This function does copy from skbuff to passed vector (see [1] for TCP, see [2] for UDP), so it would be nice to avoid it.

When underlying protocol is TCP, there can be used tcp_read_sock(). Unfortunately the implementation of tcp_read_sock() does not support peeking (see [3]), which is necessary according to current AF_KTLS design.
When underlying protocol is UDP, there is currently no such copy-less logic that could be reused (AFAIK).

EDIT: we could consider to operate directly on skbuff

[1] http://lxr.free-electrons.com/source/net/ipv4/tcp.c#L1830
[2] http://lxr.free-electrons.com/source/net/ipv4/udp.c#L1392
[3] http://lxr.free-electrons.com/source/net/ipv4/tcp.c#L1485

Review flags that are supplied (not only) from user space

Review flags that can be supplied (e.g. MSG_NONBLOCK, MSG_MORE, etc.). Some of them are not useful for TLS/DTLS, but some are. Some ideas:

MSG_DONTWAIT - for sendmsg(2) can be implemented with a worker
MSG_MORE - has to store previous bytes - sendpage() context can be reused for
this purpose
MSG_NOSIGNAL - can be propagated when suitable
MSG_PEEK -this will require not to clear cached record if received
asynchronously or write to cache when waited for data (in
kernel_recvmsg() rx queue)
MSG_WAITALL - this can be tricky, a use case when a user supplies more than
TLS_MAX_PAYLOAD_SIZE (1 << 14) should be not supported from my POV
...

Don't depend on af_alg

If CONFIG_CRYPTO_USER_API=n, there are a bunch of errors building, because we are depending on af_alg things and we shouldn't be

WARNING: "af_alg_make_sg" [/home/davejwatson/local/af_ktls/af_ktls.ko] undefined!
WARNING: "af_alg_wait_for_completion" [/home/davejwatson/local/af_ktls/af_ktls.ko] undefined!

Create KConfig entry

SSIA

Per socket worker

Currently there is implemented one worker per whole module called "ktls". It would worth it to consider to add a worker for every socket instance - e.g. "ktls-<PID>/id" or so.

Review variable types

SSIA

release & tls_rx_async_work race

This one looks like a race between receiving a new message, and close() locally. You can kinda see it in the interleaved logs

[ 372.859537] tls: --> tls_rx_async_work
[ 372.860805] tls: --> tls_peek_data
[ 373.860974] tls: [ 373.861021] tls: --> tls_release
[ 373.861023] tls: --> tls_free_sendpage_ctx
[ 373.861024] tls: --> tls_sock_destruct
[ 373.861026] tls: parallel executions: 2
[ 373.867009] --> tls_data_ready
[ 373.868162] tls: --> tls_rx_async_work
[ 373.870335] tls: --> tls_peek_data
[ 373.871089] BUG: unable to handle kernel NULL pointer dereference at 0000000000000090
[ 373.873204] IP: [] sock_recvmsg+0x45/0x60
[ 373.874106] PGD 0
[ 373.874430] ------------[ cut here ]------------
[ 373.875749] Oops: 0000 [#1] SMP
[ 373.876466] Last file read FILE* = ffff8823f8401b00
[ 373.887355] Modules linked in: sha256_generic drbg af_ktls(O) tcp_diag inet_diag ip6table_filter xt_NFLOG xt_comment iptable_filter netconsole autofs4 hwmon_vid w83795 i2c_piix4 rp
csec_gss_krb5 auth_rpcgss oid_registry dm_mod loop sg serio_raw iTCO_wdt iTCO_vendor_support e1000e ipmi_devintf x86_pkg_temp_thermal coretemp kvm irqbypass crc32c_intel aesni_intel a
blk_helper cryptd lrw gf128mul glue_helper aes_x86_64 pcspkr i2c_i801 i2c_core lpc_ich mfd_core ehci_pci ehci_hcd ipmi_si ipmi_msghandler shpchp button
[ 373.896046] CPU: 2 PID: 333 Comm: kworker/2:1 Tainted: G O 4.6.0-rc6_00054_g5294e32 #117
[ 373.897631] Hardware name: Quanta Freedom/Winterfell, BIOS F03_3B09 05/22/2014
[ 373.898850] Workqueue: ktls tls_rx_async_work [af_ktls]
[ 373.899694] task: ffff881228b96200 ti: ffff881228470000 task.ti: ffff881228470000
[ 373.901004] RIP: 0010:[] [] sock_recvmsg+0x45/0x60
[ 373.903504] RSP: 0018:ffff881228473a88 EFLAGS: 00010246
[ 373.904457] RAX: 0000000000000000 RBX: ffff8811fd726400 RCX: 0000000000000042
[ 373.912814] RDX: 000000000000401d RSI: ffff881228473b28 RDI: ffff8811fd726400
[ 373.913964] RBP: ffff881228473aa8 R08: 000000000000401d R09: 0000000000000042
[ 373.915112] R10: 000000000000003f R11: 0000000000000259 R12: ffff881228473b28
[ 373.916352] R13: 000000000000401d R14: 0000000000000042 R15: 0000000000000042
[ 373.917616] FS: 0000000000000000(0000) GS:ffff881237880000(0000) knlGS:0000000000000000
[ 373.920577] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 373.921502] CR2: 0000000000000090 CR3: 0000000001e06000 CR4: 00000000001406e0
[ 373.922753] Stack:
[ 373.923089] ffff881228474000 ffff881228473b28 000000000000401d ffffffffffffffff
[ 373.924293] ffff881228473af8 ffffffff817de7b9 ffff881228473b58 ffff8811fd726400
[ 373.925505] 0000000000000001 ffff881225fcd000 ffff881225fcd94b 0000000000000042
[ 373.926700] Call Trace:
[ 373.927212] [] kernel_recvmsg+0x69/0x90
[ 373.928145] [] tls_peek_data+0xab/0x200 [af_ktls]
[ 373.930195] [] tls_rx_async_work+0x113/0x1b0 [af_ktls]
[ 373.931298] [] process_one_work+0x16c/0x4f0
[ 373.937924] [] ? __schedule+0x36a/0x9d0
[ 373.946157] [] ? schedule+0x40/0xb0

Buffer changes

Met with a couple netdev kernel guys (therbert, hannes, alexei) to discuss buffer management in ktls. Consensus was roughly:

Preallocated send/recv buffers per-socket aren't going to work, would be too much memory for large numbers of sockets. Need to allocate them lazily as we need to decode/encode. Alternatively, if we aren't pre-emptable, they could possibly be preallocated per-cpu as an optimization.
When the tcp socket has data ready, it needs to be pulled off the socket and stored the ktls. Otherwise a tcp socket with a receive buffer size of 4k, and being sent a larger TLS message, would not result in forward progress. There are also other reasons this needs to happen, like out of order receives in tcp interacting in interesting ways with the receive buffer size, that are also fixed by moving the sk_buffs to being owned by ktls instead. KCM already does a version of this.
Output buffers should probably be stored as sk_buffs on the socket, and not scattergather lists. This would facilitate passing ktls sockets to other kernel things (KCM), as well as supporting multiple message decode by respecting receive buffer sizes on ktls easily.

tls_bind issues

In tls_bind, on the failure path for bind_end, we need to sockfd_put() the tsk->socket, since we're not using it anymore.

Pass all key material in one setsockopt(2) call

There are needed multiple setsockopt(2) calls for passing key material to kernel. It would be nice to consider to introduce one setsockopt(2) call, which would pass needed key material at once to reduce number of context switches.

E.g.:

struct ktls_conf cnf;

// session_* is obtained material from GnuTLS/OpenSSL
cnf.salt_send = session_salt_send;
cnf.salt_recv = session_salt_recv;
cnf.key_send = session_key_send;
cnf.key_recv = session_key_recv;
cnf.iv_send = session_iv_send;
cnf.iv_recv = session_iv_recv;

setsockopt(ksd, AF_KTLS, KTLS_SET_CONF, &cnf, sizeof(cnf));

EDIT: ... and skip NULL fields in struct ktls_conf in kernel.

AF_KTLS and PF_KTLS in kernel headers

include/linux/socket.h should cover AF_KTLS socket. For now, you have to choose unused protocol family in order to do insmod.

Overflow in sendfile(2)

Using --sendfile in AF_KTLS tool with a file which has more than 2GB causes overflow. This should be easyfix.

test suite

While in af_ktls-tool you have a testing tool, it may be better to automate the test suite on a simple make check command, that includes the unit tests, as well as stress test the subsystem.

Test suite

I wrote up a framework for what I think could be a good foundation for testing the AF_KTLS module. Check it out here.
https://github.com/lancerchao/af_ktls-test

Do MSG_PEEK call only once

Based on https://github.com/fridex/af_ktls/pull/28 I am open to discussion whether there can be only one peek or two peeks per kernel_recvmsg() and splice_read.

We should distinguish stream (TCP) and datagram (UDP) protocols here. As was discussed in https://github.com/fridex/af_ktls/pull/28. This should be a full list of scenarios that can happen:

for DTLS/UDP:
- EBADF when peeked size is bigger then we expect based on header
- EBADF when peeked size is smaller then we expect based on header (or does not cover header at all)
- when equal, process
for TLS/TCP:
- when peeked size is bigger than size based on TLS header, process only TLS record - part based on TLS header
- when peeked size is smaller than size based on TLS header or it does not cover header at all (possibly segmented):
  - EAGAIN for nonblocking socket
  - block for blocking socket
- when equal, proces

Can we peek only once (not a separate peek for header and then for the record)?

for DTLS/UDP it should be pretty straight forward, since when we do the peek, the whole datagram should be already available
for TLS/TCP:
- since we explicitly know where is the beginning and the end of the TLS record, we could wait/report when there is not enough data
- for nonblocking socket scenario, there can be incosistency in return value:
  1. not enough data received to handle whole record - kernel will return EAGAIN
  2. received whole record, but it was not possible to decrypt it - kernel will return EBADF (probably)
    Note this will occur even with double-peek approach

Nevertheless, peek is not a big deal though (as discussed in #28).

Any suggestions?

Crypto API scatterwalk copy

The crypto API expects data to be contiguous in memory. This means that even though it supports a scatter/gather buffer interface, under the covers it does a copy to make everything contiguous. This makes sense in some ways: the AESNI routines need data aligned on certain byte boundaries to be most efficient.

For af_ktls however, the header aad data and hash are currently never contiguous. We should either make all the af_ktls data contiguous if possible, or modify the crypto API to accept portions of data that aren't contiguous where it doesn't matter.

Attached was my work in progress diff to modify the crypto API to avoid the copies if possible.

nocopy_crypto.txt

incorrect header length checks

At: https://github.com/fridex/af_ktls/blob/master/af_ktls.c#L1046
you do not check whether the data received are sufficient before starting reading the header.

license is GPLv3

It seems that the license of this module is GPLv3. That will mean in any attempt to make it upstream we will have to be re-license to GPLv2-only or GPLv2+.

peek tcp data using tcp_read_sock

We need a way to "peek" data from tcp socket using tcp_read_sock (which is stated in the comments that it is currently not supported). If for whatever reason we decide the packet is bad during the decryption stage, we can't let userspace SSL handle the packet since at that point it has already been pulled from TCP's receive queue.

http://lxr.free-electrons.com/source/net/ipv4/tcp.c#L1490
Related #37

DTLS over TCP is not possible

Overall the API is very neat. However, currently AF_KTLS in socket() may mean either TLS or DTLS depending on whether TCP or UDP is used. While that makes sense for TCP and (mostly) UDP, there are other protocols such as SCTP that can run either DTLS or TLS. Would it make sense to select the actual protocol (DTLS vs TLS) using the 'protocol' parameter in TLS create?

That would currently break the OPENCONNECT support, but if we plan with upstream in mind, it may make sense to take that aside initially and handle more specific protocols at a second level using (possibly) BPF. https://lwn.net/Articles/599755/

Consider add likely() and unlikely() to meaningful parts

SSIA

MTU handling

After chatting with Fridolin and reading RFC 6347, I notice a discrepancy between the specs and MTU handling in KTLS, which I would like to discuss.

Right now in KTLS, if the MTU is 1500, and the user tries to send a message of length 1501 over UDP, only 1500 bytes is transmitted in order to avoid IP fragmentation. In other words, KTLS is actively cutting down the size of records to avoid fragmentation.

However, RFC 6347 says this.

In general, DTLS's philosophy is to leave PMTU discovery to the application.

It's saying that the application using DTLS is responsible for avoiding IP fragmentation, not the DTLS implementation itself.
Additionally, it goes on to say:

However, DTLS cannot completely ignore PMTU for three
reasons:

The DTLS record framing expands the datagram size, thus lowering
the effective PMTU from the application's perspective.

In some implementations, the application may not directly talk to
the network, in which case the DTLS stack may absorb ICMP
[RFC1191] "Datagram Too Big" indications or ICMPv6 [RFC4443]
"Packet Too Big" indications.

The DTLS handshake messages can exceed the PMTU.

As a solution to the above problems, it says DTLS should provide the following:

If PMTU estimates are available from the underlying transport
protocol, they should be made available to upper layer protocols. In
particular:

For DTLS over UDP, the upper layer protocol SHOULD be allowed to
obtain the PMTU estimate maintained in the IP layer.

My recommended changes:

Remove setsockopt(MTU)
Revise getsockopt(MTU) to query the MTU from the IP layer, and return MTU - KTLS_DTLS_OVERHEAD to the user
Do not truncate records that above MTU size length. Instead, allow fragmentation to occur.

ktls / af_ktls Goto Github PK

af_ktls's People

Contributors

Stargazers

Watchers

Forkers

af_ktls's Issues

Recommend Projects

Recommend Topics

Recommend Org