Comments (13)
Yeah, it's definitely a drawback that the UniversalHash
trait doesn't provide a multi-block API. It's also problematic that data in AEADs is round-tripping through calls like _mm_storeu_si128
/_mm_loadu_si128
, especially along crate boundaries where it's pretty much guaranteed not to get optimized away even in cases where it's aligned.
These sorts of optimization problems for passing data between stream ciphers and universal hash functions were the impetus for the simd-buffers
work I was attempting here:
I abandoned that, but now I wonder if maybe crypto_bigint::UInt
might make a reasonable replacement buffer type for these use cases, especially if we ensured they were properly aligned such that they could be safely converted to similarly-structured SIMD types. Then a multi-block API could operate over slices of those structured SIMD buffers with guaranteed alignment.
from aeads.
That's why I was suggesting exposing low-level architecture-specific APIs to optimize passing data between ciphers and UHFs.
Such API would have to be unsafe
and represented as free-standing methods. It could work, but I think such solution is quite ad hoc and will be hard to extend, i.e. for each somewhat relevant combination we would have to manually write loops for all combinations. Adding a new backend would mean that we will need to update all combinations using this primitive manually.
I guess it could be a practical stop-gap solution and baseline for comparing generic solutions.
from aeads.
Related issue: generalized AEAD implementations based on stream ciphers RustCrypto/traits#45
from aeads.
As a complete crypto noob who got here from github explore (But not a complete rust noob), would this be feasible? (Or even just a small part, like trying to make decryption for chacha20poly1305 single pass) How robust are existing tests?
Would the change mostly be changing implementations like https://github.com/RustCrypto/AEADs/blob/master/chacha20poly1305/src/cipher.rs#L66-L91 into something more like https://github.com/RustCrypto/AEADs/blob/master/aes-gcm-siv/src/lib.rs#L317-L347 ?
from aeads.
Yes, but it also needs to be done in a way that actually improves performance. I've tried to do this change naively a few times (to aes-gcm
, mainly, I might still have the code around) and it decreased performance.
I think doing it properly might require keeping the data flowing through XMM registers... at the very least it needs to all stay in L1 cache.
ChaChaPoly is even trickier because this issue is a micro-optimization and so far we don't have an AVX2 backend for Poly1305 (see RustCrypto/universal-hashes#49)
from aeads.
Thanks for the response!
I think I'll give it a shot. Do not let that stop anyone else from trying it since I don't have much hope I'll be able to do much.
How important is performance compiling with avx support vs without? (i.e, would you mostly care about speed when compiling with simd extensions or does the "default" cargo build configuration also matter a lot?)
from aeads.
The ccm
crate could also use a single pass encryption/decryption.
from aeads.
Let's take chacha20poly1305
in its current form and look at the AVX2 hot path (ignoring all the autodetect code in chacha20
and poly1305
).
chacha20poly1305::cipher
:
impl<C> Cipher<C> where C: StreamCipher + StreamCipherSeek,
{
pub(crate) fn encrypt_in_place_detached(
mut self,
associated_data: &[u8],
buffer: &mut [u8],
) -> Result<Tag, Error> {
// ...
// Not currently implemented, but imagine we did this:
for chunk in buffer.chunks_mut(BLOCK_SIZE * 4) {
self.cipher.apply_keystream(chunk);
self.mac.update_padded(chunk);
}
// ...
}
}
chacha20::backend::autodetect
:
pub(crate) const BUFFER_SIZE: usize = BLOCK_SIZE * 4;
chacha20::chacha
:
impl<R: Rounds, MC: MaxCounter> StreamCipher for ChaCha<R, MC> {
fn try_apply_keystream(&mut self, mut data: &mut [u8]) -> Result<(), LoopError> {
// ...
let mut chunks = data.chunks_exact_mut(BUFFER_SIZE);
for chunk in &mut chunks {
let counter_with_offset = self.counter_offset.checked_add(counter).unwrap();
self.block.apply_keystream(counter_with_offset, chunk);
counter = counter.checked_add(COUNTER_INCR).unwrap();
}
// ...
}
}
chacha20::backend::avx2
:
const BLOCKS: usize = 4;
impl<R: Rounds> Core<R> {
pub fn apply_keystream(&self, counter: u64, output: &mut [u8]) {
debug_assert_eq!(output.len(), BUFFER_SIZE);
unsafe {
let state = State {
a: self.v0,
b: self.v1,
c: self.v2,
d: iv_setup(self.iv, counter),
};
let state = self.rounds(state);
for i in 0..BLOCKS {
for (chunk, a) in output[i * BLOCK_SIZE..(i + 1) * BLOCK_SIZE]
.chunks_mut(0x10)
.zip(
[state.a, state.b, state.c, state.d]
.iter()
.map(|s| s.blocks[i]),
)
{
let b = _mm_loadu_si128(chunk.as_ptr() as *const __m128i);
let out = _mm_xor_si128(a, b);
_mm_storeu_si128(chunk.as_mut_ptr() as *mut __m128i, out);
}
}
}
}
}
universal_hash
:
pub trait UniversalHash {
fn update_padded(&mut self, data: &[u8]) {
let mut chunks = data.chunks_exact(Self::BlockSize::to_usize());
for chunk in &mut chunks {
self.update(GenericArray::from_slice(chunk));
}
// ...
}
}
poly1305::backend::avx2
:
impl State {
pub(crate) unsafe fn compute_block(&mut self, block: &Block, partial: bool) {
// ...
self.cached_blocks[self.num_cached_blocks].copy_from_slice(block);
if self.num_cached_blocks < 3 {
self.num_cached_blocks += 1;
return;
} else {
self.num_cached_blocks = 0;
}
let p = Aligned4x130::from_blocks(&self.cached_blocks);
// ...
}
poly1305::backend::avx2::helpers
:
impl Aligned4x130 {
pub(super) unsafe fn from_blocks(src: &[Block; 4]) -> Self {
// 26-bit mask on each 32-bit word.
let mask_26 = _mm256_set1_epi32(0x3ffffff);
// Sets bit 24 of each 32-bit word.
let set_hibit = _mm256_set1_epi32(1 << 24);
// - Load the four blocks into the following 32-bit word layout:
// [b33, b32, b31, b30, b23, b22, b21, b20]
// [b13, b12, b11, b10, b03, b02, b01, b00]
//
// - Unpack the upper and lower 64 bits:
// [b33, b32, b13, b12, b23, b22, b03, b02]
// [b31, b30, b11, b10, b21, b20, b01, b00]
//
// - Swap the middle two 64-bit words:
// a0 = [b33, b32, b23, b22, b13, b12, b03, b02]
// a1 = [b31, b30, b21, b20, b11, b10, b01, b00]
let (lo, hi) = src.split_at(2);
let blocks_23 = _mm256_loadu_si256(hi.as_ptr() as *const _);
let blocks_01 = _mm256_loadu_si256(lo.as_ptr() as *const _);
// ...
}
}
So, the hot path above:
- Splits the plaintext into 4-block byte chunks.
- Passes the 4-block byte chunk to
chacha20
, which:- Chunks it into 4-block chunks (no-op).
- Splits the chunk into individual blocks and:
- Calls
_mm_loadu_si128
on the block to load it into a__m128i
. - XORs the stream into the
__m128i
. - Stores the
__m128i
back into the block.
- Calls
- Passes the (now-encrypted) 4-block byte chunk to
poly1305
, which:- Splits the chunk into individual blocks.
- Copies each block into a 4-block cache (reconstructing the chunk).
- Calls
_mm256_loadu_si256
on each half of the 4-block cache. - Draws the rest of the polynomiOwl.
So the immediate blocker is that the UniversalHash
trait doesn't provide any API to process multiple blocks at a time (StreamCipher::try_apply_keystream
allows the implementor to choose the chunking, whereas UniversalHash::update
is typed on a single block).
Once that is addressed, poly1305
could directly consume 4-block chunks without using its cache, at which point we would be consistently passing around a 4-block chunk size. Then the question becomes the form in which we pass the chunk around. Sketching out two possible directions:
- Add an associated type for the chunk to an
aead
trait, which is constrained to equal an equivalent associated type incipher
anduniversal-hash
. Thenchacha20
andpoly1305
would separately set it to the same concrete type. - Add an an
aead::AeadChunk
trait, and implement it inchacha20poly1305
. Have some way to map it to the chunk inputs ofcipher
anduniversal-hash
.
from aeads.
The fundamental issue here is runtime detection. It not only means that optimal number of blocks processed in parallel can change depending on CPU capabilities (and in some cases even on CPU family!), but also that during combination of primitives we need a way to automatically generate a matrix of possible capability combinations. It means that if algorithm 1 is able to process 3 blocks by default and 8 blocks with feature A and algorithm 2 is able to process 2 blocks by default and 6 with feature B, then ideally when combining them we should generate 3 code paths: by default processing 6 blocks, for feature A processing 8 blocks, for feature B processing 6 blocks, and for feature A and B processing 24 blocks. And if algorithms have different block sizes, problems becomes even harder.
Rust does not have good tools for solving this problem and likely will not have them anytime soon. At the very least we would need some kind of function multi-versioning (i.e. an ability to define different function implementations for different target features) with an ability to query available versions at compile time. And ideally we would need trait multi-versioning as well since it's preferable to store chunk size as an associated constant, but allowing public API (via associated constants and types) to change depending on available target features is a sizable can of worms with potentially non-trivial implications.
Defining those combinations manually could work to some extent, but it will be hard to maintain and I don't think compiler will be able to optimize out our cpufeatures
-based code even if method is used in a context with enabled target features.
Round-tripping _mm_storeu_si128
/_mm_loadu_si128
should not be a big issue since for compiler it's a trivial optimization assuming code gets properly inlined. The issue here is again runtime feature detection, since branching inside MAC/universal hash block processing method acts as an optimization barrier. Without proper inlining and removing the optimization barrier I highly doubt crypto-bigint
will have any measurable effect. By caching blocks into stack you would be able to use aligned loads, but the main improvement is to keep data in registers without spilling it anywhere and you would not achieve it using this approach.
I hope to alleviate some issues in the new trait versions. It introduces slice-based block-level traits for hashes/MACs/universal hashes, hides chunk size from public API and instead uses callback-based methods. Not only should it help with inlinining, but also effectively inverses control over iteration. In other words, iteration over blocks is controlled not by higher-level code which combines primitives, but at the cipher level. It means that we can branch once per loop, instead of doing it every chunk (compiler currently is unable to optimize it automatically). Also it means that callbacks (which are used for passing blocks to MAC) are executed in the context with enabled target features and known chunk size.
Unfortunately this approach is still far from ideal. Roughly it results in the following code:
if is_aesni_available() {
for chunk in blocks.chunks_exact_mut(AESNI_CHUNK) {
aesni_encrypt(chunk);
if is_pclmul_available() {
pclumul_mac(chunk)
} else {
default_mac(chunk)
}
}
} else {
for chunk in blocks.chunks_exact_mut(DEFAULT_CHUNK) {
default_encrypt(chunk);
if is_pclmul_available() {
pclumul_mac(chunk)
} else {
default_mac(chunk)
}
}
}
In other words, if cipher backend does not cover required features for MAC backend we still have the optimization barrier on our hands.
from aeads.
@newpavlov have you actually tested that the optimizations you expect actually work out in practice, especially considering things like traits defined in two crates, being consumed by a third, where the first crate is using _mm_storeu_si128
and the second is using _mm_loadu_si128
? I haven't myself, but I'm skeptical about the degree of inlining which can occur in that sort of 3-crate scenario (which really ends up being more like 6 when you add in the trait crates), especially with a crate as Rust's unit of compilation.
To set a baseline for maximum performance, I think we could move things like the CPU feature tests into the AEAD crates like aes-gcm
/chacha20poly1305
, and expose a couple/few sets of x86-64 and ARM-specific APIs in crates like aes
, chacha20
, ghash
/polyval
, and poly1305
which operate in terms of SIMD registers. We don't even have to ship that, we just need to see what the performance difference is.
Once we're reasonably certain of what a performant implementation looks like, we can experiment with various abstractions, although I'm still a bit unsold on the changes in RustCrypto/traits#727, or at the very least they seem complicated and unclear to me.
I feel like there are slice-based abstractions missing from universal-hash
, similar to the ones I added to address RustCrypto/traits#332, which would also address the problem. I'm not sure we really need any sort of automatic constraint solver to pass around appropriately-sized chunks. We're already writing code which is explicit about the various platforms we support and backends which are detected at runtime, so we can program in an appropriate size for each of those scenarios since they're already factored into relevant modules explicitly. And really, in practice that size is dictated by the stream cipher, at least in the cases we currently care about.
Glossing over a few things, in practice I think the optimal block sizes look like the following:
AES-GCM / AES-GCM-SIV
- x86/x86_64 w\ AES-NI + CLMUL: 8 x 128-bit blocks
- ARMv8 w\ crypto extensions: 8 x 128-bit blocks
- 64-bit portable: 4 x 128-bit blocks
- 32-bit portable: 2 x 128-bit blocks
ChaCha20Poly1305
- x86/x86_64 w\ AVX2: 4 x 128-bit blocks
- x86/x86_64 w\ SSE: 2 x 128-bit blocks
- ARM w\ NEON: 2 x 128-bit blocks(?)
- Portable: 1 x 128-bit block
from aeads.
have you actually tested that the optimizations you expect actually work out in practice
No, I only played a bit with small snippets in godbolt. We may need to abuse #[inline(always)]
to achieve this optimization in practice.
I'm still a bit unsold on the changes in RustCrypto/traits#727, or at the very least they seem complicated and unclear to me.
I am myself far from 100% happy with the result, but right now I don't see a better path forward and, compared to the current design, I think it's definitely an improvement. Could you please comment in the PR on elements which you don't like or do not fully understand? I would appreciate your feedback sooner than later, since I hope to finalize it in the near future.
in practice that size is dictated by the stream cipher, at least in the cases we currently care about.
I agree, this is why the callbacks in my PR are only done on the cipher side, while MACs and universal hashes are left with the slice-based methods. But we are still left with the problem of target feature branching inside chunk iteration. Even if we are to check redundant features such as CLMUL in aes
, I don't think that compiler will be able to remove branches in our cpufeatures
-based code.
Also do not forget that code with enabled target features can not be currently inlined at all, so we definitely should strive to have chunk processing inside context with same target features.
from aeads.
Could you please comment in the PR on elements which you don't like or do not fully understand?
Left a comment on the PR. Just generally I'm confused what is happening there.
But we are still left with the problem of target feature branching inside chunk iteration.
That's why I was suggesting exposing low-level architecture-specific APIs to optimize passing data between ciphers and UHFs.
Then the check can be performed at the level of the entire AEAD, once, at the time the AEAD is initialized, and branched upon at the granularity of large AEAD operations.
The fast path for the entire core can occur within #[target_feature(...)]
annotated code which is amenable to inlining, with data flowing in the form of SIMD register types which don't need to rely on inlining for performance, since they're the desired type to begin with and there's no type conversions that need to be optimized away.
from aeads.
After numerous experiments, I think I've found a good solution to this problem could look like, but, unfortunately, it's blocked on lack of rank-2 polymorphism in Rust. I wrote about it here: https://internals.rust-lang.org/t/15875 So I think the callback-based solution explored in the cipher
v0.4 PRs is the best option which we have right now.
from aeads.
Related Issues (20)
- generic array / stream encryptor type mismatch HOT 4
- Consider adding an API that takes the nonce size and tag size as parameters HOT 3
- Consider using cycles for ascon-aead instead of milliseconds
- Bump dependency on generic-array to version 1.0.0 HOT 1
- Support for streaming AES-GCM encryption HOT 4
- how to add tag/additionalData HOT 2
- Using streamed data with ChaCha20Poly1305 HOT 2
- Support for nonce omission in AES-SIV HOT 4
- chacha20poly1305 decode issue HOT 2
- trap at Instance error with codegen-backend = "cranelift" HOT 2
- Cannot build with no-std HOT 1
- Lack of immediate access to GenericArray to view associated functions and trait impls leads to confusion and annoyance. HOT 13
- Requesting an example HOT 3
- Question about nonce size in xchacha20poly1305 HOT 2
- Extremely poor performance on AES256Gcm with anything but opt-level=3 HOT 2
- Consider exposing AesGcm::compute_tag HOT 3
- OCB3: restrict short nonces
- Enable and fix `missing_debug_implementations`
- `bytes` feature of `aead` is not re-exported by AEADs HOT 1
- Performance on Apple Silicon HOT 5
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from aeads.