Coder Social home page Coder Social logo

One-pass encryption/decryption about aeads HOT 13 OPEN

rustcrypto avatar rustcrypto commented on July 17, 2024
One-pass encryption/decryption

from aeads.

Comments (13)

tarcieri avatar tarcieri commented on July 17, 2024 1

Yeah, it's definitely a drawback that the UniversalHash trait doesn't provide a multi-block API. It's also problematic that data in AEADs is round-tripping through calls like _mm_storeu_si128/_mm_loadu_si128, especially along crate boundaries where it's pretty much guaranteed not to get optimized away even in cases where it's aligned.

These sorts of optimization problems for passing data between stream ciphers and universal hash functions were the impetus for the simd-buffers work I was attempting here:

RustCrypto/utils#221

I abandoned that, but now I wonder if maybe crypto_bigint::UInt might make a reasonable replacement buffer type for these use cases, especially if we ensured they were properly aligned such that they could be safely converted to similarly-structured SIMD types. Then a multi-block API could operate over slices of those structured SIMD buffers with guaranteed alignment.

from aeads.

newpavlov avatar newpavlov commented on July 17, 2024 1

That's why I was suggesting exposing low-level architecture-specific APIs to optimize passing data between ciphers and UHFs.

Such API would have to be unsafe and represented as free-standing methods. It could work, but I think such solution is quite ad hoc and will be hard to extend, i.e. for each somewhat relevant combination we would have to manually write loops for all combinations. Adding a new backend would mean that we will need to update all combinations using this primitive manually.

I guess it could be a practical stop-gap solution and baseline for comparing generic solutions.

from aeads.

tarcieri avatar tarcieri commented on July 17, 2024

Related issue: generalized AEAD implementations based on stream ciphers RustCrypto/traits#45

from aeads.

nico-abram avatar nico-abram commented on July 17, 2024

As a complete crypto noob who got here from github explore (But not a complete rust noob), would this be feasible? (Or even just a small part, like trying to make decryption for chacha20poly1305 single pass) How robust are existing tests?

Would the change mostly be changing implementations like https://github.com/RustCrypto/AEADs/blob/master/chacha20poly1305/src/cipher.rs#L66-L91 into something more like https://github.com/RustCrypto/AEADs/blob/master/aes-gcm-siv/src/lib.rs#L317-L347 ?

from aeads.

tarcieri avatar tarcieri commented on July 17, 2024

Yes, but it also needs to be done in a way that actually improves performance. I've tried to do this change naively a few times (to aes-gcm, mainly, I might still have the code around) and it decreased performance.

I think doing it properly might require keeping the data flowing through XMM registers... at the very least it needs to all stay in L1 cache.

ChaChaPoly is even trickier because this issue is a micro-optimization and so far we don't have an AVX2 backend for Poly1305 (see RustCrypto/universal-hashes#49)

from aeads.

nico-abram avatar nico-abram commented on July 17, 2024

Thanks for the response!

I think I'll give it a shot. Do not let that stop anyone else from trying it since I don't have much hope I'll be able to do much.

How important is performance compiling with avx support vs without? (i.e, would you mostly care about speed when compiling with simd extensions or does the "default" cargo build configuration also matter a lot?)

from aeads.

newpavlov avatar newpavlov commented on July 17, 2024

The ccm crate could also use a single pass encryption/decryption.

from aeads.

str4d avatar str4d commented on July 17, 2024

Let's take chacha20poly1305 in its current form and look at the AVX2 hot path (ignoring all the autodetect code in chacha20 and poly1305).

chacha20poly1305::cipher:

impl<C> Cipher<C> where C: StreamCipher + StreamCipherSeek,
{
    pub(crate) fn encrypt_in_place_detached(
        mut self,
        associated_data: &[u8],
        buffer: &mut [u8],
    ) -> Result<Tag, Error> {
        // ...

        // Not currently implemented, but imagine we did this:
        for chunk in buffer.chunks_mut(BLOCK_SIZE * 4) {
            self.cipher.apply_keystream(chunk);
            self.mac.update_padded(chunk);
        }

        // ...
    }
}

chacha20::backend::autodetect:

pub(crate) const BUFFER_SIZE: usize = BLOCK_SIZE * 4;

chacha20::chacha:

impl<R: Rounds, MC: MaxCounter> StreamCipher for ChaCha<R, MC> {
    fn try_apply_keystream(&mut self, mut data: &mut [u8]) -> Result<(), LoopError> {
        // ...

        let mut chunks = data.chunks_exact_mut(BUFFER_SIZE);
        for chunk in &mut chunks {
            let counter_with_offset = self.counter_offset.checked_add(counter).unwrap();
            self.block.apply_keystream(counter_with_offset, chunk);
            counter = counter.checked_add(COUNTER_INCR).unwrap();
        }

        // ...
    }
}

chacha20::backend::avx2:

const BLOCKS: usize = 4;

impl<R: Rounds> Core<R> {
    pub fn apply_keystream(&self, counter: u64, output: &mut [u8]) {
        debug_assert_eq!(output.len(), BUFFER_SIZE);

        unsafe {
            let state = State {
                a: self.v0,
                b: self.v1,
                c: self.v2,
                d: iv_setup(self.iv, counter),
            };
            let state = self.rounds(state);

            for i in 0..BLOCKS {
                for (chunk, a) in output[i * BLOCK_SIZE..(i + 1) * BLOCK_SIZE]
                    .chunks_mut(0x10)
                    .zip(
                        [state.a, state.b, state.c, state.d]
                            .iter()
                            .map(|s| s.blocks[i]),
                    )
                {
                    let b = _mm_loadu_si128(chunk.as_ptr() as *const __m128i);
                    let out = _mm_xor_si128(a, b);
                    _mm_storeu_si128(chunk.as_mut_ptr() as *mut __m128i, out);
                }
            }
        }
    }
}

universal_hash:

pub trait UniversalHash {
    fn update_padded(&mut self, data: &[u8]) {
        let mut chunks = data.chunks_exact(Self::BlockSize::to_usize());

        for chunk in &mut chunks {
            self.update(GenericArray::from_slice(chunk));
        }

        // ...
    }
}

poly1305::backend::avx2:

impl State {
    pub(crate) unsafe fn compute_block(&mut self, block: &Block, partial: bool) {
        // ...

        self.cached_blocks[self.num_cached_blocks].copy_from_slice(block);
        if self.num_cached_blocks < 3 {
            self.num_cached_blocks += 1;
            return;
        } else {
            self.num_cached_blocks = 0;
        }

        let p = Aligned4x130::from_blocks(&self.cached_blocks);
        // ...
}

poly1305::backend::avx2::helpers:

impl Aligned4x130 {
    pub(super) unsafe fn from_blocks(src: &[Block; 4]) -> Self {
        // 26-bit mask on each 32-bit word.
        let mask_26 = _mm256_set1_epi32(0x3ffffff);
        // Sets bit 24 of each 32-bit word.
        let set_hibit = _mm256_set1_epi32(1 << 24);

        // - Load the four blocks into the following 32-bit word layout:
        //      [b33, b32, b31, b30, b23, b22, b21, b20]
        //      [b13, b12, b11, b10, b03, b02, b01, b00]
        //
        // - Unpack the upper and lower 64 bits:
        //      [b33, b32, b13, b12, b23, b22, b03, b02]
        //      [b31, b30, b11, b10, b21, b20, b01, b00]
        //
        // - Swap the middle two 64-bit words:
        // a0 = [b33, b32, b23, b22, b13, b12, b03, b02]
        // a1 = [b31, b30, b21, b20, b11, b10, b01, b00]
        let (lo, hi) = src.split_at(2);
        let blocks_23 = _mm256_loadu_si256(hi.as_ptr() as *const _);
        let blocks_01 = _mm256_loadu_si256(lo.as_ptr() as *const _);
        // ...
    }
}

So, the hot path above:

  • Splits the plaintext into 4-block byte chunks.
  • Passes the 4-block byte chunk to chacha20, which:
    • Chunks it into 4-block chunks (no-op).
    • Splits the chunk into individual blocks and:
      • Calls _mm_loadu_si128 on the block to load it into a __m128i.
      • XORs the stream into the __m128i.
      • Stores the __m128i back into the block.
  • Passes the (now-encrypted) 4-block byte chunk to poly1305, which:
    • Splits the chunk into individual blocks.
    • Copies each block into a 4-block cache (reconstructing the chunk).
    • Calls _mm256_loadu_si256 on each half of the 4-block cache.
    • Draws the rest of the polynomiOwl.

So the immediate blocker is that the UniversalHash trait doesn't provide any API to process multiple blocks at a time (StreamCipher::try_apply_keystream allows the implementor to choose the chunking, whereas UniversalHash::update is typed on a single block).

Once that is addressed, poly1305 could directly consume 4-block chunks without using its cache, at which point we would be consistently passing around a 4-block chunk size. Then the question becomes the form in which we pass the chunk around. Sketching out two possible directions:

  • Add an associated type for the chunk to an aead trait, which is constrained to equal an equivalent associated type in cipher and universal-hash. Then chacha20 and poly1305 would separately set it to the same concrete type.
  • Add an an aead::AeadChunk trait, and implement it in chacha20poly1305. Have some way to map it to the chunk inputs of cipher and universal-hash.

from aeads.

newpavlov avatar newpavlov commented on July 17, 2024

The fundamental issue here is runtime detection. It not only means that optimal number of blocks processed in parallel can change depending on CPU capabilities (and in some cases even on CPU family!), but also that during combination of primitives we need a way to automatically generate a matrix of possible capability combinations. It means that if algorithm 1 is able to process 3 blocks by default and 8 blocks with feature A and algorithm 2 is able to process 2 blocks by default and 6 with feature B, then ideally when combining them we should generate 3 code paths: by default processing 6 blocks, for feature A processing 8 blocks, for feature B processing 6 blocks, and for feature A and B processing 24 blocks. And if algorithms have different block sizes, problems becomes even harder.

Rust does not have good tools for solving this problem and likely will not have them anytime soon. At the very least we would need some kind of function multi-versioning (i.e. an ability to define different function implementations for different target features) with an ability to query available versions at compile time. And ideally we would need trait multi-versioning as well since it's preferable to store chunk size as an associated constant, but allowing public API (via associated constants and types) to change depending on available target features is a sizable can of worms with potentially non-trivial implications.

Defining those combinations manually could work to some extent, but it will be hard to maintain and I don't think compiler will be able to optimize out our cpufeatures-based code even if method is used in a context with enabled target features.

Round-tripping _mm_storeu_si128/_mm_loadu_si128 should not be a big issue since for compiler it's a trivial optimization assuming code gets properly inlined. The issue here is again runtime feature detection, since branching inside MAC/universal hash block processing method acts as an optimization barrier. Without proper inlining and removing the optimization barrier I highly doubt crypto-bigint will have any measurable effect. By caching blocks into stack you would be able to use aligned loads, but the main improvement is to keep data in registers without spilling it anywhere and you would not achieve it using this approach.

I hope to alleviate some issues in the new trait versions. It introduces slice-based block-level traits for hashes/MACs/universal hashes, hides chunk size from public API and instead uses callback-based methods. Not only should it help with inlinining, but also effectively inverses control over iteration. In other words, iteration over blocks is controlled not by higher-level code which combines primitives, but at the cipher level. It means that we can branch once per loop, instead of doing it every chunk (compiler currently is unable to optimize it automatically). Also it means that callbacks (which are used for passing blocks to MAC) are executed in the context with enabled target features and known chunk size.

Unfortunately this approach is still far from ideal. Roughly it results in the following code:

if is_aesni_available() {
    for chunk in blocks.chunks_exact_mut(AESNI_CHUNK) {
        aesni_encrypt(chunk);
        if is_pclmul_available() {
            pclumul_mac(chunk)
        } else {
            default_mac(chunk)
        }
    }
} else {
    for chunk in blocks.chunks_exact_mut(DEFAULT_CHUNK) {
        default_encrypt(chunk);
        if is_pclmul_available() {
            pclumul_mac(chunk)
        } else {
            default_mac(chunk)
        }
    }
}

In other words, if cipher backend does not cover required features for MAC backend we still have the optimization barrier on our hands.

from aeads.

tarcieri avatar tarcieri commented on July 17, 2024

@newpavlov have you actually tested that the optimizations you expect actually work out in practice, especially considering things like traits defined in two crates, being consumed by a third, where the first crate is using _mm_storeu_si128 and the second is using _mm_loadu_si128? I haven't myself, but I'm skeptical about the degree of inlining which can occur in that sort of 3-crate scenario (which really ends up being more like 6 when you add in the trait crates), especially with a crate as Rust's unit of compilation.

To set a baseline for maximum performance, I think we could move things like the CPU feature tests into the AEAD crates like aes-gcm/chacha20poly1305, and expose a couple/few sets of x86-64 and ARM-specific APIs in crates like aes, chacha20, ghash/polyval, and poly1305 which operate in terms of SIMD registers. We don't even have to ship that, we just need to see what the performance difference is.

Once we're reasonably certain of what a performant implementation looks like, we can experiment with various abstractions, although I'm still a bit unsold on the changes in RustCrypto/traits#727, or at the very least they seem complicated and unclear to me.

I feel like there are slice-based abstractions missing from universal-hash, similar to the ones I added to address RustCrypto/traits#332, which would also address the problem. I'm not sure we really need any sort of automatic constraint solver to pass around appropriately-sized chunks. We're already writing code which is explicit about the various platforms we support and backends which are detected at runtime, so we can program in an appropriate size for each of those scenarios since they're already factored into relevant modules explicitly. And really, in practice that size is dictated by the stream cipher, at least in the cases we currently care about.

Glossing over a few things, in practice I think the optimal block sizes look like the following:

AES-GCM / AES-GCM-SIV

  • x86/x86_64 w\ AES-NI + CLMUL: 8 x 128-bit blocks
  • ARMv8 w\ crypto extensions: 8 x 128-bit blocks
  • 64-bit portable: 4 x 128-bit blocks
  • 32-bit portable: 2 x 128-bit blocks

ChaCha20Poly1305

  • x86/x86_64 w\ AVX2: 4 x 128-bit blocks
  • x86/x86_64 w\ SSE: 2 x 128-bit blocks
  • ARM w\ NEON: 2 x 128-bit blocks(?)
  • Portable: 1 x 128-bit block

from aeads.

newpavlov avatar newpavlov commented on July 17, 2024

@tarcieri

have you actually tested that the optimizations you expect actually work out in practice

No, I only played a bit with small snippets in godbolt. We may need to abuse #[inline(always)] to achieve this optimization in practice.

I'm still a bit unsold on the changes in RustCrypto/traits#727, or at the very least they seem complicated and unclear to me.

I am myself far from 100% happy with the result, but right now I don't see a better path forward and, compared to the current design, I think it's definitely an improvement. Could you please comment in the PR on elements which you don't like or do not fully understand? I would appreciate your feedback sooner than later, since I hope to finalize it in the near future.

in practice that size is dictated by the stream cipher, at least in the cases we currently care about.

I agree, this is why the callbacks in my PR are only done on the cipher side, while MACs and universal hashes are left with the slice-based methods. But we are still left with the problem of target feature branching inside chunk iteration. Even if we are to check redundant features such as CLMUL in aes, I don't think that compiler will be able to remove branches in our cpufeatures-based code.

Also do not forget that code with enabled target features can not be currently inlined at all, so we definitely should strive to have chunk processing inside context with same target features.

from aeads.

tarcieri avatar tarcieri commented on July 17, 2024

Could you please comment in the PR on elements which you don't like or do not fully understand?

Left a comment on the PR. Just generally I'm confused what is happening there.

But we are still left with the problem of target feature branching inside chunk iteration.

That's why I was suggesting exposing low-level architecture-specific APIs to optimize passing data between ciphers and UHFs.

Then the check can be performed at the level of the entire AEAD, once, at the time the AEAD is initialized, and branched upon at the granularity of large AEAD operations.

The fast path for the entire core can occur within #[target_feature(...)] annotated code which is amenable to inlining, with data flowing in the form of SIMD register types which don't need to rely on inlining for performance, since they're the desired type to begin with and there's no type conversions that need to be optimized away.

from aeads.

newpavlov avatar newpavlov commented on July 17, 2024

After numerous experiments, I think I've found a good solution to this problem could look like, but, unfortunately, it's blocked on lack of rank-2 polymorphism in Rust. I wrote about it here: https://internals.rust-lang.org/t/15875 So I think the callback-based solution explored in the cipher v0.4 PRs is the best option which we have right now.

from aeads.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.