Coder Social home page Coder Social logo

faster's Introduction

faster

SIMD for Humans

Easy, powerful, portable, absurdly fast numerical calculations. Includes static dispatch with inlining based on your platform and vector types, zero-allocation iteration, vectorized loading/storing, and support for uneven collections.

It looks something like this:

let lots_of_3s = (&[-123.456f32; 128][..]).simd_iter()
    .simd_map(f32s(0.0), |v| {
        f32s(9.0) * v.abs().sqrt().rsqrt().ceil().sqrt() - f32s(4.0) - f32s(2.0)
    })
    .scalar_collect();

Which is analogous to this scalar code:

let lots_of_3s = (&[-123.456f32; 128][..]).iter()
    .map(|v| {
        9.0 * v.abs().sqrt().sqrt().recip().ceil().sqrt() - 4.0 - 2.0
    })
    .collect::<Vec<f32>>();

The vector size is entirely determined by the machine you’re compiling for - it attempts to use the largest vector size supported by your machine, and works on any platform or architecture (see below for details).

Compare this to traditional explicit SIMD:

use std::mem::transmute;
use stdsimd::{f32x4, f32x8};

let lots_of_3s = &mut [-123.456f32; 128][..];

if cfg!(all(not(target_feature = "avx"), target_feature = "sse")) {
    for ch in init.chunks_mut(4) {
        let v = f32x4::load(ch, 0);
        let scalar_abs_mask = unsafe { transmute::<u32, f32>(0x7fffffff) };
        let abs_mask = f32x4::splat(scalar_abs_mask);
        // There isn't actually an absolute value intrinsic for floats - you
        // have to look at the IEEE 754 spec and do some bit flipping
        v = unsafe { _mm_and_ps(v, abs_mask) };
        v = unsafe { _mm_sqrt_ps(v) };
        v = unsafe { _mm_rsqrt_ps(v) };
        v = unsafe { _mm_ceil_ps(v) };
        v = unsafe { _mm_sqrt_ps(v) };
        v = unsafe { _mm_mul_ps(v, 9.0) };
        v = unsafe { _mm_sub_ps(v, 4.0) };
        v = unsafe { _mm_sub_ps(v, 2.0) };
        f32x4::store(ch, 0);
    }
} else if cfg!(all(not(target_feature = "avx512"), target_feature = "avx")) {
    for ch in init.chunks_mut(8) {
        let v = f32x8::load(ch, 0);
        let scalar_abs_mask = unsafe { transmute::<u32, f32>(0x7fffffff) };
        let abs_mask = f32x8::splat(scalar_abs_mask);
        v = unsafe { _mm256_and_ps(v, abs_mask) };
        v = unsafe { _mm256_sqrt_ps(v) };
        v = unsafe { _mm256_rsqrt_ps(v) };
        v = unsafe { _mm256_ceil_ps(v) };
        v = unsafe { _mm256_sqrt_ps(v) };
        v = unsafe { _mm256_mul_ps(v, 9.0) };
        v = unsafe { _mm256_sub_ps(v, 4.0) };
        v = unsafe { _mm256_sub_ps(v, 2.0) };
        f32x8::store(ch, 0);
    }
}

Even with all of that boilerplate, this still only supports x86-64 machines with SSE or AVX - and you have to look up each intrinsic to ensure it’s usable for your compilation target.

Upcoming Features

A rewrite of the iterator API is upcoming, as well as internal changes to better match the direction Rust is taking with explicit SIMD.

Compatibility

Faster currently supports any architecture with floating point support, although hardware acceleration is only enabled on machines with x86’s vector extensions.

Performance

Here are some extremely unscientific benchmarks which, at least, prove that this isn’t any worse than scalar iterators. Even on ancient CPUs, a lot of performance can be extracted out of SIMD. Surprisingly, using SIMD iterators performs better than scalar iterators even on the SSE-less Pentium.

However, intentionally worsening your program’s locality for SIMD (as seen in tests::bench_determinmant2 and tests::bench_determinant3) is not a worthwhile tradeoff unless you are doing a significant amount of work per vector.

$ RUSTFLAGS="-C target-cpu=ivybridge" cargo bench # host is ivybridge; target has AVX
test tests::bench_determinant2_scalar ... bench:         391 ns/iter (+/- 2)
test tests::bench_determinant2_simd   ... bench:         375 ns/iter (+/- 1)
test tests::bench_determinant3_scalar ... bench:         350 ns/iter (+/- 1)
test tests::bench_determinant3_simd   ... bench:         470 ns/iter (+/- 2)
test tests::bench_map_scalar          ... bench:       6,952 ns/iter (+/- 26)
test tests::bench_map_simd            ... bench:         875 ns/iter (+/- 3)
test tests::bench_map_uneven_simd     ... bench:         880 ns/iter (+/- 3)
test tests::bench_nop_scalar          ... bench:          37 ns/iter (+/- 0)
test tests::bench_nop_simd            ... bench:          34 ns/iter (+/- 0)
test tests::bench_reduce_scalar       ... bench:       6,876 ns/iter (+/- 16)
test tests::bench_reduce_simd         ... bench:         835 ns/iter (+/- 2)
test tests::bench_reduce_uneven_simd  ... bench:         836 ns/iter (+/- 2)
test tests::bench_zip_nop_scalar      ... bench:         624 ns/iter (+/- 2)
test tests::bench_zip_nop_simd        ... bench:         361 ns/iter (+/- 1)
test tests::bench_zip_scalar          ... bench:         862 ns/iter (+/- 4)
test tests::bench_zip_simd            ... bench:         771 ns/iter (+/- 2)

RUSTFLAGS="-C target-cpu=x86-64" cargo bench # host is ivybridge; target has SSE2
test tests::bench_determinant2_scalar ... bench:         426 ns/iter (+/- 2)
test tests::bench_determinant2_simd   ... bench:         376 ns/iter (+/- 2)
test tests::bench_determinant3_scalar ... bench:         355 ns/iter (+/- 2)
test tests::bench_determinant3_simd   ... bench:         486 ns/iter (+/- 3)
test tests::bench_map_scalar          ... bench:       7,157 ns/iter (+/- 59)
test tests::bench_map_simd            ... bench:       1,886 ns/iter (+/- 10)
test tests::bench_map_uneven_simd     ... bench:       1,889 ns/iter (+/- 11)
test tests::bench_nop_scalar          ... bench:          38 ns/iter (+/- 0)
test tests::bench_nop_simd            ... bench:          34 ns/iter (+/- 0)
test tests::bench_reduce_scalar       ... bench:       7,002 ns/iter (+/- 29)
test tests::bench_reduce_simd         ... bench:       1,865 ns/iter (+/- 10)
test tests::bench_reduce_uneven_simd  ... bench:       1,937 ns/iter (+/- 7)
test tests::bench_zip_nop_scalar      ... bench:         623 ns/iter (+/- 1)
test tests::bench_zip_nop_simd        ... bench:         333 ns/iter (+/- 3)
test tests::bench_zip_scalar          ... bench:         971 ns/iter (+/- 5)
test tests::bench_zip_simd            ... bench:         525 ns/iter (+/- 3)

$ RUSTFLAGS="-C target-cpu=pentium" cargo bench # host is ivybridge; this only runs the polyfills!
test tests::bench_determinant2_scalar ... bench:         427 ns/iter (+/- 2)
test tests::bench_determinant2_simd   ... bench:         402 ns/iter (+/- 1)
test tests::bench_determinant3_scalar ... bench:         354 ns/iter (+/- 1)
test tests::bench_determinant3_simd   ... bench:         593 ns/iter (+/- 1)
test tests::bench_map_scalar          ... bench:       7,195 ns/iter (+/- 28)
test tests::bench_map_simd            ... bench:       6,271 ns/iter (+/- 22)
test tests::bench_map_uneven_simd     ... bench:       6,288 ns/iter (+/- 22)
test tests::bench_nop_scalar          ... bench:          38 ns/iter (+/- 0)
test tests::bench_nop_simd            ... bench:          69 ns/iter (+/- 0)
test tests::bench_reduce_scalar       ... bench:       7,004 ns/iter (+/- 17)
test tests::bench_reduce_simd         ... bench:       6,063 ns/iter (+/- 17)
test tests::bench_reduce_uneven_simd  ... bench:       6,107 ns/iter (+/- 11)
test tests::bench_zip_nop_scalar      ... bench:         623 ns/iter (+/- 2)
test tests::bench_zip_nop_simd        ... bench:         289 ns/iter (+/- 1)
test tests::bench_zip_scalar          ... bench:         972 ns/iter (+/- 3)
test tests::bench_zip_simd            ... bench:         621 ns/iter (+/- 3)

faster's People

Contributors

adamniederer avatar dfockler avatar osveron avatar ralfbiedert avatar tivervac avatar

Watchers

 avatar  avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.