adamniederer / faster Goto Github PK

View Code? Open in Web Editor NEW

1.6K 34.0 51.0 464 KB

SIMD for humans

License: Mozilla Public License 2.0

Rust 98.42% Python 1.58%

simd cross-platform optimization intrinsics

faster's People

Contributors

Stargazers

Watchers

Forkers

gnzlbg jj-jabb sailfish009 tivervac emrul gpuworld hbcbh1999 tisma yarwelp dfockler joncarper ngaut liuguoyou ebarnard aniketflyinghigh omerbenamram fdoyon zhangsoledad andersk alecmocatta matklad schunsukesuzuki wusyong atul9 olegnn backwardn isgasho moneytech pipehappy1 zyla gabrielmajeri xudong-huang koffeinflummi evnu timmytimmyson philburr nikitavbv titaniumtown lideen999 ajunlonglive viirya ldn9638 dut3062796s jaykickliter kbendick uselessgoddess clayne atouchet ego silvergasp shabbirhasan1

faster's Issues

Change `simd_map(Fn)` to `simd_map(FnMut)`?

When trying to work with the 0.3 PackedIterator I noticed simd_map and simd_reduce take a Fn.

Would it make sense to relax this requirement to FnMut, so that the closures could modify their environment?

Rust's std::iter::Iterator seems to mostly use FnMut, and for my current use case (iterating a 2nd simd_iter while iterating another one) I would like to cause "side effects".

I assume my actual problem will be addressed in 0.4, but I could imagine people might have other reasons they could need this.

Can't build for nightly Rust

Hi Adam,

Your library uses stdsimd as a external dependency crate, which landed in Rust std awhile ago.
It also uses an old target_feature attribute syntax that compiler complains about.
Do you plan on updating your library to compile with nightly Rust any time soon ?
If not, can you please advise me on how to do necessary changes in order to utilize your library,
since I am fairly new to Rust.

Implement sautrating_sub for u8s and u16s

These are available as _mm_subs_epu8 and _mm_subs_epu16 for SSE2 and _mm256_subs_epu8 and _mm256_subs_epu8 for AVX2 and are in stdsimd.

LLVM errors

I am seeing errors with debug compilation:

...>cargo build
   Compiling stdsimd v0.0.3
   Compiling faster v0.1.3 (file:.../faster_master)
LLVM ERROR: Cannot select: t7: v16i8 = X86ISD::ABS t3
  t3: v16i8,ch = CopyFromReg t0, Register:v16i8 %vreg45
    t2: v16i8 = Register %vreg45
In function: _ZN6faster4main28_$u7b$$u7b$closure$u7d$$u7d$17h9a8972f3fea441f3E
error: Could not compile `faster`.

And release compilation:

...>cargo build --release
   Compiling faster v0.1.3 (file:.../faster_master)
LLVM ERROR: Cannot select: t144: v16i8 = X86ISD::ABS t158
  t158: v16i8 = bitcast t157
    t157: v2i64,ch = load<LD16[ConstantPool]> t0, t160, undef:i64
      t160: i64 = X86ISD::WrapperRIP TargetConstantPool:i64<<16 x i8> <i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10>> 0
        t159: i64 = TargetConstantPool<<16 x i8> <i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10>> 0
      t12: i64 = undef
In function: _ZN6faster4main17hb45e5f709eed73c2E
error: Could not compile `faster`.

The compiler is the last Nightly:

rustc 1.23.0-nightly (3b82e4c74 2017-11-05)
binary: rustc
commit-hash: 3b82e4c74d43fc1273244532c3a90bf9912061cf
commit-date: 2017-11-05
host: x86_64-pc-windows-gnu
release: 1.23.0-nightly
LLVM version: 4.0

No obvious way to xor byte stream with [u8; 4]

I'm trying to get into SIMD by implementing a trivial operation: XOR unmasking of a byte stream as required by the WebSocket specification. The implementation in x86 intrinsics is actually very straightforward, but I have a hard time wrapping my head around expressing it in terms of Faster iterators API.

The part I'm having trouble with is getting an input [u8; 4] to cycle within a SIMD vector of u8. I have looked at:

load() which does accept &[u8] as input, but its behavior in case of length mismatch is completely undocumented. It's also not obvious what offset parameter does.
Casting the input [u8; 4] to u32, calling vecs::u32s() and then downcasting repeatedly to get a SIMD vector of u8, but Downcast seems to do not at all what I want.
Getting a SIMD vector of length 4 and arbitrary type inside it, load [u8; 4] into it (lengths now match, so it should work) then downcast repeatedly until I get a vector of u8 with arbitrary length. Except there seems to be no way to request a SIMD vector of length 4 and arbitrary type.
After over an hour of head-scratching I've noticed that From<u32x4> is implemented for u8x16, so I could replace Downcast with it in approach 2 and probably get the correct result, except I have no idea how such conversions interact with host endianness.

I actually expected this to be a trivial task. I guess for someone familiar with SIMD it is, but for the likes of me a snippet in examples/ folder that loads [u8; 4] into a vector would go a long way. Or perhaps even a convenience function in the API that deals with endianness properly, to make it harder to mess up.

Hi, I'm from 2022. Please fix the build.

Or I will do it myself :(

Compatibility with Rayon

It wasn’t clear to me if faster can also work with rayon. I tried the simple example both with into_* and placing par_iter before and after simd_iter with no luck. So it looks to me like this isn’t a supported combination right now, but if it is, an example would be nice.

Thanks!

Compiling `rust-2018-migration` for `aarch64` and `wasm`

Hi,

I am trying to port a project to aarch64 and wasm using the rust-2018-migration branch. As of today I receive lots of:

2 | use crate::vektor::x86_64::*;
  |                    ^^^^^^ Could not find `x86_64` in `vektor`

Ideally, faster would have fallbacks for not-yet supported architectures. That way I could just write my SIMD code once using the provided API, instead of having two separate implementations.

Do you have any short-term plans of making such a fallback available for the 2018 version?

Also, while I am not a Rust expert, I have 1 - 2 days to look into this myself. If you think it's feasible to outline a solution you prefer, I'd be happy to try to help you out.

PMOVMSKB operation equivalent

It would be useful to me if there were an operation like fn bytemask<T: Packed>(v: T) -> usize that set the bits of the usize to 1 where the packed vector's bytes are all 1, corresponding to the asm PMOVMSKB or VPMOVMSKB. I don't know how possible this would be for faster, but it would be nice.

u64x8 iterator by zero-extending each byte in a [u8] slice

vec![0u8; 100000].iter().map(|&x| {
    x as u64 * x as u64
}).sum::<u64>();

is essentially what I'd like to use this crate to SIMD-accelerate.

The issue I'm bumping into is reading the slice as u8x8 such that I can then use <u64x8 as From<u8x8>>::from to cast it.

What is the best way to do this currently?

Related to #8, #32 and #46.

Expose `width` on vec types and provide sum() ability.

I have code where I want to add / multiply two vectors and compute the sum over all results. It looks somewhat like this:

let mut simd_sum = f32s::splat(0.0f32); 

for (x, y) in zip(xvec.simd_iter(), yvec.simd_iter()) { 
    simd_sum = simd_sum + (x - y) * (x - y);
}

// TODO ... 
let sum = sum_f32s(simd_sum, simd_width);

I haven't found a way to a) either sum simd_sum (e.g., sum = simd_sum.0 + simd_sum.1 + ... simd_sum.n) with an existing function, or b) generically implement my own sum function.

The problem I'm having with implementing sum_f32s myself is that I haven't seen an easy way to get the current width. The hack I'm using looks like this:

let _temp = [0.0f32; 32];
let simd_width = (&_temp[..]).simd_iter().width();

It would be nice if both a) and b) were implemented, or, if they are already, documented how to use them.

"no-std" feature is not additive and should instead be "std"

The way features work in cargo is that each crate in the dependency graph is built a single time with the union of all requested features from each dependent crate. For this reason, enabling a feature should never remove items or change a function's signature; they must only add items.

As an example of how the current setup can go wrong: Suppose a library depends on faster and enables the no-std feature because it doesn't need std, while another crate enables no features and uses the SIMD iterator impls for Vec. If an application depends on both of them, faster gets compiled with no-std, and the second crate will fail to compile.

The recommendation is to have a std feature that is enabled by default:

[features]
default = ["std"]
std = []

(note: the specific name "std" is recommended by the Rust API Guidelines (see C-FEATURE))

Performance inspector

Faster should have a way to hint at performance issues.

From a user's point of view it is hard to predict whether SIMD intrinsics are issued, or a fallback implementation is used. Ideally there should be a method, macro, compiler flag, ... to enable performance metrics or warnings.

In the most simple form, each fallback implementation could call something like fallback!, which in turn could be enabled or disabled with a compile feature. That macro could then emit warnings (maybe even with file names and line numbers).

As a user I could then either change my code, file an issue, or implement the missing intrinsic myself.

Collaboration tools (rustfmt, clippy)

After my last PR I noticed you had to clean up a bit. That made me wonder if it makes sense to configure and use clippy and rustfmt:

Since clippy is probably less controversial I went ahead and addressed all current issues, either by changing code (where I thought clippy made sense), or disabling lints (e.g., where faster bends rules for speed). Some of the more pedantic lints could be discussed, e.g., usage and formatting of number literals (unreadable_literal, unseparated_literal_suffix). https://github.com/ralfbiedert/faster/tree/clippy
I think rustfmt makes sense as well, but needs more configuration to resonate with the code. I found a few settings that worked for me (e.g., max_width be set rather high not to break up most macros which makes them harder to read). However you should probably take the lead on that one.

Let me know what you think about clippy, in particular unreadable_literal, unseparated_literal_suffix and type_complexity (I prefer the former 2, no opinion on 3rd). I can then create another PR.

Run-time feature detection

Currently the vector size and the SIMD instructions to be used are fixed at compile-time. This is a good first step.

It allows me to write an algorithm once, and by changing a compiler-flag generate different versions of this algorithm for different target architectures (with different vector sizes, SIMD instructions, etc.).

I really like to be able to do this at compile-time within the same binary as well, so that I can write:

// a generic algorithm
fn my_generic_algorithm<T: SimdVec>(x: T, y: T) {
   // generic simd operations
}

and monomorphize it for different instruction sets:

let x: VecAvx;
my_generic_algorithm(x, x); // AVX version
let y: VecSSE42;
my_generic_algorithm(y, y); // SSE42 version

That way input.simd_iter().map(my_generic_algorithm) could monomorphize my_generic_algorithm for SSE, SSE42, AVX, and AVX2, and do run-time feature detection to detect the best that a given CPU supports at run-time, and then dispatch to that one.

No std?

Hi!

This crate looks great, and I was wondering whether it could have a use_std feature (on by default) so that it could be used in no_std environments. Coresimd, which looks like the core counterpart to stdsimd, could be perhaps used?

Arbitrary collect support

I noticed that some methods return Vecs. In particular, it's stride and scalar_collect. I was wondering if it made sense to expose generic version that could use arbitrary FromIterator type. My motivation for this is:

It would allow for use in nostd environment, for example with arrayvec.
It would allow storing the results on the stack, again with arrayvec or with smallvec (to avoid allocations for the small cases while falling back on it in the big ones when the allocation cost is amortized).

The original methods could still be preserved, as wrappers with Vec implementation. Does it make sense? Should I send a pull request for that?

Surprising benchmark numbers

While working on #47 I noticed what looks like performance regressions in the cargo bench, in particular functions like map_simd and map_scalar, but quite a few others.

test tests::map_scalar                                ... bench:       2,022 ns/iter (+/- 264)
test tests::map_simd                                  ... bench:       6,898 ns/iter (+/- 392)

However, comparing #49 to the commit before the refactoring, the numbers are mostly unchanged.

I then assumed it's related to unfortunate default feature flags on my machine, but playing with avx2 and sse4.1 didn't have any effect either. I also have a first implementation of #48, and it actually looks like no fallbacks are emitted for map_simd. (Tried to cross check that with radare2, but have some problems locating the right symbol / disassembly for the benchmarks). Lastly, the functions map_scalar and map_simd differ a bit, but even when I make them equal (e.g., sqrt vs. rsqrt) the difference remains.

Is that a "known issue"?
Did rustc became so good in auto-vectorization?
Any suggestions how to extract the disassembly from tests::map_simd and tests::map_scalar?

Running on rustc 1.29.0-nightly (9fd3d7899 2018-07-07), MBP 2015, i7-5557U.

Update: I linked the latest faster version from my SVM library and I don't see these problems in 'production':

csvm_predict_sv1024_attr1024_problems1 ... bench:     232,109 ns/iter (+/- 20,808) [faster AVX2]
csvm_predict_sv1024_attr1024_problems1 ... bench:     942,925 ns/iter (+/- 64,156) [scalar]

Update 2 Seems to be related to some intrinsics. When I dissect the benchmark, I get

test tests::map_scalar                                ... bench:         558 ns/iter (+/- 55) [without .abs()]
test tests::map_scalar                                ... bench:         556 ns/iter (+/- 33) [with .abs()]
test tests::map_simd                                  ... bench:         144 ns/iter (+/- 17) [without .abs()]
test tests::map_simd                                  ... bench:         883 ns/iter (+/- 64) [with .abs()]

I now think that each intrinsic should have its own benchmark, e.g. intrinsic_abs_scalar, intrinsic_abs_simd, ...

Update 3 ... oh boy. I think that by "arcane magic" Rust imports and prefers std::simd::f32x4 and friends over the faster types and methods.

So when you do my_f32s.abs(), it calls std::simd::f32x4::abs, not faster::arch::current::intrin::abs.

The reason I think that's the problem is you can now easily do my_f32s.sqrte(), which isn't implemented in faster, but in std::simd.

What's more annoying is that it doesn't warn about any collision, and that std::simd is actually slower than "vanilla" Rust.

TODO:

Investigate import tree why that happens
Clean up imports if import problem
Have single-intrinsic benchmarks to detect bad intrinsics
Have Rust warn somehow if similar name conflict happens again?
Remove all usages of #![feature(stdsimd)] except in lib.rs

Update 4 Now one more thing makes sense ... I sometimes got use of unstable library feature 'stdsimd' in test cases and I didn't understand why. Probably because that's where the std::simd built-ins were used.

Package still active?

This crate is listed by the official docs for simd operations yet the current version 0.50 gives tons of compile errors, i.e. unfound symbols. Is this crate still active?

Support for Rayon integration

Rayon supports parallel iterators/mapv function to process using multiple threads. How can we integrate with rayon so we can leverage both simd and thread parallel processing?

Documentation link not working

From crates.io the documentation link https://docs.adamniederer.com/faster/index.html is not working

how to accomplish `zip_mut_with` in faster?

hey,

I have spent a few minutes trying unsuccessfully to write a simd version of code that uses ndarray's zip_mut_with - is there currently a way of combining a simd_iter_mut and simd_iter with zip?

Essentially, I'd like to modify the elements of an array in place with an operation that uses the values in another array of the same shape.

I tried this:

let mut xs = ndarray::Array::from_elem((64,), 0.0f32);
let ys = ndarray::Array::from_elem((64,), 1.0f32);
(xs.as_slice_mut().unwrap().simd_iter_mut(f32s(0.0)),
 ys.as_slice().unwrap().simd_iter(f32s(0.0)))
    .zip()
    .simd_for_each(|(x, y)| {
        *x += *y; // or whatever, not sure if this is correct for the closure
    });

and got this error:

error[E0599]: no method named `simd_for_each` found for type `faster::Zip<(faster::SIMDIter<&mut [f32]>, faster::SIMDIter<&[f32]>)>` in the current scope
   --> src/mlp/m3r.rs:298:26
    |
298 |                         .simd_for_each(|(z, b)| {
    |                          ^^^^^^^^^^^^^
...

Any recommendations for how to proceed? First time using the library, apologies if this is an obvious question.

API

Current API is not bad, but it has some problems. Because this crate implements standard Iterator trait simd iterators can be used in for loops(and they are currently used for example in benchmarks), but inside those loops there is no uneven collection handling, this I think can be added, but it would reduce performance. Second thing is that inside for loops there is no in place mutation, you need to manually store. I think it would be better to just have internal iterator trait, and implement functionality on top of it, it would also allow dropping simd prefix on iterators methods. It would make the API more similar to the standard library, with exception of providing defaults. Thoughts?

Support for Complex Numbers

First of all thanks for making this library, the API looks really neat.
I'm currently working on a project where I need to crunch a lot of complex numbers (esp. Complex) and would love to use faster instead of hooking into external C libraries like VOLK.
Is support for complex types on your roadmap?

mapping a zipped iterator to produce tuples

I have a function which looks vaguely like this:

struct Rect { real: f64, imag: f64 }
struct KetRef<'a> { real: &'a [f64], imag: &'a [f64] }

impl<'a> KetRef<'a> {
    pub fn dot(self, other: KetRef) -> Rect {
        assert_eq!(self.real.len(), other.real.len());
        assert_eq!(self.real.len(), other.imag.len());
        assert_eq!(self.real.len(), self.imag.len());
        zip!(self.real, self.imag, other.real, other.imag)
            .map(|(ar, ai, br, bi)| {
                let real = ar * br + ai * bi;
                let imag = ar * bi - ai * br;
                Rect { real, imag }
            })
            .fold(Rect::zero(), |a,b| a + b)
    }
}

Converting it to use faster requires two passes over the arrays; I am unable to produce both real and imag in one pass because simd_map requires the function output to be a single vector:

pub fn dot<K: AsKetRef>(self, other: K) -> Rect {
    use ::faster::prelude::*;

    let other = other.as_ket_ref();
    assert_eq!(self.real.len(), other.real.len());
    assert_eq!(self.real.len(), other.imag.len());
    assert_eq!(self.real.len(), self.imag.len());

    let real = (
        self.real.simd_iter(f64s(0.0)),
        self.imag.simd_iter(f64s(0.0)),
        other.real.simd_iter(f64s(0.0)),
        other.imag.simd_iter(f64s(0.0)),
    ).zip().simd_map(|(ar, ai, br, bi)| {
        ar * br + ai * bi
    }).simd_reduce(f64s(0.0), |acc, v| acc + v).sum();

    let imag = (
        self.real.simd_iter(f64s(0.0)),
        self.imag.simd_iter(f64s(0.0)),
        other.real.simd_iter(f64s(0.0)),
        other.imag.simd_iter(f64s(0.0)),
    ).zip().simd_map(|(ar, ai, br, bi)| {
        ar * bi - ai * br
    }).simd_reduce(f64s(0.0), |acc, v| acc + v).sum();

    Rect { real, imag }
}

So is it faster? Well, actually, yes! It is plenty faster... up to a point:

Change in run-time for different ket lengths
dot/16              change: -33.973%
dot/64              change: -29.575%
dot/256             change: -26.762%
dot/1024            change: -34.054%
dot/4096            change: -36.297%
dot/16384           change: -7.3379%

Yikes! Once we hit 16384 elements there is almost no speedup!

I suspect it is because at this point, memory has become the bottleneck, and most of what was gained by using SIMD was lost by making two passes over the arrays. It would be nice to have an API that allowed this do be done in one pass by allowing a mapping function to return a tuple (producing a new PackedZippedIterator or similar).

Documentation link on crates.io is broken (404)

https://crates.io/crates/faster has a "documentation" link at the top. Clicking which brings me to a 404 page.

Simple .zip() produces `assertion failed: !a.0.is_none() && a.0.unwrap().1 ... `

Hi,

When I try to run the following code:

#[cfg(test)]
mod tests {
    use faster::{IntoPackedRefIterator, IntoPackedZip, PackedZippedIterator, PackedIterator, Packed, f64s};

    #[test]
    fn test_faster_simd() {
        
        let a = vec![ 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 ];
        let b = vec![ 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0 ];
        
        let slice_a = &a[..];
        let slice_b = &b[..];
        
        let sum: f64 = (slice_a.simd_iter(), slice_b.simd_iter()).zip()
            .simd_map((f64s::splat(0f64), f64s::splat(0f64)), |(a,b)| (a - b) * (a - b))
            .simd_reduce(f64s::splat(0f64), f64s::splat(0f64), |a, v| a + v)
            .sum();

    }
}

I get:

---- kernel::rbf::tests::test_faster_simd stdout ----
	thread 'kernel::rbf::tests::test_faster_simd' panicked at 'assertion failed: !a.0.is_none() && a.0.unwrap().1 == a.0.unwrap().1 && !a.1.is_none() &&
    a.1.unwrap().1 == a.0.unwrap().1', /Users/rb/.cargo/registry/src/github.com-1ecc6299db9ec823/faster-0.4.0/src/zip.rs:306:0
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
stack backtrace:
   0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
   1: std::sys_common::backtrace::print
   2: std::panicking::default_hook::{{closure}}
   3: std::panicking::default_hook
   4: std::panicking::rust_panic_with_hook
   5: std::panicking::begin_panic
   6: ffsvm::kernel::rbf::tests::test_faster_simd
   7: <F as test::FnBox<T>>::call_box
   8: __rust_maybe_catch_panic

I'm using faster 0.4, and rust 1.25.0-nightly (4e3901d35 2018-01-23) and no target-feature. The same happens when I enable target-feature=+avx2. Unfortunately I don't really understand the assertion, so please let me know if there is something broken with the way I use faster.

From std::iter::Iterator to PackedIterator

Hi!

I think that it would be great to be able to call into_simd_iter() on regular Iterators, e.g.:

iter::repeat(42.0)
    .into_simd_iter()
    // ...

This allows SIMD to be applied when buffering the data into a vector would be inefficient and provides better interop with std.

Faster and std::simd

Opening another ticket since this is a separate discussion from #47 and might be more controversial:

The more I look into the upcoming std::simd, the more I wonder if faster should not become a thinner "SIMD-friendly iteration" library that neatly plugs into std::simd and is really good at handling variable slices, zipping, ... instead of providing a blanket implementation over std::arch.

Right now it seems that many common intrinsics and operations faster provides on packed types are or might be implemented in std::simd (compare coresimd/ppsv).

At the same time, for things that won't be in std::simd (and will be more platform specific), faster will have a hard time providing a consistent performance story anyway.

By that reasoning I see a certain appeal primarily focusing on a more consistent cross-platform experience with a much lighter code base (e.g., imagine faster without arch/ and intrin/ and using mostly std::simd instead of vektor).

Faster could also integrate std::arch specific functions and types, but rather as extensions and helpers (e.g., for striding) for special use cases, instead of using them as internal fundamentals.

faster floor/ceil/round in pre SSE 4.1 cases

If I am reading the code correctly, it looks like in the case of SSE2 Faster currently falls back to calling round()/floor() etc on each individual lane via the fallback macro.

You may be able to use these methods instead:
http://dss.stephanierct.com/DevBlog/?p=8

Or Agner Fog has a different method in his vector library:
http://www.agner.org/optimize/vectorclass.zip

edit:
Agner's functions are slower but can handle floating point values that don't fit in an i32, the first functions only handle values that do fit in an i32.

Incorrect count_ones result with -C target-cpu=native

use {faster::Popcnt, std::simd::u64x4};
u64x4::new(!0, !0, 0, 0).count_ones()

correctly returns 128 without -C target-cpu=native, and incorrectly returns 18446744073709551488 (i.e. !127) with -C target-cpu=native.

Status of AVX 512 ?

Hello, I want to check if this crate works with avx 512 instructions. And also if they recommend to use it, due to the inactivity of it.

Are there other crates ? This one seems to me to be very good.

Thanks

f32x8 and f64x4 for AVX target

I'm currently on a platform that has AVX support but does not have AVX2 support, so currently faster on my platform only uses f32x4 and f64x2, but it should be able too use f32x8 and f64x4.

Renaming structs and traits

I saw you renamed a lot of structs and traits.

Since you are on this, I was wondering what you think about renaming Packed to SIMDVector, along with friends (e.g., Packable -> SIMDConvertable), or something similar.

Main reason is, when I started reading the source and tried to understand how all the Pack* concepts relate, I found your comment Packed // A SIMD vector of some type most helpful and thought If that's what it really is, why isn't it just named liked that?

I'm not proposing exactly these names, but think the crate's internals would be easier to follow if the names reflected "well known" concepts.

Failed to build faster version = "0.5.2" with rustc 1.53.0-nightly (07e0e2ec2 2021-03-24)

Hi, I'm trying to using faster on my project, but I can't seem to build it on my machine.
Does not seems to be the problem of rust but the problem with vektor crate.

error[E0425]: cannot find function `_mm_extract_pi16` in module `crate::myarch`
    --> /home/xinbg/.cargo/registry/src/mirrors.ustc.edu.cn-12df342d903acd47/vektor-0.2.2/src/x86/sse.rs:1292:28
     |
1292 |             crate::myarch::_mm_extract_pi16(crate::mem::transmute(a), $imm8)
     |                            ^^^^^^^^^^^^^^^^
...
1296 |     crate::mem::transmute(constify_imm8!(imm2, call))
     |                           -------------------------- in this macro invocation
     |
    ::: /home/xinbg/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/../../stdarch/crates/core_arch/src/x86/sse2.rs:1419:1
     |
1419 | pub unsafe fn _mm_extract_epi16(a: __m128i, imm8: i32) -> i32 {
     | ------------------------------------------------------------- similarly named function `_mm_extract_epi16` defined here
     |
     = note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)
help: a function with a similar name exists
     |
1292 |             crate::myarch::_mm_extract_epi16(crate::mem::transmute(a), $imm8)
     |                            ^^^^^^^^^^^^^^^^^
help: consider importing this function
     |
5    | use crate::_mm_extract_pi16;
     |

^C  Building [=====================>       ] 7/9: vektor

Is there anything I'm missing to make faster 0.5.2 work?
Thx for the help!

Correct way to benchmark faster

I am doing some benchmarking of my own simd lib against faster and want to be sure I'm doing it correctly. I'm using criterion, replicating the "lots of 3s" example, as shown in this gist:

https://gist.github.com/jackmott/a0b8ca811d2cf2ecb97a35f0aee0a5c6

I'm using the default compilation settings which should be targeting SSE2 instructions for Faster, and I'm using the SSE2 settings in my library. Does this look like a fair comparison? Am I missing anything?

Also how is ceil implemented for SSE2? I think it is slower than it needs to be but I can't figure out where it happens in the faster source.

Does it support NEON(arm) such that it can be used on Android and iOS? Does it need nightly?

Hi thanks for the lib! I wonder does it support NEON(arm) such that it can be used on Android and iOS? Does it need nightly?

Don't touch the iterator system for the next few days

Hi guys,

I'm currently ripping up the old iterator system, and it's going to look nothing like 0.4.0. The drastic change is to fix a few corner cases with correctness I ran into late in the 0.4.0 development cycle. I'd recommend not touching anything in iters.rs, into_iters.rs, zip.rs, or swizzle.rs in order to avoid unresolvable conflicts. I'll close this issue once most of the changes are pushed.

As always, thank you for contributing to faster.

I cannot compile this crate

I tried several times to compile this crate version 0.4.3, using different versions of the nightly compiler and also the stable one.
When I use the nightly compilers I get lots of errors of kind:

error: #[target_feature] attribute must be of the form #[target_feature(..)]
   --> /home/---------/.cargo/registry/src/github.com-1ecc6299db9ec823/coresimd-0.0.4/src/x86/x86_64/xsave.rs:103:1
    |
103 | #[target_feature = "+xsave,+xsaves"]

and then

error: aborting due to 973 previous errors

error: Could not compile `coresimd`.

When using the stable one I get

error[E0554]: #![feature] may not be used on the stable release channel
  --> /home/----------/.cargo/registry/src/github.com-1ecc6299db9ec823/coresimd-0.0.4/src/lib.rs:14:1
   |
14 | / #![feature(const_fn, link_llvm_intrinsics, platform_intrinsics, repr_simd,
15 | |            simd_ffi, target_feature, cfg_target_feature, i128_type, asm,
16 | |            const_atomic_usize_new, stmt_expr_attributes, core_intrinsics,
17 | |            crate_in_paths)]
   | |___________________________^

error: aborting due to previous error

For more information about this error, try `rustc --explain E0554`.
error: Could not compile `coresimd`.

simd_for_each does nothing

I have tried to do something using faster and noticed that you added simd_for_each, but it doesn't seems to do anything. Simple case:

extern crate faster;
use faster::*;

fn main() {
    let mut vector = vec![0f32, 1f32, 2f32, 3f32];
    vector
        .as_mut_slice()
        .simd_iter_mut()
        .simd_for_each(f32s(0f32), |mut x| x /= f32s(2f32));
    println!("{:?}", vector);
    vector.iter_mut().for_each(|x| *x /= 2f32);
    println!("{:?}", vector);
}

Outputs:

[0.0, 1.0, 2.0, 3.0]
[0.0, 0.5, 1.0, 1.5]

I have looked at the implementation of this function and it takes a owned simd vector and performs user provided function on it and then does nothing with the result. There should be store somewhere in there and the user provided function should take mutable reference to the simd vector. Or maybe I'm doing something wrong since there aren't any examples yet.

Update crates.io

Compilation with the latest version on crates.io does not work, however master works fine. Would be nice if you could update crates.io.

Arbitrary size bitvectors

Could this library support arbitrary sized bitvectors? Suppose i have two 32bits vectors, I could store them in an unsigned integer of 32 bits. But if I have bitvectors of 1200 bits it would be ideal to break them up into 2x512 bits and one 256 bit (using the remaining 176 needed bits). I would like to perform simple operations such as AND, OR, NOT and arithmetics like + and -

Indexing into an array / lookup table (lut)?

Hi thanks for the lib! My usage needs to index into an array based on a value. Shortly speaking, it is doing something like:

for x in 0..width {
  let a = array_one[x+42] - array_one[x-42]; //???
  let b = ...some arthi op on `a` which I know faster can do...
  let c = array_two[b]; //???
}

Question: How can I parallalize the array indexing operation?
There do exist SIMD intrinsics for such lut(lookup table) operations. I have used OpenCV's universal intrinsics in C++, and it did provide one: https://docs.opencv.org/4.5.3/df/d91/group__core__hal__intrin.html#ga37fe7c336a68ae5f48967a066473a4ff

Redesign with a better design

The current design of "faster" has several big flaws that should be addressed, including:

It makes no effort to use aligned loads and no effort to support aligned Vec, which is detrimental to performance on several architectures
It abuses the Iterator interface, since the next() method only iterates the element in the first part that is a multiple of the SIMD width, and then you are supposed to call end() for the final part. This is a wrong use of Iterator since next() is supposed to return all items
Zip is very limited since it only supports vectors of the same length, and also cannot zip e.g. u8x16 and u32x4 in a way that results in 4 u8s and 4 u32s at once
A "default" element is required just to create SIMD iterators, which is probably not the best design
The SIMDIterator interface has a bunch of methods like "vector_pos()", "vector_len()", "scalar_len()", etc. that are inappropriate for a SIMD version of Iterator, since the position/length is not a concept valid for general iterators (only for slice iterators)
It has no support for multiple vector sizes
The SIMDZipped* traits are redundant, and can be removed by just implementing Packed and Packable for tuples and using the normal SIMD* traits

Here is a better design:

Introduce a Partial<T> type that represent a partially filled SIMD vector, containing a vector and the number of elements that are valid (and that supports iteration, map(), and_then()
Implementing Packed and Packable for tuples and remove the SIMDZipped* traits in favor of just using the normal traits with tuple types
Implement Iterator with Item = Partial<T> for SIMD iterators, so that the next() interface returns all items properly
Implement slice iterators so they return a partial vector to align the iterator, then return full vectors until the final partial vector, and so that aligned loads are used
Remove the current SIMDIterator and add a new SIMDIterator that provides a next_n(n: usize) -> Partial<T> method that returns a partial SIMD vector filled with exactly n elements or less if the end of the iterator is reached, and a size_hint() in terms of scalars
Remove the SIMDArray and UnsafeIterator traits
Add helpers that can reduce the SIMD size of an iterator, and that can realign the iterator
Add a SIMDExactSizeIterator that specifies that an exact size in terms of scalar is valid for the iterator
Use specialization to implement SIMDIterator and SIMDExactSizeIterator for any Iterator<Partial>, while still allowing to override the implementation for things like slice operator
Provide a SIMDVec that is like Vec but guarantees sufficient alignment for allocations (alternatively, change Vec to always align to SIMD alignment, although this might be undesirable)
Support all vector sizes including smaller ones when a bigger size is supported
Add a version of simd_iter() that allows to specify the vector size, and add a macro that allows to instantiate a code block for each possible SIMD instruction set for the platform using it and that uses runtime feature detection to dispatch (this would be an alternative to the preferred approach of compiling the binary for each SIMD instruction set, which is better overall but that some people may dislike due to code size issues).

With this design:

map(), etc. from Iterator can be used directly on SIMD iterators.
zip() can be implemented by first automatically reducing the vector size to the smallest width, and then calling next() on the first iterator (or next_n() when implementing next_n()), and then calling next_n() on the other iterators with the number of elements the first one returned
A more sophisticated version of zip() could be provided that finds the most frequent alignment among all iterators, aligns all the other with next_n() and then continues zipping with next_n(WIDTH) on all iterators. It's not clear whether this is much more useful than the simpler version that just follows the first iterator though.

#[target_feature(enable = "crypto")] the feature named `crypto` is not valid for this target

I am just add faster into the Cargo.toml, then build failed. Maybe there is a need to add #[cfg(target_arch="???")] above the target_feature in the source code. version is 0.5.2, and 0.4.3 is build passed.

Not compiling on Windows 10 because packed_simd broke

at least, that's what happened according to rust-lang/packed_simd#308

   Compiling packed_simd v0.3.3
error[E0432]: unresolved import `crate::arch::x86_64::_mm_movemask_pi8`
  --> C:\Users\Khang\.cargo\registry\src\github.com-1ecc6299db9ec823\packed_simd-0.3.3\src\codegen\reductions\mask\x86\sse.rs:47:21
   |
47 |                 use crate::arch::x86_64::_mm_movemask_pi8;
   |                     ^^^^^^^^^^^^^^^^^^^^^----------------
   |                     |                    |
   |                     |                    help: a similar name exists in the module: `_mm_movemask_epi8`
   |                     no `_mm_movemask_pi8` in `arch::x86_64`
   | 
  ::: C:\Users\Khang\.cargo\registry\src\github.com-1ecc6299db9ec823\packed_simd-0.3.3\src\codegen\reductions\mask.rs:41:1
   |
41 | impl_mask_reductions!(m8x8);
   | ---------------------------- in this macro invocation
   |
   = note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)

error[E0432]: unresolved import `crate::arch::x86_64::_mm_movemask_pi8`
  --> C:\Users\Khang\.cargo\registry\src\github.com-1ecc6299db9ec823\packed_simd-0.3.3\src\codegen\reductions\mask\x86\sse.rs:62:21
   |
62 |                 use crate::arch::x86_64::_mm_movemask_pi8;
   |                     ^^^^^^^^^^^^^^^^^^^^^----------------
   |                     |                    |
   |                     |                    help: a similar name exists in the module: `_mm_movemask_epi8`
   |                     no `_mm_movemask_pi8` in `arch::x86_64`
   | 
  ::: C:\Users\Khang\.cargo\registry\src\github.com-1ecc6299db9ec823\packed_simd-0.3.3\src\codegen\reductions\mask.rs:41:1
   |
41 | impl_mask_reductions!(m8x8);
   | ---------------------------- in this macro invocation
   |
   = note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)

error[E0432]: unresolved import `crate::arch::x86_64::_mm_movemask_pi8`
  --> C:\Users\Khang\.cargo\registry\src\github.com-1ecc6299db9ec823\packed_simd-0.3.3\src\codegen\reductions\mask\x86\sse.rs:47:21
   |
47 |                 use crate::arch::x86_64::_mm_movemask_pi8;
   |                     ^^^^^^^^^^^^^^^^^^^^^----------------
   |                     |                    |
   |                     |                    help: a similar name exists in the module: `_mm_movemask_epi8`
   |                     no `_mm_movemask_pi8` in `arch::x86_64`
   | 
  ::: C:\Users\Khang\.cargo\registry\src\github.com-1ecc6299db9ec823\packed_simd-0.3.3\src\codegen\reductions\mask.rs:47:1
   |
47 | impl_mask_reductions!(m16x4);
   | ----------------------------- in this macro invocation
   |
   = note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)

error[E0432]: unresolved import `crate::arch::x86_64::_mm_movemask_pi8`
  --> C:\Users\Khang\.cargo\registry\src\github.com-1ecc6299db9ec823\packed_simd-0.3.3\src\codegen\reductions\mask\x86\sse.rs:62:21
   |
62 |                 use crate::arch::x86_64::_mm_movemask_pi8;
   |                     ^^^^^^^^^^^^^^^^^^^^^----------------
   |                     |                    |
   |                     |                    help: a similar name exists in the module: `_mm_movemask_epi8`
   |                     no `_mm_movemask_pi8` in `arch::x86_64`
   | 
  ::: C:\Users\Khang\.cargo\registry\src\github.com-1ecc6299db9ec823\packed_simd-0.3.3\src\codegen\reductions\mask.rs:47:1
   |
47 | impl_mask_reductions!(m16x4);
   | ----------------------------- in this macro invocation
   |
   = note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)

error[E0432]: unresolved import `crate::arch::x86_64::_mm_movemask_pi8`
  --> C:\Users\Khang\.cargo\registry\src\github.com-1ecc6299db9ec823\packed_simd-0.3.3\src\codegen\reductions\mask\x86\sse.rs:47:21
   |
47 |                 use crate::arch::x86_64::_mm_movemask_pi8;
   |                     ^^^^^^^^^^^^^^^^^^^^^----------------
   |                     |                    |
   |                     |                    help: a similar name exists in the module: `_mm_movemask_epi8`
   |                     no `_mm_movemask_pi8` in `arch::x86_64`
   | 
  ::: C:\Users\Khang\.cargo\registry\src\github.com-1ecc6299db9ec823\packed_simd-0.3.3\src\codegen\reductions\mask.rs:52:1
   |
52 | impl_mask_reductions!(m32x2);
   | ----------------------------- in this macro invocation
   |
   = note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)

error[E0432]: unresolved import `crate::arch::x86_64::_mm_movemask_pi8`
  --> C:\Users\Khang\.cargo\registry\src\github.com-1ecc6299db9ec823\packed_simd-0.3.3\src\codegen\reductions\mask\x86\sse.rs:62:21
   |
62 |                 use crate::arch::x86_64::_mm_movemask_pi8;
   |                     ^^^^^^^^^^^^^^^^^^^^^----------------
   |                     |                    |
   |                     |                    help: a similar name exists in the module: `_mm_movemask_epi8`
   |                     no `_mm_movemask_pi8` in `arch::x86_64`
   | 
  ::: C:\Users\Khang\.cargo\registry\src\github.com-1ecc6299db9ec823\packed_simd-0.3.3\src\codegen\reductions\mask.rs:52:1
   |
52 | impl_mask_reductions!(m32x2);
   | ----------------------------- in this macro invocation
   |
   = note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)

error: aborting due to 6 previous errors

For more information about this error, try `rustc --explain E0432`.
error: could not compile `packed_simd`

To learn more, run the command again with --verbose.
warning: build failed, waiting for other jobs to finish...
error: build failed

0.5 can't build for lastest nightly Rust

rustc 1.30.0-nightly (5c875d938 2018-09-24)

Compiling faster v0.5.0
error[E0433]: failed to resolve. Maybe a missing`extern crate std;`?
  --> /Users/blackanger/.cargo/registry/src/mirrors.ustc.edu.cn-15f9db60536bad60/faster-0.5.0/src/vecs.rs:10:12
   |
10 | use crate::std::fmt::Debug;
   |            ^^^ Maybe a missing `extern crate std;`?

error[E0433]: failed to resolve. Maybe a missing`extern crate std;`?
 --> /Users/blackanger/.cargo/registry/src/mirrors.ustc.edu.cn-15f9db60536bad60/faster-0.5.0/src/iters.rs:9:12
  |
9 | use crate::std::slice::from_raw_parts;
  |            ^^^ Maybe a missing `extern cratestd;`?

error[E0433]: failed to resolve. Maybe a missing`extern crate std;`?
 --> /Users/blackanger/.cargo/registry/src/mirrors.ustc.edu.cn-15f9db60536bad60/faster-0.5.0/src/intrin/eq.rs:8:12
  |
8 | use crate::std::ops::BitXor;
  |            ^^^ Maybe a missing `extern cratestd;`?

error[E0433]: failed to resolve. Maybe a missing`extern crate std;`?
  --> /Users/blackanger/.cargo/registry/src/mirrors.ustc.edu.cn-15f9db60536bad60/faster-0.5.0/src/arch/x86/intrin/abs.rs:14:12
   |
14 | use crate::std::mem::transmute;
   |            ^^^ Maybe a missing `extern crate std;`?

No way to process N elements at a time

I've implement a simple image processing benchmark in rust to try out several approaches for use in my crates:

https://github.com/pedrocr/rustc-math-bench

faster looks very interesting so I was trying to add another implementation based on it to this file:

https://github.com/pedrocr/rustc-math-bench/blob/b0e3c047dcdbb2bb0fdcb29eac31f7838f83beab/src/main.rs

It's basically just a color conversion pass where each pixel has 4 values and the result is 3 values (camera space to RGB).

However I can't seem to find a way in faster to iterate over groups of N values. Basically have the ergonomic and SIMD way to do:

for (pixin, pixout) in inb.chunks(4).zip(out.chunks_mut(3)) {

That's a common thing to have to do in graphics processing, processing one buffer with a given number of values per pixel into another with another number. Is this a target use case for faster at all?

Congratulations on a very promising crate either way.