adamniederer / faster Goto Github PK
View Code? Open in Web Editor NEWSIMD for humans
License: Mozilla Public License 2.0
SIMD for humans
License: Mozilla Public License 2.0
When trying to work with the 0.3 PackedIterator
I noticed simd_map
and simd_reduce
take a Fn
.
Would it make sense to relax this requirement to FnMut
, so that the closures could modify their environment?
Rust's std::iter::Iterator seems to mostly use FnMut
, and for my current use case (iterating a 2nd simd_iter
while iterating another one) I would like to cause "side effects".
I assume my actual problem will be addressed in 0.4, but I could imagine people might have other reasons they could need this.
Hi Adam,
Your library uses stdsimd as a external dependency crate, which landed in Rust std awhile ago.
It also uses an old target_feature attribute syntax that compiler complains about.
Do you plan on updating your library to compile with nightly Rust any time soon ?
If not, can you please advise me on how to do necessary changes in order to utilize your library,
since I am fairly new to Rust.
These are available as _mm_subs_epu8
and _mm_subs_epu16
for SSE2 and _mm256_subs_epu8
and _mm256_subs_epu8
for AVX2 and are in stdsimd.
I am seeing errors with debug compilation:
...>cargo build
Compiling stdsimd v0.0.3
Compiling faster v0.1.3 (file:.../faster_master)
LLVM ERROR: Cannot select: t7: v16i8 = X86ISD::ABS t3
t3: v16i8,ch = CopyFromReg t0, Register:v16i8 %vreg45
t2: v16i8 = Register %vreg45
In function: _ZN6faster4main28_$u7b$$u7b$closure$u7d$$u7d$17h9a8972f3fea441f3E
error: Could not compile `faster`.
And release compilation:
...>cargo build --release
Compiling faster v0.1.3 (file:.../faster_master)
LLVM ERROR: Cannot select: t144: v16i8 = X86ISD::ABS t158
t158: v16i8 = bitcast t157
t157: v2i64,ch = load<LD16[ConstantPool]> t0, t160, undef:i64
t160: i64 = X86ISD::WrapperRIP TargetConstantPool:i64<<16 x i8> <i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10>> 0
t159: i64 = TargetConstantPool<<16 x i8> <i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10, i8 -10>> 0
t12: i64 = undef
In function: _ZN6faster4main17hb45e5f709eed73c2E
error: Could not compile `faster`.
The compiler is the last Nightly:
rustc 1.23.0-nightly (3b82e4c74 2017-11-05)
binary: rustc
commit-hash: 3b82e4c74d43fc1273244532c3a90bf9912061cf
commit-date: 2017-11-05
host: x86_64-pc-windows-gnu
release: 1.23.0-nightly
LLVM version: 4.0
I'm trying to get into SIMD by implementing a trivial operation: XOR unmasking of a byte stream as required by the WebSocket specification. The implementation in x86 intrinsics is actually very straightforward, but I have a hard time wrapping my head around expressing it in terms of Faster iterators API.
The part I'm having trouble with is getting an input [u8; 4]
to cycle within a SIMD vector of u8
. I have looked at:
load()
which does accept &[u8]
as input, but its behavior in case of length mismatch is completely undocumented. It's also not obvious what offset
parameter does.[u8; 4]
to u32
, calling vecs::u32s()
and then downcasting repeatedly to get a SIMD vector of u8, but Downcast seems to do not at all what I want.[u8; 4]
into it (lengths now match, so it should work) then downcast repeatedly until I get a vector of u8 with arbitrary length. Except there seems to be no way to request a SIMD vector of length 4 and arbitrary type.From<u32x4>
is implemented for u8x16
, so I could replace Downcast with it in approach 2 and probably get the correct result, except I have no idea how such conversions interact with host endianness.I actually expected this to be a trivial task. I guess for someone familiar with SIMD it is, but for the likes of me a snippet in examples/ folder that loads [u8; 4]
into a vector would go a long way. Or perhaps even a convenience function in the API that deals with endianness properly, to make it harder to mess up.
Or I will do it myself :(
It wasn’t clear to me if faster can also work with rayon. I tried the simple example both with into_* and placing par_iter before and after simd_iter with no luck. So it looks to me like this isn’t a supported combination right now, but if it is, an example would be nice.
Thanks!
Hi,
I am trying to port a project to aarch64
and wasm
using the rust-2018-migration
branch. As of today I receive lots of:
2 | use crate::vektor::x86_64::*;
| ^^^^^^ Could not find `x86_64` in `vektor`
Ideally, faster
would have fallbacks for not-yet supported architectures. That way I could just write my SIMD code once using the provided API, instead of having two separate implementations.
Do you have any short-term plans of making such a fallback available for the 2018 version?
Also, while I am not a Rust expert, I have 1 - 2 days to look into this myself. If you think it's feasible to outline a solution you prefer, I'd be happy to try to help you out.
It would be useful to me if there were an operation like fn bytemask<T: Packed>(v: T) -> usize
that set the bits of the usize to 1 where the packed vector's bytes are all 1, corresponding to the asm PMOVMSKB or VPMOVMSKB. I don't know how possible this would be for faster, but it would be nice.
vec![0u8; 100000].iter().map(|&x| {
x as u64 * x as u64
}).sum::<u64>();
is essentially what I'd like to use this crate to SIMD-accelerate.
The issue I'm bumping into is reading the slice as u8x8
such that I can then use <u64x8 as From<u8x8>>::from
to cast it.
What is the best way to do this currently?
I have code where I want to add / multiply two vectors and compute the sum over all results. It looks somewhat like this:
let mut simd_sum = f32s::splat(0.0f32);
for (x, y) in zip(xvec.simd_iter(), yvec.simd_iter()) {
simd_sum = simd_sum + (x - y) * (x - y);
}
// TODO ...
let sum = sum_f32s(simd_sum, simd_width);
I haven't found a way to a) either sum simd_sum
(e.g., sum = simd_sum.0 + simd_sum.1 + ... simd_sum.n
) with an existing function, or b) generically implement my own sum function.
The problem I'm having with implementing sum_f32s
myself is that I haven't seen an easy way to get the current width. The hack I'm using looks like this:
let _temp = [0.0f32; 32];
let simd_width = (&_temp[..]).simd_iter().width();
It would be nice if both a) and b) were implemented, or, if they are already, documented how to use them.
The way features work in cargo
is that each crate in the dependency graph is built a single time with the union of all requested features from each dependent crate. For this reason, enabling a feature should never remove items or change a function's signature; they must only add items.
As an example of how the current setup can go wrong: Suppose a library depends on faster
and enables the no-std
feature because it doesn't need std
, while another crate enables no features and uses the SIMD iterator impls for Vec
. If an application depends on both of them, faster
gets compiled with no-std
, and the second crate will fail to compile.
The recommendation is to have a std
feature that is enabled by default:
[features]
default = ["std"]
std = []
(note: the specific name "std"
is recommended by the Rust API Guidelines (see C-FEATURE))
Faster should have a way to hint at performance issues.
From a user's point of view it is hard to predict whether SIMD intrinsics are issued, or a fallback implementation is used. Ideally there should be a method, macro, compiler flag, ... to enable performance metrics or warnings.
In the most simple form, each fallback implementation could call something like fallback!
, which in turn could be enabled or disabled with a compile feature. That macro could then emit warnings (maybe even with file names and line numbers).
As a user I could then either change my code, file an issue, or implement the missing intrinsic myself.
After my last PR I noticed you had to clean up a bit. That made me wonder if it makes sense to configure and use clippy
and rustfmt
:
Since clippy
is probably less controversial I went ahead and addressed all current issues, either by changing code (where I thought clippy
made sense), or disabling lints (e.g., where faster bends rules for speed). Some of the more pedantic lints could be discussed, e.g., usage and formatting of number literals (unreadable_literal
, unseparated_literal_suffix
). https://github.com/ralfbiedert/faster/tree/clippy
I think rustfmt
makes sense as well, but needs more configuration to resonate with the code. I found a few settings that worked for me (e.g., max_width
be set rather high not to break up most macros which makes them harder to read). However you should probably take the lead on that one.
Let me know what you think about clippy, in particular unreadable_literal
, unseparated_literal_suffix
and type_complexity
(I prefer the former 2, no opinion on 3rd). I can then create another PR.
Currently the vector size and the SIMD instructions to be used are fixed at compile-time. This is a good first step.
It allows me to write an algorithm once, and by changing a compiler-flag generate different versions of this algorithm for different target architectures (with different vector sizes, SIMD instructions, etc.).
I really like to be able to do this at compile-time within the same binary as well, so that I can write:
// a generic algorithm
fn my_generic_algorithm<T: SimdVec>(x: T, y: T) {
// generic simd operations
}
and monomorphize it for different instruction sets:
let x: VecAvx;
my_generic_algorithm(x, x); // AVX version
let y: VecSSE42;
my_generic_algorithm(y, y); // SSE42 version
That way input.simd_iter().map(my_generic_algorithm)
could monomorphize my_generic_algorithm
for SSE
, SSE42
, AVX
, and AVX2
, and do run-time feature detection to detect the best that a given CPU supports at run-time, and then dispatch to that one.
Hi!
This crate looks great, and I was wondering whether it could have a use_std feature (on by default) so that it could be used in no_std environments. Coresimd, which looks like the core counterpart to stdsimd, could be perhaps used?
I noticed that some methods return Vec
s. In particular, it's stride
and scalar_collect
. I was wondering if it made sense to expose generic version that could use arbitrary FromIterator
type. My motivation for this is:
The original methods could still be preserved, as wrappers with Vec
implementation. Does it make sense? Should I send a pull request for that?
While working on #47 I noticed what looks like performance regressions in the cargo bench
, in particular functions like map_simd
and map_scalar
, but quite a few others.
test tests::map_scalar ... bench: 2,022 ns/iter (+/- 264)
test tests::map_simd ... bench: 6,898 ns/iter (+/- 392)
However, comparing #49 to the commit before the refactoring, the numbers are mostly unchanged.
I then assumed it's related to unfortunate default feature flags on my machine, but playing with avx2
and sse4.1
didn't have any effect either. I also have a first implementation of #48, and it actually looks like no fallbacks are emitted for map_simd
. (Tried to cross check that with radare2
, but have some problems locating the right symbol / disassembly for the benchmarks). Lastly, the functions map_scalar
and map_simd
differ a bit, but even when I make them equal (e.g., sqrt
vs. rsqrt
) the difference remains.
rustc
became so good in auto-vectorization?tests::map_simd
and tests::map_scalar
?Running on rustc 1.29.0-nightly (9fd3d7899 2018-07-07)
, MBP 2015, i7-5557U.
Update: I linked the latest faster version from my SVM library and I don't see these problems in 'production':
csvm_predict_sv1024_attr1024_problems1 ... bench: 232,109 ns/iter (+/- 20,808) [faster AVX2]
csvm_predict_sv1024_attr1024_problems1 ... bench: 942,925 ns/iter (+/- 64,156) [scalar]
Update 2 Seems to be related to some intrinsics. When I dissect the benchmark, I get
test tests::map_scalar ... bench: 558 ns/iter (+/- 55) [without .abs()]
test tests::map_scalar ... bench: 556 ns/iter (+/- 33) [with .abs()]
test tests::map_simd ... bench: 144 ns/iter (+/- 17) [without .abs()]
test tests::map_simd ... bench: 883 ns/iter (+/- 64) [with .abs()]
I now think that each intrinsic should have its own benchmark, e.g. intrinsic_abs_scalar
, intrinsic_abs_simd
, ...
Update 3 ... oh boy. I think that by "arcane magic" Rust imports and prefers std::simd::f32x4
and friends over the faster
types and methods.
So when you do my_f32s.abs()
, it calls std::simd::f32x4::abs
, not faster::arch::current::intrin::abs
.
The reason I think that's the problem is you can now easily do my_f32s.sqrte()
, which isn't implemented in faster
, but in std::simd
.
What's more annoying is that it doesn't warn about any collision, and that std::simd
is actually slower than "vanilla" Rust.
TODO:
#![feature(stdsimd)]
except in lib.rs
Update 4 Now one more thing makes sense ... I sometimes got use of unstable library feature 'stdsimd'
in test cases and I didn't understand why. Probably because that's where the std::simd
built-ins were used.
This crate is listed by the official docs for simd operations yet the current version 0.50 gives tons of compile errors, i.e. unfound symbols. Is this crate still active?
Rayon supports parallel iterators/mapv function to process using multiple threads. How can we integrate with rayon so we can leverage both simd and thread parallel processing?
From crates.io the documentation link https://docs.adamniederer.com/faster/index.html is not working
hey,
I have spent a few minutes trying unsuccessfully to write a simd version of code that uses ndarray's zip_mut_with
- is there currently a way of combining a simd_iter_mut
and simd_iter
with zip
?
Essentially, I'd like to modify the elements of an array in place with an operation that uses the values in another array of the same shape.
I tried this:
let mut xs = ndarray::Array::from_elem((64,), 0.0f32);
let ys = ndarray::Array::from_elem((64,), 1.0f32);
(xs.as_slice_mut().unwrap().simd_iter_mut(f32s(0.0)),
ys.as_slice().unwrap().simd_iter(f32s(0.0)))
.zip()
.simd_for_each(|(x, y)| {
*x += *y; // or whatever, not sure if this is correct for the closure
});
and got this error:
error[E0599]: no method named `simd_for_each` found for type `faster::Zip<(faster::SIMDIter<&mut [f32]>, faster::SIMDIter<&[f32]>)>` in the current scope
--> src/mlp/m3r.rs:298:26
|
298 | .simd_for_each(|(z, b)| {
| ^^^^^^^^^^^^^
...
Any recommendations for how to proceed? First time using the library, apologies if this is an obvious question.
Current API is not bad, but it has some problems. Because this crate implements standard Iterator
trait simd iterators can be used in for loops(and they are currently used for example in benchmarks), but inside those loops there is no uneven collection handling, this I think can be added, but it would reduce performance. Second thing is that inside for loops there is no in place mutation, you need to manually store. I think it would be better to just have internal iterator trait, and implement functionality on top of it, it would also allow dropping simd prefix on iterators methods. It would make the API more similar to the standard library, with exception of providing defaults. Thoughts?
First of all thanks for making this library, the API looks really neat.
I'm currently working on a project where I need to crunch a lot of complex numbers (esp. Complex) and would love to use faster instead of hooking into external C libraries like VOLK.
Is support for complex types on your roadmap?
I have a function which looks vaguely like this:
struct Rect { real: f64, imag: f64 }
struct KetRef<'a> { real: &'a [f64], imag: &'a [f64] }
impl<'a> KetRef<'a> {
pub fn dot(self, other: KetRef) -> Rect {
assert_eq!(self.real.len(), other.real.len());
assert_eq!(self.real.len(), other.imag.len());
assert_eq!(self.real.len(), self.imag.len());
zip!(self.real, self.imag, other.real, other.imag)
.map(|(ar, ai, br, bi)| {
let real = ar * br + ai * bi;
let imag = ar * bi - ai * br;
Rect { real, imag }
})
.fold(Rect::zero(), |a,b| a + b)
}
}
Converting it to use faster
requires two passes over the arrays; I am unable to produce both real
and imag
in one pass because simd_map
requires the function output to be a single vector:
pub fn dot<K: AsKetRef>(self, other: K) -> Rect {
use ::faster::prelude::*;
let other = other.as_ket_ref();
assert_eq!(self.real.len(), other.real.len());
assert_eq!(self.real.len(), other.imag.len());
assert_eq!(self.real.len(), self.imag.len());
let real = (
self.real.simd_iter(f64s(0.0)),
self.imag.simd_iter(f64s(0.0)),
other.real.simd_iter(f64s(0.0)),
other.imag.simd_iter(f64s(0.0)),
).zip().simd_map(|(ar, ai, br, bi)| {
ar * br + ai * bi
}).simd_reduce(f64s(0.0), |acc, v| acc + v).sum();
let imag = (
self.real.simd_iter(f64s(0.0)),
self.imag.simd_iter(f64s(0.0)),
other.real.simd_iter(f64s(0.0)),
other.imag.simd_iter(f64s(0.0)),
).zip().simd_map(|(ar, ai, br, bi)| {
ar * bi - ai * br
}).simd_reduce(f64s(0.0), |acc, v| acc + v).sum();
Rect { real, imag }
}
So is it faster? Well, actually, yes! It is plenty faster... up to a point:
Change in run-time for different ket lengths
dot/16 change: -33.973%
dot/64 change: -29.575%
dot/256 change: -26.762%
dot/1024 change: -34.054%
dot/4096 change: -36.297%
dot/16384 change: -7.3379%
Yikes! Once we hit 16384 elements there is almost no speedup!
I suspect it is because at this point, memory has become the bottleneck, and most of what was gained by using SIMD was lost by making two passes over the arrays. It would be nice to have an API that allowed this do be done in one pass by allowing a mapping function to return a tuple (producing a new PackedZippedIterator
or similar).
https://crates.io/crates/faster has a "documentation" link at the top. Clicking which brings me to a 404 page.
Hi,
When I try to run the following code:
#[cfg(test)]
mod tests {
use faster::{IntoPackedRefIterator, IntoPackedZip, PackedZippedIterator, PackedIterator, Packed, f64s};
#[test]
fn test_faster_simd() {
let a = vec![ 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0 ];
let b = vec![ 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0, 1.0 ];
let slice_a = &a[..];
let slice_b = &b[..];
let sum: f64 = (slice_a.simd_iter(), slice_b.simd_iter()).zip()
.simd_map((f64s::splat(0f64), f64s::splat(0f64)), |(a,b)| (a - b) * (a - b))
.simd_reduce(f64s::splat(0f64), f64s::splat(0f64), |a, v| a + v)
.sum();
}
}
I get:
---- kernel::rbf::tests::test_faster_simd stdout ----
thread 'kernel::rbf::tests::test_faster_simd' panicked at 'assertion failed: !a.0.is_none() && a.0.unwrap().1 == a.0.unwrap().1 && !a.1.is_none() &&
a.1.unwrap().1 == a.0.unwrap().1', /Users/rb/.cargo/registry/src/github.com-1ecc6299db9ec823/faster-0.4.0/src/zip.rs:306:0
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
stack backtrace:
0: std::sys::unix::backtrace::tracing::imp::unwind_backtrace
1: std::sys_common::backtrace::print
2: std::panicking::default_hook::{{closure}}
3: std::panicking::default_hook
4: std::panicking::rust_panic_with_hook
5: std::panicking::begin_panic
6: ffsvm::kernel::rbf::tests::test_faster_simd
7: <F as test::FnBox<T>>::call_box
8: __rust_maybe_catch_panic
I'm using faster 0.4
, and rust 1.25.0-nightly (4e3901d35 2018-01-23)
and no target-feature
. The same happens when I enable target-feature=+avx2
. Unfortunately I don't really understand the assertion, so please let me know if there is something broken with the way I use faster.
Hi!
I think that it would be great to be able to call into_simd_iter()
on regular Iterators, e.g.:
iter::repeat(42.0)
.into_simd_iter()
// ...
This allows SIMD to be applied when buffering the data into a vector would be inefficient and provides better interop with std.
Opening another ticket since this is a separate discussion from #47 and might be more controversial:
The more I look into the upcoming std::simd
, the more I wonder if faster
should not become a thinner "SIMD-friendly iteration" library that neatly plugs into std::simd
and is really good at handling variable slices, zipping, ... instead of providing a blanket implementation over std::arch
.
Right now it seems that many common intrinsics and operations faster provides on packed types are or might be implemented in std::simd
(compare coresimd/ppsv).
At the same time, for things that won't be in std::simd
(and will be more platform specific), faster will have a hard time providing a consistent performance story anyway.
By that reasoning I see a certain appeal primarily focusing on a more consistent cross-platform experience with a much lighter code base (e.g., imagine faster without arch/
and intrin/
and using mostly std::simd
instead of vektor
).
Faster could also integrate std::arch
specific functions and types, but rather as extensions and helpers (e.g., for striding) for special use cases, instead of using them as internal fundamentals.
If I am reading the code correctly, it looks like in the case of SSE2 Faster currently falls back to calling round()/floor() etc on each individual lane via the fallback macro.
You may be able to use these methods instead:
http://dss.stephanierct.com/DevBlog/?p=8
Or Agner Fog has a different method in his vector library:
http://www.agner.org/optimize/vectorclass.zip
edit:
Agner's functions are slower but can handle floating point values that don't fit in an i32, the first functions only handle values that do fit in an i32.
use {faster::Popcnt, std::simd::u64x4};
u64x4::new(!0, !0, 0, 0).count_ones()
correctly returns 128 without -C target-cpu=native
, and incorrectly returns 18446744073709551488 (i.e. !127) with -C target-cpu=native
.
Hello, I want to check if this crate works with avx 512 instructions. And also if they recommend to use it, due to the inactivity of it.
Are there other crates ? This one seems to me to be very good.
Thanks
I'm currently on a platform that has AVX support but does not have AVX2 support, so currently faster on my platform only uses f32x4 and f64x2, but it should be able too use f32x8 and f64x4.
I saw you renamed a lot of structs and traits.
Since you are on this, I was wondering what you think about renaming Packed
to SIMDVector
, along with friends (e.g., Packable
-> SIMDConvertable
), or something similar.
Main reason is, when I started reading the source and tried to understand how all the Pack*
concepts relate, I found your comment Packed // A SIMD vector of some type
most helpful and thought If that's what it really is, why isn't it just named liked that?
I'm not proposing exactly these names, but think the crate's internals would be easier to follow if the names reflected "well known" concepts.
Hi, I'm trying to using faster on my project, but I can't seem to build it on my machine.
Does not seems to be the problem of rust but the problem with vektor
crate.
error[E0425]: cannot find function `_mm_extract_pi16` in module `crate::myarch`
--> /home/xinbg/.cargo/registry/src/mirrors.ustc.edu.cn-12df342d903acd47/vektor-0.2.2/src/x86/sse.rs:1292:28
|
1292 | crate::myarch::_mm_extract_pi16(crate::mem::transmute(a), $imm8)
| ^^^^^^^^^^^^^^^^
...
1296 | crate::mem::transmute(constify_imm8!(imm2, call))
| -------------------------- in this macro invocation
|
::: /home/xinbg/.rustup/toolchains/nightly-x86_64-unknown-linux-gnu/lib/rustlib/src/rust/library/core/src/../../stdarch/crates/core_arch/src/x86/sse2.rs:1419:1
|
1419 | pub unsafe fn _mm_extract_epi16(a: __m128i, imm8: i32) -> i32 {
| ------------------------------------------------------------- similarly named function `_mm_extract_epi16` defined here
|
= note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)
help: a function with a similar name exists
|
1292 | crate::myarch::_mm_extract_epi16(crate::mem::transmute(a), $imm8)
| ^^^^^^^^^^^^^^^^^
help: consider importing this function
|
5 | use crate::_mm_extract_pi16;
|
^C Building [=====================> ] 7/9: vektor
Is there anything I'm missing to make faster 0.5.2
work?
Thx for the help!
I am doing some benchmarking of my own simd lib against faster and want to be sure I'm doing it correctly. I'm using criterion, replicating the "lots of 3s" example, as shown in this gist:
https://gist.github.com/jackmott/a0b8ca811d2cf2ecb97a35f0aee0a5c6
I'm using the default compilation settings which should be targeting SSE2 instructions for Faster, and I'm using the SSE2 settings in my library. Does this look like a fair comparison? Am I missing anything?
Also how is ceil implemented for SSE2? I think it is slower than it needs to be but I can't figure out where it happens in the faster source.
Hi thanks for the lib! I wonder does it support NEON(arm) such that it can be used on Android and iOS? Does it need nightly?
Hi guys,
I'm currently ripping up the old iterator system, and it's going to look nothing like 0.4.0. The drastic change is to fix a few corner cases with correctness I ran into late in the 0.4.0 development cycle. I'd recommend not touching anything in iters.rs
, into_iters.rs
, zip.rs
, or swizzle.rs
in order to avoid unresolvable conflicts. I'll close this issue once most of the changes are pushed.
As always, thank you for contributing to faster.
I tried several times to compile this crate version 0.4.3, using different versions of the nightly compiler and also the stable one.
When I use the nightly compilers I get lots of errors of kind:
error: #[target_feature] attribute must be of the form #[target_feature(..)]
--> /home/---------/.cargo/registry/src/github.com-1ecc6299db9ec823/coresimd-0.0.4/src/x86/x86_64/xsave.rs:103:1
|
103 | #[target_feature = "+xsave,+xsaves"]
and then
error: aborting due to 973 previous errors
error: Could not compile `coresimd`.
When using the stable one I get
error[E0554]: #![feature] may not be used on the stable release channel
--> /home/----------/.cargo/registry/src/github.com-1ecc6299db9ec823/coresimd-0.0.4/src/lib.rs:14:1
|
14 | / #![feature(const_fn, link_llvm_intrinsics, platform_intrinsics, repr_simd,
15 | | simd_ffi, target_feature, cfg_target_feature, i128_type, asm,
16 | | const_atomic_usize_new, stmt_expr_attributes, core_intrinsics,
17 | | crate_in_paths)]
| |___________________________^
error: aborting due to previous error
For more information about this error, try `rustc --explain E0554`.
error: Could not compile `coresimd`.
I have tried to do something using faster and noticed that you added simd_for_each
, but it doesn't seems to do anything. Simple case:
extern crate faster;
use faster::*;
fn main() {
let mut vector = vec![0f32, 1f32, 2f32, 3f32];
vector
.as_mut_slice()
.simd_iter_mut()
.simd_for_each(f32s(0f32), |mut x| x /= f32s(2f32));
println!("{:?}", vector);
vector.iter_mut().for_each(|x| *x /= 2f32);
println!("{:?}", vector);
}
Outputs:
[0.0, 1.0, 2.0, 3.0]
[0.0, 0.5, 1.0, 1.5]
I have looked at the implementation of this function and it takes a owned simd vector and performs user provided function on it and then does nothing with the result. There should be store
somewhere in there and the user provided function should take mutable reference to the simd vector. Or maybe I'm doing something wrong since there aren't any examples yet.
Compilation with the latest version on crates.io does not work, however master
works fine. Would be nice if you could update crates.io.
Could this library support arbitrary sized bitvectors? Suppose i have two 32bits vectors, I could store them in an unsigned integer of 32 bits. But if I have bitvectors of 1200 bits it would be ideal to break them up into 2x512 bits and one 256 bit (using the remaining 176 needed bits). I would like to perform simple operations such as AND, OR, NOT and arithmetics like + and -
Hi thanks for the lib! My usage needs to index into an array based on a value. Shortly speaking, it is doing something like:
for x in 0..width {
let a = array_one[x+42] - array_one[x-42]; //???
let b = ...some arthi op on `a` which I know faster can do...
let c = array_two[b]; //???
}
Question: How can I parallalize the array indexing operation?
There do exist SIMD intrinsics for such lut(lookup table) operations. I have used OpenCV's universal intrinsics in C++, and it did provide one: https://docs.opencv.org/4.5.3/df/d91/group__core__hal__intrin.html#ga37fe7c336a68ae5f48967a066473a4ff
The current design of "faster" has several big flaws that should be addressed, including:
Here is a better design:
With this design:
I am just add faster
into the Cargo.toml
, then build failed. Maybe there is a need to add #[cfg(target_arch="???")]
above the target_feature in the source code. version is 0.5.2, and 0.4.3 is build passed.
at least, that's what happened according to rust-lang/packed_simd#308
Compiling packed_simd v0.3.3
error[E0432]: unresolved import `crate::arch::x86_64::_mm_movemask_pi8`
--> C:\Users\Khang\.cargo\registry\src\github.com-1ecc6299db9ec823\packed_simd-0.3.3\src\codegen\reductions\mask\x86\sse.rs:47:21
|
47 | use crate::arch::x86_64::_mm_movemask_pi8;
| ^^^^^^^^^^^^^^^^^^^^^----------------
| | |
| | help: a similar name exists in the module: `_mm_movemask_epi8`
| no `_mm_movemask_pi8` in `arch::x86_64`
|
::: C:\Users\Khang\.cargo\registry\src\github.com-1ecc6299db9ec823\packed_simd-0.3.3\src\codegen\reductions\mask.rs:41:1
|
41 | impl_mask_reductions!(m8x8);
| ---------------------------- in this macro invocation
|
= note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)
error[E0432]: unresolved import `crate::arch::x86_64::_mm_movemask_pi8`
--> C:\Users\Khang\.cargo\registry\src\github.com-1ecc6299db9ec823\packed_simd-0.3.3\src\codegen\reductions\mask\x86\sse.rs:62:21
|
62 | use crate::arch::x86_64::_mm_movemask_pi8;
| ^^^^^^^^^^^^^^^^^^^^^----------------
| | |
| | help: a similar name exists in the module: `_mm_movemask_epi8`
| no `_mm_movemask_pi8` in `arch::x86_64`
|
::: C:\Users\Khang\.cargo\registry\src\github.com-1ecc6299db9ec823\packed_simd-0.3.3\src\codegen\reductions\mask.rs:41:1
|
41 | impl_mask_reductions!(m8x8);
| ---------------------------- in this macro invocation
|
= note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)
error[E0432]: unresolved import `crate::arch::x86_64::_mm_movemask_pi8`
--> C:\Users\Khang\.cargo\registry\src\github.com-1ecc6299db9ec823\packed_simd-0.3.3\src\codegen\reductions\mask\x86\sse.rs:47:21
|
47 | use crate::arch::x86_64::_mm_movemask_pi8;
| ^^^^^^^^^^^^^^^^^^^^^----------------
| | |
| | help: a similar name exists in the module: `_mm_movemask_epi8`
| no `_mm_movemask_pi8` in `arch::x86_64`
|
::: C:\Users\Khang\.cargo\registry\src\github.com-1ecc6299db9ec823\packed_simd-0.3.3\src\codegen\reductions\mask.rs:47:1
|
47 | impl_mask_reductions!(m16x4);
| ----------------------------- in this macro invocation
|
= note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)
error[E0432]: unresolved import `crate::arch::x86_64::_mm_movemask_pi8`
--> C:\Users\Khang\.cargo\registry\src\github.com-1ecc6299db9ec823\packed_simd-0.3.3\src\codegen\reductions\mask\x86\sse.rs:62:21
|
62 | use crate::arch::x86_64::_mm_movemask_pi8;
| ^^^^^^^^^^^^^^^^^^^^^----------------
| | |
| | help: a similar name exists in the module: `_mm_movemask_epi8`
| no `_mm_movemask_pi8` in `arch::x86_64`
|
::: C:\Users\Khang\.cargo\registry\src\github.com-1ecc6299db9ec823\packed_simd-0.3.3\src\codegen\reductions\mask.rs:47:1
|
47 | impl_mask_reductions!(m16x4);
| ----------------------------- in this macro invocation
|
= note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)
error[E0432]: unresolved import `crate::arch::x86_64::_mm_movemask_pi8`
--> C:\Users\Khang\.cargo\registry\src\github.com-1ecc6299db9ec823\packed_simd-0.3.3\src\codegen\reductions\mask\x86\sse.rs:47:21
|
47 | use crate::arch::x86_64::_mm_movemask_pi8;
| ^^^^^^^^^^^^^^^^^^^^^----------------
| | |
| | help: a similar name exists in the module: `_mm_movemask_epi8`
| no `_mm_movemask_pi8` in `arch::x86_64`
|
::: C:\Users\Khang\.cargo\registry\src\github.com-1ecc6299db9ec823\packed_simd-0.3.3\src\codegen\reductions\mask.rs:52:1
|
52 | impl_mask_reductions!(m32x2);
| ----------------------------- in this macro invocation
|
= note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)
error[E0432]: unresolved import `crate::arch::x86_64::_mm_movemask_pi8`
--> C:\Users\Khang\.cargo\registry\src\github.com-1ecc6299db9ec823\packed_simd-0.3.3\src\codegen\reductions\mask\x86\sse.rs:62:21
|
62 | use crate::arch::x86_64::_mm_movemask_pi8;
| ^^^^^^^^^^^^^^^^^^^^^----------------
| | |
| | help: a similar name exists in the module: `_mm_movemask_epi8`
| no `_mm_movemask_pi8` in `arch::x86_64`
|
::: C:\Users\Khang\.cargo\registry\src\github.com-1ecc6299db9ec823\packed_simd-0.3.3\src\codegen\reductions\mask.rs:52:1
|
52 | impl_mask_reductions!(m32x2);
| ----------------------------- in this macro invocation
|
= note: this error originates in a macro (in Nightly builds, run with -Z macro-backtrace for more info)
error: aborting due to 6 previous errors
For more information about this error, try `rustc --explain E0432`.
error: could not compile `packed_simd`
To learn more, run the command again with --verbose.
warning: build failed, waiting for other jobs to finish...
error: build failed
rustc 1.30.0-nightly (5c875d938 2018-09-24)
Compiling faster v0.5.0
error[E0433]: failed to resolve. Maybe a missing`extern crate std;`?
--> /Users/blackanger/.cargo/registry/src/mirrors.ustc.edu.cn-15f9db60536bad60/faster-0.5.0/src/vecs.rs:10:12
|
10 | use crate::std::fmt::Debug;
| ^^^ Maybe a missing `extern crate std;`?
error[E0433]: failed to resolve. Maybe a missing`extern crate std;`?
--> /Users/blackanger/.cargo/registry/src/mirrors.ustc.edu.cn-15f9db60536bad60/faster-0.5.0/src/iters.rs:9:12
|
9 | use crate::std::slice::from_raw_parts;
| ^^^ Maybe a missing `extern cratestd;`?
error[E0433]: failed to resolve. Maybe a missing`extern crate std;`?
--> /Users/blackanger/.cargo/registry/src/mirrors.ustc.edu.cn-15f9db60536bad60/faster-0.5.0/src/intrin/eq.rs:8:12
|
8 | use crate::std::ops::BitXor;
| ^^^ Maybe a missing `extern cratestd;`?
error[E0433]: failed to resolve. Maybe a missing`extern crate std;`?
--> /Users/blackanger/.cargo/registry/src/mirrors.ustc.edu.cn-15f9db60536bad60/faster-0.5.0/src/arch/x86/intrin/abs.rs:14:12
|
14 | use crate::std::mem::transmute;
| ^^^ Maybe a missing `extern crate std;`?
I've implement a simple image processing benchmark in rust to try out several approaches for use in my crates:
https://github.com/pedrocr/rustc-math-bench
faster
looks very interesting so I was trying to add another implementation based on it to this file:
It's basically just a color conversion pass where each pixel has 4 values and the result is 3 values (camera space to RGB).
However I can't seem to find a way in faster to iterate over groups of N values. Basically have the ergonomic and SIMD way to do:
for (pixin, pixout) in inb.chunks(4).zip(out.chunks_mut(3)) {
That's a common thing to have to do in graphics processing, processing one buffer with a given number of values per pixel into another with another number. Is this a target use case for faster at all?
Congratulations on a very promising crate either way.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.