mattkretz / std-simd-feedback Goto Github PK
View Code? Open in Web Editor NEWFeedback from simd in the Parallelism TS 2
Feedback from simd in the Parallelism TS 2
Links to public discussion relevant to the issue before finalizing the TS design:
The status quo of the TS:
simd_mask<T0, A0>
is implictly convertible to simd_mask<T1, A1>
if both A0
and A1
are fixed_size<N>
(same N
).simd<T0, A0>
is implictly convertible to simd<T1, A1>
if
A0
and A1
are fixed_size<N>
(same N
), andT0
to T1
preserves the value of all possible values of T0
.As a consequence, the following example is too hard to write:
// set x[i] = 0 where y[i] is even
simd<float, Abi> zero_even(simd<float, Abi> x) {
auto y = std::static_simd_cast<int>(x);
//where((y & 1) == 0, x) = 0; // error unless Abi is fixed_size
//where(std::static_simd_cast<float>(y & 1) == 0, x) = 0; // error unless Abi is fixed_size
where(std::static_simd_cast<simd<float, Abi>>(y & 1) == 0, x) = 0; // works
return x;
}
// scalar equivalent:
float zero_even(float x) {
return ((static_cast<int>(x) & 1) == 0) ? 0 : x;
}
// How it should be for simd (also needs P0917R2):
simd<float, Abi> zero_even(simd<float, Abi> x) {
return ((std::static_simd_cast<int>(x) & 1) == 0) ? 0 : x;
}
For the TS, the committee preferred to rather be too strict and discover the need for more implicit conversions through the TS process. Therefore it is important to submit enumerate the cases where implicit conversions would be helpful.
I did prefer 'join' but now...
The std libraries use multiple names for concatenating two objects into one, suggesting the result e.g. append, operator+=, join, operator+, (and strcat).
Split and join are used for ranges and views (threads use join too).
I can't think of any occurrences of 'concat' (merge, append, insert etc are used).
On the surface split and join make a nice pair.
However, join tend to be used where there is more to joining than simply concatenating viz:
With the latest version of the proposal, is there a canonical way to construct a simd_mask<T, Abi>
that is true
for the first N lanes and false otherwise?
It has been proposed that a simd can be statically resized:
resize<_NewSize>(simd_value);
If the new size is bigger than the original size, so that the simd value gains elements, what value should be inserted? Should it use the value initialisation of the simd type, or should it be user selectable with a default?
template<std::size_t _NewSize, typename _T, typename _Abi>
simd<_T, _NewSize> resize(simd<_T, _Abi> value, _T new_value = {});
This is the same as container resizing like std::vector::resize
. Since the original size and the new size are known, should any growth operations enforce that the user supplies a value to insert into the new elements?
Should resize prevent a value from being shrunk and thereby silently losing elements? std::vector::resize
provides no such warning. An alternative to resize would be using the proposed simd::extract
function instead with a 0 begin offset, which would make it clear that elements are being discarded, but be more verbose.
If resize enforces certain behaviours then maybe there should be functions which make the expected behaviour clear:
simd::grow<_BiggerSize>(value, insertion_value); // Require the value for new elements to be explicitly given, and new size must be bigger
simd::shrink<_SmallerSize>(value); // New size must be smaller. Obvious that values are discarded.
There is no precedent for names like these though.
These need to be consistent and unambiguous with the corresponding functions introduced for builtin types (countl_zero
, countl_one
, countr_zero
, countr_one
, popcount
). Note that there are two variants of these functions that are relevant to simd<T>
:
simd_mask<T>
.std::bitset
.simd<T>
.simd<int>
.The former is present in Par TS 2. The latter should be added, now that the [bit.count] functions are present in C++20.
The count[lr]_(zero|one)
functions in C++20 are slightly different from find_(first|last)_set
. The latter return in what place the first true
value in the mask appears. However, if there is none, the behavior is undefined.
The corresponding count[lr]_zero
functions count the number of consecutive false
values starting from the left or right and thus do not invoke undefined behavior.
It can be useful to create simd_masks from compressed bits which are stored in some unsigned integral value. This could be because the integral value is an easy way to quickly provide a known set of mask bits:
simd_mask<float>(0b0101101010010101);
or it could be because the application is actually using the compressed bit format to efficiently store data and there needs to be a convenient and efficient way to convert into a mask.
This mechanism will only work up to the number of bits in the unsigned_integral input. So a 64-bit integral couldn't initialise every bit in a simd_mask with 128 bits, but will only initialise the lower bits.
There is precedent in this sort of operation in std::bitset, which provides a constructor from ulonglong.
To undo the effects of this constructor we could also provide to_ulong()/to_ullong() functions which go back to the compressed bit format.
A simd can be broken down into pieces of specific sizes:
const auto [a, b, c] = split<2, 3, 2>(simd_values);
At the moment split requires the sum of the pieces to match the size of the input value, but this seems overly restrictive. I can see that it helps guide correct behaviour by ensuring that nothing gets left behind, but I think it more likely that code might not know the size of the simd_value, and consequently it should only be an error if the number of pieces is too large. For example, in the example above perhaps `simd_value' is a native simd type. This might have enough elements to supply 2 + 3 + 2 simd pieces, with some padding left over, but computing how much padding is needed makes for an unnecessarily verbose call.
I propose that split should require the sum of the pieces to be no larger than the number of elements, rather than exactly the same size.
LEWG asked for hashing support for simd
. My question whether the hash should be element-wise was not a joke, though. We need to explore this in terms of use cases.
to be discussed (and possibly re-added) in a paper discussing extract
, insert
/replace
, resize
Since implicit conversions on broadcasts are restricted to value-preserving conversions, the load/store interface is inconsistent with this strictness. I.e.:
using V = simd<float>;
V(1.); // error, double -> float conversion is not value-preserving
V(short(1)); // good, short -> float is value-preserving
double mem[V::size()] = {1.};
V(mem, flags::element_aligned); // converting load, compiles (but shouldn't?)
short mem[V::size()] = {1};
V(mem, flags::element_aligned); // value-preserving conversion, compiles
The conversion is not obvious when you look at the line of code that requests the load.
Options:
v.copy_from(mem, flags)
requires mem
to be a value_type
array. v.memload_cvt_safe(mem, flags)
allows value-preserving conversions. v.memload_cvt_unsafe(mem, flags)
allows all conversions.V(mem, flags::element_aligned | flags::safe_cvt);
V(mem, flags::element_aligned | flags::any_cvt);
V(mem, flags::element_aligned | flags::saturating_cvt); // new feature
As a variation, safe conversions could be enabled per default and only unsafe
conversions would require extra typing.
While it is not immediately obvious how many elements a simd
will have, in some cases, like with the fixed_size
abi, it is. So it would be great if we could have a constructor that allows to initialize a simd
of size N
with exactly N
constructor arguments:
stdx::fixed_size_simd<double, 4> simd{0.0, 11.0, 22.0, 33.0};
I leave it up to you whether to also add such a constructor for simd
s of other abis.
simd_mask is close in behaviour to std::bitset, but unlike std::bitset it doesn't provide NOT (~) or shift operators (<<, >>). I'm not sure how useful the shift operators actually are, but for completeness I think they should be provided.
simd_mask does provide && and || but I'm not sure they really make sense. These are meant to be short-circuit operators, but, like any customer operator of this type, they don't actually short-circuit. The simd_mask is a collection of bits, and so only providing the bitwise operators would seem to make more sense to me.
I think that the operator==/!= are also confusing. I would understand an expression like my_mask == your_mask' to mean
are the masks the same', not `which elements in the masks are the same'. Looking to std::bitset for guidance again, that uses the former meaning (and only provides == anyway_ since != is synthesised), and returns a single value indicating equality for the entire value, not its elements.
shuffle is an important feature for simd, in daily workload, we found it's necessary to be involved.
Based on your comment here, you would be open to some kind of function:
simd<T> cond(simd_mask<T>, simd<T>, simd<T>)
This is consistent with APIs I've seen in several other efforts to develop a SIMD abstraction layer, and might be is more intuitive to our users.
As you mention, if the standard provides overloading of the ternary operator, this could build on that capability, however I would recommend that a named non-member method still be provided so that implementations of std::simd
could be developed that are usable in currently available C++ standards. In other words, take advantage of but don't rely solely on language changes.
In our efforts to implement something close to std::simd
, we have chosen to only provide this primitive, which we arbitrarily name choose
.
There seems to be a pattern that we need to explore:
We want slices of
simd
,contiguous_container
(e.g. span
, vector
), andmdpsan
.More?
Many other languages have syntax to support this. E.g. Python:
a[start:stop] # items start through stop-1
a[start:] # items start through the rest of the array
a[:stop] # items from the beginning through stop-1
a[:] # a copy of the whole array
a[start:stop:step] # start through not past stop, by step
a[-1] # last item in the array
a[-2:] # last two items in the array
a[:-2] # everything except the last two items
a[::-1] # all items in the array, reversed
a[1::-1] # the first two items, reversed
a[:-3:-1] # the last two items, reversed
a[-3::-1] # everything except the last two items, reversed
extract<0, 4>(v)
could be written as v[0:4]
with that syntax.
I think that the names of the copy_to/from functions, reduuctions, and so on should be left as-is, and an extra mask parameter added to provide an overload.
I don't like adding the _if suffix to the functions because that makes them too similar to existing functions which have a different behaviour. For example, copy_if only copies values to the destination if the predicate is met, so it has the effect of compressing the data too, For example, if the predicate matched 3 out of 6 values from an input iterator, then the output will only have 3 values. But the copy_to/copy_from are actually doing the equivalent of *this = mask ? *ptr : *this
which applies the mask bit in place.
If you wanted to change the name, then I suggest using a variant of the word mask in the name instead, to make it clear that this is an in-place mask that is being applied, not a C++ copy_if compression.
I think having a copy_if
function that works on simd would be useful, but it should behave like its C++ counterpart and perform compression.
While porting a small example from Vc to std::simd, I got stuck with rsqrt
not being available. For scalar code and when compiled with -ffast-math
(or equivalent), the compiler also generates a scalar vrsqrt_ss
from an expression like 1.0f/std::sqrt(x)
. However, if x
is a std::simd
the corresponding vrsqrt_ps
is not generated. See: https://godbolt.org/z/978j5qK44
Now we could argue that rsqrt
is not accessible, because it is generally not in <cmath>
, and IIRC the Parallelism TS v2 proposes SIMD overloads for existing math functions in .
Still, this functionality is needed for fast, inaccurate code.
Could the Parallelism TS v2 provide an additonal rsqrt(std::simd)
function?
Or does someone need to write a proposal for an rsqrt
in ? We also have std::fma
there, I was surprised :D
The bits of the values of the primary types such as int, long,... can be evaluated by popcount-like operations.
in this way, it will be useful to support bit-wise operations on simd beside simd_mask with popcount.
Whether simd<T>
or simd_mask<T>
is trivial is currently a QoI issue. In principle simd<T>
should be trivial iff T
is trivial. I believe simd_mask<T>
should unconditionally be trivial (this excludes an implementation of fixed_size_simd_mask<T>
via a std::bitset
member).
Guaranteeing triviality enables reliable use of std::bit_cast
, which is a rather common task in low-level SIMD programming.
The name was discussed in LEWG. If there is no new information via experience, WG21 should not spend any further time on renaming this function. Consequently, this is a call for feedback from usage experience. We need arguments why a certain name works or does not work (clarity, ambiguity); no calls for preference.
Note that if [P0917R2](https://wg21.link/p0917r2} gets adopted, the where
function could possibly removed in favor of conditional expressions for blending.
Consider:
std::simd<float> v(addr, std::vector_aligned);
v.copy_from(addr + 1, std::element_aligned);
v.copy_to(dest, std::element_aligned);
Line 1 supplies an optimization hint to the load operation. Line 2 says what really?
“Please don’t crash. I know this is not a vector aligned access2 .” Line 3 says: “I don’t
know whether it’s vector aligned or not. Compiler, if you know more, please optimize,
otherwise just don’t make it crash.” (To clarify, the difference between lines 2 and 3
is what line 1 says about the alignment of addr
.) In both cases of element_aligned
access, the developer requested a behavior we take as given in all other situations.
Why do we force to spell it out in this case?
Since C++20, we also have another option:
std::simd<float> v(std::assume_aligned<std::memory_alignment_v<std::simd<float>>>(addr));
v.copy_from(addr + 1);
v.copy_to(dest);
This seems to compose well, except that line 1 is rather long for a common pattern in
this interface. Also, this removes implementation freedom because the library cannot
statically determine the alignment properties of the pointer.
Consequently, I suggest to keep the load/store flags as in the TS but default them
to element_aligned
. I.e.:
std::simd<float> v(addr, std::vector_aligned);
v.copy_from(addr + 1);
v.copy_to(dest);
Should there be a nexttoward
overload for simd
? And should it use fixed_size_simd<long double, N> y
and/or long double y
as second parameter? The former is more general but is likely to incur unnecessary cost in many cases.
The names were discussed in SG1 and LEWG. If there is no new information via experience, WG21 should not spend any further time on renaming these functions. Consequently, this is a call for feedback from usage experience. We need arguments why a certain name works or does not work (clarity, ambiguity); no calls for preference.
The issue that I have here is I don't understand where those support cv-qualifiers or references.
For example I would assume that if two expression rebind_simd<simd<whatever, whatever>, to-x>
and rebind_simd<const simd<whatever, whatever, to-x>
would give me the same result except const
qualification (if it was for the original argument) for rebound simd
type would be preserved.
Note: What I describe works for std::allocator_traits<const A>::rebind_alloc<args>
today but drops const
for resulted A
and it doesn't work for the references (https://godbolt.org/z/Ms6MKjKq1). Probably we should do the same or define this semantics in other way. Currently it looks like the proposal doesn't support such a scenario.
simd<T>
uses the simd_abi::compatible<T>
ABI tag. Is this the default we really want?
[parallel.simd.math] only asks for is_floating_point_v<T>
and forgot to allow signed integral T
. All of <cmath>
allows integral arguments and implicitly converts to double
. This doesn't make sense for simd
, which is why no integral overloads are specified. But abs(int)
returns int
, not double
, which makes sense for simd
.
It would be useful to provide overloads of operations like std::sqrt
that work with simd
types. He is a non-exhaustive list of non-member operations we provide in our attempt that are desired by our users:
std::sqrt
std::sin
std::cos
std::exp
std::log
std::pow
std::copysign
std::fma
std::cbrt
Implementations can just fall back to calling the existing scalar operations. Note that std::fma
and std::cbrt
can be implemented by calling the Vector Mathematical Functions of the Intel Math Kernel Library on Intel platforms.
Masked memory operations in some instruction sets allow the masked elements to suppress memory faults (e.g., trying to load from an unmapped page). If copy_to and copy_from get a mask parameter then this leads to the question of what should happen with respect to memory faults for each element.
Some implementations of std::simd may choose to use actual hardware support for masking (e.g., avx-512 masked stores) which will ensure that no memory fault can occur, but can the user rely on that behaviour?
In a generic implementation, a masked load could be implemented as a normal load which is then conditionally blended in a register. A generic implementation of a masked store could perform a read-modify-write sequence. Neither of these approaches would suppress memory faults, and providing a generic implementation which did suppress faults (e.g., by conditionally checking every element as it is read or written individually) could be slow for the more likely common case where fault suppression doesn't matter.
If nothing is said about about fault suppression in std::simd then the user cannot assume that it will happen, and must avoid using the copy_to and copy_from instructions for that purpose.
Perhaps a masked copy_to or copy_from should not guarantee fault suppression unless a flag is supplied which enforces that behaviour? Or we need separate functions which enforce fault suppression while possibly incurring slow-down?
The standard omits g/s (gather/scatter) operations. I think that class simd should offer such capabilities. I think coding g/s as member functions is the cleanest solution. I have proposed code here, which makes use of class simd's constructor taking a generator function. This route should convey sufficient information to the compiler to be optimized into hardware g/s instructions, but can be specialized to enforce this where the optimization fails.
static_simd_cast
and simd_cast
in the TS support specifying either a value_type
or a simd
type. Any feedback on casts is interesting. Certainly, the omission of casting simd_mask<T>
is a must-fix, in my opinion.
The <cmath>
functions return int
and therefore, without extra wording, the simd
overloads would return fixed_size_simd<int, N>
. Since those int
s can only have values 0 or 1 a bool
, or in this case a simd_mask
would be a better fit.
This is a question that was never really discussed before the TS. It was dismissed without discussion IIRC.
After adding std::is_constant_evaluated()
to C++20, the necessary facilities to be certain that a constexpr implementation is possible are there. In some cases this is an extra burden on the implementation and thus it becomes a trade-off question of implementation complexity (and by extension compile times) versus usefulness.
While the status quo of the TS does not support constexpr simd
, the direction of C++ in general is to make as much of C++ constexpr
as possible. For simd
it is certainly possible. Feedback listing use cases for adding constexpr
would be valuable nonetheless. Note that the std-simd (libstdc++) implementation supports constexpr
in most interfaces (which makes it non-conforming, sadly).
There is a proposal to add a generated permutation function which can be used like this:
auto evenValues = permute<_OutputSize>(simd_value, [](size_t idx) -> size_t { return idx % 2; };
This allows an arbitrarily sized output simd to be created, where the mapping of input index to output index is specified through a supplied generator function. It is inevitable that certain permutations are common, and it would be desirable to have pre-defined functions for freeing users from having to reinvent them. By providing named functions it also becomes possible for the library implementation to provide fast alternatives where the target supports them natively.
A few common functions could include:
reverse(simd_value); // Reverse the order of the elements in the simd.
// Could be given a [_Begin, _End] range sub-region to reverse.
shift_left(simd_value, offset);
slice(simd_value, start, size, stride); // Like valarray::slice
rotate(simd_value, offset);
interleave(value0, value1); // Interleave values from each argument in turn (equivalent to matrix transpose by simd rows)
Like issue #38, should these allow dynamic or static behaviour?
Where these functions could operate on subsets of a simd, should it be up to the user to handle the subset explicitly? For example, to reverse a subset of elements in a simd:
auto e = extract<_Begin, _End>(source);
auto r = reverse(e);
output = insert<_Begin>(source, r);
Note that insert
, extract
and resize
are already pre-defined permutations which can be implemented using then generic permute-with-generator.
Extracted from #32
class simd {
template <std::contiguous_iterator It>
requires std::indirectly_writable<It, const value_type&>
void copy_to(It first, Flags);
void copy_from(std::contiguous_iterator auto first, Flags);
Ideas:
inner_min(simd)
reduce_min(simd)
min_reduce(simd)
min_element(simd)
simd.min()
The name was discussed in SG1 and LEWG. SG1 chose datapar
over simd
and other options. LEWG overruled that choice and renamed it to simd
.
We should use the TS to discover the following:
simd
, because it appears too narrowly focused (from its name) on (certain) CPUs? Would such users choose differently if the name would focus on the parallelism it allows instead of a specific hardware implementation of such parallelism?There is a proposal to add static insert/extract functions:
extract<_Begin, _End>(v); // Extract a simd of size (_End - _Begin) from v, starting at index _Begin.
insert<_Begin>(v, child); // Insert a smaller simd into v at index _Begin, overwriting the values in the range [_Begin, _End]
Since the indexes are known, checking can be performed to ensure the insert/extract ranges are valid.
Investigate whether it is useful to allow the insert/extract positions to be dynamic:
extract<_Size>(v, i); // Extract a simd of the given size, starting at the given dynamic index.
insert(v, child, i); // Insert the child into the given position of v, overwriting the existing values.
The size of the insert/extract will always be static, since the simd must have a known size. Only the position could vary dynamically.
Is this useful?
Does it have a performance impact?
How is error handling done?
simd_mask provides copy_to and copy_from which read and write memory regions containing bool values. I think this is a fairly unusual operation because the native simd_mask implementation is likely to be a full set of element bits (AVX-like), or a compact bit set (e.g., ARM, AVX-512). The copy_to/copy_from functions do provide a way of generically storing simd_masks to and from memory, but I'm not sure I like forcing them to be a particular format which doesn't match either of the likely common implementations.
Given that simd_mask is proposed to have the operator+ or a conversion operator, could a programmer use that instead to explicitly mimic the effect:
simd_mask<uint8_t> = some_mask;
simd<uint8_t>(some_mask).copy_to(uint8_t_ptr);
// or (+some_mask).copy_to(uint8_t_ptr);
Similarly, reading back from memory can be accomplished using operator! to convert from simd elements to mask bits.
!(simd<uint8_t>(ptr_to_memory))
I think we should remove those memory operations and require the user to explicitly manage the storage when they need to.
We have pixel data is R1G1B1A1R2G2B2A2... format. To make compositing we need to deinterleave them into R1R2R3..., G1G2G3..., B1B2B3..., A1A2A3... format.
In Vc library we could use Vc::InterleavedMemoryWrapper to achieve this goal. std-simd doesn't provide any efficient solution for the task. We could use generator constructor for it, but it doesn't seem to generate what we want:
Old version with Vc::IntterleavedMemoryWrapper:
https://godbolt.org/z/fEQ3s1
Generator syntax with std-simd:
https://godbolt.org/z/pUtMXm
You can see that generator-based version compiler generates more than twice more instructions. That is really suboptimal
PS:
I think this issue may be somewhat related: VcDevel/std-simd#4
Real-world use of rebind_simd_t suggests that simd_cast is a concise and readable way of indiciating what you mean, while rebind_simd_t is a bit verbose.
Shen [P0820R1] argues for a substantial change to the definition of relation operators.
Kretz [P0851R0] presents the reason for the status quo. The choices were:
simd_mask
(status quo).simd_mask
are provided via new functions (member vs.operator==
and operator!=
are defined and return bool
.If there is new information showing that the current behavior is problematic and a
different behavior is more useful the discussion should be reopened.
Whenever the interface requires a simd
type that has a specific number of elements (typically deduced from one or more native types), we have a choice:
fixed_size<N>
. This is easy to memorize and understand. However, fixed_size<N>
may be less efficient if such objects have to pass a function call boundary. Also, absent more implicit conversions, users that prefer native types will have to use an explicit cast.simd_abi::deduce_t<T, N, Abis...>
. It is less obvious what the exact type will be (implementation-defined or fixed_size<N>
) and the choice will be target-dependent. This makes it slightly harder to write portable code (typically the logic is portable, it just needs an additional cast for some targets). The portability issue could be reduced by allowing more implicit casts.The current situation is inconsistent:
simd_cast<vectorizable_type>
produces fixed_size
whereas split
and concat
use deduce_t
. Consequently WG21 might want to do either one of:
fixed_size<N>
.simd_cast
to use rebind_simd
and thus deduce_t
.A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.