mattkretz / std-simd-feedback Goto Github PK

View Code? Open in Web Editor NEW

0.0 0.0 0.0 1 KB

Feedback from simd in the Parallelism TS 2

std-simd-feedback's People

Contributors

Watchers

std-simd-feedback's Issues

consider more implicit conversions

Links to public discussion relevant to the issue before finalizing the TS design:

The status quo of the TS:

simd_mask<T0, A0> is implictly convertible to simd_mask<T1, A1> if both A0 and A1 are fixed_size<N> (same N).
simd<T0, A0> is implictly convertible to simd<T1, A1> if
- both A0 and A1 are fixed_size<N> (same N), and
- converting T0 to T1 preserves the value of all possible values of T0.

As a consequence, the following example is too hard to write:

// set x[i] = 0 where y[i] is even
simd<float, Abi> zero_even(simd<float, Abi> x) {
  auto y = std::static_simd_cast<int>(x);
  //where((y & 1) == 0, x) = 0; // error unless Abi is fixed_size
  //where(std::static_simd_cast<float>(y & 1) == 0, x) = 0; // error unless Abi is fixed_size
  where(std::static_simd_cast<simd<float, Abi>>(y & 1) == 0, x) = 0; // works
  return x;
}
// scalar equivalent:
float zero_even(float x) {
  return ((static_cast<int>(x) & 1) == 0) ? 0 : x;
}
// How it should be for simd (also needs P0917R2):
simd<float, Abi> zero_even(simd<float, Abi> x) {
  return ((std::static_simd_cast<int>(x) & 1) == 0) ? 0 : x;
}

For the TS, the committee preferred to rather be too strict and discover the need for more implicit conversions through the TS process. Therefore it is important to submit enumerate the cases where implicit conversions would be helpful.

concat or join?

I did prefer 'join' but now...

The std libraries use multiple names for concatenating two objects into one, suggesting the result e.g. append, operator+=, join, operator+, (and strcat).
Split and join are used for ranges and views (threads use join too).
I can't think of any occurrences of 'concat' (merge, append, insert etc are used).

On the surface split and join make a nice pair.
However, join tend to be used where there is more to joining than simply concatenating viz:

adding a separator
you can't concatenate two tree graphs, but you can join them
you can't concatenate two threads, but you can join them

constructing a mask for the first N lanes

With the latest version of the proposal, is there a canonical way to construct a simd_mask<T, Abi> that is true for the first N lanes and false otherwise?

Proposal to add a resize function

It has been proposed that a simd can be statically resized:

resize<_NewSize>(simd_value);

If the new size is bigger than the original size, so that the simd value gains elements, what value should be inserted? Should it use the value initialisation of the simd type, or should it be user selectable with a default?

template<std::size_t _NewSize, typename _T, typename _Abi>
simd<_T, _NewSize> resize(simd<_T, _Abi> value, _T new_value = {});

This is the same as container resizing like std::vector::resize. Since the original size and the new size are known, should any growth operations enforce that the user supplies a value to insert into the new elements?

Should resize prevent a value from being shrunk and thereby silently losing elements? std::vector::resize provides no such warning. An alternative to resize would be using the proposed simd::extract function instead with a 0 begin offset, which would make it clear that elements are being discarded, but be more verbose.

If resize enforces certain behaviours then maybe there should be functions which make the expected behaviour clear:

simd::grow<_BiggerSize>(value, insertion_value);  // Require the value for new elements to be explicitly given, and new size must be bigger
simd::shrink<_SmallerSize>(value); // New size must be smaller. Obvious that values are discarded.

There is no precedent for names like these though.

Names and UB of popcount, find_first_set, find_last_set

These need to be consistent and unambiguous with the corresponding functions introduced for builtin types (countl_zero, countl_one, countr_zero, countr_one, popcount). Note that there are two variants of these functions that are relevant to simd<T>:

Counting the number of entries in a simd_mask<T>.
This returns a scalar integer or a std::bitset.
Element-wise counting the number of 1 bits in a simd<T>.
This returns a simd<int>.

The former is present in Par TS 2. The latter should be added, now that the [bit.count] functions are present in C++20.

The count[lr]_(zero|one) functions in C++20 are slightly different from find_(first|last)_set. The latter return in what place the first true value in the mask appears. However, if there is none, the behavior is undefined.
The corresponding count[lr]_zero functions count the number of consecutive false values starting from the left or right and thus do not invoke undefined behavior.

Provide simd_mask ctor from unsigned integral

It can be useful to create simd_masks from compressed bits which are stored in some unsigned integral value. This could be because the integral value is an easy way to quickly provide a known set of mask bits:

simd_mask<float>(0b0101101010010101);

or it could be because the application is actually using the compressed bit format to efficiently store data and there needs to be a convenient and efficient way to convert into a mask.

This mechanism will only work up to the number of bits in the unsigned_integral input. So a 64-bit integral couldn't initialise every bit in a simd_mask with 128 bits, but will only initialise the lower bits.

There is precedent in this sort of operation in std::bitset, which provides a constructor from ulonglong.

To undo the effects of this constructor we could also provide to_ulong()/to_ullong() functions which go back to the compressed bit format.

Allow incomplete split

A simd can be broken down into pieces of specific sizes:

const auto [a, b, c] = split<2, 3, 2>(simd_values);

At the moment split requires the sum of the pieces to match the size of the input value, but this seems overly restrictive. I can see that it helps guide correct behaviour by ensuring that nothing gets left behind, but I think it more likely that code might not know the size of the simd_value, and consequently it should only be an error if the number of pieces is too large. For example, in the example above perhaps `simd_value' is a native simd type. This might have enough elements to supply 2 + 3 + 2 simd pieces, with some padding left over, but computing how much padding is needed makes for an unnecessarily verbose call.

I propose that split should require the sum of the pieces to be no larger than the number of elements, rather than exactly the same size.

produce "modern" wording, deleting the diff

std::hash<simd> and std::hash<simd_mask>

LEWG asked for hashing support for simd. My question whether the hash should be element-wise was not a joke, though. We need to explore this in terms of use cases.

discuss why not to pull split and concat — potentially pull parts of split?

to be discussed (and possibly re-added) in a paper discussing extract, insert/replace, resize

converting loads/stores need a flag unless value-preserving

Since implicit conversions on broadcasts are restricted to value-preserving conversions, the load/store interface is inconsistent with this strictness. I.e.:

using V = simd<float>;
V(1.); // error, double -> float conversion is not value-preserving
V(short(1)); // good, short -> float is value-preserving
double mem[V::size()] = {1.};
V(mem, flags::element_aligned); // converting load, compiles (but shouldn't?)
short mem[V::size()] = {1};
V(mem, flags::element_aligned); // value-preserving conversion, compiles

The conversion is not obvious when you look at the line of code that requests the load.
Options:

Drop converting loads and stores. (That would be unfortunate, since doing converting loads and stores efficiently and portably is potentially a hard problem.)
Use different functions, e.g. v.copy_from(mem, flags) requires mem to be a value_type array. v.memload_cvt_safe(mem, flags) allows value-preserving conversions. v.memload_cvt_unsafe(mem, flags) allows all conversions.
Use a flag to enable conversions (without cvt flag, no conversions are allowed), e.g.

V(mem, flags::element_aligned | flags::safe_cvt);
V(mem, flags::element_aligned | flags::any_cvt);
V(mem, flags::element_aligned | flags::saturating_cvt); // new feature

As a variation, safe conversions could be enabled per default and only unsafe
conversions would require extra typing.

don't enable simd for freestanding

Add constructor `simd<T, Abi>{T...}`

While it is not immediately obvious how many elements a simd will have, in some cases, like with the fixed_size abi, it is. So it would be great if we could have a constructor that allows to initialize a simd of size N with exactly N constructor arguments:

stdx::fixed_size_simd<double, 4> simd{0.0, 11.0, 22.0, 33.0};

I leave it up to you whether to also add such a constructor for simds of other abis.

Modify operators for simd_mask to better match std::bitset

simd_mask is close in behaviour to std::bitset, but unlike std::bitset it doesn't provide NOT (~) or shift operators (<<, >>). I'm not sure how useful the shift operators actually are, but for completeness I think they should be provided.

simd_mask does provide && and || but I'm not sure they really make sense. These are meant to be short-circuit operators, but, like any customer operator of this type, they don't actually short-circuit. The simd_mask is a collection of bits, and so only providing the bitwise operators would seem to make more sense to me.

I think that the operator==/!= are also confusing. I would understand an expression like my_mask == your_mask' to mean are the masks the same', not `which elements in the masks are the same'. Looking to std::bitset for guidance again, that uses the former meaning (and only provides == anyway_ since != is synthesised), and returns a single value indicating equality for the entire value, not its elements.

missing shuffle

shuffle is an important feature for simd, in daily workload, we found it's necessary to be involved.

mention formatting support via input_range

Blend/masked assignment primitive

Based on your comment here, you would be open to some kind of function:

simd<T> cond(simd_mask<T>, simd<T>, simd<T>)

This is consistent with APIs I've seen in several other efforts to develop a SIMD abstraction layer, and ~~might be~~ is more intuitive to our users.

As you mention, if the standard provides overloading of the ternary operator, this could build on that capability, however I would recommend that a named non-member method still be provided so that implementations of std::simd could be developed that are usable in currently available C++ standards. In other words, take advantage of but don't rely solely on language changes.

In our efforts to implement something close to std::simd, we have chosen to only provide this primitive, which we arbitrarily name choose.

@crtrott @dalg24

Pass iterators to load/stores by value not const-ref

generalized slice syntax

There seems to be a pattern that we need to explore:

We want slices of

simd,
contiguous_container (e.g. span, vector), and
(strided) slices for mdpsan.

More?

Many other languages have syntax to support this. E.g. Python:

a[start:stop]  # items start through stop-1
a[start:]      # items start through the rest of the array
a[:stop]       # items from the beginning through stop-1
a[:]           # a copy of the whole array
a[start:stop:step] # start through not past stop, by step
a[-1]      # last item in the array
a[-2:]     # last two items in the array
a[:-2]     # everything except the last two items
a[::-1]    # all items in the array, reversed
a[1::-1]   # the first two items, reversed
a[:-3:-1]  # the last two items, reversed
a[-3::-1]  # everything except the last two items, reversed

extract<0, 4>(v) could be written as v[0:4] with that syntax.

Naming of masked overloads for copy_to, copy_from, ctor, reduce, etc.

I think that the names of the copy_to/from functions, reduuctions, and so on should be left as-is, and an extra mask parameter added to provide an overload.

I don't like adding the _if suffix to the functions because that makes them too similar to existing functions which have a different behaviour. For example, copy_if only copies values to the destination if the predicate is met, so it has the effect of compressing the data too, For example, if the predicate matched 3 out of 6 values from an input iterator, then the output will only have 3 values. But the copy_to/copy_from are actually doing the equivalent of *this = mask ? *ptr : *this which applies the mask bit in place.

If you wanted to change the name, then I suggest using a variant of the word mask in the name instead, to make it clear that this is an in-place mask that is being applied, not a C++ copy_if compression.

I think having a copy_if function that works on simd would be useful, but it should behave like its C++ counterpart and perform compression.

Missing rsqrt

While porting a small example from Vc to std::simd, I got stuck with rsqrt not being available. For scalar code and when compiled with -ffast-math (or equivalent), the compiler also generates a scalar vrsqrt_ss from an expression like 1.0f/std::sqrt(x). However, if x is a std::simd the corresponding vrsqrt_ps is not generated. See: https://godbolt.org/z/978j5qK44

Now we could argue that rsqrt is not accessible, because it is generally not in <cmath>, and IIRC the Parallelism TS v2 proposes SIMD overloads for existing math functions in .

Still, this functionality is needed for fast, inaccurate code.
Could the Parallelism TS v2 provide an additonal rsqrt(std::simd) function?
Or does someone need to write a proposal for an rsqrt in ? We also have std::fma there, I was surprised :D

<bit> for simd (popcount, bit_ceil, ...)

The bits of the values of the primary types such as int， long,... can be evaluated by popcount-like operations.
in this way, it will be useful to support bit-wise operations on simd beside simd_mask with popcount.

simd_bit_cast: bit-casting simd — specify trviality of simd and simd_mask?

Whether simd<T> or simd_mask<T> is trivial is currently a QoI issue. In principle simd<T> should be trivial iff T is trivial. I believe simd_mask<T> should unconditionally be trivial (this excludes an implementation of fixed_size_simd_mask<T> via a std::bitset member).

Guaranteeing triviality enables reliable use of std::bit_cast, which is a rather common task in low-level SIMD programming.

Name of where function

The name was discussed in LEWG. If there is no new information via experience, WG21 should not spend any further time on renaming this function. Consequently, this is a call for feedback from usage experience. We need arguments why a certain name works or does not work (clarity, ambiguity); no calls for preference.

Note that if [P0917R2](https://wg21.link/p0917r2} gets adopted, the where function could possibly removed in favor of conditional expressions for blending.

no default for the load/store flags

Consider:

std::simd<float> v(addr, std::vector_aligned);
v.copy_from(addr + 1, std::element_aligned);
v.copy_to(dest, std::element_aligned);

Line 1 supplies an optimization hint to the load operation. Line 2 says what really?
“Please don’t crash. I know this is not a vector aligned access2 .” Line 3 says: “I don’t
know whether it’s vector aligned or not. Compiler, if you know more, please optimize,
otherwise just don’t make it crash.” (To clarify, the difference between lines 2 and 3
is what line 1 says about the alignment of addr.) In both cases of element_aligned
access, the developer requested a behavior we take as given in all other situations.
Why do we force to spell it out in this case?
Since C++20, we also have another option:

std::simd<float> v(std::assume_aligned<std::memory_alignment_v<std::simd<float>>>(addr));
v.copy_from(addr + 1);
v.copy_to(dest);

This seems to compose well, except that line 1 is rather long for a common pattern in
this interface. Also, this removes implementation freedom because the library cannot
statically determine the alignment properties of the pointer.
Consequently, I suggest to keep the load/store flags as in the TS but default them
to element_aligned. I.e.:

std::simd<float> v(addr, std::vector_aligned);
v.copy_from(addr + 1);
v.copy_to(dest);

nexttoward simd overload?

Should there be a nexttoward overload for simd? And should it use fixed_size_simd<long double, N> y and/or long double y as second parameter? The former is more general but is likely to incur unnecessary cost in many cases.

names copy_from & copy_to

The names were discussed in SG1 and LEWG. If there is no new information via experience, WG21 should not spend any further time on renaming these functions. Consequently, this is a call for feedback from usage experience. We need arguments why a certain name works or does not work (clarity, ambiguity); no calls for preference.

Inconsistent use of size_t and int for sizes or indexes

Semantics of rebind_simd / resize_simd for cv/ref-qualified simd types?

The issue that I have here is I don't understand where those support cv-qualifiers or references.

For example I would assume that if two expression rebind_simd<simd<whatever, whatever>, to-x> and rebind_simd<const simd<whatever, whatever, to-x> would give me the same result except const qualification (if it was for the original argument) for rebound simd type would be preserved.

Note: What I describe works for std::allocator_traits<const A>::rebind_alloc<args> today but drops const for resulted A and it doesn't work for the references (https://godbolt.org/z/Ms6MKjKq1). Probably we should do the same or define this semantics in other way. Currently it looks like the proposal doesn't support such a scenario.

Is the default ABI tag the correct default?

simd<T> uses the simd_abi::compatible<T> ABI tag. Is this the default we really want?

abs(simd<signed integral type>) is missing

[parallel.simd.math] only asks for is_floating_point_v<T> and forgot to allow signed integral T. All of <cmath> allows integral arguments and implicitly converts to double. This doesn't make sense for simd, which is why no integral overloads are specified. But abs(int) returns int, not double, which makes sense for simd.

<cmath> operations

It would be useful to provide overloads of operations like std::sqrt that work with simd types. He is a non-exhaustive list of non-member operations we provide in our attempt that are desired by our users:

std::sqrt
std::sin
std::cos
std::exp
std::log
std::pow
std::copysign
std::fma
std::cbrt

Implementations can just fall back to calling the existing scalar operations. Note that std::fma and std::cbrt can be implemented by calling the Vector Mathematical Functions of the Intel Math Kernel Library on Intel platforms.

Consider what names need a simd_ prefix (or nested namespace)

feature test macro

Manage fault suppression for masked reads and writes

Masked memory operations in some instruction sets allow the masked elements to suppress memory faults (e.g., trying to load from an unmapped page). If copy_to and copy_from get a mask parameter then this leads to the question of what should happen with respect to memory faults for each element.

Some implementations of std::simd may choose to use actual hardware support for masking (e.g., avx-512 masked stores) which will ensure that no memory fault can occur, but can the user rely on that behaviour?

In a generic implementation, a masked load could be implemented as a normal load which is then conditionally blended in a register. A generic implementation of a masked store could perform a read-modify-write sequence. Neither of these approaches would suppress memory faults, and providing a generic implementation which did suppress faults (e.g., by conditionally checking every element as it is read or written individually) could be slow for the more likely common case where fault suppression doesn't matter.

If nothing is said about about fault suppression in std::simd then the user cannot assume that it will happen, and must avoid using the copy_to and copy_from instructions for that purpose.

Perhaps a masked copy_to or copy_from should not guarantee fault suppression unless a flag is supplied which enforces that behaviour? Or we need separate functions which enforce fault suppression while possibly incurring slow-down?

missing gather/scatter

The standard omits g/s (gather/scatter) operations. I think that class simd should offer such capabilities. I think coding g/s as member functions is the cleanest solution. I have proposed code here, which makes use of class simd's constructor taking a generator function. This route should convey sufficient information to the compiler to be optimized into hardware g/s instructions, but can be specialized to enforce this where the optimization fails.

allowed template arguments for (static_)simd_cast

static_simd_cast and simd_cast in the TS support specifying either a value_type or a simd type. Any feedback on casts is interesting. Certainly, the omission of casting simd_mask<T> is a must-fix, in my opinion.

isgreater, isgreaterequal, isless, islessequal, islessgreater, isunordered should return a mask

The <cmath> functions return int and therefore, without extra wording, the simd overloads would return fixed_size_simd<int, N>. Since those ints can only have values 0 or 1 a bool, or in this case a simd_mask would be a better fit.

constexpr all the things?

This is a question that was never really discussed before the TS. It was dismissed without discussion IIRC.
After adding std::is_constant_evaluated() to C++20, the necessary facilities to be certain that a constexpr implementation is possible are there. In some cases this is an extra burden on the implementation and thus it becomes a trade-off question of implementation complexity (and by extension compile times) versus usefulness.
While the status quo of the TS does not support constexpr simd, the direction of C++ in general is to make as much of C++ constexpr as possible. For simd it is certainly possible. Feedback listing use cases for adding constexpr would be valuable nonetheless. Note that the std-simd (libstdc++) implementation supports constexpr in most interfaces (which makes it non-conforming, sadly).

Named permutation functions

There is a proposal to add a generated permutation function which can be used like this:

auto evenValues = permute<_OutputSize>(simd_value, [](size_t idx) -> size_t { return idx % 2; };

This allows an arbitrarily sized output simd to be created, where the mapping of input index to output index is specified through a supplied generator function. It is inevitable that certain permutations are common, and it would be desirable to have pre-defined functions for freeing users from having to reinvent them. By providing named functions it also becomes possible for the library implementation to provide fast alternatives where the target supports them natively.

A few common functions could include:

reverse(simd_value);   // Reverse the order of the elements in the simd.
                       // Could be given a [_Begin, _End] range sub-region to reverse.
shift_left(simd_value, offset);
slice(simd_value, start, size, stride); // Like valarray::slice
rotate(simd_value, offset);
interleave(value0, value1); // Interleave values from each argument in turn (equivalent to matrix transpose by simd rows)

Like issue #38, should these allow dynamic or static behaviour?

Where these functions could operate on subsets of a simd, should it be up to the user to handle the subset explicitly? For example, to reverse a subset of elements in a simd:

auto e = extract<_Begin, _End>(source);
auto r = reverse(e);
output = insert<_Begin>(source, r);

Note that insert, extract and resize are already pre-defined permutations which can be implemented using then generic permute-with-generator.

Change simd loads and stores to use contiguous_iterator instead of pointer

Extracted from #32

class simd {
  template <std::contiguous_iterator It>
  requires std::indirectly_writable<It, const value_type&>
  void copy_to(It first, Flags);

  void copy_from(std::contiguous_iterator auto first, Flags);

The name of the hmin and hmax reduction functions is not intuitive

Ideas:

inner_min(simd)
reduce_min(simd)
min_reduce(simd)
min_element(simd)
simd.min()

Name of simd<T> misleading?

The name was discussed in SG1 and LEWG. SG1 chose datapar over simd and other options. LEWG overruled that choice and renamed it to simd.

We should use the TS to discover the following:

Is there interest in implementations that use hardware parallelism that does not map (exclusively) to SIMD instructions & registers?
Do users choose to not use simd, because it appears too narrowly focused (from its name) on (certain) CPUs? Would such users choose differently if the name would focus on the parallelism it allows instead of a specific hardware implementation of such parallelism?

Rename P1928 to just 'std::simd'

Dynamic versus static interfaces for insert/extract

There is a proposal to add static insert/extract functions:

extract<_Begin, _End>(v);   // Extract a simd of size (_End - _Begin) from v, starting at index _Begin.
insert<_Begin>(v, child);   // Insert a smaller simd into v at index _Begin, overwriting the values in the range [_Begin, _End]

Since the indexes are known, checking can be performed to ensure the insert/extract ranges are valid.

Investigate whether it is useful to allow the insert/extract positions to be dynamic:

extract<_Size>(v, i);     // Extract a simd of the given size, starting at the given dynamic index.
insert(v, child, i);      // Insert the child into the given position of v, overwriting the existing values.

The size of the insert/extract will always be static, since the simd must have a known size. Only the position could vary dynamically.

Is this useful?

Based on current experience, no. No user has asked for this.

Does it have a performance impact?

How is error handling done?

What happens if the index is invalid and the insert/extract overuns the bounds? Runtime exception? But that takes cycles to handle which is undesirable in performance code (i.e., simdable code).

Does simd_mask need copy_to, copy_from and ctor load?

simd_mask provides copy_to and copy_from which read and write memory regions containing bool values. I think this is a fairly unusual operation because the native simd_mask implementation is likely to be a full set of element bits (AVX-like), or a compact bit set (e.g., ARM, AVX-512). The copy_to/copy_from functions do provide a way of generically storing simd_masks to and from memory, but I'm not sure I like forcing them to be a particular format which doesn't match either of the likely common implementations.

Given that simd_mask is proposed to have the operator+ or a conversion operator, could a programmer use that instead to explicitly mimic the effect:

simd_mask<uint8_t> = some_mask;
simd<uint8_t>(some_mask).copy_to(uint8_t_ptr);
// or (+some_mask).copy_to(uint8_t_ptr);

Similarly, reading back from memory can be accomplished using operator! to convert from simd elements to mask bits.

!(simd<uint8_t>(ptr_to_memory))

I think we should remove those memory operations and require the user to explicitly manage the storage when they need to.

Vc::InterleavedMemoryWrapper is missing in std::simd

Usecase

We have pixel data is R1G1B1A1R2G2B2A2... format. To make compositing we need to deinterleave them into R1R2R3..., G1G2G3..., B1B2B3..., A1A2A3... format.

Implementation in Vc and std-simd

In Vc library we could use Vc::InterleavedMemoryWrapper to achieve this goal. std-simd doesn't provide any efficient solution for the task. We could use generator constructor for it, but it doesn't seem to generate what we want:

Old version with Vc::IntterleavedMemoryWrapper:
https://godbolt.org/z/fEQ3s1

Generator syntax with std-simd:
https://godbolt.org/z/pUtMXm

You can see that generator-based version compiler generates more than twice more instructions. That is really suboptimal

PS:
I think this issue may be somewhat related: VcDevel/std-simd#4

Reinstate simd_cast but implement using rebind

Real-world use of rebind_simd_t suggests that simd_cast is a concise and readable way of indiciating what you mean, while rebind_simd_t is a bit verbose.

relation operators

Shen [P0820R1] argues for a substantial change to the definition of relation operators.
Kretz [P0851R0] presents the reason for the status quo. The choices were:

All relops are defined and return simd_mask (status quo).
Compares returning simd_mask are provided via new functions (member vs.
non-member?)
a) No relops are defined.
b) operator== and operator!= are defined and return bool.

If there is new information showing that the current behavior is problematic and a
different behavior is more useful the discussion should be reopened.

fixed_size vs. deduce_t

Whenever the interface requires a simd type that has a specific number of elements (typically deduced from one or more native types), we have a choice:

Use fixed_size<N>. This is easy to memorize and understand. However, fixed_size<N> may be less efficient if such objects have to pass a function call boundary. Also, absent more implicit conversions, users that prefer native types will have to use an explicit cast.
Use simd_abi::deduce_t<T, N, Abis...>. It is less obvious what the exact type will be (implementation-defined or fixed_size<N>) and the choice will be target-dependent. This makes it slightly harder to write portable code (typically the logic is portable, it just needs an additional cast for some targets). The portability issue could be reduced by allowing more implicit casts.

The current situation is inconsistent:
simd_cast<vectorizable_type> produces fixed_size whereas split and concat use deduce_t. Consequently WG21 might want to do either one of:

Modify the return types of split and concat to unconditionally use fixed_size<N>.
Modify simd_cast to use rebind_simd and thus deduce_t.