wdmapp / gtensor Goto Github PK

View Code? Open in Web Editor NEW

33.0 3.0 9.0 1.16 MB

GTensor is a multi-dimensional array C++14 header-only library for hybrid GPU development.

License: BSD 3-Clause "New" or "Revised" License

CMake 3.86% C++ 94.50% C 0.32% Shell 0.21% Fortran 1.10%

gpu cuda rocm sycl hacktoberfest cpp cpp14

gtensor's People

Contributors

Stargazers

Watchers

Forkers

bd4 germasch td-mpcdf gmerlo 3togo oleksandr-pavlyk harry-zhou uphoffc cmpfeil

gtensor's Issues

sycl: update for 2020 provisional standard

One important change is that mutable lambdas are no longer allowed, so const gtensor/view refs need to allow element modification.

causing bugs: gtensor_span may not be contiguous

I've seen this in two places now:

template <typename T, size_type N, typename S>
inline void gtensor_span<T, N, S>::fill(const value_type v)
{
  if (v == T(0)) {
    auto data = gt::backend::raw_pointer_cast(this->data());
    backend::standard::memset<S>(data, 0, sizeof(T) * this->size());
  } else {
    assign(*this, scalar(v));
  }
}

and also in the various copy overloads between device and spans. This makes an underlying assumption that the data kept in the span is contiguous, but that's only the case if the strides have been calculated from a shape, as gt::adapt does. Fundamentally, however, a gtensor_span can have arbitrary strides (e.g., we could and should do that for a nicer Fortran interface, but there are other reasons one might want to do this.)

row major layout support

Follow convention of xtensor, i.e. add another template param for gt::layout_type::row_major, gt::layout_type::column_major. For backward compatibility we can make the default column_major, but since it's a C++ lib, maybe we should consider switching the default.

See
https://xtensor.readthedocs.io/en/latest/container.html?highlight=layout#internal-memory-layout

gfunction doesn't have `.view(...)`?

I didn't look into it more...

fix gfunction reference types

the reference types are currently defined to be value_type

add gt::backend::managed_storage type alias

Some of the gt-blas calls require temporary space, particularly the SYCL backend. Having a managed storage helper class would be useful for implementing this, in a nice RIAA way.

gt-blas: sycl / mkl oneAPI blas uses 64bit index type (ILP64 only)

MKL oneAPI doesn't currently support LP64. This shows up for getrf/getrs calls for the pivot array. Since this is allocated on device, there is not an easy way around it, especially re the Fortran interface for GENE. Some options:

Always take 32bit args, for SYCL copy to a / from a 64bit array internally, so the user never sees the difference
Define a gpublas_allocate_pivot_array which allocates correct size, and require users to use this

My inclination is to do (1) for now, and if it becomes a performance bottleneck we can revisit. If it's a big batch, then the computation should dominate anyway, so extra pivot array copies shouldn't be too horrible. Also, in the future LP64 may be supported better in oneAPI, then we can switch and they will all be consistent.

sycl: improve scratch data clean up for gt-blas

SYCL backend for gt-blas will leak if there is an error, and possibly in some cases when there is no error.

can't assign scalar to gtensor

gt::gtensor<double, 1, gt::space::device> d_axpy{gt::shape(100)};
d_axpy = 1.0;

fails to compile in clang++10:

/opt/isycl/bin/clang++  -DGTENSOR_HAVE_DEVICE -DNDEBUG -D__SYCL__ -I/home/bda/fusion/gtensor/examples/../../syclthrust/include -I/home/bda/fusion/gtensor/examples/../include  -O3 -DNDEBUG   -fsycl -fsycl-unnamed-lambda -o CMakeFiles/daxpy.dir/src/daxpy.cxx.o -c /home/bda/fusion/gtensor/examples/src/daxpy.cxx
/home/bda/fusion/gtensor/examples/src/daxpy.cxx:146:12: error: no viable overloaded '='
    d_axpy = 1.0;
    ~~~~~~ ^ ~~~
/home/bda/fusion/gtensor/examples/../include/gtensor/gcontainer.h:42:6: note: candidate template ignored: could not match 'expression<type-parameter-0-0>' against 'double'
  D& operator=(const expression<E>& e);
     ^
/home/bda/fusion/gtensor/examples/../include/gtensor/gcontainer.h:20:7: note: candidate function (the implicit copy assignment operator) not viable: no known conversion from 'double' to 'const gt::gcontainer<gt::gtensor<double, 1, gt::space::device> >' for 1st argument
class gcontainer : public gstrided<D>
      ^
/home/bda/fusion/gtensor/examples/../include/gtensor/gcontainer.h:20:7: note: candidate function (the implicit move assignment operator) not viable: no known conversion from 'double' to 'gt::gcontainer<gt::gtensor<double, 1, gt::space::device> >' for 1st argument
/home/bda/fusion/gtensor/examples/../include/gtensor/gtensor.h:46:7: note: candidate function (the implicit copy assignment operator) not viable: no known conversion from 'double' to 'const gt::gtensor<double, 1, gt::space::device>' for 1st argument
class gtensor : public gcontainer<gtensor<T, N, S>>
      ^

Note that this also fails with space::host.

Using d_axpy = gt::scalar(1.0) fails with a different error. Not sure if that makes any sense but it seems logical from an abstract interface point of view:

/home/bda/fusion/gtensor/examples/../include/gtensor/sarray.h:66:41: error: cannot convert 'gt::sarray<int, 0>' to 'int' without a conversion operator
sarray<T, N>::sarray(U... args) : data_{T(args)...}
                                        ^~~~~~
/home/bda/fusion/gtensor/examples/../include/gtensor/gcontainer.h:60:10: note: in instantiation of function template specialization 'gt::sarray<int, 1>::sarray<gt::sarray<int, 0> , 0>' requested here
  resize(e.derived().shape());
         ^
/home/bda/fusion/gtensor/examples/src/daxpy.cxx:146:12: note: in instantiation of function template specialization 'gt::gcontainer<gt::gtensor<double, 1, gt::space::device> >::operator=<gt::gscalar<double> >' requested here
    d_axpy = gt::scalar(1.0);
           ^
1 error generated.

Merging with xtensor?

Would it make sense to contribute this to xtensor itself, or is it better as a separate library as it is now?

I think @SylvainCorlay, @wolfv and others might be interested in your work.

add CI tests

Ideally would build for no device, CUDA, HIP, and SYCL. I have a machine with both nVidia and AMD GPUs, can use that as a starting point. Too bad it doesn't have an iGPU... but it could test SYCL also by using hipSYCL+AMD, or intel SYCL + Intel CPU OpenCL, or any SYCL with HOST device.

add gt::transform_reduce

This should cover most reduction use cases. It can wrap thrust::transform_reduce for cuda and hip backends. For sycl backend, this could possibly be implemented on top of the reduction support in 2020 spec, or more simply using oneDPL for Intel SYCL.

In particular this could be used for efficient sum of squares implementation, needed by @td-mpcdf

sycl backend is using malloc_shared for device space

This was introduced in the allocator/backend refactor, and appears to have masked issues in oneMKL getrf/s where it segfaults when using device memory.

gt-blas prints wrong error messages

gt-blas implementation (possibly gt-fft also?) does lazy error handling and re-uses gtGpuCheck, which handles success vs failure OK but prints extremely confusing and misleading error messages. One way to deal with this is to add an overload for gtGpuCheck that takes the appropriate library error type and prints the library error message, or at the very list print the numeric code and not a misleading message.

custom kernel names

When profiling and debugging, tracking down which version of a gtensor kernel is the one of interest can be challenging. For SYCL, kernel names are passed as template parameters with templated type names - another sub-type of the templated name could be optionally supplied by the user, via an gt::assign<typename>(lhs, rhs) helper function. This is ugly, but a starting point for discussion. Not exactly sure how to do this for HIP / CUDA yet.

Create generic launch and assign kernels

The CUDA/HIP implementations are fragile in that there may be array sizes that overflow certain limits. Explore using linear launch indexing and mapping back to expression indexes. This requires integer divide and modulo, which may hurt performance, but depending on computational intensity may not hurt performance and my simplify the launch routines a lot. Ideally we have one generic launch for any number of dimensions.

The WIP sycl implementation currently only goes up to 3 dims using range instead of nd_range, and similar challenges apply.

cost of param limit size workaround, and optimizing view size

SYCL has a 2k parameter size limit. CUDA typically has 4k, although it may depend on device. Complex RHS expressions can exceed this limit pretty easily, particularly on SYCL, and this is in fact the case for one of the GENE kernels. The workaround I added was to copy the RHS expression gtensor_span object to device memory and pass only the pointer as a parameter. This works, but it can slow things down, particularly for small kernels. While we want to avoid small kernels in general, it's still not great.

One way to improve this, suggested by @uphoffc, is to optimize gview and gtensor_storage. At least for the case of a gview around an gtensor, all that needs to be stored is a pointer to the underlying data and strides. The offset can be added in to the pointer, and multiple layers of strides are not needed (the underlying strides just amount to col major and can be assumed). I am not sure how this will work for more complex cases, however, things like composing swapaxes and a slice view.

We can also use the copy to device-mem pass pointer hack only when necessary, using constexpr if (for SYCL backend which already requires C++17) or with more fancy TMP hackery. Note that nvcc didn't support C++17 until CUDA 11, so we could consider bumping the gtensor requirement at the cost of loosing support for CUDA < 11.

There also seems to be redundancy in gtensor_span object, which extend from gstrided and contain shape, strides, plus the storage object's size and capacity. Not clear to me whether this is worth optimizing though.

I am also unsure of the effect of these optimizations - the low hanging fruit here is simply constexpr if to only apply the workaround when the parameters are really too big. Most small kernels will have a small RHS and won't require the workaround; kernels with a RHS that exceeds the limit will likely be long running and the extra memcpy is likely to have less impact. I think the first step here is to apply the constexpr if for the SYCL backend, which has the smallest param limit and already requires C++17. Then we can explore further more complicated changes over time. Second step is to analyze the GENE kernels to see what view combos they are using, to get a sense for how much of an effect gview size optimization would have, e.g. would special casing gview of gtensor to avoid double storing strides make a significant difference here.

performance: avoid thrust eval in complex expression

There is a performance regression in GENE related to the RHS calculation, which does a complex expression eval involving views. The first step is to develop a test case which reproduces the issue, then fix it and update gtensor in GENE.

add GTENSOR_ENABLE_ASSERT

Since NDEBUG affects other code, we should have a define that is specific to gtensor for disabling asserts. This is important especially for HIP and SYCL, which don't allow assert in device code.

New per device type spaces

Device vendor dependent code should be refactored, so ifdefs only occur in one place (perhaps space.h. One idea is to have gt::space::cuda, gt::space::hip, etc, and have gt::space::device be a type alias for this, and in the same place the device specific header can be included.

cmake: replace HIP and SYCL lib hacks

Currently the main CMakeLists.txt file lists libraries needed for blas and fft for both HIP and SYCL. This is just a quick hack to get things working until upstream cmake is improved. In particular, the HIP config breaks mixed language builds (e.g. Fortran + C), once this is merged the official cmake support for HIP should work again:
ROCm/HIP#2190

Also need to explore oneAPI cmake support further.

linear and index array accessors

For the SYCL assign/launch and sum_axis_to reduction routine, linear indexing based on expression shape is used to create N dimensional implementations. This is done using the index_expression helper function (currently only works up to 6d), which is hacky. This could also be generally useful elsewhere in gtensor to generalize functions beyond six dimension, and to create specialized routines in application code.

A related operation is access based on an index object, rather than operator() style which takes separate arguments. gview.h implements some of these for internal use, for arbitrary dimension. This could be refactored and moved so, for example, all containers support both linear access and sarray index access.

We can look to xtensor for what interface we might want to provide for this:
https://xtensor.readthedocs.io/en/latest/indices.html
https://xtensor.readthedocs.io/en/latest/quickref/basic.html#element-access
https://xtensor.readthedocs.io/en/latest/view.html#flatten-views
https://xtensor.readthedocs.io/en/latest/quickref/basic.html#data-buffer

So one possibility is operator[] for index array style access (a[{0, 7}] == a(0, 7)), and forcing the use of a.data() to get linear indexing. However a.data() only works on container types, not general expressions, so it is less powerful for implementing n-dimensional routines. Currently gcontainer provides the data_access method for this, but again it's limited to container types and container spans only.

Also note that gtensor currently assumes column major (Fortran style) indexing. If we add this functionality, it is another place that will need to be updated if row major and other layouts are added.

gt-blas can't handle non-contiguous batch data in all backends

oneMKL SYCL backend doesn't support this at all, and the HIP (rocblas+rocsolver) backend appears to have issues with it as well, even though the interface suggests it should work.

Options here:
(1) change our interface to only support contiguous, by taking a single pointer for A/B data, and allocate / populate the pointer array when needed to adapt to the underlying API.
(2) keep current interface, and emulate support in oneMKL by copying data if it's non-contiguous

The performance hit of (2) is likely prohibitive, but it could be a stopgap if support will be added later. Need to look more closely at available (non one) MKL APIs, for a sense of what might be supported in the future.

update for ROCm 3.5.0

Both thrust and no thrust builds fail:

In file included from /home/bda/fusion/gtensor/tests/test_gtensor.cxx:4:
In file included from /home/bda/fusion/gtensor/include/gtensor/gtensor.h:8:
In file included from /home/bda/fusion/gtensor/include/gtensor/gfunction.h:6:
/home/bda/fusion/gtensor/include/gtensor/expression.h:41:28: error: __host__ function 'derived' cannot overload __host__ __device__ function 'derived'
inline auto expression<D>::derived() const& -> const derived_type&
                           ^
/home/bda/fusion/gtensor/include/gtensor/expression.h:35:33: note: previous declaration is here
  GT_INLINE const derived_type& derived() const&;
                                ^
/home/bda/fusion/gtensor/include/gtensor/expression.h:47:28: error: __host__ function 'derived' cannot overload __host__ __device__ function 'derived'
inline auto expression<D>::derived() & -> derived_type&
                           ^
/home/bda/fusion/gtensor/include/gtensor/expression.h:36:27: note: previous declaration is here
  GT_INLINE derived_type& derived() &;
                          ^
In file included from /home/bda/fusion/gtensor/tests/test_gtensor.cxx:4:
In file included from /home/bda/fusion/gtensor/include/gtensor/gtensor.h:8:
In file included from /home/bda/fusion/gtensor/include/gtensor/gfunction.h:8:
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:92:25: error: __host__ function 'shape' cannot overload __host__ __device__ function 'shape'
inline int gstrided<D>::shape(int i) const
                        ^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:53:17: note: previous declaration is here
  GT_INLINE int shape(int i) const;
                ^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:98:26: error: __host__ function 'shape' cannot overload __host__ __device__ function 'shape'
inline auto gstrided<D>::shape() const -> const shape_type&
                         ^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:54:31: note: previous declaration is here
  GT_INLINE const shape_type& shape() const;
                              ^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:104:26: error: __host__ function 'strides' cannot overload __host__ __device__ function 'strides'
inline auto gstrided<D>::strides() const -> const strides_type&
                         ^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:55:33: note: previous declaration is here
  GT_INLINE const strides_type& strides() const;
                                ^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:110:31: error: __host__ function 'size' cannot overload __host__ __device__ function 'size'
inline size_type gstrided<D>::size() const
                              ^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:56:23: note: previous declaration is here
  GT_INLINE size_type size() const;
                      ^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:117:26: error: __host__ function 'operator()' cannot overload __host__ __device__ function 'operator()'
inline auto gstrided<D>::operator()(Args&&... args) const -> const_reference
                         ^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:59:29: note: previous declaration is here
  GT_INLINE const_reference operator()(Args&&... args) const;
                            ^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:124:26: error: __host__ function 'operator()' cannot overload __host__ __device__ function 'operator()'
inline auto gstrided<D>::operator()(Args&&... args) -> reference
                         ^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:61:23: note: previous declaration is here
  GT_INLINE reference operator()(Args&&... args);
                      ^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:151:26: error: __host__ function 'data_access' cannot overload __host__ __device__ function 'data_access'
inline auto gstrided<D>::data_access(size_type i) const -> const_reference
                         ^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:70:29: note: previous declaration is here
  GT_INLINE const_reference data_access(size_type i) const;

add lower bound to view or as an adapt option

Suggested by @td-mpcdf. My understanding is that this would just be an offset, so if lbound=9, the range [10,20] would actually be [1,11] on the physical array. Basically lbound is subtracted from all indexes. Is this the desired behavior? @germasch do you think this makes sense for gtensor?

This could be it's own view type, or a new view parameter. Or something specific to gt::adapt and gt::adapt_device, which is where it wold be used in GENE, but these seems like an odd place to put the functionality when looking at gtensor as a whole.

gpu unit test failure

Not sure this is unique to me, it looks like this should be reproducible as long as bounds check is enabled, (ie., build type Debug).

[kai@macbook build (pr/functions-norm)]$ tests/test_launch
Running main() from _deps/googletest-src/googletest/src/gtest_main.cc
[==========] Running 7 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 7 tests from gtensor
[ RUN      ] gtensor.launch_1d
[       OK ] gtensor.launch_1d (1 ms)
[ RUN      ] gtensor.launch_1d_templated
[       OK ] gtensor.launch_1d_templated (0 ms)
[ RUN      ] gtensor.launch_view_reverse_1d
[       OK ] gtensor.launch_view_reverse_1d (0 ms)
[ RUN      ] gtensor.device_launch_1d
[       OK ] gtensor.device_launch_1d (1822 ms)
[ RUN      ] gtensor.device_launch_view_reverse_1d
out-of-bounds error: dim = 0, arg = 2, shape = 2
../include/gtensor/strides.h:102: bounds_check: block: [0,0,0], thread: [2,0,0] Assertion `0` failed.
gpuCheck: 710 (device-side assert triggered) ../include/gtensor/gtensor.h 401
Abort trap: 6

constness of span/to_kernel types vs container types

gtensor is designed to have const safe containers, in that operator() returns a const reference if called on a const instance. This is much like std::vector. std::span, on the other hand, does not consider the underlying data part of it's state (since the data is not copied when copying a span), and it's operator[] returns a non-const reference even if the instance is const. gtensor, like it's inspiration xtensor, uses the same semantics on view objects, even though it doesn't provide much extra protection, given that you can assign a view to a non-const and modify it that way. Unfortunately this causes problems for SYCL 2020 provisional, which requires a const kernel function.

More discussion here:
intel/llvm#2538

flatten can result in a storage copy

The flatten implementation attempt to return the original object if it's already 1d or 0d, but this often results in a copy. Using decltype(auto) everywhere can fix it (both for return value, and for the variable it's assigned to), but this is fragile, i.e. using auto for the variable instead results in a copy. This breaks the flatten usage in SYCL N-d assign (can be worked around by calling to_kernel before flatten, since copying the span returned by to_kernel doesn't copy storage).

gtensor should support +=

There may have been some subtleties involves, but it should definitely be possible somehow.

avoid unnecessary storage copies

Adding a debug print to gtensor_storage copy ctor, shows some surprises. In particular assigning a temporary to an auto variable can result in a copy.

add reductions

This is the interface used by xtensor:
https://xtensor.readthedocs.io/en/latest/operator.html#reducers

The trivial cases to add are 1d sum/min/max, that operate over the entire array (not axis aware). This can be done using thrust for CUDA/HIP, and the built in reduction support for SYCL 2020. One issue here is that thrust can be used even when not being used as the gtensor backing strore - so perhaps GTENSOR_USE_THRUST should be renamed to something like GTENSOR_USE_THRUST_VECTOR. I don't see much benefit in avoiding thrust for reductions - we could fallback to CUB/rocPRIM, but rocPRIM still requires a separate install for ROCm, and Thrust/CUB are always available for CUDA.

fix gt::copy for const data type spans and containers

If the source has a const data type, no signature will match and compile will fail, e.g.

error: no instance of overloaded function "gt::copy" matches the argument list
            argument types are: (gt::gtensor_span<const real_t, 2UL, gt::space::thrust_host>, gt::gtensor<real_t, 2UL, gt::space::device>)`

there is no reason for this to not work.

ci: rocm build is slow...

So I guess I'm not quite sure why the builds with rocm are so slow in the first place...

But since the initialization / rocm install only takes of the order of 1 minute, it seems like it may be worth it to split this up in some fashion so that the various builds can run in parallel in separate containers (if possible without too much hassle).

make reductions available by default

Currently, one needs to explicitly #include <gtensor/reductions.h. Unless there's a good reason (say compile times becomes even slower), it'd be nice to have the available by default, ie., included from <gtensor/gtensor.h>.

cmake warning

I'm getting this warning:

[cmake] CMake Warning (dev) at build-spack-ubuntu2/_deps/gtensor-src/CMakeLists.txt:230:
[cmake]   Syntax Warning in cmake code at column 63
[cmake] 
[cmake]   Argument not separated from preceding token by whitespace.
[cmake] This warning is for project developers.  Use -Wno-dev to suppress it.

I think it has to do with how things are quoted, but I don't really know what the issue is, and it's in sycl-specific code, so I can't easily test what's going on.

Alternative device_vector backend not requiring thrust

This is already needed for SYCL port, and should be easy to generalize to support CUDA and HIP as well. One advantage of this is avoiding the zero fill initialization in thrust, which can really hurt performance when allocating temporary arrays.

add gt::transform or gt::vectorize?

This can be implemented via thrust::transform for CUDA/HIP and via std::transform for Intel SYCL

should gtensor_storage discard on resize?

std::vector will preserve values on resize(), which may involve copies when reallocation is necessary. gtensor_storage does the same, however, I don't think in its use in gtensor it never happens that values are expected to be preserved, so this copy is waste of time (albeit probably not typically significantly so). Since the semantics of this container differ from std::vector, anyway, in particular, lack of initializing values, I think a point could be made that it'd also not try to preserve values on resize.

Add numpy like gtensor create/initialize helpers

ci doesn't build benchmarks

(see #101) Should be easy to enable, but I need a reminder ;)

fix kernel launch failures when grid dims are too large

gtensor currently uses a fixed block size for CUDA/HIP kernels, which fails for arrays that are too large in a certain dimension (but still easily small enough to fit in a single kernel launch with a block size adjustment). This could be handled by a generic flattening n-d implementation (using 1d indexing), similar to the SYCL backend n-d assign/launch implementations, or with more complex logic to determine block and grid sizes.

custom launch and assign kernel names

improve kernel launch error checking

Can add an error check in gtLaunchKernel macro, for HIP and CUDA backends. SYCL will raise an async exception on launch failure, which should already be handled by the global error handler.

create gtensor-config.cmake

This will simplify the use of gtensor a lot, especially as more GPU backends are added.

arithmetic operations on gt::shape

It sometimes could be convenient of one could do

  auto s = gt::shape(1, 2) + gt:shape(3, 4);

rename development branch to main

I suggest renaming -> main -- going with the times --, unless there are any objections.

Using streams with gtensor

It seems to be necessary to somehow make it possible to use the (CUDA) streams with gtensor. In the launch of an explicit kernel this could be another argument, in the assignment kernels, we should add a function set_stream(streamID) and launch the assignment kernel in this stream. A set_default_stream should set it back.

benchmarks fail to build with SYCL backend

Two issues - need a native SYCL implementation of ij_deriv, and need to fix N-d assign when RHS is a scalar.

The issue is that the N-d implementation uses flatten, which calls reshape(std::forward<E>(e), shape(e.size()));, and the e here is gscalar which has no size method.

add gt::exp

For real and complex gtensor expressions. Needed for GENE.

initialize complex arrays with initializer lists

Initializing gt::gtensor<std::complex<double, 1>> = { 1., 2+3.i } works (at least sometimes), but the same with gt::complex usually does not work, not only because of 1.i being std::complex but more significantly because thrust::complex's constructors aren't constexpr. This makes, at the very least, for a bunch of ugly tests.

I think it's probably possible to get this to work with a special overload that initializes gt::gtensor<gt::complex<T>> with an initializer list of std::complex<T>. It'd be even nicer to be able to just use std::complex, but pre C++-20 lots of it is not constexpr, ie., not __host__ __device__.