wdmapp / gtensor Goto Github PK
View Code? Open in Web Editor NEWGTensor is a multi-dimensional array C++14 header-only library for hybrid GPU development.
License: BSD 3-Clause "New" or "Revised" License
GTensor is a multi-dimensional array C++14 header-only library for hybrid GPU development.
License: BSD 3-Clause "New" or "Revised" License
One important change is that mutable lambdas are no longer allowed, so const gtensor/view refs need to allow element modification.
I've seen this in two places now:
template <typename T, size_type N, typename S>
inline void gtensor_span<T, N, S>::fill(const value_type v)
{
if (v == T(0)) {
auto data = gt::backend::raw_pointer_cast(this->data());
backend::standard::memset<S>(data, 0, sizeof(T) * this->size());
} else {
assign(*this, scalar(v));
}
}
and also in the various copy overloads between device and spans. This makes an underlying assumption that the data kept in the span is contiguous, but that's only the case if the strides have been calculated from a shape, as gt::adapt
does. Fundamentally, however, a gtensor_span
can have arbitrary strides (e.g., we could and should do that for a nicer Fortran interface, but there are other reasons one might want to do this.)
Follow convention of xtensor, i.e. add another template param for gt::layout_type::row_major
, gt::layout_type::column_major
. For backward compatibility we can make the default column_major, but since it's a C++ lib, maybe we should consider switching the default.
See
https://xtensor.readthedocs.io/en/latest/container.html?highlight=layout#internal-memory-layout
I didn't look into it more...
the reference types are currently defined to be value_type
Some of the gt-blas calls require temporary space, particularly the SYCL backend. Having a managed storage helper class would be useful for implementing this, in a nice RIAA way.
MKL oneAPI doesn't currently support LP64. This shows up for getrf/getrs calls for the pivot array. Since this is allocated on device, there is not an easy way around it, especially re the Fortran interface for GENE. Some options:
My inclination is to do (1) for now, and if it becomes a performance bottleneck we can revisit. If it's a big batch, then the computation should dominate anyway, so extra pivot array copies shouldn't be too horrible. Also, in the future LP64 may be supported better in oneAPI, then we can switch and they will all be consistent.
SYCL backend for gt-blas will leak if there is an error, and possibly in some cases when there is no error.
gt::gtensor<double, 1, gt::space::device> d_axpy{gt::shape(100)};
d_axpy = 1.0;
fails to compile in clang++10:
/opt/isycl/bin/clang++ -DGTENSOR_HAVE_DEVICE -DNDEBUG -D__SYCL__ -I/home/bda/fusion/gtensor/examples/../../syclthrust/include -I/home/bda/fusion/gtensor/examples/../include -O3 -DNDEBUG -fsycl -fsycl-unnamed-lambda -o CMakeFiles/daxpy.dir/src/daxpy.cxx.o -c /home/bda/fusion/gtensor/examples/src/daxpy.cxx
/home/bda/fusion/gtensor/examples/src/daxpy.cxx:146:12: error: no viable overloaded '='
d_axpy = 1.0;
~~~~~~ ^ ~~~
/home/bda/fusion/gtensor/examples/../include/gtensor/gcontainer.h:42:6: note: candidate template ignored: could not match 'expression<type-parameter-0-0>' against 'double'
D& operator=(const expression<E>& e);
^
/home/bda/fusion/gtensor/examples/../include/gtensor/gcontainer.h:20:7: note: candidate function (the implicit copy assignment operator) not viable: no known conversion from 'double' to 'const gt::gcontainer<gt::gtensor<double, 1, gt::space::device> >' for 1st argument
class gcontainer : public gstrided<D>
^
/home/bda/fusion/gtensor/examples/../include/gtensor/gcontainer.h:20:7: note: candidate function (the implicit move assignment operator) not viable: no known conversion from 'double' to 'gt::gcontainer<gt::gtensor<double, 1, gt::space::device> >' for 1st argument
/home/bda/fusion/gtensor/examples/../include/gtensor/gtensor.h:46:7: note: candidate function (the implicit copy assignment operator) not viable: no known conversion from 'double' to 'const gt::gtensor<double, 1, gt::space::device>' for 1st argument
class gtensor : public gcontainer<gtensor<T, N, S>>
^
Note that this also fails with space::host
.
Using d_axpy = gt::scalar(1.0)
fails with a different error. Not sure if that makes any sense but it seems logical from an abstract interface point of view:
/home/bda/fusion/gtensor/examples/../include/gtensor/sarray.h:66:41: error: cannot convert 'gt::sarray<int, 0>' to 'int' without a conversion operator
sarray<T, N>::sarray(U... args) : data_{T(args)...}
^~~~~~
/home/bda/fusion/gtensor/examples/../include/gtensor/gcontainer.h:60:10: note: in instantiation of function template specialization 'gt::sarray<int, 1>::sarray<gt::sarray<int, 0> , 0>' requested here
resize(e.derived().shape());
^
/home/bda/fusion/gtensor/examples/src/daxpy.cxx:146:12: note: in instantiation of function template specialization 'gt::gcontainer<gt::gtensor<double, 1, gt::space::device> >::operator=<gt::gscalar<double> >' requested here
d_axpy = gt::scalar(1.0);
^
1 error generated.
Would it make sense to contribute this to xtensor
itself, or is it better as a separate library as it is now?
I think @SylvainCorlay, @wolfv and others might be interested in your work.
Ideally would build for no device, CUDA, HIP, and SYCL. I have a machine with both nVidia and AMD GPUs, can use that as a starting point. Too bad it doesn't have an iGPU... but it could test SYCL also by using hipSYCL+AMD, or intel SYCL + Intel CPU OpenCL, or any SYCL with HOST device.
This should cover most reduction use cases. It can wrap thrust::transform_reduce for cuda and hip backends. For sycl backend, this could possibly be implemented on top of the reduction support in 2020 spec, or more simply using oneDPL for Intel SYCL.
In particular this could be used for efficient sum of squares implementation, needed by @td-mpcdf
This was introduced in the allocator/backend refactor, and appears to have masked issues in oneMKL getrf/s where it segfaults when using device memory.
gt-blas implementation (possibly gt-fft also?) does lazy error handling and re-uses gtGpuCheck, which handles success vs failure OK but prints extremely confusing and misleading error messages. One way to deal with this is to add an overload for gtGpuCheck that takes the appropriate library error type and prints the library error message, or at the very list print the numeric code and not a misleading message.
When profiling and debugging, tracking down which version of a gtensor kernel is the one of interest can be challenging. For SYCL, kernel names are passed as template parameters with templated type names - another sub-type of the templated name could be optionally supplied by the user, via an gt::assign<typename>(lhs, rhs)
helper function. This is ugly, but a starting point for discussion. Not exactly sure how to do this for HIP / CUDA yet.
The CUDA/HIP implementations are fragile in that there may be array sizes that overflow certain limits. Explore using linear launch indexing and mapping back to expression indexes. This requires integer divide and modulo, which may hurt performance, but depending on computational intensity may not hurt performance and my simplify the launch routines a lot. Ideally we have one generic launch for any number of dimensions.
The WIP sycl implementation currently only goes up to 3 dims using range instead of nd_range, and similar challenges apply.
SYCL has a 2k parameter size limit. CUDA typically has 4k, although it may depend on device. Complex RHS expressions can exceed this limit pretty easily, particularly on SYCL, and this is in fact the case for one of the GENE kernels. The workaround I added was to copy the RHS expression gtensor_span object to device memory and pass only the pointer as a parameter. This works, but it can slow things down, particularly for small kernels. While we want to avoid small kernels in general, it's still not great.
One way to improve this, suggested by @uphoffc, is to optimize gview and gtensor_storage. At least for the case of a gview around an gtensor, all that needs to be stored is a pointer to the underlying data and strides. The offset can be added in to the pointer, and multiple layers of strides are not needed (the underlying strides just amount to col major and can be assumed). I am not sure how this will work for more complex cases, however, things like composing swapaxes and a slice view.
We can also use the copy to device-mem pass pointer hack only when necessary, using constexpr if (for SYCL backend which already requires C++17) or with more fancy TMP hackery. Note that nvcc didn't support C++17 until CUDA 11, so we could consider bumping the gtensor requirement at the cost of loosing support for CUDA < 11.
There also seems to be redundancy in gtensor_span object, which extend from gstrided and contain shape, strides, plus the storage object's size and capacity. Not clear to me whether this is worth optimizing though.
I am also unsure of the effect of these optimizations - the low hanging fruit here is simply constexpr if to only apply the workaround when the parameters are really too big. Most small kernels will have a small RHS and won't require the workaround; kernels with a RHS that exceeds the limit will likely be long running and the extra memcpy is likely to have less impact. I think the first step here is to apply the constexpr if for the SYCL backend, which has the smallest param limit and already requires C++17. Then we can explore further more complicated changes over time. Second step is to analyze the GENE kernels to see what view combos they are using, to get a sense for how much of an effect gview size optimization would have, e.g. would special casing gview of gtensor to avoid double storing strides make a significant difference here.
There is a performance regression in GENE related to the RHS calculation, which does a complex expression eval involving views. The first step is to develop a test case which reproduces the issue, then fix it and update gtensor in GENE.
Since NDEBUG affects other code, we should have a define that is specific to gtensor for disabling asserts. This is important especially for HIP and SYCL, which don't allow assert in device code.
Device vendor dependent code should be refactored, so ifdefs only occur in one place (perhaps space.h. One idea is to have gt::space::cuda
, gt::space::hip
, etc, and have gt::space::device
be a type alias for this, and in the same place the device specific header can be included.
Currently the main CMakeLists.txt
file lists libraries needed for blas and fft for both HIP and SYCL. This is just a quick hack to get things working until upstream cmake is improved. In particular, the HIP config breaks mixed language builds (e.g. Fortran + C), once this is merged the official cmake support for HIP should work again:
ROCm/HIP#2190
Also need to explore oneAPI cmake support further.
For the SYCL assign/launch and sum_axis_to reduction routine, linear indexing based on expression shape is used to create N dimensional implementations. This is done using the index_expression
helper function (currently only works up to 6d), which is hacky. This could also be generally useful elsewhere in gtensor to generalize functions beyond six dimension, and to create specialized routines in application code.
A related operation is access based on an index object, rather than operator()
style which takes separate arguments. gview.h implements some of these for internal use, for arbitrary dimension. This could be refactored and moved so, for example, all containers support both linear access and sarray index access.
We can look to xtensor for what interface we might want to provide for this:
https://xtensor.readthedocs.io/en/latest/indices.html
https://xtensor.readthedocs.io/en/latest/quickref/basic.html#element-access
https://xtensor.readthedocs.io/en/latest/view.html#flatten-views
https://xtensor.readthedocs.io/en/latest/quickref/basic.html#data-buffer
So one possibility is operator[]
for index array style access (a[{0, 7}] == a(0, 7)
), and forcing the use of a.data()
to get linear indexing. However a.data()
only works on container types, not general expressions, so it is less powerful for implementing n-dimensional routines. Currently gcontainer
provides the data_access
method for this, but again it's limited to container types and container spans only.
Also note that gtensor currently assumes column major (Fortran style) indexing. If we add this functionality, it is another place that will need to be updated if row major and other layouts are added.
oneMKL SYCL backend doesn't support this at all, and the HIP (rocblas+rocsolver) backend appears to have issues with it as well, even though the interface suggests it should work.
Options here:
(1) change our interface to only support contiguous, by taking a single pointer for A/B data, and allocate / populate the pointer array when needed to adapt to the underlying API.
(2) keep current interface, and emulate support in oneMKL by copying data if it's non-contiguous
The performance hit of (2) is likely prohibitive, but it could be a stopgap if support will be added later. Need to look more closely at available (non one) MKL APIs, for a sense of what might be supported in the future.
Both thrust and no thrust builds fail:
In file included from /home/bda/fusion/gtensor/tests/test_gtensor.cxx:4:
In file included from /home/bda/fusion/gtensor/include/gtensor/gtensor.h:8:
In file included from /home/bda/fusion/gtensor/include/gtensor/gfunction.h:6:
/home/bda/fusion/gtensor/include/gtensor/expression.h:41:28: error: __host__ function 'derived' cannot overload __host__ __device__ function 'derived'
inline auto expression<D>::derived() const& -> const derived_type&
^
/home/bda/fusion/gtensor/include/gtensor/expression.h:35:33: note: previous declaration is here
GT_INLINE const derived_type& derived() const&;
^
/home/bda/fusion/gtensor/include/gtensor/expression.h:47:28: error: __host__ function 'derived' cannot overload __host__ __device__ function 'derived'
inline auto expression<D>::derived() & -> derived_type&
^
/home/bda/fusion/gtensor/include/gtensor/expression.h:36:27: note: previous declaration is here
GT_INLINE derived_type& derived() &;
^
In file included from /home/bda/fusion/gtensor/tests/test_gtensor.cxx:4:
In file included from /home/bda/fusion/gtensor/include/gtensor/gtensor.h:8:
In file included from /home/bda/fusion/gtensor/include/gtensor/gfunction.h:8:
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:92:25: error: __host__ function 'shape' cannot overload __host__ __device__ function 'shape'
inline int gstrided<D>::shape(int i) const
^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:53:17: note: previous declaration is here
GT_INLINE int shape(int i) const;
^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:98:26: error: __host__ function 'shape' cannot overload __host__ __device__ function 'shape'
inline auto gstrided<D>::shape() const -> const shape_type&
^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:54:31: note: previous declaration is here
GT_INLINE const shape_type& shape() const;
^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:104:26: error: __host__ function 'strides' cannot overload __host__ __device__ function 'strides'
inline auto gstrided<D>::strides() const -> const strides_type&
^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:55:33: note: previous declaration is here
GT_INLINE const strides_type& strides() const;
^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:110:31: error: __host__ function 'size' cannot overload __host__ __device__ function 'size'
inline size_type gstrided<D>::size() const
^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:56:23: note: previous declaration is here
GT_INLINE size_type size() const;
^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:117:26: error: __host__ function 'operator()' cannot overload __host__ __device__ function 'operator()'
inline auto gstrided<D>::operator()(Args&&... args) const -> const_reference
^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:59:29: note: previous declaration is here
GT_INLINE const_reference operator()(Args&&... args) const;
^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:124:26: error: __host__ function 'operator()' cannot overload __host__ __device__ function 'operator()'
inline auto gstrided<D>::operator()(Args&&... args) -> reference
^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:61:23: note: previous declaration is here
GT_INLINE reference operator()(Args&&... args);
^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:151:26: error: __host__ function 'data_access' cannot overload __host__ __device__ function 'data_access'
inline auto gstrided<D>::data_access(size_type i) const -> const_reference
^
/home/bda/fusion/gtensor/include/gtensor/gstrided.h:70:29: note: previous declaration is here
GT_INLINE const_reference data_access(size_type i) const;
Suggested by @td-mpcdf. My understanding is that this would just be an offset, so if lbound=9, the range [10,20] would actually be [1,11] on the physical array. Basically lbound is subtracted from all indexes. Is this the desired behavior? @germasch do you think this makes sense for gtensor?
This could be it's own view type, or a new view parameter. Or something specific to gt::adapt and gt::adapt_device, which is where it wold be used in GENE, but these seems like an odd place to put the functionality when looking at gtensor as a whole.
Not sure this is unique to me, it looks like this should be reproducible as long as bounds check is enabled, (ie., build type Debug).
[kai@macbook build (pr/functions-norm)]$ tests/test_launch
Running main() from _deps/googletest-src/googletest/src/gtest_main.cc
[==========] Running 7 tests from 1 test suite.
[----------] Global test environment set-up.
[----------] 7 tests from gtensor
[ RUN ] gtensor.launch_1d
[ OK ] gtensor.launch_1d (1 ms)
[ RUN ] gtensor.launch_1d_templated
[ OK ] gtensor.launch_1d_templated (0 ms)
[ RUN ] gtensor.launch_view_reverse_1d
[ OK ] gtensor.launch_view_reverse_1d (0 ms)
[ RUN ] gtensor.device_launch_1d
[ OK ] gtensor.device_launch_1d (1822 ms)
[ RUN ] gtensor.device_launch_view_reverse_1d
out-of-bounds error: dim = 0, arg = 2, shape = 2
../include/gtensor/strides.h:102: bounds_check: block: [0,0,0], thread: [2,0,0] Assertion `0` failed.
gpuCheck: 710 (device-side assert triggered) ../include/gtensor/gtensor.h 401
Abort trap: 6
gtensor is designed to have const safe containers, in that operator()
returns a const reference if called on a const instance. This is much like std::vector. std::span, on the other hand, does not consider the underlying data part of it's state (since the data is not copied when copying a span), and it's operator[] returns a non-const reference even if the instance is const. gtensor, like it's inspiration xtensor, uses the same semantics on view objects, even though it doesn't provide much extra protection, given that you can assign a view to a non-const and modify it that way. Unfortunately this causes problems for SYCL 2020 provisional, which requires a const kernel function.
More discussion here:
intel/llvm#2538
The flatten implementation attempt to return the original object if it's already 1d or 0d, but this often results in a copy. Using decltype(auto)
everywhere can fix it (both for return value, and for the variable it's assigned to), but this is fragile, i.e. using auto
for the variable instead results in a copy. This breaks the flatten usage in SYCL N-d assign (can be worked around by calling to_kernel before flatten, since copying the span returned by to_kernel doesn't copy storage).
There may have been some subtleties involves, but it should definitely be possible somehow.
Adding a debug print to gtensor_storage copy ctor, shows some surprises. In particular assigning a temporary to an auto variable can result in a copy.
This is the interface used by xtensor:
https://xtensor.readthedocs.io/en/latest/operator.html#reducers
The trivial cases to add are 1d sum/min/max, that operate over the entire array (not axis aware). This can be done using thrust for CUDA/HIP, and the built in reduction support for SYCL 2020. One issue here is that thrust can be used even when not being used as the gtensor backing strore - so perhaps GTENSOR_USE_THRUST should be renamed to something like GTENSOR_USE_THRUST_VECTOR. I don't see much benefit in avoiding thrust for reductions - we could fallback to CUB/rocPRIM, but rocPRIM still requires a separate install for ROCm, and Thrust/CUB are always available for CUDA.
If the source has a const data type, no signature will match and compile will fail, e.g.
error: no instance of overloaded function "gt::copy" matches the argument list
argument types are: (gt::gtensor_span<const real_t, 2UL, gt::space::thrust_host>, gt::gtensor<real_t, 2UL, gt::space::device>)`
there is no reason for this to not work.
So I guess I'm not quite sure why the builds with rocm are so slow in the first place...
But since the initialization / rocm install only takes of the order of 1 minute, it seems like it may be worth it to split this up in some fashion so that the various builds can run in parallel in separate containers (if possible without too much hassle).
Currently, one needs to explicitly #include <gtensor/reductions.h
. Unless there's a good reason (say compile times becomes even slower), it'd be nice to have the available by default, ie., included from <gtensor/gtensor.h>
.
I'm getting this warning:
[cmake] CMake Warning (dev) at build-spack-ubuntu2/_deps/gtensor-src/CMakeLists.txt:230:
[cmake] Syntax Warning in cmake code at column 63
[cmake]
[cmake] Argument not separated from preceding token by whitespace.
[cmake] This warning is for project developers. Use -Wno-dev to suppress it.
I think it has to do with how things are quoted, but I don't really know what the issue is, and it's in sycl-specific code, so I can't easily test what's going on.
This is already needed for SYCL port, and should be easy to generalize to support CUDA and HIP as well. One advantage of this is avoiding the zero fill initialization in thrust, which can really hurt performance when allocating temporary arrays.
This can be implemented via thrust::transform for CUDA/HIP and via std::transform for Intel SYCL
std::vector
will preserve values on resize(), which may involve copies when reallocation is necessary. gtensor_storage
does the same, however, I don't think in its use in gtensor it never happens that values are expected to be preserved, so this copy is waste of time (albeit probably not typically significantly so). Since the semantics of this container differ from std::vector
, anyway, in particular, lack of initializing values, I think a point could be made that it'd also not try to preserve values on resize.
(see #101) Should be easy to enable, but I need a reminder ;)
gtensor currently uses a fixed block size for CUDA/HIP kernels, which fails for arrays that are too large in a certain dimension (but still easily small enough to fit in a single kernel launch with a block size adjustment). This could be handled by a generic flattening n-d implementation (using 1d indexing), similar to the SYCL backend n-d assign/launch implementations, or with more complex logic to determine block and grid sizes.
When profiling and debugging, tracking down which version of a gtensor kernel is the one of interest can be challenging. For SYCL, kernel names are passed as template parameters with templated type names - another sub-type of the templated name could be optionally supplied by the user, via an gt::assign<typename>(lhs, rhs)
helper function. This is ugly, but a starting point for discussion. Not exactly sure how to do this for HIP / CUDA yet.
Can add an error check in gtLaunchKernel macro, for HIP and CUDA backends. SYCL will raise an async exception on launch failure, which should already be handled by the global error handler.
This will simplify the use of gtensor a lot, especially as more GPU backends are added.
It sometimes could be convenient of one could do
auto s = gt::shape(1, 2) + gt:shape(3, 4);
I suggest renaming -> main
-- going with the times --, unless there are any objections.
It seems to be necessary to somehow make it possible to use the (CUDA) streams with gtensor. In the launch of an explicit kernel this could be another argument, in the assignment kernels, we should add a function set_stream(streamID) and launch the assignment kernel in this stream. A set_default_stream should set it back.
Two issues - need a native SYCL implementation of ij_deriv, and need to fix N-d assign when RHS is a scalar.
The issue is that the N-d implementation uses flatten, which calls reshape(std::forward<E>(e), shape(e.size()));
, and the e
here is gscalar
which has no size
method.
For real and complex gtensor expressions. Needed for GENE.
Initializing gt::gtensor<std::complex<double, 1>> = { 1., 2+3.i }
works (at least sometimes), but the same with gt::complex
usually does not work, not only because of 1.i
being std::complex
but more significantly because thrust::complex
's constructors aren't constexpr
. This makes, at the very least, for a bunch of ugly tests.
I think it's probably possible to get this to work with a special overload that initializes gt::gtensor<gt::complex<T>>
with an initializer list of std::complex<T>
. It'd be even nicer to be able to just use std::complex
, but pre C++-20 lots of it is not constexpr
, ie., not __host__ __device__
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.