Comments (15)
I am curious if it works with the EXPLICIT_KERNEL version, which uses gt::launch<1>
instead of array expressions. The errors are not obvious, but my first guesses are that the cuda_fp16.h header needs to be includes sooner, before gtensor, or that somehow a non-half value gets mixed in somehow and mixed operators are not defined so it tries to do a conversion and there are too many options.
The way I would recommend going about this, if you want to work on it, is to create a new test file in tests/test_half.cxx
and start with a super simple host only test case using gt::lanuch_host<1>
. Once you have that working, gradually add more complex tests and get those working. Eventually, we want this to be backend independent, so there needs to be a gt::half
type that e.g. maps to sycl::half on the intel sycl backend. Note that half is optional in sycl, so we may want there to be an option for turning half support on or off in gtensor.
I can take a closer look if you are stuck. I suspect the errors are going to end up being very backend specific and require some tweaks to get working on each.
from gtensor.
Another way we could go about this, I could add sycl::half support for the intel backend to provide the basic test/type structure, and you could add the cuda and/or AMD backends. Let me know what you prefer - I don't want to spoil your fun if you want to try to dig in more deeply right away, but also don't want to make it too difficult to get started.
from gtensor.
Thanks for the quick reply!
Following your input I can report the following:
- The explicit kernel version does not work either, although instead of ERROR#2(a-e) there is only one kind of error reporting an ambiguous
conversion from "thrust::ext::remove_device_reference_t<storage_type>" to a built-in type
twice forgt::launch<1>(...)
. - Including
cuda_fp16.h
before gtensor had no impact. - For a host-only code both - implicit as well as explicit - kernel generation of
axpy
with thehalf
type works just fine. - Still, for the host-only code
ERROR#1
above applies, but can be circumvented either a) by alternative initialization or b) by changing0
to0.0
in lines 107-108 ofgtensor/include/gcontainer.h
(this might break something else though). The problem here seems to be thathalf
can construct fromfloat
anddouble
, but not fromint
.
I will dig into it and try to get a cuda version running, but will consult you when I get stuck - which might be soon^^
Two more notes, which could be of interest:
std::is_floating_point<half>::value
isfalse
- For everything I reported so far,
nv_bfloat16
fromcuda_bf16.h
behaves just likehalf
(up to the obvious different rounding etc.)
from gtensor.
Oh I missed that, I would try building without GTENSOR_USE_THRUST
set. And in general, it may be safer to use cmake to build, looking at your log I think you have all the relevant options, but cmake builds is what is used most of the time, so there is more possibility of the old Makefile example getting out of date. For example for cmake, the thrust backend is no longer enabled by default.
Note that thrust is still used for some things like the complex type and reductions, even when GTENSOR_USE_THRUST is not set. The name is a bit misleading - it really refers to the storage backend, i.e. whether thrust device_vector is used or the custom gtensor_storage type is used.
Re std::is_floating_point
, this is one approach:
https://github.com/argonne-lcf/SyclCPLX/blob/main/include/sycl_ext_complex.hpp#L406
just replacing sycl::half which gt::half (which would be alias for cuda half or sycl half or hip half depending on backend).
from gtensor.
If you end up going down the route of adding some tests, feel free to submit a draft PR, and I can see what happens with sycl::half to try to see cross-backend issues sooner rather than later.
from gtensor.
FWIW, I tried daxpy
with gcc's _Float16
, and hit only two issues:
std::cout
doesn't know how to print_Float16
- The numbers in the daxpy example are greater than what can be represented in half precision, so I changed
n
to be just1024
.
Beyond that, the daxpy example seems to work fine, at least on CPU only. I guess next is taking a look what happens with CUDA.
from gtensor.
Thanks for all the input.
Update from my side: With only a slight modification, I got the CUDA explicit kernel version of axpy running (without GTENSOR_USE_THRUST
).
#include <iostream>
#include <gtensor/gtensor.h>
#include <cuda_fp16.h>
using storage_type = half;
int main(int argc, char** argv)
{
int N = 5;
gt::gtensor<storage_type, 1, gt::space::device> d_x(gt::shape(N));
gt::gtensor<storage_type, 1, gt::space::device> d_y(gt::shape(N));
storage_type a = 0.5;
auto k_x = d_x.to_kernel();
auto k_y = d_y.to_kernel();
gt::launch<1, gt::space::device>(d_x.shape(), [=] __device__ (int i) { k_x(i) = 3.0; });
gt::launch<1, gt::space::device>(d_x.shape(), [=] __device__ (int i) { k_y(i) = 4.2; });
gt::launch<1, gt::space::device>(d_x.shape(),
[=] __device__ (int i) { k_y(i) = a * k_x(i) + k_y(i); });
gt::gtensor<storage_type, 1, gt::space::host> h_y(gt::shape(N));
gt::copy(d_y, h_y);
for(int j = 0; j < N; ++j)
std::cout << "y[" << j << "] = " << (double) h_y[j] << std::endl;
return 0;
}
Apart from the minor work-arounds for initialization and printing, there are three things to note:
GT_LAMBDA
is defined as__host__ __device__
, but thecuda_fp16.h{pp}
defines operators like+(half, half)
only for__device__
; so the kernel lambda needs to be__device__
only as well.- The example also works with
#include <cuda_bf16.h>
andusing stroage_type = nv_bfloat16;
- To have operators, casts, etc. for
half
available one needs__CUDA_ARCH__ >= 530
(>= 800
fornv_bfloat16
), i.e., I additionally needed to specify thenvcc
flags--generate-code arch=compute_80,code=sm_80
on A100.
Again, I'm happy for any feedback!
I will only have time to continue on it late next week. Do you think the next step should be to get the implicit kernel running, or does something else make more sense?
from gtensor.
Making GT_LAMBDA device only is a major problem, it would break all host launches. Not obvious to me how to workaround that limitation - I wonder if there is a higher level library out there built on top of CUDA that provides a more friendly half type? One ugly possibility would be to define a separate GT_LAMBDA_HOST and GT_LAMBDA_DEVICE, and avoid use of half types in the _HOST. But that is a very ugly breaking change.
Other than finding a solution to that, I think the next step is to define a gt::half
type (possibly just an alias for a backend specific type) and some basic tests for it.
from gtensor.
I wonder if implicit conversion operators from backend specific device type to an appropriate compiler specific host type would help here? Like half -> _Float16 when using CUDA + g++.
from gtensor.
I think it might make sense for me to try this out with sycl::half and _Float16, for sycl and host backends respectively. I suspect those to be easy, and then when you have some time to work on this again, the framework will exist (in a branch) for how to integrate CUDA/HIP support.
from gtensor.
This might be worth trying? https://forums.developer.nvidia.com/t/error-when-trying-to-use-half-fp16/39786/10
from gtensor.
Making GT_LAMBDA device only is a major problem, it would break all host launches. Not obvious to me how to workaround that limitation - I wonder if there is a higher level library out there built on top of CUDA that provides a more friendly half type? One ugly possibility would be to define a separate GT_LAMBDA_HOST and GT_LAMBDA_DEVICE, and avoid use of half types in the _HOST. But that is a very ugly breaking change.
Other than finding a solution to that, I think the next step is to define a
gt::half
type (possibly just an alias for a backend specific type) and some basic tests for it.
I think appropriate design of the gt::half
type will automatically cover this problem and keep GT_LAMBDA
unchanged.
In the axpy code below, note that the custum type HalfWrapper
allows to use GT_LAMBDA
rather than [=] __device__
.
#include <iostream>
#include <gtensor/gtensor.h>
#include "half_wrapper.hxx"
using storage_type = HalfWrapper;
int main(int argc, char** argv)
{
int N = 5;
gt::gtensor<storage_type, 1, gt::space::device> d_x(gt::shape(N));
gt::gtensor<storage_type, 1, gt::space::device> d_y(gt::shape(N));
storage_type a = 0.5;
auto k_x = d_x.to_kernel();
auto k_y = d_y.to_kernel();
gt::launch<1, gt::space::device>(d_x.shape(), GT_LAMBDA(int i) { k_x(i) = 3.0; });
gt::launch<1, gt::space::device>(d_x.shape(), GT_LAMBDA(int i) { k_y(i) = 4.2; });
gt::launch<1, gt::space::device>(d_x.shape(),
GT_LAMBDA(int i) { k_y(i) = a * k_x(i) + k_y(i); });
gt::gtensor<storage_type, 1, gt::space::host> h_y(gt::shape(N));
gt::copy(d_y, h_y);
for(int j = 0; j < N; ++j)
std::cout << "y[" << j << "] = " << h_y[j] << std::endl;
return 0;
}
Where HalfWrapper
is a prototype for gt::half
which uses half
as storage type. On the CUDA device computations are performed in half
, while on the host calculations just fall back to some compute precision which the host understands (e.g. float
).
// half_wrapper.hxx
#ifndef HALF_WRAPPER
#define HALF_WRAPPER
#include <iostream>
#include <cuda_fp16.h>
#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 530)
#define TARGET_ARCH __host__ __device__
#else
#define TARGET_ARCH
#endif
class HalfWrapper
{
public:
TARGET_ARCH HalfWrapper(float x) : x(x) {};
TARGET_ARCH HalfWrapper(half x) : x(x) {};
TARGET_ARCH const HalfWrapper& operator=(const float f) { x = f; return *this; }
TARGET_ARCH const half& Get() const { return x; }
private:
half x;
};
TARGET_ARCH const HalfWrapper operator+(const HalfWrapper& lhs, const HalfWrapper& rhs)
{
#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 530)
return HalfWrapper( lhs.Get() + rhs.Get() );
#else
return HalfWrapper( float(lhs.Get()) + float(rhs.Get()) );
#endif
}
TARGET_ARCH const HalfWrapper operator*(const HalfWrapper& lhs, const HalfWrapper& rhs)
{
#if defined(__CUDA_ARCH__) && (__CUDA_ARCH__ >= 530)
return HalfWrapper( lhs.Get() * rhs.Get() );
#else
return HalfWrapper( float(lhs.Get()) * float(rhs.Get()) );
#endif
}
std::ostream& operator<<(std::ostream& s, const HalfWrapper& h)
{ s << (float) h.Get(); return s; }
#endif
Of course different roundings will lead to slightly different results on the host, but just falling back to float
definitely requires the least implementation effort for now.
What do you think?
from gtensor.
I think that is a great place to start! Trying to detect if host compiler has support and use it when available, e.g. gcc's _Float16, can be a later enhancement (and in particular I think is not important for the way GENE uses gtensor). Do you feel comfortable adding a header in include/gtensor/half.h
(or maybe fp.h
to indicate it may support e.g. fp8 too in future), and a simple test in tests/test_half.cxx
, and opening a draft PR? Might be worth creating a main fork branch for this too, so we can collaborate more easily vs using one of our forks.
We will probably also need a new cmake option GTENSOR_ENABLE_FP16 or similar, that disables test_half when not set and does not include the half.h header in gtensor.h. But that can be added later too. One advantage of this is, we can make it OFF by default and get more experimental stuff mainlined sooner without worrying about breaking some combo of CUDA/HIP version and driver and other messiness.
My understanding for SYCL is that sycl::half
must be available, but talking with @TApplencourt, it is UB to perform operations on host. This is relevant part of spec:
https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#sec:opencl:extension-fp16
Note that this does happen to work on Intel oneAPI icpx, at least with 2023.1.0 and Gen11 iGPU and a discrete GPU I tested:
#include <iostream>
#include <sycl/sycl.hpp>
int main(int argc, char **argv) {
sycl::half a, b, c;
a = 1;
b = 1;
c = 0;
c = a + b;
std::cout << "c = " << c << std::endl;
}
from gtensor.
Sure - I will follow the steps you suggested and open a draft PR.
Got 1-2 busy weeks ahead, but after that I'm all on it.
from gtensor.
Due to holidays everything took a bit longer than expected, but now it's there: draft PR #276.
from gtensor.
Related Issues (20)
- consistent size and index types HOT 7
- improve C API for streams
- cgtblas: sycl backend does not handle nullptr case HOT 3
- sycl: use complex extension
- fortran: add cmake option for complex/real sizes
- fortran: rocm does not find ISO_Fortran_binding.h header HOT 2
- clib namespace is confusing HOT 1
- device debug print helper HOT 1
- more CI checks: ASan, UBSan, clang-tidy HOT 2
- missing return statement warning
- cmake: use official ROCm and oneAPI integration
- ci: update clang-format version HOT 8
- spack package, e4s integration, release
- alternate fft backends
- micro benchmarking assign expressions on all platforms
- improve caching allocator
- const stream objects HOT 2
- half precision support
- CUDA host/device warnings (shape ctor not device) HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gtensor.