Coder Social home page Coder Social logo

glavnokoman / vuh Goto Github PK

View Code? Open in Web Editor NEW
339.0 24.0 34.0 722 KB

Vulkan compute for people

Home Page: https://glavnokoman.github.io/vuh

License: MIT License

CMake 3.18% C++ 96.08% Shell 0.66% C 0.09%
vulkan vulkan-library vulkan-utils glsl glsl-shaders gpgpu gpgpu-computing framework gpu-programming vulkan-compute-shaders

vuh's Introduction

Watch out!

I do not have time or interest any more to further maintain or develop this. I will lazily review the bug-fix requests but no(t much) effort is made to maintain the code quality or style and no new releases will be made.

Vuh. A Vulkan-based GPGPU computing framework.

Build Status

Vulkan is the most widely supported GPU programming API on modern hardware/OS. It allows to write truly portable and performant GPU accelerated code that would run on iOS, Android, Linux, Windows, macOS... NVidia, AMD, Intel, Adreno, Mali... whatever. At the price of ridiculous amount of boilerplate. Vuh aims to reduce the boilerplate to (a reasonable) minimum in most common GPGPU computing scenarios. The ultimate goal is to beat OpenCL in usability, portability and performance.

Motivating Example

saxpy implementation using vuh.

auto main()-> int {
   auto y = std::vector<float>(128, 1.0f);
   auto x = std::vector<float>(128, 2.0f);

   auto instance = vuh::Instance();
   auto device = instance.devices().at(0);    // just get the first available device

   auto d_y = vuh::Array<float>(device, y);   // create device arrays and copy data
   auto d_x = vuh::Array<float>(device, x);

   using Specs = vuh::typelist<uint32_t>;     // shader specialization constants interface
   struct Params{uint32_t size; float a;};    // shader push-constants interface
   auto program = vuh::Program<Specs, Params>(device, "saxpy.spv"); // load shader
   program.grid(128/64).spec(64)({128, 0.1}, d_y, d_x); // run once, wait for completion

   d_y.toHost(begin(y));                      // copy data back to host

   return 0;
}

and the corresponding kernel (glsl compute shader) code:

layout(local_size_x_id = 0) in;             // workgroup size (set with .spec(64) on C++ side)
layout(push_constant) uniform Parameters {  // push constants (set with {128, 0.1} on C++ side)
   uint size;                               // array size
   float a;                                 // scaling parameter
} params;

layout(std430, binding = 0) buffer lay0 { float arr_y[]; }; // array parameters
layout(std430, binding = 1) buffer lay1 { float arr_x[]; };

void main(){
   const uint id = gl_GlobalInvocationID.x; // current offset
   if(params.size <= id){                   // drop threads outside the buffer
      return;
   }
   arr_y[id] += params.a*arr_x[id];         // saxpy
}

Features

  • storage buffers as vuh::Array<T>
    • allocated in device-local, host-visible or device-local-host-visible memories
    • data exchange with host incl. hidden staging buffers
  • computation kernels as vuh::Program
    • buffers binding (passing arbitrary number of array parameters)
    • specialization constants (to set workgroup dimensions, etc...)
    • push-constants (to pass small data (<= 128 Bytes), like task dimensions etc...)
    • whatever compute shaders support, shared memory, etc...
  • asynchronous data transfer and kernel execution with host-side synchronization
  • multiple device support
  • yet to come...
  • not ever coming...

Usage

vuh's People

Contributors

federkamm avatar glavnokoman avatar liviuchirca avatar mfairclough avatar wangqiang1588 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

vuh's Issues

Rendering result feature

I want to suggest for add support of rendering result of computing (i.e. use swapchain and surface). Can be useful for ray tracers, or any graphics required computing.

Framework for getting device information

There are certain extensions like VK_AMD_shader_core_properties which could be useful in a compute scenario, as Vulkan does not (to my knowledge) provide an offical method of getting device information, for example the number of compute units. The information from VK_AMD_shader_core_properties could be exposed via vuh::Device with a function, for example

auto getShaderCorePropertiesAMD()-> vk::PhysicalDeviceShaderCorePropertiesAMD;

However I'm unsure if this is an approach that vuh would benefit from - it is specific to AMD devices and, while you could load it implicitly and then add runtime checks to the getShaderCorePropertiesAMD call (throwing an error if VK_AMD_shader_core_properties was not loaded), it might be better to have a more generic device information function. This could be implemented for AMD GPUs using VK_AMD_shader_core_properties, however that would only be possible if respective/similar extensions for Intel and Nvidia GPUs existed (I have not checked) and shared a "common" way of describing a GPU's internals, or enough information to provide the results in a generic manner - this is available to an extent with OpenCL (for example using CL_DEVICE_MAX_COMPUTE_UNITS), however even that is imperfect - for example, see https://stackoverflow.com/a/9326978.

unusually slow vulkan/glsl compared to cuda

Same algorithm using CUDA (v10) is ~4 times faster even when using double-precision data types than glsl/vulkan (v1.1.89, glslang v7.9.2888) using single-precision (float).
Another observation: for some reason when using single-precision error accumulates in cuda code and results differ after several hundreds of iterations (double precision wroks ok). Though single-precision version of glsl/vulkan code gives good results. Fixed by adding "f" suffix to constants in kernel (by advice in https://devtalk.nvidia.com/default/topic/1043602/cuda-programming-and-performance/solved-cuda-single-precision-code-accumulates-error-while-glsl-vulkan-works-why-/)
Here is code: https://drive.google.com/open?id=1Q72ERuCLypzvRjAzSZ3a3bLD9k2JEGuW
I tried with run_async.

Is it a bug that calling`fromHost` without `vkFlushMappedMemoryRanges`

Hi, I was trying to understand how vulkan works from your src code.

When I looked into file: deviceArray.hpp in branch: vuh2,

auto fromHost(It1 begin, It2 end)-> void {
    if(Base::isHostVisible()){
        std::copy(begin, end, host_data());
        Base::_dev.unmapMemory(Base::_mem);
    } else { // memory is not host visible, use staging buffer
        auto stage_buf = HostArray<T, AllocDevice<properties::HostCoherent>>(Base::_dev, begin, end);
        copyBuf(Base::_dev, stage_buf, *this, size_bytes());
    }
}

What I think is that even if isHostVisible() return true, we should check whether the memory buffer underlied includes a flag: VK_MEMORY_PROPERTY_HOST_COHERENT_BIT, or we should call vkFlushMappedMemoryRanges and vkInvalidateMappedMemoryRanges .

This is a note from here

Unmapping non-coherent memory does not implicitly flush the host mapped memory, and host writes that have not been flushed may not ever be visible to the device. However, implementations must ensure that writes that have not been flushed do not become visible to any other memory.

Please correct me if I am wrong, thanks anyway!

Incompatible Driver

I successfully compiled the examples, however, when I try to run them, I get

libc++abi: terminating with uncaught exception of type vk::IncompatibleDriverError: vk::createInstance: ErrorIncompatibleDriver
Abort trap: 6

Is there a way to fix this? Is my GPU just not supported?

implement Array according to array_usage.md

Provide ForwardIterator and OutputIterator compatible interface for (double-)buffered async io between host and device arrays. Sync happens explicitely or at ArrayView destruction. Device local & host visible memory should be used for staging buffers when available.

Error in the function read_spirv

The latest release (version 1.1.3) works with no problems, but the most recent version of the master branch does not (tested after commit 4d241ed).

When I tried to run the example vuh_example_saxpy, I got the following error:

terminate called after throwing an instance of 'vk::InitializationFailedError'
  what():  vk::Device::createShaderModule: ErrorInitializationFailed

I noticed that the function read_spirv is not returning the proper data.
After I modified the function to be the same as it was in the release 1.1.3, the program worked.

Tested with visual c++ and mingw (msys2) on Windows 10.

Can vuh::Array contain a struct?

I would like to store structs and pass them to the compute shaders in buffers, is this possible?

For example struct:

struct Vector2
{
    float x;
    float y;
}

EXC_BAD_INSTRUCTION on .toHost()

Trying to copy from a vuh::Array to host vector after running a program causes a EXC_BAD_INSTRUCTION exception. OS X.

Image:
Screenshot 2020-06-02 at 21 47 33

Sample code:


    std::vector<float> m_positionsCache;
    std::vector<float> m_outPositionsCache;
    std::vector<Entity*>& FindInCircle(Vector2& center, float radius)
    {
        auto instance = vuh::Instance();
        auto device = instance.devices().at(0);
        
        m_positionsCache.clear();        
        m_outPositionsCache.clear();

        for (auto iter = m_entities.begin(); iter != m_entities.end(); iter++)
        {
            Vector2& position = iter->GetComponent<Vector2>("Position");
            m_positionsCache.push_back(position.x);
            m_positionsCache.push_back(position.y);
            
            m_outPositionsCache.push_back(-1.0f);
            m_outPositionsCache.push_back(-1.0f);
            
            index++;
        }
     
        
        
        
        
        using SpecializationConstants = vuh::typelist<uint32_t,uint32_t,uint32_t>;
        struct PushConstants {
            uint32_t numberOfElements;
        };
      
        uint32_t num = m_positionsCache.size();
        
        
        auto inArray = vuh::Array<float>(device, m_positionsCache);
        auto outArray = vuh::Array<float>(device, m_outPositionsCache);
        
        auto program = vuh::Program<SpecializationConstants, PushConstants>(device, "a.spv");
        program.grid(128);
        program.spec(128, 1, 1);
        program({
            num
        }, inArray, outArray);
        
        outArray.toHost(m_outPositionsCache.begin());
    }

Sample spv

#version 440

layout (local_size_x_id = 0) in;
layout (local_size_y_id = 1) in;
layout (local_size_z_id = 2) in;

layout (push_constant) uniform SearchParameters {
	uint numberOfElements;
} searchParameters;

// INDEX
// X
// Y
// INDEX...

layout (binding = 0) buffer lay0 { float inElementPositions[]; };
layout (binding = 1) buffer lay1 { float outElementPositions[]; };

void main(void)
{
	if (gl_GlobalInvocationID.x >= searchParameters.numberOfElements)
	{
		return;
	}

	outElementPositions[gl_GlobalInvocationID.x] = inElementPositions[gl_GlobalInvocationID.x];
}

State of vuh2

Hi, there is the vuh2 branch but it's not the default. There seem to be some design improvements in vuh2 but what is its current state? Can you recommend vuh2 for production code, or should I better stick to the master branch? (Btw., is there on purpose no vuh/vuh.hpp in vuh2?)

How to specify offset for the buffer

I just started to learn how to use vuh. The library is easy and smooth.
But I have the question. I would like to use one big buffer as Array that contains the group of images (not one).
How to specify offset to the image N in the buffer, to process it by kernel?
So I have one block of the memory that I pass to the shader but inside I need to know what data to process.
I don't wont to reject invokation if the current invocation offset is out of the image N, it is the bad way. There is should be another solution, I think.

Allow users to specify entry point of compute shader

I'm using clspv (https://github.com/google/clspv) to generate some SPIRV shaders, and it's nice to be able to specify the entry point names myself.

I noticed you're not actively developing this project too much anymore, so I'm happy to contribute to it myself. Also have a couple other improvements I might file as well if I keep working on vulkan compute shaders!

bug in copy_async

I think there's a bug here, copy_async.hpp line 208:
return Delayed<Copy>{ stage.copy_async(device_begin(stage.array), device_end(stage.array), dst_begin) , Copy::wrap(std::move(stage)) };
"std::move(stage)" can actually destroy/move "stage" before it is used in stage.copy_async

Here's a proposed fix:

auto cpy = stage.copy_async(device_begin(stage.array), device_end(stage.array), dst_begin); return Delayed<Copy>{ std::move(cpy) , Copy::wrap(std::move(stage)) };

It seems to work now, but TBH I have no idea about how the whole thing works, I'm lost with all the std::move's

Edit: had the bug with VS, not sure about other compilers. Also had some weird compilation errors with templates, but I don't think they're related

windows CI build

needs windows port of install_dependencies and (appveyor?) CI build.

Dockerfile

I'm having trouble building and would be great if you can provide Dockerfile.
Thank you.

Cannot compile

Cannot compile sample, error message:

error C2580: 'vuh::Delayed<vuh::detail::Noop>::Delayed(vuh::Delayed<vuh::detail::Noop> &&)': multiple versions of a defaulted special member functions are not allowed

assert that `fromHost` does not copy non-allocated memory

Hi! First, thanks for vuh, it makes lots of stuff much easier. :]

I noticed that in case of copying into host-inaccessible memory through the cached memory, fromHost does not check that the iterator range used to construct the buffer (ie. distance between begin and end) is really equal to the actual array size that gets transferred (which is size_bytes()).

The problem is around here, possibly also in other places:
https://github.com/Glavnokoman/vuh/blob/master/src/include/vuh/arr/deviceArray.hpp#L113

This causes really nasty failures (eg literal reboots on some certain linux drivers) and is IMO particularly surprising -- if given a smaller range, I'd expect that just the starting piece of memory is filled, as with std::copy.

I can make a PR to "fix" this, just wondering which fix would be best:

  • document this properly and leave the handling to the users?
  • assert() that iterator range matches the array size?
  • behave "as reasonably as possible" and just copy min(end-begin, size()) elements to avoid problems? (I guess this behavior is the least surprising thus most reasonable, asking mainly because I'm not sure whether it would break any other expectations.)

Thanks for suggestions!

remove array parameters specifiacation in Kernel

There seems to be just too many options for different bindable objects (sbo, images, textures, samplers, uniforms/dynamic-uniforms of those) to be easily expressed by template parameters. It may be easier (or may not) to just deduce those objects at binding point.
Specialization constants better to be specified upfront to ensure exact types, and make it easier to drop them into same bind() function with other parameters.
Push constants structure must be specified somewhere anyway.

Support for textures

Are there any plans to support texture buffers?

Where would be a good place to start to add texture buffers?

Command Buffers?

I suggest to add command buffer when executing programs...

   auto queue = device.queues().at(0);
   auto cmd = device.command(); // something alike this
   auto program = vuh::Program<Specs, Params>(device, "saxpy.spv"); // load shader
   program.grid(128/64).spec(64)(cmd, {128, 0.1}, d_y, d_x); // write to command
   queue.submit(cmd); // submit command 

VK_RESULT_END_RANGE was removed from vulkan in 1.2.140

apparently VK_RESULT_END_RANGE does not exist anymore and vuh2 doesn't compile therefore since vulkan 1.2.140. Therefore, the enum VuhError in src/include/vuh/error_code.hpp should use another start value like 10, 11, 100, or 101. Alternatively, it could start from 0 since it looks like the code doesn't depend on VuhError and VkResult not beeing overlapping. How should it be changed?

Push constants vs Uniform buffers

Hello, first of all I would like to thank you for this amazing library. It is literally a life saver for me. I am a newbie to vulkan but I have done a lot of work with compute shaders in OpenGL ES on Android and the limitations were just getting in the way too much. I needed to convert over to Vulkan but it is very verbose. This library is exactly what I would need to implement myself for my project to work.

One thing I am curious about though is the question of updating push constants for every frame. I am sending some depth/image data to my shaders via buffers, and also sending a 4x4 matrix. Push constants seem to work just fine for this, but I am wondering about performance, even if it may be a micro optimization.

I am no Vulkan expert but what I have read indicates that sending push constants involves re-building the command buffer, while using a uniform buffer you only need update the memory and re-use the same command buffer. It is unclear which is the more efficient method when updating every frame. So far in all your examples it is showing push constants being updated for every run.

It certainly isn't slow, but I am just curious if there is anything that can be optimized here if possible, or if this is already the ideal method of updating every frame.

Thanks again! This library is perfect, it is a shame it is not more popular and under heavy development. For me it is the perfect GPGPU implementation, I love it.

Linking problem on MacOS

I am trying to use this tool and run code provided in Tutorial, but I keep getting this:

  "vuh::Device::createPipeline(vk::PipelineLayout, vk::PipelineCache, vk::PipelineShaderStageCreateInfo const&, vk::Flags<vk::PipelineCreateFlagBits>)", referenced from:
      vuh::detail::SpecsBase<vuh::typelist<unsigned int> >::init_pipeline() in main.cpp.o
  "vuh::Device::selectMemory(vk::Buffer, vk::Flags<vk::MemoryPropertyFlagBits>) const", referenced from:
      vuh::arr::AllocDevice<vuh::arr::properties::Device>::findMemory(vuh::Device const&, vk::Buffer, vk::Flags<vk::MemoryPropertyFlagBits>) in main.cpp.o
      vuh::arr::AllocDevice<vuh::arr::properties::Host>::findMemory(vuh::Device const&, vk::Buffer, vk::Flags<vk::MemoryPropertyFlagBits>) in main.cpp.o
      vuh::arr::AllocDevice<vuh::arr::properties::HostCoherent>::findMemory(vuh::Device const&, vk::Buffer, vk::Flags<vk::MemoryPropertyFlagBits>) in main.cpp.o
      vuh::arr::AllocDevice<vuh::arr::properties::HostCached>::findMemory(vuh::Device const&, vk::Buffer, vk::Flags<vk::MemoryPropertyFlagBits>) in main.cpp.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[3]: *** [vuh_example_saxpy] Error 1
make[2]: *** [CMakeFiles/vuh_example_saxpy.dir/all] Error 2
make[1]: *** [CMakeFiles/vuh_example_saxpy.dir/rule] Error 2
make: *** [vuh_example_saxpy] Error 2

My MacOS version is 10.15.2 (Catalina).
I followed steps provided in build&install and it worked fine
I am using CLion and it can find vuh/vuh.h, also it compiles shader

Can somebody tell me what I am doing wrong?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.