kunzmi / managedcuda Goto Github PK

ManagedCUDA aims an easy integration of NVidia's CUDA in .net applications written in C#, Visual Basic or any other .net language.

License: Other

C# 99.96% PowerShell 0.04%

managedcuda's People

Contributors

Stargazers

Watchers

Forkers

mutteroberin yangliatgithub mikulaszelinka lvaleriu mrakgr jorritjaap chunde gongch lackofentropy highwayns dmitrysidorov sigino jayhsieh wkx11 viphak yixingshen iloseall robbinmarcus im2f1 surban andrejager djlw78 darth-jurassic kshurov awesomemachinelearning miljanbjelojica zhongkaifu alexandersemenyak xiaotie xiyunqiao zaharponimash hoehnp equiprogrammer uzbekdev1 lduchosal nagatea filippoquaranta tzvetomir atsusy igoreksiz georgeadamon juppp catree xela-trawets longqin cbovar sinort chenw11 potaninmt poinanodesu anhellwig maalik0786 deltaengine lzw920621 void-intelligence changhopaeon olegjakushkin dinggeonly hhubschle erisonliang whble ofer mikepharesjr avanlau yyl-20020115 ashrafulsbmcbd hengzheoduyou pavadik richardfork genetec ehtick alexhiggins732 wangchengqun 914268308 souwang324 hajiyevel

managedcuda's Issues

RGB To HLS - strange byte conversion

Hello
I'm struggling again, and found no reason to the following problem...

When using NPP and doing an RGB to HLS conversion, described on page 519 of
http://docs.nvidia.com/cuda/pdf/NPP_Library.pdf

Example : If I try to convert a bitmap filled with cyan pixels, where
(R:0 G:255 B:255) means (H:50% S:100% L:50%),
reading converted data I should read (H:127 L:127 S:255) as the NPP scale is on a 255-value.

Instead I get : H042.L127.S255....... which seems 1/3 of 127....

Please have a look to my attached project...

I really cannot find a sense to the output values

Could you share some light ?

Thank you !

ManagedCudaConsoleTestNPP3.zip

Array Indexing error

Hello

I'm having from some days a struggle with NPPImage_8uC1 pixel indexing.

I've uploaded the following project :

ManagedCudaConsoleTestNPP2.zip

it's a console application, with the following code in a class for testing :

using System;
using System.Drawing;
using System.Drawing.Imaging;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.IO;
using System.Reflection;

using ManagedCuda;
using ManagedCuda.NPP;
using ManagedCuda.NVRTC;
using ManagedCuda.VectorTypes;

namespace ManagedCudaConsoleTestNPP
{
class Program
{
static void Main(string[] args)
{
var ctx = new CudaContext(CudaContext.GetMaxGflopsDeviceId());

        string src =
        @"extern ""C"" {
            __global__ void TestNPPKernel(
                unsigned char* input,unsigned char *output,
                int width, int height,
                int colorWidthStep, int grayWidthStep)
            {
                int tx = blockIdx.x * blockDim.x + threadIdx.x;
                int ty = blockIdx.y * blockDim.y + threadIdx.y;

                if (ty < height && tx < width)
                {
                    int color_tid = ty * colorWidthStep + tx;

                    unsigned char blue = input[color_tid];
                    unsigned char green = input[color_tid + 1];
                    unsigned char red = input[color_tid + 2];

                    float gray = red * 0.298912f + green * 0.586611f + blue * 0.114478f;

                    int gray_tid = ty * grayWidthStep + tx;
                    output[gray_tid] = static_cast<unsigned char>(gray);
                }
            }
        }";


        //compile to ptx and prepare kernel
        var compiler = new CudaRuntimeCompiler(src, "TestNPPKernel.cu");
        compiler.Compile(null);
        byte[] ptx = compiler.GetPTX();

        var kernel = ctx.LoadKernelPTX(ptx, "TestNPPKernel");


        //initialize kernel and bitmaps
        kernel.GridDimensions = new dim3(128, 16, 1);
        kernel.BlockDimensions = new dim3(16, 8, 1);

        int nFrameWidth = 2048;
        int nFrameHeight = 128;

        var bmp_original = new NPPImage_8uC3(nFrameWidth, nFrameHeight);
        var bmp_gray = new NPPImage_8uC1(nFrameWidth, nFrameHeight);

        var bmpCol = new Bitmap(nFrameWidth, nFrameHeight, PixelFormat.Format24bppRgb);
        var bmpGray = new Bitmap(nFrameWidth, nFrameHeight, PixelFormat.Format8bppIndexed);


        //fill bitmap with red, green, blue, cyan, magenta, yellow pixels
        var colR = Color.FromArgb(255, 0, 0);
        var colG = Color.FromArgb(0, 255, 0);
        var colB = Color.FromArgb(0, 0, 255);
        var colC = Color.FromArgb(0, 255, 255);
        var colM = Color.FromArgb(255, 0, 255);
        var colY = Color.FromArgb(255, 255, 0);

        for (int x = 0; x < nFrameWidth / 6; x++)
        {
            for (int y = 0; y < nFrameHeight; y++)
            {
                bmpCol.SetPixel(x * 6 + 0, y, colR);
                bmpCol.SetPixel(x * 6 + 1, y, colG);
                bmpCol.SetPixel(x * 6 + 2, y, colB);
                bmpCol.SetPixel(x * 6 + 3, y, colC);
                bmpCol.SetPixel(x * 6 + 4, y, colM);
                bmpCol.SetPixel(x * 6 + 5, y, colY);
            }
        }

        bmpCol.Save("input.png");



        //copy to cuda device and launch kernel
        bmp_original.CopyToDevice(bmpCol);

        kernel.Run(
            bmp_original.DevicePointer, bmp_gray.DevicePointer,
            bmp_original.Width, bmp_original.Height,
            bmp_original.Width * 3, bmp_gray.Width);

        bmp_gray.CopyToHost(bmpGray);


        //correct palette to gray colors
        var cpGray = bmpGray.Palette;

        for (int i = 0; i < 256; i++)
            cpGray.Entries[i] = Color.FromArgb(i, i, i);

        bmpGray.Palette = cpGray;

        bmpGray.Save("result.png");
    }
}

}

This application will compile a kernel that should make a RGB to Gray conversion, fill a bitmap,
launch the kernel and via NPP get the output bitmap "bmpGray"

What is happening to me (of course i'm doing something wrong, but I've tried many different datatypes !!) is that the output looks always misaligned

I'm filling a NPPImage_8uC3, and I want to receive a NPPImage_8uC1

I'm using arrays made of "unsigned char"

Maybe this is a wrong choice, I've tried many others, but without luck.

What I need to achieve is to be able to use at the same time NPP primitive funcions and my ".cu/kernel" sources.

Could you please help ? Suggesting the better parameters types to use , or providing a working example, that starts from a 24bpp image and returns a 8bpp (gray) image, using NPP (not the ColorToGray function) and kernel launch ?

I would really understand what I'm doing wrong.

Thank you for any help.

Best Regards

CudaTextureDescriptor is missing fields

The fields sRGB and normalizedCoords from the CUDA 7.5 cudaTextureDesc struct are missing from CudaTextureDescriptor and apparently combined into readMode using a composite CUTexRefSetFlags. Unsurprisingly, CudaTexObject constructor fails.

Conversion from "long"

Hello Michael
I'm having quite some headaches with "long" datatypes...

I've an application that compiles and runs 64bit kernels, inside a kernel i've a function like this

__device__ __forceinline__ long shorts_to_int64(short u1, short u2, short u3, short u4) {
	long check = (short)u1 | (short)(u2 << 16) | (short)(u3 << 32) | (short)(u4 << 48);
	
	return check;
}

the purpose was to avoid using too much memory, and placing a 0..65536 number inside an array of "long" types.
Inside others kernels this shift-pack-unpack works correctly,and it's initialized like this :

CudaDeviceVariable dev_longvalues = new CudaDeviceVariable(nMax);

and the dev_longvalues is passed on the kernel like this :

global void myKernelFunc(long* arLongValues)

So on the CUDA device, the "shorts_to_int64" function is used to fill the "arLongValues"
While on the device, the array looks correct.

When I copy back to host, I've the problem :

long[] hostLongValues = dev_longvalues;

At this point, looks like that inside the c# long[] array there is something completely
different from the c++/cuda long datatype, looks like I've 128 bit placed inside a 64-bit.

Looking to VectorTypes.cs I read this :

/// long1. long stands here for the long .NET type, i.e. long long or a 64bit long in C++/CUDA

I've tried to use long1 type, but without any better result.

In C#, i've tried to write something like this (something similar to the CUDA counterpart)

long idxValue = hostLongValues[i];
var lShortValues = LongToShortArray(idxValue).Select(i => (int)i);

    private short[] LongToShortArray(long input)
    {		
        return new[]
        {
            (short)(input & 0x0000ffff),
            (short)((input >> 16) & 0x0000ffff),
            (short)((input >> 32) & 0x0000ffff),
            (short)((input >> 48) & 0x0000ffff),
            (short)((input >> 64) & 0x0000ffff),
            (short)((input >> 80) & 0x0000ffff),
            (short)((input >> 96) & 0x0000ffff),
            (short)((input >> 112) & 0x0000ffff)
        };
    }

or even (fighting with casts....)

    private short[] LongToShortArray(long input)
    {
        return new[]
        {
            (short)((short)input & (long)0x0000ffff),
            (short)((short)(input >> 16) & (long)0x0000ffff),
            (short)((short)(input >> 32) & (long)0x0000ffff),
            (short)((short)(input >> 48) & (long)0x0000ffff),
            (short)((short)(input >> 64) & (long)0x0000ffff),
            (short)((short)(input >> 80) & (long)0x0000ffff),
            (short)((short)(input >> 96) & (long)0x0000ffff),
            (short)((short)(input >> 112) & (long)0x0000ffff)
        };
    }

but had no luck......

So I'm asking : what's the best practice to "download" from the device a c++/cuda LONG type,
to the equivalent C# ? I've compiled the C# application with x64, nothing changed..

Can you give me some advice ?

unified memory implementation

I'm curious as to why you chose to implement the CudaManagedMemory as individual classes instead of generics like you did with the CudaDeviceVariables. I briefly perused the code and didn't see anything that would prohibit this. Unless I'm missing something, it should clean up the code substantially. Would you mind shedding some light on this?

CudaArray2d and CudaArray3d have a typo in their description

More specifically for the 2d version, it says A two - dimansional - CUDA array.. Both classes have dimensional misspelled.

Size of Boolean used to copy from host to device is 4 bytes.

I randomly got this error while copying an array of bool from host to device.

Unhandled Exception: System.AccessViolationException: Attempted to read or write protected memory. This is often an indication that other memory is corrupt.
at ManagedCuda.DriverAPINativeMethods.SynchronousMemcpy_v2.cuMemcpyHtoD_v2(CUdeviceptr dstDevice, IntPtr srcHost, SizeT ByteCount)
at ManagedCuda.CudaContext.CopyToDevice[T](CUdeviceptr aDest, T[] aSource) in i:\ManagedCuda\managedCuda\ManagedCUDA\CudaContext.cs:line 1521

I have found that if I just change from CudaDeviceVariable<bool> to CudaDeviceVariable<byte>, this error will not occur anymore.

This is because CudaDeviceVariable<bool> has TypeSize of 4, it seems to copy more than it should. Therefore, it may attempt to read protected memory on the host and write to a protected memory on the device.
However, CudaDeviceVariable<byte> has TypeSize of 1, it will copy array of bool at the correct length.

ManagedCuda potentially passes parameters incorrectly

At first I thought I messed up something with context creation, but after I simply changed the call to geam from passing alpha and beta from host to the version that uses device pointers, it started throwing the AccessViolation exception. I have a minimal example here. I can't get geam (or gemm) to work at all. In the example provided it only adds up the first element of each matrix.

Moving towards Net Standard

Is it possible to have ManagedCUDA target Net Standard, or to make some changes in the source so I can create a private build of ManagedCUDA that targeting Net Standard? I need a Net Standard library because my code is in Net Standard, and Net Framework libraries cannot link with Net Standard libs and apps.

I made some changes in a forked repository as a prototype ( https://github.com/kaby76/managedCuda ). (1) There's a dependency on serialization that I commented out, but it should be ifdef'ed. Unfortunately, Net Standard does not support binary serialization. (2) Array.LongLength is not supported in Net Standard either. I wrote an extension method that appears to work. I tested the changes and they work for me.

Does this library interface the full CUDA API?

Hi my question is whether can do the following with 'managed Cuda for c#':

-> Use streams and run kernels on different streams asynchronously.
-> Interface with thrust

Thanks,
T

Support for cudaMemcpy2DToArrayAsync

Hi,

I am using multiple streams w/ 2D textures in my application, and would like to asynchronously copy from a CudaPitchedDeviceVariable to a CudaTextureArray2D/CudaArray2D. Now, I see from the NVIDIA documentation here that there is a method called cudaMemcpy2DToArrayAsync, which might suit my purposes. However, when I looked through the API for CudaArray2D, I was only able to find synchronous copy methods (I checked the source). Is there any way to asynchronously copy to a CudaArray2D using a stream, using the current ManagedCuda API?

If not, is that a feature we can add? I'd be happy to help and make a pull request if someone would be able to help point me in the right direction!

Thanks,
Dennis

thrust and managedcuda

Hello Michael

I see that this issue is closed :
#10

but I try anyway...

Anything changed about thrust since Jan 2016, with cuda 8.0 ?

it would be great to be able to run kernels with Thrust included directly from ManagedCuda

Any hope on this side ?
If so, do you have any working sample of ManagedCuda with thrust ?

I'm trying to run this particular kernel :

https://github.com/thrust/thrust/blob/master/examples/bounding_box.cu

without any luck

Best Regards

Possible Inconsistency in ActivationForward methods

Hi,
Thanks for your work!

double version of ActivationForward is:
public void ActivationForward(cudnnActivationDescriptor activationDesc, [...]

whereas float version is:
public void ActivationForward(ActivationDescriptor activationDesc, [...]

Shouldn't the double version accept an ActivationDescriptor rather than cudnnActivationDescriptor ?

nuget is not working for me

I have tried both the 7.5 standalone and x64 versions and I get the following error:

It leaves a packages.config file but VS2013 never tries to get the package when I build. The package.config file looks like this:

Not being an expert with the inner workings of VS and its interaction with nuget I am not sure what is going on. In the NuGet Cache I do see both packages.

Thanks!
Douglas R Jones

Cuda Device Memory Leak

Sample creates a memory leak on the GPU even when disposing the object.

var d_zeros = new CudaDeviceVariable < float > (1000);
float [] h_zeros = new float[1000];
d_zeros = h_zeros; /// will create memory leak
d_zeros.CopyToDevice(h_zeros). // will NOT create memory leak

d_zeros.Dispose();

Requesting feedback on the pull request

Hello, it has been a month since I made the pull request and I have still not received any feedback from you. Have you been too busy to respond, or did the pull request slip through the cracks? I need to know. Thanks.

The description for amax is wrong.

Inside the function description it says that it uses C (zero based) indexing, but in fact it uses the Fortran indexing. This matches the description on cuBlas programming guide where it says that it uses Fortran indexing. amin is also similarly broken. I would recommend just fixing the description. The NPP Min and Max functions use C based indexing in contrast to the cuBlas ones.

NVML: unnecessary reference to ManagedCuda.csproj

I noticed (when trying to build NVML) that it includes a reference to ManagedCuda. However, this is only used to satisfy some unused using directives. If you remove the additional directives and the reference, it builds fine. Would be happy to submit as a PR, but wanted to see if it was intentional first.

NVRTC - architecture

Hello Michael !

I'm using your NVRTC module, and I cannot find a way to compile at runtime setting up the "code generation" parameters (ex. : "compute_20,sm_20" or "compute_50,sm_50" )

Is it possible ? Having this setting will give any difference on the results ?

Thank you !

Reduction

This is awesome stuff! Hey is there a reduction function somewhere? I see a reduction in the NPP stuff but is there a plain old reduction?

The ger method has caused me some trouble

    public void Ger(float alpha, CudaDeviceVariable<float> x, int incx, CudaDeviceVariable<float> y, int incy, CudaDeviceVariable<float> A, int lda)
    {
        _status = CudaBlasNativeMethods.cublasSger_v2(_blasHandle, x.Size, y.Size, ref alpha, x.DevicePointer, incx, y.DevicePointer, incy, A.DevicePointer, lda);
        Debug.WriteLine(String.Format("{0:G}, {1}: {2}", DateTime.Now, "cublasSger_v2", _status));
        if (_status != CublasStatus.Success) throw new CudaBlasException(_status);
    }

More specifically, the x.Size, y.Size part. It seems quite innocuous, but in my ML library I would often reuse memory by pretending the CudaDeviceVariable is smaller than it actually is. So when I pass it into ger, the function would get the actual size directly via the pointer and throw an invalid value error.

I'd suggest the API be changed to something like:

    public void Ger(int m, int n, float alpha, CudaDeviceVariable<float> x, int incx, CudaDeviceVariable<float> y, int incy, CudaDeviceVariable<float> A, int lda)
    {
        _status = CudaBlasNativeMethods.cublasSger_v2(_blasHandle, m, n, ref alpha, x.DevicePointer, incx, y.DevicePointer, incy, A.DevicePointer, lda);
        Debug.WriteLine(String.Format("{0:G}, {1}: {2}", DateTime.Now, "cublasSger_v2", _status));
        if (_status != CublasStatus.Success) throw new CudaBlasException(_status);
    }

I think I recall the library doing something similar to the above in a different method, which caused me trouble in the past, but I can't quite recall at the moment. Rather than break the API to fix this, it might be better to make extra methods and I could make them myself if you want.

Exception when using NPP example

When trying to use the NPP example on the main readme, I keep getting a:
An unhandled exception of type 'System.DllNotFoundException' occurred in NPP.dll

Additional information: Unable to load DLL 'npps64_75': The specified module could not be found. (Exception from HRESULT: 0x8007007E)

Is this a dll that wasn't included in either the library and/or in the NuGet?

Generic interop functions exceed array bounds for some value types due to managed/unmanaged size mismatches

public void CopyToHost(float[] aDest, CUdeviceptr aSource)
{
	if (disposed) throw new ObjectDisposedException(this.ToString());
	CUResult res;
	res = DriverAPINativeMethods.SynchronousMemcpy_v2.cuMemcpyDtoH_v2(aDest, aSource, aDest.LongLength * sizeof(float));
	Debug.WriteLine(String.Format("{0:G}, {1}: {2}", DateTime.Now, "cuMemcpyDtoH_v2", res));
	if (res != CUResult.Success)
		throw new CudaException(res);
}

The generic one does, but the autogenerated ones like the above don't. This actually gave me quite a lot of trouble. Please fix this as soon as possible.

Also if you've got the time, check out the Spiral language. At this point I am a long time user of ManagedCuda and I am using it for all the Cuda interop, so it might as well be as good time as in to say thanks. The decision of whether to compile to .NET would have been a lot harder for me without it existing.

CudaDNN not a part of the solution/NuGet package

Hello,

CudaDNN is not a part of the ManagedCuda solution and consequently not a part of the NuGet package. Do you think it would be possible to add CudaDNN to the solution or is it excluded on purpose? :)

Thank you very much!

Getting a Device Allocation Error

Michael,
What would cause such an error. It happens in my dynamic Jpeg version of you NPPJpeg sample. I guess I am not getting a device pointer. So far it does not seem to happen in the static version of the library.

ManagedCuda.NPP.NPPException was caught
HResult=-2146233088
Message=Device allocation error
Source=NPP
StackTrace:
at ManagedCuda.NPP.NPPImage_8uC3..ctor(Int32 nWidthPixels, Int32 nHeightPixels) in d:\Develop\Github\managedCuda\NPP\NPPImage_8uC3.cs:line 59
at NppImageUtils.JpegDecoder.JpegDecoder.Decoder(Byte[] pJpegData) in d:\AdVISE_Projects\NdtFileConverterEmgu\NPPImageUtils\JpegDecoder\JpegDecoder.cs:line 476
InnerException:

NVML: nvmlDeviceGetName doesn't seem to work

The actual device name wouldn't come back; I'm not an expert in C/C++ bindings, so I don't know what the right attributes etc should be - but it works OK if you implement it as a byte* (and presumably as byte[] - untested). I ended up hacking it to byte* and using it this way, which worked for me:

            const int MAX_NAME_LEN = 64;
            byte* buffer = stackalloc byte[MAX_NAME_LEN];
            if(NvmlNativeMethods.nvmlDeviceGetName(device, buffer, MAX_NAME_LEN) == nvmlReturn.Success)
            {
                int len = 0; // find the nul terminator
                for(int i = 0; i < MAX_NAME_LEN;i++)
                {
                    if(buffer[i] == 0)
                    {
                        len = i;
                        break;
                    }
                }
                if(len != 0)
                {  // interpret as ASCII directly into char - small length, so fine
                    char* ascii = stackalloc char[len];
                    for (int i = 0; i < len; i++)
                        ascii[i] = (char)buffer[i];
                    name = new string(ascii, 0, len);
                }
            }

(as a side note, there's a nvmlConstants::NVML_DEVICE_NAME_BUFFER_SIZE (64) that it may be beneficial to expose in the API)

Does managedCuda support multiple GPUs?

If so, is there an example that I can look at? Specifically, I am wondering how one would choose a GPU to allocate a new CudaDeviceVariable to?

How would the Cuda Callback example look like in C#/F#?

I am doing some concurrent Cuda programming for the first time and need to use callbacks just to get an idea of in what order the kernels are being executed. How would the example here look like in a .NET language?

I have absolutely no idea how to call a nativeptr from F#. I know about P/Invoke, but those are static method calls to unmanaged .dlls.

Thanks.

Edit: So here is how to call an unmanaged pointer. I Still lack a few pieces of the puzzle yet.

Help File

Did not see a help file, so I put this one up for anyone interested. It can be downloaded from here

BLAS - SPARSE Memory Problem.

Dear Michael.
I have strange behawior of CUDA and my nVidia Gtx 650 Ti (1Gb Memory)
I'm lunching my CUDA project under Rhinoceros software (as plugin).
In the IDLE Rhino is taking more or less 260 MB from GPU memory.
BTW. I'm measuring memory by TechPowerUp GPU 0.8.7 app. I know is not the most precise, but enough.

Creating my Context, BLAS,Sparse and Kernels looks like this in VS.

Then by break points I was recording Memory Usage And Its huge and strange.
Context takes 22 MB

BLAS 180 MB !!!

Sparse 347 MB !!!

Creating this 3 context occupy whole memory on GPU. Also creating them takes few seconds.
What is more, at university I have GTX970 and that contexts never occupied so much memory or time during creation. The whole calculation on GTX970 takes 100 MB (with my data copied to GPU).
On 650 I even cannot copy data. I get error OutOfMemory.

What is more, in code I have Dispose functions. On 970 works perfectly, on 650 not.
Its the same Code! (I'm using TeamFoundation Git server)
Maybe You have any Idea why BLAS and Sparse takes so much memory? What I'm doing wrong?

Best regards,
Sebastian

AccessViolationException when copying to device

I have a rather strange issue. I am unable to copy boolean array to device. (And read from it). This is only reproducible for boolean types so far. I am able to copy bool of smaller size just fine, though.

The code:

    static void Main(string[] args)
    {
        Console.WriteLine("start");

        CudaContext cntxt = new CudaContext();
        CUmodule cumodule = cntxt.LoadModule(@"kernel.ptx");
        CudaKernel myKernel = new CudaKernel("kernel", cumodule, cntxt);

        bool[] bools = new bool[20000];
        var dev = new CudaDeviceVariable<bool>(bools.Length);

        // System.AccessViolationException: 'Attempted to read or write protected memory. This is often an indication that other memory is corrupt.'
        dev.CopyToDevice(bools); // here

        Console.WriteLine("end");
        Console.ReadKey();
    }

Exception:
System.AccessViolationException: 'Attempted to read or write protected memory. This is often an indication that other memory is corrupt.'

Stack trace:
StackTrace " at ManagedCuda.DriverAPINativeMethods.SynchronousMemcpy_v2.cuMemcpyHtoD_v2(CUdeviceptr dstDevice, IntPtr srcHost, SizeT ByteCount)\r\n
at ManagedCuda.CudaDeviceVariable`1.CopyToDevice(T[] source) in i:\ManagedCuda\managedCuda\ManagedCUDA\CudaDeviceVariable.cs:line 327\r\n
at Bug_managedCuda_c_001.Program.Main(String[] args) in c:\users\lucky\source\repos\Bug_managedCuda_001\Bug_managedCuda_c_001\Program.cs:line 25" string

I have initially stumbled upon this issue on trunc version of the code because I wanted to use CUDA 9. Hover, the exact same issue is also reproduced for "_80" version from Nuget.
For the above test I used the above mentioned "_80" version with ptx file generated in cuda9 project. I don't think ptx matters at this point, since I never get to the actual execution of the code on GPU. At this point I am just trying to allocate space to device.

I am using a GeForce 1080 Ti graphics card with 11 Gb of memory. Allocation of boolean array of such small size (20000) should not be an issue.
Using .Net framework 4.6/4.7.1 console app, compiled in "any cpu" and in "x64". "Prefer 32 bit" unchecked.

If any additional info is needed please let me know.

Memory alignment of custom structs

I was wondering if there is a general accepted solution to create a memory layout that works in C++ CUDA and in C# / ManagedCuda. I'm having trouble with the alignment of structures that contain float4 values. Since the float4 is aligned to 16 bytes in CUDA/C++, the structures usually add padding when there are members smaller than 16 bytes. This padding is not added when Marshalled from C#, thus the sizes of the structs do not match.

You can get more information from this StackOverflow description.

I also made a small project that shows the problem:
https://github.com/RobbinMarcus/ManagedCudaCustomStruct

Compiling for x86 in Windows 7

My application needs to be built for x86 because the code/driver for a hardware device in my system requires it. I did a nuget for the Managed CUDA x86-75 into my application. However, the first call to the NPP library (creating a NPPImage_8uC3 object) the system throws an exception saying

{"Unable to load DLL 'nppi32_75': The specified module could not be found. (Exception from HRESULT: 0x8007007E)"}

None of the module that came down has a 32_75 in the file name nor was there a nppi file.

What did I fail to do?

Thanks,
Doug

Cross-platform .NET Core support

Great library! I'm pretty sure the answer is "no", but is it possible for this to be a cross-platform .NET Core library? I guess it would look for DLL files on Windows, SO files on Linux, etc?

GetPooling2dForwardOutputDim does not follow the correct formula

let inline divup a b = (a-1)/b+1 // Integer division with rounding up. (a+b-1)/b is another variant on this.
// find dimension of convolution output
// outputDim = 1 + (inputDim + 2*pad - filterDim) / convolutionStride
let output_matrix_n = input_sample.num_feature_maps
let output_matrix_c = input_sample.num_channels
let output_matrix_h = 1 + divup (input_sample.num_rows + 2*verticalPadding - windowHeight) verticalStride
let output_matrix_w = 1 + divup (input_sample.num_cols + 2*horizontalPadding - windowWidth) horizontalStride

The correct formula being the above. Also I've concluded that SetPooling2dDescriptor works properly so the error can only possibly be in GetPooling2dForwardOutputDim. Very strange. I'll continue investigating this.

Edit: Edited the formula to emulate the ceiling function using divup.

Edit2: I am not sure that this is a bug with ManagedCuda. I would guess that it is more likely to be a bug in the cuDNN library.

Edit3: For an example, I opened this thread a while ago though it does not look likely that I will get an answer. The best solution for this would be to move to cuDNN v4.

GetConvolutionBackwardFilterAlgorithm & friends not showing up on Intellisense

At first I thought the GetConvolution* class methods weren't in source, but after checking I see that they are there. They show up in NativeMethods, but not in the context class. I am using the context class from F#. I'll give it a shot from C# and try disassembling the .dll if that fails.

Edit: Nope, I do not see it from C# either. I'll try dnSpy.

Wrap cuDeviceGetCount()

cuDeviceGetCount() is rather important for creating contexts when you want to support multiple devices. You have no idea how many GPUs your user has plugged in! Using the method from DriverAPINativeMethods is of course clunky, since you're going to then have to deal with cuInit() most of the time.

Is there any chance of adding a proper DeviceGetCount() function to the managed API?

From:

 static member Contexts =
    let devCount = ref 0
    DriverAPINativeMethods.cuInit(CUInitializationFlags.None) |> ignore
    DriverAPINativeMethods.DeviceManagement.cuDeviceGetCount(devCount) |> ignore
    //Error handling??!
    [|
        for n in [0..!devCount-1] do
            yield new CudaContext (n)
    |]

To:

static member Contexts =
    [|
        for n in [0..DeviceGetCount()-1] do
            yield new CudaContext (n)
    |]

Is it possible to use Dynamic Parallelism code with ManagedCuda?

Hi,

Thanks for the great work on this project. I was wondering if anyone has gotten a kernel that uses dynamic parallelism working with ManagedCuda (see: https://devblogs.nvidia.com/parallelforall/cuda-dynamic-parallelism-api-principles/). To generate a ptx file for the kernel file it has to be compiled with -arch=sm_35 (or greater) and -rdc=true (relocatable device code) settings.

The example I've been trying to get working is the advanced quick sort provided by NVidia which can be found in their sample directory CUDA Samples\v8.0\6_Advanced\cdpAdvancedQuicksort. When I generate a PTX file off a cut down version of the sample (just with the device functions) it complains that the file is invalid with the error: ErrorInvalidPtx: This indicates that a PTX JIT compilation failed.

Has anyone gotten an example with dynamic parallelism working with ManagedCuda or is that simply not possible. The stackoverflow post answer: https://stackoverflow.com/questions/26147981/nvlink-relocatable-device-code-and-static-device-libraries makes it sound like this may only be achievable via a pinvoke call to a dll with exported functions (inferred from what has been said). However maybe the problem is in the way I'm compiling it when creating the ptx file (but this post suggests that it may not be: https://devtalk.nvidia.com/default/topic/668017/dynamic-parallelism-with-cuda-driver-api/)?

Any help or guidance would be greatly appreciated.

NPP Extension Method For DivCRev missing for double

The NPP extension method for DivCRev is not included for a device variable of type double.

CudaSurface always disposes underlying volume

The CudaSurface class disposes of the CudaArray3D even when an array is passed to the constructor. There should be a concept of ownership, much like you implemented for the CudaArray3D class, as there are instances when you don't want to dispose of the underlying volume when you no longer need the surface.

P.S. I digress, but please take a look at the suggestion I made in my "unified memory" post for how to make those classes generic without sacrificing performance.

Setup new project issue

Dear Michael,

I'm beginner in C# - CUDA projects. I was very happy when I found your tutorial "Setup-a-managedCuda-project". But when I've changed my CUDARuntime project Configuration Type from dll to Utility (Hilfsprogramm?) all my #include directive stopped working, and all header files disappear from "External dependencies". So all cuda stuff such a global are undefined...
I'm using VS Pro 2013
Could you help me. I will be grateful.

Publish NuGet for CUDA 9 version.

I'm not sure if code is ready yet, but it would be nice to have latest binaries as NuGet Package.

Cheers.

Driver version?

Hi,
i am using CudaContext.GetDeviceInfo() to get the driver info among other things, but for 'DriverVersion' i get a 9.0. Not a 22.21.13.8541 nor a 385.41. am i looking the wrong place?

Support jagged array？

thanks ManagedCUDA Opensource.

I have question. " ManagedCUDA Support C# jagged array" ?

thank you

craig tao

Do you have Examples?

Hello.
Do you have examples solutions?
I try using your library, but I had runtime exception.

DriverAPINativeMethods.cuInit(ManagedCuda.BasicTypes.CUInitializationFlags.None);
CudaDeviceVariable windowSize = new CudaDeviceVariable(5);

ErrorInvalidContext: This most frequently indicates that there is no context bound to the current thread. This can also be returned if the context passed to an API call is not a valid handle (such as a context that has had cuCtxDestroy() invoked on it). This can also be returned if a user mixes different API versions (i.e. 3010 context with 3020 API calls). See cuCtxGetApiVersion() for more details.

What should I do?

BadImageFormatException when trying to use CuBLASS/CuDNN in custom project

Hello,

I am getting this exception, when I try to use these libraries in a custom project.

An unhandled exception of type 'System.BadImageFormatException' occurred in mscorlib.dll

Additional information: Could not load file or assembly 'CudaBlas, Version=8.0.13.0, Culture=neutral, PublicKeyToken=539d54d12e99fedb' or one of its dependencies. An attempt was made to load a program with an incorrect format.

May I request compiled(and signed) library files from you?

Sorry to contact you here, but I can´t find any better way. You can contact me on email address [email protected] .

Thanks in advance,
Stanislav

PrimaryContext: implicitly invokes the constructor of CudaContext

Hi,
Thanks for your work!

I was having some problems when using the PrimaryContext class.
From the timeline produced by Visual Profiler, it seemed that each time I referenced a primary context (created a PrimaryContext class), a new context was created, which is not how it's supposed to do. From what I understand, it should create a context only when there is none on the device and should reference the context if there is one.

What I found after examining the code is that it seems currently the PrimaryContext class inherits from the CudaContext class. When a PrimaryContext is created, its base class constructor is invoked, which is public CudaContext(): this(0, CUCtxFlags.SchedAuto, true) {...} , which is public CudaContext(int deviceId, CUCtxFlags flags, bool createNew) {...} ,and a new context is then created.

I've fixed the problem in my local repo. However, the fix is rather ugly. To keep the inheritance, the base class constructor must do nothing, as all the logic needs is defined in the constructor of PrimaryContext. However, the base class (CudaContext) constructor is heavily overloaded. I have to create a constructor with dummy parameters in the CudaContext class, which is definitely not what a good fix should do.

To fix the issue, I think some changes on the design needs to be made. I think the author needs to know this problem, although the related functionality may not be commonly used.

It would be really great if you could fix the issue. But again, thanks for your work. It really makes my work a lot easier.

GPU debugging in managedCuda

Hello Michael, I have recently started learning CUDA programming and I am interested in using it within the .NET framework. I know that NVIDIA NSight provides tools for debugging and profiling the GPU code.
Do you have suggestions on how to do that with managedCUDA?
Thanks for your help!

Allocating a CudaPitchedDeviceVariable using a width and height throws an ErrorInvalidValue exception.

Hello,

When I try to execute the following code, I get an ErrorInvalidValue exception.

CudaContext cudaContext = new CudaContext(CudaContext.GetMaxGflopsDeviceId());
CudaPitchedDeviceVariable<ushort> var = new CudaPitchedDeviceVariable<ushort>(
    2048, 160);

I have tried with multiple parameters, such as (16, 16) for my width and height. It appears to work with float but not with ushort. Any idea why it doesn't support ushort?

NVML observations

Hi - love the lib. I've recently needed some of the APIs from nvml, and had a few observations:

there is (AFAIK) no nuget package for nvml
the package name (nvml.dll) is awkward for deployment, as the raw API is also nvml.dll

Am I missing something obvious on packaging for NVML?

I appreciate that NVML is essentially a naked wrapper on the native methods, but it works great at what it does. However, as an additional observation: many many of the "get" methods use ref in the signature, when the semantic is actually out. For example:

uint count;
NvmlNativeMethods.nvmlDeviceGetCount(out count).Verify();

uint count = 0; // dummy
NvmlNativeMethods.nvmlDeviceGetCount(ref count).Verify();

Simply changing the extern declaration makes this work.

(note that .Verify() here is simply an extension method on the result enum)

This would obviously be a breaking change, as would be changing the library name, but: any thoughts?

kunzmi / managedcuda Goto Github PK

managedcuda's People

Contributors

Stargazers

Watchers

Forkers

managedcuda's Issues

Recommend Projects

Recommend Topics

Recommend Org