lebedov / scikit-cuda Goto Github PK

View Code? Open in Web Editor NEW

980.0 48.0 179.0 2.5 MB

Python interface to GPU-powered libraries

Home Page: http://scikit-cuda.readthedocs.org/

License: Other

Makefile 0.04% Python 98.83% C 1.13%

python gpu cuda blas lapack numerical cublas cusolver cufft pycuda

scikit-cuda's Introduction

Package Description

scikit-cuda provides Python interfaces to many of the functions in the CUDA device/runtime, CUBLAS, CUFFT, and CUSOLVER libraries distributed as part of NVIDIA's CUDA Programming Toolkit, as well as interfaces to select functions in the CULA Dense Toolkit. Both low-level wrapper functions similar to their C counterparts and high-level functions comparable to those in NumPy and Scipy are provided.

Documentation

Package documentation is available at http://scikit-cuda.readthedocs.org/. Many of the high-level functions have examples in their docstrings. More illustrations of how to use both the wrappers and high-level functions can be found in the demos/ and tests/ subdirectories.

Development

The latest source code can be obtained from https://github.com/lebedov/scikit-cuda.

When submitting bug reports or questions via the issue tracker, please include the following information:

Python version.
OS platform.
CUDA and PyCUDA version.
Version or git revision of scikit-cuda.

Citing

If you use scikit-cuda in a scholarly publication, please cite it as follows:

@misc{givon_scikit-cuda_2019,
          author = {Lev E. Givon and
                    Thomas Unterthiner and
                    N. Benjamin Erichson and
                    David Wei Chiang and
                    Eric Larson and
                    Luke Pfister and
                    Sander Dieleman and
                    Gregory R. Lee and
                    Stefan van der Walt and
                    Bryant Menn and
                    Teodor Mihai Moldovan and
                    Fr\'{e}d\'{e}ric Bastien and
                    Xing Shi and
                    Jan Schl\"{u}ter and
                    Brian Thomas and
                    Chris Capdevila and
                    Alex Rubinsteyn and
                    Michael M. Forbes and
                    Jacob Frelinger and
                    Tim Klein and
                    Bruce Merry and
                    Nate Merill and
                    Lars Pastewka and
                    Li Yong Liu and
                    S. Clarkson and
                    Michael Rader and
                    Steve Taylor and
                    Arnaud Bergeron and
                    Nikul H. Ukani and
                    Feng Wang and
                    Wing-Kit Lee and
                    Yiyin Zhou},
    title        = {scikit-cuda 0.5.3: a {Python} interface to {GPU}-powered libraries},
    month        = May,
    year         = 2019,
    doi          = {10.5281/zenodo.3229433},
    url          = {http://dx.doi.org/10.5281/zenodo.3229433},
    note         = {\url{http://dx.doi.org/10.5281/zenodo.3229433}}
}

Authors & Acknowledgments

See the included AUTHORS file for more information.

Note Regarding CULA Availability

As of 2021, the CULA toolkit by EM Photonics no longer appears to be available.

Python wrappers for cuDNN by Hannes Bretschneider are available here.

ArrayFire is a free library containing many GPU-based routines with an officially supported Python interface.

License

This software is licensed under the BSD License. See the included LICENSE file for more information.

scikit-cuda's People

Contributors

Stargazers

Watchers

Forkers

stefanv npinto sequoiar jfrelinger renjupaul rpadams larsoner szho42 c0g iskandr teodor-moldovan jellis18 mforbes alemagnani mrgloom stevertaylor urutu untom varunrajk benanne wavelets nouiz capdevc f0k panzermann lurkman psattige jz1536 alexlee-gk shixing grlee77 stonexjr paulosprey s1van mdjawad trumb xsongx oursland zhonghai2810 davidweichiang chenzhongtao cc13ny bmerry nodd angie44 astro44 rwzhao lingerlman acmarsden brainiarc7 bdice arita37 carloholly emillynge lgdkobe24 caomw arjun1610 acuner marcoforte fransal yalechang jadielam michellemay lvaleriu jordanyang nonlining jcmgray gansakumar abergeron fnielsen huangkbaaron rcalland nickorberg pstjohn erichson walkoncross zhencang yiyin nikulukani haiy indivisibleatom xennygrimmato sacrosis prasannababuaddagiri solertis sebitas kingfisher1337 nmerrill67 glennneiger wamsiv supremefist chuckie82 jackustc omghozlan achintya-kumar jaykimbravekjh geoslegend dghernandez kk7nc juno119

scikit-cuda's Issues

SVD speed test

Hi,

I am newbie. I was comparing SVD performance of scikit-cuda library and pycula library. I see a major speed difference. Am I missing something? I have a GeForce GTX 560 Ti graphics card.

Thanks.
-Varun

Here is the code:

Init data

x = np.random.randn(4000,1000).astype('float32')
x = np.dot(x.T, x)/float(x.shape[0])

---------------------------- Scikits cuda---------------

x_gpu = gpuarray.to_gpu(x)

t = time.time()
for i in xrange(3):
t1 = time.time()
culinalg.svd(x_gpu)
print time.time()-t1

Result:
0.978317022324
0.866162061691
0.868548870087

------------------------ pycula-----------------------

x_gpu = cula_Fpitched_gpuarray_like(x)

t = time.time()
for i in xrange(3):
t1 = time.time()
gpu_devdevsvd(x_gpu)
print time.time()-t1

Result:
0.197156906128
0.194029092789
0.194598913193

undefined symbol: culaGetLastStatus

Hi,

I am a newbie trying to use cula and scikits.cuda packages.

I found out that the latest CULATools software R17 removed culaGetLastStatus and culaSetLastStatus methods.
http://www.hpc-sol.co.jp/products/cula/doc/CULA_dense_R17/CULAProgrammersGuide_R17.pdf

I see those methods still in the latest scikits.cuda package. I am not sure if these methods are required. If yes, then are there are other alternatives?

Thanks,
Varun

CUDA Streams

Is it possible to use CUDA streams with the fft functions? Nvidia's CUFFT manual clearly says yes. However I haven't found any mention of streams or cufftSetStream function in the fft function python code. I'm writing/testing a pseudo-spectral PDE solver and stream could potentially speed things a bit.

Assertion Error with the integration functions.

This is the error I am getting:

 z = integrate.trapz2d(x_gpu)
  File "/usr/lib/python2.7/site-packages/scikits/cuda/integrate.py", line 304, in trapz2d
    trapz_mult_gpu = gen_trapz2d_mult(x_gpu.shape, float_type)
  File "/usr/lib/python2.7/site-packages/scikits/cuda/integrate.py", line 231, in gen_trapz2d_mult
    block_dim, grid_dim = misc.select_block_grid_sizes(dev, mat_shape)
  File "/usr/lib/python2.7/site-packages/scikits/cuda/misc.py", line 315, in select_block_grid_sizes
    assert max_blocks_per_grid_dim == max_grid_dim[1]
AssertionError

I did the test_integrate.py from the test folder, and all of them failed with the same error. Not sure if I am doing something wrong.

In [2]: scikits.cuda.__version__
Out[2]: '0.042'

CUDA 4.x changes in CUBLAS function interfaces

The function symbols in libcublas.dylib on MacOSX appear to differ somewhat from those on Linux. This prevents the functions from being properly loaded via ctypes on MacOSX.

$ uname 
Darwin
$ nm /usr/local/cuda/lib/libcublas.dylib |grep _cublas 
0000000000041df0 T _cublasCaxpy_v2
0000000000044990 T _cublasCcopy_v2
000000000003a160 T _cublasCdotc_v2
000000000003a150 T _cublasCdotu_v2
...

(Reported by David Montgomery.)

failed installation of scikits.cuda: nvcc not in path

Hi,
the installation of scikits.cuda via pip failed: "nvcc not in path"

marco@marco-All-Series:~$ sudo pip install scikits.cuda
Downloading/unpacking scikits.cuda
Downloading scikits.cuda-0.042.tar.gz (97kB): 97kB downloaded
Running setup.py (path:/tmp/pip_build_root/scikits.cuda/setup.py) egg_info for package scikits.cuda

Requirement already satisfied (use --upgrade to upgrade): numpy in /usr/local/lib/python2.7/dist-packages (from scikits.cuda)
Downloading/unpacking pycuda>=0.94.2 (from scikits.cuda)
Downloading pycuda-2014.1.tar.gz (1.6MB): 1.6MB downloaded
Running setup.py (path:/tmp/pip_build_root/pycuda/setup.py) egg_info for package pycuda
*** WARNING: nvcc not in path.
*************************************************************
*** I have detected that you have not run configure.py.
*************************************************************
*** Additionally, no global config files were found.
*** I will go ahead with the default configuration.
*** In all likelihood, this will not work out.
***
*** See README_SETUP.txt for more information.
***
*** If the build does fail, just re-run configure.py with the
*** correct arguments, and then retry. Good luck!
*************************************************************
*** HIT Ctrl-C NOW IF THIS IS NOT WHAT YOU WANT
*************************************************************
Continuing in 1 seconds... ..
Traceback (most recent call last):
File "", line 17, in
File "/tmp/pip_build_root/pycuda/setup.py", line 216, in
main()
File "/tmp/pip_build_root/pycuda/setup.py", line 88, in main
conf["CUDA_INC_DIR"] = [join(conf["CUDA_ROOT"], "include")]
File "/usr/lib/python2.7/posixpath.py", line 77, in join
elif path == '' or path.endswith('/'):
AttributeError: 'NoneType' object has no attribute 'endswith'
Complete output from command python setup.py egg_info:
*** WARNING: nvcc not in path.

*** I have detected that you have not run configure.py.

*** Additionally, no global config files were found.

*** I will go ahead with the default configuration.

*** In all likelihood, this will not work out.

*** See README_SETUP.txt for more information.

*** If the build does fail, just re-run configure.py with the

*** correct arguments, and then retry. Good luck!

*** HIT Ctrl-C NOW IF THIS IS NOT WHAT YOU WANT

Continuing in 1 seconds...

Traceback (most recent call last):

File "", line 17, in

File "/tmp/pip_build_root/pycuda/setup.py", line 216, in

main()

File "/tmp/pip_build_root/pycuda/setup.py", line 88, in main

conf["CUDA_INC_DIR"] = [join(conf["CUDA_ROOT"], "include")]

File "/usr/lib/python2.7/posixpath.py", line 77, in join

elif path == '' or path.endswith('/'):

AttributeError: 'NoneType' object has no attribute 'endswith'

Cleaning up...
Command python setup.py egg_info failed with error code 1 in /tmp/pip_build_root/pycuda
Storing debug log for failure in /home/marco/.pip/pip.log

Yesterday I had the same problem with Theano.
Before installing some packages with Anaconda distribution and then removing Anaconda (didn't want to rely on proprietary distribution), everything with Theano went fine.
But just right after anaconda's removal, exactly the same error's message appeared.
For Theano I solved the problem, putting the path into .theanorc file (theano's configuration file-"tableau de bord" file).

Any hints to solve the problem for scikits.cuda?

Looking forward to your kind help.
Kind regards.
Marco

Mac OSX Mavericks 10.9.5 can't find CULA

I'm consistently getting this problem

raise OSError('%s not found' % _load_err)
OSError: libcula_lapack.dylib, libcula_core.dylib not found

I had to change the default library names in cula.py since CULA 16 gives libcula_lapack.dylib, libcula_core.dylib for Mac OSX 10.9.5. CULA is installed and works. All CULA libraries are included in the DYLD_LIBRARY_PATH. CUBLAS is working fine, but the scikits.cuda just can't find CULA, even though the relevant .dylib files are in the library path.

use both major and minor soname number when determining CUBLAS version

Currently, only the major number is used; this causes CUBLAS 5.5 to be detected as CUBLAS 5.0.

Confused Using scikits.cuda.cula

Hello all,

I want to use some of cula functionality like LU factorization or Matrix inverse but I have some problem regarding the pointer inputs. for example for doing LU factorization with scikits.cuda.cula.culaDeviceSgetrf(m, n, a, lda, ipiv) , one need to use pointer f "a" argument but there is no pointer in python explicitly(I know all variables in python are by ref) . So what should I do in this case? should I use ctype library to create python?

this is what I am trying to do:

import numpy as np
import scikits.cuda.cula as cula
import pycuda.gpuarray as gpuarray

cula.culaInitialize()

I create a square matrix for simplicity

a=np.array([[1,2,3,4],[6,7,8,9],[7,2,3,5],[2,4,5,6]])

n=b.shape[0]
ida=ipv=m

scikits.cuda.cula.culaDeviceSgetrf(m,n,a,n,n)

but I'm getting this error:

Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/scikits.cuda-0.042-py2.7.egg/scikits/cuda/cula.py", line 329, in culaDeviceSgetrf
status = _libcula.culaDeviceSgetrf(m, n, int(a), lda, int(ipiv))
TypeError: only length-1 arrays can be converted to Python scalars

and when I try
a_gpu = gpuarray.to_gpu(a)
scikits.cuda.cula.culaDeviceSgetrf(m,n,a_gpu,n,n) :

Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/scikits.cuda-0.042-py2.7.egg/scikits/cuda/cula.py", line 329, in culaDeviceSgetrf
status = _libcula.culaDeviceSgetrf(m, n, int(a), lda, int(ipiv))
TypeError: int() argument must be a string or a number, not 'GPUArray'

any solution ?

Thanks ,

Mohsen

CUDA FFT / IFFT mis-estimates DC component in downsampling

I built a CUDA-based downsampling method based on scipy's code, and the results are not equivalent when I use CUDA. I have implemented scipy's code myself, and used pyFFTW, and these match -- CUDA is the odd one out.

Scipy's code works like this, to downsample a length-12 vector x to a length-6 vector y:

old_len = 12
new_len = 6
double = True

dtype_r = np.float64 if double is True else np.float32
dtype_c = np.complex128 if double is True else np.complex64

x = np.random.RandomState(0).randn(old_len).astype(dtype_r)

# make a repeatable random vector
x = np.random.RandomState(0).randn(old_len).astype(dtype_r)

# use scipy to resample it
y_sp = resample(x, new_len)

# use equivalent python implementation
N = int(np.minimum(new_len, old_len))
sl_1 = slice((N + 1) / 2)
sl_2 = slice(-(N - 1) / 2, None)
x_fft = fft(x).ravel()
y_fft = np.zeros(new_len, np.complex128)
y_fft[sl_1] = x_fft[sl_1]
y_fft[sl_2] = x_fft[sl_2]
y_py = np.real(ifft(y_fft, overwrite_x=True)).ravel()
y_py *= (float(new_len) / old_len)

Basically it copies part of the spectrum into a new vector, and does a shorter-length IFFT on that vector to get y.

Now the equivalent code for (py)FFTW is:

    # use fftw
    fftw_len_x = int((old_len - (old_len % 2)) / 2 + 1)
    fftw_len_y = int((new_len - (new_len % 2)) / 2 + 1)
    xy_fft = np.zeros(max(fftw_len_x, fftw_len_y), dtype_c)
    x_fft = xy_fft[:fftw_len_x]
    y_fft = xy_fft[:fftw_len_y]
    #y_fft = np.zeros((1, fftw_len_y), dtype_c)
    y_fw = np.zeros(new_len, dtype_r)
    y_fft.fill(0)
    fft_plan = pyfftw.FFTW(x, x_fft)
    ifft_plan = pyfftw.FFTW(y_fft, y_fw, direction='FFTW_BACKWARD')
    fft_plan.execute()
    #y_fft[0, sl_1] = x_fft[0, sl_1]
    ifft_plan.execute()
    y_fw /= float(old_len)

Note that for FFTW we don't need to copy the spectrum, because it's done implicitly just by making y_fft and x_fft both appropriately-size slices of / views into xy_fft.

And for (scikits.)CUDA, the code is very similar:

# use (scikits.)CUDA
fft_plan = cudafft.Plan(old_len, dtype_r, dtype_c)
ifft_plan = cudafft.Plan(new_len, dtype_c, dtype_r)
# this allocation is bigger than needed, but results didn't change when smaller
xy_fft_gpu = gpuarray.to_gpu(np.zeros(max(old_len, new_len), dtype_c))
x_gpu = gpuarray.to_gpu(x)
y_gpu = gpuarray.to_gpu(np.zeros(new_len, dtype_r))
cudafft.fft(x_gpu, xy_fft_gpu, fft_plan)
cudafft.ifft(xy_fft_gpu, y_gpu, ifft_plan, scale=False)
y_gpu = y_gpu.get() / float(old_len)

All of the DFT Components of these signals match quite well, /except/ for the DC component, which is wrong for CUDA:

>>> print [np.mean(xx) for xx in [y_py, y_sp, y_fw, y_gpu]]
[0.74821239878434997, 0.74821239878434997, 0.74821239878434997, 0.43901809316455404]

Feel free to look at the combined code in my Gist to see all of the code:

https://gist.github.com/Eric89GXL/4739735

Note that this problem doesn't seem to occur when downsampling to an odd-length (e.g., newlen = 7, which is strange...

cholesky recombination errors

Hello,

I'm using scikits.cuda for some statistical work, much of which involves the cholesky factorisation and recombination of various positive-definite covariance matrices. Over the course of this, I stumbled across a problem with culinalg.dot.

Having calculated the cholesky factorisation and its transpose of a small (3x3) matrix on the CPU, I then tasked numpy and scikits.cuda with rederiving the original matrix with a dot product. However, while the CPU successfully recreates the original covariance matrix, the GPU method only correctly computes the upper triangle, filling the lower triangle with zeros, in a way that is inaccurate.

Sample code:

import pycuda.autoinit
import pycuda.gpuarray as gpuarray
import pycuda.driver as drv
import numpy as np
import scikits.cuda.linalg as culinalg
import scikits.cuda.misc as cumisc
culinalg.init()

Lm=np.array([[ 5.77350269e-01, -5.91335503e-05, -3.47227707e-01], [ 0.00000000e+00 , 5.77350272e-01 , 3.52601194e-05], [ 0.00000000e+00 , 0.00000000e+00 , 6.73721317e-01]])
LmT=Lm.T
LmT_gpu=gpuarray.to_gpu(LmT)
Lm_gpu=gpuarray.to_gpu(Lm)
Kmm_gpu=culinalg.dot(Lm_gpu,LmT_gpu)
print "Lm,LmT GPU",Lm_gpu.get()
print LmT_gpu.get()
Kmm_cpu=np.dot(Lm,LmT)
print "two Kmm", Kmm_cpu
print Kmm_gpu.get()
print np.allclose(Kmm_cpu,Kmm_gpu.get())

Sample output:

Lm,LmT GPU [[ 5.77350269e-01 -5.91335503e-05 -3.47227707e-01]
[ 0.00000000e+00 5.77350272e-01 3.52601194e-05]
[ 0.00000000e+00 0.00000000e+00 6.73721317e-01]]
[[ 5.77350269e-01 0.00000000e+00 0.00000000e+00]
[ -5.91335503e-05 5.77350272e-01 0.00000000e+00]
[ -3.47227707e-01 3.52601194e-05 6.73721317e-01]]
two Kmm [[ 4.53900417e-01 -4.63840618e-05 -2.33934708e-01]
[ -4.63840618e-05 3.33333338e-01 2.37554941e-05]
[ -2.33934708e-01 2.37554941e-05 4.53900413e-01]]
[[ 3.33333333e-01 -6.82815425e-05 -4.34406720e-01]
[ 0.00000000e+00 3.33333337e-01 4.41129336e-05]
[ 0.00000000e+00 0.00000000e+00 4.53900413e-01]]
False

I'm using CUDA 5.5, with scikits.cuda pulled straight from github as of 17/04/2014.

MAX_GRID_DIM_X and MAX_GRID_DIM_Y not identical for GTX Titan

select_block_grid_sizes asserts that the MAX_GRID_DIM in X and Y direction are the same. I don't know about other devices, but for the GTX Titan this is no longer true and I get the following AssertionError:

/home/hannes/usr/EPD/lib/python2.7/site-packages/scikits.cuda-0.042-py2.7.egg/scikits/cuda/misc.pyc in select_block_grid_sizes(dev, data_shape, threads_per_block)
    313     max_blocks_per_grid_dim = max(max_grid_dim)
    314     assert max_blocks_per_grid_dim == max_grid_dim[0]
--> 315     assert max_blocks_per_grid_dim == max_grid_dim[1]
    316 
    317     # Actual number of thread blocks needed:

AssertionError: 

In [4]: %debug
> /home/hannes/usr/EPD/lib/python2.7/site-packages/scikits.cuda-0.042-py2.7.egg/scikits/cuda/misc.py(315)select_block_grid_sizes()
    314     assert max_blocks_per_grid_dim == max_grid_dim[0]
--> 315     assert max_blocks_per_grid_dim == max_grid_dim[1]
    316

For reference, here is the output of deviceQuery for the Titan:

Device 0: "GeForce GTX TITAN"
  CUDA Driver Version / Runtime Version          5.0 / 5.0
  CUDA Capability Major/Minor version number:    3.5
  Total amount of global memory:                 6144 MBytes (6442254336 bytes)
  (14) Multiprocessors x (192) CUDA Cores/MP:    2688 CUDA Cores
  GPU Clock rate:                                876 MHz (0.88 GHz)
  Memory Clock rate:                             3004 Mhz
  Memory Bus Width:                              384-bit
  L2 Cache Size:                                 1572864 bytes
  Max Texture Dimension Size (x,y,z)             1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
  Max Layered Texture Size (dim) x layers        1D=(16384) x 2048, 2D=(16384,16384) x 2048
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       49152 bytes
  Total number of registers available per block: 65536
  Warp size:                                     32
  Maximum number of threads per multiprocessor:  2048
  Maximum number of threads per block:           1024
  Maximum sizes of each dimension of a block:    1024 x 1024 x 64
  Maximum sizes of each dimension of a grid:     2147483647 x 65535 x 65535
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             512 bytes
  Concurrent copy and kernel execution:          Yes with 1 copy engine(s)
  Run time limit on kernels:                     No
  Integrated GPU sharing Host Memory:            No
  Support host page-locked memory mapping:       Yes
  Alignment requirement for Surfaces:            Yes
  Device has ECC support:                        Disabled
  Device supports Unified Addressing (UVA):      Yes
  Device PCI Bus ID / PCI location ID:           65 / 0
  Compute Mode:
     < Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >

Demos failing with CUDA 5

Run any demo with CUDA 5 installed (Linux, 64-bit, with newest CULA as well as of 2/3/2013), and all demos fail, either because np.allclose() fails, meaning computations are incorrect, or there is an ambiguous cublasInternalError() that occurs.

Any thoughts, or should I just go ahead and try downgrading to CUDA 4?

dot_demo failure: cublasArchMismatch

Traceback (most recent call last):
  File "dot_demo.py", line 23, in 
    temp_gpu = linalg.dot(a_gpu, b_gpu)
  File "/home/stefan/src/scikits.cuda/scikits/cuda/linalg.py", line 280, in dot
    cublas.cublasCheckStatus(status)
  File "/home/stefan/src/scikits.cuda/scikits/cuda/cublas.py", line 114, in cublasCheckStatus
    raise cublasExceptions[status]
scikits.cuda.cublas.cublasArchMismatch

undefined symbol: cublasChpr2 with libcublas4

Hiya

Does this library definitely work with libcublas 4? I'm getting this error when importing scikits.cuda.linalg:

/usr/lib/x86_64-linux-gnu/libcublas.so: undefined symbol: cublasChpr2

I'm using libcublas 4.0.17 on Ubuntu 12.04. Other CUDA-related stuff (pycuda, cudamat/gnumpy) works fine.

Cheers
-Matt

use cublas*dgmm in linalg.dot

CUBLAS 5 contains dedicated functions for multiplying diagonal and non-diagonal matrices; it would be nice to use these in linalg.dot (suggestion by Teodor Moldovan).

link the AUTHORS file to the one in docs

you might want to try .. include::docs/source/authors.rst 😄

SegFault

Installed scikits.cuda from latest git (0.042)

nosetestes fails on the 5th test with a segmentation fault
Several demos also fail with a segfault
dot_demo.py
mdot_demo.py
pinv_demo.py
tril_demo.py
The indexing_4d_demo.py fails with a compilation error.

Can you help me please?

This was on
Ubuntu 10.04.3.LTS - kernel 2.6.32.34-server (x86_64)
gcc 4.4.3
cuda 4.0
pycuda 2011.1.2
scikits.cuda 0.042

and also on
Ubuntu 11.04 - kernel 2.6.38.11-server (x86_64)
gcc 4.5.2 (but nvcc --compiler-bindir /usr/bin/gcc-4.4)
cuda 4.0
pycuda 2011.1.2
scikits.cuda 0.042

Behaviour of transpose in linalg.dot

Hello. I wanted to flag something which may be general or may just be a problem with my installation. If the latter, then my apologies.

I'm having problems with the behaviour of the transpose options in linalg,dot, specifically for multiplication of non-square matrices. For examples,

a_gpu = gpuarray.to_gpu(numpy.array([[1.,2.,9.],[4.,5.,7.]]).astype(numpy.float64))
b_gpu = gpuarray.to_gpu(numpy.array([[17.,25.,92.,16.],[46.,59.,73.,11.],[101.,0.72,0.83,0.11]]).astype(numpy.float64))

In this case linalg.dot(a_gpu,b_gpu) gives the same answer as numpy.dot(a_gpu.get(),b_gpu.get()). However, if we have

c_gpu = gpuarray.to_gpu(b_gpu.get().T)

then linalg.dot(a_gpu,c_gpu,transa='N',transb='T') is not the same as numpy.dot(a_gpu.get(),c_gpu.get().T)

Syntax error with python 2.6

Currently there is an error in :
/scikits/cuda/misc.py:686

Dict comprehjensions are not only supported from 2.7 on, which causes an syntax error for this line in Python 2.6:
num_nbytes = {np.dtype(t):t(1).nbytes for t in num_types}

Replacing it with this one, fixes the issue:

num_nbytes = dict((np.dtype(t),t(1).nbytes) for t in num_types)

transpose returns bad result!

today I got some odd problem when using the transpose function. It's been working ok before I add one column with value one to my matrix then it returns some wired result!

for example

[code]

import scikits.cuda.linalg as linalg
a=np.array.random((3,3))
b=np.c_[np.ones(len(a)),a]
b_gpu=gpuarray.to_gpu(b)

a=
([[ 0.3893433 , 0.39086179, 0.17822495],
[ 0.83947924, 0.1448663 , 0.01559765],
[ 0.86897163, 0.93040886, 0.94529043]])

 ([[ 1.        ,  0.3893433 ,  0.39086179,  0.17822495],

   [ 1.        ,  0.83947924,  0.1448663 ,  0.01559765],

   [ 1.        ,  0.86897163,  0.93040886,  0.94529043]])

now b.T =

  ([[ 1.        ,  1.        ,  1.        ],

   [ 0.3893433 ,  0.83947924,  0.86897163],

   [ 0.39086179,  0.1448663 ,  0.93040886],

   [ 0.17822495,  0.01559765,  0.94529043]])

but linalg.transpose(b_gpu)

([[ 1.        ,  0.83947924,  0.93040886],

   [ 1.        ,  0.86897163,  0.17822495],

   [ 1.        ,  0.39086179,  0.01559765],

   [ 0.3893433 ,  0.1448663 ,  0.94529043]])

also I tired to use the cula function like this
(m,n)=b.shape
c=np.zeros((n,m))

c_gpu=gpuarray.to_gpu(c)

cula.culaDeviceDgeTranspose(m,n,b_gpu.gpudata,b_gpu.len(),c_gpu.gpudata,c_gpu.len())

c_gpu=

([[ 1.        ,  0.3893433 ,  0.39086179],

   [ 0.17822495,  1.        ,  0.83947924],

   [ 0.1448663 ,  0.01559765,  1.        ],

   [ 0.86897163,  0.93040886,  0.94529043]])

any idea why its happening?

CULA Dense 13a and 14

Hey guys,

I looks like the linking against CULA has changed in versions 13 and 14. The library name has changed from libcula.so to libcula_core.so ... I tried a symbolic link, but there are undefined symbols, etc.

CULA 12 works fine.

Multiple demo failures with CUDA 3.2 on OS X 10.5

As I mentioned in the other thread, I get a whole bunch of failures of the demo programs.

Here is a dump of them (note that the PyCUDA unit tests ran with no failures).

$ for demo in *_demo.py; do printf "%s\n" "" "---------------------" $demo "---------------------"; python $demo; done



---------------------
diag_demo.py
---------------------
Testing real diagonal matrix creation for type float32
Success status:  True
Testing real diagonal matrix creation for type complex64
Success status:  True

---------------------
dot_demo.py
---------------------
Testing matrix multiplication for type float32
Success status:  True
Testing vector multiplication for type float32
Success status:  False
Testing matrix multiplication for type complex64
Traceback (most recent call last):
  File "dot_demo.py", line 38, in 
    temp_gpu = culinalg.dot(a_gpu, b_gpu)
  File "/Library/Frameworks/Python.framework/Versions/6.2/lib/python2.6/site-packages/scikits.cuda-0.03-py2.6.egg/scikits/cuda/linalg.py", line 290, in dot
    cublas.cublasCheckStatus(status)
  File "/Library/Frameworks/Python.framework/Versions/6.2/lib/python2.6/site-packages/scikits.cuda-0.03-py2.6.egg/scikits/cuda/cublas.py", line 121, in cublasCheckStatus
    raise cublasExceptions[status]
scikits.cuda.cublas.cublasMappingError

---------------------
fft_demo.py
---------------------
Testing fft/ifft..
Traceback (most recent call last):
  File "fft_demo.py", line 22, in 
    plan_forward = cu_fft.Plan(x_gpu.shape, np.float32, np.complex64)
  File "/Library/Frameworks/Python.framework/Versions/6.2/lib/python2.6/site-packages/scikits.cuda-0.03-py2.6.egg/scikits/cuda/fft.py", line 70, in __init__
    self.fft_type, 1)
  File "/Library/Frameworks/Python.framework/Versions/6.2/lib/python2.6/site-packages/scikits.cuda-0.03-py2.6.egg/scikits/cuda/cufft.py", line 114, in cufftPlan1d
    cufftCheckStatus(status)
  File "/Library/Frameworks/Python.framework/Versions/6.2/lib/python2.6/site-packages/scikits.cuda-0.03-py2.6.egg/scikits/cuda/cufft.py", line 86, in cufftCheckStatus
    raise cufftExceptions[status]
scikits.cuda.cufft.cufftInvalidValue
Exception AttributeError: "Plan instance has no attribute 'handle'" in > ignored

---------------------
mdot_demo.py
---------------------
Testing multiple matrix multiplication for type float32
Success status:  True
Testing multiple matrix multiplication for type complex64
Traceback (most recent call last):
  File "mdot_demo.py", line 37, in 
    d_gpu = linalg.mdot(a_gpu, b_gpu, c_gpu)
  File "/Library/Frameworks/Python.framework/Versions/6.2/lib/python2.6/site-packages/scikits.cuda-0.03-py2.6.egg/scikits/cuda/linalg.py", line 337, in mdot
    temp_gpu = dot(out_gpu, next_gpu)
  File "/Library/Frameworks/Python.framework/Versions/6.2/lib/python2.6/site-packages/scikits.cuda-0.03-py2.6.egg/scikits/cuda/linalg.py", line 290, in dot
    cublas.cublasCheckStatus(status)
  File "/Library/Frameworks/Python.framework/Versions/6.2/lib/python2.6/site-packages/scikits.cuda-0.03-py2.6.egg/scikits/cuda/cublas.py", line 121, in cublasCheckStatus
    raise cublasExceptions[status]
scikits.cuda.cublas.cublasMappingError

---------------------
pinv_demo.py
---------------------
Testing pinv for type float32
Traceback (most recent call last):
  File "pinv_demo.py", line 29, in 
    a_inv_gpu = culinalg.pinv(a_gpu, pycuda.autoinit.device)
  File "/Library/Frameworks/Python.framework/Versions/6.2/lib/python2.6/site-packages/scikits.cuda-0.03-py2.6.egg/scikits/cuda/linalg.py", line 809, in pinv
    u_gpu, s_gpu, vh_gpu = svd(a_gpu, 0)
  File "/Library/Frameworks/Python.framework/Versions/6.2/lib/python2.6/site-packages/scikits.cuda-0.03-py2.6.egg/scikits/cuda/linalg.py", line 160, in svd
    cula.culaCheckStatus(status)
  File "/Library/Frameworks/Python.framework/Versions/6.2/lib/python2.6/site-packages/scikits.cuda-0.03-py2.6.egg/scikits/cuda/cula.py", line 121, in culaCheckStatus
    raise culaExceptions[status]
scikits.cuda.cula.culaRuntimeError

---------------------
select_block_grid_demo.py
---------------------
Success status:  True
exec time =  0.00113606452942

---------------------
svd_demo.py
---------------------
Testing svd for type float32
Traceback (most recent call last):
  File "svd_demo.py", line 29, in 
    u_gpu, s_gpu, vh_gpu = culinalg.svd(a_gpu, pycuda.autoinit.device)
  File "/Library/Frameworks/Python.framework/Versions/6.2/lib/python2.6/site-packages/scikits.cuda-0.03-py2.6.egg/scikits/cuda/linalg.py", line 160, in svd
    cula.culaCheckStatus(status)
  File "/Library/Frameworks/Python.framework/Versions/6.2/lib/python2.6/site-packages/scikits.cuda-0.03-py2.6.egg/scikits/cuda/cula.py", line 121, in culaCheckStatus
    raise culaExceptions[status]
scikits.cuda.cula.culaRuntimeError

---------------------
transpose_demo.py
---------------------
Testing transpose for type float32
Success status:  True
Testing transpose for type complex64
Success status:  True

---------------------
tril_demo.py
---------------------
Testing lower triangle extraction for type float32
Success status:  True
Testing lower triangle extraction for type complex64
Success status:  True

and here is my system info (why can't we add attachments?)

      System Version: Mac OS X 10.5.8 (9L31a)
      Kernel Version: Darwin 9.8.0
 ...
      Model Name: MacBook Pro
      Model Identifier: MacBookPro4,1
      Processor Name: Intel Core 2 Duo
      Processor Speed: 2.5 GHz
      Number Of Processors: 1
      Total Number Of Cores: 2
      L2 Cache: 6 MB
      Memory: 4 GB
      Bus Speed: 800 MHz
...
    GeForce 8600M GT:

      Chipset Model: GeForce 8600M GT
      Type: Display
      Bus: PCIe
      PCIe Lane Width: x16
      VRAM (Total): 512 MB
      Vendor: NVIDIA (0x10de)
      Device ID: 0x0407
      Revision ID: 0x00a1
      ROM Revision: 3212
...

 CUDA Device Query (Runtime API) version (CUDART static linking)

There is 1 device supporting CUDA

Device 0: "GeForce 8600M GT"
  CUDA Driver Version:                           3.20
  CUDA Runtime Version:                          3.10
  CUDA Capability Major revision number:         1
  CUDA Capability Minor revision number:         1
  Total amount of global memory:                 536674304 bytes
  Number of multiprocessors:                     4
  Number of cores:                               32
  Total amount of constant memory:               65536 bytes
  Total amount of shared memory per block:       16384 bytes
  Total number of registers available per block: 8192
  Warp size:                                     32
  Maximum number of threads per block:           512
  Maximum sizes of each dimension of a block:    512 x 512 x 64
  Maximum sizes of each dimension of a grid:     65535 x 65535 x 1
  Maximum memory pitch:                          2147483647 bytes
  Texture alignment:                             256 bytes
  Clock rate:                                    0.94 GHz
  Concurrent copy and execution:                 Yes
  Run time limit on kernels:                     Yes
  Integrated:                                    No
  Support host page-locked memory mapping:       Yes
  Compute mode:                                  Default (multiple host threads can use this device simultaneously)

deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 53331, CUDA Runtime Version = 3.10, NumDevs = 1, Device = GeForce 8600M GT


PASSED

Press  to Quit...
-----------------------------------------------------------

different results numpy.solve Vs. scikits.cuda.cula.culaDeviceDgesv()

Hello ,

I am trying to check the speed on cpu and gpu for linear solver using numpy.solve() and cula.culaDeviceDgesv().
when I'm testing the function for about 300 samples the result on gpu is correct.but when I increase the number of samples I have two problem:

1- when increase to 400 and 500 I am getting this error:

numpy array time: 0.030175s
correctness= True

Traceback (most recent call last):
File "/home/jadidi/python-workespace/kernel/linear regression/solver.py", line 78, in
gpu_result=gpu_solve(k,y)
File "/home/jadidi/python-workespace/kernel/linear regression/solver.py", line 61, in gpu_solve
t=cula.culaDeviceDgesv(n, nrhs, k_gpu.ptr, lda, ipiv_gpu.ptr, y_gpu.ptr, ldb)
File "/usr/local/lib/python2.7/dist-packages/scikits.cuda-0.042-py2.7.egg/scikits/cuda/cula.py", line 489, in culaDeviceDgesv
culaCheckStatus(status)
File "/usr/local/lib/python2.7/dist-packages/scikits.cuda-0.042-py2.7.egg/scikits/cuda/cula.py", line 210, in culaCheckStatus
raise culaExceptionsstatus
scikits.cuda.cula.culaRuntimeError: 4
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuEventDestroy failed: launch failed
PyCUDA WARNING: a clean-up operation failed (dead context maybe?)
cuEventDestroy failed: launch failed

2-when I increase the to 600 the result on gpu are incorrect.!! it 's weird.

my code:

https://docs.google.com/document/d/1Owb20-6K_ffRuZH3FX2Vjgp4VD5YsLqXqF5jkWIL_wA/edit

I appreciate any help!

Mohsen

OSX Support

I am trying to run scikits.cuda on my OS10.8.3 apple laptop.
I modified utils.py to look for libdl.dylib rather than libdl.so, and used homebrew to install gobjdump (I believe) help find the cublas version. I also tried changing the objdump invocation to gobjdump/llvm-objdump.

I'm now stuck on line 227 in utils.py, above which it says:

XXX This approach to obtaining the CUBLAS version number may break Windows/MacOSX compatibility XXX

This appears to have happened. I'll probably keep plugging away at this throughout the day, but any chance of help?

bug: cublas.cublasZgerc

In cublas.py line 3531
_libcublas.cublasZgerc_v2.restype = None --> int

Scikits Cuda on OS x

Hi, I am new user to Scikits.cuda and I am trying to set it up on a machine running Os x 10.8.2 with nvidia cuda 5.0 already installed and working on it. I installed scikits.cuda using easy_install and when I try to import it I get an error message saying that "CUDA driver not found".

Please find the log below,

Python 2.7.5 (default, Aug 1 2013, 01:01:17)
[GCC 4.2.1 Compatible Apple Clang 4.1 ((tags/Apple/clang-421.11.66))] on darwin
Type "help", "copyright", "credits" or "license" for more information.

import pycuda.autoinit
import scikits.cuda.fft as gpufft
Traceback (most recent call last):
File "", line 1, in
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scikits.cuda-0.042-py2.7.egg/scikits/cuda/fft.py", line 19, in
import misc
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scikits.cuda-0.042-py2.7.egg/scikits/cuda/misc.py", line 16, in
import cuda
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scikits.cuda-0.042-py2.7.egg/scikits/cuda/cuda.py", line 8, in
from cudadrv import *
File "/opt/local/Library/Frameworks/Python.framework/Versions/2.7/lib/python2.7/site-packages/scikits.cuda-0.042-py2.7.egg/scikits/cuda/cudadrv.py", line 29, in
raise OSError('CUDA driver library not found')
OSError: CUDA driver library not found

Please help me to resolve this issue.

Thanks & Regards,
Kartheek.

drop scikits namespace, rename package to scikit-cuda

Many scikits (e.g., scikits-image, scikit-learn, statsmodels) have dropped the scikits namespace to simplify usage; the scikits.cuda namespace package should be simplified to skcuda. Renaming the package to scikit-cuda would also be desirable.

ifft(..., scale=True) repeatedly compiles code (_scale_inplace)

The present implementation of fft._scale_inplace repeatedly calls ElementwiseKernel which invokes the compiler each time. The inplace function needs to be memoized or cublas.scale should be used.

Incorrect Laplacian

I've been testing a simple code to calculate the Laplacian of a field with FFT and Finite Difference methods. The values I get with CUFFT are however incorrect. Is this a CUFFT issue (i.e. should I post to CUDA forums) or is this an issue with my python implementation (are the FFT plans set correctly)? NumPy seems to calculate the values correctly.
The PyCUDA code is available at http://users.utu.fi/jtksai/codes/Laplacian.zip .
Note that the -k^2*F_k values are calculated similarly in CUFFT and NumPy calculations i.e. they should be correct.

"python setup.py build" returns KeyError: 'install'

It seems that setup.py is assuming that the user will always install, not build.

(see full log below)



% python setup.py build
running build
running config_cc
unifing config_cc, config, build_clib, build_ext, build commands --compiler options
running config_fc
unifing config_fc, config, build_clib, build_ext, build commands --fcompiler options
running build_py
creating build
creating build/lib
creating build/lib/scikits
copying ./scikits/__init__.py -> build/lib/scikits
creating build/lib/scikits/cuda
copying ./scikits/cuda/__info__.py -> build/lib/scikits/cuda
copying ./scikits/cuda/__init__.py -> build/lib/scikits/cuda
copying ./scikits/cuda/autoinit.py -> build/lib/scikits/cuda
copying ./scikits/cuda/cublas.py -> build/lib/scikits/cuda
copying ./scikits/cuda/cuda.py -> build/lib/scikits/cuda
copying ./scikits/cuda/cufft.py -> build/lib/scikits/cuda
copying ./scikits/cuda/cula.py -> build/lib/scikits/cuda
copying ./scikits/cuda/fft.py -> build/lib/scikits/cuda
copying ./scikits/cuda/info.py -> build/lib/scikits/cuda
copying ./scikits/cuda/linalg.py -> build/lib/scikits/cuda
copying ./scikits/cuda/misc.py -> build/lib/scikits/cuda
copying ./scikits/cuda/special.py -> build/lib/scikits/cuda
copying ./scikits/cuda/version.py -> build/lib/scikits/cuda
Traceback (most recent call last):
  File "setup.py", line 98, in 
    "install_headers": custom_install_headers})
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/numpy/distutils/core.py", line 184, in setup
    return old_setup(**new_attr)
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/distutils/core.py", line 152, in setup
    dist.run_commands()
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/distutils/dist.py", line 975, in run_commands
    self.run_command(cmd)
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/distutils/dist.py", line 995, in run_command
    cmd_obj.run()
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/numpy/distutils/command/build.py", line 37, in run
    old_build.run(self)
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/distutils/command/build.py", line 134, in run
    self.run_command(cmd_name)
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/distutils/cmd.py", line 333, in run_command
    self.distribution.run_command(command)
  File "/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/distutils/dist.py", line 995, in run_command
    cmd_obj.run()
  File "setup.py", line 42, in run
    inst_obj = self.distribution.command_obj['install']
KeyError: 'install'
zsh: exit 1     python setup.py build

AttributeError: /usr/local/cula/lib64/libcula.so: undefined symbol: culaGetStatusString

hello all,

I am new in cuda and since I started I am stuck with bunch of problem.after installing scikits and cula I get erroe when I want to run sample test,which is :

Traceback (most recent call last):
File "", line 1, in
ImportError: No module named intergrate

import scikits.integrate
Traceback (most recent call last):
File "", line 1, in
ImportError: No module named integrate
import scikits.cuda.integrate
Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/scikits/cuda/integrate.py", line 15, in
from misc import select_block_grid_sizes, init, get_current_device
File "/usr/local/lib/python2.7/dist-packages/scikits/cuda/misc.py", line 18, in
import cula
File "/usr/local/lib/python2.7/dist-packages/scikits/cuda/cula.py", line 39, in
_libcula.culaGetStatusString.restype = ctypes.c_char_p
File "/usr/lib/python2.7/ctypes/init.py", line 366, in getattr
func = self.getitem(name)
File "/usr/lib/python2.7/ctypes/init.py", line 371, in getitem
func = self._FuncPtr((name_or_ordinal, self))
AttributeError: /usr/local/cula/lib64/libcula.so: undefined symbol: culaGetStatusString

I am getting this error when I m trying to import this modules :

misc

intergrate

linalg

special

I am using the free edition of CULA DENSE and I should note that I am able to run the c codes cula examples.

I am appreciate any help!

Resource deallocation, synchronization, etc.

There seem to be a bunch of potential bugs in the code related to resource deallocation, synchronization, etc. so I thought I should collect some comments here so they don't get forgotten. I have really only started using Cuda, so I do not know if these are all real issues (some do not seem to cause a problem yet). Please let me know if I should try to fix some of these.

Consider linalg.init() which calls misc.init(). This calls cublasCreate() and sets the result to the global _global_cublas_handle. Repeated calls will thus overwrite the handle, and these handles will never be deallocated by a corresponding call to cublasDestroy(). Also, there are no checks, so a repeated call to misc.shutdown() has issues (see Sample A below).

Probably linalg.init() should return the handle (so it can be used later) and the handles should be stored in a structure so that shutdown will close all the handles. One might like to follow the NumbaPro conventions of returning a blas object that has all of the appropriate functions. It might be nice to provide a wrapper with the same interface that provides direct dispatching to the appropriate cublas functions. (Right now linalg is quite incomplete in this sense, though I added a few more in my latest pull request #35 for issue #29.
According to the cublas docs, several functions in cublas such as nrm2, amax etc. need proper synchronization before the results are used. It does not seem like this is done.

Example A

Note: this requires some fixes found in my pull request #35 for issue #29.

In [1]: import pycuda.autoinit
from scikits.cuda import misc
def show_mem(id=0):
    free, total = pycuda.driver.mem_get_info()
    print("%0.2fMB of %0.2fMB free" % (free/1024.**2, total/1024.**2))

misc.init(); show_mem()
misc.shutdown(); show_mem()
misc.init(); show_mem()
misc.shutdown(); show_mem()

misc.init(); misc.init(); show_mem()
misc.shutdown(); show_mem()  # Memory not reclaimed because of lost handle
misc.shutdown(); show_mem()  # OOPS! segfault.



In [2]: from scikits.cuda import misc

In [3]: def show_mem(id=0):
   ...:         free, total = pycuda.driver.mem_get_info()
   ...:         print("%0.2fMB of %0.2fMB free" % (free/1024.**2, total/1024.**2))
   ...:     

In [4]: misc.init(); show_mem()
558.36MB of 1023.69MB free

In [5]: misc.shutdown(); show_mem()
560.36MB of 1023.69MB free

In [6]: misc.init(); show_mem()
558.36MB of 1023.69MB free

In [7]: misc.shutdown(); show_mem()
560.36MB of 1023.69MB free

In [8]: 

In [8]: misc.init(); misc.init(); show_mem()
558.36MB of 1023.69MB free

In [9]: misc.shutdown(); show_mem()  # Memory not reclaimed because of lost handle
558.36MB of 1023.69MB free

In [10]: misc.shutdown(); show_mem()  # OOPS! segfault.
Segmentation fault: 11

bumblebee breaks use of libdl to find shared library

When using bumblebee to get CUDA programs to access a discrete NVIDIA GPU on laptops with multiple graphics devices, attempting to load libdl via ctypes finds libdlfaker, which doesn't contain the functions needed to find library paths.

Reported by Kartheek Medathati.

Computation time for Scikits.CUDA fft

I just checked the execution time for FFT on CPU versus GPU by adding time routine to the code,

import pycuda.autoinit
import pycuda.gpuarray as gpuarray
import numpy as np
import time
import scikits.cuda.fft as cu_fft

print 'Testing fft/ifft..'
N = 1024
M = N/2

tic=time.clock()
x = np.asarray(np.random.rand(N, M), np.float32)
xf = np.fft.fft2(x)
y = np.real(np.fft.ifft2(xf))
toc=time.clock()
print 'Time for CPU FFT:', toc-tic

x_gpu = gpuarray.to_gpu(x)
xf_gpu = gpuarray.empty((x.shape[0], x.shape[1]/2+1), np.complex64)
y_gpu = gpuarray.empty_like(x_gpu)

tic=time.clock()
plan_forward = cu_fft.Plan(x_gpu.shape, np.float32, np.complex64)
cu_fft.fft(x_gpu, xf_gpu, plan_forward)
plan_inverse = cu_fft.Plan(x_gpu.shape, np.complex64, np.float32)
cu_fft.ifft(xf_gpu, y_gpu, plan_inverse, True)
toc=time.clock()
print 'TIme for GPU FFT:', toc-tic

print 'Success status: ', np.allclose(y, y_gpu.get(), atol=1e-6)

To my surprise, I found the result

python testpythonfft.py
Testing fft/ifft..
Time for CPU FFT: 0.09
TIme for GPU FFT: 0.3
Success status: True

Am I doing something wrong?

support for latest cula

scikits.cuda.cula expects to find libcula_lapack.so, libcula.so but latest cula17 ships following libraries:

lapackcpu.so libcublas.so.5.0 libcublas.so.5.0.40 libcudart.so.5.0 libcudart.so.5.0.35 libcula_lapack_link.so libcula_lapack.so libcula_scalapack.so libiomp5.so

support double precision CULA functions when premium libs are installed

Feature request.

Windows support

Hi there,

Ive been wanting to use cufft on windows using pycuda; I noted scikits cuda doesnt support windows, but monkey-patching it so that at least cufft works on win7 64bit proved quite trivial (see below).

import platform
if platform.system() == "Windows":
_libcufft = ctypes.windll.LoadLibrary('cufft32_42_9.dll')

I am anything but an expert on these matters, so I dont know how hard it is to generalize this; but if at least cufft would be made to have windows support this easily, id love to see it get merged into the main branch, so I dont have to ship my project with this monkey-patch of mine.

Thanks for the great work on this!

FFT plan creation fails due to cufftInvalidValue

Simply running fft_demo.py (or also trying with some of my own code) yields:

File "fft_demo.py", line 22, in
plan_forward = cu_fft.Plan(x_gpu.shape, np.float32, np.complex64)
File "/usr/local/lib/python2.6/dist-packages/scikits.cuda-0.02-py2.6.egg/scikits/cuda/fft.py", line 70, in init
self.fft_type, 1)
File "/usr/local/lib/python2.6/dist-packages/scikits.cuda-0.02-py2.6.egg/scikits/cuda/cufft.py", line 107, in cufftPlan1d
cufftCheckStatus(status)
File "/usr/local/lib/python2.6/dist-packages/scikits.cuda-0.02-py2.6.egg/scikits/cuda/cufft.py", line 79, in cufftCheckStatus
raise cufftExceptions[status]
scikits.cuda.cufft.cufftInvalidValue
Exception AttributeError: "Plan instance has no attribute 'handle'" in <bound method Plan.del of <scikits.cuda.fft.Plan instance at 0x2e50200>> ignored

I'm not sure why this is happening, but the issue seems to originate from the binary library in lines 106 & 107 of cufft.py.

I'm using CUDA Toolkit 3.2, Python 2.6.6, and PyCUDA 0.94.

0.04 (PyPi): ImportError: No module named version

sudo pip install --upgrade scikits.cuda gives me
ImportError: No module named version

Would it be possible to bump the version with a hotfix and upload it to PyPi ?

'module' object has no attribute 'cublasCgemmBatched'

Hi guys,

I recently installed scikits.cuda, the latest version. I simply got the zip from github and executed sudo python setup.py install, so I have the latest version.

The code that I am running stops when it need cublasCgemmBatched, that have been added recently. I obtain:

AttributeError: 'module' object has no attribute 'cublasCgemmBatched'
Apply node that caused the error: BatchedComplexDotOp(GpuContiguous.0, GpuContiguous.0)
(I am running the code of Sander Dieleman fft for convolution in theano
http://benanne.github.io/2014/05/12/fft-convolutions-in-theano.html).

Is there something obvious that could explain the fact that it seems not to find it?

Thanks by advance,
Sam

cusparse.py is incomplete and won't import

██ [ br@breach: scikits.cuda ] [ 19:38:38 ]
██ tail scikits/cuda/cusparse.py
                                            cusparseMatDescr,
                                            ctypes.c_void_p,
                                            ctypes.c_int,
                                            ctypes.c_void_p,
                                            ctypes.c_void_p,
                                            ctypes.c_void_p,
                                            ctypes.c_void_p]
def cusparseSdense2csr(handle, m, n, descrA, A, lda, 
                       nnzPerRow, csrValA, csrRowPtrA, csrColIndA):

add bindings for MAGMA libraries

use config file to specify location of various libraries

Rather than rely solely on ctypes to search for the shared libraries wrapped by the package (which doesn't always work given the variety of ways CUDA and associated library packages may be installed), scikit-cuda should use an INI configuration file in which a user can either list library paths read at installation time and hard-coded in the relevant wrapper modules or read at import time. See mpi4py for an example of the former; for the latter, it may be desirable to use pyxdg to read config files from ~/.config

CULA shared object format change

I'm using the latest version of CULA and I'm getting an error, but I think I know why. Running:

import scikits.cuda.linalg as culinalg

throws:

Traceback (most recent call last):
File "", line 1, in
File "/usr/local/lib/python2.7/dist-packages/scikits/cuda/linalg.py", line 16, in
import cula
File "/usr/local/lib/python2.7/dist-packages/scikits/cuda/cula.py", line 25, in
raise RuntimeError('%s not found' % _libcula_libname)
RuntimeError: libcula.so not found

According to this thread, libcula.so no longer exists as they've broken it into smaller pieces:
http://www.culatools.com/forums/viewtopic.php?f=15&t=1025

From the cula.py source, it looks like you're still looking for libcula.so. Just wanted to let you know that this may be an issue.

OSError: libcula_lapack.so, libcula_lapack_basic.so, libcula.so not found

Hello,

I installed the last version of Scikits.cuda from github source code, but when I try to run svd_demo.py I get the following error:

OSError: libcula_lapack.so, libcula_lapack_basic.so, libcula.so not found

I have installed cula in /usr/local/cula:
doc
examples
include
lib -> here I have the libcula_lapack_basic.so
lib64

Should I edit some configuration file or something?

In advance thank you very much, best regards,

Transpose performance

I'm not sure if this is a known issue, and from what I've read it isn't trivial to do an efficient transpose on GPUs, but the performance I'm seeing for a transpose is pretty bad. Below is an example using scikits.cuda version 0.043, pycuda version (2013,1), cuda 5.0, and python 2.7:

import time as t
import numpy as np
import scikits.cuda.linalg as linalg
import scikits.cuda.misc as cumisc
import pycuda.driver as cuda
from pycuda import gpuarray
import pycuda.autoinit
linalg.init()

mat = np.random.normal(size=(1000,1000))
mat_gpu = gpuarray.to_gpu(mat)
t0 = t.time()
mt = np.transpose(mat)
t_cpu = t.time() - t0
print str(t_cpu)

t0 = t.time()
mt = linalg.transpose(mat_gpu)
t_gpu = t.time() - t0
print str(t_gpu)

Running this I get that t_cpu < 0.001 while t_gpu is around 0.1, which is 2 orders of magnitude worse than the cpu transpose.

I understand that when you call np,transpose(mat) or mat.T, you're just changing the view and not actually doing a transpose. Does that mean that this performance difference is correct?

"nosetests -x -s -v tests" returns cufftInvalidValue (error in test_fft_float64_to_complex128)


% nosetests -x -s -v tests
test_fft_float32_to_complex64 (test_fft.test_fft) ... ok
test_fft_float64_to_complex128 (test_fft.test_fft) ... ERROR
Exception AttributeError: "Plan instance has no attribute 'handle'" in > ignored

======================================================================
ERROR: test_fft_float64_to_complex128 (test_fft.test_fft)
----------------------------------------------------------------------
Traceback (most recent call last):
  File "/Users/nich2o/envs/scikits_cuda/scikits.cuda/tests/test_fft.py", line 40, in test_fft_float64_to_complex128
    plan = fft.Plan(x.shape, np.float64, np.complex128)
  File "/Users/nich2o/envs/scikits_cuda/scikits.cuda/scikits/cuda/fft.py", line 70, in __init__
    self.fft_type, 1)
  File "/Users/nich2o/envs/scikits_cuda/scikits.cuda/scikits/cuda/cufft.py", line 114, in cufftPlan1d
    cufftCheckStatus(status)
  File "/Users/nich2o/envs/scikits_cuda/scikits.cuda/scikits/cuda/cufft.py", line 86, in cufftCheckStatus
    raise cufftExceptions[status]
cufftInvalidValue

persistent scikits.cuda.cublas.cublasNotInitialized problem

Hi,

When I try:

python diag_demo.py

I get

Traceback (most recent call last):
  File "diag_demo.py", line 13, in <module>
    culinalg.init()
  File "/home/reckoner/lib/python2.6/site-packages/scikits.cuda-0.03-py2.6.egg/scikits/cuda/linalg.py", line 35, in init
    cublas.cublasInit()
  File "/home/reckoner/lib/python2.6/site-packages/scikits.cuda-0.03-py2.6.egg/scikits/cuda/cublas.py", line 136, in cublasInit
    cublasCheckStatus(status)
  File "/home/reckoner/lib/python2.6/site-packages/scikits.cuda-0.03-py2.6.egg/scikits/cuda/cublas.py", line 121, in cublasCheckStatus
    raise cublasExceptions[status]
scikits.cuda.cublas.cublasNotInitialized

However, when I do this in IPython:

In [2]: import ctypes
In [3]: x=ctypes.cdll.LoadLibrary('libcublas.so')
In [4]: x.cublasInit()
Out[4]: 0 # initialized correctly

It seems to initialize correctly. Furthermore,

import scikits.cuda.linalg as culinalg
culinalg.init()
import pycuda.autoinit

does not produce an initialization error, but changing the order of this to the following:

import pycuda.autoinit
import scikits.cuda.linalg as culinalg
culinalg.init()

produces:

cublasNotInitialized

I have the latest CUDA

 Cuda compilation tools, release 3.2, V0.2.1221
 pycuda.VERSION=94
 GeForce 9800M GTS 
 Linux 2.6.31-22-server 69-Ubuntu SMP x86_64 GNU/Linux

Any help appreciated.

AssertionError

I ran the diag_demo.py on ubuntu14.04_x64 machine with GTX780 and got the following error

Testing real diagonal matrix creation for type float32
Traceback (most recent call last):
File "diag_demo.py", line 32, in
d_gpu = culinalg.diag(v_gpu);
File "/home/localadmin/anaconda/lib/python2.7/site-packages/scikits/cuda/linalg.py", line 870, in diag
block_dim, grid_dim = misc.select_block_grid_sizes(dev, d_gpu.shape)
File "/home/localadmin/anaconda/lib/python2.7/site-packages/scikits/cuda/misc.py", line 315, in select_block_grid_sizes
assert max_blocks_per_grid_dim == max_grid_dim[1]
AssertionError