Comments (15)
aaaah, it is very likely they have a bug on OSX version. Cant think of any other explanation, as it works cleanly on Linux. you can report bugs to them via the nvidia developer tool.
from cudnn.torch.
Yes, on Linux only MaxPooling fails sometimes, as they mention in docs. On OS X actually all ReLU, Tanh, Sigmoid and SofMax fail a lot. Will report a bug.
from cudnn.torch.
Just caught the same problem in Linux!
--------------------------------------------------------------------------------
ReLU_single
error on state (backward)
LT(<) violation val=nan, condition=0.01
/usr/local/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
test/test.lua:350: in function <test/test.lua:321>
--------------------------------------------------------------------------------
from cudnn.torch.
ok, i am going to run the unit tests a few thousand times and see how that goes.
Also, are you making sure to use Cuda 6.5 on Linux?
from cudnn.torch.
can you give me other details about your linux, for a possible reproduction
from cudnn.torch.
yes, cuda-6.5, 4 Titan Blacks, 340.29 driver, torch, cutorch, nn and cunn updated to the last version, and also got another machine (mostly equal) on which it fails too, like here Sigmoid:
--------------------------------------------------------------------------------
Sigmoid_single
error on state (backward)
LT(<) violation val=nan, condition=0.01
...ocks/torch-distro/install/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
test/test.lua:484: in function <test/test.lua:455>
--------------------------------------------------------------------------------
from cudnn.torch.
thanks, having a look
from cudnn.torch.
it's Ubuntu 14.04 btw, and it fails on all 4 cards in one test, not just one.
from cudnn.torch.
ok that's an interesting detail.
from cudnn.torch.
ok over several hundred runs, reproduced this once on my Tesla K40. Now trying to print out the specific input shape etc, and reproduce this consistently.
from cudnn.torch.
i'm not able to reproduce this even over thousands of runs. i got the nan once, but cant seem to get it again.
can you run this https://github.com/soumith/cudnn.torch/blob/burepro/test/test.lua#L415
And then send me the files badTanh.t7 badSoftmax.t7, badSigmoid.t7 etc.
from cudnn.torch.
hm, no luck after 100, will leave it to run overnight
from cudnn.torch.
ok i reproduced the nans. it is very likely that cudnn guys are using the fast approximations, so in very extreme precision cases, generating nans. we went down that path in the past and reverted, we have a long history of these things. I will report this to them.
Fast approximations docs:
http://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__SINGLE.html
from cudnn.torch.
I believe it is because you use Beta=0 and it is not handled properly in cuDNN R2 RC for the activation functions.
When beta=0, we are supposed to write directly (without reading the input) because 0 x Nan = Nan
It will be fixed in R2 Final release
from cudnn.torch.
thanks @philvdm will wait for the final release.
from cudnn.torch.
Related Issues (20)
- Error in CuDNN: CUDNN_STATUS_INTERNAL_ERROR(lua5.1,1080ti, CUDA8.0,cudnn5.1) HOT 6
- Error in CuDNN: CUDNN_STATUS_BAD_PARAM (cudnnGetConvolutionNdForwardOutputDim) when using VolumetricConvolution HOT 1
- Loss is NaN when using half precision HOT 3
- Is there any CuDNN bindings update plan? HOT 8
- Machine GPU type dependency: invalid device function HOT 1
- cudnn7.0 not supported even installed -R7 branch? HOT 5
- CuDNN v7 for Cuda 9.0 not working HOT 9
- cudnnConvolutionBackwardData failed - Error in CuDNN: CUDNN_STATUS_NOT_SUPPORTED (cudnnConvolutionBackwardData) HOT 8
- Slow loading time HOT 8
- Bug with GroupConvolutions with Padding using R7 branch
- CUDNN_STATUS_INTERNAL_ERROR lua torch HOT 1
- How to implement CNN+LSTM using cudnn torch
- THNN is nil HOT 1
- clearState breaks nn.MV
- ETN Mining - [CUDA] Error gpu 0: <C:/xmr-stak/xmr-stak-2.4.5/xmrstak/backend/nvidia/nvcc_code/cuda_extra.cu>:381
- Does cudnn.torch support nvidia v100 tensor cores?
- require cudnn takes 10 minutes on a Volta with 1 GPU (Cuda 9, cudnn 7.1)
- build error HOT 1
- RuntimeError: cuDNN version incompatibility: PyTorch was compiled against 7401 but linked against 7301
- 'CudaByteStorage' (a nil value)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from cudnn.torch.