Have just tested in Ubuntu, all tests pass. But in OS X no: <div class="snippet-cl

Just caught the same problem in Linux! <div class="snippet-clipboard-content notra

OS X R2 Tanh and SoftMax tests fail about cudnn.torch HOT 15 CLOSED

soumith commented on May 19, 2024

OS X R2 Tanh and SoftMax tests fail

from cudnn.torch.

Comments (15)

soumith commented on May 19, 2024

aaaah, it is very likely they have a bug on OSX version. Cant think of any other explanation, as it works cleanly on Linux. you can report bugs to them via the nvidia developer tool.

from cudnn.torch.

szagoruyko commented on May 19, 2024

Yes, on Linux only MaxPooling fails sometimes, as they mention in docs. On OS X actually all ReLU, Tanh, Sigmoid and SofMax fail a lot. Will report a bug.

from cudnn.torch.

szagoruyko commented on May 19, 2024

Just caught the same problem in Linux!

--------------------------------------------------------------------------------
ReLU_single
error on state (backward)
 LT(<) violation   val=nan, condition=0.01
    /usr/local/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
    test/test.lua:350: in function <test/test.lua:321>

--------------------------------------------------------------------------------

from cudnn.torch.

soumith commented on May 19, 2024

ok, i am going to run the unit tests a few thousand times and see how that goes.
Also, are you making sure to use Cuda 6.5 on Linux?

from cudnn.torch.

soumith commented on May 19, 2024

can you give me other details about your linux, for a possible reproduction

from cudnn.torch.

szagoruyko commented on May 19, 2024

yes, cuda-6.5, 4 Titan Blacks, 340.29 driver, torch, cutorch, nn and cunn updated to the last version, and also got another machine (mostly equal) on which it fails too, like here Sigmoid:

--------------------------------------------------------------------------------
Sigmoid_single
error on state (backward)
 LT(<) violation   val=nan, condition=0.01
    ...ocks/torch-distro/install/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
    test/test.lua:484: in function <test/test.lua:455>

--------------------------------------------------------------------------------

from cudnn.torch.

soumith commented on May 19, 2024

thanks, having a look

from cudnn.torch.

szagoruyko commented on May 19, 2024

it's Ubuntu 14.04 btw, and it fails on all 4 cards in one test, not just one.

from cudnn.torch.

soumith commented on May 19, 2024

ok that's an interesting detail.

from cudnn.torch.

soumith commented on May 19, 2024

ok over several hundred runs, reproduced this once on my Tesla K40. Now trying to print out the specific input shape etc, and reproduce this consistently.

from cudnn.torch.

soumith commented on May 19, 2024

i'm not able to reproduce this even over thousands of runs. i got the nan once, but cant seem to get it again.
can you run this https://github.com/soumith/cudnn.torch/blob/burepro/test/test.lua#L415
And then send me the files badTanh.t7 badSoftmax.t7, badSigmoid.t7 etc.

from cudnn.torch.

szagoruyko commented on May 19, 2024

hm, no luck after 100, will leave it to run overnight

from cudnn.torch.

soumith commented on May 19, 2024

ok i reproduced the nans. it is very likely that cudnn guys are using the fast approximations, so in very extreme precision cases, generating nans. we went down that path in the past and reverted, we have a long history of these things. I will report this to them.

Fast approximations docs:
http://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__SINGLE.html

from cudnn.torch.

philvdm commented on May 19, 2024

I believe it is because you use Beta=0 and it is not handled properly in cuDNN R2 RC for the activation functions.
When beta=0, we are supposed to write directly (without reading the input) because 0 x Nan = Nan
It will be fixed in R2 Final release

from cudnn.torch.

soumith commented on May 19, 2024

thanks @philvdm will wait for the final release.

from cudnn.torch.

OS X R2 Tanh and SoftMax tests fail about cudnn.torch HOT 15 CLOSED

Comments (15)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent