Coder Social home page Coder Social logo

Comments (15)

soumith avatar soumith commented on May 19, 2024

aaaah, it is very likely they have a bug on OSX version. Cant think of any other explanation, as it works cleanly on Linux. you can report bugs to them via the nvidia developer tool.

from cudnn.torch.

szagoruyko avatar szagoruyko commented on May 19, 2024

Yes, on Linux only MaxPooling fails sometimes, as they mention in docs. On OS X actually all ReLU, Tanh, Sigmoid and SofMax fail a lot. Will report a bug.

from cudnn.torch.

szagoruyko avatar szagoruyko commented on May 19, 2024

Just caught the same problem in Linux!

--------------------------------------------------------------------------------
ReLU_single
error on state (backward)
 LT(<) violation   val=nan, condition=0.01
    /usr/local/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
    test/test.lua:350: in function <test/test.lua:321>

--------------------------------------------------------------------------------

from cudnn.torch.

soumith avatar soumith commented on May 19, 2024

ok, i am going to run the unit tests a few thousand times and see how that goes.
Also, are you making sure to use Cuda 6.5 on Linux?

from cudnn.torch.

soumith avatar soumith commented on May 19, 2024

can you give me other details about your linux, for a possible reproduction

from cudnn.torch.

szagoruyko avatar szagoruyko commented on May 19, 2024

yes, cuda-6.5, 4 Titan Blacks, 340.29 driver, torch, cutorch, nn and cunn updated to the last version, and also got another machine (mostly equal) on which it fails too, like here Sigmoid:

--------------------------------------------------------------------------------
Sigmoid_single
error on state (backward)
 LT(<) violation   val=nan, condition=0.01
    ...ocks/torch-distro/install/share/lua/5.1/torch/Tester.lua:26: in function 'assertlt'
    test/test.lua:484: in function <test/test.lua:455>

--------------------------------------------------------------------------------

from cudnn.torch.

soumith avatar soumith commented on May 19, 2024

thanks, having a look

from cudnn.torch.

szagoruyko avatar szagoruyko commented on May 19, 2024

it's Ubuntu 14.04 btw, and it fails on all 4 cards in one test, not just one.

from cudnn.torch.

soumith avatar soumith commented on May 19, 2024

ok that's an interesting detail.

from cudnn.torch.

soumith avatar soumith commented on May 19, 2024

ok over several hundred runs, reproduced this once on my Tesla K40. Now trying to print out the specific input shape etc, and reproduce this consistently.

from cudnn.torch.

soumith avatar soumith commented on May 19, 2024

i'm not able to reproduce this even over thousands of runs. i got the nan once, but cant seem to get it again.
can you run this https://github.com/soumith/cudnn.torch/blob/burepro/test/test.lua#L415
And then send me the files badTanh.t7 badSoftmax.t7, badSigmoid.t7 etc.

from cudnn.torch.

szagoruyko avatar szagoruyko commented on May 19, 2024

hm, no luck after 100, will leave it to run overnight

from cudnn.torch.

soumith avatar soumith commented on May 19, 2024

ok i reproduced the nans. it is very likely that cudnn guys are using the fast approximations, so in very extreme precision cases, generating nans. we went down that path in the past and reverted, we have a long history of these things. I will report this to them.

Fast approximations docs:
http://docs.nvidia.com/cuda/cuda-math-api/group__CUDA__MATH__INTRINSIC__SINGLE.html

from cudnn.torch.

philvdm avatar philvdm commented on May 19, 2024

I believe it is because you use Beta=0 and it is not handled properly in cuDNN R2 RC for the activation functions.
When beta=0, we are supposed to write directly (without reading the input) because 0 x Nan = Nan
It will be fixed in R2 Final release

from cudnn.torch.

soumith avatar soumith commented on May 19, 2024

thanks @philvdm will wait for the final release.

from cudnn.torch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.