Coder Social home page Coder Social logo

Comments (5)

EderSantana avatar EderSantana commented on June 15, 2024

whoa, this is weird. I'm sure I ran the model several times both on 1080 and TitanX gpus without getting NaN.
The problem might not be in the data, otherwise people training the steering model would have complained as well.

May I ask you what is your gpu and TF version?

from research.

EderSantana avatar EderSantana commented on June 15, 2024

by any chance, would you have multigpu setup and are asking TF to use only one GPU?

also, are you able to continue training from the checkpoint? if you try to continue training, does it crash in the same point again? I remember getting random crashes due to TF rounding problems but I could continue training from the checkpoint.

from research.

kamal94 avatar kamal94 commented on June 15, 2024

Graphics card: GTX 1060
TF: tensorflow (0.10.0rc0)
Cuda compilation tools, release 7.5, V7.5.17
cuDnn version 4

I only have 1 GPU, and am using it for training.

I am not sure how to continue training from a checkpoint. I wasn't aware TF automatically creates checkpoints. I have simply been restarting the server and running the training again from scratch everytime i get this error. (By the way It seems to be almost finished now at epoch 195, so fingers crossed.) I just don't think its safe to leave a bug (if it exists) like this laying around, since it could waste days of training.

For more info, i trained this on a Nvidia Tesla K20 and although it was slower than my 1060, it worked the first time without any errors. Again, I'm kind of scared that this might be a randomly created error, which can make it hard to hunt down.

from research.

EderSantana avatar EderSantana commented on June 15, 2024

tensorflow does not do that automatically.
but our code does. Add the flag --loadweights continue from a checkpoint:
https://github.com/commaai/research/blob/master/train_generative_model.py#L137

Yeah, I guess its some rounding error in TF beyond my reach for now... But let me know if the checkpoint thing works for you.

from research.

zhaohuaqing1993 avatar zhaohuaqing1993 commented on June 15, 2024

how do you train the train_generative_model.py autoencoder successfully ,i meet some difficuty , have to doing somehting in code?thanks

from research.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.