Coder Social home page Coder Social logo

Comments (6)

jonbarron avatar jonbarron commented on April 28, 2024

Are you training on the GPU? I'm not sure what "stuck" means here, but it sounds like training could just be proceeding very slowly, and TF using the CPU for training is a very common reason for that.

from google-research.

timothybrooks avatar timothybrooks commented on April 28, 2024

I second Jon's analysis of this: it looks like CPU is being used here, which can be extremely slow and appear stuck. After waiting a long time (say, an hour) do you see anything written to your model directory?

I would recommend training with a GPU if that is possible, as it will be much faster. The last log is relating to Intel's OpenMP* thread mapping, which is probably because Intel's MKL-DNN is being used. But I have not seen those logs while training before, and see no reason why this would cause stalling.

from google-research.

jonbarron avatar jonbarron commented on April 28, 2024

I just tried training out the current code, and it seems to produce model checkpoints as output. @aasharma90 , can you confirm that model checkpoints aren't being produced when you run this? It's a little confusing because training doesn't produce loss/epoch print statements, but that seems to be a visualization issue, and not a correctness issue.

from google-research.

timothybrooks avatar timothybrooks commented on April 28, 2024

To print loss in the terminal during training, add tf.logging.set_verbosity(tf.logging.INFO) to set a high enough verbosity to see the training metrics. You can add this line right before the call to tf.estimator.train_and_evaluate(...) in train.py.

By default, Estimator will log every 100 steps. You can change this by modifying the config in train.py:
config = tf.estimator.RunConfig(FLAGS.model_dir, log_step_count_steps=[num of steps])

You may find it easier to use TensorBoard to visualize training progress, which can be done by running tensorboard --logdir=[path to model dir] in a separate terminal during or after training, and opening the printed URL in a web browser.

from google-research.

jonbarron avatar jonbarron commented on April 28, 2024

Thanks Tim, CL is in flight, I'll closes this issue once it's landed.

from google-research.

aasharma90 avatar aasharma90 commented on April 28, 2024

Hi @timothybrooks and @jonbarron

Regarding your questions -

  1. I thought the default setting would be to be run it on GPU? Sorry, I'm a very new to TF so not much aware. Could you please let me know how that can be done? You can have a look at my default command I used for training in my original post above.

  2. I added Tim's suggestions in train.py

...
config = tf.estimator.RunConfig(FLAGS.model_dir, log_step_count_steps=1)
...
tf.logging.set_verbosity(tf.logging.INFO)

The simulation is still at the same point I mentioned.

  1. Launching tensorboard, I can see the model graph, but I cannot see the training profile.

from google-research.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.