Coder Social home page Coder Social logo

Comments (7)

yuval-alaluf avatar yuval-alaluf commented on May 27, 2024 1

Ok that makes more sense. You (most likely) won't be able to run training and inference on the same gpu.
Can I ask why you are running inference.py during training? During training we output logs on the training and test images showing the input, target, and output images during each validation interval. This should help you follow the training progress.
Moreover, during training we output the training and testing logs using tensorboard that will help you understand how the model performs on the test data over time. This should help you determine which checkpoint seems to be best. I would then run inference.py on the best checkpoint you found.

from pixel2style2pixel.

yuval-alaluf avatar yuval-alaluf commented on May 27, 2024 1

All good.
Regarding your question, by defaulting we run for 500,000 steps, but that is more than we used or you that you will probably need.
We plot the training and test losses during training using tensorboard. I would recommend connecting to tensorboard to see when the model stops improving. When you see the model has converged (i.e. the test losses stop decreasing), you can stop training and use the best checkpoint obtained to perform inference.
I hope that helps answer your question.

from pixel2style2pixel.

yuval-alaluf avatar yuval-alaluf commented on May 27, 2024

Hi. If I understand you correctly, calling self.validate() in Coach() results in out of memory?
If that is the case, I wouldn't recommend pausing training or terminating and restarting because the current code does not keep the state of the optimizer.
If you find that you are unable to run validation during training I would recommend setting the save interval to say 5000 and validation interval to the maximum number of steps you're training for. In doing so, you will not run validation during training, but save a checkpoint every 5000 intervals. After training, you have a bunch of checkpoints that you can validate using inference.py and our metrics scripts.
(5000 was used an example, feel free to change depending on how often you think you should save checkpoints).

from pixel2style2pixel.

spamfold3r avatar spamfold3r commented on May 27, 2024

I'm not sure whether this is the same thing as what you're saying, but I am trying to run inference.py in a different terminal instance whilst training is happening, which results in an error relating to insufficient CUDA memory.

I'm just trying to test the checkpoints using inference.py, but am unsure of how to do this. I am using the recommended setting listed in the training section of the documentation.

from pixel2style2pixel.

spamfold3r avatar spamfold3r commented on May 27, 2024

Ahhhhhh I see, thank you. I was not aware of the visualizations throughout the training process, hence why I thought of running inference.py. My apologies.

As a newcomer, another question - at what point should training stop? Will it stop at a designated point or is it a matter of monitoring the scores and terminating when you see fit?

from pixel2style2pixel.

spamfold3r avatar spamfold3r commented on May 27, 2024

Yes that's just what I was after! Once again, thank you for your help. 😎

from pixel2style2pixel.

yuval-alaluf avatar yuval-alaluf commented on May 27, 2024

Happy to help!

from pixel2style2pixel.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.