Coder Social home page Coder Social logo

Thx for your job. I wonder what's suitable to train model, specially in what type of gpu and how big of gpu memory. I trained a few times in quadro 6000 but the log said sigkill problem in second epoch. about cofced HOT 5 CLOSED

nicozwy avatar nicozwy commented on August 22, 2024
Thx for your job. I wonder what's suitable to train model, specially in what type of gpu and how big of gpu memory. I trained a few times in quadro 6000 but the log said sigkill problem in second epoch.

from cofced.

Comments (5)

nievuelo avatar nievuelo commented on August 22, 2024

I mean, the first epoch seems work well, but in the second epoch and during the mid time of the seonc batch like 4660/5033, my thread will be killed and the pycharm tips are like:"interrupted by signal 9: sigkill" I check my log in linux and it said cpu out of memory, But my main memory of linux is 64g, I think it might be enough. I don't know what causes this kind of error.

from cofced.

Nicozwy avatar Nicozwy commented on August 22, 2024

I think that the CPU memory is too small because we have to load many reports.
Have you tried loading a small portation of reports for each claim?

from cofced.

nievuelo avatar nievuelo commented on August 22, 2024

Thanks a lot for your generous help. I will star this project.
But I mean it's not the storage capacity 199g, it's main memory capacity 64g. So I revised the eval_exp_fc5 from report each claim=12->report_each_claim=6, and change the report _each_claim from 30 to 15 in train_exp_fc5_liar_raw2. But it seems still throw kill sig problem. But the difference is before I change the parameter, it would crash during training time but after I revised it, it would crash during evaluation time. But I just had changed the parameter of report_each_claim in evaluate_model func from eval_exp_fc5. So could you help me to figure it out?

from cofced.

nievuelo avatar nievuelo commented on August 22, 2024

Or in other words, how big of the required memory is?

from cofced.

Nicozwy avatar Nicozwy commented on August 22, 2024

Hi, @nievuelo . I can not figure it out with limited information, but you can try this code on the other machine with GPU 3090 because we have successfully tested it.

from cofced.

Related Issues (10)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.