Coder Social home page Coder Social logo

Save interval not working about composer HOT 9 CLOSED

naimavahab avatar naimavahab commented on September 28, 2024
Save interval not working

from composer.

Comments (9)

mvpatel2000 avatar mvpatel2000 commented on September 28, 2024

How many epochs does your run go for? Also, what is the save_filename? Is it unique per timestamp (to verify it isn't being overwritten each time)

from composer.

naimavahab avatar naimavahab commented on September 28, 2024

Only 1 checkpoint is getting saved..in the format of 'ep3-ba1458-rank0.pt'. If I specify save interval 1ep only 1st epoch gets saved even if I have 30 epochs. And if save interval is 3ep, the 3rd epoch gets saved, rest are ignored

from composer.

mvpatel2000 avatar mvpatel2000 commented on September 28, 2024

Hm... do you get any errors or traces? Mind sharing a minimal repro please?

from composer.

naimavahab avatar naimavahab commented on September 28, 2024

There are no errors and everything working fine including plotting the loss etc.
I am using this git branch for mosaicbert pretraining. https://github.com/Skylion007/mosaicml-examples/tree/skylion007/add-fa2-to-bert/examples

from composer.

eracah avatar eracah commented on September 28, 2024

Mind sharing what version of composer you are using?

from composer.

naimavahab avatar naimavahab commented on September 28, 2024

0.17.2

from composer.

eracah avatar eracah commented on September 28, 2024

That's a very old (>6 months old) version of composer. is there a reason you need to use that old of a version?

from composer.

naimavahab avatar naimavahab commented on September 28, 2024

I have been using this particular docker image https://github.com/Skylion007/mosaicml-examples/tree/skylion007/add-fa2-to-bert/examples to use flashattention, triton etc. These set up require sepcific versions, if I update to latest composer, triton fails.
But now I managed by updating to a slightly higher version like 0.19. And it works fine. But I wonder why composer failed only at checkpoint part for the previous 0.17.2 version

from composer.

mvpatel2000 avatar mvpatel2000 commented on September 28, 2024

Going to close as it seems to work.

I'm not super sure what the bug was, but it's certainly possible there was an issue in older versions. Definitely recommend upgrading to latest :)

from composer.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.