Coder Social home page Coder Social logo

Comments (4)

javirandor avatar javirandor commented on May 20, 2024 1

Thanks a lot for the insights and spending the time looking into this!! It was really helpful :)

from gpt-neox.

AIproj avatar AIproj commented on May 20, 2024

Hello, to verify whether this is a bug, can you please divide your budget of iterations by 1.006 and let us know how many epochs that would correspond to ? Thank you !

from gpt-neox.

javirandor avatar javirandor commented on May 20, 2024

Thanks for your prompt reply! Btw, I am working with Pythia so I am using v1.0.

If I compute the number quickly, this would be 8064 / 1.008 = 8000. This results in 1 epoch but not all data will be seen since I can go up to 8023 and still get 1 epoch.

Out of curiosity, may I ask what 1.008 stands for?

Happy to check more stuff to help debug this :)

from gpt-neox.

AIproj avatar AIproj commented on May 20, 2024

8023 will be 1 epoch and 8024 will be 2, yes. This is expected behaviour with Megatron data pipelines.
Check this comment for reference or read below first for the explanation.

The reason is that you would in general have n data sources, with associated weights determining the sampling probability of each data source. Suppose you're training over 10000 sequences, coming from data sources data1 and data2, each with probability 50%. When sampling, you will roll the dices for samples from both sources, and the expected value of the number of samples from each data source will be 5000. However, as you can expect, this is a random process and you might actually end up sampling 4999 from data1 and 5001 from data2 for example with a given seed. In other words, you need to leave a margin in the number of sequences you'll sample from each dataset to account for this variance.

A margin that works well in practice is 0.5% of the number of samples, and that's what Megatron uses. Hence why I asked you to check with 1.006 that it would be one epoch, or why it's normal that 8023 (8064/8023 > 1.005) gives you 1 epoch worth of sample indices, and 8024 gives you 2 epochs (8064/8024 < 1.005).

Now of course, in your specific case with 1 data source, this 0.5% buffer is not useful -- there will be no variance in the expected number of sequences you'll have seen from your data source. Generally, my advice is that missing a few iterations won't really matter. However, if you really want to train exactly once on every sequence without this buffer, you can go to this line and turn the 1.005 factor into 1 (or remove it altogether). Make sure to re-enable it if you start training on more data sources.

More subjectively, re: why not make 1 data source an exception and disable the buffer in that case, my personal opinion is that it's preferable for the behaviour to be the same, independently of the number of data sources. :-) But if you want to do it automatically, you can simply add an if len(weights) == 1: condition that would set the buffer factor to be 1.0 and 1.005 otherwise.

from gpt-neox.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.