Hello everyone, Thanks in advance for your help :) <p dir="auto"

Is there a way to train on the entire dataset for N epochs without specifying train-iters? about gpt-neox HOT 4 CLOSED

javirandor commented on May 20, 2024

Is there a way to train on the entire dataset for N epochs without specifying train-iters?

from gpt-neox.

Comments (4)

javirandor commented on May 20, 2024 1

Thanks a lot for the insights and spending the time looking into this!! It was really helpful :)

from gpt-neox.

AIproj commented on May 20, 2024

Hello, to verify whether this is a bug, can you please divide your budget of iterations by 1.006 and let us know how many epochs that would correspond to ? Thank you !

from gpt-neox.

javirandor commented on May 20, 2024

Thanks for your prompt reply! Btw, I am working with Pythia so I am using v1.0.

If I compute the number quickly, this would be 8064 / 1.008 = 8000. This results in 1 epoch but not all data will be seen since I can go up to 8023 and still get 1 epoch.

Out of curiosity, may I ask what 1.008 stands for?

Happy to check more stuff to help debug this :)

from gpt-neox.

AIproj commented on May 20, 2024

8023 will be 1 epoch and 8024 will be 2, yes. This is expected behaviour with Megatron data pipelines.
Check this comment for reference or read below first for the explanation.

The reason is that you would in general have n data sources, with associated weights determining the sampling probability of each data source. Suppose you're training over 10000 sequences, coming from data sources data1 and data2, each with probability 50%. When sampling, you will roll the dices for samples from both sources, and the expected value of the number of samples from each data source will be 5000. However, as you can expect, this is a random process and you might actually end up sampling 4999 from data1 and 5001 from data2 for example with a given seed. In other words, you need to leave a margin in the number of sequences you'll sample from each dataset to account for this variance.

A margin that works well in practice is 0.5% of the number of samples, and that's what Megatron uses. Hence why I asked you to check with 1.006 that it would be one epoch, or why it's normal that 8023 (8064/8023 > 1.005) gives you 1 epoch worth of sample indices, and 8024 gives you 2 epochs (8064/8024 < 1.005).

Now of course, in your specific case with 1 data source, this 0.5% buffer is not useful -- there will be no variance in the expected number of sequences you'll have seen from your data source. Generally, my advice is that missing a few iterations won't really matter. However, if you really want to train exactly once on every sequence without this buffer, you can go to this line and turn the 1.005 factor into 1 (or remove it altogether). Make sure to re-enable it if you start training on more data sources.

More subjectively, re: why not make 1 data source an exception and disable the buffer in that case, my personal opinion is that it's preferable for the behaviour to be the same, independently of the number of data sources. :-) But if you want to do it automatically, you can simply add an if len(weights) == 1: condition that would set the buffer factor to be 1.0 and 1.005 otherwise.

from gpt-neox.

Is there a way to train on the entire dataset for N epochs without specifying train-iters? about gpt-neox HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent