Comments (4)
Thanks a lot for the insights and spending the time looking into this!! It was really helpful :)
from gpt-neox.
Hello, to verify whether this is a bug, can you please divide your budget of iterations by 1.006 and let us know how many epochs that would correspond to ? Thank you !
from gpt-neox.
Thanks for your prompt reply! Btw, I am working with Pythia so I am using v1.0.
If I compute the number quickly, this would be 8064 / 1.008 = 8000
. This results in 1 epoch but not all data will be seen since I can go up to 8023
and still get 1 epoch.
Out of curiosity, may I ask what 1.008
stands for?
Happy to check more stuff to help debug this :)
from gpt-neox.
8023 will be 1 epoch and 8024 will be 2, yes. This is expected behaviour with Megatron data pipelines.
Check this comment for reference or read below first for the explanation.
The reason is that you would in general have n
data sources, with associated weights determining the sampling probability of each data source. Suppose you're training over 10000 sequences, coming from data sources data1
and data2
, each with probability 50%. When sampling, you will roll the dices for samples from both sources, and the expected value of the number of samples from each data source will be 5000. However, as you can expect, this is a random process and you might actually end up sampling 4999 from data1
and 5001 from data2
for example with a given seed. In other words, you need to leave a margin in the number of sequences you'll sample from each dataset to account for this variance.
A margin that works well in practice is 0.5% of the number of samples, and that's what Megatron uses. Hence why I asked you to check with 1.006 that it would be one epoch, or why it's normal that 8023 (8064/8023 > 1.005) gives you 1 epoch worth of sample indices, and 8024 gives you 2 epochs (8064/8024 < 1.005).
Now of course, in your specific case with 1 data source, this 0.5% buffer is not useful -- there will be no variance in the expected number of sequences you'll have seen from your data source. Generally, my advice is that missing a few iterations won't really matter. However, if you really want to train exactly once on every sequence without this buffer, you can go to this line and turn the 1.005 factor into 1 (or remove it altogether). Make sure to re-enable it if you start training on more data sources.
More subjectively, re: why not make 1 data source an exception and disable the buffer in that case, my personal opinion is that it's preferable for the behaviour to be the same, independently of the number of data sources. :-) But if you want to do it automatically, you can simply add an if len(weights) == 1:
condition that would set the buffer factor to be 1.0 and 1.005 otherwise.
from gpt-neox.
Related Issues (20)
- Error on inference of huggingface HOT 2
- Update to current versions of python and pytorch
- files in multi-node training HOT 2
- NCCL error in: ProcessGroupNCCL.cpp:1269, internal error, NCCL version 2.14.3 HOT 4
- Add basic Mamba block
- How to convert gpt-neox to llama architecture..? HOT 1
- Add PyTorch Memory Profiler
- continue training from a checkpoint with different number of gpu/node HOT 1
- PyTorch Lightning Fused optimizer step
- Converting Pythia checkpoint from HF to NeoX fails HOT 3
- Dockerfile installation fails to run pythia 14M HOT 2
- Add Basic RWKV Block to GPT-NeoX
- pipe_parallel_size = 1 using DeepSpeed PipelineEngine HOT 1
- بهترین تعمیرگاه موبایل در مشهد مقدس HOT 1
- MoE loss variable not defined in gpt j residual code path
- Large model instantiation using `DeepSpeed.zero.Init` under ZeRO-3 HOT 1
- is there any ignore_index ability in the loss calculation? HOT 2
- My servers used for multi-node training do not have ssh. How can I launch multi-node training using the torchrun command? HOT 1
- LoRA Support HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from gpt-neox.