Running Megatron-Deepspeed with pipelining seems to call PipeModule with the type:tran

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[BUG] 'type:transformer' partitioning doesn't ensure non-zero parameters on each pipeline rank. about deepspeed HOT 5 OPEN

siddharth9820 commented on May 18, 2024

[BUG] 'type:transformer' partitioning doesn't ensure non-zero parameters on each pipeline rank.

from deepspeed.

Comments (5)

siddharth9820 commented on May 18, 2024 1

Yes it won't be balanced. But atleast it will "run" with Megatron Deepspeed. With the current approach, I was getting "empty parameter" errors during optimizer initialization. I believe this was happening on the second last pp rank, since it became parameterless.

from deepspeed.

tjruwase commented on May 18, 2024

@siddharth9820, thanks for reporting this error. I am curious if this is a recent regression due to the below PR that changed the balancing algorithm:
#4312

Can you please try earlier DS versions (v. 0.13.0 or 0.12.6) or revert the PR?

from deepspeed.

siddharth9820 commented on May 18, 2024

@tjruwase I am able to reproduce the error outside of Megatron-DeepSpeed as well -

I'll try the other versions too. Thanks for the pointer.

About potential fixes. - Could you first assign 1 layer to each rank first and then run this function on n-m layers and m ranks? But that wouldn't be an ideal fix if the weights aren't uniform.

from deepspeed.

tjruwase commented on May 18, 2024

@siddharth9820, thanks for the update. This seems like an implementation bug as I find it hard to believe both the new and old algorithms fail these seemingly practical cases.

Old algorithm - Fast Optimal Load Balancing Algorithms for 1D Partitioning
New algorithm - https://www8.cs.umu.se/kurser/TDBAfl/VT06/algorithms/BOOK/BOOK2/NODE45.HTM

from deepspeed.

tjruwase commented on May 18, 2024

About potential fixes. - Could you first assign 1 layer to each rank first and then run this function on n-m layers and m ranks? But that wouldn't be an ideal fix if the weights aren't uniform.

Yes, it does not seem like this approach would be balanced. I think it will only increase the minimum from zero to one. Right?

from deepspeed.

[BUG] 'type:transformer' partitioning doesn't ensure non-zero parameters on each pipeline rank. about deepspeed HOT 5 OPEN

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent