Coder Social home page Coder Social logo

Comments (10)

Hanlard avatar Hanlard commented on August 15, 2024 1

I have used following techniches to save memory...
--fp16
--checkpoint-activations
--distribute-checkpointed-activations
--fp16-lm-cross-entropy
--use-cpu-initialization

from megatron-lm.

Hanlard avatar Hanlard commented on August 15, 2024

This is 16 * V100(32G) for debug...

from megatron-lm.

Hanlard avatar Hanlard commented on August 15, 2024

I find that if add "torch.cuda.empty_cahce()" at every train_step then paragram will not OOM ...

from megatron-lm.

eric-haibin-lin avatar eric-haibin-lin commented on August 15, 2024

Did you intentionally choose MP=2? Would increasing MP help? (e.g. MP=4, DP=4)

from megatron-lm.

Hanlard avatar Hanlard commented on August 15, 2024

Did you intentionally choose MP=2? Would increasing MP help? (e.g. MP=4, DP=4)
increasing MP will help...

from megatron-lm.

Hanlard avatar Hanlard commented on August 15, 2024

Did you intentionally choose MP=2? Would increasing MP help? (e.g. MP=4, DP=4)

I solve the problem by changing DDP_impl = “local” to ”torch” ...

from megatron-lm.

gbxu avatar gbxu commented on August 15, 2024

In the readme it says that "2.5 billion parameters using 2-way model parallel and 1.2 billion parameters with no model parallel".
It's confusing that splitting the model into 2 shards takes more GPU memory. Assuming that the original model size as M, after 2-way model parallel each GPU should hold M/2.
Does Anyone know what other GPU footprint comes from with setting DDP_impl “local” and ”torch” respectively? Dose the communication library pin M or more GPU memory?
Thanks.

from megatron-lm.

ShivanshuPurohit avatar ShivanshuPurohit commented on August 15, 2024

I am getting out of memory errors with mp>1, which all disappear if I just use mp=1, with the deepspeed megatron enabled.

from megatron-lm.

github-actions avatar github-actions commented on August 15, 2024

Marking as stale. No activity in 60 days. Remove stale label or comment or this will be closed in 7 days.

from megatron-lm.

github-actions avatar github-actions commented on August 15, 2024

Marking as stale. No activity in 60 days.

from megatron-lm.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.