Describe the bug SFT llama2-70B on A100 80Gx4 1 machine with deep

And directly load model to GPU works fine . Use the <code class="notra

[BUG]Deepspeed OOM out of CPU memory on A100 80Gx4. The more GPU use the more CPU memory ds occupy. about deepspeed HOT 4 CLOSED

66RING commented on May 18, 2024

[BUG]Deepspeed OOM out of CPU memory on A100 80Gx4. The more GPU use the more CPU memory ds occupy.

from deepspeed.

Comments (4)

shh2000 commented on May 18, 2024 1

similar question. here's some findings:

deepspeed's doc claimed in ZeRO-O3 with ZeRO.init context, params is copied(for DP) and transfered to GPU layer by layer, while O2 is not.
i use 13B, A100 80G*8, 1TB CPU memory to test. when O2, it use about 500 GB memory. 13B*4Byte/param*DP8=416GB, reasonable.
when using 500GB CPU memory(reduce 50% memory), CPU OOM
changing to O3, still CPU OOM.

so maybe we use O3 zero.init incorrectly, or deepspeed has a bug around here.

to your scenario, 70*4*4=1120GB>1TB, CPU OOM. maybe you can try load half layer(40 layer instead of 80 layer)? it likely to use 560+ GB memory, which does not cause CPU OOM

however, even half layers does not OOM, there's still the bug: when loading and copying weights layer by layer, does deepspeed release the memory of former layers in O3?

from deepspeed.

shh2000 commented on May 18, 2024 1

similar question. here's some findings:

deepspeed's doc claimed in ZeRO-O3 with ZeRO.init context, params is copied(for DP) and transfered to GPU layer by layer, while O2 is not.

i use 13B, A100 80G8, 1TB CPU memory to test. when O2, it use about 500 GB memory. 13B4Byte/param*DP8=416GB, reasonable.

when using 500GB CPU memory(reduce 50% memory), CPU OOM

changing to O3, still CPU OOM.

so maybe we use O3 zero.init incorrectly, or deepspeed has a bug around here.

to your scenario, 7044=1120GB>1TB, CPU OOM. maybe you can try load half layer(40 layer instead of 80 layer)? it likely to use 560+ GB memory, which does not cause CPU OOM

however, even half layers does not OOM, there's still the bug: when loading and copying weights layer by layer, does deepspeed release the memory of former layers in O3?

ref: https://deepspeed.readthedocs.io/en/latest/memory.html , grep "And often, it’s not even possible to buy GPUs with a lot of RAM (112GB GPU anybody?) since they simply don’t yet exist."

from deepspeed.

66RING commented on May 18, 2024 1

similar question. here's some findings:

deepspeed's doc claimed in ZeRO-O3 with ZeRO.init context, params is copied(for DP) and transfered to GPU layer by layer, while O2 is not.

i use 13B, A100 80G8, 1TB CPU memory to test. when O2, it use about 500 GB memory. 13B4Byte/param*DP8=416GB, reasonable.

when using 500GB CPU memory(reduce 50% memory), CPU OOM

changing to O3, still CPU OOM.

so maybe we use O3 zero.init incorrectly, or deepspeed has a bug around here.

to your scenario, 7044=1120GB>1TB, CPU OOM. maybe you can try load half layer(40 layer instead of 80 layer)? it likely to use 560+ GB memory, which does not cause CPU OOM

however, even half layers does not OOM, there's still the bug: when loading and copying weights layer by layer, does deepspeed release the memory of former layers in O3?

@shh2000 hi I have some new findings. and it works. You need to use deepspeed.zero.init to handle lager model. like

        config = LlamaConfig.from_pretrained(model_name_or_path)
        with deepspeed.zero.Init():
            model = AutoModelForCausalLM.from_config(config, trust_remote_code=True)

from deepspeed.

66RING commented on May 18, 2024

And directly load model to GPU works fine. Use the device_map parameter like

AutoModelForCausalLM.from_config(config, device_map="auto", trust_remote_code=True)

from deepspeed.

[BUG]Deepspeed OOM out of CPU memory on A100 80Gx4. The more GPU use the more CPU memory ds occupy. about deepspeed HOT 4 CLOSED

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent