<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Recalculating the activations in the backwards pass to conserve memory about llm.c HOT 3 OPEN

ChrisDryden commented on August 16, 2024

Recalculating the activations in the backwards pass to conserve memory

from llm.c.

Comments (3)

ChrisDryden commented on August 16, 2024

To start off, I will first implement the layernorm forward in the backwards pass implementation and use the ln1 and ln2 values directly from that layernorm forward to get an initial working version of recalculating the values in the backwards pass.

from llm.c.

ChrisDryden commented on August 16, 2024

In the above PR I was able to implement the reduced memory:
Went from this with recompute set to 1:

allocating 1439 MiB for activations
val loss 4.503491
allocating 237 MiB for parameter gradients
allocating 30 MiB for activation gradients
allocating 474 MiB for AdamW optimizer state m
allocating 474 MiB for AdamW optimizer state v
allocating 474 MiB for master copy of params

To this:

allocating 1307 MiB for activations
val loss 4.504488
allocating 237 MiB for parameter gradients
allocating 30 MiB for activation gradients
allocating 474 MiB for AdamW optimizer state m
allocating 474 MiB for AdamW optimizer state v
allocating 474 MiB for master copy of params

from llm.c.

ChrisDryden commented on August 16, 2024

The PR was merged but still needs the second step of making a simplified kernel that doesnt recompute everything and reuses the values calculated in the forwards pass

from llm.c.

Recalculating the activations in the backwards pass to conserve memory about llm.c HOT 3 OPEN

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent