Hi, I've been playing around with QLoRA using the NF4Tensor class from this great libr

I am not able to replicate this, <div class="highlight highlight-source-python not

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Yeah I think I broke this <div class="Box-

Ahhh you are right the problem was the in the implementation in <code class="notransla

NF4Tensor uses 8 bits of memory about ao HOT 7 CLOSED

cuichenx commented on June 2, 2024

NF4Tensor uses 8 bits of memory

from ao.

Comments (7)

drisspg commented on June 2, 2024

I am not able to replicate this,

import torch
import torchao
from pathlib import Path
import logging
logging.basicConfig(level=logging.INFO)

from transformer_nuggets.utils.benchmark import save_memory_snapshot
print(f"Memory allocated: {torch.cuda.memory_allocated()}, Memory Reserved: {torch.cuda.memory_reserved()}")
with save_memory_snapshot(Path("nf4_memory")):
    original = torch.rand([1024, 4096], dtype=torch.bfloat16, device="cuda")
    print(f"Memory allocated: {torch.cuda.memory_allocated()}, Memory Reserved: {torch.cuda.memory_reserved()}")
    t4 = torchao.dtypes.nf4tensor.NF4Tensor.from_tensor(original, 64, 256)
    del original
    for _ in range(10):
        a = torch.empty(4096, dtype=torch.bfloat16, device="cuda")
    print(f"Memory allocated: {torch.cuda.memory_allocated()}, Memory Reserved: {torch.cuda.memory_reserved()}")

Produces:

Memory allocated: 0, Memory Reserved: 0
Memory allocated: 8388608, Memory Reserved: 20971520
Memory allocated: 2302976, Memory Reserved: 90177536

The final memory allocated by the NF4Tensor is: 2302976
(2302976 / (1024 * 4096)) = 4.39 bytes/param

from ao.

cuichenx commented on June 2, 2024

Hi @drisspg , thanks for the prompt reply! I can get the same result as you using your script. I found out if the nf4 tensor is initialized from a cpu tensor and then moved to GPU, then the memory usage is 8-bit.
Also, calling .cuda() the first time doesn't seem to have an effect?

import torch
import torchao
from pathlib import Path
import logging
logging.basicConfig(level=logging.INFO)

# from transformer_nuggets.utils.benchmark import save_memory_snapshot
print(f"Memory allocated: {torch.cuda.memory_allocated()}, Memory Reserved: {torch.cuda.memory_reserved()}")
# with save_memory_snapshot(Path("nf4_memory")):
original = torch.rand([1024, 4096], dtype=torch.bfloat16)
print(f"Memory allocated: {torch.cuda.memory_allocated()}, Memory Reserved: {torch.cuda.memory_reserved()}")
t4 = torchao.dtypes.nf4tensor.NF4Tensor.from_tensor(original, 64, 256)
del original

t4 = t4.cuda()
print(f"Memory allocated: {torch.cuda.memory_allocated()}, Memory Reserved: {torch.cuda.memory_reserved()}")
t4 = t4.cuda()
print(f"Memory allocated: {torch.cuda.memory_allocated()}, Memory Reserved: {torch.cuda.memory_reserved()}")

Produces:

Memory allocated: 0, Memory Reserved: 0
Memory allocated: 0, Memory Reserved: 0
Memory allocated: 0, Memory Reserved: 0
Memory allocated: 4194304, Memory Reserved: 20971520

from ao.

msaroufim commented on June 2, 2024

Yeah I think I broke this

ao/torchao/dtypes/nf4tensor.py

Line 50 in 0dfcbfd

    
           return args[0][0].get_original_weight().to(args[1]["dtype"]).to(args[1]["device"])

from ao.

drisspg commented on June 2, 2024

Ahhh you are right the problem was the in the implementation in to, but I think that has since been resolved on main:https://github.com/pytorch/ao/blob/main/torchao/dtypes/nf4tensor.py#L899-L923
where previously you were actually just getting a bf16 tensors secretly since it wasnt supported

from ao.

drisspg commented on June 2, 2024

actually well one caveat is if you call t.cuda() this will end up returning you a full bf16 value but if you call t.to("cuda") this should work as expected

from ao.

cuichenx commented on June 2, 2024

Thanks for the comments!
Using a nightly release from 2024.4.26, the output from my snippet above is

Memory allocated: 0, Memory Reserved: 0
Memory allocated: 0, Memory Reserved: 0
Memory allocated: 8388608, Memory Reserved: 20971520
Memory allocated: 8388608, Memory Reserved: 20971520

and this happens for both .cuda() and .to('cuda').

I think there's still a memory issue with moving an NF4 tensor from CPU to GPU.
Initializing the NF4 tensor directly on GPU still produces the correct result.

from ao.

drisspg commented on June 2, 2024

I think the update landed 2 days ago so likely isn't in that package

from ao.

NF4Tensor uses 8 bits of memory about ao HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent