Comments (7)
I am not able to replicate this,
import torch
import torchao
from pathlib import Path
import logging
logging.basicConfig(level=logging.INFO)
from transformer_nuggets.utils.benchmark import save_memory_snapshot
print(f"Memory allocated: {torch.cuda.memory_allocated()}, Memory Reserved: {torch.cuda.memory_reserved()}")
with save_memory_snapshot(Path("nf4_memory")):
original = torch.rand([1024, 4096], dtype=torch.bfloat16, device="cuda")
print(f"Memory allocated: {torch.cuda.memory_allocated()}, Memory Reserved: {torch.cuda.memory_reserved()}")
t4 = torchao.dtypes.nf4tensor.NF4Tensor.from_tensor(original, 64, 256)
del original
for _ in range(10):
a = torch.empty(4096, dtype=torch.bfloat16, device="cuda")
print(f"Memory allocated: {torch.cuda.memory_allocated()}, Memory Reserved: {torch.cuda.memory_reserved()}")
Produces:
Memory allocated: 0, Memory Reserved: 0
Memory allocated: 8388608, Memory Reserved: 20971520
Memory allocated: 2302976, Memory Reserved: 90177536
The final memory allocated by the NF4Tensor is: 2302976
(2302976 / (1024 * 4096)) = 4.39 bytes/param
from ao.
Hi @drisspg , thanks for the prompt reply! I can get the same result as you using your script. I found out if the nf4 tensor is initialized from a cpu tensor and then moved to GPU, then the memory usage is 8-bit.
Also, calling .cuda()
the first time doesn't seem to have an effect?
import torch
import torchao
from pathlib import Path
import logging
logging.basicConfig(level=logging.INFO)
# from transformer_nuggets.utils.benchmark import save_memory_snapshot
print(f"Memory allocated: {torch.cuda.memory_allocated()}, Memory Reserved: {torch.cuda.memory_reserved()}")
# with save_memory_snapshot(Path("nf4_memory")):
original = torch.rand([1024, 4096], dtype=torch.bfloat16)
print(f"Memory allocated: {torch.cuda.memory_allocated()}, Memory Reserved: {torch.cuda.memory_reserved()}")
t4 = torchao.dtypes.nf4tensor.NF4Tensor.from_tensor(original, 64, 256)
del original
t4 = t4.cuda()
print(f"Memory allocated: {torch.cuda.memory_allocated()}, Memory Reserved: {torch.cuda.memory_reserved()}")
t4 = t4.cuda()
print(f"Memory allocated: {torch.cuda.memory_allocated()}, Memory Reserved: {torch.cuda.memory_reserved()}")
Produces:
Memory allocated: 0, Memory Reserved: 0
Memory allocated: 0, Memory Reserved: 0
Memory allocated: 0, Memory Reserved: 0
Memory allocated: 4194304, Memory Reserved: 20971520
from ao.
Yeah I think I broke this
ao/torchao/dtypes/nf4tensor.py
Line 50 in 0dfcbfd
from ao.
Ahhh you are right the problem was the in the implementation in to
, but I think that has since been resolved on main:https://github.com/pytorch/ao/blob/main/torchao/dtypes/nf4tensor.py#L899-L923
where previously you were actually just getting a bf16 tensors secretly since it wasnt supported
from ao.
actually well one caveat is if you call t.cuda()
this will end up returning you a full bf16 value but if you call t.to("cuda")
this should work as expected
from ao.
Thanks for the comments!
Using a nightly release from 2024.4.26, the output from my snippet above is
Memory allocated: 0, Memory Reserved: 0
Memory allocated: 0, Memory Reserved: 0
Memory allocated: 8388608, Memory Reserved: 20971520
Memory allocated: 8388608, Memory Reserved: 20971520
and this happens for both .cuda()
and .to('cuda')
.
I think there's still a memory issue with moving an NF4 tensor from CPU to GPU.
Initializing the NF4 tensor directly on GPU still produces the correct result.
from ao.
I think the update landed 2 days ago so likely isn't in that package
from ao.
Related Issues (20)
- Semi-Structured Sparsity unsupported for Windows HOT 1
- [NF4][FSDP2]: enable multi-gpu CI
- [NF4][FSDP2] avoid peaking GPU memory when constructing NF4 tensors HOT 1
- [NF4][FSDP2] DTensor + fused adam on cpu
- Doc build failing on main
- [BUG] No module named 'expecttest' when import `torchao`
- FloatQuantization subclass HOT 1
- Building torchao from source installs unnecessary torch and nvidia packages every time HOT 1
- [Question] MBU in automated CI? HOT 2
- [Tracker] WIP features for torchao 0.3 HOT 3
- HQQ Tracker HOT 1
- torchao init: ImportError: libcudart.so.12: cannot open shared object file: No such file or directory HOT 1
- Error when using to_nf4 function, inside NF4Tensor Class
- Bitnet 1.58 prework, POC, and staging HOT 2
- Generic packing algorithms from size N to M HOT 4
- torchao.utils.benchmark_model support cpu and mps benchmarking
- custom cuda extensions make installing ao hard HOT 4
- `dequantize_affine` modified the `input` in-place HOT 4
- Numerics checks between NF4 and bnb nf4
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ao.