Comments (12)
4070Ti Super, running Ubuntu 22.04.
torch==2.4.0.dev20240426+cu121
bfloat16, cutlass
Fixed k
m | n | k | sparse_latency (ms) | dense_latency (ms) | speedup (d/s) | |
---|---|---|---|---|---|---|
0 | 3072 | 3072 | 10240 | 1.10574 | 2.131 | 1.92722 |
1 | 4096 | 4096 | 10240 | 1.9605 | 3.73044 | 1.9028 |
2 | 5120 | 5120 | 10240 | 3.12083 | 6.10269 | 1.95547 |
3 | 6144 | 6144 | 10240 | 4.74411 | 8.79509 | 1.8539 |
4 | 7168 | 7168 | 10240 | 7.29741 | 11.9486 | 1.63738 |
5 | 8192 | 8192 | 10240 | 10.6073 | 15.4296 | 1.45462 |
6 | 9216 | 9216 | 10240 | 13.6835 | 19.1741 | 1.40125 |
7 | 10240 | 10240 | 10240 | 16.8367 | 23.4461 | 1.39256 |
8 | 11264 | 11264 | 10240 | 20.37 | 28.2801 | 1.38832 |
9 | 12288 | 12288 | 10240 | 24.1402 | 33.545 | 1.38959 |
10 | 13312 | 13312 | 10240 | 28.4292 | 39.2493 | 1.3806 |
11 | 14336 | 14336 | 10240 | 32.851 | 45.5614 | 1.38691 |
12 | 15360 | 15360 | 10240 | 37.7906 | 54.6426 | 1.44593 |
13 | 16384 | 16384 | 10240 | 42.789 | 63.5041 | 1.48412 |
14 | 17408 | 17408 | 10240 | 48.5377 | 69.684 | 1.43567 |
15 | 18432 | 18432 | 10240 | 54.2561 | 77.7116 | 1.43231 |
16 | 19456 | 19456 | 10240 | 60.3411 | 85.183 | 1.41169 |
17 | 20480 | 20480 | 10240 | 66.7151 | 97.5466 | 1.46214 |
Fixed mn
m | n | k | sparse_latency (ms) | dense_latency (ms) | speedup (d/s) | |
---|---|---|---|---|---|---|
0 | 10240 | 10240 | 2560 | 3.12135 | 6.23817 | 1.99855 |
1 | 10240 | 10240 | 3840 | 4.59394 | 9.28166 | 2.02041 |
2 | 10240 | 10240 | 5120 | 7.15086 | 12.251 | 1.71322 |
3 | 10240 | 10240 | 6400 | 10.5324 | 14.7059 | 1.39625 |
4 | 10240 | 10240 | 7680 | 13.0499 | 18.0573 | 1.38372 |
5 | 10240 | 10240 | 8960 | 15.3995 | 20.6897 | 1.34353 |
6 | 10240 | 10240 | 10240 | 16.8406 | 23.4697 | 1.39364 |
7 | 10240 | 10240 | 11520 | 19.2673 | 26.2984 | 1.36493 |
8 | 10240 | 10240 | 12800 | 20.9322 | 29.0503 | 1.38782 |
9 | 10240 | 10240 | 14080 | 23.14 | 31.9612 | 1.38121 |
10 | 10240 | 10240 | 15360 | 25.6844 | 34.6865 | 1.35049 |
11 | 10240 | 10240 | 16640 | 26.2421 | 37.4893 | 1.42859 |
12 | 10240 | 10240 | 17920 | 30.1967 | 40.3297 | 1.33556 |
13 | 10240 | 10240 | 19200 | 32.4673 | 43.1666 | 1.32954 |
14 | 10240 | 10240 | 20480 | 33.5382 | 46.002 | 1.37163 |
SAM ViT-B shapes
m | n | k | sparse_latency (ms) | dense_latency (ms) | speedup (d/s) | |
---|---|---|---|---|---|---|
0 | 32768 | 768 | 3072 | 1.22253 | 1.7901 | 1.46426 |
1 | 32768 | 2304 | 768 | 0.787232 | 1.33425 | 1.69486 |
2 | 32768 | 3072 | 768 | 1.04701 | 1.74003 | 1.66191 |
3 | 32768 | 768 | 768 | 0.271155 | 0.437884 | 1.61488 |
4 | 39200 | 2304 | 768 | 0.948154 | 1.5765 | 1.66271 |
5 | 39200 | 768 | 768 | 0.324627 | 0.510302 | 1.57196 |
I omit some redundant columns from the saved csv file. correct
and contiguous
columns are all True
.
from ao.
@philipbutler as a sanity check - can you run using the 2.3 release instead of the nightlies?
I think this might be an issue with windows, but I'm not sure.
from ao.
Nice work @gau-nernst pretty cool to see results that seem uniformily faster
@philipbutler would highly recommend using WSL or dual booting (I personally dual boot), getting windows and cuda to work is just not worth it
from ao.
@gau-nernst 💯 Thanks for running these - that's awesome! For others reading, I'd like to collect these, with our A100 results somewhere. So please contribute and I'll collate these together in a nice doc. We can also collect block sparse microbenchmarks too, I know @cpuhrsch is interested in those.
@philipbutler Thank you for giving it a shot + your edits we're super helpful too :) . Yeah I think I agree with mark that dual booting linux is probably the easiest solution - but could you open an issue for tracking purposes (feel free to tag me) in pytorch about lack of windows support for semi-structured sparsity?
from ao.
Had to set up this PC, so had to do a clean Python install, and noticing neither pandas
nor tqdm
is in requirements.txt
from ao.
The benchmark command should use --dtype bf16
from ao.
Ran into RuntimeError: sparse_semi_structured_mad_op : CUTLASS not supported
Consider adding install CUDA 12.1 and the CUTLASS Quickstart to the steps.
Running through it now!
(I'm confused rn)
from ao.
Actually, @jcaip, does it make sense that to_sparse_semi_structured(torch.ones(256, 256).half().cuda())
works, but running the first benchmark script shows RuntimeError: sparse_semi_structured_mad_op : CUTLASS not supported
?
from ao.
That's strange to me @philipbutler let me think for a bit
Can you open powershell and run nvidia-smi
and screenshot the results?
from ao.
from ao.
@jcaip Just making this easy as possible for future benchmarking, step 2 should say
import torch
from torch.sparse import to_sparse_semi_structured
to_sparse_semi_structured(torch.ones(256, 256).half().cuda())
from ao.
@philipbutler as a sanity check - can you run using the 2.3 release instead of the nightlies?
I think this might be an issue with windows, but I'm not sure.
@jcaip Same error with the 2.3 release
from ao.
Related Issues (20)
- NF4Tensor uses 8 bits of memory HOT 7
- Doc build failing on main
- [BUG] No module named 'expecttest' when import `torchao`
- FloatQuantization subclass HOT 3
- Building torchao from source installs unnecessary torch and nvidia packages every time HOT 1
- [Question] MBU in automated CI? HOT 2
- [Tracker] WIP features for torchao 0.3 HOT 3
- HQQ Tracker HOT 1
- torchao init: ImportError: libcudart.so.12: cannot open shared object file: No such file or directory HOT 1
- Error when using to_nf4 function, inside NF4Tensor Class HOT 2
- Bitnet 1.58 prework, POC, and staging HOT 2
- Generic packing algorithms from size N to M HOT 4
- torchao.utils.benchmark_model support cpu and mps benchmarking
- custom cuda extensions make installing ao hard HOT 4
- `dequantize_affine` modified the `input` in-place HOT 7
- Numerics checks between NF4 and bnb nf4
- torch.iinfo() support for sub byte dtypes
- Saving autoquant quantization plan HOT 1
- Improvement ideas for `hf_eval.py`
- ARM builds in CI
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from ao.