2:4 sparisty is only supported on Ampere+ , we've only run benchmarks with A100s, but

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Nice work <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Actually, <a class="user-mention notranslate" data-hovercard-type="user" data-hovercar

That's strange to me <a class="user-mention notranslate" data-hovercard-type="user" da

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Run semi-structured spare benchmarks on consumer hardware,about pytorch/ao

gau-nernst commented on June 9, 2024 4

4070Ti Super, running Ubuntu 22.04.
torch==2.4.0.dev20240426+cu121
bfloat16, cutlass

Fixed k

	m	n	k	sparse_latency (ms)	dense_latency (ms)	speedup (d/s)
0	3072	3072	10240	1.10574	2.131	1.92722
1	4096	4096	10240	1.9605	3.73044	1.9028
2	5120	5120	10240	3.12083	6.10269	1.95547
3	6144	6144	10240	4.74411	8.79509	1.8539
4	7168	7168	10240	7.29741	11.9486	1.63738
5	8192	8192	10240	10.6073	15.4296	1.45462
6	9216	9216	10240	13.6835	19.1741	1.40125
7	10240	10240	10240	16.8367	23.4461	1.39256
8	11264	11264	10240	20.37	28.2801	1.38832
9	12288	12288	10240	24.1402	33.545	1.38959
10	13312	13312	10240	28.4292	39.2493	1.3806
11	14336	14336	10240	32.851	45.5614	1.38691
12	15360	15360	10240	37.7906	54.6426	1.44593
13	16384	16384	10240	42.789	63.5041	1.48412
14	17408	17408	10240	48.5377	69.684	1.43567
15	18432	18432	10240	54.2561	77.7116	1.43231
16	19456	19456	10240	60.3411	85.183	1.41169
17	20480	20480	10240	66.7151	97.5466	1.46214

Fixed mn

	m	n	k	sparse_latency (ms)	dense_latency (ms)	speedup (d/s)
0	10240	10240	2560	3.12135	6.23817	1.99855
1	10240	10240	3840	4.59394	9.28166	2.02041
2	10240	10240	5120	7.15086	12.251	1.71322
3	10240	10240	6400	10.5324	14.7059	1.39625
4	10240	10240	7680	13.0499	18.0573	1.38372
5	10240	10240	8960	15.3995	20.6897	1.34353
6	10240	10240	10240	16.8406	23.4697	1.39364
7	10240	10240	11520	19.2673	26.2984	1.36493
8	10240	10240	12800	20.9322	29.0503	1.38782
9	10240	10240	14080	23.14	31.9612	1.38121
10	10240	10240	15360	25.6844	34.6865	1.35049
11	10240	10240	16640	26.2421	37.4893	1.42859
12	10240	10240	17920	30.1967	40.3297	1.33556
13	10240	10240	19200	32.4673	43.1666	1.32954
14	10240	10240	20480	33.5382	46.002	1.37163

SAM ViT-B shapes

	m	n	k	sparse_latency (ms)	dense_latency (ms)	speedup (d/s)
0	32768	768	3072	1.22253	1.7901	1.46426
1	32768	2304	768	0.787232	1.33425	1.69486
2	32768	3072	768	1.04701	1.74003	1.66191
3	32768	768	768	0.271155	0.437884	1.61488
4	39200	2304	768	0.948154	1.5765	1.66271
5	39200	768	768	0.324627	0.510302	1.57196

I omit some redundant columns from the saved csv file. correct and contiguous columns are all True.

from ao.

jcaip commented on June 9, 2024 1

@philipbutler as a sanity check - can you run using the 2.3 release instead of the nightlies?

I think this might be an issue with windows, but I'm not sure.

from ao.

msaroufim commented on June 9, 2024 1

Nice work @gau-nernst pretty cool to see results that seem uniformily faster
@philipbutler would highly recommend using WSL or dual booting (I personally dual boot), getting windows and cuda to work is just not worth it

from ao.

jcaip commented on June 9, 2024 1

@gau-nernst 💯 Thanks for running these - that's awesome! For others reading, I'd like to collect these, with our A100 results somewhere. So please contribute and I'll collate these together in a nice doc. We can also collect block sparse microbenchmarks too, I know @cpuhrsch is interested in those.

@philipbutler Thank you for giving it a shot + your edits we're super helpful too :) . Yeah I think I agree with mark that dual booting linux is probably the easiest solution - but could you open an issue for tracking purposes (feel free to tag me) in pytorch about lack of windows support for semi-structured sparsity?

from ao.

philipbutler commented on June 9, 2024

Had to set up this PC, so had to do a clean Python install, and noticing neither pandas nor tqdm is in requirements.txt

from ao.

philipbutler commented on June 9, 2024

The benchmark command should use --dtype bf16

from ao.

philipbutler commented on June 9, 2024

Ran into RuntimeError: sparse_semi_structured_mad_op : CUTLASS not supported

Consider adding install CUDA 12.1 and the CUTLASS Quickstart to the steps.
Running through it now!
(I'm confused rn)

from ao.

philipbutler commented on June 9, 2024

Actually, @jcaip, does it make sense that to_sparse_semi_structured(torch.ones(256, 256).half().cuda()) works, but running the first benchmark script shows RuntimeError: sparse_semi_structured_mad_op : CUTLASS not supported ?

from ao.

jcaip commented on June 9, 2024

That's strange to me @philipbutler let me think for a bit

Can you open powershell and run nvidia-smi and screenshot the results?

from ao.

philipbutler commented on June 9, 2024

@jcaip

from ao.

philipbutler commented on June 9, 2024

@jcaip Just making this easy as possible for future benchmarking, step 2 should say

import torch
from torch.sparse import to_sparse_semi_structured
to_sparse_semi_structured(torch.ones(256, 256).half().cuda())

from ao.

philipbutler commented on June 9, 2024

@philipbutler as a sanity check - can you run using the 2.3 release instead of the nightlies?

I think this might be an issue with windows, but I'm not sure.

@jcaip Same error with the 2.3 release

from ao.

Run semi-structured spare benchmarks on consumer hardware about ao HOT 12 OPEN

Comments (12)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent