Hi. Thanks for this work guys. I was curious as to whether you had b

This is tensorflow benchmarks <a href="http://blog.gpueater.com/en/2

Ran this command with this bench : <a href="https://github.com/ryujaehun/pytorch-g

This is problematic as well <div class="snippet-clipboard-content notranslate posi

(I'm working with @hyperfraise ) a) <a href="https://gist.gith

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[Pytorch] AMD GPUs benchmarks about pytorch HOT 20 CLOSED

dmenig commented on August 31, 2024

[Pytorch] AMD GPUs benchmarks

from pytorch.

Comments (20)

iotamudelta commented on August 31, 2024 1

Typically a few hours for a single network. It depends on how many unique convolution configs are missing from our internal database.

from pytorch.

dmenig commented on August 31, 2024

This is tensorflow benchmarks

http://blog.gpueater.com/en/2018/04/23/00011_tech_cifar10_bench_on_tf13/

from pytorch.

iotamudelta commented on August 31, 2024

Thanks for the report! We are working on performance optimizations at the moment. If you are interested to help, could you export MIOPEN_FIND_ENFORCE=3 prior to running the benchmark? This will tune the MIOpen convolution kernels for the benchmark (it will take a long time though!) - that way we could see what the "ideal" performance currently is.

from pytorch.

FelixSchwarz commented on August 31, 2024

it will take a long time though!

How long is "long" on a RX 580? Some hours? A day? A week?

from pytorch.

dmenig commented on August 31, 2024

Ran this command with this bench :
https://github.com/ryujaehun/pytorch-gpu-benchmark

Whole lotta tests were running. Waited the night for it to finish. Came back this morning and the task was frozen, radeontop showing every metric at 100% utilization (doubt it though).

Looked for any result but none is showing. I'll try another, maybe lighter benchmark.

from pytorch.

dmenig commented on August 31, 2024

Not sure if relevant for you, but I can't detect my gpu now, and this is the log of when I killed it when it froze during the first benchmark. Working on it.

déc. 12 10:15:25 redqueen kernel: WARNING: CPU: 0 PID: 1429 at /var/lib/dkms/amdgpu/1.9-307/build/amd/amdgpu/../display/dc/dc_helper.c:254 generic_reg_wait+0xed/0x170 [amdgpu]
déc. 12 10:15:25 redqueen kernel: Modules linked in: md4 nls_utf8 cifs ccm fscache veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack
déc. 12 10:15:25 redqueen kernel:  aes_x86_64 crypto_simd btintel glue_helper bluetooth cryptd intel_cstate intel_rapl_perf mei_me ecdh_generic soundcore mei shpchp acpi_pad mac_hid sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 amdkfd(OE) amd_iommu_v2
déc. 12 10:15:25 redqueen kernel: CPU: 0 PID: 1429 Comm: kworker/0:0 Tainted: G        W  OE    4.15.0-42-generic #45-Ubuntu
déc. 12 10:15:25 redqueen kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z370M-ITX/ac, BIOS P1.20 09/13/2017
déc. 12 10:15:25 redqueen kernel: Workqueue: events kfd_process_hw_exception [amdkfd]
déc. 12 10:15:25 redqueen kernel: RIP: 0010:generic_reg_wait+0xed/0x170 [amdgpu]
déc. 12 10:15:25 redqueen kernel: RSP: 0018:ffffa292c55bbc90 EFLAGS: 00010297
déc. 12 10:15:25 redqueen kernel: RAX: 000000000000039f RBX: 0000000000000bb9 RCX: 0000000000000000
déc. 12 10:15:25 redqueen kernel: RDX: 0000000000000000 RSI: ffff92d06ec16498 RDI: ffff92d06ec16498
déc. 12 10:15:25 redqueen kernel: RBP: ffffa292c55bbcd0 R08: 0000000000000000 R09: 000000000005a9a0
déc. 12 10:15:25 redqueen kernel: R10: 00000000ffffffff R11: ffffffff88b5380e R12: 000000000000000a
déc. 12 10:15:25 redqueen kernel: R13: ffff92d03e846200 R14: 0000000000010000 R15: 0000000000000000
déc. 12 10:15:25 redqueen kernel: FS:  0000000000000000(0000) GS:ffff92d06ec00000(0000) knlGS:0000000000000000
déc. 12 10:15:25 redqueen kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
déc. 12 10:15:25 redqueen kernel: CR2: 00007f2fb125ba38 CR3: 000000030f80a006 CR4: 00000000003606f0
déc. 12 10:15:25 redqueen kernel: Call Trace:
déc. 12 10:15:25 redqueen kernel:  dce110_stream_encoder_dp_blank+0x12f/0x1a0 [amdgpu]
déc. 12 10:15:25 redqueen kernel:  power_down_all_hw_blocks+0x44/0x1e0 [amdgpu]
déc. 12 10:15:25 redqueen kernel:  dce110_power_down+0x12/0x20 [amdgpu]
déc. 12 10:15:25 redqueen kernel:  dc_set_power_state+0x20/0x80 [amdgpu]
déc. 12 10:15:25 redqueen kernel:  dm_suspend+0x4e/0x60 [amdgpu]
déc. 12 10:15:25 redqueen kernel:  amdgpu_device_ip_suspend+0xcf/0x190 [amdgpu]
déc. 12 10:15:25 redqueen kernel:  amdgpu_device_gpu_recover+0x451/0x7b0 [amdgpu]
déc. 12 10:15:25 redqueen kernel:  ? add_timer+0x124/0x280
déc. 12 10:15:25 redqueen kernel:  amdgpu_amdkfd_gpu_reset+0x12/0x20 [amdgpu]
déc. 12 10:15:25 redqueen kernel:  kfd_process_hw_exception+0x26/0x30 [amdkfd]
déc. 12 10:15:25 redqueen kernel:  process_one_work+0x1de/0x410
déc. 12 10:15:25 redqueen kernel:  worker_thread+0x32/0x410
déc. 12 10:15:25 redqueen kernel:  kthread+0x121/0x140
déc. 12 10:15:25 redqueen kernel:  ? process_one_work+0x410/0x410
déc. 12 10:15:25 redqueen kernel:  ? kthread_create_worker_on_cpu+0x70/0x70
déc. 12 10:15:25 redqueen kernel:  ? do_syscall_64+0x73/0x130
déc. 12 10:15:25 redqueen kernel:  ? SyS_exit+0x17/0x20
déc. 12 10:15:25 redqueen kernel:  ret_from_fork+0x35/0x40
déc. 12 10:15:25 redqueen kernel: Code: 44 8b 45 10 44 89 e1 48 c7 c2 28 e7 77 c0 48 c7 c7 b9 6e 78 c0 44 89 55 d4 50 e8 9f cd c5 ff 41 83 7d 20 01 44 8b 55 d4 58 74 02 <0f> 0b 48 8d 65 d8 44 89 d0 5b 41 5c 41 5d 41 5e 41 5f 5d c3 41 
déc. 12 10:15:25 redqueen kernel: ---[ end trace e3447af507105f30 ]---```

from pytorch.

dmenig commented on August 31, 2024

This is problematic as well

déc. 12 10:15:45 redqueen kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing DA7E (len 824, WS 0, PS 0) @ 0xDBFE
déc. 12 10:15:45 redqueen kernel: [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing D938 (len 326, WS 0, PS 0) @ 0xDA28
déc. 12 10:15:45 redqueen kernel: [drm:dce110_link_encoder_disable_output [amdgpu]] *ERROR* dce110_link_encoder_disable_output: Failed to execute VBIOS command table!
déc. 12 10:15:45 redqueen kernel: WARNING: CPU: 0 PID: 1429 at /var/lib/dkms/amdgpu/1.9-307/build/amd/amdgpu/../display/dc/dce/dce_link_encoder.c:1062 dce110_link_encoder_disable_output+0x16b/0x180 [amdgpu]
déc. 12 10:15:45 redqueen kernel: Modules linked in: md4 nls_utf8 cifs ccm fscache veth ipt_MASQUERADE nf_nat_masquerade_ipv4 nf_conntrack_netlink nfnetlink xfrm_user xfrm_algo iptable_nat nf_conntrack_ipv4 nf_defrag_ipv4 nf_nat_ipv4 xt_addrtype iptable_filter xt_conntrack
déc. 12 10:15:45 redqueen kernel:  aes_x86_64 crypto_simd btintel glue_helper bluetooth cryptd intel_cstate intel_rapl_perf mei_me ecdh_generic soundcore mei shpchp acpi_pad mac_hid sch_fq_codel parport_pc ppdev lp parport ip_tables x_tables autofs4 amdkfd(OE) amd_iommu_v2
déc. 12 10:15:45 redqueen kernel: CPU: 0 PID: 1429 Comm: kworker/0:0 Tainted: G        W  OE    4.15.0-42-generic #45-Ubuntu
déc. 12 10:15:45 redqueen kernel: Hardware name: To Be Filled By O.E.M. To Be Filled By O.E.M./Z370M-ITX/ac, BIOS P1.20 09/13/2017
déc. 12 10:15:45 redqueen kernel: Workqueue: events kfd_process_hw_exception [amdkfd]
déc. 12 10:15:45 redqueen kernel: RIP: 0010:dce110_link_encoder_disable_output+0x16b/0x180 [amdgpu]
déc. 12 10:15:45 redqueen kernel: RSP: 0018:ffffa292c55bbcd8 EFLAGS: 00010286
déc. 12 10:15:45 redqueen kernel: RAX: 0000000000000000 RBX: ffff92d03f68b7e0 RCX: 0000000000000006
déc. 12 10:15:45 redqueen kernel: RDX: 0000000000000000 RSI: 0000000000000096 RDI: ffff92d06ec16490
déc. 12 10:15:45 redqueen kernel: RBP: ffffa292c55bbd30 R08: 0000000000000000 R09: 000000000005aa8d
déc. 12 10:15:45 redqueen kernel: R10: ffff92d03f7c5540 R11: ffffffff88b5380e R12: 0000000000000000
déc. 12 10:15:45 redqueen kernel: R13: ffffa292c55bbcdc R14: 0000000000000080 R15: 000000000000000c
déc. 12 10:15:45 redqueen kernel: FS:  0000000000000000(0000) GS:ffff92d06ec00000(0000) knlGS:0000000000000000
déc. 12 10:15:45 redqueen kernel: CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
déc. 12 10:15:45 redqueen kernel: CR2: 00007f2fb125ba38 CR3: 000000030f80a006 CR4: 00000000003606f0
déc. 12 10:15:45 redqueen kernel: Call Trace:
déc. 12 10:15:45 redqueen kernel:  power_down_all_hw_blocks+0x84/0x1e0 [amdgpu]
déc. 12 10:15:45 redqueen kernel:  dce110_power_down+0x12/0x20 [amdgpu]
déc. 12 10:15:45 redqueen kernel:  dc_set_power_state+0x20/0x80 [amdgpu]
déc. 12 10:15:45 redqueen kernel:  dm_suspend+0x4e/0x60 [amdgpu]
déc. 12 10:15:45 redqueen kernel:  amdgpu_device_ip_suspend+0xcf/0x190 [amdgpu]
déc. 12 10:15:45 redqueen kernel:  amdgpu_device_gpu_recover+0x451/0x7b0 [amdgpu]
déc. 12 10:15:45 redqueen kernel:  ? add_timer+0x124/0x280
déc. 12 10:15:45 redqueen kernel:  amdgpu_amdkfd_gpu_reset+0x12/0x20 [amdgpu]
déc. 12 10:15:45 redqueen kernel:  kfd_process_hw_exception+0x26/0x30 [amdkfd]
déc. 12 10:15:45 redqueen kernel:  process_one_work+0x1de/0x410
déc. 12 10:15:45 redqueen kernel:  worker_thread+0x32/0x410
déc. 12 10:15:45 redqueen kernel:  kthread+0x121/0x140
déc. 12 10:15:45 redqueen kernel:  ? process_one_work+0x410/0x410
déc. 12 10:15:45 redqueen kernel:  ? kthread_create_worker_on_cpu+0x70/0x70
déc. 12 10:15:45 redqueen kernel:  ? do_syscall_64+0x73/0x130
déc. 12 10:15:45 redqueen kernel:  ? SyS_exit+0x17/0x20
déc. 12 10:15:45 redqueen kernel:  ret_from_fork+0x35/0x40
déc. 12 10:15:45 redqueen kernel: Code: 00 00 75 2b 48 8d 65 e8 5b 41 5c 41 5d 5d c3 48 c7 c1 40 57 73 c0 48 c7 c2 a8 53 77 c0 31 f6 48 c7 c7 05 64 78 c0 e8 55 60 cb ff <0f> 0b eb c6 e8 bc e3 bf c6 66 90 66 2e 0f 1f 84 00 00 00 00 00 
déc. 12 10:15:45 redqueen kernel: ---[ end trace e3447af507105f36 ]---

from pytorch.

iotamudelta commented on August 31, 2024

Couple of questions:
a) can you salvage the content of ~/.config/miopen and attach here? That'll at least give us an information how far it got before erroring out and if you keep the content of the file, the next time round no tuning for these configurations is required
b) which host kernel are you running this on and which dkms (or are you using upstream)?

Thanks!

from pytorch.

skylt commented on August 31, 2024

(I'm working with @hyperfraise )

a) here
b) 4.15.0-42-generic on ubuntu 18.04, with rocm 2.0.white-rabbit

from pytorch.

iotamudelta commented on August 31, 2024

@skylt Thanks, this is helpful. There is no 2.0.white-rabbit. Could you provide apt list | grep rocm? Thanks!

The instability looks like a kernel driver issue on gfx803

from pytorch.

Delaunay commented on August 31, 2024

I noticed that sometimes hcc is launched prior to the training beginning.
What is it doing ? Testing different kernels ? Compiling missing kernels ?
Compiling the best kernels according to MIOpen's DB?

The compilation step seems to disappear after the first script run.
Is it correct to assume the timings after that first run should be consistent ?

You can find the results of the test I ran prior to tuning MIOpen below.
I am running the tuning at the moment (1 done, 3 to go).

The numbers are based on 90 observations (batch) where the first 10 were discarded.

updated numbers can be found below

from pytorch.

iotamudelta commented on August 31, 2024

Concerning your questions: we do compile some kernels the first time we run them and subsequently get them from cache. So you are correct that the second time you'll invoke these kernels will be what you want to time.

I think the performance you observe makes sense. I'd expect resnet18 to get better as you tune MIOpen and in general I've observed that larger batch sizes help - so judging from your memory consumption numbers you could increase the batch size for resnet18 on the 580 2x (maybe even 4x). Please do attach your performance database here once you are done so that I can make sure we'll get these configs tuned in a future MIOpen release.

from pytorch.

geekboood commented on August 31, 2024

@Delaunay

RX580 8GB 153.4 121.76 208.56 FALSE 32 resnet50

I wonder about how you get such a nice result.
I use one RX480 but I only got about 83.92 img/sec.
Do you use dual RX580?

from pytorch.

Delaunay commented on August 31, 2024

Single GPU. I don't know where the differences could come from; I did not do anything in particular.
How did you measure the compute time ?

from pytorch.

geekboood commented on August 31, 2024

I used the benchmark script in https://github.com/ROCmSoftwarePlatform/pytorch/wiki/Performance-analysis-of-PyTorch.
And in my test, Pytorch is slightly faster than Tensorflow-rocm, which only got 77.59 imgs/sec.
Another comparison I did between RX480 and GTX1060 is to train DrQA, a machine comprehension neural network based on LSTM. In this test, it costs 42.68s to finish 100 training iterations on GTX1060, but it needs 230.10s to finish 100 training iterations on RX480. We can assume that even if VEGA series is 2x faster than RX480, a single GTX1060 can still beat it down to the ground.

from pytorch.

Delaunay commented on August 31, 2024

I see; I did not run the same script and mine had a bug. You can find the updated numbers below.

I have not started to benchmark anything else yet. I will probably next week.
I would not expect everything to work perfectly yet. NVIDIA had a lot of time to tunes their library for a lot of different application that does not mean the underlying hardware is not capable of the same performance.

I do think the convnets numbers show that their devices are capable. It is now a matter of optimization on the software level. For example the memory usage under ROCm is significantly lower than NVIDIA which suggests the algorithm under the hood are different and a trade off could be made between memory and speed.

The only big difference is f16 which I would think they did not yet work on specifically since none of their device really supports fast f16 operations (I think Radeon VII does but that's it).

This compares the RX580 (8Go) against the 1060 (6Go)

I also attached the MIOpen perf db.

gfx803_36.cd.updb.txt

from pytorch.

iotamudelta commented on August 31, 2024

@Delaunay this is very interesting data, thanks for posting it here! Which PyTorch commit did you use for your test? Your fp16 data looks correct for gfx803.

@geekboood could you post a link to the benchmark you were running? I'd be interested in having a look at it. Thanks!

from pytorch.

Delaunay commented on August 31, 2024

The commit I used was ffcbf1bd1 from Fri Feb 22 16:36:06 2019 -0800.
Yes this is for gfx803.

from pytorch.

iotamudelta commented on August 31, 2024

Thanks, very good! It may be interesting to rerun this benchmark after 3ed44b67 is included for ROCm.

from pytorch.

geekboood commented on August 31, 2024

@iotamudelta The ResNet one is already in the link above.
As for the DrQA, I use this https://github.com/facebookresearch/DrQA code and train on CMRC dataset. You can also train the reader on Squad dataset.
If you want to inspect the performance of ROCm, you can have a look at this one https://github.com/ROCmSoftwarePlatform/tensorflow-upstream/issues/173#issuecomment-469017319. I have also done some trivial test on tensorflow-rocm.

from pytorch.

[Pytorch] AMD GPUs benchmarks about pytorch HOT 20 CLOSED

Comments (20)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent