🐛 Describe the bug With Pytorch 2.3.0, when we run inferencing fo

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

[VIT Model] [perf Degradation] [X86] [ARM] torch.compile + weight prepacking results in perf degradation for VIT Transformer model about pytorch HOT 4 OPEN

maajidkhann commented on June 15, 2024

[VIT Model] [perf Degradation] [X86] [ARM] torch.compile + weight prepacking results in perf degradation for VIT Transformer model

from pytorch.

Comments (4)

leslie-fang-intel commented on June 15, 2024 1

@Valentine233 Could you help to take a look?

from pytorch.

Valentine233 commented on June 15, 2024 1

Tried on SPR with 56 threads.

According to the MKL verbose, the mkl_linear kernel (the highlighted shape in the issue) time has a regression starting from a certain moment:

257.85us -> 737.93us
MKL_VERBOSE SGEMM_COMPUTE(P,N,3072,197,768,0x7fdd360de040,768,0x1f629040,768,0x7ffcf8a68b38,0x1fca4540,3072) 257.85us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:56
MKL_VERBOSE SGEMM_COMPUTE(P,N,3072,197,768,0x7fde2a7d8040,768,0x1f99f980,768,0x7ffcf8a68b38,0x2003cdc0,3072) 737.93us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:56

By writing a small test case and running torch.addmm 2000 times, we could only see the kernel perf around 250us.

from pytorch.

maajidkhann commented on June 15, 2024

257.85us -> 737.93us
Will the regression be further investigated and fixed?

The issue is really when weight prepacking is enabled with torch.compile() as highlighted in orange in the ticket. Even the shape becomes different with weight prepacking enabled compared to [[3072], [197, 768], [768, 3072], [], [], [197, 3072]] which is just torch.compile()

from pytorch.

Valentine233 commented on June 15, 2024

There are some environment problems for the previous data. With enabling tcmalloc and iomp5 (need to install intel-openmp), the performance with weight prepack is better than that without it.

Tested on Xeon SPR with 56 threads.

Without weight prepacking:

       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls                                                                      Input Shapes
aten::addmm        15.50%     476.529ms        17.17%     527.830ms     439.858us          1200                             [[768], [197, 3072], [3072, 768], [], [], [197, 768]]
aten::addmm        14.05%     432.157ms        17.59%     540.775ms     112.662us          4800                               [[768], [197, 768], [768, 768], [], [], [197, 768]]
aten::addmm        12.28%     377.698ms        14.79%     454.871ms     379.059us          1200                            [[3072], [197, 768], [768, 3072], [], [], [197, 3072]]

With weight prepacking:

       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls                                                                      Input Shapes
mkl::_mkl_linear        17.59%     458.226ms        17.80%     463.612ms      96.586us          4800                               [[197, 768], [2900193, 1], [768, 768], [], []]
mkl::_mkl_linear        14.35%     373.810ms        14.41%     375.306ms     312.755us          1200                             [[197, 3072], [5259489, 1], [768, 3072], [], []]
mkl::_mkl_linear        11.83%     308.105ms        11.89%     309.697ms     258.080us          1200                              [[197, 768], [5259489, 1], [3072, 768], [], []]

@maajidkhann Could you try with the environment parameters mentioned above? Maybe you'd better run with the PyTorch launcher https://github.com/pytorch/pytorch/blob/main/torch/backends/xeon/run_cpu.py.

from pytorch.

Recommend Projects

[VIT Model] [perf Degradation] [X86] [ARM] torch.compile + weight prepacking results in perf degradation for VIT Transformer model about pytorch HOT 4 OPEN

Comments (4)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent