Coder Social home page Coder Social logo

[VIT Model] [perf Degradation] [X86] [ARM] torch.compile + weight prepacking results in perf degradation for VIT Transformer model about pytorch HOT 4 OPEN

maajidkhann avatar maajidkhann commented on June 15, 2024
[VIT Model] [perf Degradation] [X86] [ARM] torch.compile + weight prepacking results in perf degradation for VIT Transformer model

from pytorch.

Comments (4)

leslie-fang-intel avatar leslie-fang-intel commented on June 15, 2024 1

@Valentine233 Could you help to take a look?

from pytorch.

Valentine233 avatar Valentine233 commented on June 15, 2024 1

Tried on SPR with 56 threads.

According to the MKL verbose, the mkl_linear kernel (the highlighted shape in the issue) time has a regression starting from a certain moment:

257.85us -> 737.93us
MKL_VERBOSE SGEMM_COMPUTE(P,N,3072,197,768,0x7fdd360de040,768,0x1f629040,768,0x7ffcf8a68b38,0x1fca4540,3072) 257.85us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:56
MKL_VERBOSE SGEMM_COMPUTE(P,N,3072,197,768,0x7fde2a7d8040,768,0x1f99f980,768,0x7ffcf8a68b38,0x2003cdc0,3072) 737.93us CNR:OFF Dyn:1 FastMM:1 TID:0 NThr:56

By writing a small test case and running torch.addmm 2000 times, we could only see the kernel perf around 250us.

from pytorch.

maajidkhann avatar maajidkhann commented on June 15, 2024

257.85us -> 737.93us
Will the regression be further investigated and fixed?

The issue is really when weight prepacking is enabled with torch.compile() as highlighted in orange in the ticket. Even the shape becomes different with weight prepacking enabled compared to [[3072], [197, 768], [768, 3072], [], [], [197, 3072]] which is just torch.compile()

from pytorch.

Valentine233 avatar Valentine233 commented on June 15, 2024

There are some environment problems for the previous data. With enabling tcmalloc and iomp5 (need to install intel-openmp), the performance with weight prepack is better than that without it.

Tested on Xeon SPR with 56 threads.

Without weight prepacking:

       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls                                                                      Input Shapes
aten::addmm        15.50%     476.529ms        17.17%     527.830ms     439.858us          1200                             [[768], [197, 3072], [3072, 768], [], [], [197, 768]]
aten::addmm        14.05%     432.157ms        17.59%     540.775ms     112.662us          4800                               [[768], [197, 768], [768, 768], [], [], [197, 768]]
aten::addmm        12.28%     377.698ms        14.79%     454.871ms     379.059us          1200                            [[3072], [197, 768], [768, 3072], [], [], [197, 3072]]

With weight prepacking:

       Name    Self CPU %      Self CPU   CPU total %     CPU total  CPU time avg    # of Calls                                                                      Input Shapes
mkl::_mkl_linear        17.59%     458.226ms        17.80%     463.612ms      96.586us          4800                               [[197, 768], [2900193, 1], [768, 768], [], []]
mkl::_mkl_linear        14.35%     373.810ms        14.41%     375.306ms     312.755us          1200                             [[197, 3072], [5259489, 1], [768, 3072], [], []]
mkl::_mkl_linear        11.83%     308.105ms        11.89%     309.697ms     258.080us          1200                              [[197, 768], [5259489, 1], [3072, 768], [], []]

@maajidkhann Could you try with the environment parameters mentioned above? Maybe you'd better run with the PyTorch launcher https://github.com/pytorch/pytorch/blob/main/torch/backends/xeon/run_cpu.py.

from pytorch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.