this is on linux with a 4090 comparing . Running ollama run mistral "why is the sk

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

testing vs ollama mistral gives same speed results on llama2 7b about powerinfer HOT 9 OPEN

jtoy commented on May 22, 2024

testing vs ollama mistral gives same speed results on llama2 7b

from powerinfer.

Comments (9)

YixinSong-e commented on May 22, 2024 2

@jtoy
Yes, but not entirely. Currently, PowerInfer only benefits when resources are limited, such as the 7B model exceeding GPU memory. For scenarios where the model can be placed in GPU VRAM, the advantage of PowerInfer is not significant, and further optimization of the relevant code is still being carried out. Please wait for the next update as we will optimize the performance of the 7B scale model. By the way, happy New Year. :)

from powerinfer.

YixinSong-e commented on May 22, 2024 1

I will now provide some explanations. In fact, the target scenario for PowerInfer is the model size exceeds GPU VRAM. The currently open-sourced version of the code is not suitable for running entirely on GPU for inference. Moreover, the open-source code, for ease of trial use, is not the same as the code tested in the paper, and it introduces about a 10% performance decline(We are still investigating the cause).

The open-source version of PowerInfer has the following issues when the entire model is on the GPU:

There are synchronization issues between CPU and GPU. Even if the model is entirely on the GPU, the Feed-Forward Network (FFN) layer still requires synchronization between CPU and GPU, introducing significant synchronization overhead.
To facilitate the computations of the predictor, the current computation results of the predictor are stored on the CPU. This means that even if the model is on the GPU, some minor computations are still performed by the CPU.
Since our code has not been deeply optimized for models that can be entirely on the GPU, I have attempted to eliminate the above two overheads as much as possible on our internal PowerInfer code. Preliminary tests on a 4090 using llama-2-7B yielded the following results:
llama.cpp: 15.9 ms/token
PowerInfer: 12.64 ms/token on average
After further breakdown, I found that there are still some computations currently placed on the CPU, meaning some overheads have not been eliminated, preventing further improvement in PowerInfer.
And here is the result.

Currently, I am thinking whether to provide a testbed for those who wish to reproduce the results of our paper. Moreover, the level of interest from the community in this project has significantly surpassed our expectations. Please be aware that our open-source code is currently in a preliminary stage of development. Please be patient as we work on further optimizing the code.

from powerinfer.

tusharsoni42909 commented on May 22, 2024

Jtoy hii is the issue we resolve after eod
Thank you
Tushar

from powerinfer.

jtoy commented on May 22, 2024

Which is the commit? Is it already pushed?

from powerinfer.

jtoy commented on May 22, 2024

And what is required to test it? Do we just recompile or do we need to reconvert weights?

Can we add the performance gains for llama2 7b to the paper? It wasn’t clear when I read what expected gains we should expect.

from powerinfer.

YixinSong-e commented on May 22, 2024

Thank you for your feedback. In our previous tests, even if all of the model were placed on the GPU, there was still a certain acceleration ratio(maybe 1.2 -2x speedup, this is the result I tested with OPT before.) for powerinfer. We will check the results you mentioned. Currently, I believe there may be some performance issues with the open-source version of the code. We will reply to you as soon as possible regarding the reason for this result.

from powerinfer.

jtoy commented on May 22, 2024

Is there anyway I can help test?

from powerinfer.

jtoy commented on May 22, 2024

@YixinSong-e is it right to say that llama2 7b might not good a speed up with this library? are there any new updates?

from powerinfer.

jtoy commented on May 22, 2024

Has there been any improvement with these smaller models?

from powerinfer.

testing vs ollama mistral gives same speed results on llama2 7b about powerinfer HOT 9 OPEN

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent