Comments (9)
@jtoy
Yes, but not entirely. Currently, PowerInfer only benefits when resources are limited, such as the 7B model exceeding GPU memory. For scenarios where the model can be placed in GPU VRAM, the advantage of PowerInfer is not significant, and further optimization of the relevant code is still being carried out. Please wait for the next update as we will optimize the performance of the 7B scale model. By the way, happy New Year. :)
from powerinfer.
I will now provide some explanations. In fact, the target scenario for PowerInfer is the model size exceeds GPU VRAM. The currently open-sourced version of the code is not suitable for running entirely on GPU for inference. Moreover, the open-source code, for ease of trial use, is not the same as the code tested in the paper, and it introduces about a 10% performance decline(We are still investigating the cause).
The open-source version of PowerInfer has the following issues when the entire model is on the GPU:
- There are synchronization issues between CPU and GPU. Even if the model is entirely on the GPU, the Feed-Forward Network (FFN) layer still requires synchronization between CPU and GPU, introducing significant synchronization overhead.
- To facilitate the computations of the predictor, the current computation results of the predictor are stored on the CPU. This means that even if the model is on the GPU, some minor computations are still performed by the CPU.
Since our code has not been deeply optimized for models that can be entirely on the GPU, I have attempted to eliminate the above two overheads as much as possible on our internal PowerInfer code. Preliminary tests on a 4090 using llama-2-7B yielded the following results:
llama.cpp: 15.9 ms/token
PowerInfer: 12.64 ms/token on average
After further breakdown, I found that there are still some computations currently placed on the CPU, meaning some overheads have not been eliminated, preventing further improvement in PowerInfer.
And here is the result.
Currently, I am thinking whether to provide a testbed for those who wish to reproduce the results of our paper. Moreover, the level of interest from the community in this project has significantly surpassed our expectations. Please be aware that our open-source code is currently in a preliminary stage of development. Please be patient as we work on further optimizing the code.
from powerinfer.
Jtoy hii is the issue we resolve after eod
Thank you
Tushar
from powerinfer.
Which is the commit? Is it already pushed?
from powerinfer.
And what is required to test it? Do we just recompile or do we need to reconvert weights?
Can we add the performance gains for llama2 7b to the paper? It wasn’t clear when I read what expected gains we should expect.
from powerinfer.
Thank you for your feedback. In our previous tests, even if all of the model were placed on the GPU, there was still a certain acceleration ratio(maybe 1.2 -2x speedup, this is the result I tested with OPT before.) for powerinfer. We will check the results you mentioned. Currently, I believe there may be some performance issues with the open-source version of the code. We will reply to you as soon as possible regarding the reason for this result.
from powerinfer.
Is there anyway I can help test?
from powerinfer.
@YixinSong-e is it right to say that llama2 7b might not good a speed up with this library? are there any new updates?
from powerinfer.
Has there been any improvement with these smaller models?
from powerinfer.
Related Issues (20)
- 24GB的显存只能占用12GB,CUDA占用也不到10%。但是CPU占用100%内存占用35GB HOT 1
- 24GB的显存只能占用12GB,CUDA占用也不到10%。但是CPU占用100%内存占用35GB HOT 1
- [Question]: High PPL on wikitext2 of ReLU-LLAMA-7B for language modeling tasks HOT 2
- Does PowerInfer support multi-GPU? HOT 1
- Will we have instruct fine-tuned model support in the future? HOT 1
- Clarification on Output Neuron Pruning Method in "Deja Vu: Contextual Sparsity for Efficient LLMs at Inference Time HOT 2
- Segmentation fault (core dumped) in ggml test
- two questions that i want to solve HOT 2
- How to assign the specified CUDA_VISIBLE_DEVICE?
- invalid device symbol
- Where is the definition or addition location of GGML_USE_HYBRID_THREADING? HOT 2
- convert.py: error: the following arguments are required: mlp_model HOT 4
- Unable to generate constant output HOT 2
- The code about the figures in paper HOT 1
- Any plans to support llamafied Qwen1.5?有支持llama化qwen的计划吗? HOT 2
- 在A100-80G上无法找到cuda的情况 HOT 2
- 请问大神有支持LLama 3 70B 的计划吗?
- 关于在A100显卡上测得的效果异常的疑问 HOT 1
- Why AXPY? HOT 2
- Will this work with Falcon 2?
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from powerinfer.