Comments (7)
Seems like this is rearing its head again: ggerganov/llama.cpp#603 (comment)
from llm.
Interesting... not sure what to make of that. I suppose we could do more performance testing: I wonder if #87 (comment) is related?
from llm.
I did some more testing with combinations of 6 and 12 threads, 16 and 32 bit memory and ID on or off. This was with a 7B Alpaca model, generating 80 tokens and just timing the total run time. It was pretty consistent though, and I did the ID (increased_determinism) one three times just to be sure.
Sorted by type:
ID | Threads | Memory | TPS |
---|---|---|---|
Y | 6 | 32 | 2.7860 |
Y | 6 | 32 | 2.9353 |
Y | 6 | 32 | 2.7716 |
N | 6 | 32 | 3.7873 |
N | 6 | 32 | 3.6932 |
Y | 12 | 32 | 3.4560 |
Y | 12 | 32 | 3.4309 |
Y | 12 | 32 | 3.4439 |
N | 12 | 32 | 3.8925 |
N | 12 | 32 | 3.8376 |
Y | 6 | 16 | 3.2821 |
Y | 6 | 16 | 3.2599 |
Y | 6 | 16 | 3.2978 |
N | 6 | 16 | 3.6942 |
N | 6 | 16 | 3.7541 |
Y | 12 | 16 | 3.4277 |
Y | 12 | 16 | 3.4325 |
Y | 12 | 16 | 3.4361 |
N | 12 | 16 | 3.8940 |
N | 12 | 16 | 3.8127 |
Sorted by TPS (higher is better):
ID | Threads | Memory | TPS |
---|---|---|---|
N | 12 | 16 | 3.8940 |
N | 12 | 32 | 3.8925 |
N | 12 | 32 | 3.8376 |
N | 12 | 16 | 3.8127 |
N | 6 | 32 | 3.7873 |
N | 6 | 16 | 3.7541 |
N | 6 | 16 | 3.6942 |
N | 6 | 32 | 3.6932 |
Y | 12 | 32 | 3.4560 |
Y | 12 | 32 | 3.4439 |
Y | 12 | 16 | 3.4361 |
Y | 12 | 16 | 3.4325 |
Y | 12 | 32 | 3.4309 |
Y | 12 | 16 | 3.4277 |
Y | 6 | 16 | 3.2978 |
Y | 6 | 16 | 3.2821 |
Y | 6 | 16 | 3.2599 |
Y | 6 | 32 | 2.9353 |
Y | 6 | 32 | 2.7860 |
Y | 6 | 32 | 2.7716 |
In the TPS sorted version, ID off comes out ahead pretty consistently. It's not a massive difference most of the time.
I don't know why ID + 6 threads + 32bit memory looks so bad but those results were consistent.
Anyway... It probably would make sense to go back to the behavior with the commandline argument that toggles it with it automatically getting enabled when a seed is specified. The difference is definitely enough to care about in the 32bit memory/6 thread case.
Not sure if it's just a thing with low numbers of threads in general or what.
from llm.
Can anyone else replicate the performance differences I saw? If so, do we want to go back to the previous proposed behavior where there's a CLI flag which gets automatically enabled when --seed
is supplied?
from llm.
Sorry, haven't had the time to test. Seems like you've done some pretty thorough testing and other people have reported regressions in upstream, so I'm leaning towards your suggested solution of apply-only-when-seed-is-specified.
I'll reopen this issue for now...
from llm.
I think this is the same thing, right?
I'm guessing people would want to use that approach rather than trying to invest time in making the existing on better.
edit: I've been testing the changes. llama.cpp
seems much faster than a few days ago. Also it seemed like it would get much slower as the context increased, now it's still pretty fast even after generating a page of text.
Not 100% sure it's all this change, but something changed recently that made a big difference.
from llm.
Yes, that seems related. Looks like we'll need to sync with them soon...
from llm.
Related Issues (20)
- AMD ROCm support with HIPBLAS HOT 2
- WizardCoder llama assert failure HOT 3
- NaN logits on LLaMA 65B when using 2k+ token contexts
- Default String for ConfiguredSamplers HOT 1
- SIGTRAP triggered on MacOS HOT 2
- Medusa Speculative Decoding HOT 1
- How do I use Huggingface tokenization to use a model on Huggingace in MODEL_PATH instead of my local machine? HOT 1
- Clarify MSRV policy HOT 2
- How to disable ggml logging? HOT 1
- Support for Mistral-7b HOT 5
- Disable tokenizers-remote support for the library by default HOT 1
- Reduce dependencies
- Why is the feed_prompt process so slow? HOT 5
- Support Separate Loading of Vocabulary or Tensors
- EOS is not read from gguf format HOT 1
- Behavior when missing quantization version HOT 1
- Build fails: error: no such file or directory: 'ggml/src/ggml.c'
- When using tokio and HuggingFaceRemote it breaks dropping the runtime HOT 1
- Currently in dev any inference is broken HOT 2
- Sub reddit is down HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from llm.