This may be something to keep an eye on: <a class="issue-link js-issue-link" data-erro

Seems like this is rearing its head again: <a class="issue-link js-issue-link" data-er

I think this is the same thing, right? <a class="issue-link js-issue

Making results independent from threadcount/batch size (from llama.cpp) about llm HOT 7 CLOSED

rustformers commented on May 20, 2024 1

Making results independent from threadcount/batch size (from llama.cpp)

from llm.

Comments (7)

KerfuffleV2 commented on May 20, 2024

Seems like this is rearing its head again: ggerganov/llama.cpp#603 (comment)

from llm.

philpax commented on May 20, 2024

Interesting... not sure what to make of that. I suppose we could do more performance testing: I wonder if #87 (comment) is related?

from llm.

KerfuffleV2 commented on May 20, 2024

I did some more testing with combinations of 6 and 12 threads, 16 and 32 bit memory and ID on or off. This was with a 7B Alpaca model, generating 80 tokens and just timing the total run time. It was pretty consistent though, and I did the ID (increased_determinism) one three times just to be sure.

Sorted by type:

ID	Threads	Memory	TPS
Y	6	32	2.7860
Y	6	32	2.9353
Y	6	32	2.7716
N	6	32	3.7873
N	6	32	3.6932
Y	12	32	3.4560
Y	12	32	3.4309
Y	12	32	3.4439
N	12	32	3.8925
N	12	32	3.8376
Y	6	16	3.2821
Y	6	16	3.2599
Y	6	16	3.2978
N	6	16	3.6942
N	6	16	3.7541
Y	12	16	3.4277
Y	12	16	3.4325
Y	12	16	3.4361
N	12	16	3.8940
N	12	16	3.8127

Sorted by TPS (higher is better):

ID	Threads	Memory	TPS
N	12	16	3.8940
N	12	32	3.8925
N	12	32	3.8376
N	12	16	3.8127
N	6	32	3.7873
N	6	16	3.7541
N	6	16	3.6942
N	6	32	3.6932
Y	12	32	3.4560
Y	12	32	3.4439
Y	12	16	3.4361
Y	12	16	3.4325
Y	12	32	3.4309
Y	12	16	3.4277
Y	6	16	3.2978
Y	6	16	3.2821
Y	6	16	3.2599
Y	6	32	2.9353
Y	6	32	2.7860
Y	6	32	2.7716

In the TPS sorted version, ID off comes out ahead pretty consistently. It's not a massive difference most of the time.

I don't know why ID + 6 threads + 32bit memory looks so bad but those results were consistent.

Anyway... It probably would make sense to go back to the behavior with the commandline argument that toggles it with it automatically getting enabled when a seed is specified. The difference is definitely enough to care about in the 32bit memory/6 thread case.

Not sure if it's just a thing with low numbers of threads in general or what.

from llm.

KerfuffleV2 commented on May 20, 2024

Can anyone else replicate the performance differences I saw? If so, do we want to go back to the previous proposed behavior where there's a CLI flag which gets automatically enabled when --seed is supplied?

from llm.

philpax commented on May 20, 2024

Sorry, haven't had the time to test. Seems like you've done some pretty thorough testing and other people have reported regressions in upstream, so I'm leaning towards your suggested solution of apply-only-when-seed-is-specified.

I'll reopen this issue for now...

from llm.

KerfuffleV2 commented on May 20, 2024

I think this is the same thing, right?

ggerganov/llama.cpp#775

I'm guessing people would want to use that approach rather than trying to invest time in making the existing on better.

edit: I've been testing the changes. llama.cpp seems much faster than a few days ago. Also it seemed like it would get much slower as the context increased, now it's still pretty fast even after generating a page of text.

Not 100% sure it's all this change, but something changed recently that made a big difference.

from llm.

philpax commented on May 20, 2024

Yes, that seems related. Looks like we'll need to sync with them soon...

from llm.

Making results independent from threadcount/batch size (from llama.cpp) about llm HOT 7 CLOSED

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent