<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

P.S. I got the tokenizer.model from <a href="https://huggingface.co/chavinlo/gpt4-x-al

I'm getting >60ms per token hits. Running six threads. <p dir="aut

Seriously, convert ggml to ggjt v1 about casalioy HOT 9 CLOSED

su77ungr commented on June 8, 2024 2

Seriously, convert ggml to ggjt v1

from casalioy.

Comments (9)

alxspiker commented on June 8, 2024 2

P.S. I got the tokenizer.model from huggingface, convert.py from llamacpp and put them in the parent folder of my alpaca7b ggml model named model.bin and ran this from shell python .\convert.py .\models\ --outfile new.bin

from casalioy.

alxspiker commented on June 8, 2024 1

No, I havn't messed around with that yet, just using the db from ssd.

System Manufacturer	LENOVO
System Model	81EM
System Type	x64-based PC
System SKU	LENOVO_MT_81EM_BU_idea_FM_ideapad FLEX 6-14IKB
Processor	Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz, 1992 Mhz, 4 Core(s), 8 Logical Processor(s)
BIOS Mode	UEFI
Platform Role	Mobile
Installed Physical Memory (RAM)	8.00 GB
Available Virtual Memory	21.1 GB

IDK if thats what you need?

from casalioy.

su77ungr commented on June 8, 2024

Awesome. Looks like a weekend without any sleep again haha. I think Vicuna13b should be our goal since it's the best performing model at this point. Also might be worth taking a look at FastChat.

If you could craft a routine to convert ggml this would increase accessibility to keep it boostraped and simple.

Also feel free to commit your benchmark .txt file // I'm using the default demo files.

I'm around 108ms per token with vic7b @ i5-9600k

from casalioy.

alxspiker commented on June 8, 2024

This is starLLM automated to ask What is my name? which I ingested into it.

# use_mmap=True
llm = LlamaCpp(use_mmap=True, model_path=local_path, callbacks=callbacks, verbose=True)

llama_print_timings:        load time =  8441.23 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time =  8440.31 ms /     6 tokens ( 1406.72 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time =  8500.26 ms
 It sounds like your name is Alex.

> Question:
What is my name?

> Answer:
 It sounds like your name is Alex.

> .\source_documents\state_of_the_union.txt:
My name is alx
Total run time: 47.66585969924927 seconds

and

# use_mmap=False
llm = LlamaCpp(use_mmap=False, model_path=local_path, callbacks=callbacks, verbose=True)

llama_print_timings:        load time =  6395.35 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time =  6394.58 ms /     6 tokens ( 1065.76 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time =  6507.05 ms
 Your name is Alexandra.

> Question:
What is my name?

> Answer:
 Your name is Alexandra.

> .\source_documents\state_of_the_union.txt:
My name is alx
Total run time: 42.63529133796692 seconds

So not sure if mmap does much, not sure why or how langchain integrates that argument yet.

from casalioy.

alxspiker commented on June 8, 2024

Awesome. Looks like a weekend without any sleep again haha. I think Vicuna13b should be our goal since it's the best performing model at this point. Also might be worth taking a look at FastChat.

If you could craft a routine to convert ggml this would increase accessibility to keep it boostraped and simple.

Also feel free to commit your benchmark .txt file // I'm using the default demo files.

I'm around 108ms per token with vic7b @ i5-9600k

I'm gonna craft an auto convert if your model shows up as an older one like ggml. I could probably even support .pth and such. People will be thankful, I cant believe the performance difference. Ill also work/look into vicuna if you can test it. Ill try to download the model but my areas internet is slow and not stable.

from casalioy.

su77ungr commented on June 8, 2024

Why are your runtimes at 1000ms per token? can you shoot me your hardware specs, please.

Also are you using :memory: for testing?

Then we'd be able craft a benchmark script. Jap auto-convert seems reasonable.

from casalioy.

su77ungr commented on June 8, 2024

I'm getting >60ms per token hits. Running six threads.

Haven't touched ggml convertion yet. Also did not force RAM since I'm only at 16GiB.

@alxspiker did you try f16_ky=True?

Also ggml-vic7b-uncensored-q4 has a format=ggjt backed in. This might be a reason for this speed

from casalioy.

alxspiker commented on June 8, 2024

I'm getting >60ms per token hits. Running six threads.

Haven't touched ggml convertion yet. Also did not force RAM since I'm only at 16GiB.

@alxspiker did you try f16_ky=True?

Also ggml-vic7b-uncensored-q4 has a format=ggjt backed in. This might be a reason for this speed

823.11 ms per token

from casalioy.

su77ungr commented on June 8, 2024

Your issue changed my life. My terminal session is close to real time. This is incredible. I'm going to upload the converted ggjt-v1 models onto HuggingFace so it's way easier for people to interact with.
converted vic-7b here

from casalioy.

Seriously, convert ggml to ggjt v1 about casalioy HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent