Comments (9)
P.S. I got the tokenizer.model from huggingface, convert.py from llamacpp and put them in the parent folder of my alpaca7b ggml model named model.bin and ran this from shell python .\convert.py .\models\ --outfile new.bin
from casalioy.
No, I havn't messed around with that yet, just using the db from ssd.
System Manufacturer LENOVO
System Model 81EM
System Type x64-based PC
System SKU LENOVO_MT_81EM_BU_idea_FM_ideapad FLEX 6-14IKB
Processor Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz, 1992 Mhz, 4 Core(s), 8 Logical Processor(s)
BIOS Mode UEFI
Platform Role Mobile
Installed Physical Memory (RAM) 8.00 GB
Available Virtual Memory 21.1 GB
IDK if thats what you need?
from casalioy.
Awesome. Looks like a weekend without any sleep again haha. I think Vicuna13b should be our goal since it's the best performing model at this point. Also might be worth taking a look at FastChat.
If you could craft a routine to convert ggml this would increase accessibility to keep it boostraped and simple.
Also feel free to commit your benchmark .txt file // I'm using the default demo files.
I'm around 108ms per token with vic7b @ i5-9600k
from casalioy.
This is starLLM automated to ask What is my name? which I ingested into it.
# use_mmap=True
llm = LlamaCpp(use_mmap=True, model_path=local_path, callbacks=callbacks, verbose=True)
llama_print_timings: load time = 8441.23 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
llama_print_timings: prompt eval time = 8440.31 ms / 6 tokens ( 1406.72 ms per token)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per run)
llama_print_timings: total time = 8500.26 ms
It sounds like your name is Alex.
> Question:
What is my name?
> Answer:
It sounds like your name is Alex.
> .\source_documents\state_of_the_union.txt:
My name is alx
Total run time: 47.66585969924927 seconds
and
# use_mmap=False
llm = LlamaCpp(use_mmap=False, model_path=local_path, callbacks=callbacks, verbose=True)
llama_print_timings: load time = 6395.35 ms
llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per run)
llama_print_timings: prompt eval time = 6394.58 ms / 6 tokens ( 1065.76 ms per token)
llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per run)
llama_print_timings: total time = 6507.05 ms
Your name is Alexandra.
> Question:
What is my name?
> Answer:
Your name is Alexandra.
> .\source_documents\state_of_the_union.txt:
My name is alx
Total run time: 42.63529133796692 seconds
So not sure if mmap does much, not sure why or how langchain integrates that argument yet.
from casalioy.
Awesome. Looks like a weekend without any sleep again haha. I think Vicuna13b should be our goal since it's the best performing model at this point. Also might be worth taking a look at FastChat.
If you could craft a routine to convert ggml this would increase accessibility to keep it boostraped and simple.
Also feel free to commit your benchmark .txt file // I'm using the default demo files.
I'm around 108ms per token with vic7b @ i5-9600k
I'm gonna craft an auto convert if your model shows up as an older one like ggml. I could probably even support .pth and such. People will be thankful, I cant believe the performance difference. Ill also work/look into vicuna if you can test it. Ill try to download the model but my areas internet is slow and not stable.
from casalioy.
Why are your runtimes at 1000ms per token? can you shoot me your hardware specs, please.
Also are you using :memory: for testing?
Then we'd be able craft a benchmark script. Jap auto-convert seems reasonable.
from casalioy.
I'm getting >60ms per token hits. Running six threads.
Haven't touched ggml convertion yet. Also did not force RAM since I'm only at 16GiB.
@alxspiker did you try f16_ky=True?
Also ggml-vic7b-uncensored-q4 has a format=ggjt backed in. This might be a reason for this speed
from casalioy.
I'm getting >60ms per token hits. Running six threads.
Haven't touched ggml convertion yet. Also did not force RAM since I'm only at 16GiB.
@alxspiker did you try f16_ky=True?
Also ggml-vic7b-uncensored-q4 has a format=ggjt backed in. This might be a reason for this speed
823.11 ms per token
from casalioy.
Your issue changed my life. My terminal session is close to real time. This is incredible. I'm going to upload the converted ggjt-v1 models onto HuggingFace so it's way easier for people to interact with.
converted vic-7b here
from casalioy.
Related Issues (20)
- Unable to Provide insights on Overall Data - Only Taking top 5 or 7 chunks HOT 1
- LlamaCpp change breaks Q4_0, Q4_1 and Q8_0 models
- Illegal instruction (core dumped) HOT 4
- Multilanguage support HOT 6
- DOC: Model link is wrong for "GPT4All-13b-snoozy q5" at readme file HOT 4
- Progress stuck in window 10 for "python casalioy/ingest.py source_documents/" HOT 3
- Add OpenCL (AMD GPU) support
- Getting KeyError 'max_tokens' HOT 6
- DOC: python convert.py file missing at github repo but present in readme file HOT 1
- HTML printer gets tripped over by some special characters HOT 3
- DOC: Installation instructions for utilizing Anaconda HOT 2
- running gpt4 all models HOT 1
- data privacy HOT 9
- no module named Casalioy
- Configuration option for startLLM.py outout format HOT 1
- Miscellaneous HOT 5
- Best practices for limiting responses to a specific source document HOT 4
- Define the Answer Language HOT 3
- an error when running python ./casalioy/startLLM.py HOT 6
- Use better prompt templating
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from casalioy.