Coder Social home page Coder Social logo

Comments (9)

alxspiker avatar alxspiker commented on June 8, 2024 2

P.S. I got the tokenizer.model from huggingface, convert.py from llamacpp and put them in the parent folder of my alpaca7b ggml model named model.bin and ran this from shell python .\convert.py .\models\ --outfile new.bin

from casalioy.

alxspiker avatar alxspiker commented on June 8, 2024 1

No, I havn't messed around with that yet, just using the db from ssd.

System Manufacturer	LENOVO
System Model	81EM
System Type	x64-based PC
System SKU	LENOVO_MT_81EM_BU_idea_FM_ideapad FLEX 6-14IKB
Processor	Intel(R) Core(TM) i7-8550U CPU @ 1.80GHz, 1992 Mhz, 4 Core(s), 8 Logical Processor(s)
BIOS Mode	UEFI
Platform Role	Mobile
Installed Physical Memory (RAM)	8.00 GB
Available Virtual Memory	21.1 GB

IDK if thats what you need?

from casalioy.

su77ungr avatar su77ungr commented on June 8, 2024

Awesome. Looks like a weekend without any sleep again haha. I think Vicuna13b should be our goal since it's the best performing model at this point. Also might be worth taking a look at FastChat.

If you could craft a routine to convert ggml this would increase accessibility to keep it boostraped and simple.

Also feel free to commit your benchmark .txt file // I'm using the default demo files.

I'm around 108ms per token with vic7b @ i5-9600k

from casalioy.

alxspiker avatar alxspiker commented on June 8, 2024

This is starLLM automated to ask What is my name? which I ingested into it.

# use_mmap=True
llm = LlamaCpp(use_mmap=True, model_path=local_path, callbacks=callbacks, verbose=True)

llama_print_timings:        load time =  8441.23 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time =  8440.31 ms /     6 tokens ( 1406.72 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time =  8500.26 ms
 It sounds like your name is Alex.

> Question:
What is my name?

> Answer:
 It sounds like your name is Alex.

> .\source_documents\state_of_the_union.txt:
My name is alx
Total run time: 47.66585969924927 seconds

and

# use_mmap=False
llm = LlamaCpp(use_mmap=False, model_path=local_path, callbacks=callbacks, verbose=True)

llama_print_timings:        load time =  6395.35 ms
llama_print_timings:      sample time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings: prompt eval time =  6394.58 ms /     6 tokens ( 1065.76 ms per token)
llama_print_timings:        eval time =     0.00 ms /     1 runs   (    0.00 ms per run)
llama_print_timings:       total time =  6507.05 ms
 Your name is Alexandra.

> Question:
What is my name?

> Answer:
 Your name is Alexandra.

> .\source_documents\state_of_the_union.txt:
My name is alx
Total run time: 42.63529133796692 seconds

So not sure if mmap does much, not sure why or how langchain integrates that argument yet.

from casalioy.

alxspiker avatar alxspiker commented on June 8, 2024

Awesome. Looks like a weekend without any sleep again haha. I think Vicuna13b should be our goal since it's the best performing model at this point. Also might be worth taking a look at FastChat.

If you could craft a routine to convert ggml this would increase accessibility to keep it boostraped and simple.

Also feel free to commit your benchmark .txt file // I'm using the default demo files.

I'm around 108ms per token with vic7b @ i5-9600k

I'm gonna craft an auto convert if your model shows up as an older one like ggml. I could probably even support .pth and such. People will be thankful, I cant believe the performance difference. Ill also work/look into vicuna if you can test it. Ill try to download the model but my areas internet is slow and not stable.

from casalioy.

su77ungr avatar su77ungr commented on June 8, 2024

Why are your runtimes at 1000ms per token? can you shoot me your hardware specs, please.

Also are you using :memory: for testing?

Then we'd be able craft a benchmark script. Jap auto-convert seems reasonable.

from casalioy.

su77ungr avatar su77ungr commented on June 8, 2024

I'm getting >60ms per token hits. Running six threads.

Haven't touched ggml convertion yet. Also did not force RAM since I'm only at 16GiB.

@alxspiker did you try f16_ky=True?

Also ggml-vic7b-uncensored-q4 has a format=ggjt backed in. This might be a reason for this speed

from casalioy.

alxspiker avatar alxspiker commented on June 8, 2024

I'm getting >60ms per token hits. Running six threads.

Haven't touched ggml convertion yet. Also did not force RAM since I'm only at 16GiB.

@alxspiker did you try f16_ky=True?

Also ggml-vic7b-uncensored-q4 has a format=ggjt backed in. This might be a reason for this speed

823.11 ms per token

from casalioy.

su77ungr avatar su77ungr commented on June 8, 2024

Your issue changed my life. My terminal session is close to real time. This is incredible. I'm going to upload the converted ggjt-v1 models onto HuggingFace so it's way easier for people to interact with.
converted vic-7b here

from casalioy.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.