coderlsf / fast-llama Goto Github PK

View Code? Open in Web Editor NEW

71.0 4.0 7.0 258 KB

Runs LLaMA with Extremely HIGH speed

License: MIT License

Makefile 0.30% Shell 0.39% Python 10.36% C++ 88.80% C 0.15%

cpu-inference inference-engine llama llama2

fast-llama's Introduction

Fast-LLaMA: A High-Performance Inference Engine

Descriptions

fast-llama is a super high-performance inference engine for LLMs like LLaMA (2.5x of llama.cpp) written in pure C++. It can run a 8-bit quantized LLaMA2-7B model on a cpu with 56 cores in speed of ~25 tokens / s. It outperforms all current open-source inference engines, especially when compared to the renowned llama.cpp, with ~2.5 times better inference speed on a CPU.

Features

Feature Name	Current Support	Future Suport
Model Types	✅LLaMA2	Others LLMs like Baichuan, StableDiffusion
Quantization	✅INT16, ✅INT8	INT4
Model Formats	✅HuggingFace, ✅gguf(by llama.cpp), ✅flm
Systems	✅Linux, ✅Windows	Macbook, Android, iOS
CPU/GPU	✅X86/64 CPU	ARM, Apple Mx CPUs, GPU, CPU+GPU
Architectures	✅UMA, ✅NUMA

Advantages

Why you should use Fast-LLaMA?

Fast
- Extremely fast on CPU. Faster than any other engines on Github including llama.cpp.
Simple
- Totally less than 7k lines of C++ codes with well-orgnized code structures and no dependencies except NUMA (if needed for multi-cpus).
"Easy To Use" (target ☺️）

Quick Start

Compile

Only Linux is supported currently. Support of other platforms including Windows, Mac, GPU is coming soon.

Requirements

GCC 10.x or newer versions
libnuma-dev if your computer has more than one physical CPUs
- Linux Kernel v5.x or higher is needed for NUMA

Compilation

Method 1. Using the provided build script:

bash ./build.sh

Method 2. Using Make:

make -j 4

Run

1. Run with llama2.c models:

Step 1: Download a model

See llama2.c

Step 2: Run the model

./main -c ./models/stories110M.bin -z ./models/tokenizer.bin -j 14 -q int8 -n 200 -i 'That was a long long story happened in the ancient China.'

2. Run with hugging face format models

Step 1: Download a model

See Chinese-LLaMA-Alpaca-2

Step 2: Convert the model info FLM format

python3 ./tools/convert_flm.py -m /path/to/model-directory -o ./models/model-name-int8.flm -t int8

Step 3: Run the model

./main -c ./models/model-name-int8.flm -j 40 -n 200 -i 'That was a long long story happened in the ancient China.'

All supported command-line options are as follows:

-c: Path to the model file
-f: Model file format (e.g., gguf)
-j: Number of threads to use (e.g., 56)
-q: Quantization mode (e.g., int8)
-n: Number of tokens to generate (e.g., 200)
-i: Input text (e.g., 'That was a long long story happened in the ancient China.')
-h: show usage information

Performance

Below are some incomplete test results

Testing Result:

Model	Model Size	OutputSpeed/`8` threads	OutputSpeed/`28` threads	OutputSpeed/`56` threads
stories110M	110M	`237`tps	`400`tps	`440`tps
Chinese-LLaMA-1.3B	1.3B	`38.9`tps	`127`tps	`155`tps
Chinese-LLaMA-7B	7B	`7.4`tps	`17.4`tps	`23.5`tps

Note: tps = tokens / second

Testing Conditions

Testing Prompt: "That was a long long story happened in the ancient Europe. It was about a brave boy name Oliver. Oliver lived in a small village among many big moutains. It was a beautiful village."
Quantization: int8
NUMA: 2 sockets
- Note: Make sure that NUMA is truely available if you expect to accelerate with NUMA)
System: (uname -a)Linux coderlsf 5.15.0-72-generic #79-Ubuntu SMP Wed Apr 19 08:22:18 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
CPU: 56 physical cores, AVX-512

Architecture:            x86_64
Model name:              Intel(R) Xeon(R) Platinum 8350C CPU @ 2.60GHz
CPU(s):                  112 (56 physical cores)
Thread(s) per core:      2
Core(s) per socket:      28
Socket(s):               2

Latancy of first token will be optimized laterly.

Why

Why is it so fast?

Ultimate memory efficiency
- Zero memory allocations and frees during inferencing.
- Maximization of memory locality.
Well-designed thread scheduling algorithm
Optimized operators
- Fuse all operators that can be fused together
- Optmize calculation of several operators
Proper Quantizations

License

fast-llama is licensed under the MIT.

Acknowledgements

Special thanks to AlpinDale for his professional, meticulous, and patient guidance and assistance.

Contact

Email: 📩[email protected]

Contact me if you any questions.

fast-llama's People

Contributors

Stargazers

Watchers

Forkers

ljk53 alpindale vicwer kugoucode aoquan erjanmx nuclear6

fast-llama's Issues

How to get the model

Where to download models ./models/cnllama-7b/ggml-model-f32.gguf？

Cannot Build from missing Sleef header

g++ -std=c++20 -mavx512f -mavx512bw -mavx512vl -mavx512dq -D_GNU_SOURCE -Wall -I/home/rexommendation/Programs/RapidLLaMA/src -I/home/rexommendation/Programs/RapidLLaMA/src/utils -I/home/rexommendation/Programs/RapidLLaMA/src/model_loaders -I/home/rexommendation/Programs/RapidLLaMA/src/components -I/home/rexommendation/Programs/RapidLLaMA/src/platforms -I/home/rexommendation/Programs/RapidLLaMA/src/platforms/intel -I/home/rexommendation/Programs/RapidLLaMA/src/transformer -O3 -c /home/rexommendation/Programs/RapidLLaMA/src/utils/ftdebug.cpp -o /home/rexommendation/Programs/RapidLLaMA/src/utils/ftdebug.o
g++ -std=c++20 -mavx512f -mavx512bw -mavx512vl -mavx512dq -D_GNU_SOURCE -Wall -I/home/rexommendation/Programs/RapidLLaMA/src -I/home/rexommendation/Programs/RapidLLaMA/src/utils -I/home/rexommendation/Programs/RapidLLaMA/src/model_loaders -I/home/rexommendation/Programs/RapidLLaMA/src/components -I/home/rexommendation/Programs/RapidLLaMA/src/platforms -I/home/rexommendation/Programs/RapidLLaMA/src/platforms/intel -I/home/rexommendation/Programs/RapidLLaMA/src/transformer -O3 -c /home/rexommendation/Programs/RapidLLaMA/src/utils/utility.cpp -o /home/rexommendation/Programs/RapidLLaMA/src/utils/utility.o
g++ -std=c++20 -mavx512f -mavx512bw -mavx512vl -mavx512dq -D_GNU_SOURCE -Wall -I/home/rexommendation/Programs/RapidLLaMA/src -I/home/rexommendation/Programs/RapidLLaMA/src/utils -I/home/rexommendation/Programs/RapidLLaMA/src/model_loaders -I/home/rexommendation/Programs/RapidLLaMA/src/components -I/home/rexommendation/Programs/RapidLLaMA/src/platforms -I/home/rexommendation/Programs/RapidLLaMA/src/platforms/intel -I/home/rexommendation/Programs/RapidLLaMA/src/transformer -O3 -c /home/rexommendation/Programs/RapidLLaMA/src/model_loaders/llama2c_loader.cpp -o /home/rexommendation/Programs/RapidLLaMA/src/model_loaders/llama2c_loader.o
g++ -std=c++20 -mavx512f -mavx512bw -mavx512vl -mavx512dq -D_GNU_SOURCE -Wall -I/home/rexommendation/Programs/RapidLLaMA/src -I/home/rexommendation/Programs/RapidLLaMA/src/utils -I/home/rexommendation/Programs/RapidLLaMA/src/model_loaders -I/home/rexommendation/Programs/RapidLLaMA/src/components -I/home/rexommendation/Programs/RapidLLaMA/src/platforms -I/home/rexommendation/Programs/RapidLLaMA/src/platforms/intel -I/home/rexommendation/Programs/RapidLLaMA/src/transformer -O3 -c /home/rexommendation/Programs/RapidLLaMA/src/model_loaders/model_loader.cpp -o /home/rexommendation/Programs/RapidLLaMA/src/model_loaders/model_loader.o
g++ -std=c++20 -mavx512f -mavx512bw -mavx512vl -mavx512dq -D_GNU_SOURCE -Wall -I/home/rexommendation/Programs/RapidLLaMA/src -I/home/rexommendation/Programs/RapidLLaMA/src/utils -I/home/rexommendation/Programs/RapidLLaMA/src/model_loaders -I/home/rexommendation/Programs/RapidLLaMA/src/components -I/home/rexommendation/Programs/RapidLLaMA/src/platforms -I/home/rexommendation/Programs/RapidLLaMA/src/platforms/intel -I/home/rexommendation/Programs/RapidLLaMA/src/transformer -O3 -c /home/rexommendation/Programs/RapidLLaMA/src/model_loaders/gguf_loader.cpp -o /home/rexommendation/Programs/RapidLLaMA/src/model_loaders/gguf_loader.o
g++ -std=c++20 -mavx512f -mavx512bw -mavx512vl -mavx512dq -D_GNU_SOURCE -Wall -I/home/rexommendation/Programs/RapidLLaMA/src -I/home/rexommendation/Programs/RapidLLaMA/src/utils -I/home/rexommendation/Programs/RapidLLaMA/src/model_loaders -I/home/rexommendation/Programs/RapidLLaMA/src/components -I/home/rexommendation/Programs/RapidLLaMA/src/platforms -I/home/rexommendation/Programs/RapidLLaMA/src/platforms/intel -I/home/rexommendation/Programs/RapidLLaMA/src/transformer -O3 -c /home/rexommendation/Programs/RapidLLaMA/src/main.cpp -o /home/rexommendation/Programs/RapidLLaMA/src/main.o
g++ -std=c++20 -mavx512f -mavx512bw -mavx512vl -mavx512dq -D_GNU_SOURCE -Wall -I/home/rexommendation/Programs/RapidLLaMA/src -I/home/rexommendation/Programs/RapidLLaMA/src/utils -I/home/rexommendation/Programs/RapidLLaMA/src/model_loaders -I/home/rexommendation/Programs/RapidLLaMA/src/components -I/home/rexommendation/Programs/RapidLLaMA/src/platforms -I/home/rexommendation/Programs/RapidLLaMA/src/platforms/intel -I/home/rexommendation/Programs/RapidLLaMA/src/transformer -O3 -c /home/rexommendation/Programs/RapidLLaMA/src/components/tensor.cpp -o /home/rexommendation/Programs/RapidLLaMA/src/components/tensor.o
g++ -std=c++20 -mavx512f -mavx512bw -mavx512vl -mavx512dq -D_GNU_SOURCE -Wall -I/home/rexommendation/Programs/RapidLLaMA/src -I/home/rexommendation/Programs/RapidLLaMA/src/utils -I/home/rexommendation/Programs/RapidLLaMA/src/model_loaders -I/home/rexommendation/Programs/RapidLLaMA/src/components -I/home/rexommendation/Programs/RapidLLaMA/src/platforms -I/home/rexommendation/Programs/RapidLLaMA/src/platforms/intel -I/home/rexommendation/Programs/RapidLLaMA/src/transformer -O3 -c /home/rexommendation/Programs/RapidLLaMA/src/platforms/intel/quant_operators.cpp -o /home/rexommendation/Programs/RapidLLaMA/src/platforms/intel/quant_operators.o
g++ -std=c++20 -mavx512f -mavx512bw -mavx512vl -mavx512dq -D_GNU_SOURCE -Wall -I/home/rexommendation/Programs/RapidLLaMA/src -I/home/rexommendation/Programs/RapidLLaMA/src/utils -I/home/rexommendation/Programs/RapidLLaMA/src/model_loaders -I/home/rexommendation/Programs/RapidLLaMA/src/components -I/home/rexommendation/Programs/RapidLLaMA/src/platforms -I/home/rexommendation/Programs/RapidLLaMA/src/platforms/intel -I/home/rexommendation/Programs/RapidLLaMA/src/transformer -O3 -c /home/rexommendation/Programs/RapidLLaMA/src/platforms/intel/tf_operators.cpp -o /home/rexommendation/Programs/RapidLLaMA/src/platforms/intel/tf_operators.o
g++ -std=c++20 -mavx512f -mavx512bw -mavx512vl -mavx512dq -D_GNU_SOURCE -Wall -I/home/rexommendation/Programs/RapidLLaMA/src -I/home/rexommendation/Programs/RapidLLaMA/src/utils -I/home/rexommendation/Programs/RapidLLaMA/src/model_loaders -I/home/rexommendation/Programs/RapidLLaMA/src/components -I/home/rexommendation/Programs/RapidLLaMA/src/platforms -I/home/rexommendation/Programs/RapidLLaMA/src/platforms/intel -I/home/rexommendation/Programs/RapidLLaMA/src/transformer -O3 -c /home/rexommendation/Programs/RapidLLaMA/src/transformer/tokenizer.cpp -o /home/rexommendation/Programs/RapidLLaMA/src/transformer/tokenizer.o
/home/rexommendation/Programs/RapidLLaMA/src/transformer/tokenizer.cpp:13:10: fatal error: sleef.h: No such file or directory
13 | #include <sleef.h>
| ^~~~~~~~~
compilation terminated.
make: *** [Makefile:37: /home/rexommendation/Programs/RapidLLaMA/src/transformer/tokenizer.o] Error 1

Failed to load model

./main -f gguf -c ../text-generation-webui/models/beagle14-7b.Q5_K_M.gguf

ERROR: [./src/model_loaders/gguf_loader.cpp:263] [load_gguf()] Unsupported file type:17

Failed to load model

any example of using any gguf mistral example?

what tokenizer to use for mistral? stories110M is working fine (a lot of nonsense text generated) but how to use mixtral gguf etc?

Cannot build

Hi, I've been trying to compile the RapidLLaMA, but it seems to have issues. Is the repo still incomplete? I also had to manually build sleef with the test units omitted, as sleef doesn't work with the latest versions of mpfr, so it may be a good idea to mention that (unless it's fixed upstream).

The error I'm getting at the moment is:

/home/alpindale/AI-Stuff/tools/RapidLLaMA/src/utils/utility.h:25:10: error: ‘unique_ptr’ is not a member of ‘std’
   25 |     std::unique_ptr<char[]> buf(new char[size]);
      |          ^~~~~~~~~~
/home/alpindale/AI-Stuff/tools/RapidLLaMA/src/utils/utility.h:18:1: note: ‘std::unique_ptr’ is defined in header ‘<memory>’; did you forget to ‘#include <memory>’?
   17 | #include <iomanip>
  +++ |+#include <memory>
   18 |
/home/alpindale/AI-Stuff/tools/RapidLLaMA/src/utils/utility.h:25:21: error: expected primary-expression before ‘char’
   25 |     std::unique_ptr<char[]> buf(new char[size]);
      |                     ^~~~
Failed

Llama 2 7b chat Q8 guff causes error unknown tokenid

Llama 2 7b chat Q8 guff causes error unknown tokenid
Commad
./main -c ./llama-2-7b-chat.Q8_o.gguf -j 40 -n 200-i "Advice "

Error

ERROR:[src/model_loaders/gguf_loader.cpp:320][load_gguf()]Unknown key:tokenizer.ggml.unknown_token_id
Failed to load model

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.