Coder Social home page Coder Social logo

edwinlim0919 / neural-speed Goto Github PK

View Code? Open in Web Editor NEW

This project forked from intel/neural-speed

0.0 0.0 0.0 8.48 MB

An innovation library for efficient LLM inference via low-bit quantization and sparsity

Home Page: https://github.com/intel/neural-speed

License: Apache License 2.0

Shell 0.78% C++ 67.33% Python 12.40% C 18.69% CMake 0.79%

neural-speed's Introduction

Neural Speed

Neural Speed is an innovation library designed to provide the efficient inference of large language models (LLMs) on Intel platforms through the state-of-the-art (SOTA) low-bit quantization and sparsity powered by Intel Neural Compressor and llama.cpp. Highlights of this project:

Neural Speed is under active development so APIs are subject to change.

Installation

Build Python package (Recommended way)

pip install -r requirements.txt
pip install .

Note: Please make sure GCC version is higher than GCC 10.

Quick Start

There are two approaches for utilizing the Neural Speed: 1. Transformer-like usage, you need to install ITREX(intel extension for transformers) 2. llama.cpp-like usage

1. Transformer-like usage

Pytorch format HF model

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM
model_name = "Intel/neural-chat-7b-v3-1"     # Hugging Face model_id or local model
prompt = "Once upon a time, there existed a little girl,"

tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)

model = AutoModelForCausalLM.from_pretrained(model_name, load_in_4bit=True)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

GGUF format HF model

from transformers import AutoTokenizer, TextStreamer
from intel_extension_for_transformers.transformers import AutoModelForCausalLM

# Specify the GGUF repo on the Hugginface
model_name = "TheBloke/Llama-2-7B-Chat-GGUF"
# Download the the specific gguf model file from the above repo
model_file = "llama-2-7b-chat.Q4_0.gguf"
# make sure you are granted to access this model on the Huggingface.
tokenizer_name = "meta-llama/Llama-2-7b-chat-hf"

prompt = "Once upon a time"
tokenizer = AutoTokenizer.from_pretrained(tokenizer_name, trust_remote_code=True)
inputs = tokenizer(prompt, return_tensors="pt").input_ids
streamer = TextStreamer(tokenizer)
model = AutoModelForCausalLM.from_pretrained(model_name, model_file = model_file)
outputs = model.generate(inputs, streamer=streamer, max_new_tokens=300)

Please refer this link to check supported models.

If you want to use Transformer-based API in ITREX(Intel extension for transformers). Please refer to ITREX Installation Page.

2. llama.cpp-like usage:

One-click Python scripts

Run LLM with one-click python script including conversion, quantization and inference.

python scripts/run.py model-path --weight_dtype int4 -p "She opened the door and see"

Quantize and Inference Step By Step

Neural Speed supports 1. GGUF models generated by llama.cpp 2. GGUF models from HuggingFace 3. PyTorch model from HuggingFace, but quantized by Neural Speed Neural Speed offers the scripts: 1) convert and quantize, and 2) inference for converting the model by yourself. If the GGUF model is from HuggingFace or generated by llama.cpp, you can inference it directly.

1. Convert and Quantize LLM

converting the model by following the below steps:

# convert the model directly use model id in Hugging Face. (recommended)
python scripts/convert.py --outtype f32 --outfile ne-f32.bin EleutherAI/gpt-j-6b
2. Inference

Linux and WSL

OMP_NUM_THREADS=<physic_cores> numactl -m 0 -C 0-<physic_cores-1> python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores> --color -p "She opened the door and see"

Windows

python scripts/inference.py --model_name llama -m ne-q4_j.bin -c 512 -b 1024 -n 256 -t <physic_cores|P-cores> --color -p "She opened the door and see"

For details please refer to Advanced Usage.

Supported Hardware

Hardware Optimization
Intel Xeon Scalable Processors
Intel Xeon CPU Max Series
Intel Core Processors

Supported Models

LLAMA, LLAMA2, NeuralChat series, GPT-J, GPT-NEOX, Dolly-v2, MPT, Falcon, BLOOM, OPT, ChatGLM, ChatGLM2, Baichuan, Baichuan2, Qwen, Mistral, Whisper, CodeLlama, MagicCoder and StarCoder.

Neural Speed also supports GGUF models generated by llama.cpp, you need to download the model and use llama.cpp to create it. Validated models: llama2-7b-chat-hf, falcon-7b, falcon-40b, mpt-7b, mpt-40b and bloom-7b1.

Please check more validated GGUF models from HuggingFace in list.

Advanced Usage

1. Quantization and inferenece

More parameters in llama.cpp-like usage: Advanced Usage.

2. Tensor Parallelism cross nodes/sockets

We support tensor parallelism strategy for distributed inference/training on multi-node and multi-socket. You can refer to tensor_parallelism.md to enable this feature.

3. Custom Stopping Criteria

You can customize the stopping criteria according to your own needs by processing the input_ids to determine if text generation needs to be stopped. Here is the document of Custom Stopping Criteria: simple example with minimum generation length of 80 tokens

4. Verbose Mode

Enable verbose mode and control tracing information using the NEURAL_SPEED_VERBOSE environment variable.

Available modes:

  • 0: Print all tracing information. Comprehensive output, including: evaluation time and operator profiling. (need to set NS_PROFILING to ON and recompile)
  • 1: Print evaluation time. Time taken for each evaluation.
  • 2: Profile individual operator. Identify performance bottleneck within the model. (need to set NS_PROFILING to ON and recompile)

Enable New Model

You can consider adding your own models, please follow the document: graph developer document.

neural-speed's People

Contributors

zhenwei-intel avatar ddele avatar intellinjun avatar zhenzhong1 avatar vincyzhang avatar a32543254 avatar zhentaoyu avatar luoyu-intel avatar airmeng avatar kevinintel avatar zhewang1-intc avatar yuchengliu1 avatar hshen14 avatar clarkchin08 avatar park12sj avatar aahouzi avatar penghuicheng avatar xiguiw avatar ceciliawwq avatar eltociear avatar liangyx2 avatar rdower avatar thanatosshinji avatar changwangss avatar xinyuye-intel avatar akarx23 avatar lvliang-intel avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.