intel-analytics / bigdl Goto Github PK

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max). A PyTorch LLM library that seamlessly integrates with llama.cpp, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, ModelScope, etc.

Home Page: https://ipex-llm.readthedocs.io

License: Apache License 2.0

Shell 2.25% Python 97.18% Dockerfile 0.34% PowerShell 0.14% Batchfile 0.09%

spark tensorflow keras pytorch bigdl analytics-zoo distributed-deep-learning python scala llm transformers

bigdl's Introduction

Important

bigdl-llm has now become ipex-llm (see the migration guide here); you may find the original BigDL project here.

💫 IPEX-LLM

IPEX-LLM is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency¹.

Note

It is built on top of Intel Extension for PyTorch (IPEX), as well as the excellent work of llama.cpp, bitsandbytes, vLLM, qlora, AutoGPTQ, AutoAWQ, etc.
It provides seamless integration with llama.cpp, Text-Generation-WebUI, HuggingFace transformers, HuggingFace PEFT, LangChain, LlamaIndex, DeepSpeed-AutoTP, vLLM, FastChat, HuggingFace TRL, AutoGen, ModeScope, etc.
50+ models have been optimized/verified on ipex-llm (including LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM, Baichuan, Qwen, RWKV, and more); see the complete list here.

`ipex-llm` Demo

See the demo of running Text-Generation-WebUI, local RAG using LangChain-Chatchat, llama.cpp and HuggingFace transformers (on either Intel Core Ultra laptop or Arc GPU) with ipex-llm below.

Intel Core Ultra Laptop		Intel Arc GPU
webui.mp4	rag.mp4	llama-cpp.mp4	hf.mp4
Text-Generation-WebUI	Local RAG using LangChain-Chatchat	llama.cpp	HuggingFace transformers

Latest Update 🔥

[2024/03] bigdl-llm has now become ipex-llm (see the migration guide here); you may find the original BigDL project here.
[2024/02] ipex-llm now supports directly loading model from ModelScope (魔搭).
[2024/02] ipex-llm added inital INT2 support (based on llama.cpp IQ2 mechanism), which makes it possible to run large-size LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM.
[2024/02] Users can now use ipex-llm through Text-Generation-WebUI GUI.
[2024/02] ipex-llm now supports Self-Speculative Decoding, which in practice brings ~30% speedup for FP16 and BF16 inference latency on Intel GPU and CPU respectively.
[2024/02] ipex-llm now supports a comprehensive list of LLM finetuning on Intel GPU (including LoRA, QLoRA, DPO, QA-LoRA and ReLoRA).
[2024/01] Using ipex-llm QLoRA, we managed to finetune LLaMA2-7B in 21 minutes and LLaMA2-70B in 3.14 hours on 8 Intel Max 1550 GPU for Standford-Alpaca (see the blog here).

More updates

[2023/12] ipex-llm now supports ReLoRA (see "ReLoRA: High-Rank Training Through Low-Rank Updates").
[2023/12] ipex-llm now supports Mixtral-8x7B on both Intel GPU and CPU.
[2023/12] ipex-llm now supports QA-LoRA (see "QA-LoRA: Quantization-Aware Low-Rank Adaptation of Large Language Models").
[2023/12] ipex-llm now supports FP8 and FP4 inference on Intel GPU.
[2023/11] Initial support for directly loading GGUF, AWQ and GPTQ models into ipex-llm is available.
[2023/11] ipex-llm now supports vLLM continuous batching on both Intel GPU and CPU.
[2023/10] ipex-llm now supports QLoRA finetuning on both Intel GPU and CPU.
[2023/10] ipex-llm now supports FastChat serving on on both Intel CPU and GPU.
[2023/09] ipex-llm now supports Intel GPU (including iGPU, Arc, Flex and MAX).
[2023/09] ipex-llm tutorial is released.

`ipex-llm` Quickstart

Install `ipex-llm`

Windows GPU: installing ipex-llm on Windows with Intel GPU
Linux GPU: installing ipex-llm on Linux with Intel GPU
Docker: using ipex-llm dockers on Intel CPU and GPU
For more details, please refer to the installation guide

Run `ipex-llm`

llama.cpp: running ipex-llm for llama.cpp (using C++ interface of ipex-llm as an accelerated backend for llama.cpp on Intel GPU)
vLLM: running ipex-llm in vLLM on both Intel GPU and CPU
FastChat: running ipex-llm in FastChat serving on on both Intel GPU and CPU
LangChain-Chatchat RAG: running ipex-llm in LangChain-Chatchat (Knowledge Base QA using RAG pipeline)
Text-Generation-WebUI: running ipex-llm in oobabooga WebUI
Benchmarking: running (latency and throughput) benchmarks for ipex-llm on Intel CPU and GPU

Code Examples

Low bit inference
- INT4 inference: INT4 LLM inference on Intel GPU and CPU
- FP8/FP4 inference: FP8 and FP4 LLM inference on Intel GPU
- INT8 inference: INT8 LLM inference on Intel GPU and CPU
- INT2 inference: INT2 LLM inference (based on llama.cpp IQ2 mechanism) on Intel GPU
FP16/BF16 inference
- FP16 LLM inference on Intel GPU, with possible self-speculative decoding optimization
- BF16 LLM inference on Intel CPU, with possible self-speculative decoding optimization
Save and load
- Low-bit models: saving and loading ipex-llm low-bit models
- GGUF: directly loading GGUF models into ipex-llm
- AWQ: directly loading AWQ models into ipex-llm
- GPTQ: directly loading GPTQ models into ipex-llm
Finetuning
- LLM finetuning on Intel GPU, including LoRA, QLoRA, DPO, QA-LoRA and ReLoRA
- QLoRA finetuning on Intel CPU
Integration with community libraries
Tutorials

For more details, please refer to the ipex-llm document website.

Verified Models

Over 50 models have been optimized/verified on ipex-llm, including LLaMA/LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM2/ChatGLM3, Baichuan/Baichuan2, Qwen/Qwen-1.5, InternLM and more; see the list below.

Model	CPU Example	GPU Example
LLaMA (such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.)	link1, link2	link
LLaMA 2	link1, link2	link
ChatGLM	link
ChatGLM2	link	link
ChatGLM3	link	link
Mistral	link	link
Mixtral	link	link
Falcon	link	link
MPT	link	link
Dolly-v1	link	link
Dolly-v2	link	link
Replit Code	link	link
RedPajama	link1, link2
Phoenix	link1, link2
StarCoder	link1, link2	link
Baichuan	link	link
Baichuan2	link	link
InternLM	link	link
Qwen	link	link
Qwen1.5	link	link
Qwen-VL	link	link
Aquila	link	link
Aquila2	link	link
MOSS	link
Whisper	link	link
Phi-1_5	link	link
Flan-t5	link	link
LLaVA	link	link
CodeLlama	link	link
Skywork	link
InternLM-XComposer	link
WizardCoder-Python	link
CodeShell	link
Fuyu	link
Distil-Whisper	link	link
Yi	link	link
BlueLM	link	link
Mamba	link	link
SOLAR	link	link
Phixtral	link	link
InternLM2	link	link
RWKV4		link
RWKV5		link
Bark	link	link
SpeechT5		link
DeepSeek-MoE	link
Ziya-Coding-34B-v1.0	link
Phi-2	link	link
Yuan2	link	link
Gemma	link	link
DeciLM-7B	link	link
Deepseek	link	link
StableLM	link	link

Performance varies by use, configuration and other factors. ipex-llm may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex. ↩

bigdl's People

Contributors

Stargazers

Watchers

Forkers

chubbymaggie shaoweipng juwlee ml-lab rtvt123 stvhanna caomw icandroid yiheng dushmis spark-clustering-notebook jerryyanwan codeaudit oldregan xuesj rotemfogel sksundaram-learning mbijon benjamesbabala ml-ai-nlp-ir yanlinaung nikolayvoronchikhin kod3r huyuxiang tianxiazzc renshenglu daxiongshu logan-lu phoenixinfire blpabhishek rajendraranabhat gdemelo intellifora nanirajesh sandeepl337 hunslater mayanhui dax1n ksmaheshkumar miguelperalvo hyqgod shuhanyan swenjt wanjinchang qiuxin2012 khmfighter mt0803 seaofocean psyyz10 letslego songofhack yong93 giserh chengat1314 xiaohuiyan mpjlu anshul-cached sjtu2008 bupt-renpei ladyia nudtchengqing xsongx zachlungu lyk125 strategist922 delphine0379 matthewwilfred luoxuehuan harlixxy i8run fanshiqing pnadig ravirajadrangi ljzzju shenyurun bettercc fuwhu thunderain-project webscaleml hadisinaee jinquan-dai tech-meetup qiuyukuhe andypau is00hcw hujunxianligong zhaoshuyang tempbottle desperado1992 delkyd apsaltis leogong davidmr001 zhangxiaoli73 andysong07 dding3 fengweijp jackode bigdatafly sthsf

bigdl's Issues

Test Inception with DistriOptimizer2 with drop != 0

Test Inception with DistriOptimizer2 with drop != 0 on performance and convergence

Don't specify Float or Double type for BigDL examples/models

Don't specify Float or Double for our examples/models - using import of implicits instead, so that scala can automatically figure out the right type.

Inconsistent loop setting

Why the loop of val set be false while it's true for the training set? what if the remaining val samples less then the batch_size?

val dataIter = validationDataSet.get.asInstanceOf[LocalDataSet[Batch[T]]].data(looped = false)

https://github.com/intel-analytics/BigDL/blob/master/dl/src/main/scala/com/intel/analytics/bigdl/optim/LocalOptimizer.scala#L196

Bug? if the remaining data less then a batch

https://github.com/intel-analytics/BigDL/blob/master/dl/src/main/scala/com/intel/analytics/bigdl/dataset/image/RGBImgToBatch.scala#L46

make-dist.sh error

1, rm -r $DIST_DIR/* (if nothing under this folder)
2, cp $BASEDIR/scripts/bigdlvars.sh $BIN_DIR/ (shell file doesn't exist)

Make Module:backward() final

We should make Module:backward final, and overload the updategradInput and accGradParameters methods in its subclasses instead.

jdk7 throw exception in MKL class but the error msg is eat

In SpatialConvolution, weight address changed, but weightMM not

In SpatialConvolution, weightMM is the view of weight and won't be updated until it is empty.

When calling getParameter fuc, Module.flatten will change the memory address of weight .

So if we call forward/backward and get weight by getParameter, weight's change won't affect nonempty weightMM.

There is a need to change weightMM updated rule.

Call dataset.data(train = true) for each epoch in DistriOptimizer

Instead of calling dataset.data(train = true) in each iteration, call dataset.data(train = true) for each epoch in DistriOptimizer instead

Need a Flatten layer

Currently we only provide Reshape layer which require user to manually calc the size from the previous layer.

It would be better if we can have a similar layer like Keras:
x = Flatten()(x)

nelements_pre_layer/batch is the size user need to give if we rely on the current Reshape layer.

Seperate nn into sub-pacakges according to corresponding categories

Support GoogleNet_v2 in Spark-DL

Googlenet_V2 Model reference this paper.

A caffe model is here

We should achieve 80% intel-caffe single node performance, and achieve same top-1(68%) and top-5(88%) error compare to GPU version on local mode and cluster mode.

Move factory methods from object torch to object Tensor

Support MKL2017 DNN API

Intel MKL release 2017 version and it contains a DNN API, which provide DNN operation optimized for IA architecture. We will add new layers which leverage these new APIs to get a better performance on CPU.

tensor_apply3 apply on discontiguous tensor will throw an ArrayIndexOutOfBoundsException.

Test code:

    val x = Tensor[Float](2, 1).fill(2f)
    val y = Tensor(Storage(Array(1f, 2, 3, 4, 5, 6)), 1, Array(2, 3))
    x.expandAs(y)
    val z = Tensor[Float](2, 3).zero()
    z.cmul(x, y)  //will call apply3

Exception:

java.lang.ArrayIndexOutOfBoundsException: 2
    at scala.runtime.ScalaRunTime$.array_apply(ScalaRunTime.scala:76)
    at com.intel.analytics.sparkdl.tensor.DenseTensorMath$$anon$32.apply(DenseTensorMath.scala:60)
    at com.intel.analytics.sparkdl.tensor.DenseTensorApply$.apply3(DenseTensorApply.scala:177)
    at com.intel.analytics.sparkdl.tensor.DenseTensorMath$.cmul(DenseTensorMath.scala:63)
    at com.intel.analytics.sparkdl.tensor.DenseTensor.cmul(DenseTensor.scala:841)

(GoogleNet_v2) GoogleNet_v2 run with MKL 2017 DNN API

Run googlenet_v2 with MKL 2017 DNN API. The performance should achieve 80% of intel-caffe with MKL 2017 DNN API. It also should converge to top-1 error(68%) and top-5 error(88%)

Enrich the exception info for the require statement

There are lots of require statement here but without enough info for debugging.
i.e require(1 == this.nDimension, "invalid size")
A better version:
require(1 == this.nDimension, s"nDimension size: ${this.nDimension} should be 1")

Code refactory for Engine, Parameter, Optimizer, DataSet, etc.

Engine
We should implement an Engine object to represent the environment for the training, including:
- Type of the underlying execution engine: MKL-BLAS vs. MKL-DNN. We can then provide factory functions to automatically create appropriate version of modules (that is, using MKL-DNN or not) based on the specified type; for now, we can throw an error in the factory functions if MKL-DNN is specified but there is no such version of the module.
- Configurations for distributed training: partition#, worker#, core#, batch size, batchPerPartition, batchPerWorker, batchPerCore, OMP parameters, etc. We can also perform various check on these configurations – e.g., batch size should be a multiple of worker# X core# when using MKL-BLAS
- Pool of threads for running multiple tasks in a worker: we shouldn’t expose the multi-threading code in the application logic; instead, we can encapsulate the multi-threading code in the Engine object, and its users (such as the Optimizer) can simply call something like
```
Engine.parallelInvoke(0 until core#) { i =>
   …
}
```
Parameters
We should rename the ps package to parameters, and implement a Parameter class which represents the shared variables (containing both weights and gradients) for both local and distributed modes (somewhat similar to the Broadcast variable). It should provide the following support:
- User-specified serializer object (e.g., FP16) and update method
- getWeights/getGradients methods: these should be non-blocking methods that return a FutureResult; the user can then fetch the value through something like FutureResult.getValue(). In this way, the sync weight operations can be overlapped with other operations as follows:
```
val W = parameter.getWeights()
Engine.parallelInvoke(0 until core#) { i =>
    models(i).zeroGradient()
    ...
    W.getValue(localWeights(i))
    …
}
```
- putWeights/putGradients methods for updating the Parameter
Optimizer and DataSet
We should implement the Optimizer which only expose DL related concepts to the users; it can be constructed using DataSet, Module, Criterion, OptimMethod, etc.; on top of it, we can implement LocalOptimizer and DistributedOptimizer that accept LocalDataSet and DistributedDataSet respectively. Inside the Optimizer, it can create Parameter objects (LocalParameter or DistribtuedParameter) to manage the local or distributed training respectively.

Shall we override `size` method in Storage?

The Storage.size will cause GC and be confused with Storage.length. The performance is also terrible.

It might be better if we can provide a general batching/shuffling transformer

Batching is quite a common logic, it might be good if user don't need to be bother by that.
We might be able to hide that from user or just give a general Batch transformer.

Ideally What the user need to provide is just a simple Iterator[Sample], we take care of the batching shuffling internally(or this kind of logic can be a out-of-box component which can be picked up easily).
i.e
Iterator[Sample] -----batching---shuffling--> training/validation

Evenly and deterministically distribute DataSet across each node in the cluster

We should explicitly manage the preferred locations of DataSet, so that they are evenly and deterministically distributed across each node in the cluster

Test Inception with DistriOptimizer2 with drop = 0

Test Inception with DistriOptimizer2 with drop = 0 on performance and convergence

Make sure BigDL works for Spark local mode

Use factory methods to create layers instead of calling the constructor directly

When defining a model, instead of directly calling the constructors of individual modules, we should use factory methods (e.g., through Scala companion object’s apply methods). This allows one to define the same model that can run using different underlying engine (e.g., MKL-DNN).

(GoogleNet_v2) Run GoogleNet_v2 spark cluster mode on spark-dl

Run googlenet_v2 in spark cluster mode on spark-dl. It should achieve 80% intel-caffe performance and convergence to top1-acurracy(68%) and top-5-accuracy(88%).

Performace of Concat layer

With the test of GoogLeNet v1 on sparkdl, there may be a performance issue on the Concat layer.

The time distribution of an iteration is,

total, 1471.642277ms
forward, 333.476ms
backward, 475.903ms

The other time (not forward and backward) is 662.351. That almost is the costs of Concat layer.

In the WebscaleML, I implement the concat copy with some repeated and dirty code, which is one special case for apply2 for DenseTensorApply.
When the tesor1stride and tensor2stride are all 1, it will use System.arrayCopy instead continuous one byte copy.

Refactor ImageClassifier example

change the folder name to ImageClassification
change googlenet to inception
change utils.File to utils.TorchFile
support the case that the input images have no lables
converge the local mode and spark mode implementations

Add Conv1D for text processing

We can use Conv2D for now, but it might be more straight forward and optimized with 1D implementation.

CriterionTable's implement is wrong.

(ResNet) Add MulConstant Layer into Spark-DL

We need to support MulConstant layer to enable ResNet. We can reference torch implementation at https://github.com/torch/nn/blob/master/MulConstant.lua

Scala Doc for important classes such as transformers and layers

Add a noop CompressedTensor

Implement a a noop CompressedTensor (that can be used in place of FP16CompressedTensor & FP16SplitsCompressedTensor) that does no compression at all.

Converge the implementations of example code for local and Spark modes

Support ResNet

We need to support resnet in spark-dl, including single node performance test and tunning. Test convergence on single node and multi-node.

We can start with ResNet50. The model topology should reference facebook torch model and caffe torch model. The only difference between these two models are some convolution stride parameter.

The single node performance goal is to achieve 80% of intel-caffe.

The convergency goal is to achieve the same error rate(top-1 and top-5) on the refrence model.

add serialVersionUID

Different initialization method for weight and bias

Add checking to give a more precise exception instead of NPE

val outputWidth = (inputWidth + 2 * padW - kernelW) / strideW + 1
val outputHeight = (inputHeight + 2 * padH - kernelH) / strideH + 1

Adding checking for these two fields which might be negative if user incorrectly input the parameters.

Exception in thread "main" java.lang.NullPointerException
at com.intel.analytics.bigdl.tensor.DenseTensor.fill(DenseTensor.scala:226)
at com.intel.analytics.bigdl.nn.SpatialConvolution.updateOutput(SpatialConvolution.scala:123)
at com.intel.analytics.bigdl.nn.SpatialConvolution.updateOutput(SpatialConvolution.scala:30)
at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:129)
at com.intel.analytics.bigdl.nn.Sequential.updateOutput(Sequential.scala:32)
at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:129)
at com.intel.analytics.bigdl.optim.LocalOptimizer$$anonfun$5$$anonfun$apply$1.apply$mcD$sp(LocalOptimizer.scala:117)
at com.intel.analytics.bigdl.optim.LocalOptimizer$$anonfun$5$$anonfun$apply$1.apply(LocalOptimizer.scala:111)
at com.intel.analytics.bigdl.optim.LocalOptimizer$$anonfun$5$$anonfun$apply$1.apply(LocalOptimizer.scala:111)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Save model should not save its buffers

I find that when save googlenet model, the model size is 5.7G, while its parameter size should be only Megabytes level. The root cause is that the buffer of the model is also saved to the file. We should mark these buffer field as transient, so it won't be a part of the model persistent file.

Change GoogleNet to Inception

(ResNet) Support Identity Layer in Spark-DL

We need to support Identity layer in Spark-DL to enable ResNet. We can reference torch implementation at https://github.com/torch/nn/blob/master/Identity.lua

Change all "looped" parameters to "train"

Distributed optimizer throw NA exception when drop too many modules

We find that when learning rate is large or drop too many module gradients. The loss and activities will become NA.

(GoogleNet_v2) Run GoogleNet_v2 local mode on spark-dl

Run googlenet_v2 in local mode on spark-dl. It should achieve 80% intel-caffe performance and convergence to top1-acurracy(68%) and top-5-accuracy(88%).

(ResNet) Add CAddTable Layer into Spark-DL

We should add CAddTable layer into spark-dl to enable ResNet. We can reference torch implementation at https://github.com/torch/nn/blob/master/CAddTable.lua

Unify data load interface of different dataset

We support use different dataset(imagenet, cifar10 and mnist) to train and test the model. But the code is hard to maintain. We should do some code refactor.

The target is unify the dataload, transform function of different dataset in local or spark cluster mode.

(ResNet) Support CrossEntropyCriterion Layer in Spark-DL

We need to support CrossEntropyCriterion layer to enable resnet. We can reference the torch implementation at https://github.com/torch/nn/blob/master/CrossEntropyCriterion.lua.

Refactor SparkML example

Instead of use DataSet in the driver side, we should use RDD or DataFrame only in driver
In DLClassifier.process, we may transform the Iterator[Row] to LocalDataSet in mapPartition
In DLClassifier.process, we cannot share models between different partitions; we need to clone a new version of localModel in mapPartition
Move spark.ml.DLClassifier to the utils folder

(ResNet) Add ConcatTable Layer into Spark-DL

We should add ConcatTable layer into spark-dl to enable ResNet. We can reference torch implementation at https://github.com/torch/nn/blob/master/ConcatTable.lua

Give a more general name for the output log

We should not restrict BigDL to only accept Image as input

Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: input image smaller than kernel size
at scala.Predef$.require(Predef.scala:233)
at com.intel.analytics.bigdl.nn.SpatialMaxPooling.updateOutput(SpatialMaxPooling.scala:55)
at com.intel.analytics.bigdl.nn.SpatialMaxPooling.updateOutput(SpatialMaxPooling.scala:30)

As default spark build doesn't inlcude netlib-java, should we use our custom blas wrapper instead of netlib-java?

The default Apache Spark build doesn't include netlib-java because of license problem. We use netlib-java for some blas operation. So the lib perf will be very bad on the default Apache Spark build. Should we migrate all of the blas operation to a custom blas wrapper?

intel-analytics / bigdl Goto Github PK

bigdl's Introduction

💫 IPEX-LLM

ipex-llm Demo

Latest Update 🔥

ipex-llm Quickstart

Install ipex-llm

Run ipex-llm