Coder Social home page Coder Social logo

intel-analytics / bigdl Goto Github PK

View Code? Open in Web Editor NEW
5.6K 244.0 1.2K 225.85 MB

Accelerate local LLM inference and finetuning (LLaMA, Mistral, ChatGLM, Qwen, Baichuan, Mixtral, Gemma, etc.) on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max). A PyTorch LLM library that seamlessly integrates with llama.cpp, HuggingFace, LangChain, LlamaIndex, DeepSpeed, vLLM, FastChat, ModelScope, etc.

Home Page: https://ipex-llm.readthedocs.io

License: Apache License 2.0

Shell 2.25% Python 97.18% Dockerfile 0.34% PowerShell 0.14% Batchfile 0.09%
spark tensorflow keras pytorch bigdl analytics-zoo distributed-deep-learning python scala llm transformers

bigdl's Introduction

Important

bigdl-llm has now become ipex-llm (see the migration guide here); you may find the original BigDL project here.


💫 IPEX-LLM

IPEX-LLM is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency1.

Note

ipex-llm Demo

See the demo of running Text-Generation-WebUI, local RAG using LangChain-Chatchat, llama.cpp and HuggingFace transformers (on either Intel Core Ultra laptop or Arc GPU) with ipex-llm below.

Intel Core Ultra Laptop Intel Arc GPU
webui.mp4
rag.mp4
llama-cpp.mp4
hf.mp4
Text-Generation-WebUI Local RAG using LangChain-Chatchat llama.cpp HuggingFace transformers

Latest Update 🔥

  • [2024/03] bigdl-llm has now become ipex-llm (see the migration guide here); you may find the original BigDL project here.
  • [2024/02] ipex-llm now supports directly loading model from ModelScope (魔搭).
  • [2024/02] ipex-llm added inital INT2 support (based on llama.cpp IQ2 mechanism), which makes it possible to run large-size LLM (e.g., Mixtral-8x7B) on Intel GPU with 16GB VRAM.
  • [2024/02] Users can now use ipex-llm through Text-Generation-WebUI GUI.
  • [2024/02] ipex-llm now supports Self-Speculative Decoding, which in practice brings ~30% speedup for FP16 and BF16 inference latency on Intel GPU and CPU respectively.
  • [2024/02] ipex-llm now supports a comprehensive list of LLM finetuning on Intel GPU (including LoRA, QLoRA, DPO, QA-LoRA and ReLoRA).
  • [2024/01] Using ipex-llm QLoRA, we managed to finetune LLaMA2-7B in 21 minutes and LLaMA2-70B in 3.14 hours on 8 Intel Max 1550 GPU for Standford-Alpaca (see the blog here).
More updates

ipex-llm Quickstart

Install ipex-llm

  • Windows GPU: installing ipex-llm on Windows with Intel GPU
  • Linux GPU: installing ipex-llm on Linux with Intel GPU
  • Docker: using ipex-llm dockers on Intel CPU and GPU
  • For more details, please refer to the installation guide

Run ipex-llm

  • llama.cpp: running ipex-llm for llama.cpp (using C++ interface of ipex-llm as an accelerated backend for llama.cpp on Intel GPU)
  • vLLM: running ipex-llm in vLLM on both Intel GPU and CPU
  • FastChat: running ipex-llm in FastChat serving on on both Intel GPU and CPU
  • LangChain-Chatchat RAG: running ipex-llm in LangChain-Chatchat (Knowledge Base QA using RAG pipeline)
  • Text-Generation-WebUI: running ipex-llm in oobabooga WebUI
  • Benchmarking: running (latency and throughput) benchmarks for ipex-llm on Intel CPU and GPU

Code Examples

For more details, please refer to the ipex-llm document website.

Verified Models

Over 50 models have been optimized/verified on ipex-llm, including LLaMA/LLaMA2, Mistral, Mixtral, Gemma, LLaVA, Whisper, ChatGLM2/ChatGLM3, Baichuan/Baichuan2, Qwen/Qwen-1.5, InternLM and more; see the list below.

Model CPU Example GPU Example
LLaMA (such as Vicuna, Guanaco, Koala, Baize, WizardLM, etc.) link1, link2 link
LLaMA 2 link1, link2 link
ChatGLM link
ChatGLM2 link link
ChatGLM3 link link
Mistral link link
Mixtral link link
Falcon link link
MPT link link
Dolly-v1 link link
Dolly-v2 link link
Replit Code link link
RedPajama link1, link2
Phoenix link1, link2
StarCoder link1, link2 link
Baichuan link link
Baichuan2 link link
InternLM link link
Qwen link link
Qwen1.5 link link
Qwen-VL link link
Aquila link link
Aquila2 link link
MOSS link
Whisper link link
Phi-1_5 link link
Flan-t5 link link
LLaVA link link
CodeLlama link link
Skywork link
InternLM-XComposer link
WizardCoder-Python link
CodeShell link
Fuyu link
Distil-Whisper link link
Yi link link
BlueLM link link
Mamba link link
SOLAR link link
Phixtral link link
InternLM2 link link
RWKV4 link
RWKV5 link
Bark link link
SpeechT5 link
DeepSeek-MoE link
Ziya-Coding-34B-v1.0 link
Phi-2 link link
Yuan2 link link
Gemma link link
DeciLM-7B link link
Deepseek link link
StableLM link link

Footnotes

  1. Performance varies by use, configuration and other factors. ipex-llm may not optimize to the same degree for non-Intel products. Learn more at www.Intel.com/PerformanceIndex.

bigdl's People

Contributors

cyita avatar dding3 avatar gc-fu avatar hkvision avatar hoshibara avatar hzjane avatar jason-dai avatar jasonzzt avatar jenniew avatar jerryyanwan avatar jinbridger avatar lalalapotter avatar le-zheng avatar leonardozcm avatar liu-shaojun avatar meousker77 avatar oscilloscope98 avatar plusbang avatar psyyz10 avatar qiuxin2012 avatar qiyuangong avatar rnwang04 avatar sgwhat avatar shane-huang avatar theaperdeng avatar uxito-ada avatar weiguanghan avatar yangw1234 avatar zhengjin-wang avatar zhentaocc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bigdl's Issues

make-dist.sh error

1, rm -r $DIST_DIR/* (if nothing under this folder)
2, cp $BASEDIR/scripts/bigdlvars.sh $BIN_DIR/ (shell file doesn't exist)

Make Module:backward() final

We should make Module:backward final, and overload the updategradInput and accGradParameters methods in its subclasses instead.

In SpatialConvolution, weight address changed, but weightMM not

In SpatialConvolution, weightMM is the view of weight and won't be updated until it is empty.

When calling getParameter fuc, Module.flatten will change the memory address of weight .

So if we call forward/backward and get weight by getParameter, weight's change won't affect nonempty weightMM.

There is a need to change weightMM updated rule.

Need a Flatten layer

Currently we only provide Reshape layer which require user to manually calc the size from the previous layer.

It would be better if we can have a similar layer like Keras:
x = Flatten()(x)

nelements_pre_layer/batch is the size user need to give if we rely on the current Reshape layer.

Support GoogleNet_v2 in Spark-DL

Googlenet_V2 Model reference this paper.

A caffe model is here

We should achieve 80% intel-caffe single node performance, and achieve same top-1(68%) and top-5(88%) error compare to GPU version on local mode and cluster mode.

Support MKL2017 DNN API

Intel MKL release 2017 version and it contains a DNN API, which provide DNN operation optimized for IA architecture. We will add new layers which leverage these new APIs to get a better performance on CPU.

tensor_apply3 apply on discontiguous tensor will throw an ArrayIndexOutOfBoundsException.

Test code:

    val x = Tensor[Float](2, 1).fill(2f)
    val y = Tensor(Storage(Array(1f, 2, 3, 4, 5, 6)), 1, Array(2, 3))
    x.expandAs(y)
    val z = Tensor[Float](2, 3).zero()
    z.cmul(x, y)  //will call apply3

Exception:

java.lang.ArrayIndexOutOfBoundsException: 2
    at scala.runtime.ScalaRunTime$.array_apply(ScalaRunTime.scala:76)
    at com.intel.analytics.sparkdl.tensor.DenseTensorMath$$anon$32.apply(DenseTensorMath.scala:60)
    at com.intel.analytics.sparkdl.tensor.DenseTensorApply$.apply3(DenseTensorApply.scala:177)
    at com.intel.analytics.sparkdl.tensor.DenseTensorMath$.cmul(DenseTensorMath.scala:63)
    at com.intel.analytics.sparkdl.tensor.DenseTensor.cmul(DenseTensor.scala:841)

Enrich the exception info for the require statement

There are lots of require statement here but without enough info for debugging.
i.e require(1 == this.nDimension, "invalid size")
A better version:
require(1 == this.nDimension, s"nDimension size: ${this.nDimension} should be 1")

Code refactory for Engine, Parameter, Optimizer, DataSet, etc.

  1. Engine
    We should implement an Engine object to represent the environment for the training, including:

    • Type of the underlying execution engine: MKL-BLAS vs. MKL-DNN. We can then provide factory functions to automatically create appropriate version of modules (that is, using MKL-DNN or not) based on the specified type; for now, we can throw an error in the factory functions if MKL-DNN is specified but there is no such version of the module.

    • Configurations for distributed training: partition#, worker#, core#, batch size, batchPerPartition, batchPerWorker, batchPerCore, OMP parameters, etc. We can also perform various check on these configurations – e.g., batch size should be a multiple of worker# X core# when using MKL-BLAS

    • Pool of threads for running multiple tasks in a worker: we shouldn’t expose the multi-threading code in the application logic; instead, we can encapsulate the multi-threading code in the Engine object, and its users (such as the Optimizer) can simply call something like

      Engine.parallelInvoke(0 until core#) { i =>
         …
      }
  2. Parameters
    We should rename the ps package to parameters, and implement a Parameter class which represents the shared variables (containing both weights and gradients) for both local and distributed modes (somewhat similar to the Broadcast variable). It should provide the following support:

    • User-specified serializer object (e.g., FP16) and update method

    • getWeights/getGradients methods: these should be non-blocking methods that return a FutureResult; the user can then fetch the value through something like FutureResult.getValue(). In this way, the sync weight operations can be overlapped with other operations as follows:

      val W = parameter.getWeights()
      Engine.parallelInvoke(0 until core#) { i =>
          models(i).zeroGradient()
          ...
          W.getValue(localWeights(i))
          …
      }
    • putWeights/putGradients methods for updating the Parameter

  3. Optimizer and DataSet
    We should implement the Optimizer which only expose DL related concepts to the users; it can be constructed using DataSet, Module, Criterion, OptimMethod, etc.; on top of it, we can implement LocalOptimizer and DistributedOptimizer that accept LocalDataSet and DistributedDataSet respectively. Inside the Optimizer, it can create Parameter objects (LocalParameter or DistribtuedParameter) to manage the local or distributed training respectively.

It might be better if we can provide a general batching/shuffling transformer

Batching is quite a common logic, it might be good if user don't need to be bother by that.
We might be able to hide that from user or just give a general Batch transformer.

Ideally What the user need to provide is just a simple Iterator[Sample], we take care of the batching shuffling internally(or this kind of logic can be a out-of-box component which can be picked up easily).
i.e
Iterator[Sample] -----batching---shuffling--> training/validation

Related code:
https://github.com/intel-analytics/BigDL/blob/0b095036ef3e3b45913d0209ee617e7836e40974/dl/src/main/scala/com/intel/analytics/bigdl/dataset/image/RGBImgToBatch.scala#L33

Performace of Concat layer

With the test of GoogLeNet v1 on sparkdl, there may be a performance issue on the Concat layer.

The time distribution of an iteration is,

  • total, 1471.642277ms
  • forward, 333.476ms
  • backward, 475.903ms

The other time (not forward and backward) is 662.351. That almost is the costs of Concat layer.

In the WebscaleML, I implement the concat copy with some repeated and dirty code, which is one special case for apply2 for DenseTensorApply.
When the tesor1stride and tensor2stride are all 1, it will use System.arrayCopy instead continuous one byte copy.

Refactor ImageClassifier example

  1. change the folder name to ImageClassification
  2. change googlenet to inception
  3. change utils.File to utils.TorchFile
  4. support the case that the input images have no lables
  5. converge the local mode and spark mode implementations

Add a noop CompressedTensor

Implement a a noop CompressedTensor (that can be used in place of FP16CompressedTensor & FP16SplitsCompressedTensor) that does no compression at all.

Support ResNet

We need to support resnet in spark-dl, including single node performance test and tunning. Test convergence on single node and multi-node.

We can start with ResNet50. The model topology should reference facebook torch model and caffe torch model. The only difference between these two models are some convolution stride parameter.

The single node performance goal is to achieve 80% of intel-caffe.

The convergency goal is to achieve the same error rate(top-1 and top-5) on the refrence model.

Add checking to give a more precise exception instead of NPE

val outputWidth = (inputWidth + 2 * padW - kernelW) / strideW + 1
val outputHeight = (inputHeight + 2 * padH - kernelH) / strideH + 1

Adding checking for these two fields which might be negative if user incorrectly input the parameters.

Exception in thread "main" java.lang.NullPointerException
at com.intel.analytics.bigdl.tensor.DenseTensor.fill(DenseTensor.scala:226)
at com.intel.analytics.bigdl.nn.SpatialConvolution.updateOutput(SpatialConvolution.scala:123)
at com.intel.analytics.bigdl.nn.SpatialConvolution.updateOutput(SpatialConvolution.scala:30)
at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:129)
at com.intel.analytics.bigdl.nn.Sequential.updateOutput(Sequential.scala:32)
at com.intel.analytics.bigdl.nn.abstractnn.AbstractModule.forward(AbstractModule.scala:129)
at com.intel.analytics.bigdl.optim.LocalOptimizer$$anonfun$5$$anonfun$apply$1.apply$mcD$sp(LocalOptimizer.scala:117)
at com.intel.analytics.bigdl.optim.LocalOptimizer$$anonfun$5$$anonfun$apply$1.apply(LocalOptimizer.scala:111)
at com.intel.analytics.bigdl.optim.LocalOptimizer$$anonfun$5$$anonfun$apply$1.apply(LocalOptimizer.scala:111)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

Save model should not save its buffers

I find that when save googlenet model, the model size is 5.7G, while its parameter size should be only Megabytes level. The root cause is that the buffer of the model is also saved to the file. We should mark these buffer field as transient, so it won't be a part of the model persistent file.

Unify data load interface of different dataset

We support use different dataset(imagenet, cifar10 and mnist) to train and test the model. But the code is hard to maintain. We should do some code refactor.

The target is unify the dataload, transform function of different dataset in local or spark cluster mode.

Refactor SparkML example

  1. Instead of use DataSet in the driver side, we should use RDD or DataFrame only in driver
  2. In DLClassifier.process, we may transform the Iterator[Row] to LocalDataSet in mapPartition
  3. In DLClassifier.process, we cannot share models between different partitions; we need to clone a new version of localModel in mapPartition
  4. Move spark.ml.DLClassifier to the utils folder

Give a more general name for the output log

We should not restrict BigDL to only accept Image as input

Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: input image smaller than kernel size
at scala.Predef$.require(Predef.scala:233)
at com.intel.analytics.bigdl.nn.SpatialMaxPooling.updateOutput(SpatialMaxPooling.scala:55)
at com.intel.analytics.bigdl.nn.SpatialMaxPooling.updateOutput(SpatialMaxPooling.scala:30)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.