intel / xfastertransformer Goto Github PK

License: Apache License 2.0

CMake 2.75% Python 14.38% Shell 4.11% C++ 78.57% C 0.19%

chatglm inference intel llama llm model-serving qwen transformer xeon

xfastertransformer's Introduction

xFasterTransformer

xFasterTransformer is an exceptionally optimized solution for large language models (LLM) on the X86 platform, which is similar to FasterTransformer on the GPU platform. xFasterTransformer is able to operate in distributed mode across multiple sockets and nodes to support inference on larger models. Additionally, it provides both C++ and Python APIs, spanning from high-level to low-level interfaces, making it easy to adopt and integrate.

xFasterTransformer

Models overview

Large Language Models (LLMs) develops very fast and are more widely used in many AI scenarios. xFasterTransformer is an optimized solution for LLM inference using the mainstream and popular LLM models on Xeon. xFasterTransformer fully leverages the hardware capabilities of Xeon platforms to achieve the high performance and high scalability of LLM inference both on single socket and multiple sockets/multiple nodes.

xFasterTransformer provides a series of APIs, both of C++ and Python, for end users to integrate xFasterTransformer into their own solutions or services directly. Many kinds of example codes are also provided to demonstrate the usage. Benchmark codes and scripts are provided for users to show the performance. Web demos for popular LLM models are also provided.

Model support matrix

Models	Framework		Distribution
	PyTorch	C++
ChatGLM	✔	✔	✔
ChatGLM2	✔	✔	✔
ChatGLM3	✔	✔	✔
GLM4	✔	✔	✔
Llama	✔	✔	✔
Llama2	✔	✔	✔
Llama3	✔	✔	✔
Baichuan	✔	✔	✔
Baichuan2	✔	✔	✔
QWen	✔	✔	✔
QWen2	✔	✔	✔
SecLLM(YaRN-Llama)	✔	✔	✔
Opt	✔	✔	✔
Deepseek-coder	✔	✔	✔
gemma	✔	✔	✔
gemma-1.1	✔	✔	✔
codegemma	✔	✔	✔

DataType support list

FP16
BF16
INT8
W8A8
INT4
NF4
BF16_FP16
BF16_INT8
BF16_W8A8
BF16_INT4
BF16_NF4
W8A8_INT8
W8A8_int4
W8A8_NF4

Documents

xFasterTransformer Documents and Wiki provides the following resources:

An introduction to xFasterTransformer.
Comprehensive API references for both high-level and low-level interfaces in C++ and PyTorch.
Practical API usage examples for xFasterTransformer in both C++ and PyTorch.

Installation

From PyPI

pip install xfastertransformer

Using Docker

docker pull intel/xfastertransformer:latest

Run the docker with the command (Assume model files are in /data/ directory):

docker run -it \
    --name xfastertransformer \
    --privileged \
    --shm-size=16g \
    -v /data/:/data/ \
    -e "http_proxy=$http_proxy" \
    -e "https_proxy=$https_proxy" \
    intel/xfastertransformer:latest

Notice!!!: Please enlarge --shm-size if bus error occurred while running in the multi-ranks mode. The default docker limits the shared memory size to 64MB and our implementation uses many shared memories to achieve a better performance.

Built from source

Prepare Environment

Manually

PyTorch v2.3 (When using the PyTorch API, it's required, but it's not needed when using the C++ API.)
```
pip install torch --index-url https://download.pytorch.org/whl/cpu
```
For GPU, xFT needs ABI=1 from torch==2.3.0+cpu.cxx11.abi in torch-whl-list due to DPC++ need ABI=1.

Install dependent libraries

Please install libnuma package:

CentOS: yum install libnuma-devel
Ubuntu: apt-get install libnuma-dev

How to build

Using 'CMake'

# Build xFasterTransformer
git clone https://github.com/intel/xFasterTransformer.git xFasterTransformer
cd xFasterTransformer
git checkout <latest-tag>
# Please make sure torch is installed when run python example
mkdir build && cd build
cmake ..
make -j

Using python setup.py

# Build xFasterTransformer library and C++ example.
python setup.py build

# Install xFasterTransformer into pip environment.
# Notice: Run `python setup.py build` before installation!
python setup.py install

Models Preparation

xFasterTransformer supports a different model format from Huggingface, but it's compatible with FasterTransformer's format.

Download the huggingface format model firstly.
After that, convert the model into xFasterTransformer format by using model convert module in xfastertransformer. If output directory is not provided, converted model will be placed into ${HF_DATASET_DIR}-xft.
```
python -c 'import xfastertransformer as xft; xft.LlamaConvert().convert("${HF_DATASET_DIR}","${OUTPUT_DIR}")'
```
PS: Due to the potential compatibility issues between the model file and the transformers version, please select the appropriate transformers version.

Supported model convert list:
- LlamaConvert
- YiConvert
- GemmaConvert
- ChatGLMConvert
- ChatGLM2Convert
- ChatGLM4Convert
- OPTConvert
- BaichuanConvert
- Baichuan2Convert
- QwenConvert
- Qwen2Convert
- DeepseekConvert

API usage

For more details, please see API document and examples.

Python API(PyTorch)

Firstly, please install the dependencies.

Python dependencies
```
pip install -r requirements.txt
```
PS: Due to the potential compatibility issues between the model file and the transformers version, please select the appropriate transformers version.
oneCCL (For multi ranks)
Install oneCCL and setup the environment. Please refer to Prepare Environment.

xFasterTransformer's Python API is similar to transformers and also supports transformers's streamer to achieve the streaming output. In the example, we use transformers to encode input prompts to token ids.

import xfastertransformer
from transformers import AutoTokenizer, TextStreamer
# Assume huggingface model dir is `/data/chatglm-6b-hf` and converted model dir is `/data/chatglm-6b-xft`.
MODEL_PATH="/data/chatglm-6b-xft"
TOKEN_PATH="/data/chatglm-6b-hf"

INPUT_PROMPT = "Once upon a time, there existed a little girl who liked to have adventures."
tokenizer = AutoTokenizer.from_pretrained(TOKEN_PATH, use_fast=False, padding_side="left", trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_special_tokens=True, skip_prompt=False)

input_ids = tokenizer(INPUT_PROMPT, return_tensors="pt", padding=False).input_ids
model = xfastertransformer.AutoModel.from_pretrained(MODEL_PATH, dtype="bf16")
generated_ids = model.generate(input_ids, max_length=200, streamer=streamer)

C++ API

SentencePiece can be used to tokenizer and detokenizer text.

#include <vector>
#include <iostream>
#include "xfastertransformer.h"
// ChatGLM token ids for prompt "Once upon a time, there existed a little girl who liked to have adventures."
std::vector<int> input(
        {3393, 955, 104, 163, 6, 173, 9166, 104, 486, 2511, 172, 7599, 103, 127, 17163, 7, 130001, 130004});

// Assume converted model dir is `/data/chatglm-6b-xft`.
xft::AutoModel model("/data/chatglm-6b-xft", xft::DataType::bf16);

model.config(/*max length*/ 100, /*num beams*/ 1);
model.input(/*input token ids*/ input, /*batch size*/ 1);

while (!model.isDone()) {
    std::vector<int> nextIds = model.generate();
}

std::vector<int> result = model.finalize();
for (auto id : result) {
    std::cout << id << " ";
}
std::cout << std::endl;

How to run

Recommend preloading libiomp5.so to get a better performance.

[Recommended] Run export $(python -c 'import xfastertransformer as xft; print(xft.get_env())') if xfastertransformer's python wheel package is installed.
libiomp5.so file will be in 3rdparty/mkl/lib directory after building xFasterTransformer successfully if building from source code.

Single rank

FasterTransformer will automatically check the MPI environment, or you can use the SINGLE_INSTANCE=1 environment variable to forcefully deactivate MPI.

Multi ranks

Command line

Use MPI to run in the multi-ranks mode, please install oneCCL firstly.

oneCCL Installation
- If you have built xfastertransformer from source, oneCCL is installed in 3rdparty when compilation.
```
source ./3rdparty/oneccl/build/_install/env/setvars.sh
```
- [Recommended] Use provided scripts to build it from source code.
```
cd 3rdparty
sh prepare_oneccl.sh
source ./oneccl/build/_install/env/setvars.sh
```
- Install oneCCL through installing Intel® oneAPI Base Toolkit.(Notice:It is recommended to use versions 2023.x and below.) And source the enviroment by:
```
source /opt/intel/oneapi/setvars.sh
```

Here is a example on local.

# or export LD_PRELOAD=libiomp5.so manually
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')
OMP_NUM_THREADS=48 mpirun \
  -n 1 numactl -N 0  -m 0 ${RUN_WORKLOAD} : \
  -n 1 numactl -N 1  -m 1 ${RUN_WORKLOAD}

Code

For more details, please refer to examples.

Python

model.rank can get the process's rank, model.rank == 0 is the Master.
For Slaves, after loading the model, the only thing needs to do is model.generate(). The input and generation configuration will be auto synced.

model = xfastertransformer.AutoModel.from_pretrained("/data/chatglm-6b-xft", dtype="bf16")

# Slave
while True:
    model.generate()

C++

model.getRank() can get the process's rank, model.getRank() == 0 is the Master.
For Slaves, any value can be input to model.config() and model.input since Master's value will be synced.

xft::AutoModel model("/data/chatglm-6b-xft", xft::DataType::bf16);

// Slave
while (1) {
    model.config();
    std::vector<int> input_ids;
    model.input(/*input token ids*/ input_ids, /*batch size*/ 1);

    while (!model.isDone()) {
        model.generate();
    }
}

Web Demo

A web demo based on Gradio is provided in repo. Now support ChatGLM, ChatGLM2 and Llama2 models.

Perpare the model.
Install the dependencies
```
pip install -r examples/web_demo/requirements.txt
```
PS: Due to the potential compatibility issues between the model file and the transformers version, please select the appropriate transformers version.
Run the script corresponding to the model. After the web server started, open the output URL in the browser to use the demo. Please specify the paths of model and tokenizer directory, and data type. transformer's tokenizer is used to encode and decode text so ${TOKEN_PATH} means the huggingface model directory. This demo also support multi-rank.

# Recommend preloading `libiomp5.so` to get a better performance.
# or LD_PRELOAD=libiomp5.so manually, `libiomp5.so` file will be in `3rdparty/mkl/lib` directory after build xFasterTransformer.
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')
python examples/web_demo/ChatGLM.py \
                      --dtype=bf16 \
                      --token_path=${TOKEN_PATH} \
                      --model_path=${MODEL_PATH}

Serving

vLLM

A fork of vLLM has been created to integrate the xFasterTransformer backend, maintaining compatibility with most of the official vLLM's features. Refer this link for more detail.

Install

pip install vllm-xft

Notice: Please do not install both vllm-xft and vllm simultaneously in the environment. Although the package names are different, they will actually overwrite each other.

OpenAI Compatible Server

Notice: Preload libiomp5.so is required!

# Preload libiomp5.so by following cmd or LD_PRELOAD=libiomp5.so manually
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')

python -m vllm.entrypoints.openai.api_server \
        --model ${MODEL_PATH} \
        --tokenizer ${TOKEN_PATH} \
        --dtype bf16 \
        --kv-cache-dtype fp16 \
        --served-model-name xft \
        --port 8000 \
        --trust-remote-code

For multi-rank mode, please use python -m vllm.entrypoints.slave as slave and keep params of slaves align with master.

# Preload libiomp5.so by following cmd or LD_PRELOAD=libiomp5.so manually
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')

OMP_NUM_THREADS=48 mpirun \
        -n 1 numactl --all -C 0-47 -m 0 \
          python -m vllm.entrypoints.openai.api_server \
            --model ${MODEL_PATH} \
            --tokenizer ${TOKEN_PATH} \
            --dtype bf16 \
            --kv-cache-dtype fp16 \
            --served-model-name xft \
            --port 8000 \
            --trust-remote-code \
        : -n 1 numactl --all -C 48-95 -m 1 \
          python -m vllm.entrypoints.slave \
            --dtype bf16 \
            --model ${MODEL_PATH} \
            --kv-cache-dtype fp16

FastChat

xFasterTransformer is an official inference backend of FastChat. Please refer to xFasterTransformer in FastChat and FastChat's serving for more details.

MLServer

A example serving of MLServer is provided which supports REST and gRPC interface and adaptive batching feature to group inference requests together on the fly.

Benchmark

Benchmark scripts are provided to get the model inference performance quickly.

Prepare the model.
Install the dependencies, including oneCCL and python dependencies.
Enter the benchmark folder and run run_benchmark.sh. Please refer to Benchmark README for more information.

Notes!!!: The system and CPU configuration may be different. For the best performance, please try to modify OMP_NUM_THREADS, datatype and the memory nodes number (check the memory nodes using numactl -H) according to your test environment.

Support

xFasterTransformer email: [email protected]
xFasterTransformer wechat

Accepted Papers

ICLR'2024 on practical ML for limited/low resource settings: Distributed Inference Performance Optimization for LLMs on CPUs
ICML'2024 on Foundation Models in the Wild: Inference Performance Optimization for Large Language Models on CPUs
IEEE ICSESS 2024: All-in-one Approach for Large Language Models Inference

If xFT is useful for your research, please cite:

@article{he2024distributed,
  title={Distributed Inference Performance Optimization for LLMs on CPUs},
  author={He, Pujiang and Zhou, Shan and Li, Changqing and Huang, Wenhuan and Yu, Weifei and Wang, Duyi and Meng, Chen and Gui, Sheng},
  journal={arXiv preprint arXiv:2407.00029},
  year={2024}
}

and

@inproceedings{he2024inference,
  title={Inference Performance Optimization for Large Language Models on CPUs},
  author={He, Pujiang and Zhou, Shan and Huang, Wenhuan and Li, Changqing and Wang, Duyi and Guo, Bin and Meng, Chen and Gui, Sheng and Yu, Weifei and Xie, Yi},
  booktitle={ICML 2024 Workshop on Foundation Models in the Wild}
}

Q&A

Q: Can xFasterTransformer run on a Intel® Core™ CPU?
A: No. xFasterTransformer requires support for the AMX and AVX512 instruction sets, which are not available on Intel® Core™ CPUs.
Q: Can xFasterTransformer run on the Windows system?
A: There is no native support for Windows, and all compatibility tests are only conducted on Linux, so Linux is recommended.
Q: Why does the program freeze or exit with errors when running in multi-rank mode after installing the latest version of oneCCL through oneAPI?
A: Please try downgrading oneAPI to version 2023.x or below, or use the provided script to install oneCCL from source code.
Q: Why does running the program using two CPU sockets result in much lower performance compared to running on a single CPU socket?
A: Running in this way causes the program to engage in many unnecessary cross-socket communications, significantly impacting performance. If there is a need for cross-socket deployment, consider running in a multi-rank mode with one rank on each socket.
Q:The performance is normal when running in a single rank, but why is the performance very slow and the CPU utilization very low when using MPI to run multiple ranks?
A:This is because the program launched through MPI reads OMP_NUM_THREADS=1, which cannot correctly retrieve the appropriate value from the environment. It is necessary to manually set the value of OMP_NUM_THREADS based on the actual situation.
Q: Why do I still encounter errors when converting already supported models?
A: Try downgrading transformer to an appropriate version, such as the version specified in the requirements.txt. This is because different versions of Transformer may change the names of certain variables.

xfastertransformer's People

Contributors

Stargazers

Watchers

Forkers

pujiang2018 jackywei a3213105 abenmao duyi-wang marvin-yu changqi1 machinelearningsystem xiangzez yuanbin123456 wli58 junxichhen sakuraym heagoo wenhuanh jervint miaojinc huaqiangwang ai-mou alimagic axl-zhang xinzaifeixiang1992 alexzhf meowboy326 paladin2000cn yangkunx vivienfanghuagood heimafeitian mahadih534 bukejiyu chzhyang aurora327 zhaohb amberwudi kzhu2030 feng-intel kenplusplus huili-intel xwang98 ustcuna syulin7 zlhdsmy netaddi allendou sssssux qiuyuleng1 alanzhai219 h23120 denniszhen1 rotion chenyou-intel ftian1 tianyeet oreo-lp utopic-dev dreaming-panda w1ida eltociear

xfastertransformer's Issues

run webdemo meet error when set dytpe bf16_int8 and bf16_fp16

dtype 是 bf16_int8 和 bf16_fp16 时，运行run.sh都会直接 killed
web_demo混合精度.zip

流式输出问题

import xfastertransformer
from transformers import AutoTokenizer, TextStreamer
# Assume huggingface model dir is `/data/chatglm-6b-hf` and converted model dir is `/data/chatglm-6b-cpu`.
MODEL_PATH="/data/jane/models/chatglm2-6b-cpu/"
TOKEN_PATH="/data/jane/models/chatglm2-6b"

#INPUT_PROMPT = "Once upon a time, there existed a little girl who liked to have adventures."
INPUT_PROMPT = "问：上海在哪？答: "
tokenizer = AutoTokenizer.from_pretrained(TOKEN_PATH, use_fast=False, padding_side="left", trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_special_tokens=True, skip_prompt=False)
print(streamer)
input_ids = tokenizer(INPUT_PROMPT, return_tensors="pt", padding=False).input_ids
model = xfastertransformer.AutoModel.from_pretrained(MODEL_PATH, dtype="bf16")
generated_ids = model.generate(input_ids, max_length=200, streamer=streamer)

修改了下INPUT_PROMPT 的值，输出不是流式的了，直接一次性输出所有值

这块CPU是否满足xFasterTF的最低要求？

CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              48
On-line CPU(s) list: 0-47
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           48
NUMA node(s):        1
Vendor ID:           GenuineIntel
BIOS Vendor ID:      Bochs
CPU family:          6
Model:               61
Model name:          Intel Core Processor (Broadwell, IBRS)
Stepping:            2
CPU MHz:             2095.078
BogoMIPS:            4190.15
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            4096K
NUMA node0 CPU(s):   0-47
Flags:               fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc rep_good nopl cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx hypervisor lahf_lm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt md_clear```

lscpu的结果，帮忙看一下是否满足最低运行要求

转换模型报错

你好，我发现在使用xFasterTransformer，转换模型时，出现了以下报错
python ./tools/llama_convert.py -i /workspace/llama-2-7b-chat-hf/ -o /workspace/llama-2-7b-chat-cpu/

/usr/local/lib/python3.8/dist-packages/transformer_engine_extensions.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c106detail23torchInternalAssertFailEPKcS2_jS2_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
请问如何解决呢
我怀疑是安装包版本的问题，我看到intel是有docker镜像的，但是我无法pull，请问哪里提供呢

AttributeError: module xfastertransformer has no attribute AutoModel

(demo) [root@iZ2ze5jp679eomnr2xu3s0Z web_demo]# python ChatGLM2.py
[INFO] xfastertransformer is not installed in pip, using source code.
Traceback (most recent call last):
  File "ChatGLM2.py", line 68, in <module>
    demo = ChatGLM2Demo(args.token_path, args.model_path, dtype=args.dtype)
  File "/mnt/xFasterTransformer/examples/web_demo/demo_utils.py", line 61, in __init__
    self.model = xfastertransformer.AutoModel.from_pretrained(model_path, dtype=dtype)
  File "/mnt/xFasterTransformer/examples/web_demo/../../src/xfastertransformer/__init__.py", line 59, in __getattr__                                                                                                        
    raise AttributeError("module {} has no attribute {}".format(self.__name__, name))
AttributeError: module xfastertransformer has no attribute AutoModel

[Model] Convert HF Qwen model into FP16 data type

@marvin-Yu Need you to help us to check the FP16 model convertion.

python ./tools/qwen_convert.py -i /model/qwen-7b-chat-hf/ -o ./qwen-7b-chat-xft -d fp16

Compile using "python setup install.py" can not find "libxfastertransformer_pt.so"

When I compiled xFT using "python setup.py install" and run benchmark, an error occurred:

When I try to find this by "find / -name libxfastertransformer_pt.so", it shows nothing.

libnuma: Warning: node argument 8 is out of range

In the Docker container, numactl does not recognize the NUMA node for High Bandwidth Memory (HBM). However, using 'numactl -H' reveals the corresponding node.
will show error:

libnuma: Warning: node argument 8 is out of range

HW/SW version:
kernel version: 5.15.0-spr.bkc.pc.16.1.23.x86_64
docker version: 24.0.6
host-numactl version: 2.0.16
container-numactl version: 2.0.14-3ubuntu2
docker run option: docker run -e http_proxy -e https_proxy -e no_proxy --privileged --name wsf-78378ddc4d4ed --rm --detach xftbench-inference-lite:latest

chatGLM2-6B crash while running 4 ranks with the datatype w8a8

With xFT1.2.0, running 4 ranks on HBM 1 socket. It works on using 1 rank.
bash run_benchmark.sh -m chatglm2-6b -d w8a8 -s 1 -bs 1 -in 1024 -out 32 -i 5

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 7633 RUNNING AT spr28
= KILLED BY SIGNAL: 11 (Segmentation fault)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 7634 RUNNING AT spr28
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 2 PID 7635 RUNNING AT spr28
= KILLED BY SIGNAL: 11 (Segmentation fault)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 3 PID 7636 RUNNING AT spr28
= KILLED BY SIGNAL: 11 (Segmentation fault)

[numactl] thread binding does not work

I observe an issue that when using example binary to run benchmarking, thread binding does not work well. Fix it by taskset.

KVCache buffer overflows when token number exceeds the setting of config.ini max_pos_seq_len

If the total sequence length, the sum of input sequence length and output sequence length, exceeds the value setting in config.ini/max_pos_seq_len, application crashes silently.

Following is the command and output message for a llama2-13b test. default max_pos_seq_len is 2048, which is less than sum of input and output token lengths.

LD_PRELOAD=libiomp5.so OMP_NUM_THREADS=28 mpirun
-n 1 numactl -C0-27 -l ../..//build/example -m /root/xygao/LLM/llama/xFasterTransformer/models/llama2-13b/ -t /home/huaqiang/models/llama-2-13b-chat-hf/tokenizer.model -d bf16_fp16 -l 2048 --output_len 32 --loop 15 :
-n 1 numactl -C28-55 -l ../../build/example -m /root/xygao/LLM/llama/xFasterTransformer/models/llama2-13b/ -t /home/huaqiang/models/llama-2-13b-chat-hf/tokenizer.model -d bf16_fp16 -l 2048 --output_len 32 --loop 15 :
-n 1 numactl -C56-83 -l ../../build/example -m /root/xygao/LLM/llama/xFasterTransformer/models/llama2-13b/ -t /home/huaqiang/models/llama-2-13b-chat-hf/tokenizer.model -d bf16_fp16 -l 2048 --output_len 32 --loop 15 :
-n 1 numactl -C84-111 -l ../../build/example -m /root/xygao/LLM/llama/xFasterTransformer/models/llama2-13b/ -t /home/huaqiang/models/llama-2-13b-chat-hf/tokenizer.model -d bf16_fp16 -l 2048 --output_len 32 --loop 15 | tee 4mpi-result.txt

[INFO] First token time: 3038.85 ms

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 1431596 RUNNING AT spr-s6q-06
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 1431597 RUNNING AT spr-s6q-06
= KILLED BY SIGNAL: 11 (Segmentation fault)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 2 PID 1431598 RUNNING AT spr-s6q-06
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 3 PID 1431599 RUNNING AT spr-s6q-06
= KILLED BY SIGNAL: 9 (Killed)

ModuleNotFoundError: No module named 'xfastertransformer.tools'

I'm using commit f205d37 to compile xfastertransformer.
When "import xfastertransformer", it shows error:

2: Traceback (most recent call last):
2:   File "/home/workspace/xFasterTransformer/benchmark/benchmark.py", line 89, in <module>
2:     import xfastertransformer
2:   File "/root/anaconda3/envs/llm/lib/python3.9/site-packages/xfastertransformer-1.0.0-py3.9-linux-x86_64.egg/xfastertransformer/__init__.py", line 22, in <module>
2:     from .tools import LlamaConvert
2: ModuleNotFoundError: No module named 'xfastertransformer.tools'

a bug in the doc

root@f62374f19c02:~/xfastertransformer# python ./tools/opt_convert.py -i ./data/opt-1.3b-hf -o ./data/opt-1.3b-cpu
File "./tools/opt_convert.py", line 284
parser.add_argument("--weight_data_type", "-d" type=str, default="fp16", choices=["fp32", "fp16"])
^
SyntaxError: invalid syntax

I cannot find 'patch-gitee.diff' mementioned in GuideForUser

As mentioned in the title

cmake error. run on aliyun .

实例规格	CPU&内存
ecs.c8i.12xlarge更换购买相同配置	48核(vCPU)96 GiB

ubuntu_22_04_x64_20G_alibase_20231019.vhd创建自定义镜像 | Ubuntu 22.04 64位更换

vpc-2zekgdhgeqt82h97j49oz更换 | vsw-2zeuut5cvvc4djiv5g0t4更换

CMake Error at /root/xfastertransformer/build/mklml-prefix/src/mklml-stamp/download-mklml.cmake:170 (message):
Each download failed!

error: downloading 'https://gitee.com/qccz123456/oneDNN/releases/download/v0.21/mklml_lnx_2019.0.5.20190502.tgz' failed
      status_code: 28
      status_string: "Timeout was reached"
      log:
      --- LOG BEGIN ---
        Trying 180.76.198.77:443...

Connected to gitee.com (180.76.198.77) port 443 (#0)

ALPN: offers h2

ALPN: offers http/1.1

[CONN-0-0][CF-SSL] TLSv1.0 (OUT), TLS header, Certificate Status (22):

[5 bytes data]

[CONN-0-0][CF-SSL] TLSv1.3 (OUT), TLS handshake, Client hello (1):

[512 bytes data]

[CONN-0-0][CF-SSL] TLSv1.2 (IN), TLS header, Certificate Status (22):

[5 bytes data]

[CONN-0-0][CF-SSL] TLSv1.3 (IN), TLS handshake, Server hello (2):

[108 bytes data]

[CONN-0-0][CF-SSL] TLSv1.2 (IN), TLS header, Certificate Status (22):

[5 bytes data]

[CONN-0-0][CF-SSL] TLSv1.2 (IN), TLS handshake, Certificate (11):

[3295 bytes data]

[CONN-0-0][CF-SSL] TLSv1.2 (IN), TLS header, Certificate Status (22):

Memory mapping error on LLaMA 7B w/ 3 ranks

need to fix the issue on

https://github.com/intel/xFasterTransformer/pull/34/files#diff-a9d9f0516aacc78680023d6fb9b2765388ab390d03239f3b2cc0bdb4f5076f68

and

MMHelper::packWeight(trans, quantizedGateWeight, gateWeight);

@changqi1

[Model] QWen14B-Chat got wrong output when input tokens is too long

id.txt
attached is the input tokens

the output of Torch version is "在正常情况下，框架大车陆侧和海侧的运行速度可以设置为10%、20%、50%和80%，并且可以通过编码器检测的速度控制器进行速度闭环控制。此外，所有的起升机构都有高度参考点，可以根据需要进行速度和位置同步控制。"

and the output of xft is unreadable.

流式参数怎么接收？

怎么接收返回的streamer值呢？想用一个参数来接收。类似于stream_chat()

            for response, history in self.model.stream_chat(self.tokenizer, query, history,
                                         max_length=max_length, top_p=top_p, temperature=temperature):

Is there any plan to open-source xdnn?

I want to express my appreciation for the incredible work you have done with xft. However, I noticed that the xdnn lib seems not fully open-sourced, is there any plan for this?

Got a issue "libxfastertransformer_pt.so cannot run in the windows" when start chatglm2

I made the change on the cli_demo.py according to the guideline. But I got the following issue:

libxfastertransformer_pt.so cannot run in the windows. The error code is 0xc000012f.

would you please help to solve it?

Imported target "MPI::MPI_C" includes non-existent path

Have xFasterTransformer compared with llama.cpp or other inference framwork? which one is faster?

convert Yi34B model fail

Hi,

I try do the Yi34B model conversion with tools/llama_convert.py, but met error...

python tools/llama_convert.py  -i /data/tmp/Yi-34B -o /data/tmp/Yi-34B_xfaster/

=============== Argument ===============
saved_dir: /data/tmp/Yi-34B_xfaster/
in_file: /data/tmp/Yi-34B
processes: 8
weight_data_type: fp32
========================================
Loading checkpoint shards:   0%|                                 | 0/7 [00:03<?, ?it/s]
Traceback (most recent call last):
  File "tools/llama_convert.py", line 225, in <module>
    split_and_convert(args)
  File "tools/llama_convert.py", line 91, in split_and_convert
    model = LlamaForCausalLM.from_pretrained(
  File "/usr/bin/python3.8/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2881, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/usr/bin/python3.8/lib/python3.8/site-packages/transformers/modeling_utils.py", line 3228, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/usr/bin/python3.8/lib/python3.8/site-packages/transformers/modeling_utils.py", line 720, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/usr/bin/python3.8/lib/python3.8/site-packages/accelerate/utils/modeling.py", line 285, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([1024, 7168]) in "weight" (which has shape torch.Size([7168, 7168])), this look incorrect.

[bug] library of Intel level-zero not found

Issue script:

/benchmark/run_benchmark.sh

error message:

2023:12:15-20:39:37:(61144) |CCL_WARN| could not open the library: libze_loader.so, error: libze_loader.so: cannot open shared object file: No such file or directory
......
===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 61144 RUNNING AT worker64
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

CPU

SPR9468

related version info:

main branch of xFasterTransformer v1.1.0
CentOS8, kernel 5.17.3-1

other comments:

Such issue is not found in history version;
It took place in both official docker and bare-metal;
Why do we need Level-Zero in xFasterTransformer?

[bug] Segmentation fault occurs at large batch sizes

Segmentation fault occurs at large batch sizes

Command Line:
./run_benchmark.sh -m llama-7b -d bf16 -s 1 -bs 100 -in 512 -out 256 -i 1

Functions with errors:
onednn_amx_sgemm_f32bf16f32_compute_biasadd

Matmul matrix shape:
M = 51200, N = 12288, K= 4096, transA = 0,alpha=1.000000, lda=4096, beta=0.000000,ldc=12288

oneDNN_verbose:
onednn_verbose,info,oneDNN v3.2.0 (commit 04b180b9a58a78cf1a1cd2329671a5060c2be8de)
onednn_verbose,info,cpu,runtime:OpenMP,nthr:48
onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
onednn_verbose,info,gpu,runtime:none
onednn_verbose,info,prim_template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time
Command Line:
./run_benchmark.sh -m llama-7b -d bf16 -s 1 -bs 100 -in 32 -out 32 -i 1

Functions with errors:
hpj::Matrix &input, hpj::Matrix &output, hpj::Matrix &residential, bool isMaster) {
TimeLine t("DownProj")
assert(input.Rows() == output.Rows()); (ASSERT FAILED input.Cols()=22016, downWeight.Rows()=11008;)

Matmul matrix shape:
M = 3200, N = 12288, K= 4096, transA = 0,alpha=1.000000, lda=4096, beta=0.000000,ldc=12288

Verbose:
xft_verbose,exec,cpu,api,onednn_amx_sgemm_f32bf16f32_compute_biasadd,m3200n12288k4096,29.308059
xft_verbose,exec,cpu,api,onednn_amx_sgemm_f32bf16f32_compute_residential,m3200n4096k4096,12.953664
xft_verbose,exec,cpu,api,onednn_amx_sgemm_f32bf16f32_compute,m3200n22016k4096,42.813326

xft will be blocked when MPI + QWEN14B + do_sample=true

using cmd below and run several times generate, then the xft will be blocked in oneccl.

OMP_NUM_THREADS=48 LD_PRELOAD=libiomp5.so mpirun -n 1 numactl --physcpubind 16-63 --localalloc python demo.py -t /mnt/data/LLM_Models/Qwen-14B-Chat/ -m /mnt/data/LLM_Models/Qwen-14B-Chat/cpu/ --output_len 512 --dtype bf16_fp16 --do_sample true : -n 1 numactl --physcpubind 80-127 --localalloc python demo.py -t /mnt/data/LLM_Models/Qwen-14B-Chat/ -m /mnt/data/LLM_Models/Qwen-14B-Chat/cpu/ --output_len 512 --dtype bf16_fp16 --do_sample true

cmake file MD5 bug

-- [download 100% complete]
-- verifying file...
file='/root/xfastertransformer/build/xdnn_lib-prefix/src/xdnn_v1.1.tar.gz'
-- MD5 hash of
/root/xfastertransformer/build/xdnn_lib-prefix/src/xdnn_v1.1.tar.gz
does not match expected value
expected: 'b55b5d58c92339aa088dcc6e1df6ede2'
actual: 'b49bf8808d66ea75cfba80a406c9a587'
-- Hash mismatch, removing...
CMake Error at /root/xfastertransformer/build/xdnn_lib-prefix/src/xdnn_lib-stamp/download-xdnn_lib.cmake:170 (message):
Each download failed!

xdnn.cmake

URL_HASH MD5=b55b5d58c92339aa088dcc6e1df6ede2

URL_HASH MD5=b49bf8808d66ea75cfba80a406c9a587
After manual modification, it can be run.

QWEN14B will generate error output when multi queries with long input tokens.

The issue like below: the same long input (3338 tokens) query for nulti-times (for this input 4 times. for other small inputs need more times), the xft will generate undecodable outputs or empty outputs.

$ OMP_NUM_THREADS=48 numactl -N 1 python ./demo.py -m /mnt/data/LLM_Models/Qwen-14B-Chat/cpu -t /mnt/data/LLM_Models/Qwen-14B-Chat -d bf16_fp16
[INFO] xfastertransformer is not installed in pip, using source code.
[INFO] SINGLE_INSTANCE MODE.

Please enter the prompt:

input_prompt len: 4526, input_ids len:3338
大车行走速度在正常情况下，速度的设定分别为10%、20%、50%和80%。
====================Performance====================
Execution time: 12.97 s
Latency: 432.46 ms/token
Througput: 2.31 tokens/s

Please enter the prompt:

input_prompt len: 4526, input_ids len:3338
大车行走速度在正常情况下，速度的设定分别为10%、20%、50%和80%。
====================Performance====================
Execution time: 16.47 s
Latency: 548.86 ms/token
Througput: 1.82 tokens/s

Please enter the prompt:

input_prompt len: 4526, input_ids len:3338
根据文档内容，大车机构操作手柄共有4挡，其控制流程图如图3所示。在正常情况下，速度的设定分别为10%、20%、50%和80%
====================Performance====================
Execution time: 15.96 s
Latency: 332.47 ms/token
Througput: 3.01 tokens/s

Please enter the prompt:

input_prompt len: 4526, input_ids len:3338

====================Performance====================
Execution time: 9.74 s
Latency: 4869.03 ms/token
Througput: 0.21 tokens/s

Please enter the prompt:

input_prompt len: 4526, input_ids len:3338

====================Performance====================
Execution time: 15.65 s
Latency: 7823.53 ms/token
Througput: 0.13 tokens/s

baichuan-7b run core dump

雨村问：“政公有个衔玉之子，赦公就没一个?”子兴说：“政公有了玉儿，他的妾又生了一个，还没听说是好是歹。赦公也有二子，次子名叫贾琏，今已二十多岁，娶的是政公王夫人的娘家侄女为妻，亲上加亲。这位琏爷捐了个副知府，也不喜读书
===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 4463 RUNNING AT qqq-D50DNP1SBB
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================
(llm) root@qqq-D50DNP1SBB:~/llm# bash run_benchmark.sh -m baichuan2-7b -d bf16 -s 1 -bs 1 -in 4096 -out 32 -i 3

xfastertransformer==1.3.1

run llama2 in two CPU meet error when set dtype int8 and bf16_int8

the script is in the attachment.
llama2-7b.zip

the error info is shown as below

int8

memory node number: 16
HBM SNC4 mode
llama2-7b.sh: 17: Bad substitution
llama2-7b.sh: 17: Bad substitution
llama2-7b.sh: 17: Bad substitution
llama2-7b.sh: 17: Bad substitution
FP16 Performance
FP16 Performance
FP16 Performance
FP16 Performance
llama2-7b.sh: 17: Bad substitution
llama2-7b.sh: 17: Bad substitution
llama2-7b.sh: 17: Bad substitution
llama2-7b.sh: 17: Bad substitution
FP16 Performance
FP16 Performance
FP16 Performance
FP16 Performance
Segmentation fault (core dumped)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 21023 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 21024 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 2 PID 21025 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 3 PID 21026 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 5 PID 21028 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 6 PID 21029 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 7 PID 21030 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

bf16_int8
memory node number: 16
HBM SNC4 mode
llama2-7b.sh: 17: Bad substitution
llama2-7b.sh: 17: Bad substitution
FP16 Performance
llama2-7b.sh: 17: Bad substitution
FP16 Performance
FP16 Performance
llama2-7b.sh: 17: Bad substitution
llama2-7b.sh: 17: Bad substitution
llama2-7b.sh: 17: Bad substitution
llama2-7b.sh: 17: Bad substitution
FP16 Performance
FP16 Performance
llama2-7b.sh: 17: Bad substitution
FP16 Performance
FP16 Performance
FP16 Performance
Segmentation fault (core dumped)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 21300 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 21301 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 2 PID 21302 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 3 PID 21303 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 4 PID 21304 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 6 PID 21306 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 7 PID 21307 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

illegal instruction issue

Hi xFasterTransformer Team,

We tried to run the example of xft, and encountered the "illegal instruction" issue, shown as follows:

And we tried to figure out which instrction it failed on, and found it's vcvtps2phx:

It seems that AVX512-FP16 instruction set support is necessary for running xft. Is there any way that we can run xft without this instruction set? Thanks!

[RFC] Add XFT_VERBOSE macro to obtain kernels runtime info.

build xFasterTransformer from source failed

there is no error when install pytorch:

$ pip install torch --index-url https://download.pytorch.org/whl/cpu
Looking in indexes: https://download.pytorch.org/whl/cpu
Requirement already satisfied: torch in /usr/local/lib/python2.7/dist-packages (1.5.0+cpu)
Requirement already satisfied: numpy in /usr/local/lib/python2.7/dist-packages (from torch) (1.16.6)

But when building xFasterTransformer, it failed:

# in xFasterTransformer/build directory
$ cmake ..
-- The C compiler identification is GNU 8.3.0
-- The CXX compiler identification is GNU 8.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
>>> GCC version: 8.3.0
-- Found MPI_C: /root/xFasterTransformer/3rdparty/oneCCL/build/_install/lib/libmpi.so (found version "3.1")
-- Found MPI_CXX: /root/xFasterTransformer/3rdparty/oneCCL/build/_install/lib/libmpicxx.so (found version "3.1")
-- Found MPI: TRUE (found version "3.1")
-- oneCCL: MPI found
Building with static libraries.
-- PyTorch found. Compiling torch extension...
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
Traceback (most recent call last):
  File "<string>", line 1, in <module>
AttributeError: 'module' object has no attribute 'cmake_prefix_path'
-- Configuring done (2.1s)
CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake files:
TORCH_GLOBAL_DEPS_LIB-NOTFOUND;TORCH_CPU_LIB-NOTFOUND;TORCH_PYTHON_LIB-NOTFOUND;SHM_CPU_LIB-NOTFOUND;C10_CPU_LIB
    linked by target "xfastertransformer_pt" in directory /root/xFasterTransformer/src/pytorch

-- Generating done (0.0s)
CMake Generate step failed.  Build files cannot be regenerated correctly.

Any help would be greatly appreciated.

Does xFasterTransformer supports Falcon model?

Hello,
Planning to run PEFT tuned Falcon model on a CPU (Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz, dockerized Linux). However, the README doesn't mention Falcon, are only listed models supported?

Thanks.

encounter Qwen 72B stop id issue

#183 还是有bug, 触发 stop id 之后程序卡死.

Qwen-14B-Chat转换问题

python /root/xFasterTransformer/tools/qwen_convert.py -i /root/autodl-tmp/Qwen-14B-Chat -o /root/autodl-tmp/Qwen-xft/

=============== Argument ===============
saved_dir: /root/autodl-tmp/Qwen-xft/
in_file: /root/autodl-tmp/Qwen-14B-Chat
processes: 8
weight_data_type: fp16
========================================
The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
Try importing flash-attention for faster inference...
Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:05<00:00,  2.83it/s]
Fail to save the config in config.ini. 'GenerationConfig' object is not subscriptable
Processing ...
Traceback (most recent call last):
  File "/root/xFasterTransformer/tools/qwen_convert.py", line 262, in <module>
    split_and_convert(args)
  File "/root/xFasterTransformer/tools/qwen_convert.py", line 206, in split_and_convert
    param.detach().cpu().numpy().astype(np_weight_data_type).transpose().tofile(os.path.join(saved_dir, "model.wte.bin"))
TypeError: Got unsupported ScalarType BFloat16

请问怎么解决这问题？

Can Windows host be deployed?

As the question, wait patiently for your answer.

What is the minimum CPU and disk space configuration?

THX

Process isAssertion failed in file ../../src/mpid/ch4/shm/posix/eager/include/intel_transport_recv.h at line 1160: cma_read_nbytes == size

Issue occured:

When running llama2-7b, input4096, output2048, BS16, Beam1, on SPR-HBM flat mode SNC4.
Benchmarking CMD:

AMX_int8 not really be used when dtype="int8"

I use 'chatglm-6b.sh' to test the chatglm-6b performance, and I have set the '--dtype int8'.
But when I use perf to monitor AMX pmu-event, there are no amx int8 event. When dtype is bf16, it work normal:

OS：Ubuntu22.04 kernel 6.5.0
CPU：intel SPR 6430 *2

直接使用chatglm-6b.sh执行benchmark。数据类型bf16可以看到AMX指令正常被调用（perf pmu-event AMX计数器增加），但当数据类型为int8时，没有真正使用AMX指令。
OMP_NUM_THREADS=32 mpirun -n 1 numactl -N 0 -m 0 sh chatglm-6b.sh : -n 1 numactl -N 1 -m 1 sh chatglm-6b.sh

如下图：

benchmark报错


SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

echo "FP16 Performance "
python "${SCRIPT_DIR}"/../benchmark.py \
    --token_path /data/jane/models/chatglm2-6b/ \
    --model_path /data/jane/models/chatglm2-6b-cpu/ \
    --prompt_path "${SCRIPT_DIR}"/prompt_pool.json \
    --model_name "ChatGLM-6B" \
    --dtype fp16 \
    --token_in 2016     \
    --token_out 32 --beam_width 1 --iteration 1

跑benchmark，当输入设置位2016时会报错

Start benchmark:
iteration 0 :

Traceback (most recent call last):
  File "/root/xfastertransformer/benchmark/chatglm-6b/../benchmark.py", line 128, in <module>
    latency_90 = remained_token_times[int(args.iteration * 0.9) - 1] * 1000 / (output_token_nums - 1)
ZeroDivisionError: division by zero

Can't find the libmklml_intel.so

The following is the error
(cpu-xfaster) llm@SPR-ARC:~/xFasterTransformer/build$ python -d -m xfastertransformer.example.web_demo
Traceback (most recent call last):
File "/home/llm/miniconda3/envs/cpu-xfaster/lib/python3.10/runpy.py", line 187, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/home/llm/miniconda3/envs/cpu-xfaster/lib/python3.10/runpy.py", line 110, in _get_module_details
import(pkg_name)
File "/home/llm/miniconda3/envs/cpu-xfaster/lib/python3.10/site-packages/xfastertransformer/init.py", line 4, in
torch.classes.load_library(os.path.dirname(os.path.abspath(file)) + "/libxfastertransformer_pt.so")
File "/home/llm/miniconda3/envs/cpu-xfaster/lib/python3.10/site-packages/torch/_classes.py", line 51, in load_library
torch.ops.load_library(path)
File "/home/llm/miniconda3/envs/cpu-xfaster/lib/python3.10/site-packages/torch/_ops.py", line 643, in load_library
ctypes.CDLL(path)
File "/home/llm/miniconda3/envs/cpu-xfaster/lib/python3.10/ctypes/init.py", line 374, in init
self._handle = _dlopen(self._name, mode)
OSError: libmklml_intel.so: cannot open shared object file: No such file or directory

The following is the oneDNN from Intel oneAPI 23.03, there is no file name is libmklml_intel.so
(cpu-xfaster) llm@SPR-ARC:/opt/intel/oneapi$ find . -name '*_intel.so'
./mkl/2023.2.0/lib/ia32/libmkl_intel.so

Qwen Segmentation Fault after logN PR merged.

This issue occurs after this PR #215 merged.
Reproduce cmd:
bash ./run_benchmark.sh -m qwen-7b -d bf16 -s 1 -bs 1 -in 4096 -out 2048 -i 1

xft + QWEN14B + fp16 got unexpected outputs compared with bf16_fp16.

The same input got very different outputs with different datatype. And the bf16_fp16's output is aligned with torch.

BF16_FP16 datatype:

python demo.py -t /mnt/data/LLM_Models/Qwen-14B-Chat/ -m /mnt/data/LLM_Models/Qwen-14B-Chat/cpu/ --do_sample False --output_len 512 --dtype bf16_fp16
[INFO] xfastertransformer is not installed in pip, using source code.
[INFO] SINGLE_INSTANCE MODE.
大车行走速度在正常情况下，速度的设定分为10%、20%、50%和80%，具体数值取决于实际情况和需求。

FP16 datatype:

python demo.py -t /mnt/data/LLM_Models/Qwen-14B-Chat/ -m /mnt/data/LLM_Models/Qwen-14B-Chat/cpu/ --do_sample False --output_len 512 --dtype fp16
[INFO] xfastertransformer is not installed in pip, using source code.
[INFO] SINGLE_INSTANCE MODE.
大，

Illegal instruction (core dumped)

After installing version 1.2.0 using pip install xfastrttransformer and converting Qwen-14B Chat, I ran the demo program to obtain the following results. How should I handle this error when it occurs in the generate step?

lscpu

Use mpirun to run benchmark.py get error

the error detail:
run cmd:
mpirun -n 1 numactl -N 0 -m 0 python3 benchmark.py --token_path /data/baichuan2-13b --model_path /data/baichuan2-13b-xft/ --prompt_path ./prompt.json --model_name baichuan2-13b --dtype bf16 --token_in 1024 --token_out 512 --beam_width 1 --batch_size 1 --iteration 10 --warmup 1 --padding True

error
Failed to load xft_comm_helper library from path error code: libxft_comm_helper.so: cannot open shared object file: No such file or directory

xft + sample output result look bad

import xfastertransformer
from transformers import AutoTokenizer, TextStreamer
# Assume huggingface model dir is `/data/chatglm-6b-hf` and converted model dir is `/data/chatglm-6b-cpu`.
MODEL_PATH="/data/jane/models/Baichuan2-13B-Chat-cpu/"
TOKEN_PATH="/data/jane/models/Baichuan2-13B-Chat/"

#INPUT_PROMPT = "Once upon a time, there existed a little girl who liked to have adventures."
prompt ="""已知信息：
了 label smoothing 和 mixup 微调之后的模型做了权重上的线性加权。实验结果如
表 3.2 所示。结果表明，BANG 算法有效的提高了 WiSE-FT 算法的效果。特别的，
BANG(LS+Mixup)在五个OOD数据集上比现有的最优算法WiSE-FT高出1.9%。
表3.2 在ImageNet上微调ViT-B/16的效果
Methods ModelAveraging IN IN-V2 IN-R IN-A IN-S ObjectNet AvgOOD

ZIN 与现有的几种方法进行了比较:ERM、IRM[58]、EIIL[71]、HRM[70]和 LfF[81]。
对于IRM，本文提供了ground-truth环境划分，并将其性能作为一个上界。LfF试
图通过从错误指定的浅层神经网络样本中直接采用 boosting 来学习一个鲁棒的模
型。而且LfF仅适用于分类任务。
5.4.1 房价预测任务
本实验考虑了来自Kaggle的真实房屋销售价格回归数据集。目标变量是房价，
每个样本包含17维度的特征，如房子的建成年份、卧室数量等。数据集根据构建

BANG(Mixup+LS) Yes 81.6 73.1 79.7 58.2 54.8 58.9 64.9
3.5 小结
本节研究了为什么集成算法具有优越的 OOD 性能。对 WiSE-FT 的实证分析，
加上理论见解，表明虚假特征的多样化改善了模型的泛化性能。进一步的，笔者通
过缓解微调模型的过度自信问题改进了WiSE-FT。
20 
根据上述已知信息，简洁和专业的来回答用户的问题。如果无法从中得到答案，
请说 “根据已知信息无法回答该问题” 或 “没有提供足够的相关信息”，不允许在答案中添加编造成分，答案请使用中文。 
问题是：langchain中stuff作用是什么？,答案："""

from typing import Tuple, List
import torch
def build_inputs_baichuan(tokenizer, query: List[str], padding, history: List[Tuple[str, str]] = []):
    inputs = tokenizer(query, return_tensors="pt", padding=padding).input_ids
    print(inputs, inputs.shape)
    suffix = torch.tensor([[196]])
    prefix = torch.tensor([[195]])
    inputs = torch.cat((prefix.expand((inputs.shape[0], 1)), inputs, suffix.expand(inputs.shape[0], 1)), dim=1)
    return inputs

tokenizer = AutoTokenizer.from_pretrained(TOKEN_PATH, use_fast=False, padding_side="left", trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_special_tokens=True, skip_prompt=False)
input_ids = build_inputs_baichuan(tokenizer, prompt, padding=True)
#input_ids = tokenizer(INPUT_PROMPT, return_tensors="pt", padding=False).input_ids
model = xfastertransformer.AutoModel.from_pretrained(MODEL_PATH)

model.config(max_length=1024)
model.input(input_ids)
import time
start = time.time()
output = ""
while not model.is_done():
   next_tokens = model.forward()
   res = tokenizer.decode(next_tokens[0])
   output += res
   print(res)
print(output)
generated_ids = model.finalize()

output:

根据已知信息，无法回答该问题; 没有提供关于" Lang chain" 或 " Stuff" 的相关信息; 需要更具体或更详细的信息;

结果输出前后包含很多空格。
xfastertransformer 1.3.1

mpirun -n 1 numactl -N 0 -m 0 python test_baichuan.py

SHM reduceAdd performance issue on HBM with 2 sockets

Llama-2-7b BF16 145in 198out
batch size=1:

reduce-type	first token	second token
1s	117	41
2s-SHM	350	42
2s-ONECCL	135	25

batch size=38

reduce-type	first token	second token
1s	3500	88.9
2s-SHM	1926	426
2s-ONECCL	4912	63

using 'cmake ..' don 't find FindoneCCL.cmake

Using 'CMake'

Build xFasterTransformer

git clone https://github.com/intel/xFasterTransformer.git xFasterTransformer
cd xFasterTransformer
git checkout

Please make sure torch is installed when run python example

mkdir build && cd build
cmake ..
make -j

when cmake ..

core per numa calculation error

cores_per_numa=$(( $sockets_num * $cores_per_socket / $numa_nodes ))
上面这一行代码无法计算浮点数
针对9470 cores per socket = 52的时候就会计算错误
2 * 52 / 16 = 6.5

intel / xfastertransformer Goto Github PK

xfastertransformer's Introduction

xFasterTransformer

Table of Contents

Models overview

Model support matrix

DataType support list

Documents

Installation

From PyPI

Using Docker

Built from source

Prepare Environment

Manually

Install dependent libraries

How to build

API usage

Python API(PyTorch)

C++ API

How to run

Single rank

Multi ranks

Command line

Code

Python

C++

Serving

vLLM

Install

OpenAI Compatible Server

FastChat

MLServer

Support

Accepted Papers

Q&A

xfastertransformer's People

Contributors

Stargazers

Watchers

Forkers

xfastertransformer's Issues

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 0 PID 7633 RUNNING AT spr28 = KILLED BY SIGNAL: 11 (Segmentation fault)

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 1 PID 7634 RUNNING AT spr28 = KILLED BY SIGNAL: 9 (Killed)

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 2 PID 7635 RUNNING AT spr28 = KILLED BY SIGNAL: 11 (Segmentation fault)

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 3 PID 7636 RUNNING AT spr28 = KILLED BY SIGNAL: 11 (Segmentation fault)

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 0 PID 1431596 RUNNING AT spr-s6q-06 = KILLED BY SIGNAL: 9 (Killed)

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 1 PID 1431597 RUNNING AT spr-s6q-06 = KILLED BY SIGNAL: 11 (Segmentation fault)

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 2 PID 1431598 RUNNING AT spr-s6q-06 = KILLED BY SIGNAL: 9 (Killed)

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 3 PID 1431599 RUNNING AT spr-s6q-06 = KILLED BY SIGNAL: 9 (Killed)

Issue script:

error message:

CPU

related version info:

other comments:

Please enter the prompt:

Please enter the prompt:

Please enter the prompt:

Please enter the prompt:

Please enter the prompt:

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 0 PID 21023 RUNNING AT ubuntu-desktop = KILLED BY SIGNAL: 9 (Killed)

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 1 PID 21024 RUNNING AT ubuntu-desktop = KILLED BY SIGNAL: 9 (Killed)

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 2 PID 21025 RUNNING AT ubuntu-desktop = KILLED BY SIGNAL: 9 (Killed)

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 3 PID 21026 RUNNING AT ubuntu-desktop = KILLED BY SIGNAL: 9 (Killed)

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 5 PID 21028 RUNNING AT ubuntu-desktop = KILLED BY SIGNAL: 9 (Killed)

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 6 PID 21029 RUNNING AT ubuntu-desktop = KILLED BY SIGNAL: 9 (Killed)

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 7 PID 21030 RUNNING AT ubuntu-desktop = KILLED BY SIGNAL: 9 (Killed)

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 0 PID 21300 RUNNING AT ubuntu-desktop = KILLED BY SIGNAL: 9 (Killed)

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 1 PID 21301 RUNNING AT ubuntu-desktop = KILLED BY SIGNAL: 9 (Killed)

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 2 PID 21302 RUNNING AT ubuntu-desktop = KILLED BY SIGNAL: 9 (Killed)

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 3 PID 21303 RUNNING AT ubuntu-desktop = KILLED BY SIGNAL: 9 (Killed)

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 4 PID 21304 RUNNING AT ubuntu-desktop = KILLED BY SIGNAL: 9 (Killed)

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 6 PID 21306 RUNNING AT ubuntu-desktop = KILLED BY SIGNAL: 9 (Killed)

=================================================================================== = BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES = RANK 7 PID 21307 RUNNING AT ubuntu-desktop = KILLED BY SIGNAL: 9 (Killed)

BF16_FP16 datatype:

FP16 datatype:

Build xFasterTransformer

Please make sure torch is installed when run python example

when cmake ..

Recommend Projects

Recommend Topics

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 7633 RUNNING AT spr28
= KILLED BY SIGNAL: 11 (Segmentation fault)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 7634 RUNNING AT spr28
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 2 PID 7635 RUNNING AT spr28
= KILLED BY SIGNAL: 11 (Segmentation fault)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 3 PID 7636 RUNNING AT spr28
= KILLED BY SIGNAL: 11 (Segmentation fault)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 1431596 RUNNING AT spr-s6q-06
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 1431597 RUNNING AT spr-s6q-06
= KILLED BY SIGNAL: 11 (Segmentation fault)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 2 PID 1431598 RUNNING AT spr-s6q-06
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 3 PID 1431599 RUNNING AT spr-s6q-06
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 21023 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 21024 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 2 PID 21025 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 3 PID 21026 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 5 PID 21028 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 6 PID 21029 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 7 PID 21030 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 21300 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 21301 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 2 PID 21302 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 3 PID 21303 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 4 PID 21304 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 6 PID 21306 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 7 PID 21307 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)