Coder Social home page Coder Social logo

xfastertransformer's Introduction

xFasterTransformer

English | 简体中文

xFasterTransformer is an exceptionally optimized solution for large language models (LLM) on the X86 platform, which is similar to FasterTransformer on the GPU platform. xFasterTransformer is able to operate in distributed mode across multiple sockets and nodes to support inference on larger models. Additionally, it provides both C++ and Python APIs, spanning from high-level to low-level interfaces, making it easy to adopt and integrate.

Table of Contents

Models overview

Large Language Models (LLMs) develops very fast and are more widely used in many AI scenarios. xFasterTransformer is an optimized solution for LLM inference using the mainstream and popular LLM models on Xeon. xFasterTransformer fully leverages the hardware capabilities of Xeon platforms to achieve the high performance and high scalability of LLM inference both on single socket and multiple sockets/multiple nodes.

xFasterTransformer provides a series of APIs, both of C++ and Python, for end users to integrate xFasterTransformer into their own solutions or services directly. Many kinds of example codes are also provided to demonstrate the usage. Benchmark codes and scripts are provided for users to show the performance. Web demos for popular LLM models are also provided.

Model support matrix

Models Framework Distribution
PyTorch C++
ChatGLM
ChatGLM2
ChatGLM3
GLM4
Llama
Llama2
Llama3
Baichuan
Baichuan2
QWen
QWen2
SecLLM(YaRN-Llama)
Opt
Deepseek-coder
gemma
gemma-1.1
codegemma

DataType support list

  • FP16
  • BF16
  • INT8
  • W8A8
  • INT4
  • NF4
  • BF16_FP16
  • BF16_INT8
  • BF16_W8A8
  • BF16_INT4
  • BF16_NF4
  • W8A8_INT8
  • W8A8_int4
  • W8A8_NF4

Documents

xFasterTransformer Documents and Wiki provides the following resources:

  • An introduction to xFasterTransformer.
  • Comprehensive API references for both high-level and low-level interfaces in C++ and PyTorch.
  • Practical API usage examples for xFasterTransformer in both C++ and PyTorch.

Installation

From PyPI

pip install xfastertransformer

Using Docker

docker pull intel/xfastertransformer:latest

Run the docker with the command (Assume model files are in /data/ directory):

docker run -it \
    --name xfastertransformer \
    --privileged \
    --shm-size=16g \
    -v /data/:/data/ \
    -e "http_proxy=$http_proxy" \
    -e "https_proxy=$https_proxy" \
    intel/xfastertransformer:latest

Notice!!!: Please enlarge --shm-size if bus error occurred while running in the multi-ranks mode. The default docker limits the shared memory size to 64MB and our implementation uses many shared memories to achieve a better performance.

Built from source

Prepare Environment

Manually
  • PyTorch v2.3 (When using the PyTorch API, it's required, but it's not needed when using the C++ API.)

    pip install torch --index-url https://download.pytorch.org/whl/cpu
  • For GPU, xFT needs ABI=1 from torch==2.3.0+cpu.cxx11.abi in torch-whl-list due to DPC++ need ABI=1.

Install dependent libraries

Please install libnuma package:

  • CentOS: yum install libnuma-devel
  • Ubuntu: apt-get install libnuma-dev
How to build
  • Using 'CMake'
    # Build xFasterTransformer
    git clone https://github.com/intel/xFasterTransformer.git xFasterTransformer
    cd xFasterTransformer
    git checkout <latest-tag>
    # Please make sure torch is installed when run python example
    mkdir build && cd build
    cmake ..
    make -j
  • Using python setup.py
    # Build xFasterTransformer library and C++ example.
    python setup.py build
    
    # Install xFasterTransformer into pip environment.
    # Notice: Run `python setup.py build` before installation!
    python setup.py install

xFasterTransformer supports a different model format from Huggingface, but it's compatible with FasterTransformer's format.

  1. Download the huggingface format model firstly.

  2. After that, convert the model into xFasterTransformer format by using model convert module in xfastertransformer. If output directory is not provided, converted model will be placed into ${HF_DATASET_DIR}-xft.

    python -c 'import xfastertransformer as xft; xft.LlamaConvert().convert("${HF_DATASET_DIR}","${OUTPUT_DIR}")'
    

    PS: Due to the potential compatibility issues between the model file and the transformers version, please select the appropriate transformers version.

    Supported model convert list:

    • LlamaConvert
    • YiConvert
    • GemmaConvert
    • ChatGLMConvert
    • ChatGLM2Convert
    • ChatGLM4Convert
    • OPTConvert
    • BaichuanConvert
    • Baichuan2Convert
    • QwenConvert
    • Qwen2Convert
    • DeepseekConvert

API usage

For more details, please see API document and examples.

Python API(PyTorch)

Firstly, please install the dependencies.

  • Python dependencies
    pip install -r requirements.txt
    PS: Due to the potential compatibility issues between the model file and the transformers version, please select the appropriate transformers version.
  • oneCCL (For multi ranks)
    Install oneCCL and setup the environment. Please refer to Prepare Environment.

xFasterTransformer's Python API is similar to transformers and also supports transformers's streamer to achieve the streaming output. In the example, we use transformers to encode input prompts to token ids.

import xfastertransformer
from transformers import AutoTokenizer, TextStreamer
# Assume huggingface model dir is `/data/chatglm-6b-hf` and converted model dir is `/data/chatglm-6b-xft`.
MODEL_PATH="/data/chatglm-6b-xft"
TOKEN_PATH="/data/chatglm-6b-hf"

INPUT_PROMPT = "Once upon a time, there existed a little girl who liked to have adventures."
tokenizer = AutoTokenizer.from_pretrained(TOKEN_PATH, use_fast=False, padding_side="left", trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_special_tokens=True, skip_prompt=False)

input_ids = tokenizer(INPUT_PROMPT, return_tensors="pt", padding=False).input_ids
model = xfastertransformer.AutoModel.from_pretrained(MODEL_PATH, dtype="bf16")
generated_ids = model.generate(input_ids, max_length=200, streamer=streamer)

C++ API

SentencePiece can be used to tokenizer and detokenizer text.

#include <vector>
#include <iostream>
#include "xfastertransformer.h"
// ChatGLM token ids for prompt "Once upon a time, there existed a little girl who liked to have adventures."
std::vector<int> input(
        {3393, 955, 104, 163, 6, 173, 9166, 104, 486, 2511, 172, 7599, 103, 127, 17163, 7, 130001, 130004});

// Assume converted model dir is `/data/chatglm-6b-xft`.
xft::AutoModel model("/data/chatglm-6b-xft", xft::DataType::bf16);

model.config(/*max length*/ 100, /*num beams*/ 1);
model.input(/*input token ids*/ input, /*batch size*/ 1);

while (!model.isDone()) {
    std::vector<int> nextIds = model.generate();
}

std::vector<int> result = model.finalize();
for (auto id : result) {
    std::cout << id << " ";
}
std::cout << std::endl;

How to run

Recommend preloading libiomp5.so to get a better performance.

  • [Recommended] Run export $(python -c 'import xfastertransformer as xft; print(xft.get_env())') if xfastertransformer's python wheel package is installed.
  • libiomp5.so file will be in 3rdparty/mkl/lib directory after building xFasterTransformer successfully if building from source code.

Single rank

FasterTransformer will automatically check the MPI environment, or you can use the SINGLE_INSTANCE=1 environment variable to forcefully deactivate MPI.

Multi ranks

Command line

Use MPI to run in the multi-ranks mode, please install oneCCL firstly.

  • oneCCL Installation

    • If you have built xfastertransformer from source, oneCCL is installed in 3rdparty when compilation.
      source ./3rdparty/oneccl/build/_install/env/setvars.sh
      
    • [Recommended] Use provided scripts to build it from source code.
      cd 3rdparty
      sh prepare_oneccl.sh
      source ./oneccl/build/_install/env/setvars.sh
    • Install oneCCL through installing Intel® oneAPI Base Toolkit.(Notice:It is recommended to use versions 2023.x and below.) And source the enviroment by:
      source /opt/intel/oneapi/setvars.sh
      
  • Here is a example on local.

    # or export LD_PRELOAD=libiomp5.so manually
    export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')
    OMP_NUM_THREADS=48 mpirun \
      -n 1 numactl -N 0  -m 0 ${RUN_WORKLOAD} : \
      -n 1 numactl -N 1  -m 1 ${RUN_WORKLOAD} 

Code

For more details, please refer to examples.

Python

model.rank can get the process's rank, model.rank == 0 is the Master.
For Slaves, after loading the model, the only thing needs to do is model.generate(). The input and generation configuration will be auto synced.

model = xfastertransformer.AutoModel.from_pretrained("/data/chatglm-6b-xft", dtype="bf16")

# Slave
while True:
    model.generate()
C++

model.getRank() can get the process's rank, model.getRank() == 0 is the Master.
For Slaves, any value can be input to model.config() and model.input since Master's value will be synced.

xft::AutoModel model("/data/chatglm-6b-xft", xft::DataType::bf16);

// Slave
while (1) {
    model.config();
    std::vector<int> input_ids;
    model.input(/*input token ids*/ input_ids, /*batch size*/ 1);

    while (!model.isDone()) {
        model.generate();
    }
}

A web demo based on Gradio is provided in repo. Now support ChatGLM, ChatGLM2 and Llama2 models.

  • Perpare the model.
  • Install the dependencies
    pip install -r examples/web_demo/requirements.txt
    PS: Due to the potential compatibility issues between the model file and the transformers version, please select the appropriate transformers version.
  • Run the script corresponding to the model. After the web server started, open the output URL in the browser to use the demo. Please specify the paths of model and tokenizer directory, and data type. transformer's tokenizer is used to encode and decode text so ${TOKEN_PATH} means the huggingface model directory. This demo also support multi-rank.
# Recommend preloading `libiomp5.so` to get a better performance.
# or LD_PRELOAD=libiomp5.so manually, `libiomp5.so` file will be in `3rdparty/mkl/lib` directory after build xFasterTransformer.
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')
python examples/web_demo/ChatGLM.py \
                      --dtype=bf16 \
                      --token_path=${TOKEN_PATH} \
                      --model_path=${MODEL_PATH}

Serving

vLLM

A fork of vLLM has been created to integrate the xFasterTransformer backend, maintaining compatibility with most of the official vLLM's features. Refer this link for more detail.

Install

pip install vllm-xft

Notice: Please do not install both vllm-xft and vllm simultaneously in the environment. Although the package names are different, they will actually overwrite each other.

OpenAI Compatible Server

Notice: Preload libiomp5.so is required!

# Preload libiomp5.so by following cmd or LD_PRELOAD=libiomp5.so manually
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')

python -m vllm.entrypoints.openai.api_server \
        --model ${MODEL_PATH} \
        --tokenizer ${TOKEN_PATH} \
        --dtype bf16 \
        --kv-cache-dtype fp16 \
        --served-model-name xft \
        --port 8000 \
        --trust-remote-code

For multi-rank mode, please use python -m vllm.entrypoints.slave as slave and keep params of slaves align with master.

# Preload libiomp5.so by following cmd or LD_PRELOAD=libiomp5.so manually
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')

OMP_NUM_THREADS=48 mpirun \
        -n 1 numactl --all -C 0-47 -m 0 \
          python -m vllm.entrypoints.openai.api_server \
            --model ${MODEL_PATH} \
            --tokenizer ${TOKEN_PATH} \
            --dtype bf16 \
            --kv-cache-dtype fp16 \
            --served-model-name xft \
            --port 8000 \
            --trust-remote-code \
        : -n 1 numactl --all -C 48-95 -m 1 \
          python -m vllm.entrypoints.slave \
            --dtype bf16 \
            --model ${MODEL_PATH} \
            --kv-cache-dtype fp16

FastChat

xFasterTransformer is an official inference backend of FastChat. Please refer to xFasterTransformer in FastChat and FastChat's serving for more details.

MLServer

A example serving of MLServer is provided which supports REST and gRPC interface and adaptive batching feature to group inference requests together on the fly.

Benchmark scripts are provided to get the model inference performance quickly.

  • Prepare the model.
  • Install the dependencies, including oneCCL and python dependencies.
  • Enter the benchmark folder and run run_benchmark.sh. Please refer to Benchmark README for more information.

Notes!!!: The system and CPU configuration may be different. For the best performance, please try to modify OMP_NUM_THREADS, datatype and the memory nodes number (check the memory nodes using numactl -H) according to your test environment.

Support

Accepted Papers

If xFT is useful for your research, please cite:

@article{he2024distributed,
  title={Distributed Inference Performance Optimization for LLMs on CPUs},
  author={He, Pujiang and Zhou, Shan and Li, Changqing and Huang, Wenhuan and Yu, Weifei and Wang, Duyi and Meng, Chen and Gui, Sheng},
  journal={arXiv preprint arXiv:2407.00029},
  year={2024}
}

and

@inproceedings{he2024inference,
  title={Inference Performance Optimization for Large Language Models on CPUs},
  author={He, Pujiang and Zhou, Shan and Huang, Wenhuan and Li, Changqing and Wang, Duyi and Guo, Bin and Meng, Chen and Gui, Sheng and Yu, Weifei and Xie, Yi},
  booktitle={ICML 2024 Workshop on Foundation Models in the Wild}
}

Q&A

  • Q: Can xFasterTransformer run on a Intel® Core™ CPU?
    A: No. xFasterTransformer requires support for the AMX and AVX512 instruction sets, which are not available on Intel® Core™ CPUs.

  • Q: Can xFasterTransformer run on the Windows system?
    A: There is no native support for Windows, and all compatibility tests are only conducted on Linux, so Linux is recommended.

  • Q: Why does the program freeze or exit with errors when running in multi-rank mode after installing the latest version of oneCCL through oneAPI?
    A: Please try downgrading oneAPI to version 2023.x or below, or use the provided script to install oneCCL from source code.

  • Q: Why does running the program using two CPU sockets result in much lower performance compared to running on a single CPU socket?
    A: Running in this way causes the program to engage in many unnecessary cross-socket communications, significantly impacting performance. If there is a need for cross-socket deployment, consider running in a multi-rank mode with one rank on each socket.

  • Q:The performance is normal when running in a single rank, but why is the performance very slow and the CPU utilization very low when using MPI to run multiple ranks?
    A:This is because the program launched through MPI reads OMP_NUM_THREADS=1, which cannot correctly retrieve the appropriate value from the environment. It is necessary to manually set the value of OMP_NUM_THREADS based on the actual situation.

  • Q: Why do I still encounter errors when converting already supported models?
    A: Try downgrading transformer to an appropriate version, such as the version specified in the requirements.txt. This is because different versions of Transformer may change the names of certain variables.

xfastertransformer's People

Contributors

a3213105 avatar abenmao avatar aurora327 avatar changqi1 avatar denniszhen1 avatar dependabot[bot] avatar duyi-wang avatar feng-intel avatar huaqiangwang avatar junxichhen avatar marvin-yu avatar pujiang2018 avatar rdower avatar sakuraym avatar ustcuna avatar wenhuanh avatar wli58 avatar xiangzez avatar xwang98 avatar yangkunx avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

xfastertransformer's Issues

流式输出问题

import xfastertransformer
from transformers import AutoTokenizer, TextStreamer
# Assume huggingface model dir is `/data/chatglm-6b-hf` and converted model dir is `/data/chatglm-6b-cpu`.
MODEL_PATH="/data/jane/models/chatglm2-6b-cpu/"
TOKEN_PATH="/data/jane/models/chatglm2-6b"

#INPUT_PROMPT = "Once upon a time, there existed a little girl who liked to have adventures."
INPUT_PROMPT = "问:上海在哪?答: "
tokenizer = AutoTokenizer.from_pretrained(TOKEN_PATH, use_fast=False, padding_side="left", trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_special_tokens=True, skip_prompt=False)
print(streamer)
input_ids = tokenizer(INPUT_PROMPT, return_tensors="pt", padding=False).input_ids
model = xfastertransformer.AutoModel.from_pretrained(MODEL_PATH, dtype="bf16")
generated_ids = model.generate(input_ids, max_length=200, streamer=streamer)

修改了下INPUT_PROMPT 的值,输出不是流式的了,直接一次性输出所有值

这块CPU是否满足xFasterTF的最低要求?

CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              48
On-line CPU(s) list: 0-47
Thread(s) per core:  1
Core(s) per socket:  1
Socket(s):           48
NUMA node(s):        1
Vendor ID:           GenuineIntel
BIOS Vendor ID:      Bochs
CPU family:          6
Model:               61
Model name:          Intel Core Processor (Broadwell, IBRS)
Stepping:            2
CPU MHz:             2095.078
BogoMIPS:            4190.15
Hypervisor vendor:   KVM
Virtualization type: full
L1d cache:           32K
L1i cache:           32K
L2 cache:            4096K
NUMA node0 CPU(s):   0-47
Flags:               fpu de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 syscall nx rdtscp lm constant_tsc rep_good nopl cpuid tsc_known_freq pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx hypervisor lahf_lm 3dnowprefetch invpcid_single pti ssbd ibrs ibpb fsgsbase bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap xsaveopt md_clear```

lscpu的结果,帮忙看一下是否满足最低运行要求

转换模型报错

你好,我发现在使用xFasterTransformer,转换模型时,出现了以下报错
python ./tools/llama_convert.py -i /workspace/llama-2-7b-chat-hf/ -o /workspace/llama-2-7b-chat-cpu/

/usr/local/lib/python3.8/dist-packages/transformer_engine_extensions.cpython-38-x86_64-linux-gnu.so: undefined symbol: _ZN3c106detail23torchInternalAssertFailEPKcS2_jS2_RKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEE
请问如何解决呢
我怀疑是安装包版本的问题,我看到intel是有docker镜像的,但是我无法pull,请问哪里提供呢

AttributeError: module xfastertransformer has no attribute AutoModel

(demo) [root@iZ2ze5jp679eomnr2xu3s0Z web_demo]# python ChatGLM2.py
[INFO] xfastertransformer is not installed in pip, using source code.
Traceback (most recent call last):
  File "ChatGLM2.py", line 68, in <module>
    demo = ChatGLM2Demo(args.token_path, args.model_path, dtype=args.dtype)
  File "/mnt/xFasterTransformer/examples/web_demo/demo_utils.py", line 61, in __init__
    self.model = xfastertransformer.AutoModel.from_pretrained(model_path, dtype=dtype)
  File "/mnt/xFasterTransformer/examples/web_demo/../../src/xfastertransformer/__init__.py", line 59, in __getattr__                                                                                                        
    raise AttributeError("module {} has no attribute {}".format(self.__name__, name))
AttributeError: module xfastertransformer has no attribute AutoModel

libnuma: Warning: node argument 8 is out of range

In the Docker container, numactl does not recognize the NUMA node for High Bandwidth Memory (HBM). However, using 'numactl -H' reveals the corresponding node.
will show error:

libnuma: Warning: node argument 8 is out of range

HW/SW version:
kernel version: 5.15.0-spr.bkc.pc.16.1.23.x86_64
docker version: 24.0.6
host-numactl version: 2.0.16
container-numactl version: 2.0.14-3ubuntu2
docker run option: docker run -e http_proxy -e https_proxy -e no_proxy --privileged --name wsf-78378ddc4d4ed --rm --detach xftbench-inference-lite:latest

chatGLM2-6B crash while running 4 ranks with the datatype w8a8

With xFT1.2.0, running 4 ranks on HBM 1 socket. It works on using 1 rank.
bash run_benchmark.sh -m chatglm2-6b -d w8a8 -s 1 -bs 1 -in 1024 -out 32 -i 5

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 7633 RUNNING AT spr28
= KILLED BY SIGNAL: 11 (Segmentation fault)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 7634 RUNNING AT spr28
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 2 PID 7635 RUNNING AT spr28
= KILLED BY SIGNAL: 11 (Segmentation fault)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 3 PID 7636 RUNNING AT spr28
= KILLED BY SIGNAL: 11 (Segmentation fault)

KVCache buffer overflows when token number exceeds the setting of config.ini max_pos_seq_len

If the total sequence length, the sum of input sequence length and output sequence length, exceeds the value setting in config.ini/max_pos_seq_len, application crashes silently.

Following is the command and output message for a llama2-13b test. default max_pos_seq_len is 2048, which is less than sum of input and output token lengths.


LD_PRELOAD=libiomp5.so OMP_NUM_THREADS=28 mpirun
-n 1 numactl -C0-27 -l ../..//build/example -m /root/xygao/LLM/llama/xFasterTransformer/models/llama2-13b/ -t /home/huaqiang/models/llama-2-13b-chat-hf/tokenizer.model -d bf16_fp16 -l 2048 --output_len 32 --loop 15 :
-n 1 numactl -C28-55 -l ../../build/example -m /root/xygao/LLM/llama/xFasterTransformer/models/llama2-13b/ -t /home/huaqiang/models/llama-2-13b-chat-hf/tokenizer.model -d bf16_fp16 -l 2048 --output_len 32 --loop 15 :
-n 1 numactl -C56-83 -l ../../build/example -m /root/xygao/LLM/llama/xFasterTransformer/models/llama2-13b/ -t /home/huaqiang/models/llama-2-13b-chat-hf/tokenizer.model -d bf16_fp16 -l 2048 --output_len 32 --loop 15 :
-n 1 numactl -C84-111 -l ../../build/example -m /root/xygao/LLM/llama/xFasterTransformer/models/llama2-13b/ -t /home/huaqiang/models/llama-2-13b-chat-hf/tokenizer.model -d bf16_fp16 -l 2048 --output_len 32 --loop 15 | tee 4mpi-result.txt


[INFO] First token time: 3038.85 ms

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 1431596 RUNNING AT spr-s6q-06
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 1431597 RUNNING AT spr-s6q-06
= KILLED BY SIGNAL: 11 (Segmentation fault)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 2 PID 1431598 RUNNING AT spr-s6q-06
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 3 PID 1431599 RUNNING AT spr-s6q-06
= KILLED BY SIGNAL: 9 (Killed)

ModuleNotFoundError: No module named 'xfastertransformer.tools'

I'm using commit f205d37 to compile xfastertransformer.
When "import xfastertransformer", it shows error:

2: Traceback (most recent call last):
2:   File "/home/workspace/xFasterTransformer/benchmark/benchmark.py", line 89, in <module>
2:     import xfastertransformer
2:   File "/root/anaconda3/envs/llm/lib/python3.9/site-packages/xfastertransformer-1.0.0-py3.9-linux-x86_64.egg/xfastertransformer/__init__.py", line 22, in <module>
2:     from .tools import LlamaConvert
2: ModuleNotFoundError: No module named 'xfastertransformer.tools'

a bug in the doc

root@f62374f19c02:~/xfastertransformer# python ./tools/opt_convert.py -i ./data/opt-1.3b-hf -o ./data/opt-1.3b-cpu
File "./tools/opt_convert.py", line 284
parser.add_argument("--weight_data_type", "-d" type=str, default="fp16", choices=["fp32", "fp16"])
^
SyntaxError: invalid syntax

cmake error. run on aliyun .

实例规格 CPU&内存
ecs.c8i.12xlarge更换购买相同配置 48核(vCPU)96 GiB

ubuntu_22_04_x64_20G_alibase_20231019.vhd创建自定义镜像 | Ubuntu 22.04 64位更换

vpc-2zekgdhgeqt82h97j49oz更换 | vsw-2zeuut5cvvc4djiv5g0t4更换


CMake Error at /root/xfastertransformer/build/mklml-prefix/src/mklml-stamp/download-mklml.cmake:170 (message):
Each download failed!

error: downloading 'https://gitee.com/qccz123456/oneDNN/releases/download/v0.21/mklml_lnx_2019.0.5.20190502.tgz' failed
      status_code: 28
      status_string: "Timeout was reached"
      log:
      --- LOG BEGIN ---
        Trying 180.76.198.77:443...

Connected to gitee.com (180.76.198.77) port 443 (#0)

ALPN: offers h2

ALPN: offers http/1.1

[CONN-0-0][CF-SSL] TLSv1.0 (OUT), TLS header, Certificate Status (22):

[5 bytes data]

[CONN-0-0][CF-SSL] TLSv1.3 (OUT), TLS handshake, Client hello (1):

[512 bytes data]

[CONN-0-0][CF-SSL] TLSv1.2 (IN), TLS header, Certificate Status (22):

[5 bytes data]

[CONN-0-0][CF-SSL] TLSv1.3 (IN), TLS handshake, Server hello (2):

[108 bytes data]

[CONN-0-0][CF-SSL] TLSv1.2 (IN), TLS header, Certificate Status (22):

[5 bytes data]

[CONN-0-0][CF-SSL] TLSv1.2 (IN), TLS handshake, Certificate (11):

[3295 bytes data]

[CONN-0-0][CF-SSL] TLSv1.2 (IN), TLS header, Certificate Status (22):

[Model] QWen14B-Chat got wrong output when input tokens is too long

id.txt
attached is the input tokens

the output of Torch version is "在正常情况下,框架大车陆侧和海侧的运行速度可以设置为10%、20%、50%和80%,并且可以通过编码器检测的速度控制器进行速度闭环控制。此外,所有的起升机构都有高度参考点,可以根据需要进行速度和位置同步控制。"

and the output of xft is unreadable.

流式参数怎么接收?

怎么接收返回的streamer值呢?想用一个参数来接收。类似于stream_chat()

            for response, history in self.model.stream_chat(self.tokenizer, query, history,
                                         max_length=max_length, top_p=top_p, temperature=temperature):

Is there any plan to open-source xdnn?

I want to express my appreciation for the incredible work you have done with xft. However, I noticed that the xdnn lib seems not fully open-sourced, is there any plan for this?

convert Yi34B model fail

Hi,

I try do the Yi34B model conversion with tools/llama_convert.py, but met error...

python tools/llama_convert.py  -i /data/tmp/Yi-34B -o /data/tmp/Yi-34B_xfaster/

=============== Argument ===============
saved_dir: /data/tmp/Yi-34B_xfaster/
in_file: /data/tmp/Yi-34B
processes: 8
weight_data_type: fp32
========================================
Loading checkpoint shards:   0%|                                 | 0/7 [00:03<?, ?it/s]
Traceback (most recent call last):
  File "tools/llama_convert.py", line 225, in <module>
    split_and_convert(args)
  File "tools/llama_convert.py", line 91, in split_and_convert
    model = LlamaForCausalLM.from_pretrained(
  File "/usr/bin/python3.8/lib/python3.8/site-packages/transformers/modeling_utils.py", line 2881, in from_pretrained
    ) = cls._load_pretrained_model(
  File "/usr/bin/python3.8/lib/python3.8/site-packages/transformers/modeling_utils.py", line 3228, in _load_pretrained_model
    new_error_msgs, offload_index, state_dict_index = _load_state_dict_into_meta_model(
  File "/usr/bin/python3.8/lib/python3.8/site-packages/transformers/modeling_utils.py", line 720, in _load_state_dict_into_meta_model
    set_module_tensor_to_device(model, param_name, param_device, **set_module_kwargs)
  File "/usr/bin/python3.8/lib/python3.8/site-packages/accelerate/utils/modeling.py", line 285, in set_module_tensor_to_device
    raise ValueError(
ValueError: Trying to set a tensor of shape torch.Size([1024, 7168]) in "weight" (which has shape torch.Size([7168, 7168])), this look incorrect.

[bug] library of Intel level-zero not found

Issue script:

/benchmark/run_benchmark.sh

error message:

2023:12:15-20:39:37:(61144) |CCL_WARN| could not open the library: libze_loader.so, error: libze_loader.so: cannot open shared object file: No such file or directory
......
===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 61144 RUNNING AT worker64
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

CPU

SPR9468

related version info:

main branch of xFasterTransformer v1.1.0
CentOS8, kernel 5.17.3-1

other comments:

Such issue is not found in history version;
It took place in both official docker and bare-metal;
Why do we need Level-Zero in xFasterTransformer?

[bug] Segmentation fault occurs at large batch sizes

Segmentation fault occurs at large batch sizes

  1. Command Line:
    ./run_benchmark.sh -m llama-7b -d bf16 -s 1 -bs 100 -in 512 -out 256 -i 1

    Functions with errors:
    onednn_amx_sgemm_f32bf16f32_compute_biasadd

    Matmul matrix shape:
    M = 51200, N = 12288, K= 4096, transA = 0,alpha=1.000000, lda=4096, beta=0.000000,ldc=12288

    oneDNN_verbose:
    onednn_verbose,info,oneDNN v3.2.0 (commit 04b180b9a58a78cf1a1cd2329671a5060c2be8de)
    onednn_verbose,info,cpu,runtime:OpenMP,nthr:48
    onednn_verbose,info,cpu,isa:Intel AVX-512 with float16, Intel DL Boost and bfloat16 support and Intel AMX with bfloat16 and 8-bit integer support
    onednn_verbose,info,gpu,runtime:none
    onednn_verbose,info,prim_template:operation,engine,primitive,implementation,prop_kind,memory_descriptors,attributes,auxiliary,problem_desc,exec_time

  2. Command Line:
    ./run_benchmark.sh -m llama-7b -d bf16 -s 1 -bs 100 -in 32 -out 32 -i 1

    Functions with errors:
    hpj::Matrix &input, hpj::Matrix &output, hpj::Matrix &residential, bool isMaster) {
    TimeLine t("DownProj")
    assert(input.Rows() == output.Rows()); (ASSERT FAILED input.Cols()=22016, downWeight.Rows()=11008;)

    Matmul matrix shape:
    M = 3200, N = 12288, K= 4096, transA = 0,alpha=1.000000, lda=4096, beta=0.000000,ldc=12288

    Verbose:
    xft_verbose,exec,cpu,api,onednn_amx_sgemm_f32bf16f32_compute_biasadd,m3200n12288k4096,29.308059
    xft_verbose,exec,cpu,api,onednn_amx_sgemm_f32bf16f32_compute_residential,m3200n4096k4096,12.953664
    xft_verbose,exec,cpu,api,onednn_amx_sgemm_f32bf16f32_compute,m3200n22016k4096,42.813326

xft will be blocked when MPI + QWEN14B + do_sample=true

using cmd below and run several times generate, then the xft will be blocked in oneccl.

OMP_NUM_THREADS=48 LD_PRELOAD=libiomp5.so mpirun -n 1 numactl --physcpubind 16-63 --localalloc python demo.py -t /mnt/data/LLM_Models/Qwen-14B-Chat/ -m /mnt/data/LLM_Models/Qwen-14B-Chat/cpu/ --output_len 512 --dtype bf16_fp16 --do_sample true : -n 1 numactl --physcpubind 80-127 --localalloc python demo.py -t /mnt/data/LLM_Models/Qwen-14B-Chat/ -m /mnt/data/LLM_Models/Qwen-14B-Chat/cpu/ --output_len 512 --dtype bf16_fp16 --do_sample true

cmake file MD5 bug

-- [download 100% complete]
-- verifying file...
file='/root/xfastertransformer/build/xdnn_lib-prefix/src/xdnn_v1.1.tar.gz'
-- MD5 hash of
/root/xfastertransformer/build/xdnn_lib-prefix/src/xdnn_v1.1.tar.gz
does not match expected value
expected: 'b55b5d58c92339aa088dcc6e1df6ede2'
actual: 'b49bf8808d66ea75cfba80a406c9a587'
-- Hash mismatch, removing...
CMake Error at /root/xfastertransformer/build/xdnn_lib-prefix/src/xdnn_lib-stamp/download-xdnn_lib.cmake:170 (message):
Each download failed!


xdnn.cmake

  • URL_HASH MD5=b55b5d58c92339aa088dcc6e1df6ede2
  • URL_HASH MD5=b49bf8808d66ea75cfba80a406c9a587
    After manual modification, it can be run.

QWEN14B will generate error output when multi queries with long input tokens.

The issue like below: the same long input (3338 tokens) query for nulti-times (for this input 4 times. for other small inputs need more times), the xft will generate undecodable outputs or empty outputs.

$ OMP_NUM_THREADS=48 numactl -N 1 python ./demo.py -m /mnt/data/LLM_Models/Qwen-14B-Chat/cpu -t /mnt/data/LLM_Models/Qwen-14B-Chat -d bf16_fp16
[INFO] xfastertransformer is not installed in pip, using source code.
[INFO] SINGLE_INSTANCE MODE.

Please enter the prompt:

input_prompt len: 4526, input_ids len:3338
大车行走速度在正常情况下,速度的设定分别为10%、20%、50%和80%。
====================Performance====================
Execution time: 12.97 s
Latency: 432.46 ms/token
Througput: 2.31 tokens/s

Please enter the prompt:

input_prompt len: 4526, input_ids len:3338
大车行走速度在正常情况下,速度的设定分别为10%、20%、50%和80%。
====================Performance====================
Execution time: 16.47 s
Latency: 548.86 ms/token
Througput: 1.82 tokens/s

Please enter the prompt:

input_prompt len: 4526, input_ids len:3338
根据文档内容,大车机构操作手柄共有4挡,其控制流程图如图3所示。在正常情况下,速度的设定分别为10%、20%、50%和80%
====================Performance====================
Execution time: 15.96 s
Latency: 332.47 ms/token
Througput: 3.01 tokens/s

Please enter the prompt:

input_prompt len: 4526, input_ids len:3338

====================Performance====================
Execution time: 9.74 s
Latency: 4869.03 ms/token
Througput: 0.21 tokens/s

Please enter the prompt:

input_prompt len: 4526, input_ids len:3338

====================Performance====================
Execution time: 15.65 s
Latency: 7823.53 ms/token
Througput: 0.13 tokens/s

baichuan-7b run core dump

雨村问:“政公有个衔玉之子,赦公就没一个?”子兴说:“政公有了玉儿,他的妾又生了一个,还没听说是好是歹。赦公也有二子,次子名叫贾琏,今已二十多岁,娶的是政公王夫人的娘家侄女为妻,亲上加亲。这位琏爷捐了个副知府,也不喜读书
===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 0 PID 4463 RUNNING AT qqq-D50DNP1SBB
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================
(llm) root@qqq-D50DNP1SBB:~/llm# bash run_benchmark.sh -m baichuan2-7b -d bf16 -s 1 -bs 1 -in 4096 -out 32 -i 3

xfastertransformer==1.3.1

run llama2 in two CPU meet error when set dtype int8 and bf16_int8

the script is in the attachment.
llama2-7b.zip

the error info is shown as below

  1. int8

memory node number: 16
HBM SNC4 mode
llama2-7b.sh: 17: Bad substitution
llama2-7b.sh: 17: Bad substitution
llama2-7b.sh: 17: Bad substitution
llama2-7b.sh: 17: Bad substitution
FP16 Performance
FP16 Performance
FP16 Performance
FP16 Performance
llama2-7b.sh: 17: Bad substitution
llama2-7b.sh: 17: Bad substitution
llama2-7b.sh: 17: Bad substitution
llama2-7b.sh: 17: Bad substitution
FP16 Performance
FP16 Performance
FP16 Performance
FP16 Performance
Segmentation fault (core dumped)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 21023 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 21024 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 2 PID 21025 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 3 PID 21026 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 5 PID 21028 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 6 PID 21029 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 7 PID 21030 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

  1. bf16_int8
    memory node number: 16
    HBM SNC4 mode
    llama2-7b.sh: 17: Bad substitution
    llama2-7b.sh: 17: Bad substitution
    FP16 Performance
    llama2-7b.sh: 17: Bad substitution
    FP16 Performance
    FP16 Performance
    llama2-7b.sh: 17: Bad substitution
    llama2-7b.sh: 17: Bad substitution
    llama2-7b.sh: 17: Bad substitution
    llama2-7b.sh: 17: Bad substitution
    FP16 Performance
    FP16 Performance
    llama2-7b.sh: 17: Bad substitution
    FP16 Performance
    FP16 Performance
    FP16 Performance
    Segmentation fault (core dumped)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 0 PID 21300 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 1 PID 21301 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 2 PID 21302 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 3 PID 21303 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 4 PID 21304 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 6 PID 21306 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

===================================================================================
= BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
= RANK 7 PID 21307 RUNNING AT ubuntu-desktop
= KILLED BY SIGNAL: 9 (Killed)

illegal instruction issue

Hi xFasterTransformer Team,

We tried to run the example of xft, and encountered the "illegal instruction" issue, shown as follows:

f1

And we tried to figure out which instrction it failed on, and found it's vcvtps2phx:

f2

It seems that AVX512-FP16 instruction set support is necessary for running xft. Is there any way that we can run xft without this instruction set? Thanks!

build xFasterTransformer from source failed

there is no error when install pytorch:

$ pip install torch --index-url https://download.pytorch.org/whl/cpu
Looking in indexes: https://download.pytorch.org/whl/cpu
Requirement already satisfied: torch in /usr/local/lib/python2.7/dist-packages (1.5.0+cpu)
Requirement already satisfied: numpy in /usr/local/lib/python2.7/dist-packages (from torch) (1.16.6)

But when building xFasterTransformer, it failed:

# in xFasterTransformer/build directory
$ cmake ..
-- The C compiler identification is GNU 8.3.0
-- The CXX compiler identification is GNU 8.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
>>> GCC version: 8.3.0
-- Found MPI_C: /root/xFasterTransformer/3rdparty/oneCCL/build/_install/lib/libmpi.so (found version "3.1")
-- Found MPI_CXX: /root/xFasterTransformer/3rdparty/oneCCL/build/_install/lib/libmpicxx.so (found version "3.1")
-- Found MPI: TRUE (found version "3.1")
-- oneCCL: MPI found
Building with static libraries.
-- PyTorch found. Compiling torch extension...
-- Found OpenMP_C: -fopenmp (found version "4.5")
-- Found OpenMP_CXX: -fopenmp (found version "4.5")
-- Found OpenMP: TRUE (found version "4.5")
Traceback (most recent call last):
  File "<string>", line 1, in <module>
AttributeError: 'module' object has no attribute 'cmake_prefix_path'
-- Configuring done (2.1s)
CMake Error: The following variables are used in this project, but they are set to NOTFOUND.
Please set them or make sure they are set and tested correctly in the CMake files:
TORCH_GLOBAL_DEPS_LIB-NOTFOUND;TORCH_CPU_LIB-NOTFOUND;TORCH_PYTHON_LIB-NOTFOUND;SHM_CPU_LIB-NOTFOUND;C10_CPU_LIB
    linked by target "xfastertransformer_pt" in directory /root/xFasterTransformer/src/pytorch

-- Generating done (0.0s)
CMake Generate step failed.  Build files cannot be regenerated correctly.

Any help would be greatly appreciated.

Does xFasterTransformer supports Falcon model?

Hello,
Planning to run PEFT tuned Falcon model on a CPU (Intel(R) Xeon(R) Platinum 8167M CPU @ 2.00GHz, dockerized Linux). However, the README doesn't mention Falcon, are only listed models supported?

Thanks.

Qwen-14B-Chat转换问题

python /root/xFasterTransformer/tools/qwen_convert.py -i /root/autodl-tmp/Qwen-14B-Chat -o /root/autodl-tmp/Qwen-xft/

=============== Argument ===============
saved_dir: /root/autodl-tmp/Qwen-xft/
in_file: /root/autodl-tmp/Qwen-14B-Chat
processes: 8
weight_data_type: fp16
========================================
The model is automatically converting to bf16 for faster inference. If you want to disable the automatic precision, please manually add bf16/fp16/fp32=True to "AutoModelForCausalLM.from_pretrained".
Try importing flash-attention for faster inference...
Warning: import flash_attn rotary fail, please install FlashAttention rotary to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/rotary
Warning: import flash_attn rms_norm fail, please install FlashAttention layer_norm to get higher efficiency https://github.com/Dao-AILab/flash-attention/tree/main/csrc/layer_norm
Warning: import flash_attn fail, please install FlashAttention to get higher efficiency https://github.com/Dao-AILab/flash-attention
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 15/15 [00:05<00:00,  2.83it/s]
Fail to save the config in config.ini. 'GenerationConfig' object is not subscriptable
Processing ...
Traceback (most recent call last):
  File "/root/xFasterTransformer/tools/qwen_convert.py", line 262, in <module>
    split_and_convert(args)
  File "/root/xFasterTransformer/tools/qwen_convert.py", line 206, in split_and_convert
    param.detach().cpu().numpy().astype(np_weight_data_type).transpose().tofile(os.path.join(saved_dir, "model.wte.bin"))
TypeError: Got unsupported ScalarType BFloat16

请问怎么解决这问题?

AMX_int8 not really be used when dtype="int8"

I use 'chatglm-6b.sh' to test the chatglm-6b performance, and I have set the '--dtype int8'.
But when I use perf to monitor AMX pmu-event, there are no amx int8 event. When dtype is bf16, it work normal:

OS:Ubuntu22.04 kernel 6.5.0
CPU:intel SPR 6430 *2

直接使用chatglm-6b.sh执行benchmark。数据类型bf16可以看到AMX指令正常被调用(perf pmu-event AMX计数器增加),但当数据类型为int8时,没有真正使用AMX指令。
OMP_NUM_THREADS=32 mpirun -n 1 numactl -N 0 -m 0 sh chatglm-6b.sh : -n 1 numactl -N 1 -m 1 sh chatglm-6b.sh

如下图:

image
image
image

benchmark报错


SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"

echo "FP16 Performance "
python "${SCRIPT_DIR}"/../benchmark.py \
    --token_path /data/jane/models/chatglm2-6b/ \
    --model_path /data/jane/models/chatglm2-6b-cpu/ \
    --prompt_path "${SCRIPT_DIR}"/prompt_pool.json \
    --model_name "ChatGLM-6B" \
    --dtype fp16 \
    --token_in 2016     \
    --token_out 32 --beam_width 1 --iteration 1

跑benchmark,当输入设置位2016时会报错

Start benchmark:
iteration 0 :

Traceback (most recent call last):
  File "/root/xfastertransformer/benchmark/chatglm-6b/../benchmark.py", line 128, in <module>
    latency_90 = remained_token_times[int(args.iteration * 0.9) - 1] * 1000 / (output_token_nums - 1)
ZeroDivisionError: division by zero

Can't find the libmklml_intel.so

The following is the error
(cpu-xfaster) llm@SPR-ARC:~/xFasterTransformer/build$ python -d -m xfastertransformer.example.web_demo
Traceback (most recent call last):
File "/home/llm/miniconda3/envs/cpu-xfaster/lib/python3.10/runpy.py", line 187, in _run_module_as_main
mod_name, mod_spec, code = _get_module_details(mod_name, _Error)
File "/home/llm/miniconda3/envs/cpu-xfaster/lib/python3.10/runpy.py", line 110, in _get_module_details
import(pkg_name)
File "/home/llm/miniconda3/envs/cpu-xfaster/lib/python3.10/site-packages/xfastertransformer/init.py", line 4, in
torch.classes.load_library(os.path.dirname(os.path.abspath(file)) + "/libxfastertransformer_pt.so")
File "/home/llm/miniconda3/envs/cpu-xfaster/lib/python3.10/site-packages/torch/_classes.py", line 51, in load_library
torch.ops.load_library(path)
File "/home/llm/miniconda3/envs/cpu-xfaster/lib/python3.10/site-packages/torch/_ops.py", line 643, in load_library
ctypes.CDLL(path)
File "/home/llm/miniconda3/envs/cpu-xfaster/lib/python3.10/ctypes/init.py", line 374, in init
self._handle = _dlopen(self._name, mode)
OSError: libmklml_intel.so: cannot open shared object file: No such file or directory

The following is the oneDNN from Intel oneAPI 23.03, there is no file name is libmklml_intel.so
(cpu-xfaster) llm@SPR-ARC:/opt/intel/oneapi$ find . -name '*_intel.so'
./mkl/2023.2.0/lib/ia32/libmkl_intel.so

xft + QWEN14B + fp16 got unexpected outputs compared with bf16_fp16.

The same input got very different outputs with different datatype. And the bf16_fp16's output is aligned with torch.

BF16_FP16 datatype:

python demo.py -t /mnt/data/LLM_Models/Qwen-14B-Chat/ -m /mnt/data/LLM_Models/Qwen-14B-Chat/cpu/ --do_sample False --output_len 512 --dtype bf16_fp16
[INFO] xfastertransformer is not installed in pip, using source code.
[INFO] SINGLE_INSTANCE MODE.
大车行走速度在正常情况下,速度的设定分为10%、20%、50%和80%,具体数值取决于实际情况和需求。

FP16 datatype:

python demo.py -t /mnt/data/LLM_Models/Qwen-14B-Chat/ -m /mnt/data/LLM_Models/Qwen-14B-Chat/cpu/ --do_sample False --output_len 512 --dtype fp16
[INFO] xfastertransformer is not installed in pip, using source code.
[INFO] SINGLE_INSTANCE MODE.
大,

Illegal instruction (core dumped)

After installing version 1.2.0 using pip install xfastrttransformer and converting Qwen-14B Chat, I ran the demo program to obtain the following results. How should I handle this error when it occurs in the generate step?

image

lscpu
image

Use mpirun to run benchmark.py get error

the error detail:
run cmd:
mpirun -n 1 numactl -N 0 -m 0 python3 benchmark.py --token_path /data/baichuan2-13b --model_path /data/baichuan2-13b-xft/ --prompt_path ./prompt.json --model_name baichuan2-13b --dtype bf16 --token_in 1024 --token_out 512 --beam_width 1 --batch_size 1 --iteration 10 --warmup 1 --padding True

error
Failed to load xft_comm_helper library from path error code: libxft_comm_helper.so: cannot open shared object file: No such file or directory

xft + sample output result look bad

import xfastertransformer
from transformers import AutoTokenizer, TextStreamer
# Assume huggingface model dir is `/data/chatglm-6b-hf` and converted model dir is `/data/chatglm-6b-cpu`.
MODEL_PATH="/data/jane/models/Baichuan2-13B-Chat-cpu/"
TOKEN_PATH="/data/jane/models/Baichuan2-13B-Chat/"

#INPUT_PROMPT = "Once upon a time, there existed a little girl who liked to have adventures."
prompt ="""已知信息:
了 label smoothing 和 mixup 微调之后的模型做了权重上的线性加权。实验结果如
表 3.2 所示。结果表明,BANG 算法有效的提高了 WiSE-FT 算法的效果。特别的,
BANG(LS+Mixup)在五个OOD数据集上比现有的最优算法WiSE-FT高出1.9%。
表3.2 在ImageNet上微调ViT-B/16的效果
Methods ModelAveraging IN IN-V2 IN-R IN-A IN-S ObjectNet AvgOOD

ZIN 与现有的几种方法进行了比较:ERM、IRM[58]、EIIL[71]、HRM[70]和 LfF[81]。
对于IRM,本文提供了ground-truth环境划分,并将其性能作为一个上界。LfF试
图通过从错误指定的浅层神经网络样本中直接采用 boosting 来学习一个鲁棒的模
型。而且LfF仅适用于分类任务。
5.4.1 房价预测任务
本实验考虑了来自Kaggle的真实房屋销售价格回归数据集。目标变量是房价,
每个样本包含17维度的特征,如房子的建成年份、卧室数量等。数据集根据构建

BANG(Mixup+LS) Yes 81.6 73.1 79.7 58.2 54.8 58.9 64.9
3.5 小结
本节研究了为什么集成算法具有优越的 OOD 性能。对 WiSE-FT 的实证分析,
加上理论见解,表明虚假特征的多样化改善了模型的泛化性能。进一步的,笔者通
过缓解微调模型的过度自信问题改进了WiSE-FT。
20 
根据上述已知信息,简洁和专业的来回答用户的问题。如果无法从中得到答案,
请说 “根据已知信息无法回答该问题” 或 “没有提供足够的相关信息”,不允许在答案中添加编造成分,答案请使用中文。 
问题是:langchain中stuff作用是什么?,答案:"""

from typing import Tuple, List
import torch
def build_inputs_baichuan(tokenizer, query: List[str], padding, history: List[Tuple[str, str]] = []):
    inputs = tokenizer(query, return_tensors="pt", padding=padding).input_ids
    print(inputs, inputs.shape)
    suffix = torch.tensor([[196]])
    prefix = torch.tensor([[195]])
    inputs = torch.cat((prefix.expand((inputs.shape[0], 1)), inputs, suffix.expand(inputs.shape[0], 1)), dim=1)
    return inputs

tokenizer = AutoTokenizer.from_pretrained(TOKEN_PATH, use_fast=False, padding_side="left", trust_remote_code=True)
streamer = TextStreamer(tokenizer, skip_special_tokens=True, skip_prompt=False)
input_ids = build_inputs_baichuan(tokenizer, prompt, padding=True)
#input_ids = tokenizer(INPUT_PROMPT, return_tensors="pt", padding=False).input_ids
model = xfastertransformer.AutoModel.from_pretrained(MODEL_PATH)

model.config(max_length=1024)
model.input(input_ids)
import time
start = time.time()
output = ""
while not model.is_done():
   next_tokens = model.forward()
   res = tokenizer.decode(next_tokens[0])
   output += res
   print(res)
print(output)
generated_ids = model.finalize()

output:

根据已知信息,无法回答该问题; 没有提供关于" Lang chain" 或 " Stuff" 的相关信息; 需要更具体或更详细的信息;

结果输出前后包含很多空格。
xfastertransformer 1.3.1

mpirun -n 1 numactl -N 0 -m 0 python test_baichuan.py

SHM reduceAdd performance issue on HBM with 2 sockets

Llama-2-7b BF16 145in 198out
batch size=1:

reduce-type first token second token
1s 117 41
2s-SHM 350 42
2s-ONECCL 135 25

batch size=38

reduce-type first token second token
1s 3500 88.9
2s-SHM 1926 426
2s-ONECCL 4912 63

core per numa calculation error

cores_per_numa=$(( $sockets_num * $cores_per_socket / $numa_nodes ))
上面这一行代码无法计算浮点数
针对9470 cores per socket = 52的时候就会计算错误
2 * 52 / 16 = 6.5

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.