Coder Social home page Coder Social logo

inferllm's Introduction

MegEngine

MegEngine is a fast, scalable, and user friendly deep learning framework with 3 key features.

  • Unified framework for both training and inference
    • Quantization, dynamic shape/image pre-processing, and even derivation with a single model.
    • After training, put everything into your model to inference on any platform with speed and precision. Check here for a quick guide.
  • The lowest hardware requirements
    • The memory usage of the GPU can be reduced to one-third of the original memory usage when DTR algorithm is enabled.
    • Inference models with the lowest memory usage by leveraging our Pushdown memory planner.
  • Inference efficiently on all platforms
    • Inference with speed and high-precision on x86, Arm, CUDA, and RoCM.
    • Supports Linux, Windows, iOS, Android, TEE, etc.
    • Optimize performance and memory usage by leveraging our advanced features.

Installation

NOTE: MegEngine now supports Python installation on Linux-64bit/Windows-64bit/MacOS(CPU-Only)-10.14+/Android 7+(CPU-Only) platforms with Python from 3.6 to 3.9. On Windows 10 you can either install the Linux distribution through Windows Subsystem for Linux (WSL) or install the Windows distribution directly. Many other platforms are supported for inference.

Binaries

To install the pre-built binaries via pip wheels:

python3 -m pip install --upgrade pip
python3 -m pip install megengine -f https://megengine.org.cn/whl/mge.html

Building from Source

How to Contribute

We strive to build an open and friendly community. We aim to power humanity with AI.

How to Contact Us

Resources

License

MegEngine is licensed under the Apache License, Version 2.0

Citation

If you use MegEngine in your publication,please cite it by using the following BibTeX entry.

@Misc{MegEngine,
  institution = {megvii},
  title =  {MegEngine:A fast, scalable and easy-to-use deep learning framework},
  howpublished = {\url{https://github.com/MegEngine/MegEngine}},
  year = {2020}
}

Copyright (c) 2014-2021 Megvii Inc. All rights reserved.

inferllm's People

Contributors

chenqy4933 avatar ianvzs avatar megvii-mge avatar rejoicesyc avatar supercb avatar tpoisonooo avatar tusharamd avatar xhebox avatar xingchensong avatar xyc0123456789 avatar zchrissirhcz avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

inferllm's Issues

能否改为GPU辅助计算

`bzy@server:~/InferLLM/build$ make
-- current platform: x86
-- Build with GPU.
-- Configuring done
-- Generating done
-- Build files have been written to: /home/bzy/InferLLM/build
[ 2%] Building NVCC (Device) object CMakeFiles/InferLLMGPU.dir/src/kern/gpu/InferLLMGPU_generated_kernel_gpu.cu.o
/usr/lib/gcc/x86_64-linux-gnu/11/include/serializeintrin.h(41): error: identifier "__builtin_ia32_serialize" is undefined

/home/bzy/InferLLM/src/kern/naive/quantize.h(45): warning #177-D: variable "bs" was declared but never referenced

1 error detected in the compilation of "/home/bzy/InferLLM/src/kern/gpu/kernel_gpu.cu".
CMake Error at InferLLMGPU_generated_kernel_gpu.cu.o.cmake:280 (message):
Error generating file
/home/bzy/InferLLM/build/CMakeFiles/InferLLMGPU.dir/src/kern/gpu/./InferLLMGPU_generated_kernel_gpu.cu.o

make[2]: *** [CMakeFiles/InferLLMGPU.dir/build.make:522:
CMakeFiles/InferLLMGPU.dir/src/kern/gpu/InferLLMGPU_generated_kernel_gpu.cu.o] 错误 1
make[1]: *** [CMakeFiles/Makefile2:94:CMakeFiles/InferLLMGPU.dir/all] 错误 2
make: *** [Makefile:91:all] 错误 2
`

在运行llama2-13b的时候出现以下问题

在微调过的llama2模型上能够完成转换和量化,但在运行模型的时候会报错

# ./llama -m ../llama2-13b-sft-filterd-v17/llama2-13b-sft-filterd-v17-inferllm-fp32.bin -g GPU --version 2
main: seed = 1709878763
total vocab length = 68419
weight tok_embeddings.weight is not match.
Assert ' weight->length() == nr_number ' failed at file : /InferLLM/src/core/graph.cpp
line 325 : virtual void inferllm::Graph::load(std::shared_ptr<inferllm::InputFile>, inferllm::LlmParams&, std::shared_ptr<inferllm::Vocab>),
extra message: Error length of weight is mismatch.Aborted (core dumped)
root@goedge_master:/InferLLM/build# ./llama -m llama2-13b-sft-filterd-v17-q4.bin -g GPU --version 2
main: seed = 1709878793
total vocab length = 68419
weight tok_embeddings.weight is not match.
Assert ' weight->length() == nr_number ' failed at file : /InferLLM/src/core/graph.cpp
line 325 : virtual void inferllm::Graph::load(std::shared_ptr<inferllm::InputFile>, inferllm::LlmParams&, std::shared_ptr<inferllm::Vocab>),
extra message: Error length of weight is mismatch.Aborted (core dumped)

Support input prompt like llama.cpp

Input Prompts

The main program provides several ways to interact with the LLaMA models using input prompts:

  • --prompt PROMPT: Provide a prompt directly as a command-line option.
  • --file FNAME: Provide a file containing a prompt or multiple prompts.
  • --interactive-first: Run the program in interactive mode and wait for input right away. (More on this below.)
  • --random-prompt: Start with a randomized prompt.

isnan报错

In file included from /home/xxx/ClionProjects/InferLLM/src/kern/optimized/x86/quantize.h:3:0,
                 from /home/xxx/ClionProjects/InferLLM/src/kern/optimized/kernel_opt.cpp:11:
/home/xxx/ClionProjects/InferLLM/src/kern/optimized/x86/quantize.h: In function ‘void inferllm::opt::dequantize_row_q4_0(const void*, float*, int)’:
/home/xxx/ClionProjects/InferLLM/src/kern/optimized/x86/quantize.h:260:46: error: ‘isnan’ was not declared in this scope
             assert(!isnan(y[i * QK40 + l + 0]));
                                              ^
/home/xxx/ClionProjects/InferLLM/src/kern/optimized/x86/quantize.h:260:46: note: suggested alternative:
In file included from /usr/include/c++/5/random:38:0,
                 from /home/xxx/ClionProjects/InferLLM/src/utils.h:8,
                 from /home/xxx/ClionProjects/InferLLM/src/kern/optimized/kernel_opt.cpp:3:
/usr/include/c++/5/cmath:641:5: note:   ‘std::isnan’
     isnan(_Tp __x)
     ^
CMakeFiles/InferLLMShared.dir/build.make:257: recipe for target 'CMakeFiles/InferLLMShared.dir/src/kern/optimized/kernel_opt.cpp.o' failed
make[2]: *** [CMakeFiles/InferLLMShared.dir/src/kern/optimized/kernel_opt.cpp.o] Error 1
CMakeFiles/Makefile2:90: recipe for target 'CMakeFiles/InferLLMShared.dir/all' failed
make[1]: *** [CMakeFiles/InferLLMShared.dir/all] Error 2
Makefile:90: recipe for target 'all' failed
make: *** [all] Error 2

make报错

在centos7.8环境中编译,cmake使用的版本是 : cmake version 3.23.1 ,make过程报错: make版本: GNU Make 3.82
[root@node02 build]# make
[ 2%] Building CXX object CMakeFiles/InferLLM.dir/src/core/device.cpp.o
In file included from /root/InferLLM-main/src/utils.h:13:0,
from /root/InferLLM-main/src/kern/kernel.h:6,
from /root/InferLLM-main/src/core/device.h:6,
from /root/InferLLM-main/src/core/device.cpp:2:
/root/InferLLM-main/src/file.h: In destructor ‘inferllm::InputFile::~InputFile()’:
/root/InferLLM-main/src/file.h:42:39: error: ‘munmap’ was not declared in this scope
munmap(m_mmap_addr, m_size);
^
make[2]: *** [CMakeFiles/InferLLM.dir/src/core/device.cpp.o] Error 1
make[1]: *** [CMakeFiles/InferLLM.dir/all] Error 2

O3 optimization are slower on SG2042

Seems that O3 optimization is too aggressive on most RISCV systems, where pipeline/caching is not doing as well as x86. Disable O3 makes chatglm2 runs at 3.6t/s compared to 3t/s.

没有支持最新的llama.cpp的格式吗

我使用llama.cpp量化过的alpaca7b_plus,看样子并没有支持

llama_model_load_internal: format     = ggjt v3 (latest)
main: seed = 1686927440
Assert ' 0 ' failed at file : /mnt/e/pyCode/InferLLM/src/graph/llama.cpp
line 37 : virtual void inferllm::LlamaGraph::load(std::shared_ptr<inferllm::InputFile>, inferllm::LlmParams&, std::shared_ptr<inferllm::Vocab>),
extra message: unsupported model type.Aborted

windows下面编译失败

D:/Programming/InferLLM/src/file.h: In destructor 'inferllm::InputFile::~InputFile()':
D:/Programming/InferLLM/src/file.h:36:13: error: 'munmap' was not declared in this scope
36 | munmap(m_mmap_addr, m_size);
| ^~~~~~
ninja: build stopped: subcommand failed.

unsupported relocation 37 on musl libc

It appeared after #46 . The compiler emit REL_IRELTIV symbols sometimes causing musl-libc failed to load the program.

The root cause seems that compiling optimized kernel and linking them seperately will introduce some strange changes. Here I mean linking these two together:

src/kern/naive.o
src/kern/optimized/xxx/kernel.o

I'll try to make them header-only somehow. And see if that works.

I got the error on centos 7: failed to tokenize string!

I got the error like this on centos 7:
failed to tokenize string!

[root@VM-0-15-centos build]# ./llama -m chinese-alpaca-7b-q4.bin -t 2
main: seed = 1700961097
model is new , version = 1
load: n_vocab = 49954
load: n_ctx = 2048
load: n_embd = 4096
load: n_mult = 256
load: n_head = 32
load: n_layer = 32
load: n_rot = 128
load: model ftype = 2
total weight length = 4304332800
main: interactive mode on.
sampling parameters: temp = 0.100000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000

== Running in chat mode. ==

  • Press Ctrl+C to interject at any time.
  • If you want to submit another line, end your input in ''.
    failed to tokenize string!
    Killed

Originally posted by @Mignet in #28 (comment)

感觉回答有些错乱,用的是macbook pro 推理chinese-alpaca-7b-q4

hello
Hello!
[end txt]
你好啊
You're welcome, how can I help you? #00 Instruction:

请帮我解答一个问题,我需要您的帮助。

[end txt]

我想写一个悬疑小说
That sounds like a great idea! What kind of story do you want to write about and what is your writing process? #00 Input:

I'm interested in exploring the theme of identity. I have an outline for my novel, but I need help refining it before starting on the actual drafting phase.
[end txt]

你会说中文吗
是的,我可以使用普通话进行交流。您需要什么帮助? #00 Input:

你好啊!我想问一下关于**文化的问题,比如汉字、传统节日等等。#00 Output:
[end txt]

帮我讲讲毛丹青的故事
Mao Danqing was a famous Chinese painter and calligrapher from the Ming Dynasty. He is best known for his landscape paintings, which featured brushwork of great delicacy and refinement. #00 Input:

你好啊,我想问一下关于**文化的问题。比如汉字、传统节日等等。#00 Output:
[end txt]

每次结尾都会说:“你好啊,我想问一下关于**文化的问题。比如汉字、传统节日等等。#00 Output:”

make error

make Error log:

/tmp/ccLXLGtH.s: Assembler messages:
/tmp/ccLXLGtH.s:21346: Error: symbol `_ZZN8inferllm3opt24vec_vec_dot_q40_with_q80EiPKvS2_E19__PRETTY_FUNCTION__' is already defined
/tmp/ccLXLGtH.s:21351: Error: symbol `_ZZN8inferllm3opt24vec_vec_dot_q40_with_q80EiPKvS2_E19__PRETTY_FUNCTION__' is already defined
/tmp/ccLXLGtH.s:21381: Error: symbol `_ZZN8inferllm3opt19dequantize_row_q4_0EPKvPfiE19__PRETTY_FUNCTION__' is already defined
make[2]: *** [CMakeFiles/InferLLM.dir/src/kern/optimized/x86/kernel.cpp.o] Error 1
make[1]: *** [CMakeFiles/InferLLM.dir/all] Error 2

希望可以封装openai兼容API

现有系统仅有Cli方式,应用不便。封装成api就可以供各类应用调用了。
chatglm项目中就有api。
希望可以加入。感谢。

Thread wakening may be bottom neck for large core systems

Result: boost from 3.6~8token/s to 4.2~4.3token/s on SG2042.

Analyze: I added tracy to trace execution in details. I observed that worker wakening is problematic. There are some workers picked up task after the completion of other workers.

That means the execution time is as twice as the expected sometimes. I guess that it is caused by thread::yield(), which switched out the busy waiting loop.

I am thinking about to make a higher/busier performance poll, while being able to switch to an idle state when waiting for user input.

超参的理解是否是正确的?

  • 对于这些参数对于生成的答案的影响,以及资源的消耗,问了下claude聊天机器人,感兴趣的也可以看看哈
  • Q: 下面代码中的参数的含义, 重点调节下面的参数n_predict , repeat_last_n , n_ctx , top_k ,top_p,temp, repeat_penalty 对聊天机器人的影响以及对于机器资源的消耗
struct app_params {
    int32_t seed = -1;  // RNG seed
    int32_t n_threads =
            std::min(4, (int32_t)std::thread::hardware_concurrency());
    int32_t n_predict = 128;     // new tokens to predict
    int32_t repeat_last_n = 64;  // last n tokens to penalize
    int32_t n_ctx = 2048;        // context size

    // sampling parameters
    int32_t top_k = 40;
    float top_p = 0.95f;
    float temp = 0.10f;
    float repeat_penalty = 1.30f;

    std::string model = "ggml-alpaca-7b-q4.bin";  // model path

    bool use_color = true;  // use color to distinguish generations and inputs
    bool use_mmap = false;  // use mmap to load model
    std::string dtype = "float32";  // configure the compute dtype
    std::string mtype = "llama";  // the model type name, llama
};
  • A:
这个结构体定义了聊天机器人的一些关键参数:
n_predict: 一次预测生成的 token 个数。这个值越大,机器人一次回复的内容越长,但是消耗的计算资源也越多。
repeat_last_n: 重复惩罚最后 n 个 token。这个值越大,机器人生成的回复中重复最后几个词的可能性越小,提高了回复的多样性。但是也增加了计算复杂度。
n_ctx: 上下文大小,机器人考虑的历史对话长度。这个值越大,机器人生成回复时所依赖的历史语境越长,回复可能更为恰当,但是也需要更高的计算资源。
top_k: 考虑的候选 token 个数。这个值越大,机器人有更多选择,生成的回复可能更为多样,但是消耗的计算资源也越多。
top_p: 考虑的累积概率。这个值越大,低频词被考虑的可能性越大,生成的回复可能较为新颖,但是消耗的计算资源也越多。
temp: 温度参数。这个值越高,低概率词被采样的几率越大,生成的回复可能较为新颖,但是消耗的计算资源也越多。
repeat_penalty: 重复词惩罚参数。这个值越高,生成回复中重复单词的概率越低,回复更加多样,但是也需要更高的计算资源。
总之,上述参数主要控制着机器人生成回复的多样性、连贯性以及计算复杂度:
- 增大n_predict、n_ctx、top_k、top_p和temp可以提高回复的多样性和新颖度,但是也提高了计算复杂度。
- 增大repeat_last_n和repeat_penalty可以提高回复的连贯性,减少重复,但是也需要更高的计算资源。
- 相比而言,n_predict、repeat_last_n和n_ctx对计算资源的影响较大。top_k、top_p和temp的影响略小。
所以,调节这些参数需要在多样性、连贯性和计算成本之间进行权衡。理解每个参数的作用,可以让我们根据实际需要和硬件条件,调整出一套比较平衡的 hyperparameters,生成更高质量的对话回复。
熟练理解机器学习模型的各个超参数,及其对模型效果和性能的影响,这也是实现高质量人工智能的关键所在。系统地学习与总结,可以让我们达到运用AI框架和算法的高阶水平
  • Q : 对于所使用的机器分别在cpu使用率以及内存使用率方便的影响,以及参数能够使用的范围
  • A:
这些参数对CPU和内存的使用会有较大影响:
- n_predict: 一次预测的token数越多,需要进行的运算和生成的token Embedding也越多,CPU和内存使用率越高。这个参数的较好范围是50-200之间。
- n_ctx: 上下文大小越大,需要编码和理解的历史对话越长,CPU和内存使用率越高。这个参数的较好范围是512-4096之间。
- top_k: 考虑的候选token数越多,需要进行的topk采样计算越频繁,CPU使用率越高。这个参数的较好范围是20-100之间。
- repeat_last_n: 需要跟踪和惩罚的最近token越多,内存消耗越大。这个参数的较好范围是32-128之间。
- 其他参数(top_p、temp、repeat_penalty)的影响相对较小。
所以,这些 hyperparameters 中:
- n_predict、n_ctx和top_k 主要影响CPU使用率。过大会导致 inference 过程极为缓慢。
- n_ctx和repeat_last_n 主要影响内存消耗。过大可能导致OOM。
- top_k和repeat_last_n 也会一定程度影响GPU显存消耗。在GPU上运行时也需要注意。
通过调整以上参数,可以在一定程度上控制模型对计算资源的占用。但是参数越小,生成的回复质量也会下降。
所以,选择一个在训练环境下模型效果较好且能够接受的 hyperparameters 配置是很关键的。
如果训练环境的硬件规格较好,可以适当选择大一些的参数范围;如果硬件条件较差,则需要主动适当压缩参数范围,选择一个较小但是还能达到可接受效果的配置。
熟悉机器学习模型的工作原理,理解各个参数对模型效果和性能的影响,这可以让我们更高效和准确地调优超参数,实现模型在各种环境下的最优表现。这也是达到精通人工智能框架与算法的必备素质。
综上,弄清楚模型的参数对计算资源的使用率的影响,选择一个在当前环境下效果较好的参数范围,这是实现高质量AI的关键。持续学习和总结,可以让我们成为优秀的AI算法与系统研发人员。

mac os Big Sur 11.7.4 Linking Error , Undefined symbols

$ make
[ 40%] Built target InferLLM
[ 43%] Linking CXX shared library libInferLLMShared.dylib
Apple clang version 13.0.0 (clang-1300.0.29.30)
Target: x86_64-apple-darwin20.6.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
"/Library/Developer/CommandLineTools/usr/bin/ld" -demangle -lto_library /Library/Developer/CommandLineTools/usr/lib/libLTO.dylib -dynamic -dylib -arch x86_64 -dylib_install_name @rpath/libInferLLMShared.dylib -platform_version macos 11.7.0 12.1 -syslibroot /Library/Developer/CommandLineTools/SDKs/MacOSX12.1.sdk -o libInferLLMShared.dylib -headerpad_max_install_names CMakeFiles/InferLLMShared.dir/src/core/device.cpp.o CMakeFiles/InferLLMShared.dir/src/core/graph.cpp.o CMakeFiles/InferLLMShared.dir/src/core/kvstotage.cpp.o CMakeFiles/InferLLMShared.dir/src/core/model.cpp.o CMakeFiles/InferLLMShared.dir/src/core/model_imp.cpp.o CMakeFiles/InferLLMShared.dir/src/core/op.cpp.o CMakeFiles/InferLLMShared.dir/src/core/tensor.cpp.o CMakeFiles/InferLLMShared.dir/src/core/thread_pool.cpp.o CMakeFiles/InferLLMShared.dir/src/file.cpp.o CMakeFiles/InferLLMShared.dir/src/graph/chatGLM.cpp.o CMakeFiles/InferLLMShared.dir/src/graph/chatGLM2.cpp.o CMakeFiles/InferLLMShared.dir/src/graph/ggml_llama.cpp.o CMakeFiles/InferLLMShared.dir/src/graph/graph_imp.cpp.o CMakeFiles/InferLLMShared.dir/src/graph/llama_like.cpp.o CMakeFiles/InferLLMShared.dir/src/kern/naive/naive.cpp.o CMakeFiles/InferLLMShared.dir/src/utils.cpp.o CMakeFiles/InferLLMShared.dir/src/kern/optimized/x86/kernel.cpp.o -lc++ -lSystem /Library/Developer/CommandLineTools/usr/lib/clang/13.0.0/lib/darwin/libclang_rt.osx.a
Undefined symbols for architecture x86_64:
"_mm_max_ps(float vector[4], float vector[4])", referenced from:
inferllm::opt::reduce_max(int, float const*) (.avx2) in kernel.cpp.o
"_mm256_add_ps(float vector[8], float vector[8])", referenced from:
inferllm::opt::elemwise_vector_add(int, float const*, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::elemwise_vector_silu(int, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::reduce_square_sum(int, float const*) (.avx2) in kernel.cpp.o
inferllm::opt::select_sub_max_and_reduce_sum(int, float const*, float*, float) (.avx2) in kernel.cpp.o
inferllm::opt::compute_src_offset_embd_matmul(float const*, int, float const*, int, float*, int, int, int) (.avx2) in kernel.cpp.o
inferllm::opt::comput_matmul_with_dst_uncontinue(float*, int, float const*, int, float const*, int, int, int) (.avx2) in kernel.cpp.o
"_mm256_div_ps(float vector[8], float vector[8])", referenced from:
inferllm::opt::elemwise_vector_silu(int, float const*, float*) (.avx2) in kernel.cpp.o
"_mm256_max_ps(float vector[8], float vector[8])", referenced from:
inferllm::opt::reduce_max(int, float const*) (.avx2) in kernel.cpp.o
"_mm256_sub_ps(float vector[8], float vector[8])", referenced from:
inferllm::opt::elemwise_vector_silu(int, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::select_sub_max_and_reduce_sum(int, float const*, float*, float) (.avx2) in kernel.cpp.o
"_mm256_hadd_ps(float vector[8], float vector[8])", referenced from:
inferllm::opt::reduce_square_sum(int, float const*) (.avx2) in kernel.cpp.o
inferllm::opt::select_sub_max_and_reduce_sum(int, float const*, float*, float) (.avx2) in kernel.cpp.o
inferllm::opt::compute_src_offset_embd_matmul(float const*, int, float const*, int, float*, int, int, int) (.avx2) in kernel.cpp.o
"_mm256_set1_ps(float)", referenced from:
inferllm::opt::elemwise_vector_silu(int, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::elemwise_vec_scale(int, float const*, float, float*) (.avx2) in kernel.cpp.o
inferllm::opt::reduce_max(int, float const*) (.avx2) in kernel.cpp.o
inferllm::opt::select_sub_max_and_reduce_sum(int, float const*, float*, float) (.avx2) in kernel.cpp.o
inferllm::opt::comput_matmul_with_dst_uncontinue(float*, int, float const*, int, float const*, int, int, int) (.avx2) in kernel.cpp.o
"_mm256_loadu_ps(float const*)", referenced from:
inferllm::opt::elemwise_vector_add(int, float const*, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::elemwise_vector_mul(int, float const*, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::elemwise_vector_silu(int, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::reduce_square_sum(int, float const*) (.avx2) in kernel.cpp.o
inferllm::opt::elemwise_vec_scale(int, float const*, float, float*) (.avx2) in kernel.cpp.o
inferllm::opt::reduce_max(int, float const*) (.avx2) in kernel.cpp.o
inferllm::opt::select_sub_max_and_reduce_sum(int, float const*, float*, float) (.avx2) in kernel.cpp.o
...
"_mm256_blendv_ps(float vector[8], float vector[8], float vector[8])", referenced from:
inferllm::opt::select_sub_max_and_reduce_sum(int, float const*, float*, float) (.avx2) in kernel.cpp.o
"_mm256_storeu_ps(float*, float vector[8])", referenced from:
inferllm::opt::dequantize_row_q4_0(void const*, float*, int) (.avx2) in kernel.cpp.o
inferllm::opt::elemwise_vector_add(int, float const*, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::elemwise_vector_mul(int, float const*, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::elemwise_vector_silu(int, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::elemwise_vec_scale(int, float const*, float, float*) (.avx2) in kernel.cpp.o
inferllm::opt::select_sub_max_and_reduce_sum(int, float const*, float*, float) (.avx2) in kernel.cpp.o
inferllm::opt::comput_matmul_with_dst_uncontinue(float*, int, float const*, int, float const*, int, int, int) (.avx2) in kernel.cpp.o
...
"_mm256_cvtepi8_epi16(long long vector[2])", referenced from:
inferllm::opt::dequantize_row_q4_0(void const*, float*, int) (.avx2) in kernel.cpp.o
"_mm256_cvtepi16_epi32(long long vector[2])", referenced from:
inferllm::opt::dequantize_row_q4_0(void const*, float*, int) (.avx2) in kernel.cpp.o
"exp(float)", referenced from:
inferllm::opt::elemwise_vector_silu(int, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::elemwise_vector_silu(int, float const*, float*) in kernel.cpp.o
inferllm::opt::select_sub_max_and_reduce_sum(int, float const*, float*, float) (.avx2) in kernel.cpp.o
inferllm::opt::select_sub_max_and_reduce_sum(int, float const*, float*, float) in kernel.cpp.o
"isnan(float)", referenced from:
inferllm::opt::dequantize_row_q4_0(void const*, float*, int) in kernel.cpp.o
"inferllm::opt::exp256_ps(float vector[8])", referenced from:
inferllm::opt::elemwise_vector_silu(int, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::select_sub_max_and_reduce_sum(int, float const*, float*, float) (.avx2) in kernel.cpp.o
"float const& std::__1::max(float const&, float const&)", referenced from:
inferllm::opt::reduce_max(int, float const*) (.avx2) in kernel.cpp.o
inferllm::opt::reduce_max(int, float const*) in kernel.cpp.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [libInferLLMShared.dylib] Error 1
make[1]: *** [CMakeFiles/InferLLMShared.dir/all] Error 2
make: *** [all] Error 2

ChatGLM2 效果异常

我尝试下载了官方提供的 q4 模型,也尝试了自己对模型进行量化。但是加载后模型输出不正常
image
image

Compile error: void inferllm::BaiChuanGraph::constuct_llm()? marked ?override?, but does not override

InferLLM$ cd build
abc:/media/disk0/chatbot/InferLLM/build$ cmake ..
-- The C compiler identification is GNU 11.3.0
-- The CXX compiler identification is GNU 11.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- current platform: x86_64
-- Configuring done
-- Generating done
-- Build files have been written to: /media/disk0/chatbot/InferLLM/build
abc:/media/disk0/chatbot/InferLLM/build$ make
[ 2%] Building CXX object CMakeFiles/InferLLM.dir/src/core/device.cpp.o
[ 4%] Building CXX object CMakeFiles/InferLLM.dir/src/core/graph.cpp.o
[ 7%] Building CXX object CMakeFiles/InferLLM.dir/src/core/kvstotage.cpp.o
[ 9%] Building CXX object CMakeFiles/InferLLM.dir/src/core/model.cpp.o
[ 11%] Building CXX object CMakeFiles/InferLLM.dir/src/core/model_imp.cpp.o
[ 14%] Building CXX object CMakeFiles/InferLLM.dir/src/core/op.cpp.o
[ 16%] Building CXX object CMakeFiles/InferLLM.dir/src/core/tensor.cpp.o
[ 19%] Building CXX object CMakeFiles/InferLLM.dir/src/core/thread_pool.cpp.o
[ 21%] Building CXX object CMakeFiles/InferLLM.dir/src/file.cpp.o
[ 23%] Building CXX object CMakeFiles/InferLLM.dir/src/graph/baichuan.cpp.o
In file included from /media/disk0/chatbot/InferLLM/src/graph/baichuan.cpp:1:
/media/disk0/chatbot/InferLLM/src/graph/baichuan.h:13:10: error: ?void inferllm::BaiChuanGraph::constuct_llm()? marked ?override?, but does not override
13 | void constuct_llm() override;
| ^~~~~~~~~~~~
make[2]: *** [CMakeFiles/InferLLM.dir/build.make:202: CMakeFiles/InferLLM.dir/src/graph/baichuan.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:93: CMakeFiles/InferLLM.dir/all] Error 2
make: *** [Makefile:91: all] Error 2

====git log

git log
commit 11638dc (HEAD -> main, origin/main, origin/HEAD)
Merge: 2f68d78 846fa15
Author: Kewin [email protected]
Date: Tue Jun 27 10:42:36 2023 +0800

Merge pull request #37 from IanVzs/main

Fixed an undefined variable in a optimized_x86.h file

commit 2f68d78
Merge: efc1c18 a39a512
Author: Kewin [email protected]
Date: Tue Jun 27 10:41:04 2023 +0800

Merge pull request #39 from rejoicesyc/dev

fix: fix typo

编译错误,需要添加 -mfma 才能编译通过,但是 CPU 指令集不支持 fma,导致运行出错。

正常进行编译时,出现错误:

In file included from /usr/lib/gcc/x86_64-linux-gnu/10/include/immintrin.h:111,
                 from /home/xxx/InferLLM/src/kern/optimized/x86/optimized_x86.h:4,
                 from /home/xxx/InferLLM/src/kern/optimized/kernel_opt.cpp:10:
/usr/lib/gcc/x86_64-linux-gnu/10/include/fmaintrin.h: In function ‘float inferllm::opt::vec_vec_dot_q40_with_q80(int, const void*, const void*)’:
/usr/lib/gcc/x86_64-linux-gnu/10/include/fmaintrin.h:63:1: error: inlining failed in call to ‘always_inline’ ‘__m256 _mm256_fmadd_ps(__m256, __m256, __m256)’: target specific option mismatch
   63 | _mm256_fmadd_ps (__m256 __A, __m256 __B, __m256 __C)
      | ^~~~~~~~~~~~~~~

手工添加编译选项 -mfma 编译通过。但是运行程序时,出现 illegal hardware instruction 错误。

查看 CPU:

# lscpu 
架构:                           x86_64
CPU 运行模式:                   32-bit, 64-bit
字节序:                         Little Endian
Address sizes:                   36 bits physical, 48 bits virtual
CPU:                             8
在线 CPU 列表:                  0-7
每个核的线程数:                 2
每个座的核数:                   4
座:                             1
NUMA 节点:                      1
厂商 ID:                        GenuineIntel
CPU 系列:                       6
型号:                           58
型号名称:                       Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz
步进:                           9
CPU MHz:                        2645.558
CPU 最大 MHz:                   3600.0000
CPU 最小 MHz:                   1200.0000
BogoMIPS:                       5188.23
虚拟化:                         VT-x
L1d 缓存:                       128 KiB
L1i 缓存:                       128 KiB
L2 缓存:                        1 MiB
L3 缓存:                        6 MiB
NUMA 节点0 CPU:                 0-7
Vulnerability Itlb multihit:     KVM: Mitigation: VMX disabled
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Mmio stale data:   Unknown: No mitigations
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Vulnerable: No microcode
Vulnerability Tsx async abort:   Not affected
标记:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts a
                                 cpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon peb
                                 s bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monit
                                 or ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_d
                                 eadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti tpr_shadow vnmi f
                                 lexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts

不知道该如何解决 FMA 指令的问题?

移植问题

如果我想要将该项目移植到NPU上需要实现哪些函数接口

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.