megengine / inferllm Goto Github PK

View Code? Open in Web Editor NEW

637.0 11.0 78.0 5.03 MB

a lightweight LLM model inference framework

License: Apache License 2.0

CMake 1.14% C++ 87.96% Shell 0.51% Python 4.62% Cuda 5.77%

deeplearning inference llm mobile

inferllm's Issues

chatglm2 GPU版本的int4、int8量化模型预测结果异常

编译错误，需要添加 -mfma 才能编译通过，但是 CPU 指令集不支持 fma，导致运行出错。

正常进行编译时，出现错误：

In file included from /usr/lib/gcc/x86_64-linux-gnu/10/include/immintrin.h:111,
                 from /home/xxx/InferLLM/src/kern/optimized/x86/optimized_x86.h:4,
                 from /home/xxx/InferLLM/src/kern/optimized/kernel_opt.cpp:10:
/usr/lib/gcc/x86_64-linux-gnu/10/include/fmaintrin.h: In function ‘float inferllm::opt::vec_vec_dot_q40_with_q80(int, const void*, const void*)’:
/usr/lib/gcc/x86_64-linux-gnu/10/include/fmaintrin.h:63:1: error: inlining failed in call to ‘always_inline’ ‘__m256 _mm256_fmadd_ps(__m256, __m256, __m256)’: target specific option mismatch
   63 | _mm256_fmadd_ps (__m256 __A, __m256 __B, __m256 __C)
      | ^~~~~~~~~~~~~~~

手工添加编译选项 -mfma 编译通过。但是运行程序时，出现 illegal hardware instruction 错误。

查看 CPU:

# lscpu 
架构：                           x86_64
CPU 运行模式：                   32-bit, 64-bit
字节序：                         Little Endian
Address sizes:                   36 bits physical, 48 bits virtual
CPU:                             8
在线 CPU 列表：                  0-7
每个核的线程数：                 2
每个座的核数：                   4
座：                             1
NUMA 节点：                      1
厂商 ID：                        GenuineIntel
CPU 系列：                       6
型号：                           58
型号名称：                       Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz
步进：                           9
CPU MHz：                        2645.558
CPU 最大 MHz：                   3600.0000
CPU 最小 MHz：                   1200.0000
BogoMIPS：                       5188.23
虚拟化：                         VT-x
L1d 缓存：                       128 KiB
L1i 缓存：                       128 KiB
L2 缓存：                        1 MiB
L3 缓存：                        6 MiB
NUMA 节点0 CPU：                 0-7
Vulnerability Itlb multihit:     KVM: Mitigation: VMX disabled
Vulnerability L1tf:              Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds:               Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable
Vulnerability Meltdown:          Mitigation; PTI
Vulnerability Mmio stale data:   Unknown: No mitigations
Vulnerability Retbleed:          Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:        Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:             Vulnerable: No microcode
Vulnerability Tsx async abort:   Not affected
标记：                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts a
                                 cpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon peb
                                 s bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monit
                                 or ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_d
                                 eadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti tpr_shadow vnmi f
                                 lexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts

不知道该如何解决 FMA 指令的问题？

Arm support Compute with fp16

Now Arm only support fp32 compute, fp16 compute will be fast on armv8.2 arch or above.

decode & decode_iter 多线程会不会有问题

如果多个线程同时使用model的 decode 和 decode_iter方法，会不会有问题？

unsupported relocation 37 on musl libc

It appeared after #46 . The compiler emit REL_IRELTIV symbols sometimes causing musl-libc failed to load the program.

The root cause seems that compiling optimized kernel and linking them seperately will introduce some strange changes. Here I mean linking these two together:

src/kern/naive.o
src/kern/optimized/xxx/kernel.o

I'll try to make them header-only somehow. And see if that works.

Support more Model

support More model like: https://github.com/THUDM/ChatGLM-6B，https://github.com/openai/whisper

在centos7.8环境中编译，cmake使用的版本是： cmake version 3.23.1 ，make过程报错： make版本： GNU Make 3.82
[root@node02 build]# make
[ 2%] Building CXX object CMakeFiles/InferLLM.dir/src/core/device.cpp.o
In file included from /root/InferLLM-main/src/utils.h:13:0,
from /root/InferLLM-main/src/kern/kernel.h:6,
from /root/InferLLM-main/src/core/device.h:6,
from /root/InferLLM-main/src/core/device.cpp:2:
/root/InferLLM-main/src/file.h: In destructor ‘inferllm::InputFile::~InputFile()’:
/root/InferLLM-main/src/file.h:42:39: error: ‘munmap’ was not declared in this scope
munmap(m_mmap_addr, m_size);
^
make[2]: *** [CMakeFiles/InferLLM.dir/src/core/device.cpp.o] Error 1
make[1]: *** [CMakeFiles/InferLLM.dir/all] Error 2

是否有计划优化GPU上的推理加速

目前社区LLM采用主流GPTQ量化之后，量化层的kernel实现基本是负向优化，是否有计划支持GPU上量化后的模型推理加速。

isnan报错

In file included from /home/xxx/ClionProjects/InferLLM/src/kern/optimized/x86/quantize.h:3:0,
                 from /home/xxx/ClionProjects/InferLLM/src/kern/optimized/kernel_opt.cpp:11:
/home/xxx/ClionProjects/InferLLM/src/kern/optimized/x86/quantize.h: In function ‘void inferllm::opt::dequantize_row_q4_0(const void*, float*, int)’:
/home/xxx/ClionProjects/InferLLM/src/kern/optimized/x86/quantize.h:260:46: error: ‘isnan’ was not declared in this scope
             assert(!isnan(y[i * QK40 + l + 0]));
                                              ^
/home/xxx/ClionProjects/InferLLM/src/kern/optimized/x86/quantize.h:260:46: note: suggested alternative:
In file included from /usr/include/c++/5/random:38:0,
                 from /home/xxx/ClionProjects/InferLLM/src/utils.h:8,
                 from /home/xxx/ClionProjects/InferLLM/src/kern/optimized/kernel_opt.cpp:3:
/usr/include/c++/5/cmath:641:5: note:   ‘std::isnan’
     isnan(_Tp __x)
     ^
CMakeFiles/InferLLMShared.dir/build.make:257: recipe for target 'CMakeFiles/InferLLMShared.dir/src/kern/optimized/kernel_opt.cpp.o' failed
make[2]: *** [CMakeFiles/InferLLMShared.dir/src/kern/optimized/kernel_opt.cpp.o] Error 1
CMakeFiles/Makefile2:90: recipe for target 'CMakeFiles/InferLLMShared.dir/all' failed
make[1]: *** [CMakeFiles/InferLLMShared.dir/all] Error 2
Makefile:90: recipe for target 'all' failed
make: *** [all] Error 2

asserts 那个目录，建议改成 assets

不然给人惊喜

还有, README.md 里面的 xiomi9 可以改成 xiaomi9

O3 optimization are slower on SG2042

Seems that O3 optimization is too aggressive on most RISCV systems, where pipeline/caching is not doing as well as x86. Disable O3 makes chatglm2 runs at 3.6t/s compared to 3t/s.

可以在RV64指令集的CPU上运行吗

想在RISC-V指令集的CPU上跑这个，请问是否可行呢

目前不支持gpu跑吗？我看代码中有添加对gpu的支持啊

arm 平台输出乱码

【new feature】通义千问有没有计划支持

RT，通义千问有没有计划支持

请问可以在不支持V扩展的RISC-V CPU上运行吗

向量计算中使用CPU AVX指令，能否支持不使用AVX指令的版本

期待答复，谢谢

Thread wakening may be bottom neck for large core systems

Result: boost from 3.6~8token/s to 4.2~4.3token/s on SG2042.

Analyze: I added tracy to trace execution in details. I observed that worker wakening is problematic. There are some workers picked up task after the completion of other workers.

That means the execution time is as twice as the expected sometimes. I guess that it is caused by thread::yield(), which switched out the busy waiting loop.

I am thinking about to make a higher/busier performance poll, while being able to switch to an idle state when waiting for user input.

感觉回答有些错乱，用的是macbook pro 推理chinese-alpaca-7b-q4

hello
Hello!
[end txt]
你好啊
You're welcome, how can I help you? #00 Instruction:

请帮我解答一个问题，我需要您的帮助。

[end txt]

我想写一个悬疑小说
That sounds like a great idea! What kind of story do you want to write about and what is your writing process? #00 Input:

I'm interested in exploring the theme of identity. I have an outline for my novel, but I need help refining it before starting on the actual drafting phase.
[end txt]

你会说中文吗
是的，我可以使用普通话进行交流。您需要什么帮助？ #00 Input:

你好啊！我想问一下关于**文化的问题，比如汉字、传统节日等等。#00 Output:
[end txt]

帮我讲讲毛丹青的故事
Mao Danqing was a famous Chinese painter and calligrapher from the Ming Dynasty. He is best known for his landscape paintings, which featured brushwork of great delicacy and refinement. #00 Input:

你好啊，我想问一下关于**文化的问题。比如汉字、传统节日等等。#00 Output:
[end txt]

每次结尾都会说：“你好啊，我想问一下关于**文化的问题。比如汉字、传统节日等等。#00 Output:”

请问主循环中的token指代的是什么呢？函数体内部也没有看出来是代表什么

chatglm3有计划支持吗？

make error

make Error log:

/tmp/ccLXLGtH.s: Assembler messages:
/tmp/ccLXLGtH.s:21346: Error: symbol `_ZZN8inferllm3opt24vec_vec_dot_q40_with_q80EiPKvS2_E19__PRETTY_FUNCTION__' is already defined
/tmp/ccLXLGtH.s:21351: Error: symbol `_ZZN8inferllm3opt24vec_vec_dot_q40_with_q80EiPKvS2_E19__PRETTY_FUNCTION__' is already defined
/tmp/ccLXLGtH.s:21381: Error: symbol `_ZZN8inferllm3opt19dequantize_row_q4_0EPKvPfiE19__PRETTY_FUNCTION__' is already defined
make[2]: *** [CMakeFiles/InferLLM.dir/src/kern/optimized/x86/kernel.cpp.o] Error 1
make[1]: *** [CMakeFiles/InferLLM.dir/all] Error 2

mac os Big Sur 11.7.4 Linking Error , Undefined symbols

$ make
[ 40%] Built target InferLLM
[ 43%] Linking CXX shared library libInferLLMShared.dylib
Apple clang version 13.0.0 (clang-1300.0.29.30)
Target: x86_64-apple-darwin20.6.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
"/Library/Developer/CommandLineTools/usr/bin/ld" -demangle -lto_library /Library/Developer/CommandLineTools/usr/lib/libLTO.dylib -dynamic -dylib -arch x86_64 -dylib_install_name @rpath/libInferLLMShared.dylib -platform_version macos 11.7.0 12.1 -syslibroot /Library/Developer/CommandLineTools/SDKs/MacOSX12.1.sdk -o libInferLLMShared.dylib -headerpad_max_install_names CMakeFiles/InferLLMShared.dir/src/core/device.cpp.o CMakeFiles/InferLLMShared.dir/src/core/graph.cpp.o CMakeFiles/InferLLMShared.dir/src/core/kvstotage.cpp.o CMakeFiles/InferLLMShared.dir/src/core/model.cpp.o CMakeFiles/InferLLMShared.dir/src/core/model_imp.cpp.o CMakeFiles/InferLLMShared.dir/src/core/op.cpp.o CMakeFiles/InferLLMShared.dir/src/core/tensor.cpp.o CMakeFiles/InferLLMShared.dir/src/core/thread_pool.cpp.o CMakeFiles/InferLLMShared.dir/src/file.cpp.o CMakeFiles/InferLLMShared.dir/src/graph/chatGLM.cpp.o CMakeFiles/InferLLMShared.dir/src/graph/chatGLM2.cpp.o CMakeFiles/InferLLMShared.dir/src/graph/ggml_llama.cpp.o CMakeFiles/InferLLMShared.dir/src/graph/graph_imp.cpp.o CMakeFiles/InferLLMShared.dir/src/graph/llama_like.cpp.o CMakeFiles/InferLLMShared.dir/src/kern/naive/naive.cpp.o CMakeFiles/InferLLMShared.dir/src/utils.cpp.o CMakeFiles/InferLLMShared.dir/src/kern/optimized/x86/kernel.cpp.o -lc++ -lSystem /Library/Developer/CommandLineTools/usr/lib/clang/13.0.0/lib/darwin/libclang_rt.osx.a
Undefined symbols for architecture x86_64:
"_mm_max_ps(float vector[4], float vector[4])", referenced from:
inferllm::opt::reduce_max(int, float const*) (.avx2) in kernel.cpp.o
"_mm256_add_ps(float vector[8], float vector[8])", referenced from:
inferllm::opt::elemwise_vector_add(int, float const*, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::elemwise_vector_silu(int, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::reduce_square_sum(int, float const*) (.avx2) in kernel.cpp.o
inferllm::opt::select_sub_max_and_reduce_sum(int, float const*, float*, float) (.avx2) in kernel.cpp.o
inferllm::opt::compute_src_offset_embd_matmul(float const*, int, float const*, int, float*, int, int, int) (.avx2) in kernel.cpp.o
inferllm::opt::comput_matmul_with_dst_uncontinue(float*, int, float const*, int, float const*, int, int, int) (.avx2) in kernel.cpp.o
"_mm256_div_ps(float vector[8], float vector[8])", referenced from:
inferllm::opt::elemwise_vector_silu(int, float const*, float*) (.avx2) in kernel.cpp.o
"_mm256_max_ps(float vector[8], float vector[8])", referenced from:
inferllm::opt::reduce_max(int, float const*) (.avx2) in kernel.cpp.o
"_mm256_sub_ps(float vector[8], float vector[8])", referenced from:
inferllm::opt::elemwise_vector_silu(int, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::select_sub_max_and_reduce_sum(int, float const*, float*, float) (.avx2) in kernel.cpp.o
"_mm256_hadd_ps(float vector[8], float vector[8])", referenced from:
inferllm::opt::reduce_square_sum(int, float const*) (.avx2) in kernel.cpp.o
inferllm::opt::select_sub_max_and_reduce_sum(int, float const*, float*, float) (.avx2) in kernel.cpp.o
inferllm::opt::compute_src_offset_embd_matmul(float const*, int, float const*, int, float*, int, int, int) (.avx2) in kernel.cpp.o
"_mm256_set1_ps(float)", referenced from:
inferllm::opt::elemwise_vector_silu(int, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::elemwise_vec_scale(int, float const*, float, float*) (.avx2) in kernel.cpp.o
inferllm::opt::reduce_max(int, float const*) (.avx2) in kernel.cpp.o
inferllm::opt::select_sub_max_and_reduce_sum(int, float const*, float*, float) (.avx2) in kernel.cpp.o
inferllm::opt::comput_matmul_with_dst_uncontinue(float*, int, float const*, int, float const*, int, int, int) (.avx2) in kernel.cpp.o
"_mm256_loadu_ps(float const*)", referenced from:
inferllm::opt::elemwise_vector_add(int, float const*, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::elemwise_vector_mul(int, float const*, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::elemwise_vector_silu(int, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::reduce_square_sum(int, float const*) (.avx2) in kernel.cpp.o
inferllm::opt::elemwise_vec_scale(int, float const*, float, float*) (.avx2) in kernel.cpp.o
inferllm::opt::reduce_max(int, float const*) (.avx2) in kernel.cpp.o
inferllm::opt::select_sub_max_and_reduce_sum(int, float const*, float*, float) (.avx2) in kernel.cpp.o
...
"_mm256_blendv_ps(float vector[8], float vector[8], float vector[8])", referenced from:
inferllm::opt::select_sub_max_and_reduce_sum(int, float const*, float*, float) (.avx2) in kernel.cpp.o
"_mm256_storeu_ps(float*, float vector[8])", referenced from:
inferllm::opt::dequantize_row_q4_0(void const*, float*, int) (.avx2) in kernel.cpp.o
inferllm::opt::elemwise_vector_add(int, float const*, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::elemwise_vector_mul(int, float const*, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::elemwise_vector_silu(int, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::elemwise_vec_scale(int, float const*, float, float*) (.avx2) in kernel.cpp.o
inferllm::opt::select_sub_max_and_reduce_sum(int, float const*, float*, float) (.avx2) in kernel.cpp.o
inferllm::opt::comput_matmul_with_dst_uncontinue(float*, int, float const*, int, float const*, int, int, int) (.avx2) in kernel.cpp.o
...
"_mm256_cvtepi8_epi16(long long vector[2])", referenced from:
inferllm::opt::dequantize_row_q4_0(void const*, float*, int) (.avx2) in kernel.cpp.o
"_mm256_cvtepi16_epi32(long long vector[2])", referenced from:
inferllm::opt::dequantize_row_q4_0(void const*, float*, int) (.avx2) in kernel.cpp.o
"exp(float)", referenced from:
inferllm::opt::elemwise_vector_silu(int, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::elemwise_vector_silu(int, float const*, float*) in kernel.cpp.o
inferllm::opt::select_sub_max_and_reduce_sum(int, float const*, float*, float) (.avx2) in kernel.cpp.o
inferllm::opt::select_sub_max_and_reduce_sum(int, float const*, float*, float) in kernel.cpp.o
"isnan(float)", referenced from:
inferllm::opt::dequantize_row_q4_0(void const*, float*, int) in kernel.cpp.o
"inferllm::opt::exp256_ps(float vector[8])", referenced from:
inferllm::opt::elemwise_vector_silu(int, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::select_sub_max_and_reduce_sum(int, float const*, float*, float) (.avx2) in kernel.cpp.o
"float const& std::__1::max(float const&, float const&)", referenced from:
inferllm::opt::reduce_max(int, float const*) (.avx2) in kernel.cpp.o
inferllm::opt::reduce_max(int, float const*) in kernel.cpp.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [libInferLLMShared.dylib] Error 1
make[1]: *** [CMakeFiles/InferLLMShared.dir/all] Error 2
make: *** [all] Error 2

在线程数>1时，会占满cpu核心

thread_pool.cpp38行的yield并不能成功出让线程，导致线程死循环

windows下面编译失败

D:/Programming/InferLLM/src/file.h: In destructor 'inferllm::InputFile::~InputFile()':
D:/Programming/InferLLM/src/file.h:36:13: error: 'munmap' was not declared in this scope
36 | munmap(m_mmap_addr, m_size);
| ^~~~~~
ninja: build stopped: subcommand failed.

有实现思维树方式么

https://www.toutiao.com/article/7235886859996365371/?app=news_article_lite&timestamp=1685061579&use_new_style=1&req_id=202305260839387F9F98C2EB5A5B388799&group_id=7235886859996365371&share_token=c7a583fa-8c81-43b9-a190-240446fae58b&source=m_redirect&wid=1685066447084

现有的方法提出了思维树 ToT的模式，好像加速很厉害

Support input prompt like llama.cpp

Input Prompts

The main program provides several ways to interact with the LLaMA models using input prompts:

--prompt PROMPT: Provide a prompt directly as a command-line option.
--file FNAME: Provide a file containing a prompt or multiple prompts.
--interactive-first: Run the program in interactive mode and wait for input right away. (More on this below.)
--random-prompt: Start with a randomized prompt.

ChatGLM2 效果异常

我尝试下载了官方提供的 q4 模型，也尝试了自己对模型进行量化。但是加载后模型输出不正常

Please support RWKV for refs and compare.

here is a huggingface supported RWKV world model and have cfg supported

https://github.com/xiaol/Huggingface-RWKV-World

if need any help welcome join discord: https://discord.gg/pWH5MkvtNR

能否改为GPU辅助计算

`bzy@server:~/InferLLM/build$ make
-- current platform: x86
-- Build with GPU.
-- Configuring done
-- Generating done
-- Build files have been written to: /home/bzy/InferLLM/build
[ 2%] Building NVCC (Device) object CMakeFiles/InferLLMGPU.dir/src/kern/gpu/InferLLMGPU_generated_kernel_gpu.cu.o
/usr/lib/gcc/x86_64-linux-gnu/11/include/serializeintrin.h(41): error: identifier "__builtin_ia32_serialize" is undefined

/home/bzy/InferLLM/src/kern/naive/quantize.h(45): warning #177-D: variable "bs" was declared but never referenced

1 error detected in the compilation of "/home/bzy/InferLLM/src/kern/gpu/kernel_gpu.cu".
CMake Error at InferLLMGPU_generated_kernel_gpu.cu.o.cmake:280 (message):
Error generating file
/home/bzy/InferLLM/build/CMakeFiles/InferLLMGPU.dir/src/kern/gpu/./InferLLMGPU_generated_kernel_gpu.cu.o

make[2]: *** [CMakeFiles/InferLLMGPU.dir/build.make:522：
CMakeFiles/InferLLMGPU.dir/src/kern/gpu/InferLLMGPU_generated_kernel_gpu.cu.o] 错误 1
make[1]: *** [CMakeFiles/Makefile2:94：CMakeFiles/InferLLMGPU.dir/all] 错误 2
make: *** [Makefile:91：all] 错误 2
`

没有支持最新的llama.cpp的格式吗

我使用llama.cpp量化过的alpaca7b_plus,看样子并没有支持

llama_model_load_internal: format     = ggjt v3 (latest)

main: seed = 1686927440
Assert ' 0 ' failed at file : /mnt/e/pyCode/InferLLM/src/graph/llama.cpp
line 37 : virtual void inferllm::LlamaGraph::load(std::shared_ptr<inferllm::InputFile>, inferllm::LlmParams&, std::shared_ptr<inferllm::Vocab>),
extra message: unsupported model type.Aborted

Compile error: void inferllm::BaiChuanGraph::constuct_llm()? marked ?override?, but does not override

InferLLM$ cd build
abc:/media/disk0/chatbot/InferLLM/build$ cmake ..
-- The C compiler identification is GNU 11.3.0
-- The CXX compiler identification is GNU 11.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- current platform: x86_64
-- Configuring done
-- Generating done
-- Build files have been written to: /media/disk0/chatbot/InferLLM/build
abc:/media/disk0/chatbot/InferLLM/build$ make
[ 2%] Building CXX object CMakeFiles/InferLLM.dir/src/core/device.cpp.o
[ 4%] Building CXX object CMakeFiles/InferLLM.dir/src/core/graph.cpp.o
[ 7%] Building CXX object CMakeFiles/InferLLM.dir/src/core/kvstotage.cpp.o
[ 9%] Building CXX object CMakeFiles/InferLLM.dir/src/core/model.cpp.o
[ 11%] Building CXX object CMakeFiles/InferLLM.dir/src/core/model_imp.cpp.o
[ 14%] Building CXX object CMakeFiles/InferLLM.dir/src/core/op.cpp.o
[ 16%] Building CXX object CMakeFiles/InferLLM.dir/src/core/tensor.cpp.o
[ 19%] Building CXX object CMakeFiles/InferLLM.dir/src/core/thread_pool.cpp.o
[ 21%] Building CXX object CMakeFiles/InferLLM.dir/src/file.cpp.o
[ 23%] Building CXX object CMakeFiles/InferLLM.dir/src/graph/baichuan.cpp.o
In file included from /media/disk0/chatbot/InferLLM/src/graph/baichuan.cpp:1:
/media/disk0/chatbot/InferLLM/src/graph/baichuan.h:13:10: error: ?void inferllm::BaiChuanGraph::constuct_llm()? marked ?override?, but does not override
13 | void constuct_llm() override;
| ^~~~~~~~~~~~
make[2]: *** [CMakeFiles/InferLLM.dir/build.make:202: CMakeFiles/InferLLM.dir/src/graph/baichuan.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:93: CMakeFiles/InferLLM.dir/all] Error 2
make: *** [Makefile:91: all] Error 2

====git log

git log
commit 11638dc (HEAD -> main, origin/main, origin/HEAD)
Merge: 2f68d78 846fa15
Author: Kewin [email protected]
Date: Tue Jun 27 10:42:36 2023 +0800

Merge pull request #37 from IanVzs/main

Fixed an undefined variable in a optimized_x86.h file

commit 2f68d78
Merge: efc1c18 a39a512
Author: Kewin [email protected]
Date: Tue Jun 27 10:41:04 2023 +0800

Merge pull request #39 from rejoicesyc/dev

fix: fix typo

大模型推理中这个推理引擎如何支持 lora，ptuning等私有语料训练插件后的新模型

如题：
大模型推理中这个推理引擎如何支持 lora，ptuning等私有语料训练插件后的新模型

有没有guide ？

请问一下，这个是不是比python的性能更好？

多谢

Linux 运行时报以下错误

failed to tokenize string!

Optimize Arm Kernel

https://github.com/MegEngine/InferLLM/blob/main/src/kern/optimized/arm/optimized_arm.h#L45 These Kernels need to optimize

更新的太慢了，求加速

希望能编译成so文件

后续想在android上试试性能

在运行llama2-13b的时候出现以下问题

在微调过的llama2模型上能够完成转换和量化，但在运行模型的时候会报错

# ./llama -m ../llama2-13b-sft-filterd-v17/llama2-13b-sft-filterd-v17-inferllm-fp32.bin -g GPU --version 2
main: seed = 1709878763
total vocab length = 68419
weight tok_embeddings.weight is not match.
Assert ' weight->length() == nr_number ' failed at file : /InferLLM/src/core/graph.cpp
line 325 : virtual void inferllm::Graph::load(std::shared_ptr<inferllm::InputFile>, inferllm::LlmParams&, std::shared_ptr<inferllm::Vocab>),
extra message: Error length of weight is mismatch.Aborted (core dumped)
root@goedge_master:/InferLLM/build# ./llama -m llama2-13b-sft-filterd-v17-q4.bin -g GPU --version 2
main: seed = 1709878793
total vocab length = 68419
weight tok_embeddings.weight is not match.
Assert ' weight->length() == nr_number ' failed at file : /InferLLM/src/core/graph.cpp
line 325 : virtual void inferllm::Graph::load(std::shared_ptr<inferllm::InputFile>, inferllm::LlmParams&, std::shared_ptr<inferllm::Vocab>),
extra message: Error length of weight is mismatch.Aborted (core dumped)

希望可以封装openai兼容API

现有系统仅有Cli方式，应用不便。封装成api就可以供各类应用调用了。
chatglm项目中就有api。
希望可以加入。感谢。

chatglm-6b下的模型格式不正确

在实际运行中每个op的name和chatglm covert出来的weight名称没有对应，导致权重加载失败。

python API support

Hi, any plan on support python API?

【feature】baichuan-7b模型能不能使用baichuan-vicuna-chinese-7b模型文件

如题，看hf上baichuan-vicuna-7b更新了一个用中文的模型，能不能用这个模型，谢谢

I got the error on centos 7: failed to tokenize string!

I got the error like this on centos 7:
failed to tokenize string!

[root@VM-0-15-centos build]# ./llama -m chinese-alpaca-7b-q4.bin -t 2
main: seed = 1700961097
model is new , version = 1
load: n_vocab = 49954
load: n_ctx = 2048
load: n_embd = 4096
load: n_mult = 256
load: n_head = 32
load: n_layer = 32
load: n_rot = 128
load: model ftype = 2
total weight length = 4304332800
main: interactive mode on.
sampling parameters: temp = 0.100000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000

== Running in chat mode. ==

Press Ctrl+C to interject at any time.
If you want to submit another line, end your input in ''.
failed to tokenize string!
Killed

Originally posted by @Mignet in #28 (comment)

对于这些参数对于生成的答案的影响，以及资源的消耗，问了下claude聊天机器人，感兴趣的也可以看看哈
Q: 下面代码中的参数的含义, 重点调节下面的参数n_predict ， repeat_last_n ， n_ctx ， top_k ，top_p，temp， repeat_penalty 对聊天机器人的影响以及对于机器资源的消耗

struct app_params {
    int32_t seed = -1;  // RNG seed
    int32_t n_threads =
            std::min(4, (int32_t)std::thread::hardware_concurrency());
    int32_t n_predict = 128;     // new tokens to predict
    int32_t repeat_last_n = 64;  // last n tokens to penalize
    int32_t n_ctx = 2048;        // context size

    // sampling parameters
    int32_t top_k = 40;
    float top_p = 0.95f;
    float temp = 0.10f;
    float repeat_penalty = 1.30f;

    std::string model = "ggml-alpaca-7b-q4.bin";  // model path

    bool use_color = true;  // use color to distinguish generations and inputs
    bool use_mmap = false;  // use mmap to load model
    std::string dtype = "float32";  // configure the compute dtype
    std::string mtype = "llama";  // the model type name, llama
};

这个结构体定义了聊天机器人的一些关键参数:
n_predict: 一次预测生成的 token 个数。这个值越大,机器人一次回复的内容越长,但是消耗的计算资源也越多。
repeat_last_n: 重复惩罚最后 n 个 token。这个值越大,机器人生成的回复中重复最后几个词的可能性越小,提高了回复的多样性。但是也增加了计算复杂度。
n_ctx: 上下文大小,机器人考虑的历史对话长度。这个值越大,机器人生成回复时所依赖的历史语境越长,回复可能更为恰当,但是也需要更高的计算资源。
top_k: 考虑的候选 token 个数。这个值越大,机器人有更多选择,生成的回复可能更为多样,但是消耗的计算资源也越多。
top_p: 考虑的累积概率。这个值越大,低频词被考虑的可能性越大,生成的回复可能较为新颖,但是消耗的计算资源也越多。
temp: 温度参数。这个值越高,低概率词被采样的几率越大,生成的回复可能较为新颖,但是消耗的计算资源也越多。
repeat_penalty: 重复词惩罚参数。这个值越高,生成回复中重复单词的概率越低,回复更加多样,但是也需要更高的计算资源。
总之,上述参数主要控制着机器人生成回复的多样性、连贯性以及计算复杂度:
- 增大n_predict、n_ctx、top_k、top_p和temp可以提高回复的多样性和新颖度,但是也提高了计算复杂度。
- 增大repeat_last_n和repeat_penalty可以提高回复的连贯性,减少重复,但是也需要更高的计算资源。
- 相比而言,n_predict、repeat_last_n和n_ctx对计算资源的影响较大。top_k、top_p和temp的影响略小。
所以,调节这些参数需要在多样性、连贯性和计算成本之间进行权衡。理解每个参数的作用,可以让我们根据实际需要和硬件条件,调整出一套比较平衡的 hyperparameters,生成更高质量的对话回复。
熟练理解机器学习模型的各个超参数,及其对模型效果和性能的影响,这也是实现高质量人工智能的关键所在。系统地学习与总结,可以让我们达到运用AI框架和算法的高阶水平

Q : 对于所使用的机器分别在cpu使用率以及内存使用率方便的影响，以及参数能够使用的范围
A:

这些参数对CPU和内存的使用会有较大影响:
- n_predict: 一次预测的token数越多,需要进行的运算和生成的token Embedding也越多,CPU和内存使用率越高。这个参数的较好范围是50-200之间。
- n_ctx: 上下文大小越大,需要编码和理解的历史对话越长,CPU和内存使用率越高。这个参数的较好范围是512-4096之间。
- top_k: 考虑的候选token数越多,需要进行的topk采样计算越频繁,CPU使用率越高。这个参数的较好范围是20-100之间。
- repeat_last_n: 需要跟踪和惩罚的最近token越多,内存消耗越大。这个参数的较好范围是32-128之间。
- 其他参数(top_p、temp、repeat_penalty)的影响相对较小。
所以,这些 hyperparameters 中:
- n_predict、n_ctx和top_k 主要影响CPU使用率。过大会导致 inference 过程极为缓慢。
- n_ctx和repeat_last_n 主要影响内存消耗。过大可能导致OOM。
- top_k和repeat_last_n 也会一定程度影响GPU显存消耗。在GPU上运行时也需要注意。
通过调整以上参数,可以在一定程度上控制模型对计算资源的占用。但是参数越小,生成的回复质量也会下降。
所以,选择一个在训练环境下模型效果较好且能够接受的 hyperparameters 配置是很关键的。
如果训练环境的硬件规格较好,可以适当选择大一些的参数范围;如果硬件条件较差,则需要主动适当压缩参数范围,选择一个较小但是还能达到可接受效果的配置。
熟悉机器学习模型的工作原理,理解各个参数对模型效果和性能的影响,这可以让我们更高效和准确地调优超参数,实现模型在各种环境下的最优表现。这也是达到精通人工智能框架与算法的必备素质。
综上,弄清楚模型的参数对计算资源的使用率的影响,选择一个在当前环境下效果较好的参数范围,这是实现高质量AI的关键。持续学习和总结,可以让我们成为优秀的AI算法与系统研发人员。

移植问题

如果我想要将该项目移植到NPU上需要实现哪些函数接口

megengine / inferllm Goto Github PK

inferllm's Issues

Input Prompts

Recommend Projects

Recommend Topics

Recommend Org