megengine / inferllm Goto Github PK
View Code? Open in Web Editor NEWa lightweight LLM model inference framework
License: Apache License 2.0
a lightweight LLM model inference framework
License: Apache License 2.0
谢谢答复
正常进行编译时,出现错误:
In file included from /usr/lib/gcc/x86_64-linux-gnu/10/include/immintrin.h:111,
from /home/xxx/InferLLM/src/kern/optimized/x86/optimized_x86.h:4,
from /home/xxx/InferLLM/src/kern/optimized/kernel_opt.cpp:10:
/usr/lib/gcc/x86_64-linux-gnu/10/include/fmaintrin.h: In function ‘float inferllm::opt::vec_vec_dot_q40_with_q80(int, const void*, const void*)’:
/usr/lib/gcc/x86_64-linux-gnu/10/include/fmaintrin.h:63:1: error: inlining failed in call to ‘always_inline’ ‘__m256 _mm256_fmadd_ps(__m256, __m256, __m256)’: target specific option mismatch
63 | _mm256_fmadd_ps (__m256 __A, __m256 __B, __m256 __C)
| ^~~~~~~~~~~~~~~
手工添加编译选项 -mfma
编译通过。但是运行程序时,出现 illegal hardware instruction
错误。
查看 CPU:
# lscpu
架构: x86_64
CPU 运行模式: 32-bit, 64-bit
字节序: Little Endian
Address sizes: 36 bits physical, 48 bits virtual
CPU: 8
在线 CPU 列表: 0-7
每个核的线程数: 2
每个座的核数: 4
座: 1
NUMA 节点: 1
厂商 ID: GenuineIntel
CPU 系列: 6
型号: 58
型号名称: Intel(R) Core(TM) i7-3720QM CPU @ 2.60GHz
步进: 9
CPU MHz: 2645.558
CPU 最大 MHz: 3600.0000
CPU 最小 MHz: 1200.0000
BogoMIPS: 5188.23
虚拟化: VT-x
L1d 缓存: 128 KiB
L1i 缓存: 128 KiB
L2 缓存: 1 MiB
L3 缓存: 6 MiB
NUMA 节点0 CPU: 0-7
Vulnerability Itlb multihit: KVM: Mitigation: VMX disabled
Vulnerability L1tf: Mitigation; PTE Inversion; VMX conditional cache flushes, SMT vulnerable
Vulnerability Mds: Vulnerable: Clear CPU buffers attempted, no microcode; SMT vulnerable
Vulnerability Meltdown: Mitigation; PTI
Vulnerability Mmio stale data: Unknown: No mitigations
Vulnerability Retbleed: Not affected
Vulnerability Spec store bypass: Vulnerable
Vulnerability Spectre v1: Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2: Mitigation; Retpolines, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds: Vulnerable: No microcode
Vulnerability Tsx async abort: Not affected
标记: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts a
cpi mmx fxsr sse sse2 ss ht tm pbe syscall nx rdtscp lm constant_tsc arch_perfmon peb
s bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monit
or ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic popcnt tsc_d
eadline_timer aes xsave avx f16c rdrand lahf_lm cpuid_fault epb pti tpr_shadow vnmi f
lexpriority ept vpid fsgsbase smep erms xsaveopt dtherm ida arat pln pts
不知道该如何解决 FMA 指令的问题?
Now Arm only support fp32 compute, fp16 compute will be fast on armv8.2 arch or above.
如果多个线程同时使用model的 decode 和 decode_iter方法,会不会有问题?
It appeared after #46 . The compiler emit REL_IRELTIV
symbols sometimes causing musl-libc failed to load the program.
The root cause seems that compiling optimized kernel and linking them seperately will introduce some strange changes. Here I mean linking these two together:
src/kern/naive.o
src/kern/optimized/xxx/kernel.o
I'll try to make them header-only somehow. And see if that works.
support More model like: https://github.com/THUDM/ChatGLM-6B,https://github.com/openai/whisper
在centos7.8环境中编译,cmake使用的版本是 : cmake version 3.23.1 ,make过程报错: make版本: GNU Make 3.82
[root@node02 build]# make
[ 2%] Building CXX object CMakeFiles/InferLLM.dir/src/core/device.cpp.o
In file included from /root/InferLLM-main/src/utils.h:13:0,
from /root/InferLLM-main/src/kern/kernel.h:6,
from /root/InferLLM-main/src/core/device.h:6,
from /root/InferLLM-main/src/core/device.cpp:2:
/root/InferLLM-main/src/file.h: In destructor ‘inferllm::InputFile::~InputFile()’:
/root/InferLLM-main/src/file.h:42:39: error: ‘munmap’ was not declared in this scope
munmap(m_mmap_addr, m_size);
^
make[2]: *** [CMakeFiles/InferLLM.dir/src/core/device.cpp.o] Error 1
make[1]: *** [CMakeFiles/InferLLM.dir/all] Error 2
目前社区LLM采用主流GPTQ量化之后,量化层的kernel实现基本是负向优化,是否有计划支持GPU上量化后的模型推理加速。
In file included from /home/xxx/ClionProjects/InferLLM/src/kern/optimized/x86/quantize.h:3:0,
from /home/xxx/ClionProjects/InferLLM/src/kern/optimized/kernel_opt.cpp:11:
/home/xxx/ClionProjects/InferLLM/src/kern/optimized/x86/quantize.h: In function ‘void inferllm::opt::dequantize_row_q4_0(const void*, float*, int)’:
/home/xxx/ClionProjects/InferLLM/src/kern/optimized/x86/quantize.h:260:46: error: ‘isnan’ was not declared in this scope
assert(!isnan(y[i * QK40 + l + 0]));
^
/home/xxx/ClionProjects/InferLLM/src/kern/optimized/x86/quantize.h:260:46: note: suggested alternative:
In file included from /usr/include/c++/5/random:38:0,
from /home/xxx/ClionProjects/InferLLM/src/utils.h:8,
from /home/xxx/ClionProjects/InferLLM/src/kern/optimized/kernel_opt.cpp:3:
/usr/include/c++/5/cmath:641:5: note: ‘std::isnan’
isnan(_Tp __x)
^
CMakeFiles/InferLLMShared.dir/build.make:257: recipe for target 'CMakeFiles/InferLLMShared.dir/src/kern/optimized/kernel_opt.cpp.o' failed
make[2]: *** [CMakeFiles/InferLLMShared.dir/src/kern/optimized/kernel_opt.cpp.o] Error 1
CMakeFiles/Makefile2:90: recipe for target 'CMakeFiles/InferLLMShared.dir/all' failed
make[1]: *** [CMakeFiles/InferLLMShared.dir/all] Error 2
Makefile:90: recipe for target 'all' failed
make: *** [all] Error 2
不然给人惊喜
还有, README.md
里面的 xiomi9
可以改成 xiaomi9
Seems that O3 optimization is too aggressive on most RISCV systems, where pipeline/caching is not doing as well as x86. Disable O3 makes chatglm2 runs at 3.6t/s compared to 3t/s.
想在RISC-V指令集的CPU上跑这个,请问是否可行呢
RT,通义千问有没有计划支持
期待答复,谢谢
Result: boost from 3.6~8token/s
to 4.2~4.3token/s
on SG2042.
Analyze: I added tracy to trace execution in details. I observed that worker wakening is problematic. There are some workers picked up task after the completion of other workers.
That means the execution time is as twice as the expected sometimes. I guess that it is caused by thread::yield()
, which switched out the busy waiting loop.
I am thinking about to make a higher/busier performance poll, while being able to switch to an idle state when waiting for user input.
hello
Hello!
[end txt]
你好啊
You're welcome, how can I help you? #00 Instruction:
请帮我解答一个问题,我需要您的帮助。
[end txt]
我想写一个悬疑小说
That sounds like a great idea! What kind of story do you want to write about and what is your writing process? #00 Input:
I'm interested in exploring the theme of identity. I have an outline for my novel, but I need help refining it before starting on the actual drafting phase.
[end txt]
你会说中文吗
是的,我可以使用普通话进行交流。您需要什么帮助? #00 Input:
你好啊!我想问一下关于**文化的问题,比如汉字、传统节日等等。#00 Output:
[end txt]
帮我讲讲毛丹青的故事
Mao Danqing was a famous Chinese painter and calligrapher from the Ming Dynasty. He is best known for his landscape paintings, which featured brushwork of great delicacy and refinement. #00 Input:
你好啊,我想问一下关于**文化的问题。比如汉字、传统节日等等。#00 Output:
[end txt]
每次结尾都会说:“你好啊,我想问一下关于**文化的问题。比如汉字、传统节日等等。#00 Output:”
make Error log:
/tmp/ccLXLGtH.s: Assembler messages:
/tmp/ccLXLGtH.s:21346: Error: symbol `_ZZN8inferllm3opt24vec_vec_dot_q40_with_q80EiPKvS2_E19__PRETTY_FUNCTION__' is already defined
/tmp/ccLXLGtH.s:21351: Error: symbol `_ZZN8inferllm3opt24vec_vec_dot_q40_with_q80EiPKvS2_E19__PRETTY_FUNCTION__' is already defined
/tmp/ccLXLGtH.s:21381: Error: symbol `_ZZN8inferllm3opt19dequantize_row_q4_0EPKvPfiE19__PRETTY_FUNCTION__' is already defined
make[2]: *** [CMakeFiles/InferLLM.dir/src/kern/optimized/x86/kernel.cpp.o] Error 1
make[1]: *** [CMakeFiles/InferLLM.dir/all] Error 2
$ make
[ 40%] Built target InferLLM
[ 43%] Linking CXX shared library libInferLLMShared.dylib
Apple clang version 13.0.0 (clang-1300.0.29.30)
Target: x86_64-apple-darwin20.6.0
Thread model: posix
InstalledDir: /Library/Developer/CommandLineTools/usr/bin
"/Library/Developer/CommandLineTools/usr/bin/ld" -demangle -lto_library /Library/Developer/CommandLineTools/usr/lib/libLTO.dylib -dynamic -dylib -arch x86_64 -dylib_install_name @rpath/libInferLLMShared.dylib -platform_version macos 11.7.0 12.1 -syslibroot /Library/Developer/CommandLineTools/SDKs/MacOSX12.1.sdk -o libInferLLMShared.dylib -headerpad_max_install_names CMakeFiles/InferLLMShared.dir/src/core/device.cpp.o CMakeFiles/InferLLMShared.dir/src/core/graph.cpp.o CMakeFiles/InferLLMShared.dir/src/core/kvstotage.cpp.o CMakeFiles/InferLLMShared.dir/src/core/model.cpp.o CMakeFiles/InferLLMShared.dir/src/core/model_imp.cpp.o CMakeFiles/InferLLMShared.dir/src/core/op.cpp.o CMakeFiles/InferLLMShared.dir/src/core/tensor.cpp.o CMakeFiles/InferLLMShared.dir/src/core/thread_pool.cpp.o CMakeFiles/InferLLMShared.dir/src/file.cpp.o CMakeFiles/InferLLMShared.dir/src/graph/chatGLM.cpp.o CMakeFiles/InferLLMShared.dir/src/graph/chatGLM2.cpp.o CMakeFiles/InferLLMShared.dir/src/graph/ggml_llama.cpp.o CMakeFiles/InferLLMShared.dir/src/graph/graph_imp.cpp.o CMakeFiles/InferLLMShared.dir/src/graph/llama_like.cpp.o CMakeFiles/InferLLMShared.dir/src/kern/naive/naive.cpp.o CMakeFiles/InferLLMShared.dir/src/utils.cpp.o CMakeFiles/InferLLMShared.dir/src/kern/optimized/x86/kernel.cpp.o -lc++ -lSystem /Library/Developer/CommandLineTools/usr/lib/clang/13.0.0/lib/darwin/libclang_rt.osx.a
Undefined symbols for architecture x86_64:
"_mm_max_ps(float vector[4], float vector[4])", referenced from:
inferllm::opt::reduce_max(int, float const*) (.avx2) in kernel.cpp.o
"_mm256_add_ps(float vector[8], float vector[8])", referenced from:
inferllm::opt::elemwise_vector_add(int, float const*, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::elemwise_vector_silu(int, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::reduce_square_sum(int, float const*) (.avx2) in kernel.cpp.o
inferllm::opt::select_sub_max_and_reduce_sum(int, float const*, float*, float) (.avx2) in kernel.cpp.o
inferllm::opt::compute_src_offset_embd_matmul(float const*, int, float const*, int, float*, int, int, int) (.avx2) in kernel.cpp.o
inferllm::opt::comput_matmul_with_dst_uncontinue(float*, int, float const*, int, float const*, int, int, int) (.avx2) in kernel.cpp.o
"_mm256_div_ps(float vector[8], float vector[8])", referenced from:
inferllm::opt::elemwise_vector_silu(int, float const*, float*) (.avx2) in kernel.cpp.o
"_mm256_max_ps(float vector[8], float vector[8])", referenced from:
inferllm::opt::reduce_max(int, float const*) (.avx2) in kernel.cpp.o
"_mm256_sub_ps(float vector[8], float vector[8])", referenced from:
inferllm::opt::elemwise_vector_silu(int, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::select_sub_max_and_reduce_sum(int, float const*, float*, float) (.avx2) in kernel.cpp.o
"_mm256_hadd_ps(float vector[8], float vector[8])", referenced from:
inferllm::opt::reduce_square_sum(int, float const*) (.avx2) in kernel.cpp.o
inferllm::opt::select_sub_max_and_reduce_sum(int, float const*, float*, float) (.avx2) in kernel.cpp.o
inferllm::opt::compute_src_offset_embd_matmul(float const*, int, float const*, int, float*, int, int, int) (.avx2) in kernel.cpp.o
"_mm256_set1_ps(float)", referenced from:
inferllm::opt::elemwise_vector_silu(int, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::elemwise_vec_scale(int, float const*, float, float*) (.avx2) in kernel.cpp.o
inferllm::opt::reduce_max(int, float const*) (.avx2) in kernel.cpp.o
inferllm::opt::select_sub_max_and_reduce_sum(int, float const*, float*, float) (.avx2) in kernel.cpp.o
inferllm::opt::comput_matmul_with_dst_uncontinue(float*, int, float const*, int, float const*, int, int, int) (.avx2) in kernel.cpp.o
"_mm256_loadu_ps(float const*)", referenced from:
inferllm::opt::elemwise_vector_add(int, float const*, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::elemwise_vector_mul(int, float const*, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::elemwise_vector_silu(int, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::reduce_square_sum(int, float const*) (.avx2) in kernel.cpp.o
inferllm::opt::elemwise_vec_scale(int, float const*, float, float*) (.avx2) in kernel.cpp.o
inferllm::opt::reduce_max(int, float const*) (.avx2) in kernel.cpp.o
inferllm::opt::select_sub_max_and_reduce_sum(int, float const*, float*, float) (.avx2) in kernel.cpp.o
...
"_mm256_blendv_ps(float vector[8], float vector[8], float vector[8])", referenced from:
inferllm::opt::select_sub_max_and_reduce_sum(int, float const*, float*, float) (.avx2) in kernel.cpp.o
"_mm256_storeu_ps(float*, float vector[8])", referenced from:
inferllm::opt::dequantize_row_q4_0(void const*, float*, int) (.avx2) in kernel.cpp.o
inferllm::opt::elemwise_vector_add(int, float const*, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::elemwise_vector_mul(int, float const*, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::elemwise_vector_silu(int, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::elemwise_vec_scale(int, float const*, float, float*) (.avx2) in kernel.cpp.o
inferllm::opt::select_sub_max_and_reduce_sum(int, float const*, float*, float) (.avx2) in kernel.cpp.o
inferllm::opt::comput_matmul_with_dst_uncontinue(float*, int, float const*, int, float const*, int, int, int) (.avx2) in kernel.cpp.o
...
"_mm256_cvtepi8_epi16(long long vector[2])", referenced from:
inferllm::opt::dequantize_row_q4_0(void const*, float*, int) (.avx2) in kernel.cpp.o
"_mm256_cvtepi16_epi32(long long vector[2])", referenced from:
inferllm::opt::dequantize_row_q4_0(void const*, float*, int) (.avx2) in kernel.cpp.o
"exp(float)", referenced from:
inferllm::opt::elemwise_vector_silu(int, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::elemwise_vector_silu(int, float const*, float*) in kernel.cpp.o
inferllm::opt::select_sub_max_and_reduce_sum(int, float const*, float*, float) (.avx2) in kernel.cpp.o
inferllm::opt::select_sub_max_and_reduce_sum(int, float const*, float*, float) in kernel.cpp.o
"isnan(float)", referenced from:
inferllm::opt::dequantize_row_q4_0(void const*, float*, int) in kernel.cpp.o
"inferllm::opt::exp256_ps(float vector[8])", referenced from:
inferllm::opt::elemwise_vector_silu(int, float const*, float*) (.avx2) in kernel.cpp.o
inferllm::opt::select_sub_max_and_reduce_sum(int, float const*, float*, float) (.avx2) in kernel.cpp.o
"float const& std::__1::max(float const&, float const&)", referenced from:
inferllm::opt::reduce_max(int, float const*) (.avx2) in kernel.cpp.o
inferllm::opt::reduce_max(int, float const*) in kernel.cpp.o
ld: symbol(s) not found for architecture x86_64
clang: error: linker command failed with exit code 1 (use -v to see invocation)
make[2]: *** [libInferLLMShared.dylib] Error 1
make[1]: *** [CMakeFiles/InferLLMShared.dir/all] Error 2
make: *** [all] Error 2
thread_pool.cpp38行的yield并不能成功出让线程,导致线程死循环
D:/Programming/InferLLM/src/file.h: In destructor 'inferllm::InputFile::~InputFile()':
D:/Programming/InferLLM/src/file.h:36:13: error: 'munmap' was not declared in this scope
36 | munmap(m_mmap_addr, m_size);
| ^~~~~~
ninja: build stopped: subcommand failed.
The main
program provides several ways to interact with the LLaMA models using input prompts:
--prompt PROMPT
: Provide a prompt directly as a command-line option.--file FNAME
: Provide a file containing a prompt or multiple prompts.--interactive-first
: Run the program in interactive mode and wait for input right away. (More on this below.)--random-prompt
: Start with a randomized prompt.here is a huggingface supported RWKV world model and have cfg supported
https://github.com/xiaol/Huggingface-RWKV-World
if need any help welcome join discord: https://discord.gg/pWH5MkvtNR
`bzy@server:~/InferLLM/build$ make
-- current platform: x86
-- Build with GPU.
-- Configuring done
-- Generating done
-- Build files have been written to: /home/bzy/InferLLM/build
[ 2%] Building NVCC (Device) object CMakeFiles/InferLLMGPU.dir/src/kern/gpu/InferLLMGPU_generated_kernel_gpu.cu.o
/usr/lib/gcc/x86_64-linux-gnu/11/include/serializeintrin.h(41): error: identifier "__builtin_ia32_serialize" is undefined
/home/bzy/InferLLM/src/kern/naive/quantize.h(45): warning #177-D: variable "bs" was declared but never referenced
1 error detected in the compilation of "/home/bzy/InferLLM/src/kern/gpu/kernel_gpu.cu".
CMake Error at InferLLMGPU_generated_kernel_gpu.cu.o.cmake:280 (message):
Error generating file
/home/bzy/InferLLM/build/CMakeFiles/InferLLMGPU.dir/src/kern/gpu/./InferLLMGPU_generated_kernel_gpu.cu.o
make[2]: *** [CMakeFiles/InferLLMGPU.dir/build.make:522:
CMakeFiles/InferLLMGPU.dir/src/kern/gpu/InferLLMGPU_generated_kernel_gpu.cu.o] 错误 1
make[1]: *** [CMakeFiles/Makefile2:94:CMakeFiles/InferLLMGPU.dir/all] 错误 2
make: *** [Makefile:91:all] 错误 2
`
我使用llama.cpp量化过的alpaca7b_plus,看样子并没有支持
llama_model_load_internal: format = ggjt v3 (latest)
main: seed = 1686927440
Assert ' 0 ' failed at file : /mnt/e/pyCode/InferLLM/src/graph/llama.cpp
line 37 : virtual void inferllm::LlamaGraph::load(std::shared_ptr<inferllm::InputFile>, inferllm::LlmParams&, std::shared_ptr<inferllm::Vocab>),
extra message: unsupported model type.Aborted
InferLLM$ cd build
abc:/media/disk0/chatbot/InferLLM/build$ cmake ..
-- The C compiler identification is GNU 11.3.0
-- The CXX compiler identification is GNU 11.3.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- current platform: x86_64
-- Configuring done
-- Generating done
-- Build files have been written to: /media/disk0/chatbot/InferLLM/build
abc:/media/disk0/chatbot/InferLLM/build$ make
[ 2%] Building CXX object CMakeFiles/InferLLM.dir/src/core/device.cpp.o
[ 4%] Building CXX object CMakeFiles/InferLLM.dir/src/core/graph.cpp.o
[ 7%] Building CXX object CMakeFiles/InferLLM.dir/src/core/kvstotage.cpp.o
[ 9%] Building CXX object CMakeFiles/InferLLM.dir/src/core/model.cpp.o
[ 11%] Building CXX object CMakeFiles/InferLLM.dir/src/core/model_imp.cpp.o
[ 14%] Building CXX object CMakeFiles/InferLLM.dir/src/core/op.cpp.o
[ 16%] Building CXX object CMakeFiles/InferLLM.dir/src/core/tensor.cpp.o
[ 19%] Building CXX object CMakeFiles/InferLLM.dir/src/core/thread_pool.cpp.o
[ 21%] Building CXX object CMakeFiles/InferLLM.dir/src/file.cpp.o
[ 23%] Building CXX object CMakeFiles/InferLLM.dir/src/graph/baichuan.cpp.o
In file included from /media/disk0/chatbot/InferLLM/src/graph/baichuan.cpp:1:
/media/disk0/chatbot/InferLLM/src/graph/baichuan.h:13:10: error: ?void inferllm::BaiChuanGraph::constuct_llm()? marked ?override?, but does not override
13 | void constuct_llm() override;
| ^~~~~~~~~~~~
make[2]: *** [CMakeFiles/InferLLM.dir/build.make:202: CMakeFiles/InferLLM.dir/src/graph/baichuan.cpp.o] Error 1
make[1]: *** [CMakeFiles/Makefile2:93: CMakeFiles/InferLLM.dir/all] Error 2
make: *** [Makefile:91: all] Error 2
====git log
git log
commit 11638dc (HEAD -> main, origin/main, origin/HEAD)
Merge: 2f68d78 846fa15
Author: Kewin [email protected]
Date: Tue Jun 27 10:42:36 2023 +0800
Merge pull request #37 from IanVzs/main
Fixed an undefined variable in a optimized_x86.h file
commit 2f68d78
Merge: efc1c18 a39a512
Author: Kewin [email protected]
Date: Tue Jun 27 10:41:04 2023 +0800
Merge pull request #39 from rejoicesyc/dev
fix: fix typo
如题:
大模型推理中这个推理引擎如何支持 lora,ptuning等私有语料训练插件后的新模型
有没有guide ?
多谢
failed to tokenize string!
https://github.com/MegEngine/InferLLM/blob/main/src/kern/optimized/arm/optimized_arm.h#L45 These Kernels need to optimize
后续想在android上试试性能
在微调过的llama2模型上能够完成转换和量化,但在运行模型的时候会报错
# ./llama -m ../llama2-13b-sft-filterd-v17/llama2-13b-sft-filterd-v17-inferllm-fp32.bin -g GPU --version 2
main: seed = 1709878763
total vocab length = 68419
weight tok_embeddings.weight is not match.
Assert ' weight->length() == nr_number ' failed at file : /InferLLM/src/core/graph.cpp
line 325 : virtual void inferllm::Graph::load(std::shared_ptr<inferllm::InputFile>, inferllm::LlmParams&, std::shared_ptr<inferllm::Vocab>),
extra message: Error length of weight is mismatch.Aborted (core dumped)
root@goedge_master:/InferLLM/build# ./llama -m llama2-13b-sft-filterd-v17-q4.bin -g GPU --version 2
main: seed = 1709878793
total vocab length = 68419
weight tok_embeddings.weight is not match.
Assert ' weight->length() == nr_number ' failed at file : /InferLLM/src/core/graph.cpp
line 325 : virtual void inferllm::Graph::load(std::shared_ptr<inferllm::InputFile>, inferllm::LlmParams&, std::shared_ptr<inferllm::Vocab>),
extra message: Error length of weight is mismatch.Aborted (core dumped)
现有系统仅有Cli方式,应用不便。封装成api就可以供各类应用调用了。
chatglm项目中就有api。
希望可以加入。感谢。
在实际运行中每个op的name和chatglm covert出来的weight名称没有对应,导致权重加载失败。
Hi, any plan on support python API?
如题,看hf上baichuan-vicuna-7b更新了一个用中文的模型,能不能用这个模型,谢谢
I got the error like this on centos 7:
failed to tokenize string!
[root@VM-0-15-centos build]# ./llama -m chinese-alpaca-7b-q4.bin -t 2
main: seed = 1700961097
model is new , version = 1
load: n_vocab = 49954
load: n_ctx = 2048
load: n_embd = 4096
load: n_mult = 256
load: n_head = 32
load: n_layer = 32
load: n_rot = 128
load: model ftype = 2
total weight length = 4304332800
main: interactive mode on.
sampling parameters: temp = 0.100000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000
== Running in chat mode. ==
Originally posted by @Mignet in #28 (comment)
你好,请问去哪里下载 tokenizer 的文件到本地
请问chatglm6b是如何实现量化的?有量化代码吗?llamacpp可以做到吗?
struct app_params {
int32_t seed = -1; // RNG seed
int32_t n_threads =
std::min(4, (int32_t)std::thread::hardware_concurrency());
int32_t n_predict = 128; // new tokens to predict
int32_t repeat_last_n = 64; // last n tokens to penalize
int32_t n_ctx = 2048; // context size
// sampling parameters
int32_t top_k = 40;
float top_p = 0.95f;
float temp = 0.10f;
float repeat_penalty = 1.30f;
std::string model = "ggml-alpaca-7b-q4.bin"; // model path
bool use_color = true; // use color to distinguish generations and inputs
bool use_mmap = false; // use mmap to load model
std::string dtype = "float32"; // configure the compute dtype
std::string mtype = "llama"; // the model type name, llama
};
这个结构体定义了聊天机器人的一些关键参数:
n_predict: 一次预测生成的 token 个数。这个值越大,机器人一次回复的内容越长,但是消耗的计算资源也越多。
repeat_last_n: 重复惩罚最后 n 个 token。这个值越大,机器人生成的回复中重复最后几个词的可能性越小,提高了回复的多样性。但是也增加了计算复杂度。
n_ctx: 上下文大小,机器人考虑的历史对话长度。这个值越大,机器人生成回复时所依赖的历史语境越长,回复可能更为恰当,但是也需要更高的计算资源。
top_k: 考虑的候选 token 个数。这个值越大,机器人有更多选择,生成的回复可能更为多样,但是消耗的计算资源也越多。
top_p: 考虑的累积概率。这个值越大,低频词被考虑的可能性越大,生成的回复可能较为新颖,但是消耗的计算资源也越多。
temp: 温度参数。这个值越高,低概率词被采样的几率越大,生成的回复可能较为新颖,但是消耗的计算资源也越多。
repeat_penalty: 重复词惩罚参数。这个值越高,生成回复中重复单词的概率越低,回复更加多样,但是也需要更高的计算资源。
总之,上述参数主要控制着机器人生成回复的多样性、连贯性以及计算复杂度:
- 增大n_predict、n_ctx、top_k、top_p和temp可以提高回复的多样性和新颖度,但是也提高了计算复杂度。
- 增大repeat_last_n和repeat_penalty可以提高回复的连贯性,减少重复,但是也需要更高的计算资源。
- 相比而言,n_predict、repeat_last_n和n_ctx对计算资源的影响较大。top_k、top_p和temp的影响略小。
所以,调节这些参数需要在多样性、连贯性和计算成本之间进行权衡。理解每个参数的作用,可以让我们根据实际需要和硬件条件,调整出一套比较平衡的 hyperparameters,生成更高质量的对话回复。
熟练理解机器学习模型的各个超参数,及其对模型效果和性能的影响,这也是实现高质量人工智能的关键所在。系统地学习与总结,可以让我们达到运用AI框架和算法的高阶水平
这些参数对CPU和内存的使用会有较大影响:
- n_predict: 一次预测的token数越多,需要进行的运算和生成的token Embedding也越多,CPU和内存使用率越高。这个参数的较好范围是50-200之间。
- n_ctx: 上下文大小越大,需要编码和理解的历史对话越长,CPU和内存使用率越高。这个参数的较好范围是512-4096之间。
- top_k: 考虑的候选token数越多,需要进行的topk采样计算越频繁,CPU使用率越高。这个参数的较好范围是20-100之间。
- repeat_last_n: 需要跟踪和惩罚的最近token越多,内存消耗越大。这个参数的较好范围是32-128之间。
- 其他参数(top_p、temp、repeat_penalty)的影响相对较小。
所以,这些 hyperparameters 中:
- n_predict、n_ctx和top_k 主要影响CPU使用率。过大会导致 inference 过程极为缓慢。
- n_ctx和repeat_last_n 主要影响内存消耗。过大可能导致OOM。
- top_k和repeat_last_n 也会一定程度影响GPU显存消耗。在GPU上运行时也需要注意。
通过调整以上参数,可以在一定程度上控制模型对计算资源的占用。但是参数越小,生成的回复质量也会下降。
所以,选择一个在训练环境下模型效果较好且能够接受的 hyperparameters 配置是很关键的。
如果训练环境的硬件规格较好,可以适当选择大一些的参数范围;如果硬件条件较差,则需要主动适当压缩参数范围,选择一个较小但是还能达到可接受效果的配置。
熟悉机器学习模型的工作原理,理解各个参数对模型效果和性能的影响,这可以让我们更高效和准确地调优超参数,实现模型在各种环境下的最优表现。这也是达到精通人工智能框架与算法的必备素质。
综上,弄清楚模型的参数对计算资源的使用率的影响,选择一个在当前环境下效果较好的参数范围,这是实现高质量AI的关键。持续学习和总结,可以让我们成为优秀的AI算法与系统研发人员。
如果我想要将该项目移植到NPU上需要实现哪些函数接口
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.