sjtu-ipads / powerinfer Goto Github PK

High-speed Large Language Model Serving on PCs with Consumer-grade GPUs

License: MIT License

Dockerfile 0.12% Shell 1.33% CMake 1.28% Swift 0.06% Zig 0.19% C++ 40.30% C 33.09% Python 6.69% Nix 0.17% Cuda 10.88% Objective-C 2.71% Metal 3.19%

falcon large-language-models llama llm llm-inference local-inference bamboo-7b

powerinfer's People

Contributors

Stargazers

Watchers

Forkers

jjhw codeaudit richardkelley havietisov evelynmitchell mrcodechef chadbrewbaker bearnardd callandgus mryvae repos-ai-local kaynewest superoldman96 pitchwq cnkjco standardgalactic userbox020 tony163163 happydpc hbcbh1999 zfbok kiminh shabbirhasan1 chunhualiu ishine allthingsllm kitnet draudnaut aboutsome ericxsun chsasank goswamig jerinphilip apollohuang1 goldluo126 frankf-cgn zeroxclem wangwenjie123 holycloud parag0506 gebegb3j shenyang70s wangwendong1024 baozhi888 eltociear sorokinvld minedec isaka peterwrighten imrohankataria ahmedsaoudi kustomzone jameshennessytempus machinelearningsystem javiervicho petermathews 3x0dv5 ai-mou jmwoloso muharremokutan cygwynd jamethcook yuzhuchao xin-zhou-smu keyman9848 wikipedia2008 gamertttt lihuibng crazy-jack compass-star chaosen315 misby hustc12 jithinraj hokma1943 leichangqing polya20 creative-v yodamaster wangzy anyone0034 chiaki-chan johngeng-xj fcarsten lingying177 freedomclannad ffos axl-zhang bearx andy12039 sandeepbeniwal markusbkk amir2pl mrsnobody84 mehdi4crypto ali5ac fredatgithub babaja12 manijeh-a shootmir

powerinfer's Issues

请问你们是否有兴趣支持deepseek？

Deepseek-llm和Deepseek-coder效果也是很好的模型，而且是llama结构https://github.com/deepseek-ai/deepseek-coder/

Quantized INT4/Q4 model?

How are the gguf weights quantized to INT4? is there a script similar to llama.cpp to convert to fp16 weigths to q4_0?
Please share more details about INT4 model.

is it possible in future run mixtal8x7b

is it possible in future run mixtal8x7b ?

When I enable the gpu split,the inference result is unacceptable

When I enable the gpu split,the inference result is unacceptable.

请问下针对消费级卡的服务器的适配。

背景：这边搭载了一台消费级卡（8张 NVIDIA GF RTX4090）的服务器，希望能够接入 PowerInfer
问题：想请问是否如何接入PowerInfer，以及看是否适配。

请问和llama.cpp 相比有什么优化的地方吗？因为我看大部分代码都是和他重合的

虽然有点冒犯，但是如题

Evaluation

What the speedup means here? speedup compared to what?

No module named powerinfer, can ot split gpu

python3： No module named powerinfer
failed to generate gpu split..............

通义千问大模型什么时候能支持呢？我们在用72B、14B的，迫切希望能支持加速推理。

BUG: LLaMA-7B will not fully offload to GPU

When model can be fully loaded into VRAM, PowerInfer cannot offload all model weight to GPU.

llama.cpp:3107: vram_allocated_bytes < vram_capacity

environment: wsl inside Windows.

$ free -m
               total        used        free      shared  buff/cache   available
Mem:           23919         580       22559           3         780       23015

$ nvidia-smi
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        On  | 00000000:01:00.0  On |                  N/A |
|  0%   33C    P8              25W / 420W |   1762MiB / 24576MiB |      7%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

cmd:

./build/bin/main -m /mnt/{blur}/OneDrive/Models/oobabooga_windows/text-generation-webui/models/test/llama-70b-relu.q4.powerinfer.gguf -n 128 -t 8 -p "Once upon a time"

result:

llama_model_loader: - tensor  881:                blk.79.fc1.weight q4_0     [  8192,  3072,     1,     1 ]
llama_model_loader: - tensor  882:                blk.79.fc2.weight q4_0     [  3072, 28672,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:                               general.name str
llama_model_loader: - kv   2:                       llama.context_length u32
llama_model_loader: - kv   3:                     llama.embedding_length u32
llama_model_loader: - kv   4:                          llama.block_count u32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32
llama_model_loader: - kv   7:                 llama.attention.head_count u32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv  10:                       llama.rope.freq_base f32
llama_model_loader: - kv  11:                          general.file_type u32
llama_model_loader: - kv  12:                       tokenizer.ggml.model str
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool
llama_model_loader: - kv  22:               general.quantization_version u32
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q4_0:  722 tensors
llama_model_load: PowerInfer model loaded. Sparse inference will be used.
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = mostly Q4_0
llm_load_print_meta: model params     = 74.98 B
llm_load_print_meta: model size       = 39.28 GiB (4.50 BPW)
llm_load_print_meta: general.name   = nvme
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_print_meta: sparse_pred_threshold = 0.00
llm_load_sparse_model_tensors: ggml ctx size =    0.32 MB
GGML_ASSERT: /mnt/{blur}/Projects/PowerInfer/llama.cpp:3107: vram_allocated_bytes < vram_capacity

Experimented with vram-budget, reset or disable gpu index. No luck.

Bitcoin

# Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Please provide a detailed written description of what you were trying to do, and what you expected llama.cpp to do.

Current Behavior

Please provide a detailed written description of what llama.cpp did, instead.

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

Physical (or virtual) hardware you are using, e.g. for Linux:

$ lscpu

Operating System, e.g. for Linux:

$ uname -a

SDK version, e.g. for Linux:

$ python3 --version
$ make --version
$ g++ --version

Failure Information (for bugs)

Please help provide information about the failure / bug.

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

step 1
step 2
step 3
etc.

Failure Logs

Please include any relevant log snippets or files. If it works under one configuration but not under another, please provide logs for both configurations and their corresponding outputs so it is easy to see where behavior changes.

Also, please try to avoid using screenshots if at all possible. Instead, copy/paste the console output and use Github's markdown to cleanly format your logs for easy readability.

Example environment info:

llama.cpp$ git log | head -1
commit 2af23d30434a677c6416812eea52ccc0af65119c

llama.cpp$ lscpu | egrep "AMD|Flags"
Vendor ID:                       AuthenticAMD
Model name:                      AMD Ryzen Threadripper 1950X 16-Core Processor
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sme sev
Virtualization:                  AMD-V

llama.cpp$ python3 --version
Python 3.10.9

llama.cpp$ pip list | egrep "torch|numpy|sentencepiece"
numpy                         1.24.2
numpydoc                      1.5.0
sentencepiece                 0.1.97
torch                         1.13.1
torchvision                   0.14.1

llama.cpp$ make --version | head -1
GNU Make 4.3

$ md5sum ./models/65B/ggml-model-q4_0.bin
dbdd682cce80e2d6e93cefc7449df487  ./models/65B/ggml-model-q4_0.bin

Example run with the Linux command perf

llama.cpp$ perf stat ./main -m ./models/65B/ggml-model-q4_0.bin -t 16 -n 1024 -p "Please close your issue when it has been answered."
main: seed = 1679149377
llama_model_load: loading model from './models/65B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 8192
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 64
llama_model_load: n_layer = 80
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 22016
llama_model_load: n_parts = 8
llama_model_load: ggml ctx size = 41477.73 MB
llama_model_load: memory_size =  2560.00 MB, n_mem = 40960
llama_model_load: loading model part 1/8 from './models/65B/ggml-model-q4_0.bin'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 2/8 from './models/65B/ggml-model-q4_0.bin.1'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 3/8 from './models/65B/ggml-model-q4_0.bin.2'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 4/8 from './models/65B/ggml-model-q4_0.bin.3'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 5/8 from './models/65B/ggml-model-q4_0.bin.4'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 6/8 from './models/65B/ggml-model-q4_0.bin.5'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 7/8 from './models/65B/ggml-model-q4_0.bin.6'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 8/8 from './models/65B/ggml-model-q4_0.bin.7'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

main: prompt: 'Please close your issue when it has been answered.'
main: number of tokens in prompt = 11
     1 -> ''
 12148 -> 'Please'
  3802 -> ' close'
   596 -> ' your'
  2228 -> ' issue'
   746 -> ' when'
   372 -> ' it'
   756 -> ' has'
  1063 -> ' been'
  7699 -> ' answered'
 29889 -> '.'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


Please close your issue when it has been answered.
@duncan-donut: I'm trying to figure out what kind of "support" you need for this script and why, exactly? Is there a question about how the code works that hasn't already been addressed in one or more comments below this ticket, or are we talking something else entirely like some sorta bugfixing job because your server setup is different from mine??
I can understand if your site needs to be running smoothly and you need help with a fix of sorts but there should really be nothing wrong here that the code itself could not handle. And given that I'm getting reports about how it works perfectly well on some other servers, what exactly are we talking? A detailed report will do wonders in helping us get this resolved for ya quickly so please take your time and describe the issue(s) you see as clearly & concisely as possible!!
@duncan-donut: I'm not sure if you have access to cPanel but you could try these instructions. It is worth a shot! Let me know how it goes (or what error message, exactly!) when/if ya give that code a go? [end of text]


main: mem per token = 71159620 bytes
main:     load time = 19309.95 ms
main:   sample time =   168.62 ms
main:  predict time = 223895.61 ms / 888.47 ms per token
main:    total time = 246406.42 ms

 Performance counter stats for './main -m ./models/65B/ggml-model-q4_0.bin -t 16 -n 1024 -p Please close your issue when it has been answered.':

        3636882.89 msec task-clock                #   14.677 CPUs utilized
             13509      context-switches          #    3.714 /sec
              2436      cpu-migrations            #    0.670 /sec
          10476679      page-faults               #    2.881 K/sec
    13133115082869      cycles                    #    3.611 GHz                      (16.77%)
       29314462753      stalled-cycles-frontend   #    0.22% frontend cycles idle     (16.76%)
    10294402631459      stalled-cycles-backend    #   78.39% backend cycles idle      (16.74%)
    23479217109614      instructions              #    1.79  insn per cycle
                                                  #    0.44  stalled cycles per insn  (16.76%)
     2353072268027      branches                  #  647.002 M/sec                    (16.77%)
        1998682780      branch-misses             #    0.08% of all branches          (16.76%)

     247.802177522 seconds time elapsed

    3618.573072000 seconds user
      18.491698000 seconds sys

Combined with LLM in a flash

https://arxiv.org/abs/2312.11514
Recently, LLM in a Flash was proposed, a method to use Flash memory to run models that exceed DRAM.
If I'm right, I think we can apply these technologies simultaneously.
If that were possible, I think it would make running very large models easier.

[HELP WANTED] 支持 InternLM 吗？

希望支持 InternLM-20B 和 InternLM-7B

https://github.com/internLM/internLM/

http://huggingface.co/internlm

Any plan on supporting mistral based models?

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

Please provide a detailed written description of what you were trying to do, and what you expected llama.cpp to do as an enhancement.

Motivation

Please provide a detailed written description of reasons why this feature is necessary and how it is useful to llama.cpp users.

Possible Implementation

If you have an idea as to how it can be implemented, please write a detailed description. Feel free to give links to external sources or share visuals that might be helpful to understand the details better.

Why performance dropped a lot?

Is it worth?

请问想要部署自己的模型

应该还需要你们predictor训练的代码吧？还有ft转换relu的

How to get a relu-activated llama2 model with finetune? any supposed finetune scripts?

thx

[HELP WANTED] 支持qwen吗？

Jetson Orin+ RTXA6000

我的硬件平台是Jetson Orin+ RTXA6000，看起来显卡的资源没有完全利用起来，现存和GPU占用率都不高，怎么调整能把硬件资源都利用起来呢？

Length

我想问一下，是否有在Length 更长的情况下，比较性能，如果在A100上使用PowerInfer以及length=2048或者更高的情况下，是否能够起到类似文中的效果？
另外还想问一下，是怎样一个想法致使你们会使用ReLU激活函数而不是遵从原来的SwiGLU？

pip install -r requirements 提示 ./gguf-py not installable

您好，请问在执行pip intall -r requirements.txt时，出现以下错误，请问如何解决？

Chat model

is this project support the chat model of llama?

精度的对比

你好，有使用 powerinfer 部署后，模型精度损失的对比吗？

testing vs ollama mistral gives same speed results on llama2 7b

this is on linux with a 4090 comparing .
Running ollama run mistral "why is the sky blue?" vs the same prompt gives the same speed.
on llama7b should we expect faster results?

Does this support llama type models as well (Using SwiGLU activation)?

Without replacing the activation function?
Again, amazing work!!

How to integrate with LangChain?

Does this framework provide any interoperability with existing ecosystem like LangChain?

server cannot run

windows visual studio编译失败

使用CMake构建vs 工程，编译的时候，会报下面的错误：
fatal error C1083: 无法打开包括文件: “stdatomic.h”: No such file or directory

请问下针对消费级卡的服务器的适配。

背景：这边搭载了一台消费级卡（8张 NVIDIA GF RTX4090）的服务器，希望能够接入 PowerInfer
问题：想请问是否如何接入PowerInfer，以及看是否适配。

从meta-llama/Llama-2-13b-hf到SparseLLM/ReluLLaMA-13B

Good job!
从meta-llama/Llama-2-13b-hf到SparseLLM/ReluLLaMA-13B应该要经过一个finetune，看readme里面没有写，这个具体finetune训练会开放吗

Seems not support long prompt well.

We noticed that the paper mentioned limited performance improvement for relatively long prompt situations, but our situation is that, in the case of very long prompts, it seems PowerInfer ceases to work, generating no output. This situation occurred with llama-7B, llama-13B, and llama-70B-q4. We speculate that this might be because when the prompt is very long, there are many neurons that need to work simultaneously, and the performance optimization done using the locality principle no longer applies, causing PowerInfer's working mechanism to be ineffective. We would like to hear your perspective on this matter.

再结合上MLX岂不是可以在Mac平台起飞了！

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

在Mac M系列平台上，是否可以进一步结合上https://github.com/ml-explore/mlx 从而达到更夸张的提速效果？

Please provide a detailed written description of what you were trying to do, and what you expected llama.cpp to do as an enhancement.

Motivation

Please provide a detailed written description of reasons why this feature is necessary and how it is useful to llama.cpp users.

Possible Implementation

fatal error C1189: #error: <stdatomic.h> is not yet supported when compiling as C

运行cmake --build build --config Release，提升以下错误

E:\Langchain-Chatchat\PowerInfer>cmake --build build --config Release
MSBuild version 17.3.1+2badb37d1 for .NET Framework
  build_info.vcxproj -> E:\Langchain-Chatchat\PowerInfer\build\common\build_info.dir\Release\bui
  ld_info.lib
  ggml.c
C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.33.31629\include\stdato
mic.h(15,1): fatal  error C1189: #error:  <stdatomic.h> is not yet supported when compiling as C
, but this is planned for a future release. [E:\Langchain-Chatchat\PowerInfer\build\ggml.vcxproj
]
  ggml-alloc.c
C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.33.31629\include\stdato
mic.h(15,1): fatal  error C1189: #error:  <stdatomic.h> is not yet supported when compiling as C
, but this is planned for a future release. [E:\Langchain-Chatchat\PowerInfer\build\ggml.vcxproj
]
  ggml-backend.c
C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.33.31629\include\stdato
mic.h(15,1): fatal  error C1189: #error:  <stdatomic.h> is not yet supported when compiling as C
, but this is planned for a future release. [E:\Langchain-Chatchat\PowerInfer\build\ggml.vcxproj
]
  ggml-quants.c
C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.33.31629\include\stdato
mic.h(15,1): fatal  error C1189: #error:  <stdatomic.h> is not yet supported when compiling as C
, but this is planned for a future release. [E:\Langchain-Chatchat\PowerInfer\build\ggml.vcxproj
]
  正在生成代码...

Which A100 are you guys using?

Thanks for the great work!
Just curious, in "Evaluation shows that PowerInfer attains an average token generation rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across various LLMs (including OPT-175B) on a single NVIDIA RTX 4090 GPU, only 18% lower than that achieved by a top-tier server-grade A100 GPU. This significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy."
Are you guys talking about the 40GB version of A100 or the 80GB version?

what is the recommended wy to run with this python code?

Should we keep compiled binary running in mmory and have python call it or can python just make arbitrary calls to it?

vram-budget doesn't work well.

According to feedback, there is a discrepancy between the actual GPU usage and the budget.

会提供Docker镜像吗

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

Please provide a detailed written description of what you were trying to do, and what you expected llama.cpp to do as an enhancement.

Motivation

Please provide a detailed written description of reasons why this feature is necessary and how it is useful to llama.cpp users.

Possible Implementation

no CUDA-capable device is detected

Tried to run inference on wsl
./build/bin/main -m ./ReluFalcon-40B-PowerInfer-GGUF/falcon-40b-relu.q4.powerinfer.gguf -n 128 -t 8 -p "Once upon a time"
and got
no CUDA-capable device is detected
current device: 231936

suggestions on how to fix this?

I have 2 NVIDA cards in the computer
a GeForce RTX2070
and
Tesla M40 24GB

想请问一下有没有在A100上运行PowerInfer的效果情况

请问有没有关于在A100上运行PowerInfer的benchmark？比如说，同样在A100上运行的PowerInfer和llama.cpp相比是否有速度提升，或者能否显著的节省显存的使用呢？

Mixtral MoE support?

Prerequisites

Please answer the following questions for yourself before submitting an issue.

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

Support for Mixtral MoE

llama2中文 hf格式.bin 如何转换成PowerInfer格式?

Convert HF models with sparse threshold specified

Feature Description

Per-model sparse threshold has been supported on LLaMA-family models, yet not supported for other model architectures.

Possible Implementation

Following the way to add KV param in convert.py, adding similar logics in convert-hf-to-powerinfer-gguf.py. If necessary, update MLP model repos to add a config.json.

In-depth Analysis of Memory Management for Enhanced Performance on Consumer-grade GPUs

Dear PowerInfer Contributors,

I hope this message finds you well. I am reaching out to discuss a potential enhancement to the PowerInfer inference engine, specifically regarding the memory management strategies employed during LLM inference on consumer-grade GPUs.

Upon a thorough examination of the current implementation, I have observed that while the engine adeptly handles the distribution of workload between the CPU and GPU, there may be room for optimisation in the way memory is allocated and managed, particularly during peak usage scenarios.

The crux of the matter lies in the dynamic allocation of memory for 'hot' and 'cold' neurons. While the preloading of 'hot' neurons onto the GPU is commendable for its efficiency, the allocation of memory for 'cold' neurons during runtime could potentially be streamlined. This is especially pertinent when considering the limited VRAM available on consumer-grade GPUs compared to their server-grade counterparts.

I propose a more granular control over memory allocation, which could include:

Implementing a more sophisticated memory pooling mechanism to reduce fragmentation and improve allocation speed.
Exploring the use of memory compression techniques to increase the effective capacity of VRAM.
Introducing a dynamic memory re-allocation system that can adapt to the changing patterns of 'hot' and 'cold' neuron activations based on real-time usage.

I believe that by addressing these aspects, PowerInfer could achieve even greater performance gains and efficiency, making it more accessible and practical for a wider range of users.

I would be most interested in hearing your thoughts on this matter and am keen to contribute to the development of such enhancements.

Thank you for your time and consideration.

Best regards,
yihong1120

[HELP WANTED] aquila,aquila2是类llama模型，希望能支持

aquila,aquila2是类llama模型，商用协议宽松，希望能支持

nvcc fails due to illegal options

When I try to build this I get the following error:
nvcc fatal : Unknown option 'Wmissing-declarations'

The cmake files seem to be set up to pass gcc options to nvcc which AFAIK are not supported.

请问PowerInfer团队有计划支持Bo Peng团队开发的RWKV-LM吗？

RWKV架构一直采用的都是ReLU作为激活函数，是国内唯一的中文RNN结构语言模型，具有很高的开创性价值。请问PowerInfer团队有支持的想法吗？下面是RWKV-LM的项目链接：

https://github.com/BlinkDL/RWKV-LM

我代表RWKV社区的爱好者强烈希望您的团队可以考虑一下！非常感谢！

Eliminate Compiling warnings

Our initial implementation of PowerInfer introduced a bunch of compiling warnings, mostly unused vars, and made CI complain about it. A quick fix should solve >95% of them.

CUDA error 1 in ggml-cuda.cu:8332: invalid argument, and then segmentation fault

Running in WSL, all deps satisified, most recent code pull, on a RTX 3090.

Command line:
./build/bin/main -m models/7B/llama-7b-relu.powerinfer.gguf -n 128 -t 8 -p "Once upon a time" --vram-budget 12

Log output:

Log start
main: build = 1549 (9d72668)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1703277999
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
llama_model_loader: loaded meta data with 18 key-value pairs and 355 tensors from models/7B/llama-7b-relu.powerinfer.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor    0:                token_embd.weight f16      [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight f16      [  4096,  4096,     1,     1 ]
----snip----
llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  290 tensors
llama_model_load: PowerInfer model loaded. Sparse inference will be used.
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly F16
llm_load_print_meta: model params     = 7.57 B
llm_load_print_meta: model size       = 14.11 GiB (16.00 BPW)
llm_load_print_meta: general.name   = syx
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_print_meta: sparse_pred_threshold = 0.00
llama_model_load: sparse inference - vram budget = 12.00 GB
llm_load_sparse_model_tensors: ggml ctx size =    0.13 MB
llm_load_sparse_model_tensors: using CUDA for GPU acceleration
llm_load_sparse_model_tensors: mem required  = 8506.63 MB
llm_load_sparse_model_tensors: VRAM used: 5939.52 MB
....................................................................................................
llama_model_loader: loaded meta data with 3 key-value pairs and 64 tensors from models/7B/llama-7b-relu.powerinfer.gguf.generated.gpuidx (version GGUF V3 (latest))
llama_model_loader: - tensor    0:                    blk.0.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor    1:                 blk.0.gpu_bucket i32      [  5376,     1,     1,     1 ]
llama_model_loader: - tensor    2:                    blk.1.gpu_idx i32      [ 11008,     1,     1,     1 ]
----snip----
llama_model_loader: - tensor   62:                   blk.31.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   63:                blk.31.gpu_bucket i32      [  4608,     1,     1,     1 ]
llama_model_loader: unknown type i32
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:              generic.gpu_index.block_count u32
llama_model_loader: - kv   2:                        split.vram_capacity u64
llama_model_loader: - type  i32:   64 tensors
loaded gpu_idx, vram_required: 6119997440
apply_tensors_to_base_model: applying gpu_idx adapter from 'models/7B/llama-7b-relu.powerinfer.gguf.generated.gpuidx' - please wait ...
................................................................ done (9.84 ms)
offload_ffn_split: applying augmentation to model - please wait ...
................................ done (11859.33 ms)
llm_load_gpu_split: offloaded 5790.00 MiB of FFN weights to GPU
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  =  256.00 MB
llama_build_graph: non-view tensors processed: 548/1028
llama_build_graph: ****************************************************************
llama_build_graph: not all non-view tensors have been processed with a callback
llama_build_graph: this can indicate an inefficiency in the graph implementation
llama_build_graph: build with LLAMA_OFFLOAD_DEBUG for more info
llama_build_graph: ref: https://github.com/ggerganov/llama.cpp/pull/3837
llama_build_graph: ****************************************************************
llama_new_context_with_model: compute buffer total size = 186.57 MB
llama_new_context_with_model: VRAM scratch buffer: 185.00 MB
llama_new_context_with_model: total VRAM used: 6124.52 MB (model: 5939.52 MB, context: 185.00 MB)

**CUDA error 1 at /home/user/Envs/PowerInfer/ggml-cuda.cu:8332: invalid argument**
current device: 0

CUDA error 4 at /home/user/Envs/PowerInfer/ggml-cuda.cu:485: driver shutting down
current device: 8192
**Segmentation fault**

sjtu-ipads / powerinfer Goto Github PK

powerinfer's People

Contributors

Stargazers

Watchers

Forkers

powerinfer's Issues

Expected Behavior

Current Behavior

Environment and Context

Failure Information (for bugs)

Steps to Reproduce

Failure Logs

Prerequisites

Feature Description

Motivation

Possible Implementation

Prerequisites

Feature Description

Motivation

Possible Implementation

Prerequisites

Feature Description

Motivation

Possible Implementation

Prerequisites

Feature Description

Feature Description

Possible Implementation

RWKV架构一直采用的都是ReLU作为激活函数，是国内唯一的中文RNN结构语言模型，具有很高的开创性价值。请问PowerInfer团队有支持的想法吗？下面是RWKV-LM的项目链接：

Recommend Projects

Recommend Topics

Recommend Org