Coder Social home page Coder Social logo

sjtu-ipads / powerinfer Goto Github PK

View Code? Open in Web Editor NEW
7.0K 73.0 369.0 12.39 MB

High-speed Large Language Model Serving on PCs with Consumer-grade GPUs

License: MIT License

Dockerfile 0.12% Shell 1.33% CMake 1.28% Swift 0.06% Zig 0.19% C++ 40.30% C 33.09% Python 6.69% Nix 0.17% Cuda 10.88% Objective-C 2.71% Metal 3.19%
falcon large-language-models llama llm llm-inference local-inference bamboo-7b

powerinfer's People

Contributors

anzz1 avatar blackhole89 avatar cebtenzzre avatar comex avatar crd716 avatar dannydaemonic avatar ejones avatar galunid avatar ggerganov avatar goerch avatar green-sky avatar hodlen avatar howard0su avatar ikawrakow avatar ivanstepanovftw avatar jhen0409 avatar johannesgaessler avatar kerfufflev2 avatar li-plus avatar lshzh-ww avatar monatis avatar netrunnereve avatar prusnak avatar shibe2 avatar slaren avatar slyecho avatar sw avatar tjohnman avatar xaedes avatar yixinsong-e avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

powerinfer's Issues

Quantized INT4/Q4 model?

How are the gguf weights quantized to INT4? is there a script similar to llama.cpp to convert to fp16 weigths to q4_0?
Please share more details about INT4 model.

Evaluation

What the speedup means here? speedup compared to what?

image

llama.cpp:3107: vram_allocated_bytes < vram_capacity

environment: wsl inside Windows.

$ free -m
               total        used        free      shared  buff/cache   available
Mem:           23919         580       22559           3         780       23015
$ nvidia-smi
|=========================================+======================+======================|
|   0  NVIDIA GeForce RTX 3090        On  | 00000000:01:00.0  On |                  N/A |
|  0%   33C    P8              25W / 420W |   1762MiB / 24576MiB |      7%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+

cmd:

./build/bin/main -m /mnt/{blur}/OneDrive/Models/oobabooga_windows/text-generation-webui/models/test/llama-70b-relu.q4.powerinfer.gguf -n 128 -t 8 -p "Once upon a time"

result:

llama_model_loader: - tensor  881:                blk.79.fc1.weight q4_0     [  8192,  3072,     1,     1 ]
llama_model_loader: - tensor  882:                blk.79.fc2.weight q4_0     [  3072, 28672,     1,     1 ]
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:                               general.name str
llama_model_loader: - kv   2:                       llama.context_length u32
llama_model_loader: - kv   3:                     llama.embedding_length u32
llama_model_loader: - kv   4:                          llama.block_count u32
llama_model_loader: - kv   5:                  llama.feed_forward_length u32
llama_model_loader: - kv   6:                 llama.rope.dimension_count u32
llama_model_loader: - kv   7:                 llama.attention.head_count u32
llama_model_loader: - kv   8:              llama.attention.head_count_kv u32
llama_model_loader: - kv   9:     llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv  10:                       llama.rope.freq_base f32
llama_model_loader: - kv  11:                          general.file_type u32
llama_model_loader: - kv  12:                       tokenizer.ggml.model str
llama_model_loader: - kv  13:                      tokenizer.ggml.tokens arr
llama_model_loader: - kv  14:                      tokenizer.ggml.scores arr
llama_model_loader: - kv  15:                  tokenizer.ggml.token_type arr
llama_model_loader: - kv  16:                tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv  17:                tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv  18:            tokenizer.ggml.unknown_token_id u32
llama_model_loader: - kv  19:            tokenizer.ggml.padding_token_id u32
llama_model_loader: - kv  20:               tokenizer.ggml.add_bos_token bool
llama_model_loader: - kv  21:               tokenizer.ggml.add_eos_token bool
llama_model_loader: - kv  22:               general.quantization_version u32
llama_model_loader: - type  f32:  161 tensors
llama_model_loader: - type q4_0:  722 tensors
llama_model_load: PowerInfer model loaded. Sparse inference will be used.
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 8192
llm_load_print_meta: n_head           = 64
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_layer          = 80
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 8
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 28672
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 70B
llm_load_print_meta: model ftype      = mostly Q4_0
llm_load_print_meta: model params     = 74.98 B
llm_load_print_meta: model size       = 39.28 GiB (4.50 BPW)
llm_load_print_meta: general.name   = nvme
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_print_meta: sparse_pred_threshold = 0.00
llm_load_sparse_model_tensors: ggml ctx size =    0.32 MB
GGML_ASSERT: /mnt/{blur}/Projects/PowerInfer/llama.cpp:3107: vram_allocated_bytes < vram_capacity

Experimented with vram-budget, reset or disable gpu index. No luck.

Bitcoin

# Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

Please provide a detailed written description of what you were trying to do, and what you expected llama.cpp to do.

Current Behavior

Please provide a detailed written description of what llama.cpp did, instead.

Environment and Context

Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.

  • Physical (or virtual) hardware you are using, e.g. for Linux:

$ lscpu

  • Operating System, e.g. for Linux:

$ uname -a

  • SDK version, e.g. for Linux:
$ python3 --version
$ make --version
$ g++ --version

Failure Information (for bugs)

Please help provide information about the failure / bug.

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

  1. step 1
  2. step 2
  3. step 3
  4. etc.

Failure Logs

Please include any relevant log snippets or files. If it works under one configuration but not under another, please provide logs for both configurations and their corresponding outputs so it is easy to see where behavior changes.

Also, please try to avoid using screenshots if at all possible. Instead, copy/paste the console output and use Github's markdown to cleanly format your logs for easy readability.

Example environment info:

llama.cpp$ git log | head -1
commit 2af23d30434a677c6416812eea52ccc0af65119c

llama.cpp$ lscpu | egrep "AMD|Flags"
Vendor ID:                       AuthenticAMD
Model name:                      AMD Ryzen Threadripper 1950X 16-Core Processor
Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sme sev
Virtualization:                  AMD-V

llama.cpp$ python3 --version
Python 3.10.9

llama.cpp$ pip list | egrep "torch|numpy|sentencepiece"
numpy                         1.24.2
numpydoc                      1.5.0
sentencepiece                 0.1.97
torch                         1.13.1
torchvision                   0.14.1

llama.cpp$ make --version | head -1
GNU Make 4.3

$ md5sum ./models/65B/ggml-model-q4_0.bin
dbdd682cce80e2d6e93cefc7449df487  ./models/65B/ggml-model-q4_0.bin

Example run with the Linux command perf

llama.cpp$ perf stat ./main -m ./models/65B/ggml-model-q4_0.bin -t 16 -n 1024 -p "Please close your issue when it has been answered."
main: seed = 1679149377
llama_model_load: loading model from './models/65B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx   = 512
llama_model_load: n_embd  = 8192
llama_model_load: n_mult  = 256
llama_model_load: n_head  = 64
llama_model_load: n_layer = 80
llama_model_load: n_rot   = 128
llama_model_load: f16     = 2
llama_model_load: n_ff    = 22016
llama_model_load: n_parts = 8
llama_model_load: ggml ctx size = 41477.73 MB
llama_model_load: memory_size =  2560.00 MB, n_mem = 40960
llama_model_load: loading model part 1/8 from './models/65B/ggml-model-q4_0.bin'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 2/8 from './models/65B/ggml-model-q4_0.bin.1'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 3/8 from './models/65B/ggml-model-q4_0.bin.2'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 4/8 from './models/65B/ggml-model-q4_0.bin.3'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 5/8 from './models/65B/ggml-model-q4_0.bin.4'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 6/8 from './models/65B/ggml-model-q4_0.bin.5'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 7/8 from './models/65B/ggml-model-q4_0.bin.6'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723
llama_model_load: loading model part 8/8 from './models/65B/ggml-model-q4_0.bin.7'
llama_model_load: .......................................................................................... done
llama_model_load: model size =  4869.09 MB / num tensors = 723

system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |

main: prompt: 'Please close your issue when it has been answered.'
main: number of tokens in prompt = 11
     1 -> ''
 12148 -> 'Please'
  3802 -> ' close'
   596 -> ' your'
  2228 -> ' issue'
   746 -> ' when'
   372 -> ' it'
   756 -> ' has'
  1063 -> ' been'
  7699 -> ' answered'
 29889 -> '.'

sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000


Please close your issue when it has been answered.
@duncan-donut: I'm trying to figure out what kind of "support" you need for this script and why, exactly? Is there a question about how the code works that hasn't already been addressed in one or more comments below this ticket, or are we talking something else entirely like some sorta bugfixing job because your server setup is different from mine??
I can understand if your site needs to be running smoothly and you need help with a fix of sorts but there should really be nothing wrong here that the code itself could not handle. And given that I'm getting reports about how it works perfectly well on some other servers, what exactly are we talking? A detailed report will do wonders in helping us get this resolved for ya quickly so please take your time and describe the issue(s) you see as clearly & concisely as possible!!
@duncan-donut: I'm not sure if you have access to cPanel but you could try these instructions. It is worth a shot! Let me know how it goes (or what error message, exactly!) when/if ya give that code a go? [end of text]


main: mem per token = 71159620 bytes
main:     load time = 19309.95 ms
main:   sample time =   168.62 ms
main:  predict time = 223895.61 ms / 888.47 ms per token
main:    total time = 246406.42 ms

 Performance counter stats for './main -m ./models/65B/ggml-model-q4_0.bin -t 16 -n 1024 -p Please close your issue when it has been answered.':

        3636882.89 msec task-clock                #   14.677 CPUs utilized
             13509      context-switches          #    3.714 /sec
              2436      cpu-migrations            #    0.670 /sec
          10476679      page-faults               #    2.881 K/sec
    13133115082869      cycles                    #    3.611 GHz                      (16.77%)
       29314462753      stalled-cycles-frontend   #    0.22% frontend cycles idle     (16.76%)
    10294402631459      stalled-cycles-backend    #   78.39% backend cycles idle      (16.74%)
    23479217109614      instructions              #    1.79  insn per cycle
                                                  #    0.44  stalled cycles per insn  (16.76%)
     2353072268027      branches                  #  647.002 M/sec                    (16.77%)
        1998682780      branch-misses             #    0.08% of all branches          (16.76%)

     247.802177522 seconds time elapsed

    3618.573072000 seconds user
      18.491698000 seconds sys

Combined with LLM in a flash

https://arxiv.org/abs/2312.11514
Recently, LLM in a Flash was proposed, a method to use Flash memory to run models that exceed DRAM.
If I'm right, I think we can apply these technologies simultaneously.
If that were possible, I think it would make running very large models easier.

Any plan on supporting mistral based models?

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

Please provide a detailed written description of what you were trying to do, and what you expected llama.cpp to do as an enhancement.

Motivation

Please provide a detailed written description of reasons why this feature is necessary and how it is useful to llama.cpp users.

Possible Implementation

If you have an idea as to how it can be implemented, please write a detailed description. Feel free to give links to external sources or share visuals that might be helpful to understand the details better.

Jetson Orin+ RTXA6000

我的硬件平台是Jetson Orin+ RTXA6000, 看起来显卡的资源没有完全利用起来,现存和GPU占用率都不高,怎么调整能把硬件资源都利用起来呢?

Screenshot from 2023-12-21 10-34-33

Length

  1. 我想问一下,是否有在Length 更长的情况下,比较性能,如果在A100上使用PowerInfer以及length=2048或者更高的情况下,是否能够起到类似文中的效果?
  2. 另外还想问一下,是怎样一个想法致使你们会使用ReLU激活函数而不是遵从原来的SwiGLU?

Chat model

is this project support the chat model of llama?

精度的对比

你好,有使用 powerinfer 部署后,模型精度损失的对比吗?

windows visual studio编译失败

使用CMake构建vs 工程,编译的时候,会报下面的错误:
fatal error C1083: 无法打开包括文件: “stdatomic.h”: No such file or directory

image

Seems not support long prompt well.

We noticed that the paper mentioned limited performance improvement for relatively long prompt situations, but our situation is that, in the case of very long prompts, it seems PowerInfer ceases to work, generating no output. This situation occurred with llama-7B, llama-13B, and llama-70B-q4. We speculate that this might be because when the prompt is very long, there are many neurons that need to work simultaneously, and the performance optimization done using the locality principle no longer applies, causing PowerInfer's working mechanism to be ineffective. We would like to hear your perspective on this matter.

再结合上MLX岂不是可以在Mac平台起飞了!

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

在Mac M系列平台上,是否可以进一步结合上https://github.com/ml-explore/mlx 从而达到更夸张的提速效果?

Please provide a detailed written description of what you were trying to do, and what you expected llama.cpp to do as an enhancement.

Motivation

Please provide a detailed written description of reasons why this feature is necessary and how it is useful to llama.cpp users.

Possible Implementation

If you have an idea as to how it can be implemented, please write a detailed description. Feel free to give links to external sources or share visuals that might be helpful to understand the details better.

fatal error C1189: #error: <stdatomic.h> is not yet supported when compiling as C

运行cmake --build build --config Release,提升以下错误

E:\Langchain-Chatchat\PowerInfer>cmake --build build --config Release
MSBuild version 17.3.1+2badb37d1 for .NET Framework
  build_info.vcxproj -> E:\Langchain-Chatchat\PowerInfer\build\common\build_info.dir\Release\bui
  ld_info.lib
  ggml.c
C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.33.31629\include\stdato
mic.h(15,1): fatal  error C1189: #error:  <stdatomic.h> is not yet supported when compiling as C
, but this is planned for a future release. [E:\Langchain-Chatchat\PowerInfer\build\ggml.vcxproj
]
  ggml-alloc.c
C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.33.31629\include\stdato
mic.h(15,1): fatal  error C1189: #error:  <stdatomic.h> is not yet supported when compiling as C
, but this is planned for a future release. [E:\Langchain-Chatchat\PowerInfer\build\ggml.vcxproj
]
  ggml-backend.c
C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.33.31629\include\stdato
mic.h(15,1): fatal  error C1189: #error:  <stdatomic.h> is not yet supported when compiling as C
, but this is planned for a future release. [E:\Langchain-Chatchat\PowerInfer\build\ggml.vcxproj
]
  ggml-quants.c
C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.33.31629\include\stdato
mic.h(15,1): fatal  error C1189: #error:  <stdatomic.h> is not yet supported when compiling as C
, but this is planned for a future release. [E:\Langchain-Chatchat\PowerInfer\build\ggml.vcxproj
]
  正在生成代码...

Which A100 are you guys using?

Thanks for the great work!
Just curious, in "Evaluation shows that PowerInfer attains an average token generation rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across various LLMs (including OPT-175B) on a single NVIDIA RTX 4090 GPU, only 18% lower than that achieved by a top-tier server-grade A100 GPU. This significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy."
Are you guys talking about the 40GB version of A100 or the 80GB version?

会提供Docker镜像吗

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

Please provide a detailed written description of what you were trying to do, and what you expected llama.cpp to do as an enhancement.

Motivation

Please provide a detailed written description of reasons why this feature is necessary and how it is useful to llama.cpp users.

Possible Implementation

If you have an idea as to how it can be implemented, please write a detailed description. Feel free to give links to external sources or share visuals that might be helpful to understand the details better.

no CUDA-capable device is detected

Tried to run inference on wsl
./build/bin/main -m ./ReluFalcon-40B-PowerInfer-GGUF/falcon-40b-relu.q4.powerinfer.gguf -n 128 -t 8 -p "Once upon a time"
and got
no CUDA-capable device is detected
current device: 231936

suggestions on how to fix this?

I have 2 NVIDA cards in the computer
a GeForce RTX2070
and
Tesla M40 24GB

Mixtral MoE support?

Prerequisites

Please answer the following questions for yourself before submitting an issue.

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

Support for Mixtral MoE

Convert HF models with sparse threshold specified

Feature Description

Per-model sparse threshold has been supported on LLaMA-family models, yet not supported for other model architectures.

Possible Implementation

Following the way to add KV param in convert.py, adding similar logics in convert-hf-to-powerinfer-gguf.py. If necessary, update MLP model repos to add a config.json.

In-depth Analysis of Memory Management for Enhanced Performance on Consumer-grade GPUs

Dear PowerInfer Contributors,

I hope this message finds you well. I am reaching out to discuss a potential enhancement to the PowerInfer inference engine, specifically regarding the memory management strategies employed during LLM inference on consumer-grade GPUs.

Upon a thorough examination of the current implementation, I have observed that while the engine adeptly handles the distribution of workload between the CPU and GPU, there may be room for optimisation in the way memory is allocated and managed, particularly during peak usage scenarios.

The crux of the matter lies in the dynamic allocation of memory for 'hot' and 'cold' neurons. While the preloading of 'hot' neurons onto the GPU is commendable for its efficiency, the allocation of memory for 'cold' neurons during runtime could potentially be streamlined. This is especially pertinent when considering the limited VRAM available on consumer-grade GPUs compared to their server-grade counterparts.

I propose a more granular control over memory allocation, which could include:

  • Implementing a more sophisticated memory pooling mechanism to reduce fragmentation and improve allocation speed.
  • Exploring the use of memory compression techniques to increase the effective capacity of VRAM.
  • Introducing a dynamic memory re-allocation system that can adapt to the changing patterns of 'hot' and 'cold' neuron activations based on real-time usage.

I believe that by addressing these aspects, PowerInfer could achieve even greater performance gains and efficiency, making it more accessible and practical for a wider range of users.

I would be most interested in hearing your thoughts on this matter and am keen to contribute to the development of such enhancements.

Thank you for your time and consideration.

Best regards,
yihong1120

nvcc fails due to illegal options

When I try to build this I get the following error:
nvcc fatal : Unknown option 'Wmissing-declarations'

The cmake files seem to be set up to pass gcc options to nvcc which AFAIK are not supported.

请问PowerInfer团队有计划支持Bo Peng团队开发的RWKV-LM吗?

RWKV架构一直采用的都是ReLU作为激活函数,是国内唯一的中文RNN结构语言模型,具有很高的开创性价值。请问PowerInfer团队有支持的想法吗?下面是RWKV-LM的项目链接:

https://github.com/BlinkDL/RWKV-LM

我代表RWKV社区的爱好者强烈希望您的团队可以考虑一下!非常感谢!

Eliminate Compiling warnings

Our initial implementation of PowerInfer introduced a bunch of compiling warnings, mostly unused vars, and made CI complain about it. A quick fix should solve >95% of them.

CUDA error 1 in ggml-cuda.cu:8332: invalid argument, and then segmentation fault

Running in WSL, all deps satisified, most recent code pull, on a RTX 3090.

Command line:
./build/bin/main -m models/7B/llama-7b-relu.powerinfer.gguf -n 128 -t 8 -p "Once upon a time" --vram-budget 12

Log output:

Log start
main: build = 1549 (9d72668)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed  = 1703277999
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
llama_model_loader: loaded meta data with 18 key-value pairs and 355 tensors from models/7B/llama-7b-relu.powerinfer.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor    0:                token_embd.weight f16      [  4096, 32000,     1,     1 ]
llama_model_loader: - tensor    1:              blk.0.attn_q.weight f16      [  4096,  4096,     1,     1 ]
----snip----
llama_model_loader: - kv  17:            tokenizer.ggml.padding_token_id u32
llama_model_loader: - type  f32:   65 tensors
llama_model_loader: - type  f16:  290 tensors
llama_model_load: PowerInfer model loaded. Sparse inference will be used.
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = SPM
llm_load_print_meta: n_vocab          = 32000
llm_load_print_meta: n_merges         = 0
llm_load_print_meta: n_ctx_train      = 2048
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 32
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_gqa            = 1
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff             = 11008
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx  = 2048
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: model type       = 7B
llm_load_print_meta: model ftype      = mostly F16
llm_load_print_meta: model params     = 7.57 B
llm_load_print_meta: model size       = 14.11 GiB (16.00 BPW)
llm_load_print_meta: general.name   = syx
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token  = 13 '<0x0A>'
llm_load_print_meta: sparse_pred_threshold = 0.00
llama_model_load: sparse inference - vram budget = 12.00 GB
llm_load_sparse_model_tensors: ggml ctx size =    0.13 MB
llm_load_sparse_model_tensors: using CUDA for GPU acceleration
llm_load_sparse_model_tensors: mem required  = 8506.63 MB
llm_load_sparse_model_tensors: VRAM used: 5939.52 MB
....................................................................................................
llama_model_loader: loaded meta data with 3 key-value pairs and 64 tensors from models/7B/llama-7b-relu.powerinfer.gguf.generated.gpuidx (version GGUF V3 (latest))
llama_model_loader: - tensor    0:                    blk.0.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor    1:                 blk.0.gpu_bucket i32      [  5376,     1,     1,     1 ]
llama_model_loader: - tensor    2:                    blk.1.gpu_idx i32      [ 11008,     1,     1,     1 ]
----snip----
llama_model_loader: - tensor   62:                   blk.31.gpu_idx i32      [ 11008,     1,     1,     1 ]
llama_model_loader: - tensor   63:                blk.31.gpu_bucket i32      [  4608,     1,     1,     1 ]
llama_model_loader: unknown type i32
llama_model_loader: - kv   0:                       general.architecture str
llama_model_loader: - kv   1:              generic.gpu_index.block_count u32
llama_model_loader: - kv   2:                        split.vram_capacity u64
llama_model_loader: - type  i32:   64 tensors
loaded gpu_idx, vram_required: 6119997440
apply_tensors_to_base_model: applying gpu_idx adapter from 'models/7B/llama-7b-relu.powerinfer.gguf.generated.gpuidx' - please wait ...
................................................................ done (9.84 ms)
offload_ffn_split: applying augmentation to model - please wait ...
................................ done (11859.33 ms)
llm_load_gpu_split: offloaded 5790.00 MiB of FFN weights to GPU
llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  =  256.00 MB
llama_build_graph: non-view tensors processed: 548/1028
llama_build_graph: ****************************************************************
llama_build_graph: not all non-view tensors have been processed with a callback
llama_build_graph: this can indicate an inefficiency in the graph implementation
llama_build_graph: build with LLAMA_OFFLOAD_DEBUG for more info
llama_build_graph: ref: https://github.com/ggerganov/llama.cpp/pull/3837
llama_build_graph: ****************************************************************
llama_new_context_with_model: compute buffer total size = 186.57 MB
llama_new_context_with_model: VRAM scratch buffer: 185.00 MB
llama_new_context_with_model: total VRAM used: 6124.52 MB (model: 5939.52 MB, context: 185.00 MB)

**CUDA error 1 at /home/user/Envs/PowerInfer/ggml-cuda.cu:8332: invalid argument**
current device: 0

CUDA error 4 at /home/user/Envs/PowerInfer/ggml-cuda.cu:485: driver shutting down
current device: 8192
**Segmentation fault**

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.