sjtu-ipads / powerinfer Goto Github PK
View Code? Open in Web Editor NEWHigh-speed Large Language Model Serving on PCs with Consumer-grade GPUs
License: MIT License
High-speed Large Language Model Serving on PCs with Consumer-grade GPUs
License: MIT License
Deepseek-llm和Deepseek-coder效果也是很好的模型,而且是llama结构https://github.com/deepseek-ai/deepseek-coder/
How are the gguf weights quantized to INT4? is there a script similar to llama.cpp to convert to fp16 weigths to q4_0?
Please share more details about INT4 model.
is it possible in future run mixtal8x7b ?
虽然有点冒犯,但是如题
通义千问大模型什么时候能支持呢?我们在用72B、14B的,迫切希望能支持加速推理。
When model can be fully loaded into VRAM, PowerInfer cannot offload all model weight to GPU.
environment: wsl inside Windows.
$ free -m
total used free shared buff/cache available
Mem: 23919 580 22559 3 780 23015
$ nvidia-smi
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3090 On | 00000000:01:00.0 On | N/A |
| 0% 33C P8 25W / 420W | 1762MiB / 24576MiB | 7% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
cmd:
./build/bin/main -m /mnt/{blur}/OneDrive/Models/oobabooga_windows/text-generation-webui/models/test/llama-70b-relu.q4.powerinfer.gguf -n 128 -t 8 -p "Once upon a time"
result:
llama_model_loader: - tensor 881: blk.79.fc1.weight q4_0 [ 8192, 3072, 1, 1 ]
llama_model_loader: - tensor 882: blk.79.fc2.weight q4_0 [ 3072, 28672, 1, 1 ]
llama_model_loader: - kv 0: general.architecture str
llama_model_loader: - kv 1: general.name str
llama_model_loader: - kv 2: llama.context_length u32
llama_model_loader: - kv 3: llama.embedding_length u32
llama_model_loader: - kv 4: llama.block_count u32
llama_model_loader: - kv 5: llama.feed_forward_length u32
llama_model_loader: - kv 6: llama.rope.dimension_count u32
llama_model_loader: - kv 7: llama.attention.head_count u32
llama_model_loader: - kv 8: llama.attention.head_count_kv u32
llama_model_loader: - kv 9: llama.attention.layer_norm_rms_epsilon f32
llama_model_loader: - kv 10: llama.rope.freq_base f32
llama_model_loader: - kv 11: general.file_type u32
llama_model_loader: - kv 12: tokenizer.ggml.model str
llama_model_loader: - kv 13: tokenizer.ggml.tokens arr
llama_model_loader: - kv 14: tokenizer.ggml.scores arr
llama_model_loader: - kv 15: tokenizer.ggml.token_type arr
llama_model_loader: - kv 16: tokenizer.ggml.bos_token_id u32
llama_model_loader: - kv 17: tokenizer.ggml.eos_token_id u32
llama_model_loader: - kv 18: tokenizer.ggml.unknown_token_id u32
llama_model_loader: - kv 19: tokenizer.ggml.padding_token_id u32
llama_model_loader: - kv 20: tokenizer.ggml.add_bos_token bool
llama_model_loader: - kv 21: tokenizer.ggml.add_eos_token bool
llama_model_loader: - kv 22: general.quantization_version u32
llama_model_loader: - type f32: 161 tensors
llama_model_loader: - type q4_0: 722 tensors
llama_model_load: PowerInfer model loaded. Sparse inference will be used.
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 8192
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_layer = 80
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 28672
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 70B
llm_load_print_meta: model ftype = mostly Q4_0
llm_load_print_meta: model params = 74.98 B
llm_load_print_meta: model size = 39.28 GiB (4.50 BPW)
llm_load_print_meta: general.name = nvme
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: sparse_pred_threshold = 0.00
llm_load_sparse_model_tensors: ggml ctx size = 0.32 MB
GGML_ASSERT: /mnt/{blur}/Projects/PowerInfer/llama.cpp:3107: vram_allocated_bytes < vram_capacity
Experimented with vram-budget, reset or disable gpu index. No luck.
# Prerequisites
Please answer the following questions for yourself before submitting an issue.
Please provide a detailed written description of what you were trying to do, and what you expected llama.cpp
to do.
Please provide a detailed written description of what llama.cpp
did, instead.
Please provide detailed information about your computer setup. This is important in case the issue is not reproducible except for under certain specific conditions.
$ lscpu
$ uname -a
$ python3 --version
$ make --version
$ g++ --version
Please help provide information about the failure / bug.
Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.
Please include any relevant log snippets or files. If it works under one configuration but not under another, please provide logs for both configurations and their corresponding outputs so it is easy to see where behavior changes.
Also, please try to avoid using screenshots if at all possible. Instead, copy/paste the console output and use Github's markdown to cleanly format your logs for easy readability.
Example environment info:
llama.cpp$ git log | head -1
commit 2af23d30434a677c6416812eea52ccc0af65119c
llama.cpp$ lscpu | egrep "AMD|Flags"
Vendor ID: AuthenticAMD
Model name: AMD Ryzen Threadripper 1950X 16-Core Processor
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid amd_dcm aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 xsaves clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif overflow_recov succor smca sme sev
Virtualization: AMD-V
llama.cpp$ python3 --version
Python 3.10.9
llama.cpp$ pip list | egrep "torch|numpy|sentencepiece"
numpy 1.24.2
numpydoc 1.5.0
sentencepiece 0.1.97
torch 1.13.1
torchvision 0.14.1
llama.cpp$ make --version | head -1
GNU Make 4.3
$ md5sum ./models/65B/ggml-model-q4_0.bin
dbdd682cce80e2d6e93cefc7449df487 ./models/65B/ggml-model-q4_0.bin
Example run with the Linux command perf
llama.cpp$ perf stat ./main -m ./models/65B/ggml-model-q4_0.bin -t 16 -n 1024 -p "Please close your issue when it has been answered."
main: seed = 1679149377
llama_model_load: loading model from './models/65B/ggml-model-q4_0.bin' - please wait ...
llama_model_load: n_vocab = 32000
llama_model_load: n_ctx = 512
llama_model_load: n_embd = 8192
llama_model_load: n_mult = 256
llama_model_load: n_head = 64
llama_model_load: n_layer = 80
llama_model_load: n_rot = 128
llama_model_load: f16 = 2
llama_model_load: n_ff = 22016
llama_model_load: n_parts = 8
llama_model_load: ggml ctx size = 41477.73 MB
llama_model_load: memory_size = 2560.00 MB, n_mem = 40960
llama_model_load: loading model part 1/8 from './models/65B/ggml-model-q4_0.bin'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 2/8 from './models/65B/ggml-model-q4_0.bin.1'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 3/8 from './models/65B/ggml-model-q4_0.bin.2'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 4/8 from './models/65B/ggml-model-q4_0.bin.3'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 5/8 from './models/65B/ggml-model-q4_0.bin.4'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 6/8 from './models/65B/ggml-model-q4_0.bin.5'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 7/8 from './models/65B/ggml-model-q4_0.bin.6'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
llama_model_load: loading model part 8/8 from './models/65B/ggml-model-q4_0.bin.7'
llama_model_load: .......................................................................................... done
llama_model_load: model size = 4869.09 MB / num tensors = 723
system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
main: prompt: 'Please close your issue when it has been answered.'
main: number of tokens in prompt = 11
1 -> ''
12148 -> 'Please'
3802 -> ' close'
596 -> ' your'
2228 -> ' issue'
746 -> ' when'
372 -> ' it'
756 -> ' has'
1063 -> ' been'
7699 -> ' answered'
29889 -> '.'
sampling parameters: temp = 0.800000, top_k = 40, top_p = 0.950000, repeat_last_n = 64, repeat_penalty = 1.300000
Please close your issue when it has been answered.
@duncan-donut: I'm trying to figure out what kind of "support" you need for this script and why, exactly? Is there a question about how the code works that hasn't already been addressed in one or more comments below this ticket, or are we talking something else entirely like some sorta bugfixing job because your server setup is different from mine??
I can understand if your site needs to be running smoothly and you need help with a fix of sorts but there should really be nothing wrong here that the code itself could not handle. And given that I'm getting reports about how it works perfectly well on some other servers, what exactly are we talking? A detailed report will do wonders in helping us get this resolved for ya quickly so please take your time and describe the issue(s) you see as clearly & concisely as possible!!
@duncan-donut: I'm not sure if you have access to cPanel but you could try these instructions. It is worth a shot! Let me know how it goes (or what error message, exactly!) when/if ya give that code a go? [end of text]
main: mem per token = 71159620 bytes
main: load time = 19309.95 ms
main: sample time = 168.62 ms
main: predict time = 223895.61 ms / 888.47 ms per token
main: total time = 246406.42 ms
Performance counter stats for './main -m ./models/65B/ggml-model-q4_0.bin -t 16 -n 1024 -p Please close your issue when it has been answered.':
3636882.89 msec task-clock # 14.677 CPUs utilized
13509 context-switches # 3.714 /sec
2436 cpu-migrations # 0.670 /sec
10476679 page-faults # 2.881 K/sec
13133115082869 cycles # 3.611 GHz (16.77%)
29314462753 stalled-cycles-frontend # 0.22% frontend cycles idle (16.76%)
10294402631459 stalled-cycles-backend # 78.39% backend cycles idle (16.74%)
23479217109614 instructions # 1.79 insn per cycle
# 0.44 stalled cycles per insn (16.76%)
2353072268027 branches # 647.002 M/sec (16.77%)
1998682780 branch-misses # 0.08% of all branches (16.76%)
247.802177522 seconds time elapsed
3618.573072000 seconds user
18.491698000 seconds sys
https://arxiv.org/abs/2312.11514
Recently, LLM in a Flash was proposed, a method to use Flash memory to run models that exceed DRAM.
If I'm right, I think we can apply these technologies simultaneously.
If that were possible, I think it would make running very large models easier.
希望支持 InternLM-20B 和 InternLM-7B
Please answer the following questions for yourself before submitting an issue.
Please provide a detailed written description of what you were trying to do, and what you expected llama.cpp
to do as an enhancement.
Please provide a detailed written description of reasons why this feature is necessary and how it is useful to llama.cpp
users.
If you have an idea as to how it can be implemented, please write a detailed description. Feel free to give links to external sources or share visuals that might be helpful to understand the details better.
Is it worth?
应该还需要你们predictor训练的代码吧?还有ft转换relu的
thx
is this project support the chat model of llama?
你好,有使用 powerinfer 部署后,模型精度损失的对比吗?
this is on linux with a 4090 comparing .
Running ollama run mistral "why is the sky blue?" vs the same prompt gives the same speed.
on llama7b should we expect faster results?
Without replacing the activation function?
Again, amazing work!!
Does this framework provide any interoperability with existing ecosystem like LangChain?
Good job!
从meta-llama/Llama-2-13b-hf到SparseLLM/ReluLLaMA-13B应该要经过一个finetune,看readme里面没有写,这个具体finetune训练会开放吗
We noticed that the paper mentioned limited performance improvement for relatively long prompt situations, but our situation is that, in the case of very long prompts, it seems PowerInfer ceases to work, generating no output. This situation occurred with llama-7B, llama-13B, and llama-70B-q4. We speculate that this might be because when the prompt is very long, there are many neurons that need to work simultaneously, and the performance optimization done using the locality principle no longer applies, causing PowerInfer's working mechanism to be ineffective. We would like to hear your perspective on this matter.
Please answer the following questions for yourself before submitting an issue.
在Mac M系列平台上,是否可以进一步结合上https://github.com/ml-explore/mlx 从而达到更夸张的提速效果?
Please provide a detailed written description of what you were trying to do, and what you expected llama.cpp
to do as an enhancement.
Please provide a detailed written description of reasons why this feature is necessary and how it is useful to llama.cpp
users.
If you have an idea as to how it can be implemented, please write a detailed description. Feel free to give links to external sources or share visuals that might be helpful to understand the details better.
运行cmake --build build --config Release,提升以下错误
E:\Langchain-Chatchat\PowerInfer>cmake --build build --config Release
MSBuild version 17.3.1+2badb37d1 for .NET Framework
build_info.vcxproj -> E:\Langchain-Chatchat\PowerInfer\build\common\build_info.dir\Release\bui
ld_info.lib
ggml.c
C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.33.31629\include\stdato
mic.h(15,1): fatal error C1189: #error: <stdatomic.h> is not yet supported when compiling as C
, but this is planned for a future release. [E:\Langchain-Chatchat\PowerInfer\build\ggml.vcxproj
]
ggml-alloc.c
C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.33.31629\include\stdato
mic.h(15,1): fatal error C1189: #error: <stdatomic.h> is not yet supported when compiling as C
, but this is planned for a future release. [E:\Langchain-Chatchat\PowerInfer\build\ggml.vcxproj
]
ggml-backend.c
C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.33.31629\include\stdato
mic.h(15,1): fatal error C1189: #error: <stdatomic.h> is not yet supported when compiling as C
, but this is planned for a future release. [E:\Langchain-Chatchat\PowerInfer\build\ggml.vcxproj
]
ggml-quants.c
C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.33.31629\include\stdato
mic.h(15,1): fatal error C1189: #error: <stdatomic.h> is not yet supported when compiling as C
, but this is planned for a future release. [E:\Langchain-Chatchat\PowerInfer\build\ggml.vcxproj
]
正在生成代码...
Thanks for the great work!
Just curious, in "Evaluation shows that PowerInfer attains an average token generation rate of 13.20 tokens/s, with a peak of 29.08 tokens/s, across various LLMs (including OPT-175B) on a single NVIDIA RTX 4090 GPU, only 18% lower than that achieved by a top-tier server-grade A100 GPU. This significantly outperforms llama.cpp by up to 11.69x while retaining model accuracy."
Are you guys talking about the 40GB version of A100 or the 80GB version?
Should we keep compiled binary running in mmory and have python call it or can python just make arbitrary calls to it?
According to feedback, there is a discrepancy between the actual GPU usage and the budget.
Please answer the following questions for yourself before submitting an issue.
Please provide a detailed written description of what you were trying to do, and what you expected llama.cpp
to do as an enhancement.
Please provide a detailed written description of reasons why this feature is necessary and how it is useful to llama.cpp
users.
If you have an idea as to how it can be implemented, please write a detailed description. Feel free to give links to external sources or share visuals that might be helpful to understand the details better.
Tried to run inference on wsl
./build/bin/main -m ./ReluFalcon-40B-PowerInfer-GGUF/falcon-40b-relu.q4.powerinfer.gguf -n 128 -t 8 -p "Once upon a time"
and got
no CUDA-capable device is detected
current device: 231936
suggestions on how to fix this?
I have 2 NVIDA cards in the computer
a GeForce RTX2070
and
Tesla M40 24GB
请问有没有关于在A100上运行PowerInfer的benchmark?比如说,同样在A100上运行的PowerInfer和llama.cpp相比是否有速度提升,或者能否显著的节省显存的使用呢?
Please answer the following questions for yourself before submitting an issue.
Support for Mixtral MoE
Per-model sparse threshold has been supported on LLaMA-family models, yet not supported for other model architectures.
Following the way to add KV param in convert.py
, adding similar logics in convert-hf-to-powerinfer-gguf.py
. If necessary, update MLP model repos to add a config.json.
Dear PowerInfer Contributors,
I hope this message finds you well. I am reaching out to discuss a potential enhancement to the PowerInfer inference engine, specifically regarding the memory management strategies employed during LLM inference on consumer-grade GPUs.
Upon a thorough examination of the current implementation, I have observed that while the engine adeptly handles the distribution of workload between the CPU and GPU, there may be room for optimisation in the way memory is allocated and managed, particularly during peak usage scenarios.
The crux of the matter lies in the dynamic allocation of memory for 'hot' and 'cold' neurons. While the preloading of 'hot' neurons onto the GPU is commendable for its efficiency, the allocation of memory for 'cold' neurons during runtime could potentially be streamlined. This is especially pertinent when considering the limited VRAM available on consumer-grade GPUs compared to their server-grade counterparts.
I propose a more granular control over memory allocation, which could include:
I believe that by addressing these aspects, PowerInfer could achieve even greater performance gains and efficiency, making it more accessible and practical for a wider range of users.
I would be most interested in hearing your thoughts on this matter and am keen to contribute to the development of such enhancements.
Thank you for your time and consideration.
Best regards,
yihong1120
aquila,aquila2是类llama模型,商用协议宽松,希望能支持
When I try to build this I get the following error:
nvcc fatal : Unknown option 'Wmissing-declarations'
The cmake files seem to be set up to pass gcc options to nvcc which AFAIK are not supported.
https://github.com/BlinkDL/RWKV-LM
我代表RWKV社区的爱好者强烈希望您的团队可以考虑一下!非常感谢!
Our initial implementation of PowerInfer introduced a bunch of compiling warnings, mostly unused vars, and made CI complain about it. A quick fix should solve >95% of them.
Running in WSL, all deps satisified, most recent code pull, on a RTX 3090.
Command line:
./build/bin/main -m models/7B/llama-7b-relu.powerinfer.gguf -n 128 -t 8 -p "Once upon a time" --vram-budget 12
Log output:
Log start
main: build = 1549 (9d72668)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1703277999
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
llama_model_loader: loaded meta data with 18 key-value pairs and 355 tensors from models/7B/llama-7b-relu.powerinfer.gguf (version GGUF V3 (latest))
llama_model_loader: - tensor 0: token_embd.weight f16 [ 4096, 32000, 1, 1 ]
llama_model_loader: - tensor 1: blk.0.attn_q.weight f16 [ 4096, 4096, 1, 1 ]
----snip----
llama_model_loader: - kv 17: tokenizer.ggml.padding_token_id u32
llama_model_loader: - type f32: 65 tensors
llama_model_loader: - type f16: 290 tensors
llama_model_load: PowerInfer model loaded. Sparse inference will be used.
llm_load_vocab: special tokens definition check successful ( 259/32000 ).
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = llama
llm_load_print_meta: vocab type = SPM
llm_load_print_meta: n_vocab = 32000
llm_load_print_meta: n_merges = 0
llm_load_print_meta: n_ctx_train = 2048
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_head = 32
llm_load_print_meta: n_head_kv = 32
llm_load_print_meta: n_layer = 32
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_gqa = 1
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: n_ff = 11008
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_yarn_orig_ctx = 2048
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: model type = 7B
llm_load_print_meta: model ftype = mostly F16
llm_load_print_meta: model params = 7.57 B
llm_load_print_meta: model size = 14.11 GiB (16.00 BPW)
llm_load_print_meta: general.name = syx
llm_load_print_meta: BOS token = 1 '<s>'
llm_load_print_meta: EOS token = 2 '</s>'
llm_load_print_meta: UNK token = 0 '<unk>'
llm_load_print_meta: PAD token = 0 '<unk>'
llm_load_print_meta: LF token = 13 '<0x0A>'
llm_load_print_meta: sparse_pred_threshold = 0.00
llama_model_load: sparse inference - vram budget = 12.00 GB
llm_load_sparse_model_tensors: ggml ctx size = 0.13 MB
llm_load_sparse_model_tensors: using CUDA for GPU acceleration
llm_load_sparse_model_tensors: mem required = 8506.63 MB
llm_load_sparse_model_tensors: VRAM used: 5939.52 MB
....................................................................................................
llama_model_loader: loaded meta data with 3 key-value pairs and 64 tensors from models/7B/llama-7b-relu.powerinfer.gguf.generated.gpuidx (version GGUF V3 (latest))
llama_model_loader: - tensor 0: blk.0.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 1: blk.0.gpu_bucket i32 [ 5376, 1, 1, 1 ]
llama_model_loader: - tensor 2: blk.1.gpu_idx i32 [ 11008, 1, 1, 1 ]
----snip----
llama_model_loader: - tensor 62: blk.31.gpu_idx i32 [ 11008, 1, 1, 1 ]
llama_model_loader: - tensor 63: blk.31.gpu_bucket i32 [ 4608, 1, 1, 1 ]
llama_model_loader: unknown type i32
llama_model_loader: - kv 0: general.architecture str
llama_model_loader: - kv 1: generic.gpu_index.block_count u32
llama_model_loader: - kv 2: split.vram_capacity u64
llama_model_loader: - type i32: 64 tensors
loaded gpu_idx, vram_required: 6119997440
apply_tensors_to_base_model: applying gpu_idx adapter from 'models/7B/llama-7b-relu.powerinfer.gguf.generated.gpuidx' - please wait ...
................................................................ done (9.84 ms)
offload_ffn_split: applying augmentation to model - please wait ...
................................ done (11859.33 ms)
llm_load_gpu_split: offloaded 5790.00 MiB of FFN weights to GPU
llama_new_context_with_model: n_ctx = 512
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size = 256.00 MB
llama_build_graph: non-view tensors processed: 548/1028
llama_build_graph: ****************************************************************
llama_build_graph: not all non-view tensors have been processed with a callback
llama_build_graph: this can indicate an inefficiency in the graph implementation
llama_build_graph: build with LLAMA_OFFLOAD_DEBUG for more info
llama_build_graph: ref: https://github.com/ggerganov/llama.cpp/pull/3837
llama_build_graph: ****************************************************************
llama_new_context_with_model: compute buffer total size = 186.57 MB
llama_new_context_with_model: VRAM scratch buffer: 185.00 MB
llama_new_context_with_model: total VRAM used: 6124.52 MB (model: 5939.52 MB, context: 185.00 MB)
**CUDA error 1 at /home/user/Envs/PowerInfer/ggml-cuda.cu:8332: invalid argument**
current device: 0
CUDA error 4 at /home/user/Envs/PowerInfer/ggml-cuda.cu:485: driver shutting down
current device: 8192
**Segmentation fault**
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.