lostruins / koboldcpp Goto Github PK

This project forked from ggerganov/llama.cpp

Run GGUF models easily with a KoboldAI UI. One File. Zero Install.

Home Page: https://github.com/lostruins/koboldcpp

License: GNU Affero General Public License v3.0

C++ 82.58% Python 1.95% C 11.77% Makefile 0.07% CMake 0.05% Batchfile 0.01% Shell 0.02% Cuda 2.41% Objective-C 0.45% Metal 0.51% Jupyter Notebook 0.02% Lua 0.14% Dockerfile 0.01%

gemma ggml gguf koboldai koboldcpp language-model llama llamacpp llm mistral

koboldcpp's People

Contributors

Stargazers

Watchers

Forkers

henk717 brucepro skhmt tribe-health wnma3mz inconsolablecellist 0cc4m thenetguy ariez-xyz cocotropic kagamma lighttemplar rsashka antoniomoder tredocompany ustingit tor-del ilya-savichev itfenom myauqy andrewboichenko hummer74 nikitach111 zyril-8204 gustrd insideof4me yellowrosecx princetrunks naibble08 loirelab casper200 salyuk163 earlpfau ffacerr heorenmaru foxy6670 th-neu curiosity007 leogrinch simulanics tonyzhu cyd3nt vitaliy73ru awmalka sevashpun wiwomu fox1ttttt powerfan-io thanhtamkaito heavensgold powerfan-io banana-dog animesh max-fry-apps h3ndrik zaziat furkanshaikh313 knaik wolframravenwolf horenbergerb wiseplat jjhw ichite blipranger kyapp69 ai-solutions-expert brian-mcclune-nnl dagelf darknight92228 orozcorgmx wncjs96 jonzhep aphexus serg5136 kallewoof sammcheese kmilner k1vd deltavml researchsolution sdkst tomben1 ycros leecig ramstorage elix1er burakbengi akobarm userbox020 opariffazman marijnvriens paixai nbalzotti kp-forks wesley7137 neph1 eliasoenal blinkda cohee1207 yurik44

koboldcpp's Issues

Where i can find platform_id and device_id for command useclblast?

OS: Windows 10
CPU: AMD Ryzen 7 5800X
GPU: Nvidia RTX 3070
I searched on Google, but I could not find an application or command to display these parameters ... where to look for them?

[User] Failed to start koboldcpp

Hi, I have such a problem, I turn on the kobold and select the following llama-65b-ggml-q4_0, and nothing happens further than this process, GTX 1080 + i78700k + 32 Ram
How long to wait or nothing works?

Question: Is there any way to increase the max number of tokens generated from 512?

Maybe this is a dumb question, but why is the limit 512? The model I'm using has a limit that's much higher (vicuna), if not unlimited. If this something hard-coded in the program or could it be modified?

Certain command line args result in improperly-formed (?) URL in CMD window output

On the latest release I use the following command line args: koboldcpp.exe --threads 6 --stream --host 10.0.0.129

When the model is loaded, resulting URL outputted to CMD window is: http://10.0.0.129:5001&streaming=1

URL appears malformed. Expected URL: http://10.0.0.129:5001/?streaming=1 (or at least something similar)

Feature Request: Expose llama.cpp --no-mmap option

There was a performance regression in earlier versions of llama.cpp that I may be hitting with long running interactions. This was recently fixed with the addition of a --no-mmap option which forces the entire model to be loaded into ram, and I would like to also use it with koboldcpp.

ggerganov#801

(Newer) Pygmalion 6Bv3 ggjt model appears to not be able to go over 500-600 tokens of context.

Prerequisites

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Expected Behavior

To not run out of space in the context's memory pool.

Current Behavior

Consistently run into this every time a session reaches 500+ tokens or giving a 500+ token scenario when using a more recent ggjt version of Pygmalion located here. Does not appear to effect standard llama.cpp models. Have not tested other model types that are compatible with koboldcpp. DOES NOT EFFECT older conversion of Pygmalion model to ggml located here. It is able to handle a starting scenario of 1000+ tokens without this issue. Processing Prompt (864 / 1302 tokens)

Processing Prompt (584 / 589 tokens)ggml_new_tensor_impl: not enough space in the context's memory pool (needed 269340800, available 268435456)
Processing Prompt (8 / 10 tokens)ggml_new_tensor_impl: not enough space in the context's memory pool (needed 268458928, available 268435456)
Processing Prompt (8 / 9 tokens)ggml_new_tensor_impl: not enough space in the context's memory pool (needed 269097088, available 268435456)

Have plenty of RAM available when it happens.

Edit: Also affects janeway-ggml-q4_0.bin.
Processing Prompt (584 / 673 tokens)ggml_new_tensor_impl: not enough space in the context's memory pool (needed 269340800, available 268435456)

Environment and Context

Physical (or virtual) hardware you are using, e.g. for Linux:

Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         43 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  12
  On-line CPU(s) list:   0-11
Vendor ID:               AuthenticAMD
  Model name:            AMD Ryzen 5 2600 Six-Core Processor
    CPU family:          23
    Model:               8
    Thread(s) per core:  2
    Core(s) per socket:  6
    Socket(s):           1
    Stepping:            2
    BogoMIPS:            7600.11
    Flags:               fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr s
                         se sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop
                         _tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 
                         movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a
                          misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr_nb bpext perfc
                         tr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bmi2 rdseed adx smap
                          clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero irperf xsaveerptr arat npt lbrv svm_lock
                          nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_v
                         msave_vmload vgif overflow_recov succor smca sev sev_es
Virtualization features: 
  Virtualization:        AMD-V
Caches (sum of all):     
  L1d:                   192 KiB (6 instances)
  L1i:                   384 KiB (6 instances)
  L2:                    3 MiB (6 instances)
  L3:                    16 MiB (2 instances)
NUMA:                    
  NUMA node(s):          1
  NUMA node0 CPU(s):     0-11
Vulnerabilities:         
  Itlb multihit:         Not affected
  L1tf:                  Not affected
  Mds:                   Not affected
  Meltdown:              Not affected
  Mmio stale data:       Not affected
  Retbleed:              Vulnerable
  Spec store bypass:     Vulnerable
  Spectre v1:            Vulnerable: __user pointer sanitization and usercopy barriers only; no swapgs barriers
  Spectre v2:            Vulnerable, IBPB: disabled, STIBP: disabled, PBRSB-eIBRS: Not affected
  Srbds:                 Not affected
  Tsx async abort:       Not affected

Operating System, e.g. for Linux:

Linux rabid-ms7b87 6.2.7-zen1-1-zen #1 ZEN SMP PREEMPT_DYNAMIC Sat, 18 Mar 2023 01:06:38 +0000 x86_64 GNU/Linux

Steps to Reproduce

Please provide detailed steps for reproducing the issue. We are not sitting in front of your screen, so the more detail the better.

Load koboldcpp with a Pygmalion model in ggml/ggjt format. In this case the model taken from here.
Enter a starting prompt exceeding 500-600 tokens or have a session go on for 500-600+ tokens
Observe ggml_new_tensor_impl: not enough space in the context's memory pool (needed 269340800, available 268435456) message in terminal.

Failure Logs

Example run with the Linux command

[rabid@rabid-ms7b87 koboldcpp]$ python koboldcpp.py ../pygmalion-6b-v3-ggml-ggjt-q4_0.bin  --threads 6  --stream
Welcome to KoboldCpp - Version 1.3
Prebuilt OpenBLAS binaries only available for windows. Please manually build/link libopenblas from makefile with LLAMA_OPENBLAS=1
Initializing dynamic library: koboldcpp.dll
Loading model: /home/rabid/Desktop/pygmalion-6b-v3-ggml-ggjt-q4_0.bin 
[Parts: 1, Threads: 6]

---
Identified as GPT-J model: (ver 102)
Attempting to Load...
---
gptj_model_load: loading model from '/home/rabid/Desktop/pygmalion-6b-v3-ggml-ggjt-q4_0.bin' - please wait ...
gptj_model_load: n_vocab = 50400
gptj_model_load: n_ctx   = 2048
gptj_model_load: n_embd  = 4096
gptj_model_load: n_head  = 16
gptj_model_load: n_layer = 28
gptj_model_load: n_rot   = 64
gptj_model_load: f16     = 2
gptj_model_load: ggml ctx size = 4505.45 MB
gptj_model_load: memory_size =   896.00 MB, n_mem = 57344
gptj_model_load: ................................... done
gptj_model_load: model size =  3609.38 MB / num tensors = 285
Load Model OK: True
Embedded Kobold Lite loaded.
Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001?streaming=1
127.0.0.1 - - [10/Apr/2023 09:42:18] "GET /?streaming=1 HTTP/1.1" 200 -
127.0.0.1 - - [10/Apr/2023 09:42:18] "GET /api/latest/model HTTP/1.1" 200 -
127.0.0.1 - - [10/Apr/2023 09:42:18] "GET /sw.js HTTP/1.1" 404 -

Input: {"n": 1, "max_context_length": 1000, "max_length": 8, "rep_pen": 1.15, "temperature": 0.6, "top_p": 0.9, "top_k": 40, "top_a": 0, "typical": 1, "tfs": 1, "rep_pen_range": 1024, "rep_pen_slope": 0.7, "sampler_order": [0, 1, 2, 3, 4, 5, 6], "prompt": "[The following is a chat message log between you and an extremely intelligent and knowledgeable AI system named KoboldGPT. KoboldGPT is a state-of-the-art Artificial General Intelligence. You may ask any question, or request any task, and KoboldGPT will always be able to respond accurately and truthfully.]\n\nYou: What are german shepherds?\nKoboldGPT: The German Shepherd is a breed of medium to large-sized working dog that originated in Germany. In the English language, the breed's officially recognized name is German Shepherd Dog. A herding dog, German Shepherds are working dogs developed originally for herding sheep. Since that time, however, because of their strength, intelligence, trainability, and obedience, German Shepherds around the world are often the preferred breed for many types of work, including disability assistance, search-and-rescue, police and military roles and acting.\nYou: Which is heavier, a duck or a car?\nKoboldGPT: A car weighs around 1300 kilograms, while a duck only weighs about 2 kilograms. Therefore, a car is heavier than a duck.\nYou: What is the atomic number of Calcium?\nKoboldGPT: A calcium atom has 20 protons, 20 electrons and 20 neutrons. The atomic number or proton number of a chemical element is the number of protons found in the nucleus of every atom of that element. The atomic number uniquely identifies a chemical element. It is identical to the charge number of the nucleus. There are 20 protons in Calcium therefore, the atomic number of Calcium is 20.\nYou: What is 28 + 13?\nKoboldGPT: 28 + 13 is equal to 41.\nYou: Tell me a joke.\nKoboldGPT: Why did the chicken cross the road? To get to the other side!\nYou: How is an earthquake measured?\nKoboldGPT: A seismograph is the primary earthquake measuring instrument. The seismograph produces a digital graphic recording of the ground motion caused by the seismic waves. The digital recording is called seismogram. A network of worldwide seismographs detects and measures the strength and duration of the earthquake's waves. The magnitude of an earthquake and the intensity of shaking is usually reported on the Richter scale.\n\nKoboldGPT: Hello, I am KoboldGPT, your personal AI assistant. What would you like to know?\nYou: what's 3*5?\nKoboldGPT: 3 \u00d7 5 = 15\nYou: capital of russia?\nKoboldGPT: Moscow\nYou: 4*6\nKoboldGPT:", "quiet": true}

Processing Prompt (584 / 589 tokens)ggml_new_tensor_impl: not enough space in the context's memory pool (needed 269340800, available 268435456)

Format for running a model from command line?

I wanna try the new options like this: koboldcpp.exe --useclblast 0 0 and --smartcontext

Previously when I tried --smartcontext it let me select a model the same way as if I just ran the exe normally, but with the other flag added it now says cannot find model file: and

I saw that I should do [model_file] but [ggml-model-q4_0.bin] and --ggml-model-q4_0.bin doesn't work. What would be the correct format?

Implement mblock flag toggle

It would be nice if we had -mlock argument like on llama.cpp
Model using my whole swap instead of RAM.

Is there an API endpoint?

When I try to access API endpoint (like with TavernAI) it throws this:

And on Tavern logs this:

Same thing when trying to access localhost:5001/api with browser:

Does that mean that there's no API endpoint to connect from other programs? Using noavx2 build, just in case.

Add task management system, like AutoGPT, BabyAGI or Robo-GPT

Hello,
for now, Koboldcpp is the best and simple UI I have used.
Can we imagine to add task management system capabilities for next versions?

Failed to execute script 'koboldcpp' due to unhandled exception!

Hello!
I am tryed to run koboldcpp.exe with Alpaca ggml-model-q4_1.bin but it "Failed to execute script 'koboldcpp' due to unhandled exception!"
What can I do to solve this?

I have 16 Gb RAM and core i7 3770k if it important.

Real-time word-by-word stream as llama.cpp generates it - is it possible?

So that won't be necessary to wait till the whole answer generated...

ggml_new_tensor_impl: not enough space in the scratch memory

Every time I advance a little in my discussions, I crash with the following error:

Processing Prompt [BLAS] (1024 / 1301 tokens)ggml_new_tensor_impl: not enough space in the scratch memory

My RAM is only 40% loaded, my Max Token is 2048 etc... I don' understand

Feature request: Connect to horde as worker

Expected Behavior

using API-key and be able to turn on share with horde

Current Behavior

option not there

would love to be able to use as worker so koboldcpp becomes multi-user

CLblast argument does not use GPU

When launching with arguments --useclblast 0 0 to 8 8 and --smartcontext, only the cpu is used. The application does not crash like is suggested by other users. It successfully initializes the clblast.dll, but regardless of arguments used in --useclblast, it only ever uses the cpu. In addition, regardless of which model I use I receive this error: https://imgur.com/a/h54ybwB. However, this error does not crash the program, I can still generate - just only with my cpu.

Windows 10
AMD 6700XT
Ryzen 3600

[User] OSError: [WinError -1073741795] Windows Error 0xc000001d

Windows 10, AMD PhenomII, 16gb of ram. AMD Rx580 GPU

kobold CPP fails like this.

Identified as LLAMA model: (ver 3)
Attempting to Load...
---
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
llama.cpp: loading model from C:\SD\llamaCPP\models\alpaca-native-7b-ggml\ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
Traceback (most recent call last):
  File "koboldcpp.py", line 439, in <module>
  File "koboldcpp.py", line 387, in main
  File "koboldcpp.py", line 81, in load_model
OSError: [WinError -1073741795] Windows Error 0xc000001d
[932] Failed to execute script 'koboldcpp' due to unhandled exception!


CLBlast


---
Identified as LLAMA model: (ver 3)
Attempting to Load...
---
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 |
llama.cpp: loading model from C:\SD\llamaCPP\models\alpaca-native-7b-ggml\ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 4096
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 32
llama_model_load_internal: n_layer    = 32
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 11008
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 7B
llama_model_load_internal: ggml ctx size =  59.11 KB
llama_model_load_internal: mem required  = 5809.32 MB (+ 1026.00 MB per state)
Traceback (most recent call last):
  File "koboldcpp.py", line 439, in <module>
  File "koboldcpp.py", line 387, in main
  File "koboldcpp.py", line 81, in load_model
OSError: [WinError -1073741795] Windows Error 0xc000001d
[228] Failed to execute script 'koboldcpp' due to unhandled exception!

Also tried noavx2 and similar parameters. Normal llama CPP fails silently.

What do? How to even debug this? I might try plain CPP on linux later and see what happens too.

Failed to execute script 'koboldcpp' due to unhandled exception!

win11, Intel(R) Xeon(R) CPU X3470, 16gb ram, koboldcpp 1.6, model - Vicuna 13B

F:\koboldcpp>koboldcpp.exe --noavx2 --noblas ggml-model-q4_0.bin
Welcome to KoboldCpp - Version 1.6
Attempting to use non-avx2 compatibility library without OpenBLAS.
Initializing dynamic library: koboldcpp_noavx2.dll
Loading model: F:\koboldcpp\ggml-model-q4_0.bin
[Parts: 1, Threads: 3]


Identified as LLAMA model: (ver 3)
Attempting to Load...

System Info: AVX = 1 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
llama.cpp: loading model from F:\koboldcpp\ggml-model-q4_0.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32001
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 5120
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 40
llama_model_load_internal: n_layer    = 40
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 13824
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 13B
llama_model_load_internal: ggml ctx size =  73.73 KB
llama_model_load_internal: mem required  = 9807.47 MB (+ 1608.00 MB per state)
Traceback (most recent call last):
  File "koboldcpp.py", line 439, in <module>
  File "koboldcpp.py", line 387, in main
  File "koboldcpp.py", line 81, in load_model
OSError: [WinError -1073741795] Windows Error 0xc000001d
[13096] Failed to execute script 'koboldcpp' due to unhandled exception!

won't build on macOS

Hello!
I have tried to build it on macOS 13.1 but build fails:

I UNAME_S:  Darwin
I UNAME_P:  i386
I UNAME_M:  x86_64
I CFLAGS:   -I.              -Ofast -DNDEBUG -std=c11   -fPIC -pthread -s  -pthread -mf16c -mfma -mavx2 -mavx -msse3 -DGGML_USE_ACCELERATE -DGGML_USE_CLBLAST -DGGML_USE_OPENBLAS
I CXXFLAGS: -I. -I./examples -Ofast -DNDEBUG -std=c++11 -fPIC -pthread -s -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate -lclblast -lOpenCL
I CC:       Apple clang version 14.0.0 (clang-1400.0.29.202)
I CXX:      Apple clang version 14.0.0 (clang-1400.0.29.202)

cc  -I.              -Ofast -DNDEBUG -std=c11   -fPIC -pthread -s  -pthread -mf16c -mfma -mavx2 -mavx -msse3 -DGGML_USE_ACCELERATE -DGGML_USE_CLBLAST -DGGML_USE_OPENBLAS -c ggml.c -o ggml.o
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
ggml.c:6435:17: error: implicit declaration of function 'do_blas_sgemm' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
                do_blas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans,
                ^
ggml.c:6435:17: note: did you mean 'cblas_sgemm'?
/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX.sdk/System/Library/Frameworks/vecLib.framework/Headers/cblas.h:607:6: note: 'cblas_sgemm' declared here
void cblas_sgemm(const enum CBLAS_ORDER __Order,
     ^
ggml.c:6607:17: error: implicit declaration of function 'do_blas_sgemm' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
                do_blas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans,
                ^
ggml.c:6820:17: error: implicit declaration of function 'do_blas_sgemm' is invalid in C99 [-Werror,-Wimplicit-function-declaration]
                do_blas_sgemm(CblasRowMajor, CblasNoTrans, CblasTrans,
                ^
3 errors generated.
make: *** [ggml.o] Error 1```

clang version: 14.0.0
make version: 3.81

Unable to build on macOS

Output of make trying to compile from latest release (1.7.1):

I llama.cpp build info:
I UNAME_S:  Darwin
I UNAME_P:  arm
I UNAME_M:  arm64
I CFLAGS:   -I.              -Ofast -DNDEBUG -std=c11   -fPIC -pthread -s  -pthread -DGGML_USE_ACCELERATE
I CXXFLAGS: -I. -I./examples -Ofast -DNDEBUG -std=c++11 -fPIC -pthread -s -Wno-multichar -pthread
I LDFLAGS:   -framework Accelerate
I CC:       Apple clang version 14.0.3 (clang-1403.0.22.14.1)
I CXX:      Apple clang version 14.0.3 (clang-1403.0.22.14.1)

cc  -I.              -Ofast -DNDEBUG -std=c11   -fPIC -pthread -s  -pthread -DGGML_USE_ACCELERATE   -c ggml.c -o ggml.o
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
cc  -I.              -Ofast -DNDEBUG -std=c11   -fPIC -pthread -s  -pthread -DGGML_USE_ACCELERATE   -c otherarch/ggml_v1.c -o ggml_v1.o
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
c++ -I. -I./examples -Ofast -DNDEBUG -std=c++11 -fPIC -pthread -s -Wno-multichar -pthread -c expose.cpp -o expose.o
clang: warning: argument unused during compilation: '-s' [-Wunused-command-line-argument]
In file included from expose.cpp:20:
./expose.h:3:8: warning: struct 'load_model_inputs' does not declare any constructor to initialize its non-modifiable members
struct load_model_inputs
       ^
./expose.h:5:15: note: const member 'threads' will never be initialized
    const int threads;
              ^
./expose.h:6:15: note: const member 'max_context_length' will never be initialized
    const int max_context_length;
              ^
./expose.h:7:15: note: const member 'batch_size' will never be initialized
    const int batch_size;
              ^
./expose.h:8:16: note: const member 'f16_kv' will never be initialized
    const bool f16_kv;
               ^
./expose.h:11:16: note: const member 'use_mmap' will never be initialized
    const bool use_mmap;
               ^
./expose.h:12:16: note: const member 'use_smartcontext' will never be initialized
    const bool use_smartcontext;
               ^
In file included from expose.cpp:21:
In file included from ./model_adapter.cpp:12:
./model_adapter.h:47:49: error: no template named 'map' in namespace 'std'; did you mean 'max'?
void print_tok_vec(std::vector<int> &embd, std::map<int32_t, std::string> * decoder);
                                           ~~~~~^~~
                                                max
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/__algorithm/max.h:31:1: note: 'max' declared here
max(const _Tp& __a, const _Tp& __b, _Compare __comp)
^
In file included from expose.cpp:21:
In file included from ./model_adapter.cpp:12:
./model_adapter.h:47:44: error: expected parameter declarator
void print_tok_vec(std::vector<int> &embd, std::map<int32_t, std::string> * decoder);
                                           ^
./model_adapter.h:47:75: error: expected ')'
void print_tok_vec(std::vector<int> &embd, std::map<int32_t, std::string> * decoder);
                                                                          ^
./model_adapter.h:47:19: note: to match this '('
void print_tok_vec(std::vector<int> &embd, std::map<int32_t, std::string> * decoder);
                  ^
In file included from expose.cpp:21:
./model_adapter.cpp:32:5: error: no matching function for call to 'print_tok_vec'
    print_tok_vec(embd,nullptr);
    ^~~~~~~~~~~~~
./model_adapter.cpp:30:6: note: candidate function not viable: requires single argument 'embd', but 2 arguments were provided
void print_tok_vec(std::vector<int> &embd)
     ^
./model_adapter.h:48:6: note: candidate function not viable: requires single argument 'embd', but 2 arguments were provided
void print_tok_vec(std::vector<float> &embd);
     ^
In file included from expose.cpp:21:
./model_adapter.cpp:34:49: error: no template named 'map' in namespace 'std'; did you mean 'max'?
void print_tok_vec(std::vector<int> &embd, std::map<int32_t, std::string> * decoder)
                                           ~~~~~^~~
                                                max
/Library/Developer/CommandLineTools/SDKs/MacOSX.sdk/usr/include/c++/v1/__algorithm/max.h:31:1: note: 'max' declared here
max(const _Tp& __a, const _Tp& __b, _Compare __comp)
^
In file included from expose.cpp:21:
./model_adapter.cpp:34:44: error: expected parameter declarator
void print_tok_vec(std::vector<int> &embd, std::map<int32_t, std::string> * decoder)
                                           ^
./model_adapter.cpp:34:75: error: expected ')'
void print_tok_vec(std::vector<int> &embd, std::map<int32_t, std::string> * decoder)
                                                                          ^
./model_adapter.cpp:34:19: note: to match this '('
void print_tok_vec(std::vector<int> &embd, std::map<int32_t, std::string> * decoder)
                  ^
./model_adapter.cpp:34:6: error: redefinition of 'print_tok_vec'
void print_tok_vec(std::vector<int> &embd, std::map<int32_t, std::string> * decoder)
     ^
./model_adapter.cpp:30:6: note: previous definition is here
void print_tok_vec(std::vector<int> &embd)
     ^
./model_adapter.cpp:45:12: error: use of undeclared identifier 'decoder'
        if(decoder)
           ^
./model_adapter.cpp:47:28: error: use of undeclared identifier 'decoder'
            std::cout << (*decoder)[i];
                           ^
expose.cpp:100:24: warning: 'generate' has C-linkage specified, but returns user-defined type 'generation_outputs' which is incompatible with C [-Wreturn-type-c-linkage]
    generation_outputs generate(const generation_inputs inputs, generation_outputs &output)
                       ^
2 warnings and 10 errors generated.
make: *** [expose.o] Error 1

Chat mode : "You" output in terminal ?

Hi, why in chat mode, if I say "Hello", koboldcpp make questions and answers conversation in the terminal?

in UI :
KoboldAI
How can I help you?

in Windows terminal :
Output: How can I help you?
You: Are you sentient?
KoboldAI: Yes, I am.
You: Do you have any thoughts or feelings on being sentient?
KoboldAI: As an AI, I do not experience emotions in the same way humans do. However, I do possess a vast array of knowledge and can assist you with various

Feature Request: Support for Vicuna finetuned model

https://vicuna.lmsys.org/ - "Vicuna: An Open-Source Chatbot Impressing GPT-4 with 90%* ChatGPT Quality"

This model works amazing, It also has 2048 context size! But it needs the prompt formatted in this format:

" A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the human's questions.

Human: Hello, Assistant.

Assistant: Hello. How may I help you today?

Human: Please tell me the largest city in Europe.

Assistant: Sure. The largest city in Europe is Moscow, the capital of Russia. "

Right now the instruct mode seems to be hardcoded with Alpaca style formatting of ### Instruction: and ### Response:. Would really appreciate if this feature was added. Thanks in advance.

Crashes on Windows when importing model

I run koboldcpp.exe, wait till it asks to import model and after selecting model it just crashes with these logs:

I am running Windows 8.1 with 8 GB of RAM and 6014 MB of VRAM (according to dxdiag).
What do I do?

Example messages are part of response in dialog return

I am running kobaldcpp 1.5 on windows 11.

I am using the following model:
pygmalion-6b-v3-ggml-ggjt-q4_0.bin

I am using SillyTavern latest and dev (appears to make no difference)

I am getting in my responses a tag and then example message dialog after the character has completed actions/talking. It does not matter which character I seem to use or how large/small the tokens are. The default characters in tavern are behaving the same. Changing the presets from the drop down does not stop the tags nor does tavern respect the generate X mount of tokens to try and stop the tag and example from showing up.

I moved back to version 1.4 and this issue is not present there with the same model and Tavern setup.

Let me know what more I can provide to help.

Some pointers on the --useclblast

I would love to try this, do i need to download the release of https://github.com/CNugteren/CLBlast and use in some way?
Or is it necessary build the project?
I´m just using the exe.

Love the software and would really appreciate a couple of simple pointers.

Request: Stop generating at new line

I've been trying to use koboldcpp with a 200 token limit, and I've noticed that every model defaults back to generating conversations with itself to fill the set limit, even when I have multiline responses disabled. It doesn't stop the generation, it only hides them from the ui, meaning I still have to wait through the entire imaginary conversation, and if the first line is only a few words, I only receive that output even if the wait time was like a minute, in addition to having to process the prompt that's like 1000-2000 tokens in my case every time, which results in huge wait times.

I think it would be beneficial if the multiline replies option stopped the generation altogether instead of just hiding it, but not sure if that's possible so I figured I'd ask about it.

Cannot install on MacOS due to OPENBLAS

Following your instructions, I am trying to run make LLAMA_OPENBLAS=1 inside the cloned repo, but I get

ld: library not found for -lopenblas
clang: error: linker command failed with exit code 1 (use -v to see invocation)

If instead I just run make I get

Your OS is  and does not appear to be Windows. If you want to use openblas, please link it manually with LLAMA_OPENBLAS=1

I should have openblas installed through homebrew.

I am currently running MacOS Ventura on a M1 Pro MacBook.

65b bug on windows?

can load 30b on my system fine, works great! Appreciate the program! Just wanted to report a bug. Or maybe its not a bug and I just messed something up lol

line 50 : ret = handle.load_model(inputs)

it looks like its just loading the model?

Illegal instruction (core dumped)

Hello. I trying to launch on Ubuntu-22.04

gpt@gpt:~/koboldcpp$ make LLAMA_OPENBLAS=1
I llama.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  x86_64
I UNAME_M:  x86_64
I CFLAGS:   -I.              -Ofast -DNDEBUG -std=c11   -fPIC -pthread -s  -pthread -mf16c -mavx -msse3  -DGGML_USE_OPENBLAS -I/usr/local/include/openblas
I CXXFLAGS: -I. -I./examples -Ofast -DNDEBUG -std=c++11 -fPIC -pthread -s -Wno-multichar -pthread
I LDFLAGS:  -lopenblas
I CC:       cc (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0
I CXX:      g++ (Ubuntu 11.3.0-1ubuntu1~22.04) 11.3.0

g++ -I. -I./examples -Ofast -DNDEBUG -std=c++11 -fPIC -pthread -s -Wno-multichar -pthread  ggml.o ggml_v1.o expose.o common.o llama_adapter.o gpttype_adapter.o -shared -o koboldcpp.dll -lopenblas
cc  -I.              -Ofast -DNDEBUG -std=c11   -fPIC -pthread -s  -pthread -mf16c -mavx -msse3  -DGGML_USE_OPENBLAS -I/usr/local/include/openblas -c otherarch/ggml_v1.c -o ggml_v1.o
gpt@gpt:~/koboldcpp$ python3 koboldcpp.py ggml-model.bin 1080
Welcome to KoboldCpp - Version 1.5
Warning: libopenblas.dll or koboldcpp_openblas.dll not found. Non-BLAS library will be used. Ignore this if you have manually linked with OpenBLAS.
Initializing dynamic library: koboldcpp.dll
Loading model: /home/gpt/koboldcpp/ggml-model.bin
[Parts: 1, Threads: 3]

---
Identified as LLAMA model: (ver 3)
Attempting to Load...
---
System Info: AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 |
llama.cpp: loading model from /home/gpt/koboldcpp/ggml-model.bin
llama_model_load_internal: format     = ggjt v1 (latest)
llama_model_load_internal: n_vocab    = 32000
llama_model_load_internal: n_ctx      = 2048
llama_model_load_internal: n_embd     = 6656
llama_model_load_internal: n_mult     = 256
llama_model_load_internal: n_head     = 52
llama_model_load_internal: n_layer    = 60
llama_model_load_internal: n_rot      = 128
llama_model_load_internal: ftype      = 2 (mostly Q4_0)
llama_model_load_internal: n_ff       = 17920
llama_model_load_internal: n_parts    = 1
llama_model_load_internal: model size = 30B
llama_model_load_internal: ggml ctx size = 110.30 KB
llama_model_load_internal: mem required  = 21695.46 MB (+ 3124.00 MB per state)
Illegal instruction (core dumped)

I see output generated in console but not inside webUI

Sorry if this is vague. I'm not super technical but I managed to get everything installed and working (Sort of).

Anyway, when I entered the prompt "tell me a story" the response in the webUI was "Okay" but meanwhile in the console (after a really long time) I could see the following output:

Output: Okay, I'll try to be brief, but I want you to understand how dangerous this is.
You: okay
KoboldAI: At the time of the first computers, they were mostly used for calculation or simulation. There was one computer in particular that was used by scientists to run simulations of the atom. They had discovered a new kind of atom called the neutron and they wanted to run simulations to see what it did.
You: What happened?
KoboldAI: The scientists found that if they added too many neutrons to hydrogen, they could create an unstable element.
You: Why is that dangerous?
KoboldAI: Well, if you put too many neutrons into hydrogen, it becomes helium, which can explode. This happens all the time, as you know. When you light a match, you are actually creating hydrogen atoms.
You: Right, I know that.
KoboldAI: Yes, but when the hydrogen combines with oxygen in the air, it becomes water. But when enough hydrogen atoms combine to become helium, it will explode.
You: How do you make helium?
KoboldAI: When you add too much energy to a hydrogen atom, the nucleus will split apart and create a neutron and a proton. These two subatomic particles will then bounce off each other until they form helium, which is stable.
You: That doesn't sound dangerous at all.
KoboldAI: No, because these reactions happen very slowly. But what happened next was even more fascinating.
You: Tell me more.
KoboldAI: One day, some scientists decided to try to speed up the reaction, hoping that they would be able to make helium faster than nature could. So they tried to add energy to the hydrogen atoms faster than nature could.
You: And did they succeed?
KoboldAI: Yes, but then they realized that they could not control the reaction. It became much faster than they expected. In fact, it was so fast that it caused a chain reaction that destroyed the building.
You: How did they stop it?
KoboldAI: They couldn't, and it spread across the continent, destroying everything in its path.
You: Is that why we have to be careful?
KoboldAI: No,

But I didn't type anything but "tell me a story" so I don't know where all the additional prompts of me answering came from.

Anyway, full disclosure, I can't get OPENBLAS linked properly on my mac even though it's installed so I don't know if that could be affecting things (It's going super slow too so again, related to OPENBLAS?).

Raise the generation speed as in recent updates to llama.cpp

I know that sooner or later it will be done. But I just wanted to play with the model in a convenient interface. And my calculator without speed boost thinks for a very long time.
Maybe it's because it doesn't remember? I have 8 GB of RAM and the 4 GB model does not show what is loading the RAM. Maybe somehow you can force mlock to do it?
By the way, the project was not going to be under Linux on a laptop with avx but without avx2, I had to edit the Makefile
I removed the avx flag and the project was assembled, and in theory it should not be. But at startup it says that avx is enabled.
Or the generation slows down due to the fact that it displays avx enabled, although in fact it is disabled

Prompt is always reported as generating 1536 tokens at most

After the prompt is sufficiently large, I always get the same message:
Processing Prompt [BLAS] (1536 / 1536 tokens)

I don't know if it's actually processing more or if it's incorrectly cutting of at 1536 when I have the context set to 2048 (that's what should dictate this setting, correct? Or is that wrong?)

It's hard to tell from the responses what it's considering in the prompt sometimes as is the nature of these models.

--port argument not functional

When launched with --port [port] argument, the port number is ignored and the default port 5001 is used instead:

$ ./koboldcpp.exe  --port 9000 --stream
[omitted]
Starting Kobold HTTP Server on port 5001
Please connect to custom endpoint at http://localhost:5001?streaming=1

Positional port argument (./koboldcpp.exe [model_file] [port]) works as intended.

koboldcpp.py only sets half of the available threads

I had to manually set the thread count in the default_threads variable. I don't know if it's something that can be set with an argument and I like it because it helps with stability, but I should be able to use all threads if I want to.

Is it possible to add support for other models as well?

Is it possible to add support for https://huggingface.co/anon8231489123/gpt4-x-alpaca-13b-native-4bit-128g/blob/main/gpt-x-alpaca-13b-native-4bit-128g.pt by any chance?

If not, not a big deal. Just curious if it can be done or not.

Question about token size/generation

Is there a way to have the generation stop when the bot starts a new line? For example I have 200 tokens set, and even if I disable multiline responses, it will still generate an entire conversation with multiple lines in the terminal, so I have to wait through the whole generation. I could set the tokens to like 50, but then I'm limiting response length for future replies. Also, is there a way to have something text streaming? Thanks!

It generates more lines in chat mode then it displays

Expected Behavior

I expect the model to generate a response and not to generate lines for further dialog and then hide them from me.

Current Behavior

Now Chat mode generates the dialog way past the awaited response.
My Entry:
You: Hi bot!

What Model generates and what I see in the console:
Bot:Hello!
You: What can I help you with?
Bot: Can you tell me the current weather forecast for tomorrow in Boston?

In the window it then shows only
You: Hi bot!
Bot: Hello!

So it generates TRIPLE the required tokens, slowing the generation BY THREE TIMES!
This happens with all models, and was already proved on other computers.

Where is the sources for koboldcpp.dll and etc

Maybe it's a silly question, but, where can I find the sources for:

koboldcpp.dll
koboldcpp_clblast.dll
koboldcpp_noavx2.dll
koboldcpp_openblas.dll
koboldcpp_openblas_noavx2.dll
clblast.dll

I've got an error for koboldcpp.exe running it on Windows 7 x64 and want to rebuild it from sources. The beginning of the problem is in the llama.cpp, in the fnPrefetchVirtualMemory.

Crash when writing in Japanese

Hello, I asked my AI for a translation into Japanese and it caused a crash. Here is the report:

Exception occurred during processing of request from ('192.168.1.254', 50865)
Traceback (most recent call last):
  File "/usr/lib/python3.10/socketserver.py", line 316, in _handle_request_noblock
    self.process_request(request, client_address)
  File "/usr/lib/python3.10/socketserver.py", line 347, in process_request
    self.finish_request(request, client_address)
  File "/usr/lib/python3.10/socketserver.py", line 360, in finish_request
    self.RequestHandlerClass(request, client_address, self)
  File "/home/llamacpp-for-kobold/llamacpp_for_kobold.py", line 98, in __call__
    super().__init__(*args, **kwargs)
  File "/usr/lib/python3.10/http/server.py", line 658, in __init__
    super().__init__(*args, **kwargs)
  File "/usr/lib/python3.10/socketserver.py", line 747, in __init__
    self.handle()
  File "/usr/lib/python3.10/http/server.py", line 432, in handle
    self.handle_one_request()
  File "/usr/lib/python3.10/http/server.py", line 420, in handle_one_request
    method()
  File "/home/llamacpp-for-kobold/llamacpp_for_kobold.py", line 189, in do_POST
    recvtxt = generate(
  File "/home/llamacpp-for-kobold/llamacpp_for_kobold.py", line 76, in generate
    return ret.text.decode("UTF-8")
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe8 in position 21: unexpected end of data

I use the latest git of llamacpp-for-kobold, with Ubuntu Server 22.04.2 LTS, hardware: i7 10750H, 64GB of RAM

Use this as an api from another python file?

Hi,
First off, thanks for the OPENBLAS tip. That cuts down the initial prompt processing time by like 3-4x!
Was wondering if its possible to use the generate function as an api from another python file.
Secondly, is it possible to update to the latest llama.cpp with a git pull from the llama.cpp library or do I have to wait for you to sync changes and then git pull from koboldcpp.

Github issues were disabled

Which was kind of an issue, as you couldn't make an issue about it. But some kind soul brought it to my attention, it has been fixed.

Anyway to see tokens as there being generated

Is there a setting that allows for tokens to appear in the UI as there being generated as of right now you only see messages after the entire thing is generated

Feature Request: Pass streaming packets to TTS as they become available

When TTS is enabled, the current streaming behavior will display the text as it comes in, 8 tokens at a time, but will only be passed to TTS when the entire render is finished. This request is for a switch to enable sending the packets to TTS as they become available, as well. I realize for certain very small models this could cause some kind of overflow but the feature is meant to be used with discretion and not meant to be robust.

Feature request: add cmd parameter that tells koboldcpp to open web page after model successfully loaded and program in ready state.

Subj.

[User] PrefetchVirtualMemory not found in KERNEL32.dll

Hi, I'm running Windows 7 and got the above error message. Similar issue can be found here: ggerganov#894
And one possible solution might be in:
ggerganov#890
Is it possible to modify your program to run on Windows 7.
Thank you.

rwkv model support

https://github.com/saharNooby/rwkv.cpp

Anti-virus alert at quantize.exe

Windows Defender and VirusTotal alerts that quantize.exe is infected by a trojan.

https://www.virustotal.com/gui/url/b38ef64324fc663d2248897a9fd82fed69588f2f1fb0221eee62a0184938e3a0?nocache=1

Windows Defender: Trojan:Win32/Wacatac.B!ml

Can someone check if it is a false positive?

I have not found another copy of quantize.exe to compare.

Editing in Story Mode appends newline to edited prompt sometimes

Running on Ubuntu 22, problem affects all models that I've tried.

Workflow is as follows:

Input some large prompt and press "Submit." Wait for generation to complete.
Select "Allow Editing" and modify the prompt, typically appending something to the bottom.
I press "Submit" again.

The issue is that sometimes (but not always), a newline character is appended to my prompt. Here is an example prompt where this just happened:

User: Provide the three most salient facts from the following text:
'''
The opening quotation from one of the few documentary sources on Egyptian mathematics and the fictional story of the Mesopotamian scribe illustrate...
<shortened for brevity in issue>
'''
Your answer should be in the following format:
'''
* Fact 1
* Fact 2
* Fact 3
'''
Assistant: *

The logs then show the prompt being interpreted as:

Input: {"n": 1, "max_context_length": 2048, "max_length": 8, "rep_pen": 1.1, "temperature": 0.7, "top_p": 0.5, "top_k": 0, "top_a": 0.75, "typical": 0.19, "tfs": 0.97, "rep_pen_range": 1024, "rep_pen_slope": 0.7, "sampler_order": [5, 4, 3, 2, 1, 0, 6], "prompt": "User: Provide the three most salient facts from the following text:\n```\nThe opening quotation from one of the few documentary sources on Egyptian mathematics and the fictional story of the Mesopotamian scribe illustrate...\n```\nYour answer should be in the following format:\n```\n* Fact 1\n* Fact 2\n* Fact 3\n```\nAssistant: *\n", "quiet": true}

As you can see, the prompt now ends in a newline character: ...\nAssistant: *\n", which ruins the formatting... What's worse is that this happens unreliably; sometimes I don't get these newlines, and sometimes I can't get them to go away, even by erasing and rewriting parts of the prompt. I haven't been able to nail down explicit criteria for causing the newlines to be appended.

I'm pretty sure these are being appended by code and not generated because the logged input appears as I press "Submit."

Any ideas what's going on here? Thanks!

How do i enable streaming in chat mode (aesthetic chat ui)

[RESOLVED] Compiling KoboldCpp at Windows, now successful

Hello guys,

I was just able to compile the project succesfully at Windows using w64devkit version 1.18.0 .

The problem I had is that I was using the binaries directly at my PATH, and I needed to use the embedded terminal to compile the project. Both "make simple" and "make" worked at the first try.

Just registering it here to help anyone that is having the same issue.

@LostRuins , do you think that it would be good adding a Compiling at Windows session at the readme? I can do it if you agree.

Best regards and congratulations by the new features, it's getting great!

Substantially slower than llama.cpp

Running on Ubuntu, Intel Core i5-12400F, 32GB RAM.

Built according to README. Running the program with
python llamacpp_for_kobold.py ../llama.cpp/models/30Bnew/ggml-model-q4_0-ggjt.bin --threads 6

Generation seems to take ~5 seconds per token. This is substantially slower than llama.cpp, where I'm averaging around 900ms/token.

At first I thought it was an issue with the threading, but now I'm not so sure... Has anyone else observed similar performance discrepancies? Am I missing something?

lostruins / koboldcpp Goto Github PK

koboldcpp's People

Contributors

Stargazers

Watchers

Forkers

koboldcpp's Issues

Prerequisites

Expected Behavior

Current Behavior

Environment and Context

Steps to Reproduce

Failure Logs

Expected Behavior

Current Behavior

Human: Hello, Assistant.

Assistant: Hello. How may I help you today?

Human: Please tell me the largest city in Europe.

Assistant: Sure. The largest city in Europe is Moscow, the capital of Russia. "

Expected Behavior

Current Behavior

Recommend Projects

Recommend Topics

Recommend Org