mdrokz / rust-llama.cpp Goto Github PK

View Code? Open in Web Editor NEW

269.0 6.0 41.0 62 KB

LLama.cpp rust bindings

Home Page: https://crates.io/crates/llama_cpp_rs/

License: MIT License

Rust 53.76% C++ 46.04% C 0.20%

api-bindings crates-io ffi llama llama-cpp machine-learning model rust cpp

rust-llama.cpp's Issues

Compiling with metal feature has `ggml-metal.o` linker failure

  error occurred: Command ZERO_AR_DATE="1" "ar" "cq" "/home/tc-wolf/rust-llama.cpp/target/release/build/llama_cpp_rs-75252caa56296e09/out/libbinding.a" "/home/tc-wolf/rust-llama.cpp/target/release/build/llama_cpp_rs-75252caa56296e09/out/./llama.cpp/common/common.o" "/home/tc-wolf/rust-llama.cpp/target/release/build/llama_cpp_rs-75252caa56296e09/out/./llama.cpp/llama.o" "/home/tc-wolf/rust-llama.cpp/target/release/build/llama_cpp_rs-75252caa56296e09/out/./binding.o" "/home/tc-wolf/rust-llama.cpp/target/release/build/llama_cpp_rs-75252caa56296e09/out/llama.cpp/ggml.o" "/home/tc-wolf/rust-llama.cpp/target/release/build/llama_cpp_rs-75252caa56296e09/out/llama.cpp/ggml-metal.o" with args "ar" did not execute successfully (status code exit status: 1).

This is because the cc-rs crate adds a hash to the generated object file name to avoid collisions if there is another file in a subdirectory with the same name.

Should be fixed by #39

Cant build on Mac aarch64

When trying to build it on my Macbook aarch64, im getting a build error.


cargo build --verbose        
       Fresh unicode-ident v1.0.9
       Fresh glob v0.3.1
       Fresh minimal-lexical v0.2.1
       Fresh proc-macro2 v1.0.63
       Fresh cfg-if v1.0.0
       Fresh regex-syntax v0.7.2
       Fresh libc v0.2.147
       Fresh quote v1.0.29
       Fresh memchr v2.5.0
       Fresh libloading v0.7.4
       Fresh either v1.8.1
       Fresh regex v1.8.4
       Fresh syn v2.0.22
       Fresh nom v7.1.3
       Fresh which v4.4.0
       Fresh clang-sys v1.6.1
       Fresh log v0.4.19
       Fresh peeking_take_while v0.1.2
       Fresh bitflags v2.3.3
       Fresh cexpr v0.6.0
       Fresh prettyplease v0.2.9
       Fresh shlex v1.1.0
       Fresh lazy_static v1.4.0
       Fresh rustc-hash v1.1.0
       Fresh lazycell v1.3.0
       Fresh cc v1.0.79
       Fresh bindgen v0.66.1
   Compiling llama_cpp_rs v0.2.0 (/Users/jorgosnomikos/RustroverProjects/rust-llama.cpp)
     Running `/Users/jorgosnomikos/RustroverProjects/rust-llama.cpp/target/debug/build/llama_cpp_rs-31c12aeaf8da45ac/build-script-build`
The following warnings were emitted during compilation:

warning: clang: warning: argument unused during compilation: '-shared' [-Wunused-command-line-argument]
warning: In file included from ./llama.cpp/examples/common.cpp:1:
warning: In file included from ./llama.cpp/examples/common.h:5:
warning: In file included from ./llama.cpp/llama.h:4:
warning: ./llama.cpp/ggml.h:254:24: warning: commas at the end of enumerator lists are a C++11 extension [-Wc++11-extensions]
warning:         GGML_TYPE_COUNT,
warning:                        ^
warning: ./llama.cpp/ggml.h:260:36: warning: commas at the end of enumerator lists are a C++11 extension [-Wc++11-extensions]
warning:         GGML_BACKEND_GPU_SPLIT = 20,
warning:                                    ^
warning: ./llama.cpp/ggml.h:278:36: warning: commas at the end of enumerator lists are a C++11 extension [-Wc++11-extensions]
warning:         GGML_FTYPE_MOSTLY_Q6_K = 14, // except 1d tensors
warning:                                    ^
warning: ./llama.cpp/ggml.h:355:22: warning: commas at the end of enumerator lists are a C++11 extension [-Wc++11-extensions]
warning:         GGML_OP_COUNT,
warning:                      ^
warning: ./llama.cpp/ggml.h:450:27: warning: commas at the end of enumerator lists are a C++11 extension [-Wc++11-extensions]
warning:         GGML_TASK_FINALIZE,
warning:                           ^
warning: ./llama.cpp/ggml.h:1294:23: warning: commas at the end of enumerator lists are a C++11 extension [-Wc++11-extensions]
warning:         GGML_OPT_LBFGS,
warning:                       ^
warning: ./llama.cpp/ggml.h:1303:54: warning: commas at the end of enumerator lists are a C++11 extension [-Wc++11-extensions]
warning:         GGML_LINESEARCH_BACKTRACKING_STRONG_WOLFE = 2,
warning:                                                      ^
warning: ./llama.cpp/ggml.h:1318:43: warning: commas at the end of enumerator lists are a C++11 extension [-Wc++11-extensions]
warning:         GGML_LINESEARCH_INVALID_PARAMETERS,
warning:                                           ^
warning: In file included from ./llama.cpp/examples/common.cpp:1:
warning: In file included from ./llama.cpp/examples/common.h:5:
warning: ./llama.cpp/llama.h:124:46: warning: commas at the end of enumerator lists are a C++11 extension [-Wc++11-extensions]
warning:         LLAMA_FTYPE_MOSTLY_Q6_K          = 18,// except 1d tensors
warning:                                              ^
warning: In file included from ./llama.cpp/examples/common.cpp:1:
warning: ./llama.cpp/examples/common.h:25:45: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     int32_t seed                            = -1;  // RNG seed
warning:                                             ^
warning: ./llama.cpp/examples/common.h:26:45: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     int32_t n_threads                       = get_num_physical_cores();
warning:                                             ^
warning: ./llama.cpp/examples/common.h:27:45: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     int32_t n_predict                       = -1;  // new tokens to predict
warning:                                             ^
warning: ./llama.cpp/examples/common.h:28:45: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     int32_t n_ctx                           = 512; // context size
warning:                                             ^
warning: ./llama.cpp/examples/common.h:29:45: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     int32_t n_batch                         = 512; // batch size for prompt processing (must be >=32 to use BLAS)
warning:                                             ^
warning: ./llama.cpp/examples/common.h:30:45: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     int32_t n_keep                          = 0;   // number of tokens to keep from initial prompt
warning:                                             ^
warning: ./llama.cpp/examples/common.h:31:45: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     int32_t n_gpu_layers                    = 0;   // number of layers to store in VRAM
warning:                                             ^
warning: ./llama.cpp/examples/common.h:32:45: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     int32_t main_gpu                        = 0;   // the GPU that is used for scratch and small tensors
warning:                                             ^
warning: ./llama.cpp/examples/common.h:33:45: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     float   tensor_split[LLAMA_MAX_DEVICES] = {0}; // how split tensors should be distributed across GPUs
warning:                                             ^
warning: ./llama.cpp/examples/common.h:34:45: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     bool    low_vram                        = 0;   // if true, reduce VRAM usage at the cost of performance
warning:                                             ^
warning: ./llama.cpp/examples/common.h:38:31: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     int32_t top_k             = 40;    // <= 0 to use vocab size
warning:                               ^
warning: ./llama.cpp/examples/common.h:39:31: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     float   top_p             = 0.95f; // 1.0 = disabled
warning:                               ^
warning: ./llama.cpp/examples/common.h:40:31: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     float   tfs_z             = 1.00f; // 1.0 = disabled
warning:                               ^
warning: ./llama.cpp/examples/common.h:41:31: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     float   typical_p         = 1.00f; // 1.0 = disabled
warning:                               ^
warning: ./llama.cpp/examples/common.h:42:31: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     float   temp              = 0.80f; // 1.0 = disabled
warning:                               ^
warning: ./llama.cpp/examples/common.h:43:31: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     float   repeat_penalty    = 1.10f; // 1.0 = disabled
warning:                               ^
warning: ./llama.cpp/examples/common.h:44:31: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     int32_t repeat_last_n     = 64;    // last n tokens to penalize (0 = disable penalty, -1 = context size)
warning:                               ^
warning: ./llama.cpp/examples/common.h:45:31: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     float   frequency_penalty = 0.00f; // 0.0 = disabled
warning:                               ^
warning: ./llama.cpp/examples/common.h:46:31: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     float   presence_penalty  = 0.00f; // 0.0 = disabled
warning:                               ^
warning: ./llama.cpp/examples/common.h:47:31: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     int     mirostat          = 0;     // 0 = disabled, 1 = mirostat, 2 = mirostat 2.0
warning:                               ^
warning: ./llama.cpp/examples/common.h:48:31: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     float   mirostat_tau      = 5.00f; // target entropy
warning:                               ^
warning: ./llama.cpp/examples/common.h:49:31: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     float   mirostat_eta      = 0.10f; // learning rate
warning:                               ^
warning: ./llama.cpp/examples/common.h:51:35: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     std::string model             = "models/7B/ggml-model.bin"; // model path
warning:                                   ^
warning: ./llama.cpp/examples/common.h:52:35: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     std::string model_alias       = "unknown"; // model alias
warning:                                   ^
warning: ./llama.cpp/examples/common.h:53:35: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     std::string prompt            = "";
warning:                                   ^
warning: ./llama.cpp/examples/common.h:54:35: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     std::string path_prompt_cache = "";  // path to file for saving/loading prompt eval state
warning:                                   ^
warning: ./llama.cpp/examples/common.h:55:35: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     std::string input_prefix      = "";  // string to prefix user inputs with
warning:                                   ^
warning: ./llama.cpp/examples/common.h:56:35: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     std::string input_suffix      = "";  // string to suffix user inputs with
warning:                                   ^
warning: ./llama.cpp/examples/common.h:59:30: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     std::string lora_adapter = "";  // lora adapter path
warning:                              ^
warning: ./llama.cpp/examples/common.h:60:30: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     std::string lora_base    = "";  // base model path for the lora adapter
warning:                              ^
warning: ./llama.cpp/examples/common.h:62:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     bool memory_f16        = true;  // use f16 instead of f32 for memory kv
warning:                            ^
warning: ./llama.cpp/examples/common.h:63:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     bool random_prompt     = false; // do not randomize prompt if none provided
warning:                            ^
warning: ./llama.cpp/examples/common.h:64:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     bool use_color         = false; // use color to distinguish generations and inputs
warning:                            ^
warning: ./llama.cpp/examples/common.h:65:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     bool interactive       = false; // interactive mode
warning:                            ^
warning: ./llama.cpp/examples/common.h:66:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     bool prompt_cache_all  = false; // save user input and generations to prompt cache
warning:                            ^
warning: ./llama.cpp/examples/common.h:67:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     bool prompt_cache_ro   = false; // open the prompt cache read-only and do not update it
warning:                            ^
warning: ./llama.cpp/examples/common.h:69:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     bool embedding         = false; // get only sentence embedding
warning:                            ^
warning: ./llama.cpp/examples/common.h:70:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     bool interactive_first = false; // wait for user input immediately
warning:                            ^
warning: ./llama.cpp/examples/common.h:71:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     bool multiline_input   = false; // reverse the usage of `\`
warning:                            ^
warning: ./llama.cpp/examples/common.h:73:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     bool instruct          = false; // instruction mode (used for Alpaca models)
warning:                            ^
warning: ./llama.cpp/examples/common.h:74:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     bool penalize_nl       = true;  // consider newlines as a repeatable token
warning:                            ^
warning: ./llama.cpp/examples/common.h:75:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     bool perplexity        = false; // compute perplexity over the prompt
warning:                            ^
warning: ./llama.cpp/examples/common.h:76:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     bool use_mmap          = true;  // use mmap for faster loads
warning:                            ^
warning: ./llama.cpp/examples/common.h:77:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     bool use_mlock         = false; // use mlock to keep model in memory
warning:                            ^
warning: ./llama.cpp/examples/common.h:78:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     bool mem_test          = false; // compute maximum memory usage
warning:                            ^
warning: ./llama.cpp/examples/common.h:79:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     bool numa              = false; // attempt optimizations that help on some NUMA systems
warning:                            ^
warning: ./llama.cpp/examples/common.h:80:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     bool export_cgraph     = false; // export the computation graph
warning:                            ^
warning: ./llama.cpp/examples/common.h:81:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     bool verbose_prompt    = false; // print prompt tokens before generation
warning:                            ^
warning: ./llama.cpp/examples/common.h:100:6: error: no template named 'tuple' in namespace 'std'
warning: std::tuple<struct llama_model *, struct llama_context *> llama_init_from_gpt_params(const gpt_params & params);
warning: ~~~~~^
warning: ./llama.cpp/examples/common.h:123:26: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     bool multiline_input = false;
warning:                          ^
warning: ./llama.cpp/examples/common.h:124:20: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     bool use_color = false;
warning:                    ^
warning: ./llama.cpp/examples/common.h:125:27: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     console_color_t color = CONSOLE_COLOR_DEFAULT;
warning:                           ^
warning: ./llama.cpp/examples/common.h:127:15: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     FILE* out = stdout;
warning:               ^
warning: ./llama.cpp/examples/common.h:131:15: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
warning:     FILE* tty = nullptr;
warning:               ^
warning: ./llama.cpp/examples/common.cpp:537:6: error: no template named 'tuple' in namespace 'std'
warning: std::tuple<struct llama_model *, struct llama_context *> llama_init_from_gpt_params(const gpt_params & params) {
warning: ~~~~~^
warning: ./llama.cpp/examples/common.cpp:538:5: warning: 'auto' type specifier is a C++11 extension [-Wc++11-extensions]
warning:     auto lparams = llama_context_default_params();
warning:     ^
warning: ./llama.cpp/examples/common.cpp:556:21: error: no member named 'make_tuple' in namespace 'std'
warning:         return std::make_tuple(nullptr, nullptr);
warning:                ~~~~~^
warning: ./llama.cpp/examples/common.cpp:563:21: error: no member named 'make_tuple' in namespace 'std'
warning:         return std::make_tuple(nullptr, nullptr);
warning:                ~~~~~^
warning: ./llama.cpp/examples/common.cpp:575:25: error: no member named 'make_tuple' in namespace 'std'
warning:             return std::make_tuple(nullptr, nullptr);
warning:                    ~~~~~^
warning: ./llama.cpp/examples/common.cpp:579:17: error: no member named 'make_tuple' in namespace 'std'
warning:     return std::make_tuple(model, lctx);
warning:            ~~~~~^
warning: 63 warnings and 6 errors generated.

error: failed to run custom build command for `llama_cpp_rs v0.2.0 (/Users/jorgosnomikos/RustroverProjects/rust-llama.cpp)`

Caused by:
  process didn't exit successfully: `/Users/jorgosnomikos/RustroverProjects/rust-llama.cpp/target/debug/build/llama_cpp_rs-31c12aeaf8da45ac/build-script-build` (exit status: 1)
  --- stdout
  cargo:rerun-if-env-changed=TARGET
  cargo:rerun-if-env-changed=BINDGEN_EXTRA_CLANG_ARGS_aarch64-apple-darwin
  cargo:rerun-if-env-changed=BINDGEN_EXTRA_CLANG_ARGS_aarch64_apple_darwin
  cargo:rerun-if-env-changed=BINDGEN_EXTRA_CLANG_ARGS
  cargo:rerun-if-changed=/Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/lib/clang/15.0.0/include/stdbool.h
  TARGET = Some("aarch64-apple-darwin")
  OPT_LEVEL = Some("0")
  HOST = Some("aarch64-apple-darwin")
  cargo:rerun-if-env-changed=CC_aarch64-apple-darwin
  CC_aarch64-apple-darwin = None
  cargo:rerun-if-env-changed=CC_aarch64_apple_darwin
  CC_aarch64_apple_darwin = None
  cargo:rerun-if-env-changed=HOST_CC
  HOST_CC = None
  cargo:rerun-if-env-changed=CC
  CC = None
  cargo:rerun-if-env-changed=CFLAGS_aarch64-apple-darwin
  CFLAGS_aarch64-apple-darwin = None
  cargo:rerun-if-env-changed=CFLAGS_aarch64_apple_darwin
  CFLAGS_aarch64_apple_darwin = None
  cargo:rerun-if-env-changed=HOST_CFLAGS
  HOST_CFLAGS = None
  cargo:rerun-if-env-changed=CFLAGS
  CFLAGS = None
  cargo:rerun-if-env-changed=CRATE_CC_NO_DEFAULTS
  CRATE_CC_NO_DEFAULTS = None
  DEBUG = Some("true")
  CARGO_CFG_TARGET_FEATURE = Some("aes,crc,dit,dotprod,dpb,dpb2,fcma,fhm,flagm,fp16,frintts,jsconv,lor,lse,neon,paca,pacg,pan,pmuv3,ras,rcpc,rcpc2,rdm,sb,sha2,sha3,ssbs,vh")
  running: "cc" "-O0" "-ffunction-sections" "-fdata-sections" "-fPIC" "-gdwarf-2" "-fno-omit-frame-pointer" "-arch" "arm64" "-I" "./llama.cpp" "-Wall" "-Wextra" "-Wall" "-Wextra" "-Wpedantic" "-Wcast-qual" "-Wdouble-promotion" "-Wshadow" "-Wstrict-prototypes" "-Wpointer-arith" "-march=native" "-mtune=native" "-o" "/Users/jorgosnomikos/RustroverProjects/rust-llama.cpp/target/debug/build/llama_cpp_rs-16e92bd0bb55faf0/out/./llama.cpp/ggml.o" "-c" "./llama.cpp/ggml.c"
  exit status: 0
  cargo:rerun-if-env-changed=AR_aarch64-apple-darwin
  AR_aarch64-apple-darwin = None
  cargo:rerun-if-env-changed=AR_aarch64_apple_darwin
  AR_aarch64_apple_darwin = None
  cargo:rerun-if-env-changed=HOST_AR
  HOST_AR = None
  cargo:rerun-if-env-changed=AR
  AR = None
  cargo:rerun-if-env-changed=ARFLAGS_aarch64-apple-darwin
  ARFLAGS_aarch64-apple-darwin = None
  cargo:rerun-if-env-changed=ARFLAGS_aarch64_apple_darwin
  ARFLAGS_aarch64_apple_darwin = None
  cargo:rerun-if-env-changed=HOST_ARFLAGS
  HOST_ARFLAGS = None
  cargo:rerun-if-env-changed=ARFLAGS
  ARFLAGS = None
  running: ZERO_AR_DATE="1" "ar" "cq" "/Users/jorgosnomikos/RustroverProjects/rust-llama.cpp/target/debug/build/llama_cpp_rs-16e92bd0bb55faf0/out/libggml.a" "/Users/jorgosnomikos/RustroverProjects/rust-llama.cpp/target/debug/build/llama_cpp_rs-16e92bd0bb55faf0/out/./llama.cpp/ggml.o"
  exit status: 0
  running: "ar" "s" "/Users/jorgosnomikos/RustroverProjects/rust-llama.cpp/target/debug/build/llama_cpp_rs-16e92bd0bb55faf0/out/libggml.a"
  exit status: 0
  cargo:rustc-link-lib=static=ggml
  cargo:rustc-link-search=native=/Users/jorgosnomikos/RustroverProjects/rust-llama.cpp/target/debug/build/llama_cpp_rs-16e92bd0bb55faf0/out
  TARGET = Some("aarch64-apple-darwin")
  OPT_LEVEL = Some("0")
  HOST = Some("aarch64-apple-darwin")
  cargo:rerun-if-env-changed=CXX_aarch64-apple-darwin
  CXX_aarch64-apple-darwin = None
  cargo:rerun-if-env-changed=CXX_aarch64_apple_darwin
  CXX_aarch64_apple_darwin = None
  cargo:rerun-if-env-changed=HOST_CXX
  HOST_CXX = None
  cargo:rerun-if-env-changed=CXX
  CXX = None
  cargo:rerun-if-env-changed=CXXFLAGS_aarch64-apple-darwin
  CXXFLAGS_aarch64-apple-darwin = None
  cargo:rerun-if-env-changed=CXXFLAGS_aarch64_apple_darwin
  CXXFLAGS_aarch64_apple_darwin = None
  cargo:rerun-if-env-changed=HOST_CXXFLAGS
  HOST_CXXFLAGS = None
  cargo:rerun-if-env-changed=CXXFLAGS
  CXXFLAGS = None
  cargo:rerun-if-env-changed=CRATE_CC_NO_DEFAULTS
  CRATE_CC_NO_DEFAULTS = None
  DEBUG = Some("true")
  CARGO_CFG_TARGET_FEATURE = Some("aes,crc,dit,dotprod,dpb,dpb2,fcma,fhm,flagm,fp16,frintts,jsconv,lor,lse,neon,paca,pacg,pan,pmuv3,ras,rcpc,rcpc2,rdm,sb,sha2,sha3,ssbs,vh")
  running: "c++" "-O0" "-ffunction-sections" "-fdata-sections" "-fPIC" "-gdwarf-2" "-fno-omit-frame-pointer" "-arch" "arm64" "-shared" "-I" "./llama.cpp/examples" "-I" "./llama.cpp" "-Wall" "-Wextra" "-Wall" "-Wdeprecated-declarations" "-Wunused-but-set-variable" "-Wextra" "-Wpedantic" "-Wcast-qual" "-Wno-unused-function" "-Wno-multichar" "-march=native" "-mtune=native" "-o" "/Users/jorgosnomikos/RustroverProjects/rust-llama.cpp/target/debug/build/llama_cpp_rs-16e92bd0bb55faf0/out/./llama.cpp/examples/common.o" "-c" "./llama.cpp/examples/common.cpp"
  cargo:warning=clang: warning: argument unused during compilation: '-shared' [-Wunused-command-line-argument]
  cargo:warning=In file included from ./llama.cpp/examples/common.cpp:1:
  cargo:warning=In file included from ./llama.cpp/examples/common.h:5:
  cargo:warning=In file included from ./llama.cpp/llama.h:4:
  cargo:warning=./llama.cpp/ggml.h:254:24: warning: commas at the end of enumerator lists are a C++11 extension [-Wc++11-extensions]
  cargo:warning=        GGML_TYPE_COUNT,
  cargo:warning=                       ^
  cargo:warning=./llama.cpp/ggml.h:260:36: warning: commas at the end of enumerator lists are a C++11 extension [-Wc++11-extensions]
  cargo:warning=        GGML_BACKEND_GPU_SPLIT = 20,
  cargo:warning=                                   ^
  cargo:warning=./llama.cpp/ggml.h:278:36: warning: commas at the end of enumerator lists are a C++11 extension [-Wc++11-extensions]
  cargo:warning=        GGML_FTYPE_MOSTLY_Q6_K = 14, // except 1d tensors
  cargo:warning=                                   ^
  cargo:warning=./llama.cpp/ggml.h:355:22: warning: commas at the end of enumerator lists are a C++11 extension [-Wc++11-extensions]
  cargo:warning=        GGML_OP_COUNT,
  cargo:warning=                     ^
  cargo:warning=./llama.cpp/ggml.h:450:27: warning: commas at the end of enumerator lists are a C++11 extension [-Wc++11-extensions]
  cargo:warning=        GGML_TASK_FINALIZE,
  cargo:warning=                          ^
  cargo:warning=./llama.cpp/ggml.h:1294:23: warning: commas at the end of enumerator lists are a C++11 extension [-Wc++11-extensions]
  cargo:warning=        GGML_OPT_LBFGS,
  cargo:warning=                      ^
  cargo:warning=./llama.cpp/ggml.h:1303:54: warning: commas at the end of enumerator lists are a C++11 extension [-Wc++11-extensions]
  cargo:warning=        GGML_LINESEARCH_BACKTRACKING_STRONG_WOLFE = 2,
  cargo:warning=                                                     ^
  cargo:warning=./llama.cpp/ggml.h:1318:43: warning: commas at the end of enumerator lists are a C++11 extension [-Wc++11-extensions]
  cargo:warning=        GGML_LINESEARCH_INVALID_PARAMETERS,
  cargo:warning=                                          ^
  cargo:warning=In file included from ./llama.cpp/examples/common.cpp:1:
  cargo:warning=In file included from ./llama.cpp/examples/common.h:5:
  cargo:warning=./llama.cpp/llama.h:124:46: warning: commas at the end of enumerator lists are a C++11 extension [-Wc++11-extensions]
  cargo:warning=        LLAMA_FTYPE_MOSTLY_Q6_K          = 18,// except 1d tensors
  cargo:warning=                                             ^
  cargo:warning=In file included from ./llama.cpp/examples/common.cpp:1:
  cargo:warning=./llama.cpp/examples/common.h:25:45: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    int32_t seed                            = -1;  // RNG seed
  cargo:warning=                                            ^
  cargo:warning=./llama.cpp/examples/common.h:26:45: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    int32_t n_threads                       = get_num_physical_cores();
  cargo:warning=                                            ^
  cargo:warning=./llama.cpp/examples/common.h:27:45: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    int32_t n_predict                       = -1;  // new tokens to predict
  cargo:warning=                                            ^
  cargo:warning=./llama.cpp/examples/common.h:28:45: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    int32_t n_ctx                           = 512; // context size
  cargo:warning=                                            ^
  cargo:warning=./llama.cpp/examples/common.h:29:45: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    int32_t n_batch                         = 512; // batch size for prompt processing (must be >=32 to use BLAS)
  cargo:warning=                                            ^
  cargo:warning=./llama.cpp/examples/common.h:30:45: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    int32_t n_keep                          = 0;   // number of tokens to keep from initial prompt
  cargo:warning=                                            ^
  cargo:warning=./llama.cpp/examples/common.h:31:45: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    int32_t n_gpu_layers                    = 0;   // number of layers to store in VRAM
  cargo:warning=                                            ^
  cargo:warning=./llama.cpp/examples/common.h:32:45: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    int32_t main_gpu                        = 0;   // the GPU that is used for scratch and small tensors
  cargo:warning=                                            ^
  cargo:warning=./llama.cpp/examples/common.h:33:45: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    float   tensor_split[LLAMA_MAX_DEVICES] = {0}; // how split tensors should be distributed across GPUs
  cargo:warning=                                            ^
  cargo:warning=./llama.cpp/examples/common.h:34:45: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    bool    low_vram                        = 0;   // if true, reduce VRAM usage at the cost of performance
  cargo:warning=                                            ^
  cargo:warning=./llama.cpp/examples/common.h:38:31: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    int32_t top_k             = 40;    // <= 0 to use vocab size
  cargo:warning=                              ^
  cargo:warning=./llama.cpp/examples/common.h:39:31: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    float   top_p             = 0.95f; // 1.0 = disabled
  cargo:warning=                              ^
  cargo:warning=./llama.cpp/examples/common.h:40:31: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    float   tfs_z             = 1.00f; // 1.0 = disabled
  cargo:warning=                              ^
  cargo:warning=./llama.cpp/examples/common.h:41:31: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    float   typical_p         = 1.00f; // 1.0 = disabled
  cargo:warning=                              ^
  cargo:warning=./llama.cpp/examples/common.h:42:31: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    float   temp              = 0.80f; // 1.0 = disabled
  cargo:warning=                              ^
  cargo:warning=./llama.cpp/examples/common.h:43:31: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    float   repeat_penalty    = 1.10f; // 1.0 = disabled
  cargo:warning=                              ^
  cargo:warning=./llama.cpp/examples/common.h:44:31: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    int32_t repeat_last_n     = 64;    // last n tokens to penalize (0 = disable penalty, -1 = context size)
  cargo:warning=                              ^
  cargo:warning=./llama.cpp/examples/common.h:45:31: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    float   frequency_penalty = 0.00f; // 0.0 = disabled
  cargo:warning=                              ^
  cargo:warning=./llama.cpp/examples/common.h:46:31: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    float   presence_penalty  = 0.00f; // 0.0 = disabled
  cargo:warning=                              ^
  cargo:warning=./llama.cpp/examples/common.h:47:31: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    int     mirostat          = 0;     // 0 = disabled, 1 = mirostat, 2 = mirostat 2.0
  cargo:warning=                              ^
  cargo:warning=./llama.cpp/examples/common.h:48:31: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    float   mirostat_tau      = 5.00f; // target entropy
  cargo:warning=                              ^
  cargo:warning=./llama.cpp/examples/common.h:49:31: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    float   mirostat_eta      = 0.10f; // learning rate
  cargo:warning=                              ^
  cargo:warning=./llama.cpp/examples/common.h:51:35: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    std::string model             = "models/7B/ggml-model.bin"; // model path
  cargo:warning=                                  ^
  cargo:warning=./llama.cpp/examples/common.h:52:35: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    std::string model_alias       = "unknown"; // model alias
  cargo:warning=                                  ^
  cargo:warning=./llama.cpp/examples/common.h:53:35: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    std::string prompt            = "";
  cargo:warning=                                  ^
  cargo:warning=./llama.cpp/examples/common.h:54:35: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    std::string path_prompt_cache = "";  // path to file for saving/loading prompt eval state
  cargo:warning=                                  ^
  cargo:warning=./llama.cpp/examples/common.h:55:35: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    std::string input_prefix      = "";  // string to prefix user inputs with
  cargo:warning=                                  ^
  cargo:warning=./llama.cpp/examples/common.h:56:35: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    std::string input_suffix      = "";  // string to suffix user inputs with
  cargo:warning=                                  ^
  cargo:warning=./llama.cpp/examples/common.h:59:30: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    std::string lora_adapter = "";  // lora adapter path
  cargo:warning=                             ^
  cargo:warning=./llama.cpp/examples/common.h:60:30: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    std::string lora_base    = "";  // base model path for the lora adapter
  cargo:warning=                             ^
  cargo:warning=./llama.cpp/examples/common.h:62:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    bool memory_f16        = true;  // use f16 instead of f32 for memory kv
  cargo:warning=                           ^
  cargo:warning=./llama.cpp/examples/common.h:63:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    bool random_prompt     = false; // do not randomize prompt if none provided
  cargo:warning=                           ^
  cargo:warning=./llama.cpp/examples/common.h:64:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    bool use_color         = false; // use color to distinguish generations and inputs
  cargo:warning=                           ^
  cargo:warning=./llama.cpp/examples/common.h:65:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    bool interactive       = false; // interactive mode
  cargo:warning=                           ^
  cargo:warning=./llama.cpp/examples/common.h:66:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    bool prompt_cache_all  = false; // save user input and generations to prompt cache
  cargo:warning=                           ^
  cargo:warning=./llama.cpp/examples/common.h:67:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    bool prompt_cache_ro   = false; // open the prompt cache read-only and do not update it
  cargo:warning=                           ^
  cargo:warning=./llama.cpp/examples/common.h:69:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    bool embedding         = false; // get only sentence embedding
  cargo:warning=                           ^
  cargo:warning=./llama.cpp/examples/common.h:70:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    bool interactive_first = false; // wait for user input immediately
  cargo:warning=                           ^
  cargo:warning=./llama.cpp/examples/common.h:71:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    bool multiline_input   = false; // reverse the usage of `\`
  cargo:warning=                           ^
  cargo:warning=./llama.cpp/examples/common.h:73:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    bool instruct          = false; // instruction mode (used for Alpaca models)
  cargo:warning=                           ^
  cargo:warning=./llama.cpp/examples/common.h:74:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    bool penalize_nl       = true;  // consider newlines as a repeatable token
  cargo:warning=                           ^
  cargo:warning=./llama.cpp/examples/common.h:75:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    bool perplexity        = false; // compute perplexity over the prompt
  cargo:warning=                           ^
  cargo:warning=./llama.cpp/examples/common.h:76:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    bool use_mmap          = true;  // use mmap for faster loads
  cargo:warning=                           ^
  cargo:warning=./llama.cpp/examples/common.h:77:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    bool use_mlock         = false; // use mlock to keep model in memory
  cargo:warning=                           ^
  cargo:warning=./llama.cpp/examples/common.h:78:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    bool mem_test          = false; // compute maximum memory usage
  cargo:warning=                           ^
  cargo:warning=./llama.cpp/examples/common.h:79:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    bool numa              = false; // attempt optimizations that help on some NUMA systems
  cargo:warning=                           ^
  cargo:warning=./llama.cpp/examples/common.h:80:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    bool export_cgraph     = false; // export the computation graph
  cargo:warning=                           ^
  cargo:warning=./llama.cpp/examples/common.h:81:28: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    bool verbose_prompt    = false; // print prompt tokens before generation
  cargo:warning=                           ^
  cargo:warning=./llama.cpp/examples/common.h:100:6: error: no template named 'tuple' in namespace 'std'
  cargo:warning=std::tuple<struct llama_model *, struct llama_context *> llama_init_from_gpt_params(const gpt_params & params);
  cargo:warning=~~~~~^
  cargo:warning=./llama.cpp/examples/common.h:123:26: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    bool multiline_input = false;
  cargo:warning=                         ^
  cargo:warning=./llama.cpp/examples/common.h:124:20: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    bool use_color = false;
  cargo:warning=                   ^
  cargo:warning=./llama.cpp/examples/common.h:125:27: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    console_color_t color = CONSOLE_COLOR_DEFAULT;
  cargo:warning=                          ^
  cargo:warning=./llama.cpp/examples/common.h:127:15: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    FILE* out = stdout;
  cargo:warning=              ^
  cargo:warning=./llama.cpp/examples/common.h:131:15: warning: default member initializer for non-static data member is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    FILE* tty = nullptr;
  cargo:warning=              ^
  cargo:warning=./llama.cpp/examples/common.cpp:537:6: error: no template named 'tuple' in namespace 'std'
  cargo:warning=std::tuple<struct llama_model *, struct llama_context *> llama_init_from_gpt_params(const gpt_params & params) {
  cargo:warning=~~~~~^
  cargo:warning=./llama.cpp/examples/common.cpp:538:5: warning: 'auto' type specifier is a C++11 extension [-Wc++11-extensions]
  cargo:warning=    auto lparams = llama_context_default_params();
  cargo:warning=    ^
  cargo:warning=./llama.cpp/examples/common.cpp:556:21: error: no member named 'make_tuple' in namespace 'std'
  cargo:warning=        return std::make_tuple(nullptr, nullptr);
  cargo:warning=               ~~~~~^
  cargo:warning=./llama.cpp/examples/common.cpp:563:21: error: no member named 'make_tuple' in namespace 'std'
  cargo:warning=        return std::make_tuple(nullptr, nullptr);
  cargo:warning=               ~~~~~^
  cargo:warning=./llama.cpp/examples/common.cpp:575:25: error: no member named 'make_tuple' in namespace 'std'
  cargo:warning=            return std::make_tuple(nullptr, nullptr);
  cargo:warning=                   ~~~~~^
  cargo:warning=./llama.cpp/examples/common.cpp:579:17: error: no member named 'make_tuple' in namespace 'std'
  cargo:warning=    return std::make_tuple(model, lctx);
  cargo:warning=           ~~~~~^
  cargo:warning=63 warnings and 6 errors generated.
  exit status: 1

  --- stderr


  error occurred: Command "c++" "-O0" "-ffunction-sections" "-fdata-sections" "-fPIC" "-gdwarf-2" "-fno-omit-frame-pointer" "-arch" "arm64" "-shared" "-I" "./llama.cpp/examples" "-I" "./llama.cpp" "-Wall" "-Wextra" "-Wall" "-Wdeprecated-declarations" "-Wunused-but-set-variable" "-Wextra" "-Wpedantic" "-Wcast-qual" "-Wno-unused-function" "-Wno-multichar" "-march=native" "-mtune=native" "-o" "/Users/jorgosnomikos/RustroverProjects/rust-llama.cpp/target/debug/build/llama_cpp_rs-16e92bd0bb55faf0/out/./llama.cpp/examples/common.o" "-c" "./llama.cpp/examples/common.cpp" with args "c++" did not execute successfully (status code exit status: 1).

Support for GBNF Grammars

Llama.cpp has had support for BNF style grammars for a while, but I don't see how I can use them with these bindings.

Is there a way?

If not, is there a good starting place for hooking them up? I could take a whack at trying to get it set up, but I don't know a lot about cpp <-> rust bindings

Cant compile on Win64

Hi i cannot compile on my Win11 machine.
This is the verbose warning:

PS C:\Users\gtnom\RustroverProjects\rust-llama.cpp> cargo build --verbose Fresh unicode-ident v1.0.9 Fresh glob v0.3.1 Fresh minimal-lexical v0.2.1 Fresh regex-syntax v0.7.2 Fresh either v1.8.1 Fresh once_cell v1.18.0 Fresh log v0.4.19 Fresh shlex v1.1.0 Fresh lazy_static v1.4.0 Fresh proc-macro2 v1.0.63 Fresh regex v1.8.4 Fresh lazycell v1.3.0 Fresh rustc-hash v1.1.0 Fresh bitflags v2.3.3 Fresh peeking_take_while v0.1.2 Fresh cc v1.0.79 Fresh libc v0.2.147 Fresh winapi v0.3.9 Fresh quote v1.0.29 Fresh memchr v2.5.0 Fresh nom v7.1.3 Fresh syn v2.0.22 Fresh libloading v0.7.4 Fresh which v4.4.0 Fresh prettyplease v0.2.9 Fresh clang-sys v1.6.1 Fresh cexpr v0.6.0 Fresh bindgen v0.66.1 Compiling llama_cpp_rs v0.2.0 (C:\Users\gtnom\RustroverProjects\rust-llama.cpp) Running C:\Users\gtnom\RustroverProjects\rust-llama.cpp\target\debug\build\llama_cpp_rs-684aac4c827c5037\build-script-build`
The following warnings were emitted during compilation:

warning: cl : Command line error D8021 : invalid numeric argument '/Wextra'

error: failed to run custom build command for llama_cpp_rs v0.2.0 (C:\Users\gtnom\RustroverProjects\rust-llama.cpp)

Caused by:
process didn't exit successfully: C:\Users\gtnom\RustroverProjects\rust-llama.cpp\target\debug\build\llama_cpp_rs-684aac4c827c5037\build-script-build (exit code: 1)
--- stdout
cargo:rerun-if-env-changed=TARGET
cargo:rerun-if-env-changed=BINDGEN_EXTRA_CLANG_ARGS_x86_64-pc-windows-msvc
cargo:rerun-if-env-changed=BINDGEN_EXTRA_CLANG_ARGS_x86_64_pc_windows_msvc
cargo:rerun-if-env-changed=BINDGEN_EXTRA_CLANG_ARGS
cargo:rerun-if-changed=C:\Program Files\LLVM\lib\clang\17\include\stdbool.h
TARGET = Some("x86_64-pc-windows-msvc")
OPT_LEVEL = Some("0")
HOST = Some("x86_64-pc-windows-msvc")
cargo:rerun-if-env-changed=CC_x86_64-pc-windows-msvc
CC_x86_64-pc-windows-msvc = None
cargo:rerun-if-env-changed=CC_x86_64_pc_windows_msvc
CC_x86_64_pc_windows_msvc = None
cargo:rerun-if-env-changed=HOST_CC
HOST_CC = None
cargo:rerun-if-env-changed=CC
CC = None
cargo:rerun-if-env-changed=CFLAGS_x86_64-pc-windows-msvc
CFLAGS_x86_64-pc-windows-msvc = None
cargo:rerun-if-env-changed=CFLAGS_x86_64_pc_windows_msvc
CFLAGS_x86_64_pc_windows_msvc = None
cargo:rerun-if-env-changed=HOST_CFLAGS
HOST_CFLAGS = None
cargo:rerun-if-env-changed=CFLAGS
CFLAGS = None
cargo:rerun-if-env-changed=CRATE_CC_NO_DEFAULTS
CRATE_CC_NO_DEFAULTS = None
CARGO_CFG_TARGET_FEATURE = Some("fxsr,sse,sse2")
DEBUG = Some("true")
running: "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\bin\HostX64\x64\cl.exe" "-nologo" "-MD" "-Z7" "-Brepro"
"-I" "./llama.cpp" "-W4" "-Wall" "-Wextra" "-Wpedantic" "-Wcast-qual" "-Wdouble-promotion" "-Wshadow" "-Wstrict-prototypes" "-Wpointer-arith" "-march=native" "-
mtune=native" "-FoC:\Users\gtnom\RustroverProjects\rust-llama.cpp\target\debug\build\llama_cpp_rs-261afeb35ceff647\out\./llama.cpp/ggml.o" "-c" "./llama.cpp/ggml.c"
cargo:warning=cl : Command line error D8021 : invalid numeric argument '/Wextra'
exit code: 2

--- stderr

error occurred: Command "C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.36.32532\bin\HostX64\x64\cl.exe" "-nologo" "-MD"
"-Z7" "-Brepro" "-I" "./llama.cpp" "-W4" "-Wall" "-Wextra" "-Wpedantic" "-Wcast-qual" "-Wdouble-promotion" "-Wshadow" "-Wstrict-prototypes" "-Wpointer-arith" "-m
arch=native" "-mtune=native" "-FoC:\Users\gtnom\RustroverProjects\rust-llama.cpp\target\debug\build\llama_cpp_rs-261afeb35ceff647\out\./llama.cpp/ggml.o" "-c" "./llama.cpp/ggml.c" with args "cl.exe" did not execute successfully (status code exit code: 2).`

Sometimes crashes with UTF8 error

Some prompts fail with:

thread 'main' panicked at 'called Result::unwrap() on an Err value: Utf8Error { valid_up_to: 0, error_len: None }', /home/kimt/.cargo/registry/src/index.crates.io-6f17d22bba15001f/llama_cpp_rs-0.3.0/src/lib.rs:528:46

I'm using this model: https://huggingface.co/TheBloke/Luna-AI-Llama2-Uncensored-GGUF

Anything I can do to try to fix this problem?

Feature flag metal: Fails to load model when n_gpu_layers > 0

Can't utilize GPU on Mac with

llama_cpp_rs = { git = "https://github.com/mdrokz/rust-llama.cpp", version = "0.3.0", features = [
    "metal",
] }

Code

use llama_cpp_rs::{
    options::{ModelOptions, PredictOptions},
    LLama,
};
fn main() {
    let model_options = ModelOptions {
        n_gpu_layers: 1,
        ..Default::default()
    };

    let llama = LLama::new("zephyr-7b-alpha.Q2_K.gguf".into(), &model_options);
    println!("llama: {:?}", llama);
    let predict_options = PredictOptions {
        tokens: 0,
        threads: 14,
        top_k: 90,
        top_p: 0.86,
        token_callback: Some(Box::new(|token| {
            println!("token1: {}", token);

            true
        })),
        ..Default::default()
    };

    llama
        .unwrap()
        .predict(
            "what are the national animals of india".into(),
            predict_options,
        )
        .unwrap();
}

Error

llama_new_context_with_model: kv self size  =   64.00 MB
llama_new_context_with_model: ggml_metal_init() failed
llama: Err("Failed to load model")
thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: "Failed to load model"', src/main.rs:40:10

Include ggml-metal.metal file in source code

To use Metal, the ggml-metal.metal file needs to be placed in the current directory. In the llm crate we have a little hack that patches ggml-metal.m to include the contents of that file directly in the source code, which is more convenient. See https://github.com/rustformers/llm/blob/9376078c12ea1990bd42e63432656819a056d379/crates/ggml/sys/build.rs#L198

The same hack can be applied here too. I can make a PR if this is deemed a good idea...

llama.cpp ./embedding

Is there no rust binding to get the embeddings?

Using llama.cpp one would use:

./embedding -m ./path/to/model --log-disable -p "Hello World!" 2>/dev/null

`LLama` is not `Send`

I'm in a context where I have to instantiate a LLama instance once and then call it across threads.
Within the compiler errors I see this which is probably of use:

error: future cannot be sent between threads safely
...
help: within `LLama`, the trait `std::marker::Send` is not implemented for `*mut c_void`

Is there any way to make LLama thread-safe? Or maybe some way to accomplish more or less the same thing where one model is being called to generate text from multiple threads?

Using metal and `n_gpu_layers` produces no tokens

I'm running the example script with a few different models:

use llama_cpp_rs::{
    options::{ModelOptions, PredictOptions},
    LLama,
};

pub fn llama_predict() -> Result<String, anyhow::Error> {
    
    // metal seems to give really bad results
    let model_options = ModelOptions {
          //n_gpu_layers: 1,
        ..Default::default()
    };
    
    // let model_options = ModelOptions::default();

    let llama = LLama::new(
        "models/mistral-7b-instruct-v0.1.Q4_0.gguf".into(),
        &model_options,
    )
    .unwrap();

    let predict_options = PredictOptions {
        //top_k: 20,
        // top_p: 0.1,
        // f16_kv: true,

        token_callback: Some(Box::new(|token| {
            println!("token: {}", token);
            true
        })),
        ..Default::default()
    };

    // TODO: get this working on master. Metal support is flakey.
    let response = llama
        .predict(
            "what are the national animals of india".into(),
             predict_options,
        )
        .unwrap();
    println!("Response: {}", response);
    Ok(response)
}


#[cfg(test)]
mod tests {
    use super::*;
    #[test]
    fn test_llama_cpp_rs() -> Result<(), anyhow::Error> {
        let response = llama_predict()?;
        println!("Response: {}", response);
        assert!(!response.is_empty());
        Ok(())
    }
}

When not using metal (not using n_gpu_layers) the models generate tokens ex:

token: ind
token: ian
token:  national
token:  animal
token:  is
token:  t
token: iger
token: 
Response: indian national animal is tiger
Response: indian national animal is tiger

When I use n_gpu_layers it does not generate tokens, ex:

llama_new_context_with_model: n_ctx      = 512
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: kv self size  =   64.00 MiB
llama_build_graph: non-view tensors processed: 740/740
llama_new_context_with_model: compute buffer total size = 76.07 MiB
llama_new_context_with_model: max tensor size =   102.54 MiB
count 0
token:
token:
token:
token:
...
Response:
Response:

Is this a known behavior?

Not cloning llama.cpp submodule

This is in the readme, but it doesn't actually work:

git clone --recurse-submodules https://github.com/mdrokz/rust-llama.cpp

Running this within the repo, fixes it:

git submodule add https://github.com/ggerganov/llama.cpp/ llama.cpp

But it still won't build, though...
Looks like braking changes in llama.cpp

clang - fatal error: 'assert.h' file not found

Hi, just wanted to say thank you for creating this project! I am testing out building a simple application - identical to your example, but setting the crate type as a lib using wasm-pack. And I get the following error:

  cargo:warning=clang: warning: argument unused during compilation: '-march=native' [-Wunused-command-line-argument]

  cargo:warning=In file included from ./llama.cpp/ggml.c:4:

  cargo:warning=./llama.cpp/ggml-impl.h:7:10: fatal error: 'assert.h' file not found

  cargo:warning=#include <assert.h>

  cargo:warning=         ^~~~~~~~~~

  cargo:warning=1 error generated.

  exit status: 1

I am fairly new to Rust, any ideas on how to work around this? I am running on macOS and just building with "wasm-pack build"

Maintance and improvements

@mdrokz are you planning to main this project? I saw it uses a pretty old llama.cpp version.

Error when enabling CUDA on Windows

When enabling the cuda feature, I get the following error on windows:

[...]
  running: "nvcc" "-O0" "-ffunction-sections" "-fdata-sections" "-g" "-fno-omit-frame-pointer" "-m64" "-I" "./llama.cpp/ggml-cuda.h" "-Wall" "-Wextra" "--forward-unknown-to-host-compiler" "-arch=native" "/W4" "/Wall" "/wd4820" "/wd4710" "/wd4711" "/wd4820" "/wd4514" "-DGGML_USE_CUBLAS" "-DGGML_CUDA_DMMV_X=32" "-DGGML_CUDA_DMMV_Y=1" "-DK_QUANTS_PER_ITERATION=2" "-Wno-pedantic" "-o" "C:\\dev\\ai_kuinox\\target\\debug\\build\\llama_cpp_rs-dbbb5a5dac5f7f5e\\out\\./llama.cpp/ggml-cuda.o" "-c" "./llama.cpp/ggml-cuda.cu"
  nvcc fatal   : A single input file is required for a non-link phase when an outputfile is specified
  exit code: 1

Bug cannot build correct in macos with m2 chip

when run example from readme, build failed. cannot find ggml.o file or directory? cannot get the point. any successful implementation with macos m2 chip?

Cant build on Ubuntu 22.04

When trying to build it on my Ubuntu 22.04, I'm getting a build error.

cargo build
+ cargo build
   Compiling proc-macro2 v1.0.63
   Compiling quote v1.0.29
   Compiling libc v0.2.147
   Compiling memchr v2.5.0
   Compiling glob v0.3.1
   Compiling unicode-ident v1.0.9
   Compiling prettyplease v0.2.9
   Compiling cfg-if v1.0.0
   Compiling minimal-lexical v0.2.1
   Compiling bindgen v0.66.1
   Compiling regex-syntax v0.7.2
   Compiling either v1.8.1
   Compiling bitflags v2.3.3
   Compiling rustc-hash v1.1.0
   Compiling lazy_static v1.4.0
   Compiling shlex v1.1.0
   Compiling lazycell v1.3.0
   Compiling libloading v0.7.4
   Compiling log v0.4.19
   Compiling peeking_take_while v0.1.2
   Compiling cc v1.0.79
   Compiling clang-sys v1.6.1
   Compiling nom v7.1.3
   Compiling which v4.4.0
   Compiling syn v2.0.22
   Compiling regex v1.8.4
   Compiling cexpr v0.6.0
   Compiling llama_cpp_rs v0.3.0 (/home/rodrigo/Documents/SRS/rust-llama.cpp)
error: failed to run custom build command for `llama_cpp_rs v0.3.0 (/home/rodrigo/Documents/SRS/rust-llama.cpp)`

Caused by:
  process didn't exit successfully: `/home/rodrigo/Documents/SRS/rust-llama.cpp/target/debug/build/llama_cpp_rs-3e62109abc25cc59/build-script-build` (exit status: 101)
  --- stdout
  cargo:rerun-if-env-changed=TARGET
  cargo:rerun-if-env-changed=BINDGEN_EXTRA_CLANG_ARGS_x86_64-unknown-linux-gnu
  cargo:rerun-if-env-changed=BINDGEN_EXTRA_CLANG_ARGS_x86_64_unknown_linux_gnu
  cargo:rerun-if-env-changed=BINDGEN_EXTRA_CLANG_ARGS

  --- stderr
  thread 'main' panicked at 'Unable to find libclang: "couldn't find any valid shared libraries matching: ['libclang.so', 'libclang-*.so', 'libclang.so.*', 'libclang-*.so.*'], set the `LIBCLANG_PATH` environment variable to a path where one of these files can be found (invalid: [])"', /home/rodrigo/.cargo/registry/src/index.crates.io-6f17d22bba15001f/bindgen-0.66.1/lib.rs:604:31
  note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

remove or add a way to disable `println!("count {}", reverse_count);`

First, thanks for your work :)

I'm trying to silence llama.cpp output and keep only the answer.
I've closed stderr temporally while loading the model (this is not a nice approach, but it works).

unsafe {
    libc::close(libc::STDERR_FILENO);
}

let llama = LLama::new(model,&options);

unsafe {
    let wr = "w".as_ptr() as *const c_char;
    let fd = libc::fdopen(libc::STDERR_FILENO, wr);
    libc::dup2(fd as i32, libc::STDERR_FILENO);
}

But when I call predict I still have an unwanted output count 0.
Maybe you can change it to log::debug!("count {}", reverse_count); ?

Slow Performance compared to Python Binding

I been playing around with the Python and Rust bindings of llama and noticed that Python was producing content much faster despite same model / input.

When I printed out the args/specs of the run I noticed some things were missing from the Rust binding that Python was using.

llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: n_yarn_orig_ctx  = 2048

I am not sure if either Python is using better specifications or I am using inappropriate poor specs because I been playing around with threads, n_batch, batch and n_gpu layers.
I tried to find comments via the Rust code but couldn't find anything.

Ex. (Python's Binding)

Any recommendations?

Error running Phi2 Models

When attempting to run dolphin-2_6-phi-2.Q4_0.gguf I'm getting error loading model: unknown model architecture: 'phi2'.

Phi2 support was added a couple of weeks ago: ggerganov/llama.cpp#4490.

Is there a way to include this?

For reference, I am using this repo as part of a different package using current master:

[dependencies]
llama_cpp_rs = { git = "https://github.com/mdrokz/rust-llama.cpp.git", rev = "4922cac", features = ["metal"] }

Error in loading models

Hello!

I'm trying to run the basic CPU example in the repo and I'm facing the following error when trying to load the "wizard-vicuna-13B.ggmlv3.q4_0.bin" model:

gguf_init_from_file: invalid magic number 67676a74
error loading model: llama_model_loader: failed to load model from /<hidden>/models/wizard-vicuna-13B.ggmlv3.q4_0.bin

llama_load_model_from_file: failed to load model

called `Result::unwrap()` on an `Err` value: "Failed to load model"
thread 'llama::tests::cuda_inference' panicked at 'called `Result::unwrap()` on an `Err` value: "Failed to load model"', app/llm/src/llama.rs:84:127
stack backtrace:

Then, I tried it with other .gguf models, and in all my attempts, the code would load the model but get stuck in the prediction until I get a free error ( which would take some minutes).

Does llama.cpp not support .bin files and are the llama models just so heavy that I can't run on my notebook (I have a Intel® Core™ i5-12500H and NVIDIA® GeForce® RTX™ 3050 Ti, GDDR6 de 4 GB)?

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.