mybigday / llama.rn Goto Github PK

React Native binding of llama.cpp

License: MIT License

JavaScript 0.15% C++ 55.59% TypeScript 1.16% Swift 0.01% Ruby 0.10% C 36.68% Objective-C 4.62% Objective-C++ 0.64% Shell 0.08% CMake 0.05% Java 0.92%

android ios llama llama-cpp llm react-native

llama.rn's People

Contributors

Stargazers

Watchers

Forkers

escottgoodwin jef1056 ankit1057 ratebseirawan smashinfries bhargavhirpara scsonic vali-98 batrlatom hans00 purplemaia pedro-rivas ochafik

llama.rn's Issues

Example: Custom params in the UI

So we can easier to test model params without using debug mode.

Feature Request: TextStreaming

Is it possible to add a text streaming feature? It looks like your loading a local cpp server I wonder does swift support sockets for react native? Inference is so slow right on mobile devices right now, streaming would help the user know something is happening. Interested in contributing if you need contribs. I believe it is supported by llama.cpp in langchains implementation but im not sure if that's custom

Benchmark method

Port llama.cpp/examples/llama-bench/llama-bench.cpp as method so we can easier to collect the benchmark results in mobile devices.

LLaVa support

llama.cpp includes LLaVa example (+clip.cpp), we could use it to provide vision support. We may implement it after #30 is done.

Also, it will be great if we could make an another package named clip.rn or react-native-clip, but currently I afraid we haven't more resources to maintain it, so just keep in mind.

cannot load model

issue in the README regarding model loading. It mentions the 'gguf' model but lacks clear instructions. is file loading implemented yet? No model found is always result.

Seed, min p

can you add support to change it?

Prompt cache

https://github.com/ggerganov/llama.cpp/blob/92d0b751a77a089e650983e9f1564ef4d31b32b9/examples/main/main.cpp#L243

Support save prompt cache as file so we could speed up context initialization + prompt processing.

Error Initializing Llama Model: Context Limit Reached

Description

When using this library on an Expo v50 app I got a persistent error that stopped the completions from working. The app was functioning correctly, but after a series of prompts, it failed with the error "[Error: Context limit reached]". I smell this is an easy fix from my end? Any help is appreciated :)

Steps to Reproduce

Initialize the Llama model with standard settings.
Send multiple messages to the model.
Observe the failure after several interactions.

Environment

Expo version: v50
Model: phi2 3B Q2_K - Medium
Device: iPhone 14 Pro Max (development build)

Code Snippet

export const runLlama = async (message: string, modelId: string) => {
  const MODEL_FILE_PATH = `${FileSystem.documentDirectory}${modelId}.gguf`;
  return new Promise((resolve, reject) => {
    initLlama({
      model: MODEL_FILE_PATH,
      use_mlock: true,
      n_ctx: 2048,
      n_gpu_layers: 1,
    })
    .then((context) => {
      context.completion({
        prompt: message,
        n_predict: 60,
        temperature: 0.7,
        top_p: 1.0,
        stop: ['</s>', 'Llama:', 'User:'],
      }, (data) => {
        const { token } = data;
        resolve({ type: 'partial', token });
      })
      .then((sss) => {
        console.log('Completion result:', sss.text);
        resolve({ type: 'final', text: sss.text });
      })
      .catch((error) => {
        console.log('Error running Llama:', error);
        reject(error);
      });
    })
    .catch((error) => {
      console.log('Error initializing Llama:', error);
      reject(error);
    });
  });
};

Error

 ERROR  Error running Llama: [Error: Context limit reached]

Android: Cannot load models, stopCompletions not working.

As it says on the tin. Loading small 3b models ala Tiny Llama or StableLM models do not work. Tested models:

Attempting to call initLlama results in

Error: Failed to initialize context

Which I can only assume is here:

https://github.com/mybigday/llama.rn/blob/main/android/src/main/java/com/rnllama/RNLlama.java?plain=1#L55

I do not know enough about native functions to investigate further.

In addition, stopCompletions() does not stop a completion on Android.
Thanks for your work, the project is fantastic otherwise.

stablelm-2-zephyr-1_6b-Q8_0.gguf does not work

Hello,

I've been working on getting the stablelm-2-zephyr-1_6b-Q8_0.gguf operational (link: https://huggingface.co/spaces/stabilityai/stablelm-2-1_6b-zephyr), especially since the 3B version seems to function quite well. However, I'm encountering an issue with the 1.6B version where it fails to initialize the context. Currently, I'm using the latest version of your master branch to compile the library. Is there a straightforward modification I can make on my end to resolve this?

from logs:

01-29 22:50:05.365 3017 20732 E RNLLAMA_LOG_ANDROID: llama_model_load: error loading model: done_getting_tensors: wrong number of tensors; expected 340, got 268

Thank you.

Crash on loading specifc model

Related to:
Vali-98/ChatterUI#20

Model used:
https://huggingface.co/Crataco/stablelm-2-1_6b-chat-imatrix-GGUF/blob/main/stablelm-2-1_6b-chat.Q4_K_M.imx.gguf

llama.rn version:
0.3.1

Error provided my llama.rn:

[RNLlama] is_model_loaded false
handling signal: 11
Fatal signal 11 (SIGSEGV), code 1 (SEGV_MAPERR), fault addr 0x698 in tid 4201 (AsyncTask #1), pid 31056 (ali98.ChatterUI)

From what I can tell its attempting to memory outside its address space. Oddly enough, this doesn't occur in emulator, only built apk's.

Expose tokenize / embedding functions as util

Add utisl for expose llama_tokenize and llama_get_embeddings to JS side.

For context we need provide embedding param to enable that.

Parallel decoding

llama.cpp now supports parallel decoding in one context so we can support.

Breaking change: Deprecate stopCompletion method and move to return values of completion.

Example: Pure text completion playground

It just like OpenAI text completion playground, start with a text area as prompt.

We can also add area for use grammar.

Support Android

Not high priority.

It probably easy to support because we implemented the context in cpp/rn-llama.hpp.

We won't try to integrate the CLBlast part for now, because not all Android devices supports OpenCL.

We can look at ggerganov/llama.cpp#2059 / ggerganov/llama.cpp#2039 for a chance to land so we can use the Vulkan backend.

Failure to initLlama on Xiaomi phones.

Hello again, I've received reports from users of ChatterUI that model loading fails on Xiaomi branded phones:

Confirmed not working:

Xiaomi Poco F5 - Android 14
Redmi 10C - Android 13

I've also queried about other phones, and got a few responses for working devices.

Confirmed working:

Samsung A71 - Android 13
Samsung M52 - Android 13

Version used:

llama.rn 0.3.0-rc.14

Logcat response on the tested Poco F5:

RNLLAMA_ANDROID_JNI: [RNLlama] is_model_loaded false

There aren't enough users to confirm this is a trend across all Xiaomi phones, but it is peculiar.

[Android] Seed value does not create deterministic outputs.

As mentioned in the title, setting a seed value does not make an output deterministic on Android.

llama.rn version: 0.3.0-rc.13
Model used: phi-2.Q3_K_M.gguf
Android Devices Tested on: Emulated Pixel 3a - Android 14

Params used:

{
  "frequency_penalty": 0, 
  "grammar": "", 
  "min_p": 0.07, 
  "mirostat": 0, 
  "mirostat_eta": 0.1, 
  "mirostat_tau": 5, 
  "n_predict": 288, 
  "n_threads": 5, 
  "presence_penalty": 0, 
  "prompt": "", 
  "repeat_penalty": 1, 
  "seed": 2, 
  "stop": ["User:", "### Response: "], 
  "temperature": 1, 
  "tfs_z": 1, 
  "top_k": 0, 
  "top_p": 1, 
  "typical_p": 1
}

Update llamacpp module to latest

I am having troubling updating the llamacpp submodule mysel. Could the project be updated as llamacpp has added support for a few new base models that currently do not work in llama.rn?

Bump llamacpp version pls

Could we get a new release with the latest llamacpp? Llamacpp is evolving fast, and as far as I understand it got a few fixes during the last days which are related to llama 3. Or in other words, llama 3 is broken with the current release (they do run but the quality of ggml versions are affected apparently).

OpenCL Implementation for Android

First of all, thanks for the hard work on bringing this project to the react-native ecosystem.

I have been using llama.rn for a few weeks now in my personal project:
https://github.com/Vali-98/ChatterUI

I was wondering if there is any interest in implementing OpenCL for android. I have attempted to work on it myself to little success, given my inexperience with native modules.

Implementing optimizations from layla

Layla is a project that also integrates llamacpp for mobile use:
https://github.com/l3utterfly/llama.cpp/tree/layla-build

After some quick testing, it does seem like Layla's fork for llamacpp runs models far faster on android than llama.rn, almost twice as fast in some cases with 7b models.

It would be wonderful in these improvements were added to llama.rn as well.

Early stopping inference

Shouldn't there be a function that allows the user to stop inference? Could be implemented as a callback function just like in whisper.rn's realtimeInference()