Coder Social home page Coder Social logo

zilliztech / gptcache Goto Github PK

View Code? Open in Web Editor NEW
7.0K 59.0 493.0 22.79 MB

Semantic cache for LLMs. Fully integrated with LangChain and llama_index.

Home Page: https://gptcache.readthedocs.io

License: MIT License

Python 99.66% Makefile 0.10% Shell 0.21% Dockerfile 0.03%
chatbot chatgpt chatgpt-api llm milvus similarity-search vector-search aigc openai memcache

gptcache's Introduction

GPTCache : A Library for Creating Semantic Cache for LLM Queries

Slash Your LLM API Costs by 10x 💰, Boost Speed by 100x ⚡

Release pip download Codecov License Twitter Discord

🎉 GPTCache has been fully integrated with 🦜️🔗LangChain ! Here are detailed usage instructions.

🐳 The GPTCache server docker image has been released, which means that any language will be able to use GPTCache!

📔 This project is undergoing swift development, and as such, the API may be subject to change at any time. For the most up-to-date information, please refer to the latest documentation and release note.

NOTE: As the number of large models is growing explosively and their API shape is constantly evolving, we no longer add support for new API or models. We encourage the usage of using the get and set API in gptcache, here is the demo code: https://github.com/zilliztech/GPTCache/blob/main/examples/adapter/api.py

Quick Install

pip install gptcache

🚀 What is GPTCache?

ChatGPT and various large language models (LLMs) boast incredible versatility, enabling the development of a wide range of applications. However, as your application grows in popularity and encounters higher traffic levels, the expenses related to LLM API calls can become substantial. Additionally, LLM services might exhibit slow response times, especially when dealing with a significant number of requests.

To tackle this challenge, we have created GPTCache, a project dedicated to building a semantic cache for storing LLM responses.

😊 Quick Start

Note:

  • You can quickly try GPTCache and put it into a production environment without heavy development. However, please note that the repository is still under heavy development.
  • By default, only a limited number of libraries are installed to support the basic cache functionalities. When you need to use additional features, the related libraries will be automatically installed.
  • Make sure that the Python version is 3.8.1 or higher, check: python --version
  • If you encounter issues installing a library due to a low pip version, run: python -m pip install --upgrade pip.

dev install

# clone GPTCache repo
git clone -b dev https://github.com/zilliztech/GPTCache.git
cd GPTCache

# install the repo
pip install -r requirements.txt
python setup.py install

example usage

These examples will help you understand how to use exact and similar matching with caching. You can also run the example on Colab. And more examples you can refer to the Bootcamp

Before running the example, make sure the OPENAI_API_KEY environment variable is set by executing echo $OPENAI_API_KEY.

If it is not already set, it can be set by using export OPENAI_API_KEY=YOUR_API_KEY on Unix/Linux/MacOS systems or set OPENAI_API_KEY=YOUR_API_KEY on Windows systems.

It is important to note that this method is only effective temporarily, so if you want a permanent effect, you'll need to modify the environment variable configuration file. For instance, on a Mac, you can modify the file located at /etc/profile.

Click to SHOW example code

OpenAI API original usage

import os
import time

import openai


def response_text(openai_resp):
    return openai_resp['choices'][0]['message']['content']


question = 'what‘s chatgpt'

# OpenAI API original usage
openai.api_key = os.getenv("OPENAI_API_KEY")
start_time = time.time()
response = openai.ChatCompletion.create(
  model='gpt-3.5-turbo',
  messages=[
    {
        'role': 'user',
        'content': question
    }
  ],
)
print(f'Question: {question}')
print("Time consuming: {:.2f}s".format(time.time() - start_time))
print(f'Answer: {response_text(response)}\n')

OpenAI API + GPTCache, exact match cache

If you ask ChatGPT the exact same two questions, the answer to the second question will be obtained from the cache without requesting ChatGPT again.

import time


def response_text(openai_resp):
    return openai_resp['choices'][0]['message']['content']

print("Cache loading.....")

# To use GPTCache, that's all you need
# -------------------------------------------------
from gptcache import cache
from gptcache.adapter import openai

cache.init()
cache.set_openai_key()
# -------------------------------------------------

question = "what's github"
for _ in range(2):
    start_time = time.time()
    response = openai.ChatCompletion.create(
      model='gpt-3.5-turbo',
      messages=[
        {
            'role': 'user',
            'content': question
        }
      ],
    )
    print(f'Question: {question}')
    print("Time consuming: {:.2f}s".format(time.time() - start_time))
    print(f'Answer: {response_text(response)}\n')

OpenAI API + GPTCache, similar search cache

After obtaining an answer from ChatGPT in response to several similar questions, the answers to subsequent questions can be retrieved from the cache without the need to request ChatGPT again.

import time


def response_text(openai_resp):
    return openai_resp['choices'][0]['message']['content']

from gptcache import cache
from gptcache.adapter import openai
from gptcache.embedding import Onnx
from gptcache.manager import CacheBase, VectorBase, get_data_manager
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation

print("Cache loading.....")

onnx = Onnx()
data_manager = get_data_manager(CacheBase("sqlite"), VectorBase("faiss", dimension=onnx.dimension))
cache.init(
    embedding_func=onnx.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=SearchDistanceEvaluation(),
    )
cache.set_openai_key()

questions = [
    "what's github",
    "can you explain what GitHub is",
    "can you tell me more about GitHub",
    "what is the purpose of GitHub"
]

for question in questions:
    start_time = time.time()
    response = openai.ChatCompletion.create(
        model='gpt-3.5-turbo',
        messages=[
            {
                'role': 'user',
                'content': question
            }
        ],
    )
    print(f'Question: {question}')
    print("Time consuming: {:.2f}s".format(time.time() - start_time))
    print(f'Answer: {response_text(response)}\n')

OpenAI API + GPTCache, use temperature

You can always pass a parameter of temperature while requesting the API service or model.

The range of temperature is [0, 2], default value is 0.0.

A higher temperature means a higher possibility of skipping cache search and requesting large model directly. When temperature is 2, it will skip cache and send request to large model directly for sure. When temperature is 0, it will search cache before requesting large model service.

The default post_process_messages_func is temperature_softmax. In this case, refer to API reference to learn about how temperature affects output.

import time

from gptcache import cache, Config
from gptcache.manager import manager_factory
from gptcache.embedding import Onnx
from gptcache.processor.post import temperature_softmax
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation
from gptcache.adapter import openai

cache.set_openai_key()

onnx = Onnx()
data_manager = manager_factory("sqlite,faiss", vector_params={"dimension": onnx.dimension})

cache.init(
    embedding_func=onnx.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=SearchDistanceEvaluation(),
    post_process_messages_func=temperature_softmax
    )
# cache.config = Config(similarity_threshold=0.2)

question = "what's github"

for _ in range(3):
    start = time.time()
    response = openai.ChatCompletion.create(
        model="gpt-3.5-turbo",
        temperature = 1.0,  # Change temperature here
        messages=[{
            "role": "user",
            "content": question
        }],
    )
    print("Time elapsed:", round(time.time() - start, 3))
    print("Answer:", response["choices"][0]["message"]["content"])

To use GPTCache exclusively, only the following lines of code are required, and there is no need to modify any existing code.

from gptcache import cache
from gptcache.adapter import openai

cache.init()
cache.set_openai_key()

More Docs:

🎓 Bootcamp

😎 What can this help with?

GPTCache offers the following primary benefits:

  • Decreased expenses: Most LLM services charge fees based on a combination of number of requests and token count. GPTCache effectively minimizes your expenses by caching query results, which in turn reduces the number of requests and tokens sent to the LLM service. As a result, you can enjoy a more cost-efficient experience when using the service.
  • Enhanced performance: LLMs employ generative AI algorithms to generate responses in real-time, a process that can sometimes be time-consuming. However, when a similar query is cached, the response time significantly improves, as the result is fetched directly from the cache, eliminating the need to interact with the LLM service. In most situations, GPTCache can also provide superior query throughput compared to standard LLM services.
  • Adaptable development and testing environment: As a developer working on LLM applications, you're aware that connecting to LLM APIs is generally necessary, and comprehensive testing of your application is crucial before moving it to a production environment. GPTCache provides an interface that mirrors LLM APIs and accommodates storage of both LLM-generated and mocked data. This feature enables you to effortlessly develop and test your application, eliminating the need to connect to the LLM service.
  • Improved scalability and availability: LLM services frequently enforce rate limits, which are constraints that APIs place on the number of times a user or client can access the server within a given timeframe. Hitting a rate limit means that additional requests will be blocked until a certain period has elapsed, leading to a service outage. With GPTCache, you can easily scale to accommodate an increasing volume of of queries, ensuring consistent performance as your application's user base expands.

🤔 How does it work?

Online services often exhibit data locality, with users frequently accessing popular or trending content. Cache systems take advantage of this behavior by storing commonly accessed data, which in turn reduces data retrieval time, improves response times, and eases the burden on backend servers. Traditional cache systems typically utilize an exact match between a new query and a cached query to determine if the requested content is available in the cache before fetching the data.

However, using an exact match approach for LLM caches is less effective due to the complexity and variability of LLM queries, resulting in a low cache hit rate. To address this issue, GPTCache adopt alternative strategies like semantic caching. Semantic caching identifies and stores similar or related queries, thereby increasing cache hit probability and enhancing overall caching efficiency.

GPTCache employs embedding algorithms to convert queries into embeddings and uses a vector store for similarity search on these embeddings. This process allows GPTCache to identify and retrieve similar or related queries from the cache storage, as illustrated in the Modules section.

Featuring a modular design, GPTCache makes it easy for users to customize their own semantic cache. The system offers various implementations for each module, and users can even develop their own implementations to suit their specific needs.

In a semantic cache, you may encounter false positives during cache hits and false negatives during cache misses. GPTCache offers three metrics to gauge its performance, which are helpful for developers to optimize their caching systems:

  • Hit Ratio: This metric quantifies the cache's ability to fulfill content requests successfully, compared to the total number of requests it receives. A higher hit ratio indicates a more effective cache.
  • Latency: This metric measures the time it takes for a query to be processed and the corresponding data to be retrieved from the cache. Lower latency signifies a more efficient and responsive caching system.
  • Recall: This metric represents the proportion of queries served by the cache out of the total number of queries that should have been served by the cache. Higher recall percentages indicate that the cache is effectively serving the appropriate content.

A sample benchmark is included for users to start with assessing the performance of their semantic cache.

🤗 Modules

GPTCache Struct

  • LLM Adapter: The LLM Adapter is designed to integrate different LLM models by unifying their APIs and request protocols. GPTCache offers a standardized interface for this purpose, with current support for ChatGPT integration.

    • Support OpenAI ChatGPT API.
    • Support langchain.
    • Support minigpt4.
    • Support Llamacpp.
    • Support dolly.
    • Support other LLMs, such as Hugging Face Hub, Bard, Anthropic.
  • Multimodal Adapter (experimental): The Multimodal Adapter is designed to integrate different large multimodal models by unifying their APIs and request protocols. GPTCache offers a standardized interface for this purpose, with current support for integrations of image generation, audio transcription.

    • Support OpenAI Image Create API.
    • Support OpenAI Audio Transcribe API.
    • Support Replicate BLIP API.
    • Support Stability Inference API.
    • Support Hugging Face Stable Diffusion Pipeline (local inference).
    • Support other multimodal services or self-hosted large multimodal models.
  • Embedding Generator: This module is created to extract embeddings from requests for similarity search. GPTCache offers a generic interface that supports multiple embedding APIs, and presents a range of solutions to choose from.

    • Disable embedding. This will turn GPTCache into a keyword-matching cache.
    • Support OpenAI embedding API.
    • Support ONNX with the GPTCache/paraphrase-albert-onnx model.
    • Support Hugging Face embedding with transformers, ViTModel, Data2VecAudio.
    • Support Cohere embedding API.
    • Support fastText embedding.
    • Support SentenceTransformers embedding.
    • Support Timm models for image embedding.
    • Support other embedding APIs.
  • Cache Storage: Cache Storage is where the response from LLMs, such as ChatGPT, is stored. Cached responses are retrieved to assist in evaluating similarity and are returned to the requester if there is a good semantic match. At present, GPTCache supports SQLite and offers a universally accessible interface for extension of this module.

  • Vector Store: The Vector Store module helps find the K most similar requests from the input request's extracted embedding. The results can help assess similarity. GPTCache provides a user-friendly interface that supports various vector stores, including Milvus, Zilliz Cloud, and FAISS. More options will be available in the future.

    • Support Milvus, an open-source vector database for production-ready AI/LLM applicaionts.
    • Support Zilliz Cloud, a fully-managed cloud vector database based on Milvus.
    • Support Milvus Lite, a lightweight version of Milvus that can be embedded into your Python application.
    • Support FAISS, a library for efficient similarity search and clustering of dense vectors.
    • Support Hnswlib, header-only C++/python library for fast approximate nearest neighbors.
    • Support PGVector, open-source vector similarity search for Postgres.
    • Support Chroma, the AI-native open-source embedding database.
    • Support DocArray, DocArray is a library for representing, sending and storing multi-modal data, perfect for Machine Learning applications.
    • Support qdrant
    • Support weaviate
    • Support other vector databases.
  • Cache Manager: The Cache Manager is responsible for controlling the operation of both the Cache Storage and Vector Store.

    • Eviction Policy: Cache eviction can be managed in memory using python's cachetools or in a distributed fashion using Redis as a key-value store.
    • In-Memory Caching

    Currently, GPTCache makes decisions about evictions based solely on the number of lines. This approach can result in inaccurate resource evaluation and may cause out-of-memory (OOM) errors. We are actively investigating and developing a more sophisticated strategy.

    • Support LRU eviction policy.
    • Support FIFO eviction policy.
    • Support LFU eviction policy.
    • Support RR eviction policy.
    • Support more complicated eviction policies.
    • Distributed Caching

    If you were to scale your GPTCache deployment horizontally using in-memory caching, it won't be possible. Since the cached information would be limited to the single pod.

    With Distributed Caching, cache information consistent across all replicas we can use Distributed Cache stores like Redis.

    • Support Redis distributed cache
    • Support memcached distributed cache
  • Similarity Evaluator: This module collects data from both the Cache Storage and Vector Store, and uses various strategies to determine the similarity between the input request and the requests from the Vector Store. Based on this similarity, it determines whether a request matches the cache. GPTCache provides a standardized interface for integrating various strategies, along with a collection of implementations to use. The following similarity definitions are currently supported or will be supported in the future:

    • The distance we obtain from the Vector Store.
    • A model-based similarity determined using the GPTCache/albert-duplicate-onnx model from ONNX.
    • Exact matches between the input request and the requests obtained from the Vector Store.
    • Distance represented by applying linalg.norm from numpy to the embeddings.
    • BM25 and other similarity measurements.
    • Support other model serving framework such as PyTorch.

    Note:Not all combinations of different modules may be compatible with each other. For instance, if we disable the Embedding Extractor, the Vector Store may not function as intended. We are currently working on implementing a combination sanity check for GPTCache.

😇 Roadmap

Coming soon! Stay tuned!

😍 Contributing

We are extremely open to contributions, be it through new features, enhanced infrastructure, or improved documentation.

For comprehensive instructions on how to contribute, please refer to our contribution guide.

gptcache's People

Contributors

a9raag avatar bennu-li avatar binbinlv avatar chiiizzzy avatar cxie avatar fzliu avatar jacktempo7 avatar jaelgu avatar junjiejiangjjj avatar jupyterjazz avatar keenborder786 avatar leio10 avatar liliu-z avatar pouyanpi avatar pranaychandekar avatar progm avatar raycerossum avatar rested avatar shanghaikid avatar shiyu22 avatar simfg avatar tongtie avatar vax521 avatar vovor avatar wxywb avatar wybryan avatar xiaofan-luan avatar yyyasin19 avatar zc277584121 avatar zhuwenxing avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

gptcache's Issues

[Feature]: JavaScript/TypeScript Port

Python is obviously the ecosystem to be in for ML at the moment, but offering an easy interface for JavaScript/TypeScript (which is extremely common to use for both frontend and backend) could help increase library adoption.

I'd be curious to hear thoughts on the best approach for this, or if it's even a good idea, from those more knowledgeable. Would basically embedding this library be the best way, so that the code generally only needs written once? Or would a full port in JS probably be worth it, as to my understanding this isn't actually doing anything data-intensive or ML-intensive like creating the embeddings or the database system, and it would mostly just be syntactical changes instead of creating anything new?

[Feature]: Support for mutually exclusive multiple contexts

Is your feature request related to a problem? Please describe.

I am currently working on something that has uses multiple GPT3.5 contexts. The sharing of cache between those contexts throws up some errors. Is there a way to set up data_manager in a way that can name dbs differently ?

Describe the solution you'd like.

No response

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

[Bug]: error: update_cache_func() takes 1 positional argument but 2 were given

Current Behavior

image

I ran the demo of Github in colab, but encountered an error. The error message is as follows: WARNING: root: failed to save the data to cache, error: update_cache_func() takes 1 positional argument but 2 were given. How can I fix this?

Expected Behavior

Successfully cached information

Steps To Reproduce

No response

Environment

colab

Anything else?

No response

[Feature]: Think of adding concept of session

Is your feature request related to a problem? Please describe.

There are couple of cases where we need the idea of session and context.

Let me quickly go through some examples:

image

Another use case is the langchain SQL demo, see (https://python.langchain.com/en/latest/modules/chains/examples/sqlite.html)
The chain did follow:

  1. Based on the query, determine which tables to use.
  2. Based on those tables, call the normal SQL database chain.
    The chain request context so it won't hit the second time

Describe the solution you'd like.

No response

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

[Feature]: higher onnx similarity evaluation token limit

Is your feature request related to a problem? Please describe.

currently the implemented onnx similarity evalution using "GPTCache/albert-duplicate-onnx" is limited to 512 token, is it possible to get higher than 512?

Describe the solution you'd like.

using langchain conversational chat agent prompt producing aroute 600s to thousands of tokens, it will be easier to get the cache hit without reducing the prompt's token size using onnx similiarity evaluation

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

[Feature]: Support MongoDB Atlas for both scalar and vector data

Is your feature request related to a problem? Please describe.

As noted in #200 It would be nice to have a NoSQL implementation, personally I prefer Mongo to DynamoDB due to the breath of operations you can perform on it compared to Dynamo.

Describe the solution you'd like.

Hooking into their python driver PyMongo to store the scalar data in a standard NoSQL format. Mongo also offer a graph database option for storing and accessing the vector data MongoDB as a Graph Database. I would like to have a simplified data storage solution only using one provider.

Describe an alternate solution.

No response

Anything else? (Additional Context)

Also, this is brilliant, was a few weeks away from trying to create something similar for myself. But this is a great solution for my problems.

[DOCS]: Update Chart

Documentation Link

No response

Describe the problem

No response

Describe the improvement

No response

Anything else?

No response

[Feature]: Streaming support

Is your feature request related to a problem? Please describe.

No response

Describe the solution you'd like.

No response

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

[Feature]: Moderation API

Is your feature request related to a problem? Please describe.

I am not able to use Moderation API.

Describe the solution you'd like.

Add moderation api openai and also another possible solution, and add caching mechanism too.

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

[Feature]: Support DynamoDB

Is your feature request related to a problem? Please describe.

It's like nosql but just comes with all the AWS support https://aws.amazon.com/dynamodb/

Describe the solution you'd like.

Possibly hook up the dynamodb through https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/dynamodb.html then populate or fetch from it.

Describe an alternate solution.

Maybe mongo atlas? https://www.mongodb.com/cloud/atlas/register

Anything else? (Additional Context)

Thanks for the work on the cache implementations!

Naming suggestions: Perhaps, you can disconnect the GPT once the hype is over and call it something else because I think this cache can be applied to many ML/NLP/CV applications beyond just GPT things.

[Feature]: Support more configs for openAI models

Is your feature request related to a problem? Please describe.

Due to the document of openAI, we missed some major parameters of openAI document, see:

https://platform.openai.com/docs/api-reference/completions/create

  1. max_tokens? just bypass to GPT for now
  2. temperature: there are couple things we can do,
    1. randomly pick answer from returned result if they are all very similar.
    2. edit the answer with another small model:For instance image -> https://huggingface.co/lambdalabs/sd-image-variations-diffusers
  3. n -> if there no enough cached result, we will need to generate from OpenAI anyway.
  4. bestof -> control the topk numbers we want to retrieved from cache

Describe the solution you'd like.

No response

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

[Feature]: qdrant as a vector store

Is your feature request related to a problem? Please describe.

Hi. Glad someone finally made this. Been on my mind for a long time.
Any chance of supporting qdrant as a vector store? Qdrant also allows filtering by metadata which can be helpful if you only want to retrieve cache from within a certain date range for example. Qdrant can also store the llm response as metadata which eliminates the sqlite requirement. The new qdrant local mode means you dont have to setup a server via docker. Just install from pip and you are ready to go.

Describe the solution you'd like.

No response

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

[Bug]: Multiple messages not answering last message

Current Behavior

Using openai's api,
when I passed multiple messages to the openai, it randomly returns one of the questions' answer .

Expected Behavior

Only return answer of last question.

Steps To Reproduce

No response

Environment

No response

Anything else?

No response

[Bug]: Wrong behavior when used from the Docker Image

Current Behavior

I am following the doc to be used with docker from here.

I noticed that no matter what Prompt I set, it always returns the 1st item inserted, for example

curl -X GET  "http://localhost:8000?prompt=hello"
null
curl -X PUT -d "receive a hello message" "http://localhost:8000?prompt=hello+world"
curl -X GET  "http://localhost:8000?prompt=hello"
"receive a hello message"
curl -X GET  "http://localhost:8000?prompt=bye"
"receive a hello message"

I expect that the last bye query will not return any value, just a null.

I created a sample gptcache.yml with the following

model_src:
    openai
...
config:
    similarity_threshold: 0.2

By the way if I use threshold (as detailed in the docs) I got an error. Then another error using the -f gptcache.yml parameter

Traceback (most recent call last):
  File "/usr/local/bin/gptcache_server", line 8, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.8/site-packages/gptcache_server/server.py", line 55, in main
    init_similar_cache_from_config(config_dir=args.cache_config_file)
  File "/usr/local/lib/python3.8/site-packages/gptcache/adapter/api.py", line 167, in init_similar_cache_from_config
    with open(config_dir, "r", encoding="utf-8") as f:
FileNotFoundError: [Errno 2] No such file or directory: 'gptcache.yml'

Modified the command to something like this as I am using Windows and with the volume got it running

docker run -p 8000:8000 -v %cd%:"/workspace":rw -it zilliz/gptcache:latest gptcache_server -s 0.0.0.0 -p 8000 -f gptcache.yml

I expected to get a better result lowering the similarity_threshold value but exactly the same.

Am I missing something in how the component should be used?

thanks!

Expected Behavior

No response

Steps To Reproduce

No response

Environment

Windows 10 + Docker Image.

Anything else?

No response

[Bug]: Session is module is not working as expect

Current Behavior

when i run the example Sqlite + Faiss + ONNX, i get the following error.
And i try to add enter and exit function in the sql_storage.py, but it's not working .

def __enter__(self):
    print("debug: enter function test")
    return self

def __exit__(self):  # 
    self.drop()

def drop(self):
    self._data_manager.delete_session(self.name)
    print("debug:drop sql connnect")

return SSDataManager(cache_base, vector_base, object_base, max_size, clean_size, eviction)
File "C:\ProgramData\Anaconda3\lib\site-packages\gptcache\manager\data_manager.py", line 209, in init
self.eviction_base.put(self.s.get_ids(deleted=False))
File "C:\ProgramData\Anaconda3\lib\site-packages\gptcache\manager\scalar_data\sql_storage.py", line 229, in get_ids
with self.Session() as session:
AttributeError: enter

Expected Behavior

No response

Steps To Reproduce

No response

Environment

python : 3.8.3

Anything else?

No response

[Bug]: Moderation api is not working

Current Behavior

modOutputres = openai.Moderation.create(input=question)
File "/opt/anaconda3/envs/openai/lib/python3.9/site-packages/gptcache/adapter/openai.py", line 319, in create
res = adapt(
File "/opt/anaconda3/envs/openai/lib/python3.9/site-packages/gptcache/adapter/adapter.py", line 39, in adapt
pre_embedding_res = chat_cache.pre_embedding_func(
File "/opt/anaconda3/envs/openai/lib/python3.9/site-packages/gptcache/processor/pre.py", line 19, in last_content
return data.get("messages")[-1]["content"]
TypeError: 'NoneType' object is not subscriptable

Expected Behavior

it should give moderation api output.

Steps To Reproduce

No response

Environment

No response

Anything else?

No response

[Feature]:Support audio embeddings

Is your feature request related to a problem? Please describe.

For audio - text generation modules such as OpenAI whisper, we need to cache with audio as the key

Describe the solution you'd like.

No response

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

please give some usage scenarios for GPTCache

Hi

I am very confused with scenarios like AI chat, NPCs in the game which seems not really suitable for GPTCache, as these requests' contents are almost diifferent.

Although similarity evaluation can estimate some similarity requests and then making responses from GPTCahe, the responses' performance seems make a little worse than from LLM.

So can you give some typical scenarios that can be applied with GPTCache? What I mean is what characteristics do GPTCache scenarios conform to?

thanks!

[Feature]: PostgreSQL pgvector as vector store

Is your feature request related to a problem? Please describe.

pgvector is an extension for PostgreSQL that allows vector similarity search.
Any chance of supporting it as a vector store?

Personally, I see it very interesting since it may allow to run GPTCache with a single DB Engine (Postgres both as a cache store and a Vector Store) and it was lately supported on AWS RDS and other hosted solutions.

Describe the solution you'd like.

No response

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

[Feature]: Add Redis as a VectorStore

Is your feature request related to a problem? Please describe.

Redis is a really popular vector store and caching database which is used by the industry such as fintech. It will make it really easy to integrate into the existing services and APIs without adding new vector db such as FAISS and ChromaDB and include the entire pytorch into the build image. Also, it seems RedisSearch provide a efficient KNN search.

Describe the solution you'd like.

Use the redis async client, create a index that's solely for vectorescore. The index name prefix must NOT be matching other existing index names to avoid namespace overlapping, e.g. searching my-cache will also looking for entries under my-cache-gpt.

Describe an alternate solution.

No response

Anything else? (Additional Context)

I think this will be a really nice feature for those whom want to use cache without adding pure vector storage into their infra.

[Feature]: Support Weaviate as an option for a vector store

Is your feature request related to a problem? Please describe.

I want to use weaviate to find the K most similar requests from the input request's extracted embedding

Describe the solution you'd like.

Be able to do this:

data_manager = get_data_manager(CacheBase("weaviate"), VectorBase("weaviate", dimension=128))

or at least:

data_manager = get_data_manager(CacheBase("other db"), VectorBase("weaviate", dimension=128))

Describe an alternate solution.

None 😀

Anything else? (Additional Context)

No response

[Bug]: failed to save the data to cache, error: adapt.<locals>.update_cache_func()

Current Behavior

Hello , I was trying to run the following code from the doc (chat with GPT)) , but i am getting this message in console:

WARNING: failed to save the data to cache, error: adapt..update_cache_func() takes 1 positional argument but 2 were given

Expected Behavior

Cache save and update is expected

Steps To Reproduce

Code :


import time


def response_text(openai_resp):
    return openai_resp['choices'][0]['message']['content']

print("Cache loading.....")

# To use GPTCache, that's all you need
# -------------------------------------------------
from gptcache import cache
from gptcache.adapter import openai

cache.init()
cache.set_openai_key()
# -------------------------------------------------

question = "what's github"
for _ in range(2):
    start_time = time.time()
    response = openai.ChatCompletion.create(
      model='gpt-3.5-turbo',
      messages=[
        {
            'role': 'user',
            'content': question
        }
      ],
    )
    print(f'Question: {question}')
    print("Time consuming: {:.2f}s".format(time.time() - start_time))
    print(f'Answer: {response_text(response)}\n')


### Environment

```markdown
Env : Python 3.10.7,GPTCACHE v0.1.13

Anything else?

No response

Scalar store

Why not use SQLAlchemy, which can help us support multiple databases.

[Feature]: Support to store images in cache storage.

Is your feature request related to a problem? Please describe.

For models working on image generation, we can store the result image in cache storage rather than just text.

TODO: support minio cache storage backend and local disk storage backend

Describe the solution you'd like.

No response

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

[Bug]: Cache not writing when prompt is greater than 1000 characters for SQL Scalar Cache

Current Behavior

I have a large prompt that I want to cache (it's part of a prompt template in langchain where the query itself is small but the complete template is large) and it is not saving to cache because the prompt is > 1000 characters.

The expected output result itself is small (e.g. ~256 characters) but it's large due to it being explicit instructions. The bug I get is this:

2023-05-22 11:28:08,013 - 140704405832512 - adapter.py-adapter:162 - WARNING: failed to save the data to cache, error: (pyodbc.ProgrammingError) ('42000', "[42000] [Microsoft][ODBC Driver 18 for SQL Server][SQL Server]String or binary data would be truncated in table 'gambrinus-cachestore.dbo.gptcache_question', column 'question'. Truncated value: '...'. (2628) (SQLExecDirectW)")
[SQL: INSERT INTO gptcache_question (question, create_on, last_access, embedding_data, deleted) OUTPUT inserted.id VALUES (?, ?, ?, ?, ?)]
[parameters: ('... (1424 characters truncated) ...', datetime.datetime(2023, 5, 22, 11, 28, 7, 965579), datetime.datetime(2023, 5, 22, 11, 28, 7, 965586), bytearray(b'e\xe4 \xbd\xb1CA=\xed\x96\x00<\x1b\xa9\xbd\xbd\xd4\x8c7\xbd\xde\x93]\xbd4\x0c\xd2\xbb\xbfo$\xbd\x01\x89\x82=\xb1\xff\xe5<\xb7\xd25\xbd\xf0 ... (8408 characters truncated) ... xb8 \'\xbdM\xab\xb6\xbc\xd3\xc7\x84=\x93a\x15=\xe9\x03A\xbd\xde\xe2\x18\xbd)0\x1c\xbb\x86\x95\x08<Vl\x19=T\x9c\x95\xbdX\x13j\xbdEg\x11\xbdk\xafi\xbd'), 0)]
(Background on this error at: https://sqlalche.me/e/20/f405)

I see that the ultimate issue is that the question in the question table is set to varchar(1000) which would be too small for this prompt. The test prompt I am using has 1706 characters. This is with SQL Server, but the bug doesn't seem to be specific to SQL Server but SQL in general as a scalar cache.

Expected Behavior

Allow the question column in the question table to be larger, either as a flexible variable or large enough for most LLMs (larger than 1000 characters anyway).

Steps To Reproduce

1. Create a SQL Server Scalar Cache.
2. Input a total prompt with > 1000 characters as input.
3. Attempt to run the cache.

Environment

here's the data manager. max_size and clean_size doesn't seem to do anything.


cache_base = CacheBase('sqlserver', sql_url=SQL_URL, table_name="gptcache")
vector_base = VectorBase('milvus', host=vector_database['host'], port=vector_database['port'],
                         user=vector_database['user'], password=vector_database['password'], secure=vector_database['secure'],
                         collection_name=vector_database['collection_name'], search_params=vector_database['search_params'], local_mode=vector_database['local_mode'],
                         dimension=onnx.dimension)

data_manager = get_data_manager(cache_base, vector_base, max_size=5000, clean_size=200)


### Anything else?

_No response_

[Bug]: Chinese support not very well?

Current Behavior

I test the offical similarity example in readme .

onnx = Onnx()
data_manager = get_data_manager(CacheBase("sqlite"), VectorBase("faiss", dimension=onnx.dimension))
cache.init(
    embedding_func=onnx.to_embeddings,
    data_manager=data_manager,
    similarity_evaluation=SearchDistanceEvaluation(),
    )

...

but it dosen't support Chinese very well. I ask some question, it always occured of the same answer:

q:俄罗斯总统是谁
目前的俄罗斯总统是弗拉基米尔·普京。
q:你是谁?
目前的俄罗斯总统是弗拉基米尔·普京。
q:东风夜放花千树
目前的俄罗斯总统是弗拉基米尔·普京。
q:who are you?
I am an AI language model developed by OpenAI. I am designed to assist and provide information to users through conversation.
q:我儿子8岁, 我3年后比我儿子2倍大3岁, 我多少岁?
目前你的年龄是13岁,因为(8+3)*2=22。

q:东风夜放花千树
目前的俄罗斯总统是弗拉基米尔·普京。
Time consuming: 0.10s
2023-04-28 18:30:01,839 - 140497058133568 - _internal.py-_internal:186 - INFO: 127.0.0.1 - - [28/Apr/2023 18:30:01] "POST / HTTP/1.1" 302 -
2023-04-28 18:30:01,853 - 140497184024128 - _internal.py-_internal:186 - INFO: 127.0.0.1 - - [28/Apr/2023 18:30:01] "GET /?result=目前的俄罗斯总统是弗拉基米尔·普京。 HTTP/1.1" 200 -

I don't know how to avoid these problems?

Thank you!

Expected Behavior

match the right question and give the right answer.

Steps To Reproduce

run the similar match in the readme.

Environment

ubuntu 22.04

Anything else?

using onnx

[Feature]: Support DuckDB

Is your feature request related to a problem? Please describe.

SQLite is good but old

Describe the solution you'd like.

Support DuckDB

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

[Enhancement]: Support the bilingual llm request

What would you like to be added?

When I asked multiple questions, it always return same answers only because they mention same keywords.

Like:
I asked "what is coffee(咖啡是什么)",
it returns. " Coffee is a drink make of coffee beans " ... bla bla.
Correct answer.
But When I asked it "How to make Americano coffee(美式咖啡怎么制作的)", it answered me with same answer like above question.

Expected a better NLP.

Why is this needed?

Because it's totally different question.

Anything else?

No response

[Bug]: GET from server fail and crash

Current Behavior

When I try to get answer from server, I get empty reply from server and the server crashes.

Expected Behavior

When I try to get answer for a prompt Hello, It should return a proper reply, which in this case should be receive a hello message.

Steps To Reproduce

First I started the server:

python GPTCache/gptcache_server/server.py

Then I put and get:

❯ curl -X PUT -d "receive a hello message" "http://localhost:8000?prompt=hello"
❯ curl -X GET "http://localhost:8000?prompt=hello"
curl: (52) Empty reply from server

The server reports:

Starting server at localhost:8000
127.0.0.1 - - [25/Apr/2023 15:02:06] "PUT /?prompt=hello HTTP/1.1" 200 -
OMP: Error #15: Initializing libomp.a, but found libiomp5.dylib already initialized.
OMP: Hint This means that multiple copies of the OpenMP runtime have been linked into the program. That is dangerous, since it can degrade performance or cause incorrect results. The best thing to do is to ensure that only a single OpenMP runtime is linked into the process, e.g. by avoiding static linking of the OpenMP runtime in any library. As an unsafe, unsupported, undocumented workaround you can set the environment variable KMP_DUPLICATE_LIB_OK=TRUE to allow the program to continue to execute, but that may cause crashes or silently produce incorrect results. For more information, please see http://openmp.llvm.org/
[1]    49798 abort      python GPTCache/gptcache_server/server.py

Environment

I'm using:

  • Mac Intel
  • Pyhton 3.8
  • GPTcache from main branch - f3406ee

Though, everything works smoothly on ubuntu server.

Anything else?

No response

[Feature]: Support Image embedding and cache image as key.

Is your feature request related to a problem? Please describe.

For models like CLIP, and BLIP, we need to cache image as key

Describe the solution you'd like.

No response

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

Listen as an OpenAI API?

I checked the document, I am thinking is that possible to make the GPTCache listen as an OpenAI like API? So we can just connect it to other service via OpenAI API, such as ChatBox.

[Bug]: Customize Cache

Current Behavior

want to customize the component, in order to add more static methods,such as openai base_url。But I got an error whether i remove cache.init or not

gptcache.utils.error.NotInitError: The cache should be inited before using

e723a566fd615b807766efafcde3969

Expected Behavior

No response

Steps To Reproduce

No response

Environment

linux
latest packages

Anything else?

No response

[Enhancement]: Caching Support for Agents

What would you like to be added?

While it is possible to cache each LLM call, I notice that there is no way to cache the entire thought process and subsequent output from an Agent call e.g., LLMSingleActionAgent from langchain. Any way that this can be achieved?

Why is this needed?

Agents will be increasingly important and heavily utilized

Anything else?

No response

[Bug]: 'sentence-transformers/paraphrase-albert-small-v2' is NOT a correct model identifier listed on Huggingface

Current Behavior

I installed the latest version of GPTCache (0.1.22) and run:

from gptcache.embedding import Onnx
Onnx()

then I see the error message below:

OSError: Can't load config for 'sentence-transformers/paraphrase-albert-small-v2'. Make sure that:

- 'sentence-transformers/paraphrase-albert-small-v2' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'sentence-transformers/paraphrase-albert-small-v2' is the correct path to a directory containing a config.json file

I checked on Huggingface, and that model is no longer available.
Thank you.

Expected Behavior

No response

Steps To Reproduce

No response

Environment

No response

Anything else?

No response

[Enhancement]: Add RWKV model support (RWKV is a 100% RNN Language Model - ctxlen 8192 models available, longer ctxlen soon)

What would you like to be added?

RWKV Raven 7B Gradio Demo: https://huggingface.co/spaces/BlinkDL/Raven-RWKV-7B

Use rwkv.cpp for CPU INT4 / INT8: https://github.com/saharNooby/rwkv.cpp

Github project: https://github.com/BlinkDL/ChatRWKV

Sample code using rwkv pip package: https://github.com/BlinkDL/ChatRWKV/blob/main/v2/benchmark_more.py

Please let me know if you have any questions :)

Why is this needed?

No response

Anything else?

No response

[Feature]: Support huggingface transformers LLM model

Is your feature request related to a problem? Please describe.

Can huggingface LLM model chat caching be support?

Describe the solution you'd like.

No response

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

[Feature]: Support MiniGPT4 and Blip2

Is your feature request related to a problem? Please describe.

Mini GPT4 will be interesting DEMO about how gptcache can work in the multi modality

see https://github.com/Vision-CAIR/MiniGPT-4/

The input will be a photo and a question, while the output is the answer of the question based on the photo.

have fun

Describe the solution you'd like.

No response

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

[Bug]: The current caching strategy does not support multi-round conversations.

Current Behavior

   messages=[
       {"role": "system", "content": "You are a helpful assistant."},
       {"role": "user", "content": "who is the CEO of OpenAI?"},
       {"role": "user", "content": "how old is he/she"},

   ]

the answer

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "As of September 2021, the CEO of OpenAI is Sam Altman. He was born on April 22, 1985, which makes him 36 years old.",
        "role": "assistant"
      }
    }
  ],
  "created": 1680522845,
  "id": "chatcmpl-71D37v4kvN1haGqD322L445bqqM0P",
  "model": "gpt-3.5-turbo-0301",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 37,
    "prompt_tokens": 37,
    "total_tokens": 74
  }
}

Then I changed the company name to APPLE

    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "who is the CEO of APPLE?"},
        {"role": "user", "content": "how old is he/she"},

    ]

Since the current caching strategy is using the latest content, how old is him/her will hit the cache. So the answer is

{'gptcache': True, 'choices': [{'message': {'role': 'assistant', 'content': 'As of September 2021, the CEO of OpenAI is Sam Altman. He was born on April 22, 1985, which makes him 36 years old.'}, 'finish_reason': 'stop', 'index': 0}]}

But if you look at the context, this answer is clearly unreasonable.

Expected Behavior

No response

Steps To Reproduce

No response

Environment

No response

Anything else?

No response

[Feature]: GPTCache openAI should make the cached result more similar to openAI response

Is your feature request related to a problem? Please describe.

ChatGPT returned
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"message": {
"content": "The Valley of Kings is located in the west bank of the Nile river in Luxor, Egypt.",
"role": "assistant"
}
}
],
"created": 1680670004,
"id": "chatcmpl-71pKeRARTWzSiQE5uu6NzKkZYUTLE",
"model": "gpt-3.5-turbo-0301",
"object": "chat.completion",
"usage": {
"completion_tokens": 20,
"prompt_tokens": 229,
"total_tokens": 249
}
}

GPT cache returned
{'gptcache': True, 'choices': [{'message': {'role': 'assistant', 'content': 'The Valley of Kings is located in the west bank of the Nile river in Luxor, Egypt.'}, 'finish_reason': 'stop', 'index': 0}]}

currently, the gpt cache returned is similar in choices but lack other field such as usage and created time

Describe the solution you'd like.

No response

Describe an alternate solution.

No response

Anything else? (Additional Context)

No response

[Feature]:Support the PaddleNLP Embedding

Is your feature request related to a problem? Please describe.

支持PaddleNLP实现Embedding

Describe the solution you'd like.

支持PaddleNLP实现Embedding

Describe an alternate solution.

支持PaddleNLP实现Embedding

Anything else? (Additional Context)

支持PaddleNLP实现Embedding

[Bug]: GPTCache similarity caching code example encountered an error during execution

Current Behavior

This is an issue relating to the integration of GPTCache with LangChain

import os
import time
import gptcache
from gptcache.processor.pre import get_prompt
from gptcache.manager.factory import get_data_manager
from langchain.cache import GPTCache, SQLiteCache
from gptcache.manager import get_data_manager, CacheBase, VectorBase
from gptcache import Cache
from gptcache.embedding import Onnx
from gptcache.similarity_evaluation.distance import SearchDistanceEvaluation
from langchain.llms import OpenAI
import langchain
import openai
from decouple import config

os.environ["OPENAI_API_KEY"] = config("OPENAI_API_KEY")
openai.api_base = config("OPENAI_API_BASE")

llm = OpenAI(model_name="text-davinci-002", n=1, best_of=1)
i = 0
file_prefix = "data_map"
llm_cache = Cache()

def init_gptcache_map(cache_obj: gptcache.Cache):
global i
cache_path = f'{file_prefix}_{i}.txt'
onnx = Onnx()
cache_base = CacheBase('sqlite')
vector_base = VectorBase('faiss', dimension=onnx.dimension)
data_manager = get_data_manager(cache_base, vector_base, max_size=10, clean_size=2)
cache_obj.init(
pre_embedding_func=get_prompt,
embedding_func=onnx.to_embeddings,
data_manager=data_manager,
similarity_evaluation=SearchDistanceEvaluation(),
)
i += 1

langchain.llm_cache = GPTCache(init_gptcache_map)

llm("Tell me a joke")
`

error:
Traceback (most recent call last): File "D:\chat-main\tt.py", line 43, in llm("Tell me a joke") File "D:\chat-main\venv\Lib\site-packages\langchain\llms\base.py", line 246, in call return self.generate([prompt], stop=stop).generations[0][0].text ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\chat-main\venv\Lib\site-packages\langchain\llms\base.py", line 161, in generate llm_output = update_cache( ^^^^^^^^^^^^^ File "D:\chat-main\venv\Lib\site-packages\langchain\llms\base.py", line 51, in update_cache langchain.llm_cache.update(prompt, llm_string, result) File "D:\chat-main\venv\Lib\site-packages\langchain\cache.py", line 255, in update return adapt( ^^^^^^ File "D:\chat-main\venv\Lib\site-packages\gptcache\adapter\adapter.py", line 22, in adapt embedding_data = time_cal( ^^^^^^^^^ File "D:\chat-main\venv\Lib\site-packages\gptcache_init_.py", line 25, in inner res = func(*args, **kwargs) ^^^^^^^^^^^^^^^^^^^^^ File "D:\chat-main\venv\Lib\site-packages\gptcache\embedding\onnx.py", line 58, in to_embeddings ort_outputs = self.ort_session.run(None, ort_inputs) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "D:\Program Files (x86)\Python311\Lib\site-packages\onnxruntime\capi\onnxruntime_inference_collection.py", line 200, in run return self._sess.run(output_names, input_feed, run_options) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ onnxruntime.capi.onnxruntime_pybind11_state.InvalidArgument: [ONNXRuntimeError] : 2 : INVALID_ARGUMENT : Unexpected input data type. Actual: (tensor(int32)) , expected: (tensor(int64))

Expected Behavior

No response

Steps To Reproduce

No response

Environment

No response

Anything else?

No response

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.