hegelai / prompttools Goto Github PK

Open-source tools for prompt testing and experimentation, with support for both LLMs (e.g. OpenAI, LLaMA) and vector databases (e.g. Chroma, Weaviate, LanceDB).

Home Page: http://prompttools.readthedocs.io

License: Apache License 2.0

Python 100.00%

deep-learning large-language-models machine-learning prompt-engineering python embeddings llms vector-search developer-tools

prompttools's Introduction

PromptTools

🔧 Test and experiment with prompts, LLMs, and vector databases. 🔨

Welcome to prompttools created by Hegel AI! This repo offers a set of open-source, self-hostable tools for experimenting with, testing, and evaluating LLMs, vector databases, and prompts. The core idea is to enable developers to evaluate using familiar interfaces like code, notebooks, and a local playground.

In just a few lines of code, you can test your prompts and parameters across different models (whether you are using OpenAI, Anthropic, or LLaMA models). You can even evaluate the retrieval accuracy of vector databases.

from prompttools.experiment import OpenAIChatExperiment

messages = [
    [{"role": "user", "content": "Tell me a joke."},],
    [{"role": "user", "content": "Is 17077 a prime number?"},],
]

models = ["gpt-3.5-turbo", "gpt-4"]
temperatures = [0.0]
openai_experiment = OpenAIChatExperiment(models, messages, temperature=temperatures)
openai_experiment.run()
openai_experiment.visualize()

To stay in touch with us about issues and future updates, join the Discord.

Quickstart

To install prompttools, you can use pip:

pip install prompttools

You can run a simple example of a prompttools locally with the following

git clone https://github.com/hegelai/prompttools.git
cd prompttools && jupyter notebook examples/notebooks/OpenAIChatExperiment.ipynb

You can also run the notebook in Google Colab

Playground

If you want to interact with prompttools using our playground interface, you can launch it with the following commands.

You can run a simple example of a prompttools locally with the following

pip install notebook  # If jupyter notebook has not been installed
pip install prompttools

Then, clone the git repo and launch the streamlit app:

git clone https://github.com/hegelai/prompttools.git
cd prompttools && streamlit run prompttools/playground/playground.py

You can also access a hosted version of the playground on the Streamlit Community Cloud.

Note: The hosted version does not support LlamaCpp

Documentation

Our documentation website contains the full API reference and more description of individual components. Check it out!

Supported Integrations

Here is a list of APIs that we support with our experiments:

LLMs

OpenAI (Completion, ChatCompletion, Fine-tuned models) - Supported
LLaMA.Cpp (LLaMA 1, LLaMA 2) - Supported
HuggingFace (Hub API, Inference Endpoints) - Supported
Anthropic - Supported
Mistral AI - Supported
Google Gemini - Supported
Google PaLM (legacy) - Supported
Google Vertex AI - Supported
Azure OpenAI Service - Supported
Replicate - Supported
Ollama - In Progress

Vector Databases and Data Utility

Chroma - Supported
Weaviate - Supported
Qdrant - Supported
LanceDB - Supported
Milvus - Exploratory
Pinecone - Supported
Epsilla - In Progress

Frameworks

LangChain - Supported
MindsDB - Supported
LlamaIndex - Exploratory

Computer Vision

Stable Diffusion - Supported
Replicate's hosted Stable Diffusion - Supported

If you have any API that you'd like to see being supported soon, please open an issue or a PR to add it. Feel free to discuss in our Discord channel as well.

Frequently Asked Questions (FAQs)

Will this library forward my LLM calls to a server before sending it to OpenAI, Anthropic, and etc.?
- No, the source code will be executed on your machine. Any call to LLM APIs will be directly executed from your machine without any forwarding.
Does prompttools store my API keys or LLM inputs and outputs to a server?
- No, all of those data stay on your local machine. We do not collect any PII (personally identifiable information).
How do I persist my results?
- To persist the results of your tests and experiments, you can export your Experiment with the methods to_csv, to_json, to_lora_json, or to_mongo_db. We are building more persistence features and we will be happy to further discuss your use cases, pain points, and what export options may be useful for you.

Sentry

Usage Tracking

Since we are changing our API rapidly, there are some errors caused by our negligence or out of date documentation. To improve user experience, we collect data from normal package usage that helps us understand the errors that are raised. This data is collected and sent to Sentry, a third-party error tracking service, commonly used in open-source softwares. It only logs this library's own actions.

You can easily opt-out by defining an environment variable called SENTRY_OPT_OUT.

Contributing

We welcome PRs and suggestions! Don't hesitate to open a PR/issue or to reach out to us via email. Please have a look at our contribution guide and "Help Wanted" issues to get started!

Usage and Feedback

We will be delighted to work with early adopters to shape our designs. Please reach out to us via email if you're interested in using this tooling for your project or have any feedback.

License

We will be gradually releasing more components to the open-source community. The current license can be found in the LICENSE file. If there is any concern, please contact us and we will be happy to work with you.

prompttools's People

Contributors

Stargazers

Watchers

Forkers

jaytoday rjarun8 bibhu107 getkksingh1 mldk-tech moonisali joamps touristshaun grv805 jayinc frostbite1536 marklar-co jesusoctavioas austinwhite anibus hernancur dingvale steventkrawczyk nivekt fiddlecube louanes1 hashemalsaket imalsky louvivien alexandercohen rachittshah filip-halt osamanatouf2 hbcbh1999 pramitbhatia25 shawnkbeck acumenix kacperlukawski ianscrivener canslove robwilde damonclifford tonywhite11 richardsonjf blockvisors vibmittal divij97 josegron mmmaia techthiyanes alex4o rafaelpierre bharatr21 evelynmitchell sunmoonbamboo aicodehunt hehehe159 liumiao7 pterameta bronwin87 jeffara luvvaggarwal aiworkspace josh-g91 yusifelawawdeh k2m5t2 mivanovitch sanjeevk-os enochkan convocat ayushexel proalf epsilla-cloud rickyfer22 subratcall kuntal-c code360in therakeshpurohit furyhawk jianantian jjmata eric-epsilla 4dodi soon14 userbox020 ego jansystemic drasaadmoosa nehaa28 vatsal-vb mindkhichdi foreverlovewisdom ajfx-01 yorick-ml kevintupper bweber-rebellion trawmoney animesh einfachalex110 sfc-gh-tkipkemboi thatdudejbob filopedraz tushar-beghou tdoehmen flowgeniusmz

prompttools's Issues

GoogleVertexChatExperiment.ipynb notebook does not work

🐛 Describe the bug

I downloaded the notebook GoogleVertexChatExperiment.ipynb and tried to run it and I get this error:

----> 1 from prompttools.experiment import GoogleVertexChatCompletionExperiment
      4 model = ["chat-bison"]
      6 context = ["You are a helpful assistant.",
      7            "Answer the following question only if you know the answer or can make a well-informed guess; otherwise tell me you don't know it. In addition, explain your reasoning of your final answer."]

File ~/.pyenv/versions/3.11.5/envs/onc/lib/python3.11/site-packages/prompttools/__init__.py:7
      1 # Copyright (c) Hegel AI, Inc.
      2 # All rights reserved.
      3 #
      4 # This source code's license can be found in the
      5 # LICENSE file in the root directory of this source tree.
----> 7 from .prompttest import prompttest
      8 from .sentry import init_sentry
     11 init_sentry()

File ~/.pyenv/versions/3.11.5/envs/onc/lib/python3.11/site-packages/prompttools/prompttest/prompttest.py:13
     11 from .threshold_type import ThresholdType
     12 from .error.failure import PromptTestSetupException
---> 13 from .runner.runner import run_prompttest
     15 TESTS_TO_RUN = []
     18 def prompttest(
     19     metric_name: str,
...
     42         ),
     43         before_sleep=before_sleep_log(logging.getLogger(__name__), logging.WARNING),
     44     )

AttributeError: module 'openai' has no attribute 'APIConnectionError'

Not sure why GoogleVertexChatExperiment is depending on openai

The relevant packages I have installed locally are:

openai==1.3.4
prompttools==0.0.43

Running on Mac OSX in 3.11.5

PyPI release missing utils

🐛 Describe the bug

Was able to move utils folder from git clone to get the example in README to work. But utils is missing in the PyPI release.

Pydantic import isssues

🐛 Describe the bug

When running experiment.evaluate("similar_to_expected", measure_similarity) , got the following error .

_---------------------------------------------------------------------------
PydanticImportError Traceback (most recent call last)
Cell In[8], line 1
----> 1 experiment.evaluate("similar_to_expected", measure_similarity)

File G:\RajArun\os\PromptTools\env\Lib\site-packages\prompttools\experiment\experiment.py:172, in Experiment.evaluate(self, metric_name, eval_fn, input_pairs)
169 return
170 for i, result in enumerate(self.results):
171 # Pass the messages and results into the eval function
--> 172 score = eval_fn(
173 input_pairs[self.argument_combos[i][1]]
174 if input_pairs
175 else self.argument_combos[i][1],
176 result,
177 {
178 name: self.scores[name][i]
179 for name in self.scores.keys()
180 if name is not metric_name
181 },
182 )
183 self.scores[metric_name].append(score)
184 if self.use_scribe:

Cell In[7], line 17, in measure_similarity(messages, results, metadata)
10 def measure_similarity(
11 messages: List[Dict[str, str]], results: Dict, metadata: Dict
12 ) -> float:
13 """
14 A simple test that checks semantic similarity between the user input
15 and the model's text responses.
16 """
---> 17 distances = [
18 similarity.compute(EXPECTED[messages[1]["content"]], response)
19 for response in extract_responses(results)
20 ]
21 return min(distances)

Cell In[7], line 18, in (.0)
10 def measure_similarity(
11 messages: List[Dict[str, str]], results: Dict, metadata: Dict
12 ) -> float:
13 """
14 A simple test that checks semantic similarity between the user input
15 and the model's text responses.
16 """
17 distances = [
---> 18 similarity.compute(EXPECTED[messages[1]["content"]], response)
19 for response in extract_responses(results)
20 ]
21 return min(distances)

File G:\RajArun\os\PromptTools\env\Lib\site-packages\prompttools\utils\similarity.py:53, in compute(doc1, doc2, use_chroma)
51 def compute(doc1, doc2, use_chroma=True):
52 if use_chroma:
---> 53 return _from_chroma(doc1, doc2)
54 else:
55 return _from_huggingface(doc1, doc2)

File G:\RajArun\os\PromptTools\env\Lib\site-packages\prompttools\utils\similarity.py:43, in _from_chroma(doc1, doc2)
42 def _from_chroma(doc1, doc2):
---> 43 chroma_client = _get_chroma_client()
44 collection = chroma_client.create_collection(name="test_collection")
45 collection.add(documents=[doc1], ids=["id1"])

File G:\RajArun\os\PromptTools\env\Lib\site-packages\prompttools\utils\similarity.py:27, in _get_chroma_client()
25 def _get_chroma_client():
26 if len(CHROMA_CLIENT) == 0:
---> 27 import chromadb
29 CHROMA_CLIENT.append(chromadb.Client())
30 return CHROMA_CLIENT[0]

File G:\RajArun\os\PromptTools\env\Lib\site-packages\chromadb_init_.py:1
----> 1 import chromadb.config
2 import logging
3 from chromadb.telemetry.events import ClientStartEvent

File G:\RajArun\os\PromptTools\env\Lib\site-packages\chromadb\config.py:1
----> 1 from pydantic import BaseSettings
2 from typing import Optional, List, Any, Dict, TypeVar, Set, cast, Iterable, Type
3 from typing_extensions import Literal

File G:\RajArun\os\PromptTools\env\Lib\site-packages\pydantic_init_.py:206, in getattr(attr_name)
204 dynamic_attr = _dynamic_imports.get(attr_name)
205 if dynamic_attr is None:
--> 206 return _getattr_migration(attr_name)
208 from importlib import import_module
210 module = import_module(_dynamic_imports[attr_name], package=package)

File G:\RajArun\os\PromptTools\env\Lib\site-packages\pydantic_migration.py:279, in getattr_migration..wrapper(name)
277 return import_string(DEPRECATED_MOVED_IN_V2[import_path])
278 if import_path == 'pydantic:BaseSettings':
--> 279 raise PydanticImportError(
280 'BaseSettings has been moved to the pydantic-settings package. '
281 f'See https://docs.pydantic.dev/{VERSION}/migration/#basesettings-has-moved-to-pydantic-settings '
282 'for more details.'
283 )
284 if import_path in REMOVED_IN_V2:
285 raise PydanticImportError(f'{import_path} has been removed in V2.')

PydanticImportError: BaseSettings has been moved to the pydantic-settings package. See https://docs.pydantic.dev/2.0.2/migration/#basesettings-has-moved-to-pydantic-settings for more details.

For further information visit https://errors.pydantic.dev/2.0.2/u/import-error_

The issue occurred when creating the chroma client in prompttools due to migrations pydantic settings migration.

PydanticImportError: BaseSettings has been moved to the pydantic-settings package. See https://docs.pydantic.dev/2.0.2/migration/#basesettings-has-moved-to-pydantic-settings for more details.

For further information visit https://errors.pydantic.dev/2.0.2/u/import-error

The above required changes in chromadb .

PR raised to address this : chroma-core/chroma#775

Add common benchmarks

🚀 The feature

We need to add benchmark test sets so folks can run on models / embeddings / systems

A few essentials:

BEIR for information retrieval
MTEB for embeddings
Some stuff from HELM (e.g. ROGUE, BLEU) for LLMs

Motivation, pitch

Users have told us that they want to run academic benchmarks as "smoke tests" on new models.

Alternatives

No response

Additional context

No response

Ollama Integration

🚀 The feature

We want to integrate with Ollama: https://github.com/jmorganca/ollama

Goal is to build an experiment, like the LlamaCpp example: https://github.com/hegelai/prompttools/blob/main/prompttools/experiment/experiments/llama_cpp_experiment.py

Motivation, pitch

Ollama is a new way to run local models. It would be good to support this so developers can compare Ollama to LlamaCpp and other models/frameworks.

Alternatives

No response

Additional context

No response

Error when running promptools

🐛 Describe the bug

After installing the dependencies and running the app with streamlit I get:

ModuleNotFoundError: No module named 'pyperclip'

Both streamlit and pyperclip don't seem to be part of the dependencies.

Create generic experiments that accept a completion or chat completion function

🚀 The feature

We need experiment classes that generalize the OpenAI experiments to any OpenAI compatible API.

Motivation, pitch

We can use this to support any completion or chat completion APIs that are compatible with the OpenAI API.

Alternatives

No response

Additional context

No response

Add support for Qdrant

🚀 The feature

Qdrant is a popular vector similarity search engine and vector database.

Motivation, pitch

We can integrate with its APIs to allow experimentation on various configurations and look at its performance.

Alternatives

No response

Additional context

You can have a look at the Chroma or Weaviate for inspiration. If you would like to work on this, comment and let us know! We will be more than happy to support you.

Examples are mixed/stacking up

📚 The doc issue

Under examples/notebooks there are many example notebooks varying from LLMs, text, images, DBs. Should we store them in separate directories?

Suggest a potential fix

├── examples
├── ├── notebooks
├── ├── ├── benchmark
├── ├── ├── image_experiments
├── ├── ├── ├── StableDiffusionExperiment.ipynb
├── ├── ├── llm_experiments
├── ├── ├── ├── LlamaCppExperiment.ipynb
├── ├── ├── db_experiments
├── ├── ├── ├── LanceDBExperiment.ipynb
├── ├── ├── audio_experiments
├── ├── ├── ├── MusicGenExperiment.ipynb
├── ├── ├── video_experiments

Robustness evaluation

🚀 The feature

Request from potential user: "There are two main aspects, 1) adjusting prompts that changing semantic words does not trigger hallucination, 2) the prompt itself is such that LLM doesnt slip away from instruction"

Idea: for (1) use prompt templates to substitute words, run evals to check semantic similarity of all results. For (2) use auto-evaluation given instruction, prompt, and response to determine if the LLM followed instructions.

Motivation, pitch

We got this request from a potential user, and also robustness is a common concern in LLM evaluation

Alternatives

No response

Additional context

No response

ModuleNotFoundError: Package `llama-cpp-python` is required to be installed to use this experiment

⁉️ Discussion/Question

Suggested : Please use pip install llama-cpp-python to install the package

I am encountering difficulties installing the llama-cpp-python package in Google Colab using the command pip install llama-cpp-python. Despite following the installation instructions, the package does not seem to be recognized. This issue is hindering my ability to proceed with the experiment.

ModuleNotFoundError                       Traceback (most recent call last)
.ipynb Cell 3 line 1
      9 temperatures = [0.0, 1.0]
     11 call_params = dict(temperature=temperatures)
---> 13 experiment = LlamaCppExperiment(model_paths, prompts, call_params=call_params)

File ~/anaconda3/envs/llm_testing/lib/python3.10/site-packages/prompttools/experiment/experiments/llama_cpp_experiment.py:107, in LlamaCppExperiment.__init__(self, model_path, prompt, model_params, call_params)
     99 def __init__(
    100     self,
    101     model_path: List[str],
   (...)
    104     call_params: Dict[str, list[object]] = {},
    105 ):
    106     if Llama is None:
--> 107         raise ModuleNotFoundError(
    108             "Package `llama-cpp-python` is required to be installed to use this experiment."
    109             "Please use `pip install llama-cpp-python` to install the package"
    110         )
    111     self.completion_fn = self.llama_completion_fn
    112     self.model_params = model_params

Add Qdrant Vector database support

🚀 The feature

Qdrant (https://qdrant.tech/) is a high-performance vector DB.

Motivation, pitch

We are using Qdrant and found its performance much better than other Vector DBs. Please see the benchmarks at https://qdrant.tech/benchmarks/.

Alternatives

No response

Additional context

No response

Vector Database Experiment Support

Add support for experimenting across embeddings, vector DBs, queries, etc.

Some components that we will include:

Low level experiment for vector DBs
Harness for document retrieval tests
prompttest runner for document retrieval

Replicate integration

🚀 The feature

Replicate supports serverless inference for LLMs: https://replicate.com/collections/language-models

We should create a replicate experiment.

Motivation, pitch

Easy way to test OSS LLMs, good to compare to Huggingface hosted endpoints

Alternatives

No response

Additional context

No response

More testing utilities

🚀 The feature

LLM-generated expected responses
Move auto-eval to a utility function
LLM chooses between multiple responses

Motivation, pitch

We want more pre-built evaluation functions and utilities

Alternatives

No response

Additional context

No response

LlamaIndex Integration

🚀 The feature

We need an experiment or example for LlamaIndex. This will be similar to the work we need to do for LangChain and MindsDB.

Motivation, pitch

LlamaIndex is a popular framework for connecting LLMs to data. It would be good to support LlamaIndex testing using our existing vectorDB and LLM evaluations.

Alternatives

N/A

Additional context

N/A

Add support for LanceDB

🚀 The feature

LanceDB is a developer-friendly, serverless vector database.

Motivation, pitch

We can integrate with its APIs to allow experimentation on various configurations and look at its performance.

Alternatives

No response

Additional context

You can have a look at the Chroma or Weaviate for inspiration. If you would like to work on this, comment and let us know! We will be more than happy to support you.

Local LLM Support / Examples

🚀 The feature

Support local LLMs experiments. This would include adding experiments for Local LLM Chat and Completion APIs, harnesses and unit test runners for those experiments, and providing a few examples in notebooks.

We should start with huggingface models, and look at containerized models as well.

Motivation, pitch

Today we only support OpenAI models. We need a way to support local models as well.

Alternatives

No response

Additional context

No response

Example in README doesn't work

🐛 Describe the bug

The example in the README doesn't work. I installed prompttools, upgraded openai to the latest version. The following snippet is what I ran.

from prompttools.experiment import OpenAIChatExperiment

prompts = ["Tell me a joke.", "Is 17077 a prime number?"]
models = ["gpt-3.5-turbo"]
temperatures = [0.0]
openai_experiment = OpenAIChatExperiment(models, prompts, temperature=temperatures)
openai_experiment.run()

This throws the following error when calling run.

openai.error.InvalidRequestError: 'Tell me a joke.' is not of type 'array' - 'messages'

Add benchmarks and evals for jailbreaks

🚀 The feature

As we add benchmarks, it would be good to cover common jailbreak scenarios. We should incorporate these benchmarks, and have auto-evals that check responses to see if they are "broken"

Motivation, pitch

https://github.com/llm-attacks/llm-attacks

Alternatives

No response

Additional context

No response

UI Features

🚀 The feature

Right now we only support adjusting the temperature parameter, it would be good to add other parameters to the left side pane
Connect to the to_mongo_db method and other exporters
Support aggregations / charting in the streamlit UI
General cleanup / refactoring of the streamlit code

Motivation, pitch

We have a beta version of the UI out, but there are still many features missing and the code could be cleaned up greatly

Alternatives

No response

Additional context

No response

Is it possible to evaluate using models from Azure OpenAI Service?

⁉️ Discussion/Question

I have a model currently available via Azure OpenAI, is it possible to run a prompt through it passing AZURE_OPENAI_API_KEY instead of OPENAI_API_KEY for instance?

It can't load with llama_load_model_from_file: failed to load model

⁉️ Discussion/Question

Hi.

I just start with prompttools. I have two problems. I will post separate them.

First, I couldn't load from downloaded model. There is error like below code when I tested LlamaCppExperiment.ipynb. I try to different models and occurred same errors.

Environments

CPU : M2
RAM : 16GB
OS : Ventura 13.2.1
Python : 3.11.5
prompttools : 0.0.43

Logs

gguf_init_from_file: invalid magic characters tjgg(��k�.
error loading model: llama_model_loader: failed to load model from /Users/sewonist/Downloads/llama-2-7b-chat.ggmlv3.q2_K.bin

llama_load_model_from_file: failed to load model
AVX = 0 | AVX2 = 0 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 0 | NEON = 1 | ARM_FMA = 1 | F16C = 0 | FP16_VA = 1 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 0 | SSSE3 = 0 | VSX = 0 |

ssertionError                            Traceback (most recent call last)
Cell In[3], line 1
----> 1 experiment.run()

File ~/Projects/13.AIChat/05.Projects/prompttools/prompttools/experiment/experiments/llama_cpp_experiment.py:177, in LlamaCppExperiment.run(self, runs)
    175 latencies = []
    176 for model_combo in self.model_argument_combos:
--> 177     client = Llama(**model_combo)
    178     for call_combo in self.call_argument_combos:
    179         for _ in range(runs):

File ~/anaconda3/envs/prompttools/lib/python3.11/site-packages/llama_cpp/llama.py:923, in Llama.__init__(self, model_path, n_gpu_layers, main_gpu, tensor_split, vocab_only, use_mmap, use_mlock, seed, n_ctx, n_batch, n_threads, n_threads_batch, rope_scaling_type, rope_freq_base, rope_freq_scale, yarn_ext_factor, yarn_attn_factor, yarn_beta_fast, yarn_beta_slow, yarn_orig_ctx, mul_mat_q, f16_kv, logits_all, embedding, last_n_tokens_size, lora_base, lora_scale, lora_path, numa, chat_format, chat_handler, verbose, **kwargs)
    920 self.chat_format = chat_format
    921 self.chat_handler = chat_handler
--> 923 self._n_vocab = self.n_vocab()
    924 self._n_ctx = self.n_ctx()
    926 self._token_nl = self.token_nl()

File ~/anaconda3/envs/prompttools/lib/python3.11/site-packages/llama_cpp/llama.py:2184, in Llama.n_vocab(self)
   2182 def n_vocab(self) -> int:
   2183     """Return the vocabulary size."""
-> 2184     return self._model.n_vocab()

File ~/anaconda3/envs/prompttools/lib/python3.11/site-packages/llama_cpp/llama.py:250, in _LlamaModel.n_vocab(self)
    249 def n_vocab(self) -> int:
--> 250     assert self.model is not None
    251     return llama_cpp.llama_n_vocab(self.model)

AssertionError:

I'm wonder LammaCpp doesn't support M series? Please let me know any ideas.

Thanks.

Support for OSS Models behind an API?

🚀 The feature

Right now mainly proprietary LLMs are supported. Would be great to also support DIY/OSS LLMs - for instance, hosted in Databricks Model Serving endpoints. Or more holistically, LLMs deployed behind a Web API running in a container, for instance.

Motivation, pitch

I think this will be super useful for people or companies who are willing to do prompt engineering with Open Source LLMs. Also happy to work on this feature.

Alternatives

Could also allow for testing prompts targetting models that are running on the local machine

Additional context

No response

PaLM Support

Add experiments for Google PaLM, and harnesses to compare between model providers.

Harness Visualize does not accept String Literals

🐛 Describe the bug

In below image, you can see that in harness visualize, where it should have been some output, there is nothing.

I would treat this as a bug, just because string literals are so ubiquitous for python-based LLM apps now it would be crazy to not support it

Support for OpenAI Assistants API

🚀 The feature

https://platform.openai.com/docs/assistants/overview

Motivation, pitch

Allow users to experiment with different assistant

Alternatives

No response

Additional context

No response

MindsDB integration

🚀 The feature

MindsDB is a server for AI logic
https://github.com/mindsdb/mindsdb

We can add a MindsDbExperiment to test that AI logic

Motivation, pitch

MindsDB is a large OSS project for AI orchestration, and it would be good to support them with our test framework

Alternatives

N/A

Additional context

We have gotten a few users asking about MindsDB

experiment.evaluate() shows stale evaluation results

🐛 Describe the bug

Hi folks,

Thanks again for your work on this library.

I noticed an issue where similarity scores do not get updated when I change my expected fields. Only when I re-run the experiment are the values updated.

Bug

Steps to reproduce:

models = ["gpt-3.5-turbo", "gpt-3.5-turbo-0613"]
messages = [
    [
        {"role": "system", "content": "Who is the first president of the US? Give me only the name"},
    ]
]
temperatures = [0.0]

experiment = OpenAIChatExperiment(models, messages, temperature=temperatures)
experiment.run()
experiment.visualize()

from prompttools.utils import semantic_similarity

experiment.evaluate("similar_to_expected", semantic_similarity, expected=["George Washington"] * 2)
experiment.visualize()

from prompttools.utils import semantic_similarity

experiment.evaluate("similar_to_expected", semantic_similarity, expected=["Lady Gaga"] * 2)
experiment.visualize() # the evaluation results here indicate that "Lady Gaga" is semantically identical to "George Washington"

In my opinion, evaluate() should re-compute metrics every time it is run, rather than depending/being coupled to another function (run()). I haven't tested it on other eval_fns, but it could be worth testing if this is the case as well.

Error running example.py

🐛 Describe the bug

Ran into this error when running example.py

ERROR:root:Authentication error. Skipping request.
ERROR:root:Authentication error. Skipping request.
WARNING:root:Can't find similar_to_expected in scores. Did you run `evaluate`?
Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/Users/qianh1/Code/prompttools/examples/prompttests/example.py", line 43, in <module>
    prompttest.main()
  File "/Users/qianh1/Code/prompttools/prompttools/testing/prompttest.py", line 108, in main
    failures = sum([test() for test in TESTS_TO_RUN])
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/qianh1/Code/prompttools/prompttools/testing/prompttest.py", line 108, in <listcomp>
    failures = sum([test() for test in TESTS_TO_RUN])
                    ^^^^^^
  File "/Users/qianh1/Code/prompttools/prompttools/testing/prompttest.py", line 58, in runs_test
    return run_prompt_template_test(
           ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/qianh1/Code/prompttools/prompttools/testing/runner/prompt_template_runner.py", line 82, in run_prompt_template_test
    scored_template[prompt_template] < threshold
    ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
TypeError: 'NoneType' object is not subscriptable

Anthropic Support

Add experiments for Anthropic, and harnesses to compare between model providers.

Are any metrics/ telemetry data collected?

⁉️ Discussion/Question

Hi I read the FAQ but it got me wondering if it was intentional that the a "no data collection" policy wasn't listed.

Function Calling Experiment Support

Add support for function calling, evaluating functions, chaining LLM call with function calls, etc.

Some components will include:

Low level function calling experiment
Function calling experiment harness
Eval functions for structured outputs
Prompttest support for function calling
Experiments across function calls and tools

Add support for other models in AutoEval

🚀 The feature

This is a good task for a new contributor

We have a few utility functions to perform AutoEval:

https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval.py
https://github.com/hegelai/prompttools/blob/main/prompttools/utils/autoeval_scoring.py
https://github.com/hegelai/prompttools/blob/main/prompttools/utils/expected.py

Currently, they tend to only support one model each. Someone can re-factor the code for each of them to support multiple models. I would recommend making sure they all support for the best known models such as GPT-4 and Claude 2.

We can even consider LlaMA but that is less urgent.

Tasks

Update this file such that autoeval_binary_scoring can take in model as an argument. Let's make sure gpt-4 and claude-2 are both accepted and invoke the right completion function.
Same as the above but for this file and the function autoeval_scoring. OpenAI needs to added here.
Same as the above but for this file and the function compute_similarity_against_model
Allow auto evaluation by multiple models (e.g. both gpt-4 and claude-2) at the same time

Motivation, pitch

Allow people to auto-evaluate with different best models would be ideal

Alternatives

No response

Additional context

No response

Issue with HuggingFaceHubExperiment: Unable to Use Models Other Than gpt2

⁉️ Discussion/Question

I'm currently facing an issue while attempting to use the HuggingFaceHubExperiment for testing models from Hugging Face. Specifically, it seems that only the gpt2 model is functional in this experiment, while other models from Hugging Face do not work. I am trying to use other models available from HuggingFace.

I've tried inputting Model IDs from Hugging Face, and I've also explored the available models on https://huggingface.co/api/models. Unfortunately, none of the models, aside from gpt2, seem to be compatible with the HuggingFaceHubExperiment.

Is there a specific reason for this limitation, or am I overlooking something in the setup process? I would greatly appreciate any guidance or insights you could provide to help resolve this issue.

Streamlit playground support for localLLMs

🚀 The feature

Support local llm by connecting streamlit playground with the local dev container
User can run the llm locally and provide endpoint to connect LLM with Streamlit or we can provide a wrapper function for LLM to connect with Streamlit

Motivation, pitch

Motivation behind this is Colab where we can connect the notebook with local machine what if we can do same for streamlit playground

Alternatives

No response

Additional context

No response

Improve link sharing

🚀 The feature

Today, we only support link sharing for 1 instruction and 1 prompt, and we don't capture configuration like temperature and other variables.

First, we should support capturing configuration.
Then, we should support multiple prompts and instructions.

We need to be mindful of the total size of the URL, and look for alternatives if it gets too long.

Motivation, pitch

Link sharing supports collaboration and helps more folks use the prompttools playground.

Alternatives

No response

Additional context

No response

Support for Google Vertex AI

🚀 The feature

Support for Google's PaLM 2 API is documented, but there's no mention of Google's Vertex AI model support (e.g. chat-bison). Please add support for this.

Motivation, pitch

Google's Vertex AI offerings are appropriate where data isolation is required. We use Vertex AI models internally and would like to use Prompt Tools in conjunction with them

Alternatives

No response

Additional context

No response

Playground can't load openai experiment

⁉️ Discussion/Question

When I run playground, it didn't work with below error.

2023-11-25 19:24:35.521 Uncaught app exception
Traceback (most recent call last):
  File "/Users/sewonist/anaconda3/envs/prompttools/lib/python3.11/site-packages/streamlit/runtime/scriptrunner/script_runner.py", line 534, in _run_script
    exec(code, module.__dict__)
  File "/Users/sewonist/Projects/13.AIChat/05.Projects/prompttools/prompttools/playground/playground.py", line 196, in <module>
    placeholders[i][j].markdown(df["response"][i + len(prompts) * j])
                                ~~^^^^^^^^^^^^
TypeError: 'NoneType' object is not subscriptable

I added logging to data_loader.py

    print(experiment)
    # <prompttools.experiment.experiments.openai_chat_experiment.OpenAIChatExperiment object at 0x133ffe650>
    
    df = experiment.to_pandas_df()
    print(df) 
    # None

    return df

It looks experiment.to_pandas_df() is returned None. Is it should problem version of package?

Environments

CPU : M2
RAM : 16GB
OS : Ventura 13.2.1
Python : 3.11.5
prompttools : 0.0.43

Wrong Argument Order in LlamaCpp Experiment

🐛 Describe the bug

While executing examples/notebooks/LlamaCppExperiment.ipynb, we are testing the temperature parameter

temperatures = [0.0, 1.0]

But it is passed as model_params in LlamaCppExperiment rather than as call_parameters, specifically it becomes used as{'lora_path': 1.0}. When it should be {'temperature': 1.0} instead.

This is likely due to how self.all_args is constructed and subsequently used in Experiment.

MusicGen Support

🚀 The feature

Support measurement of audio, sound, music, [etc.] models starting with MusicGen by Facebook

Motivation, pitch

Audio, sound, music generation models are growing rapidly. The expectation is this will only speed up. To continue supporting a multi modal approach, prompttools should support testing audio models.

Alternatives

No response

Additional context

Example: https://huggingface.co/spaces/facebook/MusicGen

Error in experiment.evaluate() in introductory example OpenAIChatExperiment.ipynb

Hi folks, thanks for creating this tool.

I'm trying out prompttools and was following the introductory example (OpenAIChatExperiment.ipynb) listed on the quickstart page and encountered this error. I can reproduce the error locally and on the provided Colab notebook

🐛 Describe the bug

This is the line that raises an error:

experiment.evaluate("similar_to_expected", similarity.evaluate, expected="George Washington")

And this is the error: TypeError: evaluate() missing 2 required positional arguments: 'response' and 'metadata'

Add Contribution Guide

As titled, so that we can have more people contributing soon!

Support for OpenAI Moderation API

🚀 The feature

https://platform.openai.com/docs/guides/moderation/overview

Motivation, pitch

Allow users to evaluate with OpenAI's API

Alternatives

No response

Additional context

No response

SQL Validator

🚀 The feature

We need a SQL validator for evaluation, like what we have for JSON and Python.

We can use this package: https://pypi.org/project/sqlvalidator/

Motivation, pitch

Many folks are building text-to-SQL engines, and this will be a common validation.

Alternatives

No response

Additional context

No response

HuggingFace Inference Endpoint support

🚀 The feature

Today we support huggingface inference API, which are lightweight instances for testing models.

There is a different feature, called inference endpoints which is for larger models or more persistent instances. See https://huggingface.co/inference-endpoints

Motivation, pitch

We want to test across different infrastructure providers. Today, our only way to test OSS models is (1) locally or (2) from huggingface inference APIs. We want to add huggingface inference endpoints as well.

Alternatives

N/A

Additional context

N/A

Version Control

🚀 The feature

A way to store inference results
A way to store old prompts along with their aggregate stats
A way to designate which version is live in production
(Stretch) Track branching changes

Motivation, pitch

After a few conversations with users, it has become a question on a lot of folks minds as to how we manage prompt versions.

Alternatives

Wrap a DB that stores prompts
Git-like system
Leave as is - tell people to export prompts and manage it themselves

Additional context

Example: https://www.reddit.com/r/PromptEngineering/comments/1589sm0/how_can_i_manage_prompts_better_for_my_project/

LangChain Support

Add Harnesses and Experiments to support testing LangChains natively.

Components will include:

Low level chain and agent experiments
Chain and agent harnesses
Step-by-step visualizations, and support for evaluating intermediate outputs

Stable diffusion support

🚀 The feature

We should look into experiments and eval functions for image models, like stable diffusion

Motivation, pitch

Many GenAI apps are multi modal, and we'll have to support more than LLMs in the long term

Alternatives

No response

Additional context

No response

Add ingestion harness for vectorDB experiments

🚀 The feature

We need a way to experiment with different chunking + ingestion strategies. For example, we have some "raw" documents we want to ingest into a vector database, and there are different ways of transforming those "raw" documents into the documents we end up vectorizing. For example, we can ingest them as is, "chunk" them into 10-line chunks, or do other pre-processing to extract keywords and relevant phrases.

Motivation, pitch

Talking to some customers about their needs regarding vector DB evaluation at scale.

Alternatives

No response

Additional context

No response