langchain-ai / auto-evaluator Goto Github PK

Home Page: https://autoevaluator.langchain.com/

License: Other

JavaScript 0.17% TypeScript 52.65% CSS 1.01% Python 35.04% Jupyter Notebook 11.13%

auto-evaluator's Introduction

`Auto-evaluator` 🧠 📝

Context

Document Question-Answering is a popular LLM use-case. LangChain makes it easy to assemble LLM components (e.g., models and retrievers) into chains that support question-answering: input documents are split into chunks and stored in a retriever, relevant chunks are retrieved given a user question and passed to an LLM for synthesis into an answer.

Challenge

The quality of QA systems can vary considerably; for example, we have seen cases of hallucination and poor answer quality due specific parameter settings. But, it is not always obvious to (1) evaluate the answer quality in a systematic way and (2) use this evaluation to guide improved QA chain settings (e.g., chunk size) or components (e.g., model or retriever choice).

App overview

This app aims to address the above limitations. Recent work from Anthropic has used model-written evaluation sets. OpenAI and others have shown that model-graded evaluation is an effective way to evaluate models. This app combines both of these ideas into a single workspace, auto-generating a QA test set and auto-grading the result of the specified QA chain.

Usage

The app can be used in two ways:

Demo: We pre-loaded a document (a transcript of the Lex Fridman podcast with Andrej Karpathy) and a set of 5 question-answer pairs from the podcast. You can configure QA chain(s) and run an experiment.

Playground: Input a set of documents that you want to ask questions about. Optionally, also include your own test set of question-answer pairs related to the documents; see an example here. If you do not supply a test set, the app will auto-generate one. If the test set is smaller than the desired number of eval questions specified in the top left, the app will auto-generate the remainder.

Building the document retrieval:

The app will build a retriever for the input documents.
Retriever is a Langchain abstraction that accepts a question and returns a set of relevant documents.
The retriever can be selected by the user in the drop-down list in the configurations (red panel above).

Test set generation:

The app will auto-generate a test set of question-answer pair on the doc(s).
To do this, it uses the Langchain QAGenerationChain with the default prompt here.

LLM question-answering:

For each question, we use a RetrievalQA chain to answer it.
This will fetch chunks that are relevant to the question from the retriever and pass them to the LLM.
We expose the QA_CHAIN_PROMPT used for to pass this context to the LLM here.

Model-graded evaluation:

We let the user select from a number of model-graded evaluation prompts:

(1) The app will evaluate the relevance of the retrieved documents relative to the question.

(2) The app will evaluate the similarity of the LLM generated answer relative to ground truth answer.

The prompts for both can be seen here
Users can select which grading prompt to use. Here are some notes in prompt selection from our experience.

Experimental results:

The app will produce a table summarizing the results.
It shows the question and the ground truth (expected) answer.
It shows the chain-generated answer.
It shows the binary score (PASS / FAIL) for retrieval and the answer.
It shows the latency for retrieval and LLM answer summarization per question.
It shows the model grader output (the raw output of the grading prompt).

User inputs

The left panel of the app (shown in red in the above image) has several user-configurable parameters.

Number of eval questions - This is the number of question-answer pairs to auto-generate for the given inputs documents. As mentioned above, question-answer pair auto-generation will use Langchain's QAGenerationChain with prompt specified here.

Chunk size - Number of characters per chunk when the input documents are split. This can impact answer quality. Retrievers often use text embedding similarity to select chunks related to the question. If the chunks are too large, each chunk may contain more information unrelated to the question, which may degrade the summarized answer quality. If chunks are too small, important context may be left out of the retrieved chunks.

Overlap - The overlap in characters between chunks.

Embedding - The method used to embed chunks.

Retriever - The method used to retrieve chunks that are relevant to the user question. The default vector database used for similarity search is FAISS, but support for others is a welcome addition. You can also try other methods, such as SVM or TF-IDF.

Number of chunks to retrieve - Number of chunks retrieved. More chunks can improve performance by giving the LLM more context for answer summarization.

Model - LLM for summarization of retrieved chunks into the answer.

Grading prompt style - The prompt choice for model-graded evaluation. As mentioned above, the prompts can be seen here. More prompts would be a welcome addition. For example, with the Descriptive prompt, you will see a more detailed output with model grade justification.

Logging experiments

A user can select the desired configuration and then choose Re-Run Experiment.

This will run the new chain on the existing test set.

The results from all experiments will be summarized in the table and chart.

Contributing

Run the backend from api folder:

 pip install -r requirements.txt
 uvicorn evaluator_app:app

Test the api locally:

curl -X POST -F "files=@docs/karpathy-lex-pod/karpathy-pod.txt" -F "num_eval_questions=1" -F "chunk_chars=1000" -F "overlap=100" -F "split_method=RecursiveTextSplitter" -F "retriever_type=similarity-search" -F "embeddings=OpenAI" -F "model_version=gpt-3.5-turbo" -F "grade_prompt=Fast" -F "num_neighbors=3" http://localhost:8000/evaluator-stream

Run the frontend from nextjs folder and view web app at specified URL (e.g., http://localhost:3000/):

yarn install
yarn dev

Environment Variables

Front-end:

.env.local contains the env variables needed to run the project.

Back-end:

Specify the API keys for any models that you want to use.

OPENAI_API_KEY=
ANTHROPIC_API_KEY=

Deployment

The front-end is deployed to Vercel.

The back-end is deployed to Railway.

auto-evaluator's People

Contributors

Stargazers

Watchers

Forkers

dankolesnikov scheffershen dinfbo gmh5225 decentralised-ai melvinebenezer decentralised-ai ai-jie01 crjaensch jerwelborn mziru kamelkaouech azazali30 foolishsailor ninehills beingnodash aiwinapp hazelkang ofermend wdshin priteshkeleven langjs-starlab rebuildagency dcbark01 mchorfa bharathkumarnlp yuping322 danpechi grv805 meng-hui mubeenali007 uohnajune yudanta shecky2000 rainso9988 kumar045 jinhyeong-lim jeet129 frontapp nickblitz tomchapin commerceless lijameshao marcelom-ds showjim aacombarro89 fansi-sifan jonpoulter cat-l oreganmike sunshinewlz d215w louanes1 kongkong4556 trislee02 fragno flyspirit99 sorokinvld lexyurk sohybqasem paixai alam-bilal vkehfdl1 akashtalole daniel-yee getkksingh1 hbcbh1999 mansour-j ksjpswaroop moca ego tonywhite11 kcole93 katkamrachana 5l1v3r1 sohaib0399 samuelpath mangohehe atarora stophobia xuyin0216 rackloom id-2 trungng lachlanlindsay syusuke9999 javiervicho gosachin1 dzakwanf sionic-ai coresheep hoanganhvu123 seanpm2001 produtonics-ai-labs borgorg findmeetu

auto-evaluator's Issues

Failure on upload of some PDFs

GPT-3
https://arxiv.org/pdf/2005.14165.pdf

Galatica
https://arxiv.org/abs/2211.09085

Chinchilla
https://arxiv.org/abs/2203.15556

Missing logging / alert when back-end crashes

Back-end crashed according to Railway.

Did not see clear logging / alert in Sentry.

Last logging:

2023-05-02 17:31:03,008 loglevel=INFO   logger=evaluator_app grade_model_retrieval() L206  `Grading relevance of retrieved docs ...`
INFO:     Shutting down
2023-05-02 17:38:48,691 loglevel=INFO   logger=uvicorn.error shutdown() L253  Shutting down
INFO:     Waiting for application shutdown.
2023-05-02 17:38:48,793 loglevel=INFO   logger=uvicorn.error shutdown() L66   Waiting for application shutdown.
INFO:     Application shutdown complete.
2023-05-02 17:38:48,793 loglevel=INFO   logger=uvicorn.error shutdown() L77   Application shutdown complete.
INFO:     Finished server process [1]
2023-05-02 17:38:48,793 loglevel=INFO   logger=uvicorn.error serve() L85   Finished server process [1]

Not quite obvious why it crashed based on this logging.

Bug in the FAST scoring

Bad answer scored as PASS:

But using DETAILED prompt, it is corrected marked as INCORRECT:

Chart w/ answer score vs retrieval score and latency at size

current menu item should be underlines

if you in Demo / then dont who Demo menu item

The Test Dataset does not match what is shown in Experiment Results when app is first opened

Default Test Set is correctly loaded.
https://github.com/dankolesnikov/auto-evaluator-app/blob/main/api/docs/karpathy-lex-pod/karpathy-pod-eval.csv

But, when the app is first opened, the Q-A pairs in the Experiment Results table don't match the Test Set.

Fails to run w/ Llama-Ix

And gives the incorrect Insight
The experiment that performed the best was Experiment #2 due to combination of accuracy and latency.

Remove the Insight panel (not sure it is correct and uses subjective weighting)

Example:

Add Mantine table (collapsible cols)

PR is #14

Possible to hang during generate_eval( )

If the text produces consistent JSONdecode error during QA generation, may get stuck in while loop.

Remove Doppler dep

Still need Doppler for local testing.

Should we remove?

And for local tested we should create a .env template for folks for follow (e.g., with all required API keys).

Change app name: "Evaluator AI - evaluate your QA chains." to "Auto Evaluator"

Remove HF embeddings (slow) and HF fallback in the event of illegal token

add a file to the demo envioronment

Unbreak

In 7b3ef16 we removed doppler -

"dev": "doppler run -- next dev",

Removal of doppler means that some env variables were missing.

Both local (main) and prod are broken.

Adding these env variables to .env.local fixes local -

NEXT_PUBLIC_EVALUATOR_API_URL="http://127.0.0.1:8000"
NEXT_PUBLIC_API_URL="http://127.0.0.1:8000"

What is required for prod to work?

File transfer and reading time is slow in prod

See results:

Transfer time ~30 sec for ~30MB of files: consider stripping out images from pdf prior to sending.
Loading time per file ~5-8 sec vs ~0 sec on local.

Show intermediate states in the loading bar

How much effort to make the "processing files" stage more granular (e.g., show making retriever, making eval set, etc)? would be slightly better UX b/c this stage does hang for ~20-30 sec (esp if making eval set, etc).

add logrocket for error tracking and product analytics

API continues to run after connection to client is closed / refreshed

I kick off a job.

Then refresh client.

API will proceed:

But:
1/ Does not happen if API and front-end are deployed locally.
2/ Does not happen if only the front-end is deployed locally.

It appears to be a problem Vercel (remote deployed front-end).

Add OpenAI closedQA grader prompt

See here: https://github.com/openai/evals/blob/main/evals/registry/modelgraded/closedqa.yaml

Add support for modifying input prompt

Allow user to specify the prompt (both for the QA task and for the grading) in the UI.

Analytics

we don't have any

Add 50MB file limit and single file

Move API Key

https://github.com/langchain-ai/auto-evaluator/blob/main/nextjs/pages/_app.tsx#L16

Fix SVM in prod

It is timing out; it may require a Langchain upgrade.

Center / enlarge the spinner (maybe show processing stages, as before)

Prod (likely) hitting OOM

TL;DR we see OOM (or server returning

These are 15 large pdfs fail (503 server error) on prod, but work on local:

curl -X POST \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "num_eval_questions=1" \
-F "chunk_chars=1000" \
-F "overlap=100" \
-F "split_method=RecursiveTextSplitter" \
-F "retriever_type=similarity-search" \
-F "embeddings=OpenAI" \
-F "model_version=gpt-3.5-turbo" \
-F "grade_prompt=Fast" \
-F "num_neighbors=3" \
https://auto-evaluator-production.up.railway.app/evaluator-stream

The first 7 work on prod w/ similarity-search:

curl -X POST \                                                                                        
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "num_eval_questions=1" \
-F "chunk_chars=1000" \
-F "overlap=100" \
-F "split_method=RecursiveTextSplitter" \
-F "retriever_type=similarity-search" \
-F "embeddings=OpenAI" \
-F "model_version=gpt-3.5-turbo" \
-F "grade_prompt=Fast" \
-F "num_neighbors=3" \
https://auto-evaluator-production.up.railway.app/evaluator-stream

They are ~17 MB.

Adding any additional files fail (e.g., below fails):

curl -X POST \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "num_eval_questions=1" \
-F "chunk_chars=1000" \
-F "overlap=100" \
-F "split_method=RecursiveTextSplitter" \
-F "retriever_type=similarity-search" \
-F "embeddings=OpenAI" \
-F "model_version=gpt-3.5-turbo" \
-F "grade_prompt=Fast" \
-F "num_neighbors=3" \
https://auto-evaluator-production.up.railway.app/evaluator-stream

Add back descriptive scoring output

Currently we over-write:

https://github.com/dankolesnikov/evaluator-app/blob/dd798b6c7e26c3317467af613f328f1e3c83a1ea/api/evaluator_app.py#L322

Anthropic model is failing on prod

We are probably missing the API key?

disable adding test data by default in the playground environment

Add sidebar scrolling

Bottom if cut-off without ability to scroll:

bug: incorrect # of output rows

Sometimes the output will have 1 less than the # of eval pairs requested

Add Langchain logo to top left and make brain emoji the favicon

[mega request] Caching on back-end of index so it does not re-generate each expt

Expt numbering is wrong after we start w/ 3 examples

Make all side panels drop-down lists

horizontal scroll is needed on Summary table

Write-up on learnings / opportunities

Back-end crashing

Reports crashed -

Server crashed -

Logging is not obvious -

INFO:     Uvicorn running on http://0.0.0.0:7106/ (Press CTRL+C to quit)
INFO:uvicorn.error:Uvicorn running on http://0.0.0.0:7106/ (Press CTRL+C to quit)
INFO:     192.168.0.2:33390 - "POST /evaluator-stream HTTP/1.1" 200 OK
2023-05-05 09:04:42,541 loglevel=INFO   logger=evaluator_app run_evaluator() L334  Reading file: PosteVivereProtetti_CGA.pdf
2023-05-05 09:04:42,544 loglevel=INFO   logger=evaluator_app run_evaluator() L338  File PosteVivereProtetti_CGA.pdf is a PDF
2023-05-05 09:05:01,346 loglevel=INFO   logger=evaluator_app run_evaluator() L355  Splitting texts
2023-05-05 09:05:01,347 loglevel=INFO   logger=evaluator_app split_texts() L80   `Splitting doc ...`
2023-05-05 09:05:01,375 loglevel=INFO   logger=evaluator_app run_evaluator() L358  Make LLM
model!
gpt-4
2023-05-05 09:05:01,376 loglevel=INFO   logger=evaluator_app run_evaluator() L361  Make retriever
2023-05-05 09:05:01,376 loglevel=INFO   logger=evaluator_app make_retriever() L120  `Making retriever ...`
2023-05-05 09:05:07,734 loglevel=INFO   logger=evaluator_app run_evaluator() L365  Make chain
2023-05-05 09:05:07,735 loglevel=INFO   logger=evaluator_app generate_eval() L46   `Generating eval QA pair ...`
2023-05-05 09:05:32,805 loglevel=INFO   logger=evaluator_app run_eval() L236  `Running eval ...`
2023-05-05 09:05:50,387 loglevel=INFO   logger=evaluator_app grade_model_answer() L177  `Grading model answer ...`
2023-05-05 09:05:55,938 loglevel=INFO   logger=evaluator_app grade_model_retrieval() L205  `Grading relevance of retrieved docs ...`
INFO:     192.168.0.4:36928 - "POST /evaluator-stream HTTP/1.1" 200 OK
2023-05-05 09:06:33,392 loglevel=INFO   logger=evaluator_app run_evaluator() L334  Reading file: karpathy-pod.txt
2023-05-05 09:06:33,392 loglevel=INFO   logger=evaluator_app run_evaluator() L347  File karpathy-pod.txt is a TXT
2023-05-05 09:06:33,392 loglevel=INFO   logger=evaluator_app run_evaluator() L355  Splitting texts
2023-05-05 09:06:33,392 loglevel=INFO   logger=evaluator_app split_texts() L80   `Splitting doc ...`
2023-05-05 09:06:33,506 loglevel=INFO   logger=evaluator_app run_evaluator() L358  Make LLM
model!
gpt-3.5-turbo
2023-05-05 09:06:33,506 loglevel=INFO   logger=evaluator_app run_evaluator() L361  Make retriever
2023-05-05 09:06:33,506 loglevel=INFO   logger=evaluator_app make_retriever() L120  `Making retriever ...`
2023-05-05 09:06:35,081 loglevel=INFO   logger=evaluator_app run_evaluator() L365  Make chain
2023-05-05 09:06:35,082 loglevel=INFO   logger=evaluator_app run_eval() L236  `Running eval ...`
2023-05-05 09:06:38,116 loglevel=INFO   logger=evaluator_app generate_eval() L46   `Generating eval QA pair ...`
2023-05-05 09:06:48,428 loglevel=INFO   logger=evaluator_app run_eval() L236  `Running eval ...`
2023-05-05 09:06:57,001 loglevel=INFO   logger=evaluator_app grade_model_answer() L177  `Grading model answer ...`
2023-05-05 09:07:00,491 loglevel=INFO   logger=evaluator_app grade_model_answer() L177  `Grading model answer ...`
2023-05-05 09:07:02,402 loglevel=INFO   logger=evaluator_app grade_model_retrieval() L205  `Grading relevance of retrieved docs ...`
2023-05-05 09:07:04,823 loglevel=INFO   logger=evaluator_app run_eval() L236  `Running eval ...`
2023-05-05 09:07:15,961 loglevel=INFO   logger=evaluator_app grade_model_retrieval() L205  `Grading relevance of retrieved docs ...`
2023-05-05 09:07:22,977 loglevel=INFO   logger=evaluator_app generate_eval() L46   `Generating eval QA pair ...`
2023-05-05 09:07:26,002 loglevel=INFO   logger=evaluator_app grade_model_answer() L177  `Grading model answer ...`
2023-05-05 09:07:38,376 loglevel=INFO   logger=evaluator_app run_eval() L236  `Running eval ...`
2023-05-05 09:07:39,099 loglevel=INFO   logger=evaluator_app grade_model_retrieval() L205  `Grading relevance of retrieved docs ...`
2023-05-05 09:07:47,511 loglevel=INFO   logger=evaluator_app grade_model_answer() L177  `Grading model answer ...`
INFO:     192.168.0.4:56978 - "POST /evaluator-stream HTTP/1.1" 200 OK
2023-05-05 09:07:50,268 loglevel=INFO   logger=evaluator_app run_evaluator() L334  Reading file: karpathy-pod.txt
2023-05-05 09:07:50,269 loglevel=INFO   logger=evaluator_app run_evaluator() L347  File karpathy-pod.txt is a TXT
2023-05-05 09:07:50,269 loglevel=INFO   logger=evaluator_app run_evaluator() L355  Splitting texts
2023-05-05 09:07:50,269 loglevel=INFO   logger=evaluator_app split_texts() L80   `Splitting doc ...`
2023-05-05 09:07:50,386 loglevel=INFO   logger=evaluator_app run_evaluator() L358  Make LLM
model!
gpt-4
2023-05-05 09:07:50,386 loglevel=INFO   logger=evaluator_app run_evaluator() L361  Make retriever
2023-05-05 09:07:50,386 loglevel=INFO   logger=evaluator_app make_retriever() L120  `Making retriever ...`
2023-05-05 09:07:50,743 loglevel=INFO   logger=evaluator_app grade_model_retrieval() L205  `Grading relevance of retrieved docs ...`
2023-05-05 09:07:51,940 loglevel=INFO   logger=evaluator_app run_evaluator() L365  Make chain
2023-05-05 09:07:51,940 loglevel=INFO   logger=evaluator_app run_eval() L236  `Running eval ...`
2023-05-05 09:08:01,489 loglevel=INFO   logger=evaluator_app grade_model_answer() L177  `Grading model answer ...`
2023-05-05 09:08:09,404 loglevel=INFO   logger=evaluator_app grade_model_retrieval() L205  `Grading relevance of retrieved docs ...`
2023-05-05 09:08:14,427 loglevel=INFO   logger=evaluator_app run_eval() L236  `Running eval ...`
2023-05-05 09:08:24,486 loglevel=INFO   logger=evaluator_app grade_model_answer() L177  `Grading model answer ...`
2023-05-05 09:08:28,668 loglevel=INFO   logger=evaluator_app grade_model_retrieval() L205  `Grading relevance of retrieved docs ...`
2023-05-05 09:08:35,293 loglevel=INFO   logger=evaluator_app run_eval() L236  `Running eval ...`
INFO:     192.168.0.4:60550 - "POST /evaluator-stream HTTP/1.1" 200 OK
2023-05-05 09:08:45,730 loglevel=INFO   logger=evaluator_app run_evaluator() L334  Reading file: PosteVivereProtetti_CGA.pdf
2023-05-05 09:08:45,731 loglevel=INFO   logger=evaluator_app run_evaluator() L338  File PosteVivereProtetti_CGA.pdf is a PDF
2023-05-05 09:08:49,638 loglevel=INFO   logger=evaluator_app grade_model_answer() L177  `Grading model answer ...`
2023-05-05 09:08:55,964 loglevel=INFO   logger=evaluator_app grade_model_retrieval() L205  `Grading relevance of retrieved docs ...`
2023-05-05 09:08:59,938 loglevel=INFO   logger=evaluator_app run_eval() L236  `Running eval ...`
2023-05-05 09:09:04,879 loglevel=INFO   logger=evaluator_app run_evaluator() L355  Splitting texts
2023-05-05 09:09:04,879 loglevel=INFO   logger=evaluator_app split_texts() L80   `Splitting doc ...`
2023-05-05 09:09:04,906 loglevel=INFO   logger=evaluator_app run_evaluator() L358  Make LLM
model!
gpt-4
2023-05-05 09:09:04,906 loglevel=INFO   logger=evaluator_app run_evaluator() L361  Make retriever
2023-05-05 09:09:04,906 loglevel=INFO   logger=evaluator_app make_retriever() L120  `Making retriever ...`
2023-05-05 09:09:16,493 loglevel=INFO   logger=evaluator_app grade_model_answer() L177  `Grading model answer ...`
2023-05-05 09:09:20,289 loglevel=INFO   logger=evaluator_app grade_model_retrieval() L205  `Grading relevance of retrieved docs ...`
2023-05-05 09:09:24,466 loglevel=INFO   logger=evaluator_app run_eval() L236  `Running eval ...`
2023-05-05 09:09:30,584 loglevel=INFO   logger=evaluator_app run_evaluator() L365  Make chain
2023-05-05 09:09:30,585 loglevel=INFO   logger=evaluator_app run_eval() L236  `Running eval ...`

Experiment numbers are not refreshing correctly

Started w/ expts 1-3 filled out.

Then updated eval set and ran an expt.

This expt was expt 4.

Then it rotated back to expts 2, 3, ... n.

update labels in graph to match the axis labels

Catch server errors and return alert to the user

Test w/ GPT-4 paper (example of invalid token):

ValueError: Encountered text corresponding to disallowed special token '<|endofprompt|>'.
If you want this text to be encoded as a special token, pass it to `allowed_special`, e.g. `allowed_special={'<|endofprompt|>', ...}`.
If you want this text to be encoded as normal text, disable the check for this token by passing `disallowed_special=(enc.special_tokens_set - {'<|endofprompt|>'})`.
To disable this check for all special tokens, pass `disallowed_special=()`.

OpenAI invalid API key error

https://auto-evaluator.sentry.io/issues/4147078946/?query=is%3Aunresolved&referrer=issue-stream&stream_index=1

Create "default" page w/ pre-populated data (from Karpathy pod) and "playground"

[backlog] App re-generating index when it does not need to

I ran an expt.

I updated to use Descriptive.

It re-generated the index.

Add instructions to the demo landing page

Top (Text box, grey background)

Welcome to the auto-evaluator! This is an app to evaluate the performance of question-answering LLM chains. This demo has pre-loaded two things: (1) a document (the Lex Fridman podcast with Andrej Karpathy) and (2) a "test set" of question-answer pairs for this episode. The aim to evaluate the performance of various question-answering LLM chain configuration against test set. You can build any QA chain using on the components and score its performance.

Button (Text box, green background)

Choose the question-answering chain configuration (left) and launch an experiment using the button below. For more detail on each setting, see full the documentation here.

Color: green
Title: run experiment

Summary

Re-name initial row as baseline

Experiment Results (Text box, grey background)

This table shows the each question-answer pair from the test set along with the model's answer to the question. The app will score two things: (1) the relevance of the retrieved documents relative to the question and (2) the similarity of the LLM generated answer relative to ground truth answer. The prompts for both can be seen here and can be chosen by the user in the drop-down list Grading prompt style. The FAST prompt will only have the LLM grader output the score. The other prompts will also produce an explanation.

/opt/venv/lib/python3.8/site-packages/langchain/llms/anthropic.py:130: UserWarning: This Anthropic LLM is deprecated. Please use `from langchain.chat_models import ChatAnthropic` instead