langchain-ai / auto-evaluator Goto Github PK
View Code? Open in Web Editor NEWHome Page: https://autoevaluator.langchain.com/
License: Other
Home Page: https://autoevaluator.langchain.com/
License: Other
we don't have any
If the text produces consistent JSONdecode
error during QA generation, may get stuck in while
loop.
I kick off a job.
Then refresh client.
API will proceed:
But:
1/ Does not happen if API and front-end are deployed locally.
2/ Does not happen if only the front-end is deployed locally.
It appears to be a problem Vercel (remote deployed front-end).
/opt/venv/lib/python3.8/site-packages/langchain/llms/anthropic.py:130: UserWarning: This Anthropic LLM is deprecated. Please use `from langchain.chat_models import ChatAnthropic` instead
It is timing out; it may require a Langchain upgrade.
Top
(Text box, grey background)
Welcome to the auto-evaluator! This is an app to evaluate the performance of question-answering LLM chains. This demo has pre-loaded two things: (1) a document (the Lex Fridman podcast with Andrej Karpathy) and (2) a "test set" of question-answer pairs for this episode. The aim to evaluate the performance of various question-answering LLM chain configuration against test set. You can build any QA chain using on the components and score its performance.
Button
(Text box, green background)
Choose the question-answering chain configuration (left) and launch an experiment using the button below. For more detail on each setting, see full the documentation here.
Summary
baseline
Experiment Results
(Text box, grey background)
This table shows the each question-answer pair from the test set along with the model's answer to the question. The app will score two things: (1) the relevance of the retrieved documents relative to the question and (2) the similarity of the LLM generated answer relative to ground truth answer. The prompts for both can be seen here and can be chosen by the user in the drop-down list Grading prompt style
. The FAST
prompt will only have the LLM grader output the score. The other prompts will also produce an explanation.
Apparently Harrison needs to do this; I don't have edit permission.
And gives the incorrect Insight
The experiment that performed the best was Experiment #2 due to combination of accuracy and latency.
Still need Doppler for local testing.
Should we remove?
And for local tested we should create a .env
template for folks for follow (e.g., with all required API keys).
PR is #14
TL;DR we see OOM (or server returning
These are 15 large pdfs fail (503 server error
) on prod, but work on local:
curl -X POST \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "num_eval_questions=1" \
-F "chunk_chars=1000" \
-F "overlap=100" \
-F "split_method=RecursiveTextSplitter" \
-F "retriever_type=similarity-search" \
-F "embeddings=OpenAI" \
-F "model_version=gpt-3.5-turbo" \
-F "grade_prompt=Fast" \
-F "num_neighbors=3" \
https://auto-evaluator-production.up.railway.app/evaluator-stream
The first 7 work on prod
w/ similarity-search
:
curl -X POST \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "num_eval_questions=1" \
-F "chunk_chars=1000" \
-F "overlap=100" \
-F "split_method=RecursiveTextSplitter" \
-F "retriever_type=similarity-search" \
-F "embeddings=OpenAI" \
-F "model_version=gpt-3.5-turbo" \
-F "grade_prompt=Fast" \
-F "num_neighbors=3" \
https://auto-evaluator-production.up.railway.app/evaluator-stream
They are ~17 MB.
Adding any additional files fail (e.g., below fails):
curl -X POST \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "[email protected]" \
-F "num_eval_questions=1" \
-F "chunk_chars=1000" \
-F "overlap=100" \
-F "split_method=RecursiveTextSplitter" \
-F "retriever_type=similarity-search" \
-F "embeddings=OpenAI" \
-F "model_version=gpt-3.5-turbo" \
-F "grade_prompt=Fast" \
-F "num_neighbors=3" \
https://auto-evaluator-production.up.railway.app/evaluator-stream
Sometimes the output will have 1 less than the # of eval pairs requested
Allow user to specify the prompt (both for the QA task and for the grading) in the UI.
Reports crashed -
Server crashed -
Logging is not obvious -
INFO: Uvicorn running on http://0.0.0.0:7106/ (Press CTRL+C to quit)
INFO:uvicorn.error:Uvicorn running on http://0.0.0.0:7106/ (Press CTRL+C to quit)
INFO: 192.168.0.2:33390 - "POST /evaluator-stream HTTP/1.1" 200 OK
2023-05-05 09:04:42,541 loglevel=INFO logger=evaluator_app run_evaluator() L334 Reading file: PosteVivereProtetti_CGA.pdf
2023-05-05 09:04:42,544 loglevel=INFO logger=evaluator_app run_evaluator() L338 File PosteVivereProtetti_CGA.pdf is a PDF
2023-05-05 09:05:01,346 loglevel=INFO logger=evaluator_app run_evaluator() L355 Splitting texts
2023-05-05 09:05:01,347 loglevel=INFO logger=evaluator_app split_texts() L80 `Splitting doc ...`
2023-05-05 09:05:01,375 loglevel=INFO logger=evaluator_app run_evaluator() L358 Make LLM
model!
gpt-4
2023-05-05 09:05:01,376 loglevel=INFO logger=evaluator_app run_evaluator() L361 Make retriever
2023-05-05 09:05:01,376 loglevel=INFO logger=evaluator_app make_retriever() L120 `Making retriever ...`
2023-05-05 09:05:07,734 loglevel=INFO logger=evaluator_app run_evaluator() L365 Make chain
2023-05-05 09:05:07,735 loglevel=INFO logger=evaluator_app generate_eval() L46 `Generating eval QA pair ...`
2023-05-05 09:05:32,805 loglevel=INFO logger=evaluator_app run_eval() L236 `Running eval ...`
2023-05-05 09:05:50,387 loglevel=INFO logger=evaluator_app grade_model_answer() L177 `Grading model answer ...`
2023-05-05 09:05:55,938 loglevel=INFO logger=evaluator_app grade_model_retrieval() L205 `Grading relevance of retrieved docs ...`
INFO: 192.168.0.4:36928 - "POST /evaluator-stream HTTP/1.1" 200 OK
2023-05-05 09:06:33,392 loglevel=INFO logger=evaluator_app run_evaluator() L334 Reading file: karpathy-pod.txt
2023-05-05 09:06:33,392 loglevel=INFO logger=evaluator_app run_evaluator() L347 File karpathy-pod.txt is a TXT
2023-05-05 09:06:33,392 loglevel=INFO logger=evaluator_app run_evaluator() L355 Splitting texts
2023-05-05 09:06:33,392 loglevel=INFO logger=evaluator_app split_texts() L80 `Splitting doc ...`
2023-05-05 09:06:33,506 loglevel=INFO logger=evaluator_app run_evaluator() L358 Make LLM
model!
gpt-3.5-turbo
2023-05-05 09:06:33,506 loglevel=INFO logger=evaluator_app run_evaluator() L361 Make retriever
2023-05-05 09:06:33,506 loglevel=INFO logger=evaluator_app make_retriever() L120 `Making retriever ...`
2023-05-05 09:06:35,081 loglevel=INFO logger=evaluator_app run_evaluator() L365 Make chain
2023-05-05 09:06:35,082 loglevel=INFO logger=evaluator_app run_eval() L236 `Running eval ...`
2023-05-05 09:06:38,116 loglevel=INFO logger=evaluator_app generate_eval() L46 `Generating eval QA pair ...`
2023-05-05 09:06:48,428 loglevel=INFO logger=evaluator_app run_eval() L236 `Running eval ...`
2023-05-05 09:06:57,001 loglevel=INFO logger=evaluator_app grade_model_answer() L177 `Grading model answer ...`
2023-05-05 09:07:00,491 loglevel=INFO logger=evaluator_app grade_model_answer() L177 `Grading model answer ...`
2023-05-05 09:07:02,402 loglevel=INFO logger=evaluator_app grade_model_retrieval() L205 `Grading relevance of retrieved docs ...`
2023-05-05 09:07:04,823 loglevel=INFO logger=evaluator_app run_eval() L236 `Running eval ...`
2023-05-05 09:07:15,961 loglevel=INFO logger=evaluator_app grade_model_retrieval() L205 `Grading relevance of retrieved docs ...`
2023-05-05 09:07:22,977 loglevel=INFO logger=evaluator_app generate_eval() L46 `Generating eval QA pair ...`
2023-05-05 09:07:26,002 loglevel=INFO logger=evaluator_app grade_model_answer() L177 `Grading model answer ...`
2023-05-05 09:07:38,376 loglevel=INFO logger=evaluator_app run_eval() L236 `Running eval ...`
2023-05-05 09:07:39,099 loglevel=INFO logger=evaluator_app grade_model_retrieval() L205 `Grading relevance of retrieved docs ...`
2023-05-05 09:07:47,511 loglevel=INFO logger=evaluator_app grade_model_answer() L177 `Grading model answer ...`
INFO: 192.168.0.4:56978 - "POST /evaluator-stream HTTP/1.1" 200 OK
2023-05-05 09:07:50,268 loglevel=INFO logger=evaluator_app run_evaluator() L334 Reading file: karpathy-pod.txt
2023-05-05 09:07:50,269 loglevel=INFO logger=evaluator_app run_evaluator() L347 File karpathy-pod.txt is a TXT
2023-05-05 09:07:50,269 loglevel=INFO logger=evaluator_app run_evaluator() L355 Splitting texts
2023-05-05 09:07:50,269 loglevel=INFO logger=evaluator_app split_texts() L80 `Splitting doc ...`
2023-05-05 09:07:50,386 loglevel=INFO logger=evaluator_app run_evaluator() L358 Make LLM
model!
gpt-4
2023-05-05 09:07:50,386 loglevel=INFO logger=evaluator_app run_evaluator() L361 Make retriever
2023-05-05 09:07:50,386 loglevel=INFO logger=evaluator_app make_retriever() L120 `Making retriever ...`
2023-05-05 09:07:50,743 loglevel=INFO logger=evaluator_app grade_model_retrieval() L205 `Grading relevance of retrieved docs ...`
2023-05-05 09:07:51,940 loglevel=INFO logger=evaluator_app run_evaluator() L365 Make chain
2023-05-05 09:07:51,940 loglevel=INFO logger=evaluator_app run_eval() L236 `Running eval ...`
2023-05-05 09:08:01,489 loglevel=INFO logger=evaluator_app grade_model_answer() L177 `Grading model answer ...`
2023-05-05 09:08:09,404 loglevel=INFO logger=evaluator_app grade_model_retrieval() L205 `Grading relevance of retrieved docs ...`
2023-05-05 09:08:14,427 loglevel=INFO logger=evaluator_app run_eval() L236 `Running eval ...`
2023-05-05 09:08:24,486 loglevel=INFO logger=evaluator_app grade_model_answer() L177 `Grading model answer ...`
2023-05-05 09:08:28,668 loglevel=INFO logger=evaluator_app grade_model_retrieval() L205 `Grading relevance of retrieved docs ...`
2023-05-05 09:08:35,293 loglevel=INFO logger=evaluator_app run_eval() L236 `Running eval ...`
INFO: 192.168.0.4:60550 - "POST /evaluator-stream HTTP/1.1" 200 OK
2023-05-05 09:08:45,730 loglevel=INFO logger=evaluator_app run_evaluator() L334 Reading file: PosteVivereProtetti_CGA.pdf
2023-05-05 09:08:45,731 loglevel=INFO logger=evaluator_app run_evaluator() L338 File PosteVivereProtetti_CGA.pdf is a PDF
2023-05-05 09:08:49,638 loglevel=INFO logger=evaluator_app grade_model_answer() L177 `Grading model answer ...`
2023-05-05 09:08:55,964 loglevel=INFO logger=evaluator_app grade_model_retrieval() L205 `Grading relevance of retrieved docs ...`
2023-05-05 09:08:59,938 loglevel=INFO logger=evaluator_app run_eval() L236 `Running eval ...`
2023-05-05 09:09:04,879 loglevel=INFO logger=evaluator_app run_evaluator() L355 Splitting texts
2023-05-05 09:09:04,879 loglevel=INFO logger=evaluator_app split_texts() L80 `Splitting doc ...`
2023-05-05 09:09:04,906 loglevel=INFO logger=evaluator_app run_evaluator() L358 Make LLM
model!
gpt-4
2023-05-05 09:09:04,906 loglevel=INFO logger=evaluator_app run_evaluator() L361 Make retriever
2023-05-05 09:09:04,906 loglevel=INFO logger=evaluator_app make_retriever() L120 `Making retriever ...`
2023-05-05 09:09:16,493 loglevel=INFO logger=evaluator_app grade_model_answer() L177 `Grading model answer ...`
2023-05-05 09:09:20,289 loglevel=INFO logger=evaluator_app grade_model_retrieval() L205 `Grading relevance of retrieved docs ...`
2023-05-05 09:09:24,466 loglevel=INFO logger=evaluator_app run_eval() L236 `Running eval ...`
2023-05-05 09:09:30,584 loglevel=INFO logger=evaluator_app run_evaluator() L365 Make chain
2023-05-05 09:09:30,585 loglevel=INFO logger=evaluator_app run_eval() L236 `Running eval ...`
We are probably missing the API key?
GPT-3
https://arxiv.org/pdf/2005.14165.pdf
Galatica
https://arxiv.org/abs/2211.09085
Chinchilla
https://arxiv.org/abs/2203.15556
Default Test Set
is correctly loaded.
https://github.com/dankolesnikov/auto-evaluator-app/blob/main/api/docs/karpathy-lex-pod/karpathy-pod-eval.csv
But, when the app is first opened, the Q-A pairs in the Experiment Results
table don't match the Test Set
.
In 7b3ef16 we removed doppler -
"dev": "doppler run -- next dev",
Removal of doppler means that some env variables were missing.
Both local (main
) and prod are broken.
Adding these env variables to .env.local
fixes local -
NEXT_PUBLIC_EVALUATOR_API_URL="http://127.0.0.1:8000"
NEXT_PUBLIC_API_URL="http://127.0.0.1:8000"
What is required for prod
to work?
Back-end crashed according to Railway.
Did not see clear logging / alert in Sentry.
Last logging:
2023-05-02 17:31:03,008 loglevel=INFO logger=evaluator_app grade_model_retrieval() L206 `Grading relevance of retrieved docs ...`
INFO: Shutting down
2023-05-02 17:38:48,691 loglevel=INFO logger=uvicorn.error shutdown() L253 Shutting down
INFO: Waiting for application shutdown.
2023-05-02 17:38:48,793 loglevel=INFO logger=uvicorn.error shutdown() L66 Waiting for application shutdown.
INFO: Application shutdown complete.
2023-05-02 17:38:48,793 loglevel=INFO logger=uvicorn.error shutdown() L77 Application shutdown complete.
INFO: Finished server process [1]
2023-05-02 17:38:48,793 loglevel=INFO logger=uvicorn.error serve() L85 Finished server process [1]
Not quite obvious why it crashed based on this logging.
Test w/ GPT-4 paper (example of invalid token):
ValueError: Encountered text corresponding to disallowed special token '<|endofprompt|>'.
If you want this text to be encoded as a special token, pass it to `allowed_special`, e.g. `allowed_special={'<|endofprompt|>', ...}`.
If you want this text to be encoded as normal text, disable the check for this token by passing `disallowed_special=(enc.special_tokens_set - {'<|endofprompt|>'})`.
To disable this check for all special tokens, pass `disallowed_special=()`.
This is b/c they are no longer comparable.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.