Coder Social home page Coder Social logo

gptscore's Introduction

GPTScore: Evaluate as You Desire

This is the Source Code of Paper: GPTScore: Evaluate as You Desire.

What is GPTScore?

GPTScore is a novel evaluation framework that utilizes the emergent abilities (e.g., zero-shot instruction) of Generative Pre-Trained models to Score generated texts.

GPTScore evaluation framework support:

  1. Customizable. Customized instructions and demonstrations enable the evaluation of new aspects without labeled datasets;
  2. Multifaceted. One evaluator performs multifaceted evaluations;
  3. Training-free.

What PLMs does GPTScore support?

We explored 19 Pre-trained Language Models (PLMs) ranging in size from 80M (FLAN-T5-Small) to 175B (GPT3) to design GPTScore.
The PLMs studied in this paper are listed as follows:

Model Parameter Evaluator Name Model Parameter Evaluator Name
GPT3 OPT
text-ada-001 350M gpt3_score OPT350M 350M opt350m_score
text-babbage-001 1.3B gpt3_score OPT-1.3B 1.3B opt1_3B_score
text-curie-001 6.7B gpt3_score OPT-6.7B 6.7B opt6_7B_score
text-davinci-001 175B gpt3_score OPT-13B 13B opt13B_score
text-davinci-003 175B gpt3_score OPT-66B 66B opt66B_score
FLAN-T5 GPT2
FT5-small 80M flan_small_score GPT2-M 355M gpt2_medium_score
FT5-base 250M flan_base_score GPT2-L 774M gpt2_large_score
FT5-L 770M flan_large_score GPT2-XL 1.5B gpt2_xl_score
FT5-XL 3B flan_xl_score GPT-J-6B 6B gptJ6B_score
FT5-XXL 11B flan_xxl_score
  • Evaluator Name indicates the name of the evaluator corresponding to the Model name in the first column.

Usage

Use the GPT3-based model as the evaluator

Take the evaluation of GPT3-text-curie-001 model as an example.

  • Setting gpt3_score to True: the GPTScore evaluator uses a GPT3-based PLM.
  • Setting gpt3model to curie: the text-curie-001 model is utilized.
  • out_dir_name: set the folder for saving scoring results.
  • dataname: set the dataset name for evaluation (e.g., BAGEL).
  • aspect: set the aspect name to be evaluated (e.g., quality).

1. GPTScore with Instruction and Demonstration

Set both the use_demo and use_ist as True.

python score_d2t.py 
--dataname "BAGEL" 
--use_demo True 
--use_ist True 
--gpt3_score True 
--gpt3model "curie" 
--out_dir_name "gpt3Score_based"  
--aspect 'quality'

2. GPTScore with only Instruction

Set the use_ist to True and use_demo to False.

python score_d2t.py 
--dataname "BAGEL" 
--use_demo False 
--use_ist True 
--gpt3_score True 
--gpt3model "curie" 
--out_dir_name "gpt3Score_based"  
--aspect 'quality'

3. GPTScore without both Instruction and Demonstration

Set the use_ist to False and use_demo to False.

python score_d2t.py 
--dataname "BAGEL" 
--use_demo False 
--use_ist False 
--gpt3_score True 
--gpt3model "curie" 
--out_dir_name "gpt3Score_based"  
--aspect 'quality'

Use the non-GPT3-based model (e.g., OPT) as the evaluator

Here, we take the evaluation of OPT350M model as an example.

  • Setting opt350m_score to True: use the evaluator named opt350m_score.
  • out_dir_name: set the folder for saving scoring results.
  • dataname: set the dataset name for evaluation (e.g., BAGEL).
  • aspect: set the aspect name to be evaluated (e.g., quality).

1. opt350m_score with Instruction and Demonstration

Set both the use_demo and use_ist as True.

python score_d2t.py 
--dataname "BAGEL" 
--use_demo True 
--use_ist True 
--opt350m_score True 
--out_dir_name "optScore_based"  
--aspect 'quality'

2. opt350m_score with only Instruction

Set the use_ist to True and use_demo to False.

python score_d2t.py 
--dataname "BAGEL" 
--use_demo False 
--use_ist True 
--opt350m_score True 
--out_dir_name "optScore_based"  
--aspect 'quality'

3. opt350m_score without both Instruction and Demonstration

Set the use_ist to False and use_demo to False.

python score_d2t.py 
--dataname "BAGEL" 
--use_demo False 
--use_ist False 
--opt350m_score True 
--out_dir_name "optScore_based"  
--aspect 'quality'

Bib

@article{fu2023gptscore,
  title={GPTScore: Evaluate as You Desire},
  author={Fu, Jinlan and Ng, See-Kiong and Jiang, Zhengbao and Liu, Pengfei},
  journal={arXiv preprint arXiv:2302.04166},
  year={2023}
}

gptscore's People

Contributors

jinlanfu avatar

Stargazers

 avatar  avatar yyq avatar Xinle Deng avatar Richard Hundt avatar  avatar Yuzhi ZHAO avatar Dang Nguyen avatar  avatar Tony Lee avatar Xinyuan Lu  avatar Xiaoang Xu avatar  avatar Guntsv avatar Jeongsik Park avatar Lorraine David avatar  avatar Minkyeong Jeon avatar Bong-Min Kim avatar John Halz avatar 江尚軒 avatar  avatar  avatar Jules Belveze avatar The wind rises, and the clouds soar. avatar JIMMY ZHAO avatar  avatar Zhiyuan Fan avatar  avatar Gabriele avatar YrYang avatar Siddharth Sriraman avatar Felix Letkemann (IoT Venture) avatar  avatar  avatar Nuri Kim avatar Xi Chen avatar Lewy Zeng avatar Sunqi Fan avatar Gauransh Soni avatar Zuo-Lihan avatar Junyuan Hong avatar joonhyung-lee avatar Kwon Ko avatar Xijia Polina Zhang avatar Tengjiao Zheng avatar  avatar 咸宁 avatar Weixiao Zhou avatar Jiarui Liu avatar LIU Jiarun avatar zxy666 avatar Cheng Li avatar Yohan Na avatar  avatar Bithika Jain avatar k4ke avatar Yu-Ting Lee avatar  avatar RanLiu avatar  avatar Yebin Lee avatar Krzysztof Sopyła avatar DISI UniBo NLP avatar William Li avatar Patrick Jiang avatar Alina avatar cass avatar Shyam Sudhakaran avatar zzhoo8 avatar  avatar Aria F avatar  avatar  avatar MJ Shin avatar Byeol-hee Kim avatar Xie-Minghui avatar Yuta Nishimori avatar HouYueJie avatar Kim Kang Min avatar Yohan Lee avatar Yeong-Joon Ju (주영준) avatar 이상민 avatar yukyunglee avatar Wei Liu avatar  avatar Wei Tao avatar Ziqiang Liu avatar Lei Liu avatar  avatar  avatar Mingjia Huo avatar Ruilu Wang avatar  avatar  avatar Nikolaus Schlemm avatar Jingyang Lin avatar Jianjie(JJ) Luo avatar Andrew Moore avatar Junnan Liu avatar

Watchers

Minoru Mizutani avatar Renat Zayashnikov avatar  avatar GMFTBY avatar Lukas Santing avatar  avatar

gptscore's Issues

About the Evaluation of Dialogue Generation

GPTScore contains very elaborate experimental results for the generation-based evaluation method for lots of downstream NLG tasks, and thank you so much for your work.

Recently, I also notice that large-scale language models may become a universal and powerful evaluation method, and I also conduct some experimental results on the meta-evaluation benchmarks of dialog generation task, for example, the Empathetic-Eval mentioned in MDD-Eval.

However, I notice that the GPT-3 and other publicly available large-scale language models have a very limited correlation with human judgments (person and spearman scores). I notice that you only conduct the experimental results on the FED-turn and FED-dialog meta-evaluations. I wonder that have you ever noticed the similar experimental results that I found on other meta-evaluation benchmarks (not FED-turn and FED-Dialog).

Looking forward to get response from you.

gpt3.5 version

first of all, thanks for your great work!

As gpt3 instructGPT models are deprecated, im currently updating to gpt-3.5-turbo-instruct
but the problem is log probs and echo parameters are incompatible. i cannot use two parameters at the same time.
how i get the log probs from prompt?

Request examples for evaluating text summarization

Great work. I intend to utilize it for my abstractive text summarization paper. Would you be able to upload the related examples? I am interested in evaluating semantic coverage, factuality, informativeness, and fluency following the aspect definitions in your paper.

Kind regards

How do you get logprobs from openai?

In file gpt_inference.py, you calculate the loss via out['logprobs'].
loss = -sum(out['logprobs']["token_logprobs"][i:-1])
However, openai doesn't offer the logprobs in its output of openai.Completion.create().

Could you please tell me how you get the logprobs?

About results on NewsRoom

Hi, thanks for your brilliant and comprehensive work.

I am recently working on using GPTscore to evaluate the NewsRoom benchmark. I found that there are three different annotations for each sample in the Dataset, which are not very cosistent inherently. If i take all of them for correlation analysis, the results are far below reported.

I would like to know that, how you processed the NewsRoom dataset for correlation analysis (eg. averaging the 3 annotations? or else). Thanks and looking forward for your response.

Multilingual Evaluation

Hi, really appreciate the nice work here! I understand that GPTScore is mainly designed for english text evaluation, I am just wondering, do you think it makes sense to use GPTScore for other languages evaluation, at least for high resource languages such as French and Chinese? Thanks!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.