Coder Social home page Coder Social logo

nustm / chatgpt-sentiment-evaluation Goto Github PK

View Code? Open in Web Editor NEW
49.0 3.0 5.0 6.86 MB

Can ChatGPT really understand the opinions, sentiments, and emotions contained in the text? We provide a preliminary evaluation.

Python 100.00%
aspect-based-sentiment-analysis chatgpt emotion-analysis emotion-cause-pair-extraction large-language-models llm sentiment-analysis emotion-cause-extraction open-domain emotion-cause-analysis

chatgpt-sentiment-evaluation's Introduction

Is ChatGPT a Good Sentiment Analyzer?

Is ChatGPT a Good Sentiment Analyzer? A Preliminary Study [arXiv:2304.04339]

In this repo, we release the test sets we used for evaluation in our paper.

Introduction (TL;DR)

Recently, ChatGPT has drawn great attention from both the research community and the public. However, despite its huge success, we still know little about the capability boundaries, i.e., where it does well and fails. We are particularly curious how ChatGPT performs on the sentiment analysis tasks, i.e., Can it really understand the opinions, sentiments, and emotions contained in the text?

To answer this question, we conduct a preliminary evaluation on 5 representative sentiment analysis tasks and 18 benchmark datasets, which involves four different settings including standard evaluation, polarity shift evaluation, open-domain evaluation, and sentiment inference evaluation. We compare ChatGPT with fine-tuned BERT-based models and corresponding SOTA models on each task for reference.

Through rigorous evaluation, our findings are as follows:

  1. ChatGPT exhibits impressive zero-shot performance in sentiment classification tasks and can rival fine-tuned BERT, although it falls slightly behind the domain-specific fullysupervised SOTA models.
  2. ChatGPT appears to be less accurate on sentiment information extraction tasks such as E2E-ABSA. Upon observation, we find that ChatGPT is often able to generate reasonable answers, even though they may not strictly match the textual expression. From this point of view, the exact matching evaluation in information extraction is not very fair for ChatGPT. In our human evaluation, ChatGPT can still perform well in these tasks.
  3. Few-shot prompting (i.e., equipping with a few demonstration examples in the input) can significantly improve performance across various tasks, datasets, and domains, even surpassing fine-tuned BERT in some cases but still being inferior to SOTA models.
  4. When coping with the polarity shift phenomenon (e.g., negation and speculation), a challenging problem in sentiment analysis, ChatGPT can make more accurate predictions than fine-tuned BERT.
  5. Compared to the conventional practice - training domain-specific models, which typically perform poorly when generalized to unseen domains, ChatGPT demonstrates its powerful open-domain sentiment analysis ability in general, yet it is still worth noting that its performance is quite limited in a few specific domains.
  6. ChatGPT exhibits impressive sentiment inference ability, achieving comparable performance on the emotion cause extraction task or emotion-cause pair extraction task, in comparison with the fully-supervised SOTA models we set up.

In summary, compared to training a specialized sentiment analysis system for each domain or dataset, ChatGPT can already serve as a universal and well-behaved sentiment analyzer.

Citation

If you find this work helpful, please cite our paper as follows:

@article{wang2023chatgpt-sentiment,
  title={Is ChatGPT a Good Sentiment Analyzer? A Preliminary Study},
  author={Zengzhi Wang and Qiming Xie and Zixiang Ding and Yi Feng and Rui Xia},
  journal={arXiv preprint},
  year={2023}
}

If you have any questions related to this work, you can open an issue with details or feel free to email Zengzhi([email protected]), Qiming([email protected]).

Evaluation

Standard Evaluation

Zero-shot Results

Human Evaluation (still in zero-shot)

Few-shot Prompting

Polarity Shift Evaluation

Open-Domain Evaluation

Sentiment Inference Evaluation

We choose the ECE and ECPE tasks as the testbed.

Case Study

Standard Evaluation

Polarity Shift Evaluation

Open-Domain Evaluation

Sentiment Inference Evaluation

Emotion Cause Extraction (ECE)

Emotion-Cause Pair Extraction (ECPE)

Note that the right part is the English version translation of the left part for both ECE and ECPE.

chatgpt-sentiment-evaluation's People

Contributors

balancedzx avatar grayground avatar rxiacn avatar sinclaircoder avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

chatgpt-sentiment-evaluation's Issues

实验设计

请问chatgpt没有提供api接口,而gpt3.5需要付费使用,你们是如何使用chatgpt进行实验的呢?是通过人工一个一个输入数据吗?

Question about the number of the examples of test dataset in E2E-ABSA task

Hi, I noticed the statistics in Table 2 of your paper and found that the number of test instances for the E2E-ABSA task is inconsistent with the Sem14 test dataset. In the paper by Pontiki et al. (2014), the test size for the laptop and restaurant domains is stated as 800 sentences each. However, in Table 2, the numbers are changed to 339 and 496, and you mentioned in your paper that you used the entire test set. Therefore, I am curious about the differences here.

prompt for few shot

Tnank you for your interesting work.
Could you show your prompt with some few-shot learning please?
Thank you very much.

Performance Record Documents

Hi there!

Your work is excellent!

To manually check and understand the prediction results of different models, I am eager to find the documents storing relevant information.

I went through your work and read the code on GitHub. I understand the row data (true labels) stored in the standard data file. However, there are many files named by "50_test" "100_test" "train" "test" "dev" under different folders that I did not understand their connections with each other. To my understanding, "50_test" and "100_test" are generated by extracting 50 lines and 100 lines from "dev" respectively.

I am trying to understand these files with the purpose of finding docs that record the detailed prediction results for different models in different tasks. Could you kindly help me have a clearer mind about that?

I would appreciate it a lot if you would like to tell me about either the logic/connection behind the filenames or where I could find the docs storing the prediction results of different models!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.