Coder Social home page Coder Social logo

llmdrift's Introduction

๐ŸŽ“ LLM Drifts: How Is ChatGPTโ€™s Behavior Changing over Time?

Large language models (LLM) services such as GPT-4 and GPT-3.5 are widely being used. However, when and how these models are updated over time is opaque. Towards filling in this gap, this repository contains (i) a diverse set of datasets, and (ii) generations from popular LLMs (including GPT-4 and GPT-3.5) on these datasets over time.

๐Ÿ” Main Findings

Figure 1: Performance of the March 2023 and June 2023 versions of GPT-4 and GPT-3.5 on several diverse tasks: solving math problems, answering sensitive questions, taking surveys, answering knowledge intensive questions, generating code and visual reasoning. The performances of GPT-4 and GPT-3.5 can vary substantially over time, and for the worse in some tasks.

What are the main findings? In a nutshell, there are many interesting performance shifts over time. For example, GPT-4 (March 2023) was very good at identifying prime numbers (accuracy 84.0%) but GPT-4 (June 2023) was very poor on these same questions (accuracy 51.1%). Interestingly GPT-3.5 (June 2023) was much better than GPT-3.5 (March 2023) in this task. We hope releasing the datasets and generations can help the community to understand how LLM services drift better. The above figure gives a quantatitive summary.

๐Ÿš€ Reproducing the Figures (No API Needed)

You can directly run the Google Colab Notebook to reproduce the monitored performance drifts in our paper. You don't need API keys to get started. You can also use the local intro notebook directly.

๐Ÿ•น๏ธ Obtaining ChatGPT Generations (API Needed)

We also offer a Python system to obtain LLM generations for any given datasets. The code and usage instructions can be found here. Note that you need your own OPENAI API KEY to use this feature.

๐Ÿ’พ Datasets and Generations

The datasets and generations can be found under generation/. Each csv file corresponds to one dataset. One record/row corresponds to one query and the generation from one LLM service.

Figure 2: The first few rows in the LLM generations on PRIME dataset.

The above figure shows the first few rows in the generation/PRIME_FULL_EVAL.csv. It includes the model, query parameters (such as temperature and max token size), the query, the reference answer, the generated answer, and latency. Such information could be leverage to study various aspects of LLM services.

๐Ÿ“š Read More

You can get an overview via our Twitter threads:

Introducing LLM Drifts (July 18, 2023)

More Explanations (July 23rd, 2023)

Updated and Expanded Evaluation (August 2nd, 2023)

You can find more details in the academic paper:

๐Ÿ“ฃ Updates & Changelog

๐Ÿ”น 2024.01.03 - Added Monitoring Code

  • โœ… Added performance monitoring code to the repository

๐Ÿ”น 2023.08.01 - Added Tasks, Expanded Queries & Analysis

  • โœ… Added four new tasks to the repository
  • โœ… Expanded one existing task with more diverse queries
  • โœ… Additional analysis in the paper

๐Ÿ”น 2023.07.18 - Initial Release

  • โœ… The project is now live!

๐ŸŽฏ Reference

If you use our findings and/or datasets in a research paper, please cite our work as follows:

@article{chen2023LLMDrift,
  title={How Is ChatGPTโ€™s Behavior Changing over Time?},
  author={Chen, Lingjiao and Zaharia, Matei and Zou, James},
  journal={arXiv preprint arXiv:2307.09009},
  year={2023}
}

llmdrift's People

Contributors

lchen001 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.