Coder Social home page Coder Social logo

hendrycks / test Goto Github PK

View Code? Open in Web Editor NEW
1.1K 20.0 86.0 2.29 MB

Measuring Massive Multitask Language Understanding | ICLR 2021

Home Page: https://arxiv.org/abs/2009.03300

License: MIT License

Python 100.00%
muti-task transfer-learning gpt-3 few-shot-learning

test's Introduction

Measuring Massive Multitask Language Understanding

This is the repository for Measuring Massive Multitask Language Understanding by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (ICLR 2021).

This repository contains OpenAI API evaluation code, and the test is available for download here.

Test Leaderboard

If you want to have your model added to the leaderboard, please reach out to us or submit a pull request.

Results of the test:

Model Authors Humanities Social Sciences STEM Other Average
Chinchilla (70B, few-shot) Hoffmann et al., 2022 63.6 79.3 54.9 73.9 67.5
Gopher (280B, few-shot) Rae et al., 2021 56.2 71.9 47.4 66.1 60.0
GPT-3 (175B, fine-tuned) Brown et al., 2020 52.5 63.9 41.4 57.9 53.9
flan-T5-xl Chung et al., 2022 46.3 57.7 39.0 55.1 49.3
UnifiedQA Khashabi et al., 2020 45.6 56.6 40.2 54.6 48.9
GPT-3 (175B, few-shot) Brown et al., 2020 40.8 50.4 36.7 48.8 43.9
GPT-3 (6.7B, fine-tuned) Brown et al., 2020 42.1 49.2 35.1 46.9 43.2
flan-T5-large Chung et al., 2022 39.1 49.1 33.2 47.4 41.9
flan-T5-base Chung et al., 2022 34.0 38.1 27.6 37.0 34.2
GPT-2 Radford et al., 2019 32.8 33.3 30.2 33.1 32.4
flan-T5-small Chung et al., 2022 29.9 30.9 27.5 29.7 29.5
Random Baseline N/A 25.0 25.0 25.0 25.0 25.0

Citation

If you find this useful in your research, please consider citing the test and also the ETHICS dataset it draws from:

@article{hendryckstest2021,
  title={Measuring Massive Multitask Language Understanding},
  author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
  journal={Proceedings of the International Conference on Learning Representations (ICLR)},
  year={2021}
}

@article{hendrycks2021ethics,
  title={Aligning AI With Shared Human Values},
  author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
  journal={Proceedings of the International Conference on Learning Representations (ICLR)},
  year={2021}
}

test's People

Contributors

andyzoujm avatar collin-burns avatar helw150 avatar hendrycks avatar sbmaruf avatar xksteven avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

test's Issues

Issues with the moral scenarios task

I found some issues with the moral scenarios task. My analysis indicates that it isn’t a good measure of moral judgement of a model because of the complexity introduced by the task format. Results are summarized here https://www.lesswrong.com/posts/XqzWgkP3xekfdh8pa/mmlu-s-moral-scenarios-benchmark-doesn-t-measure-what-you . Please let me know if you think there are issues with my conclusions.

Given that other folks have built off of using moral scenarios as a metric recently https://arxiv.org/abs/2306.14308, I am trying to let people know about these findings so that at least there is some caution on using it as is going forward.

Duplicate Answers in Validation Set

I have spotted (or more precisely, my schema validator has spotted) three questions where the choices feature a duplicate.

All are in the 'validation' set. Specifically:

In elementary_mathematics_val.csv

What is the value of |3 + 5| – |-4|?,12,-4,4,12,C

In high_school_mathematics_val.csv

"Sam has $\frac{5}{8}$ of a pound of chocolate. If he eats $\frac{1}{3}$ of the chocolate he has, how many pounds of chocolate does he eat?",\frac{5}{12},\frac{5}{24},\frac{3}{24},\frac{3}{24},B

In miscellaneous_val.csv

"How many balloons would be required to fill the Empire State Building, which is about 100 stories tall?","60,000,000","60,000,000","600,000,000","6,000,000,000",A

This one is a particular problem, since it is the correct answer which has been duplicated.

medical word typo in CSV file.

Hello, my name is hiroya iizuka, 12 years experience cardiology doctor.

I found typo in clinical_knowledge_test.csv (line 66)

CleanShot 2024-02-18 at 05 56 01

In hypovolaemic shock -> In hypovolemic shock

Please fix this typo.

why ["top_logprobs"][-1]

Thanks for the code!

Could you explain why in evaluate.py --- lprobs.append(c["choices"][0]["logprobs"]["top_logprobs"][-1]
why you are only extracting the last item from the list? What is this list representing?

Thanks!

Dismatch dataset categories

In the paper, anatomy was categorised into STEM while in the categories.py file, anatomy is categorised into "health" and then "other". Which one is wrong here?

Human level performance?

Hi, first of all, thanks for releasing this great dataset!

In the abstract you wrote:
"on every one of the 57 tasks, the best models still need substantial improvements before they can reach human-level accuracy",
but I could not find human performance numbers in the paper. Do you plan to include them anytime soon?

Thanks!

About mmlu subset "all" from hugging face

Could you tell me the details of subset "all" from hugging face? Maybe "all" contains every other subsets, but what is the percentage of the other subsets using on "all".

Answers A, B, C, D are not all equally likely - is it really accurate to use random baseline as comparison?

I pulled the test data linked in the README, and I am noticing within each category there is basically never an even 25% split between A, B, C, and D..

The most imbalanced category is high school statistics, for which 47% of the answers are D.

I have two Qs: Is my analysis correct? I was using the test data downloadable from the main repo. Furthermore, if my analysis is correct wouldn't random baseline not be a fair comparison, since majority vote would do much better?

I used the data here: https://people.eecs.berkeley.edu/~hendrycks/data.tar

Odd-Looking Samples in business_ethics_val.csv

I noticed these two rows in business_ethics_val.csv, which I don't know how to interpret. Perhaps I'm just misunderstanding the format?

Executive directors are responsible for running the firm.,"1,4","1,3","2,3","1,2,3",A
Subjection to emotional exploitation.,"1,2,4","1,3,4","2,3,4","1,2,3,4,5",B

Thanks for the great eval!

Dataset size mismatched with paper

After building the given datasets, I got 14042 test examples and 1531 validation examples, which is slightly different from the paper (14079 vs. 1540). Posting here to double check that the data is correct.

Unintended (?) repetition in moral_scenarios_val.csv

It looks like the question template in moral_scenarios_val has a repeated substring which impacts 100 questions:

/Users/sambowman/Downloads/data/val/moral_scenarios_val.csv:"For which of these two scenarios does the main character (who uses I/me/my) do something clearly morally wrong, according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? Scenario 1 | I destroyed the business Scenario 2 | I pulled my younger sister's hair.","Wrong, Wrong","Wrong, Not wrong","Not wrong, Wrong","Not wrong, Not wrong",A

Incorrect answer for Q5 in high_school_computer_science_dev.csv

The answer of the last question in high_school_computer_science_dev.csv is incorrect:

A list of numbers has n elements, indexed from 1 to n.
The following algorithm is intended to display the number of elements in the list that have a value greater than 100.
The algorithm uses the variables count and position. Steps 3 and 4 are missing.
 Step 1: Set count to 0 and position to 1.
 Step 2: If the value of the element at index position is greater than 100, increase the value of count by 1.
 Step 3: (missing step)
 Step 4: (missing step)
 Step 5: Display the value of count.
 Which of the following could be used to replace steps 3 and 4 so that the algorithm works as intended?
 
(A) Step 3: Increase the value of position by 1.
    Step 4: Repeat steps 2 and 3 until the value of count is greater than 100.
(B) Step 3: Increase the value of position by 1.
    Step 4: Repeat steps 2 and 3 until the value of position is greater than n.
(C) Step 3: Repeat step 2 until the value of count is greater than 100.
    Step 4: Increase the value of position by 1.
(D) Step 3: Repeat step 2 until the value of position is greater than n.
    Step 4: Increase the value of count by 1.

The answer is listed as D in the CSV file, but the correct answer is B.
See the official question PDF from College Board to double check: Question 8 and its answer key. If you also think about it, D would not make sense because it does not update position and runs as an infinite loop.

The dataset from README.md was used: people.eecs.berkeley.edu/~hendrycks/data.tar

Suggestion: appropriate use of medical word: catheter

Please allow me to consult with you regarding the use of the term "catheter" in clinical_knowledge.csv.

Question 80: Which of the following would not be done before catheterizing?
  A: Obtain the patient's consent.
  B: Cleanse the patient.
  C: Confirm the expiration date.
  D: Contact the patient's next of kin.
  
(Correct answer: D)
Question 88: If a catheter resists all attempts to unblock it and you are unable to remove it, what should you do?
(Original: If a catheter resists all attempts to unblock it and you are unable to remove it, what should you do?)
  A: Remove the catheter more forcefully.
  B: Make further efforts to unblock the obstruction.
  C: Leave it until the next time.
  D: Seek assistance from a physician.
  
(Correct answer: D)

In the medical world, the term "catheter" is broadly used in two contexts:

  • Urinary catheter
  • Coronary artery catheter

In the above questions, it is inferred that the term is used in the context of a urinary catheter. However, in the context of coronary arteries, it affects the correct answer to question 80. (In some cases, explanation to next of kin may be required, and in procedures involving catheterization through the wrist, cleansing is not performed, making option B the correct answer.)

Therefore, to avoid potential confusion, it was considered better to explicitly specify catheter -> urinary catheter.

Thank you for your consideration.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.