hendrycks / test Goto Github PK

View Code? Open in Web Editor NEW

1.1K 20.0 86.0 2.29 MB

Measuring Massive Multitask Language Understanding | ICLR 2021

Home Page: https://arxiv.org/abs/2009.03300

License: MIT License

Python 100.00%

muti-task transfer-learning gpt-3 few-shot-learning

test's Introduction

Measuring Massive Multitask Language Understanding

This is the repository for Measuring Massive Multitask Language Understanding by Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt (ICLR 2021).

This repository contains OpenAI API evaluation code, and the test is available for download here.

Test Leaderboard

If you want to have your model added to the leaderboard, please reach out to us or submit a pull request.

Results of the test:

Model	Authors	Humanities	Social Sciences	STEM	Other	Average
Chinchilla (70B, few-shot)	Hoffmann et al., 2022	63.6	79.3	54.9	73.9	67.5
Gopher (280B, few-shot)	Rae et al., 2021	56.2	71.9	47.4	66.1	60.0
GPT-3 (175B, fine-tuned)	Brown et al., 2020	52.5	63.9	41.4	57.9	53.9
flan-T5-xl	Chung et al., 2022	46.3	57.7	39.0	55.1	49.3
UnifiedQA	Khashabi et al., 2020	45.6	56.6	40.2	54.6	48.9
GPT-3 (175B, few-shot)	Brown et al., 2020	40.8	50.4	36.7	48.8	43.9
GPT-3 (6.7B, fine-tuned)	Brown et al., 2020	42.1	49.2	35.1	46.9	43.2
flan-T5-large	Chung et al., 2022	39.1	49.1	33.2	47.4	41.9
flan-T5-base	Chung et al., 2022	34.0	38.1	27.6	37.0	34.2
GPT-2	Radford et al., 2019	32.8	33.3	30.2	33.1	32.4
flan-T5-small	Chung et al., 2022	29.9	30.9	27.5	29.7	29.5
Random Baseline	N/A	25.0	25.0	25.0	25.0	25.0

Citation

If you find this useful in your research, please consider citing the test and also the ETHICS dataset it draws from:

@article{hendryckstest2021,
  title={Measuring Massive Multitask Language Understanding},
  author={Dan Hendrycks and Collin Burns and Steven Basart and Andy Zou and Mantas Mazeika and Dawn Song and Jacob Steinhardt},
  journal={Proceedings of the International Conference on Learning Representations (ICLR)},
  year={2021}
}

@article{hendrycks2021ethics,
  title={Aligning AI With Shared Human Values},
  author={Dan Hendrycks and Collin Burns and Steven Basart and Andrew Critch and Jerry Li and Dawn Song and Jacob Steinhardt},
  journal={Proceedings of the International Conference on Learning Representations (ICLR)},
  year={2021}
}

test's People

Contributors

Stargazers

Watchers

Forkers

codeaudit xrosliang hongyunnchen thoughtfulmind xksteven nostalgebraist issipathana prettywork2021 chinglamchoi cerisara rjt1990 techthiyanes chorseng helw150 nawnoes seungonekim young-geng lorenzkuhn ollmer chaoswork justinphan3110 mooreliving777 daniel-furman jianguoz felixgithub2017 turboljy icloudai tqeeeee sbmaruf rrrrrs1 king-sid ujanssen crazyivanz mars-wei lu-wang-dl jooycelinn sevenmega jiudingsun01 vtuber-plan tonebeta zetangforward wenhuiwu qbetterk freesky-edward xinxinlon9 stippes jan-karsten-kuhnke m-shabreen zijian-hu jskdr email516888 fredatgithub gabormadarasz2117 jasonpho sorokinvld hansmoritzhafen dkhundley archerfem wangjiajun0806 christophergs kevinmantyniemi98 sgsdxzy seyi dujh22 carr001 kaimit muratfirstbatch jqwang2373 ibra7997 andrewxu313 gunale0926 zbwxp dagelf erlinares shermineh-gh lincolnneu sxh4396 yveltal43

test's Issues

Issues with the moral scenarios task

I found some issues with the moral scenarios task. My analysis indicates that it isn’t a good measure of moral judgement of a model because of the complexity introduced by the task format. Results are summarized here https://www.lesswrong.com/posts/XqzWgkP3xekfdh8pa/mmlu-s-moral-scenarios-benchmark-doesn-t-measure-what-you . Please let me know if you think there are issues with my conclusions.

Given that other folks have built off of using moral scenarios as a metric recently https://arxiv.org/abs/2306.14308, I am trying to let people know about these findings so that at least there is some caution on using it as is going forward.

Duplicate Answers in Validation Set

I have spotted (or more precisely, my schema validator has spotted) three questions where the choices feature a duplicate.

All are in the 'validation' set. Specifically:

In elementary_mathematics_val.csv

What is the value of |3 + 5| – |-4|?,12,-4,4,12,C

In high_school_mathematics_val.csv

"Sam has $\frac{5}{8}$ of a pound of chocolate. If he eats $\frac{1}{3}$ of the chocolate he has, how many pounds of chocolate does he eat?",\frac{5}{12},\frac{5}{24},\frac{3}{24},\frac{3}{24},B

In miscellaneous_val.csv

"How many balloons would be required to fill the Empire State Building, which is about 100 stories tall?","60,000,000","60,000,000","600,000,000","6,000,000,000",A

This one is a particular problem, since it is the correct answer which has been duplicated.

medical word typo in CSV file.

Hello, my name is hiroya iizuka, 12 years experience cardiology doctor.

I found typo in clinical_knowledge_test.csv (line 66)

In hypovolaemic shock -> In hypovolemic shock

Please fix this typo.

Seems that setting logprobs=100 is not useful now.

Hello authors,

I am really impressed with your efforts in creating this benchmark!

One small thing I notice is that OpenAI seems to limit the 'logprobs' argument to at most 5 (https://platform.openai.com/docs/api-reference/completions/create), while you set to 100. In this case, will your results be affected?

Why is this repository called `test`?

This is the top link for the MMLU Benchmark

why ["top_logprobs"][-1]

Thanks for the code!

Could you explain why in evaluate.py --- lprobs.append(c["choices"][0]["logprobs"]["top_logprobs"][-1]
why you are only extracting the last item from the list? What is this list representing?

Thanks!

Dismatch dataset categories

In the paper, anatomy was categorised into STEM while in the categories.py file, anatomy is categorised into "health" and then "other". Which one is wrong here?

Human level performance?

Hi, first of all, thanks for releasing this great dataset!

In the abstract you wrote:
"on every one of the 57 tasks, the best models still need substantial improvements before they can reach human-level accuracy",
but I could not find human performance numbers in the paper. Do you plan to include them anytime soon?

Thanks!

About mmlu subset "all" from hugging face

Could you tell me the details of subset "all" from hugging face? Maybe "all" contains every other subsets, but what is the percentage of the other subsets using on "all".

Answers A, B, C, D are not all equally likely - is it really accurate to use random baseline as comparison?

I pulled the test data linked in the README, and I am noticing within each category there is basically never an even 25% split between A, B, C, and D..

The most imbalanced category is high school statistics, for which 47% of the answers are D.

I have two Qs: Is my analysis correct? I was using the test data downloadable from the main repo. Furthermore, if my analysis is correct wouldn't random baseline not be a fair comparison, since majority vote would do much better?

I used the data here: https://people.eecs.berkeley.edu/~hendrycks/data.tar

Odd-Looking Samples in business_ethics_val.csv

I noticed these two rows in business_ethics_val.csv, which I don't know how to interpret. Perhaps I'm just misunderstanding the format?

Executive directors are responsible for running the firm.,"1,4","1,3","2,3","1,2,3",A
Subjection to emotional exploitation.,"1,2,4","1,3,4","2,3,4","1,2,3,4,5",B

Thanks for the great eval!

Fetching encoder json and bpe does not work (fixed by removing a typo)

Hello,

In trying to use your code, I ran into the error with fetching encoder.json and encoder.bpe.

I resolved the error by removing a whitespace in crop.py line 15 (there was a whitespace after "gpt-2/ ")

I changed it to:
r = requests.get("https://openaipublic.blob.core.windows.net/gpt-2/" + subdir + "/" + filename, stream=True)

What are expected to submit for the leaderboard integration?

Dataset size mismatched with paper

After building the given datasets, I got 14042 test examples and 1531 validation examples, which is slightly different from the paper (14079 vs. 1540). Posting here to double check that the data is correct.

can not download

I can't connect to the link to download the dataset

Unintended (?) repetition in moral_scenarios_val.csv

It looks like the question template in moral_scenarios_val has a repeated substring which impacts 100 questions:

/Users/sambowman/Downloads/data/val/moral_scenarios_val.csv:"For which of these two scenarios does the main character (who uses I/me/my) do something clearly morally wrong, according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? according to ordinary moral standards in the US as of 2020? Scenario 1 | I destroyed the business Scenario 2 | I pulled my younger sister's hair.","Wrong, Wrong","Wrong, Not wrong","Not wrong, Wrong","Not wrong, Not wrong",A

Hrvatski

Incorrect answer for Q5 in high_school_computer_science_dev.csv

The answer of the last question in high_school_computer_science_dev.csv is incorrect:

A list of numbers has n elements, indexed from 1 to n.
The following algorithm is intended to display the number of elements in the list that have a value greater than 100.
The algorithm uses the variables count and position. Steps 3 and 4 are missing.
 Step 1: Set count to 0 and position to 1.
 Step 2: If the value of the element at index position is greater than 100, increase the value of count by 1.
 Step 3: (missing step)
 Step 4: (missing step)
 Step 5: Display the value of count.
 Which of the following could be used to replace steps 3 and 4 so that the algorithm works as intended?
 
(A) Step 3: Increase the value of position by 1.
    Step 4: Repeat steps 2 and 3 until the value of count is greater than 100.
(B) Step 3: Increase the value of position by 1.
    Step 4: Repeat steps 2 and 3 until the value of position is greater than n.
(C) Step 3: Repeat step 2 until the value of count is greater than 100.
    Step 4: Increase the value of position by 1.
(D) Step 3: Repeat step 2 until the value of position is greater than n.
    Step 4: Increase the value of count by 1.

The answer is listed as D in the CSV file, but the correct answer is B.
See the official question PDF from College Board to double check: Question 8 and its answer key. If you also think about it, D would not make sense because it does not update position and runs as an infinite loop.

The dataset from README.md was used: people.eecs.berkeley.edu/~hendrycks/data.tar

Suggestion: appropriate use of medical word: catheter

Please allow me to consult with you regarding the use of the term "catheter" in clinical_knowledge.csv.

Question 80: Which of the following would not be done before catheterizing?
  A: Obtain the patient's consent.
  B: Cleanse the patient.
  C: Confirm the expiration date.
  D: Contact the patient's next of kin.
  
(Correct answer: D)

Question 88: If a catheter resists all attempts to unblock it and you are unable to remove it, what should you do?
(Original: If a catheter resists all attempts to unblock it and you are unable to remove it, what should you do?)
  A: Remove the catheter more forcefully.
  B: Make further efforts to unblock the obstruction.
  C: Leave it until the next time.
  D: Seek assistance from a physician.
  
(Correct answer: D)

In the medical world, the term "catheter" is broadly used in two contexts:

Urinary catheter
Coronary artery catheter

In the above questions, it is inferred that the term is used in the context of a urinary catheter. However, in the context of coronary arteries, it affects the correct answer to question 80. (In some cases, explanation to next of kin may be required, and in procedures involving catheterization through the wrist, cleansing is not performed, making option B the correct answer.)

Therefore, to avoid potential confusion, it was considered better to explicitly specify catheter -> urinary catheter.

Thank you for your consideration.