Coder Social home page Coder Social logo

allenai / catwalk Goto Github PK

View Code? Open in Web Editor NEW
138.0 7.0 13.0 2.11 MB

This project studies the performance and robustness of language models and task-adaptation methods.

License: Apache License 2.0

Python 98.06% Makefile 0.06% Shell 0.14% Jsonnet 1.68% Dockerfile 0.06%

catwalk's Introduction

Catwalk

Catwalk shows off models.

Catwalk contains a lot of models, and a lot of tasks. The goal is to be able to run all models on all tasks. In practice, some combinations are not possible, but many are.

Here is the current list of tasks we have implemented. This list is not showing the `metaicl` and `p3` categories of tasks, because those are largely variants of the other tasks.
wikitext
piqa
squad
squadshifts-reddit
squadshifts-amazon
squadshifts-nyt
squadshifts-new-wiki
mrqa::race
mrqa::newsqa
mrqa::triviaqa
mrqa::searchqa
mrqa::hotpotqa
mrqa::naturalquestions
mrqa::bioasq
mrqa::drop
mrqa::relationextraction
mrqa::textbookqa
mrqa::duorc.paraphraserc
squad2
rte
superglue::rte
cola
mnli
mnli_mismatched
mrpc
qnli
qqp
sst
wnli
boolq
cb
copa
eai::multirc
wic
wsc
drop
lambada
lambada_cloze
lambada_mt_en
lambada_mt_fr
lambada_mt_de
lambada_mt_it
lambada_mt_es
prost
mc_taco
pubmedqa
sciq
qa4mre_2011
qa4mre_2012
qa4mre_2013
triviaqa
arc_easy
arc_challenge
logiqa
hellaswag
openbookqa
eai::race
headqa_es
headqa_en
mathqa
webqs
wsc273
winogrande
anli_r1
anli_r2
anli_r3
ethics_cm
ethics_deontology
ethics_justice
ethics_utilitarianism_original
ethics_utilitarianism
ethics_virtue
truthfulqa_gen
mutual
mutual_plus
math_algebra
math_counting_and_prob
math_geometry
math_intermediate_algebra
math_num_theory
math_prealgebra
math_precalc
math_asdiv
arithmetic_2da
arithmetic_2ds
arithmetic_3da
arithmetic_3ds
arithmetic_4da
arithmetic_4ds
arithmetic_5da
arithmetic_5ds
arithmetic_2dm
arithmetic_1dc
anagrams1
anagrams2
cycle_letters
random_insertion
reversed_words
raft::ade_corpus_v2
raft::banking_77
raft::neurips_impact_statement_risks
raft::one_stop_english
raft::overruling
raft::semiconductor_org_types
raft::systematic_review_inclusion
raft::tai_safety_research
raft::terms_of_service
raft::tweet_eval_hate
raft::twitter_complaints

Installation

Catwalk requires Python 3.9 or later.

Unfortunately Catwalk cannot be installed from pypi, because it depends on other packages that are not uploaded to pypi.

Install from source:

git clone https://github.com/allenai/catwalk.git
cd catwalk
pip install -e .

Getting started

Let's run GPT2 on PIQA:

python -m catwalk --model rc::gpt2 --task piqa

This will load up GPT2 and use it to perform the PIQA task with the "ranked classification" approach.

You can specify multiple tasks at once:

python -m catwalk --model rc::gpt2 --task piqa arc_easy

It'll print you a nice table with all tasks and the metrics for each task:

arc_challenge   acc     0.22440272569656372
arc_easy        acc     0.3998316526412964
piqa    acc     0.6256800889968872

Training / Finetuning

Catwalk can train models. It can train models on a single task, or on multiple tasks at once. To train, use this command line:

python -m catwalk.train --model rc::gpt2 --task piqa

You can train on multiple tasks at the same time, if you want to create a multi-task model:

python -m catwalk.train --model rc::gpt2 --task piqa arc_easy

Note that not all models support training. If you want to train one and can't, create an issue and tag @dirkgr in it.

Tango integration

Catwalk uses Tango for caching and executing evaluations. The command line interface internally constructs a Tango step graph and executes it. You can point the command line to a Tango workspace to cache results:

python -m catwalk --model rc::gpt2 --task piqa arc_easy -w ./my-workspace/

The second time you run one of those tasks, it will be fast:

time python -m catwalk --model rc::gpt2 --task piqa -w ./my-workspace/
arc_easy	acc	0.39941078424453735
piqa	acc	0.626224160194397

________________________________________________________
Executed in    9.82 secs    fish           external
   usr time    6.51 secs  208.00 micros    6.51 secs
   sys time    1.25 secs  807.00 micros    1.25 secs

Tango workspaces also save partial results, so if you interrupt an evaluation half-way through, your progress is saved.

Team

ai2-catwalk is developed and maintained by the AllenNLP team, backed by the Allen Institute for Artificial Intelligence (AI2). AI2 is a non-profit institute with the mission to contribute to humanity through high-impact AI research and engineering. To learn more about who specifically contributed to this codebase, see our contributors page.

License

ai2-catwalk is licensed under Apache 2.0. A full copy of the license can be found on GitHub.

catwalk's People

Contributors

akshitab avatar anas-awadalla avatar armancohan avatar dependabot[bot] avatar dirkgr avatar epwalsh avatar ianmagnusson avatar oyvindtafjord avatar pdasigi avatar tusharkhot avatar yulinggu-cs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

catwalk's Issues

Slight differences in scores between EAI and Catwalk

  • arc_challenge
  • arc_easy
  • boolq
  • copa
  • headqa (This is Spanish. WTF?)
  • hellaswag
  • lambada
  • logiqa
  • mathqa
  • mc_taco
  • mrpc
  • multirc
  • openbookqa
  • piqa
  • prost
  • pubmedqa
  • qnli
  • qqp
  • race
  • rte
  • sciq
  • sst
  • triviaqa
  • webqs
  • wic
  • winogrande
  • wnli
  • wsc

Problem with the Beaker workspace

In the Fewshot branch of Catwalk, run python experiments/num_shots.py -w beaker://ai2/catwalk. It will fail:

[06/08/22 20:48:05] ERROR    Uncaught exception                                                                                                   logging.py:373
                             Traceback (most recent call last):
                               File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/tango/integrations/beaker/workspace.py",
                             line 91, in step_info
                                 dataset = self.beaker.dataset.get(step_dataset_name(step_or_unique_id))
                               File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/beaker/services/dataset.py", line 51, in get
                                 return _get(dataset)
                               File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/beaker/services/dataset.py", line 43, in
                             _get
                                 self.request(
                               File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/beaker/services/service_client.py", line 98,
                             in request
                                 raise exceptions_for_status[response.status_code]
                             beaker.exceptions.DatasetNotFound: 'tango-step-PredictStep-001-YVpCvU4dbvUqC3z2ARe1r8kecyqZuipg': Make sure you're
                             using a valid Beaker dataset ID or the *full* name of the dataset (with the account prefix, e.g.
                             'username/dataset_name')

                             During handling of the above exception, another exception occurred:

                             Traceback (most recent call last):
                               File "/home/dirkg/catwalk/experiments/num_shots.py", line 73, in <module>
                                 main()
                               File "/home/dirkg/catwalk/experiments/num_shots.py", line 59, in main
                                 result = metrics.result(workspace)
                               File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/tango/step.py", line 534, in result
                                 return self._run_with_work_dir(workspace, needed_by=needed_by)
                               File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/tango/step.py", line 362, in
                             _run_with_work_dir
                                 kwargs = self._replace_steps_with_results(self.kwargs, workspace)
                               File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/tango/step.py", line 509, in
                             _replace_steps_with_results
                                 return {
                               File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/tango/step.py", line 510, in <dictcomp>
                                 key: self._replace_steps_with_results(value, workspace) for key, value in o.items()
                               File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/tango/step.py", line 493, in
                             _replace_steps_with_results
                                 return o.result(workspace, self)
                               File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/tango/step.py", line 534, in result
                                 return self._run_with_work_dir(workspace, needed_by=needed_by)
                               File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/tango/step.py", line 376, in
                             _run_with_work_dir
                                 workspace.step_starting(self)
                               File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/tango/integrations/beaker/workspace.py",
                             line 118, in step_starting
                                 step_info = self.step_info(step)
                               File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/tango/integrations/beaker/workspace.py",
                             line 106, in step_info
                                 raise KeyError(step_or_unique_id)
                             KeyError: <catwalk.steps.PredictStep object at 0x7f38e2d834c0>

Problem with the WandB workspace

I have a reliable repro where the WandB workspace puts itself into a state that it can't recover from.

In this repo, on the Fewshot branch, I run python experiments/num_shots.py -w wandb://allennlp/catwalk --batch_size 4.

It will complain like this:

                               File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/tango/integrations/wandb/workspace.py", line
                             147, in step_starting
                                 raise StepStateError(
                             tango.common.exceptions.StepStateError: Step 'PredictStep-001-aUzzaKky7tw1rXhp38mMWuboeag8cCGw' is in unexpected
                             state 'running' If you are certain the step is not running somewhere else, delete the lock file at
                             /home/dirkg/.cache/tango/wandb_workspace/PredictStep-001-aUzzaKky7tw1rXhp38mMWuboeag8cCGw/lock.

Exception ignored in: <function BaseFileLock.__del__ at 0x7f401c167dc0>
Traceback (most recent call last):
  File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/filelock/_api.py", line 234, in __del__
  File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/filelock/_api.py", line 204, in release
  File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/filelock/_unix.py", line 49, in _release
TypeError: 'NoneType' object is not callable

When I remove /home/dirkg/.cache/tango/wandb_workspace/PredictStep-001-aUzzaKky7tw1rXhp38mMWuboeag8cCGw/lock, the error comes back.

Import all the CrossFit tasks

CrossFit has a somewhat unified format for their tasks. We could use it to get a bunch of tasks with very little code.

Here is a list of patterns that @ibeltagy found in CrossFit:

classification
- plan input / output: https://github.com/INK-USC/CrossFit/blob/master/tasks/ade_classification.py
- title: .... [SEP] content: ... https://github.com/INK-USC/CrossFit/blob/master/tasks/amazon_polarity.py
- premise: ... [SEP] hypothesis: ....https://github.com/INK-USC/CrossFit/blob/master/tasks/anli.py
- observation1: ...[SEP] observation2: ... [SEP] hypothesis1: .... ..... https://github.com/INK-USC/CrossFit/blob/master/tasks/art.py
- question: .... [SEP] context: ....  https://github.com/INK-USC/CrossFit/blob/master/tasks/boolq.py
- ... [SEP] .... https://github.com/INK-USC/CrossFit/blob/master/tasks/scicite.py
-  ... and many more similar to above with different field names

text to text
- summarize: .....
	- https://github.com/INK-USC/CrossFit/blob/master/tasks/gigaword.py
	- https://github.com/INK-USC/CrossFit/blob/master/tasks/multi_news.py
	- https://github.com/INK-USC/CrossFit/blob/master/tasks/reddit_tifu.py
	- https://github.com/INK-USC/CrossFit/blob/master/tasks/samsum.py
	- 
- question: ... context: ... 
	- https://github.com/INK-USC/CrossFit/blob/master/tasks/adversarial_qa.py
	- https://github.com/INK-USC/CrossFit/blob/master/tasks/ropes.py
	- (Most follow this template)
- question: ... [SEP] category: ... 
	- https://github.com/INK-USC/CrossFit/blob/master/tasks/jeopardy.py
	- very few follow this template
- ... [SEP] .... 
	- https://github.com/INK-USC/CrossFit/blob/master/tasks/ade_effect.py
	- https://github.com/INK-USC/CrossFit/blob/master/tasks/definite_pronoun_resolution.py
- ..<question string>.. [SEP] ..<context string>.. [SEP] ..<choices>... https://github.com/INK-USC/CrossFit/blob/master/tasks/cosmos_qa.py
- <question string>. <choices>. 
	- https://github.com/INK-USC/CrossFit/blob/master/tasks/ai2_arc.py
	- https://github.com/INK-USC/CrossFit/blob/master/tasks/hellaswag.py
	- should have been converted to classification
	- (multiple choice datasets is a huge mess)
- question: ... https://github.com/INK-USC/CrossFit/blob/master/tasks/break.py
- 

sequence tagging: 
- ... [SEP] acronym: .... https://github.com/INK-USC/CrossFit/blob/master/tasks/acronym_identification.py
- <string>
	- input: <string>
	- output: <entity> [SEP]  <entity> .... 
	- https://github.com/INK-USC/CrossFit/blob/master/tasks/limit.py
	- 
- 

regression
- review: ... https://github.com/INK-USC/CrossFit/blob/master/tasks/app_reviews.py
- https://github.com/INK-USC/CrossFit/blob/master/tasks/google_wellformed_query.py
- question: .... [SEP] context: ... https://github.com/INK-USC/CrossFit/blob/master/tasks/mocha.py
- 
Other:
- https://github.com/INK-USC/CrossFit/blob/master/tasks/numer_sense.py

Add Tasks

  • arc_challenge | acc
  • arc_challenge | acc_norm
  • arc_easy | acc
  • arc_easy | acc_norm
  • boolq | acc
  • copa | acc
  • headqa | acc
  • headqa | acc_norm
  • hellaswag | acc
  • hellaswag | acc_norm
  • lambada | acc
  • logiqa | acc
  • logiqa | acc_norm
  • mathqa | acc
  • mathqa | acc_norm
  • mc_taco | f1
  • mrpc | acc
  • mrpc | f1
  • multirc | acc
  • openbookqa | acc
  • openbookqa | acc_norm
  • piqa | acc
  • piqa | acc_norm
  • prost | acc
  • prost | acc_norm
  • pubmedqa | acc
  • qnli | acc
  • qqp | acc
  • qqp | f1
  • race | acc
  • rte | acc
  • sciq | acc
  • sciq | acc_norm
  • sst | acc
  • triviaqa | acc
  • webqs | acc
  • wic | acc
  • winogrande | acc
  • wnli | acc
  • wsc | acc

Adding models/methods/datasets

Motivation: Various people have asked for various additions to Catwalk already. It's risky because nobody is using Catwalk yet. But we have several people who said they want to (Pradeep, Matt/Hamish, Iz?, Ludwig).

Here are the sub-projects in order of importance:

  • Promptsource: This is the most requested one. The task would be to add a promptsource instance format to as many tasks as possible, and then evaluate various models with that format.
  • Make sure we have all the tasks in P3. This might be a no-op after the first item.
  • Few-shot prompting (for in-context learning). Nobody has explicitly asked for this. I think nobody asks because it's obvious that catwalk would have this.
  • Crossfit: Pradeep wants to use Crossfit. Those tasks should be fairly easy to add.
  • T-Few seems like a good baseline for a lot of our work, so it might become a good benchmark set for a while.
  • BigBench: Pradeep asked about those too, but backed off from it later. Could be a nice addition, but is at the bottom of this list on purpose.
  • Prompt format that uses the "channel method" for decoder-only models. Nobody has asked for it, but it came up in our reading group. I thought I could verify it with a quick experiment, but I could not.

trainable_copy might need override_weights_file

The trainable_copy method of RankClassificationModel should probably be initialized with its self.override_weights_file. This would look like:

def trainable_copy(self) -> TrainableModel:
        return TrainableRankClassificationModel(
            self._make_model(self.pretrained_model_name_or_path, override_weights_file=self.override_weights_file),
            cached_transformers.get_tokenizer(AutoTokenizer, self.pretrained_model_name_or_path),
            self.predict_chunk
        )

I need to investigate how trainable_copy is used though to make sure this is correct.

Request: Either remove `short_name_for_model_object` or allow for full name registering of models

Took a while to debug why google/t5-v1_1-small wasn't working even though it's registered in models/__init__.py. It's not obvious how the shortened name is particularly beneficial whereas the cost is that it's hard to know the right flags for model names to pass to catwalk.

Recommend either removing this shortener or at least always supporting the full name of the model.

Haven't tested this for Tasks, but I imagine similar type of issue.

Import EAI tasks

EAI tasks that are not on CrossFit.
Total task files in EAI: 47
Missing from CrossFit: 28

Not on HF dataset

  • arithmetic
  • asdiv - A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers
  • coqa
  • gsm8k - Training Verifiers to Solve Math Word Problems
  • hendrycks_ethics, math, test
  • lambada-cloze
  • lambada-multilingual
  • logiqa
  • mutual - MuTual: A Dataset for Multi-Turn Dialogue Reasoning
  • naturalqs
  • pile
  • qa4mre
  • quac
  • sat
  • storycloze
  • translation
  • triviaqa
  • truthfulqa
  • unscramble
  • wikitext

Available on HF dataset

  • head_qa
  • lambada
  • prost - PROST: Physical Reasoning about Objects Through Space and Time
  • pubmedqa
  • qasper
  • wsc273 - another version of winogrande
  • cbt - Children’s Book
  • drop

Add a way to train a model before evaluating it

Motivation: Full fine-tuning is a baseline, or rather an upper bound, in many zero-shot and few-shot experiments. @pdasigi has explicitly asked for this.

As part of this work, we'll add a new Tango step to Catwalk that trains a model on a given task/dataset, or on multiple tasks/datasets at the same time. It should call into Tango's training functions to do so. We'll also need to add a method or two to Catwalk's Model class to make this happen. Then we'll do a full evaluation on all reasonable tasks and all reasonable models, to establish good baselines across the board. This might make for a good blog post, too.

As a stretch goal, we should also try to train adaptation methods like prompt tuning, prefix tuning, or even IA3. There are some very nice implementations of some methods at https://github.com/r-three/t-few/tree/master/src.

Min et al. variant of in-context learning

Motivation: It's a good baseline that should be easy to implement in the catwalk context, but nobody has asked for it.

Described by Liu at al like this:
Min et al. [21] proposed ensemble ICL, where instead of using the output probability from concatenating the k training examples, the output probabilities of the model on each training example (i.e. 1-shot ICL for each of the k examples) are multiplied together. This lowers the memory cost by a factor of k/2 but increases the computational cost by a factor of 2. In terms of task performance, Min et al. [21] find that ensemble ICL outperforms the standard concatenative variant.

This depends on first getting normal few-shot ICL working on Catwalk.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.