Coder Social home page Coder Social logo

allenai / catwalk Goto Github PK

View Code? Open in Web Editor NEW
122.0 122.0 11.0 2.07 MB

This project studies the performance and robustness of language models and task-adaptation methods.

License: Apache License 2.0

Python 98.05% Makefile 0.06% Shell 0.14% Jsonnet 1.68% Dockerfile 0.06%

catwalk's People

Contributors

akshitab avatar anas-awadalla avatar armancohan avatar dependabot[bot] avatar dirkgr avatar epwalsh avatar ianmagnusson avatar oyvindtafjord avatar pdasigi avatar tusharkhot avatar yulinggu-cs avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

catwalk's Issues

Problem with the Beaker workspace

In the Fewshot branch of Catwalk, run python experiments/num_shots.py -w beaker://ai2/catwalk. It will fail:

[06/08/22 20:48:05] ERROR    Uncaught exception                                                                                                   logging.py:373
                             Traceback (most recent call last):
                               File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/tango/integrations/beaker/workspace.py",
                             line 91, in step_info
                                 dataset = self.beaker.dataset.get(step_dataset_name(step_or_unique_id))
                               File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/beaker/services/dataset.py", line 51, in get
                                 return _get(dataset)
                               File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/beaker/services/dataset.py", line 43, in
                             _get
                                 self.request(
                               File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/beaker/services/service_client.py", line 98,
                             in request
                                 raise exceptions_for_status[response.status_code]
                             beaker.exceptions.DatasetNotFound: 'tango-step-PredictStep-001-YVpCvU4dbvUqC3z2ARe1r8kecyqZuipg': Make sure you're
                             using a valid Beaker dataset ID or the *full* name of the dataset (with the account prefix, e.g.
                             'username/dataset_name')

                             During handling of the above exception, another exception occurred:

                             Traceback (most recent call last):
                               File "/home/dirkg/catwalk/experiments/num_shots.py", line 73, in <module>
                                 main()
                               File "/home/dirkg/catwalk/experiments/num_shots.py", line 59, in main
                                 result = metrics.result(workspace)
                               File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/tango/step.py", line 534, in result
                                 return self._run_with_work_dir(workspace, needed_by=needed_by)
                               File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/tango/step.py", line 362, in
                             _run_with_work_dir
                                 kwargs = self._replace_steps_with_results(self.kwargs, workspace)
                               File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/tango/step.py", line 509, in
                             _replace_steps_with_results
                                 return {
                               File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/tango/step.py", line 510, in <dictcomp>
                                 key: self._replace_steps_with_results(value, workspace) for key, value in o.items()
                               File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/tango/step.py", line 493, in
                             _replace_steps_with_results
                                 return o.result(workspace, self)
                               File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/tango/step.py", line 534, in result
                                 return self._run_with_work_dir(workspace, needed_by=needed_by)
                               File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/tango/step.py", line 376, in
                             _run_with_work_dir
                                 workspace.step_starting(self)
                               File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/tango/integrations/beaker/workspace.py",
                             line 118, in step_starting
                                 step_info = self.step_info(step)
                               File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/tango/integrations/beaker/workspace.py",
                             line 106, in step_info
                                 raise KeyError(step_or_unique_id)
                             KeyError: <catwalk.steps.PredictStep object at 0x7f38e2d834c0>

Add a way to train a model before evaluating it

Motivation: Full fine-tuning is a baseline, or rather an upper bound, in many zero-shot and few-shot experiments. @pdasigi has explicitly asked for this.

As part of this work, we'll add a new Tango step to Catwalk that trains a model on a given task/dataset, or on multiple tasks/datasets at the same time. It should call into Tango's training functions to do so. We'll also need to add a method or two to Catwalk's Model class to make this happen. Then we'll do a full evaluation on all reasonable tasks and all reasonable models, to establish good baselines across the board. This might make for a good blog post, too.

As a stretch goal, we should also try to train adaptation methods like prompt tuning, prefix tuning, or even IA3. There are some very nice implementations of some methods at https://github.com/r-three/t-few/tree/master/src.

Add Tasks

  • arc_challenge | acc
  • arc_challenge | acc_norm
  • arc_easy | acc
  • arc_easy | acc_norm
  • boolq | acc
  • copa | acc
  • headqa | acc
  • headqa | acc_norm
  • hellaswag | acc
  • hellaswag | acc_norm
  • lambada | acc
  • logiqa | acc
  • logiqa | acc_norm
  • mathqa | acc
  • mathqa | acc_norm
  • mc_taco | f1
  • mrpc | acc
  • mrpc | f1
  • multirc | acc
  • openbookqa | acc
  • openbookqa | acc_norm
  • piqa | acc
  • piqa | acc_norm
  • prost | acc
  • prost | acc_norm
  • pubmedqa | acc
  • qnli | acc
  • qqp | acc
  • qqp | f1
  • race | acc
  • rte | acc
  • sciq | acc
  • sciq | acc_norm
  • sst | acc
  • triviaqa | acc
  • webqs | acc
  • wic | acc
  • winogrande | acc
  • wnli | acc
  • wsc | acc

Problem with the WandB workspace

I have a reliable repro where the WandB workspace puts itself into a state that it can't recover from.

In this repo, on the Fewshot branch, I run python experiments/num_shots.py -w wandb://allennlp/catwalk --batch_size 4.

It will complain like this:

                               File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/tango/integrations/wandb/workspace.py", line
                             147, in step_starting
                                 raise StepStateError(
                             tango.common.exceptions.StepStateError: Step 'PredictStep-001-aUzzaKky7tw1rXhp38mMWuboeag8cCGw' is in unexpected
                             state 'running' If you are certain the step is not running somewhere else, delete the lock file at
                             /home/dirkg/.cache/tango/wandb_workspace/PredictStep-001-aUzzaKky7tw1rXhp38mMWuboeag8cCGw/lock.

Exception ignored in: <function BaseFileLock.__del__ at 0x7f401c167dc0>
Traceback (most recent call last):
  File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/filelock/_api.py", line 234, in __del__
  File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/filelock/_api.py", line 204, in release
  File "/home/dirkg/miniconda3/envs/catwalk/lib/python3.9/site-packages/filelock/_unix.py", line 49, in _release
TypeError: 'NoneType' object is not callable

When I remove /home/dirkg/.cache/tango/wandb_workspace/PredictStep-001-aUzzaKky7tw1rXhp38mMWuboeag8cCGw/lock, the error comes back.

Min et al. variant of in-context learning

Motivation: It's a good baseline that should be easy to implement in the catwalk context, but nobody has asked for it.

Described by Liu at al like this:
Min et al. [21] proposed ensemble ICL, where instead of using the output probability from concatenating the k training examples, the output probabilities of the model on each training example (i.e. 1-shot ICL for each of the k examples) are multiplied together. This lowers the memory cost by a factor of k/2 but increases the computational cost by a factor of 2. In terms of task performance, Min et al. [21] find that ensemble ICL outperforms the standard concatenative variant.

This depends on first getting normal few-shot ICL working on Catwalk.

Request: Either remove `short_name_for_model_object` or allow for full name registering of models

Took a while to debug why google/t5-v1_1-small wasn't working even though it's registered in models/__init__.py. It's not obvious how the shortened name is particularly beneficial whereas the cost is that it's hard to know the right flags for model names to pass to catwalk.

Recommend either removing this shortener or at least always supporting the full name of the model.

Haven't tested this for Tasks, but I imagine similar type of issue.

Adding models/methods/datasets

Motivation: Various people have asked for various additions to Catwalk already. It's risky because nobody is using Catwalk yet. But we have several people who said they want to (Pradeep, Matt/Hamish, Iz?, Ludwig).

Here are the sub-projects in order of importance:

  • Promptsource: This is the most requested one. The task would be to add a promptsource instance format to as many tasks as possible, and then evaluate various models with that format.
  • Make sure we have all the tasks in P3. This might be a no-op after the first item.
  • Few-shot prompting (for in-context learning). Nobody has explicitly asked for this. I think nobody asks because it's obvious that catwalk would have this.
  • Crossfit: Pradeep wants to use Crossfit. Those tasks should be fairly easy to add.
  • T-Few seems like a good baseline for a lot of our work, so it might become a good benchmark set for a while.
  • BigBench: Pradeep asked about those too, but backed off from it later. Could be a nice addition, but is at the bottom of this list on purpose.
  • Prompt format that uses the "channel method" for decoder-only models. Nobody has asked for it, but it came up in our reading group. I thought I could verify it with a quick experiment, but I could not.

Slight differences in scores between EAI and Catwalk

  • arc_challenge
  • arc_easy
  • boolq
  • copa
  • headqa (This is Spanish. WTF?)
  • hellaswag
  • lambada
  • logiqa
  • mathqa
  • mc_taco
  • mrpc
  • multirc
  • openbookqa
  • piqa
  • prost
  • pubmedqa
  • qnli
  • qqp
  • race
  • rte
  • sciq
  • sst
  • triviaqa
  • webqs
  • wic
  • winogrande
  • wnli
  • wsc

trainable_copy might need override_weights_file

The trainable_copy method of RankClassificationModel should probably be initialized with its self.override_weights_file. This would look like:

def trainable_copy(self) -> TrainableModel:
        return TrainableRankClassificationModel(
            self._make_model(self.pretrained_model_name_or_path, override_weights_file=self.override_weights_file),
            cached_transformers.get_tokenizer(AutoTokenizer, self.pretrained_model_name_or_path),
            self.predict_chunk
        )

I need to investigate how trainable_copy is used though to make sure this is correct.

Import all the CrossFit tasks

CrossFit has a somewhat unified format for their tasks. We could use it to get a bunch of tasks with very little code.

Here is a list of patterns that @ibeltagy found in CrossFit:

classification
- plan input / output: https://github.com/INK-USC/CrossFit/blob/master/tasks/ade_classification.py
- title: .... [SEP] content: ... https://github.com/INK-USC/CrossFit/blob/master/tasks/amazon_polarity.py
- premise: ... [SEP] hypothesis: ....https://github.com/INK-USC/CrossFit/blob/master/tasks/anli.py
- observation1: ...[SEP] observation2: ... [SEP] hypothesis1: .... ..... https://github.com/INK-USC/CrossFit/blob/master/tasks/art.py
- question: .... [SEP] context: ....  https://github.com/INK-USC/CrossFit/blob/master/tasks/boolq.py
- ... [SEP] .... https://github.com/INK-USC/CrossFit/blob/master/tasks/scicite.py
-  ... and many more similar to above with different field names

text to text
- summarize: .....
	- https://github.com/INK-USC/CrossFit/blob/master/tasks/gigaword.py
	- https://github.com/INK-USC/CrossFit/blob/master/tasks/multi_news.py
	- https://github.com/INK-USC/CrossFit/blob/master/tasks/reddit_tifu.py
	- https://github.com/INK-USC/CrossFit/blob/master/tasks/samsum.py
	- 
- question: ... context: ... 
	- https://github.com/INK-USC/CrossFit/blob/master/tasks/adversarial_qa.py
	- https://github.com/INK-USC/CrossFit/blob/master/tasks/ropes.py
	- (Most follow this template)
- question: ... [SEP] category: ... 
	- https://github.com/INK-USC/CrossFit/blob/master/tasks/jeopardy.py
	- very few follow this template
- ... [SEP] .... 
	- https://github.com/INK-USC/CrossFit/blob/master/tasks/ade_effect.py
	- https://github.com/INK-USC/CrossFit/blob/master/tasks/definite_pronoun_resolution.py
- ..<question string>.. [SEP] ..<context string>.. [SEP] ..<choices>... https://github.com/INK-USC/CrossFit/blob/master/tasks/cosmos_qa.py
- <question string>. <choices>. 
	- https://github.com/INK-USC/CrossFit/blob/master/tasks/ai2_arc.py
	- https://github.com/INK-USC/CrossFit/blob/master/tasks/hellaswag.py
	- should have been converted to classification
	- (multiple choice datasets is a huge mess)
- question: ... https://github.com/INK-USC/CrossFit/blob/master/tasks/break.py
- 

sequence tagging: 
- ... [SEP] acronym: .... https://github.com/INK-USC/CrossFit/blob/master/tasks/acronym_identification.py
- <string>
	- input: <string>
	- output: <entity> [SEP]  <entity> .... 
	- https://github.com/INK-USC/CrossFit/blob/master/tasks/limit.py
	- 
- 

regression
- review: ... https://github.com/INK-USC/CrossFit/blob/master/tasks/app_reviews.py
- https://github.com/INK-USC/CrossFit/blob/master/tasks/google_wellformed_query.py
- question: .... [SEP] context: ... https://github.com/INK-USC/CrossFit/blob/master/tasks/mocha.py
- 
Other:
- https://github.com/INK-USC/CrossFit/blob/master/tasks/numer_sense.py

Import EAI tasks

EAI tasks that are not on CrossFit.
Total task files in EAI: 47
Missing from CrossFit: 28

Not on HF dataset

  • arithmetic
  • asdiv - A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers
  • coqa
  • gsm8k - Training Verifiers to Solve Math Word Problems
  • hendrycks_ethics, math, test
  • lambada-cloze
  • lambada-multilingual
  • logiqa
  • mutual - MuTual: A Dataset for Multi-Turn Dialogue Reasoning
  • naturalqs
  • pile
  • qa4mre
  • quac
  • sat
  • storycloze
  • translation
  • triviaqa
  • truthfulqa
  • unscramble
  • wikitext

Available on HF dataset

  • head_qa
  • lambada
  • prost - PROST: Physical Reasoning about Objects Through Space and Time
  • pubmedqa
  • qasper
  • wsc273 - another version of winogrande
  • cbt - Children’s Book
  • drop

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.