microsoft / kear Goto Github PK

Official code for achieving human parity on CommonsenseQA with External Attention

Shell 2.89% Python 95.23% Dockerfile 1.88%

kear's Introduction

Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention

This PyTorch package implements the KEAR model that surpasses human on the CommonsenseQA benchmark, as described in:

Yichong Xu, Chenguang Zhu, Shuohang Wang, Siqi Sun, Hao Cheng, Xiaodong Liu, Jianfeng Gao, Pengcheng He, Michael Zeng and Xuedong Huang
Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention
The 31st International Joint Conference on Artificial Intelligence (IJCAI), 2022.

The package also includes codes for our earilier DEKCOR model as in:

Yichong Xu∗, Chenguang Zhu∗, Ruochen Xu, Yang Liu, Michael Zeng and Xuedong Huang
Fusing Context Into Knowledge Graph for Commonsense Question Answering
Findings of the 59th Annual Meeting of the Association for Computational Linguistics (ACL), 2021

Please cite the above papers if you use this code.

Results

This package achieves the state-of-art performance of 86.1% (single model), 89.4% (ensemble) on the CommonsenseQA leaderboard, surpassing the human performance of 88.9%.

Quickstart

pull docker:
> docker pull yichongx/csqa:human_parity
run docker
> nvidia-docker run -it --mount src='/',target=/workspace/,type=bind yichongx/csqa:human_parity /bin/bash
> cd /workspace/path/to/repo
Please refer to the following link if you first use docker: https://docs.docker.com/

Features

Our code supports flexible training of various models on multiple choice QA.

Distributed training with Pytorch native DDP or Deepspeed: see bash/task_train.sh
Pause and resume training at any step; use option --continue_train
Use any transformer encoders including ELECTRA, DeBERTa, ALBERT

Preprocessing data

Pre-processed data is located at data/.

We release codes for knowledge graph and dictionary external attention in preprocess/

Download data
> cd preprocess
> bash download_data.sh
Add ConceptNet triples and Wiktionary definitions to data
> python add_knowledge.py
We also add the most frequent relations in each question as a side information.
> python add_freq_rel.py

Training and Prediction

train a model
> bash bash/task_train.sh
make prediction
> bash bash/task_predict.sh See task.py for available options.

Running codes for DEKCOR

The current code is mostly compatible to run DEKCOR. To run the original DEKCOR code, please checkout tag DEKCOR to use the previous version.

by Yichong Xu
[email protected]

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

kear's People

Contributors

Stargazers

Watchers

kear's Issues

[DEKCOR] Request for original code

Hi,

I came across a paper titled "Fusing Context Into Knowledge Graph for Commonsense Question Answering" and I was highly interested in your work.

I sent an email to [email protected] to get DEKCOR's original code, but it was blocked, so I am writing it here.

Can I get DEKCOR's original code?

Thank you.

Tasks

Beta Give feedback

No tasks being tracked yet.

Options

This repo is missing a LICENSE file

This repository is currently missing a LICENSE file.

A license helps users understand how to use your project in a compliant manner. You can find the standard MIT license Microsoft uses at: https://github.com/microsoft/repo-templates/blob/main/shared/LICENSE.

If you would like to learn more about open source licenses, please visit the document at https://aka.ms/license (Microsoft-internal guidance).

AssertionError

I have the following problem when I run it：

(yzy-KEAR) jizhi2@jizhi2-MS-7A78:/media/jizhi2/软件/yzy/KEAR$ bash/task_train.sh
*****************************************
Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed.
*****************************************
start is 1660190764.6195216start is 1660190764.6195297

[1858634] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '0', 'WORLD_SIZE': '2'}
[1858635] Initializing process group with: {'MASTER_ADDR': '127.0.0.1', 'MASTER_PORT': '29500', 'RANK': '1', 'WORLD_SIZE': '2'}
[1858635]: world_size = 2, rank = 1, backend=nccl
[1858634]: world_size = 2, rank = 0, backend=nccl
batch size: 2, total_batch_size: 10batch size: 2, total_batch_size: 10

clearing output folder.
args.fp16 is 0
args.fp16 is 0
load_vocab google/electra-large-discriminator
load_vocab google/electra-large-discriminator
load_data data/csqa_ret_3datasets/train_data.json
load_data data/csqa_ret_3datasets/train_data.json
data: 9741, world_size: 2
load_data data/csqa_ret_3datasets/dev_data.json
data: 9741, world_size: 2
load_data data/csqa_ret_3datasets/dev_data.json
data: 1222, world_size: 2
get dir test/
make dataloader ...
data: 1222, world_size: 2
get dir test/
make dataloader ...
max len: 200
95 percent len: 98
train_data 9741
total length: 2436
max len: 200
95 percent len: 98
train_data 9741
total length: 2436
max len: 168
95 percent len: 97
devlp_data 1222
init_model google/electra-large-discriminator
set config, model_type= electra
deepspeed: False
resume_training: False
max len: 168
95 percent len: 97
devlp_data 1222
init_model google/electra-large-discriminator
set config, model_type= electra
deepspeed: False
resume_training: False
model_type= electra
model_type= electra
init model finished.
init model finished.
Some weights of the model checkpoint at google/electra-large-discriminator were not used when initializing Model: ['discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense.bias', 'discriminator_predictions.dense.weight']
- This IS expected if you are initializing Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Model were not initialized from the model checkpoint at google/electra-large-discriminator and are newly initialized: ['scorer.csqa_ret_3datasets.scorer.weight', 'scorer.csqa_ret_3datasets.scorer.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Some weights of the model checkpoint at google/electra-large-discriminator were not used when initializing Model: ['discriminator_predictions.dense.weight', 'discriminator_predictions.dense_prediction.weight', 'discriminator_predictions.dense_prediction.bias', 'discriminator_predictions.dense.bias']
- This IS expected if you are initializing Model from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Model from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Model were not initialized from the model checkpoint at google/electra-large-discriminator and are newly initialized: ['scorer.csqa_ret_3datasets.scorer.weight', 'scorer.csqa_ret_3datasets.scorer.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
2022-08-11 12:06:35,144 - __main__ - INFO - initializing trainer.
2022-08-11 12:06:35,144 - __main__ - INFO - initializing trainer.
Trainer: fp16 is 0
2022-08-11 12:06:35,906 - __main__ - INFO - initialize trainer finished.
Trainer: fp16 is 02022-08-11 12:06:35,906 - __main__ - INFO - setting up optimizer

2022-08-11 12:06:35,906 - __main__ - INFO - initialize trainer finished.
2022-08-11 12:06:35,906 - __main__ - INFO - setting up optimizer
/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/transformers/optimization.py:306: FutureWarning: This implementation of AdamW is deprecated and will be removed in a future version. Use the PyTorch implementation torch.optim.AdamW instead, or set `no_deprecation_warning=True` to disable this warning
  warnings.warn(
2022-08-11 12:06:35,912 - __main__ - INFO - deepspeed wrap
finish deepspeed wrap
2022-08-11 12:06:35,912 - __main__ - INFO - deepspeed wrap
finish deepspeed wrap
load successfully.load successfully.

2022-08-11 12:06:35,915 - utils.trainer - INFO - total n_step = 2436, evaluate_step = 1218
---- Epoch: 01 ----
2022-08-11 12:06:35,915 - utils.trainer - INFO - total n_step = 2436, evaluate_step = 1218
---- Epoch: 01 ----
Traceback (most recent call last):
  File "task.py", line 410, in <module>
Traceback (most recent call last):
  File "task.py", line 410, in <module>
        srt.train(train_dataloader, devlp_dataloaders, save_last=False, save_every=args.save_every)srt.train(train_dataloader, devlp_dataloaders, save_last=False, save_every=args.save_every)

  File "task.py", line 93, in train
  File "task.py", line 93, in train
    self.trainer.train(
  File "/media/jizhi2/软件/yzy/KEAR/utils/trainer.py", line 81, in train
    self.trainer.train(
  File "/media/jizhi2/软件/yzy/KEAR/utils/trainer.py", line 81, in train
    for step, batch in enumerate(train_looper):
  File "/media/jizhi2/软件/yzy/KEAR/utils/dataloader_sampler.py", line 32, in __iter__
    for step, batch in enumerate(train_looper):
  File "/media/jizhi2/软件/yzy/KEAR/utils/dataloader_sampler.py", line 32, in __iter__
    batch = next(self.dataloader_iter)
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    batch = next(self.dataloader_iter)
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 435, in __next__
    data = self._next_data()
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 474, in _next_data
    index = self._next_index()  # may raise StopIteration
      File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 427, in _next_index
data = self._next_data()
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 474, in _next_data
    return next(self._sampler_iter)  # may raise StopIteration
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 227, in __iter__
    for idx in self.sampler:
  File "/media/jizhi2/软件/yzy/KEAR/utils/resumable_sampler.py", line 31, in __iter__
    index = self._next_index()  # may raise StopIteration
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 427, in _next_index
    assert len(self.perm) == self.total_size
AssertionError
    return next(self._sampler_iter)  # may raise StopIteration
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/utils/data/sampler.py", line 227, in __iter__
    for idx in self.sampler:
  File "/media/jizhi2/软件/yzy/KEAR/utils/resumable_sampler.py", line 31, in __iter__
    assert len(self.perm) == self.total_size
AssertionError
Traceback (most recent call last):
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/runpy.py", line 192, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/distributed/launch.py", line 260, in <module>
    main()
  File "/home/jizhi2/.conda/envs/yzy-KEAR/lib/python3.8/site-packages/torch/distributed/launch.py", line 255, in main
    raise subprocess.CalledProcessError(returncode=process.returncode,
subprocess.CalledProcessError: Command '['/home/jizhi2/.conda/envs/yzy-KEAR/bin/python', '-u', 'task.py', '--local_rank=1', '--append_descr', '1', '--data_version', 'csqa_ret_3datasets', '--lr', '1e-5', '--append_answer_text', '1', '--weight_decay', '0.01', '--preset_model_type', 'electra', '--batch_size', '2', '--max_seq_length', '50', '--num_train_epochs', '10', '--save_interval_step', '2', '--continue_train', '--print_number_per_epoch', '2', '--vary_segment_id', '--seed', '42', '--warmup_proportion', '0.1', '--optimizer_type', 'adamw', '--ddp', '--print_loss_step', '10', '--clear_output_folder']' returned non-zero exit status 1.

I debugged and found that the value of len(self.perm) is 9741 and the value of self.total_size is 9742.
What is the reason for this?

The second file needed in preprocessing seems invalid anymore

https://kaikki.org/dictionary/English/nonsenses.html
The page has been updated and the new file link should be https://kaikki.org/dictionary/English/all-non-inflected-senses/kaikki.org-dictionary-English-all-n-laGJUY
am I right ?

Is it possible to preprocess data by myself?

Hello, thanks for the awesome work!
I was trying to apply some other KG to this model. Therefore I'd like to ask whether it is possible to get access to the code for preprocessing part in your paper.

about your datasets

How did the "sufaces“ in csqa_ret_3dataset come into being？

Retrieval field in the dataset

Hello! Thank you for sharing this amazing work!

I wonder if we can also get the "retrieval" field in the csqa ret dataset (I want to make my own preprocessed dataset.) Do you have any code for retrieving that? Is it about retrieving relevant training examples (q/a pairs) from the question and choice? I tried looking into the code but could not find it.

train model

Hello, excuse me.Would you tell me how can I reproduce the results in your paper? when I train the model according to the method in the ’readme‘, the accuracy rate I get continues to drop with each round of training. Can you tell me what is the reason?Looking forward to your reply~

Missing preprocessing script?

Hi, I'm trying to run your model on a similar QA dataset, but I am wondering which script you used for generating the concepts in question & choices.

Sequence length for SOTA performance

Hi, what was the sequence length used to achieve SOTA performance?

Probably needs a bit more guidance for the "general public"

Super nice to have this code here. I will probably spend much more time with it, but at first I tried to see simply if it will run on my laptop. Your docker one-liner does eventually give the /workspace prompt, though I'm not sure where to go from there, if some data is already in or not. At least I did discover the scripts are meant to run from inside /workspace ,I thought some of them might be useful straight after cloning the repo but not. I am not entirely sure what black magic you've done but I should probably study it and copy it for my own "reproducible" ML experiments.

It would be nice if someone who's more in touch with the tech did a couple of walkthroughs and even timed them, so we would know if we're on the right path or not!

is code uncompletely？

Excuse me, "self.deberta = MyDebertaV2Model(config)" in the code is not uploaded? could you provide~

Performance on other PLM

Hello,
Amazing work! Did you ever try other PLM(Bert,Roberta...) as your backbone model? Or did they perform not well in your preliminary experiments? Thanks so much

add_knowledge.py refers to incorrect path

python add_knowledge.py in /prepocess refers to the following path ../data/yicxu/wiktionary_new/wik_dict.json which does not exist in the docker container.

Bad results with DeBERTa V2 and runtime error (CUDA out of memory) with DeBERTa V3

Hello !

First of all, thanks for sharing your work.

I tried to run the training with the line giving the same results as in the paper (the last one in "task_train.sh", but I keep having a runtime error telling me that I don't have enough memory. So my question is, which GPU config did you use exactly to run the training with those hyper-parameters using this specific line ?

Also, the second line (training with DeBERTa V2 xlarge) give me very poor accuracy (around 20%), is that normal ? I ran all the preprocessing scripts and did have to change the Wiktionary link in the script "download_data.sh" to download it, so maybe there are too many differences in the last version and the data are no longer working properly with the model ?
As an indication, the new link I used is https://kaikki.org/dictionary/English/all-non-inflected-senses/kaikki_dot_org-dictionary-English-all-non-infl-PIoLCx8T.json

Maybe something else needs an update ?

Looking forward for your reply !

GPU Memory size

In the paper you mentioned "The batch size is set to 48 or smaller to fit the batch on to a single GPU."

I running your code with DeBERTa-Large on a single Nvidia V100 (16GB RAM) GPU with max_seq_length set to 512, and even by breaking the input into 5 chunks (1 inference per question-choice score), I am unable to train the model without running into memory issues.

Could you provide some details regarding the hardware used?

inquiry regarding obqa data processing for "Fusing Context Into Knowledge Graph for Commonsense Question Answering"

I am writing to seek your guidance and expertise regarding the processing of the obqa dataset in the context of the DEKCOR paper. Specifically, I have been working on identifying entities based on question-answer pairs, followed by retrieving the most frequent question and answer entities in the text, and subsequently extracting the first corresponding description from Wikipedia. However, the experimental results have not been satisfactory, with an accuracy of around 28%.

Incorrect paper url in README.md

Hi there,
I see incorrect url in README.md file.

Human Parity on CommonsenseQA: Augmenting Self-Attention with External Attention
correct paper arXiv link : https://arxiv.org/pdf/2112.03254.pdf

but, currently, an incorrect link is "https://arxiv.org/pdf/2012.04808.pdf"
this link is "Fusing Context Knowledge Graph for Commonsense Question Answering"

Thank u for sharing the code!