Coder Social home page Coder Social logo

acs-qg's Introduction

ACS-QG

Factorized question generation for controlled MRC training data generation.

This is the code for our paper "Asking Questions the Human Way: Scalable Question-Answer Generation from Text Corpus". Please cite this paper if it is useful for your projects. Thanks.

How to run

  1. Check and install requirements

  2. Download datasets: Glove, SQuAD1.1-Zhou, SQuAD 2.0, BPE, etc. Get the wiki10000.json from https://www.dropbox.com/s/mkwfazyr9bmrqc5/wiki10000.json.zip?dl=0. I also put some data in Datasets folder. Note that the path of Datasets/ is not the same with the common/constants.py. You can revise the path to the Datasets directory.

  3. Change paths in code: if your code structure is different, go to "config.py" and change "dataset_name" and other paths. Besides, go to "common/constants.py" and change paths.

  4. About experiment platform I added some dirty code to handle platform problem when running experiments. Go to common/constants.py, set EXP_PLATFORM as "others" instead of "venus". This will helps to avoid executing any dirty code.

  5. How to run a. debug Run experiments_0_debug.sh. If success, it means your environment works well. b. train models for once Run experiments_1_... at the same time. You can run them on different GPUs to save time. c. get sentences for once. It is used for data augmenter. Notice: change data path based on your structure. Run experiments_2-DA_file2sents.sh. d. parallel run different versions of experiments_3_repeat_da_de.sh with different arguments. This step sample inputs by sequential sampling. See the content in the header of experiments_3_repeat_da_de.sh. Search and replace the parameters to perform data augmentation and data evaluation on different parts of the generated sentences. We do this to help save time. We can use different GPUs to generate a lot of data in parallel. e. parallel generate questions using seq2seq or gpt2. You can choose just one kind of generation model. Run experiments_4_QG_generate_gpt2.sh to generate questions by gpt2. Run experiments_4_QG_generate_seq2seq.sh to generte questions by seq2seq. f. remove duplicated data Run experiments_5_uniq_seq2seq.sh to remove duplicated data. g. post process seq2seq results to handle the repeat problem. It is not required if you use gpt2. Run experiments_6_postprocess_seq2seq.sh

acs-qg's People

Contributors

bangliu avatar dependabot[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

acs-qg's Issues

Where to find the "SQuAD1.1-Zhou" and properly formatted SQuAD2.0 datasets

Apologizes if this request is not suitable for an issue.

I'm trying to train/reproduce this model (or rather collection of models), but I'm having a hard time gathering all the necessary datasets. I especially cannot find any dataset fitting "SQuAD1.1-Zhou". I was only able to find this version of SQuaAD1.1 - but this doesn't seem to fit the expected input format, as the preprocessor in FQG_data.py expects tab-separated lined of at least 10 fields, whereas this dataset only has 4.
Could anyone give me a pointer to the right dataset?

EDIT: Now also concerning the SQuAD2.0 data set, see my comment below.

Possible discrepancies between training pipeline in code vs paper

I'm trying to reproduce the results of the associated paper, but I have trouble making sense how the code fits the text. In the paper, the pipeline seems fairly straightforward (fig.2, p.3). A QA data set is augmented to obtain ACS-aware datasets. With these, a QG model is trained and, in a third step, its result are refined via filtering.

In the code, various "experiments" exist, some resembling certain steps of the pipeline in the paper. However, what I don't really understand: One of the first experiments, experiments_1_QG_train_seq2seq.sh, trains a QG model, via QG_main.py. This, however, without the augmented data. Is this just for comparison (e.g. as a baseline)?

A later experiment, experiments_3_repeat_da_de.sh, seems closer to the pipeline in the paper. Augmented data is created and then used for another QG model, this one in QG_augment_main.py. However, this model actually doesn't seem to be trained at all. In the code, it only gets tested (see here). I don't really understand then where the training step of a model with augmented data actually takes place? Or am I just missing something?

Apologies if this is not really fitting for an issue. And great work on the paper!

packages in requirements.txt has python version mismatch

I can't get the requirements.txt to install some Python packages that require some version of Python, while others require another version of Python.

I hit the roadblock while installing requirements when I could not find the de-core-news-sm package in PyPI website.

Can you please add some more information on which Python version is required? which OS tool has been tested on? etc.

code

When will the code be provided?

fixed and ask several Issues in This Repository

I have issue with this code, there are several condition, that explain from several statements here

Issue 1

because it's using benepar, and download the benepar_en2 or benepar_en2.gz, we must make file to setup first like

import nltk, benepar
nltk.download('punkt')
benepar.download('benepar_en2')

after that, when I want to run python config.py. there's error says we must change several statements in benepar library such as /usr/local/lib/<python_version>/dist-packages/benepar/base_parser.py, changed it into like this syntax: (i prefer using nano editor to change the root directory or nltk directories)

graph = tf.Graph() 
graph_def = tf.compat.v1.GraphDef()

but there's an error appear again, we must using tf2 disable behaviour like this syntax:

import  tensorflow.compat.v1 as tf
tf..disable_v2_behavior()

it's work, as my suggestion..
but we should thinking twice, when there's an update from benepar library next future issue..

Issue 2

because we're using glove pretrained models, I suggest when we're using load_word2vec_format.
we must convert first from glove to word2vec format, using this command line

from gensim.scripts.glove2word2vec import glove2word2vec
...
GLOVE_TXT_PATH = DATA_PATH + "original/Glove/glove.6B.300d.txt"
GLOVE_OUT_PATH = DATA_PATH + "original/Glove/word2vec.txt" # whatever format do you want
...
glove2word2vec(GLOVE_TXT_PATH, GLOVE_OUT_PATH)
GLOVE = gensim.models.KeyedVectors.load_word2vec_format(GLOVE_OUT_PATH, binary=True, encoding= 'utf-8', unicode_errors= 'ignore')
# or 
GLOVE = gensim.models.KeyedVectors.load_word2vec_format(GLOVE_OUT_PATH, binary=False)

and don't forget to make folder as mentioned in common/constants.py such as

1. FQG/src/model/Fsctorized
2. FQG/output
3. FQG/output/checkpoint in 
4. FQG/output/figure
5. FQG/output/log
6. FQG/output/pkl
7. FQG/output/result

Issue 3

if we have CUDA device, I suggest you to use cuda channles that available like
DEVICE = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
or
DEVICE = torch.device("cuda:1" if torch.cuda.is_available() else "cpu")

but there are several common problems that have been solved such as..

  1. model/encoder.py - in line 50
    emb = pack(sorted_input_emb, sorted_lengths.to('cpu'), batch_first=False)
    we must change the lenghts device value into cpu, and then
  2. model/beam_searcher.py - in line 138
    self.attn.append(attnOut.index_select(0, father_beam_idx))
    actually this problem, because father beam is not great to calculate, so we must change the variable '/' into '//'
    so the syntax this:
    father_beam_idx = best_output_accumulate_scores_id // vocab_plus_input_size

Issue 4

because spacy doesn't support for "en" format, so we must change it into "en_core_web_sm"
don't forget to use this command line for download that module as reference here
python -m spacy download en_core_web_sm

Issue 5

When it doesn't work after install allennlp==0.8.3
from allennlp.modules.elmo import batch_to_ids
and appears errors like this one
TypeError: Highway.forward: return type <class 'torch.Tensor'>is not a<class 'NoneType'>.
So how to solve it, we should changed the version as this reference

pip uninstall overrides
pip install overrides==3.1.0

it's works fine to me..

Issue 6

there's an errors about several steps from this references that from file in /util/nlp_utils.py in line 225 because we have mentioned that GLOVE as our model with keyedvectors from glove to word2vec as mentioned in Issue 1

if token in GLOVE.vocab:
    token_in_glove = token
elif token.lower() in GLOVE.vocab:
      token_in_glove = token.lower()

to be

if token in GLOVE.key_to_index:
    token_in_glove = token
elif token.lower() in GLOVE.key_to_index:
      token_in_glove = token.lower()

Issue 7

I've got so many issue when I'm using the run_glue.py
For my best solution, we don't take risk about local_rank,
because of that, I can't use the CUDA for only one channel..
So I have been inactivate all the syntax that cause of problems..
and also about config, tokenizer, and model from pretrained data.

I recommed you to download the pretrained data from internet, because
When we are not able to download the model from offline, This is the best solution so far..
we should consider to use proper syntax like this:

MODEL_CLASSES = {
    'bert': (BertConfig, BertForSequenceClassification, BertTokenizer, BertModel, 'bert-base-uncased'),
    'xlnet': (XLNetConfig, XLNetForSequenceClassification, XLNetTokenizer, XLNetModel, 'xlnet-base-cased'),
    'xlm': (XLMConfig, XLMForSequenceClassification, XLMTokenizer, XLMModel, 'xlm-mlm-enfr-1024'),
}
....

config_class, model_sequence_class, tokenizer_class, model_class, model_name = MODEL_CLASSES[args.model_type]
...
config = config_class.from_pretrained(model_name)
tokenizer = tokenizer_class.from_pretrained(model_name)
model = model_sequence_class(config)
or
model = model_class.from_pretrained(model_name)

but there's several error during this training this..

so we must using code like..

Example code to resolve the issue

from torch import nn
from pytorch_transformers import XLNetConfig, XLNetModel

Load the configuration

config = XLNetConfig.from_pretrained("xlnet-base-cased")

Check if config.n_token and config.d_model are positive integers

if config.n_token <= 0 or config.d_model <= 0:
# Modify config.n_token and config.d_model to be positive integers
config.n_token = 32000
config.d_model = 768

Create the XLNet model

model = XLNetModel(config)

After I've done with this syntax, no more conflict after that..
because if we are looking from offline pretrained, sometimes it have several errors..

Issue 8

Because I can't use several metada that I can't find it..
please give me proper metadata link, that I can download for now..

here's several link.. that I can use so far,

  1. SQUAD dataset here
  2. download_glue_data here
  3. wiki datasets here

Issue 9

When I debug the run_glue.py, there are several errors appears like:

  1. There's an error an unexpected keyword argument 'labels', because we must to erase the labels variable into this syntax
inputs = {'input_ids': batch[0],
          'attention_mask': batch[1],
          'token_type_ids': batch[2] if args.model_type in ['bert', 'xlnet'] else None}  # XLM don't use segment_ids
outputs = model(**inputs)
loss = outputs[0]  # model outputs are always tuple in pytorch-transformers (see doc)

The error message is indicating that the forward method of your model is being passed an unexpected keyword argument labels. This argument is likely being passed from the inputs dictionary, but it is not being handled correctly in the forward method.

To solve this, you need to make sure that the forward method has the necessary arguments to handle all the values in the inputs dictionary. If labels is not a necessary argument, you should remove it from the inputs dictionary before passing it to the forward method. If labels is a necessary argument, you need to add it to the method signature of the forward method.
This error occurs because the argument labels is not expected by the forward method of the model. The error message states "forward() got an unexpected keyword argument 'labels'".
You should remove the labels argument from the inputs dictionary and the corresponding calculation of the loss function.

and also change variable outputs if we have errors like
raise RuntimeError("grad can be implicitly created only for scalar outputs")
so we change syntax like this:
loss = outputs[0].mean()

The error message "grad can be implicitly created only for scalar outputs" means that the loss should be a scalar tensor, but it seems to be a tensor of shape (batch_size,), which is not a scalar tensor. You'll need to reduce the tensor to a scalar by taking the mean over the batch_size.

After that, if we meet the error about this
AttributeError: 'tuple' object has no attribute 'detach'

so we must to change the variable into like this:

outputs = model(**inputs)
logits = outputs[0]
preds = logits.detach().cpu().numpy()

The error message suggests that the output from the model is a tuple, but the .detach() method is only supported for tensors, not tuples. To fix this, you need to extract the tensor component from the tuple that you want to use for computing the preds.

Note that the specific index used for outputs[0] may vary depending on the structure of the model output, so you may need to modify this based on your specific model.


Expected behavior

I would expect the tests to pass.

Environment

  • OS: Ubuntu 18.04.3 LTS
  • Python version: 3.7
  • PyTorch version: 1.8.0 + CUDA using this pip installation:
    pip install torch==1.8.0+cu111 torchvision==0.9.0+cu111 torchaudio==0.8.0 -f https://download.pytorch.org/whl/torch_stable.html
  • PyTorch Transformers version (or branch): 2.0.0
  • Using GPU ? Yes, channel 1 "cuda:1"
  • Distributed of parallel setup ? Yes, But I have been changed into single GPU
  • Any other relevant information: System is completely clean before pip installs

Additional context

Please let me know if there is any more information I can provide.

bug in - advance method | Beam_sercher.py

Change:
father_beam_idx = best_output_accumulate_scores_id / vocab_plus_input_size
To:
father_beam_idx = best_output_accumulate_scores_id // vocab_plus_input_size

To ensure the father_beam_idx tensor contains int and thus proper index values.
Else the tensor object lead to float values and lead to error with the index_select() method in Torch used in the same function,

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.