facebookresearch / unsupervisedqa Goto Github PK

Unsupervised Question answering via Cloze Translation

License: Other

Shell 7.81% Python 92.19%

unsupervisedqa's Introduction

UnsupervisedQA

Code, Data and models supporting the experiments in the ACL 2019 Paper: Unsupervised Question Answering by Cloze Translation.

Obtaining training data for Question Answering (QA) is time-consuming and resource-intensive, and existing QA datasets are only available for limited domains and languages. In this work, we take some of the first steps towards unsupervised QA, and develop an approach that, without using the SQuAD training data at all, achieves 56.4 F1 on SQuAD v1.1, and 64.5 F1 when the answer is a named entity mention.

This repository provides code to run pre-trained models to generate sythetic question answering question data. We also make a very large synthetic training dataset for extractive question answering available.

Dataset Downloads

We make available a dataset of 4 million SQuAD-like question answering datapoints, automatically generated by the unsupervised system described in the system.

The data can be downloaded here. The data is in the SQuAD v1 format, and contains:

Fold	# Paragraphs	# QA pairs
`unsupervised_qa_train.json`	782,556	3,915,498
`unsupervised_qa_dev.json`	1,000	4,795
`unsupervised_qa_test.json`	1,000	4,804

Using this training data to fine-tune BERT-Large for reading comprehension will achieve over 50.0 F1 on the SQuAD V1.1 development set using an appropriate early stopping strategy on the unsupervised_qa dev set.

Models and Code

In addition the above data, this repository provides functionality to generate synthetic training data from user-provided documents

Installation:

The code is built to run on top of UnsupervisedMT, and requires all of its its dependencies. Additional requirements are spaCy (for NER and noun chunking), attrs, and NLTK and allennlp (for constituency parsing). It was developed to run on Ubuntu Linux 18.04 and Python 3.7, with CUDA 9

(Optionally) Create a conda environment to keep things clean:

conda create -n uqa37 python=3.7 && conda activate uqa37

The recommended way to install is shown below, which should install and handle all dependencies:

# clone the repo
git clone https://github.com/facebookresearch/UnsupervisedQA.git
cd UnsupervisedQA

# install python dependencies:
pip install -r requirements.txt

# install UnsupervisedMT and its dependencies
./install_tools.sh

Models:

Four UNMT models are made available for download

Sentence Cloze boundaries, Noun Phrase Answers
Sentence Cloze boundaries, Named Entity Answers
Sub-clause Cloze boundaries, Named Entity Answers
Sub-cluase Cloze boundaries, Named Entity Answers, Wh Heuristics (best downstream performance)

The models can be downloaded using the script:

./download_models.sh

This will download all the models and unzip them to the appropriate directory. Each unzipped model is about 850MB, so total space requirement is 3.5GB.

Usage:

You can generate reading comprehension training data using unsupervisedqa.generate_synthetic_qa_data

This script will allow you to generate unsupervised question answering data using the identity, noisy cloze or unsupervised NMT methods explored in the paper, as well as specifying several different configurations (i.e. whether to use subclause shortening, use named entity answers and whether to use the wh heuristic).

This script provides the following command line arguments:

usage: generate_synthetic_qa_data.py [-h] [--input_file_format {txt,jsonl}]
                                     [--output_file_formats OUTPUT_FILE_FORMATS]
                                     [--translation_method {identity,noisy_cloze,unmt}]
                                     [--use_subclause_clozes]
                                     [--use_named_entity_clozes]
                                     [--use_wh_heuristic]
                                     input_file output_file

Generate synthetic training data for extractive QA tasks without supervision

positional arguments:
  input_file            input file, see readme for formatting info
  output_file           Path to write generated data to, see readme for
                        formatting info

optional arguments:
  -h, --help            show this help message and exit
  --input_file_format {txt,jsonl}
                        input file format, see readme for more info, default
                        is txt
  --output_file_formats OUTPUT_FILE_FORMATS
                        comma-seperated list of output file formats, from
                        [jsonl, squad], an output file will be created for
                        each format. Default is 'jsonl,squad'
  --translation_method {identity,noisy_cloze,unmt}
                        define the method to generate clozes -- either the
                        Unsupervised NMT method (unmt), or the identity or
                        noisy cloze baseline methods. UNMT is recommended for
                        downstream performance, but the noisy_cloze is
                        relatively stong on downstream QA and fast to
                        generate. Default is unmt
  --use_subclause_clozes
                        pass this flag to shorten clozes with constituency
                        parsing instead of using sentence boundaries
                        (recommended for downstream performance)
  --use_named_entity_clozes
                        pass this flag to use named entity answer prior
                        instead of noun phrases (recommended for downstream
                        performance)
  --use_wh_heuristic    pass this flag to use the wh-word heuristic
                        (recommended for downstream performance). Only
                        compatable with named entity clozes

The input format is specified by the --input_file format argument, and can either be a .txt file of paragraphs, one per line, for questions and answers to be generated from, or a .jsonl file with each line containing a json-serialised dict of the format {"text": text of paragraph, "paragraph_id" : your unique identifier for the paragraph}

The output format can be specified by the user using the --output_file_formats argument. The user can choose between jsonl and squad format. Requesting the squad format will output a file using the squad v1.1 format, ready to be plugged into downstream extractive QA tasks. The jsonl format provides more metadata than the squad format, the fields are explained below:

{
    "cloze_id": unique identifier for this datapoint
    "paragraph": data on the paragraph this datapoint was generated from
    "source_text": the text from the paragraph the cloze was generated from
    "source_start": character index in paragraph where "source_text" starts
    "cloze_text": the text of the cloze question the question is generated from
    "answer_text": the answer text of the (cloze) question
    "answer_start": the character index that the answer starts at in the paragraph
    "constituency_parse": the constituency parse of the "source_text" if available, otherwise null,
    "root_label": the node label of the root of the constituency parse if available, otherwise null,
    "answer_type": The named entity label of the answer (if using named entity clozes) otherwise "NOUNPHRASE"
    "question_text": the text of the natural question, translated from "cloze_text"
}

A working example to produce unsupervised NMT-translated questions using the model trained with wh heuristics, named entity answers, subclause shortening is below:

python -m unsupervisedqa.generate_synthetic_qa_data example_input.txt example_output \
    --input_file_format "txt" \
    --output_file_format "jsonl,squad" \
    --translation_method unmt \
    --use_named_entity_clozes \
    --use_subclause_clozes \
    --use_wh_heuristic

I'm running out of GPU memory

The repository requires a CUDA-enabled GPU (this is a requirement of UnsupervisedMT), but you can reduce the amount of GPU memory required by adjusting the batch sizes. This can be done by modifying unsupervisedqa/configs.py file, adjusting CONSTITUENCY_BATCH_SIZE and UNMT_BATCH_SIZE.

Training Your own question translation models

This repository only provides functionality to run pre-trained unsupervised question translation models in the paper. For users who want to train new question translation models, they should use the training functionality in UnsupervisedMT, or consider the newer and more powerful XLM repository.

To train question translation models in UnsupervisedMT, first prepare large corpora of cloze questions (potentially using the functionality in this repository) and a large corpus of natural questions. Preprocess these corpora by adapting UnsupervisedMT/NMT/get_data_enfr.sh, and train using the example script in UnsupervisedMT/README, with appropriate edits to the args (e.g en->cloze and fr->question) and paths.

References

Please cite [1] and [2] if you found the resources in this repository useful.

Unsupervised Question Answering by Cloze Translation

[1] P. Lewis, L. Denoyer, S. Riedel Unsupervised Question Answering by Cloze Translation

@inproceedings{lewis2019unsupervisedqa,
  title={Unsupervised Question Answering by Cloze Translation},
  author={Lewis, Patrick and Denoyer, Ludovic and Riedel, Sebastian},
  booktitle={Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  year={2019}
}

Phrase-Based & Neural Unsupervised Machine Translation

[2] G. Lample, M. Ott, A. Conneau, L. Denoyer, MA. Ranzato Phrase-Based & Neural Unsupervised Machine Translation

@inproceedings{lample2018phrase,
  title={Phrase-Based \& Neural Unsupervised Machine Translation},
  author={Lample, Guillaume and Ott, Myle and Conneau, Alexis and Denoyer, Ludovic and Ranzato, Marc'Aurelio},
  booktitle={Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP)},
  year={2018}
}

License

See the LICENSE file for more details.

Troubleshooting

If you run into problems with installing dependencies (particularly allennlp) installing libffi may help:

apt-get install libffi6 libffi-dev

unsupervisedqa's People

Contributors

Stargazers

Watchers

Forkers

github30 chenmoshushi yyht rizwan09 tsajed chanceux yucoian kayoyin sakshiv91 askaydevs jonathanfly amir22010 losimons mandalravi psds01 tomarraj008 rafikrhouma02 ggdupont tifoit khushjammu bcmi220 mathias3 hmayzszjcts lnikolenko havingfun aesmin orkuntemiz alecs12 neutralzz jinga-lala arunsubhashkumar patelrajnath edwardburgin eddiebarry yynnxu webstruck saeednajafi yuelala a2un roychata lyuchenyang bobtuan baronrustamov kaveh3000 trinh-hoang-hiep dipanjank mayankjobanputra oshita30 pruthvi98 sbhmajum369 patrick-s-h-lewis efwfe puschb hafiz031 rizquuula iq-scm felipesfpaula mariomarkov yousefgamal220

unsupervisedqa's Issues

the fastPFE will be error when I compile it

Hello，
The compilation instructions have changed，from 'g++ -std=c++11 -pthread -O3 fast.cc -o fast' to 'g++ -std=c++11 -pthread -O3 fastBPE/main.cc -IfastBPE -o fast'. So the original instructions will be error.

Output memory map failed : 22.

Traceback (most recent call last): File "/home/askaydevs/.conda/envs/reports/lib/python3.7/runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "/home/askaydevs/.conda/envs/reports/lib/python3.7/runpy.py", line 85, in _run_code exec(code, run_globals) File "/home/askaydevs/Public/natural_language_processing/solverminds/local_ec2_uqa37/UnsupervisedQA/unsupervisedqa/generate_synthetic_qa_data.py", line 165, in <module> generate_synthetic_training_data(args) File "/home/askaydevs/Public/natural_language_processing/solverminds/local_ec2_uqa37/UnsupervisedQA/unsupervisedqa/generate_synthetic_qa_data.py", line 113, in generate_synthetic_training_data args.translation_method File "/home/askaydevs/Public/natural_language_processing/solverminds/local_ec2_uqa37/UnsupervisedQA/unsupervisedqa/generate_synthetic_qa_data.py", line 75, in get_questions_for_clozes clozes, subclause_clozes, ne_answers, wh_heuristic) File "/home/askaydevs/Public/natural_language_processing/solverminds/local_ec2_uqa37/UnsupervisedQA/unsupervisedqa/unmt_translation.py", line 316, in get_unmt_questions_for_clozes translation_input_path = preprocessing(clozes, tempdir, vocab_path, bpe_codes_path, wh_heuristic) File "/home/askaydevs/Public/natural_language_processing/solverminds/local_ec2_uqa37/UnsupervisedQA/unsupervisedqa/unmt_translation.py", line 70, in preprocessing _apply_bpe(tok_cloze_file, bpe_cloze_file, bpe_codes_path, vocab_path) File "/home/askaydevs/Public/natural_language_processing/solverminds/local_ec2_uqa37/UnsupervisedQA/unsupervisedqa/unmt_translation.py", line 39, in _apply_bpe subprocess.check_call(cmd, shell=True) File "/home/askaydevs/.conda/envs/reports/lib/python3.7/subprocess.py", line 347, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '/home/askaydevs/Public/natural_language_processing/solverminds/local_ec2_uqa37/UnsupervisedQA/unsupervisedqa/../UnsupervisedMT/NMT/tools/fastBPE/fast applybpe /tmp/tmp9gk9zgfa/dev.cloze.tok.bpe /tmp/tmp9gk9zgfa/dev.cloze.tok /home/askaydevs/Public/natural_language_processing/solverminds/local_ec2_uqa37/UnsupervisedQA/unsupervisedqa/../data/subclause_ne_wh_heuristic/bpe_codes /home/askaydevs/Public/natural_language_processing/solverminds/local_ec2_uqa37/UnsupervisedQA/unsupervisedqa/../data/subclause_ne_wh_heuristic/vocab.cloze-question.60000' returned non-zero exit status 1. [INFO/MainProcess] process shutting down

interface to accept a set of documents for self-supervised retrieval in response to a query

@patrick-s-h-lewis , is there any interface available to accept a set of documents and train a retriever using Inverse Cloze task ?
if not, do you have any guidance to do so in your code base ?
thanks !

Request help

When I run the sample code, the "question_text" is generated by a bunch of garbled characters.

question_text='Aires Aires scoreline 璟kindergarscoreline Gemeinscoreline 灘scoreline 灘scoreline 灘scoreline 灘scoreline 灘scoreline 灘scoreline 灘scoreline 灘scoreline 灘scoreline headers bbs scoreline headers 灘scoreline 灘scoreline 璟bim溾 brighter headers bbs Persons Persons Persons Persons Aires scoreline headers neighbourAires 溾 electric erian neighbourAires Persons Persons Persons neighbourAires Persons Persons Persons Persons Aires Persons Persons Persons Persons Persons neighbourAires Persons neighbourAires brighter brighter Persons Persons Persons Persons Persons Persons neighbourAires neighbourAires neighbourAires neighbourAires neighbourAires neighbourAires neighbourAires Persons Persons Persons Persons neighbourAires neighbourAires neighbourAires neighbourAires neighbourAires brighter neighbourAires neighbourAires erian erian erian erian erian erian erian erian erian erian erian neighbourAires headers Painting Mocscoreline Removal headers Painting Persons Painting gramPersons Persons Persons Persons Persons Persons TomneighbourAires erian neighbourAires erian erian erian erian erian erian erian erian erian erian neighbourAires 溾 Painting neighbourAires headers neighbourAires headers neighbourAires headers Removal headers neighbourAires brighter ayo Painting butter erian erian erian erian erian erian erian erian erian erian erian erian erian erian erian erian'

ModuleNotFoundError: No module named 'src'

Hi, When I run :
python -m unsupervisedqa.generate_synthetic_qa_data example_input.txt example_output --input_file_format "txt" --output_file_format "jsonl,squad" --translation_method unmt --use_named_entity_clozes --use_subclause_clozes --use_wh_heuristic

I face this issue:
Traceback (most recent call last):
File "/home/ma-user/anaconda3/envs/PyTorch-1.10.2/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/home/ma-user/anaconda3/envs/PyTorch-1.10.2/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/ma-user/work/UnsupervisedQA-main/unsupervisedqa/generate_synthetic_qa_data.py", line 20, in
from .unmt_translation import get_unmt_questions_for_clozes
File "/home/ma-user/work/UnsupervisedQA-main/unsupervisedqa/unmt_translation.py", line 21, in
from src.data.loader import check_all_data_params, load_data
ModuleNotFoundError: No module named 'src'

In unmt_translation.py, where this "src" comes from?

from src.utils import restore_segmentation
from src.model import check_mt_model_params, build_mt_model
from src.trainer import TrainerMT
from src.evaluator import EvaluatorMT

Other license at some point?

Will this project ever be released under a more liberal license that permits potential commercial usage like the other Facebook research projects I've looked at do: MIT, BSD, Apache, etc.? Currently this project is under the Creative Commons license variant that does not allow commercial usage.

https://github.com/facebookresearch/UnsupervisedQA/blob/master/LICENSE

ModuleNotFoundError: No module named 'sklearn.utils.linear_assignment_'

I correctly installed all the requirements following readme. but I med the error when I ran the program:

Traceback (most recent call last):
File "/Users/zms/anaconda3/envs/uqa37/lib/python3.7/runpy.py", line 193, in run_module_as_main
"main", mod_spec)
File "/Users/zms/anaconda3/envs/uqa37/lib/python3.7/runpy.py", line 85, in run_code
exec(code, run_globals)
File "/Users/zms/Documents/Study/NLP/UnsupervisedQA/unsupervisedqa/generate_synthetic_qa_data.py", line 19, in
from .constituency_parsing import get_constituency_parsed_clozes, shorten_cloze
File "/Users/zms/Documents/Study/NLP/UnsupervisedQA/unsupervisedqa/constituency_parsing.py", line 12, in
from allennlp.models.archival import load_archive
File "/Users/zms/anaconda3/envs/uqa37/lib/python3.7/site-packages/allennlp/models/init.py", line 8, in
from allennlp.models.biattentive_classification_network import BiattentiveClassificationNetwork
File "/Users/zms/anaconda3/envs/uqa37/lib/python3.7/site-packages/allennlp/models/biattentive_classification_network.py", line 16, in
from allennlp.training.metrics import CategoricalAccuracy
File "/Users/zms/anaconda3/envs/uqa37/lib/python3.7/site-packages/allennlp/training/metrics/init.py", line 11, in
from allennlp.training.metrics.conll_coref_scores import ConllCorefScores
File "/Users/zms/anaconda3/envs/uqa37/lib/python3.7/site-packages/allennlp/training/metrics/conll_coref_scores.py", line 5, in
from sklearn.utils.linear_assignment import linear_assignment
ModuleNotFoundError: No module named 'sklearn.utils.linear_assignment'
[INFO/MainProcess] process shutting down

While running this code on different dataset of paragraphs

I have dataset of 20,000 research articles, how can I generate synthetic question-answer dataset from these articles?
This dataset is completely different from SQUAD dataset used here in this code.
Thanks!

Code and pre trained model release

When can we expect the code to run and pre-trained to be released? Do you have a tentative date?

missing Dataset Info

Thank you very much for making the generated dataset publically available.

I wondered if you could also mention which method was used to generate this data? (i.e., Cloze Answer type, Cloze boundary type, translation method, with/without wh-heuristics, with/without XLM pre-training)? Similar to table 6 in the paper.

While installing the requirements.txt file

While running the requirements.txt file i am facing the below issue ,

I am following the steps mentioned in the repository by creating a conda environment and then following the steps as mentioned.

ERROR: Command errored out with exit status 1:
command: /home/xxxx/anaconda3/envs/uqa37/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-glq8b46f/cffi/setup.py'"'"'; file='"'"'/tmp/pip-install-glq8b46f/cffi/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-0554_ssy/install-record.txt --single-version-externally-managed --compile
cwd: /tmp/pip-install-glq8b46f/cffi/
Complete output (55 lines):
Package libffi was not found in the pkg-config search path.
Perhaps you should add the directory containing libffi.pc' to the PKG_CONFIG_PATH environment variable No package 'libffi' found Package libffi was not found in the pkg-config search path. Perhaps you should add the directory containing libffi.pc'
to the PKG_CONFIG_PATH environment variable
No package 'libffi' found
Package libffi was not found in the pkg-config search path.
Perhaps you should add the directory containing libffi.pc' to the PKG_CONFIG_PATH environment variable No package 'libffi' found Package libffi was not found in the pkg-config search path. Perhaps you should add the directory containing libffi.pc'
to the PKG_CONFIG_PATH environment variable
No package 'libffi' found
Package libffi was not found in the pkg-config search path.
Perhaps you should add the directory containing `libffi.pc'
to the PKG_CONFIG_PATH environment variable
No package 'libffi' found
running install
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.7
creating build/lib.linux-x86_64-3.7/cffi
copying cffi/init.py -> build/lib.linux-x86_64-3.7/cffi
copying cffi/model.py -> build/lib.linux-x86_64-3.7/cffi
copying cffi/cparser.py -> build/lib.linux-x86_64-3.7/cffi
copying cffi/lock.py -> build/lib.linux-x86_64-3.7/cffi
copying cffi/error.py -> build/lib.linux-x86_64-3.7/cffi
copying cffi/vengine_cpy.py -> build/lib.linux-x86_64-3.7/cffi
copying cffi/recompiler.py -> build/lib.linux-x86_64-3.7/cffi
copying cffi/backend_ctypes.py -> build/lib.linux-x86_64-3.7/cffi
copying cffi/api.py -> build/lib.linux-x86_64-3.7/cffi
copying cffi/commontypes.py -> build/lib.linux-x86_64-3.7/cffi
copying cffi/ffiplatform.py -> build/lib.linux-x86_64-3.7/cffi
copying cffi/setuptools_ext.py -> build/lib.linux-x86_64-3.7/cffi
copying cffi/cffi_opcode.py -> build/lib.linux-x86_64-3.7/cffi
copying cffi/verifier.py -> build/lib.linux-x86_64-3.7/cffi
copying cffi/vengine_gen.py -> build/lib.linux-x86_64-3.7/cffi
copying cffi/_cffi_include.h -> build/lib.linux-x86_64-3.7/cffi
copying cffi/parse_c_type.h -> build/lib.linux-x86_64-3.7/cffi
copying cffi/_embedding.h -> build/lib.linux-x86_64-3.7/cffi
copying cffi/_cffi_errors.h -> build/lib.linux-x86_64-3.7/cffi
running build_ext
building '_cffi_backend' extension
creating build/temp.linux-x86_64-3.7
creating build/temp.linux-x86_64-3.7/c
gcc -pthread -B /home/xxxx/anaconda3/envs/uqa37/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -DUSE__THREAD -DHAVE_SYNC_SYNCHRONIZE -I/usr/include/ffi -I/usr/include/libffi -I/home/xxxx/anaconda3/envs/uqa37/include/python3.7m -c c/_cffi_backend.c -o build/temp.linux-x86_64-3.7/c/_cffi_backend.o
c/_cffi_backend.c:15:10: fatal error: ffi.h: No such file or directory
#include <ffi.h>
^~~~~~~
compilation terminated.
error: command 'gcc' failed with exit status 1
----------------------------------------
ERROR: Command errored out with exit status 1: /home/xxx/anaconda3/envs/uqa37/bin/python -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-glq8b46f/cffi/setup.py'"'"'; file='"'"'/tmp/pip-install-glq8b46f/cffi/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-0554_ssy/install-record.txt --single-version-externally-managed --compile Check the logs for full command output.

AssertionError: mismatch between number of clozes and translations

I am failing to understand why I am encoutering this issue with many paragraphs in my dataset. My dataset is set of pages in certain category from wikipedia.

Following is an example paragraph.
{"text": "As the flat spring moves, the force from the curved spring increases. This results in acceleration until the normally-closed contacts are hit. Just as in the downward direction, the switch is designed so that the curved spring is strong enough to move the contacts, even if the flat spring must flex, because the actuator does not move during the changeover. ==Applications==\nMicroswitches have two main areas of application:\nFirstly they are used when a low operating force with a clearly defined action is needed. Secondly they are used when long-term reliability is needed.", "paragraph_id": "ee63d5c3-38f9-44c6-a3e7-dcccfeacfbb1"}

Following are the intermediate outputs for clozes and translations.

Cloze Size: 3
[Cloze(cloze_id='18e074b67a5d4424e328bd68b4d5ef32270df1e2_0', paragraph=Paragraph(paragraph_id='ee63d5c3-38f9-44c6-a3e7-dcccfeacfbb1', text='As the flat spring moves, the force from the curved spring increases. This results in acceleration until the normally-closed contacts are hit. Just as in the downward direction, the switch is designed so that the curved spring is strong enough to move the contacts, even if the flat spring must flex, because the actuator does not move during the changeover. ==Applications==\nMicroswitches have two main areas of application:\nFirstly they are used when a low operating force with a clearly defined action is needed. Secondly they are used when long-term reliability is needed.'), source_text='=Applications==\nMicroswitches have two main areas of application', source_start=360, cloze_text='=Applications==\nIDENTITYMASK have two main areas of application', answer_text='Microswitches', answer_start=16, constituency_parse=None, root_label=None, answer_type='ORG', question_text=None), Cloze(cloze_id='718e5f5bdc5cd0fe4c27ca36b4bb0d416d06e178_0', paragraph=Paragraph(paragraph_id='ee63d5c3-38f9-44c6-a3e7-dcccfeacfbb1', text='As the flat spring moves, the force from the curved spring increases. This results in acceleration until the normally-closed contacts are hit. Just as in the downward direction, the switch is designed so that the curved spring is strong enough to move the contacts, even if the flat spring must flex, because the actuator does not move during the chang eover. ==Applications==\nMicroswitches have two main areas of application:\nFirstly they are used when a low operatin g force with a clearly defined action is needed. Secondly they are used when long-term reliability is needed.'), sour ce_text='=Applications==\nMicroswitches have two main areas of application', source_start=360, cloze_text='=Applicati ons==\nMicroswitches have NUMERICMASK main areas of application', answer_text='two', answer_start=35, constituency_pa rse=None, root_label=None, answer_type='CARDINAL', question_text=None), Cloze(cloze_id='7ae37543efcc853a4c1b53007fe82 6fb622afff5_0', paragraph=Paragraph(paragraph_id='ee63d5c3-38f9-44c6-a3e7-dcccfeacfbb1', text='As the flat spring moves, the force from the curved spring increases. This results in acceleration until the normally-closed contacts are hit. Just as in the downward direction, the switch is designed so that the curved spring is strong enough to move the contacts, even if the flat spring must flex, because the actuator does not move during the changeover. ==Applications ==\nMicroswitches have two main areas of application:\nFirstly they are used when a low operating force with a clearly defined action is needed. Secondly they are used when long-term reliability is needed.'), source_text='Firstly they are used when a low operating force with a clearly defined action is needed', source_start=426, cloze_text='NUMERICMASK they are used when a low operating force with a clearly defined action is needed', answer_text='Firstly', answer_start=0, constituency_parse=None, root_label=None, answer_type='ORDINAL', question_text=None)]

Translation Size: 5
['Who = Applications = ?', 'When you have two main areas of application ?', 'How much = Applications = ?', 'What are the main Microswitches areas of application ?', 'How many times are they used when a low operating force with a clear ly defined action is needed ?']

Could someone please explain what causes to generate different number of clozes and translations and how that can be fixed?

I am using the following parameters:
python -m unsupervisedqa.generate_synthetic_qa_data example_input.txt example_output
--input_file_format "jsonl"
--output_file_format "jsonl"
--translation_method unmt
--use_named_entity_clozes
--use_subclause_clozes
--use_wh_heuristic

error when changing the "example_input.txt"

I am getting this error when i give my own data in the "example_input.txt" file

Your label namespace was 'pos'. We recommend you use a namespace ending with 'labels' or 'tags', so we don't add UNK and PAD tokens by default to your vocabulary. See documentation for non_padded_namespaces parameter in Vocabulary.

Future versions of unsupervisedQA

How is this unsupervisedQA is going to evolve over time? Are you guys planning to improve on?

Question about the model settings

Hello, thanks for sharing your impressive work and dataset!
I fine-tuned a BERT-base model with the dataset you provided and got 30.70 EM and 38.99 F1 on SQuAD dev set. According to Table 2 in the paper, it is close to the NE-Sentence-Noisy Cloze, which seems not the best of all settings. Is the model trained with this setting? Or did I miss something?
Thanks.

Model Query

How can we create our own data specific model..?

Codes for rule-based question generation described in the paper

Hi @patrick-s-h-lewis, thanks for the nice paper and open-sourcing the code. It is really helpful.

I see that you've mentioned in the paper about rule-based systems (Heilman & Smith 2010) and there is an off-the-shelf tool for English. Do you have any plan to release the code to run it, or would you give a pointer to the off-the-shelf tool that you used? I'd really appreciate it!

Finding issues while giving input txt file

While giving several paragraphs as input text file for qa generation only one line is getting detected and just one question is formed.
What should be the correct format for input file as text?

TypeError: ArrayField.empty_field: return type `None` is not a `<class 'allennlp.data.fields.field.Field'>`

When I ran this command:
python -m unsupervisedqa.generate_synthetic_qa_data exmaple_input.txt example_output --input_file_format "txt" --output_file_format "jsonl,squad" --translation_method unmt

I've received this error message:
Traceback (most recent call last):
File "/anaconda/envs/uqa37/lib/python3.7/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/anaconda/envs/uqa37/lib/python3.7/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/home/azureuser/UnsupervisedQA/unsupervisedqa/generate_synthetic_qa_data.py", line 19, in
from .constituency_parsing import get_constituency_parsed_clozes, shorten_cloze
File "/home/azureuser/UnsupervisedQA/unsupervisedqa/constituency_parsing.py", line 12, in
from allennlp.models.archival import load_archive
File "/anaconda/envs/uqa37/lib/python3.7/site-packages/allennlp/models/init.py", line 6, in
from allennlp.models.model import Model
File "/anaconda/envs/uqa37/lib/python3.7/site-packages/allennlp/models/model.py", line 16, in
from allennlp.data import Instance, Vocabulary
File "/anaconda/envs/uqa37/lib/python3.7/site-packages/allennlp/data/init.py", line 1, in
from allennlp.data.dataset_readers.dataset_reader import DatasetReader
File "/anaconda/envs/uqa37/lib/python3.7/site-packages/allennlp/data/dataset_readers/init.py", line 10, in
from allennlp.data.dataset_readers.ccgbank import CcgBankDatasetReader
File "/anaconda/envs/uqa37/lib/python3.7/site-packages/allennlp/data/dataset_readers/ccgbank.py", line 9, in
from allennlp.data.dataset_readers.dataset_reader import DatasetReader
File "/anaconda/envs/uqa37/lib/python3.7/site-packages/allennlp/data/dataset_readers/dataset_reader.py", line 4, in
from allennlp.data.instance import Instance
File "/anaconda/envs/uqa37/lib/python3.7/site-packages/allennlp/data/instance.py", line 3, in
from allennlp.data.fields.field import DataArray, Field
File "/anaconda/envs/uqa37/lib/python3.7/site-packages/allennlp/data/fields/init.py", line 7, in
from allennlp.data.fields.array_field import ArrayField
File "/anaconda/envs/uqa37/lib/python3.7/site-packages/allennlp/data/fields/array_field.py", line 10, in
class ArrayField(Field[numpy.ndarray]):
File "/anaconda/envs/uqa37/lib/python3.7/site-packages/allennlp/data/fields/array_field.py", line 42, in ArrayField
@OVERRIDES
File "/anaconda/envs/uqa37/lib/python3.7/site-packages/overrides/overrides.py", line 88, in overrides
return _overrides(method, check_signature, check_at_runtime)
File "/anaconda/envs/uqa37/lib/python3.7/site-packages/overrides/overrides.py", line 114, in _overrides
_validate_method(method, super_class, check_signature)
File "/anaconda/envs/uqa37/lib/python3.7/site-packages/overrides/overrides.py", line 135, in _validate_method
ensure_signature_is_compatible(super_method, method, is_static)
File "/anaconda/envs/uqa37/lib/python3.7/site-packages/overrides/signature.py", line 93, in ensure_signature_is_compatible
ensure_return_type_compatibility(super_type_hints, sub_type_hints, method_name)
File "/anaconda/envs/uqa37/lib/python3.7/site-packages/overrides/signature.py", line 288, in ensure_return_type_compatibility
f"{method_name}: return type {sub_return} is not a {super_return}."
TypeError: ArrayField.empty_field: return type None is not a <class 'allennlp.data.fields.field.Field'>

Could you help me run this command, please?

Discrepency between jsonl and squad outputs

Example jsonl output:

{"cloze_id": "86dbff215d73d9762cfe24d01a07fb6bdc27944b_0", "paragraph": {"paragraph_id": "59bb6ee9-320e-4532-9065-69679d580449", "text": "The process has been used to create silicon for thin-film solar cells and far-infrared photodetectors. Temperature and centrifuge spin rate are used to control layer growth. Centrifugal LPE has the capability to create dopant concentration gradients while the solution is held at constant temperature. Solid-phase epitaxy (SPE) is a transition between the amorphous and crystalline phases of a material. It is usually done by first depositing a film of amorphous material on a crystalline substrate. The substrate is then heated to crystallize the film. The single crystal substrate serves as a template for crystal growth. The annealing step used to recrystallize or heal silicon layers amorphized during ion implantation is also considered one type of Solid Phase Epitaxy. The Impurity segregation and redistribution at the growing crystal-amorphous layer interface during this process is used to incorporate low-solubility dopants in metals and Silicon."}, "source_text": "first depositing a film of amorphous material on a crystalline substrate", "source_start": 426, "cloze_text": "NUMERICMASK depositing a film of amorphous material on a crystalline substrate", "answer_text": "first", "answer_start": 0, "constituency_parse": null, "root_label": null, "answer_type": "ORDINAL", "question_text": "How much money can a film of depositorphamous material have on a crystalline substrate ?"}
{"cloze_id": "1aaf58187c460870247326230f81c232a2ce12ec_0", "paragraph": {"paragraph_id": "59bb6ee9-320e-4532-9065-69679d580449", "text": "The process has been used to create silicon for thin-film solar cells and far-infrared photodetectors. Temperature and centrifuge spin rate are used to control layer growth. Centrifugal LPE has the capability to create dopant concentration gradients while the solution is held at constant temperature. Solid-phase epitaxy (SPE) is a transition between the amorphous and crystalline phases of a material. It is usually done by first depositing a film of amorphous material on a crystalline substrate. The substrate is then heated to crystallize the film. The single crystal substrate serves as a template for crystal growth. The annealing step used to recrystallize or heal silicon layers amorphized during ion implantation is also considered one type of Solid Phase Epitaxy. The Impurity segregation and redistribution at the growing crystal-amorphous layer interface during this process is used to incorporate low-solubility dopants in metals and Silicon."}, "source_text": "one type of Solid Phase Epitaxy", "source_start": 742, "cloze_text": "NUMERICMASK type of Solid Phase Epitaxy", "answer_text": "one", "answer_start": 0, "constituency_parse": null, "root_label": null, "answer_type": "CARDINAL", "question_text": "How many type of Solid Phase Epitaxy ?"}
{"cloze_id": "22075909516185f297036cd82759d96df84cc444_0", "paragraph": {"paragraph_id": "59bb6ee9-320e-4532-9065-69679d580449", "text": "The process has been used to create silicon for thin-film solar cells and far-infrared photodetectors. Temperature and centrifuge spin rate are used to control layer growth. Centrifugal LPE has the capability to create dopant concentration gradients while the solution is held at constant temperature. Solid-phase epitaxy (SPE) is a transition between the amorphous and crystalline phases of a material. It is usually done by first depositing a film of amorphous material on a crystalline substrate. The substrate is then heated to crystallize the film. The single crystal substrate serves as a template for crystal growth. The annealing step used to recrystallize or heal silicon layers amorphized during ion implantation is also considered one type of Solid Phase Epitaxy. The Impurity segregation and redistribution at the growing crystal-amorphous layer interface during this process is used to incorporate low-solubility dopants in metals and Silicon."}, "source_text": "one type of Solid Phase Epitaxy", "source_start": 742, "cloze_text": "one type of IDENTITYMASK", "answer_text": "Solid Phase Epitaxy", "answer_start": 12, "constituency_parse": null, "root_label": null, "answer_type": "ORG", "question_text": "Who knows what type one ?"}

Example squad output:

{"title": "59bb6ee9-320e-4532-9065-69679d580449", "paragraphs": [{"context": "The process has been used to create silicon for thin-film solar cells and far-infrared photodetectors. Temperature and centrifuge spin rate are used to control layer growth. Centrifugal LPE has the capability to create dopant concentration gradients while the solution is held at constant temperature. Solid-phase epitaxy (SPE) is a transition between the amorphous and crystalline phases of a material. It is usually done by first depositing a film of amorphous material on a crystalline substrate. The substrate is then heated to crystallize the film. The single crystal substrate serves as a template for crystal growth. The annealing step used to recrystallize or heal silicon layers amorphized during ion implantation is also considered one type of Solid Phase Epitaxy. The Impurity segregation and redistribution at the growing crystal-amorphous layer interface during this process is used to incorporate low-solubility dopants in metals and Silicon.", "qas": [{"question": "How much money can a film of depositorphamous material have on a crystalline substrate ?", "id": "86dbff215d73d9762cfe24d01a07fb6bdc27944b_0", "answers": [{"text": "first", "answer_start": 0}]}, {"question": "How many type of Solid Phase Epitaxy ?", "id": "1aaf58187c460870247326230f81c232a2ce12ec_0", "answers": [{"text": "one", "answer_start": 0}]}, {"question": "Who knows what type one ?", "id": "22075909516185f297036cd82759d96df84cc444_0", "answers": [{"text": "Solid Phase Epitaxy", "answer_start": 12}]}]}]}

In jsonl "answer_start" refers to position of answer in "source_text" (that aligns perfectly). Here source text is just a part of the whole paragraph.

In squad, "context" refer to the whole paragraph but "answer_start" uses the same value as "answer_start" in jsonl, instead "answer_start" (in squad) should be "answer_start" (in jsonl) + "source_start" (in jsonl).

This discrepancy is leading to errors while fine tuning for downstream tasks.

Error is as follows:
WARNING - farm.data_handler.processor - Answer using start/end indices is 'has been used to cr' while gold label text is 'Solid Phase Epitaxy'.
Example will not be converted for training/evaluation.

Error while finding module specification for 'unsupervisedqa.generate_synthetic_qa_data'

I am trying to run the code in Google's colab environment, I think I've run all the scriprts in readme.md

However, when I run this command:
!python -m unsupervisedqa.generate_synthetic_qa_data example_input.txt example_output
--input_file_format "txt"
--output_file_format "jsonl,squad"
--translation_method unmt

I received this error:
/usr/local/bin/python: Error while finding module specification for 'unsupervisedqa.generate_synthetic_qa_data' (ModuleNotFoundError: No module named 'unsupervisedqa')

Could any one help me please?

Will you publish the dataset for training question generation?

Will you publish the dataset for training question generation?
Thanks.