Quiz Datasets for NLP

This repository maintains question answering (QA) datasets created from Japanese quiz (trivia) questions.

The datasets are used in baseline systems for the AI王 question answering competition, such as cl-tohoku/AIO3_BPR_baseline.

Some of the datasets are also available at Hugging Face Hub.

Data source

Questions

abc_01-12
- 17,735 questions used in the first (2003) through 12th (2012) abc/EQIDEN quiz competitions.
aio_01_dev
- 1,992 questions used in the development set for the first AI王 competition (2020).
aio_01_test
- 2,000 questions used in the test set for the first AI王 competition.
aio_01_unused
- 608 questions prepared but unused for the first AI王 competition.
aio_02_train
- 22,335 questions distributed as the training set for the second AI王 competition (2021).
- The questions are the same as concatenation of abc_01-12, aio_01_dev, aio_01_test, and aio_01_unused.
aio_02_dev
- 1,000 questions used in the development set for the second AI王 competition.

Passages

In the following sets of passages, each passage consists of consecutive sentences no longer than 400 characters from Japanese Wikipedia as of 2022-04-04. The following sets of passages differ in how many Wikipedia pages are used to extract sentences from.

jawiki-20220404-c400-small
- 394,124 passages from 28,246 pages which have at least 500 incoming links within Wikipedia.
jawiki-20220404-c400-medium
- 1,678,986 passages from 233,981 pages which have at least 100 incoming links within Wikipedia.
jawiki-20220404-c400-large
- 4,288,198 passages from 903,024 pages which have at least 10 incoming links within Wikipedia.

Dataset formats

The QA datasets generated by make_dataset.py are in JSON Lines format. Specifically, each line in the dataset is a JSON object like the one below:

{
    "qid": "QA20CAPR-0010",
    "competition": "第1回AI王",
    "timestamp": "2019/12/25",
    "section": "開発データ問題 (dev1)",
    "number": "10",
    "original_question": "「鍋についたおこげ」という意味の言葉が語源であるとされる、日本ではマカロニを使ったものが一般的な西洋料理は何でしょう？",
    "original_answer": "グラタン",
    "original_additional_info": "",
    "question": "「鍋についたおこげ」という意味の言葉が語源であるとされる、日本ではマカロニを使ったものが一般的な西洋料理は何でしょう?",
    "answers": ["グラタン"],
    "passages": [
    {
        "passage_id": 271687,
        "title": "グラタン",
        "text": "グラタン(仏: gratin)は、フランスのドーフィネ地方が発祥の地といわれる郷土料理から発達した料理である。「オーブンなどで料理の表面を多少焦がすように調理する」という調理法、およびその調理法を用いて作られた料理の両方を意味する。この調理法を用いたものはすべてグラタンであり、デザート用に作られるものなどもある。主にマカロニがベースとして入ることが多く、後述のドリアとは一線を画している。 日本では、ベシャメルソースを用いオーブンで焼いた料理をして「グラタン」と呼んでいるが、フランス語では、本来鍋に張り付いたおこげという意味でもあり、転じて素材が何であれ焼いて焦げ目をつけた料理を意味する言葉である。"
    },
    {
        "passage_id": 2246907,
        "title": "お焦げ",
        "text": "中華料理、特に四川料理には鍋巴(グオパー、あるいは「中華おこげ」)(en)という料理がある。本来は鍋から掻き取ったお焦げをそのまま使っていた料理で、現代では米飯を乾燥させたものを使用している。揚げた鍋巴にニンジン、白菜、ピーマンなどの野菜や、海老、豚肉などの入ったあんをかけ、溶けて柔らかくなる前のサクサクとした歯ごたえと香ばしさを味わいつつ賞味する。とりわけあんをかける瞬間がこの料理の醍醐味であり、派手な音と立ち上る香りをアピールする為、料理店では客の目の前の卓上でパフォーマンスとして見せるのが定番である。鍋巴を乾燥させる方法は食材として市販されているものは天日で長時間乾燥させたものだが、家庭において米飯を薄く平らに広げたものをフライパンや電子レンジなどで乾燥させることでも代用できる。"
    },
        ...
    ],
    "positive_passage_indices": [0, 20],
    "negative_passage_indices": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ..., 97, 98, 99]
}

Each question is given a maximum of 100 Wikipedia passages retrieved by a full-text search engine.

The positive_passage_indices represents the zero-based indices of the passages which contain any of the answers as a substring in the passage's text. In the above example, the 0th and 18th passages contain the answer "グラタン" as a substring in their texts. Note that a positive passage does not necessarily contain enough information to reason the answer given the quiestion; they are just classified as positive based on a simple string matching.

Samely, the negative_passage_indices represents the zero-based indices of the passages which do not contain any of the answers as a substring in the passage's text.

The datasets are also available in the format of the datasets used in DPR so that they are compatible with existing reading comprehension models.

Downloads

Datasets

The dataset files are generated by make_dataset.py.

File format: gzipped JSON Lines (.jsonl.gz)

Questions \ Passages	`no_passages`	`jawiki-20220404-c400-small`	`jawiki-20220404-c400-medium`	`jawiki-20220404-c400-large`
`abc_01-12`	Download (1.63 MB)	Download (581 MB)	Download (521 MB)	Download (480 MB)
`aio_01_dev`	Download (185 KB)	Download (65.1 MB)	Download (58.6 MB)	Download (53.6 MB)
`aio_01_test`	Download (189 KB)	Download (65.6 MB)	Download (59 MB)	Download (54.1 MB)
`aio_01_unused`	Download (58.3 KB)	Download (19.8 MB)	Download (17.6 MB)	Download (16 MB)
`aio_02_train`	Download (2.07 MB)	Download (735 MB)	Download (660 MB)	Download (607 MB)
`aio_02_dev`	Download (92.6 KB)	Download (32.9 MB)	Download (29.6 MB)	Download (27.3 MB)

Passages

The passages files are generated by filter_passages.py.

File format: gzipped JSON Lines (.jsonl.gz)

Passages
`jawiki-20220404-c400-small`	Download (116 MB)
`jawiki-20220404-c400-medium`	Download (448 MB)
`jawiki-20220404-c400-large`	Download (1.03 GB)

DPR-formatted datasets

Retriever input files

The format is same as DPR's datasets for training retrievers (e.g., data.retriever.nq-train.) Questions without any positive passages are excluded from these datasets.

File format: gzipped JSON (.json.gz)

Questions \ Passages	`jawiki-20220404-c400-small`	`jawiki-20220404-c400-medium`	`jawiki-20220404-c400-large`
`abc_01-12`	Download (405 MB)	Download (425 MB)	Download (414 MB)
`aio_01_dev`	Download (49.6 MB)	Download (52.5 MB)	Download (50.9 MB)
`aio_01_test`	Download (48.6 MB)	Download (52 MB)	Download (51.3 MB)
`aio_01_unused`	Download (14.1 MB)	Download (15 MB)	Download (14.5 MB)
`aio_02_train`	Download (517 MB)	Download (544 MB)	Download (530 MB)
`aio_02_dev`	Download (23 MB)	Download (24.2 MB)	Download (23.5 MB)

Questions TSV files

The format is same as DPR's datasets for validating retrievers (e.g., data.retriever.qas.nq-train.)

File format: TSV (.tsv)

Questions
`abc_01-12`	Download (2.69 MB)
`aio_01_dev`	Download (326 KB)
`aio_01_test`	Download (334 KB)
`aio_01_unused`	Download (104 KB)
`aio_02_train`	Download (3.43 MB)
`aio_02_dev`	Download (153 KB)

Passages TSV files

The format is same as DPR's passages file (e.g., data.wikipedia_split.psgs_w100.)

File format: gzipped TSV (.tsv.gz)

Passages
`jawiki-20220404-c400-small`	Download (113 MB)
`jawiki-20220404-c400-medium`	Download (433 MB)
`jawiki-20220404-c400-large`	Download (1020 MB)

Steps to generate datasets

Generate QA datasets without passages

$ mkdir ~/work/quiz-datasets/datasets/no_passages

$ python make_dataset.py \
--input_files data/abc/abc_01.txt data/abc/abc_02.txt data/abc/abc_03.txt data/abc/abc_04.txt data/abc/abc_05.txt data/abc/abc_06.txt data/abc/abc_07.txt data/abc/abc_08.txt data/abc/abc_09.txt data/abc/abc_10.txt data/abc/abc_11.txt data/abc/abc_12.txt \
--output_file ~/work/quiz-datasets/datasets/no_passages/abc_01-12.jsonl.gz
# Total output questions: 17735

$ python make_dataset.py \
--input_files data/aio/aio_01_dev1.txt data/aio/aio_01_dev2.txt \
--output_file ~/work/quiz-datasets/datasets/no_passages/aio_01_dev.jsonl.gz
# Total output questions: 1992

$ python make_dataset.py \
--input_files data/aio/aio_01_test_lb.txt data/aio/aio_01_test_lc.txt \
--output_file ~/work/quiz-datasets/datasets/no_passages/aio_01_test.jsonl.gz
# Total output questions: 2000

$ python make_dataset.py \
--input_files data/aio/aio_01_unused.txt \
--output_file ~/work/quiz-datasets/datasets/no_passages/aio_01_unused.jsonl.gz
# Total output questions: 608

$ zcat ~/work/quiz-datasets/datasets/no_passages/abc_01-12.jsonl.gz \
       ~/work/quiz-datasets/datasets/no_passages/aio_01_dev.jsonl.gz \
       ~/work/quiz-datasets/datasets/no_passages/aio_01_test.jsonl.gz \
       ~/work/quiz-datasets/datasets/no_passages/aio_01_unused.jsonl.gz \
| gzip > ~/work/quiz-datasets/datasets/no_passages/aio_02_train.jsonl.gz

$ python make_dataset.py \
--input_files data/aio/aio_02_dev.txt \
--output_file ~/work/quiz-datasets/datasets/no_passages/aio_02_dev.jsonl.gz
# Total output questions: 1000

Generate QA datasets with passages for reading comprehension

Requirements

Before executing the scripts, you need to build an Elasticsearch index of Wikipedia passages by running the codes in singletongue/wikipedia-utils.

Note: since Elasticsearch does not always ensure consistent search results, the generated datasets may slightly differ from the distribued ones if you build your own Elasticsearch indices.

`jawiki-20220404-sentences-c400-small`

$ mkdir ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small

$ python make_dataset.py \
--input_files data/abc/abc_01.txt data/abc/abc_02.txt data/abc/abc_03.txt data/abc/abc_04.txt data/abc/abc_05.txt data/abc/abc_06.txt data/abc/abc_07.txt data/abc/abc_08.txt data/abc/abc_09.txt data/abc/abc_10.txt data/abc/abc_11.txt data/abc/abc_12.txt \
--output_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small/abc_01-12.jsonl.gz \
--num_passages_per_question 100 \
--es_index_name jawiki-20220404-c400 \
--min_inlinks 500 \
--exclude_sexual_pages
# Questions with at least one positive passage: 12428
# Questions with no positive passage: 5307
# Total output questions: 17735

$ python make_dataset.py \
--input_files data/aio/aio_01_dev1.txt data/aio/aio_01_dev2.txt \
--output_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small/aio_01_dev.jsonl.gz \
--num_passages_per_question 100 \
--es_index_name jawiki-20220404-c400 \
--min_inlinks 500 \
--exclude_sexual_pages
# Questions with at least one positive passage: 1521
# Questions with no positive passage: 471
# Total output questions: 1992

$ python make_dataset.py \
--input_files data/aio/aio_01_test_lb.txt data/aio/aio_01_test_lc.txt \
--output_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small/aio_01_test.jsonl.gz \
--num_passages_per_question 100 \
--es_index_name jawiki-20220404-c400 \
--min_inlinks 500 \
--exclude_sexual_pages
# Questions with at least one positive passage: 1485
# Questions with no positive passage: 515
# Total output questions: 2000

$ python make_dataset.py \
--input_files data/aio/aio_01_unused.txt \
--output_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small/aio_01_unused.jsonl.gz \
--num_passages_per_question 100 \
--es_index_name jawiki-20220404-c400 \
--min_inlinks 500 \
--exclude_sexual_pages
# Questions with at least one positive passage: 431
# Questions with no positive passage: 177
# Total output questions: 608

$ zcat ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small/abc_01-12.jsonl.gz \
       ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small/aio_01_dev.jsonl.gz \
       ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small/aio_01_test.jsonl.gz \
       ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small/aio_01_unused.jsonl.gz \
| gzip > ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small/aio_02_train.jsonl.gz

$ python make_dataset.py \
--input_files data/aio/aio_02_dev.txt \
--output_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small/aio_02_dev.jsonl.gz \
--num_passages_per_question 100 \
--es_index_name jawiki-20220404-c400 \
--min_inlinks 500 \
--exclude_sexual_pages
# Questions with at least one positive passage: 704
# Questions with no positive passage: 296
# Total output questions: 1000

`jawiki-20220404-sentences-c400-medium`

$ mkdir ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium

$ python make_dataset.py \
--input_files data/abc/abc_01.txt data/abc/abc_02.txt data/abc/abc_03.txt data/abc/abc_04.txt data/abc/abc_05.txt data/abc/abc_06.txt data/abc/abc_07.txt data/abc/abc_08.txt data/abc/abc_09.txt data/abc/abc_10.txt data/abc/abc_11.txt data/abc/abc_12.txt \
--output_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium/abc_01-12.jsonl.gz \
--num_passages_per_question 100 \
--es_index_name jawiki-20220404-c400 \
--min_inlinks 100 \
--exclude_sexual_pages
# Questions with at least one positive passage: 14471
# Questions with no positive passage: 3264
# Total output questions: 17735

$ python make_dataset.py \
--input_files data/aio/aio_01_dev1.txt data/aio/aio_01_dev2.txt \
--output_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium/aio_01_dev.jsonl.gz \
--num_passages_per_question 100 \
--es_index_name jawiki-20220404-c400 \
--min_inlinks 100 \
--exclude_sexual_pages
# Questions with at least one positive passage: 1785
# Questions with no positive passage: 207
# Total output questions: 1992

$ python make_dataset.py \
--input_files data/aio/aio_01_test_lb.txt data/aio/aio_01_test_lc.txt \
--output_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium/aio_01_test.jsonl.gz \
--num_passages_per_question 100 \
--es_index_name jawiki-20220404-c400 \
--min_inlinks 100 \
--exclude_sexual_pages
# Questions with at least one positive passage: 1760
# Questions with no positive passage: 240
# Total output questions: 2000

$ python make_dataset.py \
--input_files data/aio/aio_01_unused.txt \
--output_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium/aio_01_unused.jsonl.gz \
--num_passages_per_question 100 \
--es_index_name jawiki-20220404-c400 \
--min_inlinks 100 \
--exclude_sexual_pages
# Questions with at least one positive passage: 514
# Questions with no positive passage: 94
# Total output questions: 608

$ zcat ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium/abc_01-12.jsonl.gz \
       ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium/aio_01_dev.jsonl.gz \
       ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium/aio_01_test.jsonl.gz \
       ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium/aio_01_unused.jsonl.gz \
| gzip > ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium/aio_02_train.jsonl.gz

$ python make_dataset.py \
--input_files data/aio/aio_02_dev.txt \
--output_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium/aio_02_dev.jsonl.gz \
--num_passages_per_question 100 \
--es_index_name jawiki-20220404-c400 \
--min_inlinks 100 \
--exclude_sexual_pages
# Questions with at least one positive passage: 824
# Questions with no positive passage: 176
# Total output questions: 1000

`jawiki-20220404-sentences-c400-large`

$ mkdir ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large

$ python make_dataset.py \
--input_files data/abc/abc_01.txt data/abc/abc_02.txt data/abc/abc_03.txt data/abc/abc_04.txt data/abc/abc_05.txt data/abc/abc_06.txt data/abc/abc_07.txt data/abc/abc_08.txt data/abc/abc_09.txt data/abc/abc_10.txt data/abc/abc_11.txt data/abc/abc_12.txt \
--output_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large/abc_01-12.jsonl.gz \
--num_passages_per_question 100 \
--es_index_name jawiki-20220404-c400 \
--min_inlinks 10 \
--exclude_sexual_pages
# Questions with at least one positive passage: 15283
# Questions with no positive passage: 2452
# Total output questions: 17735

$ python make_dataset.py \
--input_files data/aio/aio_01_dev1.txt data/aio/aio_01_dev2.txt \
--output_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large/aio_01_dev.jsonl.gz \
--num_passages_per_question 100 \
--es_index_name jawiki-20220404-c400 \
--min_inlinks 10 \
--exclude_sexual_pages
# Questions with at least one positive passage: 1885
# Questions with no positive passage: 107
# Total output questions: 1992

$ python make_dataset.py \
--input_files data/aio/aio_01_test_lb.txt data/aio/aio_01_test_lc.txt \
--output_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large/aio_01_test.jsonl.gz \
--num_passages_per_question 100 \
--es_index_name jawiki-20220404-c400 \
--min_inlinks 10 \
--exclude_sexual_pages
# Questions with at least one positive passage: 1890
# Questions with no positive passage: 110
# Total output questions: 2000

$ python make_dataset.py \
--input_files data/aio/aio_01_unused.txt \
--output_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large/aio_01_unused.jsonl.gz \
--num_passages_per_question 100 \
--es_index_name jawiki-20220404-c400 \
--min_inlinks 10 \
--exclude_sexual_pages
# Questions with at least one positive passage: 546
# Questions with no positive passage: 62
# Total output questions: 608

$ zcat ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large/abc_01-12.jsonl.gz \
       ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large/aio_01_dev.jsonl.gz \
       ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large/aio_01_test.jsonl.gz \
       ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large/aio_01_unused.jsonl.gz \
| gzip > ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large/aio_02_train.jsonl.gz

$ python make_dataset.py \
--input_files data/aio/aio_02_dev.txt \
--output_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large/aio_02_dev.jsonl.gz \
--num_passages_per_question 100 \
--es_index_name jawiki-20220404-c400 \
--min_inlinks 10 \
--exclude_sexual_pages
# Questions with at least one positive passage: 864
# Questions with no positive passage: 136
# Total output questions: 1000

Generate the filtered passages files

Before executing the scripts, you need to generate Wikipedia passages files by running the codes in singletongue/wikipedia-utils.

$ mkdir ~/work/quiz-datasets/passages

# jawiki-20220404-c400-small
$ python filter_passages.py \
--passages_file ~/work/wikipedia-utils/20220404/passages-c400-jawiki-20220404.json.gz \
--page_ids_file ~/work/wikipedia-utils/20220404/page-ids-jawiki-20220404.json \
--output_file ~/work/quiz-datasets/passages/jawiki-20220404-c400-small.jsonl.gz \
--min_inlinks 500 \
--exclude_sexual_pages
# The number of output page titles: 28246
# The number of output passages: 394124

# jawiki-20220404-c400-medium
$ python filter_passages.py \
--passages_file ~/work/wikipedia-utils/20220404/passages-c400-jawiki-20220404.json.gz \
--page_ids_file ~/work/wikipedia-utils/20220404/page-ids-jawiki-20220404.json \
--output_file ~/work/quiz-datasets/passages/jawiki-20220404-c400-medium.jsonl.gz \
--min_inlinks 100 \
--exclude_sexual_pages
# The number of output page titles: 233981
# The number of output passages: 1678986

# jawiki-20220404-c400-large
$ python filter_passages.py \
--passages_file ~/work/wikipedia-utils/20220404/passages-c400-jawiki-20220404.json.gz \
--page_ids_file ~/work/wikipedia-utils/20220404/page-ids-jawiki-20220404.json \
--output_file ~/work/quiz-datasets/passages/jawiki-20220404-c400-large.jsonl.gz \
--min_inlinks 10 \
--exclude_sexual_pages
# The number of output page titles: 903024
# The number of output passages: 4288198

Convert the datasets into the format of DPR Retriever input files

Note: specifying --skip_no_positive removes questions with no postive passages from the output file.

# jawiki-20220404-sentences-c400-small
$ mkdir ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-small
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small/abc_01-12.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-small/abc_01-12.json.gz \
--dataset_label jawiki-20220404-sentences-c400-small \
--skip_no_positive
# The number of output questions: 12428
# The number of skipped questions: 5307
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small/aio_01_dev.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-small/aio_01_dev.json.gz \
--dataset_label jawiki-20220404-sentences-c400-small \
--skip_no_positive
# The number of output questions: 1521
# The number of skipped questions: 471
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small/aio_01_test.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-small/aio_01_test.json.gz \
--dataset_label jawiki-20220404-sentences-c400-small \
--skip_no_positive
# The number of output questions: 1485
# The number of skipped questions: 515
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small/aio_01_unused.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-small/aio_01_unused.json.gz \
--dataset_label jawiki-20220404-sentences-c400-small \
--skip_no_positive
# The number of output questions: 431
# The number of skipped questions: 177
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small/aio_02_train.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-small/aio_02_train.json.gz \
--dataset_label jawiki-20220404-sentences-c400-small \
--skip_no_positive
# The number of output questions: 15865
# The number of skipped questions: 6470
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small/aio_02_dev.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-small/aio_02_dev.json.gz \
--dataset_label jawiki-20220404-sentences-c400-small \
--skip_no_positive
# The number of output questions: 704
# The number of skipped questions: 296

# jawiki-20220404-sentences-c400-medium
$ mkdir ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-medium
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium/abc_01-12.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-medium/abc_01-12.json.gz \
--dataset_label jawiki-20220404-sentences-c400-medium \
--skip_no_positive
# The number of output questions: 14471
# The number of skipped questions: 3264
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium/aio_01_dev.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-medium/aio_01_dev.json.gz \
--dataset_label jawiki-20220404-sentences-c400-medium \
--skip_no_positive
# The number of output questions: 1785
# The number of skipped questions: 207
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium/aio_01_test.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-medium/aio_01_test.json.gz \
--dataset_label jawiki-20220404-sentences-c400-medium \
--skip_no_positive
# The number of output questions: 1760
# The number of skipped questions: 240
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium/aio_01_unused.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-medium/aio_01_unused.json.gz \
--dataset_label jawiki-20220404-sentences-c400-medium \
--skip_no_positive
# The number of output questions: 514
# The number of skipped questions: 94
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium/aio_02_train.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-medium/aio_02_train.json.gz \
--dataset_label jawiki-20220404-sentences-c400-medium \
--skip_no_positive
# The number of output questions: 18530
# The number of skipped questions: 3805
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium/aio_02_dev.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-medium/aio_02_dev.json.gz \
--dataset_label jawiki-20220404-sentences-c400-medium \
--skip_no_positive
# The number of output questions: 824
# The number of skipped questions: 176

# jawiki-20220404-sentences-c400-large
$ mkdir ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-large
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large/abc_01-12.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-large/abc_01-12.json.gz \
--dataset_label jawiki-20220404-sentences-c400-large \
--skip_no_positive
# The number of output questions: 15283
# The number of skipped questions: 2452
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large/aio_01_dev.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-large/aio_01_dev.json.gz \
--dataset_label jawiki-20220404-sentences-c400-large \
--skip_no_positive
# The number of output questions: 1885
# The number of skipped questions: 107
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large/aio_01_test.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-large/aio_01_test.json.gz \
--dataset_label jawiki-20220404-sentences-c400-large \
--skip_no_positive
# The number of output questions: 1890
# The number of skipped questions: 110
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large/aio_01_unused.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-large/aio_01_unused.json.gz \
--dataset_label jawiki-20220404-sentences-c400-large \
--skip_no_positive
# The number of output questions: 546
# The number of skipped questions: 62
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large/aio_02_train.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-large/aio_02_train.json.gz \
--dataset_label jawiki-20220404-sentences-c400-large \
--skip_no_positive
# The number of output questions: 19604
# The number of skipped questions: 2731
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large/aio_02_dev.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-large/aio_02_dev.json.gz \
--dataset_label jawiki-20220404-sentences-c400-large \
--skip_no_positive
# The number of output questions: 864
# The number of skipped questions: 136

Convert the datasets into the format of DPR questions TSV files

$ python convert_dataset_to_dpr_qas_file.py \
--input_file ~/work/quiz-datasets/datasets/no_passages/abc_01-12.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/qas/abc_01-12.tsv
$ python convert_dataset_to_dpr_qas_file.py \
--input_file ~/work/quiz-datasets/datasets/no_passages/aio_01_dev.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/qas/aio_01_dev.tsv
$ python convert_dataset_to_dpr_qas_file.py \
--input_file ~/work/quiz-datasets/datasets/no_passages/aio_01_test.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/qas/aio_01_test.tsv
$ python convert_dataset_to_dpr_qas_file.py \
--input_file ~/work/quiz-datasets/datasets/no_passages/aio_01_unused.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/qas/aio_01_unused.tsv
$ python convert_dataset_to_dpr_qas_file.py \
--input_file ~/work/quiz-datasets/datasets/no_passages/aio_02_train.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/qas/aio_02_train.tsv
$ python convert_dataset_to_dpr_qas_file.py \
--input_file ~/work/quiz-datasets/datasets/no_passages/aio_02_dev.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/qas/aio_02_dev.tsv

Convert the passages file into DPR passages TSV files

# jawiki-20220404-c400-small
$ python convert_passages_to_dpr_format.py \
--passages_file ~/work/quiz-datasets/passages/jawiki-20220404-c400-small.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/wikipedia_split/jawiki-20220404-c400-small.tsv.gz

# jawiki-20220404-c400-medium
$ python convert_passages_to_dpr_format.py \
--passages_file ~/work/quiz-datasets/passages/jawiki-20220404-c400-medium.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/wikipedia_split/jawiki-20220404-c400-medium.tsv.gz

# jawiki-20220404-c400-large
$ python convert_passages_to_dpr_format.py \
--passages_file ~/work/quiz-datasets/passages/jawiki-20220404-c400-large.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/wikipedia_split/jawiki-20220404-c400-large.tsv.gz

License

The codes in this repository, not including the datasets themselves, are distributed under the MIT license.
data/abc 以下に含まれるクイズ問題の著作権は abc/EQIDEN 実行委員会に帰属します。東北大学において研究目的での再配布許諾を得ています。
data/aio 以下に含まれるクイズ問題は Creative Commons Attribution-ShareAlike 4.0 International ライセンスの下に提供されています。これらのクイズ問題は株式会社キュービックおよびクイズ法人カプリティオに依頼し作成したものです。
読解データセットにパッセージとして付与されている Wikipedia のコンテンツは、Attribution-ShareAlike 3.0 Unported ライセンスおよび GFDL ライセンスの下に配布されているものです。

cl-tohoku / quiz-datasets Goto Github PK

quiz-datasets's Introduction

Quiz Datasets for NLP

Data source

Questions

Passages

Dataset formats

Downloads

Datasets

Passages

DPR-formatted datasets

Retriever input files

Questions TSV files

Passages TSV files

Steps to generate datasets

Generate QA datasets without passages

Generate QA datasets with passages for reading comprehension

Requirements

jawiki-20220404-sentences-c400-small

jawiki-20220404-sentences-c400-medium

jawiki-20220404-sentences-c400-large

Generate the filtered passages files

Convert the datasets into the format of DPR Retriever input files

Convert the datasets into the format of DPR questions TSV files

Convert the passages file into DPR passages TSV files

License

quiz-datasets's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org

`jawiki-20220404-sentences-c400-small`

`jawiki-20220404-sentences-c400-medium`

`jawiki-20220404-sentences-c400-large`