Coder Social home page Coder Social logo

quiz-datasets's Introduction

Quiz Datasets for NLP

This repository maintains question answering (QA) datasets created from Japanese quiz (trivia) questions.

The datasets are used in baseline systems for the AI王 question answering competition, such as cl-tohoku/AIO3_BPR_baseline.

Some of the datasets are also available at Hugging Face Hub.

Data source

Questions

  • abc_01-12
    • 17,735 questions used in the first (2003) through 12th (2012) abc/EQIDEN quiz competitions.
  • aio_01_dev
  • aio_01_test
    • 2,000 questions used in the test set for the first AI王 competition.
  • aio_01_unused
    • 608 questions prepared but unused for the first AI王 competition.
  • aio_02_train
    • 22,335 questions distributed as the training set for the second AI王 competition (2021).
    • The questions are the same as concatenation of abc_01-12, aio_01_dev, aio_01_test, and aio_01_unused.
  • aio_02_dev
    • 1,000 questions used in the development set for the second AI王 competition.

Passages

In the following sets of passages, each passage consists of consecutive sentences no longer than 400 characters from Japanese Wikipedia as of 2022-04-04. The following sets of passages differ in how many Wikipedia pages are used to extract sentences from.

  • jawiki-20220404-c400-small
    • 394,124 passages from 28,246 pages which have at least 500 incoming links within Wikipedia.
  • jawiki-20220404-c400-medium
    • 1,678,986 passages from 233,981 pages which have at least 100 incoming links within Wikipedia.
  • jawiki-20220404-c400-large
    • 4,288,198 passages from 903,024 pages which have at least 10 incoming links within Wikipedia.

Dataset formats

The QA datasets generated by make_dataset.py are in JSON Lines format. Specifically, each line in the dataset is a JSON object like the one below:

{
    "qid": "QA20CAPR-0010",
    "competition": "第1回AI王",
    "timestamp": "2019/12/25",
    "section": "開発データ問題 (dev1)",
    "number": "10",
    "original_question": "「鍋についたおこげ」という意味の言葉が語源であるとされる、日本ではマカロニを使ったものが一般的な西洋料理は何でしょう?",
    "original_answer": "グラタン",
    "original_additional_info": "",
    "question": "「鍋についたおこげ」という意味の言葉が語源であるとされる、日本ではマカロニを使ったものが一般的な西洋料理は何でしょう?",
    "answers": ["グラタン"],
    "passages": [
    {
        "passage_id": 271687,
        "title": "グラタン",
        "text": "グラタン(仏: gratin)は、フランスのドーフィネ地方が発祥の地といわれる郷土料理から発達した料理である。「オーブンなどで料理の表面を多少焦がすように調理する」という調理法、およびその調理法を用いて作られた料理の両方を意味する。この調理法を用いたものはすべてグラタンであり、デザート用に作られるものなどもある。主にマカロニがベースとして入ることが多く、後述のドリアとは一線を画している。 日本では、ベシャメルソースを用いオーブンで焼いた料理をして「グラタン」と呼んでいるが、フランス語では、本来鍋に張り付いたおこげという意味でもあり、転じて素材が何であれ焼いて焦げ目をつけた料理を意味する言葉である。"
    },
    {
        "passage_id": 2246907,
        "title": "お焦げ",
        "text": "中華料理、特に四川料理には鍋巴(グオパー、あるいは「中華おこげ」)(en)という料理がある。本来は鍋から掻き取ったお焦げをそのまま使っていた料理で、現代では米飯を乾燥させたものを使用している。揚げた鍋巴にニンジン、白菜、ピーマンなどの野菜や、海老、豚肉などの入ったあんをかけ、溶けて柔らかくなる前のサクサクとした歯ごたえと香ばしさを味わいつつ賞味する。とりわけあんをかける瞬間がこの料理の醍醐味であり、派手な音と立ち上る香りをアピールする為、料理店では客の目の前の卓上でパフォーマンスとして見せるのが定番である。鍋巴を乾燥させる方法は食材として市販されているものは天日で長時間乾燥させたものだが、家庭において米飯を薄く平らに広げたものをフライパンや電子レンジなどで乾燥させることでも代用できる。"
    },
        ...
    ],
    "positive_passage_indices": [0, 20],
    "negative_passage_indices": [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, ..., 97, 98, 99]
}

Each question is given a maximum of 100 Wikipedia passages retrieved by a full-text search engine.

The positive_passage_indices represents the zero-based indices of the passages which contain any of the answers as a substring in the passage's text. In the above example, the 0th and 18th passages contain the answer "グラタン" as a substring in their texts. Note that a positive passage does not necessarily contain enough information to reason the answer given the quiestion; they are just classified as positive based on a simple string matching.

Samely, the negative_passage_indices represents the zero-based indices of the passages which do not contain any of the answers as a substring in the passage's text.

The datasets are also available in the format of the datasets used in DPR so that they are compatible with existing reading comprehension models.

Downloads

Datasets

The dataset files are generated by make_dataset.py.

File format: gzipped JSON Lines (.jsonl.gz)

Questions \ Passages no_passages jawiki-20220404-c400-small jawiki-20220404-c400-medium jawiki-20220404-c400-large
abc_01-12 Download (1.63 MB) Download (581 MB) Download (521 MB) Download (480 MB)
aio_01_dev Download (185 KB) Download (65.1 MB) Download (58.6 MB) Download (53.6 MB)
aio_01_test Download (189 KB) Download (65.6 MB) Download (59 MB) Download (54.1 MB)
aio_01_unused Download (58.3 KB) Download (19.8 MB) Download (17.6 MB) Download (16 MB)
aio_02_train Download (2.07 MB) Download (735 MB) Download (660 MB) Download (607 MB)
aio_02_dev Download (92.6 KB) Download (32.9 MB) Download (29.6 MB) Download (27.3 MB)

Passages

The passages files are generated by filter_passages.py.

File format: gzipped JSON Lines (.jsonl.gz)

Passages
jawiki-20220404-c400-small Download (116 MB)
jawiki-20220404-c400-medium Download (448 MB)
jawiki-20220404-c400-large Download (1.03 GB)

DPR-formatted datasets

Retriever input files

The format is same as DPR's datasets for training retrievers (e.g., data.retriever.nq-train.) Questions without any positive passages are excluded from these datasets.

File format: gzipped JSON (.json.gz)

Questions \ Passages jawiki-20220404-c400-small jawiki-20220404-c400-medium jawiki-20220404-c400-large
abc_01-12 Download (405 MB) Download (425 MB) Download (414 MB)
aio_01_dev Download (49.6 MB) Download (52.5 MB) Download (50.9 MB)
aio_01_test Download (48.6 MB) Download (52 MB) Download (51.3 MB)
aio_01_unused Download (14.1 MB) Download (15 MB) Download (14.5 MB)
aio_02_train Download (517 MB) Download (544 MB) Download (530 MB)
aio_02_dev Download (23 MB) Download (24.2 MB) Download (23.5 MB)

Questions TSV files

The format is same as DPR's datasets for validating retrievers (e.g., data.retriever.qas.nq-train.)

File format: TSV (.tsv)

Questions
abc_01-12 Download (2.69 MB)
aio_01_dev Download (326 KB)
aio_01_test Download (334 KB)
aio_01_unused Download (104 KB)
aio_02_train Download (3.43 MB)
aio_02_dev Download (153 KB)

Passages TSV files

The format is same as DPR's passages file (e.g., data.wikipedia_split.psgs_w100.)

File format: gzipped TSV (.tsv.gz)

Passages
jawiki-20220404-c400-small Download (113 MB)
jawiki-20220404-c400-medium Download (433 MB)
jawiki-20220404-c400-large Download (1020 MB)

Steps to generate datasets

Generate QA datasets without passages

$ mkdir ~/work/quiz-datasets/datasets/no_passages

$ python make_dataset.py \
--input_files data/abc/abc_01.txt data/abc/abc_02.txt data/abc/abc_03.txt data/abc/abc_04.txt data/abc/abc_05.txt data/abc/abc_06.txt data/abc/abc_07.txt data/abc/abc_08.txt data/abc/abc_09.txt data/abc/abc_10.txt data/abc/abc_11.txt data/abc/abc_12.txt \
--output_file ~/work/quiz-datasets/datasets/no_passages/abc_01-12.jsonl.gz
# Total output questions: 17735

$ python make_dataset.py \
--input_files data/aio/aio_01_dev1.txt data/aio/aio_01_dev2.txt \
--output_file ~/work/quiz-datasets/datasets/no_passages/aio_01_dev.jsonl.gz
# Total output questions: 1992

$ python make_dataset.py \
--input_files data/aio/aio_01_test_lb.txt data/aio/aio_01_test_lc.txt \
--output_file ~/work/quiz-datasets/datasets/no_passages/aio_01_test.jsonl.gz
# Total output questions: 2000

$ python make_dataset.py \
--input_files data/aio/aio_01_unused.txt \
--output_file ~/work/quiz-datasets/datasets/no_passages/aio_01_unused.jsonl.gz
# Total output questions: 608

$ zcat ~/work/quiz-datasets/datasets/no_passages/abc_01-12.jsonl.gz \
       ~/work/quiz-datasets/datasets/no_passages/aio_01_dev.jsonl.gz \
       ~/work/quiz-datasets/datasets/no_passages/aio_01_test.jsonl.gz \
       ~/work/quiz-datasets/datasets/no_passages/aio_01_unused.jsonl.gz \
| gzip > ~/work/quiz-datasets/datasets/no_passages/aio_02_train.jsonl.gz

$ python make_dataset.py \
--input_files data/aio/aio_02_dev.txt \
--output_file ~/work/quiz-datasets/datasets/no_passages/aio_02_dev.jsonl.gz
# Total output questions: 1000

Generate QA datasets with passages for reading comprehension

Requirements

Before executing the scripts, you need to build an Elasticsearch index of Wikipedia passages by running the codes in singletongue/wikipedia-utils.

Note: since Elasticsearch does not always ensure consistent search results, the generated datasets may slightly differ from the distribued ones if you build your own Elasticsearch indices.

jawiki-20220404-sentences-c400-small

$ mkdir ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small

$ python make_dataset.py \
--input_files data/abc/abc_01.txt data/abc/abc_02.txt data/abc/abc_03.txt data/abc/abc_04.txt data/abc/abc_05.txt data/abc/abc_06.txt data/abc/abc_07.txt data/abc/abc_08.txt data/abc/abc_09.txt data/abc/abc_10.txt data/abc/abc_11.txt data/abc/abc_12.txt \
--output_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small/abc_01-12.jsonl.gz \
--num_passages_per_question 100 \
--es_index_name jawiki-20220404-c400 \
--min_inlinks 500 \
--exclude_sexual_pages
# Questions with at least one positive passage: 12428
# Questions with no positive passage: 5307
# Total output questions: 17735

$ python make_dataset.py \
--input_files data/aio/aio_01_dev1.txt data/aio/aio_01_dev2.txt \
--output_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small/aio_01_dev.jsonl.gz \
--num_passages_per_question 100 \
--es_index_name jawiki-20220404-c400 \
--min_inlinks 500 \
--exclude_sexual_pages
# Questions with at least one positive passage: 1521
# Questions with no positive passage: 471
# Total output questions: 1992

$ python make_dataset.py \
--input_files data/aio/aio_01_test_lb.txt data/aio/aio_01_test_lc.txt \
--output_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small/aio_01_test.jsonl.gz \
--num_passages_per_question 100 \
--es_index_name jawiki-20220404-c400 \
--min_inlinks 500 \
--exclude_sexual_pages
# Questions with at least one positive passage: 1485
# Questions with no positive passage: 515
# Total output questions: 2000

$ python make_dataset.py \
--input_files data/aio/aio_01_unused.txt \
--output_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small/aio_01_unused.jsonl.gz \
--num_passages_per_question 100 \
--es_index_name jawiki-20220404-c400 \
--min_inlinks 500 \
--exclude_sexual_pages
# Questions with at least one positive passage: 431
# Questions with no positive passage: 177
# Total output questions: 608

$ zcat ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small/abc_01-12.jsonl.gz \
       ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small/aio_01_dev.jsonl.gz \
       ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small/aio_01_test.jsonl.gz \
       ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small/aio_01_unused.jsonl.gz \
| gzip > ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small/aio_02_train.jsonl.gz

$ python make_dataset.py \
--input_files data/aio/aio_02_dev.txt \
--output_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small/aio_02_dev.jsonl.gz \
--num_passages_per_question 100 \
--es_index_name jawiki-20220404-c400 \
--min_inlinks 500 \
--exclude_sexual_pages
# Questions with at least one positive passage: 704
# Questions with no positive passage: 296
# Total output questions: 1000

jawiki-20220404-sentences-c400-medium

$ mkdir ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium

$ python make_dataset.py \
--input_files data/abc/abc_01.txt data/abc/abc_02.txt data/abc/abc_03.txt data/abc/abc_04.txt data/abc/abc_05.txt data/abc/abc_06.txt data/abc/abc_07.txt data/abc/abc_08.txt data/abc/abc_09.txt data/abc/abc_10.txt data/abc/abc_11.txt data/abc/abc_12.txt \
--output_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium/abc_01-12.jsonl.gz \
--num_passages_per_question 100 \
--es_index_name jawiki-20220404-c400 \
--min_inlinks 100 \
--exclude_sexual_pages
# Questions with at least one positive passage: 14471
# Questions with no positive passage: 3264
# Total output questions: 17735

$ python make_dataset.py \
--input_files data/aio/aio_01_dev1.txt data/aio/aio_01_dev2.txt \
--output_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium/aio_01_dev.jsonl.gz \
--num_passages_per_question 100 \
--es_index_name jawiki-20220404-c400 \
--min_inlinks 100 \
--exclude_sexual_pages
# Questions with at least one positive passage: 1785
# Questions with no positive passage: 207
# Total output questions: 1992

$ python make_dataset.py \
--input_files data/aio/aio_01_test_lb.txt data/aio/aio_01_test_lc.txt \
--output_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium/aio_01_test.jsonl.gz \
--num_passages_per_question 100 \
--es_index_name jawiki-20220404-c400 \
--min_inlinks 100 \
--exclude_sexual_pages
# Questions with at least one positive passage: 1760
# Questions with no positive passage: 240
# Total output questions: 2000

$ python make_dataset.py \
--input_files data/aio/aio_01_unused.txt \
--output_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium/aio_01_unused.jsonl.gz \
--num_passages_per_question 100 \
--es_index_name jawiki-20220404-c400 \
--min_inlinks 100 \
--exclude_sexual_pages
# Questions with at least one positive passage: 514
# Questions with no positive passage: 94
# Total output questions: 608

$ zcat ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium/abc_01-12.jsonl.gz \
       ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium/aio_01_dev.jsonl.gz \
       ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium/aio_01_test.jsonl.gz \
       ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium/aio_01_unused.jsonl.gz \
| gzip > ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium/aio_02_train.jsonl.gz

$ python make_dataset.py \
--input_files data/aio/aio_02_dev.txt \
--output_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium/aio_02_dev.jsonl.gz \
--num_passages_per_question 100 \
--es_index_name jawiki-20220404-c400 \
--min_inlinks 100 \
--exclude_sexual_pages
# Questions with at least one positive passage: 824
# Questions with no positive passage: 176
# Total output questions: 1000

jawiki-20220404-sentences-c400-large

$ mkdir ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large

$ python make_dataset.py \
--input_files data/abc/abc_01.txt data/abc/abc_02.txt data/abc/abc_03.txt data/abc/abc_04.txt data/abc/abc_05.txt data/abc/abc_06.txt data/abc/abc_07.txt data/abc/abc_08.txt data/abc/abc_09.txt data/abc/abc_10.txt data/abc/abc_11.txt data/abc/abc_12.txt \
--output_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large/abc_01-12.jsonl.gz \
--num_passages_per_question 100 \
--es_index_name jawiki-20220404-c400 \
--min_inlinks 10 \
--exclude_sexual_pages
# Questions with at least one positive passage: 15283
# Questions with no positive passage: 2452
# Total output questions: 17735

$ python make_dataset.py \
--input_files data/aio/aio_01_dev1.txt data/aio/aio_01_dev2.txt \
--output_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large/aio_01_dev.jsonl.gz \
--num_passages_per_question 100 \
--es_index_name jawiki-20220404-c400 \
--min_inlinks 10 \
--exclude_sexual_pages
# Questions with at least one positive passage: 1885
# Questions with no positive passage: 107
# Total output questions: 1992

$ python make_dataset.py \
--input_files data/aio/aio_01_test_lb.txt data/aio/aio_01_test_lc.txt \
--output_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large/aio_01_test.jsonl.gz \
--num_passages_per_question 100 \
--es_index_name jawiki-20220404-c400 \
--min_inlinks 10 \
--exclude_sexual_pages
# Questions with at least one positive passage: 1890
# Questions with no positive passage: 110
# Total output questions: 2000

$ python make_dataset.py \
--input_files data/aio/aio_01_unused.txt \
--output_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large/aio_01_unused.jsonl.gz \
--num_passages_per_question 100 \
--es_index_name jawiki-20220404-c400 \
--min_inlinks 10 \
--exclude_sexual_pages
# Questions with at least one positive passage: 546
# Questions with no positive passage: 62
# Total output questions: 608

$ zcat ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large/abc_01-12.jsonl.gz \
       ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large/aio_01_dev.jsonl.gz \
       ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large/aio_01_test.jsonl.gz \
       ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large/aio_01_unused.jsonl.gz \
| gzip > ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large/aio_02_train.jsonl.gz

$ python make_dataset.py \
--input_files data/aio/aio_02_dev.txt \
--output_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large/aio_02_dev.jsonl.gz \
--num_passages_per_question 100 \
--es_index_name jawiki-20220404-c400 \
--min_inlinks 10 \
--exclude_sexual_pages
# Questions with at least one positive passage: 864
# Questions with no positive passage: 136
# Total output questions: 1000

Generate the filtered passages files

Before executing the scripts, you need to generate Wikipedia passages files by running the codes in singletongue/wikipedia-utils.

$ mkdir ~/work/quiz-datasets/passages

# jawiki-20220404-c400-small
$ python filter_passages.py \
--passages_file ~/work/wikipedia-utils/20220404/passages-c400-jawiki-20220404.json.gz \
--page_ids_file ~/work/wikipedia-utils/20220404/page-ids-jawiki-20220404.json \
--output_file ~/work/quiz-datasets/passages/jawiki-20220404-c400-small.jsonl.gz \
--min_inlinks 500 \
--exclude_sexual_pages
# The number of output page titles: 28246
# The number of output passages: 394124

# jawiki-20220404-c400-medium
$ python filter_passages.py \
--passages_file ~/work/wikipedia-utils/20220404/passages-c400-jawiki-20220404.json.gz \
--page_ids_file ~/work/wikipedia-utils/20220404/page-ids-jawiki-20220404.json \
--output_file ~/work/quiz-datasets/passages/jawiki-20220404-c400-medium.jsonl.gz \
--min_inlinks 100 \
--exclude_sexual_pages
# The number of output page titles: 233981
# The number of output passages: 1678986

# jawiki-20220404-c400-large
$ python filter_passages.py \
--passages_file ~/work/wikipedia-utils/20220404/passages-c400-jawiki-20220404.json.gz \
--page_ids_file ~/work/wikipedia-utils/20220404/page-ids-jawiki-20220404.json \
--output_file ~/work/quiz-datasets/passages/jawiki-20220404-c400-large.jsonl.gz \
--min_inlinks 10 \
--exclude_sexual_pages
# The number of output page titles: 903024
# The number of output passages: 4288198

Convert the datasets into the format of DPR Retriever input files

Note: specifying --skip_no_positive removes questions with no postive passages from the output file.

# jawiki-20220404-sentences-c400-small
$ mkdir ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-small
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small/abc_01-12.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-small/abc_01-12.json.gz \
--dataset_label jawiki-20220404-sentences-c400-small \
--skip_no_positive
# The number of output questions: 12428
# The number of skipped questions: 5307
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small/aio_01_dev.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-small/aio_01_dev.json.gz \
--dataset_label jawiki-20220404-sentences-c400-small \
--skip_no_positive
# The number of output questions: 1521
# The number of skipped questions: 471
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small/aio_01_test.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-small/aio_01_test.json.gz \
--dataset_label jawiki-20220404-sentences-c400-small \
--skip_no_positive
# The number of output questions: 1485
# The number of skipped questions: 515
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small/aio_01_unused.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-small/aio_01_unused.json.gz \
--dataset_label jawiki-20220404-sentences-c400-small \
--skip_no_positive
# The number of output questions: 431
# The number of skipped questions: 177
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small/aio_02_train.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-small/aio_02_train.json.gz \
--dataset_label jawiki-20220404-sentences-c400-small \
--skip_no_positive
# The number of output questions: 15865
# The number of skipped questions: 6470
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-small/aio_02_dev.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-small/aio_02_dev.json.gz \
--dataset_label jawiki-20220404-sentences-c400-small \
--skip_no_positive
# The number of output questions: 704
# The number of skipped questions: 296

# jawiki-20220404-sentences-c400-medium
$ mkdir ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-medium
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium/abc_01-12.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-medium/abc_01-12.json.gz \
--dataset_label jawiki-20220404-sentences-c400-medium \
--skip_no_positive
# The number of output questions: 14471
# The number of skipped questions: 3264
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium/aio_01_dev.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-medium/aio_01_dev.json.gz \
--dataset_label jawiki-20220404-sentences-c400-medium \
--skip_no_positive
# The number of output questions: 1785
# The number of skipped questions: 207
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium/aio_01_test.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-medium/aio_01_test.json.gz \
--dataset_label jawiki-20220404-sentences-c400-medium \
--skip_no_positive
# The number of output questions: 1760
# The number of skipped questions: 240
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium/aio_01_unused.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-medium/aio_01_unused.json.gz \
--dataset_label jawiki-20220404-sentences-c400-medium \
--skip_no_positive
# The number of output questions: 514
# The number of skipped questions: 94
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium/aio_02_train.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-medium/aio_02_train.json.gz \
--dataset_label jawiki-20220404-sentences-c400-medium \
--skip_no_positive
# The number of output questions: 18530
# The number of skipped questions: 3805
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-medium/aio_02_dev.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-medium/aio_02_dev.json.gz \
--dataset_label jawiki-20220404-sentences-c400-medium \
--skip_no_positive
# The number of output questions: 824
# The number of skipped questions: 176

# jawiki-20220404-sentences-c400-large
$ mkdir ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-large
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large/abc_01-12.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-large/abc_01-12.json.gz \
--dataset_label jawiki-20220404-sentences-c400-large \
--skip_no_positive
# The number of output questions: 15283
# The number of skipped questions: 2452
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large/aio_01_dev.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-large/aio_01_dev.json.gz \
--dataset_label jawiki-20220404-sentences-c400-large \
--skip_no_positive
# The number of output questions: 1885
# The number of skipped questions: 107
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large/aio_01_test.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-large/aio_01_test.json.gz \
--dataset_label jawiki-20220404-sentences-c400-large \
--skip_no_positive
# The number of output questions: 1890
# The number of skipped questions: 110
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large/aio_01_unused.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-large/aio_01_unused.json.gz \
--dataset_label jawiki-20220404-sentences-c400-large \
--skip_no_positive
# The number of output questions: 546
# The number of skipped questions: 62
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large/aio_02_train.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-large/aio_02_train.json.gz \
--dataset_label jawiki-20220404-sentences-c400-large \
--skip_no_positive
# The number of output questions: 19604
# The number of skipped questions: 2731
$ python convert_dataset_to_dpr_retriever_input_file.py \
--input_file ~/work/quiz-datasets/datasets/jawiki-20220404-c400-large/aio_02_dev.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/retriever/jawiki-20220404-c400-large/aio_02_dev.json.gz \
--dataset_label jawiki-20220404-sentences-c400-large \
--skip_no_positive
# The number of output questions: 864
# The number of skipped questions: 136

Convert the datasets into the format of DPR questions TSV files

$ python convert_dataset_to_dpr_qas_file.py \
--input_file ~/work/quiz-datasets/datasets/no_passages/abc_01-12.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/qas/abc_01-12.tsv
$ python convert_dataset_to_dpr_qas_file.py \
--input_file ~/work/quiz-datasets/datasets/no_passages/aio_01_dev.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/qas/aio_01_dev.tsv
$ python convert_dataset_to_dpr_qas_file.py \
--input_file ~/work/quiz-datasets/datasets/no_passages/aio_01_test.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/qas/aio_01_test.tsv
$ python convert_dataset_to_dpr_qas_file.py \
--input_file ~/work/quiz-datasets/datasets/no_passages/aio_01_unused.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/qas/aio_01_unused.tsv
$ python convert_dataset_to_dpr_qas_file.py \
--input_file ~/work/quiz-datasets/datasets/no_passages/aio_02_train.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/qas/aio_02_train.tsv
$ python convert_dataset_to_dpr_qas_file.py \
--input_file ~/work/quiz-datasets/datasets/no_passages/aio_02_dev.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/qas/aio_02_dev.tsv

Convert the passages file into DPR passages TSV files

# jawiki-20220404-c400-small
$ python convert_passages_to_dpr_format.py \
--passages_file ~/work/quiz-datasets/passages/jawiki-20220404-c400-small.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/wikipedia_split/jawiki-20220404-c400-small.tsv.gz

# jawiki-20220404-c400-medium
$ python convert_passages_to_dpr_format.py \
--passages_file ~/work/quiz-datasets/passages/jawiki-20220404-c400-medium.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/wikipedia_split/jawiki-20220404-c400-medium.tsv.gz

# jawiki-20220404-c400-large
$ python convert_passages_to_dpr_format.py \
--passages_file ~/work/quiz-datasets/passages/jawiki-20220404-c400-large.jsonl.gz \
--output_file ~/work/quiz-datasets/dpr/wikipedia_split/jawiki-20220404-c400-large.tsv.gz

License

quiz-datasets's People

Contributors

singletongue avatar

Stargazers

 avatar

Watchers

 avatar James Cloos avatar Naoya Inoue avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.