Coder Social home page Coder Social logo

chineseqa-with-bert's Introduction

EECS 496: Advanced Topics in Deep Learning
Final Project: Chinese Question Answering with BERT (Baidu DuReader Dataset)

Dataset

The DuReader dataset is a machine reading comprehension (MRC) dataset in Chinese. It is the rough equivalent to the popular Stanford Question Answering Dataset (SQuAD) in English. Because DuReader comprises additional question types beyond SQuAD, namely "yes/no" and entity detection, we ignored those components and only predicted answer spans within the input paragraph, as with regular MRC models.

Download the Dataset

To Download DuReader dataset:

git clone [email protected]:baidu/DuReader.git
cd DuReader/data && bash download.sh

Preprocess the Dataset

We format the DuReader dataset in the format identical to what BERT uses for the SQuAD dataset:

python3 src/preprocessing/dr_to_squad.py [path/to/dureader.processed.json]

Additional flags can be found in dr_to_squad.py or run python3 src/preprocessing/dr_to_squad.py --help.

BiDirectional Encoder Representations (BERT)

BERT is a new method of pre-training transformers for a variety of NLP tasks, including QA-IR. It achieved state of the art results on the SQuAD dataset so we wanted to apply it to DuReader.

Train BERT-Chinese on the Dataset

First install the PyTorch implementation of BERT with Huggingface's python package (https://github.com/huggingface/pytorch-pretrained-BERT):

pip install pytorch-pretrained-bert

Using our preprocessed training files or larger training sets obtained with our preprocessing script, run the training script with the command:

python run_dureader.py --bert_model bert-base-chinese --do_train --do_lower_case --train_file data/20000_search.train.json   --train_batch_size 12 --gradient_accumulation_steps 3   --learning_rate 3e-5   --num_train_epochs 2.0   --max_seq_length 384   --doc_stride 128   --output_dir ../duoutput

The hyperparameter setting has been tested on GTX1080 8GB. Generating features from the training file for the Chinese IR task can take a long time with the current scripts and it is CPU-only. For a training set of size 10000, it takes about 8-10 hours on our setup. Training itself is 1-2 hours depending on the hyperparameters. After training is complete, use the following command to generate predictions for the 1000-example preprocessed development set included in this repository:

python run_dureader.py --bert_model ../duoutput --do_predict --predict_file data/1000_search.dev.json --max_seq_length 384 --doc_stride 128 --output_dir ../duprediction

BLEU Scoring

BLEU scoring is an algorithm used to evaluate the quality of text. It has a fairly high correlation to human judgement and is significantly better than accuracy. DuReader evaluates results based off of the BLEU scoring metric.

Get BLEU scoring

The BERT model will output predictions.json.

python3 src/bleu/bert_bleu.py [path/to/predictions.json] [path/to/preprocessed.json]

chineseqa-with-bert's People

Contributors

edmondchensj avatar mchen30 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.