Coder Social home page Coder Social logo

clinirc's Introduction

Clinical Reading Comprehension (CliniRC)

Introduction

This repository provides code for the analysis of Clinical Reading Comprehension task in the ACL2020 paper: Clinical Reading Comprehension: A Thorough Analysis of the emrQA Dataset

@inproceedings{yue2020CliniRC,
 title={Clinical Reading Comprehension: A Thorough Analysis of the emrQA Dataset},
 author={Xiang Yue and Bernal Jimenez Gutierrez and Huan Sun},
 booktitle={ACL},
 year={2020}
}

Set up

Run the following commands to clone the repository and install requirements. It requires Python 3.5 or higher. It also requires installing PyTorch version 1.0 or higher and Tensorflow version 1.1 or higher. The other dependencies are listed in requirements.txt.

$ git clone https://github.com/xiangyue9607/CliniRC.git
$ pip install -r requirements.txt 

Preparing the emrQA Dataset

Our analysis is based on the recently released clinical QA dataset: emrQA [EMNLP'18]. Note that the emrQA dataset is generated from the n2c2 (previously "i2b2") datasets. We do not have the right to include either the emrQA dataset or n2c2 datasets in this repo. Users need to first sign up the n2c2 data use agreement and then follow the instructions in the emrQA repo to generate the emrQA dataset. After you generate the emrQA dataset, create directory /data/datasets and put the data.json into the directory.

Preprocessing

We first provide proprocessing script to help clean up the generated emrQA dataset. Specifically, the preprocessing script have the following functions:

  1. Remove extra whitespaces, punctuations and newlines. Join sentences into one paragraph;
  2. Reformulate the dataset as the "SQuAD" format;
  3. Randomly split the dataset into train/dev/test set (7:1:2).
$ python src/preprocessing.py \
--data_dir ./data/datasets \
--filename data.json \
--out_dir ./data/datasets

Note that there are 5 subsets in the emrQA dataset. We only use the Medication and Relation subsets, as (1) they makeup 80% of the entire emrQA dataset and (2) their format is consistent with the span extraction task, which is more challenging and meaningful for clinical decision making support.

After running preprocessing.py script, you will obtain 6 json files in your output directory (i.e., train, dev, test sets for Medication and Relation datasets)

Sampling Subsets to accelerate training

As we have demonstrated in the paper (Section 4.1), though there are more than 1 million questions in the emrQA dataset, many questions and their patterns are very similar since they are generated from the same question templates. And we show that we do not so many questions to train a CliniRC system and using a sampled subset can achieve roughly the same performance that is based on the entire dataset.

To randomly sample question from the original dataset, you can:

$ python src/sample_dataset.py \
--data_dir ./data/datasets \
--filename medication-train \
--out_dir ./data/datasets \
--sample_ratio 0.2
$ python src/sample_dataset.py \
--data_dir ./data/datasets \
--filename relation-train \
--out_dir ./data/datasets \
--sample_ratio 0.05

--sample_ratio controls how many questions are sampled from each document.

Train and Test a QA model

In our paper, we compare some state-of-tha-art QA models on the emrQA dataset. Here, we give two examples: BERT and DocReader. For other QA models tested in the paper, you can refer to their github repos for further details.

BERT

  1. Download the pretrained BERT models (including bert-base-cased, BioBERT-base-PubMed and ClinicalBERT). (Feel free to try other BERT models-:)
$ chmod +x download_pretrained_models.sh; ./download_pretrained_models.sh
  1. Train (Fine-tune) a BERT model on the emrQA medication/relation dataset. The training script is adopted from BERT github repo
$ CUDA_VISIBLE_DEVICES=0 python ./BERT/run_squad.py \
    --vocab_file=./pretrained_bert_models/clinicalbert/vocab.txt \
    --bert_config_file=./pretrained_bert_models/clinicalbert/bert_config.json \
    --init_checkpoint=./pretrained_bert_models/clinicalbert/model.ckpt-100000 \
    --do_train=True \
    --train_file=./data/datasets/relation-train-sampled-0.05.json \
    --do_predict=True \
    --do_lower_case=False \
    --predict_file=./data/datasets/relation-dev.json \
    --train_batch_size=6 \
    --learning_rate=3e-5 \
    --num_train_epochs=4.0 \
    --max_seq_length=384 \
    --doc_stride=128 \
    --output_dir=./output/bert_models/clinicalbert_relation_0.05/
  1. Inference on the test set.
$ python ./BERT/run_squad.py \
    --vocab_file=./pretrained_bert_models/clinicalbert/vocab.txt \
    --bert_config_file=./pretrained_bert_models/clinicalbert/bert_config.json \
    --init_checkpoint=./output/bert_models/clinical_relation_0.05_epoch51/model.ckpt-21878 \
    --do_train=False \
    --do_predict=True \
    --do_lower_case=False \
    --predict_file=./data/relation-test.json \
    --train_batch_size=6 \
    --learning_rate=3e-5 \
    --num_train_epochs=3.0 \
    --max_seq_length=384 \
    --doc_stride=128 \
    --output_dir=./output/bert_models/clinical_relation_0.05_epoch51_test/
  1. Eval the model. We adopt the official eval script from SQuAD v1.1.
$ python ./src/evaluate-v1.1.py ./data/datasets/medication-dev.json ./output/bert_models/bertbase_medication_0.2/predictions.json

DocReader

We adopt the DocReader module code from DrQA github repo.

  1. Set up
$ git clone https://github.com/facebookresearch/DrQA.git
$ cd DrQA; python setup.py develop
  1. Download the pretrained GloVe embeddings and put it into the data/embeddings. You can also run our script to automatically finish this step:
$ chmod +x ../download_glove_embeddings.sh; ../download_glove_embeddings.sh
  1. Preprocessing the train/dev files:
$ python scripts/reader/preprocess.py \
../data/datasets/ \
../data/datasets/ \
--split relation-train-sampled-0.05 \
--tokenizer spacy
$ python scripts/reader/preprocess.py \
../data/datasets/ \
../data/datasets/ \
--split relation-dev \
--tokenizer spacy
  1. Train the Reader:
$ python scripts/reader/train.py \
--embedding-file glove.840B.300d.txt \
--tune-partial 1000 \
--train-file relation-train-sampled-0.05-processed-spacy.txt \
--dev-file relation-dev-processed-spacy.txt \
--dev-json relation-dev.json \
--random-seed 20 \
--batch-size 16 \
--test-batch-size 16 \
--official-eval True \
--valid-metric exact_match \
--checkpoint True \
--model-dir ../output/drqa-models/relation \
--data-dir ../data/datasets \
--embed-dir ../data/embeddings \
--data-workers 0 \
--max-len 30 
  1. Inference on the test set:
python scripts/reader/predict.py \
../data/datasets/relations-mimic-new-qs-ver3.json \
--model ../output/drqa-models/[YOUR MODEL NAME] \
--batch-size 16 \
--official \
--tokenizer spacy \
--out-dir ../output/drqa-models/ \
--embedding ../data/embeddings/glove.840B.300d.txt \
  1. Eval the model. We adopt the official eval script from SQuAD v1.1.
$ cd ..
$ python ./src/evaluate-v1.1.py ./data/datasets/medication-dev.json ./output/drqa-models/predictions.json

clinirc's People

Contributors

bernaljg avatar xiangyue9607 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

clinirc's Issues

AttributeError: module 'tensorflow._api.v2.train' has no attribute 'Optimizer'

Error when I tried to run the code as per given instructions

Code:
!CUDA_VISIBLE_DEVICES=0 python ./BERT/run_squad.py
--vocab_file=./pretrained_bert_models/clinicalbert/vocab.txt
--bert_config_file=./pretrained_bert_models/clinicalbert/bert_config.json
--init_checkpoint=./pretrained_bert_models/clinicalbert/model.ckpt-100000
--do_train=True
--train_file=./data/datasets/relation-train-sampled-0.05.json
--do_predict=True
--do_lower_case=False
--predict_file=./data/datasets/relation-dev.json
--train_batch_size=6
--learning_rate=3e-5
--num_train_epochs=4.0
--max_seq_length=384
--doc_stride=128
--output_dir=./output/bert_models/clinicalbert_relation_0.05/

Error:
Traceback (most recent call last):
File "./BERT/run_squad.py", line 28, in
import optimization
File "/content/CliniRC/BERT/optimization.py", line 88, in
class AdamWeightDecayOptimizer(tf.train.Optimizer):
AttributeError: module 'tensorflow._api.v2.train' has no attribute 'Optimizer'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.