smitkiri / ehr-relation-extraction Goto Github PK

NER and Relation Extraction from Electronic Health Records (EHR).

License: MIT License

Python 86.43% JavaScript 7.80% Makefile 1.11% HTML 4.66%

ehr biobert ner bilstm-crf adverse-drug-events n2c2 ehr-records named-entity-recognition relation-extraction bert-relation-extraction

ehr-relation-extraction's Introduction

NER and Relation Extraction from EHR

This repository includes code for NER and RE methods on EHR records. These methods were performed on n2c2 2018 challenge dataset which was augmented to include a sample of ADE corpus dataset. This project serves as a capstone project for my Masters in Data Science degree at Northeastern University. A demo of this project can be accessed at ehr-info.ml. The website might not work if the GCP instance is turned off (it costs a lot of money, especially for a student).

How to Run
Introduction
Named Entity Recognition
Relation Extraction
End-to-End Results
Front-end and API deployment
References

How to Run

Using Makefile (for linux-based systems)

Edit the Makefile for any parameter changes that you want. All parameters are defined at the top of the file. Check expected parameter values in the next section.

Generate data: make generate-data
Train BioBERT for NER: make train-biobert-ner
Train BiLSTM + CRF for NER: make train-bilstm
Train BioBERT for RE: make train-biobert-re
Run API in development mode with debugging: make start-api-local
Run API in production mode: make start-api-gcp
Run the front-end: Edit the IP address for AJAX call in front-end/ehr.html and open the HTML file in a browser.

Using direct commands from terminal

To generate the preprocessed data required for model input
```
python generate_data.py \
    --task ner \
    --input_dir data/ \
    --ade_dir ade_corpus/ \
    --target_dir dataset/ \
    --max_seq_len 512 \
    --dev_split 0.1 \
    --tokenizer biobert-base \
    --ext txt \
    --sep " " \
```
- The task parameter can be either ner or re for Named Entity Recognition and Relation Extraction tasks respectively.
- The input directory should have two folders named train and test in them. Each folder should have txt and ann files from the original dataset.
- ade_dir is an optional parameter. It should contain json files from the ADE Corpus dataset.
- The max_seq_len should not exceed 512 for BioBERT models.
- For BioBERT models, use biobert-base as the tokenizer value and for BiLSTM + CRF model, use scispacy_plus.
- Use txt for the ext (extension) parameter and " " as the sep (seperator) parameter for NER, and tsv extension and tab as the seperator for RE.
Instructions for running individual models can be found in their respective directories.
To run the API in development mode with debugging on, run the following command:
```
uvicorn fast_api:app --reload
```

To run the API in production mode with gunicorn, run the following command:

gunicorn -b 0.0.0.0:8000 -w 4 -k uvicorn.workers.UvicornWorker fast_api:app --timeout 120

To run the front-end, edit the IP address for AJAX call in front-end/ehr.html and open the HTML file in a browser.

Introduction

An Electronic Health Record (EHR) [1] is an electronic version of a patient's medical history that includes extremely important information including, but not limited to, problems, medication, progress notes, immunizations and laboratory reports. EHRs are huge free-text data files that are documented by healthcare professionals, like clinical notes, discharge summaries or lab reports. Finding information from this data is time consuming, since the data is unstructured and there may be multiple such records for a single patient. Natural Language Processing (NLP) techniques could be used to make this data structured, and quickly find information whenever needed, thereby saving healthcare professionals' time from these mundane tasks.

In this project, we aim to build a tool that would automatically structure this data into a format that would enable doctors and patients to quickly find information that they need. Specifically, we aim to build a Named Entity Recognition (NER) model that would recognize entities such as drug, strength, duration, frequency, adverse drug event (ADE) [2], reason for taking the drug, route and form. Further, the model would also recognize the relationship between drug and every other named entity as well. This would allow healthcare professionals to not only look at individual entities, but also all the relationships between them. This would also allow the doctors to easily find out the relationships between a drug and ADEs so that such drugs can be monitored carefully. \par

The final goal of this project is to build an API where healthcare professionals and patients could send EHR data and the API would return character ranges for each annotation so they can be highlighted in the original data, a structured json-format data that includes separately labelled data for medication history and discharge medications. The highlighted annotations could be useful when a healthcare professional wants to see important information along with other details in the EHR. The structured information can be used to store the data for quick reference in the future. Because the EHR contains medication history as well as discharge medications, labelling them as such could help in merging new information, as the medication history would remain the same.

Named Entity Recognition (NER)

To identify named entities from the text, three different models were built. A rule-based model was built as a baseline along with two machine learning models.

NER Data Preprocessing

EHR documents are usually lengthy, and it is not desirable to have such big input sizes for machine learning models, especially for models like BERT that have an input size restriction of 512 tokens. So, a function was implemented that would split the EHR records based on a maximum sequence length parameter. The function tries to include maximum number of tokens, maintaining as much context as possible for every token. The splitting points are decided based on the following criteria:

Includes as many paragraphs as possible within the maximum token limit, and splits at the end of the last paragraph found.
If the function cannot find a single complete paragraph, it splits on the last line (within the token limit) which marks the end of a sentence.
Otherwise, the function includes as many tokens as specified by the token limit, and then splits on the next index.

The data is tokenized using a modified ScispaCy tokenizer for BiLSTM + CRF model which just removes the tokens with whitespace characters after ScispaCy tokenizes them. For BioBERT model, the BioBERT base tokenizer was used to tokenize the data. Each sequence of labels or tokens in the data was represented using the IOB2 (Inside, Outside, Beginning) tagging scheme for BioBERT and BiLSTM models.

Rule-based Model

To establish a baseline, a traditional dictionary and regular-expression based NER model was used. A regular expression was written to find the dosage entity, which would find any number followed by "mg" or "mcg". For all other entities, the data was split into 80% train data and 20% test data. The train data was used to create a dictionary of each entities, so if the same entities appear in the test data, it would classify it as the corresponding entity.

BiLSTM + CRF for NER

Just a BiLSTM network is enough to classify each token into various entities along with it's class (i.e. B: beginning or I: inside) or if it is not a part of any of the entities we are looking for (O: outside) but we witnessed some common errors of misclassification. Because the outputs of BiLSTM of each word are the label scores, we can select the label which has the highest score for each word. By this scheme, we may end up with invalid outputs, for eg: I-Drug followed by I-ADE or B-Drug followed by I-ADE. Hence we use the CRF (Conditional Random Field) algorithm to calculate the loss of our BiLSTM network as it could add some constraints to the final predicted labels to ensure they are valid. These constraints can be learned by CRF automatically from the training dataset during the training process. CRFs considers the context as well rather than predicting label for a single token without considering neighboring samples [3].

The model was built using the architecture described in Guillaume Genthial's Blog [4] and it's PyTorch implementation [5]. To train the model, the EHR dataset was tokenized using a modified version of the ScispaCy [6] tokenizer. The original tokenizer keeps all whitespace characters in separate tokens, but it was modified so that all of the white space tokens are removed. Every other tokens would remain the same. The input sequence length was set to 512 and the EHR records were split by using the steps discussed in the preprocessing section. The model was trained for 15 epochs using GPU resources from Google Cloud compute engine.

BioBERT for NER

The output of each token from the BERT model is passed through a fully connected neural network with a softmax layer at the end that classify that token to an entity. The entities here would be in IOB format, for example B-DRUG and I-DRUG would be treated as separate entities. This entire model is called BERT for token classification, and it's architecture is available in python's transformers library [7].

BioBERT is a pre-trained BERT model, that is trained on medical corpra of more than 18 billion words. Since it has a medical vocabulary and is trained on biomedical data, we chose this model to fine tune on our dataset. Code for fine tuning from the official BioBERT for PyTorch GitHub repository [8] was used with modifications in input format. The input sequence length was set to 128, and the model was fine tuned for 5 epochs using GPU resources from Google Cloud compute engine.

NER Results

The Rule-Based model did not perform very well, but that was expected as it does not take context into account and has a very high false positive rate. For BiLSTM + CRF and BioBERT, a sample of an external dataset, the ADE-corpus dataset\cite{ade-corpus} was integrated to our data which improved the performance to a great extent. The F1 score for ADE entity improved from 0.3403 to 0.8673 for the BioBERT model after adding a sample of the ADE Corpus.

Also, the BioBERT and BiLSTM + CRF models produce F1 scores similar to that of the model that won the n2c2 challenge using the same data-set. The winning model was submitted by Alibaba Inc. and used an architecture of BiLSTM + CNN for character-level + CRF for dependencies.

Model	Micro F1
Rule Based	0.2200
BiLSTM + CRF	0.8831
BioBERT	0.9328
Alibaba Inc. (Challenge winner)	0.9418

Relation Extraction (RE)

For the relation extraction task, a BERT for sequence classification model was used.

RE Data Preprocessing

Similar to the NER model, in order to be able to train and test the RE model, the data had to be transformed in a particular format. After splitting the train data into train and dev, each record was further split into paragraphs using the same method that was used for NER. The next step was to map each drug entity with all the other possible entities within that paragraph. This would form a list of all possible relations in that paragraph. Once the list was obtained, each entity text was replaced with @entity-type$. For example, the drug 'Lisinopril' would be replaced by @Drug\$ and '20mg' with @Strength\$. This was done for each relation, which means each data point would have only one relation i.e. one pair of entities. Finally, a label tag that indicates whether the entities in that text are related was added - 1 representing a relationship and 0 otherwise.

BioBERT for RE

When taken a close look at the transformed data, Relation Extraction (RE) is nothing but a binary classification problem. It was decided to use the BioBERT model again as the training process was much faster when compared to the LSTM models. Also, the biomedical domain knowledge would be an advantage over other methods.

Unlike NER where it was used for token classification, BioBERT uses the concept of sequence classification in order to predict the relations. In sequence classification, a sentence-level representation of the input sentence, called the CLS (stands for classification) token is obtained. This CLS token which basically contains both word-level and contextual information of the whole sentence, is fed into a fully connected neural network which implements the binary classification task. This model was trained on the same specifications as the NER BioBERT model.

RE Results

Even for the purpose of RE, the BioBERT model seemed to have performed extremely well. With an overall F1 score of 0.942, the model was short of just 0.021 when compared to the challenge winners' score which was of 0.963.

In addition to this, the model managed to achieve high F1 scores for each type of relation as well. The highest being that of Form-Drug with a score of 0.99, followed by Strength-Drug and Dosage-Drug, each with a score of 0.98. It is interesting to see that the scores for both the Reason-Drug and ADE-Drug relations are similar with 0.82 and 0.83, respectively. This also suggests that adding the external ADE corpus seemed to have improved the F1 scores for the ADE-Drug relation (though conclusive proof was not obtained as the model was not trained without the ADE corpus). Also, based on these results, it can be said that the model is doing a decent job of differentiating between the above two relations which is a crucial part of this project.

End-to-end Results

The above Relation Extraction results were obtained when using the actual entities provided in the data. But in reality, this will not be the case since the relationships would be obtained using the entities predicted by the NER model. This is what the end-to-end pipeline represents and the corresponding scores can be seen in the figure below. By using the BioBERT model for both Named Entity Recognition and Relation Extraction, we get an F1 score of 0.86 which is a significant drop from the earlier score of 0.94. The obvious reason for this reduction is the cascading effect of the entities that were incorrectly predicted by the NER model. However, it should be noted that the challenge winners' experienced a drop in the F1 scores as well. Moreover, the drop is even more significant when using the BiLSTM + CRF model for predicting the entities.

Front-end and API Deployment

An API was built for better accesibility and a front-end website was built to showcase the work and to make it easier for users to visualize the results. They were deployed on Google Cloud Platform (GCP).

FastAPI

An end-to-end pipeline was created to transform raw EHR documents into a structured, and more intuitive form. First, the raw EHR document is preprocessed for a Named Entity Recognition model. This preprocessed data is then sent passed through either BiLSTM + CRF model or BioBERT model for NER based on user's choice, to get predictions of entities present in the EHR document. Using these predicted entities, the raw EHR is again preprocessed for the Relation Extraction model. After getting the predictions of relationships among the entities, a table and a knowledge graph is generated which maps each drug to all of its related entities. The relation table is created using the python pandas package and the knowledge graph is created using python's networkx package.

This end-to-end pipeline was converted into an API using a python web-framework named FastAPI [9]. This package allows building a production-ready API and is compatible with HTTP web servers like Gunicorn [10].

Front-end Website

To visualize the results of the end-to-end pipeline, a static front-end website was built using HTML, Bootstrap and jQuery. The Bootstrap framework ensures that the website layout changes automatically if the website is accessed on a mobile device. The website provides an option to choose the NER model and an option to either upload or type/paste an EHR document. It also provides the user with an option to load a sample EHR document for the user to test the results. Once the user requests for the results, an AJAX [11] call is made to the API which sends the EHR text and the NER model choice to the API. The API then runs the entire pipeline and transfers the results to the website.

To visualize the results and make it more user friendly, the website highlights each entity with a different color and hovering over the highlighted text gives a tooltip of the entity type. To visualize the relations of an entity, hovering over an entity creates a red colored border around all other entities that are related to it. The website also has an option to visualize all the relations in a tabular format and in a knowledge graph.

Deployment on GCP

The API was first deployed on Google Cloud Platform's App Engine service, which provides Platform as a Service (PaaS) where the application can be managed without the complexity of building infrastructure like networks, servers, operating system and such. We just have to submit the deployment code and everything is managed automatically. However, App Engine does not currently support GPUs which caused the run-times of the models to be very high.

Due to the lack of GPU support in App Engine, the API and the front-end website was then deployed on GCP Compute Engine, which provides Infrastructure as a Service (IaaS). On Compute Engine, everything from installing an operating system, building a web server, managing networks, obtaining certificates and domain names, and managing load balancing was done by us. The final website can be accessed on ehr-info.ml

References

[1] Electronic Health Records: https://www.cms.gov/Medicare/E-Health/EHealthRecords

[2] Adverse Drug Events: https://www.cdc.gov/medicationsafety/adult_adversedrugevents.html

[3] Conditional Random Fields: https://medium.com/ml2vec/overview-of-conditional-random-fields-68a2a20fa541

[4] Guillaume Genthial's Blog: https://guillaumegenthial.github.io/

[5] PyTorch-ELMo-BiLSTM-CRF implementation: https://github.com/yongyuwen/PyTorch-Elmo-BiLSTMCRF

[6] ScispaCy: https://allenai.github.io/scispacy/

[7] HuggingFace Transformers: https://huggingface.co/transformers/

[8] DMIS Lab - BioBERT PyTorch: https://github.com/dmis-lab/biobert-pytorch/tree/master/named-entity-recognition

[9] FastAPI: https://fastapi.tiangolo.com/

[10] Gunicorn: https://gunicorn.org/

[11] jQuery AJAX Method: https://api.jquery.com/Jquery.ajax/

ehr-relation-extraction's People

Contributors

Stargazers

Watchers

ehr-relation-extraction's Issues

ADE Dataset on NER Task

I'm curious about how you dealt with deidentified data, did you let them as they are or replaced them with something? [Hospital6 29] notion for example

Pretrained models

Hi, hope you're fine

Do you plan to release pre-trained models?

Thanks

BertTokenizer lock file not found

When I run the RE training, I get this error:

FileNotFoundError: [Errno 2] No such file or directory: './dataset/cached_train_BertTokenizer_128_ehr-re.lock'

How can I solve it?

ade_corpus format

hi,
all the ade_corpus is txt(rel) or csv format, where could I find the json format for generate-data? thanks.

How to run the API + frontend locally?

Hi,

I see there are instructions to deploy the model on GCP, but there's nothing to deploy it on local computers.

Could you give me a hint?

Thanks.

Issue with running ner script with only --do predict argument

The NER model works all fine when I run the ner script with --do train --do predict arguments.
However, when I just want to find the prediction results and I try to run the script with only --do_predict (without --do_train) using the trained model, I get completely wrong results.
Is there something small I am missing out?

BiLSTM-CRF code not working

Hi I completed the build_data, but when running train.py the ner_learner throws following error
Traceback (most recent call last): File "/data/storage_hpc_nishant/ade_bench/bilstm_crf/train.py", line 185, in <module> main() File "/data/storage_hpc_nishant/ade_bench/bilstm_crf/train.py", line 181, in main learn.fit(train, dev) File "/data/storage_hpc_nishant/ade_bench/bilstm_crf/model/ner_learner.py", line 186, in fit self.train(epoch, nbatches_train, train_generator, fine_tune=fine_tune) File "/data/storage_hpc_nishant/ade_bench/bilstm_crf/model/ner_learner.py", line 217, in train for batch_idx, (inputs, targets, sequence_lengths) in enumerate(train_generator): File "/data/storage_hpc_nishant/ade_bench/bilstm_crf/model/ner_learner.py", line 139, in data_generator "word_ids": np.asarray(word_ids) ValueError: setting an array element with a sequence. The requested array has an inhomogeneous shape after 1 dimensions. The detected shape was (5,) + inhomogeneous part.
Any pointers to what could be the reason

F1 socre is zero when using run_ner.py and run_re to fine tuning bert model

When I use run.ner.py and run_re.py scripts, i got F1 score and precision are zero for ner model and re model, I think the problem lies in the format of the training data generated by generate_data right? how do you overcome that?

Getting zero results on running evaluation script

Hi there, I ran the Track2-evaluate-ver4.py file on the gold standard test data and the test data(track 2) from n2c2 and I am getting zero scores on Relations for every entity. Please find the output of the execution below

****************************** TRACK 2 *******************************
------- strict ------- ------ lenient -------
Prec. Rec. F(b=1) Prec. Rec. F(b=1)
Drug 0.9990 0.9997 0.9993 0.9990 0.9997 0.9993
Strength 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
Duration 0.9947 1.0000 0.9974 0.9947 1.0000 0.9974
Route 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000
Form 0.9998 0.9995 0.9997 1.0000 0.9998 0.9999
Ade 0.9968 0.9968 0.9968 0.9968 0.9968 0.9968
Dosage 0.9996 0.9996 0.9996 0.9996 0.9996 0.9996
Reason 0.9945 1.0000 0.9973 0.9945 1.0000 0.9973
Frequency 0.9988 0.9995 0.9991 0.9995 1.0000 0.9998
------------------------------------------------
Overall (micro) 0.9989 0.9997 0.9993 0.9990 0.9998 0.9994
Overall (macro) 0.9990 0.9997 0.9993 0.9991 0.9997 0.9994

***************************** RELATIONS ******************************
Strength -> Drug 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Dosage -> Drug 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Duration -> Drug 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Frequency -> Drug 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Form -> Drug 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Route -> Drug 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Reason -> Drug 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
ADE -> Drug 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
------------------------------------------------
Overall (micro) 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000
Overall (macro) 0.0000 0.0000 0.0000 0.0000 0.0000 0.0000

This is the output I am getting and I am confused what's wrong here. Also when I ran run_re.py, I am getting ZeroDivision error as mentioned here and now I am wondering whether these two have any relation and causing error in the training of model.
It would mean a lot if you can help me regarding this or point out if I am doing something wrong here.

`RuntimeError: CUDA out of memory` when training BioBERT NER model

I have tried to run the run_ner.py script on my local laptop and I had the following error:

Iteration:   0%|                                                                                                                             | 0/189 [00:00<?, ?it/s]
Epoch:   0%|                                                                                                                               | 0/5 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "run_ner.py", line 284, in <module>
    main()
  File "run_ner.py", line 206, in main
    model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
  File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/transformers/trainer.py", line 499, in train
    tr_loss += self._training_step(model, inputs, optimizer)
  File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/transformers/trainer.py", line 622, in _training_step
    outputs = model(**inputs)
  File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/transformers/modeling_bert.py", line 1446, in forward
    output_hidden_states=output_hidden_states,
  File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/transformers/modeling_bert.py", line 762, in forward
    output_hidden_states=output_hidden_states,
  File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/transformers/modeling_bert.py", line 439, in forward
    output_attentions,
  File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/transformers/modeling_bert.py", line 371, in forward
    hidden_states, attention_mask, head_mask, output_attentions=output_attentions,
  File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/transformers/modeling_bert.py", line 315, in forward
    hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask, output_attentions,
  File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/xegulon/miniconda3/envs/ehr-re/lib/python3.7/site-packages/transformers/modeling_bert.py", line 256, in forward
    context_layer = torch.matmul(attention_probs, value_layer)
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 5.80 GiB total capacity; 4.70 GiB already allocated; 18.19 MiB free; 4.85 GiB reserved in total by PyTorch)

Is there any easy workaround for that? Is it compulsory to run that on a cloud GPU?

Missing .tsv files in generated dataset for RE

Hi,
I used generate_data.py file for generating training dataset for RE task but in the output I am only getting .pkl files and .txt files but not .tsv files which I guess are necessary for training of model.

Any idea on how to fix this?

Rename the variable "words" in `read_examples_from_file()` and `InputExample` in utils_ner.py

The current implementation of biobert_ner.utils_ner.read_examples_from_file() uses the variable name words to store all the tokens in a document. This is misleading, especially in the case of word-piece tokenizers like BERT where individual words can be split into multiple tokens. A better variable name would be tokens since that is precisely what we are reading.

The variable names should be changed here in read_examples_from_file() and also in here in InputExample class.

Error generating data using default tokenizer

Hi, thank you so much for your repository, it has been extremely helpful for me in my research work.

Would like to highlight this particular issue when running generate_data.py when using the default tokenizer (not applicable to the scispacy tokenizer which is the current default).

The following error is encountered

$ python3 -m app.training.generate_data

Reading data

Train data:
Progress: [====================] 303/303

Test Data:
Progress: [==========>         ] 103/202/home/jiayi/adverse_drug_event_extraction/app/training/ehr.py:179: UserWarning: Invalid annotation encountered: ['12 hours']
  warnings.warn("Invalid annotation encountered: " + str(line))
Progress: [==============>     ] 145/202Traceback (most recent call last):
  File "/home/jiayi/anaconda3/envs/training/lib/python3.9/runpy.py", line 197, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "/home/jiayi/anaconda3/envs/training/lib/python3.9/runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "/home/jiayi/adverse_drug_event_extraction/app/training/generate_data.py", line 214, in <module>
    main()
  File "/home/jiayi/adverse_drug_event_extraction/app/training/generate_data.py", line 175, in main
    train_dev, test = read_data(data_dir=args.input_dir,
  File "/home/jiayi/adverse_drug_event_extraction/app/training/utils.py", line 293, in read_data
    record = HealthRecord(fid, text_path=os.path.join(test_path, fid + '.txt'),
  File "/home/jiayi/adverse_drug_event_extraction/app/training/ehr.py", line 62, in __init__
    self.set_tokenizer(tokenizer)
  File "/home/jiayi/adverse_drug_event_extraction/app/training/ehr.py", line 269, in set_tokenizer
    self._compute_tokens()
  File "/home/jiayi/adverse_drug_event_extraction/app/training/ehr.py", line 226, in _compute_tokens
    raise Exception("Error computing token to char map.")
Exception: Error computing token to char map.

After some debugging I have noticed that this is due to the 2 "###" instances in the test set for the 109724.txt file. After removing the 2 instances observed, everything was running fine. Not too sure why this is happening but thought it would be interesting to point it out for future use. Additionally, after training the model the model predictions were unexpected as well. Output was supposed to be [('ROU', 1, 1), ('DUR', 2, 2), ('DOS', 3, 3), ('ADE', 4, 4), ('FRE', 5, 6)] but it returned [('inpatient', 0, 0), ('Pseudoaneurysm', 1, 1), ('Fevers', 2, 4), ('QTc', 5, 6)].

I ended up adopting the scispacy tokenizer which works perfectly. Cheers!

How to use the relation extraction model for inference?

Everything is in the title.

After having run run_re.py, how to use it for RE predictions with new data?

Do the new data need to have its entities annotated?

How to get the n2c2 data in TSV format?

Hi,

I see in the code (this file) that to generate the data for RE, one needs to have the dataset as TSV files.

However, as far as I know, the n2c2 2018 dataset is provided as BRAT files (I asked for it).

Am I wrong, or is there a script that makes the conversion from BRAT to the TSV format you use?

Train BioBert_ner error, while running run_ner.py (Colab)

I have installed all the packages as per the requirements.txt file on Colab, including torch==1.6.0 , torchvision==0.7.0 , transformers==3.0.2

when i run the command :
! make train-biobert-ner

i get the following error :

cd biobert_ner/ &&
python run_ner.py
--data_dir ./dataset/
--labels ./dataset/labels.txt
--model_name_or_path dmis-lab/biobert-large-cased-v1.1
--output_dir ./output/
--max_seq_length 128
--num_train_epochs 1
--per_device_train_batch_size 8
--save_steps 4000
--seed 0
--do_train
--do_eval
--do_predict
--overwrite_output_dir
01/31/2022 11:45:17 - INFO - transformers.training_args - PyTorch: setting up devices
01/31/2022 11:45:17 - WARNING - main - Process rank: -1, device: cuda:0, n_gpu: 1, distributed training: False, 16-bits training: False
01/31/2022 11:45:17 - INFO - main - Training/evaluation parameters TrainingArguments(output_dir='./output/', overwrite_output_dir=True, do_train=True, do_eval=True, do_predict=True, evaluate_during_training=False, per_device_train_batch_size=8, per_device_eval_batch_size=8, per_gpu_train_batch_size=None, per_gpu_eval_batch_size=None, gradient_accumulation_steps=1, learning_rate=5e-05, weight_decay=0.0, adam_epsilon=1e-08, max_grad_norm=1.0, num_train_epochs=1.0, max_steps=-1, warmup_steps=0, logging_dir='runs/Jan31_11-45-17_5c9d189f141d', logging_first_step=False, logging_steps=500, save_steps=4000, save_total_limit=None, no_cuda=False, seed=0, fp16=False, fp16_opt_level='O1', local_rank=-1, tpu_num_cores=None, tpu_metrics_debug=False, debug=False, dataloader_drop_last=False, eval_steps=1000, past_index=-1)
01/31/2022 11:45:17 - INFO - transformers.configuration_utils - loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/dmis-lab/biobert-large-cased-v1.1/config.json from cache at /root/.cache/torch/transformers/3493610bf2342adb1bf68e2a34c59b725a710eb59df1883605e40ae7e95bf9e4.5b7a692f7cc36e826065fed1096ab38064bca502b90349c26fb1b70aae2defb6
01/31/2022 11:45:17 - INFO - transformers.configuration_utils - Model config BertConfig {
"attention_probs_dropout_prob": 0.1,
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 1024,
"id2label": {
"0": "B-DRUG",
"1": "I-DRUG",
"2": "B-STR",
"3": "I-STR",
"4": "B-DUR",
"5": "I-DUR",
"6": "B-ROU",
"7": "I-ROU",
"8": "B-FOR",
"9": "I-FOR",
"10": "B-ADE",
"11": "I-ADE",
"12": "B-DOS",
"13": "I-DOS",
"14": "B-REA",
"15": "I-REA",
"16": "B-FRE",
"17": "I-FRE",
"18": "O"
},
"initializer_range": 0.02,
"intermediate_size": 4096,
"label2id": {
"B-ADE": 10,
"B-DOS": 12,
"B-DRUG": 0,
"B-DUR": 4,
"B-FOR": 8,
"B-FRE": 16,
"B-REA": 14,
"B-ROU": 6,
"B-STR": 2,
"I-ADE": 11,
"I-DOS": 13,
"I-DRUG": 1,
"I-DUR": 5,
"I-FOR": 9,
"I-FRE": 17,
"I-REA": 15,
"I-ROU": 7,
"I-STR": 3,
"O": 18
},
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 16,
"num_hidden_layers": 24,
"pad_token_id": 0,
"type_vocab_size": 2,
"vocab_size": 58996
}

01/31/2022 11:45:17 - INFO - transformers.configuration_utils - loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/dmis-lab/biobert-large-cased-v1.1/config.json from cache at /root/.cache/torch/transformers/3493610bf2342adb1bf68e2a34c59b725a710eb59df1883605e40ae7e95bf9e4.5b7a692f7cc36e826065fed1096ab38064bca502b90349c26fb1b70aae2defb6
01/31/2022 11:45:17 - INFO - transformers.configuration_utils - Model config BertConfig {
"attention_probs_dropout_prob": 0.1,
"gradient_checkpointing": false,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 1024,
"initializer_range": 0.02,
"intermediate_size": 4096,
"layer_norm_eps": 1e-12,
"max_position_embeddings": 512,
"model_type": "bert",
"num_attention_heads": 16,
"num_hidden_layers": 24,
"pad_token_id": 0,
"type_vocab_size": 2,
"vocab_size": 58996
}

01/31/2022 11:45:17 - INFO - transformers.tokenization_utils_base - Model name 'dmis-lab/biobert-large-cased-v1.1' not found in model shortcut name list (bert-base-uncased, bert-large-uncased, bert-base-cased, bert-large-cased, bert-base-multilingual-uncased, bert-base-multilingual-cased, bert-base-chinese, bert-base-german-cased, bert-large-uncased-whole-word-masking, bert-large-cased-whole-word-masking, bert-large-uncased-whole-word-masking-finetuned-squad, bert-large-cased-whole-word-masking-finetuned-squad, bert-base-cased-finetuned-mrpc, bert-base-german-dbmdz-cased, bert-base-german-dbmdz-uncased, TurkuNLP/bert-base-finnish-cased-v1, TurkuNLP/bert-base-finnish-uncased-v1, wietsedv/bert-base-dutch-cased). Assuming 'dmis-lab/biobert-large-cased-v1.1' is a path, a model identifier, or url to a directory containing tokenizer files.
01/31/2022 11:45:18 - INFO - transformers.tokenization_utils_base - loading file https://s3.amazonaws.com/models.huggingface.co/bert/dmis-lab/biobert-large-cased-v1.1/vocab.txt from cache at /root/.cache/torch/transformers/701732fae654e0c36bf4554c7758f748495aa3427b4084607df605f2049a89a0.b2d452d8aee26fe2e337e17013b48f3d5a81bb300c38986450d4022986348bdd
01/31/2022 11:45:18 - INFO - transformers.tokenization_utils_base - loading file https://s3.amazonaws.com/models.huggingface.co/bert/dmis-lab/biobert-large-cased-v1.1/added_tokens.json from cache at None
01/31/2022 11:45:18 - INFO - transformers.tokenization_utils_base - loading file https://s3.amazonaws.com/models.huggingface.co/bert/dmis-lab/biobert-large-cased-v1.1/special_tokens_map.json from cache at None
01/31/2022 11:45:18 - INFO - transformers.tokenization_utils_base - loading file https://s3.amazonaws.com/models.huggingface.co/bert/dmis-lab/biobert-large-cased-v1.1/tokenizer_config.json from cache at None
01/31/2022 11:45:18 - INFO - transformers.tokenization_utils_base - loading file https://s3.amazonaws.com/models.huggingface.co/bert/dmis-lab/biobert-large-cased-v1.1/tokenizer.json from cache at None
01/31/2022 11:45:18 - INFO - transformers.modeling_utils - loading weights file https://cdn.huggingface.co/dmis-lab/biobert-large-cased-v1.1/pytorch_model.bin from cache at /root/.cache/torch/transformers/8c1699719a69e0d7cccc2c016217edb876ee6732c3aa2809e15a09c70e9bc22e.2c1d459b35b7f0b1938ff35bf6334bc60282ea79ea7cf7e9656e27f726ed07c6
01/31/2022 11:45:33 - WARNING - transformers.modeling_utils - Some weights of the model checkpoint at dmis-lab/biobert-large-cased-v1.1 were not used when initializing BertForTokenClassification: ['cls.predictions.bias', 'cls.predictions.transform.dense.weight', 'cls.predictions.transform.dense.bias', 'cls.predictions.transform.LayerNorm.weight', 'cls.predictions.transform.LayerNorm.bias', 'cls.predictions.decoder.weight', 'cls.predictions.decoder.bias', 'cls.seq_relationship.weight', 'cls.seq_relationship.bias']

This IS expected if you are initializing BertForTokenClassification from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPretraining model).
This IS NOT expected if you are initializing BertForTokenClassification from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
01/31/2022 11:45:33 - WARNING - transformers.modeling_utils - Some weights of BertForTokenClassification were not initialized from the model checkpoint at dmis-lab/biobert-large-cased-v1.1 and are newly initialized: ['classifier.weight', 'classifier.bias']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
01/31/2022 11:45:33 - INFO - filelock - Lock 140349564664784 acquired on ./dataset/cached_train_dev_BertTokenizer_128.lock
01/31/2022 11:45:33 - INFO - utils_ner - Loading features from cached file ./dataset/cached_train_dev_BertTokenizer_128
01/31/2022 11:45:33 - INFO - filelock - Lock 140349564664784 released on ./dataset/cached_train_dev_BertTokenizer_128.lock
01/31/2022 11:45:33 - INFO - filelock - Lock 140349564665488 acquired on ./dataset/cached_devel_BertTokenizer_128.lock
01/31/2022 11:45:33 - INFO - utils_ner - Loading features from cached file ./dataset/cached_devel_BertTokenizer_128
01/31/2022 11:45:33 - INFO - filelock - Lock 140349564665488 released on ./dataset/cached_devel_BertTokenizer_128.lock
01/31/2022 11:45:35 - INFO - transformers.trainer - You are instantiating a Trainer but W&B is not installed. To use wandb logging, run pip install wandb; wandb login see https://docs.wandb.com/huggingface.
Traceback (most recent call last):
File "run_ner.py", line 284, in
main()
File "run_ner.py", line 206, in main
model_path=model_args.model_name_or_path if os.path.isdir(model_args.model_name_or_path) else None
File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 384, in train
train_dataloader = self.get_train_dataloader()
File "/usr/local/lib/python3.7/dist-packages/transformers/trainer.py", line 240, in get_train_dataloader
if self.args.local_rank == -1
File "/usr/local/lib/python3.7/dist-packages/torch/utils/data/sampler.py", line 96, in init
"value, but got num_samples={}".format(self.num_samples))
ValueError: num_samples should be a positive integer value, but got num_samples=0
Makefile:53: recipe for target 'train-biobert-ner' failed
make: *** [train-biobert-ner] Error 1

empty data files generated for the RE task

Hi,
I'm facing an issue where every time I generate data for the RE I get empty files even tho it runs successfully

I'm running this command:

python generate_data.py \
--task re \
--input_dir dataset_split \
--target_dir datasetRE/ \
--max_seq_len 512 \
--dev_split 0.1 \
--tokenizer biobert-base \
--ext tsv \
--sep tab \

RE task is returning zero as the metrics value

The RE task is giving this warning, UndefinedMetricWarning: Recall and F-score are ill-defined and being set to 0.0 in labels with no true samples. Use zero_division parameter to control this behavior.
This is observed for the test set and not for the evaluation set.
The NER task works completely fine.

Documentation for End-to-End RE task

Is there any written documentation explaining the details of the end-to-end RE task?

Error running run_re.py - TypeError: TextInputSequence must be str

Hi there,

I was running the file run_re.py for biobert_re where I've encountered the following error:

01/21/2022 16:21:02 - WARNING - __main__ -   Process rank: -1, device: cpu, n_gpu: 0, distributed training: False, 16-bits training: False
01/21/2022 16:21:16 - INFO - utils_re -   Creating features from dataset file at /data/jiayi/n2c2/dataset_re
Traceback (most recent call last):
  File "run_re.py", line 230, in <module>
    main()
  File "run_re.py", line 103, in main
    REDataset(data_args, tokenizer=tokenizer, cache_dir=model_args.cache_dir) if training_args.do_train else None
  File "/home/jiayi/adverse_drug_event_extraction/app/bilstm_crf_re/utils_re.py", line 132, in __init__
    self.features = glue_convert_examples_to_features(
  File "/home/jiayi/adverse_drug_event_extraction/app/bilstm_crf_re/data_processor.py", line 40, in glue_convert_examples_to_features
    return _glue_convert_examples_to_features(
  File "/home/jiayi/adverse_drug_event_extraction/app/bilstm_crf_re/data_processor.py", line 74, in _glue_convert_examples_to_features
    batch_encoding = tokenizer(
  File "/home/jiayi/anaconda3/envs/re/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2418, in __call__
    return self.batch_encode_plus(
  File "/home/jiayi/anaconda3/envs/re/lib/python3.8/site-packages/transformers/tokenization_utils_base.py", line 2609, in batch_encode_plus
    return self._batch_encode_plus(
  File "/home/jiayi/anaconda3/envs/re/lib/python3.8/site-packages/transformers/tokenization_utils_fast.py", line 409, in _batch_encode_plus
    encodings = self._tokenizer.encode_batch(
TypeError: TextInputSequence must be str

Upon further debugging, this seems to be an issue from HuggingFace as seen in the comment here. To fix the issue, I've simply included a use_fast=False parameter in line 99 of run_re.py as seen below.

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    model_args.tokenizer_name if model_args.tokenizer_name else model_args.model_name_or_path,
    cache_dir=model_args.cache_dir,
    use_fast=False
)

Hope this will be helpful to anyone else encountering the same issue.

List index out of range while running re

WARNING:main:Process rank: -1, device: cpu, n_gpu: 0, distributed training: False, 16-bits training: False
Traceback (most recent call last):
File "run_re.py", line 230, in
main()
File "run_re.py", line 103, in main
REDataset(data_args, tokenizer=tokenizer, cache_dir=model_args.cache_dir) if training_args.do_train else None
File "/content/ehr-relation-extraction/biobert_re/utils_re.py", line 129, in init
examples = self.processor.get_train_examples(args.data_dir)
File "/content/ehr-relation-extraction/biobert_re/data_processor.py", line 116, in get_train_examples
return self._create_examples(self._read_tsv(os.path.join(data_dir, "train.tsv")), "train")
File "/content/ehr-relation-extraction/biobert_re/data_processor.py", line 139, in _create_examples
label = None if set_type == "test" else line[1]
IndexError: list index out of range

BiLSTM-CRF NERLearner - Skipping batch of size=1

Hi there, I've been trying to better understand the BiLSTM-CRF NER model, more specifically the NERLearner class in bilstm_crf_ner/model/ner_learner.py.

To run the NER model, I have ran the generate_data and build_data scripts, and subsequently moved on to running the train and test scripts. However, I noticed that when training (and also running test.py), the line 'Skipping batch of size=1' has been logged many times due to the following snippet of code (in both train and test functions).

https://github.com/smitkiri/ehr-relation-extraction/blob/master/bilstm_crf_ner/model/ner_learner.py#L220-L222

if inputs['word_ids'].shape[0] == 1:
    self.logger.info('Skipping batch of size=1')
    continue

All items within my training set and evaluation set will be caught by this if-statement and not move on to the other half of the code. I have tried removing this chunk for evaluation and the model could produce some prediction output - but not to great accuracy as I suspect that it might be affecting the model performance when training.

UPDATE: I realised this was due to the batch size = 1 set, which is not suited for this model. My 2 questions below still remains!

Can I check what is this code for, and will removing it for training and evaluation be okay?

Another question, can I ask how did you derive the results as seen in the BiLSTM-CRF README file? Is there a specific script that you have executed to achieve that?

NER model is not predicting any labels, even on test set during traning

Hi, thank you for this repo, unfortunately I ran into some issues and because there is no error message, I am unable to debug it....

When I try to generate predictions, I get blank output (model isn't predicting anything).
I realised that during training it also failed to predict any entities, but there was no error message, just warnings.

I used the following script to prepare dataset:

!python generate_data.py --task ner   \
                  --input_dir ./data/   \
                  --ade_dir ./ade_corpus/  \
                   --target_dir biobert_ner/dataset/   \
                  --max_seq_len 512 --dev_split 0.1   \
                  --tokenizer biobert-base   \
                   --ext txt --sep " "  \

Log:

Reading data

Train data:
Progress: [====================] 303/303

Test Data:
Progress: [==================> ] 191/202/Users/lsolis/Documents/GitHub/PubMed_pipeline/ehr-relation-extraction/ehr.py:186: UserWarning: Invalid annotation encountered: ['12 hours'], File: ./data/test/106967.ann
  warnings.warn(msg)
Progress: [====================] 202/202

ADE data: Done


Data successfully saved in biobert_ner/dataset/train.txt
Variable successfully saved in biobert_ner/dataset/train.pkl
Data successfully saved in biobert_ner/dataset/train_dev.txt
Variable successfully saved in biobert_ner/dataset/train_dev.pkl
Data successfully saved in biobert_ner/dataset/devel.txt
Variable successfully saved in biobert_ner/dataset/devel.pkl
Data successfully saved in biobert_ner/dataset/test.txt
Variable successfully saved in biobert_ner/dataset/test.pkl

Generating files successful. Files generated: train.txt, train.pkl, train_dev.txt, train_dev.pkl, devel.txt, devel.pkl, test.txt, test.pkl, labels.txt

Then run_ner.py

cd biobert_ner
export SAVE_DIR=./output1
export DATA_DIR=./dataset

export MAX_LENGTH=128
export BATCH_SIZE=16
export NUM_EPOCHS=5
export SAVE_STEPS=1000
export SEED=0

python run_ner.py --data_dir ${DATA_DIR}/ --labels ${DATA_DIR}/labels.txt --model_name_or_path dmis-lab/biobert-large-cased-v1.1 --output_dir ${SAVE_DIR}/ --max_seq_length ${MAX_LENGTH} --num_train_epochs ${NUM_EPOCHS} --per_device_train_batch_size ${BATCH_SIZE} --save_steps ${SAVE_STEPS} --seed ${SEED} --do_train --do_eval --do_predict --overwrite_output_dir

At the very last stage there was a print out in terminal, here is a fraction of it

05/10/2022 04:38:49 - WARNING - __main__ -   Example 1966, Example: 8 O

05/10/2022 04:38:49 - WARNING - __main__ -   Example 1966, Example: . O

05/10/2022 04:38:49 - WARNING - __main__ -   Example 1966, Example: per B-DRUG

05/10/2022 04:38:49 - WARNING - __main__ -   Example 1966, Example: 5 B-STR

test_results.txt content

eval_loss = 1.8185148617116416
eval_precision = 0.0
eval_recall = 0.0
eval_f1 = 0.0

And the fraction of test_predictions.txt - they all are "O", there is not a single other label predicted...

##9 O
##am O
##b O
##c O
##b O
##c O
##g O
##b O
##ct O
##c O
##v O
##ch O
##ch O
##c O
##d O
##w O
##lt O
##t O
##9 O
##am O
##uts O
##ymph O
##s O
##os O
##os O
##as O
##o O
##y O

Dataset

Is data available to get?

IndexError: list index out of range

Hi Smit,

I trained NER model using your repo + added some custom data. When I run predictions on custom dataset, it works fine on most files, but about 20% of files get the following error:

Prediction: 100%
1/1 [00:00<00:00, 4.22it/s]
---------------------------------------------------------------------------
IndexError                                Traceback (most recent call last)
[<ipython-input-33-ad82028bb83c>](https://localhost:8080/#) in <module>()
----> 1 ner_predictions = get_ner_predictions(ehr_record=text, model_name="biobert")
      2 text_ner = ner_predictions.get_entities()

2 frames
[/content/ehr.py](https://localhost:8080/#) in get_char_idx(self, token_idx)
    319             raise AttributeError("Tokenizer not set.")
    320 
--> 321         char_idx = self.token_to_char_map[token_idx]
    322 
    323         return char_idx

IndexError: list index out of range

Would you know where the issue might be, please?
Thank you!