Embedding Strategies for Specialized Domains: Application to Clinical Entity Recognition

Paper Link: https://www.aclweb.org/anthology/papers/P/P19/P19-2041/

Python environment

The code was tested on Linux (Ubuntu 16.04.4 LTS) and Python (3.6). Using Anaconda, install a new environment from the .yml file:

conda env create --name ACL_PAPER_env -f=environment.yml

Then activate it:

source activate ACL_PAPER_env

Steps for reproducing the experiments

Step 0: Download the 2010 i2b2/VA Challenge dataset

Follow the instructions on https://www.i2b2.org/NLP/DataSets/ to get your own copy of the 2010 i2b2/VA 2010 Challenge dataset. Then put this data inside the i2b2_data folder so that it looks like:

 i2b2_data/
 |
  --> 2010/
      |
      | --> test/
      |     |
      |     | --> concepts/
      |     |     |
      |     |      --> *.con
      |     |
      |       --> texts/
      |           |
      |            --> *.txt
        --> train/
            |
            | --> concepts/
            |     |
            |      --> *.con
            |
              --> texts/
                  |
                   --> *.txt

Step 1: Prepare your corpora

Prepare each corpus you want to train a set of static embeddings on by adding a folder embeddings/corpora/{corpus_name}/ where {corpus_name} is the name of your corpus. Then put your preprocessed corpus as a single text file called corpus.txt inside that folder.

The code comes with a small example corpus from the English Wikipedia in embeddings/corpora/wiki/. This is only there as an example and will not result in great performance.

Step 2: Download & compile embedding codes

Inside the embeddings folder you will also find three other folders called word2vec, glove and fasttext. Each of these folders contrain a shell script called download_and_compile_code.sh. You can run this script to download the method's source code and compile it in preparation for the embedding training.

bash download_and_compile_code.sh

Step 3: Train embeddings on your corpus

The word2vec, glove and fasttext also include another script called train_embeddings.sh. By default, runninng this script will result in training an embedding with the same parameters as in the paper, using the method of your choice, on each corpus available inside embeddings/corpura/.

bash train_embeddings.sh

Step 4: Prepare experiments

Edit the main.py script by changing the list of models to run, as well as the name of the experiment. This name will be used to create a results folder in results/{experiment_name}. Changing the experiment name will result in a different folder - this can be used to group your results as you wish.

To use an embedding trained using method {method} and corpus {corpus} use the name {method}_{corpus}. Also, there are three available ELMo embeddings: Small, Original and PubMed as described in https://allennlp.org/elmo. To use them use the names: elmo_small, elmo_original and elmo_pubmed.

For everything related to combining ELMo with other embeddings (word2vec, fastText and GloVe), you can refer to the examples in main.py script.

Step 5: Run the experiments

Edit the run_experiment.sh script and chose a GPU if available. If no GPUs are available, change the parameter --device='gpu' to --device='cpu' then run the script.

bash run_experiment.sh

This should run your experiments in "debug mode". This means that each model will be trained for only one epoch. This mode can be used to check that everything is working fine before running bigger trainings. To run a definitive training, change the parameter --debug=True to --debug=False.

hafsah2018 / acl_srw_2019 Goto Github PK

acl_srw_2019's Introduction

Embedding Strategies for Specialized Domains: Application to Clinical Entity Recognition

Python environment

Steps for reproducing the experiments

Step 0: Download the 2010 i2b2/VA Challenge dataset

Step 1: Prepare your corpora

Step 2: Download & compile embedding codes

Step 3: Train embeddings on your corpus

Step 4: Prepare experiments

Step 5: Run the experiments

acl_srw_2019's People

Contributors

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent