Coder Social home page Coder Social logo

hafsah2018 / acl_srw_2019 Goto Github PK

View Code? Open in Web Editor NEW

This project forked from helboukkouri/acl_srw_2019

0.0 1.0 0.0 1.21 MB

This is the code for reproducing the experiments from "Embedding Strategies for Specialized Domains: Application to Clinical Entity Recognition" (El Boukkouri et al.)

Python 96.35% Shell 3.65%

acl_srw_2019's Introduction

Embedding Strategies for Specialized Domains: Application to Clinical Entity Recognition

Paper Link: https://www.aclweb.org/anthology/papers/P/P19/P19-2041/

Python environment

The code was tested on Linux (Ubuntu 16.04.4 LTS) and Python (3.6). Using Anaconda, install a new environment from the .yml file:

conda env create --name ACL_PAPER_env -f=environment.yml

Then activate it:

source activate ACL_PAPER_env

Steps for reproducing the experiments

Step 0: Download the 2010 i2b2/VA Challenge dataset

Follow the instructions on https://www.i2b2.org/NLP/DataSets/ to get your own copy of the 2010 i2b2/VA 2010 Challenge dataset. Then put this data inside the i2b2_data folder so that it looks like:

 i2b2_data/
 |
  --> 2010/
      |
      | --> test/
      |     |
      |     | --> concepts/
      |     |     |
      |     |      --> *.con
      |     |
      |       --> texts/
      |           |
      |            --> *.txt
        --> train/
            |
            | --> concepts/
            |     |
            |      --> *.con
            |
              --> texts/
                  |
                   --> *.txt

Step 1: Prepare your corpora

Prepare each corpus you want to train a set of static embeddings on by adding a folder embeddings/corpora/{corpus_name}/ where {corpus_name} is the name of your corpus. Then put your preprocessed corpus as a single text file called corpus.txt inside that folder.

The code comes with a small example corpus from the English Wikipedia in embeddings/corpora/wiki/. This is only there as an example and will not result in great performance.

Step 2: Download & compile embedding codes

Inside the embeddings folder you will also find three other folders called word2vec, glove and fasttext. Each of these folders contrain a shell script called download_and_compile_code.sh. You can run this script to download the method's source code and compile it in preparation for the embedding training.

bash download_and_compile_code.sh

Step 3: Train embeddings on your corpus

The word2vec, glove and fasttext also include another script called train_embeddings.sh. By default, runninng this script will result in training an embedding with the same parameters as in the paper, using the method of your choice, on each corpus available inside embeddings/corpura/.

bash train_embeddings.sh

Step 4: Prepare experiments

Edit the main.py script by changing the list of models to run, as well as the name of the experiment. This name will be used to create a results folder in results/{experiment_name}. Changing the experiment name will result in a different folder - this can be used to group your results as you wish.

To use an embedding trained using method {method} and corpus {corpus} use the name {method}_{corpus}. Also, there are three available ELMo embeddings: Small, Original and PubMed as described in https://allennlp.org/elmo. To use them use the names: elmo_small, elmo_original and elmo_pubmed.

For everything related to combining ELMo with other embeddings (word2vec, fastText and GloVe), you can refer to the examples in main.py script.

Step 5: Run the experiments

Edit the run_experiment.sh script and chose a GPU if available. If no GPUs are available, change the parameter --device='gpu' to --device='cpu' then run the script.

bash run_experiment.sh

This should run your experiments in "debug mode". This means that each model will be trained for only one epoch. This mode can be used to check that everything is working fine before running bigger trainings. To run a definitive training, change the parameter --debug=True to --debug=False.

acl_srw_2019's People

Contributors

helboukkouri avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.