Coder Social home page Coder Social logo

cs-598-dlh-team87's Introduction

Prerequisites

Environment Setup

  1. Setup your conda virtual environment with the following environment information:

    Use environments.yaml to generate the dependencies for the conda environment (jupyter notebooks, tensorflow, spacy, gensim...)

    Run the below

    conda env create --name your_env_name_here --file=environments.yaml
    conda activate your_env_name_here

  2. If you are having issues installing any of the packages via the conda command above use the directory package-installer-helpers to aid in installing:

    • install-all.sh used to install dependencies manually without the environments.yaml file
      • This file encompasses all of the commands from the below files
    • install-biobert-embedding.sh used to install biobert embedding model dependency
    • install-glove.sh used to install the glove dependency
    • install-pip-dependencies.sh used to install all other pip dependencies

Data downloads

  1. Install the Med-7 data
    wget https://www.dropbox.com/s/xbgsy6tyctvrqz3/en_core_med7_lg.tar.gz?dl=1
    pip install /path/to/downloaded/spacy2_model

  2. Install the pertinent Mimic-III data
    wget https://console.cloud.google.com/storage/browser/mimic_extract;tab=objects?prefix=&forceOnObjectsSortingFiltering=false?

  3. Install the pre-trained fasttext model
    wget https://drive.google.com/drive/folders/1bcR6ThMEPhguU9T4qPcPaZJ3GQzhLKlz?usp=sharing

  4. Install the pre-trained word2vec model
    wget https://drive.google.com/file/d/14EOqvvjJ8qUxihQ_SFnuRsjK9pOTrP-6/view

  5. Install the biobert model dependencies (You will pass in the file path as a string from the extracted biobert dependencies to the Biobert class object):
    wget https://www.dropbox.com/s/hvsemunmv0htmdk/biobert_v1.1_pubmed_pytorch_model.tar.gz
    tar -xvzf biobert_v1.1_pubmed_pytorch_model.tar.gz a. You will need to move the extracted files to /home/ubuntu/biobertmodel

Training Code

Pre-requisites to start training code

  1. Clone the code to local.
https://github.com/sidmeister/cs-598-dlh-team87.git
cd cs-598-dlh-team87
  1. Run prerequisites as described in the above section.

  2. Copy the output file of MIMIC-Extract Pipeline named all_hourly_data.h5 to data folder.

  3. Run 01-Extract-Timseries-Features.ipnyb to extract first 24 hours timeseries features from MIMIC-Extract raw data.

  4. Copy the ADMISSIONS.csv, NOTEEVENTS.csv, ICUSTAYS.csv files into data folder.

  5. Run 02-Select-SubClinicalNotes.ipynb to select subnotes based on criteria from all MIMIC-III Notes.

  6. Run 03-Prprocess-Clinical-Notes.ipnyb to prepocessing notes.

  7. Run 04-Apply-med7-on-Clinical-Notes.ipynb to extract medical entities.

  8. Unzip embeddings.zip into embeddings folder.

  9. Run 05-Represent-Entities-With-Different-Embeddings.ipynb. This notebook will do the following actions:

    1. To convert medical entities into word representations.
    2. Prepare the timeseries data to fed through GRU / LSTM.
  10. Run 05.5_biobert_embedding.ipynb to generate the biobert embedding vectors.

  11. Run 06-Create-Timeseries-Data.ipynb to generate the appropriate ids to run in the baseline model.

Training Code

The below notebook files perform training and writing evaluative results to the hard drive.

  1. Run 07-Timeseries-Baseline.ipynb to run timeseries baseline model, LSTM and GRU, across 128 and 256 dimensionality of the output space for the RNN models. This notebook requires a /results/timeseries-baseline directory to be created.

  2. Run 08-Multimodal-Baseline.ipynb to generate the baseline multi-modal model. This model will train using all types of embeddings: concat, word2vec, fasttext, and biobert to predict 4 different clinical tasks (hosp_mort, icu_mort, los_3, los_7). This notebook requires a /results/multimodal-baseline directory to be created.

  3. Run 09-Proposed-Model.ipynb to run proposed model to predict 4 different clinical tasks (hosp_mort, icu_mort, los_3, los_7). This notebook requires a /results/cnn directory to be created.

Evaluation Code

The below notebook files perform training and writing evaluative results to the hard drive. However step 4 is critical to generate the evaluation results of the trained models.

  1. Run 07-Timeseries-Baseline.ipynb to run and evaluate timeseries baseline model to predict 4 different clinical tasks.

  2. Run 08-Multimodal-Baseline.ipynb to run and evaluate the multi-modal baseline.

  3. Run 09-Proposed-Model.ipynb to run and evaluate the proposed model to predict 4 different clinical tasks (hosp_mort, icu_mort, los_3, los_7).

  4. Run 10-Summary.ipynb to display results of each model.

Pretrained Models

pretrained-models is the directory for the models that we generated.
The models are in the format of:

  • (GRU|LSTM)-(128|256)-problem_type*-best_model.hdf5: These are the models generated from 07-TimeseriesBaseline.ipynb
    • 128|256 denotes the GRU|LSTM size
    • problem_type: mort_hosp, mort_icu, los_3, los_7
  • avg-embedding_type*-problem_type*-best_model.hdf5: These are models generated from 08-Multimodal-Baseline.ipynb
    • problem_type: mort_hosp, mort_icu, los_3, los_7
    • embedding_type: fasttext, concat, biobert, word2vec
  • 64-basiccnn1d-embedding_type*-problem_type*-best_model.hdf5: These are the models generated from 09-Proposed-Model.ipynb
    • 64 denotes the max height of the CNN image size
    • problem_type: mort_hosp, mort_icu, los_3, los_7
    • embedding_type: fasttext, concat, biobert, word2vec

Results

Below are the results from training the original model

image info


image info


image info


image info

References

Original paper via: https://www.sciencedirect.com/science/article/pii/S0933365721001056?via%3Dihub

Original paper's repository via https://github.com/tanlab/ConvolutionMedicalNer

Download the MIMIC-III dataset via https://mimic.physionet.org/

MIMIC-Extract implementation: https://github.com/MLforHealth/MIMIC_Extract

med7 implementation: https://github.com/kormilitzin/med7

Download Pre-trained Word2Vec & FastText embeddings: https://github.com/kexinhuang12345/clinicalBERT

Preprocessing Script: https://github.com/kaggarwal/ClinicalNotesICU

Biobert embedding repo: https://github.com/Overfitter/biobert_embedding

cs-598-dlh-team87's People

Contributors

watersnoopy avatar sidmeister avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.