Implementation for Improving Clinical Outcome Predictions Using Convolution over Medical Entities with Multimodal Learning

Prerequisites

Install the Med-7 data
wget https://www.dropbox.com/s/xbgsy6tyctvrqz3/en_core_med7_lg.tar.gz?dl=1
pip install /path/to/downloaded/spacy2_model
Install the pertinent Mimic-III data
wget https://console.cloud.google.com/storage/browser/mimic_extract;tab=objects?prefix=&forceOnObjectsSortingFiltering=false?
Install the pre-trained fasttext model
wget https://drive.google.com/drive/folders/1bcR6ThMEPhguU9T4qPcPaZJ3GQzhLKlz?usp=sharing
Install the pre-trained word2vec model
wget https://drive.google.com/file/d/14EOqvvjJ8qUxihQ_SFnuRsjK9pOTrP-6/view
Install the biobert model dependencies (You will pass in the file path as a string from the extracted biobert dependencies to the Biobert class object):
wget https://www.dropbox.com/s/hvsemunmv0htmdk/biobert_v1.1_pubmed_pytorch_model.tar.gz
tar -xvzf biobert_v1.1_pubmed_pytorch_model.tar.gz a. You will need to move the extracted files to /home/ubuntu/biobertmodel

https://github.com/sidmeister/cs-598-dlh-team87.git
cd cs-598-dlh-team87

Run prerequisites as described in the above section.
Copy the output file of MIMIC-Extract Pipeline named all_hourly_data.h5 to data folder.
Run 01-Extract-Timseries-Features.ipnyb to extract first 24 hours timeseries features from MIMIC-Extract raw data.
Copy the ADMISSIONS.csv, NOTEEVENTS.csv, ICUSTAYS.csv files into data folder.
Run 02-Select-SubClinicalNotes.ipynb to select subnotes based on criteria from all MIMIC-III Notes.
Run 03-Prprocess-Clinical-Notes.ipnyb to prepocessing notes.
Run 04-Apply-med7-on-Clinical-Notes.ipynb to extract medical entities.
Unzip embeddings.zip into embeddings folder.
Run 05-Represent-Entities-With-Different-Embeddings.ipynb. This notebook will do the following actions:
1. To convert medical entities into word representations.
2. Prepare the timeseries data to fed through GRU / LSTM.
Run 05.5_biobert_embedding.ipynb to generate the biobert embedding vectors.
Run 06-Create-Timeseries-Data.ipynb to generate the appropriate ids to run in the baseline model.

Run 07-Timeseries-Baseline.ipynb to run timeseries baseline model, LSTM and GRU, across 128 and 256 dimensionality of the output space for the RNN models. This notebook requires a /results/timeseries-baseline directory to be created.
Run 08-Multimodal-Baseline.ipynb to generate the baseline multi-modal model. This model will train using all types of embeddings: concat, word2vec, fasttext, and biobert to predict 4 different clinical tasks (hosp_mort, icu_mort, los_3, los_7). This notebook requires a /results/multimodal-baseline directory to be created.
Run 09-Proposed-Model.ipynb to run proposed model to predict 4 different clinical tasks (hosp_mort, icu_mort, los_3, los_7). This notebook requires a /results/cnn directory to be created.

Run 07-Timeseries-Baseline.ipynb to run and evaluate timeseries baseline model to predict 4 different clinical tasks.
Run 08-Multimodal-Baseline.ipynb to run and evaluate the multi-modal baseline.
Run 09-Proposed-Model.ipynb to run and evaluate the proposed model to predict 4 different clinical tasks (hosp_mort, icu_mort, los_3, los_7).
Run 10-Summary.ipynb to display results of each model.

pretrained-models is the directory for the models that we generated.
The models are in the format of:

(GRU|LSTM)-(128|256)-problem_type*-best_model.hdf5: These are the models generated from 07-TimeseriesBaseline.ipynb
- 128|256 denotes the GRU|LSTM size
- problem_type: mort_hosp, mort_icu, los_3, los_7
avg-embedding_type*-problem_type*-best_model.hdf5: These are models generated from 08-Multimodal-Baseline.ipynb
- problem_type: mort_hosp, mort_icu, los_3, los_7
- embedding_type: fasttext, concat, biobert, word2vec
64-basiccnn1d-embedding_type*-problem_type*-best_model.hdf5: These are the models generated from 09-Proposed-Model.ipynb
- 64 denotes the max height of the CNN image size
- problem_type: mort_hosp, mort_icu, los_3, los_7
- embedding_type: fasttext, concat, biobert, word2vec