Coder Social home page Coder Social logo

emmarocheteau / tpc-los-prediction Goto Github PK

View Code? Open in Web Editor NEW
76.0 7.0 25.0 3.43 MB

This repository contains the code used for Temporal Pointwise Convolutional Networks for Length of Stay Prediction in the Intensive Care Unit (https://dl.acm.org/doi/10.1145/3450439.3451860).

Home Page: https://dl.acm.org/doi/10.1145/3450439.3451860

License: MIT License

PLpgSQL 1.81% Python 98.19%
patient-outcomes mortality-prediction length-of-stay convolutional-neural-networks

tpc-los-prediction's Introduction

PWC

Patient Outcome Prediction with TPC Networks

This repository contains the code used for Temporal Pointwise Convolutional Networks for Length of Stay Prediction in the Intensive Care Unit (published at ACM CHIL 2021) and implementation instructions. You can watch a brief project talk here:

Watch the video

Citation

If you use this code or the models in your research, please cite the following:

@inproceedings{rocheteau2021,
author = {Rocheteau, Emma and Li\`{o}, Pietro and Hyland, Stephanie},
title = {Temporal Pointwise Convolutional Networks for Length of Stay Prediction in the Intensive Care Unit},
year = {2021},
isbn = {9781450383592},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3450439.3451860},
doi = {10.1145/3450439.3451860},
booktitle = {Proceedings of the Conference on Health, Inference, and Learning},
pages = {58–68},
numpages = {11},
keywords = {intensive care unit, length of stay, temporal convolution, mortality, patient outcome prediction},
location = {Virtual Event, USA},
series = {CHIL '21}
}

Motivation

The pressure of ever-increasing patient demand and budget restrictions make hospital bed management a daily challenge for clinical staff. Most critical is the efficient allocation of resource-heavy Intensive Care Unit (ICU) beds to the patients who need life support. Central to solving this problem is knowing for how long the current set of ICU patients are likely to stay in the unit. In this work, we propose a new deep learning model based on the combination of temporal convolution and pointwise (1x1) convolution, to solve the length of stay prediction task on the eICU and MIMIC-IV critical care datasets. The model – which we refer to as Temporal Pointwise Convolution (TPC) – is specifically designed to mitigate common challenges with Electronic Health Records, such as skewness, irregular sampling and missing data. In doing so, we have achieved significant performance benefits of 18-68% (metric and dataset dependent) over the commonly used Long-Short Term Memory (LSTM) network, and the multi-head self-attention network known as the Transformer. By adding mortality prediction as a side-task, we can improve performance further still, resulting in a mean absolute deviation of 1.55 days (eICU) and 2.28 days (MIMIC-IV) on predicting remaining length of stay.

Headline Results

Length of Stay Prediction

We report on the following metrics:

  • Mean absolute deviation (MAD)
  • Mean absolute percentage error (MAPE)
  • Mean squared error (MSE)
  • Mean squared log error (MSLE)
  • Coefficient of determination (R2)
  • Cohen Kappa Score (Harutyunyan et al. 2019)

For the first four metrics, lower is better. For the last two, higher is better.

eICU

Model MAD MAPE MSE MSLE R2 Kappa
Mean* 3.21 395.7 29.5 2.87 0.00 0.00
Median* 2.76 184.4 32.6 2.15 -0.11 0.00
LSTM 2.39±0.00 118.2±1.1 26.9±0.1 1.47±0.01 0.09±0.00 0.28±0.00
CW LSTM 2.37±0.00 114.5±0.4 26.6±0.1 1.43±0.00 0.10±0.00 0.30±0.00
Transformer 2.36±0.00 114.1±0.6 26.7±0.1 1.43±0.00 0.09±0.00 0.30±0.00
TPC 1.78±0.02 63.5±4.3 21.7±0.5 0.70±0.03 0.27±0.02 0.58±0.01

Our model (TPC) significantly outperforms all baselines by large margins. *The mean and median "models" always predict 3.47 and 1.67 days respectively (the mean and median of the training set).

MIMIC-IV

Please note that this is not the same cohort as used in Harutyunyan et al. 2019. They use the older MIMIC-III database and I have developed my own preprocessing pipeline to closely match that of eICU.

Model MAD MAPE MSE MSLE R2 Kappa
Mean* 5.24 474.9 77.7 2.80 0.000.00
Median* 4.60 216.8 86.8 2.09 -0.12 0.00
LSTM 3.68±0.02 107.2±3.1 65.7±0.7 1.26±0.01 0.15±0.01 0.43±0.01
CW LSTM 3.68±0.02 107.0±1.8 66.4±0.6 1.23±0.01 0.15±0.01 0.43±0.00
Transformer 3.62±0.02 113.8±1.8 63.4±0.5 1.21±0.01 0.18±0.01 0.45±0.00
TPC 2.39±0.03 47.6±1.4 46.3±1.3 0.39±0.02 0.40±0.02 0.78±0.01

*The mean and median "models" always predict 5.70 and 2.70 days respectively (the mean and median of the training set).

Mortality Prediction

We report on the following metrics:

  • Area under the receiver operating characteristic curve (AUROC)
  • Area under the precision recall curve (AUPRC)

For both metrics, higher is better.

eICU

Model AUROC AUPRC
LSTM 0.849±0.002 0.407±0.012
CW LSTM 0.855±0.001 0.464±0.004
Transformer 0.851±0.002 0.454±0.005
TPC 0.864±0.001 0.508±0.005

MIMIC-IV

Model AUROC AUPRC
LSTM 0.895±0.001 0.657±0.003
CW LSTM 0.897±0.002 0.650±0.005
Transformer 0.890±0.002 0.641±0.008
TPC 0.905±0.001 0.691±0.006

Multitask Prediction

These are the results when the model is trained to solve length of stay and mortality at the same time.

eICU

Model AUROC AUPRC MAD MAPE MSE MSLE R2 Kappa
LSTM 0.852±0.003 0.436±0.007 2.40±0.01 116.5±0.8 27.2±0.2 1.47±0.01 0.08±0.01 0.28±0.01
CW LSTM 0.865±0.002 0.490±0.007 2.37±0.00 115.0±0.7 26.8±0.1 1.44±0.00 0.09±0.00 0.30±0.00
Transformer 0.858±0.001 0.475±0.004 2.36±0.00 114.2±0.7 26.6±0.1 1.43±0.00 0.10±0.00 0.30±0.00
TPC 0.865±0.002 0.523±0.006 1.55±0.01 46.4±2.6 18.7±0.2 0.40±0.02 0.37±0.01 0.70±0.00

MIMIC-IV

Model AUROC AUPRC MAD MAPE MSE MSLE R2 Kappa
LSTM 0.896±0.002 0.659±0.004 3.66±0.01 106.8±2.7 65.3±0.6 1.25±0.01 0.16±0.01 0.44±0.00
CW LSTM 0.899±0.002 0.654±0.003 3.69±0.02 107.2±1.6 66.3±0.6 1.23±0.01 0.15±0.01 0.44±0.00
Transformer 0.898±0.001 0.656±0.005 3.61±0.01 112.3±2.0 63.3±0.3 1.20±0.01 0.19±0.00 0.45±0.00
TPC 0.918±0.002 0.713±0.007 2.28±0.07 32.4±1.2 42.0±1.2 0.19±0.00 0.46±0.02 0.85±0.00

Pre-Processing Instructions

eICU

  1. To run the sql files you must have the eICU database set up: https://physionet.org/content/eicu-crd/2.0/.

  2. Follow the instructions: https://eicu-crd.mit.edu/tutorials/install_eicu_locally/ to ensure the correct connection configuration.

  3. Replace the eICU_path in paths.json to a convenient location in your computer, and do the same for eICU_preprocessing/create_all_tables.sql using find and replace for '/Users/emmarocheteau/PycharmProjects/TPC-LoS-prediction/eICU_data/'. Leave the extra '/' at the end.

  4. In your terminal, navigate to the project directory, then type the following commands:

    psql 'dbname=eicu user=eicu options=--search_path=eicu'
    

    Inside the psql console:

    \i eICU_preprocessing/create_all_tables.sql
    

    This step might take a couple of hours.

    To quit the psql console:

    \q
    
  5. Then run the pre-processing scripts in your terminal. This will need to run overnight:

    python3 -m eICU_preprocessing.run_all_preprocessing
    

MIMIC-IV

  1. To run the sql files you must have the MIMIC-IV database set up: https://physionet.org/content/mimiciv/0.4/.

  2. The official recommended way to access MIMIC-IV is via BigQuery: https://mimic-iv.mit.edu/docs/access/bigquery/. Personally I did not find it easy to store the necessary views and there is a 1GB size limit on the data you can download in the free tier, which is less than I am using here (the largest file to extract is timeseries.csv which is 4.49GB). However if you do wish to use BigQuery, note that you will have to make minor modifications to the code e.g. you would need to replace a reference to the table patients with physionet-data.mimic_core.patients.

    Alternatively, you can follow instructions to set up the full database. The instructions for the previous version of MIMIC - MIMIC-III are here: https://mimic.physionet.org/tutorials/install-mimic-locally-ubuntu/ for unix systems or: https://mimic.physionet.org/tutorials/install-mimic-locally-windows/ for windows. You will need to change mimiciii schema to mimiciv and use the files in: https://github.com/EmmaRocheteau/MIMIC-IV-Postgres in place of the files in: https://github.com/MIT-LCP/mimic-code/tree/master/buildmimic/postgres (referenced in the instructions). Additionally you may find this resource helpful: https://github.com/MIT-LCP/mimic-iv/tree/master/buildmimic/postgres which is still in the process of being updated (as of November 2020).

  3. Once you have a database connection, replace the MIMIC_path in paths.json to a convenient location in your computer, and do the same for MIMIC_preprocessing/create_all_tables.sql using find and replace for '/Users/emmarocheteau/PycharmProjects/TPC-LoS-prediction/MIMIC_data/'. Leave the extra '/' at the end.

  4. If you have set up the database on your local computer, you can navigate to the project directory in your terminal, then type the following commands:

    psql 'dbname=mimic user=mimicuser options=--search_path=mimiciv'
    

    Inside the psql console:

    \i MIMIC_preprocessing/create_all_tables.sql
    

    This step might take a couple of hours.

    To quit the psql console:

    \q
    
  5. Then run the pre-processing scripts in your terminal. This will need to run overnight:

    python3 -m MIMIC_preprocessing.run_all_preprocessing
    

Running the models

  1. Once you have run the pre-processing steps you can run all the models in your terminal. Set the working directory to the TPC-LoS-prediction, and run the following:

    python3 -m models.run_tpc
    

    Note that your experiment can be customised by using command line arguments e.g.

    python3 -m models.run_tpc --dataset eICU --task LoS --model_type tpc --n_layers 4 --kernel_size 3 --no_temp_kernels 10 --point_size 10 --last_linear_size 20 --diagnosis_size 20 --batch_size 64 --learning_rate 0.001 --main_dropout_rate 0.3 --temp_dropout_rate 0.1 
    

    Each experiment you run will create a directory within models/experiments. The naming of the directory is based on the date and time that you ran the experiment (to ensure that there are no name clashes). The experiments are saved in the standard trixi format: https://trixi.readthedocs.io/en/latest/_api/trixi.experiment.html.

  2. The hyperparameter searches can be replicated by running:

    python3 -m models.hyperparameter_scripts.eICU.tpc
    

    Trixi provides a useful way to visualise effects of the hyperparameters (after running the following command, navigate to http://localhost:8080 in your browser):

    python3 -m trixi.browser --port 8080 models/experiments/hyperparameters/eICU/TPC
    

    The final experiments for the paper are found in models/final_experiment_scripts e.g.:

    python3 -m models.final_experiment_scripts.eICU.LoS.tpc
    

References

Hrayr Harutyunyan, Hrant Khachatrian, David C. Kale, Greg Ver Steeg, and Aram Galstyan. Multitask Learning and Benchmarking with Clinical Time Series Data. Scientific Data, 6(96), 2019.

tpc-los-prediction's People

Contributors

emmarocheteau avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

tpc-los-prediction's Issues

Question regarding masked datafields in timeseries.csv processed file

Hello Emma,

I ran the preprocessing scripts on the original eiCU dataset and noticed the data fields in the timeseries.csv file have "_mask" suffix. For e.g "temperature_mask", "total protein_mask".
Can you please help me understand the reason behind creating masked data fields in the processed timeseries.csv file.

Best,
Kinara Pandya

some conceptual questions about temp_pointwise

Hi Emma,

I got some conceptual questions regarding temp_pointwise implementation. I marked 3 steps in the following source code for questions. The comments are my understanding and there are 4 lines below extracted from your source code.

def temp_pointwise(...):
  ...
  # temp_skip(batch_size, ts_feature_value_dim, ts_feature_conv_dim+1, n_measure_of_patient)
  # temp_skip is combination of temporal convolution and skip connection. Each ts_feature_conv_dim(12 values in 1 layer) values are 
  # concatenated with a feature value from skip connection.
  # step 1
  temp_skip = cat((point_skip.unsqueeze(2),  # B * (F + Zt) * 1 * T
                         X_temp.view(B, point_skip.shape[1], temp_kernels, T)),  # B * (F + Zt) * temp_kernels * T
                        dim=2)  # B * (F + Zt) * (1 + temp_kernels) * T

  # point_output(batch_size * n_measure_of_patient, point_size)
  #   -> view(batch_size, n_measure_of_patient, point_size, 1)
  #   -> permute(batch_size, point_size, 1, n_measure_of_patient)
  #   -> X_point_rep(batch_size, point_size, ts_feature_pattern_dim+1, n_measure_of_patient)
  # X_point_rep contains representation of each measure in low-dimensional space
  # step 2
  X_point_rep = point_output.view(B, T, point_size, 1).permute(0, 2, 3, 1).repeat(1, 1, (1 + temp_kernels), 1)  # B * point_size * (1 + temp_kernels) * T
  
  # X_combined(batch_size, ts_feature_value_dim + point_size, ts_feature_conv_dim+1, n_measure_of_patient)
  # temp_skip and X_point_rep are concatenated along ts_feature_value_dim axis.
  # step 3
  X_combined = self.relu(cat((temp_skip, X_point_rep), dim=1))  # B * (F + Zt) * (1 + temp_kernels) * T
  next_X = X_combined.contiguous().view(B, (point_skip.shape[1] + point_size) * (1 + temp_kernels), T)  # B * ((F + Zt + point_size) * (1 + temp_kernels)) * T
  ...

At step 3 X_combined, my understanding for the reason of concatenating temp_skip and X_point_rep along ts_feature_value_dim is that X_point_rep contains representation at ts_feature_value_dim level. If so, why don't do the following:

X_combined = self.relu(cat(
      (temp_skip.view(B, point_skip.shape[1] * (temp_kernels+1), T),
      point_output.view(B, T, point_size).permute(0, 2, 1)  # B * point_size * T
      ),
  dim=1
 )

So flatten temp_skip so that it can be concatenated with point_output at ts_feature_value_dim level.

I actually have difficulty in understanding the reasoning to repeat each point_size value (1+temp_kernals) times at step 2 X_point_rep. The only reason I can think of is to match the dimension with temp_skip. But with the repeation, will next_X contain (1+temp_kernals) repeated value at dim=1, which will not add information for network?

Asking source code in text is a bit difficult. I am not sure if I state my question clearly.

Thanks in advance for your time and help,
Cheng

configuration of GPU machine for training?

Hi Emma,

Thanks for sharing the detailed code implementation. I am doing some study of your paper, which looked very interesting. May I ask what kind of the GPU machine configuration you used for training and how long roughly did it take you to train the best tpc model with eICU data set? I am trying to train the model on AWS ml.p3.2xlarge NVIDIA V100 with 16GB GPU with eICU data set. I noticed the GPU utilitization is pretty low when I inspect with 'nvidia-smi'(I set batch_size to 64 to occupy about 11GB of GPU memory). It looked the percentage of GPU usage fluctuated a lot back to 0% and most of time was not using above 80%.

Thanks, Cheng

Python Version

May I ask the python version of your environment? Thanks!

Some questions about performance

Hi Emma,

For the figures in performance tables(take table 2 for example), are the scores calculated from test data set(I assume from test data set) or validation data set?

Another question about transfomer model in transfomer_model.py.

class TransformerEncoder(nn.Module):
  ...
  def forward(self, X, T):
    ...
    # question about this line.
    X = self.transformer_encoder(src=X.permute(2, 0, 1), mask=self._causal_mask(size=T))  # T * B * d_model
    ....

Is _causal_mask telling transformer to mask padded data?

Thanks,
Cheng

kappa for los

hi, when I calculate kappa for los using sklearn
from sklearn.metrics import cohen_kappa_score
kappa = cohen_kappa_score(y_true, y_pred)

it occurred ValueError: continuous is not supported,it seems kappa is not proper for regression, only classification?

KeyError: '[141939 142056 142476 142521 142560 146391 147447 149039 149606 153006\n 160529 162431 166572 166709 167391 167417 171174 175528 177651 178069\n 178858 179142 179554] not in index'

Hi,
I am getting following error while running the command:
python -m eICU_preprocessing.run_all_preprocessing

/opt/conda/lib/python3.6/runpy.py:85: DtypeWarning: Columns (3) have mixed types.Specify dtype option on import or set low_memory=False.
exec(code, run_globals)
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"main", mod_spec)
File "/opt/conda/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Code/TPC-LoS-prediction-master/eICU_preprocessing/run_all_preprocessing.py", line 19, in
timeseries_main(eICU_path, test=False)
File "/Code/TPC-LoS-prediction-master/eICU_preprocessing/timeseries.py", line 228, in timeseries_main
gen_timeseries_file(eICU_path, test)
File "/Code/TPC-LoS-prediction-master/eICU_preprocessing/timeseries.py", line 166, in gen_timeseries_file
merged = timeseries_lab.loc[patient_chunk].append(timeseries_resp.loc[patient_chunk], sort=False)
File "/opt/conda/lib/python3.6/site-packages/pandas/core/indexing.py", line 879, in getitem
return self._getitem_axis(maybe_callable, axis=axis)
File "/opt/conda/lib/python3.6/site-packages/pandas/core/indexing.py", line 1099, in _getitem_axis
return self._getitem_iterable(key, axis=axis)
File "/opt/conda/lib/python3.6/site-packages/pandas/core/indexing.py", line 1037, in _getitem_iterable
keyarr, indexer = self._get_listlike_indexer(key, axis, raise_missing=False)
File "/opt/conda/lib/python3.6/site-packages/pandas/core/indexing.py", line 1240, in _get_listlike_indexer
indexer, keyarr = ax._convert_listlike_indexer(key)
File "/opt/conda/lib/python3.6/site-packages/pandas/core/indexes/multi.py", line 2397, in _convert_listlike_indexer
raise KeyError(f"{keyarr[mask]} not in index")
KeyError: '[141939 142056 142476 142521 142560 146391 147447 149039 149606 153006\n 160529 162431 166572 166709 167391 167417 171174 175528 177651 178069\n 178858 179142 179554] not in index'
==> Removing the stays.txt file if it exists...

==> Removing the preprocessed_timeseries.csv file if it exists...
==> Loading data from timeseries files...
==> Reconfiguring lab timeseries...
==> Reconfiguring respiratory timeseries...
==> Reconfiguring nurse timeseries...
==> Reconfiguring aperiodic timeseries...
==> Reconfiguring periodic timeseries...
==> Starting main processing loop...

Any idea how to fix this?

Channel mismatch in model_type 'tpc'

Command
python -m models.run_tpc --model_type tpc --mode test --n_epochs 5

Error
Traceback (most recent call last): File "C:\Users\sewoo\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 193, in _run_module_as_main "__main__", mod_spec) File "C:\Users\sewoo\AppData\Local\Programs\Python\Python36\lib\runpy.py", line 85, in _run_code exec(code, run_globals) File "D:\git\cs598_replication_tpc_los_prediction\models\run_tpc.py", line 39, in <module> run_tpc() File "D:\git\cs598_replication_tpc_los_prediction\models\run_tpc.py", line 34, in run_tpc tpc.run() File "C:\Users\sewoo\AppData\Local\Programs\Python\Python36\lib\site-packages\trixi\experiment\experiment.py", line 108, in run self.process_err(e) File "C:\Users\sewoo\AppData\Local\Programs\Python\Python36\lib\site-packages\trixi\experiment\pytorchexperiment.py", line 391, in process_err raise e File "C:\Users\sewoo\AppData\Local\Programs\Python\Python36\lib\site-packages\trixi\experiment\experiment.py", line 90, in run self.validate(epoch=self._epoch_idx) File "D:\git\cs598_replication_tpc_los_prediction\models\experiment_template.py", line 221, in validate self.test() File "D:\git\cs598_replication_tpc_los_prediction\models\experiment_template.py", line 254, in test y_hat_los, y_hat_mort = self.model(padded, diagnoses, flat) File "C:\Users\sewoo\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "D:\git\cs598_replication_tpc_los_prediction\models\tpc_model.py", line 618, in forward diagnoses_enc = self.relu(self.main_dropout(self.bn_diagnosis_encoder(self.diagnosis_encoder(diagnoses)))) # B * diagnosis_size File "C:\Users\sewoo\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\modules\module.py", line 1051, in _call_impl return forward_call(*input, **kwargs) File "D:\git\cs598_replication_tpc_los_prediction\models\tpc_model.py", line 78, in forward training=True, momentum=exponential_average_factor, eps=self.eps) # set training to True so it calculates the norm of the batch File "C:\Users\sewoo\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\functional.py", line 2279, in batch_norm _verify_batch_size(input.size()) File "C:\Users\sewoo\AppData\Local\Programs\Python\Python36\lib\site-packages\torch\nn\functional.py", line 2247, in _verify_batch_size raise ValueError("Expected more than 1 value per channel when training, got input size {}".format(size)) ValueError: Expected more than 1 value per channel when training, got input size torch.Size([1, 64])

I will share here when I solve this issue, and please let me know if anybody know the solution of this error.

Issues preprocessing MIMIC-IV using BigQuery

Hi I'm having trouble preprocessing the MIMIC-IV with Big Query. I'm using the their query translation tool however I'm getting errors. I'm trying to translate the following query to BQ

create table ld_commonlabs as
  -- extracting the itemids for all the labevents that occur within the time bounds for our cohort
  with labsstay as (
    select l.itemid, la.stay_id
    from labevents as l
    inner join ld_labels as la
      on la.hadm_id = l.hadm_id
    where l.valuenum is not null  -- stick to the numerical data
      -- epoch extracts the number of seconds since 1970-01-01 00:00:00-00, we want to extract measurements between
      -- admission and the end of the patients' stay
      and (date_part('epoch', l.charttime) - date_part('epoch', la.intime))/(60*60*24) between -1 and la.los),
  -- getting the average number of times each itemid appears in an icustay (filtering only those that are more than 2)
  avg_obs_per_stay as (
    select itemid, avg(count) as avg_obs
    from (select itemid, count(*) from labsstay group by itemid, stay_id) as obs_per_stay
    group by itemid
    having avg(count) > 3)  -- we want the features to have at least 3 values entered for the average patient
  select d.label, count(distinct labsstay.stay_id) as count, a.avg_obs
    from labsstay
    inner join d_labitems as d
      on d.itemid = labsstay.itemid
    inner join avg_obs_per_stay as a
      on a.itemid = labsstay.itemid
    group by d.label, a.avg_obs
    -- only keep data that is present at some point for at least 25% of the patients, this gives us 45 lab features
    having count(distinct labsstay.stay_id) > (select count(distinct stay_id) from ld_labels)*0.25
    order by count desc;

My resulting big query sql is :

CREATE TABLE mimic_iv.ld_commonlabs
  AS
    WITH labsstay AS (
      SELECT
          --  extracting the itemids for all the labevents that occur within the time bounds for our cohort
          l.itemid,
          la.stay_id
        FROM
          physionet-data.mimiciv_hosp.labevents AS l
          INNER JOIN mimic_iv.ld_labels AS la ON la.hadm_id = l.hadm_id
        WHERE l.valuenum IS NOT NULL
         AND (UNIX_SECONDS(CAST(CAST(l.charttime as DATE) AS TIMESTAMP)) - CAST(UNIX_SECONDS(CAST(CAST(la.intime as DATE) AS TIMESTAMP)) as FLOAT64)) / (60 * 60 * 24) BETWEEN -1 AND la.los
    ), avg_obs_per_stay AS (
      SELECT
          --  stick to the numerical data
          --  epoch extracts the number of seconds since 1970-01-01 00:00:00-00, we want to extract measurements between
          --  admission and the end of the patients' stay
          --  getting the average number of times each itemid appears in an icustay (filtering only those that are more than 2)
          obs_per_stay.itemid,
          avg(CAST(obs_per_stay.count as BIGNUMERIC)) AS avg_obs
        FROM
          (
            SELECT
                labsstay.itemid,
                count(*) AS count
              FROM
                labsstay
              GROUP BY 1, labsstay.stay_id
          ) AS obs_per_stay
        GROUP BY 1
        HAVING avg(CAST(obs_per_stay.count as BIGNUMERIC)) > 3
    )
    SELECT
        --  we want the features to have at least 3 values entered for the average patient
        d.label,
        count(DISTINCT labsstay.stay_id) AS count,
        a.avg_obs
      FROM
        labsstay
        INNER JOIN physionet-data.mimiciv_hosp.d_labitems AS d ON d.itemid = labsstay.itemid
        INNER JOIN avg_obs_per_stay AS a ON a.itemid = labsstay.itemid
      GROUP BY 1, 3
      HAVING count(DISTINCT labsstay.stay_id) > (
        SELECT
            --  only keep data that is present at some point for at least 25% of the patients, this gives us 45 lab features
            count(DISTINCT labsstay.stay_id) AS count
          FROM
            mimic_iv.ld_labels
      ) * NUMERIC '0.25'

However, this is producing the following error

An expression references labsstay.stay_id which is neither grouped nor aggregated at [46:28]

I'm not very good with SQL and I had other issues setting up the PostgresSQL database locally. Maybe you could help explain what this query is doing and how to better translate it to the Google Big Query style as I would like to generate the CSV files

Thanks.

Preprocessing MIMIC-IV issue

When running the command
\copy D_HCPCS FROM 'd_hcpcs.csv' DELIMITER ',' CSV HEADER NULL ''

in postgresql, it says

ERROR: 0xe2 0x80 byte combined character (encoding: "UHC") has no corresponding character code in "UTF8" encoding Syntax: COPY d_hcpcs, line 88856

image
image

Preproceesing eICU issue

I got the following error while running the pre-processing scripts by python3 -m eICU_preprocessing.run_all_preprocessing

File "/anaconda3/envs/TPC_Proj/lib/python3.8/runpy.py", line 194, in_run_module_as_main
return _run_code(code, main_globals, None,
File "/anaconda3/envs/TPC_Proj/lib/python3.8/runpy.py", line 87, in_run_code
exec(code, run_globals)
File "/eICU/TPC-LoS-prediction/eICU_preprocessing/run_all_preprocessing.py", line 18, in
timeseries_main(eICU_path, test=False)
File "/eICU/TPC-LoS-prediction/eICU_preprocessing/timeseries.py", line 228, in timeseries_main
gen_timeseries_file(eICU_path, test)
File "/eICU/TPC-LoS-prediction/eICU_preprocessing/timeseries.py", line 166, in gen_timeseries_file
merged = timeseries_lab.loc[patient_chunk].append(timeseries_resp.loc[patien t_chunk], sort=False)
File "/anaconda3/envs/TPC_Proj/lib/python3.8/site-packages/pandas/co re/indexing.py", line 967, in getitem
return self._getitem_axis(maybe_callable, axis=axis)
File "/anaconda3/envs/TPC_Proj/lib/python3.8/site-packages/pandas/co re/indexing.py", line 1194, in _getitem_axis
return self._getitem_iterable(key, axis=axis)
File "/anaconda3/envs/TPC_Proj/lib/python3.8/site-packages/pandas/co re/indexing.py", line 1132, in _getitem_iterable
keyarr, indexer = self._get_listlike_indexer(key, axis)
File "/anaconda3/envs/TPC_Proj/lib/python3.8/site-packages/pandas/co re/indexing.py", line 1330, in _get_listlike_indexer
keyarr, indexer = ax._get_indexer_strict(key, axis_name)
File "/anaconda3/envs/TPC_Proj/lib/python3.8/site-packages/pandas/co re/indexes/multi.py", line 2587, in _get_indexer_strict
self._raise_if_missing(key, indexer, axis_name)
File "/anaconda3/envs/TPC_Proj/lib/python3.8/site-packages/pandas/co re/indexes/multi.py", line 2605, in _raise_if_missing
raise KeyError(f"{keyarr[cmask]} not in index")
KeyError: '[141939 142056 142476 142521 142560 146391 147447 149039 149606 15300 6\n 160529 162431 166572 166709 167391 167417 171174 175528 177651 178069\n 1788 58 179142 179554] not in index'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.