Coder Social home page Coder Social logo

cumc-dbmi / cehr-bert Goto Github PK

View Code? Open in Web Editor NEW
22.0 7.0 9.0 6.88 MB

CEHR-BERT: Incorporating temporal information from structured EHR data to improve prediction tasks

License: MIT License

Python 100.00%
bert disease-prediction ehr-data pretrained-language-model transformers deep-learning machine-learning temporal-data

cehr-bert's Introduction

CEHR-BERT

CEHR-BERT is a large language model developed for the structured EHR data, the work has been published at https://proceedings.mlr.press/v158/pang21a.html. CEHR-BERT currently only supports the structured EHR data in the OMOP format, which is a common data model used to support observational studies and managed by the Observational Health Data Science and Informatics (OHDSI) open-science community. There are three major components in CEHR-BERT, data generation, model pre-training, and model evaluation with fine-tuning, those components work in conjunction to provide an end-to-end model evaluation framework. The CEHR-BERT framework is designed to be extensible, users could write their own pretraining models, evaluation procedures, and downstream prediction tasks by extending the abstract classes, see click on the links for more details. For a quick start, navigate to the Get Started section.

Patient Representation

For each patient, all medical codes were aggregated and constructed into a sequence chronologically. In order to incorporate temporal information, we inserted an artificial time token (ATT) between two neighboring visits based on their time interval. The following logic was used for creating ATTs based on the following time intervals between visits, if less than 28 days, ATTs take on the form of $W_n$ where n represents the week number ranging from 0-3 (e.g. $W_1$); 2) if between 28 days and 365 days, ATTs are in the form of $M_n$ where n represents the month number ranging from 1-11 e.g $M_{11}$; 3) beyond 365 days then a LT (Long Term) token is inserted. In addition, we added two more special tokens โ€” VS and VE to represent the start and the end of a visit to explicitly define the visit segment, where all the concepts associated with the visit are subsumed by VS and VE.

"patient_representation"

Model Architecture

Overview of our BERT architecture on structured EHR data. To distinguish visit boundaries, visit segment embeddings are added to concept embeddings. Next, both visit embeddings and concept embeddings go through a temporal transformation, where concept, age and time embeddings are concatenated together. The concatenated embeddings are then fed into a fully connected layer. This temporal concept embedding becomes the input to BERT. We used the BERT learning objective Masked Language Model as the primary learning objective and introduced an EHR specific secondary learning objective visit type prediction.

"cehr-bert architecture diagram"

Pretrained model release

We will release the model that we pre-trained soon

Getting Started

Pre-requisite

The project is built in python 3.7, and project dependency needs to be installed

Create a new Python virtual environment

python3 -m venv venv3.7;
source venv3.7/bin/activate;

Install the packages in requirements.txt

pip3 install -r requirements.txt

Add the JTDS jar to the spark jars folder in the python environment

cp extra/jtds-1.3.1.jar venv3.7/lib/python3.7/site-packages/pyspark/jars/

Create the following folders for the tutorial below

mkdir -p ~/Documents/omop_test/cehr-bert;

1. Download OMOP tables as parquet files

We created a spark app to download OMOP tables from SQL Server as parquet files. You need adjust the properties in db_properties.ini to match with your database setup.

PYTHONPATH=./: spark-submit tools/download_omop_tables.py -c db_properties.ini -tc person visit_occurrence condition_occurrence procedure_occurrence drug_exposure measurement observation_period concept concept_relationship concept_ancestor -o ~/Documents/omop_test/

We have prepared a synthea dataset with 1M patients for you to test, you could download it at omop_synthea.tar.gz

tar -xvf omop_synthea.tar ~/Document/omop_test/

2. Generate training data for CEHR-BERT

We order the patient events in chronological order and put all data points in a sequence. We insert artificial tokens VS (visit start) and VE (visit end) to the start and the end of the visit. In addition, we insert artificial time tokens (ATT) between visits to indicate the time interval between visits. This approach allows us to apply BERT to structured EHR as-is. The sequence can be seen conceptually as [VS] [V1] [VE] [ATT] [VS] [V2] [VE], where [V1] and [V2] represent a list of concepts associated with those visits.

PYTHONPATH=./: spark-submit spark_apps/generate_training_data.py -i ~/Documents/omop_test/ -o ~/Documents/omop_test/cehr-bert -tc condition_occurrence procedure_occurrence drug_exposure -d 1985-01-01 --is_new_patient_representation -iv 

3. Pre-train CEHR-BERT

If you don't have your own OMOP instance, we have provided a sample of patient sequence data generated using Synthea at sample/patient_sequence in the repo. CEHR-BERT expects the data folder to be named as patient_sequence

PYTHONPATH=./: python3 trainers/train_bert_only.py -i sample_data/ -o ~/Documents/omop_test/cehr-bert -iv -m 512 -e 2 -b 32 -d 5 --use_time_embedding 

If your dataset is large, you could add --use_dask in the command above

4. Generate hf readmission prediction task

If you don't have your own OMOP instance, we have provided a sample of patient sequence data generated using Synthea at sample/hf_readmissioon in the repo

PYTHONPATH=./:$PYTHONPATH spark-submit spark_apps/prediction_cohorts/hf_readmission.py -c hf_readmission -i ~/Documents/omop_test/ -o ~/Documents/omop_test/cehr-bert -dl 1985-01-01 -du 2020-12-31 -l 18 -u 100 -ow 360 -ps 0 -pw 30 --is_new_patient_representation

5. Fine-tune CEHR-BERT for hf readmission

# Copy the pretrained bert model
cp sample_data/hf_readmission ~/Documents/omop_test/cehr-bert/hf_readmission;

# Create the evaluation folder
mkdir -p ~/Documents/omop_test/evaluation_train_val_split/hf_readmission/;

# In our experiment, we use the model snapshot generated from the second epoch
cp ~/Documents/omop_test/cehr-bert/bert_model_02_* ~/Documents/omop_test/cehr-bert/bert_model.h5;
PYTHONPATH=./: python3 evaluations/evaluation.py -a sequence_model -sd sample_data/hf_readmission -ef ~/Documents/omop_test/evaluation_train_val_split/hf_readmission/ -m 512 -b 32 -p 10 -vb ~/Documents/omop_test/cehr-bert -me vanilla_bert_lstm --sequence_model_name CEHR_BERT_512 --num_of_folds 4;

Contact us

If you have any questions, feel free to contact us at [email protected]

Citation

Please acknowledge the following work in papers

Chao Pang, Xinzhuo Jiang, Krishna S. Kalluri, Matthew Spotnitz, RuiJun Chen, Adler Perotte, and Karthik Natarajan. "Cehr-bert: Incorporating temporal information from structured ehr data to improve prediction tasks." In Proceedings of Machine Learning for Health, volume 158 of Proceedings of Machine Learning Research, pages 239โ€“260. PMLR, 04 Dec 2021.

cehr-bert's People

Contributors

chaopang avatar egillax avatar ksdkalluri avatar schuemie avatar xj2193 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

cehr-bert's Issues

Syntax error when creating training data

After the event-visit linking PR (#26) was accepted, I'm now seeing this syntax error. I'm running on Windows:

spark-submit --driver-memory 64g --executor-cores 8 --num-executors 3 --executor-memory 8g spark_apps/generate_training_data.py -i d:/gpm_ccae/ -o d:/gpm_ccae/cehr-bert -tc condition_occurrence procedure_occurrence drug_exposure -d 1985-01-01 --is_new_patient_representation -iv
23/06/05 00:31:35 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Traceback (most recent call last):
  File "C:/Users/admin_mschuemi/Documents/git/cehr-bert/spark_apps/generate_training_data.py", line 7, in <module>
    from utils.spark_utils import *
  File "C:\Users\admin_mschuemi\Documents\git\cehr-bert\utils\spark_utils.py", line 785
    patient_event = patient_event.drop("_max")
                ^
SyntaxError: invalid syntax
log4j:WARN No appenders could be found for logger (org.apache.spark.util.ShutdownHookManager).
log4j:WARN Please initialize the log4j system properly.
log4j:WARN See http://logging.apache.org/log4j/1.2/faq.html#noconfig for more info.

Any thoughts?

Use-case of the model

Hi, Can you please add a simple notebook for a use-case of the models and going about training the network from scratch with sample Temporal data X, and prediction label Y?

Update age normalization method in sequence based model evaluator

The current normalization method for age is applied on the entire dataset before splitting up which might cause age information leakage for the future so it isn't aligned with the best practice. The method needs to be updated to make the evaluations fair across train/test/validation sets.

For sequence model including LSTM and BERT based, the batch normalization could be used to track the running average in the training data used later in test and validation sets.

https://github.com/cumc-dbmi/cehr-bert/blob/master/models/evaluation_models.py
https://github.com/cumc-dbmi/cehr-bert/blob/master/evaluations/model_evaluators.py

Dependency on visit_occurrence_id?

If I understand correctly, the script currently completely relies on the visit_occurrence_id to link all events to visits. However, many databases don't have these links for most events. (E.g. in CCAE only a small fraction of drug_occurrence records has a non-null visit_occurrence_id).

Instead, (for example for cohort definitions) we tend to simply rely on the date. If an event occurs on the same day as a visit, we assume they are linked.

I think it would be helpful if this was the backup option: if the visit_occurrence_id is not provided, attempt to link by date.

(or does the script already do that?)

Remove the experimental code on temporal bert model/layer/trainer

  • Remove trainer trainers/train_bert_with_pretrained_timeattention.py
  • Remove function transformer_temporal_bert_model_visit_prediction in models/bert_models_visit_prediction.py
  • Remove function transformer_temporal_bert_model in models/bert_models.py
  • Remove class TimeSelfAttention in models/custom_layers.py

Unable to finish tutorial: can run prediction evaluation

I've been able to follow the tutorial in the README just fine, except for the last line.
I've downloaded the Synthea data, converted it to the training data format, and used this line to create a pre-trained model:

PYTHONPATH=./: python3 trainers/train_bert_only.py -i sample_data/ -o ~/Documents/omop_test/cehr-bert -iv -m 512 -e 1 -b 32 -d 5 --use_time_embedding 

As a result, I now have a file called 'bert_model_01_3.67.h5'

However, this last line is throwing an error:

PYTHONPATH=./: python3 evaluations/evaluation.py -a sequence_model -sd sample_data/hf_readmission -ef ~/Documents/omop_test/evaluation_train_val_split/hf_readmission/ -m 512 -b 32 -p 10 -vb ~/Documents/omop_test/cehr-bert -me vanilla_bert_lstm --sequence_model_name CEHR_BERT_512 --num_of_folds 4;

The error is

OSError: SavedModel file does not exist at: d:/omopSynthea/cehr-bert\bert_model.h5/{saved_model.pbtxt|saved_model.pb}

(I changed the path because I'm running on Windows). However, d:/omopSynthea/cehr-bert\bert_model.h5 exists (I renamed the aforementioned 'bert_model_01_3.67.h5'.

Am I doing something wrong? How do I run a 2nd epoch?

Error when pretraining data

Hi I'm trying to apply the cher-bert using our hospital data, but I got an error during its pretraining.

I ran below under CUDA 10.1 and tensorflow 2.2.0 with A100 x4 machines

PYTHONPATH=./: python3 trainers/train_bert_only.py -i ./Documents_230617/omop_test/cehr-bert -o ./Documents_230617/omop_test/cehr-bert -iv -m 512 -e 5 -b 4 -d 5 --use_time_embedding

but I got below. I found that the logs said errors may have originated from an input operation.
Could you give any advice?

2023-06-21 23:44:49,819 - _load_training_data - INFO - Started running trainers.model_trainer: _load_training_data at line 101
2023-06-21 23:45:18,416 - _load_training_data - INFO - Took 0:00:28.597028 to run trainers.model_trainer: _load_training_data.
2023-06-21 23:45:20,826 - tokenize_concepts - INFO - Started running utils.model_utils: tokenize_concepts at line 63
2023-06-21 23:45:20,827 - utils.model_utils - INFO - Loading the existing tokenizer from ./Documents_230617/omop_test/cehr-bert/tokenizer.pickle
2023-06-21 23:46:36,348 - tokenize_concepts - INFO - Took 0:01:15.522283 to run utils.model_utils: tokenize_concepts.
2023-06-21 23:46:36,351 - tokenize_concepts - INFO - Started running utils.model_utils: tokenize_concepts at line 63
2023-06-21 23:46:36,352 - utils.model_utils - INFO - Loading the existing tokenizer from ./Documents_230617/omop_test/cehr-bert/visit_tokenizer.pickle
2023-06-21 23:48:34,747 - tokenize_concepts - INFO - Took 0:01:58.395994 to run utils.model_utils: tokenize_concepts.
2023-06-21 23:48:34.768708: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2023-06-21 23:48:34.836277: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: A100-PCIE-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2023-06-21 23:48:34.837996: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 1 with properties:
pciBusID: 0000:25:00.0 name: A100-PCIE-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2023-06-21 23:48:34.839699: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 2 with properties:
pciBusID: 0000:c1:00.0 name: A100-PCIE-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2023-06-21 23:48:34.841387: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 3 with properties:
pciBusID: 0000:e1:00.0 name: A100-PCIE-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2023-06-21 23:48:34.842187: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2023-06-21 23:48:34.845731: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2023-06-21 23:48:34.848337: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2023-06-21 23:48:34.849189: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2023-06-21 23:48:34.852776: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2023-06-21 23:48:34.855862: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2023-06-21 23:48:34.864436: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2023-06-21 23:48:34.878344: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0, 1, 2, 3
2023-06-21 23:48:34.886036: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2023-06-21 23:48:34.920966: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2350065000 Hz
2023-06-21 23:48:34.929515: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f3fd4000b20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2023-06-21 23:48:34.929597: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2023-06-21 23:48:35.323193: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x6fa4350 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-06-21 23:48:35.323283: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): A100-PCIE-40GB, Compute Capability 8.0
2023-06-21 23:48:35.323297: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): A100-PCIE-40GB, Compute Capability 8.0
2023-06-21 23:48:35.323309: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (2): A100-PCIE-40GB, Compute Capability 8.0
2023-06-21 23:48:35.323320: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (3): A100-PCIE-40GB, Compute Capability 8.0
2023-06-21 23:48:35.365127: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: A100-PCIE-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2023-06-21 23:48:35.366812: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 1 with properties:
pciBusID: 0000:25:00.0 name: A100-PCIE-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2023-06-21 23:48:35.368494: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 2 with properties:
pciBusID: 0000:c1:00.0 name: A100-PCIE-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2023-06-21 23:48:35.370173: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 3 with properties:
pciBusID: 0000:e1:00.0 name: A100-PCIE-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2023-06-21 23:48:35.370229: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2023-06-21 23:48:35.370243: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2023-06-21 23:48:35.370254: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2023-06-21 23:48:35.370264: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2023-06-21 23:48:35.370275: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2023-06-21 23:48:35.370285: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2023-06-21 23:48:35.370295: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2023-06-21 23:48:35.383312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0, 1, 2, 3
2023-06-21 23:48:35.389500: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2023-06-21 23:48:35.401616: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2023-06-21 23:48:35.401637: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108]      0 1 2 3
2023-06-21 23:48:35.401645: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0:   N Y Y Y
2023-06-21 23:48:35.401650: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 1:   Y N Y Y
2023-06-21 23:48:35.401656: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 2:   Y Y N Y
2023-06-21 23:48:35.401660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 3:   Y Y Y N
2023-06-21 23:48:35.410048: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overridingallow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2023-06-21 23:48:35.410115: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 37713 MB memory) -> physical GPU (device: 0, name: A100-PCIE-40GB, pci bus id: 0000:01:00.0, compute capability: 8.0)
2023-06-21 23:48:35.420518: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overridingallow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2023-06-21 23:48:35.420618: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 37713 MB memory) -> physical GPU (device: 1, name: A100-PCIE-40GB, pci bus id: 0000:25:00.0, compute capability: 8.0)
2023-06-21 23:48:35.424621: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overridingallow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2023-06-21 23:48:35.424722: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 37713 MB memory) -> physical GPU (device: 2, name: A100-PCIE-40GB, pci bus id: 0000:c1:00.0, compute capability: 8.0)
2023-06-21 23:48:35.427247: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overridingallow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2023-06-21 23:48:35.427315: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 37713 MB memory) -> physical GPU (device: 3, name: A100-PCIE-40GB, pci bus id: 0000:e1:00.0, compute capability: 8.0)
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0','/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')
2023-06-21 23:48:35,442 - tensorflow - INFO - Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')
2023-06-21 23:48:35,442 - VanillaBertTrainer - INFO - Number of devices: 4
2023-06-22 00:10:52,835 - VanillaBertTrainer - INFO - training_data_parquet_path: ./Documents_230617/omop_test/cehr-bert/patient_sequence
model_path: ./Documents_230617/omop_test/cehr-bert/bert_model_{epoch:02d}_{loss:.2f}.h5
batch_size: 4
epochs: 5
learning_rate: 0.0002
tf_board_log_path: ./logs
shuffle_training_data: True
cache_dataset: False
use_dask: False

2023-06-22 00:10:52,835 - VanillaBertTrainer - INFO - VanillaBertTrainer will be trained with the following parameters:
tokenizer_path: ./Documents_230617/omop_test/cehr-bert/tokenizer.pickle
visit_tokenizer_path: ./Documents_230617/omop_test/cehr-bert/visit_tokenizer.pickle
embedding_size: 128
context_window_size: 512
depth: 5
num_heads: 8
include_visit_prediction: True
include_prolonged_length_stay: False
use_time_embeddings: True
use_behrt: False
time_embeddings_size: 16
2023-06-22 00:10:52,835 - BertVisitPredictionDataGenerator - INFO - batch_size: 4
max_seq_len: 512
min_num_of_concepts: 5
is_random_cursor: True
is_training: True

2023-06-22 00:10:55,688 - VanillaBertTrainer - INFO - Calculating steps per epoch
2023-06-22 00:10:55,688 - VanillaBertTrainer - INFO - Calculated 535972 steps per epoch
2023-06-22 00:10:55.703279: I tensorflow/core/profiler/lib/profiler_session.cc:159] Profiler sessionstarted.
2023-06-22 00:10:55.703346: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1363] Profiler found 4 GPUs
2023-06-22 00:10:55.707876: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcupti.so.10.1
2023-06-22 00:10:56.144173: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1479] CUPTI activity buffer flushed

Epoch 00001: LearningRateScheduler reducing learning rate to 0.0002.
Epoch 1/5
INFO:tensorflow:batch_all_reduce: 118 all-reduces with algorithm = nccl, num_packs = 1
2023-06-22 00:11:03,227 - tensorflow - INFO - batch_all_reduce: 118 all-reduces with algorithm = nccl, num_packs = 1
WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
2023-06-22 00:11:04,079 - tensorflow - WARNING - Efficient allreduce is not supported for 1 IndexedSlices
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3').
2023-06-22 00:11:04,079 - tensorflow - INFO - Reduce to /job:localhost/replica:0/task:0/device:GPU:0then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3').
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2023-06-22 00:11:05,584 - tensorflow - INFO - Reduce to /job:localhost/replica:0/task:0/device:CPU:0then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2023-06-22 00:11:05,589 - tensorflow - INFO - Reduce to /job:localhost/replica:0/task:0/device:CPU:0then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2023-06-22 00:11:05,595 - tensorflow - INFO - Reduce to /job:localhost/replica:0/task:0/device:CPU:0then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2023-06-22 00:11:05,599 - tensorflow - INFO - Reduce to /job:localhost/replica:0/task:0/device:CPU:0then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2023-06-22 00:11:05,603 - tensorflow - INFO - Reduce to /job:localhost/replica:0/task:0/device:CPU:0then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2023-06-22 00:11:05,605 - tensorflow - INFO - Reduce to /job:localhost/replica:0/task:0/device:CPU:0then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2023-06-22 00:11:05,608 - tensorflow - INFO - Reduce to /job:localhost/replica:0/task:0/device:CPU:0then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2023-06-22 00:11:05,610 - tensorflow - INFO - Reduce to /job:localhost/replica:0/task:0/device:CPU:0then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 118 all-reduces with algorithm = nccl, num_packs = 1
2023-06-22 00:11:14,761 - tensorflow - INFO - batch_all_reduce: 118 all-reduces with algorithm = nccl, num_packs = 1
WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
2023-06-22 00:11:15,602 - tensorflow - WARNING - Efficient allreduce is not supported for 1 IndexedSlices
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3').
2023-06-22 00:11:15,602 - tensorflow - INFO - Reduce to /job:localhost/replica:0/task:0/device:GPU:0then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3').
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2023-06-22 00:11:16,427 - tensorflow - INFO - Reduce to /job:localhost/replica:0/task:0/device:CPU:0then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2023-06-22 00:11:16,431 - tensorflow - INFO - Reduce to /job:localhost/replica:0/task:0/device:CPU:0then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2023-06-22 00:11:31.148695: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2023-06-22 00:13:04.903005: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLASroutine: CUBLAS_STATUS_NOT_SUPPORTED
2023-06-22 00:13:04.903108: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLASroutine: CUBLAS_STATUS_NOT_SUPPORTED
2023-06-22 00:13:04.903128: E tensorflow/stream_executor/cuda/cuda_blas.cc:2301] Internal: failed BLAS call, see log for details
2023-06-22 00:13:04.903138: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLASroutine: CUBLAS_STATUS_NOT_SUPPORTED
2023-06-22 00:13:04.903146: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLASroutine: CUBLAS_STATUS_NOT_SUPPORTED
2023-06-22 00:13:04.903177: E tensorflow/stream_executor/cuda/cuda_blas.cc:2301] Internal: failed BLAS call, see log for details
2023-06-22 00:13:04.903387: E tensorflow/stream_executor/cuda/cuda_blas.cc:2301] Internal: failed BLAS call, see log for details
2023-06-22 00:13:04.903388: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLASroutine: CUBLAS_STATUS_NOT_SUPPORTED
2023-06-22 00:13:04.903146: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLASroutine: CUBLAS_STATUS_NOT_SUPPORTED
2023-06-22 00:13:04.903438: E tensorflow/stream_executor/cuda/cuda_blas.cc:2301] Internal: failed BLAS call, see log for details
2023-06-22 00:13:04.903415: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLASroutine: CUBLAS_STATUS_NOT_SUPPORTED
2023-06-22 00:13:04.903456: E tensorflow/stream_executor/cuda/cuda_blas.cc:2301] Internal: failed BLAS call, see log for details
2023-06-22 00:13:04.903482: E tensorflow/stream_executor/cuda/cuda_blas.cc:2301] Internal: failed BLAS call, see log for details
2023-06-22 00:13:04.903207: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLASroutine: CUBLAS_STATUS_NOT_SUPPORTED
2023-06-22 00:13:04.903404: E tensorflow/stream_executor/cuda/cuda_blas.cc:2301] Internal: failed BLAS call, see log for details
2023-06-22 00:13:04.903527: E tensorflow/stream_executor/cuda/cuda_blas.cc:2301] Internal: failed BLAS call, see log for details
Traceback (most recent call last):
  File "trainers/train_bert_only.py", line 216, in <module>
    main(create_parse_args_base_bert().parse_args())
  File "trainers/train_bert_only.py", line 212, in main
    tf_board_log_path=config.tf_board_log_path).train_model()
  File "/home/ted9219/synology/ted9219/dr6ho/ted9219/cehr-bert/trainers/model_trainer.py", line 139,in train_model
    callbacks=self._get_callbacks())
  File "/home/ted9219/synology/ted9219/dr6ho/ted9219/cehr-bert/venv3.7/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 66, in _method_wrapper
    return method(self, *args, **kwargs)
  File "/home/ted9219/synology/ted9219/dr6ho/ted9219/cehr-bert/venv3.7/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 848, in fit
    tmp_logs = train_function(iterator)
  File "/home/ted9219/synology/ted9219/dr6ho/ted9219/cehr-bert/venv3.7/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 580, in __call__
    result = self._call(*args, **kwds)
  File "/home/ted9219/synology/ted9219/dr6ho/ted9219/cehr-bert/venv3.7/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 644, in _call
    return self._stateless_fn(*args, **kwds)
  File "/home/ted9219/synology/ted9219/dr6ho/ted9219/cehr-bert/venv3.7/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2420, in __call__
    return graph_function._filtered_call(args, kwargs)  # pylint: disable=protected-access
  File "/home/ted9219/synology/ted9219/dr6ho/ted9219/cehr-bert/venv3.7/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1665, in _filtered_call
    self.captured_inputs)
  File "/home/ted9219/synology/ted9219/dr6ho/ted9219/cehr-bert/venv3.7/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1746, in _call_flat
    ctx, args, cancellation_manager=cancellation_manager))
  File "/home/ted9219/synology/ted9219/dr6ho/ted9219/cehr-bert/venv3.7/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 598, in call
    ctx=ctx)
  File "/home/ted9219/synology/ted9219/dr6ho/ted9219/cehr-bert/venv3.7/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
    inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal:  Blas xGEMMBatched launch failed : a.shape=[8,512,16], b.shape=[8,512,16], m=512, n=512, k=16, batch_size=8
         [[node replica_3/model/decoder_layer/multi_head_attention_5/MatMul (defined at home/ted9219/synology/ted9219/dr6ho/ted9219/cehr-bert/models/custom_layers.py:57) ]]
         [[div_no_nan_2/ReadVariableOp_7/_1902]]
  (1) Internal:  Blas xGEMMBatched launch failed : a.shape=[8,512,16], b.shape=[8,512,16], m=512, n=512, k=16, batch_size=8
         [[node replica_3/model/decoder_layer/multi_head_attention_5/MatMul (defined at home/ted9219/synology/ted9219/dr6ho/ted9219/cehr-bert/models/custom_layers.py:57) ]]
0 successful operations.
3 derived errors ignored. [Op:__inference_train_function_66633]

Errors may have originated from an input operation.
Input Source operations connected to node replica_3/model/decoder_layer/multi_head_attention_5/MatMul:
 replica_3/model/decoder_layer/multi_head_attention_5/transpose (defined at home/ted9219/synology/ted9219/dr6ho/ted9219/cehr-bert/models/custom_layers.py:105)

Input Source operations connected to node replica_3/model/decoder_layer/multi_head_attention_5/MatMul:
 replica_3/model/decoder_layer/multi_head_attention_5/transpose (defined at home/ted9219/synology/ted9219/dr6ho/ted9219/cehr-bert/models/custom_layers.py:105)

Function call stack:
train_function -> train_function

2023-06-22 00:13:11.183064: I tensorflow/stream_executor/stream.cc:1990] [stream=0xc090680,impl=0xb0f47eb0] did not wait for [stream=0xbb11850,impl=0xb0e36cf0]
2023-06-22 00:13:11.183143: I tensorflow/stream_executor/stream.cc:4938] [stream=0xc090680,impl=0xb0f47eb0] did not memcpy host-to-device; source: 0x7f3a7c03e980
2023-06-22 00:13:11.183225: F tensorflow/core/common_runtime/gpu/gpu_util.cc:340] CPU->GPU Memcpy failed
Aborted (core dumped)

Update age normalization method in frequency based model evaluator

The current normalization method for age is applied on the entire dataset before splitting up which might cause age information leakage for the future so it isn't aligned with the best practice. The method needs to be updated to make the evaluations fair across train/test/validation sets.

For frequency baseline models, we need to "STOP" normalizing age in the corresponding evaluators where we process the data for evaluation.

class BaselineModelEvaluator(AbstractModelEvaluator, ABC):

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.