Hi I'm trying to apply the cher-bert using our hospital data, but I got an error during its pretraining.
but I got below. I found that the logs said errors may have originated from an input operation.
Could you give any advice?
2023-06-21 23:44:49,819 - _load_training_data - INFO - Started running trainers.model_trainer: _load_training_data at line 101
2023-06-21 23:45:18,416 - _load_training_data - INFO - Took 0:00:28.597028 to run trainers.model_trainer: _load_training_data.
2023-06-21 23:45:20,826 - tokenize_concepts - INFO - Started running utils.model_utils: tokenize_concepts at line 63
2023-06-21 23:45:20,827 - utils.model_utils - INFO - Loading the existing tokenizer from ./Documents_230617/omop_test/cehr-bert/tokenizer.pickle
2023-06-21 23:46:36,348 - tokenize_concepts - INFO - Took 0:01:15.522283 to run utils.model_utils: tokenize_concepts.
2023-06-21 23:46:36,351 - tokenize_concepts - INFO - Started running utils.model_utils: tokenize_concepts at line 63
2023-06-21 23:46:36,352 - utils.model_utils - INFO - Loading the existing tokenizer from ./Documents_230617/omop_test/cehr-bert/visit_tokenizer.pickle
2023-06-21 23:48:34,747 - tokenize_concepts - INFO - Took 0:01:58.395994 to run utils.model_utils: tokenize_concepts.
2023-06-21 23:48:34.768708: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2023-06-21 23:48:34.836277: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: A100-PCIE-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2023-06-21 23:48:34.837996: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 1 with properties:
pciBusID: 0000:25:00.0 name: A100-PCIE-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2023-06-21 23:48:34.839699: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 2 with properties:
pciBusID: 0000:c1:00.0 name: A100-PCIE-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2023-06-21 23:48:34.841387: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 3 with properties:
pciBusID: 0000:e1:00.0 name: A100-PCIE-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2023-06-21 23:48:34.842187: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2023-06-21 23:48:34.845731: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2023-06-21 23:48:34.848337: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2023-06-21 23:48:34.849189: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2023-06-21 23:48:34.852776: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2023-06-21 23:48:34.855862: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2023-06-21 23:48:34.864436: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2023-06-21 23:48:34.878344: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0, 1, 2, 3
2023-06-21 23:48:34.886036: I tensorflow/core/platform/cpu_feature_guard.cc:143] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 FMA
2023-06-21 23:48:34.920966: I tensorflow/core/platform/profile_utils/cpu_utils.cc:102] CPU Frequency: 2350065000 Hz
2023-06-21 23:48:34.929515: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x7f3fd4000b20 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2023-06-21 23:48:34.929597: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): Host, Default Version
2023-06-21 23:48:35.323193: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x6fa4350 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2023-06-21 23:48:35.323283: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (0): A100-PCIE-40GB, Compute Capability 8.0
2023-06-21 23:48:35.323297: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (1): A100-PCIE-40GB, Compute Capability 8.0
2023-06-21 23:48:35.323309: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (2): A100-PCIE-40GB, Compute Capability 8.0
2023-06-21 23:48:35.323320: I tensorflow/compiler/xla/service/service.cc:176] StreamExecutor device (3): A100-PCIE-40GB, Compute Capability 8.0
2023-06-21 23:48:35.365127: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 0 with properties:
pciBusID: 0000:01:00.0 name: A100-PCIE-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2023-06-21 23:48:35.366812: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 1 with properties:
pciBusID: 0000:25:00.0 name: A100-PCIE-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2023-06-21 23:48:35.368494: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 2 with properties:
pciBusID: 0000:c1:00.0 name: A100-PCIE-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2023-06-21 23:48:35.370173: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1561] Found device 3 with properties:
pciBusID: 0000:e1:00.0 name: A100-PCIE-40GB computeCapability: 8.0
coreClock: 1.41GHz coreCount: 108 deviceMemorySize: 39.59GiB deviceMemoryBandwidth: 1.41TiB/s
2023-06-21 23:48:35.370229: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2023-06-21 23:48:35.370243: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2023-06-21 23:48:35.370254: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2023-06-21 23:48:35.370264: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2023-06-21 23:48:35.370275: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2023-06-21 23:48:35.370285: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2023-06-21 23:48:35.370295: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2023-06-21 23:48:35.383312: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1703] Adding visible gpu devices: 0, 1, 2, 3
2023-06-21 23:48:35.389500: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2023-06-21 23:48:35.401616: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1102] Device interconnect StreamExecutor with strength 1 edge matrix:
2023-06-21 23:48:35.401637: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1108] 0 1 2 3
2023-06-21 23:48:35.401645: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 0: N Y Y Y
2023-06-21 23:48:35.401650: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 1: Y N Y Y
2023-06-21 23:48:35.401656: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 2: Y Y N Y
2023-06-21 23:48:35.401660: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1121] 3: Y Y Y N
2023-06-21 23:48:35.410048: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overridingallow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2023-06-21 23:48:35.410115: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 37713 MB memory) -> physical GPU (device: 0, name: A100-PCIE-40GB, pci bus id: 0000:01:00.0, compute capability: 8.0)
2023-06-21 23:48:35.420518: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overridingallow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2023-06-21 23:48:35.420618: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:1 with 37713 MB memory) -> physical GPU (device: 1, name: A100-PCIE-40GB, pci bus id: 0000:25:00.0, compute capability: 8.0)
2023-06-21 23:48:35.424621: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overridingallow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2023-06-21 23:48:35.424722: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:2 with 37713 MB memory) -> physical GPU (device: 2, name: A100-PCIE-40GB, pci bus id: 0000:c1:00.0, compute capability: 8.0)
2023-06-21 23:48:35.427247: W tensorflow/core/common_runtime/gpu/gpu_bfc_allocator.cc:39] Overridingallow_growth setting because the TF_FORCE_GPU_ALLOW_GROWTH environment variable is set. Original config value was 0.
2023-06-21 23:48:35.427315: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1247] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:3 with 37713 MB memory) -> physical GPU (device: 3, name: A100-PCIE-40GB, pci bus id: 0000:e1:00.0, compute capability: 8.0)
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0','/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')
2023-06-21 23:48:35,442 - tensorflow - INFO - Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')
2023-06-21 23:48:35,442 - VanillaBertTrainer - INFO - Number of devices: 4
2023-06-22 00:10:52,835 - VanillaBertTrainer - INFO - training_data_parquet_path: ./Documents_230617/omop_test/cehr-bert/patient_sequence
model_path: ./Documents_230617/omop_test/cehr-bert/bert_model_{epoch:02d}_{loss:.2f}.h5
batch_size: 4
epochs: 5
learning_rate: 0.0002
tf_board_log_path: ./logs
shuffle_training_data: True
cache_dataset: False
use_dask: False
2023-06-22 00:10:52,835 - VanillaBertTrainer - INFO - VanillaBertTrainer will be trained with the following parameters:
tokenizer_path: ./Documents_230617/omop_test/cehr-bert/tokenizer.pickle
visit_tokenizer_path: ./Documents_230617/omop_test/cehr-bert/visit_tokenizer.pickle
embedding_size: 128
context_window_size: 512
depth: 5
num_heads: 8
include_visit_prediction: True
include_prolonged_length_stay: False
use_time_embeddings: True
use_behrt: False
time_embeddings_size: 16
2023-06-22 00:10:52,835 - BertVisitPredictionDataGenerator - INFO - batch_size: 4
max_seq_len: 512
min_num_of_concepts: 5
is_random_cursor: True
is_training: True
2023-06-22 00:10:55,688 - VanillaBertTrainer - INFO - Calculating steps per epoch
2023-06-22 00:10:55,688 - VanillaBertTrainer - INFO - Calculated 535972 steps per epoch
2023-06-22 00:10:55.703279: I tensorflow/core/profiler/lib/profiler_session.cc:159] Profiler sessionstarted.
2023-06-22 00:10:55.703346: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1363] Profiler found 4 GPUs
2023-06-22 00:10:55.707876: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcupti.so.10.1
2023-06-22 00:10:56.144173: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1479] CUPTI activity buffer flushed
Epoch 00001: LearningRateScheduler reducing learning rate to 0.0002.
Epoch 1/5
INFO:tensorflow:batch_all_reduce: 118 all-reduces with algorithm = nccl, num_packs = 1
2023-06-22 00:11:03,227 - tensorflow - INFO - batch_all_reduce: 118 all-reduces with algorithm = nccl, num_packs = 1
WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
2023-06-22 00:11:04,079 - tensorflow - WARNING - Efficient allreduce is not supported for 1 IndexedSlices
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3').
2023-06-22 00:11:04,079 - tensorflow - INFO - Reduce to /job:localhost/replica:0/task:0/device:GPU:0then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3').
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2023-06-22 00:11:05,584 - tensorflow - INFO - Reduce to /job:localhost/replica:0/task:0/device:CPU:0then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2023-06-22 00:11:05,589 - tensorflow - INFO - Reduce to /job:localhost/replica:0/task:0/device:CPU:0then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2023-06-22 00:11:05,595 - tensorflow - INFO - Reduce to /job:localhost/replica:0/task:0/device:CPU:0then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2023-06-22 00:11:05,599 - tensorflow - INFO - Reduce to /job:localhost/replica:0/task:0/device:CPU:0then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2023-06-22 00:11:05,603 - tensorflow - INFO - Reduce to /job:localhost/replica:0/task:0/device:CPU:0then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2023-06-22 00:11:05,605 - tensorflow - INFO - Reduce to /job:localhost/replica:0/task:0/device:CPU:0then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2023-06-22 00:11:05,608 - tensorflow - INFO - Reduce to /job:localhost/replica:0/task:0/device:CPU:0then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2023-06-22 00:11:05,610 - tensorflow - INFO - Reduce to /job:localhost/replica:0/task:0/device:CPU:0then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 118 all-reduces with algorithm = nccl, num_packs = 1
2023-06-22 00:11:14,761 - tensorflow - INFO - batch_all_reduce: 118 all-reduces with algorithm = nccl, num_packs = 1
WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
2023-06-22 00:11:15,602 - tensorflow - WARNING - Efficient allreduce is not supported for 1 IndexedSlices
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3').
2023-06-22 00:11:15,602 - tensorflow - INFO - Reduce to /job:localhost/replica:0/task:0/device:GPU:0then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3').
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2023-06-22 00:11:16,427 - tensorflow - INFO - Reduce to /job:localhost/replica:0/task:0/device:CPU:0then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2023-06-22 00:11:16,431 - tensorflow - INFO - Reduce to /job:localhost/replica:0/task:0/device:CPU:0then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
2023-06-22 00:11:31.148695: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2023-06-22 00:13:04.903005: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLASroutine: CUBLAS_STATUS_NOT_SUPPORTED
2023-06-22 00:13:04.903108: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLASroutine: CUBLAS_STATUS_NOT_SUPPORTED
2023-06-22 00:13:04.903128: E tensorflow/stream_executor/cuda/cuda_blas.cc:2301] Internal: failed BLAS call, see log for details
2023-06-22 00:13:04.903138: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLASroutine: CUBLAS_STATUS_NOT_SUPPORTED
2023-06-22 00:13:04.903146: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLASroutine: CUBLAS_STATUS_NOT_SUPPORTED
2023-06-22 00:13:04.903177: E tensorflow/stream_executor/cuda/cuda_blas.cc:2301] Internal: failed BLAS call, see log for details
2023-06-22 00:13:04.903387: E tensorflow/stream_executor/cuda/cuda_blas.cc:2301] Internal: failed BLAS call, see log for details
2023-06-22 00:13:04.903388: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLASroutine: CUBLAS_STATUS_NOT_SUPPORTED
2023-06-22 00:13:04.903146: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLASroutine: CUBLAS_STATUS_NOT_SUPPORTED
2023-06-22 00:13:04.903438: E tensorflow/stream_executor/cuda/cuda_blas.cc:2301] Internal: failed BLAS call, see log for details
2023-06-22 00:13:04.903415: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLASroutine: CUBLAS_STATUS_NOT_SUPPORTED
2023-06-22 00:13:04.903456: E tensorflow/stream_executor/cuda/cuda_blas.cc:2301] Internal: failed BLAS call, see log for details
2023-06-22 00:13:04.903482: E tensorflow/stream_executor/cuda/cuda_blas.cc:2301] Internal: failed BLAS call, see log for details
2023-06-22 00:13:04.903207: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLASroutine: CUBLAS_STATUS_NOT_SUPPORTED
2023-06-22 00:13:04.903404: E tensorflow/stream_executor/cuda/cuda_blas.cc:2301] Internal: failed BLAS call, see log for details
2023-06-22 00:13:04.903527: E tensorflow/stream_executor/cuda/cuda_blas.cc:2301] Internal: failed BLAS call, see log for details
Traceback (most recent call last):
File "trainers/train_bert_only.py", line 216, in <module>
main(create_parse_args_base_bert().parse_args())
File "trainers/train_bert_only.py", line 212, in main
tf_board_log_path=config.tf_board_log_path).train_model()
File "/home/ted9219/synology/ted9219/dr6ho/ted9219/cehr-bert/trainers/model_trainer.py", line 139,in train_model
callbacks=self._get_callbacks())
File "/home/ted9219/synology/ted9219/dr6ho/ted9219/cehr-bert/venv3.7/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 66, in _method_wrapper
return method(self, *args, **kwargs)
File "/home/ted9219/synology/ted9219/dr6ho/ted9219/cehr-bert/venv3.7/lib/python3.7/site-packages/tensorflow/python/keras/engine/training.py", line 848, in fit
tmp_logs = train_function(iterator)
File "/home/ted9219/synology/ted9219/dr6ho/ted9219/cehr-bert/venv3.7/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 580, in __call__
result = self._call(*args, **kwds)
File "/home/ted9219/synology/ted9219/dr6ho/ted9219/cehr-bert/venv3.7/lib/python3.7/site-packages/tensorflow/python/eager/def_function.py", line 644, in _call
return self._stateless_fn(*args, **kwds)
File "/home/ted9219/synology/ted9219/dr6ho/ted9219/cehr-bert/venv3.7/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 2420, in __call__
return graph_function._filtered_call(args, kwargs) # pylint: disable=protected-access
File "/home/ted9219/synology/ted9219/dr6ho/ted9219/cehr-bert/venv3.7/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1665, in _filtered_call
self.captured_inputs)
File "/home/ted9219/synology/ted9219/dr6ho/ted9219/cehr-bert/venv3.7/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 1746, in _call_flat
ctx, args, cancellation_manager=cancellation_manager))
File "/home/ted9219/synology/ted9219/dr6ho/ted9219/cehr-bert/venv3.7/lib/python3.7/site-packages/tensorflow/python/eager/function.py", line 598, in call
ctx=ctx)
File "/home/ted9219/synology/ted9219/dr6ho/ted9219/cehr-bert/venv3.7/lib/python3.7/site-packages/tensorflow/python/eager/execute.py", line 60, in quick_execute
inputs, attrs, num_outputs)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Blas xGEMMBatched launch failed : a.shape=[8,512,16], b.shape=[8,512,16], m=512, n=512, k=16, batch_size=8
[[node replica_3/model/decoder_layer/multi_head_attention_5/MatMul (defined at home/ted9219/synology/ted9219/dr6ho/ted9219/cehr-bert/models/custom_layers.py:57) ]]
[[div_no_nan_2/ReadVariableOp_7/_1902]]
(1) Internal: Blas xGEMMBatched launch failed : a.shape=[8,512,16], b.shape=[8,512,16], m=512, n=512, k=16, batch_size=8
[[node replica_3/model/decoder_layer/multi_head_attention_5/MatMul (defined at home/ted9219/synology/ted9219/dr6ho/ted9219/cehr-bert/models/custom_layers.py:57) ]]
0 successful operations.
3 derived errors ignored. [Op:__inference_train_function_66633]
Errors may have originated from an input operation.
Input Source operations connected to node replica_3/model/decoder_layer/multi_head_attention_5/MatMul:
replica_3/model/decoder_layer/multi_head_attention_5/transpose (defined at home/ted9219/synology/ted9219/dr6ho/ted9219/cehr-bert/models/custom_layers.py:105)
Input Source operations connected to node replica_3/model/decoder_layer/multi_head_attention_5/MatMul:
replica_3/model/decoder_layer/multi_head_attention_5/transpose (defined at home/ted9219/synology/ted9219/dr6ho/ted9219/cehr-bert/models/custom_layers.py:105)
Function call stack:
train_function -> train_function
2023-06-22 00:13:11.183064: I tensorflow/stream_executor/stream.cc:1990] [stream=0xc090680,impl=0xb0f47eb0] did not wait for [stream=0xbb11850,impl=0xb0e36cf0]
2023-06-22 00:13:11.183143: I tensorflow/stream_executor/stream.cc:4938] [stream=0xc090680,impl=0xb0f47eb0] did not memcpy host-to-device; source: 0x7f3a7c03e980
2023-06-22 00:13:11.183225: F tensorflow/core/common_runtime/gpu/gpu_util.cc:340] CPU->GPU Memcpy failed
Aborted (core dumped)