Coder Social home page Coder Social logo

Comments (7)

agemagician avatar agemagician commented on May 4, 2024 1

Thanks @jeffra for the update.
I will test it and I will give you my feedback.

from deepspeed.

ShadenSmith avatar ShadenSmith commented on May 4, 2024

Hello! Thank you for your interest in DeepSpeed. DeepSpeed uses its own launcher and relies on NCCL for communication instead of MPI. Codes need to use DeepSpeed's small API to run and no Horovod is used. To launch a DeepSpeed program, you just need a hostfile, which is compatible with many MPI implementations. DeepSpeed searches for /job/hostfile by default, or you can provide a hostfile with an argument: --hostfile=path/to/hostfile.

Finally, you can launch with:

deepspeed cifar_deepspeed.py --deepspeed --deepspeed_config=ds_config.json

from deepspeed.

agemagician avatar agemagician commented on May 4, 2024

Thanks for the clarification.
This will be a little tricky with SUMMIT, since I don't know what are the current hostnames.
I will try to check if bsub provide it somehow.

from deepspeed.

jeffra avatar jeffra commented on May 4, 2024

@agemagician, we just merged in a new PR that should make this a bit easier for you and others who want to use MPI. Please see this new text in our README for more details: https://github.com/microsoft/DeepSpeed/#mpi-compatibility

In your case you should be able to do something like:
ddlrun python cifar10_deepspeed.py --deepspeed_mpi --deepspeed --deepspeed_config ds_config.json

Also make sure to install the python package mpi4py if you don't already have it.

from deepspeed.

agemagician avatar agemagician commented on May 4, 2024

The ddlrun didn't work out, as follows:

2020-02-28 04:41:22.616358: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-28 04:41:22.616342: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-28 04:41:22.616342: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-28 04:41:22.616370: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-28 04:41:22.616370: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-28 04:41:22.616420: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-28 04:41:22.633562: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-28 04:41:22.633708: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-28 04:41:22.633842: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-28 04:41:22.634003: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-28 04:41:22.634142: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-28 04:41:22.634298: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-28 04:41:22.651291: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-28 04:41:22.651443: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-28 04:41:22.651589: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-28 04:41:22.651739: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-28 04:41:22.651892: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-28 04:41:22.654235: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-28 04:41:22.669154: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 2 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-28 04:41:22.669318: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 2 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-28 04:41:22.669466: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 2 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-28 04:41:22.669618: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 2 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-28 04:41:22.669775: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 2 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-28 04:41:22.672114: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 2 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-28 04:41:22.687058: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 3 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-28 04:41:22.687205: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 3 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-28 04:41:22.687346: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 3 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-28 04:41:22.687501: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 3 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-28 04:41:22.687649: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 3 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-28 04:41:22.690016: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 3 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-28 04:41:22.704951: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 4 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-28 04:41:22.705096: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 4 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-28 04:41:22.705238: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 4 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-28 04:41:22.705392: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 4 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-28 04:41:22.705540: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 4 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-28 04:41:22.707887: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 4 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-28 04:41:22.722871: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 5 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-28 04:41:22.722895: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-28 04:41:22.722962: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-28 04:41:22.722997: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-28 04:41:22.723020: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 5 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-28 04:41:22.723030: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-28 04:41:22.723045: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-28 04:41:22.723112: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-28 04:41:22.723150: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-28 04:41:22.723169: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 5 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-28 04:41:22.723192: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-28 04:41:22.723187: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-28 04:41:22.723255: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-28 04:41:22.723292: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-28 04:41:22.723326: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-28 04:41:22.723325: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 5 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-28 04:41:22.723350: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-28 04:41:22.723414: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-28 04:41:22.723452: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-28 04:41:22.723471: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 5 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-28 04:41:22.723489: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-28 04:41:22.723493: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-28 04:41:22.723548: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-28 04:41:22.723584: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-28 04:41:22.723617: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-28 04:41:22.724633: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 5 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-28 04:41:22.724656: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-28 04:41:22.724721: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-28 04:41:22.724756: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-28 04:41:22.724790: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-28 04:41:22.778804: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-28 04:41:22.778808: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-28 04:41:22.778801: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-28 04:41:22.778846: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-28 04:41:22.778846: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-28 04:41:22.778854: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-28 04:41:22.778881: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-28 04:41:22.778854: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-28 04:41:22.778904: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-28 04:41:22.778849: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-28 04:41:22.778883: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-28 04:41:22.778854: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-28 04:41:22.778896: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-28 04:41:22.778890: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-28 04:41:22.778897: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-28 04:41:22.778940: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-28 04:41:22.778930: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-28 04:41:22.778941: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-28 04:41:22.990990: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-28 04:41:22.991437: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-28 04:41:22.991584: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-28 04:41:22.991725: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-28 04:41:22.991867: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-28 04:41:22.992162: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-28 04:41:23.008098: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 3450000000 Hz
2020-02-28 04:41:23.008098: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 3450000000 Hz
2020-02-28 04:41:23.008934: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 3450000000 Hz
2020-02-28 04:41:23.009407: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 3450000000 Hz
2020-02-28 04:41:23.009421: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 3450000000 Hz
2020-02-28 04:41:23.009645: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 3450000000 Hz
2020-02-28 04:41:23.010798: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x15c157940 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-28 04:41:23.010839: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-28 04:41:23.010975: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x16f5b8af0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-28 04:41:23.010998: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-28 04:41:23.011583: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x157038d60 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-28 04:41:23.011607: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-28 04:41:23.014651: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1204ed950 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-28 04:41:23.014674: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-28 04:41:23.014892: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x15339d650 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-28 04:41:23.014917: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-28 04:41:23.015113: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x16933d290 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-28 04:41:23.015143: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-28 04:41:23.033880: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:3: failed initializing StreamExecutor for CUDA device ordinal 3: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.033886: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.033952: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:4: failed initializing StreamExecutor for CUDA device ordinal 4: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.033952: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.062768: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:1: failed initializing StreamExecutor for CUDA device ordinal 1: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.063391: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:1: failed initializing StreamExecutor for CUDA device ordinal 1: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.063428: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:1: failed initializing StreamExecutor for CUDA device ordinal 1: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.063558: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:3: failed initializing StreamExecutor for CUDA device ordinal 3: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.063765: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:3: failed initializing StreamExecutor for CUDA device ordinal 3: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.064749: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:3: failed initializing StreamExecutor for CUDA device ordinal 3: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.064869: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:2: failed initializing StreamExecutor for CUDA device ordinal 2: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.064908: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:5: failed initializing StreamExecutor for CUDA device ordinal 5: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.065097: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:1: failed initializing StreamExecutor for CUDA device ordinal 1: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.065229: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:2: failed initializing StreamExecutor for CUDA device ordinal 2: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.065453: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:2: failed initializing StreamExecutor for CUDA device ordinal 2: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.065515: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:5: failed initializing StreamExecutor for CUDA device ordinal 5: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.065567: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:2: failed initializing StreamExecutor for CUDA device ordinal 2: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.066307: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:5: failed initializing StreamExecutor for CUDA device ordinal 5: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.066389: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.066423: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:4: failed initializing StreamExecutor for CUDA device ordinal 4: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.066552: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.068210: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:5: failed initializing StreamExecutor for CUDA device ordinal 5: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.068217: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:4: failed initializing StreamExecutor for CUDA device ordinal 4: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.068321: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:4: failed initializing StreamExecutor for CUDA device ordinal 4: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.068787: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Internal: no supported devices found for platform CUDA
2020-02-28 04:41:23.068957: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Internal: no supported devices found for platform CUDA
2020-02-28 04:41:23.091814: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:3: failed initializing StreamExecutor for CUDA device ordinal 3: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.094324: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:1: failed initializing StreamExecutor for CUDA device ordinal 1: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.095155: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:2: failed initializing StreamExecutor for CUDA device ordinal 2: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.095759: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.096799: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:4: failed initializing StreamExecutor for CUDA device ordinal 4: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.097594: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:5: failed initializing StreamExecutor for CUDA device ordinal 5: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 04:41:23.098078: F tensorflow/stream_executor/lib/statusor.cc:34] Attempting to fetch value instead of handling error Internal: no supported devices found for platform CUDA

When I try Jsrun, I got another error:

THCudaCheck FAIL file=/opt/anaconda/conda-bld/pytorch-base_1571884578074/work/torch/csrc/cuda/Module.cpp line=38 error=101 : invalid device ordinal
Traceback (most recent call last):
  File "pretrain_bert.py", line 581, in <module>
    main()
  File "pretrain_bert.py", line 518, in main
    initialize_distributed(args)
  File "pretrain_bert.py", line 441, in initialize_distributed
    torch.cuda.set_device(device)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/torch/cuda/__init__.py", line 280, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/anaconda/conda-bld/pytorch-base_1571884578074/work/torch/csrc/cuda/Module.cpp:38
THCudaCheck FAIL file=/opt/anaconda/conda-bld/pytorch-base_1571884578074/work/torch/csrc/cuda/Module.cpp line=38 error=101 : invalid device ordinal
Traceback (most recent call last):
  File "pretrain_bert.py", line 581, in <module>
    main()
  File "pretrain_bert.py", line 518, in main
    initialize_distributed(args)
  File "pretrain_bert.py", line 441, in initialize_distributed
    torch.cuda.set_device(device)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/torch/cuda/__init__.py", line 280, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/anaconda/conda-bld/pytorch-base_1571884578074/work/torch/csrc/cuda/Module.cpp:38
THCudaCheck FAIL file=/opt/anaconda/conda-bld/pytorch-base_1571884578074/work/torch/csrc/cuda/Module.cpp line=38 error=101 : invalid device ordinal
Traceback (most recent call last):
  File "pretrain_bert.py", line 581, in <module>
    main()
  File "pretrain_bert.py", line 518, in main
    initialize_distributed(args)
  File "pretrain_bert.py", line 441, in initialize_distributed
    torch.cuda.set_device(device)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/torch/cuda/__init__.py", line 280, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/anaconda/conda-bld/pytorch-base_1571884578074/work/torch/csrc/cuda/Module.cpp:38
THCudaCheck FAIL file=/opt/anaconda/conda-bld/pytorch-base_1571884578074/work/torch/csrc/cuda/Module.cpp line=38 error=101 : invalid device ordinal
Traceback (most recent call last):
  File "pretrain_bert.py", line 581, in <module>
    main()
  File "pretrain_bert.py", line 518, in main
    initialize_distributed(args)
  File "pretrain_bert.py", line 441, in initialize_distributed
    torch.cuda.set_device(device)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/torch/cuda/__init__.py", line 280, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/anaconda/conda-bld/pytorch-base_1571884578074/work/torch/csrc/cuda/Module.cpp:38
THCudaCheck FAIL file=/opt/anaconda/conda-bld/pytorch-base_1571884578074/work/torch/csrc/cuda/Module.cpp line=38 error=101 : invalid device ordinal
Traceback (most recent call last):
  File "pretrain_bert.py", line 581, in <module>
    main()
  File "pretrain_bert.py", line 518, in main
    initialize_distributed(args)
  File "pretrain_bert.py", line 441, in initialize_distributed
    torch.cuda.set_device(device)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/torch/cuda/__init__.py", line 280, in set_device
    torch._C._cuda_setDevice(device)
RuntimeError: cuda runtime error (101) : invalid device ordinal at /opt/anaconda/conda-bld/pytorch-base_1571884578074/work/torch/csrc/cuda/Module.cpp:38

from deepspeed.

agemagician avatar agemagician commented on May 4, 2024

I tried to change the distributed-backend parameter to ddl, and I had another error:


2020-02-28 05:04:04.719317: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-28 05:04:04.719482: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-02-28 05:04:04.734093: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-28 05:04:04.734242: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-28 05:04:04.734367: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-28 05:04:04.734522: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-28 05:04:04.734670: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-28 05:04:04.734821: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 0 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:04:00.0
2020-02-28 05:04:04.750180: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-28 05:04:04.750333: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-28 05:04:04.750468: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-28 05:04:04.750622: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-28 05:04:04.750763: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-28 05:04:04.750914: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 1 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:05:00.0
2020-02-28 05:04:04.765597: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 2 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-28 05:04:04.765750: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 2 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-28 05:04:04.765886: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 2 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-28 05:04:04.766037: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 2 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-28 05:04:04.766179: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 2 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-28 05:04:04.766332: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 2 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0004:06:00.0
2020-02-28 05:04:04.781044: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 3 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-28 05:04:04.781193: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 3 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-28 05:04:04.781324: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 3 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-28 05:04:04.781480: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 3 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-28 05:04:04.781626: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 3 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-28 05:04:04.781772: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 3 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:03:00.0
2020-02-28 05:04:04.798361: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 4 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-28 05:04:04.798512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 4 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-28 05:04:04.798643: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 4 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-28 05:04:04.798800: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 4 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-28 05:04:04.799109: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 4 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-28 05:04:04.799260: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 4 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:04:00.0
2020-02-28 05:04:04.816072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 5 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-28 05:04:04.816097: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-28 05:04:04.816158: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-28 05:04:04.816195: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-28 05:04:04.816219: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 5 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-28 05:04:04.816231: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-28 05:04:04.816244: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-28 05:04:04.816305: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-28 05:04:04.816345: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-28 05:04:04.816350: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 5 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-28 05:04:04.816373: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-28 05:04:04.816383: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-28 05:04:04.816429: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-28 05:04:04.816467: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-28 05:04:04.816504: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-28 05:04:04.816508: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 5 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-28 05:04:04.816533: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-28 05:04:04.816593: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-28 05:04:04.816633: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-28 05:04:04.816674: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-28 05:04:04.816817: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 5 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-28 05:04:04.816842: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-28 05:04:04.816898: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-28 05:04:04.816935: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-28 05:04:04.816962: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1618] Found device 5 with properties: 
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0035:05:00.0
2020-02-28 05:04:04.816971: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-28 05:04:04.816990: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
2020-02-28 05:04:04.817047: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10
2020-02-28 05:04:04.817085: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcufft.so.10
2020-02-28 05:04:04.817122: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcurand.so.10
2020-02-28 05:04:04.818449: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-28 05:04:04.818448: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-28 05:04:04.818494: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-28 05:04:04.818492: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-28 05:04:04.818529: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-28 05:04:04.818533: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-28 05:04:04.818560: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-28 05:04:04.818561: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-28 05:04:04.818577: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-28 05:04:04.818603: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-28 05:04:04.818603: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-28 05:04:04.818621: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-28 05:04:04.818640: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-28 05:04:04.818641: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-28 05:04:04.818656: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-28 05:04:04.818716: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusolver.so.10
2020-02-28 05:04:04.818761: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcusparse.so.10
2020-02-28 05:04:04.818799: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-02-28 05:04:05.014825: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-28 05:04:05.014973: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-28 05:04:05.015120: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-28 05:04:05.015250: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-28 05:04:05.015402: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-28 05:04:05.015546: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1746] Adding visible gpu devices: 0, 1, 2, 3, 4, 5
2020-02-28 05:04:05.027646: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 3450000000 Hz
2020-02-28 05:04:05.027664: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 3450000000 Hz
2020-02-28 05:04:05.027664: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 3450000000 Hz
2020-02-28 05:04:05.030542: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 3450000000 Hz
2020-02-28 05:04:05.030785: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1d5295b00 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-28 05:04:05.030805: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-28 05:04:05.030794: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1a9f353c0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-28 05:04:05.030814: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-28 05:04:05.030831: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 3450000000 Hz
2020-02-28 05:04:05.031007: I tensorflow/core/platform/profile_utils/cpu_utils.cc:101] CPU Frequency: 3450000000 Hz
2020-02-28 05:04:05.031106: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x178883f00 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-28 05:04:05.031136: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-28 05:04:05.034994: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:3: failed initializing StreamExecutor for CUDA device ordinal 3: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.035028: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.035127: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.035165: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:1: failed initializing StreamExecutor for CUDA device ordinal 1: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.035255: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:1: failed initializing StreamExecutor for CUDA device ordinal 1: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.035256: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:5: failed initializing StreamExecutor for CUDA device ordinal 5: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.035453: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:5: failed initializing StreamExecutor for CUDA device ordinal 5: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.035580: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:2: failed initializing StreamExecutor for CUDA device ordinal 2: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.035807: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:4: failed initializing StreamExecutor for CUDA device ordinal 4: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.035920: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:2: failed initializing StreamExecutor for CUDA device ordinal 2: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.036083: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.036213: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:4: failed initializing StreamExecutor for CUDA device ordinal 4: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.036294: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:1: failed initializing StreamExecutor for CUDA device ordinal 1: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.036426: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:3: failed initializing StreamExecutor for CUDA device ordinal 3: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.037278: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1b48043b0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-28 05:04:05.037351: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-28 05:04:05.037394: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:2: failed initializing StreamExecutor for CUDA device ordinal 2: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.037758: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1b2f527d0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-28 05:04:05.037778: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-28 05:04:05.038390: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x183c1b700 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-02-28 05:04:05.038414: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-02-28 05:04:05.041883: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:5: failed initializing StreamExecutor for CUDA device ordinal 5: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.041883: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:2: failed initializing StreamExecutor for CUDA device ordinal 2: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.041933: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:4: failed initializing StreamExecutor for CUDA device ordinal 4: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.042908: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:4: failed initializing StreamExecutor for CUDA device ordinal 4: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.043108: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:5: failed initializing StreamExecutor for CUDA device ordinal 5: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.043148: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:4: failed initializing StreamExecutor for CUDA device ordinal 4: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.043477: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.043686: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:0: failed initializing StreamExecutor for CUDA device ordinal 0: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.044267: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:3: failed initializing StreamExecutor for CUDA device ordinal 3: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.044469: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1d52f8dd0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-02-28 05:04:05.044486: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0
2020-02-28 05:04:05.044530: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:1: failed initializing StreamExecutor for CUDA device ordinal 1: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.044575: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:3: failed initializing StreamExecutor for CUDA device ordinal 3: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.045110: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:3: failed initializing StreamExecutor for CUDA device ordinal 3: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.045164: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:1: failed initializing StreamExecutor for CUDA device ordinal 1: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.045387: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:5: failed initializing StreamExecutor for CUDA device ordinal 5: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.045488: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1a9f986d0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-02-28 05:04:05.045504: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0
2020-02-28 05:04:05.045932: W tensorflow/compiler/xla/service/platform_util.cc:256] unable to create StreamExecutor for CUDA:2: failed initializing StreamExecutor for CUDA device ordinal 2: Internal: failed call to cuDevicePrimaryCtxRetain: CUDA_ERROR_INVALID_DEVICE: invalid device ordinal
2020-02-28 05:04:05.046762: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1788e71f0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-02-28 05:04:05.046776: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0
Traceback (most recent call last):
  File "pretrain_bert.py", line 581, in <module>
    main()
  File "pretrain_bert.py", line 528, in main
    args.tokenizer_num_type_tokens = get_train_val_test_data(args)
  File "pretrain_bert.py", line 475, in get_train_val_test_data
    (train_data, val_data, test_data), tokenizer = data_config.apply(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 34, in apply
    return make_loaders(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 120, in make_loaders
    return make_tfrecord_loaders(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 92, in make_tfrecord_loaders
    **data_set_args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py", line 42, in __init__
    self.dataset = tf.data.Dataset.from_tensor_slices(tf.constant(records))
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 161, in constant_v1
    allow_broadcast=False)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 235, in _constant_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 95, in convert_to_eager_tensor
    ctx.ensure_initialized()
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/eager/context.py", line 492, in ensure_initialized
    context_handle = pywrap_tensorflow.TFE_NewContext(opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: device CUDA:0 not supported by XLA service
	while setting up XLA_GPU_JIT device number 0
Traceback (most recent call last):
  File "pretrain_bert.py", line 581, in <module>
    main()
  File "pretrain_bert.py", line 528, in main
    args.tokenizer_num_type_tokens = get_train_val_test_data(args)
  File "pretrain_bert.py", line 475, in get_train_val_test_data
    (train_data, val_data, test_data), tokenizer = data_config.apply(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 34, in apply
    return make_loaders(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 120, in make_loaders
    return make_tfrecord_loaders(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 92, in make_tfrecord_loaders
    **data_set_args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py", line 42, in __init__
    self.dataset = tf.data.Dataset.from_tensor_slices(tf.constant(records))
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 161, in constant_v1
    allow_broadcast=False)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 235, in _constant_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 95, in convert_to_eager_tensor
    ctx.ensure_initialized()
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/eager/context.py", line 492, in ensure_initialized
    context_handle = pywrap_tensorflow.TFE_NewContext(opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: device CUDA:0 not supported by XLA service
	while setting up XLA_GPU_JIT device number 0
Traceback (most recent call last):
  File "pretrain_bert.py", line 581, in <module>
    main()
  File "pretrain_bert.py", line 528, in main
    args.tokenizer_num_type_tokens = get_train_val_test_data(args)
  File "pretrain_bert.py", line 475, in get_train_val_test_data
    (train_data, val_data, test_data), tokenizer = data_config.apply(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 34, in apply
    return make_loaders(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 120, in make_loaders
    return make_tfrecord_loaders(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 92, in make_tfrecord_loaders
    **data_set_args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py", line 42, in __init__
    self.dataset = tf.data.Dataset.from_tensor_slices(tf.constant(records))
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 161, in constant_v1
    allow_broadcast=False)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 235, in _constant_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 95, in convert_to_eager_tensor
    ctx.ensure_initialized()
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/eager/context.py", line 492, in ensure_initialized
    context_handle = pywrap_tensorflow.TFE_NewContext(opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: device CUDA:0 not supported by XLA service
	while setting up XLA_GPU_JIT device number 0
2020-02-28 05:04:05.053050: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1b2fb6290 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-02-28 05:04:05.053070: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0
2020-02-28 05:04:05.053248: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1b4867ea0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-02-28 05:04:05.053262: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0
2020-02-28 05:04:05.053294: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x183c7f160 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-02-28 05:04:05.053309: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-16GB, Compute Capability 7.0
Traceback (most recent call last):
  File "pretrain_bert.py", line 581, in <module>
    main()
  File "pretrain_bert.py", line 528, in main
    args.tokenizer_num_type_tokens = get_train_val_test_data(args)
  File "pretrain_bert.py", line 475, in get_train_val_test_data
    (train_data, val_data, test_data), tokenizer = data_config.apply(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 34, in apply
    return make_loaders(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 120, in make_loaders
    return make_tfrecord_loaders(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 92, in make_tfrecord_loaders
    **data_set_args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py", line 42, in __init__
    self.dataset = tf.data.Dataset.from_tensor_slices(tf.constant(records))
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 161, in constant_v1
    allow_broadcast=False)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 235, in _constant_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 95, in convert_to_eager_tensor
    ctx.ensure_initialized()
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/eager/context.py", line 492, in ensure_initialized
    context_handle = pywrap_tensorflow.TFE_NewContext(opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: Invalid device ordinal value (1). Valid range is [0, 0].
	while setting up XLA_GPU_JIT device number 1
Traceback (most recent call last):
  File "pretrain_bert.py", line 581, in <module>
    main()
  File "pretrain_bert.py", line 528, in main
    args.tokenizer_num_type_tokens = get_train_val_test_data(args)
  File "pretrain_bert.py", line 475, in get_train_val_test_data
    (train_data, val_data, test_data), tokenizer = data_config.apply(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 34, in apply
    return make_loaders(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 120, in make_loaders
    return make_tfrecord_loaders(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 92, in make_tfrecord_loaders
    **data_set_args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py", line 42, in __init__
    self.dataset = tf.data.Dataset.from_tensor_slices(tf.constant(records))
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 161, in constant_v1
    allow_broadcast=False)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 235, in _constant_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 95, in convert_to_eager_tensor
    ctx.ensure_initialized()
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/eager/context.py", line 492, in ensure_initialized
    context_handle = pywrap_tensorflow.TFE_NewContext(opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: device CUDA:0 not supported by XLA service
	while setting up XLA_GPU_JIT device number 0
Traceback (most recent call last):
  File "pretrain_bert.py", line 581, in <module>
    main()
  File "pretrain_bert.py", line 528, in main
    args.tokenizer_num_type_tokens = get_train_val_test_data(args)
  File "pretrain_bert.py", line 475, in get_train_val_test_data
    (train_data, val_data, test_data), tokenizer = data_config.apply(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 34, in apply
    return make_loaders(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 120, in make_loaders
    return make_tfrecord_loaders(args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/configure_data.py", line 92, in make_tfrecord_loaders
    **data_set_args)
  File "/autofs/nccs-svm1_proj/bif120/deepforce/scripts/deepspeed/DeepSpeedExamples/Megatron-LM/data_utils/tf_dl.py", line 42, in __init__
    self.dataset = tf.data.Dataset.from_tensor_slices(tf.constant(records))
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 161, in constant_v1
    allow_broadcast=False)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 235, in _constant_impl
    t = convert_to_eager_tensor(value, ctx, dtype)
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/framework/constant_op.py", line 95, in convert_to_eager_tensor
    ctx.ensure_initialized()
  File "/ccs/proj/bif120/deepforce/virtualenvs/ibm_wml_ce-1.6.2-2/lib/python3.6/site-packages/tensorflow_core/python/eager/context.py", line 492, in ensure_initialized
    context_handle = pywrap_tensorflow.TFE_NewContext(opts)
tensorflow.python.framework.errors_impl.InvalidArgumentError: device CUDA:0 not supported by XLA service
	while setting up XLA_GPU_JIT device number 0

from deepspeed.

agemagician avatar agemagician commented on May 4, 2024

Oh, that was actually for using Megatron-LM code, which doesn't use DeepSpeed distributed code.

I will test it again with the cifar test.

from deepspeed.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.