I'm consistently getting an OOM error from tensorflow with the most recent version of

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Before: <div class="snippet-clipboard-content notranslate position-relative overfl

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

OOM in tensorflow about medaka HOT 9 CLOSED

nanoporetech commented on August 16, 2024

OOM in tensorflow

from medaka.

Comments (9)

cjw85 commented on August 16, 2024

@txje,

To avoid OOM errors you should reduce the batchsize (-b) option when running medaka_consensus.

from medaka.

txje commented on August 16, 2024

If I halve the batch size with -b 100, I get the same thing, but it fails allocating a tensor of 768M instead...

from medaka.

cjw85 commented on August 16, 2024

If absolutely nothing else is running on the GPU, I must admit I am at a loss here; a 16Gb GPU should easily handle a batch size of 100. Can you watch the output of nvidia-smi before, during and after medaka_consensus is running and report what you observe?

from medaka.

txje commented on August 16, 2024

Before:

Tue May 28 17:44:18 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P0    40W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

During:
It ramps up to ~397MB for several seconds, then jumps up to 15,345 and quickly from there to 15,731, where it sat for several seconds before crashing.

Tue May 28 17:44:34 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   45C    P0    57W / 300W |  15731MiB / 16130MiB |     86%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      7890      C   ...me/ubuntu/.conda/envs/medaka/bin/python 15721MiB |
+-----------------------------------------------------------------------------+

After:

Tue May 28 17:44:52 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P0    40W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

from medaka.

cjw85 commented on August 16, 2024

I notice from above you are using CUDA 10.1.

Which version of tensorflow are you using and how did you build/obtain it? The tensorflow version in the requirements.txt file is pinned at 1.12.2, and the binary for this available on pypi is built against CUDA 9.

medaka is untested with tensorflow versions other than the binary version available from pypi.

from medaka.

txje commented on August 16, 2024

So I noticed this, and I’m using CUDA 9.0 toolkit and compiler, and tensorflow is installed through the medaka build chain, so it’s using 1.12.2. I’m not sure if the 10.1 listed is the driver version and separate from the toolkit and nvcc or if 10.1 snuck in somewhere. I’m testing on a clean google cloud instance running Ubuntu 16.04 with a single V100 attached, and never explicitly installed cuda 10.1, so I’m not sure if or how my versions can get mixed up.

from medaka.

cjw85 commented on August 16, 2024

@txje Did you resolve this issue? It might be useful for other users to know your resolution.

from medaka.

txje commented on August 16, 2024

No, I have to stick with a smaller batch - 50 works. I can get it to run as expected on some other systems, but I haven't been able to resolve this installation.

from medaka.

cjw85 commented on August 16, 2024

Thanks for the feedback. It seems that the memory use varies with factors beyond out control; we've had some communication with Nvidia on this matter, there are some changes in tensorflow coming which will lower memory use for RNNs as used in medaka.

I will close this issue as short of reducing batch size there is not much we can advise.

from medaka.

OOM in tensorflow about medaka HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent