Coder Social home page Coder Social logo

OOM in tensorflow about medaka HOT 9 CLOSED

nanoporetech avatar nanoporetech commented on August 16, 2024
OOM in tensorflow

from medaka.

Comments (9)

cjw85 avatar cjw85 commented on August 16, 2024

@txje,

To avoid OOM errors you should reduce the batchsize (-b) option when running medaka_consensus.

from medaka.

txje avatar txje commented on August 16, 2024

If I halve the batch size with -b 100, I get the same thing, but it fails allocating a tensor of 768M instead...

from medaka.

cjw85 avatar cjw85 commented on August 16, 2024

If absolutely nothing else is running on the GPU, I must admit I am at a loss here; a 16Gb GPU should easily handle a batch size of 100. Can you watch the output of nvidia-smi before, during and after medaka_consensus is running and report what you observe?

from medaka.

txje avatar txje commented on August 16, 2024

Before:

Tue May 28 17:44:18 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P0    40W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

During:
It ramps up to ~397MB for several seconds, then jumps up to 15,345 and quickly from there to 15,731, where it sat for several seconds before crashing.

Tue May 28 17:44:34 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   45C    P0    57W / 300W |  15731MiB / 16130MiB |     86%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|    0      7890      C   ...me/ubuntu/.conda/envs/medaka/bin/python 15721MiB |
+-----------------------------------------------------------------------------+

After:

Tue May 28 17:44:52 2019       
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 418.67       Driver Version: 418.67       CUDA Version: 10.1     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  Tesla V100-SXM2...  Off  | 00000000:00:04.0 Off |                    0 |
| N/A   42C    P0    40W / 300W |      0MiB / 16130MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                       GPU Memory |
|  GPU       PID   Type   Process name                             Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

from medaka.

cjw85 avatar cjw85 commented on August 16, 2024

I notice from above you are using CUDA 10.1.

Which version of tensorflow are you using and how did you build/obtain it? The tensorflow version in the requirements.txt file is pinned at 1.12.2, and the binary for this available on pypi is built against CUDA 9.

medaka is untested with tensorflow versions other than the binary version available from pypi.

from medaka.

txje avatar txje commented on August 16, 2024

So I noticed this, and I’m using CUDA 9.0 toolkit and compiler, and tensorflow is installed through the medaka build chain, so it’s using 1.12.2. I’m not sure if the 10.1 listed is the driver version and separate from the toolkit and nvcc or if 10.1 snuck in somewhere. I’m testing on a clean google cloud instance running Ubuntu 16.04 with a single V100 attached, and never explicitly installed cuda 10.1, so I’m not sure if or how my versions can get mixed up.

from medaka.

cjw85 avatar cjw85 commented on August 16, 2024

@txje Did you resolve this issue? It might be useful for other users to know your resolution.

from medaka.

txje avatar txje commented on August 16, 2024

No, I have to stick with a smaller batch - 50 works. I can get it to run as expected on some other systems, but I haven't been able to resolve this installation.

from medaka.

cjw85 avatar cjw85 commented on August 16, 2024

Thanks for the feedback. It seems that the memory use varies with factors beyond out control; we've had some communication with Nvidia on this matter, there are some changes in tensorflow coming which will lower memory use for RNNs as used in medaka.

I will close this issue as short of reducing batch size there is not much we can advise.

from medaka.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.