Coder Social home page Coder Social logo

Comments (24)

jchodera avatar jchodera commented on June 12, 2024

Yes, this is EXACTLY the behavior I'm seeing with CUDA on our new Exxact node. Peter and I are trying to track down the issue.

It looks like nvcc is spin-locking on the other processes. Can you do a 'ps xauwww | grep nvcc' when these processes are hanging and see if this is what you're seeing too?

from brokenyank.

jchodera avatar jchodera commented on June 12, 2024

This was Peter Eastman's suggestion. I still need to try the tests he mentions.

I have a simple mpi4py test I'll send you as well.

John

---------- Forwarded message ----------
From: Peter Eastman [email protected]

Let's consider what we know here.

nvcc is being successfully launched, since we can see it in the ps output. Furthermore, we can see it's using 99% of a core, so it clearly is doing something. Assuming it's spinning while waiting for a lock (a reasonable hypothesis, but not at all certain), it's a lock that nvcc itself looks for, not anything in OpenMM.

Waiting for nvcc to finish in one process does not allow it to work when called from another process. So if it's a lock, that lock does not get released when nvcc itself exits but the parent process does not. (Or there's a bug somewhere that keeps it from realizing the lock is released.)

We haven't determined yet whether the parent process exiting allows nvcc to succeed in another process.

A second script which does not use MPI but otherwise is similar (launching several processes, each one compiles kernels at the same time) does work. This seems very odd. Are there any other obvious differences in what the scripts are doing? Does MPI do anything "strange" to the processes it creates?

What happens if you launch two independent MPI jobs at the same time, each of which creates a single process?

What MPI implementation are you using? Is it one of the ones that has special CUDA features built in?

Peter

from brokenyank.

kyleabeauchamp avatar kyleabeauchamp commented on June 12, 2024

Here's what's running:

kyleb@amd6core:~$ ps xauwww | grep nvcc
kyleb    29699  0.0  0.0   4400   604 pts/1    S+   16:48   0:00 sh -c "/usr/local/cuda/bin/nvcc" --ptx --machine 64 -arch=sm_30 -o "/tmp/openmmTempKernel0x1950f80.ptx" --use_fast_math "/tmp/openmmTempKernel0x1950f80.cu" 2> "/tmp/openmmTempKernel0x1950f80.log"
kyleb    29701 99.1  0.0   6680   732 pts/1    R+   16:48   1:21 /usr/local/cuda/bin/nvcc --ptx --machine 64 -arch sm_30 -o /tmp/openmmTempKernel0x1950f80.ptx --use_fast_math /tmp/openmmTempKernel0x1950f80.cu
kyleb    30062  0.0  0.0  13584   900 pts/2    S+   16:49   0:00 grep --color=auto nvcc

from brokenyank.

kyleabeauchamp avatar kyleabeauchamp commented on June 12, 2024

mpitest.py seems to hang as well.

from brokenyank.

kyleabeauchamp avatar kyleabeauchamp commented on June 12, 2024
kyleb@amd6core:~$ ps xauwww | grep nvcc
kyleb    30386  0.0  0.0   4400   604 pts/1    S+   16:52   0:00 sh -c "/usr/local/cuda/bin/nvcc" --ptx --machine 64 -arch=sm_30 -o "/tmp/openmmTempKernel0x2034060.ptx" --use_fast_math "/tmp/openmmTempKernel0x2034060.cu" 2> "/tmp/openmmTempKernel0x2034060.log"
kyleb    30389 99.8  0.0   6684   736 pts/1    R+   16:52   1:15 /usr/local/cuda/bin/nvcc --ptx --machine 64 -arch sm_30 -o /tmp/openmmTempKernel0x2034060.ptx --use_fast_math /tmp/openmmTempKernel0x2034060.cu
kyleb    30561  0.0  0.0  13584   896 pts/2    S+   16:53   0:00 grep --color=auto nvcc

from brokenyank.

jchodera avatar jchodera commented on June 12, 2024

It looks like that nvcc is at 99.1% CPU utilization, which suggested to Peter and I some sort of spin lock.

Even if we manually set each process's CudaTempDirectory platform property to be a different directory, this isn't sufficient to get past the spin-lock.

from brokenyank.

jchodera avatar jchodera commented on June 12, 2024

OK, so this simple example can recapitulate the buggy behavior, which is good.

Can you run two copies of the mpitest.py with one process each and see if that works?

from brokenyank.

kyleabeauchamp avatar kyleabeauchamp commented on June 12, 2024

I'm off to a meeting for an hour or so, I'll run it this evening.

from brokenyank.

jchodera avatar jchodera commented on June 12, 2024

Thanks!

from brokenyank.

kyleabeauchamp avatar kyleabeauchamp commented on June 12, 2024

I ran this in two screens and it runs fine.

mpirun -np 1 ~/src/yank/src/mpitest.py

from brokenyank.

jchodera avatar jchodera commented on June 12, 2024

Great! I think this means either (1) the fact that the mpirun-spawned processes have a parent process is causing trouble, or (2) mpirun is doing something funny to the spawned processes that makes them different from shell-spawned processes, and this is doing something funny in turn with nvcc and spinlocks.

from brokenyank.

jchodera avatar jchodera commented on June 12, 2024

Actually, I wonder if it has something to do with the fact that the mpirun processes are launched at exactly the same time. This could cause some random number seed to be the same for generating temporary file names...

from brokenyank.

kyleabeauchamp avatar kyleabeauchamp commented on June 12, 2024

Still hangs

from brokenyank.

kyleabeauchamp avatar kyleabeauchamp commented on June 12, 2024
kyleb@amd6core:~$ mpirun -np 2 ~/src/yank/src/mpitest3.py 
rank 1/2 platform CUDA deviceid 1
rank 0/2 platform CUDA deviceid 0
rank 1/2 creating context...
rank 0/2 creating context...
rank 0/2 context created in 6.027 s

from brokenyank.

jchodera avatar jchodera commented on June 12, 2024

Does the same thing for me. Do you at least see some additional temporary directories being created and used? And can you send the output of 'ps xauwww | grep nvcc'?

from brokenyank.

kyleabeauchamp avatar kyleabeauchamp commented on June 12, 2024
kyleb@amd6core:~$ ps xauwww | grep nvcc
kyleb    20905  0.0  0.0   4396   596 pts/0    S+   18:17   0:00 sh -c "/usr/local/cuda/bin/nvcc" --ptx --machine 64 -arch=sm_30 -o "CUDA1/openmmTempKernel0x19cd340.ptx" --use_fast_math "CUDA1/openmmTempKernel0x19cd340.cu" 2> "CUDA1/openmmTempKernel0x19cd340.log"
kyleb    20907 99.5  0.0   6680   732 pts/0    R+   18:17   3:11 /usr/local/cuda/bin/nvcc --ptx --machine 64 -arch sm_30 -o CUDA1/openmmTempKernel0x19cd340.ptx --use_fast_math CUDA1/openmmTempKernel0x19cd340.cu
kyleb    21241  0.0  0.0  13580   888 pts/1    S+   18:20   0:00 grep --color=auto nvcc

from brokenyank.

kyleabeauchamp avatar kyleabeauchamp commented on June 12, 2024

I see CUDA0 and CUDA1

from brokenyank.

jchodera avatar jchodera commented on June 12, 2024

I'm running the mpirun that comes with Canopy 1.0.0. Are you by chance running the same version?

[chodera@node05 src]$ mpirun --version
HYDRA build details:
Version: 1.4.1
Release Date: Wed Aug 24 14:40:04 CDT 2011

from brokenyank.

kyleabeauchamp avatar kyleabeauchamp commented on June 12, 2024

kyleb@kb-intel:~/dat/ala-lvbp/amber99$ mpirun --version
HYDRA build details:
Version: 1.4.1p1
Release Date: Thu Sep 1 13:53:02 CDT 2011
CC: gcc
CXX: c++
F77: gfortran
F90: f95

from brokenyank.

kyleabeauchamp avatar kyleabeauchamp commented on June 12, 2024

I'm using Anaconda, not Canopy.

from brokenyank.

jchodera avatar jchodera commented on June 12, 2024

I'm curious if the choice of "launcher" has any impact. If you're up to trying a few of the launchers, that would provide some useful information, I think:

  Launch options:
    -launcher                        launcher to use ( ssh rsh fork slurm ll lsf sge manual persist)

from brokenyank.

jchodera avatar jchodera commented on June 12, 2024

I think ssh/rsh and fork may be the important ones to try.

from brokenyank.

kyleabeauchamp avatar kyleabeauchamp commented on June 12, 2024

So I just tried the Ubuntu 12.04 mpich mpirun and found the same results. It's also HYDRA 1.4.1, though.

I tried ssh and fork. Same.

from brokenyank.

jchodera avatar jchodera commented on June 12, 2024

OK, thanks. Still not at all sure what is going on here. Independent processes seem to work totally fine when accessing different GPUs...

from brokenyank.

Related Issues (15)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.