Comments (24)
Yes, this is EXACTLY the behavior I'm seeing with CUDA on our new Exxact node. Peter and I are trying to track down the issue.
It looks like nvcc is spin-locking on the other processes. Can you do a 'ps xauwww | grep nvcc' when these processes are hanging and see if this is what you're seeing too?
from brokenyank.
This was Peter Eastman's suggestion. I still need to try the tests he mentions.
I have a simple mpi4py test I'll send you as well.
John
---------- Forwarded message ----------
From: Peter Eastman [email protected]
Let's consider what we know here.
nvcc is being successfully launched, since we can see it in the ps output. Furthermore, we can see it's using 99% of a core, so it clearly is doing something. Assuming it's spinning while waiting for a lock (a reasonable hypothesis, but not at all certain), it's a lock that nvcc itself looks for, not anything in OpenMM.
Waiting for nvcc to finish in one process does not allow it to work when called from another process. So if it's a lock, that lock does not get released when nvcc itself exits but the parent process does not. (Or there's a bug somewhere that keeps it from realizing the lock is released.)
We haven't determined yet whether the parent process exiting allows nvcc to succeed in another process.
A second script which does not use MPI but otherwise is similar (launching several processes, each one compiles kernels at the same time) does work. This seems very odd. Are there any other obvious differences in what the scripts are doing? Does MPI do anything "strange" to the processes it creates?
What happens if you launch two independent MPI jobs at the same time, each of which creates a single process?
What MPI implementation are you using? Is it one of the ones that has special CUDA features built in?
Peter
from brokenyank.
Here's what's running:
kyleb@amd6core:~$ ps xauwww | grep nvcc
kyleb 29699 0.0 0.0 4400 604 pts/1 S+ 16:48 0:00 sh -c "/usr/local/cuda/bin/nvcc" --ptx --machine 64 -arch=sm_30 -o "/tmp/openmmTempKernel0x1950f80.ptx" --use_fast_math "/tmp/openmmTempKernel0x1950f80.cu" 2> "/tmp/openmmTempKernel0x1950f80.log"
kyleb 29701 99.1 0.0 6680 732 pts/1 R+ 16:48 1:21 /usr/local/cuda/bin/nvcc --ptx --machine 64 -arch sm_30 -o /tmp/openmmTempKernel0x1950f80.ptx --use_fast_math /tmp/openmmTempKernel0x1950f80.cu
kyleb 30062 0.0 0.0 13584 900 pts/2 S+ 16:49 0:00 grep --color=auto nvcc
from brokenyank.
mpitest.py seems to hang as well.
from brokenyank.
kyleb@amd6core:~$ ps xauwww | grep nvcc
kyleb 30386 0.0 0.0 4400 604 pts/1 S+ 16:52 0:00 sh -c "/usr/local/cuda/bin/nvcc" --ptx --machine 64 -arch=sm_30 -o "/tmp/openmmTempKernel0x2034060.ptx" --use_fast_math "/tmp/openmmTempKernel0x2034060.cu" 2> "/tmp/openmmTempKernel0x2034060.log"
kyleb 30389 99.8 0.0 6684 736 pts/1 R+ 16:52 1:15 /usr/local/cuda/bin/nvcc --ptx --machine 64 -arch sm_30 -o /tmp/openmmTempKernel0x2034060.ptx --use_fast_math /tmp/openmmTempKernel0x2034060.cu
kyleb 30561 0.0 0.0 13584 896 pts/2 S+ 16:53 0:00 grep --color=auto nvcc
from brokenyank.
It looks like that nvcc is at 99.1% CPU utilization, which suggested to Peter and I some sort of spin lock.
Even if we manually set each process's CudaTempDirectory platform property to be a different directory, this isn't sufficient to get past the spin-lock.
from brokenyank.
OK, so this simple example can recapitulate the buggy behavior, which is good.
Can you run two copies of the mpitest.py with one process each and see if that works?
from brokenyank.
I'm off to a meeting for an hour or so, I'll run it this evening.
from brokenyank.
Thanks!
from brokenyank.
I ran this in two screens and it runs fine.
mpirun -np 1 ~/src/yank/src/mpitest.py
from brokenyank.
Great! I think this means either (1) the fact that the mpirun-spawned processes have a parent process is causing trouble, or (2) mpirun is doing something funny to the spawned processes that makes them different from shell-spawned processes, and this is doing something funny in turn with nvcc and spinlocks.
from brokenyank.
Actually, I wonder if it has something to do with the fact that the mpirun processes are launched at exactly the same time. This could cause some random number seed to be the same for generating temporary file names...
from brokenyank.
Still hangs
from brokenyank.
kyleb@amd6core:~$ mpirun -np 2 ~/src/yank/src/mpitest3.py
rank 1/2 platform CUDA deviceid 1
rank 0/2 platform CUDA deviceid 0
rank 1/2 creating context...
rank 0/2 creating context...
rank 0/2 context created in 6.027 s
from brokenyank.
Does the same thing for me. Do you at least see some additional temporary directories being created and used? And can you send the output of 'ps xauwww | grep nvcc'?
from brokenyank.
kyleb@amd6core:~$ ps xauwww | grep nvcc
kyleb 20905 0.0 0.0 4396 596 pts/0 S+ 18:17 0:00 sh -c "/usr/local/cuda/bin/nvcc" --ptx --machine 64 -arch=sm_30 -o "CUDA1/openmmTempKernel0x19cd340.ptx" --use_fast_math "CUDA1/openmmTempKernel0x19cd340.cu" 2> "CUDA1/openmmTempKernel0x19cd340.log"
kyleb 20907 99.5 0.0 6680 732 pts/0 R+ 18:17 3:11 /usr/local/cuda/bin/nvcc --ptx --machine 64 -arch sm_30 -o CUDA1/openmmTempKernel0x19cd340.ptx --use_fast_math CUDA1/openmmTempKernel0x19cd340.cu
kyleb 21241 0.0 0.0 13580 888 pts/1 S+ 18:20 0:00 grep --color=auto nvcc
from brokenyank.
I see CUDA0 and CUDA1
from brokenyank.
I'm running the mpirun that comes with Canopy 1.0.0. Are you by chance running the same version?
[chodera@node05 src]$ mpirun --version
HYDRA build details:
Version: 1.4.1
Release Date: Wed Aug 24 14:40:04 CDT 2011
from brokenyank.
kyleb@kb-intel:~/dat/ala-lvbp/amber99$ mpirun --version
HYDRA build details:
Version: 1.4.1p1
Release Date: Thu Sep 1 13:53:02 CDT 2011
CC: gcc
CXX: c++
F77: gfortran
F90: f95
from brokenyank.
I'm using Anaconda, not Canopy.
from brokenyank.
I'm curious if the choice of "launcher" has any impact. If you're up to trying a few of the launchers, that would provide some useful information, I think:
Launch options:
-launcher launcher to use ( ssh rsh fork slurm ll lsf sge manual persist)
from brokenyank.
I think ssh/rsh
and fork
may be the important ones to try.
from brokenyank.
So I just tried the Ubuntu 12.04 mpich mpirun and found the same results. It's also HYDRA 1.4.1, though.
I tried ssh and fork. Same.
from brokenyank.
OK, thanks. Still not at all sure what is going on here. Independent processes seem to work totally fine when accessing different GPUs...
from brokenyank.
Related Issues (15)
- Feature request: Add support for GROMOS forcefields and small molecules via ATb
- YANK does not correctly handle the situation where all MPI processes are attached to GPU nodes
- Check everything with PyFlakes and (auto) pep8. HOT 1
- Standardize our usage of delayed imports HOT 1
- Multidimensional Repex and AMD HOT 1
- Modify alchemy module to use group-based CustomNonbondedForce
- Allow systems to be set up using OpenMM Modeller facility
- Feature request: Protein mutations HOT 1
- Speed up alchemical intermediate creation
- See if alchemical intermediates can be assigned global Context parameter to avoid the need to create and cache many Context objects
- Deprecate pyopenmm HOT 2
- test_repex_mpi.py cuda platform HOT 4
- Repex large file sizes HOT 5
- Use separate MBar? HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from brokenyank.