I've tried this on two different systems (cluster and desktop) now. I'm finding that

Here's what's running: <div class="snippet-clipboard-content notranslate position-

<div class="snippet-clipboard-content notranslate position-relative overflow-auto" data-snippet-clip

I ran this in two screens and it runs fine. <div class="snippet-clipboard-content

test_repex_mpi.py hangs on context creation,about choderalab/brokenyank

jchodera commented on June 12, 2024

Yes, this is EXACTLY the behavior I'm seeing with CUDA on our new Exxact node. Peter and I are trying to track down the issue.

It looks like nvcc is spin-locking on the other processes. Can you do a 'ps xauwww | grep nvcc' when these processes are hanging and see if this is what you're seeing too?

from brokenyank.

jchodera commented on June 12, 2024

This was Peter Eastman's suggestion. I still need to try the tests he mentions.

I have a simple mpi4py test I'll send you as well.

John

---------- Forwarded message ----------
From: Peter Eastman [email protected]

Let's consider what we know here.

nvcc is being successfully launched, since we can see it in the ps output. Furthermore, we can see it's using 99% of a core, so it clearly is doing something. Assuming it's spinning while waiting for a lock (a reasonable hypothesis, but not at all certain), it's a lock that nvcc itself looks for, not anything in OpenMM.

Waiting for nvcc to finish in one process does not allow it to work when called from another process. So if it's a lock, that lock does not get released when nvcc itself exits but the parent process does not. (Or there's a bug somewhere that keeps it from realizing the lock is released.)

We haven't determined yet whether the parent process exiting allows nvcc to succeed in another process.

A second script which does not use MPI but otherwise is similar (launching several processes, each one compiles kernels at the same time) does work. This seems very odd. Are there any other obvious differences in what the scripts are doing? Does MPI do anything "strange" to the processes it creates?

What happens if you launch two independent MPI jobs at the same time, each of which creates a single process?

What MPI implementation are you using? Is it one of the ones that has special CUDA features built in?

Peter

from brokenyank.

kyleabeauchamp commented on June 12, 2024

Here's what's running:

kyleb@amd6core:~$ ps xauwww | grep nvcc
kyleb    29699  0.0  0.0   4400   604 pts/1    S+   16:48   0:00 sh -c "/usr/local/cuda/bin/nvcc" --ptx --machine 64 -arch=sm_30 -o "/tmp/openmmTempKernel0x1950f80.ptx" --use_fast_math "/tmp/openmmTempKernel0x1950f80.cu" 2> "/tmp/openmmTempKernel0x1950f80.log"
kyleb    29701 99.1  0.0   6680   732 pts/1    R+   16:48   1:21 /usr/local/cuda/bin/nvcc --ptx --machine 64 -arch sm_30 -o /tmp/openmmTempKernel0x1950f80.ptx --use_fast_math /tmp/openmmTempKernel0x1950f80.cu
kyleb    30062  0.0  0.0  13584   900 pts/2    S+   16:49   0:00 grep --color=auto nvcc

from brokenyank.

kyleabeauchamp commented on June 12, 2024

mpitest.py seems to hang as well.

from brokenyank.

kyleabeauchamp commented on June 12, 2024

kyleb@amd6core:~$ ps xauwww | grep nvcc
kyleb    30386  0.0  0.0   4400   604 pts/1    S+   16:52   0:00 sh -c "/usr/local/cuda/bin/nvcc" --ptx --machine 64 -arch=sm_30 -o "/tmp/openmmTempKernel0x2034060.ptx" --use_fast_math "/tmp/openmmTempKernel0x2034060.cu" 2> "/tmp/openmmTempKernel0x2034060.log"
kyleb    30389 99.8  0.0   6684   736 pts/1    R+   16:52   1:15 /usr/local/cuda/bin/nvcc --ptx --machine 64 -arch sm_30 -o /tmp/openmmTempKernel0x2034060.ptx --use_fast_math /tmp/openmmTempKernel0x2034060.cu
kyleb    30561  0.0  0.0  13584   896 pts/2    S+   16:53   0:00 grep --color=auto nvcc

from brokenyank.

jchodera commented on June 12, 2024

It looks like that nvcc is at 99.1% CPU utilization, which suggested to Peter and I some sort of spin lock.

Even if we manually set each process's CudaTempDirectory platform property to be a different directory, this isn't sufficient to get past the spin-lock.

from brokenyank.

jchodera commented on June 12, 2024

OK, so this simple example can recapitulate the buggy behavior, which is good.

Can you run two copies of the mpitest.py with one process each and see if that works?

from brokenyank.

kyleabeauchamp commented on June 12, 2024

I'm off to a meeting for an hour or so, I'll run it this evening.

from brokenyank.

jchodera commented on June 12, 2024

Thanks!

from brokenyank.

kyleabeauchamp commented on June 12, 2024

I ran this in two screens and it runs fine.

mpirun -np 1 ~/src/yank/src/mpitest.py

from brokenyank.

jchodera commented on June 12, 2024

Great! I think this means either (1) the fact that the mpirun-spawned processes have a parent process is causing trouble, or (2) mpirun is doing something funny to the spawned processes that makes them different from shell-spawned processes, and this is doing something funny in turn with nvcc and spinlocks.

from brokenyank.

jchodera commented on June 12, 2024

Actually, I wonder if it has something to do with the fact that the mpirun processes are launched at exactly the same time. This could cause some random number seed to be the same for generating temporary file names...

from brokenyank.

kyleabeauchamp commented on June 12, 2024

Still hangs

from brokenyank.

kyleabeauchamp commented on June 12, 2024

kyleb@amd6core:~$ mpirun -np 2 ~/src/yank/src/mpitest3.py 
rank 1/2 platform CUDA deviceid 1
rank 0/2 platform CUDA deviceid 0
rank 1/2 creating context...
rank 0/2 creating context...
rank 0/2 context created in 6.027 s

from brokenyank.

jchodera commented on June 12, 2024

Does the same thing for me. Do you at least see some additional temporary directories being created and used? And can you send the output of 'ps xauwww | grep nvcc'?

from brokenyank.

kyleabeauchamp commented on June 12, 2024

kyleb@amd6core:~$ ps xauwww | grep nvcc
kyleb    20905  0.0  0.0   4396   596 pts/0    S+   18:17   0:00 sh -c "/usr/local/cuda/bin/nvcc" --ptx --machine 64 -arch=sm_30 -o "CUDA1/openmmTempKernel0x19cd340.ptx" --use_fast_math "CUDA1/openmmTempKernel0x19cd340.cu" 2> "CUDA1/openmmTempKernel0x19cd340.log"
kyleb    20907 99.5  0.0   6680   732 pts/0    R+   18:17   3:11 /usr/local/cuda/bin/nvcc --ptx --machine 64 -arch sm_30 -o CUDA1/openmmTempKernel0x19cd340.ptx --use_fast_math CUDA1/openmmTempKernel0x19cd340.cu
kyleb    21241  0.0  0.0  13580   888 pts/1    S+   18:20   0:00 grep --color=auto nvcc

from brokenyank.

kyleabeauchamp commented on June 12, 2024

I see CUDA0 and CUDA1

from brokenyank.

jchodera commented on June 12, 2024

I'm running the mpirun that comes with Canopy 1.0.0. Are you by chance running the same version?

[chodera@node05 src]$ mpirun --version
HYDRA build details:
Version: 1.4.1
Release Date: Wed Aug 24 14:40:04 CDT 2011

from brokenyank.

kyleabeauchamp commented on June 12, 2024

kyleb@kb-intel:~/dat/ala-lvbp/amber99$ mpirun --version
HYDRA build details:
Version: 1.4.1p1
Release Date: Thu Sep 1 13:53:02 CDT 2011
CC: gcc
CXX: c++
F77: gfortran
F90: f95

from brokenyank.

kyleabeauchamp commented on June 12, 2024

I'm using Anaconda, not Canopy.

from brokenyank.

jchodera commented on June 12, 2024

I'm curious if the choice of "launcher" has any impact. If you're up to trying a few of the launchers, that would provide some useful information, I think:

  Launch options:
    -launcher                        launcher to use ( ssh rsh fork slurm ll lsf sge manual persist)

from brokenyank.

jchodera commented on June 12, 2024

I think ssh/rsh and fork may be the important ones to try.

from brokenyank.

kyleabeauchamp commented on June 12, 2024

So I just tried the Ubuntu 12.04 mpich mpirun and found the same results. It's also HYDRA 1.4.1, though.

I tried ssh and fork. Same.

from brokenyank.

jchodera commented on June 12, 2024

OK, thanks. Still not at all sure what is going on here. Independent processes seem to work totally fine when accessing different GPUs...

from brokenyank.

test_repex_mpi.py hangs on context creation about brokenyank HOT 24 OPEN

Comments (24)

Related Issues (15)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent