Coder Social home page Coder Social logo

Comments (6)

Yangqing avatar Yangqing commented on June 4, 2024 1

FWIW - we found when developing Caffe2 that if you happen to have multi-threading going on and one thread tries to call cudaMallocHost() or cudaFree, then that causes nccl to freeze. Essentially, what helped us solving the problem is to guard all malloc, free and nccl calls so they don't overlap. That needed some good care (thanks to @akyrola ). An example can be seen here:

https://github.com/caffe2/caffe2/blob/master/caffe2/core/context_gpu.cu#L284

from nccl.

sjeaugey avatar sjeaugey commented on June 4, 2024

First, are you running all the processes on the same node ?

In any case, you should not need a barrier in the case of MPI, only with multiple threads. A small code sample to reproduce could help.

from nccl.

hiyijian avatar hiyijian commented on June 4, 2024

Yes,it happens on a single node.
The hang project is caffe with ncclallreduce during backward. I will try to reproduce it with a small demo.
thanks.

from nccl.

hiyijian avatar hiyijian commented on June 4, 2024

hi @Yangqing. Thank you very much. I am working on our own branch of a very old version of caffe, with OpenMPI Integrated. In other words, our case is multi-process on a single node, one GPU per process. I think no thread-based lock is needed. Am I right?

from nccl.

hiyijian avatar hiyijian commented on June 4, 2024

ps: it hangs every 1 day or 1.5 days. I have to keep training with snapshot after freezon. The trained moel performance is quite good

from nccl.

sjeaugey avatar sjeaugey commented on June 4, 2024

Closing old issue. Please re-open if you still have issues with 2.3.

from nccl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.