Coder Social home page Coder Social logo

Comments (6)

giorgiosav avatar giorgiosav commented on August 29, 2024

Is NCCL already installed in the base image? I'm looking at this line in /pytorch/base/Dockerfile:44:

RUN apt-get update && apt-get install -y --no-install-recommends --allow-downgrades \
        --allow-change-held-packages \
         libnccl2=2.0.5-3+cuda9.0 \
         libnccl-dev=2.0.5-3+cuda9.0 &&\
     rm -rf /var/lib/apt/lists/*

If so, what needs to be done here? Does PyTorch need to be built differently?
(I'm new to a lot of these concepts... just finding my way around the code :D)

from mlbench-benchmarks.

martinjaggi avatar martinjaggi commented on August 29, 2024

thanks! it would be very nice to compare performance of NCCL vs Gloo , on our existing benchmark tasks. let's see if we can make it a bit easier to install/use. @Panaetius what do you think?

from mlbench-benchmarks.

Panaetius avatar Panaetius commented on August 29, 2024

If so, what needs to be done here? Does PyTorch need to be built differently?

I'm not sure if the base image contains everything that's needed. But the benchmark themselves just run with OpenMPi at the moment, so having them run with NCCL via switch passed as argument ti main.py or as a separate image would be cool

from mlbench-benchmarks.

tlin-taolin avatar tlin-taolin commented on August 29, 2024

I think only MPI backend needs to build from source and other backends can use pre-built pytorch. The base image contains everything; different backends only differ in how to initialize the multiple processes in a distributed world.

The backends (like Gloo and NCCL) right now might only support TCP initialization and shared file-system initialization; maybe it is worth to check the implementation of Horovod and figure out how to use MPI for the initialization.

from mlbench-benchmarks.

giorgiosav avatar giorgiosav commented on August 29, 2024

According to the docs (https://pytorch.org/docs/stable/distributed.html#initialization), both Gloo and NCCL support environment variable initialization, which is the default. Is this how MPI is initialized at the moment?

from mlbench-benchmarks.

giorgiosav avatar giorgiosav commented on August 29, 2024

I am currently getting this error (CUDA driver version is insufficient for CUDA runtime version), does anyone know what it could mean? I am doing some research but any advice is appreciated.

/conda/lib/python3.6/site-packages/requests/__init__.py:91: RequestsDependencyWarning: urllib3 (1.25.8) or chardet (3.0.4) doesn't match a supported version!
  RequestsDependencyWarning)
Traceback (most recent call last):
  File "/codes/main.py", line 218, in <module>
    validation_only=args.validation_only, gpu=args.gpu, light_target=args.light)
  File "/codes/main.py", line 181, in main
    validation_only, use_cuda=gpu, light_target=light_target)
  File "/codes/main.py", line 73, in train_loop
    train_set = partition_dataset_by_rank(train_set, rank, world_size)
  File "/conda/lib/python3.6/site-packages/mlbench_core/dataset/util/pytorch/partition.py", line 108, in partition_dataset_by_rank
    partition = DataPartitioner(dataset, rank, shuffle, partition_sizes)
  File "/conda/lib/python3.6/site-packages/mlbench_core/dataset/util/pytorch/partition.py", line 71, in __init__
    indices = self.consistent_indices(rank, indices, shuffle)
  File "/conda/lib/python3.6/site-packages/mlbench_core/dataset/util/pytorch/partition.py", line 46, in consistent_indices
    dist.broadcast(indices, src=0)
  File "/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py", line 737, in broadcast
    work = _default_pg.broadcast([tensor], opts)
RuntimeError: CUDA error: CUDA driver version is insufficient for CUDA runtime version (device_count at /tmp/pip-req-build-rcyhbqmk/c10/cuda/CUDAFunctions.h:20)
frame #0: c10::Error::Error(c10::SourceLocation, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) + 0x6a (0x7f2f409f605a in /conda/lib/python3.6/site-packages/torch/lib/libc10.so)
frame #1: <unknown function> + 0x69b019 (0x7f2f66d3e019 in /conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #2: c10d::ProcessGroupNCCL::tensorCheckHelper(std::vector<at::Tensor, std::allocator<at::Tensor> > const&, std::vector<at::Tensor, std::allocator<at::Tensor> > const&, int) + 0x58 (0x7f2f66d37fa8 in /conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #3: c10d::ProcessGroupNCCL::broadcast(std::vector<at::Tensor, std::allocator<at::Tensor> >&, c10d::BroadcastOptions const&) + 0x43 (0x7f2f66d3a5e3 in /conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5ec3e9 (0x7f2f66c8f3e9 in /conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
frame #5: <unknown function> + 0x10fd2d (0x7f2f667b2d2d in /conda/lib/python3.6/site-packages/torch/lib/libtorch_python.so)
<omitting python frames>
frame #40: main + 0x16c (0x400bbc in /conda/bin/python)
frame #41: __libc_start_main + 0xf0 (0x7f2f74c0b830 in /lib/x86_64-linux-gnu/libc.so.6)
frame #42: /conda/bin/python() [0x400c7d]

from mlbench-benchmarks.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.