Coder Social home page Coder Social logo

federallab / openfed Goto Github PK

View Code? Open in Web Editor NEW
31.0 31.0 3.0 2.85 MB

A Comprehensive and Versatile Open-Source Federated Learning Framework

Home Page: https://openfed.readthedocs.io

License: MIT License

Python 99.61% Dockerfile 0.09% Shell 0.31%
artificial-intelligence benchmark deep-learning distributed-learning federated-learning framework pytorch safety

openfed's Issues

RuntimeError: Timed out waiting for send operation to complete

Describe the bug
I am running the simple command python -m openfed.tools.simulator --nproc 6 examples/run.py as given in the repository just to check if the code was running and I encountered the following error.

(openfed) ozaland@prec3660c:~/OpenFed$ python -m openfed.tools.simulator --nproc 6 examples/run.py
  0%|                                                    | 0/10 [00:00<?, ?it/s]/home/ozaland/anaconda3/envs/openfed/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:430: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
  "torch.distributed.distributed_c10d._get_global_rank is deprecated "
/home/ozaland/anaconda3/envs/openfed/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:430: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
  "torch.distributed.distributed_c10d._get_global_rank is deprecated "
/home/ozaland/anaconda3/envs/openfed/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:430: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
  "torch.distributed.distributed_c10d._get_global_rank is deprecated "
/home/ozaland/anaconda3/envs/openfed/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:430: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
  "torch.distributed.distributed_c10d._get_global_rank is deprecated "
/home/ozaland/anaconda3/envs/openfed/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:430: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
  "torch.distributed.distributed_c10d._get_global_rank is deprecated "
/home/ozaland/anaconda3/envs/openfed/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py:430: UserWarning: torch.distributed.distributed_c10d._get_global_rank is deprecated please use torch.distributed.distributed_c10d.get_global_rank instead
  "torch.distributed.distributed_c10d._get_global_rank is deprecated "
 10%|████                                    | 1/10 [30:02<4:30:18, 1802.07s/it]
Traceback (most recent call last):
  File "examples/run.py", line 99, in <module>
    simulate()
  File "examples/run.py", line 52, in simulate
    api.run()
  File "/home/ozaland/OpenFed/openfed/api.py", line 71, in run
    maintainer.step()
  File "/home/ozaland/OpenFed/openfed/core/maintainer.py", line 306, in step
    return self._aggregator_step(*args, **kwargs)
  File "/home/ozaland/OpenFed/openfed/core/maintainer.py", line 378, in _aggregator_step
    flag = self.upload()
  File "/home/ozaland/OpenFed/openfed/core/maintainer.py", line 298, in upload
    self.transfer(to=True)
  File "/home/ozaland/OpenFed/openfed/core/functional.py", line 33, in _fed_context
    return safe_call(self, *args, **kwargs)
  File "/home/ozaland/OpenFed/openfed/core/functional.py", line 24, in safe_call
    return func(*args, **kwargs)
  File "/home/ozaland/OpenFed/openfed/core/maintainer.py", line 253, in transfer
    self.pipe.upload(self.packaged_data)
  File "/home/ozaland/OpenFed/openfed/federated/pipe.py", line 164, in upload
    self.transfer(True, data)
  File "/home/ozaland/OpenFed/openfed/federated/pipe.py", line 233, in transfer
    self.push(data)
  File "/home/ozaland/OpenFed/openfed/federated/pipe.py", line 249, in push
    distributed_c10d.gather_object(data, None, dst=rank, group=self.pg)
  File "/home/ozaland/anaconda3/envs/openfed/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 1981, in gather_object
    all_gather(object_size_list, local_size, group=group)
  File "/home/ozaland/anaconda3/envs/openfed/lib/python3.7/site-packages/torch/distributed/distributed_c10d.py", line 2282, in all_gather
    work.wait()
RuntimeError: [../third_party/gloo/gloo/transport/tcp/unbound_buffer.cc:133] Timed out waiting 1800000ms for send operation to complete
/home/ozaland/OpenFed/openfed/core/functional.py:26: UserWarning: Failed to call <function Maintainer.transfer at 0x7fc6515eff80>
  warnings.warn(f'Failed to call {func}')
/home/ozaland/OpenFed/openfed/core/functional.py:26: UserWarning: Failed to call <function Maintainer.transfer at 0x7f2c126eff80>
  warnings.warn(f'Failed to call {func}')
/home/ozaland/OpenFed/openfed/core/functional.py:26: UserWarning: Failed to call <function Maintainer.transfer at 0x7f60a89eff80>
  warnings.warn(f'Failed to call {func}')
/home/ozaland/OpenFed/openfed/core/functional.py:26: UserWarning: Failed to call <function Maintainer.transfer at 0x7ff26edaff80>
  warnings.warn(f'Failed to call {func}')
/home/ozaland/OpenFed/openfed/core/functional.py:26: UserWarning: Failed to call <function Maintainer.transfer at 0x7f290cfaff80>
  warnings.warn(f'Failed to call {func}')
Killing subprocess 1740079
Killing subprocess 1740080
Killing subprocess 1740081
Killing subprocess 1740082
Killing subprocess 1740083
Killing subprocess 1740084
Traceback (most recent call last):
  File "/home/ozaland/anaconda3/envs/openfed/lib/python3.7/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/home/ozaland/anaconda3/envs/openfed/lib/python3.7/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/home/ozaland/OpenFed/openfed/tools/simulator.py", line 221, in <module>
    main()
  File "/home/ozaland/OpenFed/openfed/tools/simulator.py", line 204, in main
    sigkill_handler(signal.SIGTERM, None)
  File "/home/ozaland/OpenFed/openfed/tools/simulator.py", line 138, in sigkill_handler
    returncode=last_return_code, cmd=cmd)
subprocess.CalledProcessError: Command '['/home/ozaland/anaconda3/envs/openfed/bin/python', '-u', 'examples/run.py', '--props=/tmp/collaborator-5.json']' returned non-zero exit status 1.

Environment (please complete the following information):

  • OS Platform and Distribution (e.g., Linux Ubuntu 22.04):
  • Python package versions: Pytorch v1.13.1, openfed v0.0.0, torchvision v0.14.1
  • Python version: 3.7
  • CUDA/cuDNN version: 11.7

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.