Coder Social home page Coder Social logo

decentralizepy's Introduction

EPFL logo

decentralizepy

decentralizepy is a framework for running distributed applications (particularly ML) on top of arbitrary topologies (decentralized, federated, parameter server). It was primarily conceived for assessing scientific ideas on several aspects of distributed learning (communication efficiency, privacy, data heterogeneity etc.).

Setting up decentralizepy

  • Fork the repository.

  • Clone and enter your local repository.

  • Check if you have python>=3.8.

    python --version
    
  • (Optional) Create and activate a virtual environment.

    python3 -m venv [venv-name]
    source [venv-name]/bin/activate
    
  • Update pip.

    pip3 install --upgrade pip
    pip install --upgrade pip
    
  • On Mac M1, installing pyzmq fails with pip. Use conda.

  • Install decentralizepy for development. (zsh)

    pip3 install --editable .\[dev\]
    
  • Install decentralizepy for development. (bash)

    pip3 install --editable .[dev]
    
  • Download CIFAR-10 using download_dataset.py.

    python download_dataset.py
    
  • (Optional) Download other datasets from LEAF <https://github.com/TalwalkarLab/leaf> and place them in eval/data/.

Running the code

  • Follow the tutorial in tutorial/. OR,

  • Generate a new graph file with the required topology using generate_graph.py.

    python generate_graph.py --help
    
  • Choose and modify one of the config files in eval/{step,epoch}_configs.

  • Modify the dataset paths and addresses_filepath in the config file.

  • In eval/run.sh, modify arguments as required.

  • Execute eval/run.sh on all the machines simultaneously. There is a synchronization barrier mechanism at the start so that all processes start training together.

Citing

Cite us as

 @inproceedings{decentralizepy,
author = {Dhasade, Akash and Kermarrec, Anne-Marie and Pires, Rafael and Sharma, Rishi and Vujasinovic, Milos},
title = {Decentralized Learning Made Easy with DecentralizePy},
year = {2023},
isbn = {9798400700842},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3578356.3592587},
doi = {10.1145/3578356.3592587},
booktitle = {Proceedings of the 3rd Workshop on Machine Learning and Systems},
pages = {34โ€“41},
numpages = {8},
keywords = {peer-to-peer, distributed systems, machine learning, middleware, decentralized learning, network topology},
location = {Rome, Italy},
series = {EuroMLSys '23}
}

Built with DecentralizePy

Tutorial
tutorial/EpidemicLearning
Source files
src/node/EpidemicLearning/
Cite
Martijn de Vos, Sadegh Farhadkhani, Rachid Guerraoui, Anne-Marie Kermarrec, Rafael Pires, and Rishi Sharma. Epidemic Learning: Boosting Decentralized Learning with Randomized Communication. In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023.
Tutorial
tutorial/JWINS
Source files
src/sharing/JWINS/
Cite
Akash Dhasade, Anne-Marie Kermarrec, Rafael Pires, Rishi Sharma, Jeffrey Wigger, and Milos Vujasinovic. Get More for Less in Decentralized Learning Systems. In IEEE 43rd International Conference on Distributed Computing Systems (ICDCS), 2023.

Contributing

  • isort and black are installed along with the package for code linting.

  • While in the root directory of the repository, before committing the changes, please run

    black .
    isort .
    

Modules

Following are the modules of decentralizepy:

Node

  • The Manager. Optimizations at process level.

Dataset

  • Static

Training

  • Heterogeneity. How much do I want to work?

Graph

  • Static. Who are my neighbours? Topologies.

Mapping

  • Naming. The globally unique ids of the processes <-> machine_id, local_rank

Sharing

  • Leverage Redundancy. Privacy. Optimizations in model and data sharing.

Communication

  • IPC/Network level. Compression. Privacy. Reliability

Model

  • Learning Model

decentralizepy's People

Contributors

eliaguerra avatar mvujas avatar rafaelppires avatar rishi-s8 avatar sissiki avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

decentralizepy's Issues

Refactor

The submodules in DecentralizePy have a lot of redundant code.

For instance,

  1. Most of the elements of the run() function of Node are repeated.
  2. Submodules of Sharing have a lot of redundant code, especially the part about serialization into a linear array and back.

Feel free to refactor others.

Consider using `argparse` in the various scripts

The built-in argparse library is able to replace a lot of code in various Python scripts, such as generate_graph.py. It would also print a prettier message compared to the following:

mdevos@IC-ITs-MacBook-Pro-2 decentralizepy % python generate_graph.py       
Traceback (most recent call last):
  File "/Users/mdevos/Documents/decentralizepy/generate_graph.py", line 40, in <module>
    assert len(sys.argv) >= 2, __doc__
AssertionError: Usage: python3 generate_graph.py -g <graph_type> -n <num_nodes> -s <seed> -d <degree> -k <k_over_2> -b <beta> -f <file_name> -a -h

Fix Message Loss with ZMQ

With large number of nodes on a single machine, i.e. a lot of sockets, some messages are dropped by ZMQ. Investigate why and exactly in what scenario this happens.
TCP_ACK introduces application-level retransmissions to get around this problem, but with TCP, messages should be ordered first-in-first-out. Hence, investigate by stress-testing it, if still has the same problems as the basic version. If not, describe why.

Seed not used by some datasets

Some datasets do not use the self.random_seed variable.
This leads to an inconsistent datasets repartition in a single run, since a data element can be given to multiple nodes.

my_clients = DataPartitioner(files, self.sizes).use(self.dataset_id)

my_clients = DataPartitioner(files, self.sizes).use(self.dataset_id)

Moreover, no seed is used when generating a validation set:

validation_indexes = np.random.choice(
self.train_x.shape[0], num_samples, replace=False
)

`epoch_configs` does not exist

The README.md mentions:

Choose and modify one of the config files in eval/{step,epoch}_configs

However, it doesn't look like epoch_configs exists, I only see step_configs.

`ModuleNotFoundError: No module named 'smallworld'` when running from source

When running the framework from source and following the instructions in the README.md file, I'm getting the following error:

mdevos@IC-ITs-MacBook-Pro-2 decentralizepy % python generate_graph.py --help
Traceback (most recent call last):
  File "/Users/mdevos/Documents/decentralizepy/generate_graph.py", line 1, in <module>
    from decentralizepy.graphs.Regular import Regular
  File "/Users/mdevos/Documents/decentralizepy/src/decentralizepy/graphs/__init__.py", line 2, in <module>
    from .SmallWorld import SmallWorld
  File "/Users/mdevos/Documents/decentralizepy/src/decentralizepy/graphs/SmallWorld.py", line 1, in <module>
    import smallworld
ModuleNotFoundError: No module named 'smallworld'

It seems like smallworld should be added to requirements.txt.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.