Coder Social home page Coder Social logo

orienfish / async-hfl Goto Github PK

View Code? Open in Web Editor NEW
27.0 1.0 4.0 268 KB

[IoTDI 2023/ML4IoT 2023] Async-HFL: Efficient and Robust Asynchronous Federated Learning in Hierarchical IoT Networks

License: MIT License

Python 79.16% Jupyter Notebook 18.75% Shell 2.10%
federated-learning hierarchical iot

async-hfl's Introduction

Async-HFL: Efficient and Robust Asynchronous Federated Learning in Hierarchical IoT Networks

This repo contains the simulation implementation for paper:

Xiaofan Yu, Ludmila Cherkasova, Harsh Vardhan, Quanling Zhao, Emily Ekaireb, Xiyuan Zhang, Arya Mazumdar, Tajana Rosing. "Async-HFL: Efficient and Robust Asynchronous Federated Learning in Hierarchical IoT Networks" in the Proceedings of IoTDI 2023.

[arXiv link]

File Structure

The implementation is based on FLSim and ns3-fl.

.
├── client.py          // Implementation of client class
├── config.py          // Implementation of argument parsing and configuration settings
├── configs            // The json configuration files for all datasets and test scenarios (iid vs non-iid, sync vs async)
├── delays             // Delays generated by ns3-fl and the script to generate computational delays
├── LICENSE
├── load_data.py       // Implementation of data loaders for both image datasets and the LEAF dataset
├── models             // Implementation of ML models for all datasets
├── README.md          // This file
├── requirements.txt   // Prerequisites
├── run.py             // Main script to fire simulations
├── scripts            // Collection of bash scripts for various experiments in the paper
├── server             // Implementation of servers (sync, semi-async, async)
└── utils              // Necessary util files in FLSim

Prerequisites

We test with Python3.7. We recommend using conda environments:

conda create --name asynchfl-py37 python=3.7
conda activate asynchfl-py37
python -m pip install -r requirements.txt

All require Python packages are included in requirements.txt and can be installed automatically.

Async-HFL uses Gurobi to solve the gateway-level device selection and cloud-level device-gateway association problem. A license is required. After the installation, add the following lines to the bash initialization file (e.g., ~/.bashrc):

export GUROBI_HOME="PATH-TO-GUROBI-INSTALLATION"
export PATH="${PATH}:${GUROBI_HOME}/bin"
export LD_LIBRARY_PATH="${LD_LIBRARY_PATH}:${GUROBI_HOME}/lib"

Dataset Preparation

As mentioned in the paper, we experiment on MNIST, FashionMNIST, CIFAR-10, Shakespeare, HAR, HPWREN.

  • For MNIST, FashionMNIST, CIFAR-10, the script will download the datasets automatically into ./data folder, which is specified in the json configuration files under ./configs. The non-iid partition for each client is done synthetically, assigning two random classes to each client, as set in the json configuration.

  • The Shakespeare dataset is adapted from the LEAF dataset, and we use their original script for natural data partition:

    git clone https://github.com/TalwalkarLab/leaf.git
    cd leaf/data/shakespeare/
    ./preprocess.sh -s niid --sf 0.2 -k 0 -t sample -tf 0.8
    

    The preprocess.sh script generates partitioned data in the same directory. Then, you need to specify the path to the Shakespeare data in ./configs/Shakespeare/xxx.json, where xxx corresponds to the setting.

  • The HAR dataset is downloaded from UCI Machine Learning repository. We provide a similar partition script as in LEAF here. Similarly, after partition, you need to specify the path to the partitioned data in ./configs/HAR/xxx.json.

  • The HPWREN dataset is constructed from the historical data of HPWREN. We provide the data download script and partition scripts here. You need to specify the path to the partitioned data in ./configs/HPWREN/xxx.json.

Note, that Shakespeare, HAR and HPWREN use natural non-iid data partition as detailed in the paper.

Delay Generation

One major novelty of Async-HFL is considering the networking heterogeneity in FL. Specifically, we generate communication delays from ns3-fl, and the computational delay randomly from a log-normal distribution.

  • For the communication delays, the hierarchical network topology is configured based on NYCMesh with 184 edge devices, 6 gateways, and 1 server. We assume that edge devices are connected to the gateways via Wi-Fi, and the gateways are connected to the server via Ethernet. For each node, we retrieve its latitude, longitude, and height as input to the HybridBuildingsPropagationLossModel in ns-3 to obtain the average point-to-point latency. The communication delays generated from ns3-fl are stored in delays/delay_client_to_gateway.csv and delays/delay_gateway_to_cloud.csv.
  • For the computational delays, we run python3 delays/generate.py.

The path to the corresponding delay files are set in the json configuration files.

The NYCMesh topology and round delay distributions are shown as follows.

nycmesh_delays

Getting Started

To run synchronous FL on MNIST with non-iid data partition and random client selection:

python run.py --config=configs/MNIST/sync_noniid.json --delay_mode=nycmesh --selection=random

For synchronous FL, apart from random client selection, we offer the other client-selection strategies:

To run RFL-HA (sync aggregation at gateways and async aggregation at cloud) on FashionMNIST with non-iid data partition and random client selection:

python run.py --config=configs/FashionMNIST/rflha_noniid.json --delay_mode=nycmesh --selection=random

To run semi-asynchronous FL on CIFAR-10 with non-iid data partition and random client selection:

python run.py --config=configs/CIFAR-10/semiasync_noniid.json --delay_mode=nycmesh --selection=random

To run asynchronous FL on Shakespeare with non-iid data partition and random client selection:

python run.py --config=configs/Shakespeare/async_noniid.json --delay_mode=nycmesh --selection=random

To run Async-HFL on HAR with non-iid data partition:

python3.7 run.py --config=configs/HAR/async_noniid.json --delay_mode=nycmesh --selection=coreset_v1 --association=gurobi_v1

To run Async-HFL on HPWREN with non-iid data partition, and certain alpha (weight for delays in gateway-level client selection, used in server/clientSelection.py) and phi (weight for throughput in cloud-level device-gateway association, used in server/clientAssociation.py):

python3.7 run.py --config=configs/HAR/async_noniid.json --delay_mode=nycmesh --selection=coreset_v1 --association=gurobi_v1 --cs_alpha=alpha --ca_phi=phi

Scripts

We provide our scripts for running various experiments in the paper in scripts:

.
├── run_ablation.sh                      // Script for running ablation study on Async-HFL
|                                        // (various combinations of client selection and association)
├── run_baseline_nycmesh.sh              // Script for running baselines in the NYCMesh setup
├── run_baseline.sh                      // Script for running baselines in the random delay setup
├── run_exp_nycmesh.sh                   // Script for running Async-HFL in the NYCMesh setup
├── run_exp.sh                           // Script for running Async-HFL in the random delay setup
├── run_motivation_nycmesh.sh            // Script for running the motivation study
├── run_sensitivity_pca.sh               // Script for running sensitivity study regarding PCA dimension
└── run_sensitivity_phi.sh               // Script for running sensitivity study regarding phi

Physical Deployment

Apart from simulation, we also evaluation Async-HFL with a physical deployment using Raspberry Pis and CPU clusters. The implementation is at https://github.com/Orienfish/FedML, which is adapted from the FedML framework.

The delay distribution in the physical deployment verifies our strategy in generating the delays in the simulation framework, which presents a long-tail distribution:

latency

The above plot shows a histogram of all round latencies collected when an updated model is returned in our physical deployment, with x axis representing round latency and y axis representing the counts. It can be clearly observed that the majority of trials return fairly quickly, while in rare cases the round latency can be unacceptably long.

License

MIT

If you have any questions, please feel free to contact [email protected].

async-hfl's People

Contributors

orienfish avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.