Coder Social home page Coder Social logo

aiot-mlsys-lab / fedrolex Goto Github PK

View Code? Open in Web Editor NEW
57.0 57.0 14.0 950 KB

[NeurIPS 2022] "FedRolex: Model-Heterogeneous Federated Learning with Rolling Sub-Model Extraction" by Samiul Alam, Luyang Liu, Ming Yan, and Mi Zhang

License: Apache License 2.0

Python 100.00%
federated-learning

fedrolex's People

Contributors

mi-zhang avatar samiul272 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

fedrolex's Issues

Problem about the calculation of 'Local model accuracy'

Hi, @samiul272 . As you mentioned in your paper, "local model accuracy is defined as the accuracy of the server model on each of the clients's local datasets." In my opinion, what you want to express is that when the total number of clients is 100 and 10 clients are randomly selected to participate in the training in each round, the calculation method of local model accuracy is as follows:

  1. Server sends the global model to 10 clients participating in the training;

  2. Each client uses its own test dataset to evaluate the global model and gets its own Local-Accuracy and Local-Loss.

  3. Average these Local-Accuracy values and Local-Loss values as the final values for this round.

However, while reading your code, I notice that your code only selects 10 fixed clients numbered 0,10,20,30,...90 to calculate Local-accuracy and Local-loss in each round (As you can see in the picture, variable m in you code can only take a fixed number of 10 values.), instead of selecting the test set of the current clients (their numbers are stored in the variable user-idx in your code) that actually participate in the training for global model evaluation. Is this really reasonable?

image

On the other hand, it seems that you only use one batch of data (line 212, result[0]) instead of all batch of data (line 212, result) in the client's local test set to calculate the values of Local-accuracy and Local loss for each client, is it reasonable?

Bug for client optimizer

Hi, @samiul272 , After my careful debugging, I finally found the problem. I have found that if I take the repository code directly and run the default command in README.md:
python main_resnet.py --data_name CIFAR10 \ --model_name resnet18 \ --control_name 1_100_0.1_non-iid-2_dynamic_a1-b1-c1-d1-e1_bn_1_1 \ --exp_name roll_test \ --algo roll \ --g_epoch 3200 \ --l_epoch 1 \ --lr 2e-4 \ --schedule 1200 \ --seed 31 \ --num_experiments 3 \ --devices 0 1 2
then each client locally uses the Adam optimizer instead of the SGD optimizer!

I guess the reason for this is that your default optimizer in config.yml is Adam. Although you changed the value of cfg['optimizer_name'] in the process_control function in the utils.py file, this change is only valid for the main process. However, the ray framework runs in parallel, assigning a different process to each client, so when the client declares a new optimizer in step function, all the parameters it gets from cfg are still parameters in the config.yml file, which means that the client is actually running the Adam optimizer. To test this, we print out the information for the optimizer on the client side, as shown below.
image
Then, ① set optimizer_name to Adam in the config.yml file (which is also the default setting in your code), run the command above and we can see the result as follow:
image
② set optimizer_name to SGD in the config.yml file, run the command above and we can see that:
image
After testing, mode ① can achieve the effect of Table 3, while mode ② cannot be trained. That's why I ran out of practice in issue # 7.

Question about Table 3

Hi, @samiul272 , I'm sorry to bother you again. I would like to reproduce the experimental results of the high data heterogeneity of table 3 in your paper. I ran the following code:
python main.py --data_name CIFAR10 --model_name resnet18 --control_name 1_100_0.1_non-iid-2_dynamic_a1-b1-c1-d1-e1_bn_1_1 --exp_name roll_test --algo roll --g_epoch 3200 --l_epoch 1 --lr 2e-4 --schedule 1200 --seed 31 --num_experiments 1 --devices 0 1 2 3 4
However, I found that the accuracy of the test set was very low, even after more than 2000 rounds of training. I want to ask if there is a problem with my experimental setting?

image

#--------------------------Update-----------------------------------#

By the way, I use the SGD optimizer, not Adam, since in your paper you said you use SGD optimizer.
image

The config.yml setting is shown as below.

control

exp_name: hetero_fl_roll_50_100
control:
fed: '1'
num_users: '20'
frac: '1.0'
data_split_mode: 'iid'
model_split_mode: 'fix'
model_mode: 'a1'
norm: 'bn'
scale: '1'
mask: '1'

data

data_name: CIFAR10
subset: label
batch_size:
train: 128
test: 128
shuffle:
train: False
test: False
num_workers: 0
model_name: resnet18
metric_name:
train:
- Loss
- Accuracy
test:
- Loss
- Accuracy

optimizer

###optimizer_name: Adam
optimizer_name: SGD
lr: 2.0e-4
momentum: 0.9
weight_decay: 5.0e-4

scheduler

scheduler_name: None
step_size: 1
milestones:

  • 100
  • 150
    patience: 10
    threshold: 1.0e-3
    factor: 0.5
    min_lr: 1.0e-4

experiment

init_seed: 31
num_experiments: 1
num_epochs: 200
log_interval: 0.25
device: cuda
world_size: 1
resume_mode: 0

other

save_format: pdf

How to run the experiment of transformer?

Hi, @samiul272, I want to run the experiment of transformer in Stack Overflow dataset, but I don't know how to process the data. It seems that in your code you load the data directly from the file named "/egr/research-zhanglambda/samiul/stackoverflow/stackoverflow_train.pt". However, after searching the Internet, I only found some file styles that are not pt endings. For example, I found a stack overflow dataset on kaggle (https://www.kaggle.com/datasets/stackoverflow/stackoverflow), but there is no way to apply it directly to your code. Can you tell me how to get the dataset like yours?

Question about the randomness of experiment

Hi, @samiul272 , I run your code with the following command
python main_resnet.py --data_name CIFAR10 --model_name resnet18 --control_name 1_100_0.1_non-iid-2_fix_a1-b1-c1-d1-e1_bn_1_1 --exp_name roll_test --algo roll --g_epoch 3200 --l_epoch 1 --lr 2e-4 --schedule 1200 --seed 31 --num_experiments 3 --devices 0 1 2 3 4
and I set cfg['shuffle']['train']=False, cfg['shuffle']['test']=False. I didn't change any random seeds, but when I run the code twice, I found that global model had inconsistent accuracy on the test set. I intercepted the results of the first 6 rounds of the two runs, as shown below. I wanted to know if there was any way to make them consistent. Is this randomness due to the use of the ray framework?

the first run result

Test Epoch: 1(100%) Local-Loss: 2.2380 Local-Accuracy: 45.0000 Global-Loss: 2.3141 Global-Accuracy: 16.2200
Test Epoch: 2(100%) Local-Loss: 2.1790 Local-Accuracy: 43.6000 Global-Loss: 2.3342 Global-Accuracy: 12.0700
Test Epoch: 3(100%) Local-Loss: 2.1406 Local-Accuracy: 46.8000 Global-Loss: 2.3556 Global-Accuracy: 8.7900
Test Epoch: 4(100%) Local-Loss: 2.0628 Local-Accuracy: 54.2000 Global-Loss: 2.3327 Global-Accuracy: 10.7000
Test Epoch: 5(100%) Local-Loss: 2.0630 Local-Accuracy: 49.8000 Global-Loss: 2.3717 Global-Accuracy: 10.0100
Test Epoch: 6(100%) Local-Loss: 2.0313 Local-Accuracy: 49.0000 Global-Loss: 2.3643 Global-Accuracy: 13.5500

the second run result

Test Epoch: 1(100%) Local-Loss: 2.2269 Local-Accuracy: 51.7000 Global-Loss: 2.3072 Global-Accuracy: 17.6600
Test Epoch: 2(100%) Local-Loss: 2.1619 Local-Accuracy: 51.5000 Global-Loss: 2.3470 Global-Accuracy: 11.0700
Test Epoch: 3(100%) Local-Loss: 2.1262 Local-Accuracy: 48.0000 Global-Loss: 2.3466 Global-Accuracy: 9.4800
Test Epoch: 4(100%) Local-Loss: 2.0643 Local-Accuracy: 55.1000 Global-Loss: 2.3411 Global-Accuracy: 10.0100
Test Epoch: 5(100%) Local-Loss: 2.0296 Local-Accuracy: 49.6000 Global-Loss: 2.3790 Global-Accuracy: 9.6500
Test Epoch: 6(100%) Local-Loss: 1.9768 Local-Accuracy: 49.4000 Global-Loss: 2.3819 Global-Accuracy: 9.7700

Some questions about the 'overlap' and aggregation code

Hi, @samiul272 , I am reading your code but unfortunately I'm confused about the following code.
image
Suppose that we have 2 clients and $\beta_1=\beta_2=1$, and suppose that there's a layer with only 10 neurons which means $K_i=10$. According to Appendix A.4 in your paper and the codes in the above picture, I draw the situation of overlap=0.2, 1.0 in the $j$-round communication in the following picture. Can you tell me if my understanding is right? If so, what does overlap mean?
image

Why do we need to do this experiment? Besides, why do we mess with the original order of neurons (as shown below: line 51-line 52)?
image

Error: No available node types can fulfill resource request {'CPU': 1.0, 'GPU': 0.15}. Add suitable node types to this cluster to resolve this issue.

Hi @samiul272 , I downloaded your code and ran the commands
pip install -r requirements.txt
pip install tensorboard
Then, I ran the commands
python main_resnet.py --data_name CIFAR10 \ --model_name resnet18 \ --control_name 1_100_0.1_non-iid-2_dynamic_a1-b1-c1-d1-e1_bn_1_1 \ --exp_name roll_test \ --algo roll \ --g_epoch 3200 \ --l_epoch 1 \ --lr 2e-4 \ --schedule 1200 \ --seed 31 \ --num_experiments 3 \ --devices 0 1 2
However, there were some error:
image
Since I am not familiar with the framework of Ray, can you help me how to solve this error?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.