aiot-mlsys-lab / fedrolex Goto Github PK

[NeurIPS 2022] "FedRolex: Model-Heterogeneous Federated Learning with Rolling Sub-Model Extraction" by Samiul Alam, Luyang Liu, Ming Yan, and Mi Zhang

License: Apache License 2.0

Python 100.00%

federated-learning

fedrolex's People

Contributors

Stargazers

Watchers

Forkers

ahmedcs tamoghna-sarkar siabdullah4 anthonyho1 zza234s lkyddd ycc-david ciaociaoyu shuangyichen tony-tang666 tkh666 euronym vmromero-ubi realrui

fedrolex's Issues

Problem about the calculation of 'Local model accuracy'

Hi, @samiul272 . As you mentioned in your paper, "local model accuracy is defined as the accuracy of the server model on each of the clients's local datasets." In my opinion, what you want to express is that when the total number of clients is 100 and 10 clients are randomly selected to participate in the training in each round, the calculation method of local model accuracy is as follows:

Server sends the global model to 10 clients participating in the training;
Each client uses its own test dataset to evaluate the global model and gets its own Local-Accuracy and Local-Loss.
Average these Local-Accuracy values and Local-Loss values as the final values for this round.

However, while reading your code, I notice that your code only selects 10 fixed clients numbered 0,10,20,30,...90 to calculate Local-accuracy and Local-loss in each round (As you can see in the picture, variable m in you code can only take a fixed number of 10 values.), instead of selecting the test set of the current clients (their numbers are stored in the variable user-idx in your code) that actually participate in the training for global model evaluation. Is this really reasonable?

On the other hand, it seems that you only use one batch of data (line 212, result[0]) instead of all batch of data (line 212, result) in the client's local test set to calculate the values of Local-accuracy and Local loss for each client, is it reasonable?

Bug for client optimizer

Hi, @samiul272 , After my careful debugging, I finally found the problem. I have found that if I take the repository code directly and run the default command in README.md:
python main_resnet.py --data_name CIFAR10 \ --model_name resnet18 \ --control_name 1_100_0.1_non-iid-2_dynamic_a1-b1-c1-d1-e1_bn_1_1 \ --exp_name roll_test \ --algo roll \ --g_epoch 3200 \ --l_epoch 1 \ --lr 2e-4 \ --schedule 1200 \ --seed 31 \ --num_experiments 3 \ --devices 0 1 2
then each client locally uses the Adam optimizer instead of the SGD optimizer!

I guess the reason for this is that your default optimizer in config.yml is Adam. Although you changed the value of cfg['optimizer_name'] in the process_control function in the utils.py file, this change is only valid for the main process. However, the ray framework runs in parallel, assigning a different process to each client, so when the client declares a new optimizer in step function, all the parameters it gets from cfg are still parameters in the config.yml file, which means that the client is actually running the Adam optimizer. To test this, we print out the information for the optimizer on the client side, as shown below.

Then, ① set optimizer_name to Adam in the config.yml file (which is also the default setting in your code), run the command above and we can see the result as follow:

② set optimizer_name to SGD in the config.yml file, run the command above and we can see that:

After testing, mode ① can achieve the effect of Table 3, while mode ② cannot be trained. That's why I ran out of practice in issue # 7.

Question about Table 3

Hi, @samiul272 , I'm sorry to bother you again. I would like to reproduce the experimental results of the high data heterogeneity of table 3 in your paper. I ran the following code:
python main.py --data_name CIFAR10 --model_name resnet18 --control_name 1_100_0.1_non-iid-2_dynamic_a1-b1-c1-d1-e1_bn_1_1 --exp_name roll_test --algo roll --g_epoch 3200 --l_epoch 1 --lr 2e-4 --schedule 1200 --seed 31 --num_experiments 1 --devices 0 1 2 3 4
However, I found that the accuracy of the test set was very low, even after more than 2000 rounds of training. I want to ask if there is a problem with my experimental setting?

#--------------------------Update-----------------------------------#

By the way, I use the SGD optimizer, not Adam, since in your paper you said you use SGD optimizer.

The config.yml setting is shown as below.

control

exp_name: hetero_fl_roll_50_100
control:
fed: '1'
num_users: '20'
frac: '1.0'
data_split_mode: 'iid'
model_split_mode: 'fix'
model_mode: 'a1'
norm: 'bn'
scale: '1'
mask: '1'

data

data_name: CIFAR10
subset: label
batch_size:
train: 128
test: 128
shuffle:
train: False
test: False
num_workers: 0
model_name: resnet18
metric_name:
train:
- Loss
- Accuracy
test:
- Loss
- Accuracy

optimizer

###optimizer_name: Adam
optimizer_name: SGD
lr: 2.0e-4
momentum: 0.9
weight_decay: 5.0e-4

scheduler

scheduler_name: None
step_size: 1
milestones:

100
150
patience: 10
threshold: 1.0e-3
factor: 0.5
min_lr: 1.0e-4

experiment

init_seed: 31
num_experiments: 1
num_epochs: 200
log_interval: 0.25
device: cuda
world_size: 1
resume_mode: 0

other

save_format: pdf

Question about the experiment setting

Hi, @samiul272 , I would like to know whether all your experiments in the paper set cfg['model_split_mode'] as dynamic, can you tell me?

How to run the experiment of transformer?

Hi, @samiul272, I want to run the experiment of transformer in Stack Overflow dataset, but I don't know how to process the data. It seems that in your code you load the data directly from the file named "/egr/research-zhanglambda/samiul/stackoverflow/stackoverflow_train.pt". However, after searching the Internet, I only found some file styles that are not pt endings. For example, I found a stack overflow dataset on kaggle (https://www.kaggle.com/datasets/stackoverflow/stackoverflow), but there is no way to apply it directly to your code. Can you tell me how to get the dataset like yours?

Question about the randomness of experiment

Hi, @samiul272 , I run your code with the following command
python main_resnet.py --data_name CIFAR10 --model_name resnet18 --control_name 1_100_0.1_non-iid-2_fix_a1-b1-c1-d1-e1_bn_1_1 --exp_name roll_test --algo roll --g_epoch 3200 --l_epoch 1 --lr 2e-4 --schedule 1200 --seed 31 --num_experiments 3 --devices 0 1 2 3 4
and I set cfg['shuffle']['train']=False, cfg['shuffle']['test']=False. I didn't change any random seeds, but when I run the code twice, I found that global model had inconsistent accuracy on the test set. I intercepted the results of the first 6 rounds of the two runs, as shown below. I wanted to know if there was any way to make them consistent. Is this randomness due to the use of the ray framework?

the first run result

Test Epoch: 1(100%) Local-Loss: 2.2380 Local-Accuracy: 45.0000 Global-Loss: 2.3141 Global-Accuracy: 16.2200
Test Epoch: 2(100%) Local-Loss: 2.1790 Local-Accuracy: 43.6000 Global-Loss: 2.3342 Global-Accuracy: 12.0700
Test Epoch: 3(100%) Local-Loss: 2.1406 Local-Accuracy: 46.8000 Global-Loss: 2.3556 Global-Accuracy: 8.7900
Test Epoch: 4(100%) Local-Loss: 2.0628 Local-Accuracy: 54.2000 Global-Loss: 2.3327 Global-Accuracy: 10.7000
Test Epoch: 5(100%) Local-Loss: 2.0630 Local-Accuracy: 49.8000 Global-Loss: 2.3717 Global-Accuracy: 10.0100
Test Epoch: 6(100%) Local-Loss: 2.0313 Local-Accuracy: 49.0000 Global-Loss: 2.3643 Global-Accuracy: 13.5500

the second run result

Test Epoch: 1(100%) Local-Loss: 2.2269 Local-Accuracy: 51.7000 Global-Loss: 2.3072 Global-Accuracy: 17.6600
Test Epoch: 2(100%) Local-Loss: 2.1619 Local-Accuracy: 51.5000 Global-Loss: 2.3470 Global-Accuracy: 11.0700
Test Epoch: 3(100%) Local-Loss: 2.1262 Local-Accuracy: 48.0000 Global-Loss: 2.3466 Global-Accuracy: 9.4800
Test Epoch: 4(100%) Local-Loss: 2.0643 Local-Accuracy: 55.1000 Global-Loss: 2.3411 Global-Accuracy: 10.0100
Test Epoch: 5(100%) Local-Loss: 2.0296 Local-Accuracy: 49.6000 Global-Loss: 2.3790 Global-Accuracy: 9.6500
Test Epoch: 6(100%) Local-Loss: 1.9768 Local-Accuracy: 49.4000 Global-Loss: 2.3819 Global-Accuracy: 9.7700

Some questions about the 'overlap' and aggregation code

Hi, @samiul272 , I am reading your code but unfortunately I'm confused about the following code.

Suppose that we have 2 clients and $\beta_1=\beta_2=1$, and suppose that there's a layer with only 10 neurons which means $K_i=10$. According to Appendix A.4 in your paper and the codes in the above picture, I draw the situation of overlap=0.2, 1.0 in the $j$-round communication in the following picture. Can you tell me if my understanding is right? If so, what does overlap mean?

Why do we need to do this experiment? Besides, why do we mess with the original order of neurons (as shown below: line 51-line 52)?

Error: No available node types can fulfill resource request {'CPU': 1.0, 'GPU': 0.15}. Add suitable node types to this cluster to resolve this issue.

Hi @samiul272 , I downloaded your code and ran the commands
pip install -r requirements.txt
pip install tensorboard
Then, I ran the commands
python main_resnet.py --data_name CIFAR10 \ --model_name resnet18 \ --control_name 1_100_0.1_non-iid-2_dynamic_a1-b1-c1-d1-e1_bn_1_1 \ --exp_name roll_test \ --algo roll \ --g_epoch 3200 \ --l_epoch 1 \ --lr 2e-4 \ --schedule 1200 \ --seed 31 \ --num_experiments 3 \ --devices 0 1 2
However, there were some error:

Since I am not familiar with the framework of Ray, can you help me how to solve this error?