zhuangdizhu / fedgen Goto Github PK

View Code? Open in Web Editor NEW

234.0 234.0 69.0 638 KB

Code and data accompanying the FedGen paper

Python 98.65% Shell 1.35%

fedgen's People

Contributors

Stargazers

Watchers

Forkers

coderxdy ztichigo wudonglei99 haoyu0408 realrui hongshenghu ddayzzz sheng-t pinglmlcv mvisionai dunzeng hungnphan kelenlv congweilin coderpql comp6130-graduate-group-9 davidpengiupui noploop karry520 ywang037 eugeneyuz vingoli givralnguyen joey61liuyi arj119 leslie-clclcl guritian lyh02 baowenxuan damon328 liyuntong9 hqchen2021 tzq2doc wodaka lipuu stefanwan-durham intworist pengyuzhang97 flprivacy geraldmaale gaotiaokang xdvivian ychen404 amosharid siabdullah4 bravozyz seventianyu anyaofastora changguangsheng zhang-wen-jun aouedions11 kitaharasetusna turingsu qianyxxx mooon12 shengren12138 yxma666 jagadish-kumaran ruogu-alter fabacha qunean fatemasiddika mustard9797 ss3b3 simon007-heiyewuxing shivanithakur marnymk jinhuacode

fedgen's Issues

Wrong tensor type error

If there are wrong tensor type errors when running experiments with FedGen algorithm, see changes in #3

Besides, Does anyone try to train with CIFAR-10. I have followed the setup for Mnist: replace the data loader of Mnist to CIFAR-10, change input dimension from 1 to 3, keep the same models. However, the result is not good (about 31%) on FedAvg.

Is there any special setting when do experiment with a new dataset? Thank you

Got the reply

Thanks

Question about the implementation of "FedProx"

Hi.

Does your implementation code of FedProx correspond to the algorithm block 2 in the original paper of FedProx? More specifically, the formula for updating lines 53-54 of code file "fedoptimizer.py" seems a little strange, right? In particular, what does lambda mean in FedProx algorithm?

The update formula I understand should be :
p.data=p.data - group['lr'] * ( p.grad. data + group ['mu'] * (p.data - pstar.data.clone())

Looking forward to your reply.

fedfn, fedntd folder in the image_classification/gfl.

No issue.

Question about FedProx

Hi.

The update formula I understand should be :
p.data=p.data - group['lr'] * ( p.grad. data + group ['mu'] * (p.data - pstar.data.clone())

Looking forward to your reply.

Cannot ultilize GPU for FedGen

I run the example experiment for FedGen on Mnist in README.md with the option "--device cuda" but find out there is no process deployed on GPU. I further explore your code and it seems that you have not handled "args.device" in all scripts. Besides, I add "os.environ["CUDA_VISIBLE_DEVICES"] = '0'" in main.py but the model is still deployed only on CPU. I wonder how I can utilize GPU for FedGen. I really appreciate your help!

user训练时的user_output_logp参数感觉有些奇怪

FedGen/FLAlgorithms/users/userpFedGen.py

Line 58 in 0bfd4e1

    
           user_latent_loss= generative_beta * self.ensemble_loss(user_output_logp, target_p)

这个循环里，user_output_logp参数第一次使用时是循环外47行定义的，接下来的循环，这个参数就是64行定义的
前者是本地训练的batch的label，后者是random choice的一个batch的label，这个是不是有点奇怪？

Partial Parameter Sharing Not Supported

It seems the code implemented does not conduct partial parameter sharing. As shown in line 103 of serverpFedGen.py, the partial parameter is default set to False, but in the paper, the pseudo-code shows only the classifier layer of the user's model is shared. Is it a bug or there is something I misunderstand in the code
self.aggregate_parameters()

Question about 'serverpFedGen.py'function 'visualize_images'

it's seems that this function 'visualize_image' don't work when use commands

Question: Broadcasting Updated Generative Model to Users After Training

Description

Hello,

I have been working with the FedGen implementation and have a question regarding the broadcasting of the updated generative model w to users after it has been trained on the server.

Context

In the FedGen class, the generative model w is trained using the train_generator method. However, I couldn't find the part of the code where the updated generative model parameters are broadcasted to the users after each iteration.

I noticed that the send_parameters method broadcasts the global model parameters to users but does not broadcast the generative model parameters.

Code Snippets

def train(self, args):
    #### pretraining
    for glob_iter in range(self.num_glob_iters):
        print("\n\n-------------Round number: ",glob_iter, " -------------\n\n")
        self.selected_users, self.user_idxs=self.select_users(glob_iter, self.num_users, return_idx=True)
        if not self.local:
            self.send_parameters(mode=self.mode)# broadcast averaged prediction model
        self.evaluate()
        chosen_verbose_user = np.random.randint(0, len(self.users))
        self.timestamp = time.time() # log user-training start time
        for user_id, user in zip(self.user_idxs, self.selected_users): # allow selected users to train
            verbose= user_id == chosen_verbose_user                # perform regularization using generated samples after the first communication round
            user.train(
                glob_iter,
                personalized=self.personalized,
                early_stop=self.early_stop,
                verbose=verbose and glob_iter > 0,
                regularization= glob_iter > 0 )
        curr_timestamp = time.time() # log  user-training end time
        train_time = (curr_timestamp - self.timestamp) / len(self.selected_users)
        self.metrics['user_train_time'].append(train_time)
        if self.personalized:
            self.evaluate_personalized_model()

        self.timestamp = time.time() # log server-agg start time
        self.train_generator(
            self.batch_size,
            epoches=self.ensemble_epochs // self.n_teacher_iters,
            latent_layer_idx=self.latent_layer_idx,
            verbose=True
        )
        self.aggregate_parameters()
        curr_timestamp=time.time()  # log  server-agg end time
        agg_time = curr_timestamp - self.timestamp
        self.metrics['server_agg_time'].append(agg_time)
        if glob_iter  > 0 and glob_iter % 20 == 0 and self.latent_layer_idx == 0:
            self.visualize_images(self.generative_model, glob_iter, repeats=10)

    self.save_results(args)
    self.save_model()

the question about main_plot.py

Hello
sorry,I have a problem about main_plot.pyI

the problem
FileNotFoundError: [Errno 2] No such file or directory: 'figs\Mnist/ratio0.5\Mnist-ratio0.5.png'

I hope to have a look during my busy schedule. I just touched this direction.Thank you!

Not work when only sharing the classifier

Reproduce "FedDF" baseline

Thank you for open-sourcing your project. I notice that "FedDF" (Ensemble Distillation for Robust Model Fusion in Federated Learning) is one of your baselines in your paper, however, you provide code for only FedAvg, FedProx, FedDistill, and FedGen. Could you please help me reproduce the results of FedDF? I really appreciate your help.

plot problem

I think in the file plot_utils.py, the variable 'all_curves' used in the outside of the loop only saves the last algorithm's results, in this way, when we add several algorithms in the config, the plot figure result will cut the other algorithms' trend by following the last one's scope.

max_acc = np.max([max_acc, np.max(all_curves) ]) + 4e-2

python main_plot.py --dataset EMnist-alpha0.1-ratio0.1 --algorithms FedAvg,FedGen,FedProx,FedDistill --batch_size 32 --local_epochs 20 --num_users 10 --num_glob_iters 200 --plot_legend 1

Celeb dataset generation script

Hi, Zhuang

can you share the script to generate the Celeb data? Thanks

Unable to perform Mnist experiments

when i'm ready to run "python main.py --dataset Mnist-alpha0.01-ratio0.05 --algorithm FedAvg --batch_size 32 --num_glob_iters 200 --local_epochs 20 --num_users 10 --lamda 1 --learning_rate 0.01 --model cnn --personal_learning_rate 0.01 --times 3"I got the following problem。How can I solve it.

Average Global Accurancy = 0.0950, Loss = 2.31.
Traceback (most recent call last):
File "C:\kust\xuesu\code\FedGen-main\FedGen-main\FLAlgorithms\users\userbase.py", line 163, in get_next_train_batch
(X, y) = next(self.iter_trainloader)
File "C:\Users\Administrator\anaconda3\envs\FedGen\lib\site-packages\torch\utils\data\dataloader.py", line 633, in next
data = self._next_data()
File "C:\Users\Administrator\anaconda3\envs\FedGen\lib\site-packages\torch\utils\data\dataloader.py", line 676, in _next_data
index = self._next_index() # may raise StopIteration
File "C:\Users\Administrator\anaconda3\envs\FedGen\lib\site-packages\torch\utils\data\dataloader.py", line 623, in _next_index
return next(self._sampler_iter) # may raise StopIteration
StopIteration

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\kust\xuesu\code\FedGen-main\FedGen-main\main.py", line 85, in
main(args)
File "C:\kust\xuesu\code\FedGen-main\FedGen-main\main.py", line 42, in main
run_job(args, i)
File "C:\kust\xuesu\code\FedGen-main\FedGen-main\main.py", line 37, in run_job
server.train(args)
File "C:\kust\xuesu\code\FedGen-main\FedGen-main\FLAlgorithms\servers\serveravg.py", line 35, in train
user.train(glob_iter, personalized=self.personalized) #* user.train_samples
File "C:\kust\xuesu\code\FedGen-main\FedGen-main\FLAlgorithms\users\useravg.py", line 23, in train
result =self.get_next_train_batch(count_labels=count_labels)
File "C:\kust\xuesu\code\FedGen-main\FedGen-main\FLAlgorithms\users\userbase.py", line 167, in get_next_train_batch
(X, y) = next(self.iter_trainloader)
File "C:\Users\Administrator\anaconda3\envs\FedGen\lib\site-packages\torch\utils\data\dataloader.py", line 633, in next
data = self._next_data()
File "C:\Users\Administrator\anaconda3\envs\FedGen\lib\site-packages\torch\utils\data\dataloader.py", line 676, in _next_data
index = self._next_index() # may raise StopIteration
File "C:\Users\Administrator\anaconda3\envs\FedGen\lib\site-packages\torch\utils\data\dataloader.py", line 623, in _next_index
return next(self._sampler_iter) # may raise StopIteration
StopIteration

Redundant loss

I wonder why the user_latent_loss is not mentioned in your paper.

generate_niid_dirichlet pose a error:

run the code on cuda device

It seems that the code does not supprt CUDA?

--device "cuda" can be set but it seems that it is always running on cpu

Thanks

Noise generation in generator.py (line 63)

It seems that torch.rand generates [0,1) uniformly based on the official documentation instead of standard Gaussian. Is this intended? Thanks

Error: RuntimeError: Can't call `numpy()` on Tensor that requires grad.

Full error message: RuntimeError: Can't call numpy() on Tensor that requires grad. Use tensor.detach().numpy() instead.

Added the following as line 227 to serverbase.py to resolve:
test_losses = [t.detach() for t in test_losses]

Python version: 3.8.6

Can't run EMNIST experiment

When I ran the EMNIST experiment after generation of emnist dataset I got:

(pt) wangshu@ubuntu:~/projects/FedGen$ CUDA_VISIBLE_DEVICES=3 python main.py --dataset EMnist-alpha0.1-ratio0.1 --algorithm FedGen --batch_size 32 --local_epochs 20 --num_users 10 --lamda 1 --model cnn --learning_rate 0.01 --personal_learning_rate 0.01 --num_glob_iters 200 --times 3 
================================================================================
Summary of training process:
Algorithm: FedGen
Batch size: 32
Learing rate       : 0.01
Ensemble learing rate       : 0.0001
Average Moving       : 1.0
Subset of users      : 10
Number of global rounds       : 200
Number of local rounds       : 20
Dataset       : EMnist-alpha0.1-ratio0.1
Local Model       : cnn
Device            : cpu
================================================================================


         [ Start training iteration 0 ]           


Creating model for emnist
Network configs: [6, 16, 'F']
Dataset emnist
/home/wangshu/miniconda3/envs/pt/lib/python3.9/site-packages/torch/nn/_reduction.py:42: UserWarning: size_average and reduce args will be deprecated, please use reduction='none' instead.
  warnings.warn(warning.format(ret))
Build layer 57 X 256
Build last layer 256 X 32
ensemble_lr: 0.0001
ensemble_batch_size: 128
unique_labels: 25
latent_layer_idx: -1
label embedding 0
ensemeble learning rate: 0.0001
ensemeble alpha = 1, beta = 0, eta = 1
generator alpha = 10, beta = 1
Number of Train/Test samples: 12480 8120
Data from 20 users in total.
Finished creating FedAvg server.


-------------Round number:  0  -------------


Traceback (most recent call last):
  File "/home/wangshu/projects/FedGen/main.py", line 85, in <module>
    main(args)
  File "/home/wangshu/projects/FedGen/main.py", line 42, in main
    run_job(args, i)
  File "/home/wangshu/projects/FedGen/main.py", line 37, in run_job
    server.train(args)
  File "/home/wangshu/projects/FedGen/FLAlgorithms/servers/serverpFedGen.py", line 78, in train
    self.evaluate()
  File "/home/wangshu/projects/FedGen/FLAlgorithms/servers/serverbase.py", line 226, in evaluate
    test_ids, test_samples, test_accs, test_losses = self.test(selected=selected)
  File "/home/wangshu/projects/FedGen/FLAlgorithms/servers/serverbase.py", line 165, in test
    ct, c_loss, ns = c.test()
  File "/home/wangshu/projects/FedGen/FLAlgorithms/users/userbase.py", line 137, in test
    loss += self.loss(output, y)
  File "/home/wangshu/miniconda3/envs/pt/lib/python3.9/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/wangshu/miniconda3/envs/pt/lib/python3.9/site-packages/torch/nn/modules/loss.py", line 216, in forward
    return F.nll_loss(input, target, weight=self.weight, ignore_index=self.ignore_index, reduction=self.reduction)
  File "/home/wangshu/miniconda3/envs/pt/lib/python3.9/site-packages/torch/nn/functional.py", line 2388, in nll_loss
    ret = torch._C._nn.nll_loss(input, target, weight, _Reduction.get_enum(reduction), ignore_index)
IndexError: Target 25 is out of bounds.
(pt) wangshu@ubuntu:~/projects/FedGen$

Pythorch 1.8.1, python 3.9.4.

Trainloader is not shuffle

The performance of FedAvg is not as good as FedGen simply because the Trainloader does not have a shuffle. After fixing the bugs Fedgen is not as effective as Fedavg.

Network configs: [6, 16, 'F']

Hi, I'm unable to run any of the files.
This was what is churned out. What does the Network configs: [6, 16, 'F'] mean?
python main.py --dataset Mnist-alpha0.1-ratio0.5 --algorithm FedDistll-FL --batch_size 32 --num_glob_iters 200 --local_epochs 20 --num_users 10 --lamda 1 --learning_rate 0.01 --model cnn --personal_learning_rate 0.01 --times 3

Summary of training process:
Algorithm: FedDistll-FL
Batch size: 32
Learing rate : 0.01
Ensemble learing rate : 0.0001
Average Moving : 1.0
Subset of users : 10
Number of global rounds : 200
Number of local rounds : 20
Dataset : Mnist-alpha0.1-ratio0.5
Local Model : cnn
Device : cpu

     [ Start training iteration 0 ]

Creating model for mnist
Network configs: [6, 16, 'F']
Algorithm FedDistll-FL has not been implemented.