ashwinrj / federated-learning-pytorch Goto Github PK

View Code? Open in Web Editor NEW

1.2K 1.2K 431.0 214 KB

Implementation of Communication-Efficient Learning of Deep Networks from Decentralized Data

License: MIT License

Python 100.00%

deep-learning distributed-computing federated-learning python pytorch

federated-learning-pytorch's People

Contributors

Stargazers

Watchers

Forkers

wudonglei99 chunhuizng wesleyjtann woshiduwei yuandiandian jgshu gutengzczy liudyboy gggwb shiki-yang blakecheng robot-ai-machinelearning njyeop koilgg liyibing28 cugzj kimcando jhoon-oh ddayzzz tonellotto tejas-subramanya yuchenzhao mohan67nv 564612540 amitport ahmedcs quixoteji geehokim datason toefinder fduerwilliam luan-gu codeljs minhthangbk shashirajpandey marcomilanesio peppajoeng ereebay phunglai728 shuoyuan crystal0725 tzq2doc yuanxiongguo hosseinhosseini enfangcui millionairechen kunchanglee sunshare10 tzuren bailingbird kuny1240 yj2victory ishanunc shidanni meryemjanatiidrissi zexilee fengyann bruinxiong changqing1234 jianxu95 mmalekzadeh weistaring mldl shijinming siddharthdivi naibowang ahatamiz zenghui9977 franciszchen zm1708120311 rickeyestes zhanzheng8585 sanshenghua 1398111846 ilcyb nss-01 ibal3233 zjamy-hust demoallan joey61liuyi sodaprairie0x0 saber-shi faisalahm3d fcpty2 aouedions11 aiswariyamilan linlinxka itony215 raylrayl abedidev som-don aparnagopalakrishnan432 weitong warriormay brighthaozi hit16s tim-tianyu skauntey tony92151 truongscotl

federated-learning-pytorch's Issues

New dataset app

Hi, I want to try the model on the new dataset, which py files will i need to change? (utils.py and ?)

Error calculating Test loss

Federated-Learning-PyTorch/src/update.py

Line 114 in 235b8f0

def test_inference(args, model, test_dataset):

I think in this function we should divide the loss at the end by len(testloader), or am I missing something?

Miswriting in function 'get_dataset'.

There is a writing mistake in /src/utils.py.

In function get_dataset(args) (line24),

        train_dataset = datasets.MNIST(data_dir, train=True, download=True,
                                       transform=apply_transform)

        test_dataset = datasets.MNIST(data_dir, train=False, download=True,
                                      transform=apply_transform)

should be

        train_dataset = datasets.CIFAR10(data_dir, train=True, download=True,
                                       transform=apply_transform)

        test_dataset = datasets.CIFAR10(data_dir, train=False, download=True,
                                      transform=apply_transform)

Parallel computing support

Hi thanks for providing this wonderful repository, but I'm wondering if there will be support for parallelization of client training in each round

specifically, making the local update in federated_main.py to be executed by parallel processes

for idx in idxs_users:
            local_model = LocalUpdate(args=args, dataset=train_dataset,
                                      idxs=user_groups[idx], logger=logger)
            w, loss = local_model.update_weights(
                model=copy.deepcopy(global_model), global_round=epoch)
            local_weights.append(copy.deepcopy(w))
            local_losses.append(copy.deepcopy(loss))

are there suggestions for start working on this approach?

there is something wrong in federated_main.py

the code should be the change to the place marked in the red box

Regarding attribute errors during the federated learning both in equal and unequal cases

While running the code, the following attribute errors were coming. Can anyone tell the reasons for such errors??
For equal case:

Traceback (most recent call last):
  File "src/federated_main.py", line 36, in <module>
    train_dataset, test_dataset, user_groups = get_dataset(args)
  File "C:\Users\sharm\Downloads\Federated-Learning-PyTorch-master\src\utils.py", line 41, in get_dataset
    user_groups = cifar_noniid(train_dataset, args.num_users)
  File "C:\Users\sharm\Downloads\Federated-Learning-PyTorch-master\src\sampling.py", line 173, in cifar_noniid
    labels = np.array(dataset.train_labels)
  File "C:\Users\sharm\.conda\envs\newEnv\lib\site-packages\torch\utils\data\dataset.py", line 83, in __getattr__
    raise AttributeError
AttributeError

For Unequal case:

Traceback (most recent call last):
  File "src/federated_main.py", line 36, in <module>
    train_dataset, test_dataset, user_groups = get_dataset(args)
  File "C:\Users\sharm\Downloads\Federated-Learning-PyTorch-master\src\utils.py", line 38, in get_dataset
    raise NotImplementedError()
NotImplementedError

Loss: nan WARNING:root:NaN or Inf found in input tensor. WARNING:root:NaN or Inf found in input tensor.

bug

In this for loop, are we looping over all users? The current code seems to be fixed for all users.

Federated-Learning-PyTorch/src/federated_main.py

Line 105 in 1bffc34

idxs=user_groups[idx], logger=logger)

When are you testing on all local models, why is it being called train accuracy and train loss?

In federated_main.py file , once you average out the weights, you calculate accuracy on TEST set of each client. So why is that you're calling it train accuracy/loss?

Anything I'm missing here?

regarding the training accuracy

Is this really training accuracy? Because in the update.py file the accuracy is computed as the test accuracy for the local test data.

Federated-Learning-PyTorch/src/federated_main.py

Line 115 in 1bffc34

print('Train Accuracy: {:.2f}% \n'.format(100*train_accuracy[-1]))

RuntimeError: Invalid device string: '0'

federated_main.py not working

Hi I tried to run "python src/federated_main.py --model=cnn --dataset=cifar --gpu=0 --iid=1 --epochs=10"
but is not working. (in any option w federate_main.py including dataset, model, so)

I found several issues from your git and modified those parts, but it seems like there r additional problem w loop of 'federated_main.py'.

Is there anyone else who r suffering from same issue or have fixed them?

Can you get the result in CIFAR10 of the paper?

The result of paper is so good ,but i can't get similar result by this code

A Small Issue with the MLP Model

In the MLP model, I think in the last layer it should be F.log_softmax instead of softmax.
Otherwise, the NLL loss would return negative values.

Traceback (most recent call last):
File "federated_main.py", line 33, in
torch.cuda.set_device(args.gpu)
File "D:\Anaconda3\lib\site-packages\torch\cuda_init_.py", line 243, in set_device
device = _get_device_index(device)
File "D:\Anaconda3\lib\site-packages\torch\cuda_utils.py", line 20, in _get_device_index
device = torch.device(device)
RuntimeError: Expected one of cpu, cuda, mkldnn, opengl, opencl, ideep, hip, msnpu device type at start of device string: 0

The optimizer of clients is created every epoch?

Hi, thanks for the code.

According to the lines in update.py:

if self.args.optimizer == 'sgd':
    optimizer = torch.optim.SGD(model.parameters(), lr=self.args.lr,
                                momentum=0.5)
elif self.args.optimizer == 'adam':
    optimizer = torch.optim.Adam(model.parameters(), lr=self.args.lr,
                                 weight_decay=1e-4)

The optimizer is created for every epoch, is that correct?

A small issue in creating an MLP model

It seems to me that the following line shouldn't be inside the loop and it should be moved outside the loop.

Federated-Learning-PyTorch/src/federated_main.py

Line 55 in 1bffc34

global_model = MLP(dim_in=len_in, dim_hidden=64,

Why is so low accuracy in using CIFAR dataset?

The accuracy is about 40% under the following condition. Is there an improving way?
Local epoch:10
batch size: 10
global epoch: 50
learning rate: 0.5 ~ 0.001

I cant able to plot the graph

Anybody any idea ?

NameError: name 'global_model' is not defined

AttributeError: 'CIFAR10' object has no attribute 'train_labels'

Files already downloaded and verified
Traceback (most recent call last):
  File "/Federated-Learning-PyTorch/src/sampling.py", line 282, in <module>
    d = cifar_noniid(dataset_train, num)
  File "/Federated-Learning-PyTorch/src/sampling.py", line 248, in cifar_noniid
    labels = np.array(dataset.train_labels)
AttributeError: 'CIFAR10' object has no attribute 'train_labels'

What is difference between conventional way and federated way

Some issues in federated_main.py

Line 105, there should be
idxs=user_groups[c]
because we need to traverse all users

Bug with calculating the Training Loss

Federated-Learning-PyTorch/src/federated_main.py

Line 105 in 235b8f0

idxs=user_groups[idx], logger=logger)

Here, I think we have to replace the idx with c because we want to calculate the Training error of the global_model on all training data (after the averaging).

about the average_weights function

In the original paper, it uses a weighted average here. However, the implementation in average_weights is the simple average. Is there a bug or do I misunderstand something? Thanks!

regarding saving the file

While executing federated learning code and MLP code i am getting this error
raceback (most recent call last):
File "src/federated_main.py", line 129, in
with open(file_name, 'wb') as f:
FileNotFoundError: [Errno 2] No such file or directory: '../save/objects/cifar_cnn_5_C[0.1]_iid[1]_E[10]_B[10].pkl'
170500096it [09:57, 285128.32it/s]
whether i have to create some files

Question

Hello, i am just curious to know, what does line 72 in update.py do? Does it forward the image to the model?
log_probs = model(images)

-

problem when i'm trying to use the --unequal=1 case

raise ValueError("batch_size should be a positive integer value, "
ValueError: batch_size should be a positive integer value, but got batch_size=0

Why the test accuracy of CIFAR is so low ？

CNNcifar got some problem to work

First thanks for the amazing work, but when I want to run CNN on CIFAR10 dataset there is some issue it got runtime error I wonder how to solve it. And the link is the error message.

Is in update.py line 64

Thanks for you work again.

AttributeError: 'Namespace' object has no attribute 'gpu_id'

Hello: I ran 'python src/federated_main.py --model=cnn --dataset=mnist --iid=0 --epochs=10 --gpu=1'
But keep receiving error message:
Traceback (most recent call last):
File "src/federated_main.py", line 34, in
if args.gpu_id:
AttributeError: 'Namespace' object has no attribute 'gpu_id'

I am using Windows 10 and make sure I have GPU and GPU 1 in my task manager. Thanks

copy.deepcopy(model), why?

Hello, your project has helped me a lot. Thank you very much. But I have a question: why do I need copy.deepcopy(model) when I am trying to implement a federated learning model, it seems that without copy.deepcopy all models will have the same weight. It's only when you use it that the model is different. So why is that?

How to realize communication and "federated"?

I wonder why can I run "federated_main.py" on only one GPU (stand alone deployment). Because I got the acc.png and loss.png, so I believed that I do run this .py successfully, is that right? Does the codes and experiments involve communication period? Can this be called federated learning?
If so, which sentences of the codes realize the communication?
How to get the information( specific figuresf) of its communication time and the volume of communication data?

Looking forward to somebody's reply. Millions of thanks!

1.为什么我能在单机上跑通 "federated_main.py"文件？因为我在单台服务器上运行依旧得到了loss.png和acc.png，所以我认为我应该是跑通了。但这其中有没有通信？能算真正的联邦吗？
2.如果可以的话，到底是哪行代码实现的通信呢？
3.怎么能够获得通信时间和通信数据量这些信息？
期待热心网友的解答谢谢！

Whether support multi - machine training

Whether support multi - machine training,could you support an example?

ashwinrj / federated-learning-pytorch Goto Github PK

federated-learning-pytorch's People

Contributors

Stargazers

Watchers

Forkers

federated-learning-pytorch's Issues

Recommend Projects

Recommend Topics

Recommend Org