fedml-ai / fedml Goto Github PK

FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, FEDML Nexus AI (https://fedml.ai) is your generative AI platform at scale.

Home Page: https://fedml.ai

License: Apache License 2.0

Python 78.17% Shell 1.87% Dockerfile 0.35% PowerShell 0.01% Batchfile 0.16% Java 2.91% Jupyter Notebook 14.82% CMake 0.11% C++ 1.40% C 0.02% Smarty 0.14% Jinja 0.03%

ai-agent deep-learning distributed-training edge-ai federated-learning inference-engine machine-learning mlops model-deployment model-serving on-device-training

fedml's People

Contributors

Stargazers

Watchers

Forkers

yanfangli1986 tremblerz wzpan stevenlol mengcz13 chamikara1986 lipiji taozeyi1990 lemonviv elliebababa bruinxiong mldl luke-avionics andreasalam alaydshah gutengzczy wpfnihao zpskt roserland guang000 njuhaozhang quixoteji iuserea yyan1998 amberyao matech96 xrosliang devirule sedonalewis seeker1943 lin-jk zwvews hetranger mrzhang1994 qqqhy cugzj crishawy nghible xiaming9880 wizard1203 ahmedcs jding0 aries-jessie liyang-good sadarshannaiynar currycurrycurry tzq2doc alienxcn akaanirban chenwanqq osirislambert pooyadav scalalang2 mousewu wang-shihao zhengyuli luan-gu abh15 mohan67nv cslinwang yukizhao1998 kevin-6q yujinling123 iwan933 zliangak zhanzheng8585 zhuzhu603 sandyhxd harshita1804 thinksee hnsdhr hulanwin alexwaker joonho3020 mzp0625 guoagit jx360 zsl98 yuchenlin p4perf4ce monthfall xlab101-fl dannieldwt quocnh 374494125 yuechaor guobbin whiteashes1996 research4pan chauvinhloi jennylee2017 licj15 monologue110 ziqi-zhang dsslab-code jiahu9 lavakumarkamarthi aouedions11 yihanjiang onlychristie

fedml's Issues

standalone Benchmark running the MNIST and Shakespeare

I can not get the same results when I run the command:
sh run_fedavg_standalone_pytorch.sh 0 10 10 10 shakespeare ./../../../data/shakespeare rnn hetero 100 1 0.8 sgd 0
my results is:
https://wandb.ai/pilgrim_cz/fedml/runs/5jqpw1i4?workspace=user-pilgrim_cz
the benchmart res is https://wandb.ai/automl/fedml/runs/144ey9w6
same problem when I run minst dataset
Is there anything wrong? I did not change any code.

design APIs for security and privacy

Errors in downloading datasets from Google Drive

I recommend the data manager to move datasets in Google Drive to other data repository for the following errors.

'Quota exceeded' error
Whiled executing 'CI-install.sh', I have encountered 'qouta exceeded' error for some dataset files from Google Drive, which produced an incomplete 'h5' files and caused a CI failure.
The files may be locked for a 24 hour period before the quota is reset. [link]
'Virus scan warning' error
For 'sh download_federatedEMNIST.sh' command in 'CI-install.sh', I have encountered another type of error, which produced an incomplete 'emnist_test.h5' that contained:

<title>Google Drive - Virus scan warning</title><style nonce="SBJBwuBjbn6MgL9cqdTcug">#gbar,#guser{font-size:13px;padding-top:0px !important;}#gbar{height:22px}#guser{padding-bottom:7px !important;text-align:right}.gbh,.gbd{border-top:1px solid #c9d7f1;font-size:1px}.gbh{height:0;position:absolute;top:24px;width:100%}@media all{.gb1{height:22px;margin-right:.5em;vertical-align:top}#gbar{float:left}}a.gb1,a.gb4{text-decoration:underline !important}a.gb1,a.gb4{color:#00c !important}.gbi .gb4{color:#dd8e27 !important}.gbf .gb4{color:#900 !important}</style><script nonce="SBJBwuBjbn6MgL9cqdTcug"></script>
Search Images Maps Play YouTube News Gmail Drive More »
Settings | Help | Sign in
Google Drive can't scan this file for viruses.
emnist_test.h5 (233M) is too large for Google to scan for viruses. Would you still like to download this file?
Download anyway
© 2020 Google - Help - Privacy & Terms

Question about the step in wandb log

Hi, thanks for sharing the awesome project. I'm trying the standalone experiments and have a question about the cifar results. I consulted the online result (https://wandb.ai/automl/fedml/runs/2cxam561/) but I don't understand the total steps of the log. According to the plot, there are totally 400 steps, but the configuration is 100 rounds and 20 epochs. I read the code and to my understanding the default evaluation freq is 5 so there should be 20 steps (5 rounds per step) in the log. So Why does the online log have 400 steps? Thanks!

FedML-API: finish the distributed version of splitNN

It seems that the number of joining clients (not the num of computing clients) is set in fedml_api/data_preprocessing/**/data_loader and cannot be changed, except CIFAR10 datasets.

It seems that the number of joining clients (not the num of computing clients) is fixed in fedml_api/data_preprocessing/**/data_loader and cannot be changed except CIFAR10 datasets.

Here I mean that it seems the total clients is decided by the datasets, rather the input from run_fedavg_distributed_pytorch.sh.

FedML/fedml_api/data_preprocessing/MNIST/data_loader.py

Line 112 in 3d9fda8

client_num = client_idx

Specifically, the variable client_num_in_total is changed here by the data_loader depending on the dataset. So I have 2 questions as following:

How should we partition the whole dataset into client_num_in_total parts which is defined by input?
Maybe there is a better way to write some APIs to split the data from the original dataset rather than to make a new dataset, because this will be more convenient and flexible? Then users can download the original datasets then use the data split API, saving the space of disks.

implement the FedML-mobile service SDK

Benchmark running the EMNIST and Shakespeare

Hi, is there any example settings to run EMNIST and Shakespeare? It seems to be very hard to get the results(that accuracy) in Adaptive Federated Optimization https://arxiv.org/pdf/2003.00295.pdf.

Thanks!

fed_CIFAR100 dataset loading error

Traceback (most recent call last):
File "./main_fedavg.py", line 163, in
dataset = load_data(args, args.dataset)
File "./main_fedavg.py", line 109, in load_data
class_num = load_partition_data_federated_cifar100(args.dataset, args.data_dir, batch_size=args.batch_size)
File "/home/chaoyanghe/sourcecode/fedml.ai/fedml_api/data_preprocessing/fed_cifar100/data_loader.py", line 115, in load_partition_data_federated_cifar100
train_data_local, test_data_local = get_dataloader(dataset, data_dir, batch_size, batch_size, client_idx)
File "/home/chaoyanghe/sourcecode/fedml.ai/fedml_api/data_preprocessing/fed_cifar100/data_loader.py", line 54, in get_dataloader
test_dl = data.DataLoader(dataset = test_ds, batch_size=test_bs, shuffle = True, drop_last = False)
File "/home/chaoyanghe/miniconda/envs/fedml/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 224, in init
sampler = RandomSampler(dataset, generator=generator)
File "/home/chaoyanghe/miniconda/envs/fedml/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 96, in init
"value, but got num_samples={}".format(self.num_samples))
ValueError: num_samples should be a positive integer value, but got num_samples=0

代码跑不起来（fedavg算法）

在中心服务器端可以免密登录客户端，其他软件也配置好了，但是为什么在服务器端运行代码的时候识别不了客户端的主机名呢

Processes running are suspend by some unknown reasom.

When I run the fedgkt algorithm by the following cmd.
sh run_FedGKT.sh 8 cifar10 homo 10 20 1 Adam 0.001 1 0 resnet56 fedml_resnet56_homo_cifar10 "./../../../data/cifar10" 64

The processes are often suspend by some reason.I derived the result successfully for only one time.

The one I figure it our is that the connection error between the process and wandb.
After solving the connection problem,there are still other potential reasons.
How can I figure it out?

FedML-Core: write a tutorial to introduce how to use the robust API (defense and attack)

Could you add more servers in FedAvg for faster training speed?

Could you add more servers in FedAvg for faster training speed?
As the BytePS does.

The arXiv uid in citation is from Horovod

Hi Chaoyang,

In current branch, the arXiv uid (1802.05799) is about horovod. You may want to update it with your own one.

@article{chaoyanghe2020fedml,
  Author = {Chaoyang He},
  Journal = {arXiv preprint arXiv:1802.05799},
  Title = {FedML: a Flexible and Generic Federated Learning Library and Benchmark},
  Year = {2020}
}

Are all datasets provided in FedML non-i.i.d?

I am doing research with FedML now and I want to confirm whether the datasets provided in FedML are non-i.i.d or not.

Same error happend when running FEDAVG under STANDALONE just clone the repository without any modification!

When I use commands as below which included in readme.md,
( nohup sh run_fedavg_standalone_pytorch.sh 2 10 64 cifar10 ./../../../data/cifar10 resnet56 homo 200 20 0.001 > ./fedavg_standalone.txt 2>&1
nohup sh run_fedavg_standalone_pytorch.sh 2 10 10 mnist ./../../../data/mnist lr hetero 200 20 0.03 > ./fedavg_standalone.txt 2>&1 &)

The same error occurs:
Traceback (most recent call last):
File "./main_fedavg.py", line 160, in
trainer = FedAvgTrainer(dataset, model, device, args)
File "/home/xx/proj/Source/FedML/fedml_api/standalone/fedavg/fedavg_trainer.py", line 22, in init
self.model_global.train()
AttributeError: 'NoneType' object has no attribute 'train'.

For my poor understanding of PyTorch,can anyone teach me where the code is written not so well?

Weight decay and momentum is not set for SGD optimizer

FedML/fedml_api/standalone/fedavg/my_model_trainer.py

Line 26 in e0c58c5

optimizer = torch.optim.SGD(self.model.parameters(), lr=args.lr)

Problem with /fedml_experiments/distributed/fedavg/run_fedavg_distributed_pytorch.sh

I think there is some problem with the line: PROCESS_NUM=expr $WORKER_NUM + 1
Maybe it should be PROCESS_NUM=expr $CLIENT_NUM + 1 since WORKER_NUM stands for clients per round and CLIENT_NUM stands for clients in total.

Using your specified gpus' list by customizing the function "init_training_device"!

Running the fedavg on the configure: 20 rounds, 10 epochs, 2 clients, cifar-10 dataset, resnet56, but the program always crashed! The errors as the following:

I don't know the reason for the errors.

terminology in run_vfl_fc_two_party_lending_club.py

hello, i may find a mistake in run_vfl_fc_two_party_lending_club.py . only party_a has label , so party_a may be host not guest ?

What should I do if I only have a GPUNode with 8 graphics card to run the distributed algorithm?

Could anyone please tell me what should I do if I only have a GPUNode functioning as a login node and a compute node at the same time to run the distributed algorithm, please?
It works pretty well when running fedavg algorithm under fedml's distributed architecture based on our hardware architecture saying above.
But when I try to run fednas,fedgkt,fedavg_robust algorithm,they'll fail in the same reason in the end as showing in the screenshot below.

FedNAS:

FedGKT:

implement FedMA based on FedML

create federated datasets for the autonomous driving domain, including data loader with partition method

develop the Android SDK for FedML-Mobile platform

submit the turbo-aggregate distributed version

Client Sampling Strategy

Hi, it is indeed a GREAT work!!

However, it seems, when tested at total client number = client number per round, FedAVG distributed's device sampling make the local training on a client, which should be isolated, have the information from other local dataset. (The issue theoretically will persist in cases total client number != client number per round )

In the design, the local trainer ID is separated from the the local dataset, i.e., you need to update the dataset for each trainer at each round with a given client index before do the local training. This might be beneficial for the cases where a large total client number is present, and when total client number = client number, the device sampling did nothing more than permute the client_indexes. However, do so may cause the above issue.

As we can see from the FedML/fedml_api/distributed/fedavg/FedAvgServerManager.py

In each communication round the client_indexes is permuted. And in 59 we can see, the receiver ID (trainer ID) is not necessarily linked with a specific dataset (determined by client_indexes[receiver_id]), because the order for client_indexes is not the same for each round. Even though, after each round's syncing, all clients start with same weight; the weight then is invariant to each local dataset (they are all same across clients). However, the optimizer's history is different across clients, this dissociation between trainer and local dataset will make the optimizer history on one dataset be applied to another one. The result is that the each local client will have the partial training info about the global dataset, unfairly favoring the results.

This can be verified by setting client_indexes to a fixed list at case total client number = client number per round, which yield a significant worse results than permuting it. The performance should theoratically the same for case total client number = client number per round, as each round the participating clients are the same.

In realistic settings sharing these optimizer's history will involve significant data traffic overhead(double the traffic volume)

Nonsensical if statements

Out of curiosity, I let a static code analysis run on the code. It found three nonsensical if statements: sonarcloud.

Maybe sonarcube should be incorporated into the CI process, so we can be notified as these bugs like occur.

improve the low-level communication backend with RPC + GPU Direct

We will implement this feature with PyTorch Distributed Team, researcher Li Shen (https://mrshenli.github.io/))

Network error from wandb

I want to use the customized model on the customized dataset. The fedavg can work well on ResNet56 model on our customized dataset. But the ResNet56 is too big for our small dataset. So I try to use the small model 'CNN_DropOut' on the fold of 'mode/cv/cnn.py‘. The problem is the following:

The model 'CNN_DropOut' can work well on the test_cnn.py as following:

Why the network error "ConnectTimeout" happened? The GPUs are available.
Whether the input's shape of CNN_DropOut like (batch_size, channels_num, img_width, img_height)? eg. (32, 3, 28, 28) for mnist dataset.

great job

hi,dear
glad to see the perfect rp
btw,could it be used for Recommendation?
if so ,will have a try.
thx

[distributed training] Program is stuck after the last round of training

Running the fedavg on the configure: 20 rounds, 10 epochs, 2 clients, cifar-10 dataset, resnet56, but the program always crashed! The errors as the following:

add more datasets and their data loaders

Command_round cannot be more than 200 ?

I used FedML-Server/client_simulator/mobile_client_simulator.py to ran two cllients ,
and used FedML-Server/executor/app.py to ran a server on the same PC.
Even I set the argument "comm_round" to 250, but the learning will be stuck at round 200.
But if I set the "comm_round" to 50 , the learning will be finished at round 50.
Picture:

(Client)

(Server)

The error in new machine when running distribution/fedavg

I think the FedML can use easily on another machine only by cloning the FedML without modifications. While the errors occur as follows:

Configures of computer:

4 * RTX 3090, cuda 11.1
pytorch1.7
According to CI-install.sh requiring the environment

run the command:
sh run_fedavg_distributed_pytorch.sh 4 4 1 4 cnn homo 2 1 32 0.0001 digit5 "./../../../data/Digit5" 0
There are 4 clients and 4 works.

The errors as follows(Fig.1 the warning of program, Fig.2 the 4 clients on GPUs may be wrong, the same process on all GPUs):

Training Loss become nan when do the shakespeare experiment under distributed environment

Code:
just insert one line of code in the file FedML/fedml_experiments/distributed/fedavg/mian_fedavg.py as shown in the fig1 below, others are just based on the latest orgin/master.

Cmd:
sh run_fedavg_distributed_pytorch.sh 10 10 1 8 rnn hetero 100 10 10 0.8 shakespeare "./../../../data/shakespeare" 0

Result:

wandb:

Quesion:
From wandb's screen shot,we can also notice that train/acc and test/acc are not so high.
Is it normal?Does your guys have any advice to solve this?
Thanks for your attention.

The Reason of Changing Total Number of Clients

Thank you for your nice codes.

I have a question about main_fedavg.py#L115:

args.client_num_in_total = client_num

While loading data, why should we change args.client_num_in_total = client_numinto 1000? What is the intuition behind this?

You should add relu to CNN_OriginalAvg

YOU! FORGET! ReLU!

There should be relu
https://github.com/FedML-AI/FedML/blob/master/fedml_api/model/cv/cnn.py

How to install FedML standalone simulation?

I want to install FedML in the mode of standalone simulation, while the doc only shows the installation of FedML distributed computation. How to install standalone simulation?

support federated GAN

design the FedML-Mobile service architecture

[Shell script] how to handle more than 10 parameters in shell

It seems that there is a bug with fedml_experiments/standalone/fedavg/run_fedavg_standalone_pytorch.sh, which could cause a parameter passing error.
Probably because {} for ${10},${11}… is needed when the parameter exceeds 10 (including 10) in the shell.

FedNAS: Loss comes NaN after 1st round in search.

Hi, while searching for the darts based fednas code provided, the loss becomes nan and accuracies 0 after the first round. Also, in directly setting stage =train, it gives an error stating tensors expected on cuda :0 but given cuda:5. I have checked but there does not seem to be any bug in the code. What could be wrong in both these setting? Thanks

Function combine_batches in FedML/fedml_experiments/standalone/fedavg/main_fedavg.py seems to be inefficient

def combine_batches(batches):
    full_x = torch.from_numpy(np.asarray([])).float()
    full_y = torch.from_numpy(np.asarray([])).long()
    for (batched_x, batched_y) in batches:
        full_x = torch.cat((full_x, batched_x), 0)
        full_y = torch.cat((full_y, batched_y), 0)
    return [(full_x, full_y)]

I try to combine the batches to speed up my training process on local clients, but it seems to take me about one hour to finish the job for combining the batches. Can you fix the problem?

A small typo. Please change CLINET to CLIENT.

https://github.com/FedML-AI/FedML/blob/49a3c760c7d166d6730c118eb0aafae872c852bf/fedml_api/data_preprocessing/FederatedEMNIST/data_loader.py

MNIST datset can't be download by the code!

I think the download link invalid because I can't find the MNIST data by the link in Chrome.

Missing partited CIFAR file

I could't find distribution.txt and net_dataidx_map.txt. Are these two files necessary? Or did I misunderstand the code?

FedML/fedml_api/data_preprocessing/cifar10/data_loader.py

Line 31 in f26fbe4

    
           def read_net_dataidx_map(filename='./data_preprocessing/non-iid-distribution/CIFAR10/net_dataidx_map.txt'):

FedML/fedml_api/data_preprocessing/cifar10/data_loader.py

Line 16 in f26fbe4

    
           def read_data_distribution(filename='./data_preprocessing/non-iid-distribution/CIFAR10/distribution.txt'):

Thanks for taking the time to help me!

FedML-API: finish the distributed version of decentralized algorithm

SplitNN distributed computing

I run the following script, it seems the program is stuck. Please help to fix this issue.

fedml_experiments/distributed/split_nn/main_split_nn.py

/home/chaoyanghe/anaconda3/bin/python /home/chaoyanghe/sourcecode/fedml.ai/fedml_experiments/distributed/split_nn/main_split_nn.py
download = True
INFO:root:partition data******
Files already downloaded and verified
download = True
Files already downloaded and verified
INFO:root:N = 50000
INFO:root:traindata_cls_counts = {0: {0: 541, 1: 213, 2: 34, 3: 179, 4: 13, 5: 99, 6: 20, 7: 37, 8: 107, 9: 189}, 1: {0: 9, 1: 3, 2: 92, 3: 20, 4: 438, 5: 1582, 6: 29, 7: 1158}, 2: {0: 5, 1: 119, 2: 64, 3: 461, 4: 431, 5: 753, 6: 208, 7: 1, 8: 7, 9: 9}, 3: {0: 225, 1: 148, 2: 1435, 3: 5, 4: 27, 5: 1085, 6: 11, 7: 112, 8: 170}, 4: {0: 286, 1: 769, 2: 320, 3: 342, 4: 855, 5: 234, 6: 1, 7: 301, 8: 8, 9: 28}, 5: {0: 717, 1: 65, 2: 4, 3: 908, 4: 1301, 5: 35, 6: 34, 7: 20, 8: 786}, 6: {0: 5, 1: 16, 2: 287, 3: 94, 4: 614, 5: 32, 6: 806, 8: 286, 9: 1789}, 7: {0: 410, 1: 1046, 2: 661, 3: 1002, 4: 9}, 8: {0: 640, 1: 114, 2: 62, 3: 71, 4: 309, 5: 47, 6: 553, 7: 771, 8: 534, 9: 154}, 9: {0: 2, 1: 2, 2: 1062, 3: 61, 4: 128, 5: 289, 6: 1014, 7: 852}, 10: {0: 659, 1: 355, 2: 278, 3: 569, 4: 71, 5: 362, 6: 137, 7: 652, 8: 434}, 11: {0: 294, 1: 283, 2: 187, 3: 93, 5: 239, 6: 130, 7: 825, 8: 787, 9: 817}, 12: {0: 262, 1: 746, 2: 18, 3: 104, 4: 9, 6: 315, 7: 11, 8: 9, 9: 111}, 13: {0: 279, 1: 266, 2: 77, 3: 33, 4: 13, 5: 223, 6: 834, 7: 236, 8: 885, 9: 345}, 14: {0: 510, 1: 19, 2: 21, 3: 951, 4: 27, 5: 8, 6: 493, 7: 22, 8: 963, 9: 1465}, 15: {0: 156, 1: 836, 2: 398, 3: 107, 4: 755, 5: 12, 6: 415, 7: 2, 8: 24, 9: 93}}
download = True
Files already downloaded and verified
download = True
Files already downloaded and verified
INFO:root:train_dl_global number = 781
INFO:root:test_dl_global number = 781

Suggestion: wrapping shell parameters with {} in experiments/run_xxx.sh

Hi,

I have installed FedML on my MacOS and tried to run the standalone fedavg experiment locally.

when i input this command:

sh run_fedavg_standalone_pytorch.sh 0 2 2 4 mnist ./../../../data/mnist lr hetero 1 1 0.03 sgd 1

I got this error:

Traceback (most recent call last):
  File "./main_fedavg.py", line 258, in <module>
    trainer.train()
  File "/Users/xuyangchen/Developer/PycharmProjects/FedML/fedml_api/standalone/fedavg/fedavg_trainer.py", line 71, in train
    w, loss = client.train(w_global)
  File "/Users/xuyangchen/Developer/PycharmProjects/FedML/fedml_api/standalone/fedavg/client.py", line 76, in train
    return self.model.cpu().state_dict(), sum(epoch_loss) / len(epoch_loss)
ZeroDivisionError: division by zero

And i found the problems occurs since the parameter for epochs does not actually captured:

INFO:root:Namespace(batch_size=4, ci=3, client_num_in_total=2, client_num_per_round=2, client_optimizer='02', comm_round=1, data_dir='./../../../data/mnist', dataset='mnist', epochs=0, frequency_of_the_test=5, gpu=0, lr=1.0, model='lr', partition_alpha=0.5, partition_method='hetero', wd=0.001)

The reason turns to be that parameters that after $9 were not wrapped with {} in fedml_experiments/standalone/fedavg/run_fedavg_standalone_pytorch.sh

EPOCH=$10

LR=$11

OPT=$12

CI=$13

after I wrapped them and got the experiment run successfully.

EPOCH=${10}

LR=${11}

OPT=${12}

CI=${13}

Maybe it has been working well on Linux, but I still suggest that these parameters (may also found in other run_xxx.sh) can be "improved".

Btw, this is really a great project for researchers, thanks to all contributors :)

Unable to run FedML Mobile

Hi, I want to do on-device federated learning on an android phone. I followed the limited instructions but was not able to run a server and client process. I want to run my custom model. Is this tool not ready for mobile yet? Thanks