Coder Social home page Coder Social logo

fedml-ai / fedml Goto Github PK

View Code? Open in Web Editor NEW
4.1K 4.1K 766.0 911.79 MB

FEDML - The unified and scalable ML library for large-scale distributed training, model serving, and federated learning. FEDML Launch, a cross-cloud scheduler, further enables running any AI jobs on any GPU cloud or on-premise cluster. Built on this library, FEDML Nexus AI (https://fedml.ai) is your generative AI platform at scale.

Home Page: https://fedml.ai

License: Apache License 2.0

Python 78.17% Shell 1.87% Dockerfile 0.35% PowerShell 0.01% Batchfile 0.16% Java 2.91% Jupyter Notebook 14.82% CMake 0.11% C++ 1.40% C 0.02% Smarty 0.14% Jinja 0.03%
ai-agent deep-learning distributed-training edge-ai federated-learning inference-engine machine-learning mlops model-deployment model-serving on-device-training

fedml's People

Contributors

alaydshah avatar amir-zsh avatar asce1885 avatar bbuyukates avatar beiyuouo avatar chaoyanghe avatar dependabot[bot] avatar elliebababa avatar emirceyani avatar fedml-ai-admin avatar fedml-alex avatar han-shanshan avatar joyerf avatar leigao97 avatar mrigankraman avatar mzp0625 avatar nicole456 avatar phenomenal-manish avatar prosopher avatar raphael-jin avatar ray-ruisun avatar ryantrojans avatar taokz avatar wizard1203 avatar wzpan avatar xiaoyang-wang avatar yanfangli1986 avatar yvonne-fedml avatar zhang-tuo-pdf avatar zijian-hu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

fedml's Issues

Errors in downloading datasets from Google Drive

I recommend the data manager to move datasets in Google Drive to other data repository for the following errors.

  • 'Quota exceeded' error
    Whiled executing 'CI-install.sh', I have encountered 'qouta exceeded' error for some dataset files from Google Drive, which produced an incomplete 'h5' files and caused a CI failure.
    The files may be locked for a 24 hour period before the quota is reset. [link]

  • 'Virus scan warning' error
    For 'sh download_federatedEMNIST.sh' command in 'CI-install.sh', I have encountered another type of error, which produced an incomplete 'emnist_test.h5' that contained:

<title>Google Drive - Virus scan warning</title><style nonce="SBJBwuBjbn6MgL9cqdTcug">#gbar,#guser{font-size:13px;padding-top:0px !important;}#gbar{height:22px}#guser{padding-bottom:7px !important;text-align:right}.gbh,.gbd{border-top:1px solid #c9d7f1;font-size:1px}.gbh{height:0;position:absolute;top:24px;width:100%}@media all{.gb1{height:22px;margin-right:.5em;vertical-align:top}#gbar{float:left}}a.gb1,a.gb4{text-decoration:underline !important}a.gb1,a.gb4{color:#00c !important}.gbi .gb4{color:#dd8e27 !important}.gbf .gb4{color:#900 !important}</style><script nonce="SBJBwuBjbn6MgL9cqdTcug"></script>

Google Drive can't scan this file for viruses.

emnist_test.h5 (233M) is too large for Google to scan for viruses. Would you still like to download this file?

Download anyway

© 2020 Google - Help - Privacy & Terms

Question about the step in wandb log

Hi, thanks for sharing the awesome project. I'm trying the standalone experiments and have a question about the cifar results. I consulted the online result (https://wandb.ai/automl/fedml/runs/2cxam561/) but I don't understand the total steps of the log. According to the plot, there are totally 400 steps, but the configuration is 100 rounds and 20 epochs. I read the code and to my understanding the default evaluation freq is 5 so there should be 20 steps (5 rounds per step) in the log. So Why does the online log have 400 steps? Thanks!

It seems that the number of joining clients (not the num of computing clients) is set in fedml_api/data_preprocessing/**/data_loader and cannot be changed, except CIFAR10 datasets.

It seems that the number of joining clients (not the num of computing clients) is fixed in fedml_api/data_preprocessing/**/data_loader and cannot be changed except CIFAR10 datasets.

Here I mean that it seems the total clients is decided by the datasets, rather the input from run_fedavg_distributed_pytorch.sh.

Specifically, the variable client_num_in_total is changed here by the data_loader depending on the dataset. So I have 2 questions as following:

  1. How should we partition the whole dataset into client_num_in_total parts which is defined by input?

  2. Maybe there is a better way to write some APIs to split the data from the original dataset rather than to make a new dataset, because this will be more convenient and flexible? Then users can download the original datasets then use the data split API, saving the space of disks.

fed_CIFAR100 dataset loading error

Traceback (most recent call last):
File "./main_fedavg.py", line 163, in
dataset = load_data(args, args.dataset)
File "./main_fedavg.py", line 109, in load_data
class_num = load_partition_data_federated_cifar100(args.dataset, args.data_dir, batch_size=args.batch_size)
File "/home/chaoyanghe/sourcecode/fedml.ai/fedml_api/data_preprocessing/fed_cifar100/data_loader.py", line 115, in load_partition_data_federated_cifar100
train_data_local, test_data_local = get_dataloader(dataset, data_dir, batch_size, batch_size, client_idx)
File "/home/chaoyanghe/sourcecode/fedml.ai/fedml_api/data_preprocessing/fed_cifar100/data_loader.py", line 54, in get_dataloader
test_dl = data.DataLoader(dataset = test_ds, batch_size=test_bs, shuffle = True, drop_last = False)
File "/home/chaoyanghe/miniconda/envs/fedml/lib/python3.7/site-packages/torch/utils/data/dataloader.py", line 224, in init
sampler = RandomSampler(dataset, generator=generator)
File "/home/chaoyanghe/miniconda/envs/fedml/lib/python3.7/site-packages/torch/utils/data/sampler.py", line 96, in init
"value, but got num_samples={}".format(self.num_samples))
ValueError: num_samples should be a positive integer value, but got num_samples=0

代码跑不起来(fedavg算法)

在中心服务器端可以免密登录客户端,其他软件也配置好了,但是为什么在服务器端运行代码的时候识别不了客户端的主机名呢

Processes running are suspend by some unknown reasom.

When I run the fedgkt algorithm by the following cmd.
sh run_FedGKT.sh 8 cifar10 homo 10 20 1 Adam 0.001 1 0 resnet56 fedml_resnet56_homo_cifar10 "./../../../data/cifar10" 64

The processes are often suspend by some reason.I derived the result successfully for only one time.

image
The one I figure it our is that the connection error between the process and wandb.
After solving the connection problem,there are still other potential reasons.
How can I figure it out?

The arXiv uid in citation is from Horovod

Hi Chaoyang,

In current branch, the arXiv uid (1802.05799) is about horovod. You may want to update it with your own one.

@article{chaoyanghe2020fedml,
  Author = {Chaoyang He},
  Journal = {arXiv preprint arXiv:1802.05799},
  Title = {FedML: a Flexible and Generic Federated Learning Library and Benchmark},
  Year = {2020}
}

Same error happend when running FEDAVG under STANDALONE just clone the repository without any modification!

When I use commands as below which included in readme.md,
( nohup sh run_fedavg_standalone_pytorch.sh 2 10 64 cifar10 ./../../../data/cifar10 resnet56 homo 200 20 0.001 > ./fedavg_standalone.txt 2>&1
nohup sh run_fedavg_standalone_pytorch.sh 2 10 10 mnist ./../../../data/mnist lr hetero 200 20 0.03 > ./fedavg_standalone.txt 2>&1 &)

The same error occurs:
Traceback (most recent call last):
File "./main_fedavg.py", line 160, in
trainer = FedAvgTrainer(dataset, model, device, args)
File "/home/xx/proj/Source/FedML/fedml_api/standalone/fedavg/fedavg_trainer.py", line 22, in init
self.model_global.train()
AttributeError: 'NoneType' object has no attribute 'train'.

For my poor understanding of PyTorch,can anyone teach me where the code is written not so well?

What should I do if I only have a GPUNode with 8 graphics card to run the distributed algorithm?

Could anyone please tell me what should I do if I only have a GPUNode functioning as a login node and a compute node at the same time to run the distributed algorithm, please?
It works pretty well when running fedavg algorithm under fedml's distributed architecture based on our hardware architecture saying above.
But when I try to run fednas,fedgkt,fedavg_robust algorithm,they'll fail in the same reason in the end as showing in the screenshot below.

FedNAS:
7WEZ6@35~YD3JUEY((RZT
FedGKT:
BSD(QJ8XL$L~8J(CS1303VK

Client Sampling Strategy

Hi, it is indeed a GREAT work!!

However, it seems, when tested at total client number = client number per round, FedAVG distributed's device sampling make the local training on a client, which should be isolated, have the information from other local dataset. (The issue theoretically will persist in cases total client number != client number per round )

In the design, the local trainer ID is separated from the the local dataset, i.e., you need to update the dataset for each trainer at each round with a given client index before do the local training. This might be beneficial for the cases where a large total client number is present, and when total client number = client number, the device sampling did nothing more than permute the client_indexes. However, do so may cause the above issue.

As we can see from the FedML/fedml_api/distributed/fedavg/FedAvgServerManager.py

In each communication round the client_indexes is permuted. And in 59 we can see, the receiver ID (trainer ID) is not necessarily linked with a specific dataset (determined by client_indexes[receiver_id]), because the order for client_indexes is not the same for each round. Even though, after each round's syncing, all clients start with same weight; the weight then is invariant to each local dataset (they are all same across clients). However, the optimizer's history is different across clients, this dissociation between trainer and local dataset will make the optimizer history on one dataset be applied to another one. The result is that the each local client will have the partial training info about the global dataset, unfairly favoring the results.

This can be verified by setting client_indexes to a fixed list at case total client number = client number per round, which yield a significant worse results than permuting it. The performance should theoratically the same for case total client number = client number per round, as each round the participating clients are the same.

In realistic settings sharing these optimizer's history will involve significant data traffic overhead(double the traffic volume)

image

Nonsensical if statements

Out of curiosity, I let a static code analysis run on the code. It found three nonsensical if statements: sonarcloud.

Maybe sonarcube should be incorporated into the CI process, so we can be notified as these bugs like occur.

Network error from wandb

I want to use the customized model on the customized dataset. The fedavg can work well on ResNet56 model on our customized dataset. But the ResNet56 is too big for our small dataset. So I try to use the small model 'CNN_DropOut' on the fold of 'mode/cv/cnn.py‘. The problem is the following:
image

The model 'CNN_DropOut' can work well on the test_cnn.py as following:
image

  1. Why the network error "ConnectTimeout" happened? The GPUs are available.
  2. Whether the input's shape of CNN_DropOut like (batch_size, channels_num, img_width, img_height)? eg. (32, 3, 28, 28) for mnist dataset.

great job

hi,dear
glad to see the perfect rp
btw,could it be used for Recommendation?
if so ,will have a try.
thx

Command_round cannot be more than 200 ?

I used FedML-Server/client_simulator/mobile_client_simulator.py to ran two cllients ,
and used FedML-Server/executor/app.py to ran a server on the same PC.
Even I set the argument "comm_round" to 250, but the learning will be stuck at round 200.
But if I set the "comm_round" to 50 , the learning will be finished at round 50.
Picture:
client
(Client)
server
(Server)

The error in new machine when running distribution/fedavg

I think the FedML can use easily on another machine only by cloning the FedML without modifications. While the errors occur as follows:

Configures of computer:

  1. 4 * RTX 3090, cuda 11.1
  2. pytorch1.7
  3. According to CI-install.sh requiring the environment

run the command:
sh run_fedavg_distributed_pytorch.sh 4 4 1 4 cnn homo 2 1 32 0.0001 digit5 "./../../../data/Digit5" 0
There are 4 clients and 4 works.

The errors as follows(Fig.1 the warning of program, Fig.2 the 4 clients on GPUs may be wrong, the same process on all GPUs):

image

image

Training Loss become nan when do the shakespeare experiment under distributed environment

Code:
just insert one line of code in the file FedML/fedml_experiments/distributed/fedavg/mian_fedavg.py as shown in the fig1 below, others are just based on the latest orgin/master.
image

Cmd:
sh run_fedavg_distributed_pytorch.sh 10 10 1 8 rnn hetero 100 10 10 0.8 shakespeare "./../../../data/shakespeare" 0

Result:
image

wandb:
image

Quesion:
From wandb's screen shot,we can also notice that train/acc and test/acc are not so high.
Is it normal?Does your guys have any advice to solve this?
Thanks for your attention.

How to install FedML standalone simulation?

I want to install FedML in the mode of standalone simulation, while the doc only shows the installation of FedML distributed computation. How to install standalone simulation?

[Shell script] how to handle more than 10 parameters in shell

It seems that there is a bug with fedml_experiments/standalone/fedavg/run_fedavg_standalone_pytorch.sh, which could cause a parameter passing error.
Probably because {} for ${10},${11}… is needed when the parameter exceeds 10 (including 10) in the shell.

FedNAS: Loss comes NaN after 1st round in search.

Hi, while searching for the darts based fednas code provided, the loss becomes nan and accuracies 0 after the first round. Also, in directly setting stage =train, it gives an error stating tensors expected on cuda :0 but given cuda:5. I have checked but there does not seem to be any bug in the code. What could be wrong in both these setting? Thanks

Function combine_batches in FedML/fedml_experiments/standalone/fedavg/main_fedavg.py seems to be inefficient

def combine_batches(batches):
    full_x = torch.from_numpy(np.asarray([])).float()
    full_y = torch.from_numpy(np.asarray([])).long()
    for (batched_x, batched_y) in batches:
        full_x = torch.cat((full_x, batched_x), 0)
        full_y = torch.cat((full_y, batched_y), 0)
    return [(full_x, full_y)]

I try to combine the batches to speed up my training process on local clients, but it seems to take me about one hour to finish the job for combining the batches. Can you fix the problem?

Missing partited CIFAR file

I could't find distribution.txt and net_dataidx_map.txt. Are these two files necessary? Or did I misunderstand the code?

def read_net_dataidx_map(filename='./data_preprocessing/non-iid-distribution/CIFAR10/net_dataidx_map.txt'):

def read_data_distribution(filename='./data_preprocessing/non-iid-distribution/CIFAR10/distribution.txt'):

Thanks for taking the time to help me!

SplitNN distributed computing

I run the following script, it seems the program is stuck. Please help to fix this issue.

fedml_experiments/distributed/split_nn/main_split_nn.py

/home/chaoyanghe/anaconda3/bin/python /home/chaoyanghe/sourcecode/fedml.ai/fedml_experiments/distributed/split_nn/main_split_nn.py
download = True
INFO:root:partition data******
Files already downloaded and verified
download = True
Files already downloaded and verified
INFO:root:N = 50000
INFO:root:traindata_cls_counts = {0: {0: 541, 1: 213, 2: 34, 3: 179, 4: 13, 5: 99, 6: 20, 7: 37, 8: 107, 9: 189}, 1: {0: 9, 1: 3, 2: 92, 3: 20, 4: 438, 5: 1582, 6: 29, 7: 1158}, 2: {0: 5, 1: 119, 2: 64, 3: 461, 4: 431, 5: 753, 6: 208, 7: 1, 8: 7, 9: 9}, 3: {0: 225, 1: 148, 2: 1435, 3: 5, 4: 27, 5: 1085, 6: 11, 7: 112, 8: 170}, 4: {0: 286, 1: 769, 2: 320, 3: 342, 4: 855, 5: 234, 6: 1, 7: 301, 8: 8, 9: 28}, 5: {0: 717, 1: 65, 2: 4, 3: 908, 4: 1301, 5: 35, 6: 34, 7: 20, 8: 786}, 6: {0: 5, 1: 16, 2: 287, 3: 94, 4: 614, 5: 32, 6: 806, 8: 286, 9: 1789}, 7: {0: 410, 1: 1046, 2: 661, 3: 1002, 4: 9}, 8: {0: 640, 1: 114, 2: 62, 3: 71, 4: 309, 5: 47, 6: 553, 7: 771, 8: 534, 9: 154}, 9: {0: 2, 1: 2, 2: 1062, 3: 61, 4: 128, 5: 289, 6: 1014, 7: 852}, 10: {0: 659, 1: 355, 2: 278, 3: 569, 4: 71, 5: 362, 6: 137, 7: 652, 8: 434}, 11: {0: 294, 1: 283, 2: 187, 3: 93, 5: 239, 6: 130, 7: 825, 8: 787, 9: 817}, 12: {0: 262, 1: 746, 2: 18, 3: 104, 4: 9, 6: 315, 7: 11, 8: 9, 9: 111}, 13: {0: 279, 1: 266, 2: 77, 3: 33, 4: 13, 5: 223, 6: 834, 7: 236, 8: 885, 9: 345}, 14: {0: 510, 1: 19, 2: 21, 3: 951, 4: 27, 5: 8, 6: 493, 7: 22, 8: 963, 9: 1465}, 15: {0: 156, 1: 836, 2: 398, 3: 107, 4: 755, 5: 12, 6: 415, 7: 2, 8: 24, 9: 93}}
download = True
Files already downloaded and verified
download = True
Files already downloaded and verified
INFO:root:train_dl_global number = 781
INFO:root:test_dl_global number = 781

Suggestion: wrapping shell parameters with {} in experiments/run_xxx.sh

Hi,

I have installed FedML on my MacOS and tried to run the standalone fedavg experiment locally.

when i input this command:

sh run_fedavg_standalone_pytorch.sh 0 2 2 4 mnist ./../../../data/mnist lr hetero 1 1 0.03 sgd 1

I got this error:

Traceback (most recent call last):
  File "./main_fedavg.py", line 258, in <module>
    trainer.train()
  File "/Users/xuyangchen/Developer/PycharmProjects/FedML/fedml_api/standalone/fedavg/fedavg_trainer.py", line 71, in train
    w, loss = client.train(w_global)
  File "/Users/xuyangchen/Developer/PycharmProjects/FedML/fedml_api/standalone/fedavg/client.py", line 76, in train
    return self.model.cpu().state_dict(), sum(epoch_loss) / len(epoch_loss)
ZeroDivisionError: division by zero

And i found the problems occurs since the parameter for epochs does not actually captured:

INFO:root:Namespace(batch_size=4, ci=3, client_num_in_total=2, client_num_per_round=2, client_optimizer='02', comm_round=1, data_dir='./../../../data/mnist', dataset='mnist', epochs=0, frequency_of_the_test=5, gpu=0, lr=1.0, model='lr', partition_alpha=0.5, partition_method='hetero', wd=0.001)

The reason turns to be that parameters that after $9 were not wrapped with {} in fedml_experiments/standalone/fedavg/run_fedavg_standalone_pytorch.sh

EPOCH=$10

LR=$11

OPT=$12

CI=$13

after I wrapped them and got the experiment run successfully.

EPOCH=${10}

LR=${11}

OPT=${12}

CI=${13}

Maybe it has been working well on Linux, but I still suggest that these parameters (may also found in other run_xxx.sh) can be "improved".

Btw, this is really a great project for researchers, thanks to all contributors :)

Unable to run FedML Mobile

Hi, I want to do on-device federated learning on an android phone. I followed the limited instructions but was not able to run a server and client process. I want to run my custom model. Is this tool not ready for mobile yet? Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.