Comments (5)
I have running fedavg again use the back end command (nohup), while the same problem also occurs:
nohup sh run_fedavg_distributed_pytorch.sh 4 4 1 4 resnet56 homo 10 5 32 0.001 cifar10 "./../../../data/cifar10" > ./fedavg-resnet-homo-cifar10.txt 2>&1 &
Why the errors always occur in the final ROUNT? The GPUs memory is available while the final ROUND aggregation breaks off.
from fedml.
Hi, yikang, that's not an error. Your training is successfully finished. The log is somewhat misleading. Let me have a test and modify the logging a little bit.
from fedml.
INFO:root:#######training########### round_id = 3
INFO:root:(client 642. Local Training Epoch: 0 Loss: 2.123745
INFO:root:#######finished###########
INFO:root:sys.exit(0)
INFO:root:add_model. index = 3
INFO:root:b_all_received = False
INFO:root:add_model. index = 1
INFO:root:b_all_received = False
INFO:root:add_model. index = 0
INFO:root:b_all_received = True
INFO:root:len of self.model_dict[idx] = 4
INFO:root:aggregate time cost: 0
INFO:root:################local_test_on_all_clients : 3
INFO:root:{'training_acc': 0.33828489880643486, 'training_loss': 2.1656083437750917}
INFO:root:{'test_acc': 0.3422873422873423, 'test_loss': 2.165701602372462}
INFO:root:__finish server
INFO:root:sys.exit(0)
from fedml.
I checked with 4 rounds, it works.
from fedml.
@weiyikang Hi, I've finished the issue you reported. It is because that we didn't call MPI_Abort() after finishing the training. Please update our code and try again. You can use round 2 and local epoch 1 to have a test. Thank you for your valuable feedback.
from fedml.
Related Issues (20)
- Problem with the function " _local_test_on_all_clients" in "https://github.com/FedML-AI/FedML/blob/master/python/fedml/simulation/sp/fedavg/fedavg_api.py" HOT 3
- DP Sensitivity Calculation HOT 1
- fednlp disappeared? can you restore it? HOT 1
- fed_cifar10 sample does not download the dataset correctly
- KeyError. msg_type = 5. Please check whether you launch the server or client with the correct args.rank HOT 1
- Where can I find FedGraphNN? HOT 2
- On the problem of gradient processing in FedML HOT 1
- 运行fedml.run_simulation()时就会出现TypeError: bind_simulation_device() takes 2 positional arguments but 3 were given HOT 4
- where is FedGraphNN HOT 3
- FedOpt for cross-silo HOT 2
- trained model path in single process simulation examples
- The compatibility issues of Nvidia Jetson
- Quickstart Guide
- log_file_dir arg not work
- Rookie question
- from fedml.core.distributed.server.server_manager import ServerManager from fedml.core.distributed.client.client_manager import ClientManager from fedml.core.distributed.communication.comm_manager import CommManager显示
- Which communication protocol and serialization method is supported?
- typo "salve" instead of "slave" in identifiers
- possible bug in python/fedml/core/distributed/communication/trpc/utils.py
- FedGraphnn -- wandb utilization HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fedml.