Comments (1)
Hi @ebritsyn, not sure how this issue passed by without me noticing, apologies ;)
It's awesome to hear that you are using Horovod and Trains together!
Now back to your question:
Horovod sits on top of OpenMPI, and actually pipes all the stdout/stderr from all the nodes to the "main" (rank 0) node.
From your question I'm assuming this is a "manual" execution scenario (i.e. you launch the experiment training yourself). Do you have all the stdout/stderr from all the nodes on the "rank 0" experiment result page?
Did you add a specific code making sure you only create the experiment on the "rank 0" node, like #59 did?
Regrading source code & packaging, why don't you use the trains-agent
to launch the code on all nodes, this way you can ensure the code & environment are exactly the same.
Basically all you nee to do is (assuming you have trains.conf
already on your nodes):
$ pip install trains_agent
$ trains-agent execute --full-monitoring --id <your_task_id>
Or if you want to take the manual approach, you can a use trains-agent
to build the experiment environment (packages and code) on every node, then execute the code manually:
$ pip install trains_agent
$ trains-agent build --id <your_task_id> --target-folder ~/my_horovod_experiment
$ cd ~/my_horovod_experiment
$ . bin/activate
$ python ./code/my_main_script.py
from clearml.
Related Issues (20)
- Error when calling classes through Fire HOT 2
- Use a function of iterations (e.g. epochs) as the time scale for scalars and plots HOT 2
- Pathlib Path instances in a dataclass do not get tracked by task.connect() HOT 1
- delete datasets after call get_local_copy
- report_matplotlib_figure of subplots HOT 2
- How to use Omegaconf without Hydra? HOT 1
- Scrolling log problem when using tqdm as training process bar HOT 5
- ClearML feature for integration KerasTuner is broken HOT 1
- Fix typo in docs and default sdk config HOT 1
- Executing clearml-task from cli with "-m" modules HOT 1
- Dynamic GPU/Queue Allocation for Workers in ClearML
- Add tag with Clearm-task (cli tools) HOT 1
- Problem creating datasets with Azure storage when multi file HOT 5
- Task creation failed!Always searching for this project? But I don't have it! HOT 1
- Support Megatron-LM training job on k8s cluster HOT 4
- Model.get_local_copy with specific download path. HOT 1
- "413 Request Entity Too Large" when uploading files to ClearML HOT 4
- legend titles broken in experiment comparison HOT 1
- Preview text files HOT 1
- Registering models from lightning not working (different than pytorch-lightning) HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from clearml.