Comments (10)
Interesting scenario, it is possible but only if you report the scalars manually (because the Tensorboard and Matplotlib will be automatically logged under the new experiment)
So let's assume we had experiment 1, with an experiment id of abcdef
(to get the experiment id, press on the id
icon next to the experiment name), and let's also assume it was running for 300,000 iterations
We could do:
from trains import Task
base_task = Task.get_task(task_id='abcdef')
base_task_iterations = 300000
base_task.get_logger().report_scalar(title="loss", series="loss", iteration=i+base_task_iterations, value=loss)
from clearml.
Is there any hacky way to continue logging from Tensorboard? We often end up with multiple tasks with the same name when we want to continue training the same model from the checkpoint
from clearml.
@crazyfrogspb do you need to access the previous checkpoint? Or are you asking if you can continue the iteration/step values?
from clearml.
continuing iteration/step values for correct logging
from clearml.
Hi @crazyfrogspb,
Are you using Tensorboard?
from clearml.
Yeah, torch.utils.tensorboard to be exact
from clearml.
Hi @crazyfrogspb,
If you are using torch.utils.tensorboard
, then you are reporting the iteration manually, for example:
writer.add_scalar('Train/Loss', loss.data.item(), iter)
I can think of a simple solution in the form of:
cont_iteration = {'previous_iteration': 0}
Task.current_task().connect(cont_iteration)
writer.add_scalar('Train/Loss', loss.data.item(), iter + cont_iteration['previous_iteration'])
Notice that Task.current_task().connect(cont_iteration)
can be called from anywhere in your code. It will add an additional hyper-parameter named previous_iteration
and you will be able to change this parameter, after you clone your experiment, to the last iteration of the previous execution.
That said, maybe we could introduce a new function Logger.set_initial_iteration_step()
so you can call it before you start the training, and it will essentially do the same thing as the code above. Of course you still need to somehow pass the previous last iteration. What do you think?
p.s.
Apologies for delayed reply, for some reason this issue was forgotten...
from clearml.
Hi @crazyfrogspb & @israelwei
We just released Trains 0.14.0 ,and we added Task.set_initial_iteration()
.
Basically you can now make all reports of a specific experiment start from a specific iteration offset (obviously including any scalar/plot coming from Tensorboard matplotlib etc.):
Task.set_initial_iteration(100000)
What do you think?
from clearml.
Hi @crazyfrogspb and @israelwei ,
The latest Trains release can now fully support continuing previously trained models 🎉
Example (this is torch, but any framework will work here):
Experiment A (stage 1):
from trains import Task
task = Task.init(project_name='demo', task_name='train stage1', output_uri='https://localhost:8081')
# some stuff
torch.save('model.pt')
Experiment B (stage 2):
from trains import Task
task = Task.init(project_name='demo', task_name='train stage2', output_uri='https://localhost:8081')
previous_task = Task.get_task(project_name='demo', task_name='train stage1')
task.set_initial_iteration(previous_task.get_last_iteration())
local_model = previous_task.models['output'][-1].get_local_copy()
torch.load(local_model)
# do some stuff
torch.save('model2.pt')
Notice that I used output_uri
, and pointed it to the Trains file server. This will make sure that I will automatically have a copy of all the stored models on the file server. This also means that Experiment B can be executed on any machine, and it will download the model from the file server and open a local copy of the model.pt .
With the next Trains release, the model files will also be cached locally :)
Also notice that Experiment B will automatically have the output model of experiment A as its own input model, so we can trace back the model evolution :)
from clearml.
I forgot to update , starting trains
0.16, you can continue a previously executed experiment 🚀
See also #160 details
from clearml.
Related Issues (20)
- ClearML feature for integration KerasTuner is broken HOT 1
- Fix typo in docs and default sdk config HOT 1
- Executing clearml-task from cli with "-m" modules HOT 1
- Dynamic GPU/Queue Allocation for Workers in ClearML
- Add tag with Clearm-task (cli tools) HOT 1
- Problem creating datasets with Azure storage when multi file HOT 5
- Task creation failed!Always searching for this project? But I don't have it! HOT 1
- Support Megatron-LM training job on k8s cluster HOT 4
- Model.get_local_copy with specific download path. HOT 1
- "413 Request Entity Too Large" when uploading files to ClearML HOT 4
- legend titles broken in experiment comparison HOT 1
- Preview text files HOT 1
- Registering models from lightning not working (different than pytorch-lightning) HOT 2
- GPU monitoring failed getting GPU reading, switching off GPU monitoring HOT 6
- async variant of get_mutable_local_copy HOT 1
- Light theme for the dashboard HOT 1
- Plot comparison in a single figure not working for plots other than barplots HOT 3
- API calls fail for model with deleted parent task. HOT 1
- Scalar logging bug with Fire HOT 6
- multilanguage support
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from clearml.