Hi, Suppose I want to continue a "completed" experiment. When I run it, I give it

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

Continuing a previous task while preserving its charts about clearml HOT 10 CLOSED

israelwei commented on July 29, 2024

Continuing a previous task while preserving its charts

from clearml.

Comments (10)

bmartinn commented on July 29, 2024

Interesting scenario, it is possible but only if you report the scalars manually (because the Tensorboard and Matplotlib will be automatically logged under the new experiment)

So let's assume we had experiment 1, with an experiment id of abcdef(to get the experiment id, press on the id icon next to the experiment name), and let's also assume it was running for 300,000 iterations

We could do:

from trains import Task

base_task = Task.get_task(task_id='abcdef')
base_task_iterations = 300000
base_task.get_logger().report_scalar(title="loss", series="loss", iteration=i+base_task_iterations, value=loss)

from clearml.

crazyfrogspb commented on July 29, 2024

Is there any hacky way to continue logging from Tensorboard? We often end up with multiple tasks with the same name when we want to continue training the same model from the checkpoint

from clearml.

bmartinn commented on July 29, 2024

@crazyfrogspb do you need to access the previous checkpoint? Or are you asking if you can continue the iteration/step values?

from clearml.

crazyfrogspb commented on July 29, 2024

continuing iteration/step values for correct logging

from clearml.

bmartinn commented on July 29, 2024

Hi @crazyfrogspb,

Are you using Tensorboard?

from clearml.

crazyfrogspb commented on July 29, 2024

Yeah, torch.utils.tensorboard to be exact

from clearml.

bmartinn commented on July 29, 2024

Hi @crazyfrogspb,

If you are using torch.utils.tensorboard, then you are reporting the iteration manually, for example:

writer.add_scalar('Train/Loss', loss.data.item(), iter)

I can think of a simple solution in the form of:

cont_iteration = {'previous_iteration': 0}
Task.current_task().connect(cont_iteration)
writer.add_scalar('Train/Loss', loss.data.item(), iter + cont_iteration['previous_iteration'])

Notice that Task.current_task().connect(cont_iteration) can be called from anywhere in your code. It will add an additional hyper-parameter named previous_iteration and you will be able to change this parameter, after you clone your experiment, to the last iteration of the previous execution.

That said, maybe we could introduce a new function Logger.set_initial_iteration_step() so you can call it before you start the training, and it will essentially do the same thing as the code above. Of course you still need to somehow pass the previous last iteration. What do you think?

p.s.
Apologies for delayed reply, for some reason this issue was forgotten...

from clearml.

bmartinn commented on July 29, 2024

Hi @crazyfrogspb & @israelwei

We just released Trains 0.14.0 ,and we added Task.set_initial_iteration().
Basically you can now make all reports of a specific experiment start from a specific iteration offset (obviously including any scalar/plot coming from Tensorboard matplotlib etc.):

Task.set_initial_iteration(100000)

What do you think?

from clearml.

bmartinn commented on July 29, 2024

Hi @crazyfrogspb and @israelwei ,
The latest Trains release can now fully support continuing previously trained models 🎉
Example (this is torch, but any framework will work here):

Experiment A (stage 1):

from trains import Task
task = Task.init(project_name='demo', task_name='train stage1', output_uri='https://localhost:8081')
# some stuff
torch.save('model.pt')

Experiment B (stage 2):

from trains import Task
task = Task.init(project_name='demo', task_name='train stage2', output_uri='https://localhost:8081')
previous_task = Task.get_task(project_name='demo', task_name='train stage1')
task.set_initial_iteration(previous_task.get_last_iteration())
local_model = previous_task.models['output'][-1].get_local_copy()
torch.load(local_model)
# do some stuff
torch.save('model2.pt')

Notice that I used output_uri, and pointed it to the Trains file server. This will make sure that I will automatically have a copy of all the stored models on the file server. This also means that Experiment B can be executed on any machine, and it will download the model from the file server and open a local copy of the model.pt .
With the next Trains release, the model files will also be cached locally :)

Also notice that Experiment B will automatically have the output model of experiment A as its own input model, so we can trace back the model evolution :)

from clearml.

bmartinn commented on July 29, 2024

I forgot to update , starting trains 0.16, you can continue a previously executed experiment 🚀
See also #160 details

from clearml.

Continuing a previous task while preserving its charts about clearml HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent