Coder Social home page Coder Social logo

Comments (10)

bmartinn avatar bmartinn commented on July 29, 2024

Interesting scenario, it is possible but only if you report the scalars manually (because the Tensorboard and Matplotlib will be automatically logged under the new experiment)

So let's assume we had experiment 1, with an experiment id of abcdef(to get the experiment id, press on the id icon next to the experiment name), and let's also assume it was running for 300,000 iterations

We could do:

from trains import Task

base_task = Task.get_task(task_id='abcdef')
base_task_iterations = 300000
base_task.get_logger().report_scalar(title="loss", series="loss", iteration=i+base_task_iterations, value=loss)

from clearml.

crazyfrogspb avatar crazyfrogspb commented on July 29, 2024

Is there any hacky way to continue logging from Tensorboard? We often end up with multiple tasks with the same name when we want to continue training the same model from the checkpoint

from clearml.

bmartinn avatar bmartinn commented on July 29, 2024

@crazyfrogspb do you need to access the previous checkpoint? Or are you asking if you can continue the iteration/step values?

from clearml.

crazyfrogspb avatar crazyfrogspb commented on July 29, 2024

continuing iteration/step values for correct logging

from clearml.

bmartinn avatar bmartinn commented on July 29, 2024

Hi @crazyfrogspb,

Are you using Tensorboard?

from clearml.

crazyfrogspb avatar crazyfrogspb commented on July 29, 2024

Yeah, torch.utils.tensorboard to be exact

from clearml.

bmartinn avatar bmartinn commented on July 29, 2024

Hi @crazyfrogspb,

If you are using torch.utils.tensorboard, then you are reporting the iteration manually, for example:

writer.add_scalar('Train/Loss', loss.data.item(), iter)

I can think of a simple solution in the form of:

cont_iteration = {'previous_iteration': 0}
Task.current_task().connect(cont_iteration)
writer.add_scalar('Train/Loss', loss.data.item(), iter + cont_iteration['previous_iteration'])

Notice that Task.current_task().connect(cont_iteration) can be called from anywhere in your code. It will add an additional hyper-parameter named previous_iteration and you will be able to change this parameter, after you clone your experiment, to the last iteration of the previous execution.

That said, maybe we could introduce a new function Logger.set_initial_iteration_step() so you can call it before you start the training, and it will essentially do the same thing as the code above. Of course you still need to somehow pass the previous last iteration. What do you think?

p.s.
Apologies for delayed reply, for some reason this issue was forgotten...

from clearml.

bmartinn avatar bmartinn commented on July 29, 2024

Hi @crazyfrogspb & @israelwei

We just released Trains 0.14.0 ,and we added Task.set_initial_iteration().
Basically you can now make all reports of a specific experiment start from a specific iteration offset (obviously including any scalar/plot coming from Tensorboard matplotlib etc.):

Task.set_initial_iteration(100000)

What do you think?

from clearml.

bmartinn avatar bmartinn commented on July 29, 2024

Hi @crazyfrogspb and @israelwei ,
The latest Trains release can now fully support continuing previously trained models 🎉
Example (this is torch, but any framework will work here):

Experiment A (stage 1):

from trains import Task
task = Task.init(project_name='demo', task_name='train stage1', output_uri='https://localhost:8081')
# some stuff
torch.save('model.pt')

Experiment B (stage 2):

from trains import Task
task = Task.init(project_name='demo', task_name='train stage2', output_uri='https://localhost:8081')
previous_task = Task.get_task(project_name='demo', task_name='train stage1')
task.set_initial_iteration(previous_task.get_last_iteration())
local_model = previous_task.models['output'][-1].get_local_copy()
torch.load(local_model)
# do some stuff
torch.save('model2.pt')

Notice that I used output_uri, and pointed it to the Trains file server. This will make sure that I will automatically have a copy of all the stored models on the file server. This also means that Experiment B can be executed on any machine, and it will download the model from the file server and open a local copy of the model.pt .
With the next Trains release, the model files will also be cached locally :)

Also notice that Experiment B will automatically have the output model of experiment A as its own input model, so we can trace back the model evolution :)

from clearml.

bmartinn avatar bmartinn commented on July 29, 2024

I forgot to update , starting trains 0.16, you can continue a previously executed experiment 🚀
See also #160 details

from clearml.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.