🐛 Bug Hello! I've tried to train my a comet model using my own da

<a target="_blank" rel="noopener noreferrer nofollow" href="https://user-images.github

Model outputs error right after finishing training about comet HOT 8 CLOSED

ZordoC commented on June 5, 2024

Model outputs error right after finishing training

from comet.

Comments (8)

ricardorei commented on June 5, 2024

Thanks for reporting that issue.

That was a problem when updating pytorch lightning version. In the older version on_fit_end() callback function only received 2 positional arguments, I thought I had solved that before updating lightning dependencies... I'll fix that today!

from comet.

ricardorei commented on June 5, 2024

I released a version 0.0.6.post1 that solves that... tell me if it works!

Cumprimentos

from comet.

ZordoC commented on June 5, 2024

Hey!

This time the model trained successfully according to the logs!

Epoch 2: 100%|██████████| 25000/25000 [1:16:41<00:00,  5.43it/s, loss=0.056, v_num=4-35, pearson=0.924, kendall=0.81, spearman=0.946, avg_loss=0.0621] 
                                                              
Training Report Experiment:
         train_loss_step  train_loss  ...  train_avg_loss  train_loss_epoch
Epoch 0         0.183138    0.183138  ...        0.099132               NaN
Epoch 1         0.006920    0.006920  ...        0.101763          0.107044
Epoch 2         0.001943    0.001943  ...        0.065580          0.067810

[3 rows x 12 columns]

All looks good, but when inspecting the experiments folder :

Seems like something is missing (the metadata data from the csv)

Whenever I try to load the model:

Python 3.6.9 (default, Oct  8 2020, 12:12:24) 
[GCC 8.4.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> from comet.models import load_checkpoint
>>> model  = load_checkpoint("events.out.tfevents.1606298119.ip-172-31-41-58.27572.0")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/ubuntu/comet/lib/python3.6/site-packages/comet/models/__init__.py", line 135, in load_checkpoint
    checkpoint, hparams=hparams
  File "/home/ubuntu/comet/lib/python3.6/site-packages/pytorch_lightning/core/saving.py", line 132, in load_from_checkpoint
    checkpoint = pl_load(checkpoint_path, map_location=lambda storage, loc: storage)
  File "/home/ubuntu/comet/lib/python3.6/site-packages/pytorch_lightning/utilities/cloud_io.py", line 32, in load
    return torch.load(f, map_location=map_location)
  File "/home/ubuntu/comet/lib/python3.6/site-packages/torch/serialization.py", line 529, in load
    return _legacy_load(opened_file, map_location, pickle_module, **pickle_load_args)
  File "/home/ubuntu/comet/lib/python3.6/site-packages/torch/serialization.py", line 692, in _legacy_load
    magic_number = pickle_module.load(f, **pickle_load_args)
_pickle.UnpicklingError: invalid load key, '\x18'.

I guess that's the correct way of loading the model right? Could you provide an example if not?

Best

Jose

from comet.

ricardorei commented on June 5, 2024

Actually the events.out.tfevents.1606298119.ip-172-31-41-58.27572.0 is a tensorboard file! not the checkpoint file. The checkpoint file should end with .ckpt. From your ls, it looks like lightning has not saved any checkpoint...

from comet.

ricardorei commented on June 5, 2024

I released another post-release version 0.0.6.post2 that should have that fixed.

The problem was the new lightning version that deprecated the file_path parameter from the ModelCheckpoint and changed the behaviour of the period parameter. These two updates made the ModelCheckpoint callback useless.

Obrigado mais uma vez! Todos os bugs são bem vindos, especialmente agora no inicio 😃

from comet.

ZordoC commented on June 5, 2024

No problems! I'll close the issue.

If you have anything that I can help with I'm interested! Maybe write some examples/docs on how to train a model? Would you be up to that? I've been interested in contributing to a OSS for a while :-)

Obrigado!

from comet.

ricardorei commented on June 5, 2024

Yep, that would be awesome! If for example, you write a tutorial on how to train a system we can add that to the documentation!

from comet.

ZordoC commented on June 5, 2024

Okay I will do that :-) !

Best

from comet.

Model outputs error right after finishing training about comet HOT 8 CLOSED

Comments (8)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent