Hi, thanks again for your excellent works. Is the big-lama model trained on places

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Apparently, this is a known issue in <a href="https://github.com/PyTorchLightning/pyto

Questions about training big-lama and the full-checkpoint about lama HOT 16 CLOSED

advimman commented on August 21, 2024

Questions about training big-lama and the full-checkpoint

from lama.

Comments (16)

AchoWu commented on August 21, 2024 7

I summed up the experience above and trained big-lama like this. If I made any mistakes, please correct me.
1.modified pytorch_lightning/trainer/connectors/checkpoint_connector.py Line 106:
https://github.com/PyTorchLightning/pytorch-lightning/blob/f9f4853f3663404362c7de8614a504b0403c25b8/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L106

        # restore training state
        self.restore_training_state(checkpoint)

        # restore training state
        try:
            self.restore_training_state(checkpoint)
        except KeyError:
            rank_zero_warn(
                "File at `resume_from_checkpoint` Trying to restore training state but checkpoint contains only the model."
            )

2.modified lama-main/saicinpainting/training/trainers/base.py Line 109:

            if self.config.losses.get("resnet_pl", {"weight": 0})['weight'] > 0:
                self.loss_resnet_pl = ResNetPL(**self.config.losses.resnet_pl)

            if self.config.losses.get("sege_pl", {"weight": 0})['weight'] > 0:
                self.loss_sege_pl = ResNetPL(**self.config.losses.sege_pl)

3.run

python bin/train.py -cn big-lama location=my_dataset data.batch_size=10 +trainer.kwargs.resume_from_checkpoint=abspath\\to\\big-lama-with-discr-remove-loss_segm_pl.ckpt

https://drive.google.com/file/d/1YTiKZ1hQnKvTEbXIxFXjGg61pBAch_N7/view?usp=sharing
model shared by @Liang-Sen

from lama.

Liang-Sen commented on August 21, 2024 6

@windj007
Thanks for your reply.
I just remove the "loss_segm_pl" from the checkpoint and its worked.

Share the remove_checkpoint here:
https://drive.google.com/file/d/1YTiKZ1hQnKvTEbXIxFXjGg61pBAch_N7/view?usp=sharing

from lama.

yzhouas commented on August 21, 2024

Could you also share the training log or time of big-lama? Thanks so much.

from lama.

windj007 commented on August 21, 2024

Is the big-lama model trained on places-challenge dataset?

Not exactly Places Challenge - it was trained on a subset of 157 categories from Places Challenge. Please refer to supp.mat for exact list of these categories.

Whether it performs greatly better than a big-lama trained with places2-standard?

The difference is pretty noticeable by a naked eye, but the improvement from standard -> subset-of-challenge is less than the most important contributions from our paper (e.g. masks, architecture and segm-pl).

from lama.

windj007 commented on August 21, 2024

Could you also share the training log or time of big-lama? Thanks so much.

It took approximately 12 days to train this big-lama on 8xV100 32GB with total batch size of 120 (8 gpus x 15 samples).

from lama.

windj007 commented on August 21, 2024

Is it possible to release the full checkpoints of the big-lama model, so we can finetune it on other data?

I've just uploaded full checkpoint to https://disk.yandex.ru/d/wJ2Ee0f1HvasDQ subfoler big-lama-with-discr - unlike other checkpoints, this one has discriminator and SegmPL weights included.

Please share your experience with finetuning - does it help and how dramatically.

from lama.

yzhouas commented on August 21, 2024

Thanks so much! That is super helpful!

from lama.

windj007 commented on August 21, 2024

I'll close that issue for now - feel free to reopen if you have any issies with fine-tuning

from lama.

affromero commented on August 21, 2024

Hello,
I am having some issues loading the big-lama-with-discr for finetuning. Please correct me if I am wrong but I notice that the SegmPL weights are loss_segm_pl.impl... in the .ckpt, but the current trainer loads it as loss_resnet_pl.impl... https://github.com/saic-mdal/lama/blob/ede702b19b027ad2c0380419b2b71a90fe90a14f/saicinpainting/training/trainers/base.py#L110

After modifying this, I get the following error:

    'Trying to restore training state but checkpoint contains only the model.'
KeyError: 'Trying to restore training state but checkpoint contains only the model. This is probably due to `ModelCheckpoint.save_weights_only` being set to `True`.'

@yzhouas did you have any success with this? I am wondering if it is just me.

from lama.

affromero commented on August 21, 2024

Apparently, this is a known issue in Pytorch Lightning, and the problem for the suggested Pytorch Lightning 1.2.9 seems to be here:

https://github.com/PyTorchLightning/pytorch-lightning/blob/f9f4853f3663404362c7de8614a504b0403c25b8/pytorch_lightning/trainer/connectors/checkpoint_connector.py#L106

        # restore training state
        self.restore_training_state(checkpoint)

So, a very ugly hack would be to bypass it as:

        # restore training state
        try:
            self.restore_training_state(checkpoint)
        except KeyError:
            rank_zero_warn(
                "File at `resume_from_checkpoint` Trying to restore training state but checkpoint contains only the model."
            )

from lama.

windj007 commented on August 21, 2024

Hi @affromero !

Yeah, I forgot that we changed the name of this variable already after training big lama... Another possible solution is to just strip loss_segm_pl.impl... from the checkpoint altogether - anyway it is initialized from a fixed ade20k checkpoint.

Trying to restore training state but checkpoint contains only the model.

I have not faced this issue yet. Have you resolved it?

from lama.

marcelsan commented on August 21, 2024

Hi @windj007,

I looked into the Supplementary Material but I was not able to find what categories from Places Challenge were used for training Big-Lama. Could you please list these categories? Also, why haven't you used the entire Places Challenge for training Big-Lama?

Thank you

from lama.

Liang-Sen commented on August 21, 2024

Hi @windj007 ,

I am having the some issue loading the big-lama-with-discr for finetuning, please correct me if I made any mistake.

I run this command:
python bin/train.py -cn big-lama location=my_dataset data.batch_size=10 +trainer.kwargs.resume_from_checkpoint=path\\to\\big-lama-with-discr\\best.ckpt

and got this error message:
RuntimeError: Error(s) in loading state_dict for DefaultInpaintingTrainingModule:
Missing key(s) in state_dict: "loss_resnet_pl.impl.conv1.weight", "loss_resnet_pl......
Unexpected key(s) in state_dict: "loss_segm_pl.impl.conv1.weight", "loss_segm_pl.impl....

I modified base.py Line 109:
From:
if self.config.losses.get("resnet_pl", {"weight": 0})['weight'] > 0: self.loss_resnet_pl = ResNetPL(**self.config.losses.resnet_pl)

To:
if self.config.losses.get("sege_pl", {"weight": 0})['weight'] > 0: self.loss_sege_pl = ResNetPL(**self.config.losses.sege_pl)

And and Missing key error is disappeared, but still have the Unexpected key error message:
Unexpected key(s) in state_dict: "loss_segm_pl.impl.conv1.weight", "loss_segm_pl.impl...._

Do you have any suggestion for this?

from lama.

windj007 commented on August 21, 2024

@marcelsan The list is there, on page 5.

why haven't you used the entire Places Challenge for training Big-Lama?

Bigger datasets need bigger models - and smaller models work better when the dataset is more focused. And Big-LaMa is not that big in terms of number of trainable parameters.

from lama.

windj007 commented on August 21, 2024

@Liang-Sen

And and Missing key error is disappeared, but still have the Unexpected key error message:

The quick solution is a couple of comments above:

Another possible solution is to just strip loss_segm_pl.impl... from the checkpoint altogether - anyway it is initialized from a fixed ade20k checkpoint.

I should have fixed and reupploaded the checkpoint, but have not found time yet...

from lama.

windj007 commented on August 21, 2024

@Liang-Sen thank you!

from lama.

Questions about training big-lama and the full-checkpoint about lama HOT 16 CLOSED

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent