When i was trying to train with multiple GPUs due to the limited memory, the loss was

Did you calculate the loss in the same way as <a href="https://github.com/passerer/HPI

The loss is more than 20 and not stable when training with multiple GPUs. about hpinet HOT 10 CLOSED

passerer commented on September 28, 2024

The loss is more than 20 and not stable when training with multiple GPUs.

from hpinet.

Comments (10)

passerer commented on September 28, 2024

I personally haven't encountered this situation before, but I would be happy to assist you. Could you please provide the parameter settings you used for training with multiple GPUs and for training with a single GPU?

from hpinet.

RayTan183 commented on September 28, 2024

Thank you for your reply ! Actually i am so interested in this training framework as it is quite different from other SR training strategy. So i tried to train my SR model by using the training framework. Even i have added the loss at the end of the model as HPINet done, the loss is not stable when training with multiple GPUs. The training parameter settings is as follow:

parser = argparse.ArgumentParser(description='Train Mixformer')
parser.add_argument('--exp_name', type=str, default='Mixformer',
help='experiment name')
parser.add_argument('--model', type=str, default='M', choices=['S', 'M', 'L'],
help='model size')
parser.add_argument('--root', type=str, default='/root/autodl-tmp',
help='dataset directory')
parser.add_argument('--ext', type=str, choices=['.npy', '.png'], default='.png',
help='image suffix. npy or png is required')
parser.add_argument('--scale', type=int, default=4,
help='upscale factor')
parser.add_argument('--isY', action='store_true', default=True,
help='evaluate on y channel, if False evaluate on RGB channels')
parser.add_argument('--save_interval', type=int, default=10)
parser.add_argument('--test_interval', type=int, default=1)
parser.add_argument('--log_interval', type=int, default=100)
parser.add_argument('--epochs', type=int, default=420,
help='number of epochs')
parser.add_argument('--start-epoch', default=1, type=int,
help='manual start epoch number')
parser.add_argument('--lr', type=float, default=1.5e-4,
help='learning rate')
parser.add_argument('--step_size', type=int, default=60,
help='learning rate decay per step_size epochs')
parser.add_argument('--max_batch_size', type=int, default=32,
help='maximum training batch size')
parser.add_argument('--min_batch_size', type=int, default=8,
help='minimum training batch size')
parser.add_argument('--gamma', type=int, default=0.5,
help='learning rate decay factor for step decay')
parser.add_argument('--cuda', action='store_true', default=True,
help='use cuda')
parser.add_argument('--resume', default="", type=str,
help='path to checkpoint')
parser.add_argument('--pretrained', default="", type=str,
help='path to pretrained models')
parser.add_argument('--threads', type=int, default=8,
help='number of threads for data loading')
parser.add_argument('--max_patch_size', type=int, default=720,
help='maximum hr size')
parser.add_argument('--min_patch_size', type=int, default=192,
help='minimum hr size')
parser.add_argument('--seed', type=int, default=2,
help='random seed')
parser.add_argument('--tb_logger', action='store_true', default=False,
help='use tb_logger')

I also have some questions about the HPINet:

Why the matching strategy of patches is different between training and testing ?
Why there is a self-exclusion in attention map ? I guess that it is easily to match the patch itself when testing. But the patch itself can also make some contribution in the training process. Or do you have some expxriments that can validate it will disturb the SR?

Looking forward to your reply ! Thanks a lot !

from hpinet.

passerer commented on September 28, 2024

Thank you for your reply ! Actually i am so interested in this training framework as it is quite different from other SR training strategy. So i tried to train my SR model by using the training framework. Even i have added the loss at the end of the model as HPINet done, the loss is not stable when training with multiple GPUs.

You mentioned that the loss is added. One possible reason could be the large magnitude of the combined loss. Have you considered averaging the loss instead of summing it?

from hpinet.

passerer commented on September 28, 2024

Why the matching strategy of patches is different between training and testing ?

Why there is a self-exclusion in attention map ? I guess that it is easily to match the patch itself when testing. But the patch itself can also make some contribution in the training process. Or do you have some expxriments that can validate it will disturb the SR?

During inference , we use the argmax operator to achieve the matching strategy. However, since argmax is not differentiable and thus cannot propagate gradients, we replace it with a differentiable equivalent form called Gumbel-Softmax during training . In fact, both methods achieve the same objective.
Due to the locality prior, it is theoretically beneficial to have information exchange with surrounding pixels. That's why we designed the IPSA module. Additionally, we designed the GPA module to capture distant feature, and you can refer to the paper for more detailed reasons. When the GPA learns to match itself, it degrades into an intra-patch attention mechanism, which is equivalent to the IPSA . This deviates from the original intention of the GPA. Therefore, we force it not match itself.

from hpinet.

RayTan183 commented on September 28, 2024

Thank you for your reply ! Actually i am so interested in this training framework as it is quite different from other SR training strategy. So i tried to train my SR model by using the training framework. Even i have added the loss at the end of the model as HPINet done, the loss is not stable when training with multiple GPUs.

You mentioned that the loss is added. One possible reason could be the large magnitude of the combined loss. Have you considered averaging the loss instead of summing it?

But the default setting of the L1 loss is the averaging the loss.

from hpinet.

RayTan183 commented on September 28, 2024

Why the matching strategy of patches is different between training and testing ?

Why there is a self-exclusion in attention map ? I guess that it is easily to match the patch itself when testing. But the patch itself can also make some contribution in the training process. Or do you have some expxriments that can validate it will disturb the SR?

During inference , we use the argmax operator to achieve the matching strategy. However, since argmax is not differentiable and thus cannot propagate gradients, we replace it with a differentiable equivalent form called Gumbel-Softmax during training . In fact, both methods achieve the same objective.

Due to the locality prior, it is theoretically beneficial to have information exchange with surrounding pixels. That's why we designed the IPSA module. Additionally, we designed the GPA module to capture distant feature, and you can refer to the paper for more detailed reasons. When the GPA learns to match itself, it degrades into an intra-patch attention mechanism, which is equivalent to the IPSA . This deviates from the original intention of the GPA. Therefore, we force it not match itself.

I have got your idea. I have also noticed that there are 2 convolution layers of the upsample module when the scale factor is 4 which is different from the factor 2 and 3. Could it be better if design in this way ? Thank you for your reply !

from hpinet.

passerer commented on September 28, 2024

Did you calculate the loss in the same way as here? If so, then I would recommend examining the loss step by step:

If the loss starts small and then diverges rapidly after a few steps, it could indicate that the lr is set too high or there may be a problem with the network architecture;
On the other hand, if the loss is already large in the first step, the reason may lie in the training framework. If so, I suggest conducting multi-GPU training without making any modifications to my model and code. Please observe whether the loss behaves normally. If memory constraints are an issue, you can try training with the HPINet-S.

from hpinet.

passerer commented on September 28, 2024

I have got your idea. I have also noticed that there are 2 convolution layers of the upsample module when the scale factor is 4 which is different from the factor 2 and 3. Could it be better if design in this way ? Thank you for your reply !

In the x4 upscaling setting, using two cascaded convolutions is a common practice in SR models. I guess the primary reason for this convention is the parameter count. When using a single convolution layer (conv(in_channel=c, out_channel=4*4*c)), the parameter count is double that of using two convolutions ([conv(in_channel=c, out_channel=2*2*c)] x 2)

from hpinet.

RayTan183 commented on September 28, 2024

In the x4 upscaling setting, using two cascaded convolutions is a common practice in SR models. I guess the primary reason for this convention is the parameter count. When using a single convolution layer (conv(in_channel=c, out_channel=4*4*c)), the parameter count is double that of using two convolutions ([conv(in_channel=c, out_channel=2*2*c)] x 2)

It sounds reasonable. Thank you !

from hpinet.

RayTan183 commented on September 28, 2024

If the loss starts small and then diverges rapidly after a few steps, it could indicate that the lr is set too high or there may be a problem with the network architecture; On the other hand, if the loss is already large in the first step, the reason may lie in the training framework. If so, I suggest conducting multi-GPU training without making any modifications to my model and code. Please observe whether the loss behaves normally. If memory constraints are an issue, you can try training with the HPINet-S.

It is the second situation. I will check the model and training code again. Thank you for your patient reply !

from hpinet.

The loss is more than 20 and not stable when training with multiple GPUs. about hpinet HOT 10 CLOSED

Comments (10)

Related Issues (5)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent