Coder Social home page Coder Social logo

Comments (10)

passerer avatar passerer commented on September 28, 2024

I personally haven't encountered this situation before, but I would be happy to assist you. Could you please provide the parameter settings you used for training with multiple GPUs and for training with a single GPU?

from hpinet.

RayTan183 avatar RayTan183 commented on September 28, 2024

Thank you for your reply ! Actually i am so interested in this training framework as it is quite different from other SR training strategy. So i tried to train my SR model by using the training framework. Even i have added the loss at the end of the model as HPINet done, the loss is not stable when training with multiple GPUs. The training parameter settings is as follow:

parser = argparse.ArgumentParser(description='Train Mixformer')
parser.add_argument('--exp_name', type=str, default='Mixformer',
help='experiment name')
parser.add_argument('--model', type=str, default='M', choices=['S', 'M', 'L'],
help='model size')
parser.add_argument('--root', type=str, default='/root/autodl-tmp',
help='dataset directory')
parser.add_argument('--ext', type=str, choices=['.npy', '.png'], default='.png',
help='image suffix. npy or png is required')
parser.add_argument('--scale', type=int, default=4,
help='upscale factor')
parser.add_argument('--isY', action='store_true', default=True,
help='evaluate on y channel, if False evaluate on RGB channels')
parser.add_argument('--save_interval', type=int, default=10)
parser.add_argument('--test_interval', type=int, default=1)
parser.add_argument('--log_interval', type=int, default=100)
parser.add_argument('--epochs', type=int, default=420,
help='number of epochs')
parser.add_argument('--start-epoch', default=1, type=int,
help='manual start epoch number')
parser.add_argument('--lr', type=float, default=1.5e-4,
help='learning rate')
parser.add_argument('--step_size', type=int, default=60,
help='learning rate decay per step_size epochs')
parser.add_argument('--max_batch_size', type=int, default=32,
help='maximum training batch size')
parser.add_argument('--min_batch_size', type=int, default=8,
help='minimum training batch size')
parser.add_argument('--gamma', type=int, default=0.5,
help='learning rate decay factor for step decay')
parser.add_argument('--cuda', action='store_true', default=True,
help='use cuda')
parser.add_argument('--resume', default="", type=str,
help='path to checkpoint')
parser.add_argument('--pretrained', default="", type=str,
help='path to pretrained models')
parser.add_argument('--threads', type=int, default=8,
help='number of threads for data loading')
parser.add_argument('--max_patch_size', type=int, default=720,
help='maximum hr size')
parser.add_argument('--min_patch_size', type=int, default=192,
help='minimum hr size')
parser.add_argument('--seed', type=int, default=2,
help='random seed')
parser.add_argument('--tb_logger', action='store_true', default=False,
help='use tb_logger')

I also have some questions about the HPINet:

  1. Why the matching strategy of patches is different between training and testing ?
  2. Why there is a self-exclusion in attention map ? I guess that it is easily to match the patch itself when testing. But the patch itself can also make some contribution in the training process. Or do you have some expxriments that can validate it will disturb the SR?

Looking forward to your reply ! Thanks a lot !

from hpinet.

passerer avatar passerer commented on September 28, 2024

Thank you for your reply ! Actually i am so interested in this training framework as it is quite different from other SR training strategy. So i tried to train my SR model by using the training framework. Even i have added the loss at the end of the model as HPINet done, the loss is not stable when training with multiple GPUs.

You mentioned that the loss is added. One possible reason could be the large magnitude of the combined loss. Have you considered averaging the loss instead of summing it?

from hpinet.

passerer avatar passerer commented on September 28, 2024
  1. Why the matching strategy of patches is different between training and testing ?
  2. Why there is a self-exclusion in attention map ? I guess that it is easily to match the patch itself when testing. But the patch itself can also make some contribution in the training process. Or do you have some expxriments that can validate it will disturb the SR?
  1. During inference , we use the argmax operator to achieve the matching strategy. However, since argmax is not differentiable and thus cannot propagate gradients, we replace it with a differentiable equivalent form called Gumbel-Softmax during training . In fact, both methods achieve the same objective.
  2. Due to the locality prior, it is theoretically beneficial to have information exchange with surrounding pixels. That's why we designed the IPSA module. Additionally, we designed the GPA module to capture distant feature, and you can refer to the paper for more detailed reasons. When the GPA learns to match itself, it degrades into an intra-patch attention mechanism, which is equivalent to the IPSA . This deviates from the original intention of the GPA. Therefore, we force it not match itself.

from hpinet.

RayTan183 avatar RayTan183 commented on September 28, 2024

Thank you for your reply ! Actually i am so interested in this training framework as it is quite different from other SR training strategy. So i tried to train my SR model by using the training framework. Even i have added the loss at the end of the model as HPINet done, the loss is not stable when training with multiple GPUs.

You mentioned that the loss is added. One possible reason could be the large magnitude of the combined loss. Have you considered averaging the loss instead of summing it?

But the default setting of the L1 loss is the averaging the loss.
image

from hpinet.

RayTan183 avatar RayTan183 commented on September 28, 2024
  1. Why the matching strategy of patches is different between training and testing ?
  2. Why there is a self-exclusion in attention map ? I guess that it is easily to match the patch itself when testing. But the patch itself can also make some contribution in the training process. Or do you have some expxriments that can validate it will disturb the SR?
  1. During inference , we use the argmax operator to achieve the matching strategy. However, since argmax is not differentiable and thus cannot propagate gradients, we replace it with a differentiable equivalent form called Gumbel-Softmax during training . In fact, both methods achieve the same objective.
  2. Due to the locality prior, it is theoretically beneficial to have information exchange with surrounding pixels. That's why we designed the IPSA module. Additionally, we designed the GPA module to capture distant feature, and you can refer to the paper for more detailed reasons. When the GPA learns to match itself, it degrades into an intra-patch attention mechanism, which is equivalent to the IPSA . This deviates from the original intention of the GPA. Therefore, we force it not match itself.

I have got your idea. I have also noticed that there are 2 convolution layers of the upsample module when the scale factor is 4 which is different from the factor 2 and 3. Could it be better if design in this way ? Thank you for your reply !

from hpinet.

passerer avatar passerer commented on September 28, 2024

Did you calculate the loss in the same way as here? If so, then I would recommend examining the loss step by step:

If the loss starts small and then diverges rapidly after a few steps, it could indicate that the lr is set too high or there may be a problem with the network architecture;
On the other hand, if the loss is already large in the first step, the reason may lie in the training framework. If so, I suggest conducting multi-GPU training without making any modifications to my model and code. Please observe whether the loss behaves normally. If memory constraints are an issue, you can try training with the HPINet-S.

from hpinet.

passerer avatar passerer commented on September 28, 2024

I have got your idea. I have also noticed that there are 2 convolution layers of the upsample module when the scale factor is 4 which is different from the factor 2 and 3. Could it be better if design in this way ? Thank you for your reply !

In the x4 upscaling setting, using two cascaded convolutions is a common practice in SR models. I guess the primary reason for this convention is the parameter count. When using a single convolution layer (conv(in_channel=c, out_channel=4*4*c)), the parameter count is double that of using two convolutions ([conv(in_channel=c, out_channel=2*2*c)] x 2)

from hpinet.

RayTan183 avatar RayTan183 commented on September 28, 2024

In the x4 upscaling setting, using two cascaded convolutions is a common practice in SR models. I guess the primary reason for this convention is the parameter count. When using a single convolution layer (conv(in_channel=c, out_channel=4*4*c)), the parameter count is double that of using two convolutions ([conv(in_channel=c, out_channel=2*2*c)] x 2)

It sounds reasonable. Thank you !

from hpinet.

RayTan183 avatar RayTan183 commented on September 28, 2024

If the loss starts small and then diverges rapidly after a few steps, it could indicate that the lr is set too high or there may be a problem with the network architecture; On the other hand, if the loss is already large in the first step, the reason may lie in the training framework. If so, I suggest conducting multi-GPU training without making any modifications to my model and code. Please observe whether the loss behaves normally. If memory constraints are an issue, you can try training with the HPINet-S.

It is the second situation. I will check the model and training code again. Thank you for your patient reply !

from hpinet.

Related Issues (5)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.