Coder Social home page Coder Social logo

Comments (11)

Nioolek avatar Nioolek commented on May 18, 2024 1

非常感谢您提供的issues,特别棒。确实我没有考虑半精度训练和pytorch的各版本情况,我会尽力做好完善。
我想第三个问题应该是由于不适应的lr导致。
https://github.com/PaddlePaddle/PaddleDetection/tree/release/2.4/configs/ppyoloe
这里写到了,在s模型上,作者默认的lr是基于8个GPU,每个GPU有32个images情况下训练的,如果在GPU和batchsize达不到的情况下,学习率需要降低。这个我确实在之前做的不够好,我会在近两天修正这个问题。在修正前建议您可以先尝试将s模型上的lr修改成0.01尝试。

from ppyoloe_pytorch.

ymzis69 avatar ymzis69 commented on May 18, 2024 1

@jimmy133719 您好,这个确实不会用到,只是为了更规范而改的。

下面这个学习率是和paddlepaddle里面的设置一致的,也和yolox_s中的学习率设置的一致。我个人建议您可以使用下面这个设置。不过可能对精度的影响不是很大,这个可能需要作者来解答一下@Nioolek
self.basic_lr_per_img = 0.01 / 64.0

from ppyoloe_pytorch.

jimmy133719 avatar jimmy133719 commented on May 18, 2024

您好,針對第三個問題,我調整學習率之後仍然有此問題。
另外,我有嘗試把max_epoch改成400(學習率也有調整),然後對pretained模型resume training,會遇到optimizer mismatch的問題。
Screenshot from 2022-05-23 15-52-33

from ppyoloe_pytorch.

Nioolek avatar Nioolek commented on May 18, 2024

针对调整学习率之后仍然有问题的情况,我尝试复现一下。也麻烦提供一下训练的命令和日志。
pretrained模型不含有optimizer参数,所以是不能resume training的,如果想resume train的话需要修改下代码。

from ppyoloe_pytorch.

jimmy133719 avatar jimmy133719 commented on May 18, 2024

Command:

python -m yolox.tools.train -f exps/ppyoloe/default/ppyoloe_s.py -d 1 -b 16 --fp16 -o -expn ppyoloe_s_sigmoid -c pretrained_weight/ppyoloe_s.pth

Log:
train_log.txt

我有嘗試在resume train時不去load optimizer,而結果一樣會有第三個issue。

感謝您的幫助!

from ppyoloe_pytorch.

ymzis69 avatar ymzis69 commented on May 18, 2024

感谢作者您的开源代码!
我也在实验中遇到了和@jimmy133719 一样的三个问题,很有可能的确是pytorch版本的问题,而不是作者代码的问题。
@jimmy133719 您代码中的第三个issue应该是因为您修改第二个issue的代码造成的问题,而不是学习率设置的问题,因为您第二个issue修改的并不完全,一些计算需要将pred_scores进行sigmoid操作,因此我建议您这样修改代码。

(后来经过测试是有bug的,不能这么修改代码)

yoloe_head.py中:

135行:
            cls_score_list.append(cls_logit.flatten(2).permute((0, 2, 1)))
            # cls_score = F.sigmoid(cls_logit)
            # cls_score_list.append(cls_score.flatten(2).permute((0, 2, 1)))
278行:
            loss_cls = self.focal_loss(F.sigmoid(pred_scores), assigned_scores, alpha_l)
            # loss_cls = self.focal_loss(pred_scores, assigned_scores, alpha_l)
260行:
            F.sigmoid(pred_scores.detach()), 
            # pred_scores.detach(),
287行(可能需要):
            loss = self.loss_weight['class'] * loss_cls + \
                   (self.loss_weight['iou'] * loss_iou).cuda() + \
                   (self.loss_weight['dfl'] * loss_dfl).cuda()
            # loss = self.loss_weight['class'] * loss_cls + \
            #         (self.loss_weight['iou'] * loss_iou).cuda() + \
            #         (self.loss_weight['dfl'] * loss_dfl).cuda()

losses.py中:

135行:
        weight = alpha * F.sigmoid(pred_score).pow(gamma) * (1 - label) + gt_score * label
        loss = (F.binary_cross_entropy_with_logits(pred_score, gt_score, reduction='none') * weight).sum()
        # weight = alpha * pred_score.pow(gamma) * (1 - label) + gt_score * label
        # loss = (F.binary_cross_entropy(pred_score, gt_score, reduction='none') * weight).sum()

from ppyoloe_pytorch.

jimmy133719 avatar jimmy133719 commented on May 18, 2024

@ymzis69 十分感謝您的解惑,跑成功了!
不過ppyoloe_head.py內的這幾行好像不會跑到?

278行:
            loss_cls = self.focal_loss(F.sigmoid(pred_scores), assigned_scores, alpha_l)
            # loss_cls = self.focal_loss(pred_scores, assigned_scores, alpha_l)

另外,想請教您和 @Nioolek ,目前ppyolo_s的學習率是定義在ppyolo_s.py裡面的

self.basic_lr_per_img = 0.04 / 64.0

研究了一下代碼,實際lr會乘回batch size。
若我改成1片GPU,batch size=16,這部分應該是改成0.04 / 16.0還是 (0.04 / 16.0) / 8 會比較適合,或是這部分影響不大?

from ppyoloe_pytorch.

Nioolek avatar Nioolek commented on May 18, 2024

是的,我之前的学习率设置有问题,最新push的代码修复了这个问题

from ppyoloe_pytorch.

Nioolek avatar Nioolek commented on May 18, 2024

感谢作者您的开源代码! 我也在实验中遇到了和@jimmy133719 一样的三个问题,很有可能的确是pytorch版本的问题,而不是作者代码的问题。 @jimmy133719 您代码中的第三个issue应该是因为您修改第二个issue的代码造成的问题,而不是学习率设置的问题,因为您第二个issue修改的并不完全,一些计算需要将pred_scores进行sigmoid操作,因此我建议您这样修改代码。(经过测试应该是是没问题的) yoloe_head.py中:

135行:
            cls_score_list.append(cls_logit.flatten(2).permute((0, 2, 1)))
            # cls_score = F.sigmoid(cls_logit)
            # cls_score_list.append(cls_score.flatten(2).permute((0, 2, 1)))
278行:
            loss_cls = self.focal_loss(F.sigmoid(pred_scores), assigned_scores, alpha_l)
            # loss_cls = self.focal_loss(pred_scores, assigned_scores, alpha_l)
260行:
            F.sigmoid(pred_scores.detach()), 
            # pred_scores.detach(),
287行(可能需要):
            loss = self.loss_weight['class'] * loss_cls + \
                   (self.loss_weight['iou'] * loss_iou).cuda() + \
                   (self.loss_weight['dfl'] * loss_dfl).cuda()
            # loss = self.loss_weight['class'] * loss_cls + \
            #         (self.loss_weight['iou'] * loss_iou).cuda() + \
            #         (self.loss_weight['dfl'] * loss_dfl).cuda()

losses.py中:

135行:
        weight = alpha * F.sigmoid(pred_score).pow(gamma) * (1 - label) + gt_score * label
        loss = (F.binary_cross_entropy_with_logits(pred_score, gt_score, reduction='none') * weight).sum()
        # weight = alpha * pred_score.pow(gamma) * (1 - label) + gt_score * label
        # loss = (F.binary_cross_entropy(pred_score, gt_score, reduction='none') * weight).sum()

我在想这样做针对同一个特征图进行了3次sigmoid操作,导致计算复杂度上升,有没有什么更简单的兼容方法。我尝试复现验证一下

from ppyoloe_pytorch.

ymzis69 avatar ymzis69 commented on May 18, 2024

您好,@jimmy133719 按我昨天那样子修改代码是有bug的,我进行80轮其他数据集的finetune,前面10多轮精度正常,但是后续精度会随着训练降为0。
现在我没有什么好的方法进行半精度fp16进行训练,所以我建议您不要修改作者的代码并进行单精度训练,单精度训练得到的结果是完全正常的。

from ppyoloe_pytorch.

Nioolek avatar Nioolek commented on May 18, 2024

暂时不建议使用半精度训练

from ppyoloe_pytorch.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.