Coder Social home page Coder Social logo

Comments (21)

saumitrabg avatar saumitrabg commented on September 8, 2024 2

@glenn-jocher Thanks. To make sure there is no user error, we went back and redownloaded the yolov5 git and retrained but it is still showing the same. We have been training yolov5 models for 1.5 years now (they are really great) and it is quite simple actually- just change the coco128.yaml file with the corresponding train/val datasets, pick the right rect size 640 and things have worked great. Also, our coco128.yaml file was old and had different format for train/val datasets and we fixed that to make sure we start with a clean slate. Not much progress. We will keep inspecting if there is any user error though there is no much needed to train.

from yolov5.

glenn-jocher avatar glenn-jocher commented on September 8, 2024 1

@saumitrabg I would put any differences down to your implementation or user error. All YOLOv5 models are trained on COCO from scratch on each release and results improve slightly in most cases.

from yolov5.

saumitrabg avatar saumitrabg commented on September 8, 2024 1

@glenn-jocher any clue how we can make progress? See, how we get higher mAP score with the older yolov5 default weights. A few things:

  1. We are running the models in production and for now, we are stuck in the old dated yolo version because the new yolov5 is showing a low mAP score. Is there a way to put version control so that when we try to use a default yolov5m model, it doesn't try to bring a new baseline, which is not backward compatible?
  2. We are doing the exact same thing with both new and old yolov5 and our mAP scores are much lower. Any idea how we get unblocked?
    image

from yolov5.

saumitrabg avatar saumitrabg commented on September 8, 2024 1

@glenn-jocher we have only tried on our datasets since we are building custom AI models. Regardless of the number of epochs, the master branch performs worse from get-go.

from yolov5.

saumitrabg avatar saumitrabg commented on September 8, 2024 1

@glenn-jocher if that was your conclusion, we should not have been told to reproduce between v6.0 and master latest : -).

from yolov5.

TimbusCalin avatar TimbusCalin commented on September 8, 2024 1

This also happened in my case, @saumitrabg is there any solution you used to tackle the problem? Thank you.

from yolov5.

glenn-jocher avatar glenn-jocher commented on September 8, 2024

@saumitrabg 👋 hi, thanks for letting us know about this possible problem with YOLOv5 🚀. We've created a few short guidelines below to help users provide what we need in order to get started investigating a possible problem.

How to create a Minimal, Reproducible Example

When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:

  • Minimal – Use as little code as possible to produce the problem
  • Complete – Provide all parts someone else needs to reproduce the problem
  • Reproducible – Test the code you're about to provide to make sure it reproduces the problem

For Ultralytics to provide assistance your code should also be:

  • Current – Verify that your code is up-to-date with GitHub master, and if necessary git pull or git clone a new copy to ensure your problem has not already been solved in master.
  • Unmodified – Your problem must be reproducible using official YOLOv5 code without changes. Ultralytics does not provide support for custom code ⚠️.

If you believe your problem meets all the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template with a minimum reproducible example to help us better understand and diagnose your problem.

Thank you! 😃

from yolov5.

saumitrabg avatar saumitrabg commented on September 8, 2024

FWIW-I see the difference in hyperparameters between my old version and new version of YOLOv5.

Old:
ubuntu@:/yolov5/data/hyps$ grep lrf *
hyp.finetune.yaml:lrf: 0.12
hyp.finetune_objects365.yaml:lrf: 0.17
hyp.scratch-p6.yaml:lrf: 0.2 # final OneCycleLR learning rate (lr0 * lrf)
hyp.scratch.yaml:lrf: 0.2 # final OneCycleLR learning rate (lr0 * lrf)
ubuntu@ip-172-31-18-207:
/yolov5/data/hyps$ grep SGD *
hyp.scratch-p6.yaml:lr0: 0.01 # initial learning rate (SGD=1E-2, Adam=1E-3)
hyp.scratch-p6.yaml:momentum: 0.937 # SGD momentum/Adam beta1
hyp.scratch.yaml:lr0: 0.01 # initial learning rate (SGD=1E-2, Adam=1E-3)
hyp.scratch.yaml:momentum: 0.937 # SGD momentum/Adam beta1
ubuntu@ip-172-31-18-207:~/yolov5/data/hyps$

New:
ubuntu@:/yolov5/data/hyps$ grep lrf *
hyp.Objects365.yaml:lrf: 0.17
hyp.VOC.yaml:lrf: 0.15135
hyp.scratch-high.yaml:lrf: 0.1 # final OneCycleLR learning rate (lr0 * lrf)
hyp.scratch-low.yaml:lrf: 0.01 # final OneCycleLR learning rate (lr0 * lrf)
hyp.scratch-med.yaml:lrf: 0.1 # final OneCycleLR learning rate (lr0 * lrf)
ubuntu@ip-172-31-17-53:
/yolov5/data/hyps$ grep SGD *
hyp.scratch-high.yaml:lr0: 0.01 # initial learning rate (SGD=1E-2, Adam=1E-3)
hyp.scratch-high.yaml:momentum: 0.937 # SGD momentum/Adam beta1
hyp.scratch-low.yaml:lr0: 0.01 # initial learning rate (SGD=1E-2, Adam=1E-3)
hyp.scratch-low.yaml:momentum: 0.937 # SGD momentum/Adam beta1
hyp.scratch-med.yaml:lr0: 0.01 # initial learning rate (SGD=1E-2, Adam=1E-3)
hyp.scratch-med.yaml:momentum: 0.937 # SGD momentum/Adam beta1
ubuntu@ip-172-31-17-53:~/yolov5/data/hyps$

from yolov5.

glenn-jocher avatar glenn-jocher commented on September 8, 2024

@saumitrabg judging by your results it seems pretty apparent that if all else is equal the better performing model simply started from pretrained weights and the lower one didnt.

from yolov5.

saumitrabg avatar saumitrabg commented on September 8, 2024

@glenn-jocher yes, the same datasets. The yolov5m model was trained with same real datasets with a base model that was trained on bunch of synthetic data. This is the same mAP curve with yolov5s model with the new code as well- exact same datasets. However, with the older yolo code, we do the same- train wtih real datasets with a base model that was trained on bunch of synthetic data and they get to much higher mAP score after 300 epochs. We like to ideally keep adding the new incremental data to the previous model but as yolo code changes, that is not possible. So, we are keeping version of datasets where 1st model gets trained on the default yolo model that comes in the repo. What else can we exolore? If you look at the charts, the new models off new yolo code don't go beyond 0.2 mAP score even after training on the same datasets for higher epochs. That didn't happen with the older code. Other thing is that x/lr0 curve is very different with the AI model with new yolo code- it is alwasy a straightline now.

image

image

from yolov5.

glenn-jocher avatar glenn-jocher commented on September 8, 2024

@saumitrabg got it. We are training multiple models (i.e. 8+ models in parallel right now) across COCO and VOC both from scratch (COCO) and from pretrained (VOC) as part of our normal R&D, and both are operating and training correctly, so I don't see any sign of training issues today.

from yolov5.

glenn-jocher avatar glenn-jocher commented on September 8, 2024

@saumitrabg the only thing I can think of is an AutoAnchor bug which was resolved last week. See #7067 and #7060

If you could provide a fully reproducible example of what you are seeing then we could start debugging it, but lacking that there is nothing for us to do. A reproducible example would be one data.yaml with autodownload capability and two branches that you say perform very differently.

git clone https://github.com/ultralytics/yolov5 yolov5-1 -b BRANCH1
cd yolov5-1
python train.py --data DATA.yaml

cd ..
git clone https://github.com/ultralytics/yolov5 yolov5-2 -b BRANCH2
cd yolov5-2
python train.py --data DATA.yaml

from yolov5.

saumitrabg avatar saumitrabg commented on September 8, 2024

@glenn-jocher sure-we will provide the debug information (2 yolo snapshots). Do you recommend a particular branch from December, 2021 that we can use?
image

from yolov5.

glenn-jocher avatar glenn-jocher commented on September 8, 2024

@saumitrabg well if you're saying v5.0 and master are producing different results then:

git clone https://github.com/ultralytics/yolov5 yolov5-1 -b v5.0
cd yolov5-1
python train.py --data DATA.yaml

cd ..
git clone https://github.com/ultralytics/yolov5 yolov5-2 -b master
cd yolov5-2
python train.py --data DATA.yaml

from yolov5.

saumitrabg avatar saumitrabg commented on September 8, 2024

@glenn-jocher We confirmed that your v6.0 branch works well while the master or even the 2-month old branch (tests/aws) don't work with default settings. We will stay on v6.0 for now and will move from the small to medium AI weights, however, like to understand what you need from us to help debug this info. All are trained on medium weights and with our data/4x T4 GPUs, it takes 50-60 hrs to train. The red line- 1st v6.0 model- had mAP go to 0 after 48th epoch and hence, we restarted the 2nd v6.0 model using 48th best.pt as a baseline and assume that they are the same continuation.

A few other things that I saw:

  1. x/lr0 curve is always a straightline with the latest master while all my previous successful runs (older yolov5 or v6.0) have the curve to it.
  2. I also see NMS threshold being during training with the latest master or tests/aws.

image
image

from yolov5.

glenn-jocher avatar glenn-jocher commented on September 8, 2024

@saumitrabg is this just on your dataset? If you train coco128.yaml to 300 epochs do you see the same performance on both branches?

from yolov5.

glenn-jocher avatar glenn-jocher commented on September 8, 2024

@saumitrabg we need to be able to reproduce this ourselves, otherwise there is nothing for us to investigate. For example v6.0 and v6.1 model official records are here, and you can see near identical performance across all 10 YOLOv5 models on the COCO dataset between the two versions:

from yolov5.

glenn-jocher avatar glenn-jocher commented on September 8, 2024

@saumitrabg yes it's good you've confirmed a difference, but for us to investigate we need to be able to reproduce the difference ourselves, i.e. we would need your dataset and your data.yaml so we can run your same command and then try to figure out where the differences are originating from.

It seems the differences appear in less than 10 epochs, so it shouldn't take long, we just need your dataset, or any other dataset that you see is also producing the same behavior.

from yolov5.

github-actions avatar github-actions commented on September 8, 2024

👋 Hello, this issue has been automatically marked as stale because it has not had recent activity. Please note it will be closed if no further activity occurs.

Access additional YOLOv5 🚀 resources:

Access additional Ultralytics ⚡ resources:

Feel free to inform us of any other issues you discover or feature requests that come to mind in the future. Pull Requests (PRs) are also always welcomed!

Thank you for your contributions to YOLOv5 🚀 and Vision AI ⭐!

from yolov5.

glenn-jocher avatar glenn-jocher commented on September 8, 2024

@TimbusCalin 👋 hi, thanks for letting us know about this possible problem with YOLOv5 🚀. We've created a few short guidelines below to help users provide what we need in order to start investigating a possible problem.

How to create a Minimal, Reproducible Example

When asking a question, people will be better able to provide help if you provide code that they can easily understand and use to reproduce the problem. This is referred to by community members as creating a minimum reproducible example. Your code that reproduces the problem should be:

  • Minimal – Use as little code as possible to produce the problem
  • Complete – Provide all parts someone else needs to reproduce the problem
  • Reproducible – Test the code you're about to provide to make sure it reproduces the problem

For Ultralytics to provide assistance your code should also be:

  • Current – Verify that your code is up-to-date with GitHub master, and if necessary git pull or git clone a new copy to ensure your problem has not already been solved in master.
  • Unmodified – Your problem must be reproducible using official YOLOv5 code without changes. Ultralytics does not provide support for custom code ⚠️.

If you believe your problem meets all the above criteria, please close this issue and raise a new one using the 🐛 Bug Report template with a minimum reproducible example to help us better understand and diagnose your problem.

Thank you! 😃

from yolov5.

saumitrabg avatar saumitrabg commented on September 8, 2024

from yolov5.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.