Coder Social home page Coder Social logo

Comments (4)

pdollar avatar pdollar commented on July 29, 2024 4

Hi @FateScript , yeah good point. We never published this. It's some math I sketched out. I'll share my math below with the caveat that I haven't been very careful to verify its full correctness, and the math may lack context (e.g., variable meanings). I sketched the math and implemented it, and it works empirically. So I'm sharing the math, and hopefully there's no embarrassing mistakes in it. If you find an embarrassing mistake in the math below, please lmk! Now, the math is lacking context, I hope you'll be able to understand it. I probably won't have time to go into more detail, but I figured I'd share in the hope that you can try to decode it and it's somewhat useful :)


Momentum formulation [α=.999]
v = α · v + (1 - α) · u

Update formulation [α=.001]:
v = (1 - α) · v + α · u

Two step update rolled into one assuming α2 ≈ 0 and setting u=(u0+u1)/2:
v1 = (1 - α) · v0 + α · u0
v2 = (1 - α) · v1 + α · u1
v2 = (1 - α) · ((1 - α) · v0 + α · u0) + α · u1
v2 = (1 - α) · ((1 - α) · v0 ) + α · u0 + α · u1
v2 = (1 - 2α + α2) · v0 + α · u0 + α · u1
v2 ≈ (1 - 2α) · v0 + 2α · u

The same holds for n>>1 updates not just 2 since for small α and αn<<1 the following holds:
(1 - α)n ≈ 1 - αn + n(n-1)α2/2! - n(n-1)(n-2)α3/3! + ... [binomial expansion]
(1 - α)n ≈ 1 - αn + n2α2/2! - n3α3/3! + ... [n>>1]
(1 - α)n ≈ 1 - αn [αn<<1]

Thus, To make the update independent of batch size n, we will specify α* (independent of batch_size) and we will use α in the update step where:
α = α* · n
This will make ema behavior roughly independent of the batch size n. Furthermore, it is not necessary to perform an update at every iteration. If we perform the update every k iterations, effectively we do an update after seeing n·k examples, and thus can use:
α = α* · n · k

Finally, to normalize by schedule length, we set:
α = α* · n · k / m
Where m = #epochs. Empirically we find using α set this way allows for using a fairly constant α across schedule lengths without needing to carefully tune α for each schedule length. The logic isn’t exactly equivalent for this step, this is more that your “history” is proportional across runs w different epoch length. [Note: need to make this explanation more precise.]

from pycls.

pdollar avatar pdollar commented on July 29, 2024 4

Screen Shot 2021-11-18 at 8 59 02 AM

formatting got lost in my previous post. here's a screenshot of my note with formatting preserved.

from pycls.

FateScript avatar FateScript commented on July 29, 2024

Thanks @pdollar , I understand how the magic code works now. It's soooooo kind of you : )

BTW, I want to discuss this issue a bit more.
In my opinion, if your total number of images in training process is not changed,
#iter = #epoch * #image per epoch / bachsize
so value of batch_size / #epoch could be treated as k / #iters, where k is a constant number depends on your dataset and k could be absorbed into alpha.
Maybe adjust = update_period / total_iters is more intuitive? WDYT ?

from pycls.

pdollar avatar pdollar commented on July 29, 2024

Hey thanks for digging in deeper! I don't think I have time to adjust this or think more deeply, and we're already using this way of defining EMA for many models we have trained. I find it works really well, but more importantly, I wouldn't want to break backward compatibility at this stage even if the result was more intuitive! Thanks for the discussion/suggestions tho.

from pycls.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.