Within the paper, it is referenced that when selecting the weights to grow by using th

To confirm, the paper's <math-renderer class="js-inline-math" style="display: inline-b

To confirm, the paper's <math-renderer class="js-inline-math" style="disp

Grow & Drop Ambiguity about rigl HOT 6 CLOSED

google-research commented on August 27, 2024

Grow & Drop Ambiguity

from rigl.

Comments (6)

verbose-void commented on August 27, 2024

Also, I have another question regarding dense & sparse gradient calculations:

for your drop & grow code it seems like the scores are derived from a normal dense backprop calculation that (after calculation) is multiplied by the mask. Is this also used for the sparse parameter update?

the way I have it implemented is I'm using a backward hook that applies the mask for each layer's gradients before it is passed to its successors (the unmasked grad matrix is copied and used to score the drops/grows). in other words, the mask is applied during the backward pass (affecting the outputs) as opposed to the "post calculation" method you implemented.

are you aware of this being possible and you deliberately chose to do the "post calc" method? do you think that may be a source of loss of accuracy i'm seeing?

from rigl.

evcu commented on August 27, 2024

Hi,

I realize now we have a typo/mistake in the notation. It should be the remaining connections that we exclude; not the dropped ones. In other words we don't want to select existing connections L286. Thanks for noticing this we will fix this in the paper and sorry for the confusion.

Yes, indeed it is (1) Yes (2) No. For the second one we tried re-init but it either didn't make a difference or harmed the performance (don't remember exactly). You can try on your own by using reinit_when same flag (L307).

In this implementation gradients are sparse by default since graph uses w*m to begin with. To obtain dense gradients we take the gradient with respect to masked_weights=w*m (see L556). What do you mean by layer's gradient? The activation gradients/error signals that are passed to the previous layer are not masked. The weights can be masked during or after backprop, no difference it would make.

Let me know if you have other questions and feel free to reach out through e-mail, too. I can look at your code if needed. I would be happy to add your implementation to the README.md once you have it ready.

Best

from rigl.

verbose-void commented on August 27, 2024

Thanks for the quick reply! I've never received a reply from an author about their work so I really appreciate your involvement :)

I think your notation was without error, you explicitly said the indices received from ArgTopK should not be within the set of THETA / II_drop, which you further mentioned was the set of active connections left.

It only really got me thinking after reading algorithm 1:

Since you do the parameter updates AFTER the grow selection, I thought there was no way you could gather your grow selection candidates after the drop phase because you explicitly stated to do these updates after both grow & select. There is technically no problem here, as you can update the masks before ever touching the parameters (like you did in your code), though I had to read the code to clear that up for myself.

And poking through your code more I was able to draw the same conclusions as well, however, the gradient problem is still unclear for me. I will shoot you an email shortly, thanks for being awesome!

from rigl.

evcu commented on August 27, 2024

That's true, wrong alarm 👍

Yes we either do the mask update or the gradient update. Let's continue offline

from rigl.

varun19299 commented on August 27, 2024

To confirm, the paper's $I \notin \theta \ \I_\text{drop}$ is correct? i.e., can grow connections from the dropped set.

(no TeX support in GitHub :( )

from rigl.

verbose-void commented on August 27, 2024

To confirm, the paper's $I \notin \theta \ \I_\text{drop}$ is correct? i.e., can grow connections from the dropped set.

(no TeX support in GitHub :( )

It IS correct, although slightly ambiguous. the "i cannot be within the set of theta/Idrop" notation is slightly misleading, but all it means is that the grow connections cannot grow connections that already exist, AFTER dropping has been done.

Growing CAN be done upon connections that were dropped, but in the tensorflow source code, if this happens the weights are left alone. But in the paper, it makes no reference of this and to me before I read the source code, I was confused on how those dropped & grown connections would be handled.

from rigl.

Grow & Drop Ambiguity about rigl HOT 6 CLOSED

Comments (6)

Related Issues (13)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent