Comments (6)
Also, I have another question regarding dense & sparse gradient calculations:
for your drop & grow code it seems like the scores are derived from a normal dense backprop calculation that (after calculation) is multiplied by the mask. Is this also used for the sparse parameter update?
the way I have it implemented is I'm using a backward hook that applies the mask for each layer's gradients before it is passed to its successors (the unmasked grad matrix is copied and used to score the drops/grows). in other words, the mask is applied during the backward pass (affecting the outputs) as opposed to the "post calculation" method you implemented.
are you aware of this being possible and you deliberately chose to do the "post calc" method? do you think that may be a source of loss of accuracy i'm seeing?
from rigl.
Hi,
I realize now we have a typo/mistake in the notation. It should be the remaining connections that we exclude; not the dropped ones. In other words we don't want to select existing connections L286. Thanks for noticing this we will fix this in the paper and sorry for the confusion.
Yes, indeed it is (1) Yes (2) No. For the second one we tried re-init but it either didn't make a difference or harmed the performance (don't remember exactly). You can try on your own by using reinit_when same flag (L307).
In this implementation gradients are sparse by default since graph uses w*m to begin with. To obtain dense gradients we take the gradient with respect to masked_weights=w*m (see L556). What do you mean by layer's gradient? The activation gradients/error signals that are passed to the previous layer are not masked. The weights can be masked during or after backprop, no difference it would make.
Let me know if you have other questions and feel free to reach out through e-mail, too. I can look at your code if needed. I would be happy to add your implementation to the README.md once you have it ready.
Best
from rigl.
Thanks for the quick reply! I've never received a reply from an author about their work so I really appreciate your involvement :)
I think your notation was without error, you explicitly said the indices received from ArgTopK should not be within the set of THETA / II_drop, which you further mentioned was the set of active connections left.
It only really got me thinking after reading algorithm 1:
Since you do the parameter updates AFTER the grow selection, I thought there was no way you could gather your grow selection candidates after the drop phase because you explicitly stated to do these updates after both grow & select. There is technically no problem here, as you can update the masks before ever touching the parameters (like you did in your code), though I had to read the code to clear that up for myself.
And poking through your code more I was able to draw the same conclusions as well, however, the gradient problem is still unclear for me. I will shoot you an email shortly, thanks for being awesome!
from rigl.
That's true, wrong alarm 👍
Yes we either do the mask update or the gradient update. Let's continue offline
from rigl.
To confirm, the paper's
(no TeX support in GitHub :( )
from rigl.
To confirm, the paper's
$I \notin \theta \ \I_\text{drop}$ is correct? i.e., can grow connections from the dropped set.(no TeX support in GitHub :( )
It IS correct, although slightly ambiguous. the "i cannot be within the set of theta/Idrop" notation is slightly misleading, but all it means is that the grow connections cannot grow connections that already exist, AFTER dropping has been done.
Growing CAN be done upon connections that were dropped, but in the tensorflow source code, if this happens the weights are left alone. But in the paper, it makes no reference of this and to me before I read the source code, I was confused on how those dropped & grown connections would be handled.
from rigl.
Related Issues (13)
- No module named 'officialresnet' HOT 2
- RigL TF2 - Initial Sparsification HOT 3
- RigL TF2 on Resnet50 + Imagenet HOT 1
- TF2 Grow Scores calculation HOT 1
- Using Rigl to train with structured sparsity HOT 1
- Hello, can I use this method to sparse my own neural network? HOT 3
- Question: what is the speed up gain from Rigl and other methods implemented here in training? HOT 1
- Specify TF and TF.data versions? HOT 2
- How to train own convolutional network HOT 2
- How to load pretained weights? HOT 1
- Can you post the CIFAR-10 results of the paper? HOT 1
- MetaInit w. VGG HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rigl.