Comments (7)
I think you are right! I could only have a brief look still and I'm just double checking because it's surprising that this wasn't caught for so long. We even used the same implementation in a few other models and it worked fine.
May be multiplication of the input by the gate gives the gradient signal of how much each expert "likes" to handle that input. And this may actually be working.
I will spend a little more time checking before fixing this.
Thank you
from annotated_deep_learning_paper_implementations.
😁Thanks for your attention. Actually, we also run several experiments under this setting and everything looks fine.
Another attempt I made is to deduce the gradient flow using the chain rule. The difference between the two settings is, when early multiplication, the update of the gate includes the gradient of the expert. I do still not figure out what it exactly means and is working on it.
Hope this will help.
from annotated_deep_learning_paper_implementations.
Fixed it! Thanks again for pointing it out.
from annotated_deep_learning_paper_implementations.
Interesting, will try it out and let you know
from annotated_deep_learning_paper_implementations.
Sorry, I think I misunderstood the results. It is because I set a large load balance loss coeficience so that it dominates the gradient update. Early multiplication and later multiplication indeed have different behaviors.
Btw, I am curious what the route probabilities supposed to be. Is it close to 1 for a certain expert and the others are small, or the probabilities are almost evenly distributed among experts.
from annotated_deep_learning_paper_implementations.
It seems like they are getting around .5 for the selected expert. This may be different for larger dataset where model capacity is a bottleneck
https://app.labml.ai/run/2de889d0185c11ecb8bbbdc36d3aa63a/metrics
I also tested without multiplication before I saw your comment and here's what I got
https://app.labml.ai/run/43358364185d11ecbebaabd247ae98e1/metrics
from annotated_deep_learning_paper_implementations.
Thanks! I notice there are four experts in your model. So the picked expert owns half probability and the others share the rest half. I also get similar results, but I also get extremely biased distribution in other settings. Do you know which one is the expected one?
As for the second comment, thanks for bothering to run it. But I am curious about the loss of the without-multiplication version is smaller than the normal version? Does this mean it is better to have no gradient signal from experts?
from annotated_deep_learning_paper_implementations.
Related Issues (20)
- :kr: Korean Translation HOT 2
- DDPM interpolation formula error
- Wrong Image Scale in DDPM
- Bug in rotary positional embedding HOT 1
- ppo code running error HOT 1
- question about RotaryPEMultiHeadAttention: rotary_percentage HOT 1
- "pip install labml-nn" generated errors. How to resolve it and complete the installation?
- this is regarding a doubt in diffusion/ddpm/experiment.py HOT 1
- Question about value_pe HOT 2
- Website Code HOT 1
- Question about RoPE code HOT 3
- connection timed out
- gae formula bug HOT 1
- Mistake in RoPE File HOT 2
- Chinese Translation HOT 4
- How to use my own database for training and evaluating Retro for Question-Answering? HOT 2
- Request for Implementation of Mamba Paper
- mha.py array shapes
- LORA HOT 2
- Unet error in DDPM HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from annotated_deep_learning_paper_implementations.