Comments (9)
Yes, categorical cross entropy can not be used for policy gradient generally. I have changed the cross entropy to one hot vector and multiplied with advantage function which is also one hot vector. So the categorical cross entropy became object function for the policy gradient.
I did this because I want to use model.fit() to policy gradient because it is simple! I will make some explanation for this. Thank you
from reinforcement-learning.
Categorical cross entropy is defined H(p, q) = sum(p_i * log(q_i))
. For the action taken, a
, you set p_a = A
. q_a
is the probability of taking action a
, i.e. π(a, s)
. All other p_i
are zero, thus we have H(p, q) = A * π(a, s)
. It's a very clever and generalizable trick!
from reinforcement-learning.
Softmax is not applied to the advantages (the p_i
). It is applied to action scores to get action probabilities. All other p_i
are zero simply because they are initialized that way:cartpole_reinforce.py#L84
Note that p_i
refers to the target (true) labels and q_i
to the predictions in the standard classification case. So by feeding advantages as the "target" on line 90, we are defining p_i
to be the one-hot advantages.
from reinforcement-learning.
Thank you! Your explanation is clear enough to understand easily. I will add explanation about this soon
from reinforcement-learning.
Shall I make a pull request?
from reinforcement-learning.
@fredcallaway
We will appreciate it if you do that!
from reinforcement-learning.
Hello, I am confused as to why "all other p_i are zero" if p's are advantages and softmax is applied. Is this the answer? "You correctly pointed out that the softmax numerator should never have zero-values due to the exponential. However, due to floating point precision, the numerator could be a very small value, say, exp(-50000), which essentially evaluates to zero." from https://stackoverflow.com/questions/39109674/the-output-of-a-softmax-isnt-supposed-to-have-zeros-right
from reinforcement-learning.
I printed out the advantages array:
advantages final
[[ 0. 1.54572815]
[ 0. 1.21139962]
[ 0. 0.87369404]
[ 0. 0.53257729]
[ 0. 0.18801491]
[ 0. -0.16002789]
[ 0. -0.51158628]
[-0.86669576 0. ]
[ 0. -1.22539221]
[ 0. -1.58771185]]
So the final line H(p, q) = -(A * log(policy(s,a)) means the cross-entropy loss is the negation of advantage * log probability of the policy? (I kept seeing the negation everywhere cross-entropy is explained so wanted to include it).
It's still magical to me how the network outputs probabilities but you can use the advantages like this.
Is this a hack? If so I think it is very ingenious but I also prefer to learn it the plain vanilla way. Is Karpathy's pong example the best for that?
from reinforcement-learning.
Good point regarding the negation. Your correct that the loss has the negation. However, we are actually trying to maximize sum(p_i * log(q_i))
, which corresponds to minimizing the negation.
I definitely agree that it's a bit confusing, but I think the idea of this library is not so much to teach you the math, but rather to give you some code to play around with to test the effects of hyperparameters and apply the algorithms to different environments.
from reinforcement-learning.
Related Issues (20)
- 5_A3C Cartpole Script - AttributeError: 'Functional' object has no attribute '_make_predict_function' HOT 4
- Variable Tensor("Neg:0", shape=(), dtype=float32) has `None` for gradient. HOT 1
- How to run this example code?
- Cartpole Policy Gradient script does not converge (2-cartpole/3-reinforce/cartpole_reinforce.py)
- links in cartpole are broken
- Pong Policy Gradient-important error in the definition of the convolutional net. HOT 1
- A3C on GPU
- rlcode.github.io does not exist !
- is it possible to apply categorical_crossentropy to a3c? HOT 1
- How to add Dropout
- update target_model before loading saved model in cartpole_dqn.py
- Implementing policy gradient when number of output classes is large
- Dqn-per does not use importance sampling weight in training。
- reinforcement learning real life use cases
- The issue about breakout_a3c.py in 3-atari, when i execute source HOT 1
- Why are you using SARSA instead of Q-Learning? HOT 1
- issue regarding saved models
- How to run threading while using Keras and tensorflow
- Can this code run other atari game beside breakout?
- Diagonal movement? - Grid Score
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from reinforcement-learning.