Coder Social home page Coder Social logo

Comments (9)

dnddnjs avatar dnddnjs commented on August 28, 2024 1

Yes, categorical cross entropy can not be used for policy gradient generally. I have changed the cross entropy to one hot vector and multiplied with advantage function which is also one hot vector. So the categorical cross entropy became object function for the policy gradient.

I did this because I want to use model.fit() to policy gradient because it is simple! I will make some explanation for this. Thank you

from reinforcement-learning.

fredcallaway avatar fredcallaway commented on August 28, 2024 1

Categorical cross entropy is defined H(p, q) = sum(p_i * log(q_i)). For the action taken, a, you set p_a = A. q_a is the probability of taking action a, i.e. π(a, s). All other p_i are zero, thus we have H(p, q) = A * π(a, s). It's a very clever and generalizable trick!

from reinforcement-learning.

fredcallaway avatar fredcallaway commented on August 28, 2024 1

Softmax is not applied to the advantages (the p_i). It is applied to action scores to get action probabilities. All other p_i are zero simply because they are initialized that way:cartpole_reinforce.py#L84

Note that p_i refers to the target (true) labels and q_i to the predictions in the standard classification case. So by feeding advantages as the "target" on line 90, we are defining p_i to be the one-hot advantages.

from reinforcement-learning.

dnddnjs avatar dnddnjs commented on August 28, 2024

Thank you! Your explanation is clear enough to understand easily. I will add explanation about this soon

from reinforcement-learning.

fredcallaway avatar fredcallaway commented on August 28, 2024

Shall I make a pull request?

from reinforcement-learning.

dnddnjs avatar dnddnjs commented on August 28, 2024

@fredcallaway
We will appreciate it if you do that!

from reinforcement-learning.

nyck33 avatar nyck33 commented on August 28, 2024

Hello, I am confused as to why "all other p_i are zero" if p's are advantages and softmax is applied. Is this the answer? "You correctly pointed out that the softmax numerator should never have zero-values due to the exponential. However, due to floating point precision, the numerator could be a very small value, say, exp(-50000), which essentially evaluates to zero." from https://stackoverflow.com/questions/39109674/the-output-of-a-softmax-isnt-supposed-to-have-zeros-right

from reinforcement-learning.

nyck33 avatar nyck33 commented on August 28, 2024

I printed out the advantages array:

 advantages final 
[[ 0.          1.54572815]
 [ 0.          1.21139962]
 [ 0.          0.87369404]
 [ 0.          0.53257729]
 [ 0.          0.18801491]
 [ 0.         -0.16002789]
 [ 0.         -0.51158628]
 [-0.86669576  0.        ]
 [ 0.         -1.22539221]
 [ 0.         -1.58771185]]


So the final line H(p, q) = -(A * log(policy(s,a)) means the cross-entropy loss is the negation of advantage * log probability of the policy? (I kept seeing the negation everywhere cross-entropy is explained so wanted to include it).

It's still magical to me how the network outputs probabilities but you can use the advantages like this.

Is this a hack? If so I think it is very ingenious but I also prefer to learn it the plain vanilla way. Is Karpathy's pong example the best for that?

from reinforcement-learning.

fredcallaway avatar fredcallaway commented on August 28, 2024

Good point regarding the negation. Your correct that the loss has the negation. However, we are actually trying to maximize sum(p_i * log(q_i)), which corresponds to minimizing the negation.

I definitely agree that it's a bit confusing, but I think the idea of this library is not so much to teach you the math, but rather to give you some code to play around with to test the effects of hyperparameters and apply the algorithms to different environments.

from reinforcement-learning.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.