Coder Social home page Coder Social logo

Comments (8)

haarnoja avatar haarnoja commented on August 29, 2024

Thanks for your question. We use uniform sampling because there is no direct way to evaluate the log-probabilities of action of SVGD policies, which would be needed for the importance weights. Using some other tractable policy representation could fix this issue.

You're right that uniform samples do not necessarily scale well to higher dimensions. I haven't really studied how accurate the uniform value estimator is, but from my experience, using more samples to estimate the value improves the performance only marginally.

from softqlearning.

immars avatar immars commented on August 29, 2024

ok, i see.
Thanks for the reply!

from softqlearning.

 avatar commented on August 29, 2024

I could be totally misunderstanding, but doesn't appendix C.2 talk about how one can use the sampling network for q_a' and derive the corresponding densities so long as the jacobian of a'/epsilon' is non-singular?

from softqlearning.

haarnoja avatar haarnoja commented on August 29, 2024

I see, that's indeed confusing. You are right in that we could compute the log probs if the sampling network is invertible. My feeling is that, in our case, the network does not remain invertible, and that the log probs we would obtain that way are wrong. We initially experimented with this trick (and that's why we discuss it in the appendix), but in the end, uniform samples worked better. We'll fix this in the next version of the paper, thanks for pointing it out!

from softqlearning.

 avatar commented on August 29, 2024

My pleasure! Glad I was sort of on the right track. That's very interesting, especially since non-singular weight matrices or choice of activation function are the only thing off the top of my head that might make a feedforward net non-invertible. I might play around with that.

from softqlearning.

SJTUGuofei avatar SJTUGuofei commented on August 29, 2024

Also in "softqlearning/softqlearning/algorithms/sql.py"
ys = tf.stop_gradient(self._reward_scale * self._rewards_pl + (
1 - self._terminals_pl) * self._discount * next_value)
I just wonder is it sufficient that only one sample for computing the Expectation in $\hat Q$.
Thanks a lot!

from softqlearning.

haarnoja avatar haarnoja commented on August 29, 2024

Do you mean the expectation over states and actions in Eq. (11)? It is OK, since the corresponding gradient estimator is unbiased, though it can have high variance.

from softqlearning.

SJTUGuofei avatar SJTUGuofei commented on August 29, 2024

I see.
Thank you so much!

from softqlearning.

Related Issues (11)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.