Coder Social home page Coder Social logo

Comments (9)

keiohta avatar keiohta commented on June 12, 2024 1

I agree with @ymd-h that the evaluation score does not include the discount factor.
I think the reason why the DDQN paper reports the discounted return is to evaluate the overestimation phenomenon: since the Q-network produces the estimated discounted cumulative rewards, the "true" return should be computed with the discount factor.
I don't think other paper reports discounted return.

from tf2rl.

ymd-h avatar ymd-h commented on June 12, 2024

@naji-s

As far as I know, the episode_return is used only for logging not for training, so that it doesn't matter.

In my opinion, when we compare experiments, it is better to use non-discounted total reward because discount factor is a tunable hyper parameter.

from tf2rl.

naji-s avatar naji-s commented on June 12, 2024

@ymd-h well the model is trained based on that discount factor, so not including that discount factor does not lead to fair comparisons between two different models. And here is an example of one of the algorithms implemented in tf2rl that actually reports discounted results:
https://arxiv.org/pdf/1509.06461.pdf

from tf2rl.

naji-s avatar naji-s commented on June 12, 2024

@keiohta, well why would the average return even plateau if there is no discount factor, virtually in all of these papers. Isn't it the case that most of these models without a discount factor won't even have converging return functions?

from tf2rl.

keiohta avatar keiohta commented on June 12, 2024

@naji-s I'm a bit confusing. Are you talking about training return or evaluation return?
I and @ymd-h talk about how to evaluate the policy, and I'm saying it's common to evaluate it as the average total reward over several episodes. The way tf2rl evaluates the policy follows other libraries and papers, that also use average total rewards as an evaluation metric.

from tf2rl.

naji-s avatar naji-s commented on June 12, 2024

@keiohta, I am also new to the literature. So I might be completely missunderstanding things. But I give more concrete examples. For example in the paper "Decoupling Representation Learning from Reinforcement Learning" there are the following plots:
Screen Shot 2021-08-03 at 13 27 39
All these plots have the return converging to a value as the time-steps increase. I do think to be able to guarantee this a discount factor is necessary otherwise the value would not converge. For example the top left 2nd one (cartpole: swingup) has the value of convergence to 800 as the number of steps increases. Am I wrong? Don't you need discount for the return to converge for certain?

from tf2rl.

keiohta avatar keiohta commented on June 12, 2024

Hi @naji-s ,

I do think to be able to guarantee this a discount factor is necessary otherwise the value would not converge.

Yes, we need the discount factor for training an RL agent. However, the evaluation is done without the discount factor. The training and evaluation is different.

from tf2rl.

naji-s avatar naji-s commented on June 12, 2024

Hi @keiohta,

Thank you so much for the clarification. Just a final question. Then these plots above are training results? Cause they do converge.

from tf2rl.

ymd-h avatar ymd-h commented on June 12, 2024

@naji-s

Although the paper (maybe) doesn't describe the definition, I think the plots show non-discounted rewards by using models trained with discounted rewards.

As long as discount factor (gamma) is fixed (and n-step is fixed), you can use the discounted reward for model comparison, but it is not universal metric.
In order to improve model performance, we try to tune discount factor, so that the metric itself should be independent of discount factor.

from tf2rl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.