Maybe I am missing something here but I feel the line to calculate return: <di

I agree with <a class="user-mention notranslate" data-hovercard-type="user" data-hover

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Discount not applied in evaluate_policy? about tf2rl HOT 9 CLOSED

naji-s commented on June 12, 2024

Discount not applied in evaluate_policy?

from tf2rl.

Comments (9)

keiohta commented on June 12, 2024 1

I agree with @ymd-h that the evaluation score does not include the discount factor.
I think the reason why the DDQN paper reports the discounted return is to evaluate the overestimation phenomenon: since the Q-network produces the estimated discounted cumulative rewards, the "true" return should be computed with the discount factor.
I don't think other paper reports discounted return.

from tf2rl.

ymd-h commented on June 12, 2024

@naji-s

As far as I know, the episode_return is used only for logging not for training, so that it doesn't matter.

In my opinion, when we compare experiments, it is better to use non-discounted total reward because discount factor is a tunable hyper parameter.

from tf2rl.

naji-s commented on June 12, 2024

@ymd-h well the model is trained based on that discount factor, so not including that discount factor does not lead to fair comparisons between two different models. And here is an example of one of the algorithms implemented in tf2rl that actually reports discounted results:
https://arxiv.org/pdf/1509.06461.pdf

from tf2rl.

naji-s commented on June 12, 2024

@keiohta, well why would the average return even plateau if there is no discount factor, virtually in all of these papers. Isn't it the case that most of these models without a discount factor won't even have converging return functions?

from tf2rl.

keiohta commented on June 12, 2024

@naji-s I'm a bit confusing. Are you talking about training return or evaluation return?
I and @ymd-h talk about how to evaluate the policy, and I'm saying it's common to evaluate it as the average total reward over several episodes. The way tf2rl evaluates the policy follows other libraries and papers, that also use average total rewards as an evaluation metric.

from tf2rl.

naji-s commented on June 12, 2024

@keiohta, I am also new to the literature. So I might be completely missunderstanding things. But I give more concrete examples. For example in the paper "Decoupling Representation Learning from Reinforcement Learning" there are the following plots:

All these plots have the return converging to a value as the time-steps increase. I do think to be able to guarantee this a discount factor is necessary otherwise the value would not converge. For example the top left 2nd one (cartpole: swingup) has the value of convergence to 800 as the number of steps increases. Am I wrong? Don't you need discount for the return to converge for certain?

from tf2rl.

keiohta commented on June 12, 2024

Hi @naji-s ,

I do think to be able to guarantee this a discount factor is necessary otherwise the value would not converge.

Yes, we need the discount factor for training an RL agent. However, the evaluation is done without the discount factor. The training and evaluation is different.

from tf2rl.

naji-s commented on June 12, 2024

Hi @keiohta,

Thank you so much for the clarification. Just a final question. Then these plots above are training results? Cause they do converge.

from tf2rl.

ymd-h commented on June 12, 2024

@naji-s

Although the paper (maybe) doesn't describe the definition, I think the plots show non-discounted rewards by using models trained with discounted rewards.

As long as discount factor (gamma) is fixed (and n-step is fixed), you can use the discounted reward for model comparison, but it is not universal metric.
In order to improve model performance, we try to tune discount factor, so that the metric itself should be independent of discount factor.

from tf2rl.

Discount not applied in evaluate_policy? about tf2rl HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent