Comments (9)
I agree with @ymd-h that the evaluation score does not include the discount factor.
I think the reason why the DDQN paper reports the discounted return is to evaluate the overestimation phenomenon: since the Q-network produces the estimated discounted cumulative rewards, the "true" return should be computed with the discount factor.
I don't think other paper reports discounted return.
from tf2rl.
As far as I know, the episode_return
is used only for logging not for training, so that it doesn't matter.
In my opinion, when we compare experiments, it is better to use non-discounted total reward because discount factor is a tunable hyper parameter.
from tf2rl.
@ymd-h well the model is trained based on that discount factor, so not including that discount factor does not lead to fair comparisons between two different models. And here is an example of one of the algorithms implemented in tf2rl that actually reports discounted results:
https://arxiv.org/pdf/1509.06461.pdf
from tf2rl.
@keiohta, well why would the average return even plateau if there is no discount factor, virtually in all of these papers. Isn't it the case that most of these models without a discount factor won't even have converging return functions?
from tf2rl.
@naji-s I'm a bit confusing. Are you talking about training return or evaluation return?
I and @ymd-h talk about how to evaluate the policy, and I'm saying it's common to evaluate it as the average total reward over several episodes. The way tf2rl evaluates the policy follows other libraries and papers, that also use average total rewards as an evaluation metric.
from tf2rl.
@keiohta, I am also new to the literature. So I might be completely missunderstanding things. But I give more concrete examples. For example in the paper "Decoupling Representation Learning from Reinforcement Learning" there are the following plots:
All these plots have the return converging to a value as the time-steps increase. I do think to be able to guarantee this a discount factor is necessary otherwise the value would not converge. For example the top left 2nd one (cartpole: swingup) has the value of convergence to 800 as the number of steps increases. Am I wrong? Don't you need discount for the return to converge for certain?
from tf2rl.
Hi @naji-s ,
I do think to be able to guarantee this a discount factor is necessary otherwise the value would not converge.
Yes, we need the discount factor for training an RL agent. However, the evaluation is done without the discount factor. The training and evaluation is different.
from tf2rl.
Hi @keiohta,
Thank you so much for the clarification. Just a final question. Then these plots above are training results? Cause they do converge.
from tf2rl.
Although the paper (maybe) doesn't describe the definition, I think the plots show non-discounted rewards by using models trained with discounted rewards.
As long as discount factor (gamma
) is fixed (and n
-step is fixed), you can use the discounted reward for model comparison, but it is not universal metric.
In order to improve model performance, we try to tune discount factor, so that the metric itself should be independent of discount factor.
from tf2rl.
Related Issues (20)
- GAIfO doesn't work with Ant-v2.
- Implement SAC+AE
- Add recent version of TensorFlow HOT 4
- Fix atari-py related errors HOT 1
- Implement RAD
- Fix categorical policy
- Wrapping For Absorbing State in case of off-policy GAIL/GAIfO/VAIL (DAC)
- Setting step size in trainer.py for evaluate policy HOT 2
- Unused argument `max_steps' in the function restore_latest_n_traj
- problem with load_trajectories function in experiments/utils.py HOT 3
- How to use multiple GPU to train HOT 2
- How to save the model and after the train of policy, how to use the policy on the new environment? HOT 1
- Improve document
- DeprecatedEnv error for Pendulum-v0 HOT 1
- "TypeError: 'range' object cannot be interpreted as an integer" when attempting to use the __call__() method of IRLTrainer
- ModuleNotFoundError: No module named 'future' HOT 1
- Example of GAIFO for atari?
- Possible error in critic update in SAC-AE algorithm HOT 2
- AttributeError: module 'tensorflow' has no attribute 'version' HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tf2rl.