After <a href="https://gist.github.com/AdamStelmaszczyk/8451948a806837af5b1fe930a4285d

Hi, ok good. So 2 issues: Yes, there can be local extremums al

TRPO struggling with CartPole-v0 from quick start about tensorforce HOT 9 CLOSED

tensorforce commented on July 21, 2024

TRPO struggling with CartPole-v0 from quick start

from tensorforce.

Comments (9)

michaelschaarschmidt commented on July 21, 2024

Which python version? In general, TRPO can have small numerical instabilities sometimes
I just kicked off a run with 3.6 and am seeing a steady increase:

[2017-07-24 23:15:25,367] Finished episode 300 after 136 timesteps
[2017-07-24 23:15:25,367] Episode reward: 136.0
[2017-07-24 23:15:25,368] Average of last 500 rewards: 48.662
[2017-07-24 23:15:25,368] Average of last 100 rewards: 138.07

2nd run:
[2017-07-24 23:18:02,651] Episode reward: 200.0
[2017-07-24 23:18:02,651] Average of last 500 rewards: 21.498
[2017-07-24 23:18:02,651] Average of last 100 rewards: 94.35
[2017-07-24 23:18:19,890] Finished episode 200 after 200 timesteps
[2017-07-24 23:18:19,891] Episode reward: 200.0
[2017-07-24 23:18:19,891] Average of last 500 rewards: 41.312
[2017-07-24 23:18:19,892] Average of last 100 rewards: 173.19

from tensorforce.

AdamStelmaszczyk commented on July 21, 2024

Python 2.7.12.

I also saw a steady increase in the beginning (but only in the beginning), logs linked above show that.

from tensorforce.

michaelschaarschmidt commented on July 21, 2024

Ok, I can have a go at reproducing that tomorrow in a venv. There might however always be problematic runs with bad seeds, but seems fine in 3.6.2, 3rd of 3 runs:
[2017-07-24 23:21:26,770] Finished episode 250 after 200 timesteps
[2017-07-24 23:21:26,770] Episode reward: 200.0
[2017-07-24 23:21:26,770] Average of last 500 rewards: 34.61
[2017-07-24 23:21:26,770] Average of last 100 rewards: 107.64
[2017-07-24 23:21:40,879] Finished episode 300 after 200 timessteps

from tensorforce.

AdamStelmaszczyk commented on July 21, 2024

but seems fine in 3.6.2

Could you please post a full log from python3 examples/quickstart.py (running till the end, 3000 episodes) with Python 3.6.2?

from tensorforce.

michaelschaarschmidt commented on July 21, 2024

Have to leave for today but here is first 1500 from the previous run before i killed it... In general, I believe you that performance might go down again, because we did not implement KL-divergence annealing where we stop updating after a while, but cartpole is considered solved if there is 100 eps with 200 reward I think:

17-07-24 23:23:54,495] Finished episode 1150 after 200 timesteps
[2017-07-24 23:23:54,495] Episode reward: 200.0
[2017-07-24 23:23:54,495] Average of last 500 rewards: 91.75
[2017-07-24 23:23:54,495] Average of last 100 rewards: 174.28
[2017-07-24 23:24:11,088] Finished episode 1200 after 200 timesteps
[2017-07-24 23:24:11,088] Episode reward: 200.0
[2017-07-24 23:24:11,088] Average of last 500 rewards: 105.69
[2017-07-24 23:24:11,088] Average of last 100 rewards: 197.2
[2017-07-24 23:24:27,112] Finished episode 1250 after 200 timesteps
[2017-07-24 23:24:27,113] Episode reward: 200.0
[2017-07-24 23:24:27,113] Average of last 500 rewards: 118.014
[2017-07-24 23:24:27,113] Average of last 100 rewards: 190.1
[2017-07-24 23:24:43,777] Finished episode 1300 after 80 timesteps
[2017-07-24 23:24:43,777] Episode reward: 80.0
[2017-07-24 23:24:43,777] Average of last 500 rewards: 130.174
[2017-07-24 23:24:43,777] Average of last 100 rewards: 183.11
[2017-07-24 23:25:03,404] Finished episode 1350 after 200 timesteps
[2017-07-24 23:25:03,404] Episode reward: 200.0
[2017-07-24 23:25:03,404] Average of last 500 rewards: 143.536
[2017-07-24 23:25:03,404] Average of last 100 rewards: 190.07
[2017-07-24 23:25:22,964] Finished episode 1400 after 200 timesteps
[2017-07-24 23:25:22,964] Episode reward: 200.0
[2017-07-24 23:25:22,964] Average of last 500 rewards: 157.062
[2017-07-24 23:25:22,964] Average of last 100 rewards: 198.56
[2017-07-24 23:25:41,130] Finished episode 1450 after 200 timesteps
[2017-07-24 23:25:41,130] Episode reward: 200.0
[2017-07-24 23:25:41,130] Average of last 500 rewards: 170.024
[2017-07-24 23:25:41,130] Average of last 100 rewards: 200.0
[2017-07-24 23:25:59,222] Finished episode 1500 after 200 timesteps
[2017-07-24 23:25:59,223] Episode reward: 200.0
[2017-07-24 23:25:59,223] Average of last 500 rewards: 181.628
[2017-07-24 23:25:59,223] Average of last 100 rewards: 200.0
[2017-07-24 23:26:16,896] Finished episode 1550 after 200 timesteps
[2017-07-24 23:26:16,896] Episode reward: 200.0
[2017-07-24 23:26:16,896] Average of last 500 rewards: 190.89
[2017-07-24 23:26:16,896] Average of last 100 rewards: 200.0
[2017-07-24 23:26:35,774] Finished episode 1600 after 200 timesteps
[2017-07-24 23:26:35,774] Episode reward: 200.0
[2017-07-24 23:26:35,775] Average of last 500 rewards: 195.774
[2017-07-24 23:26:35,775] Average of last 100 rewards: 200.0

from tensorforce.

AdamStelmaszczyk commented on July 21, 2024

cartpole is considered solved if there is 100 eps with 200 reward I think

Ah, I forgot about that, cheers. Checked on https://gym.openai.com/envs/CartPole-v0 and it's even lighter:

CartPole-v0 defines "solving" as getting average reward of 195.0 over 100 consecutive trials.

However it still looks like my run of python examples/quickstart.py will not pass, but the second one, the one with longer command - would.

I checked the configs and they looked the same...

from tensorforce.

AdamStelmaszczyk commented on July 21, 2024

I run it the second time and now it looks much better.

There might however always be problematic runs with bad seeds

I think it might be this, by bad seeds you mean the network weights were drawn "unluckily" (i.e. in such a way, that later TRPO could not get 200 for 100 episodes)? Would it mean it was getting stuck in local extremums (while finding weights) that were far from the optimum? That would be sad, as the network is small (2 layers with 32 neurons). If that's the case then one have to run TRPO multiple times and record "the best run"? That seems very inefficient, let's have in mind more complex env. Or there are parameter changes that should help?

from tensorforce.

michaelschaarschmidt commented on July 21, 2024

Hi, ok good. So 2 issues:

Yes, there can be local extremums always and the algorithm cannot escape. This is why it makes sense to add a heuristic for the learning rate that increases it when there is no improvement or decreases it when it gets worse. This is always the case with reinforcement learning
Another thing is that this is a stochastic policy, so it samples an action. This means it will always do slightly random things even at 200. Hence, another thing one can do is to move from sampling to deterministic acting once the policy performs well (there is a flag in the act method for this).

Hope this helps and I would close the issue then - this is just the general nature of RL algorithms, not a bug. And yes to the question, researchers always report averages/bests of many runs for this reason, and it's also why reproducing results reliably is difficult.

from tensorforce.

AdamStelmaszczyk commented on July 21, 2024

Thank you Michael, that helps a lot. Keep up the good work.

from tensorforce.

TRPO struggling with CartPole-v0 from quick start about tensorforce HOT 9 CLOSED

Comments (9)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent