Coder Social home page Coder Social logo

Comments (9)

michaelschaarschmidt avatar michaelschaarschmidt commented on July 21, 2024

Which python version? In general, TRPO can have small numerical instabilities sometimes
I just kicked off a run with 3.6 and am seeing a steady increase:

[2017-07-24 23:15:25,367] Finished episode 300 after 136 timesteps
[2017-07-24 23:15:25,367] Episode reward: 136.0
[2017-07-24 23:15:25,368] Average of last 500 rewards: 48.662
[2017-07-24 23:15:25,368] Average of last 100 rewards: 138.07

2nd run:
[2017-07-24 23:18:02,651] Episode reward: 200.0
[2017-07-24 23:18:02,651] Average of last 500 rewards: 21.498
[2017-07-24 23:18:02,651] Average of last 100 rewards: 94.35
[2017-07-24 23:18:19,890] Finished episode 200 after 200 timesteps
[2017-07-24 23:18:19,891] Episode reward: 200.0
[2017-07-24 23:18:19,891] Average of last 500 rewards: 41.312
[2017-07-24 23:18:19,892] Average of last 100 rewards: 173.19

from tensorforce.

AdamStelmaszczyk avatar AdamStelmaszczyk commented on July 21, 2024

Python 2.7.12.

I also saw a steady increase in the beginning (but only in the beginning), logs linked above show that.

from tensorforce.

michaelschaarschmidt avatar michaelschaarschmidt commented on July 21, 2024

Ok, I can have a go at reproducing that tomorrow in a venv. There might however always be problematic runs with bad seeds, but seems fine in 3.6.2, 3rd of 3 runs:
[2017-07-24 23:21:26,770] Finished episode 250 after 200 timesteps
[2017-07-24 23:21:26,770] Episode reward: 200.0
[2017-07-24 23:21:26,770] Average of last 500 rewards: 34.61
[2017-07-24 23:21:26,770] Average of last 100 rewards: 107.64
[2017-07-24 23:21:40,879] Finished episode 300 after 200 timessteps

from tensorforce.

AdamStelmaszczyk avatar AdamStelmaszczyk commented on July 21, 2024

but seems fine in 3.6.2

Could you please post a full log from python3 examples/quickstart.py (running till the end, 3000 episodes) with Python 3.6.2?

from tensorforce.

michaelschaarschmidt avatar michaelschaarschmidt commented on July 21, 2024

Have to leave for today but here is first 1500 from the previous run before i killed it... In general, I believe you that performance might go down again, because we did not implement KL-divergence annealing where we stop updating after a while, but cartpole is considered solved if there is 100 eps with 200 reward I think:

17-07-24 23:23:54,495] Finished episode 1150 after 200 timesteps
[2017-07-24 23:23:54,495] Episode reward: 200.0
[2017-07-24 23:23:54,495] Average of last 500 rewards: 91.75
[2017-07-24 23:23:54,495] Average of last 100 rewards: 174.28
[2017-07-24 23:24:11,088] Finished episode 1200 after 200 timesteps
[2017-07-24 23:24:11,088] Episode reward: 200.0
[2017-07-24 23:24:11,088] Average of last 500 rewards: 105.69
[2017-07-24 23:24:11,088] Average of last 100 rewards: 197.2
[2017-07-24 23:24:27,112] Finished episode 1250 after 200 timesteps
[2017-07-24 23:24:27,113] Episode reward: 200.0
[2017-07-24 23:24:27,113] Average of last 500 rewards: 118.014
[2017-07-24 23:24:27,113] Average of last 100 rewards: 190.1
[2017-07-24 23:24:43,777] Finished episode 1300 after 80 timesteps
[2017-07-24 23:24:43,777] Episode reward: 80.0
[2017-07-24 23:24:43,777] Average of last 500 rewards: 130.174
[2017-07-24 23:24:43,777] Average of last 100 rewards: 183.11
[2017-07-24 23:25:03,404] Finished episode 1350 after 200 timesteps
[2017-07-24 23:25:03,404] Episode reward: 200.0
[2017-07-24 23:25:03,404] Average of last 500 rewards: 143.536
[2017-07-24 23:25:03,404] Average of last 100 rewards: 190.07
[2017-07-24 23:25:22,964] Finished episode 1400 after 200 timesteps
[2017-07-24 23:25:22,964] Episode reward: 200.0
[2017-07-24 23:25:22,964] Average of last 500 rewards: 157.062
[2017-07-24 23:25:22,964] Average of last 100 rewards: 198.56
[2017-07-24 23:25:41,130] Finished episode 1450 after 200 timesteps
[2017-07-24 23:25:41,130] Episode reward: 200.0
[2017-07-24 23:25:41,130] Average of last 500 rewards: 170.024
[2017-07-24 23:25:41,130] Average of last 100 rewards: 200.0
[2017-07-24 23:25:59,222] Finished episode 1500 after 200 timesteps
[2017-07-24 23:25:59,223] Episode reward: 200.0
[2017-07-24 23:25:59,223] Average of last 500 rewards: 181.628
[2017-07-24 23:25:59,223] Average of last 100 rewards: 200.0
[2017-07-24 23:26:16,896] Finished episode 1550 after 200 timesteps
[2017-07-24 23:26:16,896] Episode reward: 200.0
[2017-07-24 23:26:16,896] Average of last 500 rewards: 190.89
[2017-07-24 23:26:16,896] Average of last 100 rewards: 200.0
[2017-07-24 23:26:35,774] Finished episode 1600 after 200 timesteps
[2017-07-24 23:26:35,774] Episode reward: 200.0
[2017-07-24 23:26:35,775] Average of last 500 rewards: 195.774
[2017-07-24 23:26:35,775] Average of last 100 rewards: 200.0

from tensorforce.

AdamStelmaszczyk avatar AdamStelmaszczyk commented on July 21, 2024

cartpole is considered solved if there is 100 eps with 200 reward I think

Ah, I forgot about that, cheers. Checked on https://gym.openai.com/envs/CartPole-v0 and it's even lighter:

CartPole-v0 defines "solving" as getting average reward of 195.0 over 100 consecutive trials.

However it still looks like my run of python examples/quickstart.py will not pass, but the second one, the one with longer command - would.

I checked the configs and they looked the same...

from tensorforce.

AdamStelmaszczyk avatar AdamStelmaszczyk commented on July 21, 2024

I run it the second time and now it looks much better.

There might however always be problematic runs with bad seeds

I think it might be this, by bad seeds you mean the network weights were drawn "unluckily" (i.e. in such a way, that later TRPO could not get 200 for 100 episodes)? Would it mean it was getting stuck in local extremums (while finding weights) that were far from the optimum? That would be sad, as the network is small (2 layers with 32 neurons). If that's the case then one have to run TRPO multiple times and record "the best run"? That seems very inefficient, let's have in mind more complex env. Or there are parameter changes that should help?

from tensorforce.

michaelschaarschmidt avatar michaelschaarschmidt commented on July 21, 2024

Hi, ok good. So 2 issues:

  • Yes, there can be local extremums always and the algorithm cannot escape. This is why it makes sense to add a heuristic for the learning rate that increases it when there is no improvement or decreases it when it gets worse. This is always the case with reinforcement learning
  • Another thing is that this is a stochastic policy, so it samples an action. This means it will always do slightly random things even at 200. Hence, another thing one can do is to move from sampling to deterministic acting once the policy performs well (there is a flag in the act method for this).

Hope this helps and I would close the issue then - this is just the general nature of RL algorithms, not a bug. And yes to the question, researchers always report averages/bests of many runs for this reason, and it's also why reproducing results reliably is difficult.

from tensorforce.

AdamStelmaszczyk avatar AdamStelmaszczyk commented on July 21, 2024

Thank you Michael, that helps a lot. Keep up the good work.

from tensorforce.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.